A D V A N C E S IN SAFETY A N D RELIABILITY Proceedings of the ESREL'97 International Conference on Safety and Reliability 17 - 20 June, 1997, Lisbon Portugal
Volume 1
Elsevier Science Internet H o m e p a g e -
http://www.elsevier.nl
Full catalogue information on all books, journals and electronic products.
Related Journals
Free specimen copy gladly sent on request: Elsevier Science Ltd, The Boulevard, Lang]'ord Lane, Kidlington, Oxford, OX5 1GB, U.K.
Advances in Engineering Software Computer Methods in Applied Mechanics and Engineering Computers and Fluids Computers and Structures Engineering Analysis with Boundary Elements Engineering Failure Analysis Engineering Structures Finite Elements in Analysis and Design International Journal of Solids and Structures Ocean Engineering Probabilistic Engineering Mechanics Reliability Engineering and System Safety Solids and Structures Structural Safety Thin-Walled Structures
ADVANCES IN S AFETY AND RELIABILITY Proceedings of the ESREL'97 International Conference on Safety and Reliability 17 - 20 June, 1997, Lisbon Portugal
Volume 1 Edited by C. Guedes Soares
Technical University of Lisbon Lisbon, Portugal
PERGAMON
U.K.
ElsevierScience Ltd, The Boulevard, Langford Lane, Kidlington, Oxford. OX5 1GB, England
U.S.A. ElsevierScience Inc., 655 Avenue of the Americas, New York, 10010, U.S.A. JAPAN ElsevierScience Japan, Tsunashima Building Annex, 3-20-12 Yushima, Bunkyo-ku, Tokyo 113, Japan
Copyright © 1997 Elsevier Science All Rights Reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted m any form or by any means; electronic, electrostatic, magnetic tape, mechanical photocopying, recording or otherwise, without prior permission in writing from the publisher.
First edition 1997
Library of Congress Cataloging in Publication Data A catalogue record for this title is available from the Library of Congress. British Library Cataloguing in Publication Data IA catalogue record for this title is available from the British Library. ISBN 0-08-042835-5
Printed and bound in Great Britain by Redwood Books Ltd.
T A B L E OF C O N T E N T S
Volume 1 AI: Risk Based Regulations The Regulatory Review of Safety-Related Information Regarding Underground Radioactive Waste Disposal in England and Wales T h o m p s o n , B. G. J. and Williams, C.R ................................................................................................ 3
Developments and Practice Towards Risk Based Regulations in Various Technologies Berg, H. P. and Kafka, P ..................................................................................................................... 15
Incorporating Risk Assessment and its Results in the Decision-Making Process Le Guen, J. M ...................................................................................................................................... 27
A2: Risk Perception Public Perceptions of Risks Associated with Major Industrial Hazard Sites Brazier, A., Irwin, A., Kelly, C., Prince, L., Simmons, P., Walker, G. P. and Wynne, B .................. 37
Societal Risk and the Concept of Risk Aversion Vrijling, J. K. and Van Gelder, P. H. A. J. M .................................................................................... 45
From Risk Analysis to Risk Perception." Developing a Risk Communication Strategy for a DamBreak Flood Risk Lima, M. L. de, Almeida, A. B. and Silva, D ..................................................................................... 53
A3: Integrating Management Models Dynamic Modelling of Safety Management Hale, A. R., Bellamy, L. J., G u l d e n m u n d , F., Heming, B. H. J. and Kirwan, B ............................... 63
Understanding Safety Culture in Organ&ations- The Concept of Total Safety Management and its Practical use in Audit Instruments Grote, G., Kiinzler, C. and Klampfer, B ............................................................................................. 71
Case Studies with Tomhid Safety Analysis Methodology Heikkil~i, J ............................................................................................................................................ 79
A4: Safety Culture and Management Attitudes In-Depth Analysis of Organisationat Factors. The Need for Field Inquiries Llory, M ............................................................................................................................................... 89
Safety Management and Accident Prevention: the State of the Art in 14 Small and Medium-Sized Industrial Plants Seppala, A .............................................................................................................................................. 97
vi
Contents
Safety Practices and Risks Attitudes in French Small Companies Favaro, M. and Davillerd, C ............................................................................................................. 105
A5: Human Factors The Integration of Human Factors in Dependability." A Vital Aspect for Risk Management Fadier, E ............................................................................................................................................ 117
Human Errors in Maintenance Actions- An Experience Based Study Pyy, P., Laakso, K. and Reiman, L .................................................................................................. 129
Human Error and Technological Accident Prevention Ferro Fernandez, R ............................................................................................................................ 137
A6: Human Reliability State of the Art in the Development of a Simulator Aided Approach to Human Reliability Assessment Bareith, A., Hol16, E., Karsa, Z., Borb61y, S. and Spurgin, A. J ...................................................... 147
An Approach to the Human Reliability Assessment in the Context of the Probabilistic Safety Analysis of Complex Plants Kosmowski, K. T ............................................................................................................................... 155
HRA Study of Cognitive Reliability in a NPP Training Simulator Gao, J., Hang, X.-R. and Shen, Z.-P ................................................................................................ 167
A7: Operational Errors and Support Systems Rating Systems and Checklists as Tools in Development and Implementation of Environmental and Safety Management Systems Christiansen, H. C. and Hansen, E .................................................................................................... 175
Safety Program to Loss Prevention Control das Neves, J. A. and Pereira, C. M ................................................................................................... 181
Approach for Assessing Hazards Related to Hidden Deficiencies in Technical Systems Tomter, A ........................................................................................................................................... 189
A8: Expert Judgement in Safety Assessments Keejam." A Knowledge Engineering Methodology for Expert Judgment Acquisition and Modeling in Probabilistic Safety Assessment Cojazzi, G., Guida, G., Pinola, L., Sardella, R. and Baroni, P ........................................................ 199
A Practical Case of Assessing Subjective Probabilities- A Discussion of Concepts and Evaluation of Methods Andersen, L. B., Nilsen, T., Aven, T., Guerneri, A. and Maglione, R ............................................. 209
Expert Judgement in Safety Assessments Brown, D. A. and Scott, I. M. B ....................................................................................................... 217
Contents
vii
A9: Risk Management Decision Support Systems Risk Analysis and Decision Making." An Integrated Approach to Designing for Safety Tan, J. K. G .......................................................................................................................................
227
The Macro Project." Cost/Risk Evaluation of Engineering & Management Decisions Woodhouse, J .....................................................................................................................................
237
Development of a Methodology to Examine the Cost Effectiveness of Health and Safety Management Practices Priestley, K. N. and Livingston, A. D ............................................................................................... 247
A I0: Risk Management Decision Support Systems ADAM." An Accident Diagnostics, Analysis and Management System Esmaili, H., Orandi, S., Vijaykumar, R., Khatib-Rahbar, M., Zuchuat, O. and Schmocker, U ..... 257
RISMAN, a Method for Risk Management of Large Infrastructure Projects de Rijke, W. G., van der Does de Bye, M. R., Buvelot, R. and Vrijling, J. K ................................ 265
Computer Supported Event Analysis in Industry with High Hazard Potential Baggen, R., Wilpert, B., Fahlbruch, B. and Miller, R ...................................................................... 273
All: Risk Management Decision Support Systems An Information System Supporting Design for Reliability and Maintenance Rit, J.-F. and B6raud, M.-T ..............................................................................................................
281
Reliability Support System for Metallic Components Susceptible to Corrosion Related Cracking Lopes, E. Dias, Vianna, C., Carvalho, T., Schmatjko, K. J., Esmeraldo, D., Vancoille, M., van Acker, W., Boulliard, G., Phlippo, K., Jovanovic, A., Poloni, M., Bogaerts, W. and Tulp, J. 289
A Knowledge-Based System for Failure Identification Based on the HMG Method Jalashgar, A ........................................................................................................................................
297
A12: Software Reliability Software and Human Reliability Evaluation." An Experimental Attempt for a Common Approach Pasquini, A., Rizzo, A. and Veneziano, V ......................................................................................... 307
An Exponential Approximation to the Exponential- Multinomial Function Sfiiz de Bustamante, B .......................................................................................................................
315
Analysis and Recommendations for a Reliable Programming of Software-Based Safety Systems Nfifiez McLeod, J., Nfifiez McLeod, J. and Rivera, S. S .................................................................. 323
A13: Safety Critical Systems Overall Reliability Evaluation of the IEEE Benchmark Test System Using the NH-2 Program Pinheiro, J. M. S., Dornellas, C. R. R., Schilling, M. T., Melo, A. C. G. and Mello, J. C. O ........ 333
Hardware and Software Fault Tolerance." Definition and Evaluation of Adaptive Architectures in a Distributed Computing Environment Di Giandomenico, F., Bondavalli, A., Xu, J. and Chiaradonna, S .................................................. 341
viii
Contents
A Programmable Electronic System for Safety Related Control Applications Halang, W. A. and Adamski, M .......................................................................................................
349
Systemic Failure Modes." A Model for Perrow's Normal Accidents in Complex, Safety Critical Systems Collins, R. J. and T h o m p s o n , R.
357
The Essential Logic Model: A Method for Documenting Design Rationale in Safety Critical Systems Collins, R. J .......................................................................................................................................
365
A14: Software Reliability 3 P - The Product, the Process, the Processing in Software Reliability Leclercq, P. R ................................................................................ •.................................................... 375
Software Reliability Method Comparative Analysis From the Experience to the Theory Arbaretier, E ...................................................................... "................................................................
383
Subjective Safety Analysis for Software Development W a n g , J., Saeed, A. and de Lemos, R ...............................................................................................
389
A15: Software Reliability Software Reliability Prediction Recalibration Based on the TTT-Plot Z h a o , M. and Helander, M ................................................................................................................
399
Safety Monitor Synthesis Based on Hazard Scenarios G6rski, J. and Nowicki, B .................................................................................................................
407
Development of a System for a Rule-Driven Analysis of Safety Critical Software Miedl, H .............................................................................................................................................
417
BI: PSA Applications Individual Plant Examinations: What Perspectives Can Be Drawn? D r o u i n , M. T., Camp, A. L., Lehner, J., Pratt, T. and Forester, J .................................................. 425
PSA for CANDU-6 Pressurized Heavy Water Reactors." Wolsong Units 2, 3 and 4 of Korea Kim, M.-K. and Park, B.-C ...............................................................................................................
435
Level 2 PSA to Evaluate the Performance of the Doel l&2 NPP Containment Under Severe Accident Conditions D ' E e r , A., Boesmans, B., Auglaire, M., Wilmart, P. and Moeyaert, P .......................... .................. 441
R A O L - Simplified Approach to Risk Monitoring in Nuclear Power Plants Simi6, Z., O'Brien, J., Follen, S. and Mikuli6i6, V ............................................................................
449
B2: PSA Applications Relative Risk Measure Suitable for Comparison of Design Alternatives of Interim Spent Nuclear Fuel Storage Facility Ferjen6ik, M ..................................... ..................................................................................................
459
Contents
ix
Evaluation of Advanced Containment Features Proposed to Korean Standard Nuclear Power Plant Jin, Y., Park, S. Y., Kim, S. D. and Kim, D. H ............................................................................... 467
The Benefits of Symptom Based Procedures in a PSA (and vice-versa) Verweij, A. J. P. and de Wit, H. W ................................................................................................... 475
B3: Living QRA
On-Line Maintenance Scheduling and Risk Management- The EOOS Monitor ® Approach Simi6, Z., Follen, S. and Mikuli6i6, V ............................................................................................... 483
Supporting Risk Monitoring with off-line PSA Studies Vivalda, C., Carpignano, A. and Nordvik, J. P ................................................................................ 489
"Living" Safety Cases- A Team Approach Rawlinson, G. A ................................................................................................................................
497
B4: Waste Isolation Pilot Plant
Condensed Summary of the Systems Prioritization Method as a Decision-Aiding Approach for the Waste Isolation Pilot Plant Boak, D. M., Prindle, N. H., Bills, R. A., Hora, S., Lincoln, R., Mendenhall, F. and Weiner, R.. 507
Conceptual "Computational Structure of 1996 Performance Assessment for the Waste Isolation Plant Anderson, D. R., Helton, J. C., Jow, H.-N., Marietta, M. G., Chu, M. S. Y., Shephard, L. E. and Basabilvazo, G ................................................................................................................................... 515
Uncertainty and Sensitivity Analysis in 1996 Performance Assessment for the Waste Isolation Pilot Plant Helton, J. C., Anderson, D. R., Jow, H.-N., Marietta, M. G. and Basabilvazo, G ......................... 525
B5: Management of Safety Assessments
Management of Safety Assessments- Lessons Learned from Experience within National Projects Wilmot, R. D. and Galson, D. A ...................................................................................................... 535
Management of Performance Assessments in the Swedish Waste Disposal Programme." SKI's Views and Experiences Dverstorp, B., Kautsky, F., Norrby, S., Toverud, 6. and Wingefors, S .......................................... 541
Organisational and Management Issues in the Regulatory Assessment of Underground Radioactive Waste Disposal Thompson, B. G. J. and Sumerling, T. J ........................................................................................... 549
B6: Safety of Nuclear Waste Disposal
Safety Assessment of Complex Engineered and Natural Systems." Radioactive Waste Disposal McNeish, J. A., Balady, M. A., Vallikat, V. and Atkins, J .............................................................. 561
Assessing Performance of Imprecisely Characterized Systems." A Mathematical Perspective Tierney, M. S. and Rechard, R. P ..................................................................................................... 569
x
Contents
Assessing and Presenting Risks from Deep Disposal of High-Level Radioactive Waste McKinley, I. G., McCombie, C. and Zuidema, P ............................................................................. 577
B7: Safety of Nuclear Work Disposal
Parallel Computing in Probabilistic Safety Assessments of High-Level Nuclear Waste Pereira, A., Andersson, M. and Mendes, B ....................................................................................... 585
Exploring the Potential for Criticality in Geologic Repositories for Nuclear Waste Rechard, R. P ..................................................................................................................................... 593
B8: Industrial Safety
Derivation of Fatality Criteria for Humans Exposed to Thermal Radiation Rew, P. J. and McKay, I. P ............................................................................................................... 603
An Inherent Safety Opportunity Audit/Technology Options Analysis Ashford, N. A. and Zwetsloot, G ...................................................................................................... 613
Live Work Safety Audits Mendes, N .......................................................................................................................................... 619
B9: Industrial Safety
Technical Certification of Dangerous Equipment: A Study of the Effectiveness of Three Legally Compulsory Regimes in the Netherlands Hale, A. R., Pietersen, C. M., Heming, B. H. J., van den Brock, B., Mol, W. E. and Ribbert, C.. 631
Environmental Risk Assessment of Chemical Plants." A Process Systems Methodology Stefanis, S. K., Livingston, A. G. and Pistikopoulos, E. N .............................................................. 639
Process Safety Management." Performance Support and Training Systems Fiorentini, C., De Vecchi, F., Lander, E. P. and Orta, C. V ............................................................ 649
An Investigation of the Consequenciesfrom Possible Industrial Accidents in Northern Greece Ziomas, I. C., Poupkou, A. and Mouzakis, G .................................................................................. 657
B10: Industrial Safety
Using Modern Database Concepts to Facilitate Exchange of Information on Major Accidents in the European Union Kirchsteiger, C ................................................................................................................................... 667
Plant Safety Improvement by Logical-Physical Simulation Piccinini, N., Fiorentini, C., Scataglini, L. and de Vecchi, F ........................................................... 675
Planning of Component Inspection." Developments in the Netherlands Heerings, J. H. and Boogaard, J ........................................................................................................ 683
B11: Modelling Physical Phenomena
Uncertainty Quantification in Probabilistic Safety Analys& of the BLEVE Phenomenon Papazoglou, I. A. and Aneziris, O. N ................................................................................................ 693
Contents
xi
Extended Modelling and Experimental Research into Gas Explosions Mercx, W. P. M .................................................................................................................................
701
Modelling of a Low Discharge Ammonia Release Dusserre, G. and Bara, A ..................................................................................................................
709
B12: Pipeline Safety Risk Assessment of Pipelines Aneziris, O. N. and Papazoglou, I. A ................................................................................................
717
Quantified Risk Analysis in Transport of Dangerous Substances: A Comparison Between Pipelines and Roads Leonelli, P., Bonvicini, S. and Spadoni, G ........................................................................................
725
Artificial Neural Networks for Leak Detection in Pipelines Belsito, S. and Banerjee, S .................................................................................................................
733
Volume 2 CI: Offshore Safety On the Sensitivity of Offshore QRA Studies Vinnem, J. E .......................................................................................................................................
745
Details of the Offshore QRA Sensitivity Studies Vinnem, J. E .......................................................................................................................................
755
Designing Maintenance Programs for Not Normally Manned Offshore Installations A a m o d t , K. and Reinertsen, R ..........................................................................................................
763
Safety Analyses as a Too!for Safe and Cost-Efficient Design against Gas Explosions Svendsen, T ........................................................................................................................................
771
Probabilistic Modelling of Offshore Pool Fires Guedes Soares, C., Teixeira, A. P. and Neves, L .............................................................................. 781
C2: Offshore Safety Experience with Fast Track Risk Assessment Used to Compare Alternative Platforms Brandtz~eg, A. and Bea, R. G ............................................................................................................
791
Demanning or not Demanning? A Case Study Chamoux, P., Leroy, A. and Petit, A ................................................................................................
799
Methods and Models for Assessing Regularity in Gas Supply from the Norwegian Gas Network to the Markets Andersen, T., Horgen, H. and Pedersen, B ....................................................................................... 807
C3: Offshore Safety (ESRA TC) Overview of the Need for Modelling Development in Offshore QRA Studies Vinnem, J. E., Pappas, J. and Cox, R. A ..........................................................................................
819
xii
Contents
Modelling of Material Damage and Production Loss Due to Accidents on Offshore Installations Brennan, G., Vinnem, J. E., Tveit, O., Skramstad, E., Nesje, J. D., Lund, J., Kragh, E., Pappas, J., Svendsen, T., Trbojevic, V., Bjorna, J. K., Cox, T., Kinsella, K. and Hide, A ............. 829
Probabilistic Escalation Modelling Korneliussen, G., Eknes, M. L., Haugen, K. and Selmer-Olsen, S ................................................... 837
C4: Maritime Safety An Approach to Safety Assessment for Ship Structures Emi, H., Matoba, M., Yamamoto, N., Arima, T. and Omata, S ..................................................... 849
The Impact of Human Element in Marine Risk Management Gaarder, S., Rognstad, K. and Olofsson, M .....................................................................................
857
Structured Simulator Training to Improve Maritime Safety Cross, S. J ..........................................................................................................................................
867
C5: Maritime Safety R&k of Environmental Impact from a Coal Harbour Hansen, N. J ......................................................................................................................................
877
Ship-Platform Collision Risk Analysis Haugen, S ...........................................................................................................................................
885
Traffic Situation Assessment and Shore Based Collision Risk Recognition Baldauf, M .........................................................................................................................................
893
SATEST, a Method for Measuring Situation Awareness in Vessel Traffic Service Operators Wiersma, E., Heijer, T. and Hooijer, J ..............................................................................................
901
C6: Aviation Safety Aviation Systematic Safety- Worldwide Accident Review Sayce, A. G. and Doherty, S. M .......................................................................................................
909
A Framework for Setting Risk Criteria in Aviation Nicholls, D. B ....................................................................................................................................
915
An Alternative Approach to Setting Reliability and Maintainability Requirements for Combat Aircraft Appleton, D. P ...................................................................................................................................
923
C7: Aviation Safety Conditional Reliability Assessment of Aircraft Structures Pieracci, A. and Rackwitz, R .............................................................................................................
933
Decision Support for Aviation Safety Diagnostics." A Bayesian Approach Luxhoj, J. T ........................................................................................................................................
943
Aviation Systematic Safety- Occurrence Grading Scheme Sayce, A. G. and Doherty, S. M .......................................................................................................
951
Contents
xiii
C8: Railway Safety System Reliability Model for Analysing Performance of a New Railway Formaniak, A. J., Muir, I. G. S., Nanu, L. and Pickett, A. D. F .................................................... 961
Safety Communications and Management of Railway Spaces Open to the Public French, L. H ......................................................................................................................................
971
Application of a Quantified Risk Control Process Tong, D., Kwong, P. and Ho, C. W .................................................................................................
977
C9: Railway Safety Monitoring of the Railway Vehicles Availability Hudoklin, A. and Stadler, A ..............................................................................................................
987
Reliability Centered Maintenance of Signalling on French Railways (SNCF) Dermenghem, J.-P ..............................................................................................................................
995
Quantitative Risk Analysis of Industrial and Transportation Activities in the Ravenna Area: A Second Report Spadoni, G., Leonelli, P. and Egidi, D .............................................................................................
1003
C10: Automobile Reliability IAM (Intelligent Actuation and Measurement) Model and its Maintenance Relevancy for Robotic Car Assembly Zeng, S. W ........................................................................................................................................
1013
Towards an Architecture for Safety Re&ted Fault Tolerant Systems in Vehicles Dilger, E., Johansson, L. A., Kopetz, H., Krug, M., Lid6n, P., McCall, G., Mortara, P., Mfiller, B., Panizza, U., Poledna, S., Schedl, A. V., S6derberg, J., Str6mberg, M. and Thurner, T .........................................................................................................................................
1021
Using Semi-Parametric Bayesian Techniques in Reliability Validation Tasks Raoult, J. P., Gouget, N. and E1 Khair, A ......................................................................................
1031
Cll" Ship Equipment Reliability Shar&g Ships' Reliability, Availability, Ma&tainability (RAM) Information to Improve Cost Effectiveness and Safety Inozu, B., Schaedel, P. G., Roy, P. and Molinari, V ....................................................................... 1039
Economic and Reliability Aspects of Simple and Redundant Configurations for Ship Electric Propulsion Systems Pierrat, L., Fracchia, M. and Manzini, R ........................................................................................
1047
Life Assessment of CPPE's Power Plants Critical Components Goul~o, A. and Enes, J .....................................................................................................................
1057
C12: Reliability of Electronic Systems New Methods of Field Reliability Analysis & a Harsh Environment Hernandez, R ....................................................................................................................................
1067
xiv
Contents
Reliability of Electronic Components- Failure Rates Prediction Methods Bot, Y., Herrer, Y., Korenfeld, H. and Gabay, Y ........................................................................... 1075
EriView 2000- A Tool for the Analysis of Field Statistics Oscarsson, P. and Hallberg, 6 ..........................................................................................................
1083
C13: Reliability of Power Engineering Systems Power System Reliability Assessments and Applications- A Review Allan, R. N .......................................................................................................................................
1093
Reliability Analysis of Distribution Systems Considering Sub-Unavailabilities Rom{m Ubeda, J. and Rivier Abbad, J ............................................................................................
1105
Liability for Electrical Accidents." Safety and Risk Stillman, R. H ...................................................................................................................................
1115
C14: Reliability of Power Engineering Systems The New VDEW Statistics of Incidents- A Source for Component Reliability Indices of Electric Power Systems Htigel, R., Weber, Th., Lebeau, H., B6se, C. and Wellssow, W. H ................................................ 1127
Different Ways to Process Failure Statistics for Use in Reliability Analysis of Electric Power Systems Lovers, G. G. and Opskar, K. A .......................................................................................................
1135
Deterministic and Probabilistic Approaches to the Dependability of Instrumentation and Control Systems." Two Complementary Assessments in Compliance with IEC 1069 Allain-Morin, G. and Hausberger, R ...............................................................................................
1143
C15: Tools and QRA Apph'cations A Software Method for the Preliminary Definition of Maintenance on Complex Systems Components Righini, R., Bottazzi, A., Fichera, C., Kladias, N. and Perasso, L ................................................. 1155
STOPHAZ." A Support Tool for Operability and Hazard Studies Senni, S., Colombo, L. and Preston, M. L .......................................................................................
1163
Computer Simulation and Risk Analysis around Unstable Cliffs. Application to a French Case E1-Shayeb, Y., Verdel, T. and Didier, C ...........................................................................................
1171
D l: Structural Reliability and Maintenance of Bridges Optim&ation of Bridge Management Dec&ions Based on Rel&bility and Life-Cycle Cost Frangopol, D. M. and Estes, A. C ...................................................................................................
1183
Reliability Methods as a Complement to the Practice of Bridge Management Mancino, E. and Pardi, L .................................................................................................................
1195
Influence of the Fatigue Degradation and the Rheological Changes in Material on the Reliability of Bridges Sieniawska, R., Sniady, P. and Zukowski, S .................................................................................... 1203
Contents
xv
D2: Optimisation of Structural Systems
Optimal Allocation of Resources for Life-Cycle Management of Structures and Highway Networks Augusti, G., Ciampoli, M. and Frangopol, D. M ............................................................................ 1213
Development of a Maintenance Optimization Procedure of Structural Components in Nuclear Power Plants Bryla, Ph., Ardorino, F., Aufort, P., Jacquot, J. P., Magne, L., Monnier, B., Pitner, P., V+rit6, B. and Villain, B .................................................................................................................................... 1221
Optimization of Thin-Walled Girders in Probabilistic Formulation Gibczyfiska, T. and Bereza, P ...........................................................................................................
1229
D3: Stochastic Models of Loads
Stochastic Modelling of Traffic Loads for Multilane Effect Evaluation Croce, P., Salvatore, W. and Sanpaolesi, L ......................................................................................
1237
The Maximum of Stationary Non-Differentiable Gaussian Processes Breitung, K., Casciati, F. and Faravelli, L .......................................................................................
1245
Robust Reliability of Mechanical Systems Ben-Haim, Y .....................................................................................................................................
1253
D4: Simulation in Structural Reliability
Multinormal Probability by Sequentional Conditioned Importance Sampling Ambartzumian, R., Der Kiureghian, A., Ohanian, V. and Sukiasian, H ........................................ 1261
Adaptive Use of Response Surfaces in the Reliability Computations of Mechanical Components Devictor, N., Marques, M. and Lemaire, M .................................................................................... 1269
From Partial Factors Method to Simulation-Based Reliability Assessment Concepts in Structural Design de Almeida, A., Marek, P. and Guitar, M ....................................................................................... 1279
Monte Carlo Simulation Challenges in Structural Mechanics." An Approach with PROMENVIR Marchante, E. M ...............................................................................................................................
1287
D5: Time Variant Reliability
A Concept for Deriving Partial Safety Factors for Time-Variant Reliability Rackwitz, R .......................................................................................................................................
1295
Time Variant Reliability of a Reinforced Concrete Column Holick2~, M. and Vrouwenvelder, T ..................................................................................................
1307
Probabilistic Estimation of Structure Life from the Po&t of View of Safety Borg6n, J., Klimaszewski, S., Smoliflski, H. and Tomaszek, H ....................................................... 1315
xvi
Contents
D6: Structural Reliability of Dynamic Systems
Moment and Spectral Methods for Stochastic Parameter Estimation of Multi-Degree of Freedom Systems Battaini, M. and Roberts, J. B ......................................................................................................... 1323
Interval Prediction of Eigenvalues, Eigenvectors and Frequency Response Functions Teicher.t, W. H. and Sz6kely, G. S .................................................................................................... 1331
Dynamic Reliability Evaluation of Hysteretic MDF Structures Considering Parameter Uncertainties Zhao, Y. G. and Ono, T ................................................................................................................... 1341
D7: Structural Control
Model-Based Diagnosis of Structural Systems Natke, H. G ....................................................................................................................................... 1351
On Controlled L&ear Quadratic Gauss&n Systems with Contaminated Observations Romera, R. and Villagarcia, T ......................................................................................................... 1361
System Reliability Approach to Safety Analysis of Controlled Structures Battaini, M., Casciati, F. and Faravelli, L ....................................................................................... 1369
D8: Fire Safety
Risk Assessment of Building Fires Magnusson, S. E., Frantzich, H. and Kundin, J .............................................................................. 1379
Assessment of the Impact of Reliability of Fire Protection Systems on Life Safety in Buildings Yung, D. and Hadjisophocleous, G. V ............................................................................................. 1391
Fire Risk Analysis and Protective Measures for the Historic Site of Evora Serrano, M. B. and Ferreira, I. M .................................................................................................... 1399
D9: Offshore Safety (ESRA TC)
Fire Reliability of Skeletal and Plated Structures in Offshore Platforms Shetty, N. K. and Guedes Soares, C ................................................................................................ 1407
Reliability Based Factors for Fixed Steel Offshore Structural Design Efthymiou, M., van de Graaf, J. W., Tromans, P. S. and Hines, I. M ........................................... 1415
Improved Processes for Strength Assessment in the Requalification of Offshore Structures Di Cocco, N. R., Copello, S. and Piva, R ........................................................................................ 1423
Offshore Pipelines." Design Scenarios and Code Calibration by Reliability Methods Leira, B. J ............................................................. ............................................................................. 1435
D10: Offshore Safety (ESRA TC)
Overview of Probabilistic Models of the Wave Environment for Reliability Assessment of Offshore Structures Bitner-Gregersen, E. M. and Guedes Soares, C ............................................................................... 1445
Contents
xvii
Reliability Analysis of the Primary Strength of an Oil Tanker- Combination of Vertical and Horizontal Wave-Induced Load Effects Casella, G. and Rizzuto, E ...............................................................................................................
1457
Fatigue Reliability of Ship Hulls with Random Limit State Guedes Soares, C. and G a r b a t o v , Y ................................................................................................ 1467
D11: Fatigue Reliability
An Engineering Methodology for Structural Integrity Assessments Using Probabilistic Fracture Mechanics Ruggieri, C. and D o d d s Jr., R. H .................................................................................................... 1477
Fatigue Crack Monitoring in Parallel Time Scales K o r d o n s k y , Kh. and Gertsbakh, I .................................................................................................... 1485
Models for Reliability and Management of NDE Data Simola, K. and Pulkkinen, U ............................................................................................................ 1491
D12: Structural Reliability
Reliability Analysis of the Stability of Slender Structures with Geometrical Imperfections Thieffry, P., Mitteau, J. C. and Lemaire, M .................................................................................... 1501
Reliability Analysis of a Stochastically Non-Linear Structural System Rozmarynowski, B ............................................................................................................................
1509
STRAP." A Computer Tool for Structural Reliability Analysis Ciccotelli, M. and Meghella, M ........................................................................................................ 1519
Probabilistic Approach of the Tunnel Face Stability Using the Monte Carlo Procedure G a m b a , L. and C h a m b o n , P ........................................................................................................... 1527
D13: Seismic Risk and Concrete Structures
A New Statistical Model for Vrancea Earthquakes Using Prior Information from Earthquakes Before 1900 van Gelder, P. H. A. J. M. and Lungu, D ....................................................................................... 1535
The Probabilistic Evaluation of the Risk Ground Movement Rezig, S., Favre, J. L. and Leroi, E .................................................................................................. 1543
Probabilistic Modeling of Concrete Structures in Bending for Cracking Analysis Bljuger, E ..........................................................................................................................................
1551
An Application of the Material Combination Factor in the Design of RC Structures Krakovski, M. B ...............................................................................................................................
1559
D14: Variability of Material Properties
Statistical Properties of the European Production of Structural Steels Cecconi, A., Croce, P. and Salvatore, W .......................................................................................... 1567
Control of Concretes Quality Lechani, M., Ait M o h a n d , H. and Madiou, H ................................................................................ 1575
xviii
Contents
Data and Consideration on the Var&bility of Geotechnical Properties of Soils Cherubini, C ...................................................................................................................................... 1583
D15: Structural Reliability Design Combining Information in the Field of Structural Engineering Scheiwiller, A .................................................................................................................................... 1595
Probabilistic Safety Design by a Generalized Two Parameter Approach Bertrand, G. and H a a k , R ................................................................................................................
1601
Failure Estimation of Trusses by Fuzzy Sets Jendo, S. and Niczyj, J ...................................................................................................................... 1609
Volume 3 El: Uncertainty Modelling (ESRA TC) (no entries)
E2: Uncertainty Modelling MayDay. A Software Tool to Perform Uncertainty and Sensitivity Analysis. Capabilities and Applications Bolado, R., M o y a , J. A. and Alonso, A ........................................................................................... 1621
Propagation of Uncertainty in Coherent Structures F a n k h a u s e r , H. R. and R e j d e m a r k , K ............................................................................................. 1629
Intuition Theory and Risk Analysis Applications Islamov, R. T .................................................................................................................................... 1637
E3: Uncertainty Analysis On the Bayesian Approach to Risk Analysis Aven, T ............................................................................................................................................. 1647
Analysis of Uncertainty of Mechanical Components Dependability Data N o w a k o w s k i , T ................................................................................................................................. 1653
An Approximate Statistical Predictor for the Melcor Code Mira, J. and Pefia, D ........................................................................................................................ 1661
E4: Modelling Maintenance Costs A Review of the Marginal Cost Approach for Order-Replacement Models for a Spare Unit Csenki, A ........................................................................................................................................... 1671
The Cost Function for Periodically Tested Standby Units With Age-Replacement Maintenance Vaurio, J. K ...................................................................................................................................... 1681
The Zero Option Maintenance ,Strategy for Minimum Risk of Failure S m a l k o , Z., Ja2wihski, J. and Zurek, J ............................................................................................. 1691
Contents
xix
E5: Maintenance Optimisation
On the Modelling of Condition Based Maintenance Scarf, P. A ......................................................................................................................................... 1701
On the Use of Equipment Criticality in Ma&tenance Optimization and Spare Parts Inventory Control Dekker, R. and Plasmeijer, R ........................................................................................................... 1709
Availability and Spares Requirements Under the Renewal Theory for Single Repairable Units Martinez Garcia, J
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1719
A Survey on the Interval Availability Distribution of Failure Prone Systems Smith, M. A. J., Aven, T., Dekker, R. and van der Duyn Schouten, F. A ..................................... 1727
E6: Maintenance Optimisation
Probability-Based Optimization of Maintenance of the River Maas Weir at Lith van Manen, S. E., Janssen, M. P. and van den Bunt, B .................................................................. 1741
Numerical Absolute & Constrained Optimization of Maintenance Based on Risk and Cost Criteria Using Genetic Algorithms Mufioz, A., Martorell, S. and Serradell, V ....................................................................................... 1749
Modelling of Test-and Inspection Procedures." a Case Study." a Branch of an Auxiliary Feed Water System (AFWS) Tombuyses, B. and Absil, P .............................................................................................................. 1757
E7: Preventive Maintenance
Reliability Evaluation of Systems Subject to Partial Renewals for Preventive Maintenance Ch~telet, E., B6renguer, C. and Grall, A .......................................................................................... 1767
The Effect of Preventive Maintenance on Component Reliability Dorrepaal, J. W., Hokstad, P., Cooke, R. M. and Paulsen, J. L .................................................... 1775
Recent Results in Age Based Preventive Maintenance Planning Nachlas, J. A., Murdock, W. P., Degbotse, A. and Rao, N. R ....................................................... 1783
E8: Risk Based Maintenance
Maintenance Evaluation Using Risk Based Criteria Torres Valle, A .................................................................................................................................. 1793
Risk Ranking- The Opportunity for Cost Reduction Parkinson, W. J., Schloss, J. C., Hewitt, J. R., Hamzehee, H. G., Karpyak, S. D. and Tirsun, D. M ..................................................................................................................................... 1801
Risk Level Indicators for Surveillance of Changes in Risk Level Oien, K., Sklet, S. and Nielsen, L .................................................................................................... 1809
E9: Reliability Data Banks (ESReDA)
The European Industry Reliability Data Bank." EIReDA Procaccia, H., Aufort, P. and Arsenis, S .......................................................................................... 1819
xx
Contents
The ESReDA Guidebook on the Effective Use of Safety and Reliability Data with Particular Reference to "Intelligent" Selections from Raw Reliability Data Arsenis, S. P ...................................................................................................................................... 1827
Data Bank Quality Pettersson, L ............................ ........................................................................................................... 1835
Development and Use of a Directory of Accident Databases involving Chemicals Pineau, J. P ....................................................................................................................................... 1841
El0: Data Analysis
Inference from Accelerated Life Tests with Generalised Life Distribution Function and with Data Driven Selection of an Ageing Model. Maciejewski, H .................................................................................................................................. 1849
Reliability of Mechanical Components--Accelerated Testing and Advanced Statistical Methods Aug6, J.-C, Lallement, J. and L y o n n e t , P ...................... ;.................................................................. 1859
Evaluating the Effectiveness of Different Burn-in Strategies Using Field Failure Analysis N y b o r g , M ........................................................................................................................................ 1869
Ell: Data Analysis
Resampling and Bootstrap Methods in Analysis of Reliability Data Belyaev, Y. K .................................................................................................................................... 1877
Resampling and Simulation on Small-Size Samples Chevalier, M., Perez, D., Biasse, J.-M. and M i t t e r r a n d , J.-M ......................................................... 1883
Derivation of Frequency and Recovery Probabilities for Loss of Off-Site Power Accident M o r o z o v , V. B. a n d T o k m a c h e v , G. V ............................................................................................ 1889
El2: Bayesian Methods
Assessing the Failure-Rate-Prior-PDF after Component Modifications Aimed to ReliabilityGrowth Clarotti, C. A., L a n n o y , A. and Procaccia, H .................................................................................. 1897
Bayesian Prediction of Risks Based on Few and Dependent Data Bigfin, E. S ........................................................................................................................................ 1903 Costs Associated with Life Testing, Truncated After the Occurrence of a Predetermined Number r of Failures D e Souza Jr., D. I ............................................................................................................................. 1913
El3: Mathematical Methods in Reliability
Reliability Evaluation of Large Weibull Systems with Different Components K o t o w r o c k i , K .................................................................................................................................. 1923
Approximation of the Reliability of Large Scale Systems with Random Number of Components S m o l a r e k , L ....................................................................................................................................... 1931
Contents
xxi
Explicit Shape of the Status and Reliability Functions of Any System M a r t i n e z Garcia, J ............................................................................................................................ 1937
Optimal Replacement Times for Systems of Multiple and Single Use K o p n o v , V. A ................................................................................................................................... 1947
Statistical Method Based on the Weibull Distribution to Estimate the Life-Length of Banknotes Le6n, F .............................................................................................................................................. 1957
El4: Reliability and Stochastic Processes Control
A Simple Definition of Risk and Its Control Charts Girmes, D. H .................................................................................................................................... 1967
A Note on Safety, Reliability and Control of Systems von Collani, E ................................................................................................................................... 1977
Applications of Some Reliability Models Considering Covariates K u m a r , D. and Westberg, U ............................................................................................................ 1985
El5: Education and Training
Centre for Dependability and Maintenance at LuleCt University of Technology- A Way to Improve Research and Education. Klefsj6, B. and K u m a r , U ................................................................................................................ 1995
Stochastic Process Control and Quality Management R a m a l h o t o M. F. and G u e d e s Soares, C ......................................................................................... 2003
An Interesting Real Problem to Teach Simulation Techniques to Engineering Students Villagarcia, T. and R o m e r a , R ......................................................................................................... 2011
FI: Systems Dependability
Using the Constraint Programming System Toupie for Qualitative Analysis of Industrial Systems Failures Lajeunesse, S. and R a u z y , A ............................................................................................................ 2021
Modeling the Operating Environment Effects on Phsyically-Connected Redundant Components Marseguerra, M., P a d o v a n i , E. and Zio, E ...................................................................................... 2029
F2: Fault Tree Analysis
Better Fault Tree Analys& Via Sequential Modularization Schneeweiss, W. G ............................................................................................................................ 2039
BDD Based Fault-Tree Processing." A Comparison of Variable Ordering Heuristics Bouissou, M., Bruy~re, F. and R a u z y , A ......................................................................................... 2045
RAMS Computation Algorithms using the Set of Good Paths of the System Lardeux, E ......................................................................................................................................... 2053
xxii
Contents
F3: Boolean Modelling
Handling Boolean Models with Loops Dutuit, Y. and Rauzy, A .................................................................................................................. 2063
An Easy-to-Implement Efficient Algorithm for Solving the Terminal Pair Reliability Problem Schneeweiss, W. and Wirsching, J .................................................................................................... 2071
Pseudo-Boolean Approach to Solving Reliability Problems Rai, S. and Trahan, J. L ................................................................................................................... 2079
F4: Dynamic Reliability
Method to Divide Dynamic Systems into Independent Sub-Systems for Reliability Computation Duhesme, E. and Laleuf, J.-C ................................................... i ....................................................... 2089
On the Application of the ISA Methodology to the Study of Accident Sequence Precursors Mel6ndez Asensio, E., P6rez Mulas, A. and Izquierdo R o c h a , J. M ............................................... 2097
Role of Time Delays in Event Trees Dessars, N. and Devooght, J ............................................................................................................ 2105
F5: Monte Carlo Methods in Reliability
Parameter Estimates for Monte Carlo Simulation of Contaminant Transport in Groundwater Marseguerra, M. and Zio, E ............................................................................................................. 2115
Variance Reduction for Simultaneous Monte Carlo Estimation of Many Markovian Unreliability Functionals Delcoux, J. L .....................................................................................................................................
2123
Variance Reduction Techniques in Monte Carlo Simulation Applied to Dynamic Reliability Labeau, P. E .....................................................................................................................................
2129
Using D&crete Event Simulation in Rel&bility Analysbs Eisinger, S .........................................................................................................................................
2139
F6: Semi-Markov Analysis
ph-Distribution Method for Reliability Evaluation of Semi-Markov Systems Bousfiha, A. and Limnios, N ............................................................................................................ 2149
Application of Transport Equations to Model Reliability Problems of Dynamic Systems Becker, G., Camarinopoulos, L. and Micheler, M ........................................................................... 2155
Availability Analysis of a 1-out-of-2:G Non-Markovian System Operating under Fluctuating Environment Agarwal, M. and Chaudhuri, M ....................................................................................................... 2163
F7: Modelling Dependent Failures
Modeling Stochastically Dependent Failures Marseguerra, M. and Zio, E ............................................................................................................. 2173
Contents
xxiii
Searching for Systemic Failure Modes Collins, R. J. and T h o m p s o n , R ....................................................................................................... 2181
VVER Specific Common Cause Failure Data Tokmachev, G. V ..............................................................................................................................
2189
F8: Reliability of Degrading Systems
Reliability Analysis of Degraded System Configurations Vahl, A ..............................................................................................................................................
2197
Simulator for Estimating Reliability of a System Subject to Imperfect Repair Lie, C. H., Hong, J. S., Kim, T. W., Baek, S. Y. and Lim, T. J ..................................................... 2205
Reliability Analysis of Non-Repaired Multistate Systems Korczak, E ........................................................................................................................................
2213
F9: Reliability in Networks
An Overview of Methodologies for Reliability Analysis of Multiexchange Networks Craveirinha, J. and Gomes, T ........................................................................................................... 2223
A Simulation Approach to the Estimation of Cutoff Connection Rate in the A TM Switching System Jeong, M. K., Koh, J. S. and Choi, S. H ......................................................................................... 2233
An Improved Method for Network Reliability Analysis Shen, Y.-L. and Tao, C.-X ............................................................................................................... 2239
Quantifying the Fault Tolerance of Multiple-Bus Based Systems Schneeweiss, W. G. and Kfifner, H .................................................................................................. 2247
F10: Petri-Net Analysis
Analysis of a Sequential Non Coherent and Looped System with Two Approaches." Petri Nets and Neural Networks Pasquet, S., Chf~telet, E., Thomas, P. and Dutuit, Y ....................................................................... 2257
Application of Petri-Nets-Based Method for Reliability Analysis of NPP Safety System Petkov, G. I .......................................................................................................................................
2265
Time-Dependent Availability Analysis for Ship Electric Redundant Propulsion Systems Using Petri Nets Pierrat, L. and Fracchia, M .............................................................................................................. 2275
Fll: Dependability of Robot Systems
Failure Detection, Isolation and Recovery System Concept for the European Robotic Arm Bos, J. F. T. and Oort, M. J. A ........................................................................................................ 2285
Developing the Safety Case for Large Mobile Robots Seward, D., Quayle, S., Somerville, I. and Morrey, R ..................................................................... 2293
Failure Diagnosis and Analysis for an Autonomous Underwater Vehicle Christensen, P., Lauridsen, K. and Madsen, H. O ........................................................................... 2301
xxiv
Contents
FI2: Reliability in Design Sensitivity of Passive System Behaviour Ricotti, M. E. and Zio, E .................................................................................................................
2311
The Utility of Passive Systems for Safety and Reliability Spray, S. D. and Cooper, J. A ..........................................................................................................
2321
Improving Systems' Dependability Agarwal, J., Blockley, D. I. and Woodman, N. J ............................................................................ 2329
F13: Reliability in Design Modelling Reliability Growth Through Innovation Ansell, J. I. and Phillips, M. J ..........................................................................................................
2341
A New Analysis Method for Reliability Design." Computational Graphic Modeling Karasawa, S., Nojo, S. and Watanabe, H ........................................................................................ 2353
Reliability Apportionment for Systems with Nonexponential Time to Failure Jaeger, M., Porat, Z. and Tzidony, D .............................................................................................. 2361
Probabilistic Reliability in Machine Tool Design Jia, Y., Cheng, X. and Jia, Z ............................................................................................................
2367
F14: Case Studies of Systems Reliability A Case Study on Reliability Analysis of a Multiexchange Telecommunication Network Gomes, T., Craveirinha, J., Baeta, I., Santos, R. and Pereira, J ...................................................... 2377
Reliability Analysis of Service Water Pumps by Poisson Point Processes with a Generalized Model for the Rate of Occurrence of Failures Saldanha, P. L. C., de Simone, E. A. and Frutuoso e Melo, P. F .................................................. 2385
Safety Analysis in Operation and Design of Open Pit Machines Mlynczak, M .....................................................................................................................................
2393
F15: Fuzzy Set Modelling Applications of Fuzzy Inference Methods to Failure Modes Effects and Criticality Analysis (FMECA) Kara-Zaitri, C. and Fleming, P. V .................................................................................................... 2403
Fault Tree Analysis with Fuzzy Failure Rates Comotti, D., Di Giulio, A., Ghisleni, T., Sinisi, M. and Uguccioni, G ........................................... 2415
The Limited Applicability of Fuzzy Set Theory to Fault Tree Analysis--Negative Probabilities and other Anomalies Bischoff, K. and Bretschneider, M ................................................................................................... 2423
PREFACE
These volumes comprise the papers presented at the ESREL'97 Conference. The purpose of the ESREL Conferences is to provide a forum for the presentation of technical and scientific papers covering both methods and applications of safety and reliability to a wide range of industrial sectors and technical disciplines enhancing cross-fertilisation between them. A broad view is taken of safety and reliability which includes essentially probabilistically based methods, or more generally, methods that deal with the quantification of the uncertainty in the knowledge of the real world and with decision making under this uncertainty. The areas covered range from design and product liability, to availability, reliability and maintainability, to assessment and management of risks to technical systems, health and the environment and to mathematical methods of reliability and statistical analysis of data. The annual ESREL Conferences stem from an European initiative merging several National Conferences into a Pan-European Safety and Reliability event under the auspices of ESRA, the European Safety and Reliability Association. ESRA started up in the late 1980's by the European Commission and in 1992 it was established in Brussels as a non-profit making International Association, aiming at the advancement and application of Safety and Reliability technology in all areas of activity. ESRA strives to establish co-operation and mutual exchange of information between national and international professional societies, standard setting organisations, industry and research groups, aiming at the advancement of the methods and applications of Safety and Reliability. The ESREL Conferences started from National Conferences in France and the United Kingdom, Xla7 in Brest in 1990 and REL'91 in London. From 1992 onwards, they acquired a more European dimension and they have been held successively in Copenhagen, Munich, La Baule, Bournemouth and Crete from 1992 to 1996, the latter in co-operation with IAPSAM. The 1997 Conference is organised by Instituto Superior T6cnico of the Technical University of Lisbon. The technical programme has been the responsibility of an International Committee of about 50 specialists from different countries, industries and technical disciplines that have reviewed the abstracts and then the full papers, prior to their final acceptance. The starting point has been 510 abstracts from which 310 have been accepted, from which 270 papers have been included in these proceedings for presentation at the Conference. It is important to stress that the present conference has, in fact, gone beyond the European dimension insofar as it includes papers with authors from 35 countries, 15 of which are non-European. The organisation of the books follows closely the sessions of the Conference, which in turn is based on six parallel sessions. Each of the three volumes of the proceedings contains the papers from two parallel sessions. XXV
xxvi
Preface
The first session deals with Risk Management, including aspects of risk perception, safety culture, human factors and decision support systems. In addition it includes topics of software reliability and of safety critical systems. The second session concentrates on applications of quantified risk assessment to the nuclear industry and to industrial safety. The second volume includes a session dealing mainly with transportation safety, including aviation, railway, automobile and maritime as well as aspects of offshore safety. It also covers aspects of reliability of electronic and power engineering systems. The other session deals mainly with aspects of structural reliability, which cover assessment methods, the modelling of loads and material properties, as well as applications to different types of structures such as bridges, buildings, concrete structures, highway systems, offshore and ship structures. The last volume includes sessions on reliability based maintenance and on systems dependability. The first of the two sessions has papers on uncertainty modelling, on maintenance and on statistical analysis of data. The other includes topics such as fault-tree analysis, dynamic reliability, boolean and semi-markov analysis and reliability in design. It is hoped that such a wide programme will ensure that the conference fulfils its aim of being a forum for engineers, scientists, managers and regulators linked to different industries and technical disciplines to meet and exchange their knowledge and experience in the field of safety and reliability engineering. In concluding, I would like to thank the authors and all those who have contributed to the organisation of the conference. In particular the members of the advisory board, the technical programme committee, the session organisers and chairpersons, the local organising committee and the conference secretariat. All have contributed to the final outcome of the conference and to the contents and organisation of these books. Carlos Guedes Soares ESREL '97 Conference Chairman
The 1997 Annual ESRA Conference
Organised by Instituto Superior T6cnico
in Collaboration with ESRA- European Safety and Reliability Association ESReDA Danish Risk Assessment Society Institute of Quality Assurance Institut de Suret6 de Fonctionnement Norwegian Association for Risk and Reliability Analysis Ordem dos Engenheiros SRE- Scandinavian Chapter The Safety and Reliability Society VDI-GSP - Verein Deutscher Ingenieure 3 ASI Associazone degli Analisti di Affidabilit/l e Sicurezza
Sponsored by Commission of the European Communities Fundag~o Calousten Gulbenkian Junta Nacional de Investigag~o Cientifica e Tecnol6gica xxvii
This Page Intentionally Left Blank
ESREL'97 Conference Chairman c. Guedes Soares, PT
Advisory Board D. Harvey, UK l.Watson, UKP. Kafka, DEM. Cottam, UK K. Petersen, DKP.R. Leclercq, FI. Papazoglou, G
Technical Programme Committee J. Anselmo, B T. Aven, N I.C. Bacivarov, RO J.F. Barbet, F A. Bareith, HG P. Barrett, UK G. Becker, DE J. Biemat, PL A. Birolini, CH D.I. Blockley, UK M. Brown, UK L. Camarinopoulos, GR C.A. Clarotti, I R. Cooke, NL M. Cottam, UK J. Craveirinha, PT C. Dennis, UK
J. Devooght, B M.S. Elzas, NL E. Fadier, F L. Faravelli, I T.A.W. Geyer, UK T. Gulbrandsen, N A.R. Hale, NL L. Harms-Ringdahl, SE M. Holick~, CZR J. Holmberg, FN A. Hudoklin, SL R. Islamov, R P. Kafka, DE B. Klefs6, SE K. Kolov~Tocki,PL H. Kortner, N P.R. Leclercq, F
V. Leg~it,CZ M. Lemaire, F B. Littlewood, UK D. Lungu, RO S. Lydersen, N M. Marseguerra, I B.R. Martin, UK S. Martorell, E I.G. McKinley, CH J. Moltoft, DK R. Nevell, UK M. Newby, UK I. Papazoglou, GR A. Pasquini, I P.PY2¢,FN K.E. Petersen, DK C. Pietersen, NL
H. Procaccia, F R. Rackwitz, DE J-F. Raffoux, F M.F. Ramalhoto, PT M. Raasand, N V. Rouhiainen, FN A.S. Bustamante, E G.I. Schu~ller, A W. Schneeweiss, DE A. Seppala, FN J-P. Signoret, F P. Sniady, PL A. Sols, E B.G.J.Thompson, UK J.K. Vaurio, FN J.E. Vinnem, N E. Wolfgang, DE
Session Organisers T. Aven E. Fadier A.R. Hale L. Harms-Ringdahl J. Holmberg
P. Kafka J. Moltoft M. Newby I. Papazoglou A. Pasquini
K.E. Petersen R. Rackwitz V. Rouhiainen W. Schneeweiss
J.P. Signoret B.G.J. Thompson J.K. Vaurio I. Watson
Local Organising Committee A. Augusto Femandes P. Mendes
J.F. Craveirinha R. Teixeira Duarte
A.P. Teixeira V. Gonqalves Brito
Col~erence Secretariat M. F,'itima Pina, Cristina Ribeiro, Sandra Robalo xxix
M. Bouza Serrano M.F. Ramalhoto
This Page Intentionally Left Blank
AI" Risk Based Regulations
This Page Intentionally Left Blank
THE REGULATORY REVIEW OF SAFETY-RELATED INFORMATION REGARDING UNDERGROUND RADIOACTIVE WASTE DISPOSAL IN ENGLAND AND WALES B G J Thompson, and C R Williams The Environment Agency of England & Wales
ABSTRACT This paper builds upon earlier related contributions to the ESREL/PSAM Conference series and elsewhere, to outline the nature of a safety case based upon probabilistic risk analysis and the options for its assessment by a regulatory authority. Much has been published since the late 1970's, concerning the methods and tools used to perform technical analyses of radiological performance for any single party in a democratic society. However, little attention seems to have been given in the literature to the way in which such analyses, undertaken independently by a regulator, are best used to probe the safety case and its underlying arguments in a coherent and traceable manner. Recent experience from the review of preliminary safety-related information for a proposed deep repository in England indicates that further consideration should be given to the regulatory methodology, to possible difficulties due to oversimplified analyses, and to careful preparation for Public Hearings.
m~vwoans Regulation, risk, assessment, radioactive waste disposal, probabilistic risk assessment, public decision making, decision analysis, Her Majesty's Inspectorate of Pollution, HMIP.
INTRODUCTION
Papers published in the earlier ESREL/PSAM international conferences, for example Thompson (1994), Steam (1994), Sumerling and Read (1994), Thompson, Smith and Porter (1996), Ashworth and Porter (1996) and others, provide an ongoing account of the development and application of the post-closure risk assessment capability of the Environment Agency1, in connection with its duties to regulate the underground disposal of solid low and intermediate level radioactive wastes. The present paper builds upon the arguments advanced in these related contributions, and elsewhere, to outline the nature of a safety case based upon probabilistic risk analysis, and the options for its assessment by a regulatory authority. Although this experience has been related specifically to nuclear waste disposal, the general principles and methods outlined, together with issues of concern, should be of interest to those constructing safety-related arguments in other application areas. This paper is intended to underpin a complete session on risk-based regulation at ESREL '97, therefore.
1 Originally carried out by Her Majesty's Inspectorate of Pollution (HMIP) which, on 1 April 1996, became part of the Environment Agency of England and Wales.
4
B.G.J. Thompson and C.R. Williams
THE REGULATORY CONTEXT In the United Kingdom, no person may dispose of radioactive waste except in accordance with an authorisation under the Radioactive Substances Act 1993 (RSA93), except where the waste is excluded by the Act or by an Exemption Order. The developer of a deep repository, referred to below as "the proponent", will be required to apply to the relevant Agency - namely the Environment Agency for a site in England and Wales, or the Scottish Environment Protection Agency for a site in Scotland - for authorisation of disposals on or from the repository site. Authorisation under RSA93 would include both the emplacement of the primary solid waste, without intent to retrieve it at a later time, and the discharges of any secondary liquid or gaseous radioactive arisings. There is no statutory requirement for the proponent to make an application for an authorisation under RSA93 at any particular time, although he must have such an authorisation actually to dispose of radioactive waste. The proponent may choose to make an application to the Agency early in his programme, or much nearer the time that waste is to emplaced. The proponent is solely responsible for preparing and presenting the Agency with a satisfactory safety case, as part of an application for authorisation. The Agency is responsible for examining the quality of the scientific basis of this case, the way in which it has been applied, the quality and traceability of the data used, the way in which uncertainties have been treated, and, eventually the conclusions offered by the proponent in regard to the safety of the proposed disposal arrangements. A proposed repository would require planning permission under the Town and Country Planning Act 1990, in addition to authorisation under RSA93. Planning applications are made to the local planning authority- but in the case of a proposed repository, the application would be called in by the relevant Environment Secretary of State and a public inquiry will be held. The Agency is likely to be asked to provide the inquiry with a technical view on the proponent's proposals which is both informed and independent. HM Nuclear Installations Inspectorate (HMNII) of the Health and Safety Executive is responsible for the safety of operations on nuclear licensed sites as defined in the Nuclear Installations Act 1965. While the present paper is not concerned with those operational nuclear safety aspects, it may be noted that any future repository would be a nuclear licensed site, and that the Agency would consult and cooperate with HMNII to ensure that the requirements of each regulatory organisation would be met by the proponent.
THE PROCESS OF REGULATORY ASSESSMENT The nature of the safety case
A safety case may conveniently be considered as comprised of four related aspects: (a) A knowledge base of repository design, waste characteristics and quantities proposed for disposal, the information from the geological and other parts of the site-specific investigations, all set against the wider scientific and technical source literature. This may be sub-divided into: (at) (a2)
general information, possibly reviewed and agreed beforehand, including released computer software, development and test details and quality regime, and site-specific information and repository-specific information that comes only from the proponent and must be reviewed with a separate time-table.
(b) The description of the trace of all decisions, assumptions etc .... made during the development of the safety argument. This may be visualised as an decision or logic 'tree' or perhaps a 'graph'. It should make clear the use of evidence and judgements in a clear, explicit manner that allows the results from the safety calculations to be traced back to source in a justifiable manner. Each assumption may be a potential cause of bias on the results of orthodox calculations, and a means of evaluating these biases is necessary. See Thompson, Gralewski and Grindrod (1995).
Advances in Safety and Reliability: ESREL '97
5
(c) The quantitative analyses themselves, commonly termed the 'performance assessment', with their resulting estimates under uncertainty of radiological doses and risks during the long-term post-closure period. Other regulatory decision variables may be considered also, Environment Agency et al (1997). The Royal Society believes that these are usually less satisfactory than dealing directly with radiological risk, however. See Royal Society (1994). (d) Depending on the stage of the regulatory process, and the overall level of understanding gained to support the case, it may also be necessary to supply further arguments. These arguments would seek to justify that the proposed further information gathering from site investigation, further design work and/or the results of research and development generally, will reduce ignorance, and will clarify the current uncertainties. This is the problem of 'How much information, and of what kind, is sufficient to provide confidence in the analysis?' This is also called the 'information closure' problem. See, for instance, Bonano and Thompson (1993). Recapitulating on the previous ESREL '96/PSAM III conference papers, as summarised by Thompson, Smith and Porter (1996), the experience in the UK suggests that there are three main stages of work leading to possible authorisation for disposal of solid low and intermediate level radioactive wastes: Stage I: Developing a system model-based method of assessment employing Monte Carlo simulation to account for uncertainties. However, industry had never submitted its own safety cases to HMIP who necessarily had, therefore, to construct different surrogate 'cases' for hypothetical facilities to indicate what might be expected, at that stage of the subject, in a safety argument. Stage 2: A proponent is likely, at an intermediate state of site investigation and of the associated research programme, to submit a planning application to construct a deep repository. The long-term safety arguments will be made on the basis of a so-called detailed Environmental and Radiological Analysis (DERA). A large-scale Public Inquiry will be held at a venue local to the proposed site. By definition, the safety case will be based on interim information. Regulators will need, therefore, to be satisfied on all four aspects outlined earlier before offering a provisional view concerning possible future authorisation. Stage 3: In due course a Final Environmental and Radiological Analysis (FERA) will be submitted for authorisation to dispose of wastes under the Radioactive Substances Act (1993). In contrast to the DERA to be submitted in Stage 2, the FERA should not require further examination of fundamental scientific or engineering issues, and should provide a robust safety argument. Further work to confirm the safety case - within the same conceptual flame as that provided in the FERA - is likely to be required as the repository is operated over a period of perhaps 50 years, and will involve regular re-assessments as a basis for permitting continued emplacement of waste and other aspects of repository operation (e.g. backfilling vaults) to proceed.
Options for regulatory assessment Broadly, during Stages 2 and 3, a regulator could adopt one of four levels of assessment capability: (i) Purely responsive; awaiting, according to a general programme, the arrival of each submission from the proponent, and only then taking action to seek independent review from in-house or external expertise. (ii) Maintaining a continuing awareness of the subject through a Panel of Experts who can prepare themselves for a more thorough review on the basis of a general research programme. This is termed the Scientific and Technical Review (STR) capability and includes review of both repository engineering and site characterisation.
6
B.G.J. Thompson and C.R. Williams (iii) In addition to the STR, introducing some ability to check the quantitative aspects of the submission by developing and maintaining expertise, independent of the proponent, in computational modelling etc .... At least being able to re-run sottware to verify the calculations performed within the proponent's own general assumptions reviewed under approach (ii). (iv) In addition to the capability of level (iii) developing and applying a completely Independent Performance Assessment (IPA) capability including methods for decision tracking and bias evaluation. This latter approach was adopted by HMIP, Thompson (1994).
Option (iv) preserves regulatory options for any specific application in the future. However, it has never been intended that UK regulators should produce a full safety case. That safety case is entirely the responsibility of the proponent, but it is difficult for the regulators to judge such a case without knowledge of what is involved based on "hands-on" experience of carrying out essential aspects of a safety analysis. RECENT ASSESSMENT-RELATED EXPERIENCE The second stage of the process is under way after a period of site investigation by the proponent. It was expected originally that the proponent would submit a planning application for repository construction in Autumn 1992, leading to a Public Inquiry in 1993/94. In response to this HMIP, following a comprehensive contract tendering process during 1991, commissioned two lead contractors to undertake assessments of safety documentation, and to provide support during the Inquiry hearings. These two principal contracts were:
The Scientific and Technical Review (STR) and the; Independent Performance Assessment (IPA) project; and these were assisted by two further contracts; Assessment Information Management System (AIMS), to provide a bibliographic database and regulatory correspondence logging procedure; and the Quality. Assurance (Q) contract, to review and extend the existing QA/SQA system and its documentation and to review the proponent's approach as required. Although no formal application to dispose of wastes would have been expected at the time of the Inquiry hearings, HMIP would have been required to give a provisional view as to the likelihood of such an authorisation being granted in due course once the full site study had been completed and a FERA safety case submitted during the third stage outlined above. Under the Scientific and Technical Review, detailed reviews of preliminary safety-related documentation were conducted by a panel of recognised experts, covering about twenty different disciplines. This helped HMIP to understand better how to conduct such reviews and to document them. It has also, we believe, been of help to any proponent to learn from this first experience of being exposed to such a critique, especially with regard to issues such as the traceability of results and conclusions back, through a chain of assumptions and decisions, to source literature and data. The Independent Performance Assessment project produced two partial post-closure, probabilistic systems analyses (PSA) based upon early site data and other information. During a first phase of work (February 1992 to April 1993) a simple PSA was completed for steady-state conditions, based on current climate. This central activity was augmented by a number of ancillary studies aimed at developing understanding of the site-specific effects of certain key processes. This was followed by a second phase of work (May 1993 to September 1994) that focused on the hydrogeological performance of the site and the release and transport of radionuclides in groundwater. Twodimensional and fully three-dimensional groundwater models were constructed using the NAMMU finite element software although the available data were still too limited for reliable calibration of these models. A number of scoping calculations were carried out for both present day and for a wide range of possible future site conditions (based on likely changes in climate, interpreted from records of the Quaternary period). These calculations explored the implications of the uncertainty at that time in our knowledge of the spatial configuration of major faults and the stratigraphy of the site and of the associated hydrogeological properties and boundary conditions, together with the uncertainty over their changes with time.
Advances in Safety and Reliability: ESREL '97
7
Even under present-day climatic conditions, assumed to be preserved over the long-term post closure period, some three-dimensional calculations indicated, as shown for instance in Fig. 1, that a plume of contaminant could deviate from the two-dimensional vertical section assumed in earlier estimates. PRESENT DAY COASTLINE
,,
\/
o,oooyo s "~lg 6 PROPONENT'S GROUNDWATER., MODEL SECTION
.
NOTIONAL
\,
R
OSlro Y
( ~
~'0"~m, ['
~ 300,000years
I O D I N E 129 PLUME TRAJECTORY
CONCENTRATIONCONTOURSAT 100m.BOD, TRANSIENTCALCULATION,'BEST ESTIMATE'FAULTS.
Figure 1. Estimated movement of contaminant assuming constant present-day climate. A "state-of-the-art" PSA was then carried out using the TIME4 and VANDAL Monte Carlo simulation software, Thompson and Sagar (1993), incorporating a fully three-dimensional groundwater flow and radionuclide transport model, that evolved over time under the influence of climate-driven changing surface boundary conditions (for instance, topographical changes, sea level, recharge etc). Statistical convergence of the sample mean dose (H) was well indicated after 940 realisations, using Simple Random Sampling, as the results over a period of 500,000 years show in Figure 2 which also compares the risks from the 95 %ile dose (Ho 95), and from the realisation yielding the highest peak dose ( ~ .
,~
10,000
t~FI'M,x 'WORST CASE' ~H / .
1,000
.
-
J
I
•
•
/
/ 100
f
/~
10-
~Ho95 95% ILE ~ ~ _..
f
I
/
-
~
~
!
RELATIVE ANNUAL 'RISK'
~rr ~.
"
_
-
-_.
-
GUTI34AN 90% CONFIDENCE INTERVAL
I
IODINE 129; DRINKING WATER PATHWAY; HYPOTHETICAL EXPOSED GROUP ABOVE REPOSITORY.
1-¢
0
I
100,000
Figure 2. Comparison of different risk estimates.
I
'
/
YEARS AFTER CLOSURE
I
500,000
8
B.G.J. Thompson and C.R. Williams
Figure 3 comparesthe results from the simple Phase 1 simulation, with the estimates obtained during Phase 2, of "mean risk" (yH) for different surface locations of a hypothetically exposed group drinking water from a sampling well. ~H
RELATIVE ANNUAL INDIVIDUAI'I0 0 _ RISK
LOCATION OF HYPOTHETICAL EXPOSED OROUP: B /
BASED
ABOVE REPOSITORY FOOTPRINT
"~ ~
/
/
UPON ARITHMETIC
I /
MEAN DOSE
. . . . . . . . . .
I [
10 - -
I I
A
~
NEAR PRESENT COASTLINE
C 1.0--
" " " ~ -
J
0.1--
SOUTHWEST REPOSITORY
OF
P H A S E 2: T H R E E - D I M E N S I O N A L FLOW; CLIMATE CHANOE P H A S E 1: T W O - D I M E N S I O N A L FLOW; PRESENT CLIMATE
-.y
0.01 --
,, N E A R
0
I 1
I 2
PRESENT
I 3
t 4
COASTINE
I 5
xl0
5
YEARS AFTER C~-OSURE--
Figure 3. Illustrative results from Monte Carlo analysis of groundwater pathway. The work concluded with an evaluation of "internal bias", see Thompson, Gralewski and Grindrod (1995), by a series of further detailed finite element calculations performed within a fractional factorial experimental design to give a first order estimate of the combined influences of five factors omitted from the Phase 2 PSA simulation. The influence on the sample mean, and the 95 %ile estimates was inferred broadly, as shown in Figure 4, from evaluation points (X, Y and Z). The sources of potential bias were elicited in group sessions and recorded on a relational database. The latter was used to ensure full traceability of the chain of decisions taken at different stages of an assessment, Grindrod (1996). DISCUSSION OF ISSUES A number of issues of potential regulatory concern can be highlighted on the basis of this experience to date: The Possible Dangers From Oversimplified Analyses In the interests of communicating to and convincing as wide an audience as possible, it is desirable to find the simplest and most robust arguments to justify any safety case. However, experience indicates that this desire may lead to a dangerous naivety. The use of Occam's Razor, when choosing between equally well supported arguments that do not neglect to consider the available evidence, is an excellent principle to invoke. It is right to simplify, but dangerous to over-simplify.
Advances in Safety and Reliability: ESREL '97 RELATIVE ANNUAL INDIVIDUAL 'RISK'
9
~H I 104-
//fl /
5
lo~-
lO2_ 10-
~
ORTHODOX
.....
W I T H 'BIAS' EVALUATION
PSA
1 o
I IOO,OOO
i 500,000
r YEARS AFTER CLOSURE
Figure 4. Effect of modelling assumptions on risk estimates. For example, the Dry Run 3 trial assessment, see for instance Thompson and Sagar (1993), showed that the then conventional tendency to omit quantitative consideration of the implications of sequences of long-term future climatic changes could lead to a serious underestimate in risk. Such effects may not be significant for other sites, but should evidently be considered fully in any viable safety case. In the present study, Figure 3 shows that an orthodox two-dimensional steady-state analysis gave no indication of high estimated risks to a group situated above the repository (compare curves A and B) nor of the strong spatial variation of exposure to a drinking water recipient (compare curves B, C and D).
The interpretation of existing regulations "Risk" is capable of many interpretations currently and, for instance, may be related to the probability distribution of health detriment in a manner analogous to that used for industrial safety, where curves of frequency versus consequence are often employed to guide decision making. The entire distribution can be compared to particular expressions of aversion to larger consequences using, for convenience, the complementary cumulative probability distribution (CCDF) of conditional risk (yH), as illustrated in Figure 5. Using logarithmic scales, a number of envelopes are shown that might represent different attitudes to risk. Curve A expresses a stakeholder opinion that the 104 p.a. target should always be met, whilst curve B only requires 10-4 p.a. never be exceeded. Curves C1, C2 and C3 retain a conviction that values below the target are of no concern, but these sloping lines represent an increasing degree of aversion to incurring larger risks above that target. Such curves can be derived from or related to personal loss functions. As long as the estimated CCDF lies to the left of the envelope of concern, the particular stakeholders should feel able to accept the safety argument provided that all other conditions, assumptions, judgements etc ... underlying the overall case are also acceptable in their view. The most commonly used interpretation of risk seems to be that based upon arithmetic sample mean dose (H) multiplied by the ICRP risk to dose conversion factor (Y), see Thorne (1988). This interpretation implies equal aversion (or lack of it) to doses on either side of the mean. But this might not be what is desired. It might therefore be appropriate to apply an asymmetric "loss function". One way of doing this would be to estimate risk based on a percentile of doses, Smith (1993). If, for instance, the 95 %ile is considered (Ho.9s) then usually, but not always, higher risks are estimated as in the example shown in Figure 2. Basing the risk upon the simulation run that gives the highest maximum dose (Hm~ yields much larger values, as the figure illustrates. Such a result had been pointed out in 1986, see Thompson and Sagar (1993), to caution against relying upon "worst case" bases for decisions.
10
B.G.J. Thompson and C.R. Williams A, B, C, D, 'ENVELOPES' PROBABILITY OF EXCEEDING
1.00
", AT 3 X 104yEARS - ~ ' ~
\
\c 2
\
...................
0.05
~ ~"C1
3 ........... \ \ \ \ \ \
~-'-- D
\\
\ O.O1 -, '--------7
\ A r
r
-
~
~H
1E-16 1B-14 1E,-12 1B-10 1B-08 1B-06 1E-04 CONDITIONAL ANNUAL INDIVIDUAL 'RISK'
Figure 5. Illustrative results compared with different possible 'risk' envelopes.
When the full consequence distribution is considered, such as required by current US regulations, see for example Anderson et al (1997), then, from Figure 5, it is apparent that, at 360,000 years, the safety case might 'fail' against envelopes A or D, but would 'pass' with respect to B or C. Use of regulatory analyses to review a safety case
Much has been published since the late 1970s concerning the methods and tools used to perform technical analyses ofradiological performance as may be carried out by any single party in the process of decision making in a democratic society. Key concerns, that appear to have received little attention in the literature, are related to the way in which such analyses, undertaken independently by a regulator (or an interest group), are best used to probe the safety case and its underlying arguments in a coherent and traceable manner. Figure 6 illustrates the overall process of assessment by a regulator (activities below the diagonal) of a safety case put forward by a proponent (activities above the diagonal). The first three aspects (a, b, c) of the case are represented, with parallel development by a regulator of s e l e c t e d parts of a similar, but independently conducted, technical analysis. As explained more fully in Sumerling and Read (1994), for instance, it is the sole responsibility of the proponent to produce a full safety case, and a comprehensive set of quantitative results would not be produced by the regulator. Hence, the decision logic would be detailed only for those regulatory activities carried out. Either player would be likely to follow an iterative procedure of technical analysis similar to that shown in Thompson (1994) for instance, which leads to an orthodox PSA result. Dose percentiles versus time might be compared as shown for (say) a particular sub-set ofradionuclides, and a key exposure pathway. Such results, obtained independently, may disagree significantly because conventional analyses are based upon conceptual and computational models that necessarily have systematic biases due to lack of knowledge of system structure and of process representations. Bias is also likely due to numerical approximations and the unavailability of quality assured software that incorporates the latest understanding. The hypothesis is that if such "bias evaluation", as described above for the second phase of the present project, was applied in a c o n s i s t e n t m a n n e r to both sets of orthodox PSA results, it could bring these closer together, as sketched in Figure 6. A corresponding traceable record of the further decisions that are involved would also be examined, Thompson, Gralewski and Grindrod (1995). The process of regulatory review of the three aspects would then lead to the technical information upon which decisions would, in part, be made. If the overall programme of site investigation and other studies is still incomplete, as in Stage 2 described earlier, the DERA case may contain also a strategy (d) to reduce ignorance and to clarify key uncertainties thereby ensuring adequate overall confidence in the final (FERA) arguments. Such a strategy would be expected to reduce the amount of bias to be evaluated in this later safety case.
Advances in Safety and Reliability: ESREL '97
11
ACT IVIT Y BY PROPONE NT \
KNOWLEDGE BASE
N
(cO
I
~t~
I__ COMPREHENSIVE
~11 DEt~SIt~ILOGIC
~i'~,..._n.n.:.:,
n
,t
......
O etc...
.==I
® uJ
r~ >~n >-
DECISION LOGICOF SELECTED ASPECTS
/ RISK
% ILES
/
/
/
TARGET
T IME
"REG. VIEW O~SELECTED AS P ECTS'~: (c)~..~
Figure 6. Illustration of the process of safety case assessment. Preparation for the formal regulatory review
Possible advantages for both parties (i.e. developer and regulator)could occur if a number of issues were resolved before the formal submission of DERA (that is information (a) and safety case (b, c and d)). Hence, agreement might be sought on: a given format of recording and articulating the decision logic, and how to access this through related computer -based facilities; a given format of site geological and related information as, for instance, in the use of common three-dimensional data structures, and in related documents; overall safety case document structure and content; QA arrangements for software development and use, and for gathering and interpreting site data; a compatible treatment of bias to enable the parties involved to at least attempt reconciliation of lines of argument and results; formal agreements about incontrovertible common aspects such as: thermo-chemical data; climate analogue sites and data; the waste inventory and its stable element chemistry, with uncertainties made explicit;
12
B.G.J. Thompson and C.R. Williams terminology, units and symbols; assumptions about the characteristics of future possible exposed groups; interpretation of'risk'; 'best practical means'...., in conjunction with policy groups; protocols for elicitation of subjective judgements and for peer reviews and, finally what might reasonably be expected in terms of 'validation'.
CONCLUDING REMARKS Interaction between regulators and proponents is necessary in advance Of any formal safety submission as well as in direct response to the latter as part of any authorisation process. If the preparation is done carefully there should be "no surprises" in the case presented. At most it should be a refinement of information already delivered, of arguments and decision logic already rehearsed. Software tools and techniques, whose development has been undertaken to well-established software engineering practice and quality assurance, should already have been published. Similarly, all methods underpinning the quantitative arguments and results, revealed in their definitive form in the case itself, should have been reviewed in the scientific and technical literature, Royal Society (1994). The examination of the safety case itself can be regarded as the review of the four aspects identified in this paper, and this examination may be done using nominally four different levels of regulatory and technical capability ranging through a classical peer review to a capability to carry out also any selected part of such a safety analysis in quantitative terms using independently developed tools and methods of analysis. It is suggested here that maintaining this capability keeps open the widest range of opportunities at each of the stages in the assessment process, thereby indicating it to be of most potential benefit to a regulatory body. The entire process is complex, and for its success will depend upon good management based upon relevant "hands on" experience. Management and organisational issues are discussed in the companion paper at this conference by Thompson and Sumerling (1997). Uncertainty is inescapable, and must be openly and thoroughly examined in any credible case. Contrary to much wish~l thinking in this subject it is by no means guaranteed that acquiring and using more site-specific and other data in an analysis will reduce uncertainty. It will, however, enable a better understanding of the domain of uncertainty, and hence reduce ignorance. However it remains, at present, an open question as to the amount of information required to provide regulatory confidence in a safety submission. ACKNOWLEDGEMENTS The authors thank the Environment Agency for permission to publish this paper. The results of this work may be used in the formulation of policy but at this stage do not constitute UK Government policy. The authors are especially appreciative of the constructive comments from Mr R.E. Smith of the Environment Agency during preparation of this paper. REFERENCES
Anderson, D.R., Helton, J.C., Jow, M-N, Marietta, M.G., Chu, M.S.Y. and Basabilvago, G. (1997). Conceptual Computational Structure of the 1996 Performance Assessment for the Waste Isolation Pilot Plant (ibid). Ashworth, A.B. and Porter, I.T. (1996). Application of PSA in the Regulation of the Drigg Low-level Waste Disposal Site. Proc. Third International Conference on Probabilistic Safety Assessment and Management, PSAM III, Crete, (June 1996). Bonano, E.J. and Thompson, B.G.J. (1993), (eds). Probabilistic Risk Assessment on Radioactive Waste, Special Issue of Reliability Engineering and System Safety, vol. 42, nos. 2, 3 (1993).
Advances in Safety and Reliability: ESREL '97
13
Environment Agency, Scottish Environment Pollution Agency, Department of the Environment for Northern Ireland (1997). Radioactive Substances Act 1993: Disposal Facilities on Land for Low and Intermediate Level Radioactive Wastes: Guidance on Requirements for Authorisation. (January 1997). Grindrod, P. (1996). Traceability of Argument and the Treatment of Conceptual and Parametric Uncertainties within a Safety Case, and how the Regulator may examine this by Independent Analysis. PSAM III, Crete, (June 1996). HM Government (1993). Radioactive Substances Act 1993, HMSO, ISBN 0-10-541293-7. ICRP (1985). Radiation Protection Principles for the Disposal of Solid Radioactive Wastes, Annals of the Intl. Commission for Rad. Prot. 15(4), ICRP Publication 46, (1985). Royal Society (1994). Disposal of Radioactive Wastes in Deep Repositories. The Royal Society (London) ISBN 0 85403 493 5, (Nov 1994). Smith, A.F.M. (1993). An Overview of Probabilistic and Statistical Issues in Quantitative Risk Analysis for Radioactive Waste Disposal, Parts 1 and 2, UK Govt., Dept. of Environment Reports, DoE/RR/90. 073 -93 .074 (Jan 1993). Stearn, S.(1994). Risk Analysis in Regulation and Risk Communication. PSAMII, San Diego, Calif. (March 1994). Sumerling, T.J. and Read, D.(1994). Aspects of Review of a Proponent's Post-Closure Safety Assessment on behalf of a Regulator. PSAMII, San Diego, Calif (March 1994). Thompson, B.G.J.(1994). The HMIP Research and Development Programme for Post-Closure Risk Assessment. Proc. Second Intl. Conf. on Prob. Safety Assessment and Management, PSAMII, San Diego, Calif., USA (March 1994). Thompson, B.G.J., Gralewski, Z.A. and Grindrod, P. (1995). On the Estimation of Bias in Post-Closure Performance Assessment of Underground Radioactive Waste Disposal, Proc. 4th. Intl. High Level Rad. Waste Man. Conf., Las Vegas, Nev., USA. (April 1995). Thompson, B.G.J. and Sagar, B. (1993). The development and application of integrated procedures for postclosure assessment, based upon Monte Carlo simulation: the probabilistic systems assessment (PSA) approach. Rel. Eng. and Syst. Safety, vo142, pp 125-160 (1993). Thompson, B.J.G., Smith, R.E. and Porter, I.T. (1996). Some Issues affecting the Regulatory Assessment of Long-Term Post-Closure Risks from Underground Disposal of Radioactive Wastes. Proc. Third International Conference on Probabilistic Safety Assessment and Management, PSAM III, Crete, (June 1996). Thompson, B.G.J. and Sumerling, T.J. (1997). Organisational and Management Issues in the Regulatory Assessment of Underground Radioactive Waste Disposal (ibid). Thorne, M.C. (1998). Assessment of the Radiological Risks of Underground Disposal of Solid Radioactive Wastes, IrK Govt, Dept Of Environment Report No. DoE/RW/89.030 (Dec 1988).
This Page Intentionally Left Blank
DEVELOPMENTS AND PRACTICE TOWARDS RISK BASED REGULATIONS IN VARIOUS TECHNOLOGIES H.P. Berg ~and P. Kafka 2 Bundesamt ~ r Strahlenschutz (BfS) P. O. Box 10 01 49, D-38201 Salzgitter, FRG. 2 Gesellschaft for Anlagen- und Reaktorsicherheit (GRS) mbH Forschungsgel~inde, D-85748 Garching, FRG.
ABSTRACT Over a long period of time, systems design and structure functions have been developed and estimated by the so-called trial-and-error method. With the increasing importance of complex and large-scale technologies functional and safety problems initiated by random effects within the man-machine-milieu interaction have challenged new procedures. Safety engineering and the relevant regulations turned from a retrospective to a prospective procedure. For this prospective procedure the so-called deterministic approach based on deterministic criteria and conservative calculations was established first. Today exists an increasing utilization of the so-called probabilistic approach to take into account risk aspects to a larger extent. The paper illuminates some basics in safety engineering, discusses some pros and cons regarding the deterministic and probabilistic approach and shows recent developments and practices towards risk based regulations in various technologies. Specific examples will be given from Civil Engineering, Space and Aviation Industry, the practice for Marine Structures and in the Process Industry. In the case of Nuclear Technology will be shown in particular current trends to support operational and maintenance decisions during plant operation not only by performance based but also by risk based considerations. The unresolved issue for many legal environments in various countries namely the uncertainties of probabilistic results and the vagueness of the state of knowledge, will be explained. Finally, some recommendations for supporting actions towards risk based regulations will be given. KEYWORDS: Safety Engineering, Risk Based Regulations, Risk Informed Regulations, Probabilistic Safety Assessment, Probabilistic Approach.
HISTORICAL PERSPECTIVE More than 2,400 years ago Pericles stated that: "the worst thing is to rush into actions before the consequences have been properly debated", and "the Athenians are capable at the same time of taking Risk and Estimating them before-hand". Realizing that already the ancient Athenians knew this codex, it is really surprising that today Safety Engineers have to learn again how all the various players in the real world have to be convinced that a modem society should move towards principles known to the Athenians. 15
16
H.P. Berg and P. Kafka
As traditionally teached at the universities, an engineer has to design a component, system or structure in such a way that the product must function and be safe. A building, e.g., must possess a structure which provides strength and stiffness so that the entire system can perform the duties for which it was specified. Over a long period of time, systems design and structure functions have been developed by the so-called trial-and-error method. That means, based on the lessons learned from undesired events, e. g. the safety factors and the codes of engineering practice were improved step by step, expecting that new design would continue to meet the function and be more safe. With the increasing importance of complex and large-scale technologies functional and safety problems initiated by random effects within the man-machine-milieu interaction have challenged new procedures, e.g. to estimate analytically the prospective behavior of the component, system or structure. Safety engineering turned from a retrospective to a prospective procedure (see also: Blockley (1992), Fragola (1996), Kafka (1996 B, 1996 C)).
SYSTEM DESIGN VERSUS SYSTEM ASSESSMENT In principle, there is fundamental distinction between a task to design and construct a new system or to assess an existing one. In the first case, the state of knowledge regarding the layout and the foreseen function of the system is restricted. However, rules, design codes and recommendations for a proper and safe design are normally available and can be used in a mix for the creation of the various specifications and the calculation of the design parameters. The designer concentrates on the realization of the "function" of the system and the computation of the relevant point values for it. The required reliability is mainly ensured by conservative safety factors. In the second case, the state of knowledge regarding the system layout is more satisfying and most of the parameters are measurables of the system and operational experience is available. To assess "function" and "malfunctions" of components and the resulting consequences for the system are the duties. Other methods and tools are needed, e. g., event trees, fault trees and uncertainty simulations. An assessment of the safety level therefore requires an integrative respective "umbrella" procedure. Such a procedure allows to evaluate all consequences, effects and safety contributions of the system characteristics realized as a manifestation of the utilization of various rules, standards and requirements. If only the compliance with all these rules and recommendations and the correctness of the calculated design parameters are checked point by point separately, no integrative answer regarding the entire safety level can be given. The picture is complete only if the puzzle pieces are integrated together. In simple words: the design of a safe new system is a forward chaining task and to assess the safety level of an operating system is a backward chaining task, each performed normally with specifically adopted methods and tools. A design procedure graphically follows an event tree structure dealing with the task of establishing all the end states (design parameters) originating from the expected system function. Vice versa, an assessment procedure graphically follows a fault tree structure dealing with the task to identify all the possible root states (causes) for the malfunction of the system. This distinction is one of the reason why an applicant (the designer) and an assessor (the regulator) are often involved in long discussions regarding the most appropriate methods and tools needed for the common aim, the realization of a safe system.
DETERMINISTIC VERSUS PROBABILISTIC APPROACH
A deterministic criterion can be characterized as a pre-defined design rule whose fulfillment provides sufficient confidence that the design intent is met. These rules are based and established on the experience and expertise of a rule-making body, normally composed by members coming from different groups of interest e. g. designers, operators, and regulators. Such rules can be qualitative, quantitative or a mix of both (see Kafka (1995), Thadani (1996)).
Advances in Safety and Reliability: ESREL '97
17
The essence of such a type of approach is that a deterministic analysis or calculation has to be performed and one has to show compliance with rules in a "checklist format" based on y e s and n o answers. The deterministic analysis can be characterized as a point value calculation via a functional representation of the system behavior including so-called conservative assumptions. Thus, the deterministic analysis is a single snapshot into the space of all the possibilities which can be formed by the real world behavior. One has to be aware that this simplification implicitly means that all the other possibilities - not considered by this snapshot - are excluded (some say as misuse: other cases are not possible). The strength of the deterministic approach is that the associated analysis and decision making process is relatively clear and simple. The systems analysis and the associated calculations are straightforward and the decision making answer is "Go" or "No-Go". The weakness of this approach is the extensive use of expert judgment without any explicit consideration of various types of uncertainties and the lack of any information about which criteria or analysis results are more or less important with respect to the safety level. A ranking procedure regarding issues or outcomes is not possible. Additionally, the world can not be modeled realistically using conservative assumptions. The real world normally follows the most probable circumstances and boundary conditions. In other words, an accident scenario simulated with conservative parameters represents a very rare single case within the space of all the possibilities. Finally, the deterministic approach may give the false impression that the results are "certain" and the scenarios are "true". A probabilistic approach can be characterized by an extensive use of probabilities. This implies also the extensive searching process for failure "possibilities" in systems" analysis and the performance of quantitative calculations because of the existence of random processes in the real world. In other words, the approach searches for the spectrum of possibilities quantified with probabilities. Thus, the analyst has a more comprehensive rather then a snapshot (by the way of a deterministic approach) view to the real world. Evidently probabilistic criteria are normally expressed in terms of failure or success probabilities per demand. The strengths are the integrative and quantitative approach which allows rankings of issues and results, explicit consideration and treatment of all types of uncertainties, and application of an optimization process Apostolakis (1990). A (fictitious) weakness of the probabilistic approach is the more complex and time consuming analysis and decision making process because more information and insights have to be collected, processed and considered for decisions. A still unresolved issue for many legal environments in various countries is the fact that the probabilistic approach explicitly shows the uncertainties and the vagueness of the state of knowledge and the result has to be characterized as a prognostic estimation of what in future can happen or not. In particular, the problem of incompleteness and sensitivity of the results is addressed. However, although also the probabilistic approach is only a model of the real world it represents the real world in a much more realistic way.
THE ELEMENTS OF RISK
The extent and nature of most safety engineering problems are such that a suit of technologies operating as an integrated safety feature would be required to prevent and mitigate existing potential hazards. Thus, the design, evaluation and assessment of the safety technologies is a complex decision process that is affected by a wide range of factors. Selecting technologies to work as an integrated safety feature is an optimization process. For each technology the objective is to maximize the positive impact and to minimize the negative ones. There exists a wide-spread consensus that optimization processes need a complete model focusing on all quantifiable issues of interest. A risk model as the foundation for formal safety optimizations and decisions is widely
18
H.R Berg and R Kafka
accepted as the preferred vehicle for an explicit and systematic consideration of all the issues affecting decision making in safety engineering (see also: Ale (1996), Aven (1996), Bonano (1996), E.C. (1993), Hessel (1996), Hirschberg (1996), Schmidt (1996), Watson (1993)). Ongoing findings and recommendations, e .g., by the Center for Risk Analysis at the Harvard School of Public Health, prepared for the U.S. Congress, CRA (1995), and the National Research Council (NRC) of the U. S. National Academy of Science, Bonano (1996), have both strongly advocated the use of risk assessment in environmental management and other decision making processes. It can be assumed that there also exists a consensus that safety engineering of complex installations is very much interrelated with environmental aspects. Important elements of "risk" in the context of the usefulness for safety engineering are the identified spectrum of undesired events, the estimated frequencies and consequences of these events, and the identified and quantified spectrum of the various types of uncertainties.
AIMS, GOALS AND TARGETS Safety Engineering is in principle a management process to establish aims, goals and targets, to transfer these requirements into the design and construction of the real system, and to assess the realized system with respect to the compliance with the established aims, goals and targets. It is usually said that system "aims" are the general aspects mainly focusing on the envisaged function and duties of the system. Goals are mostly used in the context of "Safety Goals", i.e. qualitative or quantitative requirements with respect to safety for the entire system. U.S. NRC (1986). "Targets" are mainly used in connection with system functions expressed in reliability and/or availability characteristics. Petersen (1992). ,
Level 4 Environment Level 3
Plant
TOPGoal J
Risk -Based
i¢ ~ ¢
~o~,~,~~,,,~,~~,,
n~.
C. . . . . . . . . . :.I..... ~.~ /. ~,~j~1~ ~ V ~,
Specific
Level 2 System
Level 1 Component (deterministic and/or probabilistic)
Fig. 1: Breakdown of Targets based on a Top Goal. Kafka (1995). All these expressions can, in principle be mapped into the various levels of the system (top level, subsystem level, component level or piece part level). To establish, e. g., targets on different system levels, a top-down procedure should be used (see Fig.l). Starting with a safety goal on the top of this pyramid one can break down targets at the various system levels. In Reliability Engineering appropriate tools exist to execute this breakdown in a structured manner. The safety goal set at the top of the system is correlated to the question "how safe is safe enough".
Advances in Safety and Reliability: ESREL '97
19
Nowadays there exists a preference to establish this goal quantitatively, i.e., expressed by specific risk components (for example expressed as a frequency of occurrence of a given event scenario per plant-year). A pragmatic approach to establish such a safety goal is used in various technologies (Aviation, Haak (1995), Shooman (1990), JAA (1989), Space, Preyssl (1995), Klein (1996), Nuclear, Thadani (1996), OffShore, DNV (1992). Knowing and considering the "realized standard" for the existing installations the goal for "new" installations should e.g. two orders of magnitude (factor 100) "safer" (of lower in risk) than the existing installations. In other words, the frequency of a predefined catastrophic failure has to be less than 10x per installation-year. With the help of such a pragmatic approach one can perform the breakdown. Besides such a breakdown procedure there exists in most technologies a set of well established rules and regulations (based on best engineering practice) to achieve in a pragmatic way "safe" systems. Examples are the so-called single failure criterion, the spatial separation or the defense in depth concept. These rules should also be considered in safety engineering to gain a high safety standard. It should be mentioned that the issue "how safe is safe enough" would have a very strong relation and dimension to the risk awareness, acceptance and tolerability by the public. In the context of this paper comments and discussions on that topic are not foreseen. WHAT IS GOING ON IN VARIOUS TECHNOLOGIES NOWADAYS Safety Engineering regarding large technological systems, which essentially are assembled of active components (pumps, valves), passive structures (pipes, vessels), and the operational staff managing operation and maintenance requires a risk based analysis process. This process involves four main issues: (1) component reliability, (2)structural reliability (3) man-machine reliability, and (4) an integrative system reliability model. These issues and the necessary methods are more or less available and used in all technologies. Many more differences are in the various regulatory environments regarding the requirements of which type of studies and assessments are appropriate and needed. It varies from the opinion that a risk based (probabilistic) approach is not adequate, (e.g. Germany, chemical industry) toward to a totally risk based regulatory environment (e.g. Holland, process industry, Ale (1996)). By way of other examples, the aviation industry (e.g. Airbus industry, Haak (1995)) has used a probabilistic design concept for a few years, including a safety goal for catastrophic accidents and reliability targets for systems and components. The space industry moves more and more to a risk based approach for the main project decisions and to reliability targets for systems and components. Large manufacturing industries (e.g. Toyota, Ford Motor Company) moves toward Reliability and Maintainability (R & M) targets for all equipments and production lines, Reichart (1994), SAE (1995). In the civil and marine structure industries there exist probabilistic design codes for the design of structures including a set of probabilistic goals and pre-formulated calculation procedures. DNV (1992). In the nuclear industry one can observe worldwide in the last years a movement toward risk based approach and regulations which is strongly encouraged from the industry. Probabilistic Safety Assessment (PSA) has become the tool of choice for selecting the best of several alternatives. Closely related to risk based regulation is the development of performance based rules. Such rules focus on the final result to be achieved. They do not specify the process, but instead establish the goals to be reached and the procedures how the achievement of those goals is to be judged. The inspection and enforcement activities are based on whether or not the goals have been met. Risk based regulation has the potential of both improving nuclear power plant safety and reducing plant operating costs. This modern form of regulation could be applied to present operating installations and to advanced designs. In fact, it would help to quantify the safety improvements of advanced designs. The application of PSA technology to the regulatory process can reduce public risks in several ways: by finding design weaknesses, by improving plant operations, and in developing severe accident management programs. For example, a traditional product of a Level 1 PSA is some estimate of the likelihood of a
20
H.E Berg and E Kafka
nuclear power plant having a core damage or core melt event. The overall core melt frequency is estimated by summing up many thousands of accident sequences, each providing some increment of core melt frequency. If a Level 2 PSA is performed, then there will also be estimates of the containment failure frequency and the releases of radioactive material to the environment associated with each containment failure mode. A large sum, e.g., a high core melt frequency or high containment failure frequency, can be an indication of a poor design. Further, by examining the different contributors to the aggregate value, particular areas of design weakness can be pin-pointed. Therefore, PSA results can be used to evaluate the design of nuclear power plants. Numerous plant-specific design improvements have already been implemented in various nuclear plants based on PSA insights, thereby lowering nuclear risks. Garrick (1995). More recently, PSA techniques have been applied in evaluating the operation of nuclear power plants focusing the interest on how plant risks vary with time. Kafka (1996 A). There are several mechanisms that can cause plant risks to change over time. The performance of individual components and whole systems may degrade due to aging or improve due to design modification or enhanced maintenance. Plant configurations also change from time-to-time as certain components are removed from (or restored to) service for tests and/or maintenance, while others may be removed through failure. Configurations also change when going from one plant operating mode to another, such as transition from power operation to shutdown. Since the risk significance of a component or system is also a function of the plant's configuration, changing configurations yield different risk levels. Just as earlier applications of integral PSA results were utilized to reduce the risks due to design weaknesses, present specific PSA applications are increasingly dedicated to minimizing operational weaknesses, e.g., avoiding high risk plant configurations. Quantitative safety criteria and objectives - correlated with the risk of each single individual in the vicinity of the plant and/or the societal risk of the population as a whole - are used in the decision-making process, for example, in the United Kingdom, Cassidy (1996), and in the Netherlands, Ale (1996). In both countries, this safety concept is not restricted to nuclear installations but it has been adopted as a more global safety policy regarding all potential hazardous industries and activities. Another significant milestone in the development of risk based regulation in the United States was the development of quantitative safety goals and there endorsement in a Nuclear Regulatory Commission (NRC) Policy Statement in 1986; U.S. NRC (1986). This addresses the question of "how safe is safe enough?". In the years following the safety goal policy statement, the relationship between specific regulatory requirements and the risk reduction or lack of it have been investigated in a number of intemal NRC studies. In 1994 the Commission approved a Probabilistic Risk Assessment (PRA) implementation plan. The plan addresses the use of PRA in all major NRC functions, reactor regulation, research, evaluation of operational date, utilization of nuclear materials, and waste disposal. Its major elements include developing decision criteria for regulatory applications of PRA, developing pilot projects to test PRA application in specific circumstances, looking at the contribution of risk based thinking to the inspection process, and examining operator licensing issues from risk perspective. Following the PRA implementation plan was the publication of the Commission's PRA Policy Statement in 1995; U.S. NRC (1995). The PRA policy statement formalizes the Commission's commitment to risk informed regulation. It states, in part, "The use of PRA technology should be increased in all regulatory matters to the extent supported by the state of the art in PRA methods and data, and in a manner that complements the NRC's deterministic approach and supports the NRC's traditional defense in depth philosophy". Nowadays there exists a minor distinction between "risk based" and "risk informed" regulation and the Commission has begun to substitutes the.clearer term risk informed for risk based in its lexicon. Murphy (1995), Garrick (1995). The main elements of the implementation plan are threefold. The first part defines the regulatory areas where PRA can play a role in the decision making process. The second part underlines that the current deterministic engineering approach is maintained unless a solid basis for change is established. The third part of the framework are probabilistic considerations. Key elements are the use of established methods, success
Advances in Safety and Reliability: ESREL '97
21
criteria, human and equipment reliability data and sensitivity and uncertainty analysis. The final part is the integration of deterministic and probabilistic considerations. The success of risk informed regulations ultimately depends on having sufficient reliability data to allow quantification of regulatory alternatives in terms of relative risk contribution. The NRC is considering in this context a new rule which is submitted for public comments, which would require power reactor licensees to collect and report to the NRC certain equipment reliability data. Also the so-called "Maintenance Rule" represents a step towards risk informed regulations. U. S. NRC (1996). In the Federal Republic of Germany, the nuclear licensing procedure is essentially based on deterministic safety analysis. In the context of periodic safety reviews which are recommended for all nuclear power plants in operation, also probabilistic considerations can be taken into account; this is a first smoothly step towards the international activities, but PSA, at present, only supports but does not determine regulatory decision making. Berg (1996A). As desirable as it might be to write regulations in terms of the ultimate measure of probabilities, it is not even nearly possible to define the probability of a possible accident sequence with enough precision and enough replicability to use such probabilities as terms in the regulations bottom line. Therefore the uncertainties have to be considered additionally. The fact that the PSA does not model all relevant issues (problem of incompleteness) and that the results are not sufficiently robust (problem of sensitiveness) is the main reason why PSA could not be the sole tool or basis for creating a new regulatory regime in the near future. Therefore, the determination of probabilistic safety goals is not supported in Germany from the legal point of view, neither as probabilistic limits nor as orientation values. In contrast, for the design of advanced reactor types like the European Pressurized Water Reactor (EPR) the common recommendations of the German and French advising bodies for the regulators have made an interesting statement: "For determining the adequate combination of redundancy and diversity in safety systems, the designer may use probabilistic targets as orientation values; in that case, orientation values of 10 -6 per year for the core damage probability due to internal events for power states and for shutdown states, respectively, could be used, having in mind the necessity to consider associated uncertainties. For those internal and external hazards the probabilities of which cannot be realistically determined, provisions have to be implemented by the designer to obtain a consistent design; this is the case for earthquakes, for which the designer has to state, in which way he intends to prove the existence of sufficient design margins". RSK (1994). In some other countries utilizing nuclear energy, e. g. Finland, Canada, Sweden, Switzerland, there is an increasing trend toward establishment and use of risk based regulations intermeshed with quantitative probabilistic safety goals. The current status and practice is shown, e. g., in Berg (1996A), SKi (1996). When properly applied, the results of a PSA can be used to identify and prioritize the importance of hardware, human actions (operation and maintenance staff activities) and plant procedures to plant risk. The information contained in a PSA is also important in the development of a sound risk management program that could be used for decision-making purposes. An integral part of a comprehensive risk management program at a nuclear power plant would be a living PSA that could be used as the basis for day-to-day operational and maintenance activities and for the short and long-term assessment and prioritization of safety-related needs. Regarding risk management, the idea of such a living PSA has been supported in the framework of the German Nuclear Regulatory Research Program.
22
H.E Berg and P. Katka
PROS AND CONS OF RISK BASED A P P R O A C H AND REGULATIONS Although some countries and various technologies are strongly discussing a proposed transition process from deterministic to a risk based approach the pros and cons including challenges of such a transition process should be taken into account. Summarizing the pros of the risk based approach considering its essential potential and the long term ideas and goals one can conclude: • it is supported explicitly (quantitatively) by our historical experience, • it is based on the understanding of the system and component behavior formulated in deterministic codes and calculations, • it the real world with all the determined, random and uncertain elements and parameters based on our state of knowledge, • it is quantitative and therefore appropriate for sensitivity, importance and optimization studies, • it integrates design, manufacturing and operational aspects for safety balancing over the life cycle of the system, • it integrates all the safety issues and allows therefore rankings and optimizations. Specific benefits of risk based regulations are: • • • • •
to to to to to
have a cost-effective approach to regulation, assure that resources are focused on essential safety issues, have a methodology that can be used to both enhance safety and manage operability, be able to communicate results and decisions on a clearly defined basis, attain an open, fair, and predictable regulatory framework.
On the other hand, there are disadvantages and difficulties which are posed by such a risk based approach: • to place very heavy reliance upon the exercise of regulatory personnel in judging whether standards have been complied with, whether risks have been properly identified and quantified, and whether enough preventive or mitigative measures have been taken to satisfy the proper balancing of costs and risks, • to ensure that the regulator's forces are extremely well-informed scientifically and technologically in order to produce consistent application of standards, • to be relatively time-consuming in ensuring a sound data-base for decisions about risk and methods of control in assessing safety, • to impart a high degree of uncertainty into computations of whether risks have been reduced in a sufficient manner which might be a fertile ground for endless debate between utility and regulator. Challenges which are associated with developing and implementing risk based approaches to regulation are: • to obtain an acceptable methodology for risk assessments that is commensurate with the decisions to be made, • to perform the needed, relevant risk assessments, • to focus regulatory questions so that risk assessments can be useful, • to have a regulatory structure that encourages risk based methods, • to perform the necessary regulatory research that assures a robust, stable approach to risk based regulation, • to effectively communicate the process, risks, and decisions to the public.
Advances in Safety and Reliability: ESREL '97
23
CONCLUSIONS AND RECOMMENDATIONS A large number of contributions are available regarding the presented topic and only a few of them are referred in this paper. In this resume we will attempt to Summarize important insights and remarks from the widespread developments and applications of risk based approaches in safety engineering. • The suggestive idea for an ecologically and economically beneficial system (i. e. the higher the danger potential of a component, the more reliable the component itself and its safety features must be) can today be found in all safety engineering approaches within the various technological sectors. Nevertheless, the required and/or demonstrated safety level of systems is often evaluated by general engineering judgments, rules and requirements only. • The use of an entire model to build up all the physical and logical interrelations of the important elements and the adoptions of quantitative methods to evaluate this model is neither well established nor balanced in the various technologies. While nuclear technology, space, aviation, off-shore industry and civil engineering is at the front end, chemical engineering is at the back end. Also the status, effort and the application related to the risk based approach is very inhomogeneous in the various countries (e. g. Holland, Ale (1996), Norway, Aven (1996), Switzerland, BUWAL (1991), U. K., SKi (1996), U.S.A., SKi (1996), U. S. Congress (1994), is in the front end, and Germany is at the back end). • The principal value of risk based approach is that it represents the most complete compilation of the state-of-the-art knowledge analysis and data available for a given problem to develop an integrative perspective necessary to assess the variety of relationships between initiating events, human and equipment performance, mitigation features, accident phenomena and, if necessary, health effects and consequences. The risk based approach can also usefully be applied for balancing and evaluating the system design, the manufacturing aspects and the operation including operator actions and maintenance activities. Kafka (1996). • The deterministic approach for designing and assessing the system regarding safety performance generates in this context only limited insights which are needed as basic information but are solely not sufficient for decision making in safety engineering. Rankings and optimizations are requested. All the answers generated by the deterministic approach are uncorrelated with respect to the entire system characteristics. • The significant benefit of the risk based approach i. e. the identification and analysis of the possible event scenarios, generates much more insights than achievable by the deterministic analysis process. • The utilization of risk based approach requires of course more manpower and computation effort than a safety assessment with simplified assumption and based on checklist-formatted rules. But the larger effort must be balanced against the larger benefit in terms of a higher confidence in less risky systems. • The process of the risk based approach shows "uncertainties" and "vague" insights. This "new" situation for lawyers and regulators challenges further development of some basic laws and acts in safety engineering. The deterministic approach holds hidden these uncertainties and vague insights. It pretends an accuracy which is not given in the real world. Based on these summarized statements we want to give the following recommendations: • A risk based approach requires both, risk based procedures - to do safety engineering, and risk based regulations - to control what we have done in safety engineering. Both tools should undergo further developments to improve the state of knowledge for reduction of uncertainties in decision making. • As typical for all engineering fields, although some lessons learned from applications and historical events are beneficially adopted in the risk based approach (e. g. human factor, common cause failures), other issues (e.g. computer/software reliability) must undergo further developments and research work. • Finally, it should be stressed that in high level information processing systems, such as the risk based approach, should be a continuous activity established to collect and model the increasing state of knowledge and to quantify, as best as possible, also the so-called unquantifyable issues (e. g. the
H.R Berg and R Kafka
24
safety culture, managerial aspects) per se. The risk based approach is a living process; there will never be an end to the story.
REFERENCES
Ale, B.J.M., Laheij, G.M.H., Uijt de Haag, P.A.M. "Zoning Instruments for Major Accident Prevention" Proceedings, ESREL" 96 - PSAM III, Crete, 1996, Springer Verlag, pp 2191 - 2196 Center for Risk Analysis, CRA, Harvard "Reform of Risk Regulation: Achieving more Protection at less Cost" Harvard School of Public Health, Boston, USA, March 1995 Apostolakis, G. "The Concept of Probability in Safety Assessment of Technological Systems" Science 50 (1990), pp 1359- 1366 Aven, T., Nja, O., Rettedal, W. "On Risk Acceptance and Risk Interpretation" Proceedings, ESREL'96 PSAM III, Crete, 1996, Springer Verlag, pp 2191 - 2196 H. P. Berg "Approach for Risk Based Regulation and Risk Management of Nuclear Power Plants", Proceedings of SRA Europe Meeting 1996, University of Surry, Guilford, June 1996 (A) Berg, H.P., G6rtz, R., Schaefer, T., Schott, H. "Quantitative probabilistische Sicherheitskriterien ffir Genehmigung und Betrieb kerntechnischer Anlagen: Status und Entwicklung im internationalen Vergleich" BfS-KT-15/96, 1996 (B) Blockley, D, Edt. "Engineering Safety" McGraw-Hill Book Company, 1992 Bonano, E., Peil, K. "Risk Assessment: a Defensible Foundation for Environmental Management Decision Making" Proceedings, ESREL'96- PSAM III, Crete, 1996, Springer Verlag, pp 2117-2121 BUWAL "Handbuch I zur St6rfallverordnung STFV" Bundesamt (Buwal), Schweiz, Juni 1991 Cassidy, K. "UK Risk Criteria for Siting of Hazardous Installations and Development in their Vicinity" Proceedings, ESREL' 96 - PSAM III, Crete, 1996, Springer Verlag, pp 1892 - 1898 DNV "Structural Reliability Analysis of Marine Structures" Classification AS, N-1322 Horvik, June 1992 Fragola, J. "Design Decisions and Risk: Engineering Applications" Proceedings, ESREL'96 - PSAM III, Crete, 1996, Springer Verlag, pp 1811 - 1816 Fragola, J., Shoomann, L. "Experience Bounds on Nuclear Plant Probabilistic Safety Assessment" 92RM-165: Garrick, J., Wakefield, D. "A Progress Report on the Status of Selected Applications of Probabilistic Risk Assessment in the U.S. Nuclear Power Industry" Proceedings, KAERI, PSA'95 November 1995, Seoul, pp 923 - 926 Haak, D. Airbus GmbH, Hamburg, informal communication, 1995 Hessel, P.P. "Toward Risk based Regulation" Proceedings, ESREL'96 - PSAM III, Crete, 1996, Springer Verlag, pp 339-342 Hirschberg, S., Spiekerman, G. "Comparative Evaluation of Severe Accident Risk Associated with Electricity Generation Systems" Proceedings, ESREL'96- PSAM III, Crete, 1996, Springer Verlag, pp 3945
Advances in Safety and Reliability: ESREL '97
25
HSK SKi "Proceedings of Executive Meeting on Risk-Based Regulations and Inspections" Vol I, II, SKi 96-69, 1996 -
International Atomic Energy Agency, IAEA "INSAG- 5, The Safety of Nuclear Power" IAEA Safety Series No. 75-INSAG-5, 1994 Joint Aviation Authority, JAA "Basic Objective Requirements for all Systems on Large Transport Category Airplanes" JAR 25.1309, 1989 Joksimovich, B. "Man versus Machine in Nuclear Safety Regulation" Proceedings, Seoul, Korea 1995, pp 788 - 794
KAERI, PSA'95,
Kafka, P. "Sicherheit grol3technischer Anlagen" TO, VDI Verlag Dtisseldorf, 9/95 September 1995, pp 354 - 357 Kafka P., Gromann A. "Where we are in Living PSA and Risk Monitoring" Proceedings, ESREL'96 PSAM III, Crete, 1996 (A), Springer Verlag, pp 1884 - 1891 Kafka, P. "Probabilistic Safety Assessment: Quantitative Process to Balance Design, Manufacturing and Operation for Safety of Plant Structures and Systems" Nuclear Engineering and Design 165 (1996) pp 33-350, 1996 (B) Kafka P. "Safety Engineering - Why Should we Move Towards Risk-Based Evaluations?" Proceedings, SKi 96-69, Vol 2, 1996 (C) Klein, M., Schueller, G.I., Esnault, P. "Guidelines for Factors of Safety for Aerospace Structures" Proceedings, ESREL'96 - PSAM III, Crete, 1996, Springer Verlag, pp 1696 - 1701 Murphy, J. "Risk Based Regulation: Practical Experience in Using Risk-Related Insights to Solve Regulatory Issues" Proceedings, KAERI, PSA'95 November 1995, Seoul, pp 945 -948 Petersen, K. , Sieger, K., Kongso, H. "Setting Reliability Targets for the Great Belt Link Tunnel Equipment" Paper, ESReDA Seminar, Amsterdam, Holland, April, 1992 Preyssl, Ch. "Safety Risk Assessment and Management - the ESA Approach" Reliability Engineering and System Safety, 49 (1995) pp 303 - 309 Reichart, G., Dilger, E., Winner, H. "Iterative Safety Design Process - Ein Ansatz zur Sicherheitsanalyse und -bewertung ktinftiger Fahrzeugsysteme" VDI Bericht 1152, Dtisseldorf 1994 RSK "Gemeinsame Empfehlung von RSK und GPR f~r Sicherheitsanforderungen an zukiinftige Kernkraftwerke mit Druckwasserreaktor", admitted in 1994 SAE "Reliability and Maintainability Guideline for Manufacturing Machinery and Equipment" Society of Automotive Engineers, Inc., SAE, Warrendale, PA, USA, 1995 Schmidt, S. "Decision Analysis, Risk Research & Assessment: An Integrated Approach for Risk Management" Proceedings, ESREL'96 - PSAM III, Crete, 1996, Springer Verlag, pp 1817 - 1822 Shooman, M. "Probabilistic Reliability: An Engineering Approach" 2nd Edition, Kreiger, Melbourne, FL, 1990 Thadani, A., Murphy, J. "Risk-Informed Regulation - Issues and Prospects for its Use in Reactor Regulation in den USA" Proceedings, ESREL'96 - PSAM III, Crete, 1996, Springer Verlag, pp 2172 2177
26
H.P. Berg and P. Kafka
The Engineering Council (E. C.) "Guidelines on Risk Issues" The Engineering Council, U. K., 1993, ISBN 0-9516611-7-5 U.S. Congress "Risk Assessment Improvement Act of 1994" Identifier: H. R. 4306, USA, 1994 U.S. NRC "Safety Goals for Nuclear Power Plant Operation" U.S. NRC, Washington DC, NUREG-0880, 1986 U. S. NRC "Probabilistic Risk Assessment Implementation Plan", U.S. NRC, Washington DC, 1994 U.S. NRC "PRA Policy Statement", U. S. NRC, Washington DC, 1995 U. S. NRC "Maintenance Rule", 10 CFR 50.65, U. S. NRC, Washington DC, 10 July 1996 Watson, I. "Developments in Risk Management" Paper, ESREL'93, Munich, VDI Verlag DOsseldorf, 1993, pp 511-521
INCORPORATI NG RISK ASSESSMENT AND ITS RESULTS IN THE DECISION-MAKING PROCESS J M Le Guen Health and Safety Executive, Rose Court, 2 Southwark Bridge, London SE 1 7HS, UK
ABSTRACT The degree to which risks should be controlled is essentially a question of values. Individuals tolerate different levels of risk depending on the benefits they think they will gain from taking the risks. Equally, society's tolerance of different risks varies dramatically for a whole variety of reasons, some relatively straightforward and amenable to scientific evaluation and others complex expressions of deep-seated, psychological attitudes. This paper sets out criteria which HSE has developed against that background for defining tolerable levels of risks and integrating that in the decision-making process. The criteria accept that risk assessment, more often than not, cannot produce scientific estimates of actual risks but can instead only produce conditional estimates of risks under specified sets of assumptions; that there is generally a need to achieve a balance between risks and costs but equally that there are some risks that cannot be tolerated under any circumstances. The criteria have gained considerable acceptance within industry. It has helped HSE to meet its objective of being an open and transparent organisation by showing how decisions about risks are arrived at, and by letting duty holders understand what is expected of them and what they should expect from the regulator. KEYWORDS Unacceptable risk, tolerable risk, negligible risk, criteria for standard setting. INTRODUCTION As we approach the Millennium, contrasting public attitudes to a technological society at the beginning and at the turn of the twentieth century, is most revealing. In 1900, 'la belle ~poque', there was considerable optimism that science and technology would solve everything, transform people lives and make the world a safer, healthier and more prosperous one to live in. A lot of this has happened. During the last hundred years the standard of living has risen dramatically; we now live longer and are generally healthier. Yet people have never been more pre-occupied with risks to health, safety and the environment. Even more disturbingly, there is today a general malaise about science and technology. There are several reasons for this phenomenon. I will mention only a few. Firstly, the majority of people in industrialised countries these days no longer have to struggle for their daily bread. In industrialised countries obesity is now a bigger problem than malnutrition. As a result the acceptance of industrialisation to gain increased standards of living is no longer as readily given as when the fight against hunger and poverty overshadowed everything. This is giving rise to a growing propensity to scrutinise benefits brought about by industrialisation against its undesirable side effects such as the risk of being maimed or killed or environmental pollution. 27
28
J.M. Le Guen
Secondly, there is a perception that the exploitation of modern technology is increasingly giving rise to risks: which could lead to catastrophic consequences - perceptions fuelled by greater awareness of issues such as nuclear fall-out following Chernobyl, depletion of the ozone layer, global warming and acid rain; where the consequences may be irreversible, eg. the release of genetically modified organisms; which lead to inequalities because they affect some people more than others such as those arising from the siting of a chemical plant, power station or a waste disposal facility; which could pose a threat to future generations such as radioactive waste depositories. Thirdly, whereas nearly everyone can readily assess the threat of a tiger over one's shoulder, this is not the case for risks posed by many of the newer hazards arising from industrialisation, eg pollutants in foodstuff. People must rely instead on the opinion of experts. However, the trust placed in expert opinion as a source of reassurance is being continually eroded particularly for those issues where the mass media have exposed controversies surrounding such opinions. There is a school of thought emerging that society's concerns about risks have now reached such a point that the redistribution of risks in society is becoming as important politically as the re-distribution of wealth (though the political system may be still lagging behind and may still be predominantly concerned with the latter). Whether this is true or not is still a matter for debate, but what cannot be denied is that: -
many managers today have less freedom in management matters which only yesterday they would have regarded as a private matter for themselves to decide; eg plans for modifying their plant within their own boundaries, what raw materials and processes they should use, or how the waste generated (or the plant itself at the end of its useful life) should be disposed of; more and more restrictions are being placed at international or European level on goods and services that are allowed on the market because of the risks (real or perceived) that they entail; ignoring or riding rough-shod over society's concerns may cause markets to collapse, give rise to calls for bureaucratic checks and entail legal proceedings and loss of reputation. it is more important than ever before that the management of health and safety risks be integrated in the overall management process itself, now that it is increasingly the norm to look at the health and safety record of a firm to get an indication of the quality of its management overall.
Some parts of industry may have been slow to recognise the above. They have often concentrated on taking decisions based on the results of risk assessments and failed to be sensitive to ethical, social and cultural considerations or played down the assumptions and uncertainties in the risk assessment thereby giving the impression that their results are based on sound science. Doubts have been expressed by some as to whether an approach based on risk assessment may be appropriate for the regulation of risks. This was witnessed by the controversy surrounding the proposals for the disposal of the Brent Spar oil platform in the UK, the collapse of the market for bottled Perrier water following their admission that their product had accidentally been contaminated by traces of benzene, the controversy surrounding Bovine Spongiform Encepalopathy (BSE). This paper describes the criteria known as the Tolerability of Risk (TOR) which HSE has developed to inform decisions about the degree to which risks should be regulated and/or controlled. It avoids the above pitfalls by taking into account the scientific knowledge about the risks concerned, the technology available for controlling them as well as public attitudes towards the risks and the benefits they engender. The criteria has gained considerable acceptance within industry.
Advances in Safety and Reliability: ESREL '97
29
RISK ASSESSMENT
The process of assessing risks is now an essential component of an effective strategy for countering the general hankering after a zero risk society, incorporating the management of health of safety in the decision-making process and rationalising the amount of resources that should be allocated for preventing or reducing risks. More importantly it is being increasingly recognised that used judiciously, it can be a powerful tool for re-assuring the public that science is being used for pursuing technologies whose benefits outweigh the risks and integrating public values in the decision making and political process. Assessing risks is simple in principle. It involves identifying hazards or examining what in a particular situation could cause harm or damage, and then assessing the likelihood that harm will actually be experienced by a specified population and what the consequences would be (ie the risks). As such, a risk assessment is essentially a tool for extrapolating from available data, a value or judgement which people will accept as an estimate of the risk attached to a particular activity or event. Though a sharp distinction is often made between risk assessment and risk management - the distinction is artificial. It stems from original beliefs - now known to be misguided - that assessing risks could be, by and large, a totally scientific and objective process unlike risk management which inevitably has to be more subjective because it has to take into account a host of other factors such as economic analysis, perception of risks, availability of alternative technologies, concerns about equity etc. In practice, an assessment of risks is also a very subjective process since it often cannot be undertaken without making a number of assumptions. Moreover, since a risk assessment takes account of measures already in place it invariably contains some elements of risk management. In short though the basic principles of assessing risks may be simple, applying them is not as simple. Indeed risk assessment nowadays more often than not is a composite of established disciplines, including toxicology, engineering, statistics, economics, demography and psychology. Knowledge of these disciplines is required to solve many of the practical problems - described below - that are encountered during the assessment process. Uncertainty
Uncertainties are inherent in the process of assessing risks. They all stem from imperfect knowledge and can be considered to be of two types. Known uncertainties or 'what you know you don't know'. For example a risk estimate may be based on models which are known not to describe certain aspects of a problem or utilises data which are known to be of limited applicability or accuracy. Unknown uncertainties on the other hand arise from sources which are either unknown to those making the risk estimate, or whose significance is not appreciated - 'what you don't know you don't know'. For example it may not be appreciated that a particular model is incomplete in its description of possible effects. In the last few years, development of better methods for tackling problems posed by uncertainty has become increasingly important. Several techniques are available for tackling both known and unknown types of uncertainties. For instance, sensitivity testing is often used to check the importance of assumptions to the final result. Small changes are made, within limits, in the data used and assumptions made, and their effects on the result of the assessment are then examined. Large changes are an indication that the assumptions need to be re-examined. A review of available methods for tackling uncertainty is beyond the scope of this paper. Its influence and ramifications for informing policy decisions have been examined by Ravetz et al. Risk Perception
People's attitude to risks is greatly influenced by they way they perceive them. Indeed the factors that tend to reflect society's concerns about the risks bear a marked similarity with those mentioned earlier as influencing people's attitude to science and technology. How the risk interacts with psychological, social, cultural, and institutional processes is important. Factors of particular importance include how well the
30
J.M. Le Guen
process (giving rise to the hazard) is understood, how equitably the danger is distributed, how well individuals can control their exposure, whether risk is assumed voluntarily and the number of people likely to be killed in any one incident and the trust that the public have in the regulator with responsibility for ensuring that those who create risks introduce appropriate measures to control the risks. THE REGULATION OF RISKS
Risk Criteria If they are to be successful, criteria for the regulation of risk must reflect society's values at large. As such they must take into account how people perceive risks and recognise that certain hazards give rise to societal concerns because people have an intrinsic dread of them or because they interact with the social aspirations, ethical beliefs and cultural ethos of society. Typical examples of such hazards include those associated with a nuclear power station, chemicals affecting the ozone layer or giving rise to global warming and biotechnology. Three broad 'pure' criteria have emerged for judging the tolerability of risk: equity-based, which starts with the premise that all individuals have absolute fights to certain levels of protection. This usually leads to standards held to be usually acceptable in normal life, or which refer to some other premise held to establish an expectation of protection. In practice, this often results in fixing a limit to represent the maximum level of risk to which any individual should be exposed. If the risk estimate derived from the risk assessment is above the limit, the risk is held to be intolerable whatever the benefits; c o s t / b e n e f i t b a s e d where some direct comparison is made between a value placed on the benefits of reducing risk of injury or detriment, and the costs of preventing and reducing the risks. This form of criterion may relate the comparison not to the overall benefits and costs, which may be very difficult to establish, but to the benefits and costs of an increment of risk reduction. In other words, cost/benefit based criteria compare the benefits in monetary terms obtained by the adoption of a particular risk prevention measure with the cost of introducing it. ' t e c h n o l o g y b a s e d ' . This essentially reflects the idea that a satisfactory level of risk prevention is attained when relevant best or good practice, or "state of the art" technology is employed.
The TolerabUity of Risk
The above criteria are not mutually exclusive. Indeed, all three are present in the framework that HSE has developed for deciding what risks are so great as to be unacceptable; so small that no further precautions are necessary; or, if they fall between these two states, whether the risks should be incurred, taking account of the benefits or the need to avoid some greater risk. The criteria known as TOR (short for tolerability of risk) are illustrated in the geometry at Figure 1. The horizontal line at the top represents an upper limit above which a particular risk for practical purposes is regarded as intolerable whatever the benefit. Any activity or practice giving l'{se to a risk greater than this threshold would be ruled out unless it can be modified to reduce the degree of risk below this level. The line at the bottom, on the other hand, represents a threshold below which risks are considered broadly acceptable because they compare with small risks which do not worry people or cause them to alter their behaviour in any way. When incurred they only result in a very small addition to the background level of risks to which everyone is exposed during their lifetime (typically 1 in a 100). Between the two lines is the region where people will tolerate risks in order to secure benefits. However, this tolerance of risks is buttressed by an expectation that people will be told the nature and level of the risks, and the confidence that the risks are being controlled as low as is reasonably practicable.
32
J.M. Le Guen
This region, known as the 'tolerability region', accommodates people's and society's willingness to live with a particular risk so as to secure social and economic benefits. Benefits for which people and society tolerate risks typically include local employment, lower cost of production, personal convenience and the maintenance of general social infrastructure for example through the availability of electricity, food or water supplies. However, while people may tolerate risks for which they can see some benefits that outweigh them and as such will indeed engage voluntarily in activities which often involve high risks, in general they want the risks to be as low as possible. Moreover, they are far less tolerant of risks imposed on them and over which they have little control. The concept of tolerability implies that existing control measures should be periodically reviewed to ensure that they are both properly applied and that they take account of changes over time, as for example, the availability of new options for reducing or eliminating risks due to technical progress.
To lerab ility limits The dividing line between the unacceptable and tolerable regions must reflect society's values at large and will depend on the nature of the hazards and the detriment they could give rise to. However, HSE has proposed that for hazardous events to which workers are exposed, a risk of death of 1 in 1,000 per year should be the dividing line between what is tolerable for the majority of workers for most of their working lives, and what is unacceptable for any but fairly exceptional groups. For members of the public who have a risk imposed on them "in the wider interest" HSE would set this limit at an order of magnitude lower - at 1 in 10,000 per annum. At the other end of the spectrum, HSE believes that an individual risk of death of 1 in 1,000,000 per annum for the public (including workers) corresponds to a very low level of risk and should be considered as broadly acceptable. In addition to those levels of individual risks, the HSE has suggested that the chance of an accident causing societal concerns due to multiple fatalities in a single event should be less than 1 in 1,000 per year and, if possible, less than 1 in 5,000 for accidents where there is some choice whether to accept the risk of it happening - for example by allowing the erection of a hazardous installation in a built-up area. The choice of the above figures is essentially a policy decision and they are not intended to be straight-jackets applied rigidly in all circumstances. The upper boundary was determined by analogy with high risk industries generally regarded as well regulated while the lower boundary took account of people's voluntary acceptance of risks in particular situations and risks that they usually regard as negligible in their walk of life. On the other hand the boundary for the level of risk for events causing multiple fatalities is roughly based on an examination of the levels of risk that people were prepared to tolerate for hazards causing a major accident affecting the surrounding population eg the survey carried out by HSE on the potential risks of the industrial installations at Canvey Island on the Thames or the predicted annual chance of an aircraft crash in the UK killing 500 or more people. Though the above tolerability limits are defined in terms of fatalities, it is possible to apply the framework for judging the tolerability of other forms of detriments (eg non-fatal injuries) by converting them into "fatality equivalents".
Typical persons Some of the tolerability limits mentioned above relate to individual risks. But this is not as straightforward as it seems. What is a typical individual? Individuals are affected by risk differently depending amongst other things on their physical make up, abilities, age etc.
Advances in Safety and Reliability: ESREL '97
33
When setting standards, tolerability limits, or control limits that apply to a particular activity whatever the circumstances, this problem is addressed by constructing a hypothetical type of individual or typical person who is in some fixed relation to the hazard, eg the person most exposed to it, or a person living at some fixed point or with some assumed pattern of life. Other individuals can then see how their circumstances relate to this typical person and reckon whether they or their family incur a greater or smaller risk. As such, typical persons may be regarded as persons (including groups) assumed to have homogeneous characteristics who are deliberately posited for the purpose of risk assessment. They may for example be persons hypothetically constantly present at the perimeter fence of a nuclear power station, or a group of workers presumed to be exposed to a given risk for exactly forty hours per week etc. It is therefore very important to understand that the determination as to where the risks from a particular activity lie in the TOR geometry (ie whether they lie in the unacceptable/tolerable/negligible region), is performed by calculating the risk to a typical individual exposed to the hazard under consideration for a specified period of time (usually a full working day) integrated over a full year. The calculations do not relate to the risk arising from the actual exposure of a real person to the hazard in question. Similarly, once it has been established where the activity lies in the TOR geometry, decisions on the measures that need to be taken across the board to control the risks relate to those needed to ensure the health and safety of the typical individual and not to those of an actual person undertaking the activity (though it may be necessary to adjust these measures to take account of any particular characteristic - eg a disability - peculiar to the real person). In short, the risks derived to the typical individual act as a pointer as to where the activity lies on the TOR geometry, which in turns dictates the type and degree of the measures that need to be introduced across the board to eliminate or to control the risks adequately. This approach of establishing for any circumstances the control measures that need to be introduced for protecting actual persons from risks, by looking at those needed to protect a typical individual, has an important consequence. Once these measures have been identified, they represent the measures that must then be introduced, even for short exposures, for protecting actual persons from the risks. For example if the risk in operating an unguarded power press, is found to be intolerable and a guard is necessary to make the risk tolerable, then the guard will need to be in place, however short the period it is being used. Nor is it admissible to arguethat there is no need to introduce risk control measures for a high risk activity because in practice the risk is shared between many employees to ensure that the risk to any one individual is low. REDUCING RISK AS L O W AS IS REASONABLY PRACTICABLE In practice most risks will fall into the 'tolerability region' which will require those having a duty to control the risk to decide whether the risks have been reduced as low as is reasonably practicable (ALARP). The first step is to look at whether relevant good practice has been adopted. Where relevant good practice is not established, duty-holders will be expected to apply risk-reducing measures balanced against their associated costs through a cost benefit assessment. In general the higher the risks the more the balance should tilt in favour of adopting the measures unless their costs are clearly excessive compared with the benefit of the risk reduction.
Placing a Monetary Valuation on Benefits Cost benefit assessments can often be done on the basis of common sense judgement and without the explicit valuation of the benefits. However, there are also many situations where the benefits of reducing risk can be valued explicitly. This requires agreement on protocols since the health and safety of people and protection of the environment are not things that are bought or sold. It is therefore difficult to find market transactions to gauge their value directly.
34
J.M. Le Guen
Nevertheless, many techniques have been developed for putting a monetary value on risk reductions to human life and health. For example, by comparing property values, it is possible to estimate how much people actually pay for cleaner air or more attractive scenery. Or again by looking at how much more people are prepared to pay for a car fitted with an airbag it is possible to gauge what people are prepared to pay for achieving a marginal reduction in the risk of death. Similarly, by examining how much extra is paid for undertaking particularly dangerous jobs one can estimate what monetary compensation people require for a marginal increase in the risk of death. HSE in undertaking cost benefit analyses currently attributes a value of £8 for a reduction in risk of death of 1 in 100,000 per year - unless the risk is particularly dreaded or is near the margin of tolerability and subject to large uncertainties when higher values are used. CONCLUSIONS Individuals tolerate different levels of risk depending on the benefits they think they will gain from taking the risks. Equally, society's tolerance of different risks varies dramatically for a whole variety of reasons, which have at their roots deep-seated, psychological attitudes and beliefs. A philosophical framework which makes use of the results of a risk assessment can provide a structure for informing decisions on the risks that society and individuals are prepared to accept and for communicating how these decisions were reached. However, for the tool to work, all those who rely on it have to understand its nature and its limitations - in particular that it relates to typical rather than actual persons; that the process is assumption and value-laden and subject to uncertainties. In short, acceptance of decisions on risks has to take account of values established by informed debate and people's willingness to tolerate risks in return for benefits. HSE has developed and published such a framework, known as TOR, for establishing how far it is reasonable to regulate and control risks having regard to established good practice; the impact of technological change; the uncertainties attached to risk estimates and not least how people perceive risks and which one there are prepared to tolerate. Since its publication, HSE has found it extremely useful for justifying its regulatory decisions, targeting the allocation of resources and reaching judgements on what is reasonably practicable when enforcing the law. It has also helped to make decisions informed by the results of a risk assessment more acceptable. More and more firms are becoming aware that the public mood has changed and that they have to be open and transparent about the risks they create and the measures they have in place for preventing or controlling those risks. TOR provides an established framework for putting such information in the public domain. Though it may not bring back the optimism of the 1900s, many have found it a most useful tool in their attempts to dispel the malaise surrounding their industry.
REFERENCES Funtowicz Silvio O. and Ravetz Jerome R. (1990). Uncertainty and quality in Science for Policy, Kluwer Academic Publishers ISBN 0 7923 0799 2. Health and Safety Executive (1992). The Tolerability of Rbskfrom Nuclear Power Stations, page 32, paras 181183, HMSO, London, ISBN 0 11 886368 1.
A2" Risk Perception
This Page Intentionally Left Blank
PUBLIC PERCEPTIONS OF RISKS ASSOCIATED WITH M A J O R INDUSTRIAL HAZARD SITES A. Brazier ', A. Irwin2, C. Kelly3, L. Prince 4, P. Simmons 5, G. P. Walker6, B. Wynne 5 'Major Hazards Assessment Unit, Health and Safety Executive, Bootie, Merseyside, UK 2Department of Human Sciences, Brunel University, Uxbridge, Middlesex,, UK 3Ergonomics and Work Psychology, Health and Safety Laboratory, Sheffield, UK "INLOGOV, Birmingham University, Birmingham, UK 'Centre for the Study of Environmental Change, Lancaster University, Lancaster, UK 6Division of Geography, Staffordshire University, Stoke on Trent, UK
ABSTRACT This paper reports on research commissioned by the Health and Safety Executive in the context of the forthcoming Seveso II Directive, to help understand how members of the public perceive the risks from major hazard installations. The research is based on comparative case studies of seven hazardous installations and uses complementary data collection and analytical methods, including contextual research, focus group discussions, siting scenarios and Q-sort exercises. The purpose, rationale, methodology and emerging results of this major 3 year research project are discussed.
KEYWORDS Risk, perception, tolerability, siting, policy, UK.
INTRODUCTION In the UK statutory controls over major hazards are designed to identify sites with a major hazard potential, to prevent and reduce the severity of accidents through requirements for safe operation, and to mitigate the consequences of accidents which do occur, through emergency planning, information to the public and land-use planning. In accordance with the role envisaged under the Seveso II Directive (Commission of the European Communities 1994, Walker 1995) the Health and Safety Executive (HSE) provides expert technical advice on risks that would be posed by new hazardous installations and risks from existing installations to proposed development in their vicinity. Planning advice is also given on the risks from notified pipelines and licensed explosives sites. The Seveso II Directive will require information on safety 37
38
A. Brazier et al.
measures and emergency planning to be made available to persons liable to be affected by a major accident originating from a major industrial hazard site. To prepare for the implementation of the Directive, HSE has commissioned an interdisciplinary research project to help understand:the level of public comprehension of hazard and risks from major hazard sites, for example, awareness of the nature and extent of these risks, and the sources of this information. public perception of these risks, that is, of the perceived likelihood of a major accident, compared with other perceived risks, and how this perception compares with assessed risk levels. the level of risk the public is willing to tolerate from major hazards, and the factors which influence attitudes, such as economic benefit, the public image of a company good neighbour policies, etc.
PRACTICAL APPLICATION OF THE RESEARCH The primary purpose of the research is to assist HSE and other Government Departments: • • • • •
to to to to to
develop policies on the tolerability of risk develop criteria for the siting of major hazards and the use of land within their vicinity explain risk issues to local planning and hazardous substances authorities and to the public evaluate the effectiveness and impact of legislation provide a baseline of knowledge from which to measure future change.
THE RESEARCH IN CONTEXT The research forms part of a wider HSE behavioural and social sciences research programme and coincides with both the revision of the HSE publication on the Tolerability of Risks from Nuclear Power Stations (HSE 1992) which is being redrafted to encompass other industries and the revision of the 1989 HSE discussion document (HSE 1989) which suggested quantified risk criteria for land use planning within the vicinity of major industrial hazards and outlined the rationale underpinning these criteria. In light of the many comments received on the discussion document and a further six years experience of their use HSE is revising the document to include more definitive approaches and criteria for control to include recommendations on:• • • • • •
the siting of new hazardous installations the tolerable risks from existing installations the further development of land in the vicinity of such installations the tolerable risks to existing development in the vicinity of such installations the location of pipelines containing hazardous substances the control of developments (both new and existing) in the vicinity of such pipelines, and touches on • the local control of dangerous substances transportation risks (to dovetail with the recommendations made in the 1991 HSE Report on the transportation of dangerous substances in major hazard quantities)
EXISTING RESEARCH BASE
Although major hazards have been recognised as a particular category of technological risk for over 20 years, there have been comparatively few studies of how the public perceive these risks. Much risk perception research has been concerned with how the public perceive multiple forms of risk in the abstract,
Advances in Safety and Reliability: ESREL '97
39
rarely including major hazard risks amongst the list of risks to be considered. Where work has been focused on particular forms of environmental risk, nuclear power and radioactive waste have figured most prominently (e.g. Van der Pligt 1992, Macgill 1987, Wynne et al 1993) with hazardous waste landfill and incineration also receiving significant attention in the US. One limitation of this latter body of work, in relation to the present research project, is that it deals with populations affected by persistent and often undetectable toxic risks of a kind that have a particular cultural resonance because of their associations with notions of 'contamination' that arouses particular public anxieties (Erikson, 1990, Beck, 1992). Although some large chemical installations may be perceived as presenting a similar risk to local populations, in the case of many major accident hazards the risk of this sort of toxic exposure (and its powerful cultural associations) is absent. In the UK there has been a relatively small number of pieces of work concerned with major hazards. Social and Community Planning Research (Prescott-Clarke 1980, 1982) undertook a study of public perceptions of a range of risks incorporating analysis of perceptions of' major hazards, through a small-scale, qualitative feasibility study on Teesside and a larger nationally representative quantitative survey. The latter (Prescott-Clarke 1982) revealed a number of interesting patterns of perception but without analysing them in any real depth. For example, it was found that the nearer someone lived to a major industrial site the less likely they were to consider that such sites collectively posed a risk to the public, but the underlying reasoning behind this spatial trend is not explored. The discussion of the results of the focus group work on Teesside (Prescott-Clarke 1980) is in some ways more revealing, with selected quotes beginning to open up to dimensions of trust in regulators, economic dependency and industry-community relations. Smith and Irwin (1984) undertook a survey of public perceptions of risk in the vicinity of major hazard industry in Halton in Cheshire, concluding that in neither of the two survey areas, 'did the risks associated with factory accidents emerge as a major concern'. However, they questioned the value of attitudinal survey evidence, stressing the dangers of over-generalisation and the need to recognise the diversity and contextual nature of risk attitudes. A number of studies have also considered public perceptions and responses in the context of information given out to the public under the CIMAH Regulations. These include research contributing to comparative European studies led by Brian Wynne (1987, 1990) and work undertaken by Jupp and Irwin (1989)around the Carrington complex in Manchester. Irwin (1995) discusses the results of questionnaire and semi-structured interview work around plants in Eccles and Clayton/Beswick which found a generally high level of concern about factory accidents and pollution sitting alongside other concerns such as unemployment, crime and violence. The trust placed in various possible disseminators of information about hazards is explored, revealing a low level of trust in industry but also an overall pattern of scepticism and wariness about information sources. It is stressed that both the hazards and their perception are very much embedded in the nature of the locality and the lives of local people, so that they are 'an intrinsic part of everyday social reality and the very identity of these areas' (ibid.). Looking beyond the UK, a number of studies in the Netherlands have examined public perceptions of major accident hazards. Vlek and Stallen (1981) and Stallen and Tomas (1988) report on the results of research undertaken around the port of Rotterdam, using a number of different approaches to studying public risk perceptions. The initial study involved 700 people responding to a psychometric questionnaire, asking about their personal judgements of various risky activities. Three pairs of cognitive dimensions underlying sets of individual judgements were identified - riskiness, beneficiality and acceptability- but Vlek and Stallen emphasise the limitations of the study (despite its scale), such as the problem of group average ratings hiding substantial differences in understanding and interpretation, and raise doubts over the meaning of comparisons made between major hazards and other very different types of risk. A later follow on study in the Rotterdam area sought to extend this work by focusing on 'feelings on insecurity' when faced with the threat of a major accident, and the importance of 'personal control' in the response made to insecurity. Four types of response to the threat of major hazards are identified - secure, accepting, vigilant and defensive with their occurrence amongst the 600 interviewees analysed by spatial proximity to major hazard sites and gender. Little variation across space is found, although women are found to be more anxious about the risks because they value matters of personal health and well-being more highly than men. In particular, Stallen
40
A. Brazier et al.
and Tomas argue, that it is important to look at the context of statements made about risk, as the intensity of feeling about a threat is only one aspect of a more involved qualitative structure of the affect. In a study undertaken in the vicinity of the DSM chemicals complex in south Holland and further afield as a control group, Wiegmann, Gutteling and Boer (1991) set out to explain the apparently greater level of acceptance of risk found in the vicinity of the hazard in Stallen and Tomas's study (a phenomenon also observed in survey research in the Halton study referred to above). Economic dependence upon the industry was found to play a part in this phenomenon but only appeared to account for a small proportion of the variation. They conclude that social learning theory, a cognitive theory which assumes that through a process of experiential verification the residents near the chemical complex are less affected by the 'biased' accounts of industry risks offered in the media, offers the best explanation of this effect. When contrasted to the results of the research around nuclear power, radioactive and toxic waste sites, these studies pose interesting questions about the rather different patterns of local perception of major accident risks. Whilst there are clearly differences between the political culture of the UK and other countries, it may also be that the cultural valency, and therefore individual perceptions, of these different types of hazards plays a significant part in accounting for these different patterns of response. One limitation of most of these existing studies has been the use of a single research method or a focus on a single major hazard site (albeit with other control sites in the case of the Dutch studies). This is where we anticipate that the approach to be adopted in the present study, involving both multiple sites and a combination of psychological and socio-cultural research methods, will provide some further insight into the issues that have been raised.
APPROACH TO THE RESEARCH The approach adopted by the research team to the study of how people experience the risks associated with major hazard sites, sees the various dimensions of awareness, comprehension, perception and toleration as closely intertwined (to the extent that it may be very difficult to separate them). Our approach also sees this experience as being formed in a specific local context. It follows that central to developing an understanding of risk perception will be an exploration of the interaction of risk perceptions with attitudes towards other aspects of industry, the local area, local regulatory bodies and the broader social and economic context within which the local community is situated. This perspective in turn influences the research methods chosen and the overall research design. The comparative research design is also intended to enable us to investigate a number of more specific issues raised by recent risk perceptions research both in Britain, Europe and the USA. In particular, within the overall aims specified earlier, the research will seek to: • • • •
draw out the extent of diversity, consistency and stability of perceptions amongst different publics, rather than seeking only some homogenised view of a putative local 'community'. explore the differing roles of national and local level influences on perceptions of risk from industrial sites. examine the role of dimensions of credibility and trust in industry and regulators in risk perceptions. examine the influence of implicit senses of empowerment or powerlessness held by members of the public on their interest in and willingness to take up information about hazards and emergency arrangements.
RESEARCH DESIGN Case study site selection
Seven case study major hazard sites are being examined. These have been selected by the research team in consultation with the HSE, in order to provide variety in the following characteristics:
Advances in Safety and Reliability: ESREL '97 • •
• • • •
41
the nature of the hazard at the site (toxic, explosive/flammable) the designation of the site as a top-tier 'Seveso Directive' installation, referred to in the UK as a CIMAH ( Control of Industrial Major Accident Hazards Regulations 1984) site , or as a lower inventory installation referred to in the UK as a NIHHS (Notification of Installations Handling Hazardous Substances Regulations 1982) site. CIMAH sites will have disseminated information to local publics whilst there is no such requirement for NIHHS sites the physical size of the installation (ranging from a major complex to a warehouse) the length of time the installation has existed the history of accidents or publicised incidents at the plant the socio-economic characteristics of the surrounding population (derived from census data)
We were also constrained by the need to have a reasonably substantial population living near to the hazardous site. For this purpose we used census data to estimate how many people lived within the 'consultation distance' (CD) specified around each installation by the HSE. We have so far completed field work in three of the case study areas and details of these are shown in Table 1. Further case study sites are to include a major chemical/petrochemical complex, a warehouse holding ammonium nitrate and a stretch of'major hazard' pipeline. TABLE 1 CASE STUDY SITES
Company
Site
Hazard No. of Type Employees
!Albright Langley, CIMAH and Wilson Sandwell,West Toxic Midlands
Size
Year Established
Extent of CD
Population in CD
600
55 acres
1851
750-1000m 7,980
Allied Colloids
Low Moor, CIMAH Bradford, West Toxic Yorkshire.
2000
42 acres
1953
1300m
1,890
Rohm and Haas
Jarrow, South Tyneside
220
14 acres
1955
400m
1,950
NIHHS Toxic
Contextual research
The initial stage in the work in each case study area is to carry out contextual research. We see this as an important element in the research design. It involves, first of all, a search of information sources held in local studies libraries and archives. These include local press reports, local histories, industrial and company histories, local development plans and ordnance survey maps of the area. This research is supplemented by interviews with key local actors including site management at the company, the HSE Inspector responsible for the site, emergency planners, land use planners, the chair of the local liaison committee, together with other people from the local community such as head teachers of schools near to the site or, in cases, local councillors. Focus groups
The main method for finding out about how people perceive, understand and feel about risk is through focused group discussions. The advantage of focus groups as a method for this kind of research is that they provide a context in which people interact with one another. This creates an opportunity to observe the way
42
A. Brazier et al.
in which people's views are formulated, expressed and challenged. It establishes a far more dynamic and in certain ways more revealing research situation than a one-to-one interview or a survey. Focus groups have not been widely used in risk perception research (Wynne et al 1993) but are extensively used in social research more generally. The approach that we have developed involves recruiting 6 groups of 6-8 local people in each area (without informing them of the principal themes of the research). Each group meets twice. Each meeting lasts ninety minutes and the two meetings take place about three weeks apart. In the first meeting a discussion guide is used and participants are led through a list of topics, beginning with their views about the area and gradually focusing in on the major hazard site. At the end of the meeting the Q-sort exercises (see below) are demonstrated to participants and they complete these in the period between the two meetings. The second meeting begins with a discussion of three stimulus sheets containing quotations which express different points of view about (a) the provision of public information about industrial hazards, (b) the regulation of industry and (c) who should be responsible for setting safety standards. In the second half of the meeting we employ two sets of materials which use planning and facility siting scenarios to stimulate discussion of a variety of risk issues, including questions of risk assessment criteria and societal risk. The first of these scenarios involves deciding about the location of new housing in the vicinity of an existing hazardous installation in an already built up area. The second relates to a choice between two possible locations for a hazardous plant extension.
Q-method Q-method is a pattern analytic technique developed for the study of subjectivity. It has been used in a range of disciplines, most notably political science, but has been employed to only a limited degree in previous studies of risk perceptions. In this project we have produced two Q-sort exercises which all the people involved in the focus groups undertake. One is concerned with people's sense of place. The second is concerned with their view of the local major hazard site. Using Q-method first involved the collection of statements about major hazard sites from a range of different sources. A sample representing the range of statements was then created and from this we produced a set of cards, each containing a single statement related to the local major hazard site. They were asked to sort the cards into a predetermined pattern according to the strength of their agreement or disagreement and to record the resulting configuration on a form provided. The resulting arrangement of cards for all participants is then subjected to factor analysis to reveal any underlying patterns or factors. These factors represent distinct points of view which to a greater or lesser extent are held in common by some of the participants. Interpretation of these factors is subsequently validated by returning to interview a few individuals whose Q-sorts exemplify a particular factor or point of view. Some contrasting examples of the statements that people are asked to sort on major hazards are as follows: ' An industrial accident that would harm local residents is extremely unlikely.' 'The safety of local people is a top priority for Allied Colloids.' ' No-one forces people to live here - if they think its too dangerous they can always move out.' ' The chemical industry is just a threat to everyone's health and safety.'
PRELIMINARY ANALYSIS AND RESULTS As the research project is yet to be fully completed, discussion of results must be preliminary and tentative. However certain themes have emerged from the focus group sessions and analysis of the Q-sort data is initially revealing some common patterns of response.
Advances in Safety and Reliability: ESREL '97
43
Focus group." comparative themes and issues Initial thematic analysis of the focus group data has identified a number of issues and themes relevant to our understanding of people's perceptions and toleration of risks associated with these sites. These issues appear in various ways across all three case studies. They are: • • • • • • •
the influence of what we have called 'local memory' on perceptions of a site, this collective memory sometimes extending back a long time and passing from generation to generation the importance of experience of incidents in shaping local discussion and perceptions and relationships between the company and the local community the way in which inferences about risk are drawn from sensory 'evidence' such as smells, visible plumes, the general appearance of the site and the quality of local nature the way in which people reason about risk, in relation to probability and consequence arguments, trade-offs between risk and other concerns and the scope for individual choice in risk-taking the stigmatisation that can result from the presence of a hazardous site and the implications this has for local identity the variety of images that people hold of the company and the range of ways in which initiatives by the company, such as open days, community newsletters, and liaison groups are responded to the strength of views on regulation and regulators and the expectations of and trust in regulatory practice that are expressed
Q-sort data Our initial analysis of the Q-sort data has found that in each area two clearly differentiated, orthogonal factors, or points of view, emerged. In each case the strongest of these was a point of view which was in general distrustful of the company and had little confidence in the regulators or in the emergency services. This point of view was also characterised by concern about the risks associated with the site. The second factor represented a contrary point of view which, in general was more trusting in the competence of the company, the regulatory authorities and the emergency services and which consequently tended to be more tolerant of the risks. Significantly, very few of the participants in these case studies appeared to hold such a robust view. A comparative analysis of the factors identified for all three sites found them to correspond very closely, confirming that we were finding very similar patterns of response at all three sites. The most significant differences between the factors can be accounted for by specific differences in the local contexts. CONCLUSION The experience of bodies dealing with risk concerns has repeatedly shown the importance of developing a better understanding of public risk perceptions. However, despite much research over the past twenty years that understanding remains somewhat elusive. There is a need for research which deals with the perception of particular forms of risk in the settings and contexts in which these are experienced and which more explicitly recognises that living with risk is very much part of everyday life for many people. In the case of major accident hazards we have argued that such research must involve a locally focused and contextualised approach, which allows people, as much as possible, to discuss and express points of view in ways which they define and determine. Such an approach is also needed if we are to develop a deeper understanding of why particular perceptions of risk are held and the stability and certainty of particular points of view. The use of focus groups allied with q-method we have described in this paper, provides an innovative and effective approach to meeting these objectives and to the formulation of a range of recommendations for the development of major hazard policy and regulation in the UK.
44
A. Brazier et al.
REFERENCES Beck U (1992) Risk Society: Towards a New Modernity, Sage, London, UK. Commission of the European Communities (1994) Proposal for a COUNCIL DIRECTIVE on the control of major accident hazards involving dangerous substances (COMAH). COM(94) 4 final: Brussels 26.01.94 Erikson, K. (1990) Toxic Reckoning: Business Faces a New Kind of Fear, Harvard Business Review, Vol. 90:1, pp. 118-126. Health and Safety Executive (1989) Risk Criteria for Land use planning in the vicinity of Major Industrial Hazards, HMSO, London, UK. Health and Safety Executive (1992) The Tolerability of Risk from Nuclear Power Stations, HMSO, London Irwin, A. (1995) Citizen Science. A study of people, expertise and sustainable development, Routledge, London, UK. Jupp, A. and Irwin, A. (1989) Emergency Response and the Provision of Public Information under CIMAH, Disaster Management, 1:4. Macgill, S.M. (1987) The Politics of Anxiety, Pion, London, UK. Prescott-Clarke, P. (1980) Public Attitudes Towards the Acceptability of Risks, Social and Community Planning Research, London, UK. Prescott-Clarke, P. (1982) Public Attitudes Towards Industrial Work Related and Other Risks, Social and Community Planning Research, London, UK. Smith, D. and Irwin, A. (1984) 'Public Attitudes to Technological Risk: the Contribution of Survey Data to Public Policy Making', Transactions of the Institute of British Geographers, 9. Stallen, P.J.M. and Tomas, A. (1988) Public Concern about Industrial Hazards, Risk Analysis, 8:2, pp. 237-245. Van der Pligt, J. (1992) Nuclear Energy and the Public, Blackwell, Oxford, UK. Vlek, C. and Stallen, P.J. (1981) Judging Risks in the Small and the Large, Organizational Behaviour and Human Performance, 28, pp. 235-271. Walker, G. P. (1995) Land use planning, industrial hazards and the 'COMAH' Directive, Land Use Policy, vol. 12, no. 3, pp. 187-191. Wiegman O, Gutteling JM and Boer H (1991) Verification of Information through Direct Experiences with an Industrial Hazard, 12:3, pp. 325-339. Wynne B (1987) Implementation of Article 8 of the EC Seveso Directive: A Study of Public Information, Report to the European Commission, DG XI. Wynne B (1990) Empirical Evaluation of Public Information on Major Industrial Accident Hazards, Report to EC Joint Research Centre, ISPRA Wynne B et al (1993) Public Perceptions and the Nuclear Industry in West Cumbria, Centre for the Study of Environmental Change, University of Lancaster, Lancaster, UK.
SOCIETAL RISK AND THE CONCEPT OF RISK AVERSION J.K.Vrijling and P.H.A.J.M. van Gelder Department of Civil Engineering, Delft University of Technology Stevinweg 1, 2600 GA Delft, The Netherlands
ABSTRACT It seems generally accepted that the FN-curve is a fairly accurate description of the societal risk. However in the communication with the public and representative decisionmakers a schematisation of the FN-curve to one or two numbers may bring certain advantages. Various measures like Potential loss of Life, the area under the FN-curve, the Risk Integral etc. are proposed in the literature. Although the formulae look distinctly different at first sight a more thorough inspection reveals, that all schematisations contain as building blocks the two familiar statistical moments of the FN-curve, the expected value of the number of deaths E(N) and the standard deviation ~(N). In the paper the linear combination E(N) + k.cy(N) is proposed as a simple risk averse measure of the societal risk.
KEYWORDS
Risk analysis, acceptable risk, societal risk, individual risk, risk aversion, decision making.
INTRODUCTION
There is general agreement in the literature and in regulatory circles that risk should at least be judged from two points of view (VROM (1988,1992), HSE (1989)). The first point of view is that of the individual, who decides to undertake an activity weighing the risks against the direct and indirect personal benefits. This first point of view leads to the personally acceptable level of risk or the acceptable individual risk, defined in as "the frequency at which an individual may be expected to sustain a given level of harm from the realisation of specified hazards". The specified level of harm is narrowed down to the loss of life in many practical cases. The second point of view is that of the society, considering the question whether an activity is acceptable in terms of the risk involved for the total population. Commonly the notion of risk in a societal context is reduced to the total number of casualties (VROM (1988,1992), HSE (1989)) using a definition as by IoCE (1985): "the relation between frequency and the number of people suffering from a specified level of harm in a given population from the realisation of specified hazards". If the specified level of harm is narrowed down to the loss of life, the societal risk may be modelled by the frequency of exceedance curve of the number of deaths, also called the FN-curve due to a specific hazard. 45
46
J.K. Vrijling and RH.A.J.M. van Gelder
The FN-curve can be seen as an exceedance curve with a related probability density function (p.d.f.) of the number of deaths. The p.d.f, of the number of deaths Ndij given an accident for activity i at place j can have many forms. A few types are presented here to stimulate further thinking. The first conditional p.d.f, is the Dirac, that limits the outcomes to exactly N fatalities. Other possibilities that allow a larger variation in the outcome, are the exponential and the log-normal p.d.f. The probability of exceedance curves of the number of fatalities, that can be derived from these two forms reflect to some extent the FN-curves found in practical quantitative risk assessment (QRA) studies. A fourth is the inverse quadratic Pareto distribution that coincides precisely with the norm put forward by the Ministry of VROM (1988). The Pareto p.d.f, has no finite standard deviation unless the right tail is truncated (Fig. 1). Exactly the same models could be applied for the material damage that results from a disaster, if the horizontal axis is measured in monetary units. It should be noted that the proposed conditional p.d.f.'s have to be multiplied with the probability p of an accident and that the outcome zero fatalities with probability (lp) should be added to find the complete p.d.f, of the number of deaths (Fig.l). The classical measures of expected value and standard deviation will appear to be very useful numbers to classify the risk. probability density function ECN wu ) I-p.
VAR(I~
:
probability of exceedance
2. p
m~j) -- 2 .
,.
LN (N ~AX )
m
2P
r
•
I
N
HA)(
X
•
~ M,I~C
Inverse quadratic Pareto
Figure 1" A theoretical p.d.f, and probability of exceedance curve for the number of deaths. A controversy is found on the way to judge and limit the societal risk. Many apparently different judgemental numbers and normative decision rules can be discemed as will be shown below.Also the question of these numbers and rules include risk aversion is the subject of debate. Some analysis makes however clear that the rules and numbers proposed in the literature fall in two categories: risk neutral or risk averse. The direction of the development seems to be towards risk averse measures, although this trend is somewhat obscured by mathematical complexity.
D I F F E R E N T MEASURES AND LIMITATIONS OF SOCIETAL RISK It seems generally accepted that the FN-curve is a fairly accurate description of the societal risk. However in the communication with the public and representative decision makers a schematisation of the FN-curve to one or two numbers may bring certain advantages. As to the limitation of the societal risk to acceptable levels many different rules are proposed by scientists and regulatory bodies. Disagreement is especially found on the question if societal risk should be judged with a risk averse or a risk neutral attitude. One of the oldest simple measures of societal risk is the Potential Loss of Life (PLL), which is defined as the expected value of the number of deaths per year:
Advances in Safety and Reliability: ESREL '97
47
E (N) = ~ x. f Ndij (x). dx (1) where
: fNdij --
the p.d.f, of the number of deaths resulting from activity i in place j in one year
Ale (1996) has proposed the area under the FN-curve as a simple measure of societal risk. Although this is not immediately clear, it can be mathematically proven that the area under the FN-curve equals the expected value of the number of deaths (appendix 1).
E(N) =
I(1-
FNdij(x))dx (2)
An absolute limit to the expected value of the number of deaths is not mentioned in the literature. The use of the expected value seems very valuable in the comparison of various alternatives. VROM(1988) limits the societal risk at plant level by a line that is inversely proportional to the square of the number of deaths. This absolute requirement that formed the basis for the regulation and the siting of hazardous installations or new developments in the Netherlands during the last decade reads:
10 4 X
2
(3) for x > 10 deaths, where : FNdij = the c.d.f, of the number of deaths resulting from activity i in place j in one year (the subscript dij will be obmitted further on). The HSE (1989) remarks that the judgement of the societal risk at plant level by the VROM-rule is overly risk averse. HSE proposes to change the value of the exponent in the expression from 2 into 1 in order to form a more even judgement. In recent papers Cassidy (1996) of HSE defined the risk integral RI as an appropriate measure of societal risk that should be further explored: R I = I x. a - G ~x)). ax
(4) A limiting value is however not yet attached to this new concept. Vrijling (1995) notes that the societal risk should be judged on a national level by limiting the total number of casualties in a given year in the following way:
E(Na~) + k cy(Nm) < [3~ I00
(5) where : k - risk aversion index. The formula (5) accounts for risk aversion, which will certainly influence acceptance by a community or a society. Relatively frequent small accidents are more easily accepted than one single rare accident with large consequences, although the expected number of casualties is equal for both cases. The standard deviation of
48
J.K. Vrijling and RH.A.J.M. van Gelder
the number of casualties will reflect this difference. The risk aversion is represented mathematically by increasing the expectation of the total number of deaths, E(Nd~), by the desired multiple k of the standard deviation before the situation is tested against the norm. Rule (5) can be transformed into a similar expression valid at plant level by taking into account the number of independent installations NA. It can also be transformed mathematically into a VROM-type of rule applicable at plant level as shown in the same paper:
1- FNd,(X) <-- C--~2 for all x >_ 10
x
: I ~' 10012 where C,
[.k ~--AAJ (6)
For values of 13~= 0.03, k = 3 and NA = 1000 the rule equates exactly to the VROM-rule, which appears to be a specific case in a more general framework. Bohnenblust (1996) judges the number of casualties after correction with a factor q~(x) in an economic framework. Weighing the societal risk SR in the light of the cost of measures to improve safety an optimal decision is reached. Changing the summation into an integral the expression proposed by Bohnenblust reads:
Sl~ :
I x. , (x). G (x). & (7)
Although not explicitly stated by Wehr and Bohnenblust (1995), it can be deduced from a graph in the paper that (p( x ) - ~/(x/10). So the SR measure could be expressed as:
SR = I ~ l O. x 's. f u (x). dx
(8) THE RISK ATTITUDE OF THE VARIOUS RULES
First it should be stated that in this paper a decision based on the expected value only is called risk neutral. Risk neutrality can be modelled with a linear utility function. In case of a risk averse attitude a smaller standard deviation is preferred above a larger in case of equal expected values. In the literature this is frequently modelled by quadratic utility functions. To show the principle the expected utility is evaluated below for a linear and a quadratic utility function: In case of risk aversion the standard deviation starts to play a role. However the strict fipplication of quadratic utility curves has as a disadvantage that the units become [death 2/year], which is difficult to communicate to the public and most probably also to decision makers.
~ ax. f N (x). dx = a. E (N)
~ a x2. f u (x).dx = a.(E (N ) 2 + ~ (N ) 2) (9)
Advances in Safety and Reliability: ESREL '97
49
Using the concept of risk attitude the PLL and the area under the FN-curve, that are both equal to E(N), can be classified as risk neutral measures of the societal risk. The rule proposed by Vrijling(1996), containing E(N) + k.(~(N), is clearly risk averse. Consequently the VROM-rule which is proven to be a special case of this rule can be similarly classified as risk averse. If the exponent of the VROM-mle is changed into 1 as is proposed by HSE only the expected value of the number of casualties is limited, which according to the definition given above should be called risk neutral. The measure proposed by Bohnenblust (1995,1996) has an intermediate position with an exponent of 1.5. It is very interesting to note that is can be mathematically proven (appendix 2) that the risk integral proposed by Cassidy (1996) of HSE equals:
1
2
RI = ~x.(1-Fu (x)).dx = -j(E2(N ) +c ( N ) ) (10) Apparently the need for a simple risk averse measure to schematise the FN-curve is also felt in the United Kingdom. A disadvantage of the risk integral RI might be that the units are [death 2 /year] and some difficulty will be met in formulating an easy to understand limiting value.
CASE STUDY The half of Holland that lies below the sea level is divided in NAi= 40 more or less independent polders surrounded by dike-rings. If it is assumed that at some future date each polder will house Npij= 1,000,000 inhabitants, an estimate of the number of casualties in case of flooding can be made. In 1953 approximately 1% of the inhabitants drowned, giving a value of pdjn=0.01. Little is known of the influence of modem technological development on this number, but the failure of energy and communication networks during the minor floods in Limburg point to a limited beneficial influence. The expected value and the standard deviation of the number of deaths in 40 independent polders per year are equal to:
E(Ndi )
=
40pj 10-210 6
cy 2(Ndi) = 40pj (1 - p j )(10-210 -6 )2 (11) If these expressions are substituted in the norm Eqn. 5 the solution for 13i=l becomes pnj= 3.10 -7 per year. In case the aversion of the inhabitants against flooding is more extreme and [3 i= 0.1 the acceptable probability of failure of the dike ring is pf~j= 3.10 .9 per year.
50
J.K. Vrijling and RH.A.J.M. van Gelder 10 0 ,
,
,
,
,
,
,
, , ,
,
,.
,
,
,
, ,
,,
,
,
,
,
,
,
, , i
,
,
,
,
,
,
,
,,,
,
,
,
,
,
,,
,,
,
,
,
,
,
,
,,~
,o_2
....i---ii--ii-!i-ii-i-!ii ....Fi"i':i;i'ii'i!!i
=ol
.
, ~ 10 .6
.....................
.
;
,-. lO-12 i
.
.
.
.
.
.
.
.
.
.
.
.
iiiiiiiiiiiiiiiii i iiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiilliii
1 o 14
v
1 01
102 Number
103 of fatalities
104
N
Figure 2." FN-curve for flooding of the Brielse polder For the Brielse dike ring near Rotterdam a FN-curve (Fig.2) has been drawn estimating the probability of failure of the existing dikes at 10-4 per year. The FN-curve shows that there are five equally likely scenario's with death counts varying from 15 to appr. 5000 people. As these scenario's are assumed to be independent, the combinations, that claim even more casualties, are less likely by an order of magnitude. Developing the local criterion for a dike ring using the values mentioned above, the constant becomes C i - 27.8 - 0.278 for 13 - 1-0.1. Thus the present situation based on the philosophy developed by the Deltacommittee (1960) seems insufficiently safe in the light of modem developments. Following the normative framework developed here the acceptable probability of failure of the dike equals 6.3.10 .7 to 6.3.10 .9 depending on the value of 13 •
CONCLUSIONS Although the FN-curve, the exceedance curve of the number of deaths, is generally accepted as a clear representation of the societal risk regarding the loss of human life, a search for one simple number to express the societal risk can be observed. Several schematisations of the FN-curve are proposed. Some of these schematisations are completed with a limiting value to provide a decision rule. Although the formulae look distinctly different at first sight a more thorough inspection reveals a common approach that promises a relatively rapid convergence of opinions. It appears that all schematisations contain as building blocks the two familiar statistical moments of the FNcurve, the expected value of the number of deaths E(N) and the standard deviation c~(N). The Potential Loss of Life (PLL) measure and the area under the FN-curve, as a simple measure of the societal risk, are both equal to E(N). However no absolute use of this measure is mentioned and consequently no limiting value is reported in the literature. Using the concept of risk aversion it was proven that the most recently proposed ways to judge societal risk are all risk averse. The well known VROM-rule, that limits the FN-curve by 103/N 2 appears to be a special case of a more general rule proposed by Vrijling (1995), that limits the societal risk at the Dutch national level by: E(N) + k.c~(N) < 13.100.
Advances in Safety and Reliability: ESREL '97
51
The criticism of the VROM-rule, that the exponent of 2 is overly risk averse and that a value of 1 should be preferred, leads to a limitation of E(N) only. Limiting the FN-curve by C/N places an upperbound to E(N) and must be classified as a risk neutral approach. The risk integral RI, recently proposed by HSE as an alternative measure needing further investigation, is shown to be equal to 0.5 {E(N) 2 +(y(N)2}. Thus the risk integral should be classified as a risk averse measure of societal risk. The units [death2 /year] in which RI is expressed may hamper the understanding of this measure by the public. A linear expression like E(N) + k.o(N) with units [death/year] is preferred. In addition the relatively simple relation between this measure and a VROM-type of rule provides the possibility of unification. Attention should be paid to the fact that some rules (e.g. Vrijling(1995)) limit the societal risk at a national level, while most others (VROM,HSE) address the risk at plant level. Because "many small unrestrained developments could add up to a noticeable worsening of the overall situation" (HSE(1989)), the societal risk should be limited in a concerted way on both national and local level. In the approach of Vrijling (1995) the societal risk is limited at national level and consequently taking the number of hazardous installations into account a VROM-type of rule is derived for plant level.
LITERATURE Ale,B.,(1996), Zoning instruments for major accident prevention, Proc. ESREL/PSIAM, Crete, p. 1911. Bohnenblust,H.,(1996), Integrating technical analysis and public values in risk based decision making, Proc. ESREL/PSIAM, Crete, p. 1911. Cassidy,K.,(1996), Risk criteria for the siting of hazardous installations and the development in their vicinity, Proc. ESREL/PSIAM, Crete, p. 1892.
Deltacommittee (1960), Deltareport, Den Haag, The Netherlands. HSE,(1989), Risk criteria for land-use planning in the vicinity of major industrial hazards, HM Stationery Office. Institute of chemical engineering,(1985), Nomenclature for hazard and risk assessment in the process industries, ISBN 85 295184 1. VROM, (1988), Dutch National Environmental Plan, The Hague. VROM,(1992), Relating to risks (in Dutch), The Hague. Vrijling, J.K. et alt.,(1995), A Framework for risk evaluation, Journ. of Hazardous materials, 43, p.245-261. Wehr,H., Femaud,C., Bohnenblust,H.,(1995), Risk analysis and safety concept for new long railway tunnels in Austria, Proc. Safety in road and rail tunnels, Granada, p.3.
APPENDIX 1 (PROOF OF EQN 2)
0
Ox
O0
0
52
J.K. Vrijling and R H . A . J . M . van G e l d e r
APPENDIX 2 (PROOF OF EQN 10) cy 2 ( N ) = v a r N = E N 2 - E 2 N ,
So
E2N +~2N
= EN 2,
oo
x
oox
0
0
00
So 0
o
u
o
0 u
FROM RISK ANALYSIS TO RISK PERCEPTION: DEVELOPING A RISK C O M M U N I C A T I O N STRATEGY FOR A D A M - B R E A K FLOOD RISK M.L. Lima ~, A. Bet~mio de Almeida 2 and D. Silva 3 ~Dept of Organisational and Social Psychology, I.S.C.T.E., Av. For~:as Armadas, 1600 Lisboa; Portugal 2Dept of Hydraulics and Water Resources, I.S.T.,Technical University of Lisbon, Av. Rovisco Pais, 1600 Lisboa; Portugal 3 GES - Social Ecology Group, L.N.E.C., Av. do Brasil, 101, 1799 Lisboa Codex, Portugal
ABSTRACT
This paper describes the process of development of a risk communication strategy to implement the first structured flood warning system in Portugal in case of dam failure. This research project was funded through the NATO Science for Stability Program (NATO PO-FLOOD RISK Project). The pilot warning system was located along two sequential dams in Arade River (Algarve), which will enhance the safety of the population living in downstream valleys. The five phases of the process can be briefly described as: (a) Risk analysis and definition of inundation maps for different dam failure scenarios; (b) Descriptive analysis of human occupation of the risk area; (c) Exploratory analysis of expert and lay discourses about dam safety and risk; (d) Survey study on the public perception of dam related risk, and preventive behaviour against floods in general; (e) Definition of a communication strategy to establish a flood warning system.
KEYWORDS
risk perception, risk communication, inundation maps, population at risk, dam break, warning system
INTRODUCTION
The concept of environmental education has been used to refer to the strategies which aim at making people aware of environmental problems and at changing behaviour to alleviate them. However, within this general concept, only a specific type of environmental problems are usually considered: those that encourage environmentally responsible behaviour (such as conserving water and energy, recycling etc.). Preventive action to minimise the consequences of a natural or technological disaster is not often seen as a problem within this conceptual area. Instead, it has frequently been conceived as a technical problem in which structural and technology solutions are evaluated in terms of engineering criteria (Sime, 1995), and the population at risk is conceived as a mere target of a final design to a warning system. The different approach to both problems has important consequences in the way communication strategies are defined. While, in the first case, it is widely recognised that effective communication must take into account the needs, 53
54
M.L. Lima et al.
expectations, values and prior behaviour of the public, in the second case, communication is conceived only in from the experts" point of view. This technical approach systematically promotes low levels of information about the risk in the public in contrast to custodians of that information and, in case of disaster (for example a flood), the public in the impacted area is often not warned beforehand, although the flood has been detected or forecasted in advance (Handmer, 1988). In our research, and contrary to the above view of risk communication, we support an altemative approach which, as Sime (1996) puts it, considers the perspective of the public as a starting point in a warning system evaluation. Under this perspective, an effective warning system cannot be defined only in terms of the phenomena and the forecast, but also in terms of the involvement, characteristics and perceptions of those who are at danger (AEMI, 1995, Handmer, 1996; Syme, 1996, Sime, 1996). In this paper, we present the process of development of a risk communication strategy to implement the first structured flood warning system in case of dam failure in Portugal. This process was developed within a multidisciplinary integrated research project funded through the NATO Science for Stability Program, which aims at improving the safety of the population in downstream valleys. This project, entitled Dam-Break Flood Risk Management in Portugal (NATO PO-FLOOD RISK Project), integrates the results of contributions from hydraulic analysis, safety analysis, land-use management, social sciences and computer sciences (Almeida et al., 1996), and is supported also by the Portuguese water and dam safety authority (INAG), the main Portuguese power company (EDP) and the Portuguese Civil Protection Agency (SNPC). Until now, the construction of a risk communication strategy to implement the warning system involved five stages. The aim of the communication was to establish a pilot warning system for a set of two sequential dams in Arade River (Algarve), which will enhance the safety of the population living in downstream valleys. We will briefly describe the rationale and the accomplishments of each stage.
Stage 1- M a p p i n g the floodable area in case of a dam b r e a k event - Inundation maps
In this stage, several dam-break inundation studies were performed to determine the impact of a flood produced by a dam failure on the downstream area. Inundation maps were obtained for different dam failure scenarios, by numerical simulation based on computational models. According to Almeida & Viseu (1996), these types of floods are very different from ordinary natural floods for several reasons including: (1) a very high peak discharge and water depth valves; (2) eventual occurrence of movable bores and modular jumps; (3) fast and violent flooding of the banks; (4) flooding of previously dry land with abnormal dissipative effects; (5) transport of sediments and debris; (6) very difficult calibration of the models for each case. The production of inundation maps for our particular case was an important outcome of the project and a necessary step for risk assessment. From the different maps available, we decided to base our communication strategy in the worst case scenario: break of both Funcho and Arade Dams, and all inundated area was considered at risk, independently of the depth of the water.
Stage 2: D e s c r i b i n g the inundation area - D e m o g r a p h i c and land use characterisation in a GIS
In this stage, our goal was to produce information about the character&tics of the occupation of inundation areas in case of dam break, for that specific valley. To gather the information about the, the inundation map was compared with the census maps. Using this methodology, we could estimate the total population at risk (PAR) (and described its social and demographic characteristics) and the built environment in specific areas. In urban areas it was easy to have a very good overlap of the two maps. In rural areas, due to the dispersion of buildings, it was difficult to reproduce the flood maps in census maps. Downstream risk assessment must take into consideration several factors, including: (1) Warning time (WT), according to the literature (DeKay & McClelland, 1993; Brown & Graham, 1988) this variable is critical to predict loss of life in case of a dam break. The model differentiates between three cases in increasing vulnerability: WT
90 minutes. We defined warning time as the time it takes the initial flood wave to reach the population at risk. (2) Community size (CS): urban areas, although more populated, are considered as less
Advances in Safety and Reliability: ESREL '97
55
vulnerable because the buildings are less dispersed and the levels of organised safety is higher (Brown & Graham, 1988).This variable was set to two levels: whether the area was a city or a set of disperse villages. According to the above defined criteria, 5 risk areas were defined and described (Lima & Silva, 1996). Table 1 shows some of its characteristics.
TABLE 1 Description of the population and buildings at risk by .inundation area I
Inundation area Area Area Area Area Area
1 2 3 4 5
WT < 15m 15-25m 25 m 25-60 m 60 m
CS Rural Rural Urban Rural Urban
Population at risk PAR 674 549 3694 3212 905
O/Y rate 113% 109% 40% 19% 114%
Illiteracy 28% 34% 16% 15% 15%
Buildings at risk CSS 43% 41% 58% 50% 21%
Low 100% 100% 89% 89% 79%
This analysis show that in that valley, and according to the 1991 census data, we have about 9.000 residents at risk. Area 3 and 4 are the more populated ones. It is also there that we can find the higher percentage of young people (individuals below 19 years old). The other risk areas (1,2, and 5) show a reversed pattern: they are less populated, and they have more elderly (20-25%). The population has a very low level of literacy. The percentage of illiterate individuals vary from 15 to 34%. Only 10 to 20% have more than the fourth grade at school. Educational level of the population is especially low in areas 1 and 2. Areas 3 and 4 have more buildings and dwellings than the other risk areas. Almost all buildings are used as residences, specially in rural areas (1,2 and 4). Areas 4 and 5 are tourist zones, and we find in those areas higher percentage of occasional residences. The buildings in Areas 3 and 4 are younger and higher than those in the other areas, and they are more often constructed using concrete support structures. Buildings in risk area 5 are especially old and were built with weaker materials. Simultaneously, a land-use analysis was performed on the same areas, to identify key structures (schools, hospitals, etc.) and to determine the plans for the future years for that region (Farrajota & Campos, 1996). All those data were introduced into a database specially produced to this project, which also includes the characterisation of the dams (Femandes & Andrade, 1996).
Stage 3: Identifying local actors and discourses about dam safety and risk- Qualitative study The success of a risk communication strategy depends largely upon local involvement. Demographics cannot give us a clear picture of the different partners in the process and, as the valley under study spreads over different communities, some field work had to be done in order to identify the local authorities and opinion-
makers, as well as the lay discourses about dam safety. Some direct interviews were conducted with local experts (mayors, urban planners, civil protection agents, dam owners) and 30 residents in different risk areas. The first set of interviews allowed us the identification of the important structures to be contacted to implement the warning system, and their views about the problem.
WT: Warning time; CS: Community size; PAR: Population at risk; O/Y rate: old/young population rate. It corresponds to the ratio between the number of residents over 64 years old and the number of residents below 19 years old. Illiteracy: % of residents who have not received any formal education and cannot read or write. CSS: % of buildings with concrete support structure; Low: % of buildings with 1 or two floors.
M.L. Lima et al.
56
The second set of interviews gave us access to the lay discourses about the dams (Lima & Silva, 1996). A content analysis was performed on these interviews and results show that the dams are perceived in a rather positive way: the benefits are very clear, especially to those more vulnerable in case of dam failure. This can be a result of the specific type of dams we studied (built for water storage to be used to field irrigation), where, and contrary to dams producing electric power, there was a resultant improvement in the farming patterns of that area. The second important result was that no one ever complained about the non-existence of warning systems. Although there was a clear assertion of the inconveniences of unexpected spillouts no one proposed a more active strategy towards dam risk. This result was also found by other authors (Bishop and Syme, 1992; Syme, 1994), and the general pattern found resembles a passive resignation, blindly trusting others to take care of their lives. Important differences in discourse were found according to risk vulnerability. Specifically, three types of lay discourse about the dams were identified: • "The darns did not bring any benefits, except maybe for flood regulation" - discourse of the younger, and from those living far away from the dams, in risk area 5 (Portim~,o).
•
"The dams were very good for me, for this area and for farmers. They improved farming and helped in flood regulation. Dams can break, although it is an improbable event. I trust the dam because I know people working there, and I know that the engineers are very competent". This type of discourse was found in Silves (risk area 3), especially in retail store owners.
• "There are different types of dams, according to their material, their age, their structure, and they are variably safe. Even though I trust the older dam and there is no possibility that it breaks, although I sometimes have nightmares about it. This old dam was very good for me, as a farmer, and it improved farming and the whole quality of life in this area ". This kind of speech is typical of those farmers living very near the dams. They present a very concrete and specific way of speaking about the dams: although they are older and mainly illiterate, they speak in a much more complex and technical way about dams (thickness of wall, type of structure, etc). One other important characteristic of the discourse of this group is a need for denial of risk, often associated with reported fear. The different discourses highlighted some interesting issues. Higher levels of risk exposure is associated with risk denial. This result, which was also found for some other technological risks (van der Pligt, 1992), can be understood as an emotion-focused way of coping (Lazarus and Folkman, 1985) with an unavoidable stressor that enormous structure full of water quite near their house). The different pattern of confidence expressed by those rural and illiterate residents near the dams (a confidence in the dam itself) and by those more educated living in the nearby town (personal trust on the experts) was also a surprising and interesting result. We can associate this finding to Giddens (1990) concept of confidence in expert systems as a symptom of modem societies. Under this perspective, for those living far from the dams these are more abstract systems and confidence is produced through their access points (in our case the dam engineers or the those who work on the dam). Accordingly, they rate all dams as safe. On the contrary, for the rural population living near the dam, this structure has a very specific image and specific functions in their daily life. They don't need to trust the engineers, because the community has been living with the dam for many years and simply trusts a very familiar and useful structure.
Stage 4: Describing risk perception of dam break risks - Survey study The qualitative study we undertook produced a vivid picture of the reality we faced. But, in order to produce a communication strategy to implement the warning system, some more precise information was needed about the way people perceived those risks. In this stage, a quantitative analysis of the beliefs about dam safety was produced. A survey studY was performed, which intended to examine the following issues: (1) How does the population at risk in Case of a dam failure disaster think about that risk? (2) How are the dams perceived by those residents? 302 residents were randomly selected among the risk areas defined above (in this case, we joined the first and second areas because they have very similar demographic structures). The main differences in the
Advances in Safety and Reliability: ESREL '97
57
demographic profile (Table 2) contrasts the rural samples (where respondents are older and more illiterate) with the urban settings. The subjects in our sample have a great amount of flood experience and they live in one or two store houses. This pattern is similar to the one described for the population, based on the statistics data for the region in 1991 (Step 2). Data was collected by direct interviews, using a structured questionnaire. The interview took place at the residence of the interviewed, and took an average of 32 minutes to complete.
TABLE 2 Description of the sample in the survey study
Risk area Warning time Community size Main community Number of Subjects Mean age Education (<4y) Building: apartment Flood experience >3
Sample1
Sample2
Sample3
Sample4
1& 2 <25m rural Barragem 48 54.7 70% 2% 86%
3 25m urban Silves 105 48.7 44% 28% 64%
4 <60m rural Mexilhoeira 85 53.7 71% 16% 63%
5 60m urban Portimfio 66 47.4 51% 16% 65%
First of all, we tried to understand the perceived probability of a series of hazards. For each hazard, respondents rated their probability of occurrence in a five point scale (l=it is not possible to happen, 5=it is very likely to occur). Results show (Figure 1) that bush fire are considered the most probable one (M=4.27, sd=.82), and dam failure is considered the least probable one (M=2.31, sd=.98), very close to big constructions failure (M=2.43, sd=.90). Floods are viewed as probable (M=3.66, sd=l.01), followed by earthquakes (M=3.23, sd=.88) and house fires (M=3.01, sd=.92). As Bishop and Syme (1992) found in Australia, dam failure risk is not salient in our sample. However, there are some differences in the subjective estimates of probability for dam failure risk in the four sub-samples: the areas closer to the dam underestimate it, as compared to the ones far away from the dam. This is particularly true for the two rural samples.
I
bush fire I flood earthquakes ~
l
_
house fire ~ =
big constructions
~
dam failure
~ 1
2
3
4
FIGURE 1" Subjective probabilities of different hazards
5
58
M.L. Lima et al.
Risk is not usually formulated by lay people in these quantitative terms and the risk perception literature identifies a series of qualitative dimensions in which the public evaluate the risks they take. These dimensions were used to rate flood and dam failure risk. Results (Figure 2) show that flood risk is described in moderate terms. It is seen as medially old, known by those exposed and by authorities. It is not considered as catastrophic, dread, or as affecting self (it is seen as a risk affecting more other people), and it is not also seen as very easily controllable by authorities or by self. Dam failure is seen in a less soft way. It is perceived as rather dreadful, new and catastrophic risk and not at all preventable by self. It is seen as not very well known either by the authorities or by the population at risk and they do not perceive the authorities as competent to control that risk. Again, it is mainly a risk that is perceived not to affect directly self and to be someone else's problem. When we compare those two risks (using the appropriate statistical tests to test the differences between observed means), we can see that dam break risk is perceived as more dread (t=-11.27, df=295,p<.0001), affecting more (t=--9.31, df=286,p<.0001) and more catastrophic (t=-19.26, df=273,p<.0001) than flood risk. Flood risk is perceived as more known by those affected (t=6.70, df=243,p<.0001) and by the authorities (t=-4.99, df=229,p<.0001), and more controllable and preventable by self (t=-10.14, df=273,p<.0001). No significant differences were found for control by the authorities, old risk or affecting more others than self.
Dread, Known, ,Z~ects me, Known byaulh, Controlable byaulh, Catastrophic, Old,
r
Prevar~able byself, Oe~rs
2
not at all -
3
- flood risk
g
very r m ~
------ dam break risk
FIGURE 2: Perceived floods and dam failure risks To simplify these descriptions and structure risk perception for both those risks, two factor analysis were performed to these risk descriptions. Our results showed that, although with a very similar underlying structure, dam break and flood risks are seen in very different ways. People think of them in terms of possibility of control (can this risk be controlled, either by myself or the authorities?), dread (does this risk produces fear?) and knowledge (is this risk well known?). They consider dam failure risk much more frightful, more unknown and less controllable than flood risk. In both cases, perception of risk controllability is more common in urban samples than in rural ones. Samples near the dam perceive dam break risks as more dreadful and flood risks as less dreadful than people living near the coast. Risk perception is not isolated from other cognitions (e.g., Vlek and Stallen, 1981). In this study we analysed also the perceived benefits of these two especific dams, the evaluations of dam safety and the beliefs that support safety evaluations. Similarly to the results of the qualitative study, farming benefits are particularly salient (especially for residents living closer to the dam), and the overall quality of life also was rated as having improved. Corresponding to that pattern of results, the beneficiaries of the dam are described as farmers and residents near the dam (Barragem and Silves). The region as a whole (Algarve) was also
Advances in Safety and Reliability: ESREL '97
59
perceived as a beneficiary, but self was not seen that way except for those living in the rural area near the dam. According to our results, dams are perceived as very beneficial to the region, but they are also perceived as safe structures. The reasons for trusting the safety of the dams were also analysed. Age of the dam seems to be an important criteria for safety evaluations, together with trust in technicians. Although the level of explanation of the regression equation is not high (20%), the principal predictors of safety evaluations for dam are two items relating to trust in technicians, one stating the confidence on the dam age. Those results show an interesting pattern of cognition about dam safety: Trust in technicians seems to be an important factor to explain perceptions of safety but the age of the dam is another one. For the residents, an old dam, because it has had no problems during a long time, is trustworthy. The younger dam, although with a concrete structure, arises some suspicion, because it has not yet been proven to be safe.
Stage 5: Implementing a risk communication strategy for a warning system The studies performed in the stages described above have given us a lot of information to define the risk communication process to implement the warning system in case of a dam induced flood. Specifically, it has shown that: 1. As dam failure risk is perceived as a much more dramatic, uncontrollable and unfamiliar risk than flood risk, it is more difficult to accept. The flood warning system to be created should then be developed for floods in general: dam failure being a particular and extreme case of those. This strategy would have several other important consequences: it would prevent damages in case of natural floods or dam spillouts, and the population could have more opportunities to learn preventive behaviour. 2. Risk communication should not be conceived as a one-way strategy. In order to promote effective responses to warning systems, the decision process of the individuals must be understood and the residents should be informed about the seriousness of the risks they face and their possibility of response in adequate time (Janis & Mann, 1977). To fulfil these goals in our case study, where the population is living at risk for more than 40 years, direct communication is particularly desirable. Furthermore, a local committee of residents in risk areas could help to improve the dialogue between the experts and the public and prevent excessive responses (Syme, 1994). 3. Rural/urban distinctions are also important both in our results and the literature of warning response (Handmer &Ord, 1986). However, as the differences relate mainly to the age and degree of literacy of the subjects, they affect only the communication channel and not the content of the message. Local involvement is a key factor for the success of the warning system. A GIS for dam safety management is presently under development within this project (Gamboa & Santos, 1996). This user-friendly system includes the topography, river network, roads and inundation maps of the valley and it is a powerful instrument to promote the dialogue with the local authorities and residents. Several meetings with the local experts are already planned and our knowledge of the technical characteristics of the dams, the valley and its population make us an important partner in safety issues for this area.
CONCLUSION A multidisciplinary project is often a challenge for all the researchers involved. This paper showed how we tried to put together the results from the different disciplines involved in a research project which aimed at improving the safety of a downstream valley below two dams in Rio Arade (Algarve). Although, at first, each of the contributions seemed quite autonomous, the main goal of producing an effective communication strategy with the local authorities and population seemed quite effective to connect the research team. The
60
M.L. Lima et al.
philosophy of our methodology stresses a two-way approach to risk communication, and thus we tried to involve the public and their representatives in the different stages of our research. During this year several meetings will take place, and the final success of the waming team depends mainly on our skills to involve them in their safety.
REFERENCES AEMI (1995). Flood Warning." An Australian Guide. Mount Macedon, Ausffalian Emergency management Institute. Almeida, A.B., & Viseu, T. (1996). Dams and valley safety: a present and future challenge. Paper presented at the Workshop NATO Dams and Safety Management at Downstream Valleys. Lisbon, Portugal, November. (forthcoming publication). Almeida; A.B., Ramos, C.M., Franco, A.B., Lima, M.L., & Santos, M.A. (1996). Dam-break floodrisk and safety management at downstream valleys: A Portuguese integrated research project. Paper to be presented to the X/X Congress of the International Commission on Large Dams (ICOLD). Florence, Italy, 1997 (Q.75). Bishop, B.J., & Syme, G.J. (1992). Community perceptions of dam safety issues. CSIRO Division of Water Resources Consultancy, Report 92/32. Brown, Curtis A. & Graham, Wayne J. (1988) Assessing the threat to life from dam failure. Water Resources Bulletin, vol 2(6): 1303-1309. DeKay, M.L., & McClelland, G.H. (1993). Predicting loss of life in cases of dam failure and flash flood. Risk Analysis, 13, 193-205. Femandes, J.P, & Andrade, M.J. (1996). A database for dam safety management.. Paper presented at the Workshop NATO Dams and Safety Management at Downstream Valleys. Lisbon, Portugal, November. (forthcoming publication). Gamboa, M, & Santos, M.A.(1996). A GIS for dam safety management. Paper presented at the Workshop NATO Dams and Safety Management at Downstream Valleys. Lisbon, Portugal, November. (forthcoming publication). Giddens, A. (1990). The consequences of modernity. Oxford: Basil Balckwell. Handmer, J. (1988). The performance of the Sydney flood warning system, August 1986. Disasters, 12, 37-49. Handmer, J. (1996). Bellow the spillway: ensuring effective warnings in a crisis. Paper presented at the Workshop NATO Dams and Safety Management at Downstream Valleys. Lisbon, Portugal, November. (forthcoming publication). Handmer, J.W. & Ord, K.R. 1986. Flood warning and response. In D.I. Smith and J.W. Handmer (Eds.) Flood Warning in Australia (pp. 235-251). Canberra: CRES. Janis, I.L. & Mann, L. 1977. Decision making." a psychological analysis of conflict, choice and commitment. New York: Free Press Lazarus, R.S., & Folkman, S. (1984). Stress Appraisal and Coping. New York: Springer. Lima, M.L., & Silva, D. (1996). Caracteriza¢6o socio-demogr6fica e do edificado das zonas em risco de inunda¢6o por ruptura das barragens do Funcho e de Silves. Lisboa: L.N.E.C. (in press) Lima, M.L., & Silva, D. 1996a. Understanding lay perceptions of dam-related risks. Paper presented to the Annual Meeting of the Society for Risk Analysis(Europe). Guilford, June Quarentelli, E.L. (1978) Disasters: theory and research. Beverly Hills: Sage Publications Sime, J.D. (1995). Crowd psychology and engineering. Safety Science, 21(1), 1-14. Sime, J.D.(1996). Informative flood warnings: occupant response to risk, threat and loss of place. In J. Handmer (Ed.) Current issues in total flood warning system design - An international Workshop. Middlessex University, September 1995 (forthcoming publication). Slovic, P. 1987. Perception of risk. Science, 236: 280-285. Syme, G.J. (1996). Risk versus Standards based approaches to dam safety: The consequences for community involvement in decision making and safety management in the Australian context. Paper presented at the Workshop NATO Dams and Safety Management at Downstream Valleys. Lisbon, Portugal, November. (forthcoming publication). Syme, G.J. (1994). Dam break: Creating a vigilant community. Paper presented to International Decade for Natural Disaster Reduction, Sydney. van der Pligt, J. (1992). Nuclear Energy and the Public. Oxford: Blackwell. Vlek, C., & Stallen, J.P. 1981. Judging risks and benefits in the small and in the large. Organizational Behavior and Human Performance, 28:235-271.
A3" Integrating Management Models
This Page Intentionally Left Blank
DYNAMIC MODELLING OF SAFETY MANAGEMENT A.R. Hale l, L.J. Bellamy 2, F. Guldenmund l, B.H.J. Heming 1 and B. Kirwan 3, Safety Science Group, Delft University of Technology, 2628 EB Delft, NL 2 SAVE Engineering Advice Bureau, 7301 GL Apeldoorn, NL 3 Ergonomics Group, University of Birmingham, Edgbaston, Birmingham B 15 2TT, UK
ABSTRACT The modelling and quantification of failure risk and of safety management began as two distinct activities. Since the 1980s they have been growing together. This paper describes the latest in a series of studies to integrate technical and management modelling so that predictions and assessments of total system risk and its control can be made. The approach emphasises the dynamics of management control of risk. It combines the PRIMA audit system with the Delft framework of safety management systems, using SADT-modelling for a systematic approach to critical management factors. The process of integration raises fundamental questions about how to articulate the two parts, which are based on different paradigms.
KEYWORDS Integrated risk modelling, safety management, audit, management dynamics
INTRODUCTION: HISTORICAL PERSPECTIVE The assessment of failure probability and of safety management competence started life as totally independent activities. Probabilistic risk analysis is deterministic, based on progressive logical disaggregation of small numbers of tightly defined top events linked by event trees (each step subject to the test that it is a necessary and complete breakdown of the one above). It is by essence quantitative and is driven by the availability of failure data and by sensitivity analyses directed towards the most important combinations of failure determinants. Fault trees are constructed which are broken down to the level of component or system failure where failure rate data is available from data banks, or can be assessed with sufficient reliability by experts. Such data is genetic by nature, collected across many organisations, and so taking no account of differences in management control competence. The fault trees used seldom, if ever, contain items related to organisational tasks or characteristics. Indeed the practice in chemical industry studies, which are largely directed at assessing off-site risks in relation to major hazards regulations, is that the factors leading to initiating events, such as loss of containment (LOC), are hardly modelled at all; such risk assessments consisting mainly of event trees which model effects on the consequences of the LOC. Safety management assessment traditionally started as a holistic assessment, largely qualitative in its aims, and focused on improvement. Quantification in audits such as ISRS (DNV 1995) has been used only as an aid to prioritisation and has been based on unstructured expert judgement. The checklists used in such audits are lists of management activities not linked in any clear way to failure scenarios, or even to the direct control of the primary process and the technology of the organisation. Management systems at the level of these checklists are therefore 63
64
A.R. Hale et al.
assumed to be generic for all types of technology (Hale &Hovden 1996). Many audit systems, particularly since the publication of the ISO 9000 series of standards, are based on models with implicit or explicit feedback and learning loops designed to improve performance. The first attempts to bridge the gap between technical risk and safety management and to develop a more integrated modelling were made in the 1980s. MANAGER (Technica 1988) developed two management factors which were used to modify PRA probabilities. This work was followed by the development of PRIMA (Wright et al 1993), whilst an independent development gave rise to WPAM (Davoudian et al. 1994a, 1994b) based on task analysis and organisational factor assessment. PRIMA was developed using the analysis of LOC incidents as a basis for assessing priorities for management modelling. It links to the technical risk model at the level of the initiating events, which it splits into three; LOC through failures of pipes, vessels and hoses. The management model is divided into the phases of the plant life cycle, design (and domino effects from plant layout, etc.), manufacture, construction, operations and maintenance. Within those phases it classifies activities relevant to preventing LOCs as hazard reviews (HAZ), routine inspection and testing (ROUT), human factors review of tasks, interfaces, procedures, etc. (HF) and checking and supervision for successful completion of work (CHECK). The analysis of the LOC incidents showed that eight of the cells of the matrix formed by these two dimensions account for the vast majority of failures. These eight areas of management were therefore used as the focus for an audit which produced an assessment of each area. These assessments could then be combined, using the proportions found in the incident study as weighting factors, to produce three management factors for the three types of technical failure (see Hurst et al 1996 for details). The audit was constructed as a check of the control and monitoring loops postulated as responsible for ensuring that the eight key activities are developed, formalised, implemented, reviewed and improved. PRIMA adopted some of the structure of existing management models, notably the planning, feedback and monitoring loops and offered a formalised way of linking them to the technical failures, but it suffered from a number of limitations: the link to the technical failures is at a global level, providing little guidance about what management factors affect the failures in which way; the way in which the audit questions relating to the presence and working of the control and monitoring loops should be combined to produce a rating of the audit area is not clearly articulated; the model and audit are still essentially static in their approach, recognising that leaming loops must exist to keep risk managed effectively over time, but giving no clear indication of how the level of risk could vary given specified characteristics and developments in the management and technical systems. These limitations formed the starting point for a project (I-Risk) sponsored by the European Commission.
I M P R O V I N G THE M A N A G E M E N T MODEL: ADDING THE DELFT F R A M E W O R K
The PRIMA audit and modelling system is being supplemented in the project by the approaches developed, until now independently, in Delft (see Hale 1995, Hale et al 1997 for details). This work started as a framework for safety management devised for qualitative modelling of the control of risks in the primary process of an organisation. Like PRIMA, it recognises the importance of the life cycle phases from design through to operations and maintenance (and decommissioning). It postulates in each phase three functional levels of control of risk; an execution function (E) which directly interfaces with the primary process in which the risks arise or are created, which operates the measures and controls which keep the risks from manifesting themselves and which corrects deviations which would otherwise lead to harm. The primary process per life cycle phase has a different relationship to plant safety; in the design phase it is the design process leading, as product, to a plant which has low intrinsic danger, built-in risk controls and which is low maintenance (since maintenance is related to some 4050% of all major accidents in the process industry [Moll et al 1994]) and maintenance-friendly; in the operations phase it is the direct operation of the plant without loss of containment; in the maintenance phase it is carrying out of maintenance in such a way that the maintenance staff is not injured and that the plant is retumed to service in a better state for safe operation and without added failure probabilities introduced by the maintenance process. a planning and policy function (P) which devises and improves the methods, techniques, criteria and allocates resources for safe operation (E). This is modelled as a problem-solving cycle with nine clearly defined steps directed at risk recognition and evaluation, development of control solutions, implementation and monitoring.
Advances in Safety and Reliability: ESREL '97
65
a system structure and culture function (S) which sets up, implements, reviews and modifies and improves the P function within the overall company structure and management philosophy, making decisions about where and by whom the problem solving steps in P will be carried out, what priorities will be and how the safety management system (SMS) will be kept motivated and functioning in line with best practice and the pressures from the market, regulators and others. These three functions are implemented within any given company in different ways; the tasks being given to different layers in the company hierarchy, or sometimes contracted out to consultants, or may even be done by regulatory bodies (e.g. inspections of small companies, review and certification of SMS for certification. This functional, as opposed to structural, modelling is an essential feature of the framework, since it offers the possibility of generic modelling of an SMS which will be applicable to any company no matter how it is structurally organised. Mapping onto the structure of a given company takes place by asking how, and how effectively, the functions are carried out in the given organisation. The Delft framework maps well onto the control and monitoring loops and the life cycle phases of the PRIMA audit. It also provides a formalism for identifying more systematically where the four types of activity (HAZ, ROUT, CHECK, HF) in the PRIMA model take place and how they relate to each other. The links within and between the functional levels are made more explicit than in the PRIMA model and are broken down into more specific steps, which can be linked better to audit questions for assessment purposes. The improved model will therefore consist not of eight more or less independent loops, each affecting risk levels separately, but of one dynamically interrelated management system, showing how all aspects are connected. We return below to the issues which this step raises for quantification and for the assumptions of independence of events represented in fault trees.
GENERIC NOTATION FOR MODELLING MANAGEMENT:SADT A second area in which the modelling is being developed is that of the formal notation for representing the various steps in the functional levels. Both PRIMA and the original Delft framework identified management activities in the SMS, but did not formalise the representation of them any further, beyond the indication that they were decisions and actions which were connected to each other by arrows, implying information flows. The current modelling is exploring the potential of the Structured Analysis and Design Technique (SADT) (Marca & McGowan 1988) for making this process more systematic (see Hale et al 1996 for details). This notation represents activities genetically as transformations in which defined inputs (I) combine or are altered to produce outputs (O). The transformations are carried out by mechanisms or resources (R), which themselves do not get used up (at least in the short term) or incorporated into the outputs directly. Transformations are under the control of criteria and constraints (C), which represent performance indicators by which the success of the transformation can be judged. An example of a transformation relevant to the safety management system might be the operation of a safety information system, in which the following at least are relevant (a full analysis would be too lengthy for this paper). The inputs are data on possible scenarios, risk exposure, actual deviations, incidents, near misses and breakdowns, actual harm, the range of potential prevention measures possible, between which the system must choose, and the actual measures chosen and implemented with their expected effects, practice, problems and new developments elsewhere. The transformation is the comparison of actual performance with expected performance of prevention measures (e.g. to detect unexpected scenarios, deviations happening despite the presence of control measures, etc.), which produces as outputs reports (to the company, regulators, insurers, etc.), recommendations to change practice (E-level), plans and procedures (P-level), and the signals to trigger major redesign of the SMS at the S-level.. The resources are the skills and knowledge of those who record, process and analyse the information, those who use it, the hardware and software to record, store, analyse, dis.tribute, discuss and use it, the time to devote to the work and the money to finance it (e.g. purchase of external information). The controls governing the transformation are the needs of the users, the requirements of external bodies for reporting, the criteria set by the SMS which have to be met, etc. The four factors are represented as arrows leading into, and out of a box in which the activity takes place. They are placed in a clockwise order (I,C,O,R) around the faces of the box. The notation was developed for the process of knowledge engineering, the formalisation of decision making processes in order to program them into software.
66
A.R. Hale et al.
Since the purposes of generic modelling of the SMS resemble in many ways the objectives of building decision support software, the potential for this technique seems clear. It has also been used in modelling primary processes and the risk controls on them (Rasmussen & Whetton 1993), which is an attractive long term characteristic, since compatibility between technical and management models is a long term aim of research. Trials in the current project have been undertaken with the modelling tool to represent the relationships between maintenance management and safety studied in earlier projects (Heming et al 1996) and to make explicit the PRIMA audit model. The SADT notation works with a hierarchical representation of any given activity. Any formulation of a task can be "unpacked" into its constituent steps, in much the same way that hierarchical task analysis is used for human reliability assessment (Kirwan & Ainsworth 1992) (c.f. the use of task analysis in WPAM [Davoudian et al 1994]). The use of the safety information system sketched above can be unpacked into its sub-tasks of recording data, analysing it, updating risk inventories and scenarios, evaluating of the effectiveness of prevention measures, recommending changes and reviewing and updating the operation of the information system. In the same way each functional level of the model (S,P,E) can be split up into its component steps. The notation and its software (Logic Works, undated) require that, as the broadly formulated tasks are unpacked, all of the I,O,C and R at the higher aggregation should be linked to each of the unpacked steps to show how they are driven and regulated. This means that the successive unpacking must also contain the explicit steps in which global performance criteria and resources are translated and allocated, respectively, in order to control and resource the successively more detailed sub-tasks of the activity. This formalism is a good discipline for both understanding and representing existing management processes, and for identifying weaknesses in them, for example steps which are under-resourced, or which have no defined criteria controlling them. In terms of the improvement to the PRIMA audit, the notation offers ways of classifying, grouping and where necessary supplementing or thinning out and evaluating the audit questions asked to verify the existence and functioning of the control and monitoring loops (now transformed into the three functional levels of the new framework). Trials with the notation have shown that there are conceptually only a limited number of generic types of controls and resources which are almost always relevant for any activity in the SMS. This is attractive in terms of the principle of parsimony as well as being valuable in terms of simplifying the process of assessing influences on any one SADT box. Only 10 factors have been identified so far as necessary. RESOURCES: competence (knowledge and skills present in the heads of people) commitment (perhaps the least tightly defined resource, but coveting the motivation to apply time and competence and to carry out the activity necessary for achieving the end result, rather than skipping it) hardware (measuring instruments, control devices, barriers, etc) soft tools (methods, software, measuring protocols etc) time (both of people and hardware). money (to pay for the development and presence of the above). The last two are strictly speaking, according to the rules of SADT, not resources, as they are used up, but it is more convenient to model them as such, since they are usually not directly combined with other SMS or primary process inputs to create the outputs. CONTROLS: output criteria (goals for safety and other objectives, e.g. production, and possible conflicts between them) rules (agreed and often written rules imposed by the working group or company as means of achieving the output criteria; either procedural or highly specific [see Hale & Swuste 1993 for a discussion about hierarchies of rules, related to the notion of unpacking criteria for use at different SADT levels]) risk perception (criteria for performance internal to the persons carrying out the activity) supervision (criteria and rules internal to a person directly supervising and directing the activity) It becomes clear in using these genetic categories, however, that they are not independent influences. There is almost always some dependency between controls and resources, since they are physically linked; e.g. a piece of hardware used as resource (say a relief valve) comes with a limited range of set-points, which are built-in criteria for its functioning; a person carrying out a task as resource of competence comes with a certain commitment and risk perception, which are only partially modifiable, and may vary together (e.g. Reason et al 1989 showed that
Advances in Safety and Reliability: ESREL '97
67
persons who rated their competence as high were more likely to violate rules on the grounds that this was not a risk for them). It is also clear that different combinations of the influences may be alternatives for good control of an activity. Competence, rules and supervision form such a trio (with a minor link to commitment). We have encountered a number of activities which can be just as well controlled by either highly competent people with few imposed rules or supervision, or by people with little knowledge, but clear, detailed rules and close supervision, or even rules and no supervision, but a great commitment to stick to the rules. Considering the genetic influences and their interaction forces the modeller to come to terms with these interactions. When we reach the stage of quantification of management effects, the interactions may necessitate the use of influence modelling techniques using expert judgement (e.g. Phillips et al. 1983) The SADT notation provides a convenient way of modelling control and feedback loops. The influence of the controls (C) on any given box can be modelled as a sub-task, at the next level of unpacking, in which the "draft output", e.g. the hazard review of a design, the procedures for plant shutdown, or the completed permit-to-work before signing a job off, is subject to a quality check against the defined criteria (C). The result of this check box is the "approved output", but also a report to the higher (P) level indicating whether the output was satisfactory, or whether action must be taken to modify or reinforce the procedures which governed production of that hazard review. Each SADT box, therefore, has associated feedback loops attached to it, which take the outputs of the box and monitor them in order to learn whether changes need to be made in the Cs, Rs, or Is of the system in order to make it work more safely next time. We can envisage these as the means for preventing the SMS decaying and for adapting it constantly to new and changing circumstances. Some of the control and monitoring loops stay at the same level in the SMS; these are direct check activities which pick up deviations and correct them. Other loops circle up to a higher level (E to P, and P to S) and are responsible for the company redefining, or re-emphasising the safety controls (C) on an activity, or reconsidering, refreshing or renewing the resources (R) needed to carry it out. We can postulate that the absence or weakness of these loops allows the quality or preparedness of the Cs and Rs to decline over time, which will, in turn, mean that they are less effective in ensuring that the activity or transformation in the box is carried out to a high quality and hence that the genetic failure rate being processed by the box stays low. If we wish to use such a loop in quantitative modelling as an input to the failure probability monitoring, it will be necessary to make assumptions (or get expert judgements) about the frequency and quality with which these loops must operate in order to prevent corrosion (or indeed to produce improvement in the transformation). In this way more specific form is given to the checks made of the control and monitoring loops in the PRIMA audit. This example demonstrates a generic aspect of the SADT-modelling, namely that, if we follow the I --> O links we stay on one level of the SMS model. If, on the other hand, we pursue a C or R arrow out of a transformation box and enquire how that control or resource is decided upon, provided, and kept functional, we move up a level in the SMS, from E to P to S. This means we drive deeper into the management system and through into the safety culture of the organisation. It is in following these paths that we must expect to be able to capture the common mode failures represented by a less than adequate safety culture. The quality of these links between functional levels can give us an indication of whether the levels interact and drive each other successfully or simply flow over each other as lamina, in which case only pseudo-control will occur, characterised by such phenomena as well-kept safety manuals which bear no relationship to reality, high levels of violations of rules which are impracticable and false ideas at senior level that safety problems are under control because no reports of failures reach their level. The trials with the SADT-notation show that the first levels of unpacking of the SMS within, or across life cycle phases yields a highly generic model which seems to be applicable to any company and any technology. However, as the SADT-boxes are unpacked it becomes increasingly necessary to model in relation to a specified technology and the organisation of the SMS at a specific site. this means that the tool offers the potential for generating generic audit questions at a fairly high level of aggregation (based on the I,C,O and R of the significant management processes), which can be answered by any given company in terms of the organisational arrangements it has for meeting carrying out those activities. The same model can be developed further by any one company interested in modelling its own system in more detail, the differences between companies at this more detailed modelling level, but applied to the same technology, provide one clear definition of differences in safety culture.
68
A.R. Hale et al.
LINKING THE MANAGEMENT AND TECHNICAL MODELS More explicit modelling of the safety management system with SADT raises fundamental questions about the link between this sort of model and the technical risk model. Current fault trees, particularly those used in chemical risk studies, do not generally contain direct failure elements which link directly to management system failures. Conceptually the inclusion of this sort of element is not problematic. The MORT tree (Elsea & Conger 1983) has shown this at a generic level. More recently van der Mark (1996) has produced genetic fault trees for all LOC incident types and shown that 10 trees can cover all causal mechanisms. His genetic trees reach the level of management factors such as design and maintenance decisions after two three layers. The problem of the fault tree approach is that such factors start to appear in many different branches (e.g. design decisions related to all the various hardware risk controls). These design decisions represent potential common mode failures, if they are carried out by the same design team, with the same competence, risk perception and design philosophy.Common mode failure has always presented problems for quantification in PRAs, since the basic assumption of the method is that base events in a fault tree are independent of each other. From the side of the management model the process of unpacking of the SMS in each life cycle phase is also one of increasing specificity, whereby the different aspects of the risk control system are devised and kept operational. At several levels into that unpacking of SADT-boxes the direct links to the design, maintenance and operation of specific risk controls becomes evident. The fact that these are sub-tasks of larger SMS tasks indicates that the very purpose of the (rational) management system is to plan and run these actions so that they are "common mode successes". The degree of common mode is given by the final point in the unpacking at which two or more of the design decisions are still represented in one SADT-box. The closer that is to the specific failure mode of the technical components, the more there will be a common mode failure component. The presence and successful use of the control and monitoring loops within and between the functional levels will also indicate the degree to which the management system succeeds in linking separate failure modes into one common mode.
Figure : Fit of technical and management models The technical (fault tree) model and the management (SADT) model can therefore be thought of as triangles, with increasing specificity towards their bases. They have to be fitted together as shown in the figure. The representation on the left indicates highly detailed specific linking, which would be likely to be qualitatively very interesting, but exceptionally time-consuming to model, and a nightmare to quantify (compare the modelling of Wreathall 1989). The right-hand representation shows the linking achieved in MANAGER and PRIMA. The central figure shows the level at which the current project is aimed. In order to make that link, decisions have to be made about the point at which both significant and generic management failure (or success) components can best be modelled as common modes in the fault trees. Guidance on this can be obtained by more detailed analysis of the original LOC data used in PRIMA (Hurst et al 1991), to discover what are the dominant management factors in the different types of LOC incident classified by failure mechanism. One factor in deciding how far down in fault trees to go is the danger of incompleteness of the modelling as determinants are disaggregated (c.f. Fischhoff et al 1978). The matching process therefore brings into sharp focus the issue of common mode. To the extent that a
Advances in Safety and Reliability: ESREL '97
69
management system succeeds in linking all failure determinants into one common management system, and in managing that successfully, fault tree modelling becomes redundant; the best predictor of technical failure would then be the measure of how good that unifying SMS is. It is only because any SMS fails to be all-embracing and to operate its structuring (S) and planning and procedures (P) functions perfectly, that detailed modelling of how good its constituent parts are and how those constituent parts link to top events is necessary. In particular, the data from the LOC analyses on which the PRIMA audit are based show that different life cycle phases and the management failures in them load very differently onto different failure mechanisms; see table 1. The table shows the percentage breakdown for 502 failures according to the type of failure causing LOC and the life cycle phase in which recovery would have been possible. (The total table sums to 84%, the rest of the incidents being from other causes or unclassifiable)
TABLE 1 L I F E C Y C L E A N A L Y S I S OF M A N A G E M E N T F A C T O R I N F L U E N C E S F O R LOG
Corrosion
Erosion
INCIDENT TYPES F O R P I P E W O R K F A I L U R E S ( % )
External Loading
Impact
Overpressure
Vibration
Temperature
Wrong Equipment
Operator error
Defective pipe
Design/ Layout
5.5
0.8
2.2
2.7
7.35
1.7
2.9
1.3
1.7
1.3
Construct./ Manufact
1.5
0
0.15
0.1
0.8
0.2
0
3.4
2.1
1.7
Operation
0.3
0
0.4
1.8
1.1
0.05
0.3
0.03
6.2
0.5
Maint.
5.6
0.5
0.6
1.05
3.3
0.6
1.0
2.05
18.0
4.0
Since the different life cycle phases are quite commonly subcontracted to different organisations with their own SMS, it is likely that the common mode across phases is comparatively less than within them. Further modelling will be needed to establish to what degree each life cycle phase SMS must be detailed and how much further it will be necessary to go than the four types of control and monitoring loop already modelled in PRIMA.(HAZ, HF, ROUT, and CHECK - see above). The project is not yet at the stage of detailed quantification using the SADT notation, so it is not yet clear how its use may modify current risk parameters for given sites.
CONCLUSIONS SADT-modelling offers a rigorous notation for modelling management systems. The current project has shown that it can clarify the relations between management factors and technical failures and can be more systematic in modelling them than earlier methods. The confrontation of technical and management models places a strong emphasis on how common mode failure (or success) should be seen and quantified in risk modelling. Because it is hierarchical in structure it also offers a way of reconciling the needs of regulators to model management systems at genetic and aggregated levels with the interest of companies in modelling qualitatively in much detail their own site- or company-specific management systems in order to improve them. Field trials later in the project must indicate the degree to which it can be applied in practice as a modelling tool.
ACKNOWLEDGEMENTS This paper is based largely on work sponsored by the European Commission under the Environment programme, in the project I-Risk.
70
A.R. Hale et al.
REFERENCES
Davoudian K., Wu J-S. t~¢ Apostolakis G. (1994). Incorporating organisational factors into risk assessment through the analysis of work processes. Reliability Engineering and System Safety 45:1-2, 85-105. Davoudian K., Wu J-S. & Apostolakis G. (1994). The work process analysis model (WPAM). Reliability Engineering and System Safety 45:1-2, 107-125. DNV (Det Norske Veritas). (1995). International Safety Rating System. Intemational Loss Control Institute. Georgia. Elsea K.J. & Conger D.S. (1983). Management Oversight and Risk Tree. Risk Report, 6:2. Intemational Risk Management Institute. Fischhoff B., Slovic P. & Lichtenstein S. (1978). Fault trees: sensitivity of estimated failure probabilities to problem presentation. J. of Experimental Psychology: Human Perception and Performance 4:2, 330-344 Hale A.R., Heming B. Carthey J., & Kirwan B. (1996). Modelling of safety management systems. Safety Science in press. Hale A.R. (1995). Modelling integrated working conditions management systems. Paper to Workshop on Improving the Working Environment." from medical-technical problem solving to a process of participative management. Swedish Institute for Work Research. Stockholm. Hale A.R. &Hovden J. (1996). Management and culture: the third age of safety. Paper to the conference on Safety Policy & Management, Worksafe. Australia. Sydney Hale A.R., Kirwan B., Guldenmund F. & Heming B. (1996). Capturing the fiver: multi-level modelling of safety management. 2nd International Workshop of International Power Operators (INPO). Technical University of Berlin, Germany. Hale A.R. & Swuste S. (1993). Safety rules: procedural freedom or action constraint? llth NeTWork Workshop. The use of rules to achieve safety. Bad Homburg. Heming B., Hale A.R., Smit K., van Leeuwen N.D. & Rodenburg F. (1996). Evaluating the control of safety in maintenance management in Dutch major hazard plants: a research model, in Cacciabue P.C. & Papazoglou I.A. (eds). Probabilistic Safety Assessment & Management. Springer, London. 732-737. Hurst N.W., Bellamy L.J., Geyer T.A.W. & Astley J.A. (1991). A classification scheme for pipework failures to include human and socio-technical errors and their contribution to pipework failure frequencies. Journal of Hazardous Materials. 26, 159-186. Hurst N.W., Young S., Donald I., Gobson H & Muyselaar A. (1996). Measures of safety management performance and attitudes to safety at major hazard sites. J. of Loss Prevention in the Process Industry. 9:2, 161172. Kirwan B. & Ainsworth L.K. (eds). (1992). A guide to task analysis. Taylor & Francis. London. Logic Works. Logic Works BPwin~. Princeton, New Jersey. Marca, D.A. & MacGowan, C.L. (1988). SADT; Structured Analysis and Design Technique. New York: Mc Graw-Hill. van der Mark R. (1996). Generic fault trees and the modelling of management and organisation. Graduation report. Faculty of Mathematics. Technical University of Delft. Moll, O., Hale A.R. & Smit K. (1994). Preventie van onderhoudsgerelateerde ongevallen (Prevention of maintenance-related accidents). Tijdschrifi voor Toegepaste Arbowetenschap, 6, 79 - 86. Phillips, L.D., Humphreys, P., and Embrey, D.E. (1983). A socio-technical approach to assessing human reliability. London School of Economics, Decision Analysis Unit, Technical Report 83-4. Rasmussen B, & Whetton C. (1994). Hazard identification based on plant functional modelling. European Safety & Reliability Conference. 986-995. Reason J.T., Manstead A.S.R., Stradling S.G., Baxter J.S. and Campbell K.A. (1989). Errors and violations on the roads: a real distinction? CEC Workshop. Errors in the operation of transport systems. Cambridge. Technica (1988). The Manager Technique. Management Safety Systems Assessment Guidelines in the Evaluation of Risk. London Wreathall J. (1989). A hierarchy of risk control measures with some considerations of "unorganisational" accidents. Paper to the Second World Bank Workshop on Risk Management and Safety Control. Karlstad, Sweden. Wright M.S., Bellamy L.J., Brabazon P.G. & Hurst N.W. (1993). The evaluation and management ofpipework and vessel safety. C454/014/93. Institution of Mechanical Engineers.
UNDERSTANDING SAFETY CULTURE IN ORGANIZATIONSTHE CONCEPT OF TOTAL SAFETY MANAGEMENT AND ITS PRACTICAL USE IN AUDIT INSTRUMENTS G. Grote, C. Kfinzler and B. Klampfer Work and Organizational Psychology Unit Swiss Federal Institute of Technology (ETH), 8092 Ztirich, Switzerland
ABSTRACT
Recent attempts to capture system safety have focused on the term safety culture indicating a shift in emphasis from individual and directly accident-related factors to organizational and "latent" safety factors. Based on the sociotechnical systems approach, an extension of existing models of safety culture is suggested, termed Total Safety Management. According to the understanding of culture as deeply rooted assumptions about the interplay of people, technology, and organization, shared by the members of an organization, a method attempting to capture these assumptions and to evaluate them in relation to the furthering or hindering of system safety along with first results of its application in safety audits is presented. While past investigations into the usefulness of the Total Safety Management model for evaluating safety culture have been focused on industrial systems, especially chemical production, a current project aimed at safety in civil aviation will be outlined briefly in a final section.
KEYWORDS Safety culture, Total Safety Management, sociotechnical systems design, chemical production, aviation, audit instruments, risk management
INTRODUCTION
In research as well as in industrial practice, efforts aimed at increasing the safety of high-risk production systems more and more focus not only on technical and individual-centred measures, but on an integral safety management improving the interplay of technology, organization and human resources. Faulty management decisions, termed "latent errors" (Reason, 1990), endangering an optimal functioning of the sociotechnical system and thereyby increasing the likelihood of errors, have become the core of safetyrelated system diagnosis, especially in those systems, where a high safety margin has been reached already. In an attempt to not only identify such management errors, but yet again move back in the causal chain of accidents and incidents in trying to forecast such errors, the value systems of organizations are evaluated on the basis of notions of a "safety culture" (e.g. INSAG, 1991). 71
72
G. Grote et al.
In this paper, a concept of safety culture is outlined, which is based on the sociotechnical systems approach. The main focus of the concept is process safety, including individual safety as well as environmental concerns in as much as they affect and are affected by process safety concems. Subsequently, a method for evaluating safety culture according to this concept is presented. The rationale for the method as well as results of its application in safety management audits in petrochemical plants are described. The usefulness of the chosen approach is discussed and some current work extending it to civil aviation is outlined.
A SOCIOTECHNICAL MODEL OF SAFETY CULTURE: TOTAL SAFETY MANAGEMENT The Intemational Nuclear Safety Advisory Group (INSAG), among others, coined the term safety culture as "that assembly of characteristics and attitudes in organizations and individuals which establishes that, as an overriding priority, nuclear safety issues receive the attention warranted by their significance" (INSAG, 1991, p. 1). The group also provided a framework for the 'measurement' of safety culture by distinguishing characteristics of safety culture on the strategic, management, and individual level of an organization (IAEA, 1994). Similar characteristics to those in the INSAG-model have been identified by empirical comparisons between 'safe' and 'unsafe' organizations (e.g. Cohen, 1977; Hoheisel, 1995; Zohar, 1980) and through analysis of active and latent failures having led to major accidents (Reason, 1991). Frequently mentioned characteristics are management commitment to safety, safety training and motivation, safety committees and safety rules, record keeping on accidents, sufficient inspection and communication, adequate operation and maintenance procedures, well-designed and functioning technical equipment and good house keeping. A major problem with those models of safety culture is their lack of integration into general models of organization and of organizational culture, understood as the deeply rooted assumptions about human, societal and ecological categories shared by the members of an organization and their expression in values, behavior pattems, and artifacts found in the organization (e.g. Schein, 1985). Also, the connection between safety-related characteristics of a system and more general characteristics, like job and organizational design and the use of technology, is missing. Thereby, the impression is furthered that safety can be looked upon and be promoted as something detached from the make-up of the sociotechnical system as a whole. Under the heading of Total Quality Management, a more integral approach has been developed with respect to quality. Similarly, the establishment of a safety culture could bedescribed as "Total Safety Management". It has been argued by the present authors that the sociotechnical systems approach can provide the basis for such a Total Safety Management (of. Grote & Ktinzler, 1996a,b). The sociotechnical systems approach conceives of work systems as having a technical and a social subsystem which together determine how well the primary task of a work system can be accomplished. Only if the two subsystems are jointly optimized, can maximum effectiveness be achieved. The predominant design objective is the provision of a high degree of self-regulation in the sub-units of the work system down to its individual members in order to allow for direct and flexible reaction to or proactive prevention of variances in the production process (cf. e.g. Grote, 1997; Susman, 1976; Ulich, 1994). On at least two levels the sociotechnical systems approach can be linked to safety, i.e. the definition of the primary task of a work system and the degree of self-regulation of sub-units in the system. Starting from the assumption that work systems are created and designed with respect to the optimal accomplishment of their primary task, this task should be defined not only in terms of the quantity and quality of products or services, but should contain the safety of their production as well. Only when safety is understood as a central task inseparably linked to production, can one assume that safety measures will penetrate every part of the work system. Secondly - and contrary to most industrial practice - it is argued that a high degree of self-regulation of work teams is beneficial to safety. This assumption has two roots, one is that especially in complex systems immediate reactions to variances and disturbances in the production process as well as anticipatory
Advances in Safety and Reliability: ESREL '97
73
actions for their prevention necessitate the delegation of control to the lowest, i.e. shop floor, level (e.g. Perrow, 1984). The second root is the motivation model embedded in the soci0technical approach, i.e. task orientation (Emery, 1959; Ulich, 1994). Tasks that among other things allow for a high degree of autonomy, task completeness and task feedback, will further an indiviual's intrinsic motivation. If, as derived from the definition of the primary task, his or her task includes safety, then motivation should also be directed towards that aspect of the task. Some empirical evidence for this assumption has been provided in studies on the effects of working in semi-autonomous work teams where it could be shown that accidents and unsafe acts decreased (Trist et al., 1963; Trist, Susman & Brown, 1977). However, as Perrow (1984) among others has pointed out, tight coupling of technical systems limits the possibilities for decentralized regulation of a system. If strong interdependencies exist between parts of the system so that changes in one subsystem will have wide-ranging and locally not foreseeable effects in other subsystems, regulation should be centralized. Observations in so-called "high reliability" organizations like aircraft carriers indicate one possible solution to the dilemma of concurrent requirements for centralized and decentralized regulation of the system. LaPorte and Consolini (1991) described how these organizations can switch between forms of regulation with different degrees of decentralization depending on the requirements of the situation. Under 'normal' conditions, high levels of formalized, hierarchical control were found while in states of increased alertness due to emerging threats a more team-like cooperation with local control was established. The importance of cultural factors in being able to perform these switches in control patterns was emphasized by Weick (1987). On the basis of these links between sociotechnical systems design, organizational culture and safety, it has been suggested that the aforementioned models of safety culture should be extended in a number of ways (Grote & Ktinzler, 1996a, b; see Figure 1): Firstly, characteristics of the work system not directly related to safety should be included, especially characteristics of job and organizational design influencing the degree of self-regulation on the shop floor. Secondly, a model of safety culture should be incorporated into a more general model of organizational culture (e.g. Schein, 1985, 1992), emphasizing complex interactions between an organization's material and immaterial reality. Thirdly, along with the suggested extensions of existing models of safety culture regarding material indicators of culture, the norms and beliefs linked to these indicators should be included. This concerns assumptions related to the chosen form of work organization, e.g. regarding the nature of human activity and the necessity of control, as well as management philosophies on automation and human reliability as they are expressed in the actual division of functions between human operators and machines.
Joint optimization of technology and work organization aiming at the regulation of disturbances at their source Material characteristics of the organization
{ Integration of safety in organizational structures and processes
$1" Immaterial characteristics of the organization
Values and beliefs that further integration of safety in all work processes
{ Norms related to socio-technical design principles like automation philosophy and beliefs concerning trust/control
Figure 1: Sociotechnical model of safety culture as Total Safety Management (adapted from Grote & Kianzler, 1996b)
74
G. Grote et al.
Investigations aimed at comparing work organization, safety measures and employees' perceptions of safety in four chemical plants and a transportation company indicated the usefulness of the Total Safety Management model for understanding differences in safety culture especially between the transportation company and the chemical plants (cf. Grote & Ktinzler, 1996a,b). Based on these investigations a method was developed to capture safety culture as part of safety audits as they are carried out, for instance, by insurance companies or other external agencies. The development and test of this method will be presented in the following sections.
D E V E L O P M E N T OF A M E T H O D FOR EVALUATING SAFETY CULTURE While the original investigations were carried out over the course of two years, comprising job observations and interviews with operators, interviews with safety experts and production managers as well as a survey of safety-related assessments of production personnel, a diagnostic method usable during safety management audits has to be applied in the course of a few days with a few resources as possible. At the same time, as much information from as many people as possible is needed to gain a picture detailed enough to allow inferences about an organization's culture that by definition is difficult to capture as it is ingrained in deeply rooted, usually unconscious norms and assumptions (cf. e.g. Schein, 1985, 1992; Schneider & Shrivastava, 1988). These conflicting requirements were met by developing a questionnaire that complements information gained from top and middle management interviews and plant tours which audits are normally comprised of. By means of the questionnaire views of organizational members in different departments and on different hierarchical levels can be obtained on factual characteristics of safety in their company as well as on more value-oriented statements closer to 'cultural assumptions'. By analyzing the responses in terms of the nature of shared views as well as differences between different organizational groups, a more varied picture can be developed regarding the way an organization handles safety, building on and extending the 'baseline' information obtained from the expert interviews. The questionnaire, which was developed for audits in petrochemical production sites, contains three parts: (I) questions on the company's safety measures, e.g. safety training, management involvement in safety issues, maintenance procedures and technical design (cf. e.g. Hoheisel, 1995, Reason, 1993, Zohar, 1980, for similar questions), (II) pairs of statements describing different 'philosophies' regarding safety measures, work organization, and use of technology, e.g. "Employees are motivated for safety by information and interesting tasks vs. Employees are bound to safety by strict control", (III) questions related to personal needs for good job performance, e.g. more vs. less procedures. In all three sets of questions, directly safety-related concerns as well as more general aspects of job and organizational design were included. While the first and third part of the questionnaire mainly deal with 'material' characteristics of the organization, the second part aims to touch on more normative, 'immaterial' characteristics. For each pair of statements in the second part, a 'positive' side was defined based on assumptions contained in the Total Safety Management model. Questionnaire responses are analyzed in four steps. Firstly, the overall pattem of responses is looked at in order to gain a first impression of the 'average safety perception' in a plant. Secondly, relevant groups of employees within the plant are identified, e.g. based on different hierarchical levels, different departments, and different production units, and their response patterns are compared. In a third step, the questionnaire responses are related to the information obtained from expert interviews during the audit. These analyses allow preliminary judgements on the general quality of safety measures, on the degree to which shared views on safety and the work situation exist that resemble a shared culture, and on the quality of that culture in terms of the degree to which norms crucial for a 'safety culture' exist. As a fourth step, requiring onsite analysis of the questionnaire responses, these judgements need to be put to a test, though, by discussing them - as openly as possible - in a feedback meeting with representative members of the organization audited. This
Advances in Safety and Reliability: ESREL '97
75
also points to a general limitation of the chosen approach, as it requires an open and not, or at least not predominantly, inspection-oriented relationship between the audit team and the audited company. The insurance company, for which the questionnaire was developed, therefore uses the questionnaire only for those clients with whom they maintain a good and long-term relationship. Following, the results from two audits which were carried out by members of the insurer's risk management team together with one of the authors, will be discussed in order to illustrate the application of the suggested method.
RESULTS FROM SAFETY MANAGEMENT AUDITS The questionnaire has been tested in four audits. Selected results from two of these audits will be presented in order to discuss the usefulness of the chosen approach for the evaluation of safety culture. Both audits were carried out in petrochemical plants belonging to large US petrochemical companies. Plant A has has about 800 employees, a total of 41 employees were included in the survey, 31 in operation and ten in maintenance. Eighteen respondents occupied lower and middle management functions, the others were operators without management functions. Plant B has about 500 employees, a total of 69 employees participated in the survey, 54 in operations, distributed among three production units, and 15 in maintenance. Twenty-five of the respondents occupied lower and middle management functions. The overall evaluation of safety measures by the respondents was more positive in plant A than in plant B. Also the more detailed assessments of safety measures in Part I of the questionnaire were generally more positive in plant A. Based on an exploratory factor analysis, the items in Part I were combined into three scales, 'formal safety' (e.g. There are sufficient written procedures, checklists etc. to ensure process safety.), 'personal safety' (e.g. There exist numerous training courses for the advancement of safe behavior.), and 'enacted safety' (e.g. In case of conflicts with production requirements safety comes first.). Interestingly, only for the scales 'personal safety' and 'enacted safety', the differences were large enough to reach statistical significance. Differences in the perception of 'formal safety' were non-significant (see Table 1). The more positive assessments by respondents in plant A coincided with a more positive evaluation of safety management by the auditers based on expert interviews. Interestingly, particularly negative assessments by the auditers conercerning certain aspects like safety training and learning from near misses were found in the questionnaire responses of plant personnel as well. TABLE 1 EVALUATION OF SAFETYMEASURESIN TWO PETROCHEMICALPLANTS Scale a Formal safety Personal safety Enacted safety
Plant A 3.58 3.83 3.45
Plant B 3.40 3.18 3.05
t-Test n.s. p < .001 p < .01
a The scores are averaged responses on 5-point Likert scales for all items contained in the respective scale. In both plants evaluations by superiors were generally more positive than by operators, a result pointing to potential hierarchical subcultures, which has to be interpreted cautiously, however, as it might be the expression of an attributional bias, not basic cultural norms. Superiors being more responsible for safety measures might perceive them more positively in support of a more positive self-image, which is a finding similar to the "self-serving bias" frequently reported in attribution research (e.g. Zuckerman, 1979).
76
G. Grote et al.
Especially in plant B, a number of interesting differences in responses by employees from different production units were found, questioning the existence of a strong shared culture. While responses in two units with overall evaluations of safety measures reaching mean values of below 4 on a 7-point scale were astonishingly negative - especially when considering that the employees were well aware of the questionnaire being part of an insurance audit - the responses in the third unit were quite positive with a mean value of 5.5. In particular, safety training, the enactment of the 'Safety First'-principle, and sufficient discussion of safety issues were aspects perceived much more negatively in the first two units than in the third. Looking at the norm-related statements of Part II of the questionnaire, yet again, respondents in the first two production units gave - based on the assumptions of the Total Safety Management model - more negative responses than in the third unit. They for instance saw employees more bound to safety by strict control instead of motivated for safety by information and interesting tasks and indicated less frequently that learning from near misses occurs in the plant. Interestingly, respondents in the first two production units stated more often - with a similar mean value as in plant A - that personnel can make their own decisions during process upsets instead of having to follow clearly documented procedures. This result points to a controversial issue in safety management as discussed above, i.e. the degree of decision latitude provided to operators. One interpretation for the positive evaluation of plant A's safety management by employees and auditers alike is the fact that plant A management follows a highly participative strategy furthering employee involvement at all levels and regarding all issues of safety as well as general management. This interpretation is based on empirical evidence pointing to generally positive effects of employee participation (cf. e.g. Ulich & Grote, in press), but also on the assumption that participation is a form of 'higher order autonomy' that allows for a compromise in the conflict between centralization and decentralization also discussed above (Grote, 1997). The effect of participatory decision making, e.g. concerning the development of standard operating procedures and the degree of flexibility permitted in running production processes by automated vs. manual control, will have to studied further, before firm conclusions can be drawn, however.
DISCUSSION AND F U R T H E R R E S E A R C H PLANS
As the reported results show, the questionnaire proved useful for obtaining data on safety-related perceptions which complement and expand information gained from expert interviews during safety audits. One generally encouraging result was the lack of a social desirability bias as indicated by the occurrence of rather negative responses in some samples. Also, differences in the perceptions of groups of employees within a plant are a very valuable source for more differentiated interpretations of a plant's safety culture. However, great caution is required not to stretch interpretations too far: What can be achieved - especially by means of a very open discussion of the results - is helping the auditers in adding a 'deeper' layer to their assessments. Also, the organization can be supported in obtaining an increased consciousness of existing norms and values and their interaction with the organization's material reality, thereby identifying furthering and hindering conditions for an integral approach to system safety. A general shortcoming of the research basis for safety indicators used in audit instruments - also inherent in the approach described - is the reliance on post hoc analyses of causal factors in accidents. In order to identify measures required for proactive safety management, more analyses of incidents and non-routine occurrences are necessary. Such analyses allow to identify factors directly furthering safety as event sequences are studied that involve coping with unsafe or critical conditions. In a follow-up project, an airline's data base of incidents and non-routine occurrences will be used to study the interplay of organizational and social environment, technical environment (with a special focus on flight deck automation), and human strenghts and limitations in performing flight operations. Accidents and incidents in aviation- like in other complex working systems - rarely have a single cause, but instead are the
Advances in Safety and Reliability: ESREL '97
77
result of a series of contributory factors concerning human, technical, organizational and environmental matters. Correspondingly, a systems approach like the Total Safety Management model described above that deals with 'root causes' of safe as well as unsafe conditions must be chosen for risk management and risk assessment. There are many accident and incident reports which provide excellent information on what in detail went wrong, without reflecting underlying causes. An example of such a 'just referring to symptoms' is the often cited crew-caused factor "pilot deviation from basic operational procedure". A Special Safety Report of the FSF (1994) discusses the problem of flight deck procedure design and asks the question why pilots deviate from SOPs: "A complex human-machine system is more than merely one or more human operators and a collection of hardware components. In order to operate a complex system successfully, the human-machine system must be supported by an organizational infrastructure of operating concepts, rules, guidelines, and documents. The coherence of such operating concepts, in terms of consistency and logic, is vitally important for the efficiency and safety of any complex system." (FSF, 1994) The objective of the follow-up project is to develop methodologies that lead causal analysis of incidents beyond the simple surface description of symptoms, thereby attempting to improve knowledge on indicators and measures for proactive safety management. A systematic analysis of incident data for at least two types of aircraft with highly automated flight decks will be carried out, conerning technical, operational and environmental factors. In addition, an instrument will be developed to gain information on the 'human side' of the events (especially which factors like training, SOPs or design components were supportive and which acted in the opposite direction) as well as on organizational circumstances. Analyzing incidents in this way offers the advantage that not only failures can be identified but also strategies for handling these failures and dealing successfully with a critical situation. This allows to study "good coping practices", which can help to improve risk assessment and risk mangement in the sense of a Total Safety Management.
REFERENCES Cohen, A. (1977). Factors in successful occupational safety programs. Joumal of Safety Research, 9, 168178. Emery, F.E. (1959). Characteristics of socio-technical systems. London: Tavistock Document No. 527. FSF: Flight Savety Digest (August 1994): "Flight Deck Procedure disigns ca Reduce Confusion, Enhance Safety (Special Safety Report) Grote, G. (1997). Autonomie und Kontrolle - Zur Gestaltung automatisierter und risikoreicher Systeme. Ztirich: vdf Hochschulverlag. Grote, G. & Ktinzler, C. (1996a). Sicherheitskultur, Arbeitsorganisation und Technikeinsatz. Ztirich: vdf Hochschulverlag. Grote, G. & Ktinzler, C. (1996b). Safety culture and its reflections in job and organizational design: Total Safety Management. In A.V. Gheorghe (Ed.), Integrated regional health and environmental risk assessment and safety management. Intemational Joumal of Environment and Pollution, 6, 618-631. Hoheisel, D. (1995). Massnahmen zur verbesserung der Arbeitssicherheit im Betriebsvergleich. In C. Graf Hoyos & G. Wenninger (Eds.), Arbeitssicherheit und Gesundheitsschutz in Organisationen (pp. 6378). G6ttingen: Verlag ftir Angewandte Psychologie. IAEA (1994). ASCOT Guidelines. Wien: Intemational Atomic Energy Agency. INSAG (1991). Safety culture. Safety Series No. 75-INSAG-4. Wien: Intemational Atomic Energy Agency. LaPorte, T.R. & Consolini, P.M. (1991). Working in practice but not in theory: Theoretical challanges of "high-reliability organizations". Journal of Public Administration Research and Theory, 1, 19-47.
78
G. Grote et
al.
Perrow, C. (1984). Normal accidems. Living with high-risk technologies. New York: Basic Books. Reason, J. (1990). Human error. Cambridge: Cambridge University Press. Reason, J. (1991). Too little and too late: A commentary on accident and incident reporting systems. In T.W. van der Schaaf, D.A. Lucas & A.R. Hale (Eds.), Near miss reporting as a safety tool (p. 9-26). Oxford: Butterworth-Heinemann. Reason, J. (1993). Managing the management risk: New approaches to organisational safety. In B. Wilpert & T. Qvale (Eds.), Reliability and safety in hazardous work systems (p. 7-22). Hove: Lawrence Erlbaum. Schein, E.H. (1985). Organizational culture and leadership. San Francisco: Jossey-Bass. Schein, E.H. (1992). Organizational culture and leadership (2nd Ed). San Francisco: Jossey-Bass. Schneider, S.C. & Shrivastava, P. (1988). Basic assumptions themes in organizations. Human Relations, 41,493-515. Susman, G.I. (1976). Autonomy at work. A sociotechnical analysis of participative management. New York: Praeger. Trist, E.L., Higgin, G.W., Murray, H. & Pollock, A.B. (1963). Organizational choice: The loss, redicovery and transformation of work tradition. London: Tavistock. Trist, E.L., Susman, G. & Brown, G.R. (1977). An experiment in autonomous working in an american underground coal mine. Human Relations, 30, 201-236. Ulich, E. (1994). Arbeitspsychologie (3rd Ed). Ztirich: Verlag der Fachvereine; Stuttgart: Poeschel. Ulich, E. & Grote, G. (in press). Work organization. ILO Encyclopaedia of Occupational Health and Safety, 4th Ed. Weick, K.E. (1987). Organizational culture as a source of high reliability. California Management Review, 29, 112-127. Zohar, D. (1980). Safety climate in industrial organizations: Theoretical and applied implications. Journal of Applied Psychology, 65, 96-102. Zuckerman, M. (1979). Attribution of success and failure revisited, or: The motivational bias is alive and well in attribution theory. Journal of Personality, 47, 245-287.
CASE STUDIES WITH TOMHID SAFETY ANALYSIS M E T H O D O L O G Y Jouko Heikkil~ VTT Manufacturing technology, P.O.Box 1701, FIN-33101 Tampere, Finland e-mail: j ouko. heikkila@vtt, fi
ABSTRACT
TOMHID is a new safety analysis methodology developed in an international project. It is intended to be used in an overall study of a process plant. The goal is to identify those parts which primarily require further investigation and improvement. The methodology integrates the identification of significant hazardous situations and the investigation of defects in the management of safety. A special description of a plant is created to guide the study. The description summarises the chemicals, functions, operations, organisation and equipment of the plant. The investigation of managerial tasks is based on accident scenarios which are developed according to the identified hazardous situations. The methodology was tested with three case studies.. The cases included full scale TOMHID analyses in a power plant, in a dairy and a related warehouse, and in a steel pipe factory. Possible problems in performing TOMHID analysis were especially investigated in the case studies. Defects were found especially in guidelines for specifying defects in management systems. On the other hand, many especially positive features were noted. For example, the special plant description was considered as a good means for organising information for hazard identification. Furthermore, the method developed for the analysis of potential human errors produced good results. An overall evaluation of the TOMHID methodology is, that the structure and approach of the methodology appeared to be practical, and most of the methodology worked well and was easy to use. KEYWORDS hazard identification, safety management, process, plant, TOMHID, case study INTRODUCTION
Many different safety analysis methods have been developed in resent decades and some of them have proved to be very practical and useful for improving safety. Preliminary Hazard Analysis (PHA), MOND and DOW indices, Hazard and Operability Study (HAZOP), Failure Mode and Effect Analysis (FMEA) and Fault Tree Analysis (FTA) are already widely used in industry. Management systems are analysed with Management Oversight and Risk Tree (MORT) and many auditing methods like Five Stars. Human factors are analysed for instance with Task Analysis and with Technique for Human Error Rate Prediction (THERP). All these methods are performing reasonable well when they are used independently and in a task which they are planned for. However, when carrying out a plant level safety analysis following problems may arise: • Normally, some purely technical descriptions, like a layout drawing, process flow diagrams or technical drawings are used for guiding PHA, HAZOP, FMEA and FTA analyses. Human activities are not presented in these descriptions, and therefore human contribution to accidents can easily be forgotten. 79
80
J. Heikkil/~
• HAZOP, FMEA and FTA are aimed and used for quite detailed examination of possible accidents in a system. It means that only selected parts of complex systems, like a chemical plant, can be analysed with these methods, since a lot of work is required. The same problem applies to methods for human factor analysis. Moreover, with HAZOP and FMEA the analysis is started from deviations or failures which is not very effective way of identifying hazards, although the results of an systematic analysis can be very comprehensive. • Experiences on application of MORT for the investigation of management system have indicated, that with this method it is not easy to gain results which are concrete enough to tell how the safety management can be improved. • Methods dot not systematically utilise or integrate the results of analyses carried out with some other method which means that the combination of the results of a plant level safety analysis may remain incomplete even if the separate analyses have been carried out appropriately. Development of a new methodology was inspired by the identification of the problems listed above. TOMHID methodology is planned for a plant level analysis of hazards in a chemical plant. As a basis of the methodology the existing methods are used as much as possible. Moreover, the methodology is developed to meet the needs that have been derived from the identified problems listed above. The needs can be stated as follows: • A method for creating a such plant description which will support a comprehensive plant level hazard analysis. The description should comprise sufficient information on the chemical, functional, operational, organisational, constraint and equipment information of a plant. • A systematic hazard identification method which utilises the comprehensive plant description mentioned above. • A management system analysis method which will provide concrete results. • Methods for integrating the investigation of technical, human and managerial factors in a plant level analysis. The methodology was tested with two case studies within TOMHID development project. However, three new case studies were arranged, because the first two studies were not full-size TOMHID analyses. The purpose of the case studies was to test the functionality of the methodology: how the methodology meets the needs listed above and what problems arise when carrying out a TOMHID analysis in practice. T O M H I D M E T H O D O L O G Y AND ANALYSIS CASES
TOMHID methodology is developed for analysis of whole plant. The leading idea of the methodology is to investigate both a technical system and a system for running the plant within the same analysis case. Major hazards in technical system and major weaknesses in management activities are identified. Target is to define which parts and activities of the plant should be further studied and improved. The sequence linking management activities and accident events is as follows (see also Figure 1): the management activities are used to maintain and develop working conditions, so that worker can avoid errors, which would affect to accident events. The methodology has been described in details by Rasmussen and Whetton (1993), Wells et al (1993) and Heikkila et al (1995). The TOMHID development project and the methodology have been introduced by Suokas (1995).
Test Cases
The functionality of TOMHID methodology was investigated and improved in three full-size TOMHID analyses in different industrial plants. Analyses were carried out by two research scientist at the time. One of them leaded the analysis and the other observed. After each analysis stage, the observations were discussed and solutions for identified problems were looked for. Representatives of plants were encouraged to point out weaknesses of the methodology and to suggest improvements. A questionnaire for evaluating each analysis session was given to the participants of the session.
Advances in Safety and Reliability: ESREL '97
Authorities:
Demgn~ errors Disturbances
....
,u,,,.,s
,
l N
~ndations
II ......
Customers II
I
Mediaopinil:~ on Public History : etc
81
t "" "":72
/t N
(
//
::t trembles) ........
[ Error [Jointbreaks) , Wn,bo, l.-'~~: : : [ Leakagel
,
i
....... ,: ~ :
,
: : : .....
......: :
: :~..... : " :i:::............
TOMHID ::: :. . . . . Invesligation::of::Maint~ano. Oevelopment of Working :
Ignition~1
..... :-~
:
:and: Conoition8
Conoept AnalySis
Hazard ....
:
Figure l Link between management activities and accident events The methodology was tested in three plants including a power plant, a dairy, and a steel pipe factory. The power plant burns coal and produces electric power into the national networks. There are about 16 employees in a shift when full production power is used. Additionally there are some maintenance personnel, design personnel, administrative personnel, upper management, experts, guards etc. About 50 persons at most are working on plant site at the time in a normal situation. The dairy produces and packs milk products. The related warehouse also stores products of other diaries. The dairy and warehouse has about 260 employees working mainly in two shifts. The steel pipe factory manufactures steel pipes starting from a steel strip. It has about 160 employees working mainly in two shifts.
Starting of Analysis Each study was started with a meeting in which the scope and boundaries of the analysis case was defined and the case was organised. In the meeting, 2-3 TOMHID experts and about five company representatives were present. The company representatives typically included the director of plant, plant or company safety manager, production manager(s) and the manager of technical services. In the beginning of the meeting TOMHID methodology was introduced to the representatives of the company. The company representatives briefly introduced the plant and its operation. The boundaries of the analysis were defined concerning equipment, operations and organisation. The criteria for hazard identification was defined concerning the severity and type of accident consequences. Different types of accident consequences included: injuries and fatalities, loss of production, damage of property, and environmental damage. In one case the criteria was as follows: • one or more fatalities or permanent incapacity for work • costs of more than 175 000 ECUs from damaged property and loss of production together • accident affecting outside of the plant site in a such way that the public pays attention to it. The criteria was quite similar in the other cases. As the practical matters, the contact persons ware named and the dates of next meetings were agreed. The construction of plant description was started by outlining plant operations with the simplified process flow
82
J. Heikkil~i
diagram. As the end of the meeting, TOMHID experts visited on plant site and got material describing the plant. After this first meeting, the analyses continued as it is presented in figure 2. Construction of i Functional Model of I'--'1 : " Plant II ......w':: .... : : Hazard I:: .... Identification l:: ::~¢ :v
, : Construction of I Accident ::..... i::
~ nc ept Ha:za r d Ana[ ysi $ "
Hazard Evaluationl
::
I
I
I Scenario......sI,':lW°rking ..... :::*:.: .... Condition ~
: : ............. i:
Examination
:::
*::
::
Inv. a 0n :0f M, ,tena,c, A,.d.... I
Devel0pment0fW0rking~e0nd$0s: ~ ~:::
:
:
:
:
....
'
~
$ - ~ ~=J
:. . . . . . .
I,-
~
i]WorkingConditionj
:
.
M'n°g'm°n' h
I
Examination
i 0n"n,'"0n 1 I
|i-R;p0rtingand..... Planning of
Figure 2: Stages of TOMHID analysis
Construction of Plant Description The first stage is to prepare a plant description which is used for guiding the subsequent hazard identification session. The description summarise the chemical, functional, operational, organisational and equipment information of a plant. The plant description was constructed together with the TOMHID experts and the representatives of the plant. Production manager and manager of technical services typically participated in description construction. The plant description is hierarchical. On the top level, the main operations and their connections are presented. For exanple, in the top level description, the diary operations were divided as follows: milk production, storage functions, heating system, cooling systems, electrical systems, safety systems, information systems, management, maintenance, operational states, and other operations. Each upper level operation is divided and presented as a combination of operations until the appropriate amount of details has been achieved. For example, the mid-level description of milk production operation include the following operations: material reception and intermediate storage, milk production line, packing system. The most detailed level of description also include different inputs, outputs, constraints and resources to the operation. In these cases plant descriptions included at most three hierarchical levels. An example of the description of an operation on the most detailed level is presented in figure 3.
Hazard Identification In a hazard identification session, potential hazardous situations are identified and documented. The documentation of a potential hazardous situation includes couple of main events of a potential significant accident. The identification is carried out by the plant personnel together with a TOMHID expert. The plant personnel included safety manager/engineer, and managers of production and maintenance, supervisors from
Advances in Safety and Reliability: ESREL '97
83
production and maintenance, workers from production and maintenance. In these case studies, there were two sessions lasting about three hours.
Heatinq operational states -operating - backup system . . . . . - in service
system
s a f e ~ systerns.....
:gas:alarm - temperature alarm -fire alarm
control:Systems
-remote control of boilers (:except backup boiler) . measurements from compressed air:system
inspections
:
inputs
-
t operations (prosesses, equipment) - heating oil storage (max 50 0001)
natural gas . . . . . ' [ i steam bar)(120 degrees) heating oil boilers boilers for hot (2)(10 water (2) electric power compressors for compressed air (7 bar) water cold circulating water from process P e t :s o n,rn [ :! : : : . . . . . . . . : : : Maintenance "~ dntenam -:unmanned: . . . . : ' external:se~ices - external operator - maintenance dept.:
i,
::
]
:outpu~
I::
:'s th:ot:circulating e a m : .... ..................................... of compressors water to process:
::
:
-coohng::wa,er
:: :
:
....
........
Figure 3 Description of heating system as a part of plant description The safety expert led the sessions and directed the discussion. In the beginning of the session, the criteria for selecting hazardous situations were introduced. The criteria was different in each case. Each item in the description was separately discussed. Items include each input, each operation, each output, each safety system etc. The discussion on each item was typically started with asking about the possible accidents (fire, explosion etc.) of the item (natural gas input, steam boiler, gas alarm, etc.). After that, keywords were used.
Reporting and Hazard Evaluation In these cases, 32, 70 and 66 potential hazardous situations were reported. This does not directly measure the hazardousness of the plant, since the criteria was different in each case. A hazardous situation was described in a report, for example, as follows: "Handling of heavy slippery and heavy loads: Loads handled in warehouse are heavy and can be slippery because of ice or leakage of contents. It is possible that the worker falls or the load falls on worker causing injury or even death." Identified hazardous situations were evaluated by the plant representatives. The plant representatives included director of the plant, production manager(s) and manager of maintenance. Hazardous situations were organised in 6-7 groups according to the nature of hazard (e.g. fire, collision, etc.) or the scope of hazardous situation (e.g. maintenance, handling of loads, etc.). For example, in one of the cases, the groups were: 1) Material handling and transportation, 2) Input materials and energy, 3) Maintenance, monitoring and operational activities, 4) Control systems 5) Smoke and other harmful discharges 6) Storage ja use of hazardous chemicals. Hazards in one group was evaluated at the time. Normally, 1-3 hazardous situations in each group were evaluated to require immediate improvements. In one case, for example, the following hazards were selected: handling of heavy loads in material reception, cable fire, worker falling while working in strorage hall, fire in control room, fire in storage hall, and possible defects in LPG system.
Construction of Accident Scenarios Construction of accident scenarios was started by selecting two or three of potential hazardous situations. The potential hazardous situations are used as a starting point of the construction of accident scenarios which will
84
J. Heikkil~i
support examination of working conditions. The potential hazardous situations were selected on the basis of the criteria comprising the severity of the potential accident and how well the plant and different hazard categories are covered. The target was that the selected hazardous situations would cover sufficiently all the different plant functions, activities, equipment, and regions as well as the different types of hazards. Severity of the potential consequences is important for directing the investigation to the safety critical areas and it may be important for motivating the interviewees. Usually interviewees are more interested in likely and personally affecting accidents than unlikely and generally severe accidents - if these are the alternatives. A sufficient coverage is more important than the selection of the most severe potential accidents. The selected potential hazardous situations were developed to descriptions accident scenarios. An accident scenario includes the description of those events which may comprise a specific accident. Events are typically failures of equipment, human errors and deviations in a process functions and conditions. Failure to control (prevent, detect, mitigate, etc.) an accident development is one special type of such events. 'Failure to control' usually has direct link to human tasks. Scenario should be quite simple, so that it is easy to understand. The scenarios in the cases handled following topics: 'Accidental start of pump while maintenance', 'Leakage of fuel pump', 'Smouldering coal in storage', 'Collision in production line', 'Collision of forklift truck and cyclist', 'Fire of cables', 'Handling of heavy and slippery loads'. The scenario events, for example in 'Leakage of fuel pump' were: 'Blockage of filter in input line', 'Pump trembles', 'Joint brakes', 'Leakage of fuel', 'Ignition and fire'.
Working Condition Examination Working conditions in plant are examined by interviewing workers. The aim is to collect a list of examples of deficiencies in working conditions. This list is the starting point to the subsequent working condition management examination. Accident scenarios are used for inducing and directing the discussion in worker interviews. Interviewees were the workers, who has opportunity to affect to the course of described accident. In each case, 6-7 workers were interview. In the firs1 case, interviews were individual, but in the other two cases, workers were interviewed in groups of 3-4 workers. One interview lasted 2-3 hours The working conditions are examined by discussing each event of the scenario separately. The examination was started with the identification of errors, failures or other problems possibly causing the event. After identification of such error, conditions contributing to the error were identified. The identification of error producing conditions was supported with the check list of the objects related to working. The check list comprises: 'Equipment', 'Environment', 'Materials', 'Procedures', 'Individual' and 'Other factors'. The number of the identified conditions were 80-160 in each case. For example, the following 'error producing' conditions were noted: • incoming packages can vary a lot, which make their handling difficult • the space for material reception is crowded • pallets may be in bad conditions • a barrel may fall from load when the doors of trailer are opened • the tool for checking the correct tightness of joint is not available • reparation of joint may fail because of dirt • it is difficult to locate the actual source of smoke in the case of cable fire • it is difficult to find a telephone in the basement; that might be needed in emegency situation. It should be noted that, in this stage, the worker had the main responsibility for evaluating the conditions as 'error producing'.
Working Condition Management Examination The state of working condition management was examined by interviewing appropriate plant personnel affecting working conditions. These included designers, engineers, supervisors, managers, safety and fire protection engineers and managers, and training personnel. In each case, 10-14 persons were interviewed. Interviews were based on the list of 'error producing conditions' collected in worker interviews. Intention of
Advances in Safety and Reliability: ESREL '97
85
the discussion was to find out whether the identified 'error producing condition' exists only randomly (by mistake), or if they are caused by deficiencies in system. The discussion was started by selecting an 'error producing condition', that the interviewee somehow is responsible for. The responsibility could be identification, decision making, or planning, designing or implementation of improvements. The diagram presented in figure 4 was used to support the discussion.
!:i:i~i~!~!:i:~:`~!`.!~`~i``!~!~i~!~i:i~!:i~?~!~i:i~i~i:!:i:!~i:i~`:~!`':!~i:i:i:i~i:i:i~i:.1"~':!`.1:~i:i~i~i~i~i~i~i:i:i~i~i:!:i~i~i~i:i:i``:'~i~i~i:i~i~i~i~i~i~i:i~i``:'~i`':i:i:i~i:i~i:i~i~i:i~i~!:i~i~i~i:i:i~i~i~i~
Figure 4: Sequence describing management of working conditions After the introduction of an 'error producing condition', it was discussed, what might prevent the correction of that or similar conditions. Thus the deficiencies in conditions to manage working conditions were identified.
Reporting on Working Condition Management Identified problems affecting to management of working conditions were reported as a list of statements concerning the poor conditions to manage working conditions. Typically, these include lack of time, lack of money, lack of competence, lack of means, and lack of or too much information for certain tasks. The statement included a brief description of problem, some actual statement presented in interviews and comments by the analysts. An example of a reported 'poor condition' is as follows: "People do not prepare for meetings. ' There is not useless meetings, but I have not time to prepare for every meeting' 'People do not bother to prepare for meetings. Comment: If the participant is not prepared for the meeting, it may be impossible to participate in the handling of the matter or the handling is ineffective. The result may also be inadequate or even wrong decision, if the person who has not prepared, has power and wilfingness to make decision." The number of statements in each case was 104, 101 and 44. The statements were organised in groups. In one of the cases the groups were: Meetings, Available time, Quality system, Decision Making, Teams and shifts, Information systems, Activity of personnel, Safety, Emergency situations, Procedures, Training, Development activities, Co-operation and interaction, Practices, Management, authority and responsibilities. In other cases, partially same groups were used. Other used groups were as follows: Arrangements of production, Practices of
86
J. Heikkil~i
foremen, Projects, Documentation, Maintenance activities. The specific groups in each case depended on the material achieved from the interviews. The relative number of statements in each group was presented. In order to further process the reporter problems, it was suggested, that appropriate groups of employees evaluate the identified problems, and try to develop solutions. RESULTS The test cases introduced both the well working and unsatisfactory features of the methodology: • Definition of specific criteria for selection of hazards during hazard identification was included in the methodology in the beginning of this project. It diminished the discussion on whether the hazard should be reported or not. • The plant description was evaluated as a good means to organise plant information for hazard identification. Identification of hazards related to 'Inputs', 'Operations' and 'Outputs' proceeded well and was productive. However, the way by which the other parts of the description was handle did not work so well and it could not be improved. • The reporting and the procedure for evaluating the identified hazards was improved during the project. Selection of significant hazards from a group of the reasonable number of identified hazards, appeared to perform well, even if it is not very formalprocedure. • Identification of 'error producing conditions' by using accident scenario proved to be a functional solution. The number and range of the identified conditions was more than satisfactory for the purposes of TOMHID analysis. The plant personnel was also very interested in the list of identified conditions. Unfortunately, its use as a part of final results of the analysis is limited, since it is based on very few interviews. • Identification and, especially, reporting of problems in working condition management would require more guidance and support in the methodology. The plant representatives were much less certain how to proceed with this report that with the report of identified hazards. The plant representatives unfamiliarity to formally treat organisational problems may partly explain this difference. The test cases proved that the concept of the TOMHID methodology is functional. TOMHID analyses produced results that the plant representatives regarded as important and useful. It was especially noted that analysis support long term safety management of a plant. It required 1,5-2,5 man months to lead and report a TOMHID analysis. Approximately as much resources was required from plant personnel. This was estimated to be possibly too much in a commercial use, even if the plant representatives regarded the result of the analyses very satisfactory. When compared to the possible costs of any of the identified possible accidents, the costs of the analysis are low. REFERENCES Heikkil~., J., Rasmussen, B., Rouhiainen, V. & Suokas, J. (1995). M I M I X - Method for investigating management impact on causes and consequences of specific hazards, VTT Research Notes 1689, Technical Research Centre of Finland, Espoo, Finland Rasmussen, B. & Whetton, C. (1993). Hazard identification based on plant functional modelling, Riso-R712(EN), Riso National Laboratory, Roskilde, Denmark Suokas, J. (ed.) (1995). An overall knowledge-based methodology for hazard identification, Results from the TOMHID-project. VTT Research Notes 1658, Technical Research Centre of Finland, Espoo, Finland Wells, G., Wardman, M., & Whetton, C. (1993). Preliminary safety analysis. Journal of Loss Prevention in Process Industry, 6:1, 47-60.
A4" Safety Culture and Management Attitudes
This Page Intentionally Left Blank
IN-DEPTH ANALYSIS OF ORGANIZATIONAL FACTORS. THE NEED FOR FIELD INQUIRIES
M. Llory Institut du Travail Humain, 17 rue des Espessas, 30660 Gallargues-le-Montueux, France
ABSTRACT Analysis and interpretation of industrial and occupational accidents are shifting from the human error paradigm to the organisational accident paradigm. We present in this paper ten proposals which could form the basis of this new paradigm. Work organisation, communication failures organisational factors, perceptions of operators on their own work, are fundamental aspects of this method of thinking safety. We illustrate the proposals with two different accident cases: the space shuttle Challenger accident and the accident that occurred to a crane-driver.
KEYWORDS
Accidents; Accident investigation; Organisational factors; Paradigm; Precursor; Safety management; Work analysis.
INTRODUCTION Accidents and incidents: what questions should we ask ? What models should we take ?
Each serious industrial or occupational accident incurs the same serious and complex questions for experts and managers. Was the accident predictable? Was it avoidable? What were the root causes behind it? How far back do we need to go into the history of the human organisational structures which are implicated? Was the analysis complete and thorough? Can we put an accident down to the mere chain-effect of some important events and tangible, established facts? Can we build a general accident model and statistically pinpoint relations of cause and effect? What information can be drawn from accidents? (Llory, 1996). We must admit as well that, far from functioning regularly and smoothly, sociotechnical systems are riddled every day with little incidents or minor dysfunctionings and that latent causes of failure, hidden technical or organisational inadequacies tend to create potential weaknesses in these systems. As from here, accident prevention largely depends on the capacity of the organisation concemed to identify, acknowledge, master or reduce the incidents and loopholes in question. 89
90
M. Llory
These determining safety objectives are closely dependent on the conception that the people who are directly involved have of accidents and their causes, of incidents and their relation to accidents: that is, their conception of how events typically arise, how latent they are and their typical potential to develop and degenerate into accidents. From the human error paradigm to the a organisational accident ~.
The paradigm of human error, as we have named it, has been built, expanded and stabilised over these last decades. By 'paradigm' we mean a group of concepts, theories and their underlying models (human and organisational models), analytical and investigatory methods, procedures for interpreting and gathering information, etc. (Vinck, 1995). Amongst other advantages, this paradigm has enabled us to progress from 'accusatory dynamics' in companies to 'functional (analysis) dynamics' (Dodier, 1995). For all that, people have never stopped criticising the paradigm of human error, sometimes vigorously. We will simply recall C. Perrow's critical analysis (1982) concerning the interpretation of the Three Mile Island accident in the corresponding report by the Presidential Commission which, in particular, qualifies the diagnosis of human operator error as a retrospective judgment. We are thus quietly introducing a fundamental question concerning analysis methods that was pointed out clearly by D. Vaughan (1996). As with any historical procedure, an interpretation - and accident or incident analysis can be considered such a procedure - has to take account of the context (organisational, psychological, or knowledge-related) before and at the time of the events, of the state of mind of the people involved, and of their understanding of the situation workwise before the accident. We must not make the mistake - sometimes in rather an insidious way - of introducing presumptions which stem from our (retrospective) knowledge of the accident or incident. With human error identification, analysts seem to come up against an insurmountable difficulty and to be hardly able to get any further than acknowledging the error and its consequences. If we go back to G. Bachelard's concept (1993), we can say that, from the epistemological point of view, human error is an obstacle when analysing incidents and accidents (M. Llory and A. Llory, 1997). Another conception of accidents, organisational this time, has been outlined and put forward in opposition to the human error paradigm.This organisational concept has been developed by authors like C.Perrow (1984), Diane Vaughan (1996) regarding the Space Shuttle Challenger accident, L. Clarke (1992) concerning the wreckage of the Exxon Valdez, J. Reason, author of 'The Age of the Organisational Accident' (1990). Here we quote L. Clarke and J.F. Short (1993): 'In the preceding remarks one can see humans erring, but because of social structure and culture rather than individual failures. The intellectual problem is that as a theory of sociotechnical breakdown, human error presumes more than it explains, obscuring the complexities of interaction between humans, machines, and organisation. We learn more about how risks are produced by theorising organisational factors such as production pressures, managerial expectations, and regulatory effectiveness than by focusing on human error.' (p. 387). The aim of this article is to emphasise the need for field enquiries ('clinical' as opposed to statistical), in order to analyse organisational factors which could lead to accidents or accidental situations, and to extend the organisational conception of accidents by introducing work organisation aspects. Two cases have been chosen to support this reasoning throughout the paper: - the accident to the Space Shuttle Challenger, which is probably the best-analysed accident we have in terms of the human and social factors contributing to this industrial accident. - an occupational accident we analysed ourselves directly: the fall of a motor-driven crane, which led to the serious injury of the driver.
91
Advances in Safety and Reliability: ESREL '97 T W O VERY D I F F E R E N T CASES OF ACCIDENTS
We will not give details about the analysis of the Challenger accident: there is a very rich and critical bibliography already available (Llory, 1996). However, we will be referring to Vaughan's recent work (1996), amongst others, throughout this paper. Here is a brief reminder of the circumstances surrounding the second accident. Agents from a neighbouring department ask a crane-driver to carry out a lifting manoeuvre involving a heavy load. The agents who make the request tell the crane-driver that two of the sole-plates on which the crane stands are on unstable ground (a trench has been dug for the passage of a large electrical cable, then filled in). The crane-driver reduces the supporting surface underneath the crane to avoid a possible accident, but by so doing he unbalances it and creates a risk of another kind. The electronic safety-device which covers the load, crane-jib and supporting surface available is defective. The crane is thrown off balance and the driver, trapped in his cabin and partly crushed by the load, is badly injured.
PROPOSALS F O R ACCIDENT P S Y C H O L O G I C A L FACTORS
TEN
ANALYSIS
BASED
ON
ORGANISATIONAL
AND
The following basic proposals, ten in number, form an essential part of the grounds for a psychoorganisational conception of accidents which completes and extends the organisational concept in two ways: - composing elements and conditions of work organisation are taken into consideration ;
- safety questions are accounted for by the actual people concerned, notably decision-makers and managers. This conception also underlines the historical dimension of each accident.
Proposal number 1: Precursors and weak signals, before accidents Accidents are preceded by accident precursors, but these are not recognised as signals of potential disasters. 'Bad news' does not reach the upper levels of organisations, where top managers and decision-makers could stop the development of accident sequences (Llory, 1996). In the case of Challenger, 'weak signals' concerning the dysfunctioning of booster seal-rings were not taken into account. And the people in charge of the department for which the crane-driver was doing the job were either not alarmed by different ' signals', or were unable to identify them. These were: - the existence of very conflictual relationship from the crane-driver and his colleagues on one hand and the foreman on the other hand ; - a renewed outbreak of work accidents concerning the team ; - an increasing work-load for the said team, who were given tasks in assembly, dismantling, maintenance, and putting away material, these being 'peripheral' as such to the main aims and activities of the department which were related to testing.
Proposal number 2: Communication failures: a major source of accidents Formal structures in organisations have a tendency to rigidify what goes on within: the circulation of information, of signals related to potential hazards, is filtered, distorted or blocked. Over the last years, there has been a widely-accepted tendency to interpret accidents and organizational dysfunctioning differently in terms of communication failure ( Rogers Commission, 1986), or 'distorted communication' (Tombs, 1990), or 'pathology of communication' (Dejours, 1992).
92
M. Llory
Studies on communication processes within organisations would relate to how densely and how smoothly information circulates, the authenticity of what people involved say, the possibilities field-personnel have of expressing their fears and doubts, and acceptability or tolerance levels of 'bad news' by managers and experts. These fundamental components and characteristics are implicated in organisational diagnosis and in forecasting the way organisations will evolve. Informal aspects of communication are very important and require our close attention. Communication breakdowns were analyzed during the hearings of the Presidential Commission on the space shuttle Challenger accident (Rogers Commission, 1986). "Bad news" related to the degraded social climate (see the following proposal) of the team in which the crane-driver was working dit not reach the top managers.
Proposal number 3: Defensive attitudes of managers and organisational pressures Denial of information (its refusal in the psychological sense) is not a very rare symptom with managers who are far from the field. In this way, 'bad news' from field operators or personnel can be played down or refuted. Our attention subsequently switches to the executives and managers rather than the operators who usually make the errors. However, executives and managers often shelter behind complex attitudes, systematically underestimating the difficulties facing the field personnel. They camp firmly on extremely normative positions, while discussions with them reveal their profound ignorance of human factors and of work-related organisational factors. They tend to approach accidents from the behavioural aspect and inrelation to the individual. These defensive attitudes can build up pressures within organisations: pressure to be productive and pressure to conform to procedures and prescriptions. The chapter VIII of the report on the Challenger accident was untitled: "Pressures on the system" (Rogers Commission, 1986). Some years before the crane-driver accident, production pressures led to unfavourable decisions concerning work organisation: the team in which the crane-driver was working was reduced; the consecutive overload of handling and storage tasks led to the postponing of important jobs, and to growing social tensions within the team.
Proposal number 4: From individual behaviour to work organisation A lot of recurrent and persistent problems in organisations are related to work organisation, and not to individual behaviour. 'We only find what we are searching for': we have to incorporate a careful examination of how work is really organised into our analysis of organisations and sociotechnical systems (Llory, 1997). Diagnosing safety questions in relation to organisational factors presupposes that we analyse co-operation and communication in judgments and decisions made in the work context (work-related difficulties or risks), particularly at the interface between executives, managers and field personnel. Analysis centres on everyday work, evaluating work-teams' and management-teams' adaptability and reactivity to unforeseen situations, and on how far these situations are collectively anticipated and elaborated (M. Llory and A. Llory, 1996a). This type of enquiry must be carried out in conformance with clearcut aims and protocol and within a strict deontological framework.
Advances in Safety and Reliability: ESREL '97
93
According to how profound or intense it is, operators' personal motivation is a fundamental indication when diagnosing and forecasting. In the case of Challenger, managers and engineers at NASA and Morton Thiokol (the subcontractors who designed and manufactured the boosters) were too busy elsewhere to attend to conditions prevailing in their own work organisation: the second teleconference was arranged overhastily; Morton Thiokol engineers had insufficient time to prepare arguments and visual aids; communication problems were underestimated for a teleconference involving three different sites; discussions and the circulation of information of vital importance for the launch continued well into the night, so the participants' degree of resistance was overestimated (Mitler, 1988), etc... After the crane-driver's accident, the managers' first analyzes do not show the determining influence of work organisation behind the event: i.e. an excessive amount of work, which meant the driver did several parallel jobs; an accumulation of problems related to storing heavy material; not enough attention paid to maintenance on control- and safety-devices; a tendency to carry out jobs singlehanded and not in twos to avoid unnecessary employment - (the people who requested the lifting operation were not trained to drive such machines or to understand the problems associated with them); neglect by the management to arbitrate in conflicts.
Proposal number 5: The importance of historical aspects We need to include historical aspects in our analysis of organisational factors in risk-prone systems and accidents. Accidents are 'rooted in history' (Rogers Commission, 1986). To explain the Challenger accident, the Rogers Presidential Commission showed it was necessary to go back thirteen years in time and not just fifteen hours from the accident, that is to say to the beginning of the second teleconference during which Morton Thiokol managers reversed their position. In the second case analysed, production pressure, aimed at making the company's technical installations financially viable, had lead to a chain-effect of decisions and choices being made several years before, which had had an unfavourable effect on the company's work organisation and created favourable ground for highly accident-prone work situations to arise: a reduced number of workers in the crane-driver's team; the removal of an intermediate engineering post which had ensured arbitration duties whenever conflicts arose within the team; an emphasis on 'profitable' jobs and a tendency to build up 'auxiliary' or 'peripheral' tasks like storage and maintenance ; no works foreman or pilot to provide support or guidance for the crane-driver in his work.
Proposal number 6: Examples of organisationalfactors The main organisational factors which seem to be present in a lot of accidents are the following: the producing of a local culture (or mentality) which is resistant to new ideas: see the concept of paradigm (Vaughan, 1996); - a culture of production and profitability which generates pressure on the field personnel; the pernicious effects of this pressure are not properly assessed by middle- and top-managers; - ' structural secrecy' (Vaughan, 1996) ; a system of heavy, rigid, impersonal and omnipresent procedures; this system induces a general attitude of conformity (Vaughan, 1996) ; some technical failures and human conflicts are not solved or settled by arbitration; the effects of these problems tend to last or to get worse. -
-
-
In the case of the Challenger accident, two additional factors relating to D. Vaughan's analysis seem to be determining:
94
M. Llory
- the apparent lack of any analysis or thought about the working methods and work organisation conditions which prevailed with the managers and engineers in charge of examining the files and preparing the launch decision. (See Proposal number 4). - the occulting of any worries or basic rules related to safety,(notably that of the principle of precaution), which is illustrated by an absence of safety experts at the decisive teleconference. This point was emphasised by the Rogers Commission (1986), who entitled Chapter VII of their report: The Silent Safety Program. The safety back-up often provided by safety experts was therefore non-existent.
Proposal number 7: From audits to field enquiries These factors and their corresponding effects are not easy to detect byaudits. Audits are traditionally carried out to assess organisational conformity to safety standards. We prefer field enquiries to analyse the flexibility and reactivity of organisations, and to detect potential problems and incident precursors that cannot be determined beforehand. It is particularly important to evaluate the fluidity of bottom-up communication, as mentioned above (Proposal number 2). Enquiries demand much active and difficult attention from participants; they can be likened to the comprehensive approach (Dejours and Abdoucheli, 1990), already recommended by Taylor (1981) for safety analysis. They presuppose a fundamental reversal from the epistemological angle, and lead us to give predominance to those who actually carry out the work and not to the experts who dictate or prescribe the phases and organisational changes in the work (M. Llory and A. Llory, 1997). T. Dwyer (1992) argues on similar lines, stressing the need for a change of paradigm. The enquiry following the crane-driver's accident was carried out in such conditions. It not only showed up unfavourable safety-related factors, the weak points of the organisational system (see Proposals number 4 and 5), but also the dynamics of human and collective processes involved. Above all, the enquiry showed how a certain number of managerial choices, such as: productivity pressure, work organisation methods using teams set up for the occasion, closer relations between operators and the customers (who actually had access to technical installations and hence exerted additional pressure on the personnel) tended to affect the conditions of work organisation and to destabilise this work organisation.
Proposal number 8: Requirements for field enquiries These field enquiries need a specific professional code of ethics and a well-adapted methodological framework. We prefer collective interviews to individual ones, and the status of working groups for these teams. Post-accident enquiries need special care, as working teams are stressed and destabilised by the event. Results of enquiries should be handed over for validation to the people involved in the events who participated in those enquiries. Above all, enquiries enable us in this way to understand accidents, their causes, the dynamics behind their development through the protagonists themselves, and also to reconstitute and restore - in a symbolical way too - socioprofessional relations which have been upset or strained by the accidents.
Proposal number 9: Inconsistencies in work descriptions In-depth studies (of the 'clinical' type) are needed on real working practices and on what operators and managers actually perceive of normal situations. Investigations are needed after accidents, like those of the Challenger (Rogers Commission, 1986) and of the Dryden accidents (Reason, 1995). These investigations would allow us to detect inconsistencies in work descriptions by different categories of personnel (for the concept of description: see M. Llory and A. Llory, 1996b). They would be available to the experts for scientific discussion, and would allow testing of descriptive and explanatory models.
Advances in Safety and Reliability: ESREL '97
95
Proposal number 10: Panels for studying potential precursors Incident precursors are usually noticed and pointed out by field personnel: before most accidents, people who are aware of potential and emergency hazards try to forewarn decision-makers, but they do not succeed. They should be able, as a last resort, to seek the help of an independent control panel or commission. It is urgent for us to set up panels like this within organisations. These advisory commissions should be entirely independent of the organisational powers concerned and have efficient means of action at their disposal.
DISCUSSION - CONCLUSION A radical change seems to be taking place in accident conception. The human error paradigm still dominant today presupposes that we should have a prior and virtually thorough knowledge of potential incidents and defects in the sociotechnical system and in procedures if we are to offset them. In this case, accidents happen because operators and certain managers do not conform to the prescribed rules (as with the Challenger accident, where the common interpretation of the Rogers Commission report put ultimate blame on the middle-managers, as Vaughan (1996) reminds us). This interpretation of accidents justifies the conformity audit as a fundamental type of safety method. But, as D. Vaughan convincingly points out, the Challenger accident is in fact a tragic case of conformity. The managers respected the rules scrupulously. It is also a tragic case of silence (Llory, 1996). With this new accident concept, we must go right down to the roots of how organisations function and how work is organised, to informal communication, co-operation and elaboration, which implies, as we have seen, specific enquiry methods. Comprehensive enquiry methods like this can prove efficient in accident prevention.
REFERENCES
Bachelard, G. (1993), Laformation de l'esprit scientifique, Librairie Philosophique J.Vrin, Paris, France. Clarke, L. (1992), The wreck of the Exxon Valdez, In: Controversy. Politics of technical decisions, Sage Publications, Newbury Park, California, U.S.A., 80-96. Clarke, L. and Short, J.F. (1993). Social organization and risk: some current controversies. Annu. Rev. Sociol. 19, 375-399. Dejours, C. (1992), Pathologie de la communication. Situation de travail et espace public: le cas du nucl6aire, Raisons pratiques, Ed. de I'EHESS, Paris, France, 3, 177-201. Dejours, C. and Abdoucheli, E. (1990). Itin6raire th6orique en psychopathologie du travail. Revue Prdvenir. 2: 19, 3-19. Dodier, N. (1995), Les hommes et les machines, M6taili6, Paris, France. Dwyer, T. (1992). Industrial safety engineering. Challenges of the future, Accident Analysis and Prevention, 23: 3, 265-273. Llory, M. (1996), Accidents industriels. le co~t du silence, L'Harmattan, Paris, France. Llory, M. (1997). Human - and work - centred safety: keys to a new conception of management. Ergonomics, to be published. Llory, M. and Llory, A. (1996a). L'intervention en entreprise ; de l'analyse du comportement/~ celle du travail, Proc.XXXI st Conf. S.E.L.F. (Soci6t6 des Ergonomes de Langue Frangaise), Bruxels, Belgium, 2, 100-108. Llory, M. and Llory, A. (1996b). Description gestionnaire et description subjective: des discordances. Revue Internationale de Psychosociologie, 3: 5, 33-52. Llory, M. and Llory, A. (1997). Psychodynamique du travail et pr6vention des accidents du travail: vers un renouvellement des analyses et des pratiques de s6curit6, Colloque International de Psychodynamique et de
96
M. Llory
Psychopathologie du Travail, Paris, France, 30-31 january 1997. Mitler, M.M. and al. (1988). Catastrophes, sleep and public policy: consensus report, Sleep, 11" 1,100-109. Perrow, C. (1982). The President's Commission and the normal accident. In: Accident at Three Mile Island. The human dimensions, Sills, D.L., Wolf, C.P., Shelanski, V.P., Ed., Westview Press, Boulder, Colorado, U.S.A., 16, 173-184. Perrow, C. (1984), Normal accidents. Living with high-risk technologies, Basic Books, New York, U.S.A. Reason, J. (1990). The age of the organizational accident. Nuclear Engineering International, july, 18-19. Reason, J. (1995). A system approach to organizational error. Ergonomics, 38: 8, 1708-1721. Rogers Commission (1986), Report of the Presidential Commission on the Space Shuttle Challenger Accident, Washington, D.C., U.S.A. Taylor, D.H. (1981). The hermeneutics of accidents and safety. Ergonomics, 24: 6, 487-495. Tombs, S. (1990). A case study in distorted communication. Institution of Chemical Engineers Symposium Series, 122, 99-111. Vaughan, D. (1996), The Challenger launch decision, The University of Chicago Press, Chicago, U.S.A. Vinck, D. (1995), Sociologie des sciences, Armand Colin, Paris, France.
SAFETY M A N A G E M E N T AND ACCIDENT PREVENTION: THE STATE OF THE ART IN 14 SMALL AND MEDIUM-SIZED INDUSTRIAL PLANTS A. Seppala Finnish Institute of Occupational Health, Department of Occupational Safety Laajaniityntie 1, FIN - 01620 Vantaa, Finland
ABSTRACT
Swedish-speaking workers have accidents at work about one third less frequently than Finnish-speaking workers. A study was designed to survey the safety culture in 14 small and medium-sized companies; six Finnish-speaking, six Swedish-speaking, and two bilingual companies. By means of a questionnaire, data were collected about the following issues: perception and assessment of safety hazards at work; safety attitudes; safe behavior and risk-taking; safety in work planning and in work methods; general safety actions and management issues within the plant. In all, 333 people (managers, supervisors, clerical workers, blue-collar workers) filled out the questionnaire consisting of 104 items. The results indicated that the Swedish-speaking companies were significantly better when the safety hazards of the work environment and machinery were evaluated. The senior management's attitudes towards safety were evaluated as being better in the Swedish-speaking companies. Supervisors paid more attention to the overall safety of the work environment in the Swedish-speaking companies than in the Finnish-speaking companies.
KEYWORDS
Accidents, safety hazards, safety management, safety culture
INTRODUCTION
Finland has about 5 million inhabitants and two official languages, Finnish and Swedish. The Swedishspeaking inhabitants form a minority of about 300 000 people. Most of the Swedish speakers live on the southern or western coast of Finland. Some of Finland's population is Swedish-speaking owing to the shared history and long-standing cultural and political contacts between Finland and Sweden. The cultural and social border between these two groups has traditionally been distinct. This cultural difference is reflected in a number of ways. The worker population of these two groups seem to differ as to the outcomes of their behavior. According to a variety of national statistics, Swedish-speaking companies have a lower frequency of occupational accidents than Finnish-speaking companies. Salminen et al (1996 a) observed that Swedish-speaking workers have about 20-40 % fewer occupational accidents than do Finnish97
98
A. Seppala
speaking workers in Finland, adjustments are made for the branch of industry in question. These statistical findings led us to seek possible explanations in the safety culture and the safety management practices in the enterprises. Safety culture is defined here as a composite of values and attitudes in safety that are reflected in the ways and modes of daily safety practices and behavios at work. The safety culture indicates implicitly how to behave, how to organize things, how to reward behavior, and how to punish behavior. The culture arises from perceptions of behavioral outcomes. It is based on a longterm history of joint experience. Values and attitudes concerning safety are individual, but also depend on organizational values within a specific company. Individual values may differ from the organizational values; yet a strong organizational culture may have a primaryeffect on behavior. Different companies certainly have different safety cultures. It is important to identity the safety factors that are linked with accident occurrence. A previous study by Seppala (1992) reported that the accident rates of industrial enterprises were associated with their safety culture, especially with the managerial practices reflecting organizational responsibility for safety. Accident frequences have been found to differ considerably between Finnish-speaking and Swedish-speaking companies. This study was conducted in order to investigate the possible differences in safety cultures between Finnish-speaking and Swedishspeaking companies. The aim of this study, a pilot.project, is to help us focus more attention on the relevant characteristics of safety culture and of safety management practices in small and medium-sized companies.
M
E
T
H
O
D
S
A study was planned to survey the safety culture, i.e. the forms and expressions of safe behavior, and the implementation, effectiveness and problems of safety management in 14 small and medium-sized companies. The enterprises represented the following spheres of activity: eight metal plants; four woodworking plants; and two dairies. The number of personnel in the separate plants ranged from 24 to about 60. Six companies were predominantly Finnish-speaking and six were Swedish-speaking. Two companies had about the same number of Finnish-speaking and Swedish-speaking personnel. Data were collected by means of a questionnaire. The questionnaire on safety culture focused on the following issues: background data - perception and assessment of safety hazards at work - work environment, machines and devices, work methods safety attitudes, safe behavior and risk-taking - safety in work planning and in work methods general safety actions and management issues within the plant. -
-
-
In all, 333 persons (managers, supervisors, clerical workers, blue-collar workers) filled out the questionnaire, consisting of 104 items. Accident occurrences were inquired by a questionnaire, as well as by a follow-up of the statistics from the previous 5 years. This paper presents the preliminary data obtained from the project. The differences are reviewed on the basis of t-tests of the means between companies according to the language predominantly spoken in the companies (mostly between the Finnish-speaking and Swedish-speaking companies).
Research groups
The personnel was given freedom to choose between a Finnish-language or a Swedish-language
Advances in Safety and Reliability: ESREL '97
99
questionnaire form. We received 185 Finnish-language questionnaires and 145 Swedish-language questionnaires. There were no statistical differences between these groups as to the main occupational category, age, gender, length of employment, or function as a safety person. This paper describes the results only between the companies of different language groups. We received 148 completed questionnaires from the Finnish-speaking companies, 134 from the Swedish-speaking companies, and 48 from the bilingual companies. Table 1 shows the distribution of respondents in the companies, by occupation.
TABLE 1 DISTRIBUTIONOF THE RESPONDENTSIN THE FINNISH-SPEAKING, SWEDISH-SPEAKINGAND BILINGUAL COMPANIES, BY OCCUPATIONALCATEGORY Occupational category
Finnish-speaking companies (n=145) %
Swedish-speaking companies (n=132) %
Bilingual companies (n=47) %
5 10 78
6 17 74
15 17 60
100
100
100
management supervision clerical work worker
There were proportionally fewer workers in the bilingual companies. Otherwise there were no statistical differences as to the age or gender of the personnel, length of employment, or function as a safety person among the three groups of companies.
RESULTS
Accident frequences of the companies During 1990-1995, 244 occupational accidents had occured in the companies studied (Salminen et al, 1996 b). The frequency of accidents by the companies' main language is shown in Table 2. The bilingual companies had a significantly higher accident frequency than did the predominantly Swedish-speaking or Finnish-speaking companies. The accident frequency in the Swedish-speaking companies was 16% lower than in the Finnish ones; the difference was not, however, statistically significant.
TABLE 2 ACCIDENT FREQUENCY(ACCIDENTS PER 1000 PERSON-YEARS)AT THE WORKSITE, BY THE COMPANIES' MAINLANGUAGE Main language of the company Finnish Swedish Bilingual
Companies (n) 6 6 2
Accidents
Person-years
103 66 75
1293 986 444
Accident frequency 79.66 66.94 168.92
1O0
A. Seppala
Perception and assessment of safety hazards at work Table 3 shows the mean of the 18 hazardous items according to the companies' main language.
TABLE
3
THE MEAN AND THE T-VALUES OF THE EVALUATIONSOF SAFETY HAZARDS (18 ITEMS) BY THE COMPANIES' MAIN LANGUAGE (RESPONSE SCALE: 1=POOR, DANGER EXISTS; 2=SOME DEFICIENCIESEXIST; 3=GOOD, NO DANGER)
Main language People Mean of the SD t of the company (n) hazard items Finnish 148 2.20 0.36 1) -3.400 Swedish 138 2.35 0.38 2) -0.320 Bilingual 47 2.37 0.37 3) -2.817 1) difference between Finnish-speaking and Swedish-speaking companies 2) difference between Swedish-speaking and bilingual companies 3) difference between Finnish-speaking and bilingual companies
p< .001 n.s. .01
The personnel in the Swedish-speaking companies reported fewer hazards than did the personnel in the Finnish-speaking companies. The evaluation of the hazards in the Swedish-speaking companies was significantly better in the following items: - orderliness of the work environment - traffic arrangements passages, hallways - work space lighting of the work sites - safety devices on the machinery physical strain. -
-
-
It appears that management in the Swedish-speaking companies pays more attention towards maintaining and improving the general work environment, as well as towards ensuring safety preconditions at the worksites.
Safety attitudes, safe behavior and risk-taking The mean of the 14 safety attitude items (response scale: l=highly disagree; 5= highly agree) was 3.56 for the Swedish-speaking companies and 3.49 for the Finnish-speaking companies. The attitudes towards safety were thus slightly better in the Swedish-speaking companies; the difference, however, was not statistically significant. Personnel in both of the company groups indicated high interest in safety. Other workers were also seen as appreciating safety. The attitudes in the bilingual companies (mean 3.77) were significantly better than in the Finnish-speaking or Swedish-speaking companies. About 69-73% of the personnel in each group of the companies reported that they took more risks when they were very busy at work. The attitudes in the Swedish-speaking companies were significantly better than those in the Finnishspeaking companies as to the following single items: - the senior management considers safety to be important - supervisors ensure the safety at the workplace - I can influence the safety of my job.
Advances in Safety and Reliability: ESREL '97
101
The attitudes in the Finnish-speaking companies, on the other hand, were significantly better in the following items: - I take safety into consideration in my work habits - supervisors intervene when they observe unsafe work habits - safety personnel have an impact on safety at my workplace. The results suggest the Swedish-speaking companies place more emphasis on preconditions of safety in the work and in the work environment; also, supervisors are perceived as looking after preventive safety work and the individual employee is seen as having an active impact on safety. The Finnish-speaking companies, in turn, seemed to emphasize work habits. Supervisors are seen as intervening when unsafe work habits are observed rather than as ensuring job safety or the safety of the work environment. Finnish-speaking companies also seem to emphasize the role of the safety staff.
Safety in work planning and in work methods The evaluations as to the safety of daily work and work methods were slightly better in the Swedishspeaking companies than in the Finnish-speaking ones. The mean of the 30 items for the Swedish companies was 2.06 and for the Finnish companies 1.99 (response scale: l=poor; 3= good); the difference, however, was not statistically significant. The evaluations were best for the bilingual companies (mean 2.14). There were only a few significant differences in the single items between the Swedish-speaking and Finnish-speaking companies. The personnel in the Swedish companies rated the following items better than did the personnel in the Finnish companies: - financial resources for safety condition of the machines and devices - safety devices on the machines availability of first-aid equipment. -
-
The results again seem to indicate that the Swedish-speaking companies took better preventive actions to ensure safety in the work environment and of machinery.
General safety actions and management issues within the plant The personnel in the Swedish-speaking companies also evaluated the level of various safety activities and practices as slightly better than did the personnel in the Finnish-speaking companies. The mean for the 22 items concerning safety actions and safety management in the Swedish-speaking companies was 2.99; in the Finnish-speaking companies it was 2.91 (the response scale: 1= very poor; 5= very good); the difference was not statistically significant. The evaluations were again best in the bilingual companies, the mean of the scale being 3.21. The evaluations of the safety level were, in general, quite poor. At best, only one third to about a half of the personnel in each group of main company language evaluated the safety level as good. If we examine the evaluations of the blue-collar workers only, the ratings were even slightly worse; the mean of the evaluations of the safety level was 2.73 for the workers in the Finnish-speaking companies, being 2.84 for the workers in the Swedish-speaking companies and 2.90 in the bilingual companies.
102
A. Seppala
The worst evaluations given for safety activities are shown in Table 4. It seems that all of the small and medium-sized companies in the study had paid less attention in the implementation and improvement of general safety practices.
TABLE 4 THE WORST EVALUATIONS OF THE SAFETY PRACTICES IN THE FINNISH-SPEAKING AND SWEDISH-SPEAKING COMPANIES
Item of safety activity
safety education safety control definition of safety responsibilities discussion of accidents that had occurred safety information
Finnish-speaking companies poor or very poor (% of all responses) 57 44
Swedish-speaking companies poor or very poor ( % of all responses) 57 46
42
46
42 40
45 47
DISCUSSION The results indicated that the Swedish-speaking companies had slightly lower accident occurrences (a 16% lower accident frequency) than the Finnish-speaking companies in Finland. The evaluations of the safety hazards, attitudes, work and work methods, and the safety level of the companies all suggest that better preventive measures are taken in the Swedish-speaking companies. The management of the Swedishspeaking companies seems to pay more attention to maintaining and improving the general work environment and safety of the machinery, as well as to ensuring safe preconditions at the immediate worksite. The supervisors and foremen are perceived as taking care of preventive safety work as well. Preventive safety work by supervisors is supported by better senior management attitudes. The personnel have more direct impact on safety. The responses of personnel in the Finnish-speaking companies indicated good personal attitudes towards safety. They also seemed to emphasize individual work habits. Supervisors are seen as intervening when unsafe work habits are observed rather than interfering with the work or work environment itself. The Finnish-speaking companies also emphasize the role of safety personnel. Owing to the small number of companies in the pilot study, the results should be considered only preliminary. Futher research is needed on the differences that were found. The results of this study indicated some issues that could be focused on. No specific scientific conlusions can yet be drawn on real differences in safety culture and management practices of the Finnish-speaking and Swedish-speaking companies. The safety evaluation of the personnel in the bilingual companies was especially problematic. The personnel in the bilingual companies evaluated the safety culture and safety practices on the whole as better than did the personnel in either the Finnish-speaking or Swedish-speaking companies. There were only two bilingual companies in the study. It is difficult to say how well they represent bilingual companies in general. For example, two bilingual companies that had better accident rates refused to participate in the study. The personnel of the bilingual companies had more clerical workers and management staff. These personnel groups usually have more positive attitudes towards safety activities than do blue-collar workers
Advances in Safety and Reliability: ESREL '97
103
(Seppala, 1992). There may also be some problems in the study method that was used, or in the approach adopted by the personnel when making their evaluations. The larger project, of which this study was a part, applied an on-site observation method to evaluate some safety factors, e.g. orderliness of the work environment, machine safety, work methods, ergonomics, hallways, and first-aid equipment. The results indicated that the bilingual companies had the best over-all level of safety in these aspects. Safety practices, attitudes towards safety and occupational hazards need to be studied more specifically in a variety of bilingual companies to be able to draw any further conclusions. The safety of the daily work and the safety practices followed were, at best, evaluated as being only mediocre. A good starting point for improvement would be to provide safety training for the specific needs identified in small and medium-sized companies, giving the personnel the opportunity to participate in the training. Another major step would be to assist small and medium-sized companies in defining and implementing a company safety policy. For instance, safety responsibilities, the provision of on-the-site job instruction, improved information processing, hazard identification and safety inspections, safety control, and measures for implementing safety measures all need to be defined and specified in the company safety policy. The practices adopted in and the recommendations derived from large companies and corporations usually cannot be applied as such in small and medium-sized companies.
References Salminen S., Johansson, A., Hiltunen, E. & Str6mnes, F.J. 1996 a. Accident frequencies among the Finnishspeaking and Swedish-speaking workers (in Finnish). TyOja Ihminen, 10:2, 125-136. Salminen, S., Johansson, A., Seppala, A. 1996 b. Differences of safety culture and production arrangements in Finnish-speaking and Swedish-speaking companies (in Finnish). Finnish Institute of Occupational Health, Helsinki, Finland. Seppala, A. 1992. Evaluation of safety measures, their improvement and associations with occupational accidents (in Finnish). Ty6ja lhminen, suppl. 1/92.
This Page Intentionally Left Blank
SAFETY PRACTICES AND RISKS ATTITUDES IN F R E N C H SMALL COMPANIES M. Favaro and C. Davillerd National Research and Safety Institute (INRS), Avenue de Bourgogne, B.P. 27 - 54501 Vandceuvre Cedex, FRANCE
ABSTRACT A survey of French SMEs safety practices and risks attitudes has been carried out. 98 randomly sampled firms were investigated in various business sectors, ranging in personnel from 20 to 200. The objective was to explore the relations between safety practices, occupational accidents, risks attitudes and other managerial and economic characteristics of firms. The results show a strong differentiation among the level of safety practices observed in the surveyed firms. Further analyses show that those firms that give good decisional autonomy to Health and Safety personnel (if any) suffered the least number of accidents. Firms economically less profitable than their respective sector average tend to suffer the greater number of accidents. Firms that delivered categorical and/or negative Health and Safety opinions suffered the more serious accidents and are the less safety "active". Firms which gave more balanced and/or positive judgments suffered less serious or minor accidents and are more "active". KEYWORDS
Clustering Analysis, Health And Safety Practices, Multiple Correspondence Analysis, Occupational Accidents, Principal Components Analysis, Safety Management, Small Firms, Risks Attitudes. INTRODUCTION Health and Safety (H&S) in French Small and Medium Enterprises (SMEs) is a matter of some concern, as companies whose staffs are under 300 make up + 99,5 % of existing companies and employ _+ 75 % of the whole workforce. As well as their numerical size, empirical evidence shows that H&S is often poorly handled, and that it remains difficult to make any significant progress in its management (CNAM, 1994 ; Eakin, 1988 ; Favaro et al, 1997a). Large managerial differences are also empirically observable between small and larger companies (Favaro, 1992). It was therefore decided to carry out a survey using personally administered questionnaires. The objective was to deepen the existing knowledge of the characteristics and determining factors of H&S practices, attitudes, and occupational accidents occuning in small firms. After a short methodological presentation, this paper gives a presentation of the main results achieved, with an emphasis on the statistical relations between H&S data and a set of structural, economic and managerial characteristics of the firms. M E T H O D O L O G I C A L FRAMEWORK Together with H&S data, a broad range of information was collected in order to obtain a description of each firm participating in the survey. Information came either from questionnaires (i.e information given by respondents), or from other sources including descriptive data on firms from the French INSEE 1, statistical 1 Institut National de la Statistique et des Ewdes Economiques (National institute for statistics and economic studies). 105
106
M. Favaro and C. Davillerd
data on occupational accidents from the Prevention Services of Regional Health Insurance Funds (French CRAM) and economic and financial data from a specialized private data base. The type of data collected was prescribed by the overall design of the research, stressing the following five themes :
Description of H&S characteristics of firms : 1- health and safety practices and attitudes : risks/accidents analysis methods, H&S training, available H&S information, opinion on protective devices, on H&S regulations, etc.) ; 2- miscellaneous health and safety characteristics (dangerous equipment, health problems, H&S representatives, statistical data on accidents, etc.).
Themes for contextual description of firms : 3- position of the firm (economic activity, organizational and technological data, etc.) ; 4- running of the firm (firm targets, quality certifications, marketing, position of markets, appraisal of personnel, etc.) ; 5- profiles of respondents : age, training, function, lenghts of service, former activities, timetables (market prospecting activities, negociations with suppliers, supervision of working progress, financial control, etc.). In all, more than 250 variables, including 195 from questionnaire administration were collected from 98 firms, using a probability sampling method with constraints of size (20-200 personnel bracket), activity sectors (elimination of sectors little exposed to occupational accidents) and geographical localisation (the 11 French administrative areas whose prevention services participated). Special care was taken to suit the context of small firms : direct contact, initial survey with open-ended questions, questionnaire tests, procedure acceptable by respondents in terms of duration and relevance of questions. SPAD.N statistical analysis software was used to process the data 2. Most of the variables being qualitatives, Multiple Correspondence Analysis (MCA) was the main technique employed, together with Clustering Analysis (CA) (Benz6cri, 1992 ; Cox et Cox, 1994 ; Tenenhaus et Young, 1985). Quantitative information, e.g. on occupational accidents or economic outcomes, were also subjected to more classical Principal Components Analysis (PCA). E X P L O R A T I O N OF H E A L T H AND SAFETY PRACTICES Two analyses are presented below, both focused on answers given about the H&S practices declared. The first analysis refers to the whole sample (n=98). The second one focuses on the sub-sample (n=32) of respondents who declared that they employ specialized H&S personnel.
H&S practices : whole sample analysis The exploration of H&S practices consisted of the statistical analysis of the answers delivered for the following 15 qualitative variables: 1- H&S training (2 categories- 'yes/no'), 2- safety checks on equipment (3 categories - 'yes/no/does not apply'), 3- maintenance procedures (2 categories), 4- general H&S initiatives (8 categories - recoded open questions/see QAL HS INIT./), 5- accidents/risks analysis methods (2 'yes/no' categories), 6- H&S estimates (2 'yes/no' categories), 7---> 12- available H&S information (6 variables x 2 'yes/no' categories/see INFO on Fig.l/), 13- personnel committed to H&S (2 'yes/no' categories, recoded from 4 specific variables/see below "subsample analysis"/), 14--->15- self initiated contacts with H&S authorities (work inspectorate & occupational medicine) (2 x 2 'yes/no' categories).
2 SPAD.N (Syst~me Portable d'Analyse de Donn6es Num6riques) is developped by the CISIA (Centre International de Statistique et d'Informatique Appliqu6es), St-Mand6, France.
Advances in Safety and Reliability: ESREL '97
107
Figure 1 gives the graphical configuration of Managers'answers on the first 2 MCA axes 3.
(~AL HS INIT. " • HEALTH & SAFETY
GAL HS INIT.
• HEALTH •
• INFO "YES SAFETY SHEETS • YES SEEK ADVICI FROM WORK INSPECT.
INFO : YES FREE ACCESS TO LITERATURE
• YES PROCEDURES IF PROBLEMS 0.3 • YES SEEK ADVICE FF )M OCCUP. MED. • INFO : NO HS MEETINGS
• YES HS TRAINING
ACTIVE TREND
FACTOR1
• YES HS PERS.
• YES ACCID! NT/RISK ANAL. METHODS
• INFO : YES HS MEETINGS • YES SAFETY CHECKS ON EQUIP.
PASSIVE
==,,,.._
• NO HS PERS.~ TREND • NO HS TRAINING • INFO : NO POSTERS • INFO : NO SAFETY SHEETS DON'T SEEK ADVICE • FROM OCCUP. MED. o.3
-0.3
-0.3
GAL HS INIT. • GAL HS INIT.
: NTR •
:
ST'h,-N'Dh,-l~'D I--S'A-TI0N =INFO • YES WELCOME BOOKLET
• INFO
• YES HS GUIDE
93~ r e s p o n d e n t s eigenvalues : F1 = 0 , 2 8 7 5 F2=0,1079
•
cumulative variance=27% PHI2=1,46
YES HS ESTIMATES
• GAL HS INIT.
• IRRELEVANT GAL HS INIT.
• DON'T KNOW •
Figure 1 • Factorial chart of Health and Safety Practices (whole sample) This chart strongly suggests that axis 1 discriminates two profiles of respondents, respectively called 'active' (left side) vs 'passive' (right side) 4. Notice that here axis 2 remains difficult to qualify. To highlight what are the main statistical determinants of the H&S results delivered by the basic factorial analyses, we used the technique of illustrative (or supplementary) variables. It consists in projections of selected variables (which are here contextual ones/see above/) upon factorial charts, previously made with the active (or principal) variables (the more descriptive of H&S ones that participated in the determination of the
3 For all factorial charts, only categories with valid contributions on axes are represented. The criteria for determining 'valid contributions' is presented by Cibois (1984). 4 A clustering analysis (CA) worked out with the same data gave further confirmation about the adequacy of the partition. This CA allowed us to add a new 'H&S practices' variable, composed of two modalities 'active' & 'passive'.
108
M. Favaro and C. Davillerd
factorial axes). Depending the objectives to be reached, such information may be either the 'individuals' (raw profiles) or the 'data' (column profiles) (see Benzdcri, 1992 • Morineau, 1994 • Favaro, 1997b). Figure 2 shows the mean positions of a range of categories, projected as illustrative variables on the factorial chart of H&S practices. The set of variables selected here are the followings : sector activities, existence of an H&S agreement with the regional prevention services, existence of an H&S committee, training/education, juridical status (eg. family business), number of hierarchical levels (< 3 ; _>4). • RUBBER, PAPER, CARDBOARD INDUSTRY
(n= 1)
• THERE WAS A H&S COMMITTEE (n = 7)
WOOD, PLASTIC INDUSTRIES (n= 9) • • HS AGREEMENT WITH REGIONAL PREVENTIVE SERVICES (n = 14)
• 'INTERMEDIATE GOODS' INDUSTRY (n = 13)
0.2
" 3 HIERARCHICAL LEVELS (n = 50)
ACTIVE TREND
NO HS AGR. WITH REG. PREV. SERVICES (n = 79) HIGHLY QUALIFIED EXEC. (n = 32) • 4 HIERARCHICAL LEVELS (n = 43) THERE IS A H&S COMMITTEE (n = 43)
0.3
• -0.3
• NEVER BEEN H&S COMMITTEE (n = 41 ) • TRADES (n = 23) -0.2
BUILDING IND. (N = 1) •
Figure 2" Projection of illustrative variables on the H&S Practices factorial chart By plotting the points referring to the selected data, this chart (Figure 2) brings into light new information about a range of determinants of safety involment in SMEs. Thus in this example, the H&S activity ('active trend') is associated with a high number of hierarchical levels ('_>4 levels' : V-test=-2,3 & V-test=2,3 for '<3 levels'), the presence of an H&S safety committee (V-test=-2,7 for 'presence' & V-tes~=2,7 for 'absence') and with highly educated executives (V-test=-3,4 for 'higher education level') 5. Further analyses also indicate that the sector-activities do not contribute much to explain the observed differencies between levels of H&S 5 The V-test ("test-value") is a statistical criteria available with SPAD.N. It is conceptually closely linked to the p-value and used here to ensure a valid interpretation of the data. As a rule, we consider as statistically significant a V-test ___2.For more information, see Morineau (1992, 1994).
Advances in Safety and Reliability: ESREL '97
109
practices. The only exception are the firms that belong to the chemical industry, which tend to be the more "active" ones. H&S practices • sub-sample analysis
Most of the variables remain similar to the former analysis (made on the whole survey sample). The only exception is that seven new qualitative variables have been substituted for the former variable "personnel committed to H&S", in order to describe more in detail the characteristics of the H&S personnel. The first four variables refer to H&S decision autonomy in the firm : "to supervise jobs ?", "to decide of purchase or jobs ?", "to call meetings ?", "to intervene in the workplace ?". The last three variables describe the professional profile of H&S personnel : function (technical/administrative/other), training (technical/other than technical/without training), hierarchical position (shop flooffsupervisodmanagement). Figure 3 gives the MCA graphical configuration of the categories of answers for the sub-sample of respondents that have declared an H&S personnel.
STRONG DECISION AUTONOMY 32 respondents eigenvalues
HS PERS : YES TO SUPERVISION • OF JOBS
F2=0,1772
HS PERS : YES TO DECISIONS ON PURCHASE OR JOBS • HS PERS
cumulative
: TECHNICAL OFFICES •
: YES TO BRING MEETINGS ABOUT • HS PERS : YES TO INTERVENE • IN THE WORKPLACE
• NO SAFETY CHECKS ON EQUIP. • INFO • NO HS MEETINGS
0.2
DO NOT SEEK ADVICE • FROM OCCUP. MED.
• INFO " YES WELCOME BOOKLET
FACTOR 1
• INFO : NO WELCOME BOOKLET NO PROCEDURES IF PROBLEMS •
• YES HS TRAINING
• YES PROCEDURES IF PROBLEMS
o3
- 0.3
"-
PASSIVE TREND
• NO HS TRAINING • INFO : NO SAFETY SHEETS
• INFO • YES SAFETY SHEET~ -0.2
YES SEEK ADVICE FROM OCCUP. MED. •
variance=27,59%
PHI2=1,65
HS PERS
ACTIVE TREND
:
F1 = 0 , 2 7 8 0
GAL HS INIT.
: NTR •
INFO : NO POSTERS • • INFO • YES HS MEETINGS GAL HS INIT.
• YES PROCEDURES IF PROBLEMS HS PERS
• STANDARDISATION •
• MISCEL. TRAINING •
INFO " YES FREE ACCESS TO LITERATURE
HS PERS : NO POWER OF DECISION OVER PURCHASE OR JOBS •
•
• HS PERS
: NO SUPERVISORY AUTHORITY ON JOBS
HS PERS : CANNOT INTERVENE • IN THE WORKPLACE •
HS PERS
YES HS ESTIMATES
• YES ACCIDENT/RISK ANAL. METHODS • G A L H$ INIT.
•
HS PERS
: ADMINIST. OFFICES •
: CANNOT CALL HS MEETINGS
• IRRELEVANT
! WEAK DECISION AUTONOMY
]
Figure 3 • Factorial chart of Health and Safety practices (sub-sample)
110
M. Favaro and C. Davillerd
Examining the chart shows that this sub-sample is composed of two different H&S personnel patterns. The upper-left side ('active' firms) of the graph put together answers that refer to a high decision autonomy. On the other hand, answers typical of a weak decision autonomy are plotted on the bottom-right side ('passive' firms). This outcome gives support for arguing that the presence of personnel involved in H&S, although contributive the "active trend", does not imply in all cases a real emphasis on safety activities. Closer examination of this sub-sample (Figure 3) put forward a statistical relationship between the tendency to be 'active' vs 'passive' when there is a H&S personnel : 'active' firms leave much mole deqision autonomy to their H&S personnel than 'passive' ones do. E X P L O R A T I O N OF O C C U P A T I O N A L ACCIDENT DATA PCA were computed with a range of Occupational Accidents (OA) that occurred in the ill'm-sample during the last three years 6. Table 1 gives the basic statistics for each variable subjected to a PCA. Variables 1 and 3 are related to the frequency of OA whereas variables 2 and 4 are severity indicators. TABLE 1 BASIC STATISTICS FOR OCCUPATIONAL ACCIDENTS Variables (3 ~,ears • recoded) 1- No of accidents with LWD 2- No of LWD ~ ~or._t~,Iiil:~a ~,l,.,jil mto~_ ill IL**vjI
4- No of LWD per accident
Sample size 87 87 84 89
Mean 10,31 196,83 7,45 21,93
Standard deviation 8,15 190,76 7,68 18,11
min.
max.
0,74 47,17 7,29 857,84 0 35,90 0,9 100,43 (LWD=Lost Work Days)
Table 2 gives the variable/factor correlations achieved with the PCA. TABLE 2 VARIABLE/FACTOR CORRELATIONS (OA PCA) Factors 1 2 3 Variables (3 years • recoded) 1- No of accidents with LWD iiiiiiiiiiiii;0..iiiSi..5.iiiiiiii::iiiiiiii -0,34 0,32 0,22 2- No of LWD i iiiiii.~i.Si~.iii!i.....iii] 0,34 3- No accidents without LWD ,ii~iiiiiii~i~0ii~i~iiiiiii.l.iii. -0,37 -0,56 4- No of LWD per accident 0,29 :~i~:~!ii:~i~i:~i~i~i~i:~:i~iiiii -0,19
4 0,23 -0,25 -0,03 0,17 (LWD=Lost Work Days)
Examination of the 2 first axes variable/factor correlations (Table 2) indicates a positive and rather strong correlation between the first factor and variables 1 to 3. This first axis has been qualified as a frequency one. Factor 2 is very strongly correlated with variable 4 (severity indicator), and moderately but negatively correlated with variables 1 and 3 (frequency indicators). Therefore, this second axis as been interpretated as an indicator of severity. This PCA analysis was mainly used to look for some determinants of OA in the film sample. One result to be noticed here is the following • the projection of economic variables upon the PCA chart shows that firms economically les~ profitable than their respective ~ector average statistically tend to suffer the greater number of accidents.
6 Raw data were delivered by the 11 (out of 16) regional prevention services that participated in the survey.
Advances in Safety and Reliability: ESREL '97 EXPLORATION
OF
RISKS
111
ATTITUDES
Here we will focus on risks attitudes considered through the responses delivered by managers for a set of 20 statements that convey some of the most typical attitudes - favourable or not - observed in the field of H&S. The answers given by the respondents could be either "rather in agreement" or "rather in disagreement" (2 categories). The statements put forward refer to 4 main topics and are the followings : 1- persqnal protective equipment : 1.1- may impede the work procedures, 1.2- is a complement to the implementation of more technical solutions, 1.3- is interesting because of its cheapness, 1.4- is above all a regulatory requirement to which one is subjected, 2- technical protective devices : 2.12.22.32.4-
allow risk control, are mostly a mean to be in accordance with H&S regulations, generate loss of time, are very effective,
3' to comply with H&S regulations : 3.1- a necessity to ensure employees'safety, 3.2- mostly a constraint, 3.3- more and more difficult to honour, 3.4- essential to give a good image of the finn, 3.5- a necessity to protect the environment, 3.6- a cost whose benefit is difficult to see, 4- tO reach gQ0d safety records : 4.14.24.34.44.54.6-
to reduce the unpredictable at a minimum, to take on qualified personnel, to give priority to the work organization, to encourage personnel to take personal responsability for safety, to act upon major hazards or pollutions, to define the rules and make them applied.
Figure 4 summarizes the MCA graphical configuration of the answers given by respondents. Synthesized statements are given, along with the percentages of answers and the reference number (in brackets) for each category. The first axis (see Figure 4) suggests that firms deliver either balanced and/or positive judgments (on the left side) or categorical and/or negative H&S opinions (on the fight side). If we consider more in detail each sector (I to IV) of the chart, the main opinion tendencies about H&S tend to be the following : - Sector I brings together managers favourable to H&S regulation but remaining qualified in their opinions, - Sector II illustrates a general attitude of opposition to technical prevention, - Sector III conveys an attitude of general opposition to H&S rules and regulations, - Sector IV always refers to favourable and/or qualified attitudes about protective devices (whether they are technical or personal ones). Notice that all the statements do not appear on the chart. Only 11 out of 20 statistically contribute to shape patterns of attitudes towards risks. Two among them (No 1.2 and 3.4) are indicated two times, which means that they are associated with "agreement" and "disagreement" answers. Therefore those items strongly contribute to illustrate the opposition observed with this MCA in the field of risks attitudes. Further projections (as illustrative variables/see above/) of the Occupational Accidents outcomes and of Safety Practices (see Figure 1) on this "Risks Attitude mapping" (Figure 4) show that firms delivering categorical and/or negative H&S opinions (sectors II and III) suffered the more serious accidents and are the less safety "active". Conversely firms which gave more balanced and/or positive judgments (sectors I and IV) suffered less serious or minor ac~:idents and are more "active".
112
M. Favaro and C. Davillerd
Q
Q
ev "H&S REGULATIONS ARE NOT MOSTLY A CONSTRAINT" 45%
(3.2)
L.)
"TECHNICAL SAFETY NOT VERY EFFECTIVE" Z1%
(z.4)
"ACCORDANCE WITH THEM IS GOOD FOR FIRM'S IMAGE" 50% (3.4) "BUT MORE AND MORE DIFFICULT TO HONOUR" 35% (3.3)
"MOSTLY A MEAN TO BE IN ACCORDANCE WITH H&S REGULATIONS" 38% (z.z)
"GENERATES LOSS OF TIME" 23% (2.3) FACTOR 1
-...,,,
v
"TECHNICAL AND PERSONAL PROTECTION ARE COMPLEMENTARY 61% (1.2__) "PERSONAL PROTECTIVE EQUIPMENT IS NOT ONLY A REGULATORY REQUIREMENT" 54% (1.4) "BUT IT IS AN IMPEDIMENT TO WORK" 31% (1.1) 80 respondents eigenvalues : F1=0,1385 F2=0,1t67 cumulative variance=l 5,46% PHI2=1,65
"TO HONOUR H&S REGULATIONS IS NOT NECESSARY FOR FIRM IMAGE" 49% (3.4) "NOT NECESSARYTO DEFINE RULES AND MAKE THEM APPLIED" 13% (4.6) "USELESS TO TRY TO REDUCETHE UNPREDICTABLE" 18% (4.1) "TECHNICAL AND PERSONAL PROTECTION ARE NOT COMPLEMENTARY" 29%
(i .z)
O
Figure 4 • Summarized factorial chart of Risks Attitudes CONCLUSION This paper gives some of the more salient information produced by the analysis of the information collected during the survey that we devoted to H&S in SMEs 7. A practical conclusion of the results presented in this paper is that since safety practices and attitudes in small firms appear to be more strongly dependent on characteristics like hierarchical levels, existence of an H&S committee, economic achievement, etc. than on more technical ones (given by sector-activity), we suggest that guidance to encourage the safety "passive" ill'ms to move towards better Health and Safety involvement should include promotion of more managerial and economic expertise, rather than to focus exclusively upon the technical side of H&S. With the use of the existing data file of this H&S survey in SMEs, new developments are cun'ently underway. The objective of such a further study will be to compute an experimental scoring application (making use of discrimination methods with further extension of the data file for consolidation of the results achieved)that can be applied to statistically predict H&S practices and also, if proven to be feasible and realistic, Occupational 7 Much other valuable information is not presented here, in particular answers to more open questions. For complete results, see Favaro et al (1997a).
Advances in Safety and Reliability: ESREL '97
113
Accident outcomes. We assume that such a conception of an H&S scoring procedure will contribute to the advancement of decisional research and further applications in the field of Safety Management in the context of SMEs. REFERENCES
Benzdcri, J.P. (1992). Correspondence Analysis Handbook, Marcel Dekker, New-York, U.S.A. Cibois, P. (1984). L'Analyse des Donndes en Sociologie, PUF, Paris, France. CNAM (Caisse Nationale d'Assurance Maladie) (1994). Rapport d'activit6, CNAM, Paris, France. Cox, F.T.and Cox, M.A.A. (1994). Multidimensional Scaling, Chapman et Hall, London, U.K. Eakin, J.M.(1988). Occupational Health and Safety in Small Businesses, University of Calgary, Department of Community Health Sciences, Canada. Favaro, M. (1992). Safety management through the development and the implementation of safety indicators.Safety and Reliability'92, Elsevier, London, 237-248 (Proceedings of the European Safety and Reliability Conference'92). Favaro, M. (1996). La pr6vention dans les PME, I-"Situation", INRS, Notes Scientifiques et Techniques n°134. Favaro, M., Davillerd, C. and Francois, M.(to be issued 1997a). La pr6vention dans les PME, II-"Enqu~te" INRS, Notes Scientifiques et Techniques. Favaro, M. (to be issued 1997b). La pl'6vention dans les PME, III-"Methodologie", INRS, Notes Scientifiques et Techniques. Morineau, A. (1992). L'analyse de donn6es et les tests de coh6rence dans les donn6es d'enqu~te, in ASU • L___aa qualit6 de l'inf0rmation dans lea enqu~tes, Dunod, Paris, 427-440. s
Morineau, A. (1994). Le "Th6mascope" ou analyse structurelle des donn6es d'enquOte, in Gl'ang6, D. and Lebart, L. • Traitements statistiques des enqu~tes, Dunod, Paris, 135-159. Tenenhaus, M. and Young, W. (1985). An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivaxiate data. Psychometrika 50 • 1, 91-119.
This Page Intentionally Left Blank
A5" H u m a n Factors
This Page Intentionally Left Blank
THE I N T E G R A T I O N OF H U M A N F A C T O R S IN D E P E N D A B I L I T Y : A VITAL A S P E C T FOR RISK M A N A G E M E N T Invited Lecture International Conference on Safety and Reliability June 17 - 20, 1997 - Lisbon, Portugal
ELIE FADIER Institut National de Recherche et de S6curit6 Service E r g o n o m i e et Psych01ogie Industrieile F-54501 V a n d o e u v r e Cedex
ABSTRACT In addition to defining the concept of what the activities of human operator's include and their evolution as a function of time, this paper will outline a certain number of elements that will help us to better understand how the human operator functions, some of their characteristics, the operational "model" that they use, contributions and weaknesses so that we can integrate these factors during the conception of a given working situation. Following this, we will discuss the impact of automation on human activity, and then finish by looking at some of the steps currently being taken to integrate human factors in industrial situations. KEYWORDS : Automation, Dependability, Human activities, Human factors, Human reliability methods, MAFERGO
INTRODUCTION
Despite the increasing automation of industrial production and/or exploitation systems - one of the consequences of which is a reduction in the number of human operators - the role and place of human activity remain very important. These are reflected either by the performance of tasks that the technical systems cannot perform, or by the monitoring, control, regulation mad recovery from random events that come from the technical system. The mastery of the dependability (RAMS) of a system depends not only on a knowledge of how the system functions in a normal state, but also of how it is organised and configured, and of the identification, analysis and treatment of abnormal events. Control over human factors also requires knowledge of how human operator's reacts, the identification of their requirements and characteristics, as well an analysis and tratment of weaknesses and errors. The objective of this paper is to outline a certain number of elements that will help us to better understand how the human operator functions, some of their characteristics, the operational "model" that they use, contributions and weaknesses so that we can integrate these factors during the conception of a given working situation. Following this, we will discuss the impact of automation on human activity, and then finish by looking at some of the steps currently being taken to integrate human factors in industrial situations. E V O L U T I O N OF THE C O N C E P T OF THE O P E R A T O R
The concept of what the activities of a human operator include has evolved in the same manner, and at the stone time, as the technological evolution df industrial production and/or exploitation systems. At the beginning of the industrial aera, production systems relied very heavily, and almost without exception, on the physical activity of their human operators who executed repetitive tasks without considering the meaning of their actions. As technology - and its reliability - advanced, automation began, little by little, to take over the tasks of the transformation and production of different products. This in turn caused humans to assume roles with a more dominant cognitive (mental) component - roles that require more reflection and that are more motivating. However, even though this observation is generally true, there are nevertheless still some situations where physical activity still dominates daily work. It is useful to define three different, but complementary, "models" of a human operator: 117
118
E. Fadier
The operator as an executor
Here, the operator is assumed to execute a pre-defined task "as faithfully as possible", which in turn supposes that the defined task can be performed and executed exactly as defined. The operator's activity is defined by the means that are to be used, and includes a relatively detailed mode of operation. An inheritance of Taylorism, this idea of the operator as executor obviously implies that the average operator is simply considered as a simple component of a system, to whom the order to execute a particular task is given. Any failure on the part of the operator is immediately seen as having an effect on the performance of the system. It is this "model" that formed the basis of a qualitative approach to human reliability (Swain 1964, 1974). The operator as an intelligent actor
In this representation, the activity of the operator has a predominantly mental component. The task, necessarily more complex, is defined as an obligation to obtain certain results, often with a particular goal to be attained. Even if the means used to complete the task are occasionally defined, more emphasis is placed on the objective to be satisfied. For this reason, the operator interacts with other elements of the system to accomplish his task. In this context, there is often a difference between the defined task and the activity carried out. This difference can be explained by the contribution that the operator makes to adapt the task to different working situations (random, or unforeseen events in the technical system, external solicitations from the immediate environment or organisation). Here the operator is not a simple executor, but rather an intelligent actor (participant) who must carry out cognitive tasks (reasoning, representation, memorisation, diagnosis, information processing, etc.). The approach best adapted to the understanding and analysis of this role is the qualitative and explicative approach of human reliability - an approach that has been strongly developed by psychologists and ergonomists since the 1960s (Faverge 1966, Bainbridge 1982, Leplat 1985, Reason 1990, Hoc 1996). The operator as a collective agent
This concept of the operator appears in instances when the work station disappears. The operator, who is never isolated, is integrated into a team with a given mission to carry out. The tasks are shared out between the team members, and performed with only a certain degree of autonomy since they are carried out in co-operation with several partners. In this context, the personality of the individual plays a deternaining role in the success of the mission. Co-operation, communication, as well as the organisation of the work to be performed become pertinent indicators of activity and of the success of the mission. Furthermore, if an emphasis is placed on the influence of the task organisation on the psyche of the individual, Human-Hulnan relations and the effects of task sharing between operators are no longer elements that can be neglected in this approach. This is the psychological approach of the human in the work place (de Terssac et al. 1990). Finally, it also appears to be important to point out that the role and influence of the "decision maker", and of management and the supervisory staff must also be taken into consideration since the fact that they have an impact on the role, place, and performance of human beings is undeniable. SOME S P E C I F I C S OF HUMAN F U N C T I O N I N G (Fadier & Mazeau 1994) D e f i n e d tasks a n d real activities
The organisation of a job (defined by its designers, management, environment, etc.) defines the constraints and formal rules that must be respected in order to carry out a given operation (tools to be used and conditions for their use, tools that are forbidden, obligatory protective equipment, operatory modes, order of operations, temporal constraints for execution, documents to be filled out, etc.). The reliability of an operation depends on the correct execution of the defined task and the respect of rigorously defined indications. This hypothesis does not hold together for three reasons: It is occasionally difficult to define the conditions of a task with precision beforehand. The degree of precision required will vary from operator to operator, and as a function of the tasks to be completed. Implied tasks are inevitable, and constrain the operators to (Kern et al. 1989): complete the existing procedures to include the implicit parts, redefine "others" in order to adapt them to the realities of a given situation, - define non-existent "procedures". -
-
Advances in Safety and Reliability: ESREL '97
119
The human operator cannot scrupulously execute a given task without understanding exactly what is asked of him. In certain cases he will interpret the instructions, thereby introducing the possibility of errors, especially when specific aspects of the behaviour of the operator have not been taken into account: e.g. becoming tired, reasoning on the basis of a mental representation of reality, etc. Maintenance tasks (and not only corrective maintenance) are characterised by a high degree of interaction between diagnostic activities and activities related to task execution: given what is seen by the operator, are we still within the context of the previously defined procedure? If not, is the difference significant? Should the supervisory staff be called?... These questions are not always consciously asked (if they were, the difficulties would be resolved more often than not), but do stimulate the taking of certain actions. Work-related situations, in both maintenance and production, are characterised by: -the presence of technical malfunctions. These are particularly damaging when they strike diagnostic systems because they can render a solution impossible, - process instability. Machines suffering from a total malfunction are often easier to manage than machines that breakdown only intermittently and/or partially, conditions of execution that are occasionally difficult (hot environments, uncomfortable posture that do not help reflection, night-time interventions .... ), - situations with co-activity (with those operators in charge of production who take over or who are waiting, cooperation between electricians and mechanics .... ), - situations where maintenance and production co-operate. -
hnportance of the mental representation The analysis of human errors (Leplat 1985, Rasmussen 1986, de Keyser 1989, Reason 1990), in both maintenance and production/exploitation, reveals that the behaviour of a given operator obeys some sort of logical in each of the cases observed. This logic turns out to be ill-adapted to the situation after the fact since it only takes one, or a few of the elements of a situation into account (context, operational mode, execution time, state of resources .... ). Current understanding of human function in the work place, and particularly of the ways in which we treat information has demonstrated the importance of a mental representation (of the installation, of the task objectives, of the role of each person .... ) of the system that the operators construct to help them decide on a course of action. Available information might have little significance if it contradicts the mental representation of the state of the system or of the system that the operator has constructed. This operational specificity that push the operator to retain only that information from a given situation that confirms the preestablished hypotheses and to not retain any infonrmtion that contradicts these hypotheses is referred to as "confirmation bias". In each of the situations with which the operator is confronted, it is therefore important to ensure (through an analysis of the task) that he does not run the risk of "reasoning correctly for the wrong reasons" - reasons that he will not question again. The predominance of the recent past, of the more critical incidents, and ambiguous information are traps which should be avoided (Mazeau 1993).
Role of the collective work (group) The collective (or group-oriented) nature of work in general (Leplat 1988, Six et al. 1993) is accentuated by technological evolutions. Here, it is indispensable to mark the difference between a group of operators - an operational "collective" - (a group of individuals pursuing a common goal), and a work group - or working collective - (a group of people having a common characteristic). Production operators, lnaintenance operators, and system methods engineers constitute working collectives. The engineer in charge of a service, the system operator, system methods engineers, the maintenance operator and the technician working for an external firm who all meet in the control room in order to understand the cause of a malfunction and to decide what to do together constitute a "collective operator". The way these informal groups function is much better understood today (than it was previously). Certain conditions are necessary if this collective is to be efficient: the perception of a common goal and a shared interest in obtaining it, and thus the technical and organisational means needed to construct the "Common Operative Reference" (de Terssac 1990) needed to stimulate efficient communication (Mazeau 1993). An analysis of the co-operation between the operators of a given team often reveals difficulties of communication (de Keyser 1985) which are sometimes due to managerial structure (hierarchical and salaried differences .... ), to task organisation (difficulties for one operator to accept the competition from another in the same job in situations where different tasks are performed simultaneously, cost of any stoppages to the firm or to maintenance...). However, the most frequent source of difficulties seems to be associated with logical and representational differences stemming from divergences in the "implicit functional analysis" of the installation
120
E. Fadier
(principal functions, priorities .... ). These can generate difficulties of communication that lead to contentious situations. Current computer-based tools (C.A.M.M. Computer Aided Maintenwme Management) can help to resolve these problems by facilitating the diffusion of information, provided that they respect the language, logical and priorities of each group.
Overlapping of physical, mental and psychological plans The growing importance of aspects related to the treatment of information should not cause one to forget the other characteristics of the way humans function in the work place: operational efforts, postures that can be difficult to hold, fatigue due to physical labour, changes noted during night shifts, etc. These can all influence the ability of human beings to face up to those situations with which they are confronted. Previously taught habits, and those learned from experience can modify one's perception of a situation and trigger a response that is poorly adapted to any modifications that may have been made to the system. Finally, psychological aspects including stress, fear, and boredom should not be forgotten since they also influence the behaviour of humans in the work place.
The specificity of activity in automated systems The ever-increasing role assumed by automated and computerised system in certain operations in manufacturing processes has thus caused major changes in the nature of the work performed by human operators, as well as in the nature of related risks. a - Changes in the nature of risks Technological changes contribute to improvements in the direct safety of human operators to the extent that they are farther removed (and protected) from the exact site where the transformation of the product takes place. However, even if the system operator is relatively safe, an improper action in the control room can have serious consequences on the rest of the site. The safety of operators present at the site often depends directly on those actions undertaken by the operator in the control room. It is therefore necessary to design information/control systems that can help to avoid such actions, their causes and/or their consequences. It is therefore possible to say that the safety of a complex, computerised installation (and the quantity and quality of what is produced) is highly dependent on an optimal co-operation between different teams, and between humans and the computerised command and control system. b - Changes in the nature of work performed These changes can be of two types (Neboit 1991) : first, the person controlling the system is informed and acts oll the process in an indirect manner (via screen/keyboard/synoptic/control interfaces), and second, the controller generally intervenes only when the operation of the system is in a deteriorated state. Use of interfaces the operator intervenes in the process via interfaces that present complex, symbolic information. This has been made necessary due to the fact that the processes themselves have become extremely complex, and because tasks at different levels of regulation and control have been integrated into a generalised monitoring function. Because the intervention occurs in this manner, the operators' task requires mental (or cognitive) activities including the gathering/treatment of information, memorisation, representation, anticipation, and diagnosis. The nature of these activities and reasoning strategies should thus be taken into account when designing such interfaces in order to help the person who controls the process. Interventions in deteriorated situations: the operator norlnally intervenes less often during the normal operating regime, or at least where normal operating procedure is observed and can followed with a computer. In this event, the role of the operator is essentially reduced to a monitoring task. On the other hand, he intervenes more often during phases where normal operation is not observed in order to find the malfunction, diagnose its cause(s), anticipate the evolution of the process, and, when necessary, switch over to manual operation. This places the operator in a position where he must recover from incidents (Faverge 1980) with a temporal constraint - this being a potential source of information overload, stress and risk. C E R T A I N A S P E C T S OF A U T O M A T E D SYSTEMS
The massive introduction of new technologies has modified the nature of production and exploitation systems in
Advances in Safety and Reliability: ESREL '97
121
a significant manner. The principal characteristics of these systems can be SUlnlnarised as follows:
Increasing complexity: This complexity stems from a significant addition of technical materials for production, control, detection, monitoring, and diagnosis. Mixed technologies The excessive cost of adding new technologies often pushes those people making the (financial) decisions to make changes step by step. For this reason materials from different generations of more or less compatible technologies, of different forms and natures, are found side by side. Add to this coexistence of technologies a competitive industrial pressure that forces finns to widen their product range, and one can see that this will have severe consequences on the operators' activities. A collective dimension to work It has been noted that the evolution of work is first of all characterised by the disappearance of the work station, and by the increasing "discretionality" of the (operator's) tasks. These tasks are often defined in terms of their goals and principals of execution that are more or less explicit. In this event, it is better to talk about a specific mission to be accomplished by a team than about tasks to be carried out (de Terssac 1992). A more or less explicit definition of the execution conditions gives the operator a greater or lesser degree of autonomy, and emphasises the importance of exchanges, communication and the co-ordination of tasks between different individuals who have the same overall objective to attain. An increasing role of human-machine interfaces Human operators are generally removed from the site of production/exploitation in automated systems. For this reason, they generally intervene in the process by means of different types of control interfaces (screens and/or synoptics), and are called upon to manage transformed, abstract visual or audio information in different forms (analogue, symbolic, graphic), which might, or might not be coded. An increasing role for information In order to better manage the complexity of a system, its designer provides abundant amounts of documentation on the procedures. This implies different forms of communication (written vs. oral, explicit vs. implicit). The comprehension and execution of instructions and procedures currently pose significant safety problems. More and more, emphasis is being placed in the design and maintenance of procedures (ISdF 1991, 1994). The following Table summarises the impact of this massive introduction of technology, on the system in general, on safety, and the activities of the operator, and mentions certain new requirements in the area of risk management. SOME PERSPECTIVES
What are the possible automated systems?
approaches that will allow one to better consider human factors in complex
Given that humans are a specific element in a socio-technical system, whoever - or whatever - interacts with them interacts not only with their physiological and psychological components, but also with their psychic and affective components. For this reason the only practical approach is the adaptation of the system to human characteristics and requirements. This adaptation is realised using the ergonomics of working situations, and could touch upon the following aspects:
Action on the physical environment (noise, lighting, vibrations, ventilation, etc.) through classical ergonomic knowledge whereby one attempts to adapt the physical environment to the physiological characteristics of human beings while taking into consideration their activities (Cazamian 1989). Action on Human-Machine Interfaces, from simple interfaces (command dials, colours and buttons) up to the edition of operating procedures and modes. Any form of aid to these operators can be considered as an interface in this context. This adaptation must take the physiological and cognitive characteristics of humans into consideration along with their activities.
122
E. Fadier TABLE 1 IMPACT OF AUTOMATION ON PRODUCTION SYSTEMS
I M P A C T ON THE SYSTEM
INCREASING COMPLEXITY
MIXED TECHNOLOGIES
COLLECTIVE NATURE OF WORK
IMPACT ON ACTIVITIES
• increase of activity with predominant mental component • removal of (monitoring and/or humans from site of supervision, transformation of diagnosis, the product memorisation, representation, anticipation) • intervention in • instability of deteriorated production systems • variety of products situations • Compensation by • co-existence of humans, recovery several generations activities of material • transfer of human knowledge to tools
• disappearance of the idea of the fixed work station
co-activity and cooperation •
• permanence of activity
I M P A C T ON SAFETY • disappearance of classical risks associated with production activities
• transformed/ abstract information: • visual/audio ,analogue/ symbolic/graphic • coded • screens/synoptics
INCREASING ROLE OF INFORMATION
• abundance of documentation and procedures
• analysis of malfunctions and failures
• increase of physical work load in some cases
• proof of critical working situations
• rigidity of automated sequences • shift change-over • breakdown of structural control
• different forms of communication (written - oral, implicit - explicit)
• task/activity definitions • analysis of tasks/work • consideration of organisation aspects of work • training (common operative referential)
• same interface for several users
• solicitation of different mental functions, and importance of reasoning-type activities
• adequate representation of necessary information
• sollicitation of different mental functions and importance of reasoning-type activities
• development of a functional schedule of conditions
• appearance of new working situations that present a risk
• increasing use of control interfaces
• increased work load
• definition of needs
• integration of humans in the production cycle
• sharing and definition of tasks between humans, teams, and humanmachine • structural controls INCREASING ROLE OF HUMANMACHINE INTERFACES
NEW R E Q U I R E M E N T S IN RISK MANAGEMENT
• new forms aids in work place to compensate for lack of skills - specific training
• mental representation of technical systems • vigilance • understanding of instructions and procedures • shift change-over
• ergonomics of procedures, instructions and reasoning aides
Actions on task definition - a task upstream from the sharing out of jobs b e t w e e n h u m a n s and m a c h i n e s .
U p to what point should one automate? W h a t is the role of the human in the system? S y s t e m design is generally d o n e using functional analysis techniques. This is essentially a formal f r a m e w o r k for co-operation, or s y s t e m of
Advances in Safety and Reliability: ESREL '97
123
rules, between the different parties contributing to the design. In fact, when a functional analysis is performed attention is usually focused on technical functions, with the operator only being taken into account summarily (ergonomics of the wrist, colour of the interface). He is generally given those functions that the technical system cannot perform (Garrigou 1992). In so far as this lastpoint is concerned, several studies (Fadier et al. 1991, Ligeron et al. 1991, Chevenier et al. 1991, Gauthier et al. 1994) have shown that the functional analysis methods that are currently the most widely used cannot account for the activities of future users (production activities, maintenance activities .... ), nor for areas of potential use (in particular conditions of operation in deteriorated states), whereas ergonomics has underlined, by means of the analysis of real activities, the fact the work is composed of adaptations and adjustments and optimal operating conditions that are very rarely seen. This difference (and sometimes contradiction) between the (theoretically) expected operation and real operation (including the management of random events) is considered as one of the most important sources of "risk taking" since the operator/user needs to rectify a situation that was not foreseen during the design phase.
Action. on task organisation. This must take not only the collective dimension of the tasks into consideration, but also the adaptation of the activities for new types of work and the new characteristics of automated systems. Communication, co-operation and co-ordination are three very important factors that all managers should integrate before proposing changes in the organisation of tasks in the work place. Action on the trabThTg of the personnel. This should complement the above actions. Continuing education, learning of new techniques, and training in the management of the more common random events are 'all possibilities that could help the operators become more capable in specific operational conditions. What methods should one use? In the work presented by the Institut de Sfiret6 de Fonctionnement "State of the Art in the Area of Human Reliability" (Fadier et al. 1994), the majority of the methods used for human reliability and ergonomics are described, and the reader is referred to this text for more details. Nevertheless, it should be remembered that several different methods are used depending on, sector of activity and the specifics of the activities carried out (see table 2). Furthermore, for approximately tens years now, "benchmark exercise" studies have been carried to help potential users in their choice of methods (Gerdes 1993, Humphreys et al. 1988, Kirwan 1988, 1992, Kirwan et al. 1995, Mesiter 1986, Watters 1988). The results of these studies are rather convinchag. However, in order to satisfy the requirements of SMI/SMEs, and to improve the operational reliability of automated systems, the INRS has developed a method entitled "Mdthode pour l'Analyse de la Fiabilit6 et ERGonomie Opdrationnelle" (MAFERGO Method for the Analysis of Operational Reliability and Ergonomics). This method [Fadier et al. 1991, Fadier et al. 1996] integrates both reliability and ergonomic approaches, and has been proposed as a method to respond to the requests of external finns. It includes four steps which are very briefly outlined below: A s t r u c t u r a l - f u n c t i o n a l a n a l y s i s aimed at understanding the nonnal operation of an existing system through the use of descriptive methods (block diagrams, flow charts), the description of foreseen tasks, and a preliminary analysis of real activities. An operational analysis aimed at describing the modes of exploitation and operation of a system, as well as the availability of the technical components, by comparing them with the analysis of the spatial-temporal planning of the tasks (requirements, constraints). The identification of m a l f u n c t i o n s is done using an event tree and AMDE, as well as an analysis of the consequences of the malfunctions on the operators' activities. An analysis of the cause of the m a l f u n c t i o n s using a fault tree (ADD) offer the possibility of defining the degree of fragility of the system, and the scenarios of the events that impose constraints on the operators. Finally, proposals of i m p r o v e m e n t s are focused on the improvement of reliability and availability, as well as on the improvement of the human-task pairing. -
-
-
-
-
124
E. Fadier TABLE 2 CATEGORYOFMETHODSUSEDIN HUMANREIAABILITY
Category of Methods
Characteristics of the Type of Methods
Examples
Methods' of evaluation of human reliability
These describe and quantify human peffonrmnces in THERP (Swain et al. 1983) a given task in order to evaluate and predict human - TESEO (Bello et al. 1980) reliability for this type of task - APJ (Seaver et al. 1982) Methods of qualitative These describe and model human variability in the Reason Approach analysis of the human execution of a task, or, more generally, help to (Reason J. 1979) understand the psychological process that intervenes - Rasmussen Model components of in the appearance of an error reliability (Rasmussen et al. 1987) - OAET (Brown et al. 1982) These integrate the description, quantification and Methods for the SHERPA (Embrey D.E. management of human psychological analysis of errors in the wider context 1986) of a global approach - the objective of which is the - HCR (Hannaman et al. reliability prevention of the causes of errors and the design of 1984) systems better adapted to human capabilities and - HEART (Williams J.C. requirements 1985) Methods using These rely on an a posteriori evaluation of human - A Simulator-Based Model reliability in reference tasks in order to infer human feedback and data (Beare et al. 1984) reliability in a similar task. Qualitative data offer the bases EPFH (EDF 1986) possibility of predicting the variability of the - MAFERGO (Fadier et al. behaviour for a given task, which in turn allows one 1991) to direct research and design approaches -
-
-
-
CONCLUSION Let us recall that work is composed of a permanent cycle of adaptations and adjustments, and that in order to manage this adaptation the operator needs a degree of autonomy. The difficulty for designers lies in the need to maintain the level of the performance of systems while allowing the operator sufficient room to manoeuvre in order to be able to complete his mission. The human factor should not simply be reduced to human errors. It is true that human reliability is not infallible (Faverge 1970, Mazeau 1993), and that any human intervention will have both positive and negative aspects. Fixating only on the negative aspects would necessarily lead to the neglecting of a significant portion of reality, and would cause us to opt for biased choices. How do we avoid human errors without inhibiting the adaptive function of human activities? question that specialists are currently looking to answer.
This is the
In fact, e r r o r should be seen and analysed not only from the point of view of its consequences on the system, nor as a deviation from a proscribed procedure, but also as the symptom of a poor operator-task pairing. A reliable system is one that can be recovered, and is thus capable of managing any m a l f u n c t i o n s , which in tuna implies that the latter be well understood.
R
E
F
E
R
E
N
C
E
S
BAINBRIDGE, L. (1982). Irony of Automation. 1st IFAC Congress "Analysis design and evaluation of man machine systems", Baden-Baden. BEARE, A.N., DORRIS R.E., BOWELL C.E., CROWE NUREG/CR-3309, U.S. Nuclear Regulatory Commission.
D.S.
&
KOZINSKY,
E.J.
(1984).
Advances in Safety and Reliability: ESREL '97
125
BELLO, G.C. & COLOMBANI, V. (1980). The human factors in risk analyses of process plants. The control room operator model, "TESEO". Reliability engineering, 1, 3-14. BROWN, R.G., VONHERRMAN, J.L. & QUILLIAN, J.F. (1982). Operator action event trees for the Zion 1 pressurized water reactor. NUREG/CR-2888. CAZAMIAN, P. (1989). Traitg d'ergonomie. Octares Editions. CHEVENIER, C., MERCIER, C. & RAGAZZO, D. (1991). Bilan d'une adaptation de SADT dans un contexte contrale-commande de production. JoumEes Intemationales de GEnie Logiciel de Toulouse. DE KEYSER. V. (1985). Les communications dans les systEmes automatisds: champs cognitifs et supports d'information chez les opdrateurs. In: Noizet, Bouilanger, Bresson et al. (Eds.): La communication. Presses Universitaires de France, Paris. DE KEYSER, V. (1989). L'erreur humaine. La Recherche, 216, 1444-1455. DE TERSSAC, G. (1992). Autonomie darts le travail. Octares Editions. DE TERSSAC, G. & CHABAUD, C. (1990). REferentiel opdratif commun et fiabilit6. In: J. Leplat & G. de Terssac (Eds.): Lesfacteurs humains de lafiabilitd dans le syst~me complexe, chapitre 4, 112-139. Octares Editions. EDF (1986). Evaluation probabiliste de la fiabilitd humaine et essais sur simulateur. Mosneron-Dupin, Grosdeva, Saloiu & Berger. In: Colloque INRS "Lesfacteurs humains de Iafiabilitd", avril 1991. EMBREY, D.E. (1986). A systematic approach for assessing and reducing human error in process plants. Human Reliability Associates Ltd, Delton. FADIER, E. & al. (1994). Etat de l'art clans le domaine de la fiabiIitg humaine. Octares Editions. FADIER, E. (1990). Fiabilitd humaine: mdthode d'analyse et domaine d'application. In: J. Leplat & G. de Terssac (Eds.): Lesfacteurs humains de lafiabilit~ darts le syst~me compIexe, chapitre 2. Octares Editions. FADIER, E. & MAZEAU, M. (1994). T'~ches de maintenance et rSle de l'homme. Journdes AFIM/ISdF "Apport de la maintenabilit~, de la maintenance et des facteurs humains gz la s~retd de fonctionnement", Paris, 2223 novembre 1994. FADIER, E. & NEBOIT, M. (1996). Proposition d'une mdthode d'analyse de la fiabilit6 opErationnelle intEgrant l'analyse ergonomique. Communication faite au 10e Colloque de fiabilit~ et maintenabilitd, Saint-Malo (France), ler-3 octobre 1996. Editions SEE. FADIER, E., POYET, C. & NEBOIT, M. (1991). Advantages of an integrated approach of reliability and ergonomical analysis: application to an hybrid system of sequential production. In: Y. Queinnec & F. Daniellou (Eds.): Designing for Eve13,one. Taylor & Francis. FAVERGE, J.M. (1980). 203-206.
Le travail en treat qu'activit6 de rEcupEration. Bulletin de Psychologie, 33:344,
FAVERGE, J.M. (1966). L'ergonomie des' processus industriels. l'Universit6 Libre de Bruxelles.
Editions de l'Institut de Sociologie de
FAVERGE, J.M. (1970. L'homme, agent de fiabilit6 et d'infiabilitE. Ergonomics, 13:3,301-327. GARRIGOU, A. (1992). Les apports des confi;onmtions d'orientations socio-cognitives au sein de processus de conception participatifs: Ie raIe de l'ergonomie. CNAM, Th6se de Doctorat d'Ergonomie. GAUTHIER, F. & CHARRON, F. (1994). Conception en ingdnierie shnultande et la sant6 et la sdcurit6 du travail, let CongrEs pluridisciplinaire "Quagtg et s~retg de fonctionnement", Compi~gne (France),
126
E. Fadier
17-18 novembre 1994. GERDES, V. (1993). HRA Techniques. A Selection Matrix. In: Proceedings SRE Symposium "Reliability a competitive edge", october 4-6, 1993. HANNAMAN, G.W., SPURGIN A.J. & LUKIC, Y.D. (1984). PRA analysis Electric Power Research Institute, Palo Alto. HOC, J.M. (1996). SupeJwision et contr6le de processus. Presses Universitaires de Grenoble.
:
Human cognitive reliability model.
La cognition en situation dynamique.
HUMPHREYS, P., EMBREY, D.E., KIRWAN, B. & REA, K. (1988). Human reliability assessors guide. Safety and Reliability Directorate. United Kingdom Atomic Energy Authority, Wigshaw Lane. ISdF (1991). Conception et maintenance des procgdures. Proiet 11/92. ISdF (1994). Aide h la conception, gz la validation et ?~la maintenance des plvcddures. Projet 7/94. KERN, H. & SCHUMANN, M. (1989). Latin de la division du n'avail? La rationalisation dans la plvduction industrielle. Editions Masson/Sciences de l'Homme, Paris, 417 p. KIRWAN, B. (1988). A comparative evaluation of five human factors assessment techniques. In : B.A. Sayers (Ed.): Human factors and decision making: t heir influence on safety and reliability. Elseviers Applied Science, 87-109. KIRWAN, B. (1992). Human error identification in human reliability assessment. Part 1: approaches. Applied Ergonomics, 23:5,299-318. KIRWAN, B. (1992). Human error identification in human reliability assessment. comparaison of techniques. Applied Ergonomics, 23:6, 371-381.
Overwiew of
Part 2:
Detailed
KIRWAN, B., KENNEDY, R. & TAYLOR-ADAMS S. (1995). A validation study of three human reliability quantification techniques: THERP, HEART, and JHEdI. In: Watson & Cottam (Eds.) : P~vceedin.gs ESREL '95. The Chameleon Press Ltd. LEPLAT, J. (1985). Erreur humaine, fiabilit~ humaine dans le travail. Anrmnd Colin, Paris. LEPLAT, J. (1988). Organisation of activity in collective tasks. Workshop "Distributed decision making", Bad Hamburg, 25 p. LIGERON, J.C., SALAUN, Y. & RINGLER, J. (1991). fonctionnement. Proiet ISdF 1/1991.
L'analyse fonctionnelle en matikre de sYtretd de
MAZEAU, M. (1993). L'homme, agent de fiabilit6 faillible. dossier "Fiabilit6 et erreurs humaines", 66.
PeJformances Humaines & Techniques,
MEISTER, D. (1986). Human Factors Testing and Evaluation. Elsevier, 424 p. NEBOIT, M. (1991). Ergonomie et fiabilit6 humaine dans la conduite des syst6mes complexes. Revue Ggngrale d'Electricit~, 5, 36-40. RASMUSSEN, J. DUCAN, K. & LEPLAT, J. (1980). New technology and human error. John Wiley & Sons, Chistester. RASMUSSEN, J. (1986). Information processing and human-machhTe interaction: cm. approach, to cognitive engineering. A.P. Sage (Ed.), North-Holland series in system science and engineering 12. REASON, J. (1979). Actions not as planned: the price of automatization. In: G. Underwood & R. Stevens (Eds.): Aspects' of consciousness. Volume 1: Psychological issues. Academic Press, Londres.
Advances in Safety and Reliability: ESREL '97
127
REASON, J. (1990). Human error. Department of Psychology, University of Manchester. SEAVER, D.A., STILLWELL, W.G. & SCHWARTZ, J.P. (1982). Expert estimation of human error plvbabilities in nuclear power plant operatiom': a review of probability assessment and scalh~g. NUREG/CR-2255. Decision Science Consortium Inc., Falls Church. SIX, F. & VAXEVANOGLOU, X. (1989). Les aspects collectifs de travail. Octares Editions. SWAIN, A.D. (1964). THERP. Sandia Laboratories, Albuquerque. Report SC.R 64 1338. SWAIN, A.D. (1974). Human factors' associated with prescribed action links. Albuquerque. Report SC.R 74 0051, 35 p.
Sandia Laboratories,
SWAIN, A.D. & GUTTMANN, H.E. (1983). Handbook of human reliability analysis with emphasis on nuclear plant applications. NUREG/CR- 1278, US Nuclear Regulation. WATFERS, T. (1988). Human factors reliability benchmark exercise. In: B.A. Sayers (Ed.): Human factors and decision, making." their influence on safety and reliability. Elsevier Applied Science, 110-125.
This Page Intentionally Left Blank
HUMAN ERRORS IN MAINTENANCE ACTIONS AN EXPERIENCE BASED STUDY Pekka Pyy, Kari Laakso l, Lasse Reiman 2 1VTT Automation, PL 1301, 02044 VTT, FINLAND [email protected], [email protected] 2STUK, PL 14, 00881 Helsinki, FINLAND [email protected]
ABSTRACT
In this paper, a study of human errors related to nuclear power plant (NPP) maintenance is presented. About 4400 maintenance history fault information records and licensee event reports of an NPP from the years 199294 were analyzed. A considerable time was used to verify the results with the maintenance personnel of the NPP. The paper discusses the scope and objectives of the study, method used and results obtained. The results are presented separately for single human errors, common cause failures (CCFs) and single human errors leading to multiple effects due to shared equipment. Finally, the validity of results is discussed and conclusions made with some recommendations.
KEYWORDS
Maintenance, human errors, human factors, human reliability, nuclear power plants, statistics, common cause failures, fault history.
INTRODUCTION In the research concerning human behaviour and human error possibilities, main attention has been traditionally focused upon the control room crew performance in disturbance and accident conditions. The control room operators have an essential role in disturbance management. On the other hand, also maintenance errors may have an impact on the severity of a disturbance by disabling safety related equipment. The chances of operators to manage the situation are worsened if latent equipment failures due to imperfect test and maintenance activities exist. Especially common cause failures, affecting several trains of a safety related system, may have a significant contribution to the reactor core damage risk. Therefore, the dependence of errors between tasks performed in redundant trains is an issue of extreme importance. The topic is highlighted in many earlier studies, e.g. in (Reiman 1994). The objectives of the study were to identify human errors related to maintenance, give examples of the origin and appearance of human induced common cause failure mechanisms, generate numerical indicators explaining maintenance related errors and their influence and to try to find out improvents. Human reliability 129
130
E Pyy et al.
explaining maintenance related errors and their influence and to try to find out improvents. Human reliability data estimation based on the material and detailed root cause analyses were, so far, excluded from the study scope due to the extensive effort required for the data screening. In future, however, human reliability data analysis is an important topic to be studied.
USED METHOD
It is useful to study which kinds of human maintenance related errors can pass on to otherwise operable components and cause latent unavailability. The kernel of the thinking behind the study is that equipment may be declared operable, although it actually is not capable of fulfilling all its functions. This latent unavailability may due to many phenomena, e.g., human actions such as forgetting a restoration and imperfectness of testing. The principle of latent unavailability born via maintenance related actions and not revealed thereafter is shown in Figure 1. Timeline Component operable
QC
inoperable
Component inoperable
Maintenance
Oe.~gn Work planning / / ~
Component
~Humanl~lP-rrorXtakesplaceana,[ maintenanceI r I leavesthecon~nentI ~ a~. x I
Unavailability due to maintenance
inoperable
operable
Operating
organisation
Work permit - -
Component declared
Component organisation
I to a degraded state I
,
wOrkcontrol roomPermi returned t to///
V' Most errors found
Residual unavailability due to errors
,
I
.~ ,Administrative,
;
I
2,3 ....
I
~,_ , veriodic tests
I ' ', .... I
e~'res Som idual errors f(xJnd
y Som residual errors found
Residual unavailability decreases
Residual unavailability decreases
.=,
,
I
I Resi
errors found,
disturbance conditions
Figure 1. Model of the birth of an error in a maintenance action and its consequences (component degradation) passing several barriers. The unavailability decreases in each barrier due to possibilities to detect the fault. The database used in the study were the fault and repair history records, completed with other utility reporting, of one NPP located in the Nordic countries. The failure reports covered 4407 failure and repair records in maintenance history from the calendar years 1992-1994. Human errors related to maintenance, such as omissions or wrong settings, may result in single or dependent component failures. The number of single failures is high when compared to the CCFs, called here as HCCFs (Human CCFs). An HCCF may be born by repeated wrong actions or due to correct actions bringing e.g. an unfit set point or spare part to the target system. In special cases, also single human errors may cause multiple consequences due to latent technical interactions between components and systems. They may be called human shared equipment failures, HSEFs. A failure is considered to be an HCCF or HSEF via its consequences, i.e. redundant components are faulted in parallel. Similarly, human induced non-critical faults (HCCNs) have been studied due to their precursor importance and due to similar error mechanisms as in HCCFs. Thus, the scope of the CCF definition used here is to some extent different than used in Probabilistic Safety Assessments (PSAs). In the cases, where the human originated dependence between faults remained unclear even after interviews, single errors were assumed. For the exact definitions of the terms, see (Laakso et al 1997). The first part of the study was the screening of the 4407 fault records, their feedback information and 16 utility reports to find out human error candidates cases. The details facilitating the identification were the coding of the fault causes and the text description based on the findings in repair. The screening phase was carried out by
Advances in Safety and Reliability: ESREL '97
131
two analysts in a sequential order resulting in about 500 candidate records that were discussed with the maintenance foremen of the utility. That first review resulted in 334 candidate cases, of which amount 6 came from other sources than from the fault history data (See Figure 2). Further investigation with the utility personnel revealed that, from these 334 cases, 126 fault history records could be further grouped into 37 different dependence cases (HCCFs, HCCNs and HSEFs). This reduction was due to the fact that multiple fault records in the database often originated in repeated erroneous actions. Thus, the number of single human errors found in the fault history was 206. Furthermore, 11 dependence cases preliminary classified as HCCFs and 2 single errors were found to caused by other reasons, e.g. by ageing, rather than human errors. The flow of the screening is manifested in Figure 1.
Number of cases Utility fault history records -4400
Utility reports e.g. rootcausestudies 16
Search and screening of human errors based on failure reporting and additional utility reports -500
the
*
~166 Otherthan Iverification of the search results with the maintenanc 4 _ _ - II~ human errors foremen screened out I /
/
334 cases
*
Human error classification into single failures, candidate common cause failures etc. 126 -HCC F/H-NNC -chffdiclafe-f-a ~1i ~ec-o?ds~. . . .
~/E I
2 ageing caused records q
i JStatisticsof206 ------I~ single errors 14 cases
' 3 - 7 - ca-ses
HCCFs and
Refined root cause analysis
23 cases ~W J
Other, e.g. 12 I human shared | equipment |
/dependences
etci
Figure 2. Flow chart of screening of the human errors related to maintenance activities. The 334 maintenance records, including 6 cases from other utility reports, were classified according to the following explaining factors: direct cause of the error, type of equipment involved, time of error origin, time of error detection and type of action that revealed the error. Also underlying causes of human errors, such as work planning and co-operation deficiencies, were studied in the cases of dependent errors. The analysis results of the data will be discussed more in the following chapters.
RESULTS As may be deducted from Figure 2, about 7,5 % of all failure records could, finally, be identified as human errors related to maintenance. Human errors could be found in all four cause categories used by the plant personnel. This may be easily understood against the background that the used cause categories, e.g. 'operating and maintenance personnel' and 'failure in installation or earlier' are not mutually exclusive. Similarly, if sub-categories were used, e.g. 'foreign labour' and 'installation error' are not mutually exclusive. The frequent use of other categories than 'miscellaneous', given that the fault was caused by human actions,
P. Pyy et al.
132
however, speaks about a realistic classification of human errors and increases the reliability of the used database. In the following, the results will be discussed separately with regard to single human errors and dependent ones (HCCFs, HCCNs and HSEFs).
Single human failures Together 206 single human errors were identified, as shown in Figure 2. According to the results presented in TABLE 1, control & instrumentation (84 cases) and electrical equipment (40 cases) are most frequently affected with the share of about 60 %. An interesting observation can be made with regard-to process valves, dampers and hatches - their maintenance errors and wrong alignments are modelled often in PSAs but, here, their share is rather low (17 %). This supports the idea that more safety emphasis should be put to complex systems including instrumentation, control, protection and electrical power supply. In order to study the occurred error types, a praxis oriented application of (Swain & Guttmann 1983) taxonomy, relating the error phenotypes omission and commission to the consequences, was used with the extension to wrong setting errors. Commission errors dominate the results with the share of 67 %, and especially the category 'other (error of) commission than wrong direction' is frequent (54 %). That category consists of several types of wrong actions such as carelessness, confusion in object, use of too much or little force causing bad connections, untight bolts, broken pieces etc. The finding is interesting since human reliability analysis (HRA) studies often concentrate upon omission errors, and, in some cases, wrong settings. However, in praxis, many commission errors result in similar consequences as omissions. The equipment type has a significant effect on the distribution of the error types. Based on TABLE 1, the share of commission errors in instrumentation and control (I&C) equipment is exceptionally high. In contrast to that, many omissions take place in actions on instrument line block valves (67 % of total all error modes). With regard to the valves, the result could be expected - error types such as wrong direction and wrong settings are not very likely to take place in process equipment. The amount of wrong settings in I&C and electrical equipment was only app. 12-13 % of the total, which can be regarded as rather expected, too. It is worth noticing that although many errors take place in I&C it is not always the respective maintenance team that has committed an error. Due to the fact that many of the errors are of type commission, also other e.g. mechanical maintenance team can cause instrument faults by crushing objects, by wrongly installing instrumentation lines etc. Although a detailed study of underlying causes of single human errors was not possible, many commission errors had their background in work planning and system design & layout.
TABLE 1. SINGLE HUMAN ERROR CAUSE DISTRIBUTION AMONG COMPONENT CATEGORIES.
Omission error Wrong settings Commission, wrong direction Commission, other TOTAL
I&C Mechanical Electrical components components components 13 7 14
Valves 7
Instr. line TOTAL valves 8 49
11
0
5
2
0
18
1!
3
7
6
0
27
49
26
14
19
4
112
84
36
40
34
12
206
Advances in Safety and Reliability: ESREL '97
133
Most preventive maintenance, modifications and testing activities take place during the annual refuelling outage. Therefore, it is not surprising that 127 (app. 62 %) single human error cases also stem from that period. It is more interesting to study the detection time of the errors. There, the situation is somewhat reversed: only about 32 % of all single human errors are detected during an outage. Approximately 60 % of single errors were detected during the power operation and 7 % during the plant start-up. Preventive actions revealed only app. 20 % of single human errors. About 94 % of those errors born during the power operating mode of the plant were also detected in that same mode. The errors born during an outage were studied further in order to assess the proportion remaining latent until the plant start-up. The result is shown in Figure 3. Approximately 5 1 % of induced errors are also discovered during the same outage. However, residual 49 % remain latent until the plant start-up or even until the power operation. This result can partly be explained by the fact that many of the faults caused are rather negligible from the safety and economy point of view. However, faults exist in safety systems, too. No thorough analyses have been performed to verify, if human errors in safety systems have a different detection distribution than other errors. Although somewhat better detection before plant start-up can be judged based on the data, no major conclusions can be made. Preventive actions, e.g. testing and preventive maintenance are not specifically effective, only 17 % of errors were detected by them. The low ratio may be explained by the fact that many of these single human errors are not critical from safety point of view. Thus, they are noticed during plant walk-troughs in various purposes, by alarms or occasionally while working nearby.
Power Start-up operation, prev. 10 % action 7%
Power operation, other 32 %
Outage /prev. Start-up Outage ! other action 1 % Shut-down oper./prev. Outage, other 34%
uuP,ge, prev. action 17%
other 79 %
Figure 3. Plant operating mode at the time of detection of 127 single human errors stemming from outages (left) and 78 stemming from the operating period (right). The amount of omission, wrong direction/sequence and wrong setting errors remained quite stable through 1992-1994, and according to the Chi-square test the calendar year was not a statistically significant explanatory factor. The yearly distributions obtained for omission and commission errors are somewhat higher than what was presented in (Reiman 1994). Reiman discovered app. 9,6 omissions and 3,8 wrong direction commissions per year through 1981-1991, whereas the findings of this study are 16,3 and 9, correspondingly. The difference is mostly due to the more extensive scope of this study, since the search for human errors covered all the maintenance records and not only those pre-classified as human errors. In the search for human maintenance errors, thus, all the fault cause categories should be investigated in order to avoid underestimation. Wrong settings were not studied in the form of their yearly distribution in (Reiman 1994).
134
P. Pyy et al.
Dependent human failures Dependent equipment faults, originally classified as human induced, were found in 126 maintenance records and 4 other plant reports. After a profound and effort intensive analysis, this amount could be reduced to 44 records referring to 14 HCCF and HCCN cases and 15 records referring to 12 HSEFs. In several cases, there were more fault records in the maintenance history than actual human errors since some errors had effect on several components. It is worth mentioning, that in the course of this study, many dependent mechanisms, first regarded as wrong settings, could be screened out. This was due to the fact that they, actually, were caused by e.g. ageing. In the plant fault records, they were originally classified as human errors. Should their analysis be carried out on a superficial level, the wrong settings would have dominated the results of HCCF and HCCN study (see earlier text for acronym clarifications). The results obtained for HCCFs and HCCNs are, surprisingly, rather similar as to the single human errors. Again, instrumentation (10 cases) and electrical equipment (4 cases) dominate as target equipment groups - no other equipment were identified. The corresponding ratio in critical faults (HCCFs) is 5 and 3 cases. The dominant error type category is, again, commission (6 cases, 5 HCCFs) with 42 % contribution. Also dependent wrong settings (4 cases, 29 %) are an usual category, but all appear to be non-critical instrumentation related HCCNs. The contribution of wrong settings is, anyway, more significant than in the case of single errors. Omission was the error type in 2 cases (1 HCCF, 1 HCCN, both related to instrumentation) and wrong direction in other 2 instrument related HCCFs. Two HCCFs error mechanisms were born during the power operation. One of them was detected in a periodic test and another otherwise, as a part of routine activities, during the power operation. Figure 4 presents the detection modes of the 12 dependent human errors born during the outage time. About 62% of all, and from the HCCFs about 68 %, remained undetected at least until the start-up. The distribution of detection time of HCCFs and HCCNs in safety related system follows roughly the same distribution. From these dependent faults, 8 out of 12, and from HCCFs 5 out of 6, were in safety systems or in their support systems. Thus, the fact that some errors are negligible by their consequences, as many single human errors, does not explain the results. Three faults were discovered from two plant units. One has to, however, bear in mind that the database begins to be rather small, when HCCFs and HCCNs are studied. This does not overrule the fact that a large proportion of the dependencies has remained undetected and even one of them has been discovered through a plant level disturbance. r ' ~ ' ~Nsgi J ~ ~ ven " ~ ' ~ 'that ~ - ' ~they e ~ i i ~we : ~ i ~ ~ ~f!~~~ ~D~t~cti~o~p e riod of HCCFs an d~~~CC born during an outage
:
"~
"
~~;.~,.'.;I
III
"~!~
~i'~" i! '~: . . . . . . . . . . . . . . . . . . . . . .
Figure 4. Distribution of the detection modes of human induced dependences given that they were introduced during an outage (12 cases, in one case the origin was outage at one unit and power operation at another).
Advances in Safety and Reliability: ESREL '97
135
The thorough analysis of the dependent human errors allowed to make further inference about their birth and discovery occasions. Modifications are an important source of dependencies with the share of almost 50% (7 cases). The distribution of detection occasions is quite flat, although periodic testing and alarms produce almost 50 % of total.
Figure 5. The detailed detection (left) and origin activity (right) types of dependent human errors (14 cases). The result is interesting from the safety point of view, because it is difficult to know which kind of hazards are due to new equipment requiring new skills and practices. However, nuclear utilities normally carry out extensive start-up testing programs for their new equipment. In future, even more profound planning, coordination and testing of the backfittings and modifications could, however, result in even better equipment performance.
Human induced shared equipment failures Apart from HCCFs and HCCNs, together 12 HSEFs (see earlier text for acronym clarifications) were identified in the course of the study. Of this amount, 3 were omissions and 7 other commissions than wrong direction errors. All the omissions were cases, where one single error, e.g. delayed or forgotten restoration or missed point in instructions, led to multiple consequences in equipment. HSEFs can be seen important from the safety point of view, since the probability of one error is normally higher than of repeated ones. Seven HSEFs were born during an outage and 5 during the power operation. Eight HSEFs had been detected during the power operating mode. Three out of those 7 HSEFs born during an outage have remained latent until the power operation, which is somewhat less than for HCCFs and HCCNs. All those HSEFs born during the power operation were also detected in that operating mode. All this is rather analogical to single human errors. As another analogy, preventive actions have been rather insignificant detection means with only 17 % share.
DISCUSSION The most significant uncertainties in the results of this work are related to the data and its uses, i.e., identifying human errors based on the fault records and classifying them based on the raw data. The cause categories used in the plant maintenance records did neither directly address the failures in redundant components nor explicitly allow human error classifications. This may have led to not noticing some error mechanisms. Besides, at the outset of the study, the target was set to identify dependent human errors causing faults in redundant components or trains. This led to the result where otherwise correlated dependence mechanisms where left outside an accurate consideration and may appear as single errors.
136
E Pyy et al.
The authors wish to express that there are also other data sources that may be used to complete the results, e.g., quarterly reports, annual outage event reports, test and calibration protocols, control room log books, work orders and modification data. These were only utilised to a limited extent in the study. The reason for this was to limit the effort. For example, finding evidence from the calibration protocols requires extremely experienced researchers and a great deal of resources. The human error type classification used in the study was not very detailed. More detailed taxonomies, presented in e.g. (Reason 1990) would have required considerably more work and the results, still, might have been rather uncertain. Thus, the attempt to try to map e.g. cognitive error mechanisms involved was left outside the study as a very detailed statistical significance testing of results, too. The latter decision was supported by the fact that, in most cases, the influences of certain factors to data could be inferred, otherwise. In future, it may be interesting to compare the results with PSA and plant technical specifications to make further conclusions e.g. about the safety significance of the results.
CONCLUSIONS The study was capable of producing interesting results, since large amount of plant specific maintenance data was used as source material. Many of the results were expected, e.g. that outages are an important human error source due to high amount of maintenance and modification. Many of the maintenance related errors stemming from outages remain undetected until the power operation. In that respect, single and dependent human errors showed rather similar behaviour. One has to remember, anyway, that maintenance outages are vital in order to ensure the safe operation of an NPP. Maintenance related human errors may have a significant safety influence - however, more safety degradation would probably be caused if no maintenance would take place. Although a significant amount of human errors took place in safety related systems, not all of them were functionally critical. In addition, the most common error type was very wide 'other commission error', more or less, related often to carelessness, e.g. too much or little force was used. Instrumentation and electrical components seem to be prone to human errors, partly due to the vulnerability and partly due to the complexity of the equipment and their uses. More emphasis may be needed in HRA to study these components, in future. Plant modifications appeared as a very important source of dependent human errors. Thus, more extensive planning, co-ordination and testing of the modifications is suggested. This includes better interaction between different design branches / maintenance groups of a nuclear installation. The topic may be very plants specific and no generic conclusions should be drawn without revisiting plant own experience. The amount of work used in the study was extensive, especially going beyond the maintenance data base in the form of interviews and analysis was resource consuming. Still, in many cases it was difficult to extract exact information and additional analyses may be required to verify some results. Yet, plant maintenance records offer the best database for maintenance related human errors. Their wider utilisation is recommended, in future.
REFERENCES
Laakso, K., Pyy, P., Reiman, L. 1997. Human errors in maintenance action at nuclear power plants. Research Report, STUK-YTO Series. In press. Reason, J. 1990. Human error. Cambridge University Press. 302 pp. Reiman, L. 1994. Expert judgment in analysis of human and organizational behaviour ay nuclear power plants. STUK-A 118, Helsinki.226 pp. Swain & Guttmann. 1983. Handbook of human reliability analysis with emphasis on nuclear power plant applications.
HUMAN ERROR AND TECHNOLOGICAL
ACCIDENT PREVENTION
R. Ferro Fernandez National Center for Nuclear Safety, Ministry of Sciences, Technology and Environment, Cuba
ABSTRACT Human error prevention is essential in assuring the safety of hazardous industries, taking into account that it has been one of the most important contributors to several industrial accidents which have occurred in the world during the last decades. These accidents resulted in human victims, costly material damages because of production, installations and equipment losses or environmental effects in case of contaminating industries. Today it is widely recognized that industry safety and reliability depends not only on their systems and equipment reliability, but also on human reliability, either in normal or in emergency conditions. This paper presents a general background on the importance of human error prevention, some ways to reduce the probability of their occurrence and the need to achieve a high safety culture at all levels of industry and also describes the efforts made in our country in this field by the National Group of Probabilistic Safety Assessment.
KEYWORDS
Human Factors, Human Error, Human Reliability, Technological Accident, Accident Prevention, Safety.
INTRODUCTION Human error prevention is essential in assuring safety in the industries with potential risks. Man has been one of the major contributors to several of the biggest technological accidents that have occurred in the world during the last decades. The consequences have been different, from economic damages due to production losses and/or equipment and installations destruction to human victims and environmental effects, in case of contaminating industries. Its widely known the human involvement in the accidents at the Chemical plant in Bhopal, (India, 1984), Three Mile Island (USA, 1979) and Chemobyl (Ukraine, 1986) Nuclear Power Plants, the Challenger Space shuttle disaster (USA, 1986), the Clapham Junction train collision (United Kingdom, 1989) and other major disasters. In the petrochemical industry operational errol' has resulted in millionaire losses. It is considered that 50-80% of the accidents in commercial aviation are due to pilot errors (Caruso, 1990). It is evident that human factors problems can affect any industry, in any country. Hence, human error prevention and the control of its incidence is recognized as one of the most important challenge posed to 137
138
R. Ferro Fernandez
Engineering today. In our country we are also making an important effort in that direction. The approaches we follow and some results achieved are presented in this paper.
MAN AS PART OF THE T E C H N O L O G I C A L SYSTEM. As accidents shown, system safety and reliability of the technological systems depend on both equipment reliability and human reliability. Man continues playing an important role in the technological system operation, either as a direct and active operator in less automated systems or in control and surveillance functions in the most automated ones. Errors in maintenance, tests and calibration activities can lead to the system unavailability or miscalibration. Omission or commission errors during an emergency might complicate the accident sequence or aggravate its consequences. On the other hand, human knowledge and experience can generate human actions, not previously included in operating procedures, that might recover an almost lost situation or mitigate its consequences. Such dependence between the system safety operation and the human performance has contributed to a wider and comprehensive understanding of the technological system, defining it as an assembly of mechanical and human components interacting among them to accomplish the system function. Only the appropriate matching between these two components, considering for both their limitations and capabilities, can assure an important reduction of system failures due to human errors. For that reason, it is necessary to influence all those elements of man-machine interface that could be a source of human failures and errors, leading to system failures. It is important to ensure an ergonomic design of the equipment and working tools and the working environment focused not only to provide an individual safety and comfort when performing a task, but also to achieve a high human reliability from the system safety point of view. The studies on human physical and psychological characteristics are essential to understand the human behaviour and response mechanisms in presence of certain stimulus and the human interrelation with other environmental factors. The training systems and requirements are also of prime importance. The development of analytical methods for human reliability assessments completes the efforts made in identifying and improving the weaknesses in man-machine interaction.
LATENT FAILURES. The attention to the so called latent failures in the organization has considerably increased during the last years in addition to active failures on which the emphasis was focused before (Organizacion de la Atomic Energy Agency, 1994). The Active failures are those human errors and violations that have an immediate and visible effects, that is., they are the main or one of the most important causes of accidents. Latent failures, on the other hand, are those failures in the organization that remain hidden during the time previous to an accident and which create the conditions for the active failures to arise, resulting in an accident. In relation to the latent failure, there have appeared two key terms: Safety culture and safety management.
Safety Culture The achievement of an effective safety culture in organizations and individuals is one the goals that it has been proposed in some industrial sectors to increase safety and reduce man-induced accidents. This term appeared for the first time in the nuclear industry after the Chernobyl accident in 1986, becoming later one of the basic safety principles for Nuclear Installations (International Atomic Energy Agency, 1988). This concept is defined as follows:
Advances in Safety and Reliability: ESREL '97
139
Safety culture is that assembly of characteristics and attitudes in organizations and individuals which establishes that, as an overriding priority, nuclear plan safety issues receive the attention warranted by their significance (International Atomic Energy Agency, 1991). This concept comprises the two angles of the problem: the collective aspect, i.e., the organization as a whole, the environment and conditions that it creates for its personnel and the individual aspect, i.e., the man attitudes and capabilities to use this created environment in an effective way for safety purposes. Safety culture is encouraged to achieve in the personnel at all levels in the organization an all pervading safety thinking, an inherently questioning attitude, the prevention of complacency, a commitment to excellence, and the fostering of both personal accountability and corporate self- regulation in safety matters (International Atomic Energy Agency, 1991).
Safety Management The development of an effective safety culture could be prompted by establishing of a Safety Management System, which allows to set the basis that will regulate the organization functioning and individual behavior on safety- related matters and the way to its self-controlling. This system includes definitions of safety policies, responsibility delimitation, training systems, managers and personnel attitudes, etc. The system constitutes one of the principal measures for the identification and eradication of latent failures and it is being developed in many industrial sectors.
HUMAN FACTORS TREATMENT FOR TECHNOLOGICAL ACCIDENT PREVENTION. Because of the multidisciplinary character of human factors problems, several approaches exist to deal with them. In our case, we are working on a four direction-oriented basis as follows: Equipment, tools and working environment: Comprises the studies related to the ergonomics of equipment, control boards and rooms designs, working environmental conditions, operating procedures. Man-inherent issues: Involves the studies of different human sciences, personnel selection and training requirements. Human Reliability Analysis: Comprises the assimilation, use and development of analytical methods, human error data collection, root causes analyses. Organizational issues: Include all those activities in the field of safety culture and safety management.
ACTIVITIES PERFORMED IN THE FIELD OF HUMAN FACTORS At the end of the 80's a National Group for Probabilistic Safety Analyses (NGPSA) was created composed of several specialists from different Cuban institutions working in the area of safety analyses. The aim of this group was the assimilation of probabilistic safety analysis (PSA) techniques in order to accomplish such analysis for our first Nuclear Power Plant under construction in Juraguss, Cienfuegos province. For that purpose a Technical Assistance Project CUB/9/008, sponsored by the International Atomic Energy Agency (IAEA) was performed, which allowed our personnel to acquire the basic knowledge on several matters, including the human reliability. The stoppage at Juragua NPP, which started by the end of 1992, led to a decrease of the construction works rhythms and to a general contraction of nuclear activities in the country. Such situation gave us the chance to extend the gained experience in the performance of such analyses from the nuclear industry to other conventional ones in our country. This period has been very important in the maturity of our group.
140
R. Ferro Fernandez
Activities performed in the nuclear sector.
Human actions treatment in the Pre-operational Probabilistic Safety Analysis of Juragua NPP. Due to the construction stage of the NPP, there is a limited operation documentation, like procedures and other information on the future operation of the plant. For that reason, it was necessary to simplify the human reliability analysis task in this PSA study. The main objectives were the identification of several human actions and the selection of the most important ones using screening values of Human Error Probabilities (HEP). A procedure for Human Actions treatment (PM 0107) was developed for this task as part of PSA Manual of Quality Assurance instructions and procedures. The scope of human reliability task covered the following: -Type 1 (pre-accidental) and 3 (post-accidental procedural) human actions -Calibration actions that can result in common cause failures. -Generic screening values of HEP Qualitative assessment of dependence -Selection of the most important human actions -No detailed analyses of selected human actions was performed. The NPP Technical Project and the Piping and Instrumentation diagrams were reviewed in order to identify possible human actions. The PSA studies, operating procedures and regimes from other similar NPP were also considered in postulating human actions. Some of the results are: 1. The appropriate application of the rules and guidelines of PM0107 allowed to discard several human actions from further consideration in the study. 2. The results of human action identification, modelling and selection of the most important ones are summarized in table No. 1
TABLE 1 RESULTS OF HUMAN ACTION IDENTIFICATION, MODELLING AND SELECTION Modelled human actions Type 1 human actions Type 3 human actions Total
Selected Most 19 48 67
important human actions 6 30 36
These preliminary results provide the operating organization with a useful information that has to be taking into account for the future plant operation organization and documentation.
Assessment of Safety Culture in Regulatory activities. Based on the IAEA Guidelines for Organizational self-assessment of Safety Culture (Intemational Atomic Energy Agency, 1994) a Project for assessing some aspects of the safety culture in nuclear activities performed in our country was prepared. The scope of the study comprises all kinds of nuclear-related organizations like nuclear and radioactive facilities, supporting organizations and the Regulatory Body, which are going to be covered in three stages. This study was started with the assessment of the Regulatory Body's activities, considering the role of the governmental organizations in setting up the national basis of safety culture.
Advances in Safety and Reliability: ESREL '97
141
With the purpose of being aware with the opinions about the Regulatory Body's work a survey was distributed among personnel from different organizations in several provinces. The review of the responses revealed some items that require further analysis to find the causes and take the appropriate measures that could contribute to improve the Regulatory Body's safety culture. These items are: 1-Regulatory Body's role and functions. 2-Legislation of the responsibilities and attributions of different kinds of nuclear organizations. Safety Policies 3-Application of regulatory requirements 4-Regulatory Body's interference 5-Dissemination and feedback of safety information 6-Regulatory body and utilities interaction 7-Image and benefits of regulatory activity. This study will continue during 1997 to cover the remaining organizations included in the proposed scope. Project for a probabilistic safety assessment (PSA) of an industrial irradiator.
Since 1987 operates in Cuba a gamma irradiator for food preservation purposes. In spite of the wide utilization and the known benefits of industrial irradiators several accidents with human victims have occurred throughout the world. The analysis of these accidents and the subsequent recommendations have been focused mainly to some human actors problems (Intemational Atomic Energy Agency, 1994) According to this situation a project to perform a PSA for this irradiator has been prepared in which the human reliability task will be one of the most important, considering a high number of manual activities required for the operation of this installation. The objectives of the study are: 1. To obtain information to support plant operation by the improvement of operating procedures, maintenance strategies, test and surveillance programs and personnel training. 2. To promote an operating personnel risk-based safety culture. This study has to be initiated soon as part of the license renewal due to the introduction of modifications in the original design. Activities in the sectors of conventional industries.
The extension of the risk and safety analyses experience to other sectors of the conventional (non-nuclear) industries started with a group of activities which addressed initially to the promotion and dissemination of these methods and later to their practical utilization in projects for the industry. The tasks related to human factors issues have been one of major interest, not only limited to human reliability, but also including other issues in the field of human factors such as working conditions and environment, personnel training systems, operating procedures and safety culture and management. The activities performed up to now are briefly described below: Promotion and training activities.
Post-graduate and training courses: Several courses on "Safety and Reliability of Industrial Systems" have been performed to familiarize managers and specialists of the risk industries with the used methods and human factors influences. Cathedra for Industrial Safety: It was created in 1994 with specialists from different industrial sectors like nuclear, electricity, aeronautics, biotechnology, chemical fire protection, biosafety and others with the aim to join the efforts in the field of safety and accident prevention in the
142
R. Ferro Fernandez
national industry. This Cathedra promotes conferences, technical meetings, inter-industrial projects and other activities. Human Factors is one of the study areas of the Cathedra. National workshops on safety of the Industry: Annually a workshop is held with the purpose of showing the most recent results and studies performed in the country in the field of safety and accident prevention. Studies on human factors have been presented in all the workshops, being the safety culture one of the most discussed items.
Application in the biotechnological industry. During 1995-96 a study on reliability of two critical system of one of the production plants was carried out. The study included an analysis of human actions using generic values of HEP. In spite of a high conservative design of the systems, 21 human actions were identified and included in the logic models. At the same time several recommendations to improve the operating conditions for human error prevention were formulated in areas such as: Labeling systems Indicators lamps. Accessibility to manual operating devices. Operating procedures and logbooks Personnel training systems Alarms.
Application in the hazardous chemical industries. In 1996 a study on possibilities and consequences of accidents with ammonia releases in an industrial refrigerated facility was accomplished. As part of this study a general assessment of several human factors issues was performed and recommendations were given in the areas of labeling of systems and components, and accessibility to equipment during emergency situations. In 1997 a more comprehensive study will be undertaken in this facility starting from these preliminary results.
Safety regulations for the industry. Due to the importance of human error prevention and the need to achieve a high culture on such matters, we are actively working in the elaboration process of several national regulations. These regulations are addressed to the control of major risks in the industry and environmental protection. This way, the attention to human factors will be increased in the new projects or designs and also in the existing facilities.
CONCLUSIONS The studies we have performed up to this moment have confirmed the importance of human factors assessment to increase the safety and reliability in the risk industries. In most of the cases, the recommendations were not too costly and their contributions to human error prevention is important. Such assessment has become an important part of any safety study we perform for the industry, achieving this way a comprehensive understanding of safety and reliability problems. The attention to human factors issues has to be increased as an important way of accident prevention. Of prime importance are the organizational and safety culture issues without relegating other human factors issues such as the improvement of designs and operating conditions, the development and wider use of human reliability
Advances in Safety and Reliability: ESREL '97
143
assessment methods, the enhancement of training systems. We will continue working on these matters contributing this way to the worldwide efforts to accident prevention.
REFERENCES Caruso J. G (1990) Importancia de los Factores Humanos en la Seguridad de las Instalaciones Nucleares. Revista de Seguridad Radiologica. No. 1, 61-71 Intemational Atomic Energy Agency (1994). Guidelines for organizational self-assessment of safety culture and for reviews of safety Culture in Organizations Team. IAEA-TECDOC-743. IAEA Division of Publications, Austria. Intemational Atomic Energy Agency (1988). Basic safety principles for nuclear power plants. Safety Series No. 75-INSAG-3. IAEA Division of Publications, Austria. International Atomic Energy Agency (1994). •Lessons leamed from radiological accidents in Industrial Irradiators. IAEA Division of Publications, Austria. Intemational Atomic Energy Agency (1991). Safety Culture. Safety Series No. 75-INSAG-4. IAEA Division of Publications, Austria Organizacion de la Aviacion Civil Intemacional. (1993). Factores Humanos, Gestion y Organizacion. Compendio sobre Factores Humanos Num.10, Circular 247-AN/148
This Page Intentionally Left Blank
A6" Human Reliability
This Page Intentionally Left Blank
STATE OF THE ART IN THE DEVELOPMENT OF A SIMULATOR AIDED APPROACH TO HUMAN RELIABILITY ASSESSMENT A. Bareith 1, E. Holl61, Z. Karsa ~, S Borbrly 2 and A. J. Spurgin3 WEIKI Institute for Electric Power Research Co., Budapest, POB 233, H-1368, HUNGARY 2paks Nuclear Power Plant Co., Paks, POB 71. H-7031, HUNGARY 3Consultant, San Diego, CA, USA
ABSTRACT As a follow-up study of a previous effort, operator reliability experiments were performed at the full scope training simulator of the Paks Nuclear Power Plant (Hungary) in late 1995. The objectives of the experiments were to improve the human reliability model developed from the previous study and provide feedback to training enhancement at the plant. In the simulator sessions the control room crews of Paks were exposed to three PSA-type, multi-failure simulated transients. Extensive data collection was performed during the experiments including computerised data retrieval on plant and operator responses as well as 9bserver data collection on operator actions, performance shaping factors, individual and team skills. 7ollowing the observations most of the experimental data were converted into a common data base for the mrposes of data analysis. A lot of the planned analyses have already been performed. The project has now tepped into the final stage of theanalysis process with interpretation of the results. This paper briefly ummarises the main features of the data collection and analysis methodology used at Paks for the simulator xperiments. More importantly, a discussion is given concerning current data analysis results.
~EYWORDS ata analysis, data collection, experiment, human factors, human reliability, operator, operator training, obabilistic safety assessment, simulator, statistics
rTRODUCTION
level 1 Probabilistic Safety Assessment (PSA) was performed for the Paks Nuclear Power Plant of ngary between 1992 and 1994. In the framework of the Paks PSA study 120 operator reliability ~eriments were carried out at the full scope plant simulator to provide input to model and quantify ~rator (or crew) reliability for post-initiator dynamic responses. The Human Reliability Analysis (HRA) Paks relied extensively on these simulator studies. According to the results of the Paks PSA the fractional tribution of those accident sequences (minimal cut sets) that contain human errors is over 90 %. To ice this number a decision was made to launch a project on further investigation of safety related human ors and human reliability. 147
148
A. Bareith et al.
OBJECTIVES A major part of the project is a new series of simulator experiments on control room crew responses during transients. While the previous effort in 1992 was aimed at providing input data to the HRA, the objectives of the new experiments are to 1. refine the HRA model developed from the former simulator tests 2. develop recommendations for improving operator reliability. The experience from the earlier tests helped a lot in the new study. However, the modified objectives increased the complexity of the experiments considerably.
A
P
P
R
O
A
C
H
In the first step of the study it was determined what kind of data would be necessary to meet the objectives. These data needs were then used as a basis to (1) set up the data collection requirements and (2) design the simulator exercises. The design phase was finalised by pilot tests followed by the real experiments. The last stages of the programme are the data analysis and the interpretation of analysis results. In the process of defining data collection needs and the associated data collection requirements a taxonomy was created for the data to be collected either by observers or by automated data recording tools. This taxonomy covers important aspects of control room crew operation, such as - scenario events and chronology plant responses and parameter changes features of operator diagnosis, decision making and actions effect of performance influences, e.g. task complexity, man-machine interface, emergency operating procedures - performance competencies, e.g. knowledge adequacy, communications, supervisor leadership - operator errors and error causes. -
-
-
Matrix type data forms and a pre-programmed bar-code reader system (with the accompanying bar-coded data sheets) were used to help to produce observer data. Additional information was collected by taskspecific simulator software and video recorders. Three transients with multiple failures were designed for the purposes of the experiments: 1. Medium size loss of coolant accident (LOCA) with unavailable high pressure emergency core cooling pumps 2. Main steam line break outside the containment, important motor driven valves unavailable 3. Loss of electric load to self-consumption. These transients and the associated operator interactions are included in the PSA model of the plant. Detailed descriptions were produced for each scenario as part of developing lesson guides for the three types of simulator sessions. The lesson guides were put together by a joint work of the training personnel and the PSA/HRA team using a newly developed systematic approach that helps the instructors to better define learning objectives, conduct and evaluate an exercise in a more consistent way as compared to the previous practice. The operator reliability experiments took place during the six weeks of the autumn operators' refresher training period at the Paks Simulator Centre in 1995. The 24 crews that work for the four units were tested for each scenario. Thus, the observations covered 24 x 3 = 72 simulator sessions in total. The experiments were performed similarly to the usual refresher training sessions. In addition to the trainers, an observer team of three followed the simulator exercises and the subsequent evaluations for the purposes of data collection. The members of the observer team were an HRA analyst, a PSA specialist from the plant, a training and
Advances in Safety and Reliability: ESREL '97
149
simulator expert. Table 1 gives a summary of how the data collection tasks were assigned to the various observers. For automated retrieval of transient data most of the data collecting equipment was operated by a training instructor. Following a session the training instructors joined the observer team for a closing discussion. Each experiment was concluded by setting up a data package of all the data collected during the session.
TABLE 1 DATA COLLECTION RESPONSIBILITIES OF THE OBSERVERS
OBSERVER DELEGATED FROM
OBSERVER'S RESPONSIBILITY
Simulator Department of Paks NPP
Diagnosis, Use of Procedures, Additional Observer Notes
Analyses Department of Paks NPP
Man-Machine Interface Cognitive Behaviour, Leadership, Communication
VEIKI PSA team
Operator Errors, Error Type, Error Causes, Overall Performance Measures
All
Subsequent to the observations the analysis of the experimental data has been started. As a first step a data analysis plan was developed. This was done in parallel with a data reduction, verification and interpretation process as required by the various analysis techniques. A lot of the planned analyses have already been performed. The project has now stepped into the final stage of the analysis process with interpretation of the results.
ANALYSIS
OF SIMULATOR
DATA
In the remaining chapters the data analysis is discussed in some more details with examples of important data analysis findings. The discussion below follows the main steps taken during the data analysis so far.
Data Reduction
In order to enable the use of a wide range of analysis methods the raw experimental data was pre-processed, which resulted in an integrated data base. As referred above, the raw data consisted of computer data files, observer data and video records. From the computer data files the following data were selected and put into the data base: - time stamps of scenario events including simulated malfunctions, control room signals, plant and operator responses - important plant parameter changes. The raw observer data were available on paper based data tables and on data files that had been created by the bar-code readers. Within the observer data two main subgroups were defined: one for data related to a given human-system interaction (HI) and another concerned with measures of overall crew performance and performance competencies. The HI level data represent the features of crew operation in terms of diagnosis decision and task execution, the effect of performance influences including ergonomic factors, as well as the operator errors made and the error causes observed. The measures of overall (or global) performance describe the effectiveness of crew performance, various aspects of team skills, etc. The raw observer data were converted into numeric variables and then put into a spreadsheet type data base. The integrated data
150
A. Bareith et al.
base consists of three two dimensional arrays representing the three scenarios. For a control room crew the number of variables within a scenario is around 500.
Analysis Methods
In accordance with the objectives of the experiments the data analysis had to be focused on those methods that could (1) provide input into the HRA model development and (2) yield insights into potential human factors improvements. There is an obvious and strong relationship between the two areas that implies performing the same types of analyses. However, there are also specific methods and analysis tools that serve one objective better than the other. For example, some simple statistics of operator responses can be usefully applied to training evaluation and improvement purposes that can be less important from the point of view of HRA. In the previous data analysis of the simulator tests in 1992 a decision tree method was used to embody the observer data on operator performance into the HRA framework, which worked reasonably well - see Bareith et al (1996). However, during the phase 1 tests it also became clear that more efforts were needed to make improvements in the establishment of the simulator aided HRA model. Examples of the areas that need special attention as far as the HRA model development is concerned are as follows: - circumstances of operator errors and deviations from the expected operational strategy, e.g. error modes, mechanisms, causes, etc. - relationship between experimental measurements and human error probability (HEP) - HEP predictions from time reliability or team skills considerations - changes in crew response as compared to the previous tests. With respect to human factors and training the focus of the analysis to understand what affects the performance of the crews. Given that we understand the effect on performance, then measures can be identified to help improve performance by training, changes in the man-machine interface (MMI) or in the procedures. Performance of the crews is measured by ability to control the plant within acceptable parameter limits and to avoid errors from the expected series of actions that the crews are supposed to take in response to the accident scenario. Some errors occur systematically: a given action is not performed correctly by the majority of the crews, or a given crew tends to make more errors than the others. Some errors have a more random nature. So as to generate insights into potential ways if improving crew performance, the analysis was focused on two aspects: (1) the effect of operator actions on the plant, (2) how the crews operate as a unit, i.e. team skills. The list of the most important analyses performed for both HRA and human factors improvement purposes is as follows: - comparative analysis of crew performance effectiveness for each HI and each scenario using the computer event logs and data on plant parameter changes comparisons with results of the 1992 experiments analysis of observed operator errors and deviations with causal decomposition, using both observer and simulator data statistical evaluation of time response data including time reliability curves, control charts, etc. statistical analysis for correlation between performance influences and effectiveness of crew responses in terms of number of errors/deviations and subjective performance measures statistical evaluation of crew competencies. Similarly to the data collection phase, the analysis has been performed as a joint work of the trainers and the HRA/PSA analysts with participation of a US consultant. As it can be depicted from the description above, the analysis has relied on both is statistical methods and subjective evaluation. The statistical methods applied range from simple descriptive statistics to advanced non-linear statistics. Because of the features (distribution and sample size) of the data base variables mostly non-parametric tests were used for correlation analysis. -
-
-
-
-
Advances in Safety and Reliability: ESREL '97
151
Examples of Results Probably the most important area of data analysis was the analysis of errors. In the discussion below we will be referring to operator errors. By error we mean failure to take an action that is required by training. Such errors are not necessarily identical to those defined in a PSA model as basic events. From the point of view of safety the significance of the errors observed vary substantially. The presence of errors does not imply that the crew led the plant to core damage. By deviations we mean (1) deviation from the expected response (this can result in an error), or (2) unexpectedly long time to initiate response or perform an interaction. In the error analysis process the presence of errors and deviations from the expected responses were identified for each HI. Also, the most probable root causes of the problem and the type of the error were determined too. The root cause classification was based on a categorisation scheme pre-defined during the design of the experiments. In addition, weighing factors were used to measure the relative contribution of the various causes to the error/deviation. During the root cause analysis recommendations were developed HI by HI for error reduction measures. By summarising the analysis results for the various HI's a hierarchy has been developed for the overall error characteristics. This hierarchy is shown in Figure 1. As the figure indicates, errors and deviations are dominated by problems with Man-Machine Interface (58.74%), Supervisor Leadership (12.93%), Knowledge Adequacy (9.45%) and Communications (9.09%). Identically to the previous experiments, difficulties in the use of procedures have a minor effect. It should be emphasised that the latter finding reflects the way Paks operators use the procedures. It is not a detailed evaluation of the procedures themselves, since the experiments were very limited in this respect due the limited number of accident scenarios observed. The contribution of diagnosis problems displays a substantial increase as compared to the earlier experiments (44% versus 15%). (Note that the man-machine system is defined in a very broad sense, and thus diagnosis problems appear under Processing within the Man-Machine Interface subgroup in Figure 1.) An explanation of the higher importance of diagnosis is the fact that the scenarios during the new study were a lot more complex than in 1992. This made the conditions of diagnosis more difficult too. Diagnosis problems lead us to the issue of procedures, because two of the three experimental scenarios are not covered by the procedures in the way they were designed for the experiments. Consequently, the operators could not find too much written guidance that could have helped them in the course of making diagnosis. Another area studied is the types of errors observed. The distribution of error types was determined for each scenario and overall statistics were produced. Distinction was made between intentional and unintentional errors that were further subdivided based on Reason (1990). The results are summarised in Table 2. The analysis of errors and deviations has led to the following conclusions: Complex, multi-failure situations tend to result in more diagnosis difficulties, which increases the likelihood of mistake type errors. - Due to the way the Paks control room crews are organised and operate and because of the weaknesses of procedures, operator knowledge is the most important precondition of successful crew response in multifailure situations. There is an increase in the number of errors as the scenarios require higher level of team integration (leadership and communication). - For a number of HI's weaknesses in control room design (displays and control) unfavourably affect crew performance. -
-
152
A. Bareith
et al.
Operator error
100.00
___l__ ~1___ Leadership Communi-
I
Knowledge Adequacy
cation
Style i
9.45 t
12.93
I
I
i
9.09
IIm//.I
"1~~°
System
Style
Internal
Quality
Use
4.55
7.34
7.34
0.00
2.10
First Principle
t Degreeof Existence
4.90
t
Cognitive I Overload
Time Stress
Procedures ]
External
Wording
1.75
0.00
5.59
0.00
2.10
t Layout& Organization 0.00
I
Man-mach. Interface 58.74
5.59
___1__
I
Display
I
Control
Processing
2.10
43.70
12.94
I
I
Detection
Reading
Task Load
12.24
0.70
I
-• _•
1.40
I
Lack of Information
Information Load
Lack of Knowledge
2.80
18.18
9.44
16.08
Real
9.44
0.00
Information Load
_t
0.00
Layout & Location
0.70
0.00
I
Location
How Compelling
I
Lack of Knowledge
t Incomplete Detection 18.18 Figure 1" Error Cause Hierarchy
t
Placing 0.00
t Execution 0.00
Advances in Safety and Reliability: ESREL '97
153
TABLE 2 D I S T R I B U T I O N OF E R R O R T Y P E S
LAPSE
SLIP
MISTAKE
RULE VIOLATION
Primary LOCA
40.00%
3.33%
56.67%
0.00%
Steam line break
11.54%
1.28%
84.62%
2.56%
Loss of electric load
33.33%
16.67%
50.00%
0.00%
All
29.05%
3.33%
66.67%
0.95%
R TYPE
i
During the observations some of the crew competencies were measured by subjective and behaviourally anchored five-point rating scales with 1 being the "worst" and 5 being the "best". The analysis of these measurements was made mainly for training evaluation purposes. With respect to the overall success of crew operation, the average rating of the 24 crews is 3.42 for the LOCA scenario, the same for the steam line break and 4.08 for the loss of electric load. The coefficient of skewness is -1.53 for the electrical problem, -0.36 for the LOCA situation and 0.09 for the steam line break. The results of this highly subjective estimation are in good agreement with those of the error analysis, namely: the secondary side steam line break with total loss of feedwater supply appeared to be the most challenging task to the crews, while the frequently trained electrical disturbance was mostly handled very successfully. Similar statistics were produced for other crew competencies, such as crew knowledge, team integration, communications, compliance with safety requirements. An extract of the simplest statistics is given in Table 3. The main lesson from the data of this table is that most competencies show the same behaviour across the three accident scenarios as the measure of overall success. Another example from the analysis of global competencies is presented in Table 4. It shows the results of a Kruskal-Wallis test between the variables for overall success and communications with personnel outside of the control room in the case of the steam line break scenario. The latter results are very typical: in most of the cases the competencies are highly interrelated.
TABLE 3 C H A R A C T E R I S T I C S OF C R E W C O M P E T E N C I E S
STEAM L. BREAK ELECTRIC ALL LOCA H: ~ARAMETER MEAN VARIANCE MEAN VARIANCE MEAN VARIANCE MEAN - - T VARIABLE-'~,,,~._ Overall success
3.417 ~
0.974
3.417
0.929
4.083
1.018
3.639
Knowledge of situation
3.875
0.850
3.417
1.213
4.458
0.779
3.917
0.797
3.708
1.042
4.458
0.932
4.097
0.830
4.167
0.885
3.956
i
t
Team cohesion Internal communication External communication Compliance with safety goals
4.125
i !
!
i
4.083 i i 3.826
0.776
4.000
0.834
4.417~
0.984
4.042 L
0.859
4.000 '
3.375
1.279
3.000
1.474
!
I
3.188
154
A. Bareith et al. TABLE 4 RESULTS OF A KRUSKAL-WALLIS TEST
Kuskal-Wallis ANOVA by Ranks Dependent Variable: FGLOBSUM- OVERALL SUCCESS Independent (Grouping) Variable: FGLOBKK- EXTERNAL COMMUNICATION GROUP
CODE
VALID N
1
2
1
2 3 4
3 4 5
10
5 8
SUM OF RANKS 2.5 32 124 141.5
Kruskal-Wallis Test: H(3, N=24)=11.079, p=0.011
CONCLUSIONS As far as the HRA related objective of the study is concerned, the results from the analysis of simulator data are now being integrated into the HRA framework. In this process the HRA implications of the currently available results will be studied. Additional analyses may also become necessary. The results of the forthcoming HRA model development are expected to be used extensively in the ongoing and future probabilistic safety analyses and PSA applications for Paks. Based on the experience from the simulator tests and on the insights gained from the data analyses so far some conclusions have already been drawn and recommendations have been made for potential safety improvement measures. These conclusions and recommendations fall into the areas of simulator training, emergency operating procedures and their use as well as the man-machine interface complex. Let us take simulator training as an example. The Paks plant has been recommended to consider the following for the purposes of improving training effectiveness and thus increasing operator reliability and plant safety: inclusion of more multi-failure scenarios in the continuing training program with complexity similar to that of the scenarios designed for the experiments - more focus on compliance with safety goals and procedural requirement during training more specific training on incident/accident diagnosis - training on improving communications and supervisor leadership development of a new protocol of training evaluation.
ACKNOWLEDGEMENTS This work has been carried out under the financial support and encouragement of the Paks NPP, the Hungarian National Committee of Technological Development and the U.S.-Hungarian Science and Technology Joint Fund. Special thanks to the Paks Simulator Department for their assistance.
REFERENCES Reason, J. (1990). Human Error, Cambridge University Press, New York, USA Bareith, A., Borb61y, S., Hol16, E. Spurgin, A. J. (1996). Treatment of Human Factors for Safety Improvements at the Paks Nuclear Power Plant. In: Cacciabue, C., Papazouglou, I. A. (eds.). Probabilistic Safety Assessment and Management, ESREL '96 - PSAM-III, June 24-28, 1996, Crete, Greece. Vol. 2, 11911196.
AN A P P R O A C H TO THE H U M A N RELIABILITY ASSESSMENT IN THE CONTEXT OF THE PROBABILISTIC SAFETY ANALYSIS OF C O M P L E X PLANTS K. T. Kosmowski Department of Electrical and Control Engineering, Technical University of Gdafisk, Narutowicza 11/12, 80-952 Gdafisk, Poland
ABSTRACT The paper outlines an approach to the human reliability assessment in the context of the probabilistic safety analysis (PSA) of complex plants. More important methodological issues related, currently discussed in the domain literature, are described. These include: human performance and error taxonomies, cognitive aspects of human behaviour, performance shaping factors (PSFs), mechanisms of human errors, qualitative and quantitative probabilistic modelling, expert judgement, modelling of dependencies, the representation and treating of uncertainties in quantitative assessments. The influence of the organisational factors and management on the reliability and safety of the plant are emphasised. The approach has been implemented in a prototype software system for data and knowledge acquisition to perform ,,living" PSA for supporting in the lifetime of a complex plant the safety related decision making.
KEYWORDS Human factors, Human errors, Performance shaping factors, Human reliability analysis (HRA), Probabilistic safety analysis (PSA), Dependencies, Uncertainties, Fuzzy set theory, Expert opinions.
INTRODUCTION
The influence of so called human factor on the safety of complex industrial systems, nuclear power plants (NPPs) and hazardous chemical installations in particular, is widely recognised. Human-system or simply human interactions is the term that describes all interfaces between humans and the system (Moieni et al. 1994). Errors committed by individuals managing, operating and maintaining these systems are often most significant causes of accidents and risk associated with their operation (Dougherty and Fragola 1988, Pat6-Cornel and Murphy 1996). On the other hand the state of the art of the human reliability methodology and available at present methods/techniques for probabilistic assessment of human failures indicate that this methodology is not mature. It was confirmed by results of some experimental research, aimed at validation of the human reliability analysis (HRA) models, more frequently used in engineering practice. Another problem is associated with the fact that the human reliability assessments are profoundly dependent on expert 155
156
K.T. Kosmowski
judgement (Mosleh et al. 1988). Results of HRA benchmark exercises, summarised e.g. in (Poucet 1988), have shown that the human error probability (HEP) and assessed frequencies of some accident situations for specified plant, obtained by different groups of HRA experts, can reach discrepancy as high as orders of magnitude. Current limitations of existing HRA approaches become apparent when the role of the human operators is explicitly examined in the context of real events at complex plants (Dougherty 1993). The cognitive modelling is becoming a fundamental issue of HRA methodology (Cacciabue 1992). Therefore, considerable research efforts have been undertaken at some research institutions and regulatory bodies to improve the methodology, verify models and propose approaches standardising HRA performed within PSA studies (Woods et al. 1990, Ives 1991, IAEA 1992, Llory 1992, Lydell 1992, Kosmowski et al. 1994, Reer et al. 1996). A special attention require treating of errors of commission; dependencies between hardware failures and human errors, and dependencies between latent failures and active human errors.
H U M A N P E R F O R M A N C E AND ERRORS
Human behaviour types
The distinction of three categories of human behaviour has been proposed by Rasmussen (1983). His conceptual framework assumes three cognitive levels of human behaviour: s k i l l - b a s e d (highly practised tasks that can be performed as more or less subconscious routines governed by stored patterns of behaviour), r u l e - b a s e d (performance of less familiar tasks in which a person follows remembered or written rules) and k n o w l e d g e - b a s e d (performance of novel actions when familiar patterns and rules can not be applied directly, and actions follow the information processing with the inclusion of diagnosis, planning and decision making). However, it is known from real situations and HRA practice that the distinction between skill-based actions and rule-based actions is often more or less arbitrary. The similar difficulty is often encountered with the distinction between rule-based and knowledge-based behaviour (Reason 1990).
Error types
Described above behaviour types seem to involve different error mechanisms, which may mean radically different reliability characteristics (IAEA 1992). Human errors are often classified to be of following kinds (Dougherty & Fragola 1988): a s l i p - is an attentional failure (for example, an error in implementing a plan, decision or intention or an unintended action); a l a p s e - is a momentary memory failure (for example, an error in recall of a step in a task or forgetting intentions); a m i s t a k e - is an error in establishing a course of actions, for example, an error in diagnosis, planning or decision making. Slips and lapses are unintended actions. They can occur during the execution of skill-based actions. Mistakes are intended actions. They are committed when the knowledge-based actions are planned and executed. Mistakes are associated with more serious error mechanisms as they lead to incorrect understanding of the situation and to conceiving of an inappropriate plan of resulting actions. Mistakes can also occur in selection and execution of rule based actions, for example, due to inappropriate selection of a procedure (IAEA 1992).
Cognitive errors and incorrect human outputs
A cognitive error is an action or inaction that is based on decision (a diagnosis, plan, etc., i.e. the error's causation has a 'high level' cognitive content) that produces an effect not intended by the actor or that is
Advances in Safety and Reliability: ESREL '97
157
inappropriate considering the situation in hindsight. A classification scheme for human errors, used in THERP (Swain and Guttmann 1983), is related mainly to the observable outputs resulting from incorrect human behaviour which are categorised as errors of commission or omission. An error of commission (EOC) is an incorrect performance of a system-required task or action, or the performance of some extraneous task or action that is not required by the system and which has the potential for contributing to a system-defined failure. An error of omission (EOM) is a failure to perform a task or action. Thus, a commission error is considered as an action, rather than an inaction, potentially but not necessarily a cognitive error (Swain's commission slip is the counterexample) and a cognitive error may not be a commission error- but a cognitive omission or a violation (Reason 1990), which is not an error. The distinction between errors of omission and errors of commission is almost entirely set by the conditions represented in the functional and logical models of the plant developed in PSA studies.
Phase of actions In PSA two main phases are considered: the time prior to an initiating event and the time after the initiating event of an abnormal situation. Human actions and errors can be divided so as to relate them to the phases before and during of an accident into three main categories (Dougherty & Fragola 1988, IAEA 1992): (A)-actions/errors in planned activities, i.e. so-called pre-initiator events that cause equipment/systems to be unavailable (dormant failures) when required post initiator, (13)- errors in planned activities that lead directly, either by themselves or in combination with equipment failures, to initiating events/faults (e.g. unplanned plant shutdown), i.e. human induced initiators, (C) - actions/errors in event-driven (off-normal) activities, i.e. post-initiator events; these can be safety actions or errors that aggravate the accident sequence. Thus, human interactions can have a variety of means of affecting safety of the plant. Unsafe acts can be classified as: latent and active human failures. Impact of unintended actions can be immediate or resulted failures can lie dormant to an accident, being causes of the unavailability of a protecting system or a weakness in a safety barrier. It is important to understand potential of these failures to affect risk, especially in analysing multiple unsafe acts and possible dependencies between latent and active failures in given accident sequence.
Error mechanisms and performance shaping factors Unsafe actions can came about different error mechanisms. Error mechanisms are not directly observable, only their consequences as unsafe acts can be observed (Dougherty 1993). Examples of error mechanisms include: attentional failures, memory failures, situational appraisal failures, and knowledge failures. Different error mechanisms are primarily associated with different kinds of unsafe acts. For example, incomplete knowledge and failures in situational appraisal are error mechanisms associated with mistakes, whereas failures in attention and memory failures are associated respectively with slips and lapses. Any factor that influences the human performance is designated as a performance shaping factor (PSF). Many factors have been distinguished which can potentially influence the performance of man operating the plant. Some of them are dependent on the plant conditions. PSFs can be divided into: external (those outside the individual) and internal, with special distinguishing of stressors (Swain & Guttmann 1983). In HRA only several PSFs are usually considered, depending on models applied. One of the most important PSF, and difficult to include in HRA models, is stress. Other important PSFs to be considered in some methods are, for example: quality of control room design, training quality of the staff and other quality assurance aspects including administrative procedures, operational procedures, the operator redundancy.
158
K.T. Kosmowski
HUMAN RELIABILITY ANALYSIS IN THE CONTEXT OF PROBABILISTIC ASSESSMENT OF A COMPLEX PLANT
The scope of probabilistic assessments supporting the safety oriented decision making Several steps in the probabilistic assessment process of a complex plant including HRA aimed at the safety oriented decision making are distinguished. Some of them can be supported by an expert system (Kosmowski et al. 1996). In parentheses a short characteristic of consecutive steps is given: A. Decomposition of the plant and classifying of accident initiating events (oriented on goals and functions of the plant's systems). B. Construction of event trees and classifying of abnormal events, states of the plant and accident sequences (main effort was put on probabilistic safety analyses of level 1). C. Construction of logical structures to account for equipment failures and human errors (the fault trees and HRA event trees for assumed level of details). D. Initial assessment of frequencies for distinguished classes of: plant states, accident sequences, external releases and consequences (and initial reduction of logical structures). E. Modifying of fault and event trees to include human errors and dependencies. F. Quantitative probabilistic assessment for classes of: plant states, accident sequences, external releases and consequences (and uncertainty assessment). G. Calculation of risk indices under uncertainties. H. Comparing probabilistic results with quantitative criteria (and proposing candidates for the risk reduction). I . Cost-effectiveness analysis and the risk reduction oriented decision making (ranking of alternatives and decision making according to some preference criteria). Taking into account the purpose of this paper and its limited space, only some selected issues will be outlined below.
Human factors and human reliability analysis Dougherty (1993) distinguishes four types of HRA modelling approaches: procedural, temporal, influential, and contextual. The influential and contextual approaches are considered as non-linear output metaphor for the human performance modelling. He notices that the influential analysis (holistic at quantification) and contextual (reductionist, but not simply subtasks) approach may find themselves indistinguishable at the quantification stage because of the paucity of actual data. He also suggests that the influential and the contextual approaches may merge into a single approach. It should be emphasised that the contextual approach require much more task analysis and situational analysis than, for example, has been exhibited in many analyses using the variants of SLIM technique (Humphreys 1988, Gertman and Blackman 1994). A framework for the contextual analysis combining the plant engineering, human factors, HRA and PRA is presented in Figure 1. In a complex plant there is usually a crew of operators plus considerable
Advances in Safety and Reliability: ESREL '97
159
supporting personnel. Events are detected almost solely by means of a complex technical system of instrumentation and alarms. During an abnormal situation the operators follow the emergency operating procedures (EOPs). A goal to be reached is a success oriented, i.e. to reach a plant state for which the consequences are mitigated. Because the situation can be dynamic (Hammond 1988) and/or with multiple failures it can be very difficult to diagnose the situation and to make decision concerning the goal and to plan actions to reach this goal. Therefore, in complex dynamic situations the floating goals can be pursued. There can be also conflicting goals as candidates for selection, so the matrix of goals can be specified (Dougherty 1993). Goals may have preconditions, which amounts to subgoals. Each subgoal has some tasks that are needed to meet the goal. Plant conditions and the perception of situation influence error mechanisms (Figure 1) by setting the context which determines the sensitivity of plant personnel to particular PSFs and thereby providing the opportunities for error mechanisms to become manifest and result in unsafe acts. The same error mechanism may lead to very different unsafe actions depending on the plant conditions. In addition, the unsafe actions can change plant conditions, which in turn, creme the potential for additional PSFs to become relevant in influencing particular error mechanisms and further unsafe actions. There are several means to incorporate the human factors and results of HRA into the logical and probabilistic models of the plant (Hannaman and Spurgin 1984). In the current modelling approach it is possible to change the structure of the logical models and/or to modify probabilities of relevant events, using ordered fuzzy quantifiers with relevant correction function on fuzzy probabilities, for: • latent failures included mainly in fault trees, • cognitive errors, especially errors of commission, evaluated contextually and included at higher levels in fault trees but mainly in event trees; • the dependency level for a failure event, assessed at consecutive branching points of given event tree with regard to the context of failure events in the sequence.
'armsanI nstrumentati°n' e3 norma
monitoring of plant safety conditions
and processing ~ Plant stat of signals I i k._ ~ n t_' t
Protecting and [ control systems]
I
.
. . . . .,J lmuatmg ,L. - - ~ event )
/Safety related/ ! ~ systems and ~'"'i | functions | 4k
""
Latent i failures . . . . .
,~"
. . . .
I I I
""
~ f dPercepti°n-'~ the situation I T :. . . . . . . . . . . ........ : following I Goal(Human actions) . . . . . . . . . . ~ i~ecove 'ry actions i of EOPs ~] ,, ,..,i,. .......................... : ....~, .... . . ..... ,i,. ...... . ..,~,. . . . . ....... ( Logic and-'~ PSFs .~ Error '.~ Unsafe' t_. . . . . ¢(Human failure'~ -~ probabilistic | ' mechanisms ' actions -~ events ~ - - ~ models ..~ ,
°
.
•
i.
.
. . . . . . .
.o
•
• . . . . . . . . . . . . . .
, . . . . . . . .
.
Figure 1" A framework for combining the plant engineering, human factors, HRA and PSA Basic probabilities of latent and active failures events (slips, lapses and some mistakes) are evaluated using conventional HRA techniques such as THERP and ASEP-HRAP with a support of an expert system (Kosmowski 1995). These basic probabilities can be corrected using fuzzy quantifiers to account for some additional attributes of the situation considered. In current research works more attention is paid to contextual HRA modelling approaches, especially to description of more significant factors and
160
K.T. Kosmowski
probabilistic quantifying of cognitive errors, especially errors of commissions (based on the framework shown in Figure 1). As a basic technique to perform complex attributive analyses with contribution of experts the SLIM technique has been selected (Kosmowski 1995). To limit the subjectivity of the structural analysis and attributive assessments in performing HRA and PRA, a prototype software system based on the expert system technology has been designed (Kosmowski 1996).
Uncertainty representation and treating based on the fuzzy set theory For representing and treating uncertainties, the fuzzy set theory have been applied (Dubois and Prade 1988, Kwiesielewicz and Kosmowski 1994). Probabilistic assessments are usually based partly on statistical data and partly on subjective expert judgement. In consequence we have different types of uncertainties to handle, namely stochastic and due to fuzziness, which is a kind of current uncertainty (Cam 1996). However, there is no framework to operate on both types of data. To overcome this problem we propose to transform statistical data into the possibilistic form and next to use a fuzzy approach for further calculations. There is a significant development of methods and considerable interest in applying the fuzzy set theory in reliability and safety assessments (Bowles and Palfiez 1995, Cam 1996, Chung and Ahn 1992, Misra 1993, Onisawa 1995). In order to transform stochastic data into possibilistic form we apply the probability-possibility transformation introduced by Dubois and Prade (1982): 7c(x) = ~xmax(p(x),p(q))dq
(1)
where 7t(x) is the possibility distribution and p(x) is the probability density function. The discrete form of (1) was used for calculation of fuzzy probability or frequency (Kwiesielewicz & Kosmowski 1994). In order to calculate fuzzy probabilities of a complex event (e.g. a top event of fault tree) we assume that probabilities of basic events as fuzzy numbers can be decompose into finite number of a-cuts (in our case real number intervals), next use the interval analysis (Moore 1966) and finally recompose a solution fuzzy probability of this complex event. Assume that two fuzzy numbers ~ and b have cz-cuts a ~ = [a 7,a~] and b ~ =[bT,bT] respectively. To handle a-cuts for a ~[0,1] we use basic algebraic operations on positive real intervals (Moore 1966):
= ~, E
~ Va
(c7,c7) = (aT + bT, a7 + bT)
(2)
~=~ ob
~ Va
(c7,c7)=(a7 - b T , a 7 - b T )
(3)
~=~®E
~ Va
(cT,cT)=(aT.bT,a~'.b 7)
(4)
"d = ~ 0 K
--~ V a
(c7,c7) = (a 7 / b7 , a 7 / bT)
(5)
It is easy to see that the interval operations lead to algebraic operations on lower (left) and upper (right) bounds of intervals and can be extended for matrix-vector operations. Generalising the matrix formalism from (Kaplan 1982) we can write a formula for calculation of fuzzy frequencies of distinguished plant states F y =F ® M (6) where F is a vector of fuzzy frequencies of initiating events, lVI is a plant response matrix containing conditional probabilities that i'th initiator will result in j'th plant state and ® is a symbol of multiplying of fuzzy vectors and matrices containing fuzzy numbers. In similar way vectors of fuzzy frequencies of: external releases ~r and consequences ~x can be obtained which are used for calculation of risk indices.
Advances in Safety and Reliability: ESREL '97
161
Treating of dependencies Dependence mechanisms associated with a combination of latent and active human failures are particularly important in probabilistic assessments because this combination can both initiate an accident sequence and cause failure of the installed safety barriers and defences. This can dramatically increase the frequency of an emergency condition of the plant and change the relative contribution of such sequences to risk, compared with sequences for which such failures were assumed as truly independent. It was assumed that dependencies considered in quantitative evaluating of event trees associated with phenomenological influences or human-system interactions, due to complexity of the problem and many factors involved, will be represented by linguistic expressions (Swain & Guttmann 1983, Misra 1993, Onisawa 1995). In cases of potential dependencies between failure events in an event tree we treat a branching point of this tree in a special way, introducing a correction function for a fuzzy conditional probability of given branch:
q~,~ = Z,~m(q't,,.)
(7)
where zt~m(.) is the dependence correction function for the conditional probability q'~,min a branch of level (/) and number (m) of given event tree; this function is constructed with regard to a linguistic statement e ~ E . The correction functions based on the proposal of Swain & Guttmann (1983), named as a positive dependence model, are presented in Table 1. The conditional probabilities ~,,, are assessed from fuzzy probabilistic models based on a project data base (developed on basis of external data bases and/or expert judgement) with regard to events including equipment failures and/or human errors. In more complex cases a fault tree as logical modelling framework with fuzzy probabilities is used (Tanaka et al. 1983). In a computer programme for quantifying of the event tree it is possible to assess fuzzy conditional probabilities in sequence k (a failure path) with or without correction of probabilities of consecutive branches ®
m* = I-I
q~e
(8)
I,m e V,
where V, is a set of pairs (l,m) denoting the level (/) and number (m) of branching point belonging to accident sequence k. TABLE
1
D E P E N D E N C E CORRECTION FUNCTIONS
Linguistic expressions (e ~ E ) for dependence levels Zero Dependence
(ZD)
LOW Dependence
(LD)
Moderate Dependence (MD) High Dependence
(HD)
Complete Dependence (CD)
Dependence correction function for the probability q'l,,, in a branch l,m (level number): Zl~m(') .-~ZD
q~,m = q't,,, ~ LD = (1 @ 19 ® ~'~,,,,) 0 20 q,..,
~ qt,m = (1@ 6®q'~,m) ~ 7 ~ HD qt,m = (10 q't,m) ~ 2 --CD ql,m = 1
For illustrating the evaluation process of components of matrix 1VI an example of the event tree is shown in Figure 2. The top event A~ in such tree can be loss of a function due to hardware failure(s) and/or human error(s). Failures are represented graphically on this tree by down drawn bifurcation in nodes
162
K.T. Kosmowski
(branching points). The relevant events are assessed using a fault tree and/or a HRA event tree (Swain and Guttmann 1983).As described above the human reliability analysis is usually very complex. The context of given situation including latent and active failures in the accident sequence should be considered. There is also the potential for hardware dependent failures (partly due to human errors). It was assumed that latent errors (L) committed during maintenance, calibration, etc. can effect every node (Figure 2). For different accident sequences the dynamic of the plant, safety related conditions and factors influencing human behaviour can be quite different, requiring from operators following the appropriate emergency operating procedures (EOPs) and deciding about current safety goal for given plant state. Such complex man-technology interactions require careful consideration of potential dependencies which can be quite different for given top event in various accident sequences. It will result that, for example, the probability ~ze,1can be different than q';,2 "
Initial event
(f/)
A~
Pl,1
Top events A2
A3
P2,1
P3,1
q~,l
.P3,2
P2,2
P3,3
Plant damage states (j)
Probabilities of sequences
Yl
~]
Y2 y;
m~ ~ ~3
(k)
--~2
Y2
e
m~
Y3
~6
Y3
ffl7
Y3
m~ J
~
,~
_......~ mil
mi3
Figure 2: An example of the event tree If in an accident sequence are failure and success branches it is necessary to modify the formula (8) replacing in appropriate order the failure fuzzy probability q~,~ by the success branch fuzzy probability p~,. calculated from following formula p~~ = 1 ® q~,~
(9)
where symbol ® denotes subtraction of fuzzy numbers (3); in this formula the real number 1 is treated as a special case of the fuzzy number. Knowing the probabilities (fuzzy numbers) of consecutive sequences of the event tree (for i'th initiator) ~ and the plant damage state y~ for each sequence, the components of the matrix l~I are calculated from following formula mu = Z ~k kaC~
(I0)
where @ is a symbol of addition of fuzzy probabilities (fuzzy numbers) and K u denotes symbols (numbers) of sequences with an end state yje in i'th event tree.
Advances in Safety and Reliability: ESREL '97
163
The analysis of the human failure events and the assessment of the human reliability is usually quite complex with the use of appropriate techniques taking into account relevant performance shaping factors (PSFs), error types with their psychological mechanisms and recovery potential (Kosmowski 1995). Human failures can degrade the safety related functions which can be considered at three levels of the plant hierarchy: (1) components, (2) subsystems and (3) inter-systems, in consecutive accident sequences. Dependencies associated with human actions and errors as well as recovery potential can be analysed for first two levels using, for example, a method described in THERP (Swain and Guttmann 1983). Below an approach will be outlined for assessment of inter-system dependencies for an accident sequence of the event tree. Analysing of complex man-technology interactions is based profoundly on expert judgement and, therefore, related assessments can be only approximate. It is proposed to use linguistic statements and apply a fuzzy set theory framework in a similar way as described above. For evaluating of the level of dependency (ZD, LD, MD, HD, CD) for given node of the event tree a set of rules has been proposed and implemented in a prototype expert system. This system is used in following way. The expert system is asking the user for an identifier of given branching point and starts the evaluation session in which the analyst answer questions by selecting one or more options associated with description of conditions, factors and the situation context. After this phase the system presents selected data and asks for confirmation. If the data are confirmed by the user the inferring process starts on a set of rules, giving the dependency level as result. An example of simplified rule is as follows: IF the_contribution_of_latent failures is IMPORTANT AND the_behaviour_type is RULE_BASED AND EOPs_quality_for_the_situation is MODERATE AND the_stress_level is HIGH AND the time window is SHORT THEN the_dependence_level is HIGH The inferred dependency level is then used for calculation of fuzzy conditional probabilities according to Table 1 and formulas (7-10). The last node in the sequence is treated in a special way to take into account global recovery factors and phenomenological uncertainties. The expert system is not applicable when probabilities of branches associated with dependent human failures in given accident sequence have been qualitatively and quantitatively assessed using holistic multi-expert techniques such as SLIM or APJ.
CONCLUDING REMARKS The human failure events influence very significantly the risk associated with operation of complex plant. In probabilistic modelling of a sociotechnical system it is necessary to consider potential cognitive human errors, especially errors of commission, and possible dependencies between failure events. It should be carried out in contextual analyses in a framework which combines the plant engineering, human factors, HRA and PRA. This framework has been developed for designing a prototype expert system to support integrated probabilistic modelling of complex plants including the human reliability analysis. The methods and procedures of probabilistic assessments under uncertainties based on the fuzzy set theory have been developed. It is possible to calculate fuzzy frequencies of accident sequences and risk indices. The logic models and probabilistic assessments of the plant can be easily modified in the lifetime of the plant, as new evidence is available or additional knowledge acquired, to support the safety related decision making. In the approach described in this paper the method of positive dependence proposed by Swain was adapted. However, a more general method for evaluations inter-system dependencies in accident sequences should be developed to identify factors which potentially degrade safety related functions taking into account structural and phenomenological dependencies, dependent equipment failures (common cause failures) and human failures as well as potential influences of organisational factors. Due to complexity of the problem and many sources of uncertainty involved the development of methods based on the fuzzy set theory is becoming attractive. To limit the variability and subjectivity of analyses
164
K.T. Kosmowski
applying the expert system technology is proposed, although the quality aspects of its design require a systematic problem representation, the development of pragmatic dependency models and knowledge acquisition from PSA and HRA experts.
REFERENCES
Bowles, J.B., Pal~iez, C.E. (1995). Application of fuzzy logic to reliability engineering. Proceedings of the IEEE 83:3, 435-449. Cacciabue, P.C. (1992). Cognitive modelling: A fundamental issue for human reliability assessment methodology?, Reliability Engineering and System Safety 38, 91-97. Cai, K.-Y. (1996). Introduction to Fuzzy Reliability. Kulwer Academic Publishers. Boston, USA. Chung, M.H. and Ahn, K.I. (1992). Assessment of the potential applicability of fuzzy set theory to accident progression event trees with phenomenological uncertainties. Reliability Engineering and System Safety 37, 237-252. Dougherty, E.M. and Fragola, J.R. (1988). Human Reliability Analysis: A Systems Engineering Approach with Nuclear Power Plant Applications. A Wiley-Interscience Publication, John Wiley & Sons Inc., New York, USA. Dougherty, E. (1993). Context and human reliability analysis. Reliability Engineering and System Safety 41, 25-47. Dubois, D., Prade, H. (1982). On several representation of an uncertain body of evidence. In M. M. Gupta and E. Sanchez (Eds.). Fuzzy Information and Decision Process. North-Holland. Dubois, D. and Prade, H. (1988). Possiblity Theory, An Approach to Computerized Processing of Uncertainty. Plenum Press, New York, USA. Gertman, D.I., Blackman, H.S. (1994). Human Reliability and Safety Analysis Data Handbook. John Wiley & Sons, Inc., New York, USA. Hammond, K.R. (1988). Judgement and decision making in dynamic tasks. Information and Decision Technologies 14, 3-14. Hannaman, G.W. and Spurgin, A.J. (1984). Systematic Human Action Reliability Procedures (SHARP). EPRI NP-3583, Research Project 2170-3, USA. Humphreys, P. (ed.) (1988). Human Reliability Assessor Guide. RTS 88/95Q, Safety and Reliability Directorate, UK. IAEA (1992). Procedure for Conducting Human Reliability Analysis in Probabilistic Safety Assessment. International Atomic Energy Agency: IAEA, a draft report, Vienna, Austria. Ives, G. (1991). Developing expert system to analyze human performance in NPP events. Journal of ENS (European Nuclear Society), Nuclear Europe Worldscan, 11-12. Kaplan, S. (1982). Matrix theory formalism for event tree analysis: application to nuclear risk analysis. Risk Analysis 2, 9-18. Kosmowski, K.T., Degen, G., Mertens, J. and Reer, B. (1994). Development of Advanced Methods and Related Software for Human Reliability Evaluation within Probabilistic Safety Analyses. Berichte des Forschunszentnma Jiilich; 2928, Germany. Kosmowski K. T. 1995. Issues of the human reliability analysis in the context of probabilistic studies. International Journal of Occupational Safety and Ergonomics 1:3, 276-293. Kosmowski, K.T., Duzinkiewicz, K., Drapella, A., Augusiak, A., Baum, G., Jackowiak, M. and Kwiesielewicz M. (1996). Features of a knowledge based system REPSA1ES for probabilistic safety analysis, in: Development of Safety Related Expert Systems. International Atomic Energy Agency: IAEA, TECDOC-856,183-202, Vienna, Austria. Kwiesielewicz, M. and Kosmowski, K.T. (1994). Uncertainty representation frameworks for probabilistic safety studies. Technical University of Gdafisk, The Scientific Papers of Electrical Engineering Faculty 7, 135-148. Llory, M.A. (1992). Human reliability and human factors in complex organizations: epistemological and critical analysis - Practical avenues to action. Reliability Engineering and System Safety 38, 109-117.
Advances in Safety and Reliability: ESREL '97
165
Lydell, B.O.Y. (1992). Human reliability methodology. A discussion of the state of the art. Reliability Engineering and System Safety 36, 15-21. Misra K.B. (Ed.) (1993). New Trends in System Reliability Evaluation. Elsevier Science Publishers B.V., Amsterdam, The Netherlands. Moieni, P., Spurgin, A.J., and Singh, A. (1994). Advances in human reliability analysis methodology. Part I: Frameworks, models and data. Part II: PC-based HRA software. Reliability Engineering and System Safety 44, 27-66. Moore, R. (1966). Interval Analysis. Englewood Cliffs: Prentice Hall. Mosleh, A., Bier, V.M. and Apostolakis, G. (1988). A Critique of current practice for the use of expert opinions in probabilistic risk assessment. Reliability Engineering and System Safety 20, 63-85. Onisawa, T. (1995). System reliability from the viewpoint of evaluation and fuzzy set theory approach. In: Reliability and Safety Analysis under Fuzziness (Editors: Onisawa, T., Kacprzyk, J.). Physica-Verlag, Heidelberg, Germany. Pat6-Comel, M.E. and Murphy D.M. (1996). Human and management factors in probabilistic risk analysis: the SAM approach and observations from recent applications. Reliability Engineering and System Safety 53, 115-126. Poucet, A. (1988). Survey of methods used to assess human reliability in human factors reliability benchmark exercise. Reliability Engineering and System Safety 22, 257-268. Rasmussen, J. (1983). Skills, rules, knowledge; signals, signs and symbols and other distinctions on human performance models. IEEE Transaction on Systems, Man and Cybernetics SMC-13:3. Reason, J. (1990). Human Error. Cambridge University Press, USA. Reer, B., Str/ater, O. and Mertens, J. (1996). Evaluation of Human Reliability Analysis Methods Addressing Cognitive Error Modelling and Quantification. Berichte des Forschungszentrum JiJlich, 3222, Germany. Swain, A.D. and Guttmann, H.E. (1983). Handbook of Human Reliability Analysis with Emphasis on Nuclear Power Plant Application. NUREG/CR-1278, USA. Tanaka, H., Fan, L.T., Lai, F.S. and Toguchi, K. (1983). Fault-tree analysis by fuzzy probability. IEEE Transactions on Reliability R-32:5, 453-457. Woods, D.D., Pople, H.E. and Roth, E.M. (1990). The Cognitive Environment Simulation as a Tool for Modeling Human Performance and Reliability. Westinghouse Science and Technology Center. Pittsburgh, NUREG/CR-5213, USA.
This Page Intentionally Left Blank
HRA
STUDY
OF COGNITIVE R E L I A B I L I T Y
IN A NPP TRAINING SIMULATOR
Gao Jia ~, Hang Xiang-Rui 2 and
Shen Zu-Pei 2
University of Science & Technology Beijing Beijing, 100083, China 2Tsinghua University Beijing, 100084, China Email: huangrui @ pwrs.eea.tsinghua.edu.cn
ABSTRACT This paper describes the simulator-based experiment for evaluation of Chinese NPP crews reliability with response time data and offers some opinion on the results based on the HER model. It is useful for improving the training quality in NPP and also as an effort to the study of HRA in PRA.
KEY WORDS Human Reliability Analysis, Probabilistic Risk Assessment, Human Cognitive Reliability Model, Simulator Experiment, Crew Response Time, S-R-K Classification Model
1. I N T R O D U C T I O N Human Reliability Analysis (HRA) is an important issue for NPP safety assessment, but still is difficult to be dealing with, since there are no fully appropriate models and data for HRA. Although the experts judgment can be used to identify and estimate human error probability, the variations are to be expected 167
168
J. Gao et al.
considerably at present. Therefore, the steps have been identified and taken by the worldwide countries with operating nuclear power plants to improve the quality of HRA studies. The study of operator reliability using full-scale nuclear power plant simulators can meet two aspects of demand: one is to collect operator response time and reliability data which can fill up the blank in the database, the other is to enhance understanding of operator performance during emergency events. These data can form the basis for quantifying human reliability in probabilistic risk assessment (PRA). National Nuclear Safety Administration of China has paid great attention to develop the technique of HRA and viewed it as one of the key projects in the Eighth and Ninth five-year plans of the nation I~lI21 .We have carried out a research program of experiment on simulator and obtained some staged results.
2. O B J E C T I V E S
AND REQUIREMENTS
The aim of the operator reliability experiment is to collect operator response time and reliability data using full-scale nuclear power plant simulator in Tsinghua University with actual operating crews in NPP. In addition, three parameters of Weibull distribution in HCR model which are suited to the characteristic of Chinese crews are expected to be obtained, the requirements of the experiment are offered as follows: (1) The scenarios used in the simulator sessions should be selected representing, to the feasible extent, some of the sequences and key human interactions (His) known to be important in PRA study and helpful to the verification of HCR model. (2) The crews response times should be recorded accurately during the accident sequences. Two observers were in charge of this task using stopwatches, meanwhile, video recorder was also used throughout the sessions. (3) During the sessions, the operators would be demanded to fill out the MMPI and Cattell questionnaires to acquire their personal psychological information, so as to be the reference to the crew response time data. (4) Analysis of collected data could be carried out using Tsinghua computer code --- HRAS (Human Reliability Analysis System) to obtain the crew non-response probability fitting curves and the parameters of HCR model, which would be of help with HRA database in PRA. (5) Based on the conclusions drew from the experiment, the constructive suggestions could put forward to improve the training method, and safety operation of NPPs in China. (6) Primarily develop an approach to collect data from simulator and do preparatory work for the construction of HRA database and make foundation for the further study in PRA. The experiment was on the 950MW. three-loop PWR full-scale simulator in Tsinghua University, 54 operators were divided into 10 crews. For the favorite of study for HRA in PRA, 5 scenarios of accident sequences were selected ,they were: SGTR, LOCA, MSLB, LOSPl(without following execution)and LOSP2(with following execution), which were included in the training sessions with relatively high probabilities for their occurrence and very important in analysis of safety operation in NPPs as well , especially for the impact of human interactions on these 5 accident sequences progression. Table 1. shows the natures of the selected scenarios and Table 2 shows the information of background for the crews. This experiment only focused on the required actions under limited time and the actions are performed accurately, that is, free of errors. In other words, only the actions associated with P2 were expected. I31
Advances in Safety and Reliability: ESREL '97
169
TABLE 1 The Characteristics of Selected Scenarios No.
Scenarios
Num. of Num. of Num. of Type of Malfunctions Cues His behaviors
1
SGTR
1
4
2
MSLB
1
4
2
Rule
3
LOCA
1
7
2
Rule
4
LOSP1
1
4
1
Rule
5
LOSP2.
1
4
1
Rule
1
Rule
TABLE 2 Crews background information Item Age
Education
Num. of persons
Percentage
<30
25
46%
30-40
28
52%
)40
1
2%
College
10
19%
University
44
81%
< 5 years
23
42.6%
i> 5 years
31
57.4%
Performance
Good
30
55.5%
Level
Average
9
16.7%
uncounted
15
27.8%
SRO
18
33%
RO
10
19%
Personnel
26
48%
Experience
Position
3. R E S U L T S
Status
AND INTERPRETATION
OF COLLECTED
DATA
The experiment data was collected from 5 pre-defined scenarios including 5 human interactions. The response time is the time from the original signal(s) or cue(s) of plant disturbance up to the time for operator to initiate the first correct action. After the screen out and statistic treatment of data using Tsinghua code ---HRAS, table 3 gives the normalized response times for each crew during 5 different scenarios
170
J. Gao et al.
TABLE 3 Crew's normalized response times No. of crews SGTR (HI1) MSLB (HI2) LOCA (HI3) LOSP1 (HI4) LOSP2 (HIs) 1
0.647
1.703
0.360
0.757
2.440
2
1.215
0.505
1.533
1.540
2.200
3
0.845
1.040
0.545
0.820
2.010
4
1.330
0.837
1.020
0.820
0.900
5
1.440
0.870
1.308
0.873
0.360
6
0.510
0.735
0.580
1.495
2.290
7
0.915
1.320
1.500
1.180
0.420
8
1.175
1.395
1.240
0.585
0.510
9
1.585
1.390
1.160
3.720
0.660
10
1.190
1.180
1.098
1.706
0.510
Mean
1.100
1.150
1.190
1.200
1.270
SD
0.450
0.600
0.640
0.830
0.810
4. A N A L Y S I S A N D R E S U L T S (1) Human Cognitive Reliability (HCR) was proposed in 1984 by Hannaman, Spurgin & Lukic t41tSl, which was developed for quantification of crews success (or non-resp0nse ) probability as a function of time and allowing for various types of human behavior (i.e. skill-, rule- and knowledge-based ) that can result in significantly different non-probabilities. The HCR model is mathematically represented by three curves of normalized three-parameter Weibull distributions, one for each of behavior categories, as follows: TABLE 4 =
C~
t/T0.5 )
Cvi
The fitting parameters of Weibull distribution for 5 His
= 1.0 where
t/T0.5 < C~i
t = the available time for finish the task T0.5 = crew median response time
C vi, C ni, B i =
location, scale and shape parameters
associated with the category of cognitive behavior P(t) = crew non-response probability at time t
His
CY
Cn
13
SGTR
0.07
1.10
2.25
MSLB
0.18
1.05
1.48
LOCA
0.04
1.19
1.59
LOSP1
0.18
1.06
0.65
LOSP2
0.38
1.10
0.65
In our experiment, we have used own-developed program software HRAS ( Human Reliability Analysis System) for building the framework for the operator response time database, compiling data and fitting distribution curves with Weibull and lognormal, and providing various statistic treatment functions, tables and figures.
Advances in Safety and Reliability" ESREL '97
171
The time window t using thermal-hydraulic calculations or transient analysis can be estimated, which includes the total time of detection, diagnosis, decision-making and the time for completing the action. T0.5 and cr can be obtained from experiments under various conditions on simulators or judgment by experts. (2) The fitting curves of Non-Response Probabilities (NRP) were made with Weibull and lognormal distributions, that is, operators diagnosis non-response probability versus normalized time ( i.e. the ratio of actual crew response time to the median response time) for 5 human interactions. Table 4 lists the fitting parameters of Weibull distribution. Fig. 1 to Fig.6 are the NRP fitting curves with Weibull distribution. FIGURE 1
FIGURE 2
SGTR fittin~ curve
MSLB fittinz curve
1
nrp
nrp
0. I
0.1
0.01 O.
|
|
. . . . . .
1
.
.
.
.
.
.
0.01 0. I
.
10
1
normalized time
|
i
i
i
i
i
i iI
i
. . . . .
! normalized time
FIGURE 3
FIGURE 4
LOCA fitting curve
LOSP 1fitting curve
lo
nrp
nrp
0. I
0.1
i 0.01 0.1
,
normallzecl tlme
,
0.01 0. I
| , ,
n
~ l FIGURE 6
FIGURE 5
Rule-based fitting curve
LOSP2 fitting curve nrp
. . . . .
w
.
.
.
.
.
.
0
nlp
.
0.1
0,
1
t 0.01 0.1
. . . . . . . . . . . . . . 1
normalized time
,,
0.01 0. I
O .
.
.
.
.
.
.
.
i
1
normalized time
. . . . . . .
/o
I0
172
J. Gao et al.
5. D I S C U S S I O N A N D R E C O M M E N D A T I O N (1) It is feasible to use HCR model with either Weibull or lognormal distribution to assess crew nonresponse probability, because response times are variable. The conclusions drew from the experiment are similar with those reported by other countries, but the data show distinct difference. The fitting parameters heavily depend on the screen-out and analysis of the data, the important factors include: the determination of the types of His, the valuable of the recording data and the analysis and interpretation to anomalous data points, etc., which can produce obvious influence to the results. (2) From the analysis of data, it is not clear to distinguish the group of skill- and rule-based behavior, so we view all the His as the rule-based behavior but the LOSP2 which demands the crews to operate according to certain strategy and shows some trend as knowledge-based behavior. In some scenarios, the cognitive process of experienced crews may be viewed as skill-based behavior, while the new crews may show more characteristic of rule-based type. (3) The analysis shows that the study of crew response time on simulator can be helpful to on-line risk monitors and safety management in NPP. To the same HI, the different crews may have various median time, the well-trained crews may have less time than others. The SDs, in general, increase as the cognitive behavioral type associated with the task moves from skill- to knowledge-based behavior. So, the median time and SD can be used as evaluating the quality of training and predicting of the reliability of the crews. (4) The study of operator response behavior on simulators is of use to HRA in PRA, and to improve the training method and the quality of procedures for no ambiguity as to avoid the confusions during the accident sequences progressing. Therefore, taking full use of existing simulators in Chinese NPPs to develop further study of human reliability is a meaningful strategy to the plant of efficiency and availability. (5) The further study for operator reliability will be focused on: (a) refining the approach of data collection on simulators and the detailed experiment planning, (b) setting up the Chinese crew reliability data base, (c) enhancing the application of the experiment results on simulators to the risk management in NPPs, (d) carrying out the research of dynamic human cognitive reliability model for HRA in China.
References 1. Research for Human Factor Engineering Application Eighth Five-year National Science and Technology Key Project: 85-213-04-0316,1995. Technical Report by National Nuclear Safety Center and Research Group (China). 2. Huang, Xiang Rui (1993). A simulator-based evaluation of operator behavior in a Chinese Nuclear Power Plantunder ATWS condition. Reliability Engineering and System Safety Vol.40, 195-197. 3. IAEA-TECDOC-592, (1991). Case study on the use of PSA methods : Human reliability analysis. 4. Spurgin, A. J. (1990). Operator Reliability Experiments Using Power Plant Simulators vol. 1, Executive
Summary, vol.2, Technical Report. EPRI NP-6937. 5. Bareith, A., Karsa, Z., KISS, I. and Spurgin, A. (1994). Crew Reliability Experiments at Paks NPP Application to HRA. IAEA technical Committee meeting on Advanced in HRA and PRA . Szentedre, Hungary, Sept. 20-23.
A7" Operational Errors and Support Systems
This Page Intentionally Left Blank
RATING SYSTEMS AND CHECKLISTS AS TOOLS IN DEVELOPMENT AND IMPLEMENTATION OF ENVIRONMENTAL AND SAFETY MANAGEMENT SYSTEMS Hans Chr. Christiansen I and Else Hansen 2 lDeloitte & Touche Environmental Services, DK-1780 Copenhagen V, Denmark *l 2Kommunekemi a/s, DK-5800 Nyborg, Denmark
ABSTRACT
In the development and implementation of a environmental and safety management system at Kommunekemi a/s, the central Danish hazardous waste treatment plant, rating systems was used as a sort of catalogue of new ideas and checklist to fulfil the requirements in the standards BS 7750 (British Standard 7750, Specification of an Environmental Management System) and the European Union Eco-Management and Audit Scheme (EMAS). Rating systems are often used for evaluating companies environmental or safety performance as a way of retrospective assessment, but in this case a rating system was used in a more forward approach as a developing tool. It is the intention, that the special designed rating system for Kommunekemi, KKRS, will be used at regular intervals as a source of inspiration and at the same time as evaluation of the system against the environmental standards and to give a rating of the ongoing work done by the personnel at the company.
KEYWORDS
Rating system, implementation.
checklists,
management
system,
environment,
safety,
evaluation,
development,
INTRODUCTION
Rating systems or checklists have been used for a long time as tools for testing fulfilment of specified requirements or ensuring that activities are carried out. This is well known for technical issues (for instance inspections, testing and material certification) as well as management issues (auditing and certification of quality systems etc.). These activities are different kinds of conformity audits or testing and the purpose is to test if the requirements of the rating system, checklist or standard are fulfilled. The intention of the test is that the answer should be positive to all questions or at least for a defined number of the questions. The auditor and auditee are not to decide if the question are relevant or not.
,1 The author was until the July 1996 Environmentaland safety Manager at Kommunekemia/s. 175
176
H.C. Christiansen and E. Hansen
In this use, the questions are often closed questions and they are strictly related to the matter, that are in search. Especially for the technical issues the questions are often very exact. There are no intentions to broaden out the scope of the investigation and the only thing of interest is, if the answer on the question is "yes" or "no". This way of using rating system or checklist gives a retrospective view of how the system works or how the performance of the company is, in other words a sort of documentation of the work done. When the rating has been finished, the next step will then be to make some corrections and re-engineering on the system or the activity and then do a new testing and start a new cycle.
EXAMPLE
1
TRADITIONAL TYPES OF QUESTIONS IN RATING SYSTEMS AND CHECKLISTS
- Is there is an alarm summary permanently on display? - Is there an emergency shutdown capability? - Have all piping, equipment, seals, gaskets etc. (hydro-test and/or leak detection as appropriate) been pressure tested)? - Are all instruments calibrated? - Have all control logic, interlocks and emergency shutdown controls been tested on an item-by-item basis?
The traditional use of rating systems and checklists are often a tool to investigate more technical issues, because the interest lays on the fact, is it functioning or is it not functioning. And because of this the questions are often closed questions related to one or few possibilities. An other approach to rating systems and checklists are to use them as development tools. Here rating systems and checklist are used both to evaluate the existing work and to give inspiration for further work. This approach can be of great importance for developing environmental and safety management systems or other types of management systems. Rating systems and checklists focus here on the management systems, activities etc. To do this the rating systems and checklists must give advice and inspiration and must for this purpose reflect a sort of "best practise" in the area of activities, that are evaluated. The rating system represent a tool that present different possibilities for development to the company. The company is challenged and forced to take decisions related to the questions and the informations that the questions reflect. The intention in this approach is not that the company shall just try to fulfil all questions. The rating systems and checklists contain a huge amount of questions and some of the activities the company might simply decide not to go for. The approach leads the company through a process of self evaluation and helps the company to form a system that meets the company demands. At the same time the tools should still give an evaluation or rating of the existing activities. This makes it possible to test if the activities are improving and it makes the situation of the company visible to the management. This inspiration are given by the questioning with open questions that do not define exactly what to do but ask the company to evaluate how they intent to fulfil the question and implement the decided activity. It is not enough to know, whether the answer is "yes" or "no". It is more important to verify that the company professionally has done evaluations and implemented methods to manage the risks. The questions are changed from asking, if well defined methods or technical are implemented to reduce a type of risk, to ask if
Advances in Safety and Reliability: ESREL '97
177
the company has implemented a system to manage the risk through an ongoing process of evaluation, decision and implementation. The questions in this approach for rating systems and checklists means, that the questionnaires are not typically related to technically issues but more related to the management area.
E X A M P L E
2
QUESTIONS IN RATING SYSTEMS WITHIN A DEVELOPMENT APPROACH
Has a person or team been named to co-ordinate the engineering controls program? Does this person or team have adequate qualifications and resources to co-ordinate the engineering controls program? Do design standards assure that, wherever possible and needed: - The layout of consoles is logical, consistent and effective? - There is an emergency shutdown capability? - There is an alarm summary permanently on display? At mechanical completion is there a system to assure that the following critical checks and tests have been properly completed: - Pressure testing of piping, equipment, seals, gaskets etc. (hydro-test and/or leak detection as appropriate)? -
Calibration of instruments?
- Checking of control logic, interlocks and emergency shutdown controls on an item-by-item basis, as may be feasible?
Another new approach of rating systems is the illustrative informations which can be delivered, showing the level of performance or the omissions from a wanted level of performance or fulfilment of a standard. The rating systems from Det Norske Veritas, International Safety Rating System and International Environmental Rating System all use this new feature. An example is showed in figure 1. The rating systems use the graphically illustrations to show the performance level of the rated companies and to illustrate the improvement, or lack of improvements, since the last rating. It is with this feature very easy to illustrate for the management level how the performance of the company develops and the direction of the performance. Deloitte & Touche Environmental Services have developed a range of systems to analyse "the gap" between the actual performance of the company and specified standards or programs, "Gap Analysis". The gap is here understood as the difference or lacks between the performance and the specifications in the standard or the program. Systems for gap analysis are in function more like traditional rating systems, a questionnaire based on the standard, to which it is the wish to measure the company's performance against. It can for an example be the British Standard 7750 for environmental management systems. The differens between traditional rating
178
H.C. Christiansen and E. Hansen
system and gap analysis is in the way, the presentation of the result is done and to whom, the result is directed. Gap analysis is mostly reported in graphically manners, where graphics as an example can show the percentages of fulfilment in the different elements of the standard, but of cause also in text. The target group for gap analysis is mostly the senior management level in the company, and for this reason the graphically reporting approach is a very illustrative and effective form for communication.
Leadership & Admin. Leadership Training Planned Inspections & Maintn. Critical Task Analysis & Proc. Accident/Incident Investig. Task Observations
Emergency Preparedness Rules & Work Permits Accident/Incident Analysis Knowledge & Skill Train. Personal Proct. Equiprn. Health & Hygiene Control System Evaluation Engineering & Change Mgrnt. Personal Corn In unications Group Communications general prom otion Hiring and Placement Materials & Services Mgrnt. Off-the-Job Safety 0%
Figure 1. Example on graphically illustration from International Safety Rating System. The advantage of the gap analysis are the easy and illustrative reporting form, which allow the senior management level to focus on the more superior management lines and not force them to go into the technical details.
THE CASE In September 1994 it was decided, that there should be developed and implemented a combined environmental and safety management system at Kommunekemi a/s and it was also decided, that the system had to be certified before July 1996. There was no updated safety assessment nor environmental assessment for the whole site area but approximately 25 to 30 different older or newer assessments for the different plant areas. For some of the plants the authorities had stated yearly emissions reports.
Advances in Safety and Reliability: ESREL '97
179
The reason for the decision was, that there had been three minor inexplicable explosions in one of the to rotary kiln incinerators at the company and one of the recommendations from the investigation team, formed by experts from government agencies and the company, was, that there shoud be established a documented safety management system. This recommendation was change by the board to a combined safety and an environmental management system. On this basis it was decided as the first thing to make an initial rating after the International Safety Rating system, ISRS, from Det Norske Veritas (DNV) to state the actual level of safety performance and secondly to develop a combined environmemal and safety rating system tailor-made to Kommunekemis requirements. This combined tailor-made system was named Kommunekemi Rating System, KKRS, to illustrate, that the system was made specifically to the requiremems of Kommunekemi and that large parts of the elemems of the rating system was only relevant to Kommunekemi. At the initial rating audit after ISRS the performance level of Kommunekemis safety work was clarified and at the same period the level of the environmental performance was evaluated by doing an environmental review of the activities at the site element level. This audit and review gave an agreed basis for the future work and initially after the ISRS audit was finished the work with developing KKRS started. It was in this way possible to take advantage of the established knowledge of Kommunekemis level of environmental and safety performance. KKRS was made over the two DNV rating systems; ISRS and International Environmental Rating System, IERS, but the original system elements and questions was modified to cover the nature and work of Kommunekemi and the fact, that the scope was to develop a management system in accordance to the two environmental management systems, BS 7750 from BSI, the European Union's EMAS and the Responsible Care program from the European Chemical Industry Council, CEFIC. This was done especially with questions based on the content of the standard, BS 7750, and elements from the EMAS regulation, but the questions was transformed from the traditional "yes or no" type to the more developing approach. In the developmem of KKRS there was specially taken regards to the requiremems in the standards and for fulfilmem of national environmemal legislation but because the scope of KKRS also included safety matters the national Danish legislation on worker protection was written in the rating system. By doing this the workers at Kommunekemi felt, that the idea of implememing a environmental and safety system was not only a prestige project for the senior managemem but the felt, that the new system also was of great importance for there daily work at the differem plants. The system was able to bring occupational safety up on the disk of the board and the senior managers. When the KKRS manual was finished in the beginning of 1995 it was book primed and distributed to all managers in the company, from supervisors to the managing director. The KKRS manual was then used at all levels and in all departmems as a list of ideas to the developing work of forming the environmemal and safety management system. It was now possible to review the stated levels of environmemal and safety performance to the standards specifications at the differem departmems. The actual level of environmemal and safety performances was measured towards the description in KKRS which reflected the standards requiremems and made it possible to detect what was missing from the standard poim of view and this was done both by the local staff in the differem departmems and by the staff from the centrel environmental and safety departmem. Although all managers at the company had the KKRS manual at there disks, the most work with the manual was done by the central environmental and safety department, which was pointed as the responsible department for the development and implementation of the environmental and safety management system.
180
H.C. Christiansen and E. Hansen
Butit was more easy to discuss the different matters with managers from the different department, because there was established a common base for the development of the management system. Just by doing ratings on the company's performance and writing new systems it isn't possible to change the approach and the daily work and one of the most central thing in the hole implementation process at Kommunekemi was a training program in environmental and safety matters for all personnel at Kommunekemi. This training program gave all the personell a common base for the future discussions on environmental and safety matters. At the training program department managers ware trained as the first and they thenparticipated in the training of the supervisors and both groups together then participated in the training of the rest of the personnel, the "blue collars". By doing the training program in this way, all levels was able to hear there superiors commit them selves to the environmental and safety management system and to accept the requirements of the system. It was also posible for the personells from lower levels to give at feed back on the developt system directly to the responcible managers. Another benefit from the training program was, that it was very difficult for the superiors to refuse the wishis from there personnel to follow the described way to react in the environmental and safety management system, then they had teached to do it.
As a summery of the development and implementation proces of a environmental and safety management system at Kommunekemi it can be stated, that; At the same time the use of KKRS gave the advantage that KKRS functioned as a list of ideas to how the work in the area should be. By using KKRS in this forward direction it was now possible to change the direction of the scope from backwards to forward because the KKRS included so many evaluating questions, that could be used as bricks to the further work. From the realised missing requirements of the standards it was possible to decide the need for procedures and other types of documentation. It was at the time possible to begin the description of the needed management system, the evaluation showed wasn't in place. The systematic work with KKRS has made it possible to detect and to establish the core element of the management system due to list of ideas from the KKRS. The management system should be a common system for the hole company, not for the different departments but the system had to reflect the structure and nature of all the different departments, that together describe and create Kommunekemi as a functioning company. The next step was then to establish the environmental and safety handbook on the basis of documented procedures and instructions, from this step and forwards the work began to build more and more directly on the requirements in the chosen standards, BS 7750 and EMAS. But the work with Kommunekemi Rating System hasn't finished, there are still a lot of opporttmities in the KKRS manual for the future. Kommunekemi had its certificate for the BS 7750 in June 1996 and was verified and registered in accordance to the EMAS in September 1996.
SAFETY PROGRAM TO LOSS PREVENTION CONTROL Jo~o Alexandre das Neves, Carlos Morgado Pereira CPPE - Comp. Portuguesa de Produg~o de Electricidade, SA Rua Mousinho da Silveira, 10 - 1250 Lisboa- Portugal
ABSTRACT This paper deals with the theme of prevention and safety, within the realism of organisational factors and safety culture, integrated in the experience of the CPPE - Portuguese Producer of Electricity. Prevention and Integrated Safety in thermal power plants applies a management program to solve problems of loss prevention and control. This program uses a methodology which allows, in a systematic way, to minimise the risk or to maintain it in a potential state, reduced enough or controlled amidst acceptable levels.
PRESENTATION As a result of EDP's re-organisation process fourteen companies have been created, being one of them CPPE Companhia Portuguesa de Produg~o de Electricidade, SA. CPPE administrates an electro-generation system, which includes 7 thermal and 25 hydroelectric power plants with a total installed capacity of 7,050 MW. The energy produced in 1996 was about 24,426GWh, which represents 79.2% of national demand. Thermal Production Direction of CPPE is responsible for the management of the thermal power plants system. This system is composed of 2 coal plants, 2 gas turbine plants and 3 fueloil plants (one of them with a cogeneration system). The whole system has a generating capacity of 3 555.7 MW. In 1996 the production of electricity in power plants reached 11,293 GWh, this value represents 36.6% of the national needs. To measure the efficiency and the quality of service provided, Thermal Production Direction takes as a reference UNIPEDE's indicators, whose values in 1996 were the following ones: • Availability - 91.9% • Unplanned unavailability - 2.5% • Reliability ( Unplanned trips / 7000h ) - 9.05 • Air Quality Index - 2.0
INTRODUCTION
This paper deals with the theme of prevention and safety, within the topic of organisational factors and safety culture, integrated in the experience of the CPPE - Portuguese Producer of Electricity. 181
182
J.A. das Neves and C. M. Pereira
Security is nowadays one of the most relevant components in enterprise strategy, as it is an important factor in sustaining high productivity and competitivity rates. The development of a modern enterprise can be evaluated trough the level of application and observation of safety regulations stated by the management. Success or failure of the "zero accident" purpose depends on a safety politics. Risk evaluation, necessary to this process will be achieved though a potential risk survey and through implementation and management. On other hand the definition of a safety strategy in productive activities should result in a culture of greater confidence which would strengthen the application of the safety plan. Prevention and Safety, being one of the activities in a Power Plant, is a service rendered by all of those who work at the plant. Risk diversity, among the different roles in such a plant, demands a detailed analysis to detect its risk levels and define the protection objective. To review procedures, to alter attitudes or behaviours and to assure the availability of safety practices should be a priority of management. Every activity should be accompanied by the safety and prevention component. Considering the statements already made, Prevention and Integrated Safety in power plants, applies a wise plan which promotes a set of actions that assures, at every moment, the security levels requireble by universalistic standards. A safety plan established by a management program is capable of promoting a solid basis of management proceedings to solve problems of loss prevention and control. Considering a set of actions integrated in sub-programs, with interactions between them which, in an articulated way, establish a protection wall against the hazard. This wall is made of blocks, each one of them representing a sub-program of action, establishing a shield against risk. The lack of availability of one of the above sub-programs can result in a weakness of the facility, which would become vulnerable to hazards.
PREVENTION AND SAFETY POLICIES In an enterprise, the definition of a Prevention and Safety Policy is the beginning of a Safety Culture supported by several practises. Knowing its social responsibility and its importance in a modern society, thermal Production Direction of CPPE has a Prevention and Safety Policy, integrated in a strategy defined by EDP group, with the following main purpose:
"Reach for safety and patrimony protection and assuming the firm "s responsibilities to its community, through identification of hazards and proper management ". The existence of a written policy about safety and its diffusion reinforces a culture of trust which will lead to a greater efficiency in the implementation of a Safety Plan. A written statement of corporate management policy or philosophy regarding loss prevention and control is fundamental to properly manage these problems. This statement, widely published throughout the organisation, gives clear testimony that management has made loss prevention and control a corporate objective.
Advances in Safety and Reliability: ESREL '97
183
Stabilising a Safety Plan made of programs which allow to the interaction between people, hazards and prevention, management assumes a solid basis in Loss Prevention and Control. Each program is presented independently but their structure is similar for all of them. The success of a Plan like this needs the cooperation of all the workers. You must convince them that the Safety System is really important and its rules should be accepted by everyone. So it is necessary to divulge the Global Program and all the sub-programs.
SUB-PROGRAMS OF ACTION
Impairments to Fire Protection Systems A protection impairment occurs when fire or explosion protective systems are shut off or otherwise taken out of service. These systems have to be inspected and submitted to maintenance. This is the only way to assure their availability in case of fire or explosion. Many large losses might have been minimised if a fire or protective system had not been impaired. There are three types of impairments to Fire Protection Systems: - Emergency impairment: occurs when an unforeseen incident partially or totally impairs the effectiveness of a fire or explosion system. - Planned impairment: occurs when it is necessary to shut down a fire or explosion protective system for maintenance or modification. These two operations have to be carefully planned to minimise the period of impairment. - Hidden impairment: is one which is not known to exist and is therefore the most serious type. A good inspection program can reveal the hidden impairment, thus allowing prompt restoration of vital protective equipment. It is therefore necessary to implement the correct actions to minimise the shut off period and its resulting risks, to reach a maximum safety at the 'place of the occurrence. The implementation of an impairment management program requires several basic steps. These are: 1-Assignment of the responsibilities for impairment supervision. This responsibilities should be assigned to the Responsible of the Safety Department, 2- Inform Department Heads about the areas where protection is out of service, 3- Shut down hazardous processes such as cutting, welding and other hot work, 4- Prohibit smoking throughout the affected area, 5- Inform the Public Fire Department about the situation so that they may act if a fire occurs, 6- Supplement manual fire fighting facilities by tempo.rary addition of extra fire extinguishers and charged hose lines and personnel to make use of them, 7- Adopt a detailed plan for supervision, 8- Train in special hazards and provide standard operational procedures for them. Each action plan should follow these guidelines: 1- Limit the frequency, extent and duration of all impairments, 2- Work continuously on impaired equipment until is restored to service, 3- Reduce the possibility of fire during the impairment by shuttling down hazardous processes, 4- Enhance surveillance and fire fighting capability during the impairment, 5- After the impairment, restore all fire protection systems promptly and verify, by appropriate tests, that all fire protection systems have indeed been restored
184
J.A. das Neves and C. M. Pereira
Smoking Regulations Numerous fires are attributed to careless discarded lighted smoking materials. To control this source of ignition, management must create areas of the facility where smoking is permitted and "No Smoking" zones. The action program consists in the definition of a plan for all the facility and determine the areas in which smoking can be permitted and those areas in which it should be absolutely prohibited, accordingly to their risk of fire. Warehouses where combustibles are stored, areas in which explosives, flammable gases or liquids or combustion ashes may be present are examples of places containing severe hazards, where smoking cannot be allowed under any circumstance. In these areas lighters and matches should be prohibited. The implementation of a management program for Smoking Regulations requires some basic steps such as: 1. Clearly mark " N o Smoking" zones and inform employees of the reasons for the prohibition, 2. Clearly mark areas in which smoking is permitted and provide for proper disposal of smoking materials such as ashtrays, ventilation equipment, fire protection systems, etc.. 3. Promote the observance of the smoking regulations, 4. Inform visitors and outside contractors, of the smoking regulations, and make sure they observe them.
Hazard Evaluation Prevention needed the development of adequate techniques. Hazard evaluation is the first matter to analyse in safety concerns and its main objective is to create a list of all the elements from the productive system ManMachine-Environment which could cause accidents. In a Plant, particularly in a Thermal Power Plant there are some operations, equipments, machines and processes that have several hazards. Since each hazard represents a potential loss, its important to identify and evaluate the hazard. Knowing the hazard and its evolution, management may create a program of Loss Prevention. The implementation of Hazard Evaluation management program needs some basic steps, such as: 1. Careful choice of equipments and processes to be analysed, 2. Which components, systems or procedures are "critical". Critical components, systems or procedures are those that, if out of service, could result in a catastrophic loss. 3. Determine the type and level of loss prevention controls to be implemented in accordance with the results of the study and the magnetitude of the potential loss. The search for the right procedures to identify and evaluate the hazards leads to the development of Safety Systems study. That study concluded that hazards were products of the interaction of the persons involved, the machines or equipment in use, the environment and the influences of the management. The identification and evaluation of hazards can be done by using different methods, depending on circumstances and objectives. Once identified the hazards, they should be qualified and divulgated in a logical manner to enable management to set priorities for loss prevention and control measures. Critical systems and components tipically will monitor or control pressure, temperature, electric power, rotor position or other entities. When deviation from normal conditions occurs, they will alarm, initiate the corrective action, or shut down the process or equipment to avoid undesirable consequences.
Advances in Safety and Reliability: ESREL '97
185
Employee Training StatistiCal studies of incidents showed us that about 85% of accidents in the industry are due to human error. One important way to reducing loss caused by human error is to provide employee training in proper work practices. This training may include information on specific job skills, safety practices and company regulations and requirements. In fact, if the employees frequently have technical fails in routine jobs, it is because their skills are not being learned properly. Even though there are still some safety problems that can be addressed to the employee training, and management should be attentive to them. To develop an employee training, course to minimise human error, management should: 1- Be aware that "on-the- job training" in safety is the best training; 2- Establish the scope of all segments of the program, accordingly to safety strategy; 3- Have training program segments written; 4- Show the employee the need to improve its performances; 5- Establish training and retraining schedules; 6- Be aware that training programs include the most important aspects of particular jobs and critical procedures. 7- Employee training on Safety must include training information about safety equipments.
Insurance Companies Recommendations Recommendations made by insurance companies during their visits to power plants are aimed at reducing loss potential. The implementation of a management program to minimise loss potential at the facility should: 1- Analyse the recommendations and decide which should be implemented; 2- Analyse their costs; 3- Be aware that, any change projected by the engineering teams, should be analysed by the insurance technicians; 4- Carefully schedule the implementation of the accepted recommendations; 5- Inform the Insurance Company about the rejected recommendations and the reasons for that option.
Pre-Emergency Plan E m e r g e n c i e s are exceptional situations which may cause a danger to people or goods. If they occur they
should be controlled by means of a pre-emergency plan. If unexpected, the emergency will lead to worst consequences. The Pre-Emergency-Plan (PEP) is a like a self-protection and it can be defined as the combination of persons and material means available to fight, in a proper way, against a situation of emergency. If management wants an efficient PEP the firm will need well-trained, coordinated and resourceful people, ready to act properly at the right time. To create a PEP you must classify the emergencies. There are some important steps you should remember in that classification: - Hazard identification and evaluation, - Implementation of the PEP and its revisions, - Definition of means and appointment of employees who are able to: • Defend other employees and population against risk; • Minimise losses at the plant or outside the plant;
186
J.A. das Neves and C. M. Pereira
• Minimise the environmental impact: The implementation of a PEP has the following steps: - Emergency classification; Hazard evaluation/anticipation of emergency conditions; Definition of available means; Creation of a decision tree to help fighting the emergency; - Emergency training for all the employees; - Creation of emergency brigades; - Specific training for their members; Creation of simulacrum; - Cooperation with external agencies (Police, Civil Defence, Local Fire Department); - Permanent modification of the PEP in order to keep pace with changes in property, facilities and processes. -
-
-
-
Hazardous Materials At power plants there are some processes where substances with hazardous properties are used. It is important to identify those substances to inform all the employees about their risks and to teach those employees the precautions to handle them. There are many chemical substances with hazardous properties such as flammability, radioactivity, toxicity, etc.. The implementation of a program to evaluate the hazardous materials and to reduce loss potential has the following aspects: - Determination of the physical properties of each chemical substance handled at the facility; Evaluation of the hazardous properties and determination of the relative hazard levels of each substance and any necessary handling precautions; - Establishment of methods for disseminating the hazard information and handling precautions to the employees and to the emergency department. - Establishment of a method for assisting in the development of process hazards evaluation; - Create a file of "Hazardous Materials" with all the information about each one of these materials (identification, handling, storage, precautions, emergency procedures, etc.). - Safety Identification accordingly to EC legislation. -
Security Surveillance Prompt detection of adverse conditions is crucial to effective loss prevention and control. Many catastrophic losses would have been a routine failure if detected promptly and an appropriate response occurred. In conjunction with fire and explosion protective systems and various other management programs for loss prevention and control, fire protection and security surveillance provide a means of continuously monitoring the facility for conditions which might lead to a fire, explosion or other incident. Management should develop a written surveillance plan for fire protection and safety to be certain that the facility is checked regularly. To accomplish this management should: Determine which areas are occupied during working hours; - Establish a plan for surveillance of the unoccupied areas; Create a program of guard service to detect unsafe conditions and to correct them, if possible, during the tour. A report should be written during the tour and supervised afterwards; Schedule employee training to those guards; Limit the access to the facility; -
-
-
-
Advances in Safety and Reliability: ESREL '97
187
Hot Work Permission Welding, brazing, flame or plasma cutting, hot riveting and other activities that produce sparks or use flame are important working methods in thermal power plants. Hot work is an hazardous process and that hazard must be controlled. The principal hazard associated with portable hot work equipment is the introduction of unauthorised ignition sources into "No Smoking" areas. The implementation of a management program to control and to reduce the hazards of hot work should follow these principles: - Performing hot work in a properly arranged maintenance shop except when the job cannot be moved to it; - A supervisor for hot work should be assigned; - Establishment of written procedures for hot work; - Establishment and implementation of a permit system; - The authorisation for hot working must be given in a written paper and it must include the proper safety procedures to do that work.
Fire Protection Equipment Maintenance and Inspection Fire protection equipment deteriorates with the passing of time and it's also vulnerable to external influences such as corrosion environment, careless use and accidental damages. All protection equipments should be frequently inspected and tested to determine their availability and the need of maintenance. The implementation of a management program to supervise the operationality of the protection equipments should follow these principles: - Creation of an inspection plan for protection equipment; Select and train individuals to supervise the execution of the plan; - Initiate a complete survey of the facility's fire protection equipment for the purpose of developing a customised inspection report form; - Establish contracts for periodic testing and maintenance of these equipments; - Establish a specific program for essaying and testing the equipment's. -
Safety Auditorships Internal and to Contractors Evaluating is comparing with a model. Loss Prevention and Safety compares models and processes to reach the established objectives. Safety auditorships are made to evaluate if the objectives were reached and to find the weal~esses and the proper ways to correct those weaknesses. Management should implement a program of auditorships following these steps: - The responsible for the plant is the person who requires the auditorship; All declarations made to the auditors team are anonymous; - An efficient auditorship needs a previous complete diagnosis and it should have three different parts: pre-auditorship, auditorship and post-auditorship. -
188
J.A. das Neves and C. M. Pereira
Safety Meetings to Analyse Accidents and Disfunctions in the relation Man~Machine Safety procedures are a management's concern and they should be implemented after a methodical analysis. The information has a role extremely important during the search of data to hazards evaluation and during the creation of a document containing all the safety procedures of each activity. The creation of "Safety Subcomissions", with representatives of the employees and representatives of management, and the Safety Meetings organised by these subcomissions, leads to a culture of trust and to an improvement of safety. Safety Subcomissions headed by the Plant Director should: - Analyse and define the rules of Labour Safety and Housekeeping; - Recommend safety practices and motivate the employees to a culture of safety; - Analyse and develop a plan that leads to better working conditions; - Analyse and approve the reports made during the meetings of Labour Safety and Housekeeping; - Analyse statistics of labour accidents, their causes, and providing procedures to prevent them; - Analyse reports about labour accidents; - Create a plan for employee training Labour Safety and Housekeeping.
Health and Safety Plan During Overhauls The overhaul of a unit (Steam generator + Turbogenerator) in a power plant includes maintenance, inspection and testing operations in different areas such as: mechanical, electrical, chemical, control equipment, etc.. Its length of time is about 6 weeks of continuous work. Most of maintenance works are made by contractors employees. In an overhaul there are about 30 contractors and subcontractors. At the crowdest period there must be about 600 people working simultaneously in maintenance operations. To reduce the probability of accidents or to minimise their consequences, management should implement a combination of procedures which will reduce the hazards and will contribute to improve safety conditions during the overhaul. A document including all the written information about housekeeping, health and safety during the overhaul should be created. This document, called Health and Safety Plan is an evidence of management involvement on Safety. For its implementation management and contractors cooperate to: Identify and evaluate the hazards of each activity; Suggest procedures to minimise the hazards; Create a proper integration of contractor's activities in the overhaul; - Establish adequated surveillance of employees safety; Create a First Aid Plan; Train employees in Prevention matters. -
-
-
-
-
CONCLUSION The control of results, in terms of accidents and in terms of safety conditions, are two of the advantages of this program. Another aspect that we can enhance is the increasing of mechanisms of hazard evaluation. This resets from the creation of an accidents/incidents file and its statistical analysis, both made to implement the proper procedures to improve Loss Prevention.
APPROACH FOR ASSESSING HAZARDS RELATED TO HIDDEN DEFICIENCIES IN TECHNICAL SYSTEMS. Arild Tomter Kongsberg Gruppen ASA- KongSberg Defence Systems, P.O.box 1003, N-3601 Kongsberg, Norway.
ABSTRACT Any man-made system starts with man. All human beings make errors. Errors made by system designers may penetrate into the system, where they may reside as unintended and hidden deficiencies. A hidden deficiency may later in the system lifetime manifest itself as a failure, This may initiate an unintended chain of events, which may be hazardous. Such hazards are outside the scope of conventional safety analysis techniques and methodologies. A different approach for addressing such hazards is presented. The approach is based on assessing the system's proneness to presence of hidden deficiencies, combined with assessing the system's ability to control associated failures and to prevent any uncontrolled chain of events leading to an accident. Proneness to hidden deficiencies is assessed by assessing various <> system attributes, which are assumed to correlate with such proneness. The system's ability to control associated failures is assessed by assessing various system characteristics, which contributes to a general <>, or acts as special hazard protection mechanisms. This approach is the result from a theoretical and logical deduction process, with no supporting experience. verification or validation activities. It is, however, the author' s hope that the basic approach is appreciated and considered interesting within the system safety, research environment.
KEYWORDS System, deficiency, safety., hazard, accident, errorproneness, immunity.
INTRODUCTION All established and commonly used techniques and methodologies for safety analvsis of technical systems are based on a common fundamental approach: Identification and description of possible hazardous events. analysis of possible causes and associated probabilities, and evaluation of likely consequences and associated degree of severity. Potential hazards and accidents, which are not recognised, remain outside the scope of these techniques and methodologies. Nevertheless, history shows enough examples of accident developments which in advance were determined to be impossible, or the potential was not recognised at all. <<Titanic>>went down despite it was <>, <> capsized because nobody considered the possibility of the bow door 189
190
A. Tomter
remaining open upon departure, <<Estonia>>went down due to a bow door hinge construction too weak to stand the rough weather, and <~Ariane 5>>failed due to an unrecognised software inadequacy. One may in all these examples put the blame on human errors: not realising the iceberg threat (Titanic), a seaman taking a nap while on watch instead of closing the bow door (Herald of Free Enterprise), a construction engineer providing too weak hinge construction (Estonia), a system designer providing inadequate software (Ariane 5). And of course, the blame on human error is, literally speaking, correct, as it usually is in large accidents. But: the Titanic captain did probably not recognise icebergs to represent a significant hazard, the seaman on Herald of Free Enterprise did probably not envision the potential catastrophic consequences of taking a nap, the Estonia engineer was probably not aware of the construction weakness of the bow door hinge, and the Ariane 5 system designer did probabl}¢ not consider the possibility of a software inadequacy (the software was proved reliable in Ariane 4). Usually, when catastrophes result from human errors, the humans making the error were not aware of the occurrence of the fatal action (or lack of action), they were not aware of it' s inadequacy, or they did not realise the hazard potential at all. Human beings are not designed to perfection and infallibility. Human beings do make errors, and will continue to do so. Our ability to overview all details, interactions and behavioural patterns and modes of a large and complicated system is limited. Hence, all activities and processes where humans are involved, including all man-made systems, are exposed to unintended limitations, errors and deficiencies, and presence of such limitations, errors and deficiencies should be presumed.
THE CONCEPT OF VULNERABILITY The term Vulnerability has various meanings in literature, elg. Vulnerability to adverse conditions (Rosness 1991), Vulnerability (<>) to malfunction because of component wearout (Meister 1991), Vulnerability (<<external vulnerability>> to an external adversary system or situation (Meister 1991). u-- given the system characteristics, multiple and unexpected interactions of failures are inevitable. This is an expression of an integral characteristic of the system, not a statement offrequency. ~ (Perrow 1984) This statement introduces a different aspect of vulnerability, addressing the potential for malfunctions resulting from presence of unintended and unknown internal design deficiencies. Such deficiencies may be present from system <>due to human error during the design and/or production process. They may remain hidden for periods of time, and manifest themselves upon occurrence of some specific triggering conditions. The triggering condition may be fully <<normal>) and completely within the intended design envelope, and may be not adversary at all.
It is evident that this type of vulnerability may cause hazardous events which will not be identified, and which accordingly remain outside the scope of established safety analysis techniques and methodologies. A different approach for assessing the potential for such hazards in a system is needed. Such an approach has to rest upon some basic understanding of fundamental mechanisms causing hidden deficiencies as well as mechanisms for a deficiency to manifest itself and develop to a hazardous chain of events. When trying to develop such fundamental understanding, let us look to the human body. Although much may be and are said about the human race and human activity on earth, the human body is, when looked upon in a strict system context, a superb system demonstrating an amazing degree of adaptability, survivability and fault tolerance. Reason (1988) considers latent failures in an organisation analogous to the medical term of resident pathogens, which in combination with triggering factors bring about disease. His discussion was presented as a model of understanding failures within organisational and managerial systems. A corresponding discussion of the human body as a <<system>> operating within its environment may, however, be of good help in understanding corresponding mechanisms of technical systems.
Advances in Safety and Reliability: ESREL '97
191
The human body is continuously exposed to various strains and threats from a more or less hostile environment, e.g. hostile micro-organisms like bacteria and virus, which continuously try to intrude the body. Successful intrusion may cause disease. However, our body contains a sophisticated defence system, which counterattacks and tries to destroy the intruders. This immunity system is a major prerequisite for our survivability in a hostile environment. When we are healthy, this defence system succeeds in the continuos <<war>> and intruders are defeated. Occasionally, our body may be impaired or exposed to an extra strong attack, and our defence system needs to mobilise and activate all it's capabilities, and fight an intensive battle for some time. We experience this mobilisation and intensive battle as being sick. This model may, however, not explain the existence of chronic diseases. Chronic diseases are not a general condition affecting the entire population, but rather a specific condition striking certain individuals only. Hence, it seems logical to search for explanations of occurrence of a chronic desease in a disturbance or deficiency within the body of the individual affected (a <<system fault>>), either within internal physiological processes (i.e. <<system performance>>) or within the body's internal defence system. Such a fault may be present from birth, i.e. an integral characteristic of the <<system>> caused by a <<system design or production error)>. The fault may, however, remain invisible as a hidden deficiency (i.e. as a predisposition to the associated disease), until some specific condition may trigger it's manifestation. The triggering condition may be quite normal and not represent any extraordinary level of strain or hostility at all, but happens to <<match>> the hidden deficiency leading to it's manifestation and outbreak of the associated disease symptoms. By applying this fundamental reasoning on technical systems, the potential for hazardous events may be considered as related to some kind of predisposition of systems to such phenomena. Such predisposition should be considered as a basic property of the individual system. This point of view raises two crucial questions: 1. What type of mechanisms or characteristics of a technical system may impact the likelihood and/or amount of hidden deficiencies residing in the system? What makes a system prone to such hidden deficiencies? 2. What type of mechanisms may be feasible to prevent a hidden deficiency to manifest itself as a hazardous event? How can defence mechanisms or a system-internal <> counteract the potential hazardous effect of hidden deficiencies in a technical system?
SYSTEM CHARACTERISTICS AND PRONENESS TO PRESENCE OF HIDDEN DEFICIENCIES Any man-made system is created through human processes of specification, design, production and implementation of the system. Unintended system deficiencies may be introduced through these processes, and may reside in the final system as an integral part of the system's total set of behavioural rules and patterns. Hidden deficiencies are unintended results from human activities, and may enter into and survive the entire system creation process due to fundamental limitations in human capabilities and capacities. When searching for system characteristics which may impact a system's proneness to hidden deficiencies, we should look for characteristics which have a bearing on fundamental human limitations. System characteristics increasing the level of mental capacity needed to maintain a complete and detailed perception and overview of all aspects of the system behaviour, are expected to increase the probability of unintended and hidden deficiencies to enter and survive the system creation process. System characteristics which tend to simplify the overall and detailed overview of all aspects of the system, are, on the other hand, expected to facilitate prevention, detection and elimination of such deficiencies within the system creation process. System characteristics should also be defined in a general context independent from the specific application or missions of the system. Hence, we should search for characteristics of the basic system structure as opposed to system functions. Based on Meister's (1991) structuring of systems in structural elements and
192
A. Tomter
Perrow's (1984) concept of complexity and coupling, 6 system characteristics are established as being expected to impact the probability of unintended system deficiencies to enter and survive the system creation process: 1) Complexity, 2) Coupling, 3) Size, 4) Differentiation, 5) Organisation, 6) Indeterminacy.
Complexity System complexity is dealt with in literature from various aspects and viewpoints: interdependency relationships among units and subsystems (Meister 1991), unfamiliar, unplanned, unexpected sequences which are invisible or difficult to comprehend, i.e. <>, (Perrow 1984), interconnections between parts and non-linear relationships between crucial variables (WahlstrOm 1990). It is reasonable to assume that a system with extensive dependencies, large number of parts, interconnections and relationships, and large amount of complex interactions, imposes a heavier challenge on the human mental capacity in establishing a complete comprehension of all aspects of the total system behavioural modes and parts, than does a system with limited dependencies, number of parts and interactions. This view is supported by Meister (1991) and Wahlstr6m (1990), who both claim that complexity makes the system more difficult to understand. Hence, the system characteristic Complexity is considered to have a major impact on a system's proneness to presence of hidden deficiencies.
Coupling. Coupling within a system focuses on how subsystems or parts are interconnected with each other. Perrow (1984) introduced the terms fight and loose coupling. Tight coupling is characterised by close and direct interconnections. The effect of an event occurring in a process state or part of a tightly coupled system tends to transmit through various parts of the system, and is more likely to cause a cascading effect. Loose coupling generally incorporate buffer capabilities between different stages in the process. The effect of an event occurring in a process state or part of a loosely coupled system is more likely to be absorbed within the system and causing limited impacts only. It is reasonable to assume that coupling is a relevant characteristic regarding proneness to accidents. The important effect seems, however, to rest on the capability to absorb and limit effects of occurring failures, thus acting as a defence mechanism counteracting potential hazardous effects of a hidden deficiency. Such defence mechanisms are further discussed below. Nevertheless, it seems also reasonable to assume that a tightly coupled system imposes a heavier challenge to the human mental capacity in maintaining an overall overview and comprehension of the system, than does a loosely coupled system. Hence, the system characteristic Cou01ing is considered to have some impact on a system's proneness to presence of hidden deficiencies.
Size Our intuitive way of characterising a system's size is by rather unprecise terms like small, medium, large, representing a perception of one system property. Whether the property is expressed as small, medium or large depends, however, on the specific reference frame. A <<small>>nuclear power plant may objectively be larger than a <>weapon control system. Despite this problem of taxonomy for measuring and expressing size as a system characteristic, size is yet a real property which needs to be addressd.
Advances in Safety and Reliability: ESREL '97
193
It seems obvious that large systems generally incorporate a higher number of subsystems and parts, as well as interrelationships, interconnections and interactions, than do smaller systems. Hence, large systems are assumed to be generally more demanding to the human mental capacity in maintaining an overall overview and comprehension of the entire system behavioural modes and patterns. This view is also supported by Meister (1991) and Wahlstrom (1990), who both considers size to be related to complexity. Hence, the system characteristic Size is considered to have a distinct impact on a system's proneness to presence of hidden deficiencies~
Differentiation Differentiation means ((the differences among units and subsystems within a single system.~ (Meister 1991). Hence, differentiation within a system depends on how the system is defined with respect to its boundaries, and what attributes of units and subsystems are addressed. For the purpose of assessing proneness of a system to hidden deficiencies, differences which may be relevant for the system designer's ability to completely overview and comprehend all interaction paths and relations within the system, should be taken into account. From this viewpoint the following aspects of of differentiation are considered relevant: 1) Mission of the subsystem, i.e. the prime function to be performed by the subsystem. 2) Operator involvement, i.e. unmanned, operator controlled, operator assessment/decision, execution task. 3) Degree of autonomy, i.e. ability to perform autonomously, or dependency on other subsystems. It is reasonable to assume a positive correlation between degree of differentiation between various subsystems and the need to co-ordinate their individual performance. It is further reasonable to assume that increased need for co-ordination tends to increase the mental capacity needed for maintaining a complete and detailed overview and comprehension of all behavioural paths and modes of the system as a whole. Thus, differentiation within a system is expected to be positively correlated with proneness to hidden deficiencies within the system. The strength of this correlation should, however, not be overemphasised, and related effect already accounted for, like complexity and size, should be excluded from this evaluation. Hence, the system characteristic Differentiation is considered to have some impact on a system' s proneness to presence of hidden deficiencies.
Organisation A system's or0ganisa.tion is the way in which units and subsystems are an-anged in relation to each other within the context of the total system. Some key aspects of a system organisation is rigidity versus flexibility, centralisation versus decentralisation, distribution of authority, and formality versus informality. In organisations characterised by rigidity, centralisation, authoritarianism and formality, performance is more directed by general or specific regulations and guidance originating at superior level, e.g. as operator procedures, special orders, etc. Flexible, decentralised and informal organisations are, on the other hand, more characterised by improvisation and more creative local level resolution of problems, as available and allowed range of response options are broader. A <<determinate>> type of organisation incorporates, as mentioned above, a more narrow set of response option. This will facilitate stability and predictability of overall system behaviour. Hence, it seems reasonable to assume that a <<determinate>> system generally enhance the ability to predict system behaviour and to maintain a complete and overall overview of all behavioural paths and modes of the system as a whole. This, in turn, is assumed to facilitate the ability to detect and eliminate system deficiencies within the system creating process.
194
A. Tomter
Hence, the system characteristic Or ganisation is considered to have a distinct impact on a system's proneness to presence of hidden deficiencies.
Indeterminacy Indeterminacy may be described as a function of various elements (Meister 1991): Input characteristics, need for interpretation of input, emphasis on information processing, required amount of decision making, procedural variability, and available response options. Indeterminacy is also considered to be closely related to uncertainty. The total uncertainty of a system may result from two different uncertainty components: 1) System uncertainty - a built-in system property, due to e.g. sensor limitation, procedural flexibility. 2) Situation uncertainty - reflecting uncontrollable impacts from the environment, e.g. enemy, weather. It seems reasonable to assume that design of indeterminate systems generally are more difficult and challenging for the designer than determinate systems. It is further reasonable to assume that built-in system uncertainties are associated with less predictability of system behaviour in given situation, and that this will tend to increase the system' s proneness to presence of hidden deficiencies. A corresponding reasoning may be applied to systems subject to situation uncertainties. Such systems should be designed for adequate response to various external situations. With dominant situation uncertainties, the overall sets of system responses and behaviour options will increase and be more complicated, and the potential for the system being exposed to situations to which it is not able to response adequately, will increase. Hence, the system characteristic Indeterminacy is considered to have an extensive impact on a system's proneness to presence of hidden deficiencies.
SYSTEM CHARACTERISTICS AS <> CAPABILITIES A hidden deficiency in a system will, when <>by a corresponding Iriggering condition, manifest itself as an unintended failure. This may, in turn, initiate an uncontrolled chain of events finally leading to an accident. The probability of this to occur is, however, dependant on the system's ability to counteract and prevent such hazardous development, i.e. the system's internal <> capabilities. The <> of a system is considered to consist of various safety defence mechanisms and barriers. These may be of two types: 1) General system properties which facilitate ability to tolerate local failures and absorb their effects within the system component concerned. Such ability is considered closely associated with resilience capabilities incorporated in the system. 2) Special protection strategies aiming at preventing and limiting hazardous effects of uncontrolled chains of events. Such strategies should be developed based on a well recognised accident development model.
System resilience - the key to general .hazard immunitp~ capabilities The discussion of resilience as a key to hazard immunity of a system is mainly based on Foster (1993). He states the need for giving resilience greater attention in decisionmaking processes and systems in general (i.e. social, political, economical, ecological, biological as well as technical systems). While resilience generally is consider to incorporate a broad set of dimensions, three dimensions are considered to be most relevant for technical systems: System, Operational and Physical characteristics. Resilience related to the System dimension is impacted by the following characteristics:
Advances in Safety and Reliability: ESREL '97
195
1) Dependency on external variables. Heavy dependency on external and uncontrollable variables reduces the degree of resilience, while acceptance of extensive ranges of external variables provides greater ability to accept uncontrolled and unperfected fluctuation of values. 2) Diversity versus specialisation. A highly specialised system may provide higher productivity and output rate, but is more adapted to those specific conditions which it is designed for. Significant deviations of the conditions may for a highly specialised system provide serious common-cause failures, while systems of high diversity more easily will tolerate such deviations and isolate their effects. 3) Functional redundancy. It is obvious that functional redundancy enhances the system's ability to continue it's performance in failure situations, thus preventing system collapses and related accidents. Resilience related to the Operational dimension is impacted by the following characteristics: 1) Reversibility. As results of decisions usually are impacted by uncertain or unrecognised conditions, it is reasonable to consider increased reversibility of processes to enhance system resilience and ability to tolerate failures and counteract adversary effects. 2) b~cremental application of resources. This implies that reduced system capability, e.g. caused by breakdown of certain system components, is able to continue a reduced set of operational functions (degraded operations). It is reasonable to consider this option of responding to a failure situation as an important mechanism for preventing serious system collapses and accidents. 3) Hierarchical embedding. This implies the capability of individual subsystems to continue their operations in an autonomous mode in a situation of failure of other parts of the. It's contribution to system resilience and corresponding ability to prevent serious system collapses and accidents, are strongly similar to incremental application of resources discussed above. Resilience related to the Physica ! dimension is impacted by the following characteristics: 1) Cellular structure. Cellular structure is heavily employed in nature, combined with ability of cells to divide and replace lost cells. This concept of nature constitutes a superior recuperative ability, as other cells can take over functions from lost cells while new cells are created to replace the lost ones. It is obvious that such structure provides for extensive resilience and increased system survivability. 2) Mobility. Mobility is highly used in nature as survival strategy. In military strategy, the importance of mobility for increased resilience and survivability is well accepted. Technical systems may also benefit from mobility in order to avoid unexpected and adverse external strains and threats. 3) uLoose coupling~. This term was introduced by Perrow (1984) and focuses on how subsystems or parts are interconnected within the system (ref. discussion above). A ~doosely coupled)) system tends to incorporate buffer capabilities between various stages in the system process, thus increasing the ability to locally absorb adverse effects of failures and prevent an uncontrolled cascading development.
Accident
prevention
strategies
- specialised
hazard
protection
mechanisms
The general energy model explains an accident to be caused by uncontrolled release of energy, while injuries to personnel are caused by transfer of energy in excess of body injury thresholds. From this basic model Haddon (1989) developed a number of accident prevention strategies within three groups, focusing on Energy source, Barriers and Victim.
Strategies related to energy source includes the following: - prevent build-up of energy - modify the characteristics of energy - limit the amount of energy - prevent uncontrolled release of energy - modify rate and concentration of released energy. These strategies aim at eliminating or reducing the amount of energy subject to potential uncontrolled release, and corresponding potential for damages or injuries in situations of system failures. This group of strategies involve basically the system itself and associated operations.
196
A. Tomter
Strategies related to barriers include the following: separate the source of energy and potential victims in time or space (e.g. safety zones, time slots) separate by means of physical barriers (e.g. protection covers, fences) These strategies aim at preventing potential victims to be in the proximity if and when energy might be released, thus preventing personnel to experience injuries. This group of strategies involve not only the system, but related safety organisation, procedures and measures as well. -
-
Strategies related to the victim include the following: - improve the victims ability to endure an energy flow (e.g. safety glasses, gloves, boots, etc.) limit the development of injury (e.g. rescue and first aid organisation, assets and capability) stabilise, repair and rehabilitate (e.g. medical treatment, hospital capabilities) These strategies aim at limiting the injuries if an accident, despite the <> and safety defence mechanisms incorporated in the system, yet does occur. This group of strategies involve not only the system and related organisation, procedures and safety assets, but also the general public infrastructure regarding rescue and health services and capabilities. -
-
CONCLUSIONS Conventional safety analysis techniques and methodologies do not address for safety hazards of a system caused by unintended and hidden system deficiencies. The approach presented in this paper addresses just this type of safety hazards in technical systems. The approach is resulting from a theoretical and logical deduction process, based on the author's viewpoints regarding general human fallibility, basic limitations of human mental capabilities, and technical systems as products of human processes. Although the approach as part of the deduction process was tested on a specific system, it is, however, not backed by any experience. The author acknowledges that the approach needs critical discussions, evaluations and further refinements, as well as systematic testing, verification and validation activities, before any adequacy and validity as a practical tool may be claimed. It is the author's hope that this will be considered an interesting challenge by the system safety research environment.
REFERENCES Foster, Harold D (1993). Resilience Theory and System Evaluation. NATO AS1 (Advanced Science Institutes) Series F." Computer and Systems Sciences, Vol. 110 (Verification and Validation of Complex Systems: Human Factors Issues), 35-61. Haddon, W. (1980). The basic strategies for reducing damage from hazards of all kinds. Meister, David (1991). Psychology of System Design. Elsevier, New York. Perrow, Charles (1984). Normal Accidents. Living with high-risk Technologies. Basic books, New York. Reason, James (1988). Resident Pathogens and Risk Management.. Paper given to First World Bank Workshop on Safety Control and Risk Management, Washington DC. Rosness, Ragnar (1991). Vulnerability m Complex Systems- Directions for Research. Paper presented at NTNU Continuing Education Course in Safety Management, Trondheim 1993. WahlstrOm, Bjrm (1990). Vulnerability, environmental accidents and communication. Invited paper presented on the 9th Nordic Conference for Accident Researchers - NOFS90, Svalbard 1990.
A8" Expert Judgement in Safety Assessments
This Page Intentionally Left Blank
KEEJAM: A KNOWLEDGE ENGINEERING METHODOLOGY FOR EXPERT JUDGMENT ACQUISITION AND MODELING IN PROBABILISTIC SAFETY ASSESSMENT G. Cojazzi, G. Guida*, L. Pinola, R. Sardella °, P. Baroni* European Commission, Joint Research Centre, Ispra, ISIS, Italy *Universit~ degli Studi di Brescia, DEA, Brescia, Italy °Universit/t degli Studi di Bologna, DIEM, Bologna, Italy
ABSTRACT In this paper a novel expert judgment methodology, called KEEJAM (Knowledge Engineering Expert Judgment Acquisition and Modeling), which is supposed to be applicable to a variety of expert judgment elicitation contexts, is described. The KEEJAM methodology takes in input the definition of the application domain of interest and the specification of the expert judgment tasks to be faced. This approach provides structured and disciplined support to the normative expert, namely the knowledge engineer, in eliciting the knowledge and the reasoning strategies of the experts, building consistent knowledge models, and applying such models to the solution of the expert judgment tasks considered. The KEEJAM methodology is organized into five phases: start-up, design, knowledge acquisition and modeling, exploitation and refinement, synthesis and release. The main features of each phase are described in the paper, and practical suggestions for tailoring the KEEJAM methodology in concrete cases are provided. The KEEJAM methodology is currently being applied in the framework of a European Benchmark Exercise on Expert Judgment Techniques in PSA level 2 (Cojazzi et al., 1996).
KEYWORDS
Probabilistic Safety Assessment, Expert Judgment, Knowledge Engineering, Knowledge Acquisition and Modeling.
INTRODUCTION
Among the primary objectives of Probabilistic Safety Assessment (PSA) of complex industrial systems, such as nuclear power plants, there is the representation of the overall risks related to their existence and operation. Due to the complexity of the systems under analysis, a large amount of information is inevitably involved. Different types of knowledge sources must be considered and integrated to obtain a global assessment, and the inherent uncertainty that affects the conclusions reached must be correctly quantified. Expert Judgment is generally needed to identify, interpret and process the available information and the related uncertainty. As a matter of fact, structured expert judgment, substantiated by adequate rationales, is required and accepted as major source of information when: 199
200
G. Cojazzi et al.
• no experimental data or sufficiently validated computer codes are available for predicting the occurrence and the evolution of the phenomena of interest (incomplete knowledge of the governing phenomena); • the interpretation of available data or the applicability or the validity of current models and codes are questioned. In such cases, expert judgment use is one of the most critical tasks within the overall safety assessment process, and EJ methods must allow expert judgment to be included in an auditable fashion (IAEA, 1994). In fact, the credibility of the overall assessment heavily depends on the traceability, scrutability and accountability of the expert judgment process and, therefore, specific methodologies for formal and structured expert judgment have been designed and applied (USNRC, 1990), (Cooke, 1991), (Mosleh and Apostolakis, 1984), (Kaplan, 1992), (Sandri, 1991), (Pulkkinen, 1994). Since the pioneering work performed for the completion of the NUREG 1150 study (USNRC, 1990), structured expert judgment has been proposed and used in a variety of studies spanning from waste repository studies (Bonano et al., 1990) to probabilistic consequence analysis (Harper et al., 1995). Notwithstanding the success gained by those EJ applications, the analysis of the characteristics of EJ in the PSA domain has suggested the possibility of a completely different approach to structured expert judgment, takifig advantage of methods and techniques developed in the field of knowledge engineering. In order to properly design a new methodology, a set of requirements for a knowledge-based approach to EJ acquisition were firstly developed in (Guida et al., 1996). Starting from that initial work, this paper describes a first version the Knowledge Engineering Expert Judgment Acquisition and Modeling (KEEJAM) methodology for structured expert judgment in PSA. The paper is organized as follows. Section 2 presents the motivations underlying the proposed approach and section 3 discusses the main objectives pursued. Section 4 introduces the basic definition of KEEJAM, while the detailed organization of the methodology into phases and tasks is presented and commented in section 5. Some concluding remarks are given in section 6.
MOTIVATIONS The perspective of the proposed EJ methodology is that expert judgment is a knowledge problem, and, hence, a correct treatment of the expert judgment issue should consider in detail the knowledge the experts base their judgments upon. The focus of attention is thus shifted from the final numerical opinions of the experts back to the knowledge, the hypotheses, the data, and the assumptions the experts employ to formulate their judgments. Considering the expert judgment a knowledge problem implies that, first of all, the experts' knowledge and problem-solving strategies applied to solve the problem at hand are acquired. Then, in order to exploit this knowledge, it is necessary to formalise the different problem-solving approaches proposed by the experts and build up self-consistent models. These resulting knowledge models constitute an explicit representation of the state of the art in the field of interest, which is made available to the user in a scientifically communicable, correctable and updatable form. Then, by means of an analysis of the knowledge models, the root causes of the differences in the experts' final numerical estimates can be identified. The differences due to misunderstandings or trivial calculation errors can be solved, and experts' consensus can be reached; otherwise, it is possible to distinguish between actual disagreements and simply alternative points of view. Hence, the aggregation of the different experts' opinions can be made by means of an integration of the different data and reasoning mechanisms exploited by the experts, according to the analysis of their alternative points of view and actual semantics. The analysis of the knowledge models supports also the identification of the various forms of imperfections (imprecision, vagueness, uncertainty, etc.) that characterise the knowledge itself. As a consequence, it is
Advances in Safety and Reliability: ESREL '97
201
possible to justify the selection of a suitable formalism for uncertainty representation in the specific case at hand, according to the characteristics of the domain, and avoiding any a-priori superimposition of a particular uncertainty formalism. Some of the main high-level objectives of the Knowledge Engineering Expert Judgment Acquisition and Modeling (KEEJAM) methdology are hereafter reported.
Modeling of the knowledge behind expert judgments. KEEJAM focuses not only on the numerical experts' judgments elicitation and combination, but provides an explicit and formal representation of the knowledge and of the reasoning processes the experts use to derive such results. How the experts reason, the methods they use to solve the problem under study, the assumptions and decisions they make, the knowledge they exploit in their reasoning and all relevant justifications are acquired and formalised in executable knowledge models. Representation of uncertainty. Starting from the analysis of the explicited and formalised knowledge models, KEEJAM aims at identifying, on a case-by-case approach, a representation for uncertainty which is justified by the specific characteristics of the domain knowledge at hand. The uncertainty representation should be cognitively plausible, i.e., it should respect the actual semantics as intended by the domain experts.
Justification. KEEJAM aims at providing deep justifications of the results achieved, both from the normative and the substantive points of view. The deep reasons that motivate and justify (both technically and legally) the results obtained are brought to light, and the models and the assumptions that are behind the achievement of a given result are explicited in detail. Moreover, justifications are used to support a critical comparison and integration of the results obtained from the different experts.
Integration of knowledge sources. KEEJAM aims at expliciting the reasons that are behind the different experts' results; they may depend on a variety of different reasons (use of different codes, different inputs, etc.). Resorting to the analysis of these reasons permits resorting to more structured compositions than mere numerical treatment of the different judgments.
Scrutability and maintenability. KEEJAM provides explicit and formal (executable) models of the experts' problem solving, leading to a deep scrutability of the results and to the possibility of ease maintenance and updating of the models themselves. Knowledge engineering can offer the suitable techniques for eliciting and modeling the experts' knowledge; an introduction to knowledge acquisition and modeling techniques can be found in: (Barr and Feigenbaum, 1981), (Kidd, 1987), (Greenwell, 1988), (Diaper, 1989), (Boose, 1990), (Scott and Clayton, 1991), (Ford and Bradshaw, 1993). In synthesis, the proposed new methodology of structured and formal expert judgment process, based on a knowledge engineering approach, aims at achieving two fundamental objectives: o developing explicit and formal models of expert knowledge and reasoning that can be employed to reproduce the results of expert judgment in a mechanical way, through a transparent and justified process; o modeling the domain knowledge uncertainties with the aim of obtaining an effective uncertainty representation, cognitively plausible and, above all, justifiable from the substantive point of view.
BASIC CONCEPTS AND PHASES DEFINITION
The methodology KEEJAM is structured into phases and tasks. The methodology is defined in general terms in order to be applicable to a variety of expert judgment acquisition contexts. Therefore, KEEJAM explicitly includes a preliminary phase (Phase 1, see later) devoted to the verification of the prerequisites for applicability and to the tailoring to the specific case considered. The KEEJAM methodology is globally aimed at developing conceptual models of expert knowledge and reasoning in the application domain of
G. Cojazzi et al.
202
interest. A conceptual model is an explicit representation of the knowledge relevant to the application domain, and of the ways this knowledge is used to solve the class of problems of interest (Hayes-Roth et al., 1983), (Guida and Tasso, 1994). The KEEJAM methodology includes five phases: • Phase 1: Start-up is devoted to a preliminary analysis of the application domain, to verify the feasibility and appropriateness of the knowledge-based approach, to define the requirements of the knowledge and reasoning models to be developed, to tailor the methodology to the case at hand, to define the project team, and to plan the methodology application. • Phase 2: Design is aimed at defining appropriate techniques for the representation of the types of knowledge and reasoning strategies relevant to the application domain of interest, also including the treatment of the imperfections that may affect knowledge and reasoning. • Phase 3: Knowledge acquisition and modeling is devoted to acquiring knowledge from the identified knowledge sources (domain experts, written documents and real-world contexts) and at developing a domain conceptual model that meets the stated requirements. • Phase 4: Exploitation and refinement is devoted to exploit the conceptual model developed for carrying out the assigned set of expert judgment tasks. • Phase 5: Synthesis and release collects the results obtained and produces a suitable documentation of the work done. In addition, KEEJAM structuring allows also the implementation of quality control and assurance processes.
M E T H O D O L O G Y ORGANIZATION The basic organization of KEEJAM into phases and tasks is reported below. This structure should be considered as a reference framework. INPUT
definition of the application domain definition of the expert judgment tasks
11. START-UP 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10
familiarization with the application domain and domain experts preliminary analysis of the application domain through interview of domain experts assessment of the feasibility and appropriateness of aknowledge engineering approach definition of conceptual modeling requirements selection of knowledge sources definition of documentation standards definition of quality control and assurance processes definition of project team tailoring of the methodology work planning
OUTPUT feasibility assessment conceptual modeling requirements knowledge sources documentation standards quality control and assurance processes project team tailored methodology work plan 2. DESIGN
2.1 2.2
domain analysis and design/choice of techniques for representing knowledge and reasoning analysis of the various forms of imperfection that affect knowledge and reasoning and design/choice of techniques for uncertainty representation and processing
Advances in Safety and Reliability: ESREL '97 2.3 2.4 2.5
203
design/choice of suitable conceptual modeling languages testing of the selected conceptual modeling languages with domain experts and refinement definition of specific requirements for knowledge acquisition and modeling
OUTPUT basic technical choices conceptual modeling languages and knowledge acquisition and modeling requirements 3. KNOWLEDGE ACQUISITION AND MODELING 3.1 knowledge acquisition planning 3.2 knowledge acquisition and modeling repeat o knowledge elicitation from knowledge sources o knowledge modeling and representation o knowledge integration o knowledge validation and refinement with domain experts unt____iithe l conceptual model meets the stated requirements 3.3 refinement and validation of conceptual model with domain experts IOUTPUT conceptual model 4. EXPLOITATION AND REFINEMENT for each expert judgment task 4.1 task execution and expert judgment formulation 4.2 development of justifications 4.3 critical analysis with domain experts 4.4 final model refinement and validation with domain experts 4.5 result collection and documentation OUTPUT expert judgment formulation for the tasks considered and justification 5. SYNTHESIS AND RELEASE 5.1 documentation collection and organization 5.2 quality records collection and synthesis critical analysis of the work done 5.3 5.4 writing of the final technical report OUTPUT complete documentation quality certification The following remarks provide some comments and notes for a better and more concrete understanding of the KEEJAM methodology defined above. Phase 1 (Start-up) is devoted to define detailed technical bases for the expert judgment acquisition project. Task 1.2 (Preliminary analysis of the application domain through interview of domain experts) focuses on: o analysis of the domain knowledge; o analysis of the expert judgment tasks considered; o analysis of the problem-solving strategies adopted by the experts. Task 1.3 (Assessment of the feasibility and appropriateness of a knowledge engineering approach) aims at assessing if the application domain can be considered knowledge intensive, i.e., the domain experts rely on large and different knowledge sets, and if a structured collection of domain knowledge and problem-solving strategies is of actual interest for the final EJ user.
204
G. Cojazzi et al.
Task 1.4 (Definition of conceptual modeling requirements) considers the basic conceptual modeling requirements, such as the purpose, the scope, the suitable granularity of the model and the knowledge epistemological types, and the requirements of the language to be used for the formalization of the conceptual model. Task 1.5 (Selection of knowledge sources) aims at identifying the knowledge sources to be exploited in the project, namely: domain experts, written documents, and real-world contexts (Guida and Tasso, 1994). The analysis of the available knowledge sources takes into account the following main attributes: o quality and cost of a knowledge source; o coverage, with respect to the application domain considered; o necessity of the knowledge source in order to provide essential domain or problem-solving knowledge; o availability, especially important for domain experts. Then selection aims at: o covering all parts and facets of the application domain; o ensuring an adequate quality of the elicited knowledge; o minimizing knowledge acquisition costs (effort and time); o ensuring a smooth and effective knowledge acquisition process through a sufficient level of availability of the selected knowledge sources and a limited degree of redundancy. Task 1.8 (Definition of the project team) defines completely the project team, the role of each team member, his function and the expected engagement in the project. Task 1.10 (Work planning) focuses on the scheduling of phases and tasks, the definition of intermediate results, the responsibilities, the mile-stones and the control actions, including possible re-planning. Phase 2 (Design) deals with the basic design choices that will drive the subsequent activities of knowledge acquisition and conceptual modeling. Task 2.3 (Design~choiceof suitable conceptual modeling languages) aims at defining specific representation languages appropriate for the application domain considered (Barr and Feigenbaum, 1981). It should rely as much as possible on previous proposals and experience, in such a way as to support standard and reuse. Phase 3 (Knowledge acquisition and modeling) is aimed at building a conceptual model of the application domain of interest, encompassing both domain and problem-solving knowledge. The conceptual model developed may be constitued by paper models or a computer-based models, according to the characteristics of the case at hand and to the user needs envisaged. Task 3.1 (Knowledge acquisition planning) aims at defining the knowledge acquisition plan, containing a precise specification of all activities relevant to the knowledge acquisition task, their logical organization, and their temporal scheduling. Each knowledge acquisition activity is identified by the following information: o goal to be attained: knowledge elicitation, modeling, representation, integration, refinement, validation; o scope: specific topics to be considered; o involved knowledge source; o knowledge elicitation technique to be used; o supporting materials to be exploited; o scheduled date and place of execution of the knowledge acquisition activity. Generally, a knowledge acquisition activity has one goal, it involves one knowledge source, and it uses one elicitation technique. Composite knowledge acquisition activities involving more goals, knowledge sources and elicitation techniques are always intricate and rarely effective. General criteria to be taken into account in the design of a knowledge acquisition plan include: alternating knowledge acquisition activities with different goals and different knowledge acquisition techniques, and allocating enough time between
Advances in Safety and Reliability: ESREL '97
205
knowledge acquisition activities in order to achieve a deep understanding of the knowledge acquired and to permit a critical analysis of the representations produced. Task 3.1 may also refine, if appropriate, the selection of knowledge sources carried out in Task 1.5. Task 3.2 (Knowledge acquisition and modeling) is the core step of Phase 3 and of the whole KEEJAM methodology. It is of primary importance that the knowledge engineers in charge of it master.a sufficiently large set of knowledge acquisition techniques - see (Kidd, 1987), (Greenwell, 1988), (Diaper, 1989), (Scott and Clayton, 1991), (Firley and Hellens, 1991), (Ford and Bradshaw, 1993). Phase 4 (Exploitation and refinement) concems the execution of the conceptual model to derive expert judgment results. The execution may be manual if the conceptual model is composed by paper-executable models, sufficiently simple and small, or computer-supported if the knowledge models are computer-based, large and complex. Task 4.2 (Development of justifications) is a fundamental step of Phase 4. In fact, one of the main reasons for a knowledge-based approach to expert judgment is having clear and well founded justifications of the results obtained. Justifications are an essential component of problem solving based on a knowledge-level model. In the expert judgment process, these justifications represent a crucial quality factor; they are the basis for assuring some very important properties of the results obtained, such as coherence, verifiability and scientific defensibility. Task 4.4 (Model refinement and validation with domain experts) is also an important step of Phase 4. In fact, when models are put to work, many adjustments and refinements may emerge, which can significantly improve the applicability of the conceptual model to the considered cases and the quality of the results. Of course, any modification and extension of the models should be discussed with domain experts. Phase 5 (Synthesis and release) aims at carrying out a final critical analysis of the obtained results, paying a particular attention to quality issues. The final project documentation is produced as well.
CONCLUSIONS In this paper a first high-level definition of KEEJAM, a Knowledge Engineering methodology for Expert Judgment Acquisition and Modeling in PSA, has been presented. Differently from other methodologies, KEEJAM focuses on the acquisition and modeling of the knowledge underlying the formulation of the judgment provided by the experts. In fact, in our opinion, only resorting to explicit knowledge and reasoning models, it is possible to assure such fundamental properties as validity, transparency and justification of the results of the whole EJ process. Expected benefits deriving from the application of the KEEJAM methodology include: o explicit and formal modeling of the knowledge and reasoning on which expert judgments are based; o cognitively plausible and substantively justified representation of knowledge imperfections; o possibility of aggregating the opinions of the experts by integrating several, different knowledge sources (domain experts, written documents, real-world contexts) into a unitary and consistent conceptual model; o possibility of incremental refinement of the models developed; o justification of expert judgment results, in terms of both the knowledge and the reasoning strategies used to derive them. On the other hand, some weak points are foreseen: o KEEJAM application is expected to be rather complex and costly, requiring substantial effort and long time; o KEEJAM requires as a fundamental resource the availability of recognized domain experts, actually available to cooperate for long time periods and capable of carrying out a hard introspective work with
206
G. Cojazzi et al.
the knowledge engineer; the cooperation of domain experts is a key point for the success of a knowledge engineering project, and, in some cases, it may be difficult to obtain, thus constituting a possible bottleneck. The KEEJAM methodology is currently being applied in the framework of a European Benchmark Exercise on Expert Judgment Techniques in PSA level 2 (Cojazzi et al., 1996). The application of the methodology to this case and the evaluation of the results will be the subject of future works.
REFERENCES
Barr, A., and Feigenbaum E. A., (1981). The Handbook of Artificial Intelligence, Kaufman, Los Altos, California, USA, 141-222. Bonano, E.J., Hora, S.C., Keeney, R.L. and von Winterfeldt, D., (1990). Elicitation and Use of Expert Judgment in Performance Assessment for High-Level Radioactive Waste Repositories, United States Nuclear Regulatory Commission, NUREG/CR 5411, Washington DC, USA. Boose, J.H. and Gaines, B.R., (Eds.), (1990). The Foundations of Knowledge Acquisition - KnowledgeBased Systems, Vol. 4, Academic Press, London, UK. Cojazzi, G., Pinola, L. and Sardella, R., (1996). The JRC Ispra Benchmark Exercise on Expert Judgment Techniques in PSA level 2: Design Criteria and General Framework, Proc. of Int. Topical Meeting on Probabilistic Safety Assessment (PSA '96), Sep. 29- Oct. 3, 1996, Park City, Utah, US. Cooke, R. M., (1991). Experts in Uncertainty, Oxford University Press, New York. Diaper D. (ed.), (1989). Knowledge Elicitation: Principles, Techniques and Applications, Ellis Horwood, Chichester, UK. Guida, G., Baroni, G., Cojazzi, G., Pinola, L. and Sardella, R., (1996). Preliminary Requirements for a Knowledge Engineering approach to Expert Judgment Elicitation in Probabilistic Safety Assessment, in P.C Cacciabue I.A. Papazoglou (Eds.), Proc. of ESREL '96 -PSAM III International Conference on Probabilistic Safety Assessment and Management, Crete, Greece, 1996, Springer-Verlag London, UK. Guida, G. and Tasso, C., (1994). Design and Development of Knowledge-Based Systems: From Life Cycle to methodology. John Wiley & Sons, Chichester, UK. Firley, M. and Hellens, M., (1991). Knowledge Elicitation - A Practical Handbook, Prentice-Hall, London, UK. Ford, K., and Bradshaw, J.M. (Eds.), (1993). Knowledge Acquisition as Modeling, John Wiley & Sons, Chichester, UK. Greenwell, M., (1988). Knowledge Engineering for Expert Systems, Ellis Horwood, Chichester, UK. Harper, F.T., Hora, S.C., Young, M.L., Miller L.A., Lui, C.H., McKay, M.D., Helton, J.C., Goossens, L.H.J., Cooke, R.M., Pasler-Sauer, J., Kraan, B. and Jones, J.A., (1995). Probabilistic Accident consequence Uncertainty Analysis. Dispersion and Deposition Uncertainty Assessment. Main Report, US NRC, NUREG/CR-6244, EC EUR 15855EN, SAND94-1453. Hayes-Roth, F., Waterman D., and Lenat D., (1983). Building Expert Systems, Addison-Wesley, Reading, Massachusetts, USA. IAEA, (1994). Procedures for Conducting Probabilistic Safety Assessments of Nuclear Power Plants (Level 2), International Atomic Energy Agency, Safety Series No. 50-P-8, Vienna, Austria. Kaplan, S., (1992). 'Expert information' versus 'expert opinions'. Another approach to the problem of eliciting/combining/using expert knowledge in PRA, Rel. Engng. and System Safety, 35, 61-72. Kidd, A., (Ed.), (1987). Knowledge Acquisition for Expert Systems, A Practical Handbook, Plenum Press, New York. Mosleh, A., Apostolakis, G., (1984). Models for the Use of Expert Opinions, in Low-Probability~HighConsequence Risk Analysis: Issues, Methods, and Case Studies, R. A. Waller, V. T. Covello (eds.), Plenum Press, New York. Pulkkinen, U., (1994). Statistical Models for Expert Judgment and Wear Prediction, VTT publications 181. Technical research centre of Finland, Espoo 65 p. + app. 80 p.
Advances in Safety and Reliability: ESREL '97
207
Sandri, S., (1991). La Combinaison de l'Information Incertaine et ses Aspects Algorithmiques, Th6se pr6sent6e/~ l'Universit6 Paul Sabatier, Toulouse, France. Scott, A.C. and Clayton, J.E., (1991). A Practical Guide to Knowledge Acquisition, Addison-Wesley, Reading, MA. USNRC (1990). Severe Accident Risks: An Assessment for Five U.S. Nuclear Power Plants, United States Nuclear Regulatory Commission, NUREG-1150, Vol 1, Washington DC, USA.
This Page Intentionally Left Blank
A PRACTICAL CASE OF ASSESSING SUBJECTIVE P R O B A B I L I T I E S A DISCUSSION OF CONCEPTS AND EVALUATION OF METHODS L. Berg Andersen 1, T. Nilsen z, T. Aven 3, A. Guerneri 4 and R. Maglione 4 1 Well Service Technology A/S, P.O. Box 61,5061 Kokstad, Bergen, Norway 2 RF-Rogaland Research, Stavanger, Norway 3 Stavanger University College, Stavanger, Norway 4 Agip SpA, Milan, Italy
ABSTRACT
The main objectives of this paper are to present and discuss project experiences concerning a subjective probability assessment process and thus, contribute to a sound basis for future quantitative risk analysis (QRA) in a full Bayesian setting. The paper clarifies some of the most important basic concepts and describes factors that significantly affect the credibility of the expert evaluations. Problems and solutions related to alternative ways of transferring the expert-information to figures are discussed on the basis of practical cases of assessing subjective probabilities. These cases demonstrated the possibility and the potential of using experts for assigning probabilities an thus transfer knowledge into probability values. However, some well known problems associated with assigning subjective probabilities were recognised, i.e. superficiality and lack of precision. The problem of imprecision increased significantly when evaluating events associated with the lower part of the probability scale. Our experiences suggest that the main causes of superficiality and lack of precision were related to the expert's difficulties in expressing his degree of belief to small figures, lack of training in assigning probabilities, and problems in maintaining the experts motivation.
KEY WORDS
Subjective probability, uncertainty, probability assessment, risk analysis, decision making, blowout risk, expert judgement.
INTRODUCTION
This paper is based on the experiences gained through the R&D project ~<Stochastic Modelling for the Quantification of Kick and Blowout Risk During Exploration Drilling>>, (Berg Andersen and Aven, 1996). The project aims to develop a computerised risk management tool that strengthen the basis for decision making regarding well control problems in the planning phase of new exploratory wells.
Well specific parameters must be considered on a fairly detailed level in order to establish risk models that support the management's decision making in the planning of new exploration wells. Consequently, the 209
210
L.B. Andersen et al.
complexity of the risk model increases and the available amount of hard data for <> assessing coherences and probabilities decreases. Berg Andersen and Pedersen (1996) concluded that a subjectivistic probability model should be adopted when modelling for the quantification of well specific risks related to the kick phenomenon (uncontrolled flow of hydrocarbon from the formation to the wellbore). Systematised drilling experience in terms of available hard data and expert judgements were to be combined in a full Bayesian setting. The model conclusions are expressed by a well specific kick probability and interpreted as <>.
Objectives and Scope of Work The main objectives of this paper are to present and discuss project experiences concerning a subjective probability assessment process and thus, contribute to a sound basis for future quantitative risk analysis (QRA) in a full Bayesian setting. Major challenges recognised when applying a subjective probability model in a QRA are unclear definitions of basic concepts, understanding factors affecting the credibility of the expert judgements, and the lack of proper probability assessment methods. This paper seek to clarify some of the most important basic concepts and describe factors that significantly affect the credibility of the expert evaluations. Finally, problems and solutions related to alternative ways of transferring the expert-information into figures are discussed on the basis of practical cases of assessing subjective probabilities.
BASIC CONCEPTS
Probability and Frequency Let: A - the occurrence of an accidental event X = the number of accidental events occurring In the classical approach we assume that there exist values p and m such that p equals the true probability of A and m equals the expected value (frequency) of the random variable X, i.e. p=P(A) and m=EX. The probability p represents the relative fraction of times the event A occurs if the situation analysed were hypothetically repeated an infinite number of times. The expected value m represents the mean number of X if the situation analysed were hypothetically repeated an infinite number of times. The true values p and m are objective but unknown characteristics of the event A and the quantity X, respectively. Thus these unobservable parameters are to be estimated in the risk analysis, using models like fault trees and event trees. In the classical approach the uncertainty notation is related to the accuracy of the estimated quantities p and m, i.e. the gap between the unobservable assumed true value and the estimate generated by the risk analysis.
A subjective probability represents a degree of belief. In the Bayesian approach we use the probability notion to express the uncertainty related to the values of observable quantities such as A and X. Hence, in a full Bayesian setting the meaning of uncertainty is completely different from what we mean by uncertainty in the Classical model. What is uncertain is the occurrence of the event A and the value of X, and the probabilities P(A) and P(X =x) express this uncertainty. In a full Bayesian setting there is no sense in speaking about uncertainty of the probability P, because such a reasoning would presuppose the existence of a true value of P.
Advances in Safety and Reliability: ESREL '97
211
Expert Judgements Judgements are inferences or evaluations that go beyond obvious statements of fact, data, or the conventions of a discipline. Factual judgements are beliefs or opinions about propositions that can, in principle, be proven to be right or wrong. Value judgements are expressions of preferences among alternatives based on tradeoffs and priorities. Because values are inherently personal, value judgements cannot be proven right or wrong. Most real judgements mix factual and value elements, (Otway and von Winterfeldt, 1992).
Many judgements require a special type of expertise. Making inferences about parameters and assigning probabilities to events must be based on special knowledge of the technical system, of data and model results, and of the parameters and events relevant to the analysis. We call any judgement requiring special expertise an <<expertjudgement>>. Expert judgements can be the result of either informal or formal processes. Informal processes are implicit, unstructured, and undocumented; formal processes are explicit, structured, and documented.
Information All probability assignments are conditional on some information. Thus, we write P(AII) where I represents the information related to event A. Words and phrases like <<wrong>>, <>, <<encumbered with errors>>, and <> have previously been used in connection with subjective probability assessments. The concept of information must, however, be defined before these words and phrases give meaning in a Bayesian setting.
In order to define the concept of information in a Bayesian setting consider the following supposition. A ~totab> information It related to event A exists. If this information is available, event A will either be true or false. Hence P(AIIt) is either 1 or 0. If A represents the occurrence of an accidental event, total information means that we know how the equipment will perform and how the personnel involved will operate the equipment with 100 percent certainty. Of course, in practice this will never be the case since there will always be some uncertainty related to the outcome of future events.
The assessor's probability statement will be based on an information set I,,. This information set might be very small compared to the total information It and also on a different form. Based on the information Ia the assessor expresses his/her degree of belief regarding the outcome of an event by means of a probability. The level of information Ia can be appraised by others, and thus room is given for acceptance or rejection of the assessor's judgement. From the decision-maker's point of view, the information Ia is not necessarily representing the <> related to event A. For example, the assessor might not have access to relevant failure data, or he might be emphasising insignificant factors, etc. The <~best available information>>, denoted Ie, will be a reference level when the decision maker evaluates the credibility of the assessor. Hence, a large gap between Ie and I~ may lead to rejection of the expert judgement. On the other hand, if the decision maker considers the assessor to have the best available information (knowledge), Ie and Ia coincide.
The level of information Ie is defined through the qualities given to it by the person who wish to evaluate the credibility of the assessor. Thus, we note that the interpretation of the <> Ie related to an event is subjective and may vary from one evaluator to another.
212
L.B. Andersen et al.
FACTORS AFFECTING THE CREDIBILITY OF EXPERT J U D G E M E N T S Subjective probability is a measure of degree of belief, which reflects one's state of information. It is not only subjective but also variable since it can change from one situation to another. In general it is not possible to obtain repeated independent measurements of subjective probability from the same individual because he/she is likely to remember his/her previous thoughts and responses. Consequently, there are no procedures for the measurement of belief that permit the application of the law of large numbers to reduce measurement errors.
The difficulties involved in applying standard measurement criteria of reliability and validity to the measurements of belief give rise to the question of how to evaluate and improve assessments of subjective probability. Lindley et al (1979) apply three types of criteria called pragmatic, semantic (calibration) and syntactic. The pragmatic criterion, which is based on comparison with <> values, is found irrelevant in a full Bayesian framework where true probabilities do not exist. The semantic criterion is of course relevant, but in a reliability/risk context the probabilities are typically small (rare events) which makes it difficult and most often impossible to carry out a meaningful calibration. Coherence is clearly essential if we are to treat assessments as probabilities and manipulate them according to probability laws, thus the syntactic criterion applies.
Several experts can be involved in assessing input probabilities to a risk model that systematises and pull together expertise in different areas related to the same overall phenomenon. In a risk management context an evaluation of the analysis results requires reasonable consistency among the experts degree of belief and objective data (<> is, of course, a highly subjective statement), cf. the semantic and syntactic criteria. Consistency is important when the management evaluate whether the Bayesian risk analysis results are useful as a basis for improved decision making. However, credibility related to the expert judgements are perhaps even more important. Four main problem areas are identified to have the potential of decreasing the credibility of expert judgements: 1. Gap judged by the decision maker, between the assessor's state of knowledge Ia and the <> Ie 2. The decision maker considers the <> Ie to be insufficient 3. Motivational aspects 4. Superficiality or randomness involved in transforming the assessors' state of knowledge into a numerical value If the decision maker considers the assessor's level of information (knowledge) be significantly lower than the <>, he/she will not have confidence in the results. The decision maker will be sceptical to the assessor as an expert. Trying to use the best expertise available does not fully solve this problem since in practice there will always be time and cost constraints. Sensitivity and criticality analyses should be used as guidelines for deciding when to call for additional expertise and/or more comprehensive decomposition of the specific problem under analysis. Even if the expert is considered to have the <>, there could be a confidence problem. The decision maker may judge the < to be insufficient, and further studies are required to give a better basis for the probability assessment.
An expert may assess a probability that completely or partially reflects inappropriate motives rather than his deeply felt belief regarding a specific event's outcome. As an example, it is hard to believe that a sales representative on commission would make a completely unprejudiced judgement of two safety valves, whereof one of them belongs to a competitive firm. Another example is an engineer that was involved in the design process and later is asked to judge the probability of failure in an item he personally recommended to be installed. The engineer claims that the item is <> and assigns a very low failure probability.
Advances in Safety and Reliability: ESREL '97
213
The management may reject the sales representative's judgement without much consideration since they believe that inappropriate motives have influenced his judgement. The engineer's judgement might not be rejected just as easily since he obviously is a company expert in this area. On the other hand, incentives are present that might affect his probability assessment. Motivational aspects will always be an important part of evaluating the credibility and thus the usefulness of analyses that include expert judgements. In general we should be aware of the existence of incentives that often lead people to report probabilities that do not entirely reflect their true beliefs.
The fourth problem, related to superficiality or randomness, is partially solved through adopting appropriate probability assessment methods that contribute to avoid superficiality and to some extent compensate for the assessor's possible lack of feeling for numerical values. Different probability assessment methods are discussed in the next section.
People tend to use rather primitive cognitive techniques when assessing probabilities, i.e. so called heuristics. Heuristics for assessing probabilities are easy and intuitive ways to deal with uncertain situations where the assessor unconsciously tends to put too much weight on insignificant factors, i.e. availability, anchoring and adjusting, and representativeness (Tversky and Kahneman, 1974): • The occurrence of events where the expert easily can retrieve similar events from memory are likely to be given higher probabilities than the occurrence of events that are less vivid and/or completely unknown to the expert (<> heuristics) • The expert tends to choose an initial anchor. Then extreme points are assessed by adjusting away from the anchor. One of the consequences is often low probability of extreme outcomes (<> heuristics) • The expert assesses a probability by comparing his/her knowledge about the phenomenon with the stereotypical member of a specific category. The closer the similarity between the two, the higher the judged probability of membership in the category (<> heuristics) Despite the fact that people are subject to such deficiencies, evidence suggests that individuals can learn to become good in assessing probabilities. The spheres of decision making and weather forecasting are both good examples of this.
M E T H O D S FOR ASSESSING SUBJECTIVE PROBABILITIES
Why bother to use methods when assessing subjective probabilities? Basically we could have the expert assess the probability directly by asking, <<What is your belief regarding the probability that event A will occur?>> If the expert is at all able to give an answer to a direct question like this he/she may place little confidence in the answer given. Of course, when assigning the probability values the experts should utilise his/her intuition regarding probabilities. However, people have different feelings for numerical values as well as varying developed intuition regarding probabilities. Specific methods were tested in order to find appropriate ways of quantifying the experts degree of belief. The main purpose of the methods were to avoid superficiality and unwanted variability, ensure <> to the assessed values, and make the expert express numerically, as accurate as possible, his deep-felt belief in the outcome of events. The methods tested by means of expertise on offshore drilling operations were: 1. Betting 2. Lottery games 3. Direct probability assessment/Odds making
214
L.B. Andersen et al.
The Betting method is based upon finding amounts to win or loose, making the expert indifferent about betting for or against a specific event's occurrence. As an example, bets related to an event were formulated as"
Bet 1: Win X= 1000 dollars if the mud pump fails during drilling, else loose Y= 100 dollars Bet 2: Loose X=1000 dollars if the mud pump fails during drilling, else win Y=100 dollars Our approach was to offer bets that first favoured one side and then the other, gradually adjusting the payoffs in each round. The indifference point was be found by adjusting the bet appropriately, making it more or less attractive depending on the experts response to the previous bet. The probability related to pump failure in the above example can be expressed by: P(pump failure)=Y/(Y+X) = 100/(100+ 1000)=0.091 The second method to be tested was comparison of two lottery like games. As an example the expert was asked to compare the lottery: Win 100.000 dollars if the mud pump fails during drilling, else win nothing with the reference lottery: Win 100.000 dollars with known probability p=0.1, and win nothing with known probability p=0.9 The idea was to adjust the probability in the reference lottery until the indifference point was found. When the expert was indifferent about which lottery to choose, the subjective probability related to the events occurrence had to be the p that made him indifferent. Adopting this method required that the expert was familiar with some sort of probability mechanism. Visualising probabilities related to different prizes through the ~wheel of fortune)) was found to be an appropriate alternative. The experts indifference point was found by adjusting the shaded sector of the wheel to present the probability of winning in the reference lottery. Figure 4.1 shows an example of the computer program that was developed to serve as a probability assessment aid.
:ire r~ G~arn
f~i~ ~,~!~!~i~ii~i~i!~i~i~i~i~!~!~i!i ii i iit~~!ii!i~!iiiii/iii~iiiilii i ~i~ii!!i!i~i i~i~!/!f!~ii i!i i ~iI!iI~i~i~i!i!/~!~liiii~i~/i//~/~/i~iif/i/~!il~i~ i~/l~!!II~iI/~ll!~iii i i~ !Ii
iiiiiiiii~i!i!Ni::!~i~iii~i1iiiiiii!1iiii::!!!::!i~ii!~i~i~i~iii~i!i::iNi!!i!~!~!~i~iiiii~ii~;i!i[~i}~::!~::~ii~iiii~iii~iiIN~i~i~i~s~:~iI~iii~ii!ii~!!i::!!!~::~!::~::!~ig~i~IiiI~i~!I!II~::~::ii~Ni~iiiiiI~N :ilii!!~!
....................
...........................................
~
'iiii
Figure 1: Windows from the probability assessment software, i.e. the betting and lottery approaches The probability assessment software automatically changed the bets and the reference lottery as a response to the expert's preferences. The wheel of fortune divides when the probability becomes less than 4%, i.e. less than an angle of 15 degrees. Thus, handling small probabilities by means of the wheel of fortune technique
Advances in Safety and Reliability: ESREL '97
215
required the assessor to consider two or more independent wheels. When the wheels were imaginary rotated the shaded portion that reflects <<events occurrence>> had to arrive at the arrow on all of the wheels.
In practise both approaches turned out to be relatively time consuming. The expert even seemed to perceive the computerised probability assessment tools as boring roundabout ways of transforming his knowledge to figures and the motivation for endeavour considering the bets and lotteries gradually evaporated. The reluctance related to the betting and lottery methods in combination with a significant strengthening of the experts feeling for numerical values resulted in direct reference to the probability p or the odds < the event in question.
The following conclusions were made with respect to the betting, lottery and odds making as tools for assessing the expert' s degree of belief: • The betting and lottery approaches contributed to full utilisation of the probability scale and increased the expert' s feeling for numerical values • The betting and lottery approaches were time consuming processes that, after a while, promoted motivational problems and thus superficiality • The probability mechanism in to the betting method seemed hard to understand for the assessor • Odds making and direct probability assessment were both time efficient approaches • It was proved difficult to fully utilise the probability scale when adopting the odds making and direct probability assessment approaches, i.e. assessment of solely <<even>>numbers like 1 to 100, 1 to 500, 1 to 1000, etc.
Discussion
The practical cases of assessing subjective probabilities demonstrated the possibility and the potential of using experts to assign probabilities and thus transfer knowledge into probability figures. However, some well known problems associated with assigning subjective probabilities were recognised, i.e. superficiality and lack of precision. The problem of imprecision increased significantly when evaluating events associated with the lower part of the probability scale. Our experiences suggest that the main causes of superficiality and lack of precision were related to the expert's difficulties in expressing his degree of belief to small figures, lack of training in assigning probabilities, and problems in maintaining the experts motivation. A critical probability level under which the expert found it hard to express his degree of belief in numbers was recognised during the assessment sessions. When evaluating low probability events the expert struggled with the choice of probability figures that deviated with one or more orders of magnitude. Experiences from the assessment sessions suggest that this problem can be. significantly reduced by training. In the early assessments the critical level of where the expert seemed to loose contact with the probability scale was approximately 0.01. At the end of the session this critical level was reduced by about one order of magnitude, to 0.001. Whether extensive training alone can fully solve this problem is, however, questionable. The relatively frequent appearance of low probability events (103-10 6) in traditional QRAs' stimulates consideration of alternative approaches to this problem. Introducing the concept of reference level implies supplementing the numerical probability scale with a set of known reference events associated with commonly agreed probabilities. To support the expert in the assessment process with some sort of illustrating reference events for comparison purposes is considered as a step in the right direction. How to interpret the reference level is, however, not obvious. In a full Bayesian approach we cannot interpret the reference levels as true probabilities. At this point the most adjacent way of interpreting the concept of reference levels, that are not in conflict with the basic principles of the full Bayesian approach, relates to some sort of <<standardisation>>. We may interpret the application of the
216
L.B. Andersen et al.
reference levels as a way of increasing the experts knowledge or level of information I.. The probabilities assigned to the reference events must be interpreted as commonly accepted standard figures that reflect the uncertainties related to the events outcome. The application of a set of reference events towards the end of the assessment sessions were shown to provide a useful aid for the expert in assigning small probabilities. Further work is required concerning both the practical and theoretical aspects of <<standardisation>> in a Bayesian framework before any firm conclusions can be made. An alternative approach to the problem is to construct the risk models in such a way that the number of low probability events are minimised. In order to reduce the problem of assessing probabilities to rare events we may seek to find the optimal balance between a sufficient level of modelling detail for decision making purposes and avoidance of a large number of low probability events. Expressing deep-felt degree of belief in terms of probability figures forces the expert into comprehensive considerations of all relevant information. Furthermore, the expert is required to think about uncertainties in a systematic but, nevertheless an unfamiliar way. Obviously, to carry out such a demanding process over time a certain degree of motivation is required. In some probability assignment situations like, gambling, investments, or making critical decisions concerning life and death such motivation is naturally provided. In situations of repeatedly assessing input probabilities for a risk model on the other hand, the same degree of motivation can hardly be expected. Hence, the necessary motivation must be ensured by other means in order to avoid superficiality and to obtain probabilities that reflect the experts deep-felt degree of belief. The development of the probability assessment software (ref. figure 1) was an attempt to maintain the experts motivation throughout the assessment process. Emphasis was placed on the user interface where the probability mechanism was visualised and fictive sums of money was put at stake. As mentioned previously, however, the expert got bored after some time. Further work is required to improve the methods and guidelines in order to avoid superficiality and unwanted variability in future subjective probability assessments. The lessons leamed from the practical cases of assessing subjective probabilities have provided a sound basis for solving the remaining problems highlighted in this paper. In order to provide decision making support by means of QRAs and at the same time ensure a consistent interpretation of the risk model conclusions we see no other alternative than combining hard data and expert judgements in a full Bayesian setting.
REFERENCES Berg Andersen L, and Aven T. (1996): <> The KickRisk project, Draft report RF-Rogaland Research, Stavanger, Norway. Berg Andersen L., and Pedersen, S. (1996): <<Stochastic Kick Modelling - Pre-study>> Report RF-96/050, RF Rogaland Research, Stavanger, Norway. Otway, H., and von Winterfeldt, D. (1992): <<Expert Judgement in Risk Analysis and Management: Process, Context, and Pitfalls>>, Risk Analysis, 12, No.1. Lindley, D.V., Tversky, A., and Brown, R.V. (1979): <>, Journal of Royal Statistical Society A, 1979, 142, Part 2, pp. 146-180. Tversky, A., and Kahneman, D. (1974): <<Judgements Under Uncertainty: Heuristics and Biases.>> Science, 185, pp 1124-1131.
EXPERT J U D G E M E N T IN SAFETY A S S E S S M E N T S Debbie A Brown 1 and Ian M B Scott 2
1WS Atkins Safety and Reliability, WS Atkins (Consultants) Limited, Woodcote Grove, Ashley Road, Epsom, Surrey, KT18 5BW, United Kingdom. 2WS Atkins Safety and Reliability, WS Atkins (Consultants) Limited, 610 Birchwood Boulevard, Birchwood, Warrington, Cheshire, WA3 7WA, United Kingdom.
ABSTRACT One of the principal requirements of professionally qualified and experienced safety analysts is to use their skills to undertake meaningful assessments of safety. Usually, these skills are pro-active and contribute to developed assessments which justify the degree of safety (or risk) associated with particular occupational activities. However, safety professionals can also be required to give expert judgement after a workplace accident or incidence of occupational ill health, to advise a court of law on the effectiveness or applicability of risk control measures that were in place at the time of the incident. Expert judgement in the circumstances of accident reconstruction has to be undertaken particularly thoughtfully as criminal enforcement processes or civil liability claims may already be under way. The degree of information made available to the expert may vary considerably, yet a judgement is expected to be made to which the expert may be held personally accountable in a court of law. The information may emanate from a variety of direct and indirect sources. Examples include witness statements given in personal interviews by the parties involved or technical data provided by manufacturers. Equally, information may have to be sought by research, inspection or examination of materials, equipment or substances. In some cases it is based upon undisputed technical fact. However, in the absence of technical fact, an expert's opinions may prove to be significant. In these circumstances, the expert's competence, and his interpretation of the available information, become vital to reaching a judgement. His competence, credibility and judgement may all be subjected to critical and adversarial cross-examination in the courtroom. By reference to recent expert witness case studies, in which varying degrees of information were available, the authors outline the key known factors in each case and discuss their contribution to the understanding of events. The authors discuss the implications for 'expert judgement' in the safety assessments based on their respective competence and consider the main influences which led to the formation of their judgements. 217
218
D.A. Brown and I.M.B. Scott
KEYWORDS expert witness, risk assessment, safety case, safety assessment, accidents, occupational ill-health
INTRODUCTION One of the fundamental tasks of a safety professional is to assess the significance of data and to react accordingly. This inevitably means problem solving for others who are faced with a situation that has an inherent risk to safety at work, an occupational health consequence or, increasingly, significant effects on the natural environment. The speed of this reaction may often be a subject matter based u.pon the perceived danger on the part of the person who created the problem in the first place.
Consequently, much of the work of safety professionals can be reactive to problems (such as accidents and incidents at work, or chronic instances of exposure to harmful materials) with the balance of time devoted to more proactive measures to reduce risk at work and promote the organisation's safety management system. This situation can exist in relative equilibrium until it is upset by an external influence, the consequences of which are inevitably to lay open the safety management system (and by direct consequence, the work of the safety professional) to inspection or examination (rarely at a time when it is convenient to do so).
Some of the work of professional safety consultants is to skilfully, and with purpose, analyse the issues involved in safety assessments that have failed to meet their intended performance. The intention of this paper is to identify some of the issues involved in post-evaluation of "failed" safety assessments and are subsequently an issue of dispute. Such disputes are often brought to law for arbitration and part of our work results in the forensic presentation of expert judgement.
THE NEED F O R S A F E T Y A S S E S S M E N T
Safety assessments are required in virtually, all types of industrial undertaking and commercial enterprise where people are at work or contractors and members of the public are exposed to the risk of injury. Up until 1992, many safety assessments were "statutory" requirements. Some safety assessments (or "safety cases") were voluntary, used by the commissioning organisation to justify a particular course of action in terms of an adequate assessment of the risk. Examples include those required under the CIMAH Regulations 1984 (as amended) and the Offshore Installations (Safety Case) Regulations 1992.
What are safety assessments? Since the introduction of the Management of Health and Safety at Work Regulations 1992, there has been a supplementary mechanism to incorporate virtually all non-statutory safety cases and safety reports into safety assessments or risk assessment studies as required under Regulation 3. Consequently, safety assessment is now statutorily legislated. Possible exceptions remain, such as in transport and where the number of employees involved in the business is less than five. The latter example requires some form of safety assessment, but this does not have to be formally recorded or documented.
The.first signs of trouble As with issues of reliability in the operation of plant and equipment, it is the fact that something, in common experience and often used, works that becomes unremarkable. Notice is rarely taken of safety systems which
Advances in Safety and Reliability: ESREL '97
219
seem to be operating successfully - it is only when the system "goes down" that maintenance or repair is offered. Inevitably the first signs of trouble, in no particular significant order, can be identified when failures in the safety system are noticed, there is a rise in accidents and incidents at work or enforcement action is taken.
Why do safety assessments fail to meet their intended performance? These first signs of trouble indicate that the safety assessment does not now satisfy workplace or operating conditions. Reasons for this are rarely simple and may include organisational failures, a gulf in the expectations of undertaking the work or changes in the scope of the safety assessment.
SPECIFIC CASE STUDIES The reader will be familiar with some of the publicised reviews of safety assessments following major accidents such as Piper Alpha. The opportunities to contribute to the investigation of these catastrophic events are rare. Most safety practitioners are much more likely to be called upon in connection with accidents and incidences of occupational ill-health at work. Many of these cases involve lost time injury incidents, and it is still relatively rare to be faced with the need to investigate a fatality. The following examples are typical of the majority of accident or ill-health cases that may require investigation and expert judgement.
Safety Assessment A representative example of plaintiff and defendant work is included but importantly, it is difficult to show that expert judgements of the original safety assessments were treated equally. The fundamental assumption here, as in all work of this type, is that a basic assessment of safety at work was made in the first place.
Generally, the safety practitioner approaches safety assessment in a pro-active manner, using established analytical methodologies to evaluate hazards, assess risks, identify possible consequences and assign an appropriate risk "value in the light of the risk reduction measures in situ and functioning immediately prior to the event. These methods are invariably well structured, logical and systematic in their approach, enabling identification and assessment of failures in systems, processes or procedures which could have implications for health and safety. The methodologies at the disposal of the safety practitioner have almost certainly evolved over a long period of time and have grown in their professional acceptance. They enable the consequences of failures to be considered quantitatively or qualitatively in terms of degree of acceptability of risk, and further enable risk mitigation measures to be implemented in accordance with perceived priority.
Accident investigation and reconstruction Of the methods available, the epidemiological approach is of particular relevance to the expert safety practitioner. This technique involves analysis of historical performance data from which the expert might reconstruct event sequences, identify relationships in failure (accident) events and determine historical fact. Accident reconstruction is often difficult and is frequently subject to the views of the expert, based on her or his years of experience, and other key influential factors.
Accident reconstruction extends beyond mere accident investigation and causation. The expert has to imagine, by means of the often limited facts and information available to him, not only the chronology of the events surrounding a particular incident but must also construct a profile of the organisation and its safety culture, in order to determine the facts which will enable him to make an expert judgement regarding the correct or appropriate accountability for an injury or incidence of ill-health.
220
D.A. Brown and I.M.B. Scott
The general principle behind accident investigation is to construct event sequences or chains of events and to identify direct and indirect causes which led to the accident, principally with a view to amending the safety management strategy to prevent recurrence or further loss. The fundamental doctrine of accident investigation is "not to apportion blame or fault". As an outcome of accident reconstruction, however, the expert is expected to make just such a judgement. Whilst the expert has to maintain impartiality, he is required to indicate his opinion regarding accountability for cause of the injury or ill-health. Clearly, where expert judgement is involved, "taking sides" is unacceptable but must, nevertheless, be differentiated from truly giving an opinion. Accident investigation is generally conducted as soon as possible following the event to establish the facts and causation sequence. This usually involves interviewing witnesses, inspecting the scene of the accident and critically reviewing working practices. Accident reconstruction by an engineer is, traditionally, most unlikely to take place immediately following an incident. Indeed, an expert may be asked to make a judgement upon circumstances which took place many years previously as the law allows for a civil action to be brought within three years of identification of the injury or ill-health. By way of illustration, the authors have made expert judgements on cases where premises have long since been demolished and companies have ceased functioning.
Accident reconstruction is therefore a reactive process, often associated with the development of a legal case. The expert's report, whilst identifying possible measures which could have been employed to mitigate risk at the time, clearly has no role to play in making recommendations to management to prevent recurrence of an accident, as might be expected as part of a company's internal accident investigation process.
CASE X - MANUAL HANDLING (BACK) INJURY. In this case the expert was appointed by the plaintiffs solicitor. The plaintiff had incurred a back injury as a result of single-handedly delivering items of furniture from a light goods vehicle. The instructing solicitor was young and efficient. Written instructions to proceed were accompanied by extensive paper records including a statement from the plaintiff, and copies of Particulars, Further and Better Particulars, a medical report and company documentation (including job sheets and work programmes). Seemingly, to contain costs to a tight budget, the solicitor did not permit the expert to interview the plaintiff; nor was a vehicle inspection made.
The accident reconstruction focused predominantly on interpretation of documentation. The availability of such detailed documentation, particularly from the defendant, aided the expert to corroborate the plaintiff's evidence. To make progress, it was necessary to identify and establish the key facts surrounding the circumstances of the case. As with most reconstructions, these included the organisational structure and reporting line, the plaintiff's employment contract details, the working environment, the plaintiff's competence, the type, and system, of work, and the identification of the likely chain of events that led to the injury. The medical report was accepted as evidence in its own right. Coincidental investigation involved a literature search for guidance documentation which would have been readily available at the time of the accident, and reviewing prevailing legislation, with a view to establishing general duties and levels of knowledge of manual handling principles. The expert also contacted a vehicle manufacturer's agent to substantiate technical data concerning the vehicle's specification. This ergonomic and anthropometric data corroborated the information provided by the plaintiff
The defendant contested contributory negligence, stating that the plaintiff was aware that he should have requested assistance with the delivery of heavy or oversize items. The evaluation of the organisation and its structure, a table-top exercise involving cross-reference of documentation, supplemented by additional information from the plaintiff, indicated that levels of internal communication were poor, the management structure was ineffective and there were no suitable mechanisms in place for task management.
Advances in Safety and Reliability: ESREL '97
221
Key known factors that contributed to understanding for development of an expert opinion in this case comprised: • • • • • •
statements from both the plaintiff and the defendant; evidence of surgical correction of the back injury; very detailed job cards and records; material which would be readily cross-referenced to substantiate facts; detailed data on the vehicle specification such that the "workplace" was established; and a plea of contributory negligence made by the defence.
This case exhibited an exceptional amount of informative and well documented material. This expedited the process, contained costs and helped to draw the case to an efficient and satisfactory conclusion in the plaintiff's favour.
CASE
Y -
R E P E T I T I V E STRAIN INJURY
In this case the expert was appointed by the plaintiff's solicitor. The plaintiff, a typist, had incurred a repetitive strain injury to her left wrist. Written instructions to proceed were accompanied by the plaintiff's statement and two medical reports made by the same examining doctor. The plaintiff's statement was confusing and, in some instances, contradictory, and the expert sought to interview the plaintiff to clarify the relevant issues. Several attempts were made to set up a tripartite meeting at the workplace, but every attempt was thwarted at the "eleventh hour" by the defendant. To progress the cas.e, therefore, the expert eventually held the meeting at the plaintiff's home whilst a court injunction was raised to gain access to inspect the workplace.
The instructing solicitor took the somewhat unusual step of advising the expert to deal directly with the plaintiff and the defendant's solicitor. The loss of this intermediary resulted not only in increased work for the expert, but, perhaps more significantly, also increased the expert's professional and personal involvement in the case. Generally, an expert has no role to play within litigation other than being "an expert" By his very nature, however, the safety professional is genuinely concerned with the health, safety and welfare of individuals and it often takes considerable effort to dissociate oneself and not to become emotionally drawn. The legal profession does not generally acknowledge these professional ethics, wanting, quite simply, an expert's opinion.
In this particular case, however, the solicitor positively encouraged the expert's involvement. Whether this was a legal ploy and a planned attempt to influence the expert, or merely another facet of an instructing solicitor who seemed disorganised and disinterested in the case, is difficult to say. However, it did serve as a major distraction and accentuated the importance of the expert to remain as independent and unbiased as possible. This became progressively more difficult as the solicitor kept neither the expert nor the plaintiff informed. The expert had to deal with issues with which she was unacquainted; and the plaintiff was fast becoming frustrated about the lack of progress. She increasingly turned to the technical for reassurance.
Accident reconstruction focused on the plaintiff's evidence, corroborated as far as possible by cross-reference to the medical evidence. This particular case has been included within this discussion paper as it highlights several key issues for the safety practitioner as expert witness.
Firstly, the availability of information. This was limited, riddled with discrepancies and solely in support of the plaintiff. From this, however, the expert carefully constructed a view of the organisation, its management structure and its working procedures. In the absence of company documentation, and following the cancellation
222
D,A. Brown and I.M.B. Scott
of two planned workplace inspections, this focused predominantly upon a review of legislation and relevant guidance. The final report was structured around safe working practices that could have been expected during the period in question, and in particular an evaluation of the conditions which could be expected to lead to the onset of injury. The injury itself formed an additional complication, as medical opinion was contradictory. Secondly, the degree of control that a solicitor exercises over a case and the effect that this can have on progress. The solicitor's case management caused considerable stress for all parties involved. Frustration was levelled at the expert by the plaintiff on several occasions. A great deal of inter personal skill and patience was required to manage the situation and whilst not influencing the expert's final opinion, it hindered progress. Key known factors that contributed to the development of an expert opinion in this case comprised: • • • • •
confused statement from the plaintiff and no information or co-operation from the defendant; physical evidence of an injury was seen, in that the person wore a wrist splint; no work procedures and records or workplace description remained; it was reported that no work equipment was available for examination; the outspoken nature of the plaintiff who had a fractured work career broken by several periods of maternity leave. This contributed to her loosing her place in the organisation's structure; and • little factual information could be cross-referenced. Within litigation, this is perhaps an exceptional case. It contrasts starkly to case X and is notable for its unusual approach and ineffective use of resources. The plaintiff will perhaps be most remembered as an outspoken character who contributed little to easing the oi~en strained relations, between ourselves and her lawyer. The case was settled out of court in the plaintiff' s favour.
CASE Z - OCCUPATIONAL ASTHMA In this case the expert was acting upon instructions from an insurance company in support of a claim brought against one of their clients. The plaintiff alleged occupational asthma through exposure to chemicals used in the manufacture of polyurethane tyres. The expert was provided with a broad outline of the details of the alleged case. No additional information, particularly regarding the plaintiff or his condition, was provided. The case was further complicated by the fact that the polyurethane workshop was now under new management, the civil action being brought against the former employer and owner of the premises who had sold out to landlords. Despite there being no legal obligation to do so, the new owner agreed to allow the expert to inspect the premises and prepare a report. The inspection gave the expert the opportunity to witness the manufacturing process at first hand and to meet with some of the plaintiff's ex-colleagues. The non-availability of evidence from the plaintiff, or medical confirmation of his condition, was ultimately inconsequential to the development of the expert's judgement. The working conditions were self-evident and showed in the 1990s an appalling standard of workplace occupational hygiene. The undocumented systems of work and workplace instructions, combined with the lack of awareness and understanding by the employees of the dangers associated with the chemical process and risk mitigation measures, led to the expert opinion that exposure in the workplace could well have caused, or contributed to, the alleged incidence of occupational asthma. Accident reconstruction focused on workplace inspection. Key known factors that contributed to understanding for development of an expert opinion in this case comprised: • no statement from the plaintiff and no information or co-operation from the defendant; • no medical evidence of an injury was disclosed; • no work procedures and records or historical workplace description remained;
Advances in Safety and Reliability: ESREL '97
223
• the work equipment was available for examination; and • a site visit was made possible and allowed for direct interpretation of the workplace conditions. The principal conclusions from this expert judgement case showed that work practices were not as imagined and were wholly inadequate compared with other industrial situations handling the same or comparable materials. Lack of information from either side hampered gathering of essential facts such as work procedures or occupational health records. But on balance, physical inspection of the workplace enabled a direct interpretation of the working conditions which it was understood had changed little from the time of the employees first reported ill-health. This case contrasts heavily with the first two reported cases as it meant that so much information had to be reconstructed. This meant that the practical experience of the working conditions had to be superimposed on theoretical conditions and standards to see where they coincided and where deficiencies existed. The case is still progressing with detailed review of medical evidence. CONCLUSIONS Undertaking expert judgement work of safety assessments is part of the work skills necessary for safety professionals. The work is interesting and challenging, but also time consuming and laborious, especially in corroborating information, accessing obsolete legislation and standards and preparing reports for court. The work must be undertaken by appropriately trained practitioners and ideally should continue under the supervision of experienced staff who can offer guidance when necessary. The three cases cited demonstrate that accident reconstruction can be effectively undertaken despite the quite significant differences in availability of information. The expert may at times consider that he does not have sufficient evidential information to undertake a comprehensive review, at times he may have very limited information indeed. But circumstantial evidence is just one element in the safety evaluation process. In retrospect, and with accident investigation experience, it is far easier to model accident causation chains. Accident causation naturally gravitates towards management accountability due to the emphasis of legislative requirements and the need to manage safety. The basis of a safety management system is predominantly the installation and maintenance of effective management control strategies aimed principally at preventing accidents, occupational ill-health and reducing the likelihood of loss. There are established key factors which contribute to the development and implementation of an effective system. However, even the most sophisticated and successful systems may contain minor flaws which would, in normal circumstances, pass unobserved. The deficiency may only be highlighted pro-actively as a result of an audit, or reactively as a result of an accident or incident of occupational ill-health. Many of the actions are brought as a result of these deficiencies which, had they not happened in the first place, would have passed unnoticed. The expert should focus upon collecting as much factual information as possible from which to model accident causation chains. This should involve, for instance, information regarding legislation prevailing at the time in question, the ready availability of guidance documentation, the general knowledge and understanding of the hazards and risks, and the available evaluation methods, likely to have been used for evaluating the risks over which available evidence is superimposed. The amount and availability of information to the expert appears to differ quite considerably depending upon the instructing party, the age of the case and the competence and experience of the solicitor. Whilst the litigation process inherently influences the availability of information (discovery) to the parties involved in an action, the instructing solicitor controls the flow of information to the expert.
What is apparent is that there is considerable opportunity for the expert to be far from fully acquainted with the entire facts surrounding a case. An expert representing the plaintiff, for example, invariably has less
224
D.A. Brown and I.M.B. Scott
substantiated documentary information and is dependent upon the plaintiff's personal recollections and medical evidence. Often this expert has restricted access to records regarding company safety performance.
The type of information sought, and the development of the ultimate opinion, is likely to rest not only upon the competence of the expert within his own technical field, but also upon the professional ethics of the expert as a safety practitioner. Issues of competence, credibility and judgement are fundamental and because of this there are an increasing number of registers being formed in Europe for specialists in particular technical fields.
In our opinion, there are a basic set of requirements that have to be met before review of safety assessments can be undertaken. For the pre-1992 statutory cases, there is often a clear check-list of requirements for information given in the Regulations themselves. The methods for undertaking these cases have been well understood and providing the methods are still robust and the software etc. used in forming a judgement is still credible, the review is often straightforward. In examples of occupational accidents and ill-health the position is often less clear and the following check-list of requirements may prove helpful. There should be: 1. review of documentary evidence, work instructions and procedures, safe systems of work etc.; 2. an opportunity to meet the plaintiff to gather first hand information about the work that he or she undertook; 3. physical access to the workplace; 4. inspection of plant and equipment and engineers reports; 5. original drawing or dimensions of the workplace; 6. medical records or personnel records of employees; 7. training records for employees, evidence of trade union membership; 8. statements made by the plaintiff and relevant photographs; 9. factual statements from enforcing authority Inspectors and previous inspection reports; Obviously, not all these issues can be guaranteed in every case, but at least three of the factors given above should be present for the expert to have a fair chance of coming to a reasonable judgement about the viability of the safety assessment that has been reviewed. Items 1 and / or 2 on the above list are virtually essential. In the absence of documentation and records, accident reconstruction is likely to be manufactured from recollections of past events. These are likely to be fragmented, perhaps even distorted with time. Some events which may prove significant to the development of the expert's opinion may be so insignificant to other people involved that they are forgotten or ignored. It is therefore essential that the expert has the opportunity to interview the plaintiff to obtain all the facts which he believes to be significant to his judgement. Item 3 is a desirable requirement, and the remaining issues illustrated are particularly helpful in preparing the expert judgement.
References The independent expert (his role in litigation or arbitration in common law jurisdiction), Ludlow M R Expert witnesses: science, medicine and the practice of law
A9" Risk Management Decision Support Systems
This Page Intentionally Left Blank
RISK ANALYSIS AND DECISION MAKING: AN INTEGRATED APPROACH TO DESIGNING FOR SAFETY John Kian Guan Tan Engineering Design Centre, The University of Newcastle Upon Tyne Newcastle, NE1 7RU, UK
ABSTRACT An integrated approach to designing for safety incorporating risk analysis techniques and Multiple Criteria Decision Making (MCDM) is presented in this paper. Designing for Safety incorporates specific probabilistic risk analysis methodologies within the design process. The decision problems in designing for safety are formulated as MCDM problems. The various potentially conflicting design objectives could then be met as closely as possible within the constraints of limited financial resources and technological knowledge. A made-to-order (MTO) engineering product is used as an example. Failures Mode, Effects and Criticality Analysis (FMECA) and Boolean Representation Method (BRM) are employed to identify all faults, and their respective failure probabilities. A Multiple Objective Decision Making (MODM) technique, Interactive Step Trade-off Method (ISTM) is used to search for efficient design solutions. The use of MCDM together with PRA permit the use of rational principles and the incorporation of judgements, it also promotes comparability among objectives in making decisions. The methodology presented thus provides an integrated approach to designing for safety.
KEYWORDS Probabilistic Risk Analysis (PRA), Multiple Criteria Decision Making (MCDM), Multiple Objective Decision Making (MODM), Interactive Step Trade-off Method (ISTM), Boolean Representation Method (BRM), Failures Mode, Effects and Criticality Analysis (FMECA)
INTRODUCTION Probabilistic Risk Analysis (PRA) methods have been gaining acceptance and popularity for assessing a given system for reliability and risk. Such methods are used, either to comply with regulations or as self-imposed goals in the design process. Safety criteria are important in current product design procedure right from the initial design stages, especially for large complex engineering projects. Designing for Safety incorporates specific PRA methodologies within the design process. This is one of the most effective ways of reducing or eliminating potential serious risks down stream. In this paper, the term risk analysis and PRA are used interchangeably. PRA refers to studies of process or equipment failure or operability, Henley and Kumamoto (1991). It provides the means for safety and reliability assessments, what it does not provide is the means to assist the designer to decide how to achieve the best safety standard within a given set of constraints. This problem is further complicated by the presence of conflicting design objectives. Hence decision tools are required to obtain a "best" 227
228
J.K.G. Tan
compromised design that closely meet the requirements of the various design objectives. The presence of conflicting objectives require compromises to be made, either explicitly or implicitly. In this respect MCDM methods are particularly suitable and useful. MCDM problems are classified into MODM and MADM problems. A MADM problem is typically a selection problem and involves ranking a finite set of alternatives or choosing the most preferred solution from the given alternative sets, based on multiple attributes reflecting technical and economical performances of a design. A MODM problem is much more complicated than a selection problem in that alternatives are not given but are hidden in a set of constraints. Hence MODM problem involves both generation of alternatives from the constraints and evaluation of the design in terms of the multiple objectives. In a single objective optimisation problem, the solution is generally expressed as an optimal solution. In a multiple objective optimisation, the solution is generally expressed as solution points on the Pareto surface, or the generation of the Pareto surface. When an engineering product is designed, identification of all possible failure conditions and their corresponding probabilities is a primary task. When the risks are judged to be unacceptable, the design will require rearrangement or modification, so as to include either some protection devices or more reliable components or both. The decision making process in designing for safety involves choosing between a number of such possible actions. Each modification is likely to affect the achievement of various objectives of the design. An example of the conflict between design objectives would be that it is usually not possible to have a design which simultaneously maximises safety and performance, whilst minimising initial capital cost and maintenance costs. The presence of these design objectives, many of which are in potential conflict, requiring the use of decision making tools, either formal or informal, to obtain the best compromise solution. A made-to-order (MTO) engineering product will be used as an example to illustrate the principle described. MTO products are usually complex assemblies of components for which building and testing of prototypes is not possible. Risk analysis is carried out for such an MTO engineering product. The PRA methods employed are FMECA and BRM, and the MCDM method is ISTM, a MODM technique. Analysis of the results from the PRA and MODM problem will then enable the selection of the "best" design alternative, intuitively by the decision-maker(s) (DM) or with the aid of Multiple Attribute Decision Making (MADM) techniques. The BRM and MODM methods are taken from various software support environments developed by the Engineering Design Centre at University of Newcastle, UK.
PREVIOUS W O R K S Most of the earlier related works done by other authors in the field of reliability study were in the category of reliability optimisation. Misra and Ljubojevic (1973) presented a optimisation approach that include a cost constraints. The idea of trade-off between cost and reliability had been emphasised even with this single objective optimisation approach. Nakagawa and Nakashima (1977) presented a method for obtaining an optimal reliability allocation of an n-stage series system by a heuristic method. Multiple objectives optimisation techniques were also explored and discussed. Inagaki et al. (1978) explored the use of multiple objective optimisation on reliability allocation for a series system with time-dependent reliability. Multiple objective fuzzy optimisation was applied to solve a multiobjective reliability apportionment problem for a series system by Dhingra (1992). The applications of MCDM techniques in reliability and safety analysis were limited to some stereotyped text book style problems, unlike the applications of MCDM in many other fields. MCDM applications in water resources achieve particularly outstanding results. The main issue of decision making within the context of risk assessment is not adequately covered and addressed.
STATEMENT OF THE P R O B L E M Given a system which can be broken down into sub-systems, and subsequently into components. The system's safety are functions of the sub-systems' and subsequently, the components' reliability. The reliability variables with multiple states are to be represented. Let p be the number of components or sub-systems, m be the number of top-event failures, and s be the number of states in each reliability variables. Further, let the failure proba-
Advances in Safety and Reliability: ESREL '97
229
bilities of each component be Xu. The various PRA conducted would yield the top-event failure probability Fj in the form:
Fj = f(XllXl2...Xlp )
(1)
Xli (l = 1, 2 ..... s) represents the i component's reliability variables with multiple states. The form of the function f is directly influenced by the configuration of the system.
Where
Hence minimising the failure probabilities of F j becomes an optimisation problem. In order to realistically deal with the complexity of the various conflicting design objectives, multiple objective optimisation techniques are used. The functions F j obtained from PRA methods can then be treated as objective functions in a MODM optimisation formulations. Other objective functions are formulated to reflect the nature of the problem and the intentions to achieve the specified design objectives. A MODM problem may be defined as the following vector mathematical programming problem:
optimise: f ( x ) = {fl(x), f2(x) ..... f k(x) } such that: X E f~,X =
Ixi x 2
." x
n]
(2) (3)
[ gi(X) < O, (i = 1..... nl) hi(X) < 0, (j = 1..... n2)
o- xi
I
st(X)
(4)
-< 0, (l = 1..... n3)
[em(X ) < 0, (m = 1..... n4)
where xi(i = 1..... n) are design variables, f t ( X ) ( 1 < t < k > 1) is a linear or nonlinear objective function, gi(X)(O < i < nl) is a nonlinear inequality constraint function, hj(X)(O < j < n2) a nonlinear equality constraint function, Sl(X)(O < l < n3) a linear inequality constraint function and em(X)(O < m < n4) a linear equality constraint function. Generality speaking, there is no design that could optimise all of the objective functions simultaneously. Compromise or trade-offs among the objectives are inevitable. The final solution of the problem in equation (3) is the "best" compromised or most preferred solution, which attains the objectives as closely as possible to the preference indicated by the designer.
SOLUTION P R O C E D U R E
PRA Methods
The solution procedure begins with PRA. The PRA methods selected are FMECA, and BRM. These methods are selected because of the following advantages over other PRA methods: • • • • •
It can deal with engineering systems with multiple state variables and feedback loops. Top events of a large engineering system with a relatively higher level of innovation can be identified. Omissions of failure causes are less likely than in Fault Tree Analysis. The information produced from FMECA can be used directly for Boolean representation modelling. A single Boolean representation modelling is enough to deal with all possible faults.
Assumptions made: • The components or sub-systems at the same analysis level are considered to be independent.
230
J.K.G. Tan
• A continues variable can be expressed by two or more discrete states such as high, normal and low, each of which corresponds to a certain range of values. • Failure variables can be represented by probability distributions. The PRA methods used here attempt to combine FMECA and the BRM, to systematically identify and assess all system failure events and their respective causes. FMECA is a bottom up approach and is usually carried out on the basis of the evaluation of hardware elements. In FMECA, how combinations of occurrence of failure modes affect the system performance and safety is not studied. The Boolean Representation (BR) tables are constructed by studying all possible combinations of the input variables to produce the output variables. An engineering system can be described in terms of components and their interactions. A component can be described by a set of input events and a set of output events. Each output event specifies the state of the output and a set of input events specifies the states of inputs. Each event may have several states. For example, output pressure from a valve may be assigned to one of the five states such as too high, high, normal, low, and too low, each of which corresponds to a range of values. The interactions of components can be modelled by studying the system process diagram. Once the components and their interactions have been modelled, the Boolean representation modelling can be started initially at the component level, progressed up to the sub-system level, and finally to the system level in order to obtain the final BR description. The BRM begins after the completion of the FMECA. At the component level, the Boolean representation (BR) descriptions of the components can be constructed. The failure modes identified in the FMECA of a component can be used as the input attributes of the BR table. It describes, in the form of a table, the conditions which must exist for the occurrence of the identified component output states. The last column of the BR table describes the states of the output of the component while other columns prescribe the states of the input attributes. Each row represents a possible condition for the occurrence of the component's output state. When constructed directly from FMECA, the BR table usually has some degree of redundancy. The rules of simplification can be applied to absorb and merge redundant rows and redundant attributes to generate the irreducible BR of the component. After all the BR tables of the components have been constructed, the sub-system's BR table can be generated using a process of aggregation. Intermediate variables need to be eliminated by substituting them with primary variables regarding the interactions of the components. After the elimination, the rules of simplification should be applied again to produce the irreducible BR table of the sub-system. After the constructions of all the sub-systems' BR tables, the BRM can now progress up to the system level, and the same procedures are repeated until ultimately the irreducible BR table for the system is constructed. The rules of deduction of extra prime implicants can then be applied to the irreducible B R table to obtain the final system BR table. The final system BR table contains all the prime implicants associated with the system output states. A prime implicant can be considered to be the equivalent of a cut set in a fault tree analysis but for systems with multiple states variables. The simultaneous occurrence of the basic events associated with any of the prime implicants C 1, C2 ..... CN will result in the occurrence of the top event T c . Hence the probability of occurrence of the top event T c can be calculated by equation (5): P(Tc)
=
=
P ( C 1 u C 2 u ...
N E e(ci)i=1
u
CN)
N E P(CiNCj)+"'+(-1)N-1p(c1NC2A"'ACN) i=l,i¢j
(5)
M O D M M e t h o d s Interactive Step T r a d e - o f f M e t h o d (ISTM)
The ISTM is a learning oriented method. It allows a designer to search the efficient solution frontier of a MODM problem by means of a trade-off analysis. In ISTM, objectives are divided into three subsets, suppose a feasible or an efficient design is given. The designer is required to decide whether an objective has to be im-
Advances in Safety and Reliability: ESREL '97
231
proved from the current status, or should be kept at least the current status, or may be sacrificed from the current status. Based on the trade-off analysis, ISTM will try to f'md a new efficient design which can satisfy the designer's requirements and improve the assigned objectives to the largest possible extent. The ISTM provides a designer with a powerful tool to search for good efficient designs from which the best compromise may be evolved. Once the three subsets of objectives are classified, an auxiliary problem that transform the three subset of objectives into another optimisation problem can be defined. The various objectives are either improved, sacrificed or maintained in the auxiliary optimisation problem in which the SLP is again used to solve for the new efficient solution. The process is repeated interactively until the decision maker is satisfied with the optimal solution(s) derived. By repeated use of ISTM interactions, the entire or partial Pareto surface of the problem can be determined. Computer models have ben developed for the implementation of the BRM and ISTM. The computer programmes are written in C, and running on Sun Spare Workstation.
AN EXAMPLE A hydraulic hoist transmission system of a crane is used as an example. This system is used to control the crane motions of hoisting up and hoisting down loads as required by the operator. It consists of five sub-systems, namely a hydraulic oil tank, an auxiliary system, a control system, a protection system and a hydraulic servo transmission system. Each sub-system is associated with several failure modes. The occurrence of each failure mode associated with each sub-system may result in certain possible consequences, with severity class depending on the nature of the failure mode and the interactions of the sub-systems. The main objective is to achieve the highest level of safety at the minimum cost. There are three top-events, namely hoisting up continuously not as required, $1, lowering continuously not as required, S 2 and no output from the package motor of the hydraulic servo transmission system, S 3 . It is also established that $1 and S 2 may have serious consequences to the safety of the operating personnel and therefore carry a higher priority than S 3 . S 1, $2, and S 3 contributes to the repair cost and loss of earning cost and therefore there is a motivation to keep these failure probabilities a minimum. FMECA is carried out on the five sub-systems. Results from the FMECA are summarised in the form of tables, it is not possible to show all these tables due to the space constraints in the conference proceeding. One of these tables are shown as representative, Table 1 shows the FMECA of the hydraulic tank, similar tables are constructed for the other four sub-systems. For this example only severity classes 1, 2 and 3 (i.e. events that carry serious consequences) are considered for construction of BR tables.The results from the FMECA are utilised to construct BR table of the sub-systems by studying each possible combination of input attributes which are possible failure modes. The BR table for the hydraulic tank is shown in Table 2. In the BR tables, N stands for "Not happening" of a variable, F stands for "Failure happening" and * stands for "Don' t care".
TABLE 1 FMECA OF THE HYDRAUUCTANK
Failure mode number
Failure mode rate
1
0.443
oil temperature too high or too low
reduce efficiency
self annunciating
0.01 ~
level gauge failure
could result in insufficient oil supply
self annunciating and by maintenance
0.05*
Failure mode
Effects on system
Detecting method
Severity
232
J.K.G. Tan TABLE 1 FMECA OF TtZEHYDRAULICTANK
Failure mode number
Failure mode rate
3
0.01 ~
Effects on system
Failure mode
major leak
no flow for the system supply
self annunciating
minor leak
none
self annunciating
0.05* 4
0.395
Severity
Detecting method
4
1 (serious consequences) - 4 (minor consequences)
Severity:
TABLE 2 BR TABLEOF THE HYDRAULICTANK
HM 1
HM 2
Ho H1
H2 major leak in the hydraulic tank level gauge failure the output variable of oil supply tank no oil supply from the oil tank supplying oil from the oil tank
HMI: HM2: Ho:
tt1: I42:
The final system B R table of the hydraulic hoist transmission system is obtained by the method described and the output from the BRM programme, a part it is shown in Table 3.
TABLE 3 PART OF THE FINAL BR TABLE
H
H
A
A
A
A
A
MM 1 2
M 1
M 2
M 3
M 4
M 5
C
C
C
C
P
P
P
P
P
P
P
S
S
*
*
*
*
*
*
*
*
*
F
*
*
*
*
*
*
*
*
*
F
*
*
*
F
F
*
*
*
*
*
*
*
F
*
*
F
*
*
*
F
F
*
*
*
F
*
*
*
*
*
*
F
*
*
F
*
*
*
*
*
F
F
*
*
*
*
F
*
*
*
*
*
*
*
F
F
*
*
*
F
*
*
*
F
*
*
*
*
*
*
*
S
S
S
S
M 4
M 5
6
*
*
*
*
S1
*
*
*
*
*
S1
*
*
*
*
*
*
S1
*
F
*
*
*
*
*
S1
*
F
*
*
*
*
*
SI
M M M M M M M M M M M M M M M 1 2 3 4 1 2 3 4 5 6 7 1 2 3
S
Advances in Safety and Reliability: ESREL '97
233
M O D M Formulation
There are a number of ways of setting up the objective functions. The main considerations in this example are cost and risks. Four objective functions are set up to reflect these considerations, the first considerations are on top-event S 1 and S 2 , since the intention of keeping these two events that carry serious consequences low have been expressed. The second consideration is the minimisation of the general risk index, which is the sum of all the top-event failure probabilities. The general Risk Index (RI) is an idea that expresses the desire to keep all failure probabilities to a minimum, which will also mean that a better Mean time Between Failure (MTBF) can be achieved. The last consideration is the minimisation of the cost, or economic resources to sustain the system. It is generally not possible to achieve all these objectives simultaneously. However, with MODM formulations, the decision maker will have a clear idea as to the range of feasible alternatives that are available to him. The M O D M formulation can be represented by equations (6) to (9) as follows:
mitt Riskl = f l(XllXl2"'Xlp)
(6)
rain Risk2 = f 2(xllxt2...Xtp)
(7)
3
rain RI = ~_~ f i(XllXl2...Xlp)
(8)
i=1
mitt Cost = COSTT(X ) + COSTM(X ) + COSTR(X ) COST T = COST M = COST R =
(9)
top-event caused cost maintenance cos repair cost
ISTM is used to generate the solution for the above MODM problem. A total of approximately 60 interactive iterations are required to enable the plotting of the Pareto surface. The solution to the above problem is a three dimensional Pareto surface, the solutions are plotted as a series of two dimensional graph as shown in Figure 1. The feasible regions for the alternative solutions are shown in Figure 1. Cost objective is used as a common axis to relate the various objectives trade-off. It can be seen that the probability and RI fall within a reasonable range, and that the critical cost falls somewhat on £440,000. The interesting feature that a decision maker can gather from this graph is the choices of probability for S 1, $2 and RI. Supposing the decision maker is determined to keep only S 1 to a minimum at the expense of other objectives, then it can be seen from the graph that the failure probability of S 1 can indeed be very low. However, the other objectives would maintain at the relatively high failure probability values. For example, at an investment for the cost at £500,000, the DM can have the choices of probabilities limits shown in Table 4. Of course the sensible choices are the values that fall in between these limits, the "best" compromise for £500,000 investment for cost could very well be a solution that offers a reasonably low S 1 , S 2 and RI. These various alternatives could be selected intuitively by the DM or formulated as an M A D M problem for further rational selection processes.
CONCLUDING REMARKS A new integrated approach to designing for safety incorporating PRA methods and MODM techniques is presented. The ability to provide decision making support based on the analytical power of PRA enables the DM or designer to select the "best" alternative solution that satisfy as closely as possible to a set of conflicting objectives. F M E C A and BRM are used in conjunction with ISTM, a MODM technique, to produce the desirable integration. A marine hydraulic transmission system is used to illustrate the use of this integrated approach. It can be said that the M O D M technique offers a new and exciting way to PRA. The DM is able to reason and compare alternatives based on defined objectives. Better decisions in risk analysis during the design stages
234
J.K.G. Tan
means better utilisation of limited resources which will lead to a more holistic approach towards designing for safety.
1.6
. .
- w -
. . . . . .
- , ,
S1 .
1.4
.
.
.
+
-
R I S K
1.2 r~
0.8
:] m
0
\ \
0.6 -
0.4 0.2
~
-,,m-,-
-
-.m.,-
-
- ¢ ~
- i t ~
400
- m ~
.,m..,.,,n,-
-n,,.--
-
~
.
-'0-
-
,,.,,¢v-
-
-..-. _.... _,,_,,.~.~. . . . . . . . . . . . o
0
2-,-,0-
|
420
0
_ t . . . . o _
_
_
..1 .
.
.
.
.
.
-
_
_
_! . . . .
t.
•
•
a
440
460
4;0 500 COST (£1,000)
520
540
560
FIGURE 1" PLOT OF RISK VS COST (PARETOSURFACE)
TABLE 4 PROBABILITYLIMITS AT £500,000
RI
P(S1) 102
P(S2) 10-2
0.006121
0.904782
0.394434
0.818294
0.072332
0.394434
1.577761
1.577761
0.180032
REFERENCES
Dhingra Anoop K. (1992). Optimal Apportionment of Reliability & Redundancy in Series System Under Multiple Objectives. IEEE Transactions on Reliability 40:4, 576-582. Dixon P. (1964). Decision table and their applications, Computer and Automation 13:4, 376-386. Goicoechea Ambrose, Hansen Don R., and Duckstein (1982). Multiobjective Decision Analysis with Engineering and Business Applications, John Wiley & Sons. Haimes Yacov Y. and Stakhiv Eugene Z. (editors) (1985). Risk-Based Decision Making in Water Resources. American Society of Civil Engineers
Advances in Safety and Reliability: ESREL '97
235
Haimes Y. Y., Hall W. A. and H T Freedman. (1975). Multiobjeetive Optimization in Water Resources Systems, Elsevier Scientific Publishing Company Henley Ernest J. and Kumamoto Hiromistsu (1991). PROBABILISTIC RISK ASSESSMENT Reliability Engineeriltg, Design, and Analysis, IEEE Press USA. Inagaki Toshiyuki, Inoue Koichi, and Akashi Hajime (1978). Interactive Optimization of System Reliability Under Multiple Objectives. IEEE Transactions on Reliability R-27:4, 264-267. Misra Krishna Behari and Ljubojevic Milan D. (1973). Optimal Reliability Design of a System: A New Look. IEEE Transactions on Reliability R-22:5, 255-258. Nakagawa Yuji and Nakashima Kyoichi (1977). A Heuristic Method for Determining Optimal Reliability Allocation. IEEE Transactions on Reliability R-26:3, 156-61. Salem S. L. (1979). Decision Table Development and Application To The Construction of Fault Trees, Nuclear Technology, 42, 51-64. Schmuhl Joerg, Hartmann Rolf, Muller Holger, and Hartmann Klaus (1996). Structural Parameter Approach and Multicriteria Optimization Techniques for Complex Chemical Engineering Design. Computers chem. Engng 20 Suppl $327-$332. Sen P. (1991). Marine Design: The Multiple Criteria Approach, RINA261-276. Sen P.,Tan K.G. (1994). Modelling Uncertainty in Cost Estimation, 13th International Cost Engineering Congress,1994 October, London, pp CE 9. Sen P., Yang J. B. (1993). A Multiple Criteria Decision Support Environment for Engineering Design, Proceedings of 9th International Conference on Engineering Design, Hague, The Netherlands, August 1993. Tan K.G., Lau Lui (1990). Probabilistic Approach To Engineering Cost Estimation, Ngee Ann Journal, Journal of Ngee Ann Polytechnic, 535 Clementi Road, 2159 Singapore, 3:1, 12 pages. Tan K.G. (1996). A Framework for Decision Support System In "Designing for Safety", PhD First Year Report, Engineering Design Centre, University of Newcastle, December 1996. Wang, J.; Yang, J. B.; Sen, P.; Ruxton, T., (1996). Safety based design and maintenance optimisation of large marine engineering systems, Applied Ocean Research, 18:1, 13 pages. Yang Jian-Bo, Chen Chen, and Zhang Zhong-Jun (1990). The Interactive Step Trade-Off Method (ISTM) for Multiobjective Optimization, IEEE Transactions On Systems, Man, And Cybernetics, 20:3, 688-695. Yang Jian-Bo, Sen P. (1994). Interactive Trade-off Analysis and Preference Modelling for Preliminary Multiobjective Ship Design Synthesis. Technical Paper, EDC, University of Newcastle, EDCN/MCDM/PAPERS/ 18/1.
This Page Intentionally Left Blank
THE MACRO PROJECT: COST/RISK EVALUATION OF ENGINEERING & MANAGEMENT DECISIONS
John Woodhouse The Woodhouse Partnership Ltd, Headley Common Rd, Newbury, Berkshire, RG19 8LT United Kingdom.
1. Abstract The European EUREKA project 'MACRO' (project no. EU1488) is assembling quantitative techniques for cost- and risk-based engineering/maintenance decisions, and is developing the necessary guidance for their use, particularly in weak data circumstances. This paper outlines the scope of the project, the key components identified so far and the "Asset Performance Toolkit" that will result. Prelhninary studies and feasibility trials have already showed scope for very substantial impact (improved asset availability and life cycle cost reductions). MACRO is bridging the gap between reliability engineering/asset management and the 'front line' business of industrial decision-making.
1.1 Keywords Risk-based, Decision-making, Cost/Benefit, Optimisation, Asset Management
2. Introduction The essence of Asset Management is getting the best value for money out of strategic assets over their lifespan or any specified lesser period. Inevitably, the effectiveness of any business in managing its assets is governed by the component decisions that are taken. High quality decisions will manifest themselves through value added to the business. Assigning value to more abstract concepts (such as increased safety or public confidence) inevitably involves a degree of subjective judgement. Nevertheless, a structured approach to the task is crucial if a company wants to establish a robust and defensible decision making policy. It is impossible to contemplate an assessment of'value added' without considering the likelihood of success and the risks associated with each possible scenario. Nevertheless the sophistication and analytical detail of modem reliability engineering methods are often wasted: either the methods are understood by too few, or the available data is of limited quality, inadequate volume or restrictive application. Usable quantitative techniques for decision analysis are still rare in industry. The 'front line' environment of poor data, problem complexity and commercial pressure allows little time or even motivation to perform any depth of analysis. Yet the cost- and risk-consequences of relying on 'engineering judgement' are staggering. Wider use of even the most basic reliability 237
238
J. Woodhouse
The MACRO Project engineering methods can make very substantial business impact - in one recent case, 8-figure annual savings through reduced maintenance bills and improved plant availability I. This paper discusses the key drivers in an asset-centred business. It identifies the compatibility of the MACRO project with necessary practical responses to such business drivers. The practical answers bridge the current gulf between academic theory and industrial practice: this void cannot be spanned simply by hoping that asset managers can be converted magically into reliability experts. The subjects are too complex and, apart from the lucky few, the audience is not used to thinking in terms of probability or business risk. Selected elements need to be prepared and fed to the industrial front-line in pre-digested morsels.
3. The M A C R O Project The MACRO project is a 3-year research and development programme (started in December 1995), supported by the UK Department of Trade & Industry and operating under the European EUREKA MAINE umbrella. The collaborating organisations are: The Woodhouse Partnership Ltd (project managers) Yorkshire Electricity A/S Norske Shell ATL Consulting Services Ltd Brown & Root UK Ltd Det Norske Veritas Hozelock Ltd Institute of Asset Management The National Grid Company PLC Websters Mouldings Ltd.
3.1 Deliverables The terms of reference for the project identify two key deliverables: •
A decision navigation guide that helps industrial decision-makers to identify which tools, techniques and data requirements are necessary for which decision types and operational~usiness circumstances.
•
A set of modular cost/risk evaluation and optimisation software utilities, called the "Asset Performance Toollat". This software is to be designed for use with varying data quality and applicable to a wide range of industrial decisionsupport requirements. One of the first modules to be completed (APT-SPARES) is illustrated in this paper.
Section 6 of this paper describes to range of facilities being developed.
~Utility company 1995 study of five selected maintenance intervals (inspections, overhauls, functional tests and painting/lubrication), based on quantified but range-estimated hazard rate characteristics and failure consequences.
Advances in Safety and Reliability: ESREL '97
The MACRO Project 4. Defining the need MACRO was conceived to provide some standards and practical utilities for front line decision-makers. It includes the application of several, individually familiar, analysis techniques and will, in particular, generate guidance on which tool or technique is applicable in which circumstance. Clearly the boundaries for such a toolkit are difficult to define and the first phase of the project involved a feasibility and definition task. This first phase defined the modularity of the toolkit - which groupings of facilities best fit the range of decisions faced by the same individual (or, increasingly, multi-disciplined team). This list is certainly not exhaustive but is presented as a working selection of decision types and decision makers. It provides a representative coverage of the practical requirements.
4.1 Specific Decision Responsibilities 4. I. 1 Maintenance~plant~asset managers •
• • • • •
Justify maintenance budget: the sum of individual justifications of preventive tasks, inspections, shutdowns and housekeeping/overheads. e.g. if we reduced preventive maintenance by 10%, what would be the net costrisk impact? Justify specific shutdowns (work content, timing, cost/risk impact) Set intervals for periodic shutdowns Set intervals for major inspections or other planned maintenance tasks Set holding levels for critical spares (incl. supplier comparisons etc.) Compare in-house with contractor options for specific jobs or ranges of jobs. i.e. "what if?." considerations of differences in operating costs, overheads, work quality, downtime and response times, resulting equipment reliability etc.
4.1.2 Engineers & Technologists The same decisions are required as for a maintenance/plant/asset manager, usually at an equipment-specific level, PLUS Operational problem-solving, triggered by high maintenance Operating c o s t unre#abiHty, inefficiency,bottleneckmg recent 'big bang' event external requirement (e.g. iegislatiotO, management'concern' level of general irritation hunch accidental discovery Such problem-solving involves evaluation of specific design/operations changes, preventive, detective (inspection/condition-based) or corrective maintenance strategies, and contingency planning options (such as spares, bypass facilities etc.). This is likely to encompass... Maintenance strategy review either from an existing task list (review & filtering) or from 'first principles' of Failure Modes & Effects or Deterioration Modelling. Spares requirements reviews ~) Individual Planned Maintenance, Inspections, Condition Monitoring and equipment replacement intervals, particularly with poor data.
239
240
J. Woodhouse
The MACRO Project 0 Condition reaction points for corrective maintenance or modification ~) Repair versus replacement comparisons
4.1.3 Operations or Production Engineers~Managers Operational problem-solving triggered by: planned & unplanned downtime operating costs production output levels personnel productivity production quafity operational efficiency TPM TOM metrics, management "concern" recent 'big bang' event external requirement (e.g. legislatiot0 level of general irritation hunch accidental discovery De-bottlenecking Batch sizes & product mix (e.g. contribution analysis) Project evaluations: Equipment modifications Process changes New technology Shutdown intervals Major asset replacement justification & timing (incl. refurbishment versus replacement comparisons).
4.2 Project Engineers/Contracting Engineers * • • •
• • • • • • •
Project evaluations, in greenfield and brownfield circumstances Design option comparisons: Life Cycle Costing comparisons Pay-back analysis Assessment of system effects Initial maintenance strategy recommendations Initial spares requirement recommendations Manning and other resource levels Evaluation of life extension options Vendor & Tender evaluations/comparisons Life Cycle Cost quantification (including risk exposures).
The next stage in translation towards a suitable 'toolkit' involved a high level separation of potentially helpful technologies. In the following section, we present the preliminary working list of such activities and their related possible aids:
A d v a n c e s in Safety and Reliability: E S R E L '97
The MACRO Project
5. Decision Support Requirements 5.1 Operating Environment 5. I. 1 Problem identification Task
Appropriate Aills
Performance Indicator hierarchy and usage Top-10 data: reliability, m'tce cost, downtime Total Impact assessment (incl. lost oppty costs) On-line or automatic problem diagnosis Programmed review and continuous improvement infrastructure
Procedures Procedures/tailored IT Procedures/tailored IT CBM, AI & neural nets etc Procedures
5.1.2 Problem interpretation/investigation Task
Appropriate Aids
Root Cause Analysis Performance Indicator 'drill-down' methods Pattern-finding (incl. SPC, Fourier techniques, RELIAN etc.)
SPC/Procedures Procedure/tailored IT Procedure/software
5.1.3 Evaluating possible solutions Task
Appropriate Aids
Design or usage changes Preventive Maintenance tasks Inspection/condition-based maintenance tasks/tools Contingency planning options (impact reduction) Spares and stock holding strategy Manning levels and contractor requirements Evaluation of overall system availability effects
Calculation/simulation Calculation Calculation Calculation Calculation/simulation Calculation/simulation Simulation
5.2 PROJECT ENVIRONMENT Task
Appropriate Aids
Project (LCC) Evaluation Design option comparisons System configuration comparisons Overall system performance/availability prediction Operating strategy- performance vs risk trade-off Operating strategy - equipment loading vs. lifespan/performance Operating strategy - seasonality/external factors impact Maintenance strategy- preventive intervals Maintenance strategy -detective (inspect & cond monitoring) intvls Maintenance strategy- corrective (contingency option evaluations) Spares and stock holding strategy Manning levels and contractor requirements
Calculation Calculation Simulation Simulation Calculation Calculation/simulation Procedure/simulation Calculation Calculation Calculation Calculation/simulation Calculation/simulation
NOTE: 'Procedures' refers to need Jbr practical guidance or organisation rather than analytical tools, 'Tailored IT' relates to local infi)rrnation systerns requirements, 'Calculation' includes any pure mathematical solutions and 'Simulation'covers subjects most suitable for dynamic simulation and Monte Carlo or other sampling.
241
242
J. Woodhouse
The MACRO Project
Finally in the definition phase, MACRO considered a practical grouping of the underlying mathematical techniques that would support the calculation aids. At an early stage, it was considered uneconomic and of limited value to enter far into the simulation arena - there are several proprietary tools and development languages already on the market and the MACRO involvement should be limited to guidance on when and where they should be used. The MACRO utilities should, on the other hand, concentrate on the use of mathematical modelling techniques and high-speed "What-iff." analysis, particularly in the circumstance of limited or poor quality data. The following represents the current groups of analytical techniques:
6. Main modules for the Asset Performance Toolkit 6.1 ASSET LIFE & MODIFICATIONS:
Analysis functions: Asset life cycles, life extension options, project evaluation, repair replace
decisions.
6.2 INTRUSIVE (PRE-PLANNED WORK) MAINTENANCE TASKS:
Analysis functions: Optimal maintenance intervals, cost/risk evaluation of preventive maintenance, maintenance opportunities, manufacturers' recommendations & warranties, legal constraints.
6.3 CONDITION-BASED MAINTENANCE TASKS:
Analysis functions: Optimal inspection and condition monitoring intera,als, failure-finding &
test intera,als, optimal condition reaction points.
6.4 WORK GROUPING AND SHUTDOWN STRATEGY:
Analysis functions: Optimal task grouping, shutdown intervals, age- versus block-based
maintenance, opportuni~ evaluations.
6.5 MATERIALS & RESOURCES:
Analysis functions: Optimal spares holding levels, evaluation of alternate locations, supply methods, pooling options.
243
Advances in Safety and Reliability: ESREL '97
The MACRO Project Example process summary"
Labour Costs Materials Costs
Lost Opportunity Costs Penalty & Consequential Costs Premiums paid for Legal Compliance, Environment, Staff Welfare & Public Image
Optimum reaction points
With numbers attached!
+
CAPEX OPEX Disposal Costs
USING WEAK INFORMATION Worst case
3500 3000
Most Likely
2500
Best case
2OOO "
1500
100o 5co o
Inspection Interval(months)
9robI Maximum range for decision
7. The Raw Material" "Hard" & "Soft" data In general, the data is entered by range-estimated examples (e.g."50-60% would reach 5 years without major failure"). Survival (or cumulative failed) information is certainly easier to estimate than instantaneous probabilities (hazard rates) but it is also possible to use more detailed reliability information if it is available. Otherwise, curves are fitted to the estimates, drawing from a range of common curve types depending upon best-fit criteria. Clearly, an interface with existing Failure Modes & Effects Analysis (FMEA) or Reliability Centred Maintenance (RCM) studies is also valuable. This will populate the failure mode descriptions and provide some starting points to the estimation of probability patterns (usually in the form of estimated mean failure rates or deterioration timescales). An early discovery has been the usual inadequacy of Weibull plots - due to the combination of limited (or zero) historical data and multiple potential failure modes. MACRO is making radical improvements in the description and usage of failure patterns and reliability modelling. For example, the Maintenance module will be capable of handling several concurrent hazard rate curves, each calculated and 'fitted' to range-estimated points on the cumulative failure/survival curve.
244
J. Woodhouse
The MACRO Project 8. Other Influences 8.1 Operational efficiency One of the cormnonest opportunities for cost savings is in energy and materials consumption, or production output efficiency - and there is often a close relationship between reliability and such operational performance. Yet rarely has this been explored or successfully handled in a flexible manner. MACRO has incorporated facilities for evaluating and optimising the combination of risks and efficiencies, whether the latter manifests as variations in energy or consumables (inputs), or as degradation of productive output (volume or quality). Examples (already studied and opfmised) include the de-coking strategy for an industrial furnace - a combination of efficiency-oriented maintenance with reliability/availability repercussions.
8.2 Lost opportunity costs Perhaps the single most elusive piece of data in the puzzle. Uncertainty in reliability data is often swamped by the lack of knowledge about the financial consequences of downtime. Here MACRO provides some important lessons- forcing the cost of unavailability, lost opportunity and other 'intangibles' out into the open. Considerable scope is possible in the use of reference examples, "what if?." sensitivity testing and standardisation. Some guidelines on criticality analysis and process mapping (failure consequence assessment) are being generated. This is also an area of potential direct interfacing with process and plant availability simulators.
8.3 Others There are several other 'special case' areas of influence upon maintenance and replacement intervals: Firstly, there is the issue of Maintenance cost escalation. This includes costs that are themselves a function of time (for example, the degree of preparation required before painting, which increases with time since it was last performed). In addition Legal, safeO' or environmental constraints may also apply - MACRO methods will include calculation of the "premium paid for compliance", over and above the direct cost/risk impact to the company. Specific downtime or business-related oppontunities may exist for certain intervals or 1-off occasions. Warranties or service agreements may affect the economic consequences of some of the failure modes. Each of these factors has been considered for breadth of applicability, form of presentation and practical handling method.
9. An example: evaluating spares requirements This area are relatively well researched and partially developed in various forms already. MACRO is collating these facilities and extending them to a full risk-based evaluation of spares, raw materials and consumable stock-holdings. Some innovations have been needed and developed to handle factors not incorporated before: the project has completed a slow-moving spares evaluator, APT-SPARES, which exceeds the capabilities of any other existing algorithms. Until this point, the state-of-the-art was represented by some risk analysis tools that perform queuing theory calculations, using Poisson models for demand distribution. On the economic side, such tools consider the cost of capital tied-up, storage and maintenance costs and some provision for depreciation or loss in re-sale value (not to be confused with the depreciation applied for taxation purposes). APT-SPARES adds, among other factors, the
Advances in Safety and Reliability: ESREL '97
245
The M A C R O Project whole area o f criticality (or stock-out consequences), the various forms of replacement timescale that are possible (emergency re-order, workshop repair etc) and constraints on useful life for the spare (technology overtake, shelf-life, cessation o f usage/demand). This necessitates a much-escalated calculation of conditional probabilities as the Markov model has more states and possible combinations. The resulting analysis is a combination of generated probability distributions, iterative sampling, sensitivity and confidence testing, and economic calculation. A printout of the AP7"-SI~At¢ES module is attached to illustrate the type of analysis report being generated by MACRO. The user interface concentrates on the "what if'?." style of operation, with a choice of 'single analysis' and 'batch' studies (the latter provides a spreadsheet layout, allowing block assignment 'of variables, global "what ifs?" and various database views). Results of the real-life application o f APT-SPARES have already revealed big opport-unities across a wide range of assets within a company. In one case, £6-8 million is being saved on the spares requirements for a new petrochemical installation. In another, stockholdings have been reduced from £900k to £350k as a direct result - without,jeopard~sing risk exposures. In this simple example, we present the output o f one study on one type of equipment. Turbine rotors are expensive, have long replacement leadtimes and are generally very reliable. Nevertheless, the turbines are also usually very critical - failure consequences can be great in terms of process downtime, performance losses, alternative generation costs etc.. Consequently, for a population of three such turbines, a spare rotor was considered worthwhile. Application o f a more comprehensive cost/risk analysis, as evaluated by At~T-SPARF, S, in this case supported a change in the holding policy (to the holding of two spares), despite the relatively high cost of such equipment and the high reparability o f any failures that might occur (80% estimated reparable within 6 weeks). The net benefit of holding a second spare is worth £ 10,000/year (combined effect of risk reduction and extra costs involved).
APT-SPARES analysis example
71117!iiT~i !i 71!!iiii!ilil i!i!!i~:!iliTi!iiiTiTiiii~iiTi!iTiiiTiiii!i!i!ili!i!~i~;iTiiiiii!T:7i1;TTi7iiiiii!iTiTiiiiiiiiiiiiiiiii iTiii!iiTiliTi ii!iiiiiiiiiiiiT!!! iiiiii iliiiiii~i ili i i i!i i i i~ilii !iiiiiiiiiiii i il !iiii!iiiiii i i i!iiiiiii!iiJ. . . . . .
~~i;i
~~~
:!~~!:i::i77!i!~iii :i: i2i~.ct':i~i i~/.~i~iii~ ~~::'.........i~7:; ; .....~;::7 ';~"<~:~;;:;w~:";~";~<;'~'!~
::~. ::7:;: 7: ~:: ::~7:~~;7:~=~ ~; ::::~ ~ ~:~ :; : ~: ~;~:~:~ : : ~ ~;lastall##:ius!~; i Di>-i~nib~7~i~Tiil~7~mib~ .....
: ~{l::~Re;sUpply
' ............~................................................................................... ' .........
i;i; :i:~ ;!7;!~i!;~i
!.... S t O c k ~ h O I d i n : l i ~ : ::i~ent
| 0 : I 1
I 4 I _ I b I 5
hc.ldlti:i~i~-f6ti~:,:; i { !<:,. i ~:::~ ~o : , :ii~::i~: _...
643337 12305
0 945
0 580
~._....
0 174
04:]3:37 7:7;~i~i 14084 Curre ~
9901 ~, .... 12661 ! 5421 •
;i!: :i ;,i771:TPbrchase & O W n e r s h i o ; c o s ,
~
....
/ : I: :
.
~ ........:ii ...........................................
I~ •~$t Olrt~or~sy..,+.teo~1 I!;~:$:;i; !!D]
:~n~: :i::~:i;ti:;:i;:. ; i
[
0 3895 2380 36% ............ :_~_ . . . . . : ......~........... "U 4<'85 7]9807 4066 0 5745 3580 6096
..... ;
:i!!
!Ch~S~ce.oiTei~rbiist~i~g;!:;;laa
..............~..................... ;;;~;~:: ~:~:~:ei~il
i!kO~J~ ~8~fid~f~J£~fiJrl~lelts.il~18~. ~ ~ !:::~.:i.::~ ]l"~ ;:? L::.7:~;!:;;:.; :; m ...... --=--! _:_i:
i::~:::~:~:::~::~:~:~;:~'.........;........... '~i~i~'::i
....... ....... .........................................r . . . . ::::~...................~:~:: ~:~~: ~~ .~;" .........
::!:==~;
.....
ii!i;~;
8 :lie
l: <~:~ ~ ..... i:l::iLii'~lltedu~e,,,., ..................l t I!~;1D e m a n d
~..__~
.
lo,~ ~
~*~ ~
1 i ~1
~
....." " -"
~~*
)
~
.
,
+
.
cessation
.
'
~
..... -' ;7~
:~ "
! :~-
.I -
~/
~ I I ;~; ;
~ J
" 7
" ....
I ~.~,~,i~t,i-
, ~ ....
246
J. Woodhouse The MACRO
Project
The above is a small sample only - the full analysis includes sensitivity testing (most data is usually range-estimated). The other analysis modules are in various stages of release as the following schedule indicates:
Schedule of Deliverables (Procedures, Training & Software)
, Slow-moving spares
June '96
÷ Maintenance intervals
Mar '97
• Project evaluation & Life Cycle Costing
Feb '97 Sep '97
• Risk-based inspection
Oct '97
• Fast-moving Materials
May '98
• Shutdown strategies
Sep '98
¢ ¢
8. CONCLUSIONS The justification for engineering, resource and maintenance strategies is not robust, complete or defensible if it excludes risk exposures and probabilistic factors. Yet conceptual complexities, lack of relevant data, and limited time for decision-making all conspire to make rigorous cost/risk evaluation a challenge. In this paper, however, a project has been introduced that is tackling some of the core issues head-on: i.e. the systematic treatment of cost and risk 'tradeoff' where hm'd data is most limited. Such handling of uncertainty is a fundamental element of effective life-cycle management of physical assets. The genetic principles being developed as part of this project are clem'ly applicable across the whole range of industrial operations. The specific results already attained reveal how important this area is becoming and how urgent is the adoption of basic skills and tools. The MACRO Project is an interesting and promising development - innovative thinking has been brought together from a range of organisations who are all committed to implementing cost/risk decision-making in their respective activities. The tangible results are already emerging.
D E V E L O P M E N T OF A M E T H O D O L O G Y TO E X A M I N E THE COST EFFECTIVENESS OF HEALTH AND SAFETY M A N A G E M E N T PRACTICES K.N. Priestley and A.D. Livingston Human Factors Department, WS Atkins Safety and Reliability, Birchwood Boulevard, Warrington, Cheshire, WA3 7WA, UK
ABSTRACT This study seeks to establish what weaknesses in safety management practices actually cost a company. The methodology proposed includes the development of data collection methods which allow the costs and the root causes of accidents to be identified. If the investigation of root causes is carried out in sufficient detail, the 'latent failures' within the safety management system can be identified and costs may then be apportioned to elements of safety management via the root causes. By measuring safety management costs before and after the introduction of health and safety initiatives, comparisons can be made to examine what level of savings (if any) have been accrued by the company. Once sufficient data has been gathered it is envisaged that such a methodology may also reveal which areas of safety management practice bring about the larger returns on investment. This paper reports on the initial phase of the study.
KEYWORDS Safety Management Research, Accident Costs, Root Cause Analysis, Methodology, Case Study, Health and Safety Executive, UK
INTRODUCTION Two recent UK Health and Safety Executive (HSE) publications - Successful Health & Safety Management (1991) and The Costs of Accidents at Work (1993) - suggest that the development of effective health and safety management systems will reduce the costs of occupational ill health, injury and death; and that the associated savings in financial costs will make for more efficient and profitable organisations. In order to establish more evidence of the specific relationship between health and safety management and the commercial savings that this can provide, the HSE commissioned WS Atkins to develop a methodology to enable analysis of the cost effectiveness of health and safety management practices, and to trial and evaluate this methodology within industry. The study seeks to establish the actual costs to a company of weaknesses in safety management practices by adopting the philosophy that root causes of an accident may be related to specific elements of safety 247
248
K.N. Priestley and A.D. Livingston
management. The HSE's model of health and safety management consists of 6 main elements: Organising, Planning, Implementing, Measuring, and Reviewing.
Policy,
The objective of this initial phase of the study was to develop a method to identify the costs of accidents and another method to investigate the root causes of accidents to enable the latent failures/weaknesses in the safety management system to be revealed. This process will enable the costs of a particular accident to be attributed to the appropriate safety management element/s. Although it was recognised that the contribution of safety management failures to an accident may not be equal, it was considered that a judgement on the percentage contribution would prove difficult and overly subjective. It was therefore decided to divide the costs equally between the root causes identified for each accident. Differences in contribution should become apparent due to the frequency of root cause occurrence if the assumption that a system failure is implicated in many accidents is true. This assumption will be examined in the later phases of the study. An advantage of developing a methodology to calculate the costs of safety management failure is that more accurate comparisons of the effects of safety initiatives can be made by measuring safety management costs before and after the introduction of health and safety initiatives. Once sufficient data has been gathered it is envisaged that such a methodology would also help to reveal which areas of safety management practice bring about the larger returns on investment. The activities for the first phase of the study were identified as the simplification of the costing methodology previously used by the HSE and the development of a root causes analysis methodology which would allow safety management failures to be identified. The methods developed for the simplification of the costing of accidents and the identification of root causes were tested in a pilot study in a manufacturing company in July / August 1995.
ACCIDENT COSTING M E T H O D O L O G Y The accident costing methodology essentially provides a structure and prompt for individuals to seek out and record information on the costs arising out of an accident. These costs may be financial, for example costs associated with replacement parts and equipment or they may be lost opportunity costs which arise when staff are paid for no production, or are redirected to tasks dealing with the consequences of the accident. A complex methodology for investigating the costs of accidents that occur within an organisation had already been published by the HSE. This required shift managers/supervisors to initially complete four different forms which were then forwarded to different departments for further costs to be added. In addition, weekly updates of one of the forms was also required. The information on these forms then had to be coded and the costs from different areas collated until the final figure was reached. The aim of simplifying this methodology was to reduce the amount of paperwork, make the data more easily auditable, and reduce the amount of time for completion, without losing the accuracy of what was recorded. For the pilot study, potential factors which could contribute to the accident costs were listed on a reference table: see Table 1. This reference table acted as a prompt / memory aid to the user so that all cost factors were considered. The costs were then recorded on a single form (The Accident Cost (AC) Form) which was kept by the user and updated as and when the costs were identified. The form literally required a table to be completed which acted as a spreadsheet when it came for the costs to be calculated. The headings of the Accident Cost Form are presented in Figure 1. If costs were incurred by other departments it was the responsibility of the user to contact the relevant department and record the costs on the one form.
Advances in Safety and Reliability: ESREL '97
249
TABLE 1 ACCIDENT COST FACTORS ACF Code
Accident Cost Factor
Examples
01
Initial response to accident
First aid, fire fighting, stopping machinery, making area safe
02
Cleaning up
Time / costs incurred cleaning an area where an incident has occurred
03
Transport
Of injured person (to hospital/home), of vehicles
04
Absence
Short term, from workstation to obtain treatment (returning same day), long term, through injury/ illness (more than a day)
05
Assessing/rescheduling production
Planning, organising for work targets to be met
06
Lost production and wasted time
Individuals waiting to be able to start work, delays
07
Replacement labour
Other company staff, temps from an agency
08
Reworking product
Time spent / costs in bringing substandard product/s up to standard
09
Repairing damage/faults
To plant and equipment/repairing / resetting faults
10
Hiring/purchasing
Tools, plant, equipment, services etc
11
Disposing
Waste, materials, product, plant, equipment
12
Consultants fees
Investigation of accident, H&S surveys, advice on remedial measures
13
Administration
Preparing forms, reports/Conducting investigations, meetings/Dealing with outside agencies
14
Other
Please specify on AC Form
A second form (Managerial / Administration (MAT) Form) was devised to capture any managerial / administration costs that were not easily attributable to one incident. These were required to be completed weekly by managers. ACF Code (See ACF Table)
Description External/ (Job Title or Internal/ Item) Overtimeor . Materials . m
Business Unit
Grade
£ Rate/hr or Item Cost
Time (hours worked) or No. of Items
It Figure 1" Headings on the Accident Costs (AC) Form
Overtime Multiplier
250
K.N. Priestley and A.D. Livingston
ROOT CAUSE ANALYSIS The development of the root cause analysis technique is clearly a fundamental cornerstone of this research. At the outset the requirements of the research determined a number of criteria which it was felt that the technique needed to meet, namely: The method must be consistent with the safety management principles described in Successful Health
& Safety Management The method must allow the root causes identified to be coded to allow cross referencing with the safety management elements to be undertaken. This will require the technique to be prescriptive to some extent. to reduce variability in interpretation between analysts to aid statistical compilation and analysis of results The method should be simple and practicalto apply it will be applied by company personnel a large number of relatively 'small' events (first aid accidents and near misses) would need to be investigated in a short period The method should be intuitive to enable ease of learning with minimum training The method should be compatible With UK legislative regimes and safety culture •
The method should focus on management and organisational issues
A literature review of the available root cause analysis (RCA) techniques was initially undertaken to examine whether any of the methods lent themselves to the above requirements. Some techniques could not be translated easily onto the management model presented by the HSE because they were based on different models of safety management e.g. techniques such as SCAT (Bird and Germain, 1985) and MORT (Johnson, 1980). Other techniques examined such as TOR (Weaver, 1987) were not sufficiently prescriptive for the coding requirements of the research. None of the 21 methods reviewed met the exact requirements of the research project, so it was decided to develop a bespoke methodology. This comprised: Accident Report Form
Front-end form completed by shift managers/supervisors to distribute the workload. Records accident details, and asks a series of questions about: hazards, control measures, behaviour and immediate causes.
R o o t Causes Analysis R e f e r e n c e Table
Used by a nominated person, where negative responses are given on the Accident Report Form. It directs the user to the appropriate Root Causes Tables R o o t Causes Tables.
Designed to help the user consider the range of possible causes associated with a particular event and to search for evidence. If there was insufficient information submitted on the Accident Report Form further investigation was required to identify whether the issues covered by the Root Causes Tables were relevant to the accident. The issues covered in the tables were identified from the literature review of the different root causes analysis techniques. All the root causes named in the techniques were listed and grouped under common headings
Advances in Safety and Reliability: ESREL '97
251
which were developed into the tables. These lists were compared with the issues discussed in the HSE model and any omissions were addressed. The titles of the tables were:
• • • • • • •
Risk Assessment Policies and Standards Verbal Communication Written Communication Information Supervision Training
• Procedures/Working Practices • Monitoring and Reviewing • Maintenance • Design/Procurement • Working Environment • Recruitment
• Role Clarity and Accountability • Medical/Health Surveillance • Cooperation.
The root causes tables were cross referenced to encourage the nominated person to question progressively further behind the issues raised until they identified the latent failures in the safety management system.
F I N D I N G S F R O M P I L O T STUDY
Cost of Accidents The revised method reduced the number of forms raised per accident during the pilot study to just two. However, it was found during the pilot study that the MAT form was not used. Users at the company preferred to record managers time on the AC form and use the appropriate code for administration. As a result of the pilot it was decided to simplify the system still further in the future and use only the AC form. Using this methodology, it was easier to collate and verify the information recorded. Coding was performed in situ by the users when completing the form, rather than at the end of the study, saving time and increasing accuracy. Thus it was judged that the simplification of the accident methodology had proved to be successful. A total of £10 938 costs were recorded over the data collection period. Accidents were recorded and their root causes investigated for a three week period. The costs of the accidents noted during this time were monitored for a further two weeks in case any costs e.g. staff absence time, ran over the three week study period. During the study period 23 accidents were reported, the average cost for the different classifications of accident outcomes used by the company is shown in Table 2. It can be seen that the most expensive events were the near misses. The company involved operated a 24 hours / day, 7 days / week, batch process operation with a perishable product. Typically the near miss events affected machinery or process resulting in a loss of product or a batch which could not be recovered and resulting in large expenses. TABLE 2 AVERAGE COST OF DIFFERENT CLASSES OF ACCIDENT
Type of Outcome
Number of Events
Average Cost (£)
Over 1 days absence
2
456.75
First Aid Injury
5
29.19
Near Miss
16
617.43
252
K.N. Priestley and A.D. Livingston
Root Causes Analysis
Examination of the results from the pilot study showed that the root causes analysis tool was not a success for the purposes of the research. Despite training and acknowledgement by the company of a 'no-blame' policy and the multi-factor causation of accidents, the results showed that in half the cases only one root cause was identified and that typically the nearest items to 'blame' suggestions were selected e.g. 'Procedures not followed' or 'Supervision less than adequate'. The cross referencing provided to assist the users in examining what lay behind these issues was not used, and as a result only active failures were identified. Furthermore, from the information recorded on the forms it was not possible for the study organisers to progress the investigation after the study, because the reasons for selection of the root cause were not given. In the absence of latent failure identification it was not possible to allocate accident costs to the safety management elements of the HSE's model with any degree of confidence. A number of methodological problems were identified. Firstly, while the research required the identification of latent failures, the emphasis on making the tool 'user friendly' and intuitive resulted in active and latent failures being combined on the tables. In some instances it would be necessary to follow the cross referencing to reach the latent failures. Simplification of the accident investigation techniques had therefore led to loss of accuracy, resulting in a preponderance of active failure identification. Secondly, it was assumed that the safety staff within the organisation would be familiar with accident investigation methods and also the HSE's model. The evidence from the completed forms indicated that this was not necessarily the case and in fact no formal training in investigation techniques had been given. As a consequence of the first pilot study, the expectations of the RCA tool and the method of application within an organisation have been revised. Root cause analysis will only be undertaken by trained individuals with support from the researchers, simplification for wider application was not successful. The method and training will place greater emphasis on investigative techniques such as events and causal factors charting (see Johnson, 1980) and the acquisition of'evidence' to support the decisions made to accept or reject a root cause as a contributory factor to an accident. A different approach has been developed to the categorisation of the root causes. The latent failures are defined at the outset using the HSE's model of a safety management system, see Table 3. It is the responsibility of the investigators to establish the links with the active failures through their investigation. The uniqueness of each incident makes this a preferable approach. Once the investigators have identified the active failures through the events and causal factors technique, they must then examine the risk control system concemed to identify where the latent failures occurred.
FURTHER WORK
At the time of submission of this paper, the revisions to the root causes analysis methodology are about to be piloted in another organisation. When the data collection methods have been finalised a series of case studies are proposed in different industries. These will measure the performance of the safety management system using an auditing technique and assess the costs associated with safety management failures. An action plan for improvements will be identified and the costs associated with these improvements recorded. Once sufficient time has elapsed the measures will be repeated to see what impact has been made in terms of audit scores and the costs associated with failures. Such case studies will also enable exploration of the relationship between effective health and safety management and accident costs in a number of industry sectors.
Advances in Safety and Reliability: ESREL '97
253
TABLE 3 DEFINITION OF LATENT FAILURES USING THE HSE SAFETY MANAGEMENT MODEL HSE HEALTH AND SAFETY MANAGEMENT ELEMENT
LACK OF
LESS THAN ADEQUATE (LTA)
POLICY
No policy defined
Policy definition is LTA
ORGANISING
No system specified System is not implemented
Specification of system is LTA Implementation of system is LTA
PLANNING
No plan developed No standard developed
Plan is LTA Standard is LTA
IMPLEMENTING
Plan is not implemented Standard is not implemented
Implementation of Plan is LTA Implementation of Standard is LTA
MEASURING
No Procedure specified Procedure is not implemented
Specification of Procedure is LTA Implementation of Procedure is LTA
REVIEWING
No Procedure specified Procedure is not implemented
Specification of Procedure is LTA Implementation of Procedure is LTA
REFERENCES
Bird, FE Jr and Germain, GL, (1985), Practical Loss Control Leadership, ILCI, Loganville, Georgia. Health and Safety Executive. (1991), Successful Health & Safety Management, HS(G)65, HMSO, UK Health and Safety Executive. (1993), The Costs of Accidents at Work, HS(G)96, HMSO, UK Johnson, W.G. (1980), MORT, Safety Assurances Systems, Marcel Dekker, New York Weaver, DA (1987), Technic of Operations Review TOR, in Modem Accident Investigation and Analysis, ed. TS Ferry (1988), New York: Wiley
This Page Intentionally Left Blank
A 10" Risk Management Decision Support Systems
This Page Intentionally Left Blank
ADAM: AN ACCIDENT DIAGNOSTICS, ANALYSIS AND MANAGEMENT SYSTEM H. Esmailil, S. Orandil, R. Vijaykumar 1, M. Khatib-Rahbar 1, O. Zuchuat 2, and U. Schmocker 2 1Energy Research, Inc., P. O. Box 2034, Rockville, Maryland 20847-2034, USA 2Swiss Federal Nuclear Safety Inspectorate, CH-5232 Villigen-HSK, Switzerland
ABSTRACT A novel approach for the on-line monitoring of potential accidents in nuclear power plants is under development. The present Accident Diagnostics, Analysis and Management (ADAM) system is intended to enable an efficient utilization of real-time plant data for improved implementation of off-site emergency actions, improved training of emergency response team, and to assist in regulatory decisions regarding implementation of potential accident management strategies. In addition, in the off-line mode, ADAM can also be used for application to severe accident, and Probabilistic Safety Analyses (PSAs). ADAM consists of thermodynamics, heat transfer, combustion, fluid flow, fission product release, fission product transport, and other ancillary modules. The various modules are integrated into a code package that executes within a WINDOWS 95 TM operating environment, using extensive graphics-based controls, and parametric modeling features. It is designed to run several orders of magnitude faster than real time, with extensive capability for parametric and configurational sensitivity studies. The models are being benchmarked against more detailed codes and data. The paper presents the detailed modeling framework of ADAM, including results of applications to both accident monitoring and accident management strategy evaluations for use in regulatory decision making.
KEYWORDS ADAM, Severe Accidents, Simulation, Diagnostics, Real-time, Thermal-hydraulics, Accident Management
1.
INTRODUCTION
The management of the severe accidents is expected to be under the direction of the control room operators working together with the plant technical support, and the accident response team. However, important utility actions will require the approval of the inspectorate, which can only be provided if an appropriate technical information is available regarding the actual plant condition, the observed symptoms, and potential impact of implementing selected accident management actions. The WINDOWS 95~-based ADAM system is developed at Energy Research, Inc. (ERI) to provide the analyst with simple computational tools to analyze an accident based on the available plant data and 257
258
H. Esmaili et al.
simulation. The current version is developed for a BWR-6/Mark-III plant in Switzerland, and is designed to run several orders of magnitude faster than real time. ADAM operates in two modes: (1)
On-line accident diagnostics/monitoring mode - In this mode, selective plant parameters (as measured by plant sensors), arriving into ADAM every 2 minutes, are used to assess the margins to core damage, containment failure, vent actuation, and hydrogen combustion (through appropriate alarms). In addition, the state of the reactor, containment, auxiliary building are diagnosed, including rule-based assessment of the symptoms to arrive at the most likely scenario.
(2)
Accident management and simulation mode - In this mode, ADAM can be used to simulate various scenarios to determine the potential impact of the available severe accident management strategies (or Emergency Operating Procedures [EOPs]) on the accident evolution. Alternatively, ADAM can also be used as an accident analysis tool to simulate variety of operational, and severe accident scenarios. ADAM capabilities include evaluation of PSA success criteria, operator actions, emergency operating procedures, and severe accident mitigation strategies.
Other simulation tools which have been designed to assist in the analysis of accidents in nuclear power plants include, CAMS developed by OECD-Halden, and MARS developed by FAI (Dawson et al 1996). The design of the CAMS is based on a number of basic modules that perform certain tasks including signal validation, the tracking simulator and state identification, the predictive simulator, and the strategy simulator. There is no signal validation counterpart in ADAM as this was not the objective of ADAM, and also because the number of signals in ADAM are limited. The purpose of the CAMS tracking simulator is to calculate the quantities that are not measured, but can be calculated from other measured quantities. The analysis section in ADAM also provides the user with calculated process variables which include the thermal hydraulic parameters such as, the drywell pressure, the air and steam concentrations inside the containment, and various safety margins. The ADAM system is designed to meet the objectives of the analysts at the accident response center who only have limited on-line information about the status of the plant. The implementation of complicated models is therefore avoided in ADAM. In addition, such models require access to a larger number of process measurements (perhaps at shorter time intervals) that may be available to the operators, but not the analysts in the accident response center. MARS performs two basic functions, i.e., tracking and prediction, using the thermal-hydraulic modeling in the MAAP computer code. MARS is designed to run faster than real time, and runs on a CONVEX computer system, which is a UNIX based operating system. ADAM is designed to run several orders of magnitude faster than real time on a Personal Computer (PC) platform.
2.
ADAM ACCIDENT MONITORING AND DIAGNOSTICS MODE
The plant is nodalized into five volumes for diagnostic purposes, which include the reactor pressure vessel, the drywell, the weir wall region, the containment, and the auxiliary building. A total of 23 measured parameters (primary variables) as received from the plant are available in ADAM to analyze the conditions inside the plant. A limited number of these primary variables (reactor water level and pressure; containment pressure, temperature, hydrogen concentration; and suppression pool water level and temperature) are used to calculate the secondary variables (e.g., air concentration inside the drywell). A single primary variable may involve more than one measurement depending on the range or the location of the instrument; however, all measured quantities are displayed in ADAM. The various features of ADAM in the monitoring/diagnostics mode are discussed in the following. In ADAM, an accident is detected if both the water level and the pressure in the reactor fall outside of the normal operating range. The normal operating range for each sensor is derived based on the information in the plant safety analysis report and limited plant measurements. A simple diagnostics logic is implemented in ADAM to determine the most likely cause. The accident type is generally categorized into a drywell Loss of
Advances in Safety and Reliability: ESREL '97
259
Coolant Accident (LOCA) or a transient, depending on the pressure change inside the containment and the reactor pressure. Since only seven primary variables are used to calculate the secondary variables, additional information, logic, and assumptions must be used. The assumptions in the calculations of the secondary variables depend on the event type, i.e., LOCA or transient. The conservation of mass and energy along with the assumption of the ideal gas law are used to derive secondary variables. In addition, adjustments are also required to detect a hydrogen burn inside the containment, and the suppression pool main vent activation for flow from the containment to the drywell. There are no vacuum breakers between the drywell and the containment, and an overpressure in the containment due to a hydrogen burn can result in water flow from the containment into the drywell and result in the flooding of the drywell region. These phenomena are all accounted for in the ADAM diagnostic logic. In ADAM, a margin is defined as the time required until a certain condition is satisfied. There are currently five calculated margins that include, (1) core uncovery, (2) containment venting, (3) containment failure, (4) suppression pool saturation, and (5) hydrogen combustion. There are provisions for a number of "alarms" within the ADAM logic to inform the analyst of certain conditions in the plant. It should be recognized that the scope of this "alarm" implementation is only based on the measured variables received from the plant. There are currently 16 alarms modeled in ADAM. In addition to the alarms display, information is also provided in ADAM to help the analyst in examining the status of the reactor and the containment. The status display provides a snapshot of the reactor and the containment states. Figure 1 shows the graphical display of the ADAM in the diagnostic mode. Upon activation, ADAM automatically opens a "System Overview and Alarm Status" display window which is the main controller. The information in the "System Alarms" and the "Reactor and Containment State" is continuously updated and written in the overview window. A complete list of the alarms and their activation states can also be displayed as shown in the upper part of the figure. Figures 2 and 3 show the reactor pressure vessel and containment display system in ADAM. Each display is equipped with gauges that record the measured variables. The gauge colors (green, orange, or red) notify the analyst whether the measured variable is within certain limits imposed by the user based on the plant operating conditions. A number of gauges have two tabs, which would switch the display between normal scale and the extended scale or different locations for the same basic measurement (e.g., reactor water level or containment pressure). ADAM can also show the water level in the suppression pool or inside the reactor using animation (dark areas in the figures). More information on each measured variable such as instrument location can also be displayed. A display window showing the conditions inside the secondary containment is also available in ADAM (not shown). A time history of each variable can be displayed as shown in Figure 4. Two graphs representing the containment pressure and hydrogen concentration are shown. At the bottom of each graph display, the user can choose the "View Size" or "View Location" for that variable. Using the view size control, the graph can show the entire time history of the variable (this is shown in Figure 4). The view location option shows selected intervals in the time history of the variable. The analysis display shows the measured (primary variable) as well as calculated or secondary variables. A time history of all these secondary variables is also available and can be graphically displayed. The safety margins discussed earlier are embedded in the analysis display. 3.
ACCIDENT M A N A G E M E N T AND SIMULATION M O D E
A number of models have been implemented in ADAM to simulate accident progression in a nuclear power plant, and include: • •
Non-equilibrium, separated flow thermal-hydraulics (including critical and non-critical flows) Heat transfer to structures
H. Esmaili et al.
260 * • • • •
Simple, parametric fuel heat up, meltdown, relocation, and debris quenching Simple, parametric fission products release and transport (for both in-vessel and ex-vessel) Hydrogen and CO generation, transport and combustion Core concrete interactions Emergency Core Cooling System (ECCS) and decay heat removal systems
ADAM includes provisions for operator actions in order to examine accident management strategies and their consequences. The simulation code can also be used to generate data for the diagnostic mode to assist in the visual display of the accident. The plant initial conditions and information about the type of the accident is user-specified. Figures 5 and 6 show the comparison of ADAM and MELCOR (Summers et al 1995) for two transients without the operation of the ECCS (with and without the ADS activation). The ADAM simulations were performed on a personal computer with a PentiumTM processor, running about 100 times faster than real time with a timestep of 0.5 seconds. For design basis accidents that do not involve core melting or combustion events, the timestep can be increased resulting in a faster code execution. The results of both accident scenarios show good comparison with the MELCOR calculations. Some differences in the containment pressurization following hydrogen combustion are observed in Figure 5 that is mainly due to modeling differences and the nodalization employed in MELCOR. The model parameters in ADAM were chosen to mimic the MELCOR modeling framework to the extent possible. Although, there are differences between the code calculated results in ADAM and MELCOR, the general trends are preserved. The comparison of the ADAM predictions with experimental data is underway.
4.
CONCLUDING REMARKS
The ADAM system is intended to enable an efficient utilization of real-time data from a nuclear power plant for improved implementation of off-site emergency actions, to assist in regulatory decisions regarding accident management strategies, and to be used for severe accident training. In addition, ADAM can be used for application to severe accidents and PSAs, and is designed to run several orders of magnitude faster than real time, with extensive capability for sensitivity studies. The various modules are integrated into a code package that executes within a WINDOWS 95TM operating environment, using extensive graphics-based controls. ADAM has been used to simulate various design basis and severe accident sequences (including LOCAs and transients), and the results of ADAM calculations show good comparison with the more sophisticated codes such as MELCOR. The development of ADAM for other types of nuclear power plant designs is planned.
5.
REFERENCES
Dawson, S. M., Raines, J. C., Hammersley, R. J., Munuera, A., Alonso, J. R., and Aleza, S. (1996). Faster than real time accident management predictions at CSN using MARS. International TopicalMeeting on Probabilistic Safety Assessment, I, 497-504. Summers, R. M., Cole, R. K., Smith, R. C., Stuart, D. S., Thompson, S. L., Hodge, S. A., Hyman, C. R., and Sanders, R. L. (1995). MELCOR Computer Code Manuals, ~ G / C R - 6 1 1 9 , SAND93-2185.
ACKNOWLEDGMENTS This work is sponsored by the Swiss Federal Nuclear Safety Inspectorate.
Advances in Safety and Reliability: ESREL '97
:............... ; ........... ' ................... ................ .......... :: ......................... i~~w~ie
ri~:~io~:Low:w~t~riL~Gi;:L3~
261
........ ::~'::........................ ::........... ................ •......
.,; . ::--::;: ::.~ ,.:,i,, .~ ..-: ~ .~,, ~ ; . ,,::{.~.:~. ; {y:.,:,. :: ,:~: RPV Water .
i"
'
'
i
:~:,!
R.PV P r e s s u r e :'::: . . . .
"
~,i
~
~
I
Last Signal on: i 0O: 2 8 : 0 0
~ ~;/
:
'
Water ~:
':.
Level (L2) ~
.... :
.
Relief Set Point (78.4 bars)
:......" :!: ,.;L:..' : ; i
< %7 (CHECK
,: I::.L, i:,~.";~~
FOR SCRAM) ~ ::",:~
:.::~: ~i ,.~~'!.:i: "~ " ~~i~: ,; i!,,,~ ~
. ' ~ " :~,.!. : :.i.~.... -::. ~ .: ::::~'. { :,~:. i ,.:..::.:,: ::.; :i: :-i'i:
,':
, :.:..: : ' , :
. ~ :.'
:.::~,,:":'
.!:i
.i:.,
:
' ':!
" !;,',,
SYstem
Alarms
~' !".:',:-~.;'~ ...... i l '~::' !.!i":!!'.! ' ' :
~
.
i.....3 1 . ' 0 5 J § 9 i ~ ..... i
'
£RV
:: .~?<~-,:
.:':~,: . i
::"~'." : '~'"i.,:: :: ::::.':
Low-Low
i~:,.' : : , . Above
APRM
: ::..~,:-,:::
Below
!
el
(L2)
RPU
Pressure
R P V eind D r y w e l l S u m p ]
Aboue
Reactor
.
.
.
.
i
i
S,REI R e l : l : , e F
and Containment
Set
::
!i:
.... '
Point
State
:
~::i i !
Secondary C~iinn'~ ~ i.................................... ........... ....................... :Rea~oi:W~#i.:evei ar~I :PI;e~ui;ei~r~orl~ai ............. i:i......... ........ :........ ::i:'i'7:":......."7::' ...... .................... i......... -i: i ........................................................ i . . . . . . . . . . . . . . Possible T ~ of A c c i d e n t : T r ~ r ~ i e r ¢ : " .... : :' ......... .....i ..... ;ii ....................... A.naI}:~!.7,...................... 1 i . . . . . . . . . . . . . . . . APIRM "< %7, CheeR- for-SC~IRAM ..... .................:;................ " :: ........... ....;:: ~::.:;'':
Sta,t] ~Mi
.... oft~/ord
] I~MSDOS
Prompt I~Reading
C:\ad
I SystemOve~ie
.....
II@~,,-- ~ .
i~i~:
Figure 1 Organization of the system overview display in A D A M for a BWR/6-Mark-III
,
, .
",
i.
i
: .........
'."
• •i°o ..°.7, oo . J
,~o.oom
! ';.,:V..~LLi:...........
~.~3ooo ooo~ so:o: ~ •L
..... :....... :..: ........... :.....
~i ~ e r
Level i.n:RPV:: :!
• ...............
I0.0
~O.O ,~
1
oo-: oo-.:,
:
., . ) '
ADAM -~right
199B. Ene~g.v R:es'earch: I n c . All Rights Bese~vedl
. . . . . . . .
.
......
.....
.
' "
:
....
.
i~0s~e'Typeol
ae6id~i
Tr~'ns~er.:.
/
.:
. . o. . . . . . . . . . . . . . . .
;....... . .i.
.
:~:: ".: :
.
' ~ : :. ( . " : ~ ? = i - :.: ::
Figure 2: Organization of the reactor pressure vessel and drywell display in A D A M for a BWR/6-Mark-III
262
H. Esmaili et al.
;
C u r r e n t Time: 10:38:35
!
,oo.o, 1 I I
•
'' 2.1 ' ...... "
AM
12113196 L a s t S i g n a l on:
........ '. ...... [ba]...........
0i:18:00 31.05.1 996
Non'hal S e.ale
{ 1 oo_ , .......
14.0m
....
.....
~___
100.0
71
........ -s4.o(¢ml ............
'. Suppress-i o n Pool
i
Loe~ion 1
I
0 . 0 m . . .0. .. 9. 3 5 m -
- 2.733 m....... _ m .-. .5. . .4 7 7 m
4.105
m
• 6.177
i ADAM ?
- C o p y r i g h t 1 9 9 6 . Energ.9 R e s e a r c h .
I .................
- -1.:Y.'.:.7.7,
............
.-..-I
i
Inc. All R i g h t s R e s e r v e d .
:
,
i S u p p r e s s i o n Pool ~Yater Level
Temperature
@!
.
.
~
.
.
.
" " : +:i: .
"-~
Normal Sc~iI~
I
IE+08.0~
.261 [ ~ H 2 1
'
~
Co,~,r,r~e~,"Z Conoentration
'
. . . .
100 [mS v/'h] R~be Mor~itor Location 1
.
.
.
:":i:i..:;:
.
.
.
:" : ....
.
.
7 i. ~. :..:.i:
:.:~)i7:.:_'.j ~':: Y(i:
.................... i
...............................................................................................................................................................................................
'~_S.la[t.I..~?.IM~r°sdlWold 7:1 I~MS-OO.S Prompt I.~,ead!ng..C!Xad:_--.] sy.,.m 0 .... .,..... ll@co--,i-,.--,
!
'i01aeki~
Figure 3 Organization of the containment display in ADAM for a BWR/6-Mark-III
o,-,.-.z.z ~1 ,,s i , ~
,,1 .,-,1~ ~1 as 1,1,.,,,,:1
Containment
.:.:.:: . . : . . : . :
..........:_...::.::...:_:
Pressure
ell ~;o .~1 o s 1',1~,5
(bar)
Normal
n-~ ~,,i ~1 a s i~,~,a
Scale
.:; ..........................................................................................................
i
.
._..
i
'
•
.
~ . ....................
......
.
.
if,
,', 0
illi~r "I ;at$ I,bld]l r" "" - . . . . . . . i.~
,i
~'~i]~',t"[ii" ~:$"'l'; 4~II) " -Ii.[. . . . .
gl .~-'11'I:'~1 ~:i'll-'~d ). . . . . . . . .
~"~:I "1"~'[i l : II~-'/'~g ). . . . . . . . . 011 i~id ]
[ j i ,.,$ i i1,iiii
Co~.mmo~ H2 Con~o~.o~ (~.2)
.~:,: ............................... : :.:" ::_~:"::. 7; :::7:. :
........... :. ..... : ........ ; ............................. : ............. :::..i}:
!
Y:: { ::................ : ........ i
I
•
................V i m
Location
'l
...... I
",..~Sl;;t-i---,~...~Mic,o~ ~ (-I~MS-:DO -7 ~l~CLplei-- SystemOv ' ~Graphic
Ni
'~
,
.... .
i~
. : ................... :.........:..... :7~ ..............: .........: : : : : : : ........
i:c~*,::Ili-e,;o,;;::
Figure 4 Organization of the time history plot displays in ADAM for a BWR/6-Mark-III
!:i:is;,:g:a
263
Advances in Safety and Reliability" ESREL '97 Reactor Water Level
16
v
E
ADAM MELCOR
14
w
._1 "O
o
o
12 10 8
0
1000
2000
3000
4000 5000 Time (s)
6000
7000
8000
9000
Core Axial Level 2 Fuel Temperature
2500
.
.
.
.
ADAM MELCOR
2000 v
E I--
1500
-
1000
-
0
I
0
1000
2000
3000
4000 5000 Time (s)
6000
7000
8000
9000
Containment Pressure
180000
ADAM MELCOR
170000 160000 13_ v
t_ ¢/)
13..
150000 140000 130000 120000 110000 100000
,
0
,,
I
1000
2000
3000
~"
4000 5000 Time (s)
6000
7000
8000
Figure 5: Comparison of ADAM and MELCOR for a Transient Without ADS Activation
9000
264
H. Esmaili et al. Reactor Pressure
9e+06 8e+06 7e+06 13..
6e+06
!._
5e+06
v
¢/} ¢/} G)
Q.
ADAM MELCOR
• li.
4e+06 3e+06 2e+06 le+06 0
16
E v >
0
1000
2000
Time (s)
3000
4000
5000
Reactor Water Level
I
ADAM MELCOR
14 't 12
._1 "10 ¢/} Q. 0
L)
10 8
0
1000
2000
Time (s)
3000
5000
Suppression Pool Temperature
350
ADAM MELCOR
345 v
4000
340 335 330
i._
E I--
325 320 315 310 305 300
0
1000
2000
Time (s)
3000
4000
Figure 6: Comparison of ADAM and MELCOR for a Transient With ADS Activation
5000
RISMAN, A METHOD FOR RISK MANAGEMENT OF LARGE INFRASTRUCTURE PROJECTS W.G de Rijke l, M.R. van der Does de Bye 2 ,R. Buvelot 3 and J.K. Vrijling 4 1Ministry of Transport, Public Works and Water Management Civil Engineering Division P.O. Box 20.000, 3502 LA Utrecht, The Netherlands 2Twijnstra Gudde Management Consultants BV P.O. Box 907, 3800 AX Amersfoort, The Netherlands 3Dutch Railway Company NS Railinfrabeheer BV P.O. Box 2025, 3500 HA Utrecht, The Netherlands 4Delft University of Technology P.O. Box 5048, 2600 GA Delft, The Netherlands
ABSTRACT
In the Netherlands each year a large number of infrastructure projects is planned and executed, varying from relatively simple to very complex. Due to the increased complexity of these projects it is becoming more and more difficult to make sure that the project targets are reached within the scheduled time and costs. In co-operation with Management Consultant Twijnstra Gudde BV and the Delft University of Technology the Dutch Ministry of Transport and NS Railinfrabeheer BV have developed a stepwise method (the so-called RISMAN method) to analyse, quantify and manage the project related risks. In the paper the method will be explained in detail. Furthermore the application of the method will be illustrated by two case studies.
KEYWORDS
Risk management, Risk analysis, Risk assessment, Project management, Infrastructure project, Project risks
INTRODUCTION
Infrastructure projects can have a strong impact on the environment. In the densily populated Netherlands legal procedures have been implemented to regulate the interaction between the planned infrastructure and the local interests and to ensure that the negative effects on people and the environment are acceptably low. The time consuming character of the procedures, the technical complexity and the political sensitiveness of the projects make it hard for the projectmanager to be certain about the project costs and the project duration. Good project management asks for control of time, money, quality, information and organisation 265
266
W.G. de Rijke et al.
but also for control of project related risks. To facilitate the project manager in his need to improve project control the Dutch Ministry of Transport and NS Railinfrabeheer BV in co-operation with Management Consultant Twijnstra Gudde BV and the Delft University of Technology have developed a stepwise method to analyse and quantify the technical and nontechnical project risks and to support decision making on risk reducing measures. The developed R I S M A N method is now operational and is applied to large infrastructure projects in the Netherlands, such as the Betuweroute Rail Freight Freeway, the High Speed Train Link and the Amsterdam North-South Underground.
PRACTICAL APPLICATION The RISMAN method has been developed to provide a tool for the projectmanager, which can be applied during the various phases of large infrastructure projects. During the initial phase the identification of the uncertainties is emphasized while later on the accent shifts towards the quantification of the uncertainties and the calculation of the project risk. The project risks can only be determined in co-operation with management and specialists closely involved in the project. The structured discussions and the integrated approach of the projects risks have a very positive impact on the risk awareness of the project management and staff. Since the risk assessment team will audit the project on the basis of their own expertise, new unconsidered risks may be identified and the known risks may be reconsidered. In these cases the risk analysis lead to a redefinition of the critical project succes factors.
THE RISMAN M E T H O D
The method contains six steps (see figure 1): Determination of purpose In,den~ification of unce~ainttes
li
QLlarltification of uncertainties
I
J Calculation of
~
ompletion of
i quantification of measures
r
.Calculation ..................i......c~ .................................... ef'~ect ivity of ~I ......... ~i
Figure 1: Overview RISMAN method
Advances in Safety and Reliability: ESREL '97
267
Determination of purpose
The projectmanager has to determine the goal of risk management. The RISMAN method will usually be applied to manage the aspects of time or/and money, but other aspects are also possible (such as quality or safety). For the chosen aspect a risk criterion is defined which is expressed in terms of a maximum allowable level and a probability of exceeding the maximum level. A second decision item concerns the scope of the risk analysis. In some cases the analysis has to be done for the integral project (including all project items and all project phases), in other cases only a specific part of the project may be subject of the analysis. Essential is that the level of risk analysis is in balance with the level of detail of the available project data. The RISMAN method offers the flexibility to perform a global analysis or a detailed quantative risk assessment (see table 1).
TABLE 1 POSSIBLE WAYS TO APPLY THE RISMAN METHOD ...........................................step ..................................................i.............. .. A ............................ g ............................. C ................................ D ................. ....................................................................................................................................................................................................................
Ji
Identification of uncertainties }i x x x x .....Quantificaiionofunce~ainties .................................. i .................................................. x ............................... x .................................. x .................. Calculation of project risk x x x [deniificafionofmeasures .......................................................... i..............x ................................................................. x ................................. x .................. Quantification of measures x x :'caicuiationoi~ei~i%c(oi~measures ..................... ii................................... i ................ ............................................................................................. x
Identification of uncertainties
In this step we define the various uncertainties. Three types of uncertainty have been identified: normal uncertainties. These are the deviations due to the stochastic character of the parameters involved; - special events. These are unplanned events, which have a low probability of appearance but a high impact on project costs and/or project duration; planuncertainties. This type of uncertainty mainly occurs during the initial phases of a project, when a range of alternatives is under consideration. Decisions on alternatives are usually linked up with project milestones. -
-
The special events are identified by means of an event-matrix. On one axis the cost or time breakdown of the project is presented. The other axis comprises of a number of risk clusters. The proper angle of incidence of looking at a project is of vital importance in order to get a full identification of the potential risks related to the project. We have learned that the following 'seven s p e c t a c l e s ' cover most of the risks involved with large infrastructural projects: - technical. Normally a lot of attention is already paid to uncertainties of this kind; - organisational. Not only inefficiency within the projectorganisation may effect the final result, also the formal relation with the owner of the project might have its impact on the results; - p o l i t i c a l . This aspect is especially important in case of projects with a high political sensitivity (examples: Channeltunnel, High-speed train links). Delays and/or cost overruns due to a long decision cyclus or extra requirements are not unknown; - g e o g r a p h i c a l . Normally large projects do interfer with other objects and plans in the direct surrounding. On the various interfaces problems might occur; - f i n a n c i a l . Both funding problems (example: the need for private funding) and interaction with exploitation of the object might cause risks; - social. Acceptance by the society of extra disturbances and increasing involvement of civilians within the decision process are examples of factors involving the project result; - legal. Evolving rules and regulations do effect the progress of the project. In some cases lack of jurisprediction might case indecisiveness.
268
W.G. de Rijke et al.
Data are being gathered by means of databases, expert opinions and specific analyses by the RISMAN team. Based on mutual experiences a first selection can be made, thus allowing the team to focus on the main uncertainties. The (sometimes complex) relation between the various events can be analysed by means of a causeconsequence chart or fault tree. A decision tree analysis can be useful when considering the planuncertainty of the project.
Quantification of uncertainties In this step a value to each uncertainty is added. When the exact probability function for the normal uncertainty is lacking use is made of a triangular distribution function for which the lowest, mean and highest possible values are estimated. The special events are being characterised by a probability of occurance and the cost and/or time effect in case of occurance. For normal uncertainties and special events statistical information might be available in some cases. However the use of experts is also often required. When dealing with expert opinions one should be aware of the fact that the result can be influenced by statistical scatter and risk perception aspects. Methodical it is important that the expert can give an individual opinion, however consensus among the experts about the identified risks can save a lot of time in the communication process. We learned that a combined individual and groupwise approach of experts give satisfactory results.
Calculation of project risk In this step the total project risk is being calculated. Based on the values quantified in the previous step the following results can be obtained: - the statistical distribution function of the total project costs or project duration. From the calculated distribution the mean value and standard deviation are derived; - the probability of exceedance of the initially planned total project costs or project duration; - the contribution of each uncertainty to the total standard deviation of the project costs or project duration. This specifically gives insight into the most important uncertainties. The calculations are being done by means of Monte Carlo simulation. Special attention must be paid to possible economical or technical dependent uncertainties which can significantly influence the calculated results. It is recommended to investigate the effect of possible dependence by means of sensitivity analysis. When evaluating the results one should be aware that the calculation results are based on the quantified uncertainties, which were used as input into the simulation. During the risk analysis process risks are identified which cannot be quantified. The qualitative risks can have a major effect on the project results and should also be communicated with the projectmanager.
Identification and quantification of measures The projectmanager has to decide, whether the results of the risk analysis are acceptable when compared to the initially formulated risk criteria. The wish to accept a limited exceedance of the planned project cost or planning can be formulated as B=~t+k.c where ~t and cr are the calculated mean value and the standard deviation respectively, k is a policy factor and B is the maximum available budget or project duration (For instance: when k=l the projectmamager accepts that there is a 15% probability that budget B will be exceeded). The above mentioned criterion can be combined with other risk criteria, for instance a project phase dependent maximum level of ~/~.
Advances in Safety and Reliability: ESREL '97
269
When the project risk criteria are exceeded risk control measures should be taken. Thereby it is cost efficient to focus the measures on the most important uncertainties. During this step both the identification and the quantification of measures are being investigated. In principle two kinds of measures can be considered: - self carrying the risk after extra investments to make the risk acceptable; examples are additional soil investigation and safety training of personnel. The measures are preventive or responsive. - turning over the risk in exchange for money; examples are insurance and backwards or forwards contracting of risks. For each measure both the costs and the effect on the project risk have to be estimated. In doing so, one should be aware of the interaction between the various uncertainties for reducing one risk can easily introduce or enlarge another risk.
Calculation of effectivity of measures Finally the project risk will be recalculated by implementing the quantified measures into the Monte Carlo simulation. As a result an overview can be presented to the projectmanager, in which per measure the effect on the mean value and the standard deviation can be seen.
CASE
STUDY: P R O J E C T R A M S P O L
In order to protect the inland ot the Netherlands against flooding from the IJsselmeer the Dutch Government decided in 1991 to build a storm surge barrier near Ramspol. In the spring of 1995 the alternative design alternatives were presented and in 1996 the design concept of a flexible barrier was chosen. At this phase of the project the costs to build the barrier were estimated at 42 Million ECU. The RISMAN method has been applied to get insight into the financial risk of the project. To identify and quantify the uncertainties interviews were held with the projectmanager, a number of specialists and with the local government. First the uncertainties in the planned cost items were estimated by means of a triangular distribution function. Next, considering the seven spectacles of RISMAN, the special events were identified and quantified, see table 2.
TABLE 2 SPECIAL EVENTS PROJECT RAMPSOL EFFECT IN MILLION ECU ...........................................................................................................................................................................................................................
iT ............................................................................
7"'.
.....................................................
Special event Jl probability Additional sluice needed unlikely (< 0.05) ~M.aj~r.~..~h~ge..~f~d.e.s~ig.n.~c~nce~pt~.~.~ue...i~..ie.c~ic.a~.~.c~mp~exi~y.............................p 0 s s i ~ ; i e { 6 : 6 g - 6 3 5 / 1 7
~ .............................................................................
i i
effect 9.1 - 13.6
........../ i 1 8 / 3 1 1 ~ 7
/
Pr.o..tec.tivemea..sures.aga!nst....!ce...damage..needed. .....................................Z~i~.ossi!i~!el..io.i~0.57 . 0...-5.)i.......i...i . .........i..i.......0.:..3................. Additional power supply station needed unlikely (< 0.05) i 0.3 Change of planned closure system likely (0.5 - 0.95) i 0.5 Redesign of dykes due to change of maximum waterlevels possible (0.05 - 0.5) i 0.1 More dredging and deposit of polluted soil than planned possible (0.05 - 0.5) i 1.0 Additional foundation due to error possible (0.05 - 0.5) I 1.4 - 2.3 Additional functionality of barrier likely (0.5 - 0.95) ~ 0.9 ...........................................................................................................................................................................................................................
{,..............................................................................
...........................................................................................................................................................................................................................
] ...............................................................................
'Measurestocompensat?0raamagegenvir0en{................................................................... Ve~iii~eiy{6i95-i)
..............~i................................ 6 3 ................................
270
W.G. de Rijke et al.
Furthermore, a number of risks were identified concerning the project organisation and the possible effect of legal procedures. These risk could not be quantified but were nevertheless communicated as important issues to the projectmanager. Based on the quantified normal uncertainties and special events the total financial risk of the project was calculated by means of Monte Carlo simulation (number of simulations = 10,000). The results are listed in table 3 in which the lower row represents the calculated total project costs. TABLE 3 PROBABILISTIC RESULTS MONTE CARLO SIMULATION TOTAL PROJECT COSTS IN MILLION ECU ..................5{J~o..........I.............85~0...........[...........i~}%............ 2% l g l..................~ ...........i exceedance { i ~ e exceedance exceedance ~exceedance J .........................................................................................
In table 4 an overview is given of the 5 major uncertainties in the cost estimation and the calculated contribution to the total standard deviation of the project costs.
TABLE 4 UNCERTAINTY TOP 5 Uncertainties
Contribution to standard deviation
~..:..:.=:~:...:~:..:.~=:.~:::=::::~:::~:=~..:..:.=:::::::~:~:===================================~:~:::~:~::~:~:~::~.:~.~..~:~:::~:.=::::::::::::::.~::::~:::~:~:::~:.:..:..:`:::::::::::::::::::::::::::::::::::::::::::::
Costs of concrete parts of the barrier Major change of design concept Additional sluice needed Costs of Civil engineering Costs of bottom protection works
26% 16% 14% 8% 8%
The results of the calculation show that when the identified uncertainties are taken into account the initial project budget will certainly be unsufficient. The projectmanager was advised to take the necessary steps to enlarge his budget, to take measures to control the main uncertainties and to update the risk analysis on a regularly basis.
CASE STUDY: PROJECT BETUWEROUTE
In the coming years, the transportation of bulk freight will increase significantly not only in the Netherlands but also throughout Europe. In order to handle this increase the Dutch Government has planned to build the Betuweroute, a Trans European Rail Freight Freeway (TERFF), over which 60 Million tons of freight per annum in 2010 will be transported. The 160 kilometre-long Betuweroute starts at the Maasvlakte Terminal at Europort and ends at Zevenaar close to the German border. The total project costs will amount approximately 4 Billion ECU. The first preparations started in 1990. The construction phase in the main contracts will start in the beginning of 1999. The project will be completed in 2004. NS Railinfrabeheer BV has the task of completing the Betuweroute project in the most efficient way
Advances in Safety and Reliability: ESREL '97
271
possible. In order to support the management it was decided tot implement methods for structured risk assessment and risk control based on the RISMAN method..The scale of the project imposes limitations to the scope of the risk assessment. Considering the size and the complexity of the project and the level of experience with the RISMAN method three risk analyses have been carried out: two based on sub-projects (demarcated parts of the Betuweroute) of different size (Botlek Rail Tunnel and Harbour Rail Line) and one based on the complete project. The following main items has been experienced with regards to the usefullness of structured risk assessment and risk management. 1. The RISMAN method delivered potential useful data for project management and increased risk awareness within the Managementgroup Betuweroute. 2. The risk analyses generated hundreds of possible risks, of which many were similar in two or more analyses. 3. Several identified risks have organisational backgrounds and could not be assessed on probability and effect. They were labeled as events that must be prevented in any case. 4. Many risks would never be part of the main project risks but should nevertheless be dealt with. 5. The analyses generated all kinds of risks, many of them being part of supporting processes. Of many risks it was not easy to appoint a responsible manager within the Managementgroup immediately since the subject had relations to several persons. It became clear that the enormous size and complexity of the Betuweroute project requires an additional model to enable management of the risk management process. The project management should be able to appoint identified risks to responsible actors to control a specific risk, i.e. take and implement measures and check the actual effects. This requires risk management to be coherent with the project organization. Use is made of the 'value chain' approach of Porter (1986), as presented in figure 2, in which primary and supporting processes are defined which can be linked with the project 0rganisation.
global actual I status P3.a Archeology P1 Scope P2 Preliminary P3.bpiping and wiring relocation /Definition, design P3.c Town planning approvals P3.d Permits Primary processes P3.e Grounds acquisition ] P3.g Site clearanceand redevelopment S1 Planning and cost control $2 Procurement Supporting processes $3 Quality assurance t $4 Documentcontrol $5 Information systems $6 Public relations
P4 Tender & P5 Management P6 HandContracting of N construction °mVaern tndan \ contracts
Figure 2: Risk management model This risk management model is valid for the complete project as well as for sub-projects. Another benefit of this model is that risk assessment and management can be done for each process seperately, primary or supporting. This enables taylor-made implementation for each process without losing the integral risk approach for the project. A database can be used holding similar risks in different sub-projects and
272
W.G. de Rijke et al.
processes. So in case different sub-projects are being analysed, use can be made of the same data. In this way a high degree of efficiency can be reached. The risks per sub-project as well the importance of a risk for the complete project can than easily be assessed and managed.
CONCLUSIONS Due to the technical, procedural andpolitical complexity of infrastructural projects it is becoming more and more difficult to make a reliable estimation of the costs and the duration, if only based on the judgement and experience of the projectmanager. A risk analysis can be an excellent supporting tool for the projectmanager to get insight into the main uncertainties. Thus he can make a better judgement of the reliability of the estimation and take steps to control the risks. The developed RISMAN method is not a new concept but helps the projectmanager in a practical way to structure the uncertainties and to make a rational choice between risk reducing strategies. The value of the RISMAN method has been proven on a number of major projects in the Netherlands.
REFERENCES Porter, M.E. (1986). Competition in global industries. Harvard Business School Press. USA Vrijling, J.K. and Redeker F.R. (1993). The risks involved in major infrastructural projects. Proc. Options for Tunneling, Elsevier Science Publ. p. 37-39 De Leeuw van Weenen, R.P., De Rijke, W.G. et al (1995). Project RISMAN, case study Ramspol. Analysis of the financial risks (in Dutch), Bouwdienst, Utrecht. Blazer, J., Stam, D. et al (1996). The Risman method. A tool for risk management of large infrastructure projects (in Dutch), ISBN 9074411110x
C O M P U T E R SUPPORTED EVENT ANALYSIS IN INDUSTRY WITH HIGH HAZARD P O T E N T I A L R. Baggen, B. Wilpert, B. Fahlbruch and R. Miller FSS - Research Center System Safety, Berlin University of Technology DovestraBe 1-5, 10587 Berlin, Germany
ABSTRACT In the current paper, organizational learning is described as a key concept to arrive at a systematic recognition of human and organizational factors in the genesis of system safety related incidents. The Safety through Organizational Learning (SOL) approach aims at installing an organizational learning system in the German nuclear industry. As part of this approach, the SOL event analysis procedure was designed to overcome the limitations of existing methods, namely the concentration on technical causes of events and several shortcommings of traditional checklist approaches. As a further improvement of the SOL method, features of a Computer Supported Event Analysis (CSEA) procedure are outlined. Computer support allows us to make sure that event analyses are conducted in standardized ways. It can be used to effectively transport current knowledge on human and organizational factors to the places where it is needed. Steps in the realization of the computer to.~l are described.
KEYWORDS system safety, organizational learning, event analysis, human factors, organizational factors
SAFETY THROUGH ORGANIZATIONAL LEARNING: THE SOL APPROACH It is now widely recognized that industries with high hazard potentials should not only strive for an improvement in technical systems but also make sure that human and organizational factors contributing to system safety are thoroughly and continuously considered (e.g. Reason, 1990). Since any improvement in system safety can only be made on the base of available knowledge, organizational learning systems should provide the necessary" information. This requires methods to gather and evaluate knowledge on the contribution of organizational factors to system safety. The Safety through Organizational Learning (SOL) approach developed by the Research Center System Safety aims at an improvement of system safety by proposing an organizational learning system for the German nuclear industry. It is based on feedback oriented analyses of events relevant to system safety. SOL consists of a method for event analysis, an event reporting system and a database designed to keep the accumulated safety related knowledge for later retrieval. Although it was primarily developed for the nuclear industry, SOL can be applied in any other industrial high risk domain. The event analysis method in SOL was developed to overcome several shortcomings we found in existing approaches in the nuclear industry. Most of them offer only a limited number of categories for human or orga273
274
R. Baggen et al.
nizational failure. As a result, there is a neglect of organizational factors as contributing categories in the genesis of incidents. This leads analysts to identify ,,hard"technical facts while overlooking organizational factors often considered as ,,soft" and difficult to assess. Also, we found that psychological knowledge on test construction and problem solving is often ignored in existing event analysis procedures. Finding contributing factors with SOL means more than just checking for the existence of some predefined causes from a checklist. In contrast to a more technical view on event causation, the SOL approach particularly tackles the problem solving nature of event analyses. Analysts have to be aware that their results form a collective reconstruction of the past reality to be used as a starting point for improvements in safety. This requires a thorough understanding of event genesis from human, technical, group, organizational and environmental factors as well as some fundamentals of causal attribution in human problem solving. Also, information about the event cannot only come from technical domains but may include asking witnesses, reading internal documents and logfiles, conducting technical analyses, medical examinations and so on. Event analyses in SOL consist of three steps: (1) Situational description, (2) Identification of contributing factors and (3) report generation.
Situational description. As a first step in the analysis of an event, a situational description is prepared. It should depict in graphical form all available information on the event in chronological order and separately for every actor involved. This can best be described by asking ,,what?"-, ,,when?"- and ,,who?"-questions about the event. The materials provided in SOL give several hints to arrive at a correct and comprehensive event description. A crucial issue is to avoid that analysts prematurely ask questions about contributing factors or try to find causal explanations while still busy with describing the event. Identification of contributing factors. Based on the situational description, the analysis proceeds with the identification of factors contributing to the event. This procedure is associated with asking ,,why?"-questions for every part of the event description. Since this is a crucial step for the finding of human or organizational failures, an identification aid is provided that covers twenty areas of human and organizational factors: • • • • • • • • • •
representation of information information transfer information processing working conditions personal performance operation scheduling violations responsibility control and supervision group influence
• • • • • • • • • •
rules and procedures qualification training and selection criteria organizational leadership and goals feedback of experience safety principles quality management maintenance regulatory and consulting bodies technical components
The factors were gathered by thorough literature reviews and expert rankings on the importance of a particular factor for event genesis under a social-technical perspective. Every factor is expressed by a general question (e.g. ,,Do you have evidence for an influence of the working conditions on the operator performance?" for the factor ,,working conditions'), specific examples that further illustrate it (e.g. ,,noise", ,,heat", ,,time pressure" etc.), and is provided with links to others that should also be considered for the event at hand. If a factor is found relevant during an analysis, it is added as a contributing factor to the graphical representation of the situation. Afterwards, searching has to be continued by following the links. It is mandatory that at least two factors are found for every part of the event description. Because SOL aims at standardization of the process of event analysis rather than its contents, analysts are encouraged to add their own factors. In sum, the design of the identification aid is meant to stimulate creative problem solving with respect to human factors and their contribution to safety related events. The provision of detailled operationalizations, the links within the identification aid, and the strict separation of the situational description from the finding of contributing factors should effectively overcome the limitations of traditional checklists.
Advances in Safety and Reliability: ESREL '97
275
Report generation. Reporting on the event is the final step in every event analysis procedure. By reports, the knowledge gathered from an event analysis is communicated to the organization or regulatory bodies. However, an event analysis procedure must make sure, that all essential information is included and that the report contains descriptors to allow statistical processing and later retrieval. Report generation in SOL is fitted into the requirements of incident reporting systems in the German nuclear industry. Within the SOL approach, it is recommended to conduct event analyses in teams of 3 to 6 human factors experts. Besides being necessary to discover all relevant aspects of an incident, this further opens a stage for internal discussion of an event. Group discussions can be seen as a first feedback loop within a plant that gives an opportunity for immediate safety related improvement in organizational issues. SOL event analysis procedures were empirically tested. First, by thorough reviews of the human factors literature, all contributing factors covered in the identification aid were checked for their relevance and independance. Second, discussions with experts from the application domain revealed that SOL offers new insights into organizational failure and possible remedies. Third, empirical tests of the procedures showed that the problems associated with traditional methods, namely checklist approaches, could indeed be resolved effectively. For a more detailed description of the SOL approach to event analysis, see Fahlbruch & Wilpert (1995), Wilpert et al. (1994) and Becker et al. (1996).
A TOOL F O R C O M P U T E R SUPPORTED EVENT ANALYSIS (CSEA) Although, as confirmed by empirical testing, the current state of the event analysis procedure in SOL allows its practical use, we intend to further improve handling and precision of the method with a computer support for SOL. Computer support offers the following advantages over paper-and-pencil methods for event analysis: • • • • • •
Controlled separation of situational description and identification of contributing factors, Automatic checks for consistency of the results wherever possible, Standardized forms for entering event information, Helpful functionality to avoid cumbersome paper work, Easy-to-use functions for report generation, Presentation of up-to-date human factors knowledge in the place where event analyses are conducted.
Most experts see analysing an event as a process of creative problem solving that cannot be supported by a computer. This is true if a CSEA system is thought of as a software that proposes and finds causes on its own, like an expert system. Expert systems are of limited usefulness in event analysis, because they may prevent the experts from conducting the analysis with the required scrutiny and make them rely on the answers given by the system. To conclude, it is vital for any computer support of event analysis, that experts remain fully responsible for the results they achieve. The modular structure for a CSEA system based on the SOL approach to event analysis is as follows (see Figure 1):
Entering and editing of event descriptions. The first module is mainly a graphical interface that allows a description of an event with the help of event building blocks. In the event description, SOL uses event building blocks (Ferry, 1988; Fahlbruch & Wilpert, 1995) to model the chronological order of molecular events attributable to single actors. The event description module should make sure that the chronological order and other testable features of the building blocks are correct.
Identification of contributing factors. The module for identification of contributing factors is only to be invoked when the event description phase is terminated correctly. This module helps in the identification process by providing a hypertext form of the existing SOL identification aid.
276
R. Baggen et al.
In addition, the module is the place to implement further supporting functions and a vivid presentation of human factors knowledge using multimedia elements such as photographs, video sequences, sounds, texts or figures. In order to give a complete account of the knowledge needed for event analysis, the following domains should be covered: . • • • •
event analysis procedures in the human factors domain the twenty contributing factors from the SOL identification aid interview techniques group dynamics human error and system safety research
All information provided by the supporting functions should be adaptable to the level of expertise of the analysts. It should be linkable to an ongoing analysis process much like an on-line help system or an illustrated textbook but care must be taken to avoid premature conclusions about contributing factors. Nevertheless, the need for trade-off between the comprehensiveness of a documentation and the resources given in a limited research project has to be anticipated.
Event information from • • • •
Interviews Documents Technical analysis Other information systems
Event description Report generation Identification of contributing Factors
Internal
reporting
system
l'
SOL Identification Aid
.....
)l~
National reporting system
Information f l o w
Human Factors Knowledge Documentation CSEA software
Analysis steps in S O L . • ] I ~ Information f l o w a c c o r d i n g to national r e g u l a t i o n s
Figure 1 Module structure of the CSEA-Sofiware. Together, these features should help to further promote the consideration of human factors knowledge in cases where a technical as well as a human factors explanation of an event is possible. This is especially important since there is an ongoing debate within the German nuclear industry about the fruitfulness of the concept of an organizational safety culture and event analyses with an organizational emphasis. Generation of event reports. The report module should help the analysts by reducing the work associated with a report to the lowest possible amount. This can be achieved by generating reports with computer aid, e.g. by automatically transforming the graphical event description and the contributing factors found to a textual form. Such a feature would also lower the threshold for conducting event analyses since repetitive paper work is avoided. Thus, more and perhaps minor events can be analysed and more knowledge on human factors related events can be gathered.
In addition, the CSEA sottware has to be designed to support group work of human factor experts. This means to make sure that all participants of an analysis get an overview of the state of their cooperative work at every time. According to the SOL concept it is also mandatory that every participant is able to make active contributions to the analysis results. This requires the use of appropriate technical facilities for PC support of small group interaction, e.g. overhead projection or individual mouse pointers.
Advances in Safety and Reliability: ESREL '97
277
Besides supporting the immediate analysis work, technical and formal requirements for feeding the results achieved with the CSEA software back to local and national reporting systems will be considered. This means analysing the present technical support and to fit the CSEA system in existing chains for report transmission and storage. Also, it requires scrutinizing the challenges or problems networking systems (e.g. the internet) offer for event analysis procedures and reporting systems. For example, one could think of event analyses conducted via the net with experts from remote companies or consulting agencies. Another option is to support information gathering with anonymous reporting systems for near misses.
REALIZATION OF THE CSEA-SOFTWARE Before implementing the CSEA software, it is important to exploit all relevant sources of information to make sure that the projected system meets the requirements in practice. CSEA systems already in use. To begin with, existing CSEA approaches have to be analysed. The most impressive example for a CSEA system comes from Japan, where the utility companies developed the JAESS tool to support event analysis according to the J-HPES method (Takano et al., 1994). Although this tool has the same aim as the projected system, it differs considerably in the underlying event analysis procedures. Event analysis with JAESS and the J-HPES approach is far more complicated than with the SOL method since it consists of fourteen steps and fully prescribes the whole analysis procedure. J-HPES makes extensive use of checklists and thus cannot avoid their methodological problems. JAESS does not explicitly support team work of human factors experts as does SOL. In the Netherlands, Prof. Hale and his coworkers programmed the ISA software to support non-experts in event analysis. IS A is used to study and document events neglected by human factors departments due to minor significance for system safety (Koorneef & Hale, 199'5). In contrast to SOL or J-HPES, ISA is far more deterministic in the finding of contributing factors because it has to be used by non-experts in the human factors domain and advises them to achieve improvements in the safety of work in a local setting. Several other systems are already in use for the transmission, documentation, storage and retrieval of event reports (e.g. the SACRE system at Electricite de France, see Colas, 1993). Most of them are mixtures of event analysis and retrieval systems in that they rely on reports already available. Basically, they are used to reanalyse reports from different origins with a set of descriptors and to store them for later statistical evaluation.
Interviews with experts and rapid prototyping. Software developers that do not take care of the practical requirements articulated by prospective users risk constructing systems that will never be used. We therefore want to interview experts from the German nuclear industry, consulting agencies and regulatory bodies to check our functional outline against the requirements in nuclear plants and the national reporting system. This requires to prepare some preliminary presentation of the software, e.g. as a mockup of the user interface. Preliminary interviews with representatives from the nuclear industry revealed that there is a great need for any supporting software. Local experts have to deliver up to fifteen event reports per year to internal or national reporting systems. Most experts in the German nuclear industry are educated as engineers and must be trained to deal with the additional demand of human factors oriented event analysis. Another important requirement is to maintain compatibility with other analysis software already in use (e.g. for technical or ergonomic problems) as well as with computer platforms and networks.
Analysis of the human factors knowledge in literature.. A third step is to check the available knowledge in literature for a later implementation in the system. Basically, this means to select significant principles widely accepted as prerequisites for a positive safety culture and psychological knowledge related to the process of event analysis. Also, this requires to further specify the ways for presenting the knowledge to human factors experts doing event analysis. The following steps will then be carried out to specify in detail the CSEA software and to prepare later implementation work:
278
R. Baggen et al.
Specification of functional requirements. As a first step in the realization process, we have to write down the functional requirements of the CSEA software. This means describing in some detail what functions the software will support and how the user interfaces should look. In addition, we have to specify the modules and their cooperation in technical terms, that is, the interfaces between them (e.g. data format, protocols etc.). Selection of the implementation tools. The second and last step to be accomplished is to select a software environment for later implementation of the CSEA modules. Since this is a crucial choice in every software project, care must be taken to find a tool that is easy to learn, provides sufficient predefined functions to ease the programming work and last but not least offers options for extensions of the CSEA software.
CONCLUSIONS In the present paper, the SOL approach and a system for Computer Supported Event Analysis was shortly outlined. As a result of the work described we want to arrive at a prototype of a CSEA software. After the realization of the prototype and its documentation, implementation work has to be continued with an empirical evaluation of the functionality in a practical setting. For this to be most successful, several steps of information gathering were described. Nevertheless, most of the work still has to be done. Once ready, the software should help to promote the idea of an organizational learning system in the German nuclear industry. Such a system should help to better understand the influence of human and organizational factors on system safety.
REFERENCES Becker, G., Wilpert, B., Miller, R., Fahlbruch, B., Fank, M., Freitag, M., Giesa, H.-G., Hoffmann, S., & Schleifer, L. (1996) EinfluB des Menschen auf die Sicherheit von Kernkraftwerken. Analyse der Ursachen von ,,menschlichem Fehlverhalten"beim Betrieb von Kernkraftwerken. Schriftenreihe Reaktorsicherheit und Strahlenschutz (BMU-1996-454). Bonn: Bundesrninister mr Umwelt, Naturschutz und Reaktorsicherheit. .
.
Colas, A. (1993) Human Factors and Safety-Performance-Quality in Operations at Electricit6 de France's Power Nuclear Power Plants. EdF Human Factors Group, Paris. Fahlbruch, B., & Wilpert, B. (1995) Event Analysis as a Problem Solving Process. Paper presented at the 13th International NeTWork Workshop on 'Event Analysis in the Context of Safety Management Systems", 11-13 May 1995, Bad Homburg, Germany. Ferry, T.S. (1988) Modern Accident Investigation and Analysis. New York: Wiley. Koorneef, F., & Hale, A. (1995) Organisational Feedback from Accidents at Work. Paper presented at the 13th International NeTWork Workshop on 'Event Analysis in the Context of Safety Management Systems", 11-13 May 1995, Bad Homburg, Germany. Reason, J. (1990) Human Error. Cambridge: Cambridge University Press. Takano, K., Sawayanagi, K., Iwai, S., & Kabetani, T. (1994) Analysis and Evaluation System for Human Related Incidents at Nuclear Power Plants. Paper presented at the First International Conference on HFResearch in Nuclear Power Operations (ICNPO), 31 October - 2 November 1994, Berlin. Wilpert, B., Fank, M., Fahlbruch, B., Freitag, M., Giesa, H. G., Miller, R., & Becker, G. (1994). Weiterentwicklung der Erfassung und Auswertung von meldepflichtigen Vorkommnissen und sonstigen registrierten Ereignissen beim Betrieb von Kernkraftwerken hinsichtlich menschlichen Fehlverhaltens. Schriftenreihe Reaktorsicherheit und Strahlenschutz (BMU-1996-457). Bonn: Bundesminister mr Umwelt, Naturschutz und Reaktorsicherheit.
A11" Risk Management Decision Support Systems
This Page Intentionally Left Blank
AN INFORMATION SYSTEM SUPPORTING DESIGN FOR RELIABILITY AND MAINTENANCE
J.-F. Rit and M.-T. B~raud E l e c t r i c i t ~ de F r a n c e 6 quai Watier, Chatou, F-78400
ABSTRACT EDF is currently developing a methodology to integrate availability, operating experience and maintenance in the design of power plants. This involves studies that depend closely on the results and assumptions of each other about the reliability and operations of the plant. Therefore a support information system must be carefully designed. Concurrently with development of the methodolo~', a research oriented information system was designed and built. It is based on the database model of a logistic support repository that we tailored to our needs.
KEYWORDS Information system, design, reliability, availability, operating experience, maintenance, integrated logistic support, data base.
P U R P O S E OF T H E I N F O R M A T I O N S Y S T E M
Designing Nuclear Power Plants for Improved Operation and Maintenance Electricit~ de France is currently developing a methodology, CIDEM, which stands for "Design integrating availability, operating experience and maintenance", for the design of power plants. CIDEM is based on a reviewing process described by Degrave and Martin-Onraet (1995). Along the project, designs are submitted to the CIDEM team which evaluates them with respect to criteria of availability, maintenance costs and personnel exposure to radiation. This evaluation is preferably decomposed into an allocation and consolidation process so that the plant, along with the criteria, can be broken down into manageable components. Among the studies conducted in the project, we have selected three types for the scope of the information system and hence the scope of this paper: - analysis of operating experience, which consists in establishing, through delving into events database of related plants or generic reliability data bases, reliability data that will be the comparison basis for the values expected from the design; 281
282 -
-
J.-F. Rit and M.-T. B~raud allocation and prediction of forced unavailability, using a reliability model according to the method given by Bourgade et al. (1996); accounting for maintenance which confronts the design and its allocated availability to the cost and performance of the maintenance program; such a program, a valuable by-product of the study, is the collection of the main maintenance tasks, along with their frequency, selected according to the Reliability Centered Maintenance (RCM) method described by Jacquot (!996); the whole study, detailed by Degrave et al. (1996) and Degrave et al. (1997), follows the principles of Integrated Logistic Support (ILS).
Sharing Information for Consistency and Efficiency Considering the complexity of a nuclear plant and the length of its design and life cycle, the CIDE.~I studies, highly data consuming, need assistance to handle a large amount of information. In their guidelines for improving the operations and maintenance of nuclear power plants, Mazour et al. (1996) state the need for a data repository that must be filled by the end of the design and should be the foundation for an information system supporting operations. In addition, they suggest that the contents of such a repository should be jointly defined as soon as possible. Following this recommendation, we conducted a research oriented effort for the design of an information system concurrently with the development of the CIDEM methodology. A demonstration software was implemented so that experience is gained about the actual interest and feasibility of integrating studies through shared information. The demonstration is based on case studies on the Chemical and Volume Control System of the European Pressurized Reactor, a program involving EDF in the design of a new nuclear power plant. This paper relates the approach taken to design and build the information system and the first conclusions that can be drawn after its completion. We contend that building a system meeting all the afore mentioned needs is a novel experience. It sheds an interesting light on the status of the relations between the three connected domains of systems reliability analysis, operations feedback analysis and maintenance optimization. On a more practical level, we hope to present to the reader attempting a similar endeavor enough elements to reproduce and go beyond what we achieved.
DESIGNING THE INFORMATION
SYSTEM
A First Building Block: the Logistic Support Analysis Repository M o d e l Designing an information system meant to support a work process not yet defined appears to be difficult if not objectionable. Yet there is a growing tendency on reducing the length of the design cycle and information system building, not being a productive task with respect to the design, is under considerable pressure and must be anticipated. However, as long as the methodology of design for reliability and maintenance evolves, which is still the case in the European Pressurized Reactor project, the information system cannot be final. Nonetheless, we took an opportunity to play an active role in the definition and the organization of an integrated design process by concurrently designing the information system. Avoiding new development to handle each type of studies, we relied on pre-existing computer tools. Thus we integrated software from the EDF failure reporting system (see Lannoy and Procaccia, 1996), reliability analysis tools described by Bouissou and Bourgade (1997) and our RCM workstation (see Jacquot, 1996). Rather, we focused on the means to make these tools work together. We did this by the means of a centralized database (see figure 1), that would simplify data exchange mechanisms, store the common data and implement a common perspective on the main reliability and
Advances in Safety and Reliability: ESREL '97
283
I Functionsanalysis) / /
CIDEM Data base Reliability
/
,
-
Fig. 1. Basic system architecture maintenance concepts. We must emphasize that we do not look for a repository that would cover all data and concepts used in the project, rather we aim to centralize shared data and coincidental concepts. As an example. operational feedback events, as well as fault trees, were not selected for inclusion in the central database: they are managed by a relevant particular tool. Yet, we realized that an existing software collection is not enough to design quickly and efficienth" the model of the desired centralized database. Joint design tends to drift into long discussions and disambiguation of basic concepts like failure or repair. This is why we used, as a working hypothesis, the framework of the US Department of Defense (DOD, 1991) norm on integrated logistic support (MIL-STD-1388). This norm is two fold, the first part (13881A) describes a methodology for, among other goals, integrating reliability and maintenance studies in the design process. Although we did not use this part as a project wide reference, we felt that the scope was adequate. The second part (1388-2B) describes the model of a data base supporting these studies. Its main components are a data dictionary of about 500 Data Element Definitions (DED), a collection of tables defining links between the data and a collection of reports that specify how data should be extracted and organized.
A tailoring process The 1388 model is supposed to be tailored to the needs of each project. A selection among the reports can serve as a functional specification from which a subset of data and tables can be inferred. We took a slightly different approach since our need was putting together already existing tools and studies types. We chose to tailor the data dictionary by selecting only the subset of the DED that defines information exchanged between at least two studies, augmenting them with our own D ED if need be. About 65 data element definitions were selected from the norm and 15 added for our needs. It soon appeared to us that the definitions were very terse; we strongly needed the relation information to grasp the supposed meaning of the data. The table form given by the norm is not adequate for easy understanding, we reformulated the tables into an entity-relationship model for the selected data (a
284
J.-F. Rit and M.-T. B6raud
T H E 1388 M O D E L A N D C I D E M
A Rough but Workable Framework Our first comment on the adequacy of the 1388 norm is that it did act as an effective catalyst to build a common model. One must also admit that a data dictionary is very limited to define the concepts revolving around the function, reliability and operations of a system. Moreover, although it must have been a tremendous effort, this kind of dictionary is more a compilation of concepts founding pre-existing MIL standards and theory has made progress since then. For example, the model relies on the implicit assumption that all failure laws follow an exponential distribution function with a time independent parameter; the definition for the failure rate is actually the definition for the maximum likelihood estimate and the difference between demand related and time related failures is considered a mere question of unit. Accepting these as simplifications based on pragmatism, we proceeded further on.
The New Perspectives of 1388 Several aspects of the rationale underlying the 1388 model were new to us and we found them useful.
System break down The whole model is organized around a hierarchical breakdown of the system, a power plant in our case. The resulting tree structure (that we will call plant breakdown in the following) is strictly enforced. While there is nothing revolutionary in this, two points must be emphasized: - most data is related to a tree node, that is a particular, well identified, subset of the plant; there is no mechanism to represent "generic" or family related data, thus the difference with general purpose reliability data bank is clear. Also, one can have a feel of the volume of data by multiplying the number of DED by the number of nodes in the plant breakdown: a few dozens DED actually mean a large database! - there is no semantics attached to a particular depth or level of decomposition; thus, the identification coding of the subparts of the system relates only to a position in a tree. Moreover the model structure is completely independent on the depth level: for example any plant breakdown item can be associated to failure modes and maintenance task; this is not standard practice at EDF: the code of an item gives information on which subsystem the item belongs to and what kind of equipment is involved, also a failure mode or a maintenance task on the whole plant or on a subsystem are never considered.
Reliability data The 1388 model defines four different values for a given reliability parameter, say a failure rate: Compared, Allocated, Predicted and Measured (CAPM). Despite the terseness of the norm on their meaning, we took: -
compared values as the results of operating feedback analysis on currently running equipments or installations, mapped to the current design; allocated values as the goals established in a top down direction; - predicted values as the values expected in the project, possibly on the basis of compared values and incorporating, for example, expected improvements in technology or operating conditions pertaining to the design; - measured values as values measured after the design is built.
-
We found this distinction very useful to clarify the scope of each study. For example, operational feedback analysts tend to compute directly predictions from observed data (sometimes combined with
Advances in Safety and Reliability: ESREL '97 items_XB I System arb°rescence ~.._ o,1_ Loaistic code number u,n- Alternative desion code item designation corrective maintenance cost O,n~ preventive maintenance cost O,n
0,n..
--.....
/
~--Has palls ;
i
( Has maintenance tasks I
111
n mbe o part minimum stock , ,,, (1:1 \ / ~ F Is in catal°g '1~0..
/
0,n
~catalog items \ HA ~:ommercialentiWcode
I
reference number designation
[ I
O;n J I ric ~ p e I
\
I J ~ ~
\
~
/
,
~
A
/
| |
°i / /
"
J
/
/ (1;1
JI
O,n o,n ~ / /" tasks roupin ( g g L_ ~ I'~,,.~ A ~,. J ~P~
O.n...~f/
g,n
/( serious'ness")
0n / k.
/ J
.
/
/ modes ,'.' seriousness_BN I, Failure
task frequncy unit
., I task designation I [ personnel ~ 0 n | task duration ~ L )'--" ' ]total personnel radiation "~-'~A I
L
/' /
I~~--~--~-;~1 / taskfrequency I
,/O,n
(1,1 ~ CD I ' personel , I 0erson identiier I qualification I task duration
~ '
~ f, ~ [ Tasks solve failures \
//
fi Failure modes fl
I
~
.1/1 t ,
iI task c o ~
~
failure rate I I failure mod~ code Failure rate unit J Jfailure rate MTTR I l efect on plant operationalavailability I I local effects forced unavailability " ] } effects on next level annual number of missions ~1~ ~ L ~ etectiOn means Annual operating time . ]L --..r~ n demand indicator
~
/ ~
o,n
(1,1 CAPM RAM parameters_BD I
~
~~
J I
E
~
|
r
o,n
( CAPM RAM parameters •
~ ~ ~ ~ / ~ J ~
~
(1,1
onceof item quantity bought
/
/ /
spare parts_CI / number of parts required '" I
(
J
I
'"" I RAM parameters_BA i wearout life [ Function / maintenance concept ~ op..=,.,,..,,,ac,.mp, o co,.
/ ~ / /
,
285
Seriousness
CAPM failure mode parameters
J=
II
~
I (1,1
/
/A
1
CAPM code M'rI'R for the failure mode failure . .rate oegradationratio forced unavailability
I
I I J I I
Fig. 2. A simplified conceptual model
entity-relationship model containing all 500 or so DED would have greatly simplified our task). A simplified version of the model is shown on figure 2. The three subsets of data sketched on figure 1 are made apparent in the structure of the graph : on the top and left-hand sides are data related to the tree-like description of the plant ; on the right-hand side are data related to reliability, around the failure mode entity, on the bottom are data related to maintenance, around the task entity.
286
J.-E Rit and M.-T. B6raud
expertises). In the same time, the allocation process assigns objectives on the basis of what is currently possible and thus needs compared data. We have to say, however, that when the item is a whole new design with no close running equivalent. the amount of interpretation required to produce a figure blurs the distinction between a compared and a predicted value.
Maintenance tasks In order to be able to define the resources necessary to operations and maintenance, the 1388 model gives an "operational perspective" to maintenance tasks. A maintenance task must correspond to a unit of work actually performed by identified workers using identified tools. This is in contrast with the highly abstracted "task" underlying a M T T R attached to a failure mode for the use of a reliability analyst. It is also quite different from a task selected by RCM which tends to abstract auxiliary and induced tasks like preliminary diagnostic or scaffolding erection. Yet the three perspectives must coexist. We must say that, we have not established links in the model between, for example, a "reliability" M T T R and a maintenance task duration. Thus we only enforce a weak consistency on durations. W h a t w e A d d e d to the M o d e l
To fulfill our goal of facilitating data exchange, we felt the need to add data elements to the model (pointed with a bold A label on figure 2). First there are new data element that could not belong to a general model because they are specific to the system. In our case, for example, personnel exposure to radiation is an important design goal given in the form of a cumulative annual dose "for the plant". For health purposes, exposure data are currently collected on an individual worker basis, according to regulations. We determined that for design optimization the relevant data element should be the cumulative dose associated with each maintenance task. Thus our addition to the model. However, going from observed data on the personnel to data on tasks implies knowing who did what. Such knowledge is difficult to derive from the current data, for ethical and legal reasons, because they were not collected for that purpose. On a broader perspective, we had to add CAPM values to failure mode dependent reliability figures. The 1388 model only provides for CAPM values attached to an plant breakdown item, or so to speak, "all failure modes being considered". One can interpret this as a stronger emphasis in reliability analysis in the CIDEM studies. Finally, there is scarcely any provision in the model to justify the source of values obtained from operational feedback analysis. Nobody in the project would blindly use such figures. On the other hand, much care is taken by the feedback analyst to account for selected samples and provide, if need be, likelihood distributions. Therefore, we added for each item an operating feedback sample code that is a pointer to the feedback analysis conducted to produce all the "compared" values related to the item. For the moment a qualitative examination is enough, no need was expressed for the use of numerical data like confidence intervals.
FEEDING
AND USING THE INFORMATION
SYSTEM
A demonstration scenario, linking the various CIDEM studies by the data flow was elaborated to demonstrate their integration through the sharing of key data. This completed our static approach of building a data model and allowed a feedback on shaping the design for reliability and maintenance process. Furthermore, this was useful in avoiding the syndrome of defining an extensive model and being driven to collect useless data. Figure 2 is actually close to the result of an a posteriori filtering of the afore mentioned 80 or so data elements.
Advances in Safety and Reliability: ESREL '97
287
Detailing the scenario is not in the scope of this paper. However, in addition to showing interdependencies and proper sequencing of the studies mentioned at the beginning of this paper, the scenario underlined the need for an overlooked task : producing the plant break down along with the relevant failure modes. The difficulty in this task is to elaborate this breakdown so that it is relevant to all studies and to do so before these studies are started. If both conditions are not met, each study will use its own breakdown. As a consequence, the results will be related to slightly different plant subsystems or components. Then any global reasoning, trading, for example, availability against investment cost and maintenance policy, will be rendered impossible unless all results are "harmonized" meaning in effect rerunning the studies. If an inadequate breakdown is forced upon the studies, the results, if there are any, will be meaningless to their producers when confronted to each other they will clearly be wrong and require a long and difficult tweaking. We experienced both situations to a certain degree : there has been for long in EDF a standard .plant breakdown system, it is not fully adequate to reliability studies nor to operations feedback analysis nor to maintenance optimization, as a consequence each domain uses a different breakdown. However, we were able to reach an adequate consensus on the breakdown by focusing on the "right" depth level of the breakdown that we call the functional groups level. This level (which is not however met at a uniform depth in the breakdown) roughly addresses what are often called components in the safety and reliability literature. At this level, we observe that we can quite easily link a piece of equipment with a basic function, we can can quite easily link this function to the behavior of the plant, we can quite easily measure failures on parts of this equipment and finally relate it to maintenance tasks. Such a level has been identified and used in RCM studies. Its usefulness justified then the elaboration of a catalog by Saby (1995) of all the generic functional groups that are met in the thermo-hydraulic systems of a power plant. We relied on this catalog and showed that its use can be extended to build reliability models. We claim that if all CIDEM studies use different breakdowns coinciding on this level. they will have a good level of consistency. Unfortunately, the fuzziness of the preceding paragraphs is eloquent in showing the lack of an adequate theory that would give rules and principles on what is a functional group and what are its failure modes. We could only observe after using a catalog that it was adequate to our purpose.
CONCLUSIONS
AND PERSPECTIVES
The 1388 model is a useful and workable starting point for structuring reliability and maintenance information at the design stage. However, we had to augment it to suit our needs of integrating more closely availability studies and rigorous operations feedback justifications. The quality of case studies is greatly improved. Firstly, data available in the central base curb shaky estimations of unavailable input data. Secondly, results output to the base are de facto submitted to a review process based on consistency with already present data and on overall rigor in common concepts definition. Such a model must however be augmented with a system breakdown that will ensure a coincidence of the studies, at least at some kind of component level. Such a breakdown level must be finely determined and theoretical ways of describing it are still lacking. Now that the CIDEM process is more settled, we wish to expand the scope of the information system to a larger set of studies conducted in the CIDEM project and that push the limits of our model framework. First, we will have to embed our currently autonomous system into an overall CAD system. EDF is currently implementing a new CAD system (CAO 2000) that will manage the common technical description of the plant. It will be much more powerful and rigorous in the handling of libraries containing
288
J.-E Rit and M.-T. Bdraud
generic data that will make the studies much easier. Moreover, the CIDEM database currently gives an instant view on the data in the project, without accounting for history and interdependence of the studies. The study monitor of the CAD system should remedy to that.
Acknowledgments The authors wish to thank C. Degrave, A. Lannoy, E. Bourgade, M. Bouissou, C. Meuwisse, D. Vasseur and C. Martin-Mattei for their comments on the drafts of this paper as well as for their contribution to the work described in it. References
Bouissou, M. and Bourgade, E. (1997). Unavailability evaluation and allocation at the design stage for electric power plants: methods and tools. In Proc. ann. Reliability 8_4Maintainability Symposium (RAMS97), pages 91-99, Philadelphia. IEEE. Bourgade, E., Degrave, C., and Lannoy, A. (1996). Performance improvements for electrical power plants: designing-in the concept of availability. In Cacciabue, P. and Papazoglou, I., editors, Probabilistic Safety Assessment and Management: ESREL 96 - PSAM III, volume 1, pages 158-163. ESRA, IAPSAM, Springer Verlag. Degrave, C. and Martin-Onraet, M. (1995). Integrating availability and maintenance objectives in plant design, EDF approach. In Third International Conference on Nuclear Engineering (ICONE-3), volume 3, pages 1483-1488, Kyoto. ASME-JSME. Degrave, C., Martin-Onraet, M., and Meuwisse, C. (1996). Integrated logistic support concept in the design of nuclearpower plants. In Fourth International Conference on Nuclear Engineering (ICONEJ), volume 4, pages 27-31, New-Orleans. ASME-JSME. Degrave, C., Meuwisse, C., Hamon, L., and Martin-Mattei, C. (1997). Taking maintenance into account in the design of nuclear power plants. In Fifth International Conference on Nuclear Engineering (ICONE-5), Nice. (to appear). DOD (1991). DOD requirements for a logistic support analysis record. Military Standard MIL-STD1388-2B, Department of Defense, United States of America. Jacquot, J.-P. (1996). A survey of research projects in maintenance optimization for Electricitd de France power plants. In proc. 1996 ASME pressure vessels and piping conference, volume 332, pages 83-88, Montreal. Lannoy, A. and Procaccia, H. (1996). The EDF failure reporting system process, presentation and prospects. Reliability Engineering and Safety, 51 (2): 147-158. Mazour, T. et al. (1996). Designing nuclear power plants for improved operation and maintenance. Technical Report IAEA-TEC-DOC-906, International Atomic Energy Agency. Saby, P. (1995). Projet OMF, description gdndrique des matdriels et de leurs ddfaillances, note technique D4002.42.81-94/049 Indice 2, I~lectricit~ de France Production Transport, Exploitation du parc nucldaire, Ddpartement maintenance.
RELIABILITY SUPPORT SYSTEM FOR METALLIC COMPONENTS SUSCEPTIBLE TO CORROSION RELATED CRACKING E. Dias Lopes ~, C. Vianna~; T. Carvalhol; K.J. Schmatjk02; Duarte Esmerald03; M. Vancoille4; Wim Van Acker4; Gert Boulliard4 ; Kristel Phlipp04; A. JovanovicS; Marco PoloniS; W. Bogaerts6; Jack Tulp 7. ISQ - Instituto de Soldadura e Qualidade - Estrada Nacional 249 Km3, Cabanas, Lei~o ( Tagus Park) Apartado 119, 2781 Oeiras, Portugal 2 Siemens K W U - Hammerbacherstrasse, 12/14, D-91050 Erlangen, Germany 3Quimigal Adubos - Zona Industrial de Estarreja, 3860 Estarreja, Portugal 14METAlogic - Kapeldreef 60, B - 3001 Leuven, Belgium 5 MPA Stuttgart- Pfaffenwaldring 32, 70569 Stuttgart, Germany 6 KULeuven - De Croylaan 2, B - 3001 Leuven, Belgium 7 NCD - Postbus 120, 3720 AC Bilthoven, Netherlands
ABSTRACT
This document presents the objectives and a technical description of a software system developed on the framework of a European Community R&D project (BRITE EURAM project BE5936). It also includes a description of the achieved results. The research was directed at developing an (off-line) automatic Corrosion & Materials Advisor for improve design and maintenance of corrosion-liable equipment in process industries and power plants ( new ones and those already in operation). A comprehensive modular software system was developed covering a wide range of aspects of cracking corrosion prevention and analysis related to several combinations of materials & environments. The system modules can be summarised as follows:
CRAI Module: • a suite of intelligent systems and tools for the assessment of component cracking risk and expected service life that is part of a "corrosion simulator", with automated problem solving routines and user programming tools, plus related Background Information compiled as databases and hypermedia:
FRACTAL Module: • a set of engineering tools for corrosion failure analysis plus related Background Information comprising an Atlas of Case Histories and guidelines in hypermedia format. 289
E. Dias Lopes et al.
290 D E S I G N E R Module:
• A system for intelligent guidance through a collection of guidelines, standards and examples, selecting solutions for a broad spectrum of design situations liable to corrosion. The overall system is an integrated and interactive Corrosion Advisory and Engineering Workstation for the assessment of component behaviour and cracking risks.
INTRODUCTION This paper presents the objectives and a technical description of a software system developed on the framework of a in the European Community R&D project. It also includes a detailed description of the achieved results. The research was directed at developing an (off-line) automatic Corrosion & Materials Advisor for improve design and maintenance of corrosion-liable equipment in process industries and power plants ( new ones and those already in operation). A comprehensive modular software system has been developed covering a wide range of aspects of cracking corrosion prevention and analysis related to several combinations of materials & environments. The overall system is an integrated and interactive Corrosion Advisory and Engineering Workstation for the assessment of component behaviour and cracking risks that:. •
acts as a CORROSION SIMULATOR, allowing the user to analyse different what-if scenarios when changing process or other conditions: "What kind of corrosion problems can be expected if chemical composition of the environment, temperature, pressure, inhibitor additions, design, mechanical load .... change? What will be the corrosion rate? How high is the risk and how dangerous are the anticipated corrosion phenomena?..."
•
acts as an INTELLIGENT INFORMATION BANK guiding the user on how to mitigate corrosion related cracking, and allowing him to access necessary background information: an engineering library of codes, guidelines, examples, images, reports, troubleshooting procedures and routines...
The tool delivers support in: (1) design: when something has to be built, and the question is: will the component corrode, crack or deteriorate in some other way; how can the design be improved/optimised to avoid corrosion-initiated problems; (2) safety analysis in operating plants: when something has been built and the question is: what is the potential damage and expected risk under the actual process conditions and observed (rates of) corrosion, e.g. as revealed during intermediate inspection or maintenance. This would help for example in setting the appropriate process parameters, as well as in 'predictive maintenance' or scheduling of inspections; (3) failure and damage analysis: when the equipment has deteriorated and requires maintenance, and the questions are: what is the failure mode and how can the problem be overcome.
INNOVATIVE ASPECTS The main innovative aspects, with respect to other corrosion prevention software tools already on the market, are: Integration in a unique system of various "pieces of knowledge" directed to prevent and analyse corrosion failures.
Advances in Safety and Reliability: ESREL '97
291
Development of a corrosion simulator capable to quantify or give qualitative assessment of corrosion risks. Openness of the system: The system is almost fully open allowing to the user - being a corrosion expert - to create and program his own rules of assessment or quantification of corrosion risk, through neural networks analysis or rule-based expert systems; to store his own design rules, specifications or standards; to record his own failure case histories in a very comprehensive and systematic ATLAS of Case Histories; to save information and the work session performed with the system in the NOTEPAD facility which is present throughout the modules. Tutorial Use & Easy Basic Information Search: The system has a set of databases combined with a significant amount of Background Information (in hypermedia format) which - all together - form a very useful and broad, structured, scientific and technological information on corrosion aspects. All these features can be used both for information search or for tutorial purposes in a systematic and quick way. The system provides a set of WORKED EXAMPLES that illustrates - through the solution of real cases - how the system can be used for the solution of different kind of corrosion engineering problems. Combination of different software modules (realised with commercial shells) for developing an easy-touse, user-friendly system, without need of complex and expensive computer systems.
SYSTEM D E V E L O P M E N T The objectives were achieved by performing the following R&D tasks: 1. Structuring and analysing knowledge already available ( case histories, construction rules, standards, data sets, etc.) in order to derive corrosion cracking models. 2. Performing some additional testing in specific target areas. 3. Customising appropriate knowledge engineering methods and tools, including the use of expert system, hypermedia and neural network technologies 4. Validating the system features through realistic examples of industrial problems like design aspects for corrosion prevention, failure analysis, corrosion risks assessment. The system covers a wide range of combination of Materials vs. Phenomena as could be seen on the Table below:
TABLE 1 CRACKING PROBLEM MATRIX COVERED IN THE PROJECT Austenitic Stainless Steels Chloride-induced SCC Caustic cracking IGA & Intergranular SCC by -high-temperature water -nitrates HE- HIC Corrosion-fatigue & Straininduced cracking
Ni & Ni -based C -Steels & Low alloys Alloy Steels
X
X
X
Y X
Y X
X
Ti
292
E. Dias Lopes et al.
SYSTEM A R C H I T E C T U R E The system architecture is presented below:
• System Administration • Report generation • .....
..... Advisor Adv User guidance gt
FRACTAL
si n e r
• Standards
1
1 Background information:
• • • •
FRACTAL Background information ATLAS of Corrosion Cracking case Histories Corrosion Cracking Phenomena Readings Materials Database
/ ~
External databases: • Perinorm • Active library of corrosion • ACHEMA product database • ....
\ www search
Figure 1 - Scheme of the ORACLE System architecture The ORACLE shell is an integrated environment that co-ordinates all the modules and sub-modules to improve the effectiveness of working with the ORACLE system. The shell has a user accessibility pattern, that classify the users in four different categories: The Administrators are responsible for the system configuration (external applications, user creation, passwords .... ) The Experts are in charge of managing the system technical contents (insertion of guidelines/standards, creation of neural network-based models, insertion of rules) The Standard classification is for performing failure analysis, corrosion risk assessment without permission to modify and/or enlarge the relevant information of the system. The Beginners classification is for training purposes, enabling the user to navigate the system but without permission to modify relevant information. The automatic report facility, Notepad, is a text-based component where all the actions and results are reported, putting together the material necessary to edit a complete report about the work session.
BRIEF DESCRIPTION OF THE MAIN MODULES
Crai Module
Core of the "Corrosion Simulator" is a corrosion cracking risk assessment module called CRAI (Corrosion Cracking Risk Analysis Instrument). It may be used in design (evaluation of "fitness for performance" of a specific design concept), as well as in failure analysis tasks (e.g. evaluation of a priori corrosion risks under the component's operational conditions).
Advances in Safety and Reliability: ESREL '97
293
Corrosion cracking risks can be assessed in two ways, either by investigating the qualitative indications of a cracking risk (i.e. resistant/resistant unless.../not resistant .... ) or by investigating the quantitative risk (i.e. what is the chance that the material will fail). The first type of investigation is essentially useful to determine whether or not further detailed analysis (quantitative assessment) is necessary or useful, especially in those circumstances where the initial investigation yields a "resistant unless..." result. This dual nature of the CRAI design is pictured in Figure 2. r I Surface layer ', Interface i i__
I User 1 ORACLE Shell . . . . . . . . . . . Preprocessing
Quantitative Reasoning Modules
, 4t--i
Qualitative Reasoning System
MajorCorrosive= (Extraneous Matedal Corrosion Factors)
Output Module
Figure 2 - The overall CRAI architecture A common interface gives access to both the qualitative and quantitative reasoning system. The qualitative reasoning system makes a general assessment of corrosion cracking risks based on (general) information about material, environment, equipment and some environmental parameters (concentrations, temperature, pressure, velocity, pH, heat treatment, welding). A more detailed qualitative assessment can be made using the Extraneous Corrosion Factors module which takes into account a wider range of environmental parameters. Quantitative Reasoning Modules are available for a selected set (represented by X and Y in Figure 2) of environment/material combinations. These modules are detailed models each competent within a limited domain. The modules can be rule based expert system modules, neural networks or other algorithmic procedures. CRAI rtms on (high-end) PCs running Microsoft Windows* or Microsoft Windows95 ®. CRAI is a hybrid system and uses a mix of software packages. Database parts of the system have been implemented using Microsoft Access with a Microsoft Visual Basic based interface. Neural network parts are plain C-code as generated by the NeuralWorks Professional IIPlus software package. Expert system parts have been implemented using Kappa from IntelliCorp.
Fractal Module
FRACTAL is an hybrid knowledge-based system containing a number of integrated modules (database, neural network, electronic flow-chart and hypermedia documents). The system supports all the main tasks of a corrosion engineer during failure assessment:
294
E. Dias Lopes et al. a) b) c) d)
the the the the
description of the present situation (corrosion damage, failure of a component), evaluation of the findings based on the past experience and test results, assessment of the failure reasons and proposal of remedies to improve the present situation.
FRACTAL supports all the main tasks of a corrosion engineer, that is: • to describe a present situation (corrosion damage, failure of a component) • to evaluate the findings based on the past experience and test results • to realise the reasons for the failure and to improve the present situation by means of the following activities: Storage of case histories in a database and query of similar case histories as information support. Context-sensitive access to relevant information (i. e. guidelines) in hypertext format, depending on the present case. First rough assessment using a neural networks-based data mining module. Detailed investigation supported by a knowledge-based flowcharting module. Reporting facility to document all the steps performed. z:~
The links to the other modules of the ORACLE system allow to enlarge the spectrum of the information available to the end-user during its assessment work. The software system consists of four main applications: 1) Case histories database (with currently about 800 case histories), where each case analysed can be stored with several attributes related to material, component, environment 2) Neural network-based data mining module, enabling to individuate trends within similar case histories giving indications for the general assessment. 3) Assessment route flow-chart (implemented by means of the ExpertChart application), that indicate the different steps to perform for the complete assessment, from the on-site investigation to the compiling of a final report. 4) Hypertext-based background information (dynamically linked to the flow-charts) to support the user to classify the present case. These applications are integrated into one software tool, which has the task: • to support the navigation through the different applications during the user's session • to assure a user-friendly interface with the support of an extensive Help facility • To facilitate the task of reporting by means of the FRACTAL Notepad, where all the activities and the user remarks are automatically stored.
DESIGNER MODULE
DESIGNER module complements to the other modules as a tool for assessment of design in respect to corrosion liability. This is achieved by 'intelligent' guidance of the user through a collection of standards and codes, design guidelines, checklists and examples selected to their relevance for corrosion induced cracking of plant components. Due to the predominantly non-numeric character of this information, an ES component of the case-based reasoning type has been developed for focused access to the information contents stored in the module. In addition, the various 'pieces of knowledge' are linked by a particular hypertext technique, which retrieves explanatory information from the same information base during the session. Furthermore there is a link to the respective examples in the collection of the FRACTAL module and other links are prepared for search in external sources after their attachment by the user.
Advances in Safety and Reliability: ESREL '97
295
One of the main conditions of development was the request for an 'open system', i.e. the module allows storage of new information and editing and delete of the contents by the user. This is realised by appropriate editing functions of the module. The use of the module is supported by an own help function based on the WINDOWS software, which gives advice to all elements of the GUI and access to the topics by search. The software platform used for the development of DESIGNER was MS-AccEss throughout, first as version 1.1, then 2.0 for the final form. The use of MS applications for facilitating the merge of the modules first developed in parallel as stand-alone versions was agreed from the beginning. While the detailed description of all functions of the module has to be left to the manual, the following sections discuss the functionality of DESIGNER from its underlying principles and the steps of development.
RESULTS AND CONCLUSIONS The research was directed at developing an (off-line) automated Corrosion & Materials Advisor for improved design and maintenance of corrosion-liable equipment in process industries and power plants ( new ones and those already in operation). The ORACLE System has been successful in achieving its principal objective of providing a Corrosion and Materials Advisor to improve the assessment of corrosion-related cracking risks and to enhance design and maintenance practices in process industries and power plants. The applicability and benefits of the developed system have been demonstrated through a number of realistic examples herein named as Worked Examples ( WE). A comprehensive modular software system has been developed covering a wide range of aspects of cracking corrosion prevention and analysis related to several combinations of materials & environments. The system developed is rather innovative and more complete regarding other systems related to Corrosion prevention. The main aspects to point out are: Integration in a unique system of various "pieces of knowledge". Development of a corrosion simulator capable to quantify or give qualitative assessment of corrosion risks. Openness of the system: The system is almost fully open allowing to the user to create and program his own rules of assessment or quantification of corrosion risk, through neural networks analysis or rule-based expert systems; to store his own design rules, specifications or standards; to record his own failure case histories in a very comprehensive and systematic ATLAS of Case Histories.
This Page Intentionally Left Blank
A KNOWLEDGE-BASED SYSTEM FOR FAILURE IDENTIFICATION BASED ON THE HMG METHOD Atoosa Jalashgar Systems Analysis Department, Riso National Laboratory, DK-4000 Roskilde, Denmark
ABSTRACT This paper introduces the main features of a knowledge-based system to perform failure identification of technical systems on the basis of the function-oriented system analysis method Hybrid MFM-GTST (HMG). The building blocks of the system, called SINA, comprise two groups of generic and application-specific knowledge bases, a diagnosis knowledge base to accomplish the task of failure identification by using the information received from the two groups of knowledge bases, and a user interface. The bases along with an implemented prototype of the knowledge-based system are explained.
KEYWORDS Function-oriented system analysis, HMG method, failure identification, knowledge-based system.
INTRODUCTION In recent years, research within the area of function-oriented modelling of complex technical systems has been intensified, and parts of the effort have been formalised in some well-defined methodologies. The functionoriented system analysis method HMG (Jalashgar (1997), Jalashgar et al (1996)) uses a terminology to support acquiring knowledge about functions of a technical system with regard to a predefined set of goals, and it utilises two function-oriented approaches to represent this knowledge in a model, which is directly used for the task of failure identification. Nevertheless, the prime motivation behind the method has been to be able to reveal system failures, that cause degradation or lack of functions, but usually are hidden due to availability of other functions, that are sufficient for the attainment of the overall goals. The knowledge identification is supported by defining and grouping basic aspects of technical systems according to a developed terminology. Briefly, the terminology is as follows: Goals: are intended states of a technical system Functions: are roles of the system components that separately or in combination contribute to the attainment of the goals. A function can be active or passive, depending on whether the component that realises it is active or passive. Active components are those that possess some degrees of autonomy and thus can be the agent in realising one or more of their functions, which will be active functions. Passive components can only realise passive functions. Behaviours: are activities possessed by active components of the system, and whose existence, if intended, is necessary for all the active functions of such components to be realised. Capabilities: are the ability of system components to bring the components or the system into different states. Hence, behaviours and functions are some of the effects and all the intended effects of the capabilities, respectively. In addition, those capabilities that are resulting in an arbitrary state of the component at an arbitrary time are called active capabilities. Focusing on a specific set of system goals def'med by, say the designer, and considering a particular component within the system, its capabilities can be categorised as follows: 297
298
A. Jalashgar 1) Capabilities that contribute to the implementation of the functions of the component and thus to the achievement of the system goals. Clearly, these capabilities are intended. 2) Capabilities that are unwanted and almost immediately prevent the realisation of the functions of the component and the achievement of the system goals. 3) Capabifities that are unintended but do not by themselves affect the functions of the component or the system goals in any time. 4) Capabifities that are unintended and do affect the functions of the component, but do not have any immediate effect on the system goals.
Physics: are the causal interactions and the interrelationships among variables and parameters of different system
components. Physical Structures: are the tangible and visible aspects of the system, as opposed to its intangible aspects being goals, functions, behaviours, capabilities and physics. The terms manifest capabilities and latent capabilities are used to address the first two and the last two categories respectively. The knowledge representation is performed by applying the two function-oriented approaches, the Multilevel Flow Modelling (Lind (1994)) and the Goal Tree-Success Tree (Kim et al (1987)) methods. The MFM method is used to represent the overall goals, the groups of functions (each group is called a function-type), and the interrelations among the goals and the function-types in a technical system. The GTST method is used to model the capabilities of the system components that can result in different functions depending on the goal context. The resulting HMG model is then obtained by attaching the appropriate GTST models of the components to the system functions in the MFM model. A knowledge-based system to perform failure identification of technical systems on the basis of their HMG models is developed. The building blocks of the system, called SINA, comprise two groups of generic and application-specific knowledge bases, a diagnosis knowledge base to accomplish the task of failure identification by using the two groups of knowledge bases, and a user interface to handle the intercommunication between not only the user and the knowledge bases, but also the user and the software environment utilised to develop the bases. A description of the bases and a prototype of the knowledge-based system is provided in the following. T H E G E N E R I C K N O W L E D G E BASES
The generic knowledge bases are those that are not related to a technical system subject to modelling and analysis, but are used to build the MFM model of the systems, the GTST models of the components, and the HMG model of the systems. A technical system will henceforth be called a system-object or simply an s-object. The bases are: 1) A language base that contains means to use the English language for expressing goals, functions, capabilities and physical structures in the HMG models. 2) A component base that contains information about different classes of physical components involved in various systems. 3) An MFM base that contains MFM concepts to represent the goals, the function-types and their interrelationships in different s-objects. 4) A GTST base that contains GTST concepts to represent the capabilities and the physical structure of the components, and their interconnections. 5) An HMG base that contains a set of rules for how to connect the GTST models to a certain function-type in the MFM model of the s-objects.
In the following, each generic knowledge base is described.
Advances in Safety and Reliability: ESREL '97
299
The Language Base The prime aim of this knowledge base is to offer a common vocabulary, so that the expressions for goals and functions of the s-object, and those for the capabilities and physical structures of the components involved, can be formed in a consistent and compact manner. The base consists of a library of allowed English words to form phrases for expressing goals, functions, capabilities and physical structures, a set of syntax rules to suggest the possible words the user can choose during the construction of the phrases, and a dedicated editor. The library has a hierarchical structure and an attribute table is attached to each class in the library. The table provides information about the class inheritance and the allowed preceding and subsequent classes. The rules in the rule set are responsible for checking whether the typed word within the editor belongs to the library, and whether the word matches with its preceding and subsequent words. Furthermore, the rules slightly distinguish among phrases formed to describe a goal, a function, a capability or a physical structure.
The Component Base The component base provides means to create objects for each specific component of a certain s-object, to gather all information about the component in its corresponding object, and to logically relate the component to its superior classes defined already in the base. Hence, the component base contributes to a modular knowledge representation of the s-object and to a more efficient preparation of its HMG model. The base contains a component library, and a procedure to detect ambiguities among created component classes in the library. Similar to the library in the language base, the component library has also a hierarchical structure with an attribute table attached to each class in the hierarchy. The information in the table includes the class name, the name of its superior class, the shape, the type (active or passive), the number of input/output stubs, the parameters, the operating, monitoring and control variables and their relationships, the operational boundaries, the name of the allowed connected components, and references to the capabilities and the physical structure (i.e., to the GTST models) of the component. The ambiguity detection procedure in the component base identifies classes that have two or several groups of superior classes. That is, they have been defined twice or more in the library.
The M F M Base The purpose of the MFM base is to facilitate the preparation of the MFM model of a certain s-object, so that the overall goals and function-types of the s-object can directly be reflected by the model. The base consists of a hierarchically structured library of the MFM classes with an attribute table attached to each class. The base also contains: 1) a set of syntax checking rules to assure that the syntax of a flow-structure is not in violation with the syntax rules defined in the MFM method (Lind (1990)), 2) a set of propagation rules to determine how a given value assigned to certain function-types in a given flow-structure should propagate through the rest of the function-types within the model, and 3) a set of conflict detection rules to investigate whether the propagated flow value for each function-type is in accordance with the measured flow value for the function-type.
The GTST Base The aim of the GTST base is to provide means to prepare the GTST models of the capabilities and physical structure of the components. The base comprises a hierarchically structured library of the GTST classes with an attribute table attached to each class, a set of syntax checking rules, a set of propagation rules to decide the status of the elements in the models, and a set of conflict detection rules to find the instantaneously occurred conflicts among status values of the model elements. Additionally, the base provides some procedures that handle the relations between the GTST models and components, and the GTST models and function-types.
The HMG Base The purpose of the HMG knowledge base is to provide means for building the HMG model of s-objects. Since the HMG model of an s-object is created by connecting the function-types in the MFM model of the s-object to
300
A. Jalashgar
their completely-realising and partly-rcalising components, the entire HMG base consists of a couple of rules for adding two attributes to the attribute table of every function-type in the MFM model of an s-object. THE A P P L I C A T I O N - S P E C I F I C K N O W L E D G E BASES The application-specific knowledge bases are those that are generated for a certain s-object, and therefore contain all information about the s-object. They use the generic knowledge bases to build the MFM model of the s-object, the GTST models of the components involved, and the resulting HMG model of the s-object. The bases for each s-object are: 1) An application base that contains information concerning the physical structure and operation of a specific s-object that is about to be modelled and analysed. The component base is used to represent the physical structure of the s-object. The attribute tables of the created component instances and globally accessible parameters and variables as well as rules and procedures are used to represent the operation of the s-object. 2) An application-specific MFM base that contains the MFM model of the s:object, and therefore all information concerning the goals and functions of the s-object. The base also contains a set of rules and procedures to automatically assign and alter various attribute values of the MFM elements in the model on the basis of the information received from the application base. The generic MFM base is used to create the model. 3) An application-specific GTST base that contains the GTST models of the capabilities and physical structure of the components in the s-object. The generic GTST base is used to create the models and to relate them to the component instances in the application base and the function-type instances in the application-specific MFM base. 4) An application-specific HMG base that generates the HMG model of the s-object by using the generic HMG base, the application-specific GTST base and the application-specific MFM base. THE DIAGNOSIS KNOWLEDGE BASE The generic and application-specific knowledge bases constitute the necessary means to generate the HMG model of different s-objects, and to perform the diagnosis entirely on the basis of the model. However, the latter task implies also the necessity of an inference engine, which is placed in the diagnosis knowledge base. The inference engine uses the attribute values of the MFM and the GTST elements in the respective models, in order to reason about the functional and the operational state of the s-objects. THE P R O T O T Y P E An application-oriented prototype is developed in the object-oriented software environment G2 (Gensym (1995)). The application used is the Water Supply Process Control system (henceforth called WSPC-system) introduced in Jalashgar (1995) and HMG-modelled in Jalashgar (1996). Figure 1 shows the WSPC-system and a part of its HMG model. The knowledge kept in the prototype is organised by means of several workspaces, which can be categorised in five groups. The first group comprises a simulator of the WSPC-system, the second group concerns the MFM model of the system, the third group includes the GTST models of the components, and the fourth group assigns the GTST models to their corresponding components in the system, and to their corresponding function-types in the MFM model, so that the latter becomes the HMG model. Finally, the last group performs the diagnosis on the basis of the model. As it appears, the workspaces of the prototype ar__.ge knowledge bases. A generic part of an MFM toolbox developed by Jan Eric Larsson and described in Larsson (1992) is used to implement a part of the second group of the workspaces.
The Simulation Group This group consists of five workspaces containing: 1) a group of component classes, 2) a schematic of the WSPC-system formed by the instances of the component classes, 3) a group of procedures to simulate the closed-loop dynamics of the system, 4) a group of methods to determine values of the proper component attributes on the basis of the simulated dynamics, and 5) a group of global parameters and variables together with two procedures to activate the simulation. The activation also includes the assignment of data to the components and to the MFM model of the system. Figure 2 shows the last group in the Activation workspace.
Advances in Safety and Reliability: ESREL '97
301
The Activation workspace contains those procedures that activate the WSPC-simulator as well as the data flow to the components and function-types in the MFM model. Since these procedures deal with data that belong to the whole system rather than to a particular component or function-type, these data are also placed in the workspace. They include past and present measured, controlled, actuating and disturbance variables. Constant parameters of the WSPC-system, as well as some transient data are also placed in the workspace. The simulator starts running for a certain time interval through a call to calling-procedure. As soon as the interval is passed, the decide-which-goal procedure is called to evaluate the simulated data, on the basis of which it is decided whether and how the overall goal of the control system is achieved. The block on the right side of the procedure icons is a G2 action button pressing of which makes a synchronised call to both procedures. In calling-procedure, the simulated data are forwarded to the components and to the MFM model, as soon as they are updated.
Pressure sensor
It:::
Membrane vessel
Adjusting knob Pump
Control section
Pipelines
Water source
G1
mfl
mf2
mr3
mf4
G2
mf5
mf6
G3
mr7
mr8
mfl2
mf9
mfl3
mfl0
mfll
mfl4
NT"
Figure 1" The WSPC-system and its HMG model. The MFM Group
This group consists of three workspaces: 1) an MFM-toolbox workspace that provides necessary tools to create an MFM model of an arbitrary system, 2) a workspace that contains the MFM model of the WSPCsystem, and 3) a workspace that contains the procedure that forwards the simulated data from the components of the system to the corresponding function-types in the MFM model. This procedure is called in callingprocedure in the activation workspace.
302
A. Jalashgar ....................................................................................................................................................................................TIT1-.111T1--T-.
.,u,.,B,,,,,. W Y
NNNN NN@NNN 51
~i1
11830
YO
Y1
R1
~; Z
0
76
1 163~i
0
U1
.
.
.
.
.
.
.
.
.
VS;P I=IL'I'I= LOzAt
.
.
.
.
TO
0
.O.LI:.0.
0.l
B ETJOL
0
N N N N! N I=REI=
WIEF
Y.a.CT
W~PC-LEAK
N N 0
M VI= IL'rF LOIN
KK
I= ILTFLOW
KM
0
PUMP-FLOI, Ib'
0
0
N N
MEMB-FLOIN
ESTFLOW
"~nlse
AN I=LIG
P UMP -OFF
c ~ , . . Q - F ' . o c ~ ou.E OEC~OE-~,. le.-QOA,
;i~;!j!~!;!;!;!;!;!j!i!;!;!!!!!!!i~i!i;i;l
Figure 2: The Activation workspace in the group of simulation workspaces.
The GTST Group This group consists of two workspaces: 1) a workspace that contains the classes of GTST elements and also the GTST models of the components, and 2) a workspace that contains some procedures that for an arbitrary GTST model carry out the propagation of the status of the GTST elements through the model.
The GTST-Model-Assignment Group This group comprises two workspaces: 1) a workspace that assigns the proper GTST models to the components of the system, and 2) a workspace that assigns the proper GTST models to the function-types in the MFM model. Hence, the latter results in the HMG model of the WSPC-system.
The Diagnosis Group This group has only one workspace, HMG-Based-Diagnosis, that contains the necessary procedures to perform the diagnosis on the basis of the HMG model of the system. Additionally, the workspace contains procedures to reset the results of the diagnosis, so that a new diagnosis can start. Figure 3 shows the workspace. The main diagnosis procedure wspc diagnosis and the reset hmg procedure are called through their respective action buttons. The wspc diagnosis procedure only contains three calls to freq-cont, onoff-pumpon and onoffpumpoff. Each of these three procedures investigates the status of the goal of the MFM model first, and on that basis one of the procedures will proceed. The procedures do not diagnose on the basis of the propagated attribute values of the function-types. However, in order to reduce the search domain, they are supplied with the information about whether the flow value of mbf3 (the function-type having the consumer tap as its completely-realising component, see Figure 1) is zero or not. Hence, the only sources of information for the procedures are the entire search domain provided by the HMG model, the status of the goal, and the status of the flow value of mbf3.
Advances in Safety and Reliability" E S R E L '97 I l l l l l l l l l l l l l l l n l l l n l l l l
303
...............................................................................................................................................................................
tM G-BASED-D
I./--k,:SNC, S l'
WSPC-DIAGN0818
FI=IEQ-CONT
ONOFF-PUM PON
RESET-GTST-ELE M ENT81
ONOFF-PUM POFF
RESET-G-I-~T-ELE M ENTS2
RE~ET-H M (3
I unconditionally conclude that the flow of rnl0f3 = 0
I
Figure 3: The HMG-Based-Diagnosis workspace. RUNNING THE PROTOTYPE The following provides a simple procedure to run the prototype. This includes the simulation o f the W S P C system, the generation of its H M G model and the diagnosis of'the state of the system on the basis of the H M G model. The results of a single run of the prototype is also displayed and discussed. The procedure is as follows: 1) start G2. The HMG model of the WSPC-system will be created automatically. However, the attributes of the elements of the models have not yet received their values. This is the job of the diagnosis. 2) open the Activation workspace and the HMG-based-Diagnosis workspace on the main menu. 3) press the button labelled "open the tap and fred out which goal is achieved" on the Activation workspace. The simulation will then run for a given time interval indicated in the table of the main simulation procedure calling-procedure. This also means, that the value of the proper attributes of the component and function-type instances will be updated on the basis of the simulated values. 4) press the button labelled "WSPC Diagnosis" on the HMG-Based-Diagnosis workspace, as soon as calling-procedure times out. For the convenience of the user, the MFM-Model and the Mass-Flow-Structure workspaces along with the status of the goal will appear automatically. The elements of the GTST models of the function-type instances (and also those of the component instances) have now received their values as the result of the diagnosis. 5) After investigating the GTST models, reset the diagnosis by pressing the "reset HMG" button on the HMG-Based-Diagnosis workspace. 6) start from step 3. The following provides a close look to the outcome of a run of the prototype. The information received by the diagnosis is: 1. The search domain is the HMG model. 2. The goal of required flow and pressure is achieved through the on/off control of the frequency, and while the pump is nmning. 3. The flow value of mbf3 is not zero. The second information indicates, that onoff-pumpon procedure must have performed the diagnosis. On the basis of the third information, it is concluded that all function-types have been realised. Based on this conclusion, following assignments are taken place as the results of the diagnosis:
1. The capability-type attribute of the blocks in the manifest GTST models (those that represent the capabilities belonging to the 1. category) of the function-types are set to manifest-1.
2. The capability-type attribute of the blocks in the GTST models of the partly-realising components (Jalashgar (1996)) of the three function-types msofl, mbf2 and mstfl are also set to manifest-1.
304
A. Jalashgar
3. The capability-type attribute of the blocks in the latent GTST models (those that represent the capabilities belonging to the 4. category) of the function-types are set to latent-2. The reason for the third assignment is, that there is a potential for a hidden failure somewhere in the system, although the overall goal is achieved. Figure 4 displays the attribute table of the pipeline between the membrane vessel and the consumer tap, and the attribute table of one block in the latent GTST model of mtf6, the completely-realising component of which is the specific pipeline. The figure indicates, that there is a leak in the pipeline, and the information about it is detected and represented in the latent GTST model of the pipeline, despite the fact that the overall goal of the system is achieved. The leak is a hidden failure. In the explained procedure, the diagnosis is performed off-line, meaning that the GTST models are examined after calling-procedure times out. An on-line diagnosis can easily be implemented by placing the call to the wspc-diagnosis procedure in calling-procedure. REFERENCES Jalashgar (1997): Atoosa Jalashgar, "Identification of Hidden Failures Based on the Function-Oriented System Analysis Method HMG", Accepted for publication in Intemational Joumal of Intelligent Systems (Spring 1997). Jalashgar (1996): Atoosa Jalashgar, Mohammad Modarres, "Identification of Hidden Failures in Control Systems: A Functional Modeling Approach", Presented at FLINS'96, The Second Intemational FLINS Workshop on Intelligent Systems and Soft Computing for Nuclear Science and Industry, Mol, Belgium, 96 (Proceedings: pp 205-213). Jalashgar (1995): Atoosa Jalashgar, "Applications of functional modelling in control systems", Presented at The Third Intemational Workshop on Functional Modelling of Complex Technical Systems, Maryland, USA, 95 (Proceedings: pp 65-80). Lind (1994): Morten Lind, "Modelling goals and functions of complex industrial plants", Applied Artificial Intelligence, 8, 2, pp. 259-283, 1994. Kim et al (1987): S. Kim and M. Modarres, "MOAS: A real-time operator advisory system", Nuclear Engineering and Design, 104, pp. 67-81, !987. Lind (1990): Morten Lind, "Representing Goals and Functions of Complex Systems, An Introduction to Multilevel Flow Moelling", Institute of Automation, Technical University of Denmark, 1990. Gensym (1995): Gensym Intelligent Real-Time Systems, "G2 Reference Manual, Version 4.0", Gensym Corporation, Cambridge, MA, USA, 1995. Larsson (1992): Jan Eric Larsson, "Knowledge-Based Methods for Control Systems", Ph. D. Thesis, Department of Automatic Control, Lund Institute of Technology, 1992.
Notes Item configuration Names
Connection style
Notes OK
OK
Item oonflguratlon
none AJ-P5
orthogonal
Inflow
0.295
Outflow
0.195
none
Names PS-LG I
I
status true Bot true Capabilitytype latent-2 Priority 1 Goal "can deliver water" S u¢¢~s
Top
'"'
false
Figure 4: The contents of two attribute tables after the diagnosis has taken place.
A12" Software Reliability
This Page Intentionally Left Blank
SOFTWARE AND H U M A N RELIABILITY EVALUATION: AN E X P E R I M E N T A L A T T E M P T FOR A C O M M O N APPROACHA. Pasquini 1, A. Rizzo 2 and V. Veneziano 3 1Department of New Technology, ENEA, Via Anguillarese 301, 00060 Roma, ITALY 2Department of Communication Science, Univ. of Sienna Via del Giglio 14, 53100 Siena, ITALY 3Centre for Software Reliability, City University Northampton Square, London EC 1V 0HB, United Kingdom
ABSTRACT There are analogies between the following processes: software reliability growth due to testing and the related fault removal; improvement of man-machine interface due to preliminary operative feedback; improvement of the operator performances due to his learning activity. Only the first one of these processes is currently modelled by using mathematical methods, called software reliability growth models. This paper explains whythese methods should be extended to model the reliability growth processes of the whole complex, i.e. the human, interface and software system. To support the feasibility of the approach, the paper describes an experiment in which the reliability of a software control system, of its graphic man-machine interface and of the operators are evaluated during the phases of software testing, man-machine interface improvement and training. Preliminary results confirm the applicability of the models and induce to further investigate the validity of the approach.
KEYWORDS Cognitive science, hardware reliability, human reliability, reliability growth, reliability trend models, software reliability, system reliability.
INTRODUCTION Digital computers are having an increasingly important role in process control applications. They are replacing or supporting operators in functions requiring the solution of complex problems or prompt decisions based on a large amount of information. In some applications the resulting control systems can be defined safety critical since their failure could produce severe consequences in terms of human life, environmental impact or economical losses. Examples are process control systems in chemical or nuclear power plants, in space vehicle, transportation and medical life support devices. this work was partially supported by the European Union - DGXII - Program Human Capital and Mobility, via the "OLOS" research network (Contract CHR-X-CT94-0577). 307
308
A. Pasquini et al.
Safety-critical systems require an assessment activity to verify that they are able to perform their functions in specified use environments. This activity would benefit from evaluation methodologies that consider these systems as a whole and not as the simple sum of their parts. Indeed, analysis of accidents involving such systems as shown that they are rarely due to the simple failure of one from their components. Accidents are the outcome of a composite causal scenario where human, software and hardware failures combine in a complex pattern. Well known examples include: space applications such as when the Phobos I flight control system and the ground control caused the failure of the space mission; medicine when a combination of an architectural flaw with a software fault and operator misbehaviour in the Therac 25 radiation therapy machine caused the over radiation and death of some cancer patients; nuclear power when a failure of the Crystal River process control system and the operator caused a radioactive water release. These examples are all drawn from large scale, low volume systems. The same kind of problem will also surface in increasingly volume produced items that will incorporate programmable components interacting with operators. An obvious area is in the automotive industry whenever there are strong pressures to decrease costs and increase functionality through the use of programmable elements. On the contrary, dependability analysis and evaluation of safety critical systems are based on techniques and methodologies that concern human and computer separately. Most integration efforts are limited to hardware and software components with the questionable assumption that their evaluation can be performed independently and then combined, for example using the traditional reliability graphs when evaluating reliability. Therefore the assessors of these systems have the difficult task of integrating the results of completely different and not compatible methodologies at different stages of advancement.
STATE OF THE ART IN QUANTITATIVE RELIABILITY EVALUATION Software reliability growth models attempt to predict the reliability of software on the basis of its failure and fault removal history. This is defined as the realisation of a sequence of random variable T l, T 2..... T n, where T i denotes the time spent in testing the program after the fault causing the (i-1)th failure has been removed until the ith failure occurred. The approach followed by reliability growth models is a black box approach. No care is given to the single actions causing the reliability growth and to their interactions. Focus is in their effect, that is in the reliability growth process in its entirety. Several models have been proposed to estimate the reliability in terms of Mean Time To Failure or number of residual faults. Malaiya and Srimani (1992) and Xie (1993) contain a detailed survey of most of these models. Brocklehurst and Littlewood (1992), Brocklehurst et al (1992) and Iannino et a! (1984) contain proposals to decide the most appropriate for each application, combine the information they provide, or compare them. These models may provide a first, rough reliability estimation and support project management. In other words they represent a modest but well-understood prediction tool for decision-makers. In the field of cognitive science, the approach followed while analysing the reliability of human in control and supervision is rather different. Cognitive scientist refuse to adopt the black box approach to model human behavior. The focus of current cognitive engineering is in optimising the role of the individual in human-machine systems, by understanding how people acquire information, represent it internally and use it to guide their behaviour. Little attention has been paid to the quantitative evaluation of human reliability. Cognitive scientists tried to understand the meaning and the sequence of human actions when performing control functions, see for example Rasmussen (1986). They developed cognitive models where the single information processing activities and external actions are considered as in Sheridan (1988). These models are far too complex for a quantification of their elements and of their interactions. They are used mainly for qualitative consideration with the aim of improving training, equipment design and procedures. Quantitative considerations regarded essentially timing aspects of human perceptual-motor learning. Perceptual-motor performances of human improve with practice with a relationship that is approximately proportional to a power of the amount of practice, as in Card et al. (1986). This relation, called power law of practice, applies to all skilled behaviour, both cognitive and sensory motor, see for example Newell and Rosenbloom (1981). But, little use of it has been done to describe changes in the
Advances in Safety and Reliability: ESREL '97
309
quality of performances. Most of the work in the area of man-machine interfaces aimed at providing conditions to optimise human performances and reducing the probability of failures due to the interface. There is rarely any attempt to predict the nature and likelihood of specific human errors. Guidelines and checklists have been produced to improve the design of interfaces in new systems or to evaluate possible deficiencies in existing ones, see for example Smith and Mosler (1987) and Rizzo et al. (1996). Quantification concerns mainly the aspects of usability of interfaces and aims at comparing different implementative solutions during design and prototyping. System reliability studies are based mainly on the use of formal techniques such as Fault Tree Analysis. Use of this technique and the need to quantify the probability that human actions are successfully carried out raised the need of human error probabilities estimation. A methodology to provide such an estimation is provided in Bell and Swain (1985). This methodology encompasses task analysis and human error rate prediction. In Fault Tree Analysis, and in similar formal techniques for system reliability evaluation, there is a very mechanistic consideration of humans: they are modelled as hardware components that provide a function when required. The black box approach is in fact applied to humans. A detailed critic analysis of this approach and of its limits is described in Rasmussen et al. (1987). This paper consider the possibility of using a different approach in the reliability analysis and prediction of control systems. The reliability growth process of the system is modelled on its whole, considering the human, interface and software system together. This is done by analysing the failure history of the whole system during the phases of software testing and fault removal, operator training and interface evaluation and improvement.
THE RELIABILITY G R O W T H PROCESS AND ITS MODELLING There are analogies between the following processes: software reliability growth due to testing and the related fault removal; improvement of man-machine interface due to preliminary operative feedback; improvement of the operator performances due to his learning activity. These processes are shown in Table 1, together with a list of the main techniques used to stress and improve the reliability of the human, interface and software components. The table shows also a list of the main events leading to system evolution and to the consequent reliability growth. All the processes lead to component reliability growth and these growths are likely to have similar characteristics. For example all of them have well known limiting factors such as the limits inherent in the use of an operative profile. In some cases the events leading to reliability growth may not have the desired effect, but even in this cases there is same similarity in the behaviour of the different system components. For example, human learning does not give a complete guarantee that the human who has learned the reason for a particular failure will not fail in those circumstances again. There is only a reduction in the probability of this happening. For software something producing a similar effect on reliability may happen: there is a certain probability that new faults are introduced during the fault removal process. Even if a fix ensures that the same input conditions will not cause the same failure to occurr once again, it may happen that the software reliability does not increase (because of new faults introduced) or increases less than expected. But, in general, we can assume that the reliability of the whole system increases because of the combined effect of.these processes. Similarities in reliability growths suggest the possibility of using reliability trend models to model and estimate the reliability of the whole system. In particular, software reliability growth models seem to be adequate for this purpose. Recent works such as Littlewood and Strigini (1993) and Brocklehurst and Littlewood (1992) have shown that the evaluation of the model performances (statistical evaluation of the fit of the data estimated with the real data) and recalibration of the model are sometime more important than the characteristics of the model in itself. Due to the complexity of control systems using human, and computer components, a new model risks to have parameters whose physical meaning is ambiguous or not
310
A. Pasquini et al.
adequately considered. For this reason, the correct use of existing models seems to be more advisable than the development of new, specific ones.
TABLE 1 PROCESSES LEADINGTO RELIABILITYGROWTH
System Component
Technique used to stress the comonent.
Event leading to system modification
Event leading to reliability growth
Software
Testing according to a specified operational profile Evaluation of interface using guidelines, simulated operative usage & preliminary operative feedback Training using the estimated operative usage of the system
System, sub-system or module failure System failure, interface evaluation results, operative usage feedback
Fault detection and removal
System failure, system abnormal/normal working conditions,
Learning from experience (increasing skill, building rules, increasing knowledge)
Interface
Operator
Interface modification and improvement
The potential positive results of modelling system reliability growth are quite evident. Quantification of the reliability growth can support the: identification of stopping criteria for the operator training; identification of stopping criteria for the interface modification and improvement; comparison of different possible interfaces on the basis of the operator reliability growth; comparison of the effectiveness of different testing, evaluation and training strategies.
P R E L I M I N A R Y RESULTS F R O M AN E X P E R I M E N T Modern cognitive psychology, see for example McClelland and Rumelhart (1986), considers information processing like a process distributed between two poles, human brain and environment. Then, the external representation of the information has a relevant influence on its processing. Zhang and Norman (1994) have shown how explicit representation of implied information improves problem solving performances. Hancock and Meshkati (1988) defined human mental workload as the properly cognitive side of the complex problem of information processing in the human factors psychology. This perspective is used in a systemic approach in experimental analysis of human-machine control system behaviour. The experiment is based on the use of: a simulated dynamic process to be controlled; a software control system, designed for the purpose, with its graphic man-machine interface; operators co-operating with the control system to keep the dynamics under control. We selected a process to be controlled able to stress the operator, requiring him a cognitive effort analogous to the one required in real supervision and control tasks. The process control requires the presence of strategic and sensory-motor skills. This simulated process to control is a modified version of the one presented in Van Gelder (1980). The operator is required to provide a specific mixture of three different fluids into a vessel, by means of a complex of tubes and pumps with a not linear response. The composition of the mixture change with time, and the operator has also to perform some additional, less important, control activities at the same time (control of the liquid level in additional tanks, etc.). A
Advances in Safety and Reliability: ESREL '97
311
software system, developed for the purpose, simulates the dynamic process to be controlled, provides the interface between operator and software control system and record all the failures occurred during the phases of software control system testing, man-machine interface improvement and training of the operators. The experimental system is shown in figure 1.
Controlsystem interface
l.J I
Failure recording
J
system
Controlsystem
r-I
f Simulatedsystem 1 L
to be controlled
Figure 1" Experimental system A sample of 20 subjects, male and female students and young researchers with reasonable computer confidence was selected. During training, subject were required to reach, an adequate, pre-defined level of control of the process. Preliminary results of this experiment show a significant increase in reliability during the initial phases of training in all the subject, then the level of reliability stabilise very slowly at a value changing from subject to subject. Figure 2 shows the data concerning a specific subject with a typical behaviour. The number of cycles he was able to keep the process under control is compared with the number of training sessions.
L e a r n i n g ( b a s e d o n error) t r e n d 7000 o
6000
~=
5000
..m
¢,o
~
4000
~' Lu 3000
O =,_ m
2000
E =
1000
Z
i
ii
ii
i i
i;
; i
i
i i
l;
i
i
ii
i i
i;
; ;
i ;
; i
i
Number of Training Sessions
Figure 2: Learning trend of a subject with typical behaviour Three software reliability growth models, Jelinski-Moranda, Littlewood-Verrall and NHPP, were applied to these data. Results are shown in figure 3, were raw data are compared with those provided by the model.
312
A. Pasquini et al. +
f~r next
Raw D a t a
,= Jelin=~l,i-Mor-~da x Litlle~ood-'U~ r ~ l • NHPP (I'BE~
cycle
I
Iv ~ ~
0.40XI,
~
' ,
,
¢=,
¢=
I',-
0.3000= -l.-
,
.
.
.
.
.
0.2000= -' '- ........
, ~, '~ o.,ooo -" o.oooo
.
.
.
......
.:
:
+ ~---~
.~
~ •
!
|
x ,
+
,
,
,×x ~
~, ....
!:~"
×
xx .
,,,
-x" ,fie-
i
+
'~'
0 0
~-~.~
~
m J~ . . . . . . . . . .
~-.
."
-I., ~ ..... ~.,,+
i
,
+:
i
I-- . . '. _+_,_ . . . .
0 0
C~dLttlve
....
. '~
+ .'-
. ~i¢-m . . . . I . ,'. . I,. ."" . . . . ' I' 1 . . . I. . ++
,
X
,
r"
I
~ ' ' '
i o.
'
'
"
C3 C~ (:3 0 OD
0 0
i o. 0 C3 0 C3 C3
]between :l~/].ure#
Figure 3" Raw data compared with software reliability growth models. Goodness of fit between data and software reliability growth models was tested by means of the Kolmogorov Smirnov Distance. Results are shown in figure 4.
[] Jelindd-Mor'~da x Utllettood-~r~l • NHPP 0"BE)
Failur, Probal ,i "[~ ( C D F )
KS 91alis~o: 0.384101 - Biased KS Slali~to: 0.307273- Biased KS Slatislic: 0.326200- Unbiased ,
|
,
1.ooo, .,..,--
|
c~o' : "~'i
~ C~.2~0,
;
~ i
I.! X l = -)4~ - - -@-I - - , . . . . . . . . .
"
"
'
CI.O00,
.~
i
0
(~
O
0
"':
i --F
i
..........
.............................. J L i ....
i
i'
"~"
i i
0
C3
0
0
0
~--
FLihtre P~b~b/I/~,- I=(/)
Figure 4: goodness of fit, tested on failure probability. From these preliminary data, software reliability growth models seem to be adequate for estimating the reliability of the operators in this specific context and with these artificial experimental conditions. Nevertheless some problems arise from the dissimilarities between human and software and the way in which they improve their reliability. Results of training and quality of human answer are strongly affected
Advances in Safety and Reliability: ESREL '97
313
Indeed, even though software reliability growth models adopt a black box approach in describing the system to which they are applied this does not imply to extend the black box to model human behaviour. Instead software reliability growth models can be considered just a quantitative assessment complementary to the qualitative assessment performed within the cognitive science approach. The tools designed within cognitive engineering, see for example Dumas and Redish (1993), are more adequate to analyse human behaviour and to identify the possible cause of misbehaviour. However the cognitive approach lacks of formal tools for quantify the reliability, and also to assess and quantify if adopted solutions to specific problems have improved the reliability. Even if reliability vary greatly in humans, and then is difficult to generalise information obtained from a specific subject, useful information can be obtained from trend analysis (this is quite often the case even in software reliability modelling). The added value of this measurements, in design of regime phases, can be understood when they are applied repeatedly to the human-machine system under assessment. In such cases, the trend in reliability assessment can point out if the design solution or management decision are pointing at the desired direction. Moreover they can allow the development of a common language between the so far separate fields of human and software dependability. It is just along this direction that we want to extend our approach - apply the reliability growth models iteratively after the each redesign or improvement of the system.
CONCLUSIONS This paper analysed the possibility of using software reliability growth models for control systems based on human and computer. Software reliability growth models represent a modest but well-understood prediction tool for decision-makers. Extending their applicability the whole control system would allow to offer a quantitative support to several decisional activities in the phases of operator training and interface design and improvement. Preliminary results obtained in experimental conditions confirm the applicability of the models and induce to further investigate the validity of the approach in real conditions.
REFERENCES
Bell, B. J. and Swain, A. D., (1985). Overview of a procedure for human reliability analysis, Hazard
Prevention 1, 22-25. Brocklehurst, S. and Littlewood, B. (1992). New Ways to Get Accurate Reliability Measures. IEEE
Software 9:4, 34-42. Brocklehurst, S., Lu, M. and Littlewood, B. (1992). Combination of predictions obtained from different software reliability growth models. Proceedings of the l Oth Annual Software Reliability Symposium, June, Denver, Colorado, 24-33. Card, S. K., Moran, T. P. and Newell, A. (1986). The Model Human Processor. An Engineering Model of Human Performance, Chapter 45. In K. R. Boff, L. Kaufman, J. P. Thomas (Eds.), Handbook of Perception and Human Performance. John Wiley and Sons, New York. Dumas, J.S and Redish, J.C. (1993). A practical guide to usability testing, Ablex, Norwood, NJ. Littlewood, B. and Strigini, L. (1993). Validation of ultra-high dependability for software-based systems. Communication of the ACM 36:11, 69-88. Hancock, P. A. and Meshkati, N. (Eds.) (1988). Human mental workload. North-Holland, Amsterdam. Iannino, A., Musa, J. D., Okumoto, K. and Littlewood, B. (1984). Criteria for Software Reliability Model Comparisons. IEEE Transactions on Software Engineering 10:6, 687-691. Lindgaard, G. (1994). Usability Testing and System Evaluation, Chapman & Hall, London. Malaiya, Y. K. and Srimani, P. K. (1991). Software Reliability Models: Theoretical Developments, Evaluation, and Application. IEEE CS Press. McClelland, J. L. and Rumelhart, D. E. (Eds.) (1986). Parallel distributed processing. Explorations in the microstructure of cognition. 2 voll., MIT Press, Cambridge, Mass.
314
A. Pasquini et al.
Newell, A. and Rosenbloom, P. S. (1981). Mechanisms of skill acquisition and the law of practice. In J. R. Anderson (Ed.), Cognitive skills and their acquisition. Erlbaum, Hillsdale, NJ. Rasmussen, J., Duncan, K. and Leplat, J. (1987). New Technology and Human Error, John Wiley and Sons, New York. Rasmussen, J. (1986). Information Processing and Human-Machine Interaction, Elsevier Science Publishers, North Holland. Rizzo, A.., Parlangeli, O., Marchigiani, E., and Bagnara, S. (1996). Guidelines for managing human error. SIGCHI Bullettin 6, 125-131. Sheridan, T. B. (1988). Task Allocation and Supervisory Control. Handbook of Human-Computer Interaction - M. Helander (ed.), Elsevier Science Publishers, North Holland. Smith and Mosler, (1987). Too much too soon: Information overload. IEEE Spectrum June, 51-55. Van Gelder, K. (1980). Human Error with a Blending-Process Simulator. IEEE Transactions of Reliability R-29:3, 258-264. Xie, M. (1993). Software Reliability Models-a Selected Annotated Bibliography. Software Testing, Verification and Reliability 3, 3-28. Zhang, J. and Norman D. A. (1994). Representations in Distributed Cognitive Tasks. Cognitive Science 18:1: 87-122.
AN EXPONENTIAL APPROXIMATION TO THE EXPONENTIAL- MULTINOMIAL FUNCTION B. S~iz de Bustamante Nuclear Engineering Department, Madrid Technical University Jos6 Abascal, 2. E-28006 Madrid
ABSTRACT
This paper presents a reliability function (EMRF) obtained after a detailed study of the software failure process during testing periods: The function allows reliability evaluation of finite and/or countable infinite states systems with different repair modes. In order to gain applicability the function is simplified maintaining its basic assumptions, and a second function (EERF) is obtained. In order to study interchangeability conditions the mathematical properties of both funcions are analyzed and two error functions are defined. It is verified that the approximation is valid for usual system operation time ranges.
KEYWORDS Expanded-Exponential Function, Exponential-Multinomial Function, Poisson Process, Software Failure, Software Fault, Software Reliability,
INTRODUCTION
At present, a common interest in all production processes and enterprise activities is to achieve a high quality level. The usual practice is to follow a methodology designed to garantee such quality level, and once the process is finished, all quality attributes of the product obtained are assessed in relation to the previously fixed quality aims. Software development is a good example of this common practice. Software quality is defined by means of a set of attributes whose measurement is not an easy task. These attributes are related either to system operation, either to system evolution, as shown in table 1, where a brief definition of each attribute is given [1 ]. From the eight attributes presented, reliability may be chosen as the most studied one due to its importance at operation time. It is defined accurately as the probability of not occuring any failure during a given period of time t' counted from the last failure occurence, or kth failure [3]:
(1) where T' k is a random variable that denotes time between the kth and (k+ 1)th failures, and t k denotes time to the kth failure occurence. F~.k(t) is the distribution function assumed for the time between failures, whose 315
316
B. Sfiiz de Bustamante parameters are dependent on the product properties and on the development process. Therefore reliability depends on the assumptions made to define the distribution function for the time between failures.
SYSTEM OPERATION ATTRIBUTES ii
SYSTEM EVOLUTION ATTRIBUTES
TABLE 1 SOFTWARE QUALITY ATTRIBUTES RELIABILITY **works according to specification EFFICIENCY *jworks efficiently in terms of time and space ERGONOMICS *jeasy to use INTEGRITY **safe operation PORTABILITY I,physic support independance REUSABILITY **its components may be reused INTEROPERABILITY ~easyinteraction with other systems MANTENABILITY i easy fault correction and adaptation to changes in specifications I
The distribution function corresponding to the reliability function presented in this paper is obtained assuming that the software failure process during testing periods is a Non Homogeneous Poisson process -NHPP-, and in relation to the correction process it is considered that correction time is negligible, therefore the failure proccess and the correction process are considered simultaneous, and that each correction performed may not have a successful result: the fault that causes a failure may not be detected, or even it is possible to introduce a new fault when trying to fix the detected one. These assumptions yield an exponential-multinomial reliability function -EMRF-. EMRF is a general reliability model from which other models may be obtained when setting some of its parameters. Nevertheless, it presents a mathematical complexity that makes its general application difficult and cumbersome, therefore an approximation is developed. It is found out that EMRF may be approximated by an exponential function called in this paper Expanded Exponential Reliability Function -EERF-
T H E
E X P O N E N T I A L - M U L T I N O M I A L R E L I A B I L I T Y FUNCTION
This function is obtained considering the following assumptions about software development and operation: 1. Software failures are independent one of each other, and they occur according to a failure intensity ~(t) which is time dependent. 2. The probability of two or more errors occuring simultaneously is negligible. 3. Failure intensity ~.(t) is proportional to the number of faults present at time t. 4. The time taken to correct a fault is considered negligible. 5. The fault that causes a failure is corrected according to three correction probabilities: • p - Probability of perfectly correcting the fault - perfect correction-. • q - Probability of leaving the system with the same number of faults as before the failure, due to the impossibility of correcting the fault that caused the failure - imperfect correction r - Probability of introducing a new fault when trying to correct the fault that caused the last failure - additive correction -. These probabilities are considered to be constant during the testing period to be studied. •
The last assumption is specific to this function, while the rest have been tried and validated in previous software reliability models. The third assumption was disputed when first presented by Jelinski and Moranda in 1972 due to the fact that it] gives the same weight to all faults, and it may be argued that not all faults contribute equally to program[ unreliability, as stated by Littlewood and Verrall in 1973. Nevertheless it may be considered that over time[ and mainly due to operational profile diversity, the effect of fault significance will tend to average out.[ Several models have been developed on the basis of either assumption [5]. [
Advances in Safety and Reliability" E S R E L '97
317
Function Development The first three assumptions considered allows the use of Poisson definitions to obtain the probability distribution for the time between failures. Considering that the corrections performed may have three possible outcomes according to three constant probabilities, the correction process should follow a multinomial distribution. The failure intensity - )~(t) - is assumed to be proportional to the number of faults present at time t - Xk(t) -, which can be written as:
Xk(t)=no--[k--Ik--2Jk]=no--k+Ik
+2Jk
(2) where no indicates the number of faults at the beginning of the testing period, k the number of failures occured up to time t, and Ik and Jk are random variables that represent the number of imperfect corrections and the number of additive corrections performed up to the kth failure. Therefore failure intensity is given by: it (t) = 20[no - k + Ik + 2J~]
(3) According to the distribution function definition and to the total probabilities theorem, and taking into account that at time tk there are Xk faults, having occured k failures and having performed ik imperfect correction and Jk additive correction, the distribution function for the time between the kth and the (k+ 1) failures is given by:
F~k,=£ £Pr[T~'_
k +2j~]Pr[X~=no-k+i~
+2jk]
i k = O j k =0
(4) As the correction process follows a multinomial distribution: k~ Pr[X k = n o - k + i k +2jk]=ik!Jk!(k_ik_Jk)!q'krJkpk-'k-J~
(5) and due to the failure rate assumed for the Poisson process considered: Pr[Tk'_< t' X k =no--k+ik + 2 j k ] = l - e x p [ - ( n o - k + i
, +2jk)Aot' ]
(6) therefore the distribution function for the time between the kth and the (k+ 1) failures, as given by the product of the two mentioned probability distributions, is: k k
k!
Fr,, = ,~0,~J 1 - e x p [ - ( n 0 - k + i k + 2jk )m"ll i, , j,
i k -Jk )' q'*r'* pk-,,-,~ =
£
±£
&. r , k~ q~rj~ pk-~,-j, _ ~exp[_[n ° _ k + ik + 2jk)~t']. "--Oj~=Oik!Jk !(k--ik--Jk)! i~=oj~=o kt .qik r~k pk-ik-jk = \
1
"ik!Jk!(k--ik--Jk)! k (p+q+r)
k! - ~ £ . 'Jk !(k ik = O j k = O l k
.
--
q,~ e_,,~O,,rj~e_2j~O,,p~_,~_j~e_(,,o_k)~o,,_ ik
__
Jk)'
.
1 - [ p + q exp(-3ot') + r exp(-2 3ot' )]k exp[_(no _ k)3ot, ] (7) Consequently software reliability after the k failure is: •
318
B. Sfiiz de Bustamante
R(t'lt k ) = [p + q exp(-A0t' ) + r exp(-220t')] k exp[-(n o - k)2~t']
(8) which is the exponential-multinomial reliability function, a generalization of the exponential-binomial function developed by R. E. Barlow and F. Proschan in 1975 [6]. If parameter r is set to zero in Eqn. 8. the exponential-binomial function is obtained. The exponential-multinomial function with parameters 9~0 = 0.088, no = 39, K = 30, p = 0.8, q = 0.4 and r = 0.06 is illustrated in Fig.l, curve Rl(t). The R1 drawing can be considered as the generic shape of the exponential.multinomial function.
A Global Reliability Function The exponential-multinomial function may be considered as a global reliability function since other reliability functions, based on different assumptions, can be derived setting some of its parameters. Jelinski and Moranda [4] publised in 1972 the first software reliability model, considering the same assumptions as in the present case for the software failure process, but the correction process was not studied. It was assumed that the fault that causes a failure is always detected and corrected, which is equivalent to consider the following values for the correction probabilities: p = 1, q = 0 and r = 0. If those values are taken into the exponential-multinomial function, the function given by Jelinski and Moranda to quantify software reliability is obtained, as given below:
R(t'lt k ) : e x p [ - ~ (n o - k)t'] (9) In 1979 the concept of imperfect correction was introduced by K. Okumoto in his doctoral dissertation [6]. In his model, which considered the same assumptions about the software failure proccess as Jelinski and Moranda, it is introduced the concept of imperfect correction. The model includes two correction probabilities, p and q, that verified p + q = 1. The software reliability function is then the exponentialbinomial function:
R(t' tk):[p+qe-~°"]k.e-~"°-k'~°" (10)
Mathematical Properties The application of the exponential-multinomial reliability function to the analysis of software reliability requires previous estimation of its parameters, which are: • no, number of faults contained in the software at the begining of the testing period • p, probability of performing a perfect correction • r, probability of performing an additive correction • 9~0,failure intensity proportionality constant Nevertheless the exponential-multinomial function can be considered as a reliability function only when no -> k, that is, when the numer of failures occured is less than the number of faults initially present in the system. For values of k greater than no the exponential-multinomial function does not fulfil the condition contained in the definition of reliability, (im R(t) = 0, [7].. The exponential-multinomial function is characterized by the following properties: • It always presents a decreasing failure rate.
319
Advances in Safety and Reliability: E S R E L '97 • It is an increasing function with the number of failures - k These properties assure a reliability growth during the testing period [7]..
THE EXPANDED EXPONENTIAL RELIABILITY FUNCTION The exponential-multinomial reliability function is a complex mathematical expression, therefore its applicaction to the analysis of software testing evolution may not be an easy task. This complexity is due to the accurate assumptions made about the software development process. In order to facilitate the EMRF usefulness an exponential approximation, called expanded exponential reliability function - EERF -, is presented. First the software reliability model developed by Jelinski and Moranda is considered. In this case the failure intensity is given by" 2(t) = ~0(no - k) (11) being n0-k the number of faults present in the system after the kth failure because all corrections performed are assumed to have been successful. The probability density function for the time between failures is: fT~.(t') = ,Co(no - k) exp[-,¢o (no - k)t']
(12) If in Eqn. 11., the number of faults present in the system after the kth failure when all corrections performed have been successful - no-k -, is substituted by the expected number of faults according to the exponentialmultinomial reliability function assumptions - Eqn.2.-: E [ X , ] = E[n o - k + i k + 2jk ] = n o - k ( p -
r)
(13) The probability density function for the Time Between Failures and the reliability function can be written as: f r .(t') = ,Co(no - k ( p - r ) ) e x p [ - A o (n o - k ( p - r))t']
(14) R(t' t k) = exp[-,~ (n o - k ( p - r))t']
(15) The exponential reliability function in Eqn.15. is also obtained if the exponential-multinomial reliability function - Eqn.8. - is expanded in Taylor series and the second order terms are considered negligible. This function may be considered an analytic approximation to the exponential-multinomial reliability function. It is denominated expanded exponential reliability function because it widens the traditional reliability function R(t) = exp[-~0t], being its exponent a linear application of the correction probabilities.
The expanded exponential function is also illustrated in Fig.1 curve R2(t), with the same parameters as the exponential-multinomial function - ~.0 = 0.088, no = 39, K = 30, p = 0.8, q = 0.4 and r = 0.06.
Rl(k,t) 0.5 R2(k,t)
k 0 Fig 1.Reliability functions R1 & R2
320
B. Sfiiz de B u s t a m a n t e
Mathematical Properties The application of EERF to the analysis of software reliability requires previous estimation of its parameters, which are: • no, number of faults contained in the software at the begining of the testing period • p, probability of performing a perfect correction • r, probability of performing an additive correction • ~.0, failure intensity proportionality constant which are the same as in the case of EMRF. This function is also limited in its application according to the values of its parameters. For k > n0/(p-r) the function can not be considered a reliability function. The application range is rather wider than in the case of EMRF, since p-r is always lesser than 1 [7]. EERF failure rate is constant between failures and decreases as the number of failures occured - k - increases. There is no relation between the failure rates of both functions [7]. In the case of EERF the failure rate coincides with failure intensity, which is a consequence of the failure rate being constant between failures [7].
ERROR ANALYSIS Fig. 2 depicts the differences between the exponential-multinomial reliability function - Rl(t) - and the expanded exponential reliability function- R2(t) - for the set of parameters given in 2.1, ~0 = 0.088, no = 39, K = 30, p = 0.8, q = 0.4 and r = 0.06 is illustrated in Fig. 1, as the software trial goes on. The Fig.2 curve presents a maximum at time a = 1.415, time much longer that the process mean Time Between Failures which is 0.699. 0.01
fi(k,t) 0.005
0
0
5
10
t Fig.2. A b s o l u t e error: R 1 - R 2
This curve can be considered general [7], therefore a change of parameters will produce a similar curve with a maximum. The absolute error of approximating EMRF by EERF can be defined by the Chebyshev norm:
IlRl(t)- R2(t)l]=
= maximumlRl(t ) -
R2(t) I = d (16)
which gives an upper bound to the error for each particular case. Therefore: R 2 ( t ) - d < Rl(t) < R 2 ( t ) + d (17) If e denotes the relative error correponding to d, maximum of Rl(t)-R2(t), the following expression is obtained: d ,~ e R 2 ( t ) (18)
321
A d v a n c e s in S a f e t y and Reliability: E S R E L '97 which allows to delimit the error o f approximating EMRF by EERF, once e is known. For Fig. 2 curve m a x i m u m , corresponding to time a = 1.415, is obtained an absolute error o f d = 0 . 0 0 4 2 7 9 and a relative error o f e= 0.0088. The absolute error - d - and the relative error - e - at the m a x i m u m o f each curve - R1 (t), R2(t) -, determined by its abscissa a, is a m o n o t o n o u s increasing function o f k, the number o f software failures, as shown in figures 3, 4, and 5, where all parameters are the ones given in the first paragraph o f the this section, except for k which takes the values between 1 and 39 [7]. Similarly assuming a constant k, the absolute and relative errors, and the abscissa o f the Chebyshev norm, are an increasing function o f p, being r constant, and a decreasing function o f r, being p constant [7]. 3
I
]
O.O4
O.3
2
0.2
ai
di
~1
~
0
I
I
0
20 k.1
0.02
ei ~0.1
0
40
0
Fig.3 Times to max. errors (a)
20 k.1
o
40
Fig.4 Max. abs. errors (d)
0
20 k.!
40
Fig. 5 Relative errors (e)
In order to facilitate estimations o f the absolute errors - d - and relative errors - e -at the Chebyshev norm abscissa - a -, it has been prepared a set o f contour plots mapping d, e, and a for each k value, keeping ~o = 0.088 and no = 39. Figures 6, 7 and 8 present then a, d, and e contour plots for k = 30 being p represented at the x-axis and r at the y-axis.
0.4-"
0.~ 0
.
0.08 0 1
0.9
0. 40. 50
0.3 0.3~
uO"
)2 0. 25 O. 3 I
0. 5 "
O. 03 0.~
0.~
o. 90 1
O.
6
0
I
0.2
I
0.4
0.6
0.8
I
6 10
0 40 050
01.2
I 0.4
0!6
M
M Fig.6. Times to max. errors (a)
\o
0.1"
6
0!8
0
0.2"
~
6
I 0.2
0!4
I 0.6
0.8
M Fig.7. Max absolute errors (d)
Fig. 8. Relative errors (e)
For example, supposing the following set o f parameters, ~-0 = 0.088, no = 39, K = 30, p = 0.6, q = 0.35 and r = 0.05, then the following values may be obtained from figures 6, 7, and 8: Chebyshev norm abscissa, a ~ 1" absolute error at a, d ~ 0.0056; relative error at a, e ~ 0.04. The process reliability at t = a/10 ~ 0.1 is, according to EERF, R2 = 0.82037, and the approximation error bound, Eqn.17., d = e.R2 = 0.0334. Therefore, according to Eqn. 16., 0.853781 _< R1 < 0.786959. Nevertheless, it must be emphasized that e is a limiting relative error.
322
B. Sfiiz de Bustamante CONCLUSIONS
The exponential-multinomial reliability function presented allows software reliability growth quantification during testing periods. It is an increasing function with the number of failures occured, and presents a decreasing failure rate with operation time, which matchs the objective of any test, which is to detect and fix all possible faults introduced during system developing. EMRF is a global reliability function from which other reliability functions may be obtained setting some of its parameters. The assumptions on which this function is based describe accurately the software failure process, but yield a complex function, therefore it is simplified, maintaining its basic assumptions, and the expanded exponential realiability function is obtained. This function is a simpler one, therefore its usage may be more adecuate. Comparing both functions it is observed that they yield the same values at the begining of the testing period studied, and EERF becomes more conservative as the number of failures occured increases, therefore it suggests longer testing periods. Besides, EERF presents a wider application range. The error analysis indicates that for values of time equal to or less than the Time Between Failures the error committed is negligible, therefore EERF may substitute EMRF for normal operation times. Therefore the main contribution to of this work to the software relibility modelling theory is a simple model that considers the possibility of imperfect fixes, either because the fault that causes a failure is not corrected or because a new problem is introduced when trying to correct a detected fault. The next step is to study the application of both functions presented to evaluate reliabilty evolution of any system whose repair time might be considered negligible as compared to operation time.
REFERENCES
111
[21 [31 [41 [51
[61 I71
[81 [9]
Bittanti, S.(1987): Software Reliability Modelling and Identification. Springer-Verlag, New York USA. Colombo A.G., S~iiz de Bustamante A. (1990). Systems Reliability Assessment. Kluwer Acadeimic Publishers, Dordrecht, NL. IEEE Standards Board (1989): IEEEStandard Dictionary of Measures to Produce Reliable Software. The Institute of Electrical and Electronics Engineers, Inc., New York, USA. Jelinski, Z., Moranda, P.B. (1972): Software Reliability Research. Statistical Computer Performance Evaluation. W. Freiberger Editor. New York, USA. Musa, J.D., Iannino, A., Okumoto, K.(1987): Software Reliability, Measurement, Prediction, Application. McGraw-Hill Book, New York, USA. Okumoto, K. (1979): Stochastic Modelling for Reliability and Performance Measures of Software Systems with Applications. Dissertation, UMI Dissertation Information Service, Michigan, USA. S~iiz de Bustamante, B. (1996). El Error como Proceso de Markov No Homogdneo. Aplicaci6n al An6lisis de la Fiabilidad de los Sistemas L6gicos. Dissertation, UMI Dissertation Information Service, Michigan, USA. Sfiiz de Bustamante, B.(1994): Software Reliability Modelling: A Poisson Type Model. Proceedings of the Fourth European Conference on Software Quality. Basel, Switzerland, October 17-20, 1994, pp. 312-319. Sfiiz de Bustamante, B.(1994): Software Reliability Modelling: Synthesis and Outlook. Proceedings of the 9th International Conference on Reliability and Maintainability. La Baule, France, 30 May - 3 June, 1994, pp. 468 - 475.
ANALYSIS AND RECOMMENDATIONS FOR A RELIABLE PROGRAMMING OF SOFTWARE-BASED SAFETY SYSTEMS Juan Nufiez Mc Leod, Jorge Nufiez Mc Leod and Selva S. Rivera CEDIAC Centro de Estudios de Ingenieria Asistida por Computadora Engineering Faculty, Cuyo National University CC 405, 5500, Mendoza, Argentina
ABSTRACT
The present paper summarizes the results of several studies performed for the development of high reliability software on i486 microprocessors, towards its utilization for control and safety systems for nuclear power plants. The work is based on software programmed in C language. Several recommendations oriented to high reliability software are analyzed, relating the requirements on high level language to its influence on assembler level. Several metrics are implemented, that allow for the quantification of the results achieved.. New metrics were developed and other were adapted, in order to obtain more efficient indexes for the software description. Such metrics are helpful to visualize the adaptation of the software under development to the quality rules under use. A specific program developed to assist the reliability analyst on this quantification is also presented in the paper. It performs the analysis of an executable program written in C language, disassembling it and evaluating its internal structures.
KEYWORDS Reliability, software, metrics, structures, C language, assembler language.
INTRODUCTION In order to write a program, a high-level language is usually used, as it greatly simplifies the writing work for the programmer. However, such a high-level language is not directly interpreted by the computer, and it needs to make use of a compiler. The compiler transforms the high-level program (source code) into a low-level program (object code), which is understandable by the computer. The result of this compilation 323
324
J. Nfifiez McLeod et al.
process is a program which is functionally equivalent, but with an internal structure that may be substantially different from that of the source code. A program written in high-level language, consisting of a series of linear instructions, without any jump or bifurcation, may became with bifurcations in the equivalent low-level program. When developing programs in high-level language the first quality factors affected (linearity, simplicity, conciseness, etc.) are the most appreciated by the reliability analysts and the hardware engineers. Therefore, one of the first steps to do is to provide both the programmers and the reliability analysts with the knowledge and the tools to fix the developing procedures and to evaluate that the programs include these factors. It is important to point out that this is referred to a large quantity of small programs (or routines). Each one has a big impact in the reliability of the control or safety system.
PURPOSE
The present work is oriented to the development of reliable software for its use on control and safety systems. It is motivated in the control and safety systems of the Multi Purpose Reactor actually under construction in Egypt. These systems use a i486 platform programmed in C language and in assembler. The purpose is to bring the following aids for both the programmer and the reliability analyst: • Information in the form of recommendations, that allow for the writing of programs in a high-level language (C language). These programs, after compilation, should generate coding at machine-level, that complies with the Quality Assurance Plan. • Metrics to assist the programmer, in order to be able to evaluate the best programming alternatives • Metrics to assist the reliability analyst in order to be able to identify routines which are poorly developed from the reliability point of view • A simple and adequate tool to provide the first quality control steps in a Software Quality Assurance program
P R O G R A M M I N G STRUCTURES ANALYSIS One of the goals of this study is to generate a series of recommendations on the programming rules in high level, in our case in C language. These recommendations are oriented to obtain machine code as linear as possible. The linearity of a code allows for a correct control procedure, in real time, during its execution. This is due to the fact that the programming lines must be executed sequentially and in a unique order. The linearity level acceptable for a certain code will depend on the code purpose. In example, if the code function is to control a certain parameter, which is sensed on a 2 out of 3 measurements, the code may be perfectly linear. The same is not true if the code has to decide among a series of different actions, as a function of the information provided by other codes, such as the preceding example. The code linearity is directly related to the code testability (Karl, 1995). A s the code is more linear, it will be easier to test and less time will be required for this activity. Other of the objectives, is to obtain concise code (IEC-880, 1986). This is difficult to do when working in
Advances in Safety and Reliability: ESREL '97
325
a high-level language. The optimal would be to develop the code directly in assembler language. The compromise solution will be to adopt assembler language where it becomes necessary. On the next sections, some eases are developed, for a correct interpretation and documentation on the recommendations that follow:
Data conversion
The data conversion job is realized with the mathematical coprocessor in two steps. The first step is to load the numerical value with the load instruction in original format, either floating point or integer, with their diverse sizes; next, the unload follows, according to the format required. When working in ascendant direction, that means, when converting integer to floating point, this scheme is directly used, but when converting in the reverse direction, a general conversion module to convert to 64 bits integers is used (Brey, 1991). Independently of this module, it must be validated the possibility of this assignment. The call to this conversion module may be avoided working directly over the source code, as can be observed in Table 1. For simplicity, the validation, similar for both examples, has been omitted. In the lower part of the table, a comparison on linearity and conciseness can be observed, showing the clear advantages of the example 2.
Example 1 . . .
TABLE 1 NUMERICALSDATACONVERSION Example 2 . . .
{ double xd ; long xl ; short xs ;
{ double xd, long xl, short xs,
xd=lO ; xl = (long) xd; xs = (short) xd;
_asm {
}
assembler code length" 24 lines Jumps :2none calls:
xd = lO" fld [xd] fistp [xq fld [xd] fistp [xsl
assembler code length" 6 lines jumps'none calls'none
Libraryfunctions There is a series of mathematical functions, mainly transcendent functions, already implemented in the mathematical coprocessor in the i486 (Brey, 1991). For generality reasons and in order to simplify the compilation, the compiler does not use any of these functions, implementing its own functions (library functions). A library function returns a certain value as a functions of one or more arguments, or none. However, the call to the library function implies a series or arguments adaptations. For example, the necessary argument may be double precision, and the real argument single precision, and this fact implies a
326
J. Nfifiez McLeod et al.
preliminary conversion before the call to the calculation module (in a similar manner as that shown before). On the other side, all the library functions of a certain class have a similar structure, in order to easy the use of the compiler. An example can be observed in Table 2.
TABLE 2 LIBRARY FUNCTION Example 2
Example 1 . . .
. . .
{ double a, b"
double a, b"
a = 20.0 • b = sqrt( a )"
a = 20.0 • asm {
{
}
assembler code length" 102 lines jumps" 16 calls • 4
fld [al fsqrt fslp [1o]
assembler code length" 5 lines jumps'none calls'none
Logical operations Logical operations are based in comparisons that allow to take decision in one or other way, being true or false. These decisions constitute, structurally, jumps that break the linearity o f the structure. As an example, a classic comparison type is shown in Table 3. The resulting assembler code is also shown. In this case, the use o f '*' instead o f '&&' (logic AND) allows the linearization o f the calculation without further consequences.
TABLE 3 LOGICALOPERATIONSCALC Example 2
Example 1
...
{
. . .
{
int ai,hi,ci,d ;
int ai, hi, ci, d ;
...
. . .
d= (ai>bi)* (ai
d-- ai >hi && a i < c i mov cmp jnl mov cmp
e a x , d w o r d ptr [ai] dword ptr [ci],eax
mov
dword ptr
[d],O0000001
jpl m o v jp2 ...
dword ptr
[d],O0000000
jle
J~
e a x , d w o r d ptr [ai] dword ptr [bi],eax
jp1 jpl
jp2
assembler code length" 9 lines bumps 3 calls'none
mov xor cmp setnle mov xor cmp setl imul mov
e a x , d w o r d ptr [ai] ecxr ecx dword ptr [ci],eax cl e a x , d w o r d ptr [ai] edx, edx dword ptr [bi] ,eax dl ecx, edx dword ptr [d] ,ecx
assembler code length" 10 lines jumps • none calls'none
Advances in Safety and Reliability: ESREL '97
327
Conditional jumps (If-then) For this type of sentences, the argument calculation can be linearized but, in general, it is impossible to avoid at least one conditional jump, which is the essence of the IF conditional sentence. The coding of this sentence can be linearized if the objectives are simple expressions related among them. For example, it can be observed in Table 4, different linearization possibilities for the case of assigning to one variable the greater value found among other two variables.
Example 1
TABLE 4 IF CONDITIONALSTATMENT Example 2 Example 3
. . .
. . .
. . .
{ int ai, bi, ci, d ;
{ int ai, hi, ci, d,
{ int ai, hi, ci, d,
. . .
if(ai>bi) ci=ai; else ci=bi; ,) assembler code length : 8 lines jumps : 2 calls:none
. . .
. . .
ci = ai > bi ? ai • hi" }
ci = ( ai > bi ) * ai + ( ai <= bi ) * b i }
assembler code length : 6 lines
assembler code length : 12 lines jumps:none calls:none
jumps: 1
calls:none
Loop structures (For, While and Do-while) Something similar to what occurs for the IF sentences, also occurs for these kind of sentences. The argument calculation can be linearized, but the existence of at least two jumps, one conditional and the other one unconditional, cannot be avoided, because they constitute the essence of the loop structure.
RECOMMENDATIONS • Avoid data conversion, and in case needed, perform this function at processor level in assembler language. • Use the coprocessor arithmetic routines, or specific library functions for the type of data used. • Transform the logical expressions into Boolean algebraic equations. • When using a logical bifurcation sentence, or a loop sentence, pay special attention to the logical expression that controls it. • Use, whenever possible, implicit IF sentences. Bear in mind that a IF-THEN sentence will add at least one conditional jump, and a IF-THEN-ELSE sentence will add at least one conditional and one unconditional jump. Each conditional jump implies additional testing effort, because the tests must cover all the variable ranges, but also all the different alternatives when executing. • The loop sentences will add always one conditional and one unconditional jump, and bearing in mind the difficulties that these offer to the programmers, its use is not recommended.
328
J. Nfifiez McLeod et al.
METRICS The quantification of the quality factors is performed with the so-called software metrics (Karl, 1995). The use of certain metrics is suggested in this chapter, trying to give the programmer the tools to the control and follow-up of the code, without over charging its work.
Cyclomatic Complexity One of the goals of the software programmers for high quality is the verification, as simple as possible, of the software, in order to eliminate all the errors. To allow for easy verification and maintenance of the software, it is recommended that every module in the program must have a cyclomatic complexity not exceeding 10. The McCabe cyclomatic complexity measure was designed to identify the maintainability of a program (Kan, 1995). This measure can be used to indicate the required effort to test a program. The cyclomatic complexity is equal to the number of binary decisions in a program plus 1. A 3-option decision is accounted as 2 binary decisions. A CASE sentence with n possible ways is accounted as (n-l) binary decisions. The iteration in a loop is accounted as a binary decision.
Fan-in and Fan-out
The cyclomatic complexity evaluates the complexity of each module, and implicitly assumes that each module in the program is a separated entity. The structural metrics try to take into account the interactions between modules of a program or system, and quantify them. The most common structural metrics are fan-in and fan-out (Kan, 1995). Fan-in is the number of modules that call a certain module and fan-out is the number of modules that are called by a certain module. Modules that have a large fan-in and fan-out must be relatively small and simple. Big and complex modules should have a small fan-in but may have a big fan-out. Modules with large fan-in and large fanout may indicate a bad structural design.
Cycle-Ability Index In the present work, a cycle-ability index is proposed. This index indicates the complexity of each module taking into account the inherent difficulty in the decision taking process, the difficulty that the programmer has to understand the problem, the consequences of the proposed solution and, finally, the use and refreshment of values for the decision taking process. The following table indicates the cycle-ability index values for different sentences:
TABLE 5 STATMENTS WEIGHTEDFOR CYCLE-ABILITYINDEX Cycle-ability index Statements If- then -else I Switch or Case 2 3 For, While and Do-while
Advances in Safety and Reliability: ESREL '97
329
Linearity index A linearity index is proposed. It allows for the quantification of the number of calls to subroutines, and conditional or unconditional jumps present in the code. From such a way, it is straightforward to fix and control the limits to codes with different functional capabilities. h = Nr. subroutines called + Nr. conditionals jump + Nr. unconditionals jump
(1)
Self containing index It indicates the degree that the code has to perform all its implicit and explicit functions by itself (IAEA, 1988). The calculation formula is as follows: Is =
nr. of deleuated functions nr. of delegated functions + ~ . of not delegated functions
(2)
In equation (2) are accounted those functions explicitly invoked by the program from the Operative System (delegated) and those that, being able to be used, are not used.
Maturity in dex The maturity index is associated to the program robustness. The robustness (IAEA, 1988) is defined as the degree until what the code can continue its execution, despite some violation of the specification hypothesis. The maturity index approaches exponentially 1 and can be calculated with equation (3) IM= MT - (Fa + Fm + Fe) MT
(3)
Where" MT is total number of modules, Fa is added modules number, Fm is modified modules number and Fe is eliminated modules number
SPECIFIC PROGRAM TO ASSIST THE RELIABILITY ANALYST A specific program was developed to assist the reliability analyst using MS-Visual C++ under Windows95 platform. This programs allows the code analysis in machine language and the routine separation and metrics calculation in an automatic form.
Disassembling The study of the program structure at low-level cannot be performed directly, because it is constituted as a long chain composed of ones and zeroes, grouped in octets, which meaning is decodified by the microprocessor in sequential form. Therefore, to be able to study the structure of this kind of programs, a new compilation in reverse direction is performed, in order to obtain as a result a higher level program, but
330
J. Nfifiez McLeod et al.
keeping the machine-code structure. The most adequate language for this task is the assembler language, because it has a direct correspondence to the machine coding. The specific program presents the results in this language, as ilustrated in tables 1 to 4, which were obtained with the mentioned program. The programmer is able to see the code in assembler language, he can observe how the modifications performed in the source code are reflected in the assembler code. This allows for the discussion and interaction with the reliability analyst. Metrics
The program calculates, automatically and for each routine, the following metrics: Cyclomatic Complexity Index, Fan-in, Fan-out and Linearity Index
CONCLUSIONS Several studies, relating the source coding with the object coding, have been used to demonstrate how they can influence favorably in some soft'ware quality factors. Based on these, a series of recommendations for the programmers have been generated. A basic tool for the software quality analysis has been also developed, which is useful not only for the programmers but also for the reliability analysts. Several indexes that may assist both the programmers and the reliability analysts have been implemented. These indexes allow for the establishment of principles for the development and evaluation of the programming alternatives.
CONTINUATION The next steps are towards the possibility to perform comparative evaluations of the source and the object code, to include reliability models and to study certain hardware characteristics that may severely affect the programs reliability. In reference to the last aspect mentioned, the analysis of failures produced by induced errors in the computer stack has started.
REFERENCES
Brey, B. (1991), The Intel Microprocessors, Prentice-Hall. IAEA (1988), T.R. Nr. 282, Manual on Quality Assurance for Computer Software Related to the Safety of Nuclear Power Plants. IEC Standard, Publication 880 (1986), Software for computers in the safety systems of nuclear power stations, International Electrotechnic~ Commission. Kan, S. (1995), Metrics and Models in Software Quality Engineering, Addison-Wesley Publishing Co.
A13" Safety Critical Systems
This Page Intentionally Left Blank
O V E R A L L R E L I A B I L I T Y E V A L U A T I O N OF THE IEEE B E N C H M A R K TEST S Y S T E M USING THE NH-2 P R O G R A M J. M. S. P i n h e i r o ~
C.R.R.
Dornellas 2
M. T h . S c h i l l i n g ~
A.C.G.
Melo 2
J.C.O.
Meilo 2
CERJ - Electricity Company of Rio de Janeiro, Rua Luiz Leopoldo Fernandes Pinheiro 517 Sala 1604, cep 24016-100, Niter6i, Rio de Janeiro, Brazil 2 CEPEL - Centro de Pesquisas de Energia Eldtrica, AV. UM s/n - Ilha do Fundfio - Cidade Universitfiria, cep 21941-590, Rio de Janeiro, Brazil 3CAA/UFF, Fluminense Federal University, Rua Passo da Pfitria 156, Bloco E, sala 350, cep 24210-240, Niter6i, Brazil.
ABSTRACT This paper presents a detailed reliability assessment of the new RTS-96, obtained with the NH-2 (3.5 release) program. A set of numerical results are presented and inte~reted, including investigations related with independent power production and load management strategies. A comprehensive analysis of the effect of several wheeling transactions on the overall system reliability is also considered. The numerical results obtained are useful for comparisons of existing techniques for bulk reliability evaluation.
KEYWORDS
Computer tools for risk and reliability analysis, application of probabilistic methods, bulk reliability, IEEE reliability test system, wheeling, independent powerproduction, load management.
INTRODUCTION During the nineties, the power industry all over the world has been submitted to accelerated changes that are giving rise to interesting and difficult questions related to aspects such as: wheeling strategies, transmission access issues, emission constraints, independent power production, demand side management, equipment ageing, the prospect of massive utilization of electric vehicles and the building of a wide-world energy grid, growing complexities related with power quality issues and security, the characterization of a large spectrum of ancillary services, increasing utilization of non-conventional energy sources, and several others polemical topics. Designing efficient test systems to investigate those problems is a tough but needed task. Reflecting these trends, both academia and industry were challenged to take effective steps to face the defying new requirements. The answer from academia materialized shortly in form of an ad-hoc reliability test system for teaching overall power system reliability assessment and emphasing the evaluation of acceptable customer services [1 ]. 333
334
J.M.S. Pinheiro et al. In 1979 the first version [2] of the IEEE Reliability Test System (RTS-79) was developed with the aim "to satisfy the need for a standardized data base to test and compare results from different power system reliability evaluation methodologies". Even at that time enhancements to the system were already expected to be incorporated to face special applications and new trends in the power production business. Accordingly, a second version of the RTS was proposed in 1986 expanding the data relating to the generating system [3]. Such data, as unit schedule maintenance, unit derated states representation, load forecast uncertainties, and interconnection ties were then supplied. After 1986, several other useful reliability test systems were also proposed and published [4-6] which helped to narrow the gap between the growing industry needs and available computational tools. Recently [7], an enhanced test system (RTS-96) for use in bulk power system reliability evaluation studies was proposed. It is expected that this new system will permit comparative and benchmark studies to be performed on new and already existing reliability evaluation techniques. This paper presents a detailed reliability assessment of the new RTS-96, obtained with the NH-2 (3.5 release) program [8]. A set of numerical results are presented and interpreted, including investigations related with independent power production and load management strategies. A comprehensive analysis of the effect of several wheeling transactions on the overall system reliability is also considered. The numerical results obtained are useful for comparisons of existing techniques for bulk reliability evaluation.
COMPARATIVE STUDY: OLD VS. N E W SYSTEM The original IEEE RTS has 24 buses, 38 branches and a load of 2 850 MW. This system has proved to be quite useful and continues to be largely utilized by several researchers. However, some drawbacks have been pointed out such as the lack of data for DC links, interconnections, hydrological effects, system dynamic behavior, etc. Some of these deficiencies have been overcome in the proposed new RTS. The new IEEE RTS-96 has 3 well defined electrical areas interconnected by 5 tie-lines. The system has 73 buses, 120 branches, and 96 generating units with a total generating capacity of 10 215 MW (3 405 MW in each area) for a system peak load of 8 550 MW (2 850 MW in each area). An optional DC link representation, and several kinds of data which were unavailable to the old RTS are also provided. Each one of the areas is identical to the original RTS-79 (with the exception of sub-area 32). Figure 1 shows a simplified sketch of the new system.
Area 30
Area 10
Area 20
Figure 1: 1EEE RTS-96 Sketch. For the sake of comparison, Table 1 present a set of reliability indices obtained for both the old and new systems. In this specific comparison only peak load condition was analysed. Composite reliability indices were calculated by Monte Carlo simulation method. The relative uncertainty (,8) shown in Table 1 was attained within 13 069 samples. The adequacy evaluation of each state was carried out through AC power
Advances in Safety and Reliability: ESREL '97
335
flow model coupled to a successive linearization based remedial actions model. Violations caused by contingencies were eliminated by rescheduling of system generators, adjustments in voltage profile and taps of LTC transformers, and, if necessary, load curtailment. All these remedial actions are automatically activated in the program by the RDSQ option. The basic reliability indices are the loss of load probability (LOLP) and expectation (LOLE), expected energy not supplied (EENS), loss of load frequency (LOLF), loss of load duration (LOLD), and system problems probability (SPP). The later refers to the system condition before applying remedial actions. Some values of the severity index [9] are also occasionally presented. In Table 1 it is again confirmed that the new test system is more reliable than the old one, since all global indices have shown improvements. For instance, the EENS of the old system is now 3.85 times higher than the value of the new RTS. These results confirm the benefits brought by interconnections for improving power systems reliability.
TABLE 1 COMPOSITE RELIABILITYINDICES(PEAK LOAD) Indices SPP (%) LOLP (%) LOLE (h/year) EENS (MWh/year) LOLF (occ./year) LOLD (h) Severity (min)
RTS-79 Value ] [ 3 ( % ) 25.31 15.12 3.57 1 324.92 3.57 220 300.0 5.00 27.31 9.23 48.50 8.46 4 529.55 5.00
RTS-96 Value ]3(%) 7.1 4.92 3.84 431.66 3.84 57 191.68 6.52 25.47 6.74 16.94 5.50 401.34 6.52
Still looking at Table 1, it is worthwhile to note that the old system shows a severity value (4 529 min) classified [9] as degree 4, which means "catastrophic consequences to consumers including system collapse or black-out". On the other hand, the new system has a severity reaching 401.34 min, classified as degree 3, meaning "very serious impact to all consumers". These results confirm that the traditional peak load bulk reliability analysis is quite pessimistic. It is also interesting to assess the relative contribution of generation, transmission and composite events to the reliability indices. In Table 2, for the old RTS, the generation, transmission and composite failures accounts to 77%, 3%, 20% and 76%, 1%, 23% for the LOLP and EENS indices respectively. In the Table 3, for the new RTS, these contributions are 33%, 16%, 51% and 53%, 1%, 46%, for the same indices. It is seen that the new RTS is still generation dominated, although its relative contribution has decreased. As a consequence, the relative contribution of composite failures has increased. Therefore, to better analysing the performance of the new RTS, a composite reliability tool is required.
TABLE 2 RELATIVECOUTRIBUTIONOF FAILURESON RTS-79 II Generation Transmission Composite
LOLP
EENS
77 % 3% 20 %
76 % 1% 23 %
] ....
J.M.S. Pinheiro et al.
336
TABLE 3 RELATIVECOUTRIBUTIONOF FAILURESON RTS-96 [[
LOLP 33 % 16 % 51%
Generation Transmission Composite
EENS 53 % 1% 46 %
LOAD M O D E L L I N G In order to proceed with an accurate reliability assessment, the RTS-96 hourly load data was aggregated into 10 load levels, using clustering techniques [8]. Figure 2 presents the load transition diagram obtained and utilized in all results depicted in this section. In this figure, the load level probabilities and transition rates are also shown.
........ - ~ ~
pfob :001% •
........
....
•
pfob =560%
Jo.~'%:-~--~
~I [
. ,
o.2~
\ ~-ob.=,.97~o f ..... T
.....
J
Til ~,:~,~ -'~.
o.11~
......
I
-
'ooo,~,
II [ I
I
o.oo~n
000~ : ~
level= 650%
~
o,o °'°level= 71.3% ~ ' o~....
-,. . . . . . . . . . .
~
.
0.019/h
..........
j - - ~
[
level= 58 %
0.224,'h
' .......... ....
:oo,':,o%~,.)
.........
T
0.17~
0.223~
0.109/11.
~ gob. ....... =1655 ) A 0.01~
prob. = 4.54% I
.... lI 0.351t~
O.O02/h 0126/h
( :o~!: 5407: ~ ) 0.201/h
1-
0.012/11
0.076(n
Figure 2:RTS-96 Load Transition Diagram.
BENCHMARK STUDY The results for this case (see Table 4) are taken as reference. Active and reactive redispatching as well as tap change in LTC transformers (RDSQ) are allowed to tentatively remedy the appearing failure modes. The system was submitted to a Monte Carlo analysis. The Monte Carlo assessment takes into account both generation and/or transmission contingencies allowing the evaluation of composite reliability indices. The simulations are performed for each one of the 10 load scenarios (see Fig. 2), obtaining conditioned reliability indices. The final indices are calculated as the weighted sum of the conditional values, using the load level probabilities as weights. The total sample size encompasses 904 495 observations, optimally allocated among the load levels by using a stratification technique. A coeficient of variation f3 of 4.9% or even lower was attained, indicating that the evaluated reliability indices have a satisfactory precision level.
Advances in Safety and Reliability: ESREL '97
337
This analysis was carried out on a Digital Alpha Server 1000 workstation and the execution time was about 62 minutes. From this total time almost 52% was spent in the optimization step of remedial actions evaluation. In Table 4, comparing the Monte Carlo evaluated SPP (2.03%) and LOLP (0,041%) indices one can see that the effectiveness of the remedial actions was 97.98% in this case. As suggested in [9], the obtained severity value (0.49 min) classifies the RTS-96 in degree zero (system in acceptable normal condition). The LOLP (0.041%) is equivalent to 1.5 days offailures in 10 years. TABLE 4 IEEE RTS-96 BENCHMARK RELIABILITYINDICES Monte Carlo Indices II
2.03 .041 3.57 69.98 1.11 3.21 .49
SPP (%) LOLP (%) LOLE (h/year) EENS (MWh/year) LOLF (occ./year) LOLD (h) Severity (min)
Comparing Tables 1 and 4 one can also see the importance of careful load modeling. While the EENS related with the load peak study is 57 191.68 MWh/year, the same index drops 820.8 times to a value of 69.68 M~qa/year when the load is represented with more detail (see Fig. 2). The system areas reliability is shown in Table 5. It is interesting to observe that while the three areas are topologically almost identical (except for the nodes of interconnections), area A is significantly more reliable than the other two, both of which have approximatelly the same reliability level. In all areas the 230 kV grid is stronger than the 138 kV. Table 6 shows the failure modes relative contributions to the system reliability. It is seen that islanding and voltage violations are the two main failure causes. TABLE 5 AREA RELIABILITY
Area Identification
AreaA (code 11), i 3 8 k V Area A (code 12), 230 kV Area B (code 21), 138 kV Area B (code 22), 230 kV Area C (code 31), 138 kV Area C (code 32), 230 kV Total
[[
U
EENS (MWh/year)
Relative Contribution (%)
7.48 2.01 29.32 1.59 25.92 3.66 69.98
11 3 42 2 37 5 100
INFLUENCE OF WHEELING TRANSACTIONS ON BULK RELIABILITY As a by-product of the reliability analysis, expected values of sensitivity indices, measuring the variation on system energy not supplied with respect to variations on load increase were obtained. These sensitivities were used to selected sites for installation of new generation or loads. Bus 122 (Aubrey, 230 kV) was identified as the most sensitive RTS node to any system load increment, and therefore was chosen to be one of the buses involved in the wheeling transaction. Bus 204 (Bailey, 138 kV) was chosen, by chance, to be the Other one. since it is in another electrical area. has a lower voltage level, has no generation, is radially connected and its sensitivity to load increase is neglegible. In Table 7. three cases were analysed involving I 0.50 and 100 MW transactions from bus 122 to bus 204.
J.M.S. Pinheiro et al.
338
TABLE 6 RELATIVE CONTRIBUTIONOF FILURE MODES
Failure Mode
]]
...Islanding Overload Voltage (over+under) Island + Voltage Overload + Voltage Others Total
Indices (Monte C~'lo) SPP 1.82 .025 .16 .028 2.03
[
LOLP .013 .024 .003 .001 .041
LOLE 1.13 .00031 2.1 .28 .0027 .058 3.57
EENS 47.8 .12 11.5 3.78 .0096 6.89 69.98
LOLF .40 .00032 .58 .89 .0013 .046 1.11
Table 7 shows the impact of a wheeling transaction[10] on the system reliability levels simulated for peak load conditions (8550 MW) and a more stringent coeficient of variation (see also Table 1). It is seen that all transactions cause a modest impact to overall system reliability lncl~ces. ~'or instance, wnen the flow goes from bus 122 to bus 204, the EENS increases 0.6%, 2.0% and 4.4% for 10, 50 and 100 M \ \ transactions, respectively. These results suggest that wheeling transactions, in general, should not be contracted without a previous comprehensive bulk reliability analysis of the entire power system.
TABLE 7 WHEELING FROM BUS 122 TO Bus 204 (PEAK LOAD) Indices Monte Carlo LO'LP (%) LOLE (h/y) EPNS (MW) EENS (MWh/y) LOLD (h) LOLF (occ./y) Severity (min)
l
Value Base 4.97 435.15 6.88 60309.12 16.29 26.70 423.22
Value 10 MW 5.01 439.16 6.92 60668.64 16.60 26.45 425.25
Value 50 MW 4.96 434.80 7.02 61452.47 16.63 26..15 428.74
V.alue 100MW 4.96 435.01 7.19 62964.79 16.13 26.97 436.75,
13 (%) 3.0 3.0 5.0 5.0 4.1 5.1 5.0
Table 8 shows a detailed investigation of the 10 MW wheeling transaction considering four cases: (i) WIOYY -(as for yes, yes) where both the generation and load, respectively at source and sink nodes, can be changed to minimize the amount of load cut. In this case the wheeling transaction allows load interruption; (ii) WIONN- in this case neither the source nor the sink node power injections can be changed. The wheeling transaction is taken as firm; (iii) WIOYN- the generation can be changed but the load is taken as firm; (ix,) W10NY- the generation is firm but the load is interruptible. It is quite interesting to note that when the 10 MW is not interruptible (WIONN, WIOYN) the system overall reliability is worse from the sole point of view of EENS and EPNS. However, the other indices show a slightly improvement. Energy-wise, case WIONY is the most reliable. This shows how critical to the system bulk reliability, the load connected to the sink node is. This is evident since when this load is allowed to be cut the system EENS drops.
339
Advances in Safety and Reliability: ESREL '97 These results suggest that commercial contracts involving firm power transactions should be carefully designed to avoid system overall degradation.
TABLE 8 WHEELINGFROM BUS 122 TO Bus 204 FOR 10 Mw TRANSACTION(PEAKLOAD) [ W10YY Indices Monte Carlo 5.01 LOLP (%) 439.16 LOLE (h/y) 6.92 EPNS (MW) EENS (MWh/y) 60668.64 LOLD (h) 16.60 26.45 LOLF (occ./y) 425.25 Severity (min)
W10NN
W10YN
W10NY
4.27 373.92 7.46 65332.51 15.31 24.43 458.47
4.28 374.52 7.49 65644.77 15.33 24.43 460.66
5.01 439.16 6.91 60565.30 16.56 26.52 424.52
(%) 3.0 3.0 5.0 5.0 4.1 5.1 5.0
FINAL REMARKS
This paper has presented a detailed reliability assessment of the newly released IEEE RTS-96, including investigations related with independent power production and load management strategies. It was verified that the new system is more robust than the original IEEE RTS-79, in the sense that the reliability indices of the new system (see Table 1) are better. A comprehensive analysis of the effect of several wheeling transactions on the overall system reliability was also considered [11-13]. The numerical results obtained are useful for comparisons of existing techniques for bulk reliability evaluation. It was hinted that the establishment of commercial contracts involving firm wheeling transactions between an independent power producer and a client might cause an overall reliability degradation from the point of view of expected energy not supplied. This suggest that the load interruption costs incurred by other system consumers may be increased at the expenses of a higher reliability level offered for just one specific and privileged consumer. This practice is not lair if other clients are not given the means to protect themselves. This also emphasizes the need for further investigations concerning spatial risk coordination for the benefit of all system clients.
ACKNOWLEDGMENTS
Part of this work was supported by the Brazilian National Council for Scientific and Technological Development (CNPq) under project grant no. 522.849/96-2 and the Research Support Foundation of Rio de Janeiro State (FAPERJ) under project grant no. E 26/170.068/95-APQ 1, both availed to the third author.
REFERENCES
[1] R. Billinton, S. Jonnavithula (1996). A Test System for Teaching Overall Power System Reliability Assessment. IEEE PES Winter Meeting, paper 96 WM056-2 PWRS. [2] IEEE RTS Task Force of APM Subcommittee (1979). IEEE Reliability Test System. IEEE Trans. on PAS, Vol. PAS-98, No.6, 2047-2054.
340
J.M.S. Pinheiro et al. [3] N. M. K. Abdel-Gavad, R. Billinton, R. N. Allan, (1986). The IEEE Reliability Test System - Extensions to and Evaluation of the Generating System. IEEE Trans. on PS, Vol. PWRS-1, 1-7. [4] E. Khan, G. Nourbakhsh, J. Oteng-Adjei (1989), K. Chu, K. Debnath, L. Goel, N. Chowdhury, P. Kos, R. Billinton, S. Kumar. A Reliability Test System for Educational Purposes - Basic Data. IEEE Trans. on PS, Vol. PWRS-4, 1238-1244. [5] E. Khan, G. Nourbakhsh, J. Oteng-Adjei, K. Chu, N. Chowdhury, P. Kos, R. Billinton, S. Kumar. (1990). A Reliability Test System for Educational Purposes - Basic Results. IEEE Trans. on PS, Vol. PWRS-5, 319-325. [6] I. Sjarief, K. S. So, L. Goel, R. Billinton, R.N. Allan (1991). A Reliability Test System for Educational Purposes - Basic Distribution System Data and Results. IEEE Trans. on PS, Vol. PWRS-6, 813-820. [7] IEEE RTS Task Force of APM Subcommittee (1996). The IEEE Reliability Test System - 1996, IEEE PES Winter Meeting, paper 96 WM326-9 PWRS. [8] A.C.G. Melo, A. M. Leite da Silva, J. C. O. Mello, M. Th. Schilling, M. V. F. Pereira (1995). Relevant Factors in Loss of Load Cost Evaluation in Power Systems Planning. Proc. of the IEEE Stockholm Power Tech Int Symp, Stockholm, Sweden, 117-122. [9] IEEE Task Force on Bulk Power System Reliability Reporting Guidelines (1995). Reporting Bulk Power System Delivery Point Reliability. IEEE PES Summer Meeting, paper 95 SM513-2PWRS. [10]F. Gbeddy, R. Billinton (1996). Impact of Power Wheeling on Composite System Adequacy Evaluation. Electrical Power & Energy Systems, Vo1.18:3, 143-151. [11 ] J.C.O. Mello, A.C.G. Melo, X. Vieira Filho, M. V. F. Pereira, J. C. G. Praga, E. M. T. Nery, B. G. Gorenstin, S. Granville (1996). Power System Reliability Evaluation in a Competitive Framework. CIGRE, 38-202, Paris. [12] R. Billinton, L Salvaderi, J. D. McCalley, H. Chao, Yh. Seitz, R. N. Allan, J. Odom (1996). Reliability Issues in Today's Electric Power Utility Environment. IEEE PES Winter Meeting Panel Session, Baltimore [13]CIGRt~ T. F. 38.03.11 (1997). Adequacy Assessment of Interconnected Systems in Presence of Wheeling and Trading: Indices, Modelling and Techniques. CIGRE Symposium, Tours, France.
HARDWARE AND SOFTWARE FAULT TOLERANCE: DEFINITION AND EVALUATION OF ADAPTIVE ARCHITECTURES IN A DISTRIBUTED COMPUTING ENVIRONMENT
F. Di Giandomenico l, A. Bondavalli2, J. Xu 3 and S. Chiaradonna I 1
IEI/CNR, Pisa, Italy; 2 CNUCE/CNR, Pisa, Italy, Computing Laboratory, University of Newcastle upon Tyne, UK
ABSTRACT
This paper discusses the issue of providing tolerance to both hardware and software faults by defining several hybrid-fault-tolerant architectures, which can co-exist and work simultaneously at the top of the supporting environment, and introduces a systematic method for evaluating their dependability, efficiency and response time. To address general-purpose distributed systems where multiple unrelated applications may compete for system resources, our architectural solutions have an important concern with adaptation in the use of redundancy according to system conditions. KEYWORDS Adaptive architectures, dependability, distributed computing environments, efficiency, hardware and software fault tolerance, response time. 1
INTRODUCTION
The need for developing a unified method for tolerating both hardware and software faults has been recognised in the last few years, and several proposals in this direction have already appeared in the literature (Dugan, et al (1995); Kim, et al (1989); Lala, et al (1988); Laprie, et al (1990); Wu, et al (1994)). The idea of implementing fault tolerance in separate layers (e.g. hardware and software layers) of a computing system helps in managing the complexity of the derived architectural solutions, but could result in a too weak approach in coping with the relationships existing between the hardware and the software behaviour. Integrating provisions for coping with both hardware and software faults can reduce the overlapping of fault tolerance techniques, thus improving efficiency and performance. Most existing studies assume that a fixed amount of hardware and system resources is bound statically to a given fault tolerant structure. The development of an architecture is thus completely isolated from the environment in which it is intended to operate. Therefore, their focus is restricted to the reliability aspects without any reference to considerations of efficiency and performance, which are doubtlessly of high interest when making a system design choice. In fact, in a general-purpose information-processing distributed system multiple unrelated applications may compete for system resources such as processors, memories and communication devices. thereby exhibiting highly varying and dynamic system characteristics. By focusing on this type of systems. in this paper we extend previous work on the topic of combined architectures for tolerating hardware and software faults and address efficiency and performance as well as reliability issues. In particular, the objective of our work is twofold. First, we define several architectures by extending existing software fault tolerance schemes to the treatment of both hardware and software faults. We distinguish between static strategies that always consume a fixed amount of resources and dynamic (i.e. adaptive ) strategies that use 341
E Di Giandomenico et al.
342
additional resources only when an error is detected, in the .hope that efficiency and performancewill be improved. We are mainly concerned with dynamic strategies and two typical dynamic schemes are exploited 0 recovery biocks (RB) (Randell (1975)) and Self-Configuring Optimal Programming (SCOP) (Bondavalli. et al (1993)). N-version programming (NVP) (Avizienis, et al (1977)) and NVP with a tie-breaker (NVP-TB) (Tai, et al (1993)) are chosen as two representatives of static schemes for the sake of comparison. Secondly. we introduce a method for analysing the proposed architectures with respect to reliability, efficiency and response time aspects, and give examples of quantitative evaluations. Given the very high complexity involved in an analysis based on a completely distributed, varying environment, here we restrict the potentiality of such a varying environment by introducing a few assumptions, which obviously limit the realism of our analysis. Nevertheless, our analysis is a first contribution in the direction of evaluating a fault tolerant architecture under dependability, performance and efficiency aspects. The rest of the paper is organised as follows. In Section 2 several hybrid-fault-tolerant architectures are defined. In Section 3 we evaluate the dependability of the architectures under consideration. Section 4 analyses these architectures with respect to resource cost and response speed. Due to space limitations. only the analysis of the SCOP instance is detailed in both Sections 3 and 4; the complete derivations for the other architectures can be found in Di Giandomenico, et al (1995). Conclusions are given in Section 5. 2
ARCHITECTURAL SOLUTIONS FOR HYBRID FAULT TOLERANCE
This section defines a set of architectures for hardware and software fault tolerance under the assumption of a distributed supporting environment. Each of our architectural solutions is single-application-oriented. Conceptually an architecture must request processing resources from the supporting system upon invocation and return them to the system when the required computation terminates. During the computation, the architecture may apply for additional resources if necessary. For a given fault-tolerant application, an architecture contains a set of software variants designed independently (mainly for coping with residual design faults), an adjudicator (Anderson (1986)) (e.g. an acceptance test or voter) for the selection of an acceptable output from the results of those variants, and a control program managing the execution of the variants and taking proper actions in response to the adjudicator output. The related programs and input/output data may be stored on the disks of some hardware nodes. The adjudicator is supposed to be replicated on all the hardware nodes supporting a specific architecture, but a selected node is responsible for taking a final decision from the local decisions and for producing outputs of the architecture. (As it would be short and simple, the final adjudication is assumeu to be highly dependable.) Control programs are organised in a manner similar to the organisation of adjudicators. An architecture is denoted by a group of multi-elements X(F, N, Hb, tIm~ .... ) where X indicates a specific architecture for hybrid fault tolerance, equivalent to the name of the selected scheme for software fault tolerance such as RB and NVP; F indicates the number of faults to be tolerated and is further expressed by a detailed form: (f, i, j) in whichfis the number of hybrid (that is hardware or software) faults to be tolerated, i is the number of hardware faults to be tolerated assuming perfect software, and j is the number of software faults to be tolerated assuming perfect hardware; N
is the number of application-specific software variants;
Hb
is the basic (minimum) number of hardware nodes an architecture needs to achieve the given level of hybrid fault tolerance F;
Hmax is the maximum (total) number of hardware nodes an architecture needs to achieve a given level of hybrid fault tolerance F when the worst fault situation occurs. Since realistic examples of implementing software fault tolerance are most based on two or three software variants (Laprie, et al (1990)), we will restrict our interests to such particular instances.
2.1
Dynamic Architectures: SCOP((1, 2, 1), 3, 2, 4) and RB((1, 2, 1), 2, 2, 3)
An execution of the SCOP((1, 2, 1), 3, 2, 4) architecture is divided into two phases (see Figure 1). In the first phase, Variant 1 and Variant 2 run on two hardware components and the adjudicator compares their
Advances in Safety and Reliability: ESREL '97
343
results. Consistent results are accepted immediately. Otherwise, the second phase begins and executes Variant 3 and Variant 1 on two additional hardwares. The adjudicator will decide in the end according to all the four results, seeking a 2-out-of-4 majority. One (hardware or software) fault at least is tolerated; if no software fault manifests itself during computation, up to two hardware faults will be tolerated. (a)
SCOP((1,2,11, 3, 2, 41 Hb
(b) Hmax I ,,. i sPace
I
Hb
Hmax
D
basic hardware component additional component requested
, Time
'r Time
'] software variant
1
•
Tau ]'max
RB((1,2,11, 2, 2, 31
Figure 1" An instance of the SCOP architecture (a) and of the RB architecture (b). The primary variant V1 in the RB((1, 2, 1), 2, 2, 3) architecture is executed on two hardware components and the results produced by the replicated variants are compared. If they agree, acceptance tests are applied to them. This agreeing result will be released unless both acceptance tests reject it. In the last case. the variant V2 will be executed on an additional node and its result released. If the results produced in the first phase disagree, a diagnostic routine must be applied to the two hardware nodes employed. If one hardware node is found non-faulty, then the result produced by the v.ariant running on it is released, otherwise an additional node is requested to execute the variant V2, whose result is then released without any further check. At least one hardware or software fault is tolerated, assuming perfect diagnostic routines. Both SCOP and RB architectures are highly efficient when no fault manifests itself during computation, which is the most likely situation. However, the application that uses these architectural solutions must be prepared to accept a rare, but still possible heavy degradation in case the second phase is necessary.
2.2
Static Architectures: NVP((1, 2, 1), 3, 4, 4) and NVP-TB((1, 2, 1), 3, 4, 4)
The NVP((1, 2, 1), 3, 4, 4) instance requests four hardware nodes. Three software variants are executed in parallel on these nodes (according to the schema of Figure 2) and their results compared seeking a 2-out-of-4 majority. This architecture can tolerate any one hybrid fault. If no software fault manifests itself during computation, two hardware faults can be tolerated. NVP(( 1,2, I ), 3, 4, 4) and NVP-TB((1,2, I ), 3, 4, 4) Hb = Xmax I
o
Tmax L=J
Space •
Time
Figure 2: Instances of NVP and NVP-TB architectures. The NVP-TB approach has been recently introduced by Tai, et al (1993) with the aim of enhancing performance of basic NVP by modifying the operational usage of the NVP redundancy. The NVP-TB((1; 2. 1), 3, 4, 4) architecture requests three software variants distributed on four hardware nodes as in the NVP architecture. The variants are executed in parallel, but as soon as two results by two different variants are produced, a first adjudication phase starts and, only if disagreeing results are observed, a second adjudication phase is executed involving all the four results, seeking a 2-out-of-4 majority. From the operational point of view, the NVP-TB instance differs from the SCOP instance for the fact that the three software variants plus the replicated one are always started, independent of the results of the first adjudication phase. However.
344
E Di Giandomenico et al. this architecture, although classified as a static one, does improve on NVP in performance, since in most cases only the first two results produced by two faster variants are needed to complete the computation. To simplify the notation, in the following the above defined architectures will be simply referred as SCOP. RB, NVP and NVP-TB.
3
DEPENDABILITY EVALUATION
This section contains the dependability evaluation of the architectures defined in Section 2 adopting a Markov approach. We extend the analysis in Laprie, et al (1990) by considering a different set of faults our architectures are to tolerate and by introducing a different model that allows to analyse both hardware and software faults in a combined framework, starting from a set of special software failures that would lead to the failure of the whole architecture despite of the hardware conditions. Hardware failures are considered only when they affect the whole architecture alone or together with some software failures. Basic assumptions for our evaluation are as follows: 1) failures of hardware nodes are independent; this is a reasonable assumption considering the nowadays well-established techniques for hardware design. The probability that correct software variants running on failed hardware nodes produce the same incorrect outputs is assumed to be negligible; 2)
compensation among failures doesn't happen, neither between software variants, nor between variants and their adjudicator nor between hardware components and variants;
3)
for architectures with multiple phases, the adjudicator exercised in more than one phase will show the same (erroneous or correct) software behaviour throughout all the phases; hardware faults are independent of software faults (and vice versa): a failure in an hardware component will cause an incorrect output of the software variant running on it, but will have no influence upon activating a fault in the variant itself;
4)
5)
failures of the underlying communication system are not addressed explicitly (though a failure in the link connecting two nodes may be considered as a failure in the sending or receiving node). TABLE 1 FAILURE TYPES, NOTATIONAND VALUES FOR SCOP.
Failure Types 3 variants fail with consistent results 2 variants fail with consistent results (the 3rd variant may fail with a different result)
Prob. Values q3v q2v
The adjudicator fails, selecting an erroneous result (given that an erroneous result has qvd been produced) A variant fails, given that none of the above events happens qiv A majorit), exists, but the adjudicator fails to reco~nise it (without releasin~ an), result) A hardware node fails, affecting the variant and/or the adjudicator running on it
qd qh
10-1° qiv x 10-3 10-l° from 10-5 to 10-3 10-9 10-9
Table 1 shows the relevant types of failures of software and hardware components for the SCOP architecture together with the values used in the subsequent evaluation example; the detailed dependability model of the SCOP architecture is illustrated in Figure 3. This model is a slightly simplified one in which the states representing the execution of the second phase are introduced only when necessary in order to distinguish among different behaviours of the architecture. Table 2 briefly explains the meanings of the states and arcs in the figure. By using the set of intermediate parameters shown in the right side of Figure 3, the failure probability of the SCOP architecture is: QscoP = q l+( 1-q l)qiv2+pi(qd+( 1-qd) (qh2( 1-PII( 1-qiv))+q3 q4))+( 1-q l)q2(qiv+( 1-qiv)qd+( 1-qiv)( 1-qd)( 1-PlV)) Similar models are derived for the other architectures and the corresponding expressions of the failure probability determined. Due to space limitations, they are omitted here; the failure probability expressions for the RB, NVP and NVP-TB are in Di Giandomenico, et al (1995).
Advances in Safety and Reliability: ESREL '97
345
qiv+(l- qiv) qd Q
(
q l = 3q 2v+ q3v + 3qvd q2 = 2q iv(l_ qiv) q3 =2 qh(l- qh! F
q4 =qi + (I- qiv)q h 2
Pl =(I- qiv)(l- q l ) PlI =(I- qh) 2 PlY =(I- qh)4
(1- PlV)
Figure 3: The dependability model for SCOP. TABLE 2 MEANINGS OF THE STATES AND ARCS IN FIGURE 3.
I,F,S VP
SP1 SP2 Fvl Ss Fhl Fh2
initial, failure and success state of an execution, respectively (F and S are absorbin~ states) 2 variants are executed on 2 nodes in the first phase; the arc from VP to F is labelled with the sum of the probabilities of the software failures causing the failure of the whole architecture without considering the hardware behaviour (i.e. independent failure of the executed variants, and common mode failures between the variants and between the variants and the adjudicator ) one variant executes correctly while the other fails, and the second phase is performed the two variants execute correctly; the behaviour of the adjudicator is then examined just one variant fails after the first phase; the hardware behaviour is then examined in the first phase, software components including the adjudicator execute correctly; according to the hardware behaviour the state S is reached or the second phase executed (states Fhl and Fh2) the second phase operates due to the failure of one hardware during the first phase; the behaviour of both hardware and software components in the second phase is analysed the second phase operates due to the failure of two nodes; success or failure of the whole execution depends on the behaviour of both hardware and variants in the second phase
Figure 4 shows, as an example, a plot of the functions representing failure probabilities of the four architectures; the numerical values in Table 1 have been chosen for the dependability parameters of SCOP. while similar values have been assigned to the corresponding parameters of the other architectures.
"~ u ~ ~ ~
(x 10"6) 4.0 :3.5 3.0 2.5 2.
"
scop ._ / / nvp-tb-~//
O
"~ 1.5 ~I 1 . o ~
O.5 0 0
7-
,
,
-4 -3 Probability of independent failures of variants: qiv ( 10^) -$
Figure 4: Plot of the probabilities of failure for the four architectures. Despite of certain practical implications, the parameters setting chosen simply constitute a line in the space
F. Di Giandomenico et al.
346
of all the possible combinations; this doesn't allow to derive any general conclusion about the dependabilit3 of the four architectures. However, the example seems to be quite consistent with s o m e intuitive conclusions. Since the influence of the hardware failure is relatively small according to the set of parameters chosen, it is the software behaviour that makes major contributions to the failure of the whole architecture. Previous work, for example in Bondavalli, et al (1993), has shown that there is no evidence that. in the general case, one of RB, NVP and SCOP is better than the others from the software dependability, point of view under comparable conditions. This explains why the curves appear very close in the figure. 4
RESOURCE COST AND RESPONSE TIME
In this section the average resource consumption (i.e. the average number of hardware nodes required) and response time are estimated for each execution of a given architecture. The numerical evaluations performed are simply meant as examples to show how such an evaluation can be made rather than to derive definitive conclusions about the behaviour of the four architectures examined.
4.1
Average Resource Consumption
An execution of the SCOP and RB schemes may require two phases. From the dependability evaluation. their probability of termination at the end of the first phase can be obtained. Here, the expression relative to the SCOP architecture only is reported; the complete derivation is in Di Giandomenico, et al (1995).
PlscoP = Pt(1--qd)Ptt +(q2v + q3v)(1--qd)Ptt + 2q~d. Then, the average resource consumptions by the SCOP and RB architectures in one execution are:
AV.RESscoP -- 2 + 2(1- PlSCOP) and AV.RESRB = 2 + ( 1 - PlRB). NVP executes all of its variants in parallel; therefore it has a constant resource consumption equal to 4. NVP-TB, although it stops as soon as two equal results are produced by two different variants, generallyexecutes all of its variants in parallel. Thus, in most cases it has a resource consumption equal to 4. o" ~ o ~
2.04 2. o3 2.02 2.01
~
2.00
~ <
1.99
RB
-5
-4
-3
-2
Probability of independent failure of variant: qiv ( 10A)
Figure 5: Average processing node utilisation vs. failure probability. From this simple analysis, we can conclude that dynamic architectures have average resource consumption lower than static architectures. Figure 5 shows the plot of AV.RESscoP and AV.RESRB as a function of the probability qiv (the same dependability parameters values as in Section 3 are used). Since for most plausible values SCOP and RB show a very high probability to stop at the end of the first phase, their average hardware consumption is almost constant and very close to the amount required to start executing.
4.2
ResponseTime
The response time analysis is conducted assuming that all the processing nodes are required from the
Advances in Safety and Reliability: ESREL '97
347
supporting system, and the proper software is then loaded on them before execution takes place; this scenario seems the most appropriate to general-purpose distributed systems which we focus upon. To analyse response time we will follow the same approach used in Chiaradonna, et al (1994); Tai, et al (1993), but include times relative to hardware components. It is assumed that the time needed to obtain the processing nodes and to load the software is an independent and exponentially distributed random variable Wi with parameter Awi. The execution times of different combinations of variant/node pairs are also assumed to be independent and exponentially distributed random variables Ei with parameter ,;t,i,particularly Yd with parameter ~,d for the adjudicator. Designating with Yc the duration of an execution of the given architecture. we will derive its distribution. Let Ywl = max{W1, W2} and YE1 - m a x { E 1 , E2} denote the times necessary for obtaining two processing nodes and execute two variants in the first phase respectively. Similarly, concerning the second phase, Yw2 max{ W3, W4} and YE2 = max{E3, E4}. Therefore, the execution time Yc for SCOP is:
~Yc~ = Yw, + YE, + I'd = max{W~,W2} + max{E~.E2} + Yd, with probability P,scoP Yc = [Yc2 = Ywl + EEl + Yd + Yw2 -I- Yz: + Yd,
with probability (1- PlscoP)
The expressions for Yc relative to the other architectures can be found in Di Giandomenico, et al (1995). An example of response time estimation in a realistic situation follows. Reasonable values used for the time parameters of SCOP are: ~,i = 1/5 msecs -l, i=1..4; ~,d = 2 msecs-1; ~wi = 1/50 and 2 msecs -1, i=1..4; similar values have been adopted for the corresponding parameters of the other architectures under analysis. The probability of termination at some phase has been computed based on the same values already used in the previous dependability analysis and assigning the value 10-4 to qiv. 0.01 O. 008 O. 006 O. 004 O. 002
' SCO~.~...
I
.
.
.
.
0.18
~-wi - 1/50
',
0.16 0.14
] "~vp-tb ~ ,~
(b)
0.120.1 s e o / j ~ . 00,
0.06
0 0
;.
0.9.
.
50
I00
150
200
250
time for one execution (in milliseconds)
300
O. 02 0 0 ~wi = 2
10
20
,--30
, 40
time for one execution (in milliseconds)
Figure 6: Distribution of Yc Figure 6 shows the plots of the pdf of Yc in the case of no timing constraints. When the time to acquire a hardware node is significantly longer than the execution time of a variant, dynamic architectures are better than static ones with respect to the average response time due to the lower number of nodes the former necessitates to start an execution (see Figures 6.(a)). In case the time for acquiring a node reduces to be equal or smaller than the execution time of a variant, the average response time is mainly determined by the execution time of variants and, with the parameters values chosen for this example, NVP-TB shows the best behaviour (see Figures 6.(b)). This is not surprising, since NVP-TB, although classified as a static architecture because of its fixed requirement of four hardware to start an execution, has a dynamic operative behaviour. The choice of identical (exponential) distributions for the execution time of the variants contributes to favouring NVP-TB (for example, the time for executing two variants among four comes out to be significantly lower than executing two variants among exactly two). Changing the distribution and/or the parameter Ai, could lead to different results. In this experiment, the RB architecture behaves worse than SCOP, but still no worse than NVP. 5
CONCLUSIONS
This work has presented the definition of a number of dynamic and static architectures for tolerating both
E Di Giandomenico et al.
348
hardware and software faults, and a method for evaluating them with respect to dependability, resource consumption and response time aspects, assuming a dynamic, although limited, supporting environment. The analytical results have shown that dynamic architectures have better resource utilisation than static ones, without compromising dependability. Even from the response time point of view, dynamic architectures, although often have a longer worst-case response time, in certain application scenarios max have a higher probability of making a timely response than static designs. The assumptions made obviously limit the realism of the analysis performed so that no definitive conclusion on dependability and efficiency of the considered architectures in the general case can be derived from it. Another serious problem, common to any work of this kind, is the difficulty in obtaining sound estimates for the parameter values (like correlation parameters). Our work is not supposed to solve this problem: rather, Our intention is to put the current work (perhaps best possible results) forward to a more practical stage by collectively considering dependability, performance and efficiency aspects. Nothwidstanding models cannot in general be employed for producing reliable predictions about a system, their usefulness in showing the relative importance of different phenomena which the designer may hope to control is undoubtedly of primary importance. Models could be usefully employed, for example, in comparing different designs to understand the most appropriate solution, and to highlight problems within the design. It will be objective of future studies to improve the work by gradually releasing the assumptions made here.
Acknowledgement. This work has been partially supported by the ESPRIT Long Term Research Project 20072 on Design for Validation (DeVa), and by the Italian CNR Project "Metodologie, architetture, ambienti di progetto e valutazione per sistemi di elaborazione distribuiti".
REFERENCES Anderson, T. (1986). A Structured Decision Mechanism for Diverse Software, in Proc. 5th Symposium on Reliability in Distributed Software and Data Base Systems, Los Angeles, California, 125-129. Avizienis, A. and Chen, L. (1977). On the Implementation of N-Version Programming for Software Fault Tolerance During Program Execution, in Proc. COMPSAC 77, 149-155. Bondavalli, A., Di Giandomenico, F. and Xu, J. (1993). A Cost-Effective and Flexible Scheme for Software fault Tolerance. Journal of Computer Systems Science and Engineering 8:4, 234-244. Chiaradonna, S., Bondavalli, A. and Strigini, L. (1994). On Performability Modeling and Evaluation of Software Fault Tolerance Structures, in Proc. EDCC-1, Berlin, Germany, 97-114. Di Giandomenico, F., Bondavalli, A. and Xu, J. (1995). Hardware and Software Fault Tolerance." Adaptive Architectures in Distributed Computing Environments, Technical Report B4-15, IEI-CNR. Dugan, J. B. and Lyu, M. (1995). Dependability Modeling for Fault-Tolerant Software and Systems, M. Lyu Ed., WILEY, 109-138. Kim, K. H. and Welch, H. O. (1989). Distributed Execution of Recovery Blocks: an Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications. IEEE Trans. on Computers C-38:5, 626-636. Lala, J. H. and Alger, L. S. (1988). Hardware and Software Fault Tolerance." a Unified Architectural Approach, in Proc. IEEE 18th Intern. Symp. on Fault Tolerant Computing, 240-245. Laprie, J. C., Arlat, J., Beounes, C. and Kanoun, K. (1990). Definition and Analysis of Hardware-andSoftware Fault-Tolerant Architectures. IEEE Computer 23:7, 39-51. Randell, B. (1975). System Structure for Software Fault Tolerance. IEEE Trans. on Software Engineering SE-I:2, 220-232. Tai, A. T., Avizienis, A. and Meyer, J. F. (1993). Performability Enhancement of Fault-Tolerant Software. IEEE Trans. on Reliability, Sp. Issue on Fault Tolerant Software R-42:2, 227-237. Wu, J., Wang, Y. and Fernandez, E. B. (1994). A Uniform Approach to Software and Hardware Fault Tolerance. Journal of Systems and Software 26, 117-127.
A P R O G R A M M A B L E ELECTRONIC SYSTEM FOR SAFETY RELATED CONTROL APPLICATIONS Wolfgang A. Halang 1 and Marian Adamski 2 1Faculty of Electrical Engineering, FernUniversitaet, 58084 Hagen, Germany 2Department of Computer Engineering and Electronics, Technical University of Zielona Gora, 65-246 Zielona Gora, Poland
ABSTRACT A low complexity, fault detecting computer architecture for utilisation in programmable logic controllers to be employed in safety related process control applications is presented. The cyclic operating mode of PLCs and a specification level, graphical programming paradigm based on the interconnection of application oriented standard software function blocks are architecturally supported. Thus, by design, there is no semantic gap between the programming and machine execution levels enabling the safety licensing of application software by an extremely simple, but rigorous method, viz., diverse back translation. KEYWORDS
Safety related control, programmable electronic system, programmable logic controller, computer architecture, function block programming, safety licensing of software, diverse back translation. INTRODUCTION
Economical considerations impose stringent boundary conditions on the development and utilisation of technical systems. This holds for safety related systems as well. Since manpower is becoming increasingly expensive, also safety related systems need to be highly flexible, in order to be able to adjust them to changing requirements at low costs. In other words, safety related systems must be program controlled. Thus, the use of hard-wired safety systems will diminish in favour of computer based ones. In society, on the other hand, there is an increasing awareness of and demand for dependable technical systems in order not to endanger human lifes and to prevent environmental disasters. Computer based technical systems have the special property that they consist of hardware and software. The latter knows no faults caused by wear, environmental events etc. Instead, all errors are design errors, i.e., of systematic nature, and their causes are always latently present. Hence, dependability of software cannot be achieved by reducing the number of errors contained by testing, checks, or other heuristic methods to a low level, which is generally greater than zero, but only by rigorously proving that it is error-free. Taking the high complexity of software into account, only in exceptional cases this objective can be reached with the present state of the art, being the reason why the licensing authorities are very reluctant to approve safety related systems, whose behaviour is exclusively program controlled. In general, safety licensing is still denied for highly safety critical systems relying on software with non-trivial complexity. 349
350
W.A. Halang and M. Adamski
To provide a remedy for this unsatisfactory situation, the architecture of a customised real-time, co_mputer control system was developed and a prototype built, which can carry out safety related functions within the framework of distributed process control systems or programmable logic controllers. It explicitly supports sequence controls as defined in the standard IEC 1131-3 (1992) and required by many automation programs including safety related ones. The architecture can be safety licensed in its entirety, by exploiting the intrinsic properties of a special, but not untypical case, that has been identified in industrial control problems. Here the complexity turns out to be manageable, because we restrict our attention to rather simple computing systems in the form of PLCs, and since application domains exist only demanding software of limited variability, which may be implemented in a well-structured way by interconnecting carefully designed and rigorously verified "'software ICs". The architecture features full temporal predictability, determinism, and supervision of program execution and of all other activities of the computer system, and supports the software verification method of diverse back translation as devised by Krebs and Haspel (1984). By closing the semantic gap between software requirements and hardware capabilities, it relinquishes the need for not safety licensable compilers and operating systems. THE SOFTWARE ENGINEERING PARADIGM Standardisation bodies of chemical industry have identified and defined in the VDI/VDE-Richtlinie 3696 (1995) a set of 67 application specific function modules suitable to formulate - on a very high level employing the graphical ,,Function Block Diagram" and ,,Sequential Function Chart" languages defined by the IEC International Standard 1131-3 (1992) - the large majority of the occurring automation problems. Written in the IEC 1131-3 high level language ,,Structured Text", the source code of these software modules usually does not exceed two pages in length. Therefore, their correctness can be formally proven, e.g., using predicate calculus, but also symbolic execution or, in some cases, even complete test. This analysis of process automation suggests to introduce a new programming paradigm as exemplified in Figure 1, viz., to graphically compose software out of high level user oriented building blocks instead out of low level machine oriented ones. Essentially, for any application area, there are specific sets of basic functions modules, although certain functions like analogue and digital input and output have general relevance. For the formulation of automation applications with safety properties, basic function modules are only interconnected with each other, i.e., single basic functions are invoked one after the other and, in the course of this, they pass parameters. Besides the provision for constants as external input parameters, the basic functions' instances and the parameter flows between them are the only language elements used on this programming level. Owing to the simple structure, this logic is only able to assume, the corresponding object code does not contain other features than sequences of procedure calls and some internal moves of data. BI
m
D]
,I r* !,
.
m~ m
E
I
I
c
I
tJtn'~
,
m 84
I
B7
,,
B5
o
,rT----j
n
,
'qi-"i
-
~
I
14.-., NPRJO
z.
Figure 1" A graphically formulated process control program
Advances in Safety and Reliability: ESREL '97
351
Many automation programs including safety related ones have the form of sequence controls composed of steps and transitions. Hence, linear sequences of steps and alternative branches of such sequences as shown in Figure 2 need to be architecturally supported. Parallel branches in sequential function charts should either be implemented by hardware parallelism or already resolved by the application programmer in form of explicit serialisation. While in a step, an associated program, called action, developed according to the above paradigm is being executed. Also, for purposes of a clear concept, for easy comprehension and verification, and in order not to leave the Petri net concept of sequence controls, we only permit the utilisation of nonstored actions. All other types of actions as defined in IEC 1131-3 can be expressed in terms of non-stored ones and re-formulated sequential control logic. THE SAFETY ORIENTED A R C H I T E C T U R E Since it is not our objective to save hardware costs, but to facilitate the understandability of implemented software and of its execution process, we designed an architecture as shown in Figure 3 with, conceptually, two different processors: a control flow processor (master) and a basic function block processor (slave). These two processors are implemented by separate physical units. Thus, we achieve a clear and physical separation of concerns: execution of the basic function modules in the slave processor, and all other tasks, i.e., execution control, sequential function chart processing, and function module invocation, assigned to the master. This concept implies that the application code is restricted to the control flow processor, on which the project specific safety licensing can concentrate. To enable the detection of faults in the hardware, a dualchannel configuration is chosen as displayed in Figure 4, which also supports diversity in form of different master processors and different slave processors. All processing is simultaneously performed on two processors each and all data communicated are subject to comparison. The basic function processors perform all data manipulations and take care of the communication with the environment. The master and slave processors communicate with each other through FIFO-queues. Clearly, the masters' and slaves' programs, though co-ordinated via communication, can be separated. This separation enables to transfer data access and data protection issues from software to hardware, thus increasing the controller's dependability. The master and slave processors execute programs in co-ordination with each other as follows. The master processors request the slaves to execute a function block by sending the latter's identification and the corresponding parameters and, if need be, also the block's internal state values via one of the FIFO-queues to the slave processors. Here the object program implementing the function block is performed and the generated results and new internal states are sent to the master processors through the other FIFOqueue. The elaboration of the function block ends with fetching these data from the output FIFO-queue and storing them in the masters' RAM memories.
÷o
-t-f
6
8
-q- ..... .....
&
÷
+b
.....
÷
.....
7
+h
+i
........
10
÷m Figure 2: A sequential function chart
352
W.A. Halang and M. Adamski
The results and intemal states are stored in the masters' memories. The slaves' memories, if needed at all, are only used temporarily while elaborating function blocks. Hence, the slaves may be viewed as memoryless function co-processors or dedicated calculators. A number of fail safe comparators checking the outputs from the master processors before they reach the slaves and vice versa completes a fault detecting two-channel configuration. Any inequality detected by the comparators generates an error signal (see below) which stops the controller and sets the outputs to safe states. These states are provided by fail safe hardware. To prevent any modification by malfunctions, there is no program RAM, but all programs are provided in read only memories (ROMs). The code of the basic function modules resides in mask programmed ROMs, which are produced under supervision of and released by the licensing authorities, after the latter have rigorously established the correctness of the modules and their translation into object code. On the other hand, the sequences of module invocations together with the corresponding parameter passing, representing the application programs at the architectural level, are written into (E)PROMs by the user. This part of the software is subject to project specific verification again to be performed by the licensing authorities, which finally still need to install and seal the (E)PROMs in the target process control computers. The master/slave configuration was chosen to physically separate two system parts from one another: one whose software needs to be verified only once, and the other one performing application specific software.
: . . . . . . . . . ::: :~:i :.. ::,-: . . . . . . . . . .
:-'%%!
~-~
t:::~:~.i~!.~:~:~,~:.,:~':ki¢~;?
Figure 3: Concept of a PES for safety related control Besides program memory, the masters' address spaces also comprise RAM memory and FIFO-input/output registers, command registers, two step registers each, viz., step identifier and step initial address, and transition condition registers. Furthermore, there are program counters and single bit step-clock-occurred registers, which are not programmer accessible. Additionally, in the masters' address spaces other units are memory mapped to create and receive control signals for the access of ROM, RAM, and FIFO-queues. We have implemented the master using a Field Programmable Gate Array (FPGA) manufactured by Xilinx (1994). Such an FPGA VLSI design is suitable for formal verification. Verified hardware assures correct execution of logically sequenced instructions. For the envisioned purpose just two instructions are required, viz., MOVE and STEP. The MOVE instruction has two operands, which directly point to locations in address space. Thus, the memories and the above mentioned registers can be read and written. A read from a FIFO-input register implies that the processor has to wait when the input FIFO-queue register is empty. In case of writing into an output FIFO-queue register, the processor also has to wait when the register is full. Execution of a MOVE implies program counter incrementation.
Advances in Safety and Reliability: ESREL '97
353
The programs executed by the master processors consist of sequences of steps. Behind the program segment of each step a STEP instruction, with a next-step-address as operand, is inserted, which checks whether the segment was executed within a step cycle frame or not. The step cycle is a periodic signal generated by the system clock and establishing the basic time reference for PLC operation. The length of the cycle is selected in a way as to accommodate during its duration the execution of the most time consuming step occurring in an application (class). If the execution of a segment does not terminate within a step cycle, an error signal is generated, which indicates an overload situation or a run time error. Then, program execution is stopped immediately, and suitable error handling is carried through by external fail safe hardware. Normally, however, segment execution terminates before the instant of the next step cycle signal. Then, the processors wait until the end of the present cycle period. When the clock signal finally occurs, the step-clock-occurred registers are set. According to the contents of the transition condition registers it is decided, whether the step segment is executed once more, or whether the execution of the logically subsequent step is commenced, i.e., whether the program counters are re-loaded from the step-initial-address registers, or if another segment's initial program address is loaded from the STEP instruction's operand called next-step-address. Since only one step is active at any given time, and since program branching is only possible in this restricted form within the framework of executing STEP instructions, this mechanism very effectively prevents erroneous access to code of other (inactive) steps as well as to program locations other than the beginnings of step segments. The design objective for providing FIFOs is to implement easily synchronisable and understandable communication links, which decouple the master and slave processors with respect to their execution speeds. The FIFO-queues consist of a fall-through memory and two single bit status registers each, viz., FULL and EMPTY indicating the filling states of the FIFOs. The status registers are not user accessible. They are set and reset by the FIFO control hardware and, if set, they cause a MOVE to a FIFO's input port or from an output port, resp., to wait until space in the FIFO becomes available or data arrives. The comparison for equality of the outputs from the two master processors and of the inputs from the two slave processors, respectively, is carried out by the two fast comparators placed into the FIFO-queues. Since the responsibility for detecting errors in the system rests on these comparators, they need to meet high dependability requirements and are, therefore, implemented in fail safe technology as described by Schuck (1987). A comparator is connected to two FIFOs' outputs. The first data elements from each input queue are latched and subsequently compared with each other. If both latches do not hold the same value, then an error signal is generated, which stops the operation of the entire system. Otherwise, the value is transferred into both output FIFOs. Communication with external technical processes takes place through fault detecting input/output driver units attached to the slave processors. Output data words generated by the two slaves are first checked for equality in a HIMA-Planar fail safe comparator of P. Hildebrandt GmbH Co KG (1992) and, subsequently, they are latched in an output port. If output data are not identical, an error signal is generated leading to a system stop. To achieve full determinism of execution time behaviour, the basic cycle was introduced as maximum step execution period. Although it exactly determines a priori the cyclic execution of the single steps, the processing instants of the various operations within a cycle, however, may still vary and, thus, remain undetermined. Since a precisely predictable timing behaviour is only important for input and output operations, temporal predictability is achieved as follows. Digital input data are read by the drivers at the beginning of each cycle and stored en bloc in two independent RAM buffers assigned to the respective slaves. The cycle start is signaled by the step-clock-occurred register. Only after that, the data are made available for further processing, thus providing predictability in timing. Following a STEP command from the masters, the slaves may access the data at any time during the cycle. Our controller prototype has 264 digital inputs. The input driver for all of these signals is implemented using an FPGA, just as the masters. This decreases the number of chips required and simplifies the circuit board. The output driver of the prototype has two independent 8-bit registers assigned to the slaves. Output data bytes generated by the slaves are latched in the registers at the end of every cycle (also signaled by the stepclock-occurred register). The data are first checked for equality in a HIMA-Planar fail safe comparator and,
354
W.A. Halang and M. Adamski
subsequently, they are latched in an output port becoming effective to the environment. Hence, the prototype has 8 digital outputs (although the expansion to more banks of 8 digital outputs each is not a problem). However, if the output bytes are not identical, an error signal is generated leading to a system stop. Safe states are then output by HIMA-Planar hardware.
'~e~ Safe Com.atatorl
[ ~mmporator raij'Sa" [
Input
Output
tntmim
1
~.tmter Omlmll
Figure 4: Block diagram of fault detecting master-slave PLC The FIFO-queue and output comparators mentioned above are the components of a global comparator unit, which also receives operation monitoring signals from processor watch-dog timers and some correctness signals from other units. Based on all these, a global correctness signal, in other words, a negated global error, is generated and fed back to all units of the controller. Naturally, each one can operate if this signal indicates "'no error". Otherwise, the units stop and the controller outputs are set to safe states. The global error signal is also output, allowing to trigger some external hardware as well. SOFTWARE SAFETY LICENSING First of all, the elements of an employed function block set are rigorously verified with appropriate formal methods. This takes place together with the safety licensing of the hardware before any such system is put into service. The details of the function blocks' implementation on the slave processors are part of the architecture and, thus, remain invisible from the application programming point of view. Application software is safety licensed by subjecting the object code loaded into the master processors to diverse back translation, a verification method developed in the course of the Halden experimental nuclear power plant project by Krebs and Haspel (1984). This technique consists of reading machine programs out of computer memory and giving them to a number of teams working without any mutual contact. All by hand, these teams disassemble and decompile the code, from which they finally try to regain the specification. A safety licence is granted to a software if its original specification agrees with the inversely obtained re-specifications. Of course, in general this method is extremely cumbersome, time consuming, and expensive. This is due to the semantic gap between a specification formulated in terms of user functions on one hand and the usual machine instructions carrying them out on the other. Applying the programming paradigm of basic function modules, however, a specification is directly mapped onto sequences of procedure invocations and parameter passing. It takes only minimum effort to verify a master program by interpreting such code, which just implements a particu-
Advances in Safety and Reliability: ESREL '97
355
lar module interconnection pattern, and by re-drawing the corresponding graphical program specification. Diverse back translation is especially well suited for the verification of the correct implementation of graphically specified programs on the architecture introduced above. This is due to the following reasons: 1. The method is essentially informal, easily comprehensible, and immediately applicable without any training. Thus, it is extremely well suited to be used on the application programming level by people with most heterogeneous educational backgrounds. Its ease of understanding and use inherently fosters error free application. 2. The effects of high complexity utility and compiler-like programs, whose correctness cannot be established rigorously, are verified, too. 3. Since graphical programming based on application oriented function blocks has the quality of specification level problem description, and because by design there is no semantic gap in our architecture between the levels interfacing to humans and to the machine, diverse back translation leads back in one easy step from machine code to problem specification. 4. For our architecture, the effort required to utilise diverse back translation for the safety licensing of application programs is by several orders of magnitude smaller than for the von Neumann architecture, once a certain set of function blocks has been formally proven correct. CONCLUSION This paper addresses a pressing problem. A solution to all open questions in safety related computing is not presented, but a practically useful beginning is made being applicable to a wide class of industrial control problems. In a constructive way, and using available methods and hardware technology only, a computer architecture was devised enabling the safety licensing of complete PESs including the software. The concept's main achievement is that the otherwise cumbersome and expensive software verification method of diverse back translation became feasible by architectural support. It is hoped that the approach will ultimately lead to the replacement of discrete or relay logic by PESs executing licensed software to implement safety critical functions in process control. A prototype has been built in the framework of a doctoral thesis. After on-going intensive tests will have been completed, the prototype will be employed in emergency shutdown systems and as part of distributed automation systems for chemical plants. REFERENCES
P. Hildebrandt GmbH Co KG (1992). Fail Safe Electronic Controls, Tech. Info 92.08. IEC International Standard 1131-3 (1992). Programmable Controllers, Part 3: Programming Languages, International Electrotechnical Commission, Geneva. Krebs, H., and Haspel, U. (1984). Ein Verfahren zur Software-Verifikation. Regelungstechnische Praxis 26, 73 -- 78. VDI/VDE-Richtlinie 3696 (1995). Vendor independent configuration of distributed process control systems, Beuth Verlag, Berlin. Schuck, H. (1987). Analoger Fensterkomparator in Fail-safe-Technik, Technische Universit~it Braunschweig. Xilinx (1994). XACT Software User Guide.
This Page Intentionally Left Blank
SYSTEMIC FAILURE MODES: A MODEL FOR PERROW'S N O R M A L ACCIDENTS IN COMPLEX, SAFETY CRITICAL SYSTEMS
R.J. Collins ~and R.Thompson 2 ~Sentient Systems Limited, 1 Church Street, The Square, Wimborne, Dorset, BH21 1JH, UK 217 Belmore Road, Lymington, Hants, SO41 3NU
ABSTRACT In 1984 Charles Perrow produced the book 'Normal Accidents' which presented the 'Normal Accident Hypothesis' (NAH). The NAH states that there are inevitable failure modes of complex, highly coupled systems that are not predictable and hence not preventable. Complexity Theory provides us with tools for understanding systems that exhibit emergent behaviour; in other words, behaviours that arise spontaneously as a product of complexity and do not admit to reductionist explanations. The authors argue that complexity theory provides a framework within which the NAH can be understood. The authors introduce the term 'Systemic Failure Modes' (SFM) which are defined to be undesirable behaviours of systems that emerge as a function of system complexity and which are not reducible to smaller (more atomic) constituent components. SFM form a sub-set of Perrow's Normal Accidents and are accessible to the production of testable models. The authors review the fields of complexity theory and of general systems theory and re-state
KEYWORDS
Normal Accidents, Complex Systems, Safety, Systemic, Chaos
INTRODUCTION: THE N O R M A L A C C I D E N T H Y P O T H E S I S In 1984 Charles Perrow produced the book Normal Accidents: Living with High-Risk Technologies. The book has been most commonly referenced as a large collection of case-studies of accidents, (for example see Reason (1990) and Tenner (1996)). The book does contain a wide range of case-study material relating to accidents and disasters, however, we have been able to find no reference that treats the central thesis of the text. It is Perrow's central thesis, the 'Normal Accident Hypothesis', with which this paper deals. The Normal Accident Hypothesis (NAH) is succinctly expressed in Perrow's own words: "[There are] characteristics of high-risk technologies that suggest that no matter how effective conventional safety devices are, there is a form of accidents that is inevitable" and... 357
358
R.J. Collins and R. Thompson "If interactive complexity and tight coupling - system characteristics produce an accident, I believe we are justified in calling it a 'normal 'system accident'. The odd term 'normal accident' is meant to signal system characteristics, multiple and unexpected interactions of failures This is an expression of an integral characteristic of the system, not frequency."
inevitably will accident', or a that, given the are inevitable. a statement of
The Normal Accident Hypothesis is controversial indeed and this may explain the failure of other authors to engage with this subject. We consider that it does merit analysis since the conclusion that Perrow draws from it is of such great importance. We believe that efforts should be made to either debunk the hypothesis and move on, or otherwise to consider the implications that would arise from the hypothesis being substantiated. There are other features of the book "Normal Accidents" that may have prevented it from being more fully addressed by the scientific and engineering communities. Firstly, an attempt is made by Perrow to substantiate the Normal Accident Hypothesis by reference to a large body of case study material. The readings are certainly suggestive that some untreated and unpreventable mechanism underlies a wide range of accidents. However suggestion is not proof. Perrow is a Professor of Sociology and it has been pointed out that a major difference between the social sciences and the physical sciences is the differing emphasis on empirical and observational evidence (Shipman 1982). Problems in social sciences do not admit easily to experimentation and thus sociologist may be constrained to present many arguments on the basis of observation and inferential explanation. Such explanations are less familiar and less convincing to scientists and engineers more used to empirical evidence. In this paper we will attempt to provide some analytical substantiation for the Normal Accident Hypothesis. Whereas Perrow has provided the high-level, indicators of the phenomenon, we intend to indicate some lowlevel mechanisms by which it might arise. We shall attempt to provide supporting evidence for the hypothesis based on the emerging field of complexity theory. Our assertion is somewhat stronger even than Perrow's. His view seems to be that in complex, highly coupled systems, the inevitable faults of component parts are bound to precipitate in unpredictable, and ultimately catastrophic ways. We include these cases but consider also the situations in which no individual component failure occurs and yet the system as a whole manifests a 'pathological' failure behaviour. In the following section we consider the history of thought concerning systems and the emerging field of complexity theory. We review the arguments supporting the notion that there may well be emergent properties of systems (in this case failure mechanism) that are not traceable to the individual system components.
SYSTEM THINKING The well-worn phrase 'the whole is more than the sum of its parts' occurs frequently in texts on systems theory and design (von Bertalanffy 1968; Meister 1991) and has been attributed by Koestler (1978) to have originated from Smuts: "A whole, which is m o r e than the sum of its parts, has something internal, some inwardness of structure and function, some specific inner relations, some internality of character of nature which constitutes that m o r e ' (Smuts 1926) Mattessich (1982) has argued that the 'Systems' approach as a direct embodiment of the holistic paradigms of philosophers as diverse as Lao-tse, Heraclitus, Leibniz, Vico, Hegel, Marx, Whitehead, Driesch and others. We shall attempt to deconstruct the somewhat flowery language of Smutts and attempt to re-interpret it in line with an analysis of complex systems. We shall explore the nature of the 'more' that Smuts refers to in the context of safety and failure analysis. Smuts attempted to express this concept algebraically, and we would reformulate his notion thus:
Advances in Safety and Reliability: ESREL '97
359
Whole = ~ parts~ + x i=1
(1)
A key argument to our thesis is that the 'x' in Eqn. 1 does exist in some meaningful, demonstrateable manner. The general notion that systems do exhibit properties over and above those of their constituent parts has been widely attributed to Hegel ~. This notion includes the view that the parts of a system cannot be understood in isolation to the whole.
R E D U C T I O N I S M AND D E C O M P O S I T I O N A distinction must be made between 'decomposition' and 'reduction'. To decompose is to separate or resolve into constituent parts or elements. Reduction is the process of describing a phenomenon in terms of more 'basic' or 'primitive' phenomena to which the first is then considered to be equivalent. Analysis, as we see it, is often decompositional and yet not always successfully reductive. We will argue that something important may be lost in the process of decomposition, such that the products are not a true reduction of the whole. When we use the word 'analysis' in the sense of'safety analysis' it is likely that we mean to connote the general process of determining errors and the possibility and consequences of faults. However, 'analysis' has been taken to have an operational meaning: "'Analytical procedure' means that an entity investigated be resolved into, and hence can be constituted or reconstituted from, the parts put together" (yon Bertalanffy 1968) That is, 'analysis' is by definition reductionist. But as Bertalanffy goes on to say: "Application of the analytical procedure depends on two conditions. The first is that interactions between 'parts' be non-existent or weak...The second condition is that the relations describing the behaviour of the parts be linear...These conditions are not fulfilled in the entities called systems"
If this view is true then the term "System Analysis" is at best an oxymoron and potentially an example of an atomically reduced self-contradictory statement.
R E D U C T I O N I S M AND D E C O M P O S I T I O N IN E N G I N E E R I N G ANALYSIS System safety analysis invariably follows a reductionist, decompositional approach. Fault-trees are an exemplar of this general rule. Beginning with specific potential behaviours of a system they are used to reason root causes through a process of decomposition. However, only in certain, well defined cases does this decomposition encompass the combination or interaction of failure modes. An exhaustive search of potential failure cases would resemble a maximal fault-tree in which all combinations of all possible failure modes were considered. Clearly this is not practical for any but the most trivial system. The argument for this is that the 'combinatorial explosion' problem resists analysis of permutations of failures.
We have been unable to find an accessible reference to substantiate this attribution. It would appear (rather self-referentially) that the concept of "whole greater than parts" is not expressed in this form in the writings of Hegel. Rather it seems to be encoded in some complex, global manner not accessible to the reductionist techniques available to non-philosophers.
360
R.J. Collins and R. Thompson
It has been argued that both Fault Tree Analysis (FTA) and Failure Modes and Effects Criticality Analysis (FMECA) exhibit both inductive and deductive properties (Collins and Leathley 1995). The reasoning processes involved in these analyses involve generalisation and specialisation steps to varying degrees. More importantly the success of both types of analysis are predicated on the applicability of a reductionist approach. The other major techniques of safety analysis are essentially reductionist: Zonal Analysis (ZA) considers the physical proximity of components and the unintended energy or information flow between them. Common mode failure analysis (CMFA) considers shared susceptibilities between components. Systematic design faults are considered, but as mentioned previously, the use of the word 'systematic' refers to the process of design and the ubiquitous occurrence of the error or fault in diverse parts of a system rather than a 'pathology' of the system structure as a whole. Such analyses can intersect the functional or failure domain hierarchies considered by Fault Tree Analysis (FTA) and FMECA but they still act in a reductionist manner since they impose their own structural hierarchies. For ZA this hierarchy is one of physical proximity. For CMFA the hierarchies may be of location, of design, of manufacture etc. It should also be considered that not only are these techniques characteristically reductionist they also operate within the domain of failure itself. We can at least conceive of a system failure that is a product of the interaction of the component parts of a complex system under conditions such that none of the individual components have failed. It is this class of high-level failures that we term 'Systemic'. It is not surprising that reductionist, decompositional thinking predominates in system safety analysis since they have dominated science as a whole since the 17th century (Meister 1991). However, there is a rejuvenation of interest in more holistic concepts of systems as a result of pressure from problems that simply do not admit to a reductionist approach (Casti 1979). We should now consider if any weight can be given to the notion of Systemic failure modes, that is, failures arising at the system level that are not attributable to behaviour (failures) at a lower level of abstraction. If such failures are inconceivable then it will not concern us too much if our analysis methods are exclusively reductive. However, if credence can be given to the concept, then some effort will be required to review the safety analyst's dependence on decompositional, reductionist techniques.
C O M P L E X I T Y AND CHAOS Casti has referred to the three 'C' words of system theory: 'Connectivity', 'Complexity' and 'Catastrophe' (Casti 1979). We take Perrow's 'Coupling' to be synonymous with Casti's 'Connectivity' and add to his list 'Chaos' (Casti makes a similar addition in his later work (Casti 1994)). Casti's book 'Complexification' provides a catalogue of arguments against the reductionist approach: "[The] reason for trying to create a science of the complex is to get a handle on the limits of reductionism as a universal problem solving approach" (Casti 1994, pp 273) Casti describes 'Complexification' as the 'Science of Surprise' and this relates to our consideration of the Normal Accident Hypothesis. Unpredicted failure modes of systems are surprises indeed. Complexity theory concerns a number of properties of systems that do not admit to a reductionist approach such as emergent behaviour, chaotic behaviour and 'deterministic randomness'. It is the recognition of these phenomena that lead to the central argument of this paper; that it may be possible to use the ideas of complexification to build computational models that lend weight to the Normal Accident Hypothesis. The next sections briefly describe what is meant by the terms 'Complexity' and 'Chaos' which have a particular bearing on the sections that follow.
Advances in Safety and Reliability: ESREL '97
361
Complexity Perrow does not provide a formal definition of 'complexity' within the book Normal Accidents. 'Complexity' has become a much over-used term in recent years and often eludes definition (von Neumann 1966). For example, readers of "Dealing with Complexity", a text on systems science by Flood and Carson (1988), are left to infer what 'complexity' actually means from a series of observations such as "complex situations are often partly or wholly unobservable". Within the field of complexity theory the most commonly quoted definition of complexity is that of algorithmic complexity due to Kolmogorov (1965) and Chaitin (1966; 1970; 1974; 1982) 2. This definition relates specifically to the complexity of a string of binary digits although it can be generalised to other systems. This definition holds that a measure of complexity is provided by the shortest algorithm that can produce the string. In other words it is equivalent to the most dense coding of the information used to describe a particular system. This definition is attractive at a theoretical level since it leads directly to a number of interesting conclusions in number theory (for example that most real numbers are 'random', i.e. not producable algorithmically by any program significantly shorter than the number itself). However, it is of little practical value to the engineer since this 'measure' of complexity cannot, in general, be computed. Bennett (1990) has gathered a diverse collection of definitions of the term 'complexity' applicable to physical and biological systems. These definitions refer to such properties as high free energy; the ability of a system to be programmed to perform like a Universal Turing Machine; the existence of long-range order in the system and 'Thermodynamic Depth', the amount of entropy produced in the systems evolution. Each of these is theoretically sound but of little practical value to the engineer who needs a usable metric for complexity. In the field of Software Engineering the McCabe Complexity Measure (amongst others) is used to measure the complexity of computer programmes (McCabe 1976). This measure is based on a graph representing the control flow of the program. The measure is based on the number of decision paths and loops and is related to the difficulty of testing the program effectively. It seems likely that systems engineers require similar measures of complexity for real physical systems. In safety related systems an important aspect of complexity is likely to be the 'cognitive' complexity of a system or a situation. This seems only tenuously linked with the pure, mathematical and physical definitions of complexity previously mentioned. For example, a human being may battle for an extended period with a Chinese string puzzle in an attempt to separate one closed loop from another. Topologically however the problem is trivial since the two closed loops are never 'joined' in a mathematical sense. The things that a human being finds hard or is confused by is likely to have as much to do with the function of the human brain as with the domain of the problem. We can do no more than to alert the reader to the difficulties associated with the term 'complexity'. Like Perrow, we shall side-step this issue and adopt the intuitive, 'dictionary' definition of complexity rather than a formal one.
2 Curiously, by Chaitin's own admission, this definition was developed and published by Kolmogorov several years before Chaitin published his own work. In a curious reversal of normal academic procedure Chaitin appears to have been adopted as the 'inventor' of this important definition. There seem to be no good reason for this other than that Chaitin is (1) American and (2) Alive, whilst Kolmogorov is (1) Russian and (2) Dead.
362
R.J. Collins and R. Thompson
Chaos
'Chaos' has been adopted as the short-hand term for the behaviour of non-linear dynamical systems more formally referred to as deterministic unpredictability (Gleick 1987; Hilborn 1994) . Chaos refers to the situation in which the future behaviour of a system is difficult to predict over a long period because it depends on arbitrarily small variations in the current state. Since it is impossible to observe the current state of systems with infinite accuracy, the future behaviours cannot be predicted with accuracy beyond a certain point. In technical terms, a chaotic system is one in which trajectories through the system state-space diverge exponentially from each other (up to some overall limiting boundary conditions for the system). Such divergence may be measured using the Lyapunov Exponent, a statistic developed to measure chaotic behaviour. If two paths that start close together with a separation do at time t=0 and the two paths diverge so that their separation at time t satisfies the expression:
d(t)=doea
(2)
then the parameter )v is called the Lyapunov exponent for the trajectories. If 9~is negative then the behaviour is considered to be chaotic.
THE IMPLICATIONS OF CHAOTIC BEHAVIOUR A direct consequence of chaotic behaviour is that the long term future states of systems are impossible to assess from observations of starting conditions and past paths. Although predictions of future behaviour can be made for a short time ahead, the accuracy of predictions reduces rapidly with the period of time for which the prediction is made. Such behaviour might have serious consequences for human operators involved in control of such systems. As systems move into a chaotic region of their behaviour the computational effort associated with 'correct' control decisions increases exponentially. In other words, systems become essentially uncontrollable by the normal mechanisms. In Normal Accidents Perrow provides a number of case studies of shipping accidents in which the paths of the ships involved were both 'pathological' and clearly unpredictable to the captains involved. These types of case study provide suggestive evidence for systemic failure modes that might be tested through observation and through the production of computational models.
CONCLUSION This paper has reviewed the emerging field of complexity theory and the field general systems theory with respect to system level failure modes. We have termed these failure modes "Systemic" and have argued that they provide a testable explanation of some of the events termed "Normal Accidents" by Perrow. Central to our thesis has been the consideration of the system level thinking that is characteristic of Perrow's original argument. This has been contrasted to the decompositional, reductionist mode commonly adopted in failure and safety analysis. The term 'Systemic Failure Mode' refers to system level failure modes that are not products of the failure modes of the constituent components of a system. We have-shown that a feature of complex systems is that they may exhibit 'emergent' behaviours. We have argued that pathological behaviour at the system level might also be a potential emergent property of safety critical systems. This is our expression of Perrow's Normal Accidents. By equating certain system level failures with pathological emergent behaviours of complex systems we have arrived at a position where a testable model might be constructed. If Systematic Failure Modes exist then it should be possible to build computational model that exhibit these types of pathological emergent
Advances in Safety and Reliability: ESREL '97
363
behaviours. This paper has presented the philosophical and theoretical basis for the existence of Systemic Failure Modes. A second paper by the authors presents a computational models of such failure modes, based on the arguments presented here (Collins and Thompson, 1997).
REFERENCES
Bennett, C.H. (1990). How to Define Complexity in Physics and Why. Complexity, Entropy and the Physics of Information. SFI Studies in the Sciences of Complexity Vol VIII. W.H. Zurek (Ed). Addison Wesley: Redwood City, CA. Casti, J. (1979). Connectivity, Complexity and Catastrophe in Large-Scale Systems John Wiley and Sons : Chichester Casti, J.L. (1994). Complexification." Explaining a Paradoxical world through the science of surprise. Abacus :London Chaitin, G. (1966). On the Length of Programs for Computing Finite Binary Sequences. Journal of the Association of Computing Machinery. 13:4, 547-569 Chaitin, G. (1970). On the Difficulty of Computations. IEEE Transactions on Information Theory IT-16, 5-9 Chaitin, G. (1982). Algorithmic Information Theory. Encyclopaedia of Statistical Sciences. Volume 1, 3841. Wiley: New York. Chaitin, G. (1974). Information Theoretic Computational Complexity. IEEE Transactions on Information Theory IT-20, 10-15 Collins, R.J. and Leathley, B. (1995). Psychological Predispositions to Errors in Safety, Reliability and Failure Analysis. Safety and Reliability, 14:3, 6-42 Collins, R.J. and Thompson, R. (1997). Searching for Systemic Failure Modes. ESREL 97 Flood, R.L. and Carson, E.R. (1988). Dealing with Complexity: An Introduction to the Theory and Application of Systems Science. Plenum Press: New York. Garfinkel, A. (1993). Reductionism. In R.Boyd, P.Gasper, and J.D.Trout (Eds) The Philosophy of science Massachusetts Institute of Technology : Massachusetts Gleick, J. (1987). Chaos." Making a New Science. Abacus : London Hilbom, R.C. (1994). Chaos and non-linear dynamics: An introduction for scientists and engineers, Oxford University Press: Oxford Koestler, A. (1978). Janus. A Summing Up. Hutchinson and Co. : London Kolmogorov, A. (1965). Three approaches to the Definition of the Concept 'amount of information '. Problemy Peradachi Informatsii Mattessich, R. (1982). The Systems Approach: Its Variety of Aspects. Journal of the American Society for Information Science, November 1982, 383-394 McCabe, T. (1976). A Software Complexity Measure. IEEE Transactions on Software Engineering. Volume 2. 308-320
364
R.J. Collins and R. Thompson
Meister, D. (1991 ). Psychology of System Design. Elsevier : Amsterdam Perrow, C. (1984). Normal Accidents." Living with High-Risk Technologies. Basic Books : USA Reason, J. (1995). Human Error. Cambridge University Press : Cambridge, UK. Shipman, M. (1982). The Limitations of Social Research Longman : London Smuts, (1926). Holism and Evolution MacMillan and Co. : London Tenner, E. (1996). Why things bite back. New Technology and the Revenge EJ]ect Fourth Estate, London. von Bertalanffy, L. (1968). General Systems Theory: Foundations, development, applications. Allen Lane, The Penguin Press von Neumann, J. (1966). Theory of Self-Reproducing Automata. University of Illinois Press: Illinois
THE ESSENTIAL LOGIC MODEL: A METHOD FOR DOCUMENTING DESIGN RATIONALE IN SAFETY CRITICAL SYSTEMS R.J.Collins Sentient Systems Limited, 1 Church Street, The Square, Wimborne, Dorset, UK
ABSTRACT This paper describes a diagramming convention called an 'Essential Logic Model' (ELM). An ELM tells the 'story' of a system and the decisions that were made during its design. The ELM shows the relationship between broad classes of requirements and how these relate to the explanatory concepts developed and used during system development. It is argued that, if safe systems are to be produced and maintained then there must be harmony in the design decisions made by groups of engineers. An ELM is a didactic tool to be employed between system developers and maintainers to externalise and record a logical series of thought processes that result in a design realisation. Experience suggests that this process of externalisation and documentation serves to highlight errors and inconsistencies in thinking. This seems to be particularly true in the case of large, complex systems when individuals are unlikely to be 'experts' in all aspects of the system.
KEYWORDS Software Engineering, Systems Engineering, Safety, Documentation, Diagram, Requirement, Essential Logic Model
R E C O R D I N G AND C O M M U N I C A T I N G IDEAS ABOUT AN EVOLVING SYSTEM DESIGN A considerable amount of case-study material is available to allow us to conclude that a large fraction of errors in safety-critical system design result from errors in communication between various groups involved in systems development. For example, in his review of the root causes of safety related software errors Lutz (1993) has stated: "Safety
related
interface
faults
are
associated
overwhelmingly
with
c o m m u n i c a t i o n errors between a development team and others (often between
software developers and systems engineers)". (emphasis added) It has been argued that engineers are actually predisposed to making certain types of error in the development of safety related systems. Many of these errors are associated with the formation of incorrect concepts about 365
366
R.J. Collins
the system, or failures to successfully communicate ideas from one person, or group to another (Collins and Leathley 1995). A key component of communication within the systems development and (design) maintenance environment is the technical documentation that supports it. A final design implementation may be considered as the realisation of a design process. However, system design documentation often records 'how a design is' rather than why it is as it is. Project documentation may record the realisation but may lose the essence of the logical process that resulted in that realisation. Although a history of change records might show the material evolution of a design, that is not equivalent to capturing the underlying logical threads of evolving ideas within a development group. Rann, Turner and Whitworth (1994) have said: "..a large part of software development effort is spent on making programs 'evolve' to fit new needs. It is common to find that small changes take much longer to ~implement than would be expected. One reason for this is that it takes programmers some time to understand the system before they can even think about making changes." It is considered that this issue is not solely confined to software and may be particularly problematic in the case of safety critical systems developments. Errors may be introduced into designs during the modification and update processes because engineers fail to appreciate the logic behind particular aspects of the design. Skills may be very diverse within an extended development team on a large system development. It may be argued that to increase harmony between design decisions made by different groups, each group must be able to both communicate and understand why particular decisions have been made. It is considered that existing documentation methods that encode what a system must do and how that function is achieved may fail to record why specific design decisions were made. If system updating efforts are to be consonant with an original design process effort then system engineers must understand why things are they way they are. The Essential Logic Model is a method for documenting the 'why' information of system development. The following text presents a straight-forward method for expressing and documenting 'why' things are as they are, within a system design.
THE ESSENTIAL L O G I C MODEL An 'Essential Logic Model' (ELM) tells the 'story' of the system and the decisions that were made during its construction. The ELM shows the relationship between broad classes of requirements and the explanatory concepts developed and used during system development. The diagram also shows the important processes that exist within the system and the 'objects' that the new system will interface to and on which it is dependent. It is often the case that during some stage of a large system development, requirements for the system as a whole are decomposed and attributed onto components of the system. The ELM communicates the reasoning that underlies this decomposition. The ELM shows how the collection of requirements, existing objects, processes and the explanatory concepts applied to them lead onto the new objects proposed as part of the system. It is considered that the ELM finds a natural place within the collection of Object Oriented Analysis (OOA) conventions and techniques. The relationships between the defined objects and the existing system objects
Advances in Safety and Reliability: ESREL '97
367
may be documented in an Information Model (IM) or Entity Relationship Diagram (ERD) (Shlaer and Mellor 1992). This diagram and its accompanying documentation shows how the various components will relate to each other and highlights their important interfaces. If an ERD shows what a system is, then the ELM shows why the system is as it is.
THE ELM SYMBOLOGY An ELM consists of four types of 'boxes' as illustrated in Figure 1. In this diagram 'Objects' are exactly synonymous with the objects of OOA. 'Processes' are synonymous with the processes of Data-Flow Diagrams. 'Requirements' are intended to capture important constraints on system implementation. 'Concepts' are the explanatory notions we introduce to make sense of the system and of the problem domain in general. Each box is given a unique number to facilitate referencing. In an ERD, boxes contain short, explicit textual descriptions. Boxes are joined by lines intended to represent the 'flow of logic'. In other words, the lines are not strongly typed in the sense that they represent data-flow or relationships. Rather, they connote the way in which one 'idea' leads to another. The 'idea' in this case may be an abstract concept, but it may also be an explicit requirement, a process or an object. Accompanying an ELM diagram is a narrative text that provides more details of the 'story' or 'logic' of the design evolution.
An Object
A Process
3
A Requirement
A Concept
Figure 1: The Four Symbols of an Essential Logic Model
368
R.J. Collins
EXAMPLES OF USING THE ELM IN SAFETY RELATED APPLICATIONS It is considered that there are three related uses of the ELM in the context of safety: 1.
To make design trade-offs explicit; To make explicit the logic for safety related design features in order to communicate with engineers who are not safety specialists; To document design rational so that design modifications made in ignorance of the original intent, do not become safety issues. That is, to prevent errors being introduced as a result of design modification because the original design rationale was not understood.
The following sections consider each of these cases in turn:
Documenting the Logic of Safety Related Design Trade.offs It is clear that in any non-trivial system, design trade-offs will be made with respect to safety. For example, a decision might be made to implement functional components in a diverse manner in order to reduce the risk of common mode failures due to systematic design faults. Such a design decision might be 'self-evident' to the design authority or arise as a result of safety analysis performed by specialists. It is likely however, that such a decision will be at odds with other desirable properties of the system. For example, for reasons of supportability or production cost it may be desirable to minimise the variety of components used in a design. Such a design trade-off will be made during the course of system development and hopefully a rational and considered judgement will ultimately be made. It is important to document both the fact that this trade-off has been made and the arguments and logic that result in a final decision. Figure 2 shows a fragment of ELM that documents a safety related design trade-off and its relation to a final implementation. This (fictitious) example documents the relationship between various requirements and the final implementation of the design as a diverse system. The diagram links both functional and non-functional requirements and explains the final design decision by emphasising that, (in this case) safety issues take precedence over logistic and supportability issues. Notice that in the simple examples presented here, the narrative text that normally accompanies an ELM has been omitted. In more complex cases it is unlikely that the full range of comments could be written within the symbols without making the diagram visually cluttered and thus a separate text should be provided.
Making the Rationale of Safety Related Design Features Explicit Engineering of large systems is a social activity in that it requires groups of people with differing expertise and world views to communicate and work together to a common objective. Since it is unlikely that any individual will understand all aspects of a complex system, it is necessary for individuals to make design decisions that are not fully understood by other team members. ELMs can be used as a didactic tool between groups of systems engineers to communicate the rationale for design decisions, in order to increase the harmony between disparate system elements. The example of Figure 3 may be 'self-evident' to many safety specialists, but the same may not be true of other system designers. In this diagram the ELM is being used to explain why the design is as it is.
Advances in Safety and Reliability: ESREL '97
369
Supportability requirements suggest that the smallest range of replaceable items as possible should be used Radar
Altimeter
Safety requirementsthat altitude is a high integrity function. Thus diverse /
Requirement to provide altitude information to pilot
redundant
implementation needed Traditional Barometric Altimeter (Backup) Logistic Reliability requirements require as few componentsas possible to be used
Figure 2: ELM showing example logic that might lead to diverse altitude measures for an aircraft
Operators must we warnedof over-temperature in turbines
Over-temperature shut down mechanism /~
ilghL!i!ii~
I
~
"-3T,-A~'~-- !,,.__
~ iI
I
Built In Test (removes dormancy)
Figure 3: Example of an ELM used to communicate the logic of safety related design decisions
Recording Design Rational to Support Updates and Modifications The two examples above concerned use of ELMs during the primary development programme. However, many systems undergo updates and modification during their in-service life. The engineering teams dedicated to in-service updates may be quite distinct from the original development teams and updates may be performed some period of time after the original development programme. If the communication of design rationale during system development may be lacking, how much greater is it likely to be after a period of years has elapsed? Both of the previous examples are relevant here. During in-service updates it will be important for system modifiers to understand why design decisions have been made in order to ensure that their modifications are compatible with the design intent of its' originators. Personal experience suggests that inappropriate
370
R.J. Collins
'optimisations' in system re-design may arise as a result of not fully comprehending the reasons that design decisions were originally made. The converse is sometimes also the case. In any complex design there may be a number of arbitrary decisions made on the basis that there is no strong evidence to force a design in one direction or another. During redesign, there can be a tendency to adopt post-hoc rationalisations for design decisions that were in fact quite arbitrary. At the very least this can lead to wasted time and effort as system modifiers struggle to provide a logic for arbitrary decisions.
P R A C T I C A L E X P E R I E N C E OF USING E L M S Practical experience of using ELMs has been gained on two major projects, the first of which was regarded as a safety-critical application. The second project had some safety related elements but was not considered to be safety-critical. In both cases ELMs were adopted quickly in the organisation after introduction. The key features of ELMs that seemed to support adoption was their relative simplicity and their power for communicating 'trains of thought'. Most users had little difficulty in learning the notation and no training courses were considered necessary. Most users learnt the ELM convention from early drafts of this paper or from practical examples of drawings being used by other staff. In both cases, ELMs were drawn using PC computer based drawing tools intended for 'flow-chart' drawing. In the first instance ABC-Flowcharter was used, in the second instance 'Visio' was chosen. Both of these tools provided all of the facilities required to draw ELMs, to modify and edit them and to export them into a word-processing package. In the first project the major benefit of ELMs was seen to be the ability to communicate design rationale between skilled engineers with very different areas of techni~l expertise. This project required skills from a number of diverse disciplines represented by individuals frofn a variety of engineering cultures. The ELMs provided an easy to use mechanism to communicate the logic of design decisions between specialist groups. The second project was composed of a much more harmoginous group of engineers and thus interdisciplinary problems were less pronounced. However, the system was more complex and there was a difficulty in reasoning through the impact of each design decision. In this case ELMs were used to track the multitude if individual design decisions in order to develop an overall understanding of the design logic of the system. In this case a number of insights were gained that lead to design errors being detected and potential optimisations being identified. The expressed view was that a principal application of the ELM would be the long term documentation of the system in support of in-service updates.
CONCLUSION The ELM has so far been used on two major systems developments by the author as described above. The diagrams were generally found to be easy to learn and straight-forward to use. The ELM was considered to be a useful addition to the armoury of system analysis and documentation tools. The author is keen to see this method adopted on other system developments. The drawings seem to enable design errors to be detected and to support engineers in their communications. Particularly they seem to help overcome 'cultural' difference between engineers from different technological disciplines. They allow extemalisation of concepts and flows of logic that may be self-evident to one group, whilst being quite opaque to another.
Advances in Safety and Reliability: ESREL '97 REFERENCES
Ashworth, C. And Goodland, M. (1990). SSADM: A Practical Approach McGraw-Hill: Berkshire, UK Collins, R.J. and Leathley, B. (1995) Psychological Predispositions to Errors in Safety, Reliability and Failure Analysis. Safety and Reliability 14:3, 6-42. Longworth, G. (1992). A User's Guide to SSADM, NCC Blackwell: Manchester Lutz, R.R. (1993). Analyzing Software Requirements Errors in Safety-Critical Embedded Systems. in Proceedings of the IEEE Int. Symp. on Requirements Engineering 126-33 Rann, D., Turner, J. And Whitworth, J. (1994). Z: A Beginner's Guide. Chapman and Hall: London Shlaer, S. and Mellor, S.J. (1992). Object-Lifecycles: Modelling the World in States. Prentice Hall: Englewood Cliffs
371
This Page Intentionally Left Blank
A14" Software Reliability
This Page Intentionally Left Blank
3P THE PRODUCT, THE PROCESS, THE PROCESSING IN SOFTWARE RELIABILITY P.R. Leclercq D3E Division, MATRA Bae Dynamics 20-22 Rue Grande Dame Rose, 78141 Vrlizy-Villacoublay, France Tel: 33 1 34881517, Fax: 33 1 34882345
ABSTRACT This paper presents how Product, Process and Processing are retained for software/program reliability. predictions. More precisely it describes the metrics' tree that integrates basic characteristics, step by step, in intermediate results as modules' reliability and finally overall program reliability. At each step, informations are provided on the intermediate elements that conduct to the overall reliability.
KEYWORDS Reliability prediction, metrics, process monitoring, development management, quality assurance.
INTRODUCTION The growing part of software and the functions realized by software in actual products (systems or equipments) conduct project leaders to consider reliability a key element in their priorities. By another way, to count reliability as the other software performances, we must specify, assess and measure reliability figure(s). So trade-off evaluations have to be conducted early in the development to be efficient and start from the specification phase. We classify these actions as <<software reliability predictions >>. It is why we and other companies conduct studies on the subject. Some major improvements were gain in the past recent years. More than software, they have for goal to consider software and hardware as undisolved part of the overall system or equipment. We detail hereafter the motivation and the results of an intensive effort to get an answer to these questions and to present some results of the studies conducted in this domain.
POSITION OF THE PROBLEM
During past decade, software reliability was largely discussed. ANSI/IEEE defined it precisely as: 375
376
RR. Leclercq
~ The probability that software will not cause the failure of a system for a specified time under specified conditions. The probability is a function of the inputs to, and use of, the system as well as a function of the existence of faults in the software. The inputs to the system determine whether existing faults, if any, are encountered >>. [ANSI/IEEE91 ] Nevertheless, if a definition exists, the notion requests some precision. In that field, comparison with hardware will provide some troubles. There, no failure that supposes software does fatigue, wear out or bum out. In the present case, we have defects introduced in the code that results of faults due to process during development. The activation of these faults conducts to bugs which evidence is a failure, non suitable effect(s) on the system run [ANSI/AIAA93]. There is no modification in the software but only activation in non monitored inputs, a fault. Even tested and relieved of its majority of faults, a software may have some remaining ones. So the operational conditions conduct to account some not accounted during tests, without any warning of their occurrences. Consequently, we consider that a quantification of that occurrence probability is necessary for software as it is performed for the material to obtain the result for the overall system. We remark that quantification is not a self-objective. It is the meaning to obtain judgment elements, the confidence on software through its development to its use.
Reliability predictions We think that predictions must be early as possible to be efficient, even the software does not exist but when preliminary studies are conducted. The reason is to correct when it is easy with a lower cost than later. Generally, the predictive methods used for software are included in the category of software reliability growth methods. This means that we examine the decrease of the occurrence rate of bugs when the software is in test. Besides to their difficulties, these methods have in common to be only applicable very late in the development, when several corrections may be very expensive to apply. The most known of these methods are Goel-Okumoto, Littlewood, S model.
Need of a model Modelisation is today, a tool that allows to describe a process with its inputs to reveal its outputs characteristics. An analytic model is a characterized, calibrated tool that allows to analyze the performances of the applications. Modelisation is generally complex, long to evaluate, difficult to experiment. Nevertheless, modelisation is a simple tool to examine the inputs' influences and to get an idea of choices in the process [IEEE 1992]. In the present case, we modelized the software nmning in operational conditions defined by an <~Operational mission )). In those conditions, we are very close to the conditions of the IEEE software reliability definition. The present difficulty to establish the modelisation, is due to the large complexity of software that count several millions of line code in high level language.
APPROACH Bas/a
The area of software reliability is very complex. This intends to split this area into sub-area to get easier questions to be solved. For that we conducted intensive studies to identify these sub-areas. This was performed both by bibliographic researches and in house experiments. Rapidly major sub-areas were
Advances in Safety and Reliability: ESREL '97
377
identified and successively refined to get axioms and models. In the spirit of an IEEE directive, a large activity was c6nducted [IEEE1992]. Nevertheless limits of its application were mainly due to the number of candidate programs for that purpose. This was conducted through a loop of maturation that is defined in Figure 1.
Start J 1 •"~ Sub- Areas
2 Axioms Figure 1: Maturation loop.
Figure 1 identifies: 1. An initial sub-areas' list. 2. Axioms (an evidence by itself but which may not be directly demonstrated, but only verified after by experiments) that define relation(s) between different characteristics. 3. Metrics, analytical models that order the relation between characteristics. 4. Validations, applications to several programs to measure the correctness and the accuracy of metric results applied to a sample program/modules. 5. Suitable? Critical examination of the results of metrics and measures on programs. If these different results are compatible we decide that they are suitable and we declare the metrics valid. 6. If not suitable, we come back in early phase of the loop, modify axiom(s) .... until the results will be suitable. Remark: The maturation loop may be updated when we change the type of program(s) used for validation.
3P sub-areas definitions We based mainly the sub-areas' identification on Soistman, Mac Call and in house studies [SOIST85], [MCALL87]. Three sub-areas seemed particularly interesting for future developments. They are: • The software itself, the Product. • The Process used to develop the product. • How we use de product, the Processing. In that approach, the software reliability of a Program [P] appears as the focus of a net of the previous identified sub-areas.
The product
At this step we define the product as the program we want to get, to realize different functions with an equipment, a system that includes the program/software. The product is described in the Software Requirement Specification (SRS). This document defines precisely what is wanted, the function that has to be realized, when and for which conditions. In early step, the specification itself and during development, the Software Detailed Design Specification (SDDS) describes the program. In these documents, the difficulties attached to such product and the major constraints appear. It is possible to define the assemblies that will be part of the program. We name them as <<modules >>,a generic term for sub-assemblies of the program. Each module may be described as a <>with a generic view. The item list describing modules is: • The main Operational application type (i.e., control, real-time, computational...module type). • The Mission variability (a variety of <<missions >>,operation modes supported). • The Functional complexity (a module may be required to perform more than one specific task).
378
RR. Leclercq • The System interaction (interface with hardware, with other software, with an operator). • T h e I n p u t domain variability (a large or narrow domain with or not effect of data errors). • The Expected size (the range that is def'med by the level where the study is performed).
The process From the beginning, with the specification, we consider the process that conducts to produce a program composed of modules. This process is based on an organization of the project team, the use of workbench, methodologies and tools. These characteristics conduct to avoid introduction of fault in the program. For that reason, they are named Enhancement Avoidance Characteristics. Hereafter are listed some of the characteristics: • Independent or not independent quality assurance organization. • Rigorous or not rigorous software development plan. • Use of methods and/or tools for specification establishment. • Automated test tools. When the development is running, unfortunately faults are introduced and the verification and the validation of the program consists to remove these faults by testing. So detection measures and test are employed. They are named Enhancement Detection Characteristics. We list some of them: • Code inspection, frequent or not • Formal or not change notices for specification, detailed analysis. • System integration tests.
The profile The operational profile is one of the key factor in the software reliability. Due to this factor, we cannot assume that all the conditions accounted in operational use of the program have been tested in late validation phases of the program development. So it is necessary to define a mission profile as ~>as possible of the program's use. An average of activation of operational functions, for which the program has been designed, defines this mission profile. An activation matrix of program's modules translates this profile.
Reliability metrics'framework. So the problem to solve appears to get first the axioms and then the main metric(s). The axioms are obtained after intensive examination of field gathered results, examining data describing software bugs, measures performed by software workbench used to develop the program. Attached to each axiom there are metrics that are the model translation of such axioms and which define the relation between the different P. As that approach appears evident and valid by application of these metrics to new programs, it is refined to be more accurate and precise. So the present status of the methodology may be represented as a net that relies basic characteristics and specifies the P sub-areas. Figure 2 illustrates the steps to obtain the overall program reliability figure, from metrics at P level (Product, Process, Profile) themselves based on basic characteristics.
Reliability metrics net At present time it is necessary to identify more precisely What are: • The different basic characteristics. • The metrics who rely these characteristics.
Advances in Safety and Reliability" ESREL '97
379
Program Metric1
I
I 6 1 Profil Metrics
roduct Metric
rocess Metric
Other characteristics
Other characteristics
Blic
/ Basic /
characteris/
Basic
/characteris /
.,~,/
/
characteris/
.,ioi/
-,~n /
Figure 2: Software Reliability metrics framework I
22 Program reliability
r 18 1 Module reliability
' I
/Avoidance / /characteristi7
'
ii
I
11 I I
Modulel Inherent reliability
5
6
/Detection / /Inherent / /eharacteristi?~ / m o d u l e l /
12
Modulen Inherent reliability
Module1non ~ Modulennon i testedcalls j testedcalls j
16
II
Process sub-area
~
17 t
13
Module 1total i Modulen total callsnumber callsnumber I J
7
/Inherent / /modulen /
/ characteristic#/ charactefistic/s
Standard development
{
Module1 ' Modulen " non-identical 1 non-identical i calls .) ~,,. calls t)
i
t_
I 9 ] IProc 10 n] Process ess detectio ~avoidancefactor) ~. factor ~
4
reliability
"1
Enhancement factor
t
I I Module 19 n
i
I
I
~ ActivSation/ matrix /
gram allotmen/ in m~tules
Functional activation
Product sub- area
Profii sub-a r e a
Figure 3" Software reliability metrics' net
380
RR. Leclercq
After several iterations according to the figure 1 above, a validated situation appears. The identification of sub-areas was a useful tool to progress. However it appears a not so simple distribution. Several times basic characteristics are linked from one sub-area to another. In those conditions, the present network is that which is defined in figure 3. If we attach to icons the subsequent definition, we have figure 3.
A preparationtask
An input set
A metric result
Figure 3 must be read bottom up and from left to fight. If we detail this figure, we count a total of 8 major metrics plus some other sub-metrics. These metrics gather, today, a set of 48 generic characteristics that are retained for their influence on software reliability. These characteristics are splited in three categories that count respectively 8, 22, 18" • Inherent characteristics. • Enhancement avoidance characteristics. • Enhancement detection characteristics. We present here after one of the most significative. Additionally it is necessary to precise the profile. The profile allows to def'me the duration, and how and when the functions of the program are solicited.
A significant metric The non tested call(s)of a program, Nr, that may conduct to a bug is one of the original metrics we developed. Programs' measurements were recorded. After different trials the subsequent surface appears suitable. 1E3 0.8.®.4~.2
Nr
900 800 700 600 500 400 300 200 100 Ri
~
~ ~ ~ ~ ~
1E3 900 800
700 600 500
~400
2E~
3E
~300
~ ~
200 100
~ ~ ~ Na 0.80.60.40.2 Figure 4: Number of non tested calls
Advances in Safety and Reliability: ESREL '97
381
Its formulation is:
Nr
"-
Ke
*
Nd
or
Nr
(1 - En) "-
- k0 ( Ri * Ln ( N a ) ) , , e
O- Ea~)
Na
. . t 1)
Where: - Nr is the number of non tested calls that may conduct to reveal a bug in operation. - Ke a coefficient of efficiency of the tests evaluated with enhancement factors. - Ea n the enhancement factor including only avoidance characteristics applied during development. - E n the enhancement factor including avoidance and detection characteristics. - Ri is the inherent reliability that describes the complexity of the specification. - k0 a form coefficient.
Na, E n, Ean,, Ri, are evaluated with the help of other metrics. Unfortunately we are not able to detail them here due to lack of place. Some are described in [LEC92].
DATA
BASE
Such a set of metrics will operate only if we establish values for its characteristics. It is a large difficulty. Many people have exercised in that field. For that purpose a lot have examined data bases, bugs collections and tried to gain statistical correlation between characteristics. Unfortunately, as we are aware, the results are not at the level of what they expected. A. Goel with whom we discussed this problem thinks that no answer may probably not obtain by this way. There are many reasons we may summaries. The characteristics expected as interesting were not or insufficiently recorded. As the data bases were built these characteristics were not retain as suitable for the use of the data base. So, the results were poor. In addition the number of needed recorded programs is so important due to the number of characteristics to discriminate that it is difficult to imagine that it may easily possible. So we have adopted another approach. This approach consists to gather the experience of <<software expert >> who by experience are able to define what are the influent characteristics and how influents are they. For that purpose a set of questionnaire has been developed to obtain a flexible frame to allow comparison and perform statistics. These statistics will provide the best estimation the influence of the different characteristics. In these conditions we have gathered studies based on that assumption. The main interesting was conducted by Soistman [SOIS85] under a US DOD contract. The approach was conducted at several levels: • The identification of characteristics, they have important impact on software reliability. • A first series of characteristics that aggregate the soitware reliability under metrics and submetrics based on a common recognition. • A quantification of these characteristics by the same way. The resultant quantification represents the <<expert opinion >> on characteristics that determine the software reliability. It was necessary to split the overall inquiry in several questionnaires more easy to manage. The frame of sub-questionnaire is closely related to sub-area defined previously. Today we have enlarged the data base that contains around one hundred experts' opinions.
EXPERIENCE Overview
With that approach of software reliability, studies have been conducted on several projects. In that condition we gain several kinds of experience:
382
ER. Leclercq 1. 2. 3. 4.
Validation of the metrics set. In case of problems, inputs for upd~iting the set in some particular areas. Operational experience in the management of reliability predictions in a project. Operational results directly applicable for the development of the concerned project.
Details
After present experiment we consider this set of metrics almost applicable to a large diversity of programs. Nevertheless the results must be taken into account with care. For each result it must be provided to the customer of the methodology why we obtain the different results. Nevertheless we do not consider we have the same confidence level as for hardware. So the results have more to be used in case of trades-off than as absolutes. For a while we had problems for programs that are composed of ten thousand or million lines of code. Today that problem is solved. At present one of the more difficult problem does not concern the methodology itself but the tool that processes metrics due to accuracy in some calculation when we study networks. We have at present time a large experience the management of different project in the field of defense, telecommunication, energy.... on program dealing with PC, workstation, dedicated development for application or system programs. Answers have been provided for improvement on specification, development process level, program architecture... Several of them was obtained in collaborations with other companies.
FUTURE The improvement will continue on different directions. They are: 1. Enrichment of the expert data base. 2. Treatment of applications to continue to validate the robustness of the sets of metrics, including comparison with bugs' collections on operational programs. 3. A tentative action to use a more close and complementary approach with the Capability Maturity Model (CMM) and with the European way for SPICE.
References ANSI/AIAA (1993). ANSI Recommended Practice for Software Reliability. Report R-013-1992. ANSIflEEE (1991). IEEE Standard Glossary of Software Engineering Terminology. Standard 729-1991. E. Fiorentino and E. Soistman (1985). Combined hardware/software reliability prediction methodology, Proceeding Annual Reliability And Maintainability Symposium. IEEE (1992). IEEE Standard for a Software Quality Metrics Methodology. Standard 1061-1992. Mac Call and al. (1987). Methodology for software reliability prediction. Study report RADC 87-171. P.R. Leclercq (1992). A software reliability assessment model. Proceedings of the Annual Reliability and Maintainability Symposium, Las Vegas, 1992, 294, 298. E. C. Soistman & K. B. Ragsdale (1985). Impact of Hardware/Software Faults on System Reliability. Study report RADC- TR85 228.
SOFTWARE RELIABILITY METHOD COMPARATIVE ANALYSIS FROM THE EXPERIENCE TO THE THEORY Emmanuel Arbaretier SOFRETEN Parc Saint Christophe, 10 avenue de l'entreprise, 95865 Cergy Pontoise Cedex
ABSTRACT This presentation deals with the use of different predictive analysis methods for software applications, so that it may be possible to anticipate maintenance problems which are specific to them. The first part concerns the different approaches which can be adopted, without forgetting the extremist method which consists in obtaining a complete mathematical validation of the specifications of the software, and in generating automatically the associated source code; this source code is assumed to be perfect, which means that it has no defect, and it will be subjected to no maintenance action, except those corresponding to the evolutions of the software. The second part presents the qualitative approach which enables to anticipate very precisely the characteristics of the maintenance tasks, through the identification of the most vulnerable and critical parts of the software. The third part describes the contribution of the quantitative methods, in particularly through different methods of mesurement of the improvement of the code, due to the maintenance action. The fourth part deals with the subtle problem of adaptation of the concepts which are commonly applied to the world of hardware, up to the world of software.
K E Y W O R D S
Reliability, Software, Hardware, Failure Analysis, Methods, Reliability Growth, Formal Languages, Likelihood Function
THE DIFFERENT APPROACHES: QUALITATIVE, QUANTITATIVE, FORMAL Different approaches do exist in the field of software reliability, and certain of these approaches are rather more adapted to the mastering of the maintenance issues, than others: - the qualitative approaches enable to identify the most vulnerable parts of the software, and to orient the test and validation actions on these elements, by targetting and provoquing the maintenance actions on these elements: this aspect will be developped in the second paragraph, through the example of the AEEL method (Analyse des Effets des Erreurs Logicielles - Analysis of the Effects of the Software Defects) - the quantitative approaches try to apply as much as possible to the world of softwares a transposition of the theoretical tools used in the world of hardware - the formal approaches which consist in applying to the specifications of the software mecanisms of logical prooves which avoid the appearance of failures due to design errors; certain of these languages are associated 383
384
E. Arbaretier
with automatic code generators which enable to expect maximal reliability of the sofware; one can for example classify these languages according to the methods which support them: • algebrical: Z, VDM, RAISE, B • with communicating behaviour: LOTOS, ESTELLE, SDL • synchrone: LUSTRE, SIGNAL, ESTEREL.
A
E
E
L
:
F R O M DESIGN TO SOFTWARE SUPPORT
The AEEL method can be considered as an adaptation to the software systems of the FMECA methodology, as applied to the hardware. The parallel use of the FMECA and AEEL studies allows the fact that the hardware and software parts of a system may be submitted to the same dependability analyses, according to comparable methods, in the frame of a homogeneous approach• The AEEL method particularly allows to emphasize critical points during development phases of the software, to increase the efficiency of the Quality Assurance process performed during this development cycle through a better selection of the test and validation tasks, and to make easier the identification of corrective actions and design improvement recommandations. The AEEL method allows to evaluate the criticality of a software from the criticality of every of its components; The principle of the analysis is to assume design error hypotheses about each of its elementary component, and to identify the consequences of these errors on: - the operating modes of the module on which they appear -
-
the operating modes of the other modules the global operating modes of the software or of the system The aim of one AEEL is to :
-
-
-
-
emphasize the weakest points in the design of the software by determining the components the defauts of which may have the most critical consequences identify the components (procedures, modules..) which are most critical in the frame of the architecture of the software because of their complexity or their strategic characteristics as to the operating modes of the system influence the test/validation policy of the software, and more generally the development Quality Assurance process, by giving it a more precise and efficient orientation anticipate the coming functional evolutions from observed limitations, and particularly due to the improvement of the software, regarding its potential failures The AEEL studies concern softwares with high operational requirements, expressed for example with following quality indicators:
-
performance : high volume of information, maximal response times, multiple transactions and requests, high input and output flows
-reliability/availability : high cost due to the interruption of the system mission, or to its unavailability -
functionalities : complexity and sophistication of the functionalities, extreme diversity and complexity of the operational scenarios
- safety : software which is critical, as to the safety of hardware and personnel
Advances in Safety and Relial~ility: ESREL '97
385
- human factors: software which is going to be used by an important population of users, heterogeneous, not very familiar with the computers, and able to reject the application. The AEEL is based on following steps: - to define the hypotheses of the Analysis - to identify the components of the software which are going to be submitted to the Analysis, by justifying the choices and evaluating the corresponding workload - for every component which has been previously selected, to determin the consequences of the different types of error assumed in the frame of the initial hypotheses, on the operating modes of the software, at different levels of the system, up to the global level, which corresponds to its main functions - to describe the detection devices, the test and validation tasks, as well as the corrective actions which have been associated with these defauts -to perform a synthesis of these elementary analyses, at the level of every component, so to define general design improvement recommandations for the software or preferential functional evolutions to schedule later. The tasks which are part of the AEEL methodology can be grouped according to following steps: - step 1 • P r e p a r a t i o n of the h y p o t h e s e s
• selection of the themes and objectives of the AEEL definition of the criticality scale (criticality levels) • list of the types of errors which must be simulated, and analysed on each component • interface of the AEEL with the performance of the project - step 2 : S y s t e m A n a l y s i s a n d W o r k l o a d D e f i n i t i o n
• realization of the Functional Analysis worksheets for every selected theme preparation classification of the modules and sorting • work schedule of phase 3 - step 3 : R e a l i z a t i o n o f the A E E L
• building of the different AEEL worksheets • global synthesis of the AEEL • production of the AEEL report • updating of the AEEL worksheets and report The AEEL study is iterative through the V development cycle of the software Nevertheless, it is interesting, that several iterations may have been performed before the coding phase, so that it may be possible to take into account at the lowest cost, those modifications which have been selected during the design process; this way, it will be possible to apply the test and validation tasks of the software components in relation with the defauts simulated, in later phases of unit testing, integration and validation R E L I A B I L I T Y G R O W T H MODELS: L I M I T S OF THE ANALOGY C O N C E R N I N G QUANTITATIVE T E C H N I C S APPLIED TO H A R D W A R E AND S O F T W A R E R E L I A B I L I T Y The interactions between software maintenance and its reliability are expressed againts the failure rate parameters with time decreasing formulas like:
386 MUSA: GOEL OKUMUTO SHANTIKUMAR
E. Arbaretier 1 ( t ) = C/MoTo[ M o - ( i - 1 ) ] for t e[ti-1 , ti] 1 ( t ) = a b exp[ - b t ] for t e [ ti-1 , ti ] 1 ( t ) = [ N - ( i - 1 ) ] a b exp[-bt] for t e [ t i - 1 , ti]
with i varying between 1 and n, and corresponding to debug actions on a module the failure rate of which 1 we want to evaluate, and ti - ti-1 identifying the time intervalls between two subsequent debug actions. In these models, the parameters C, Mo, To, a , b , N are estimated through maximization of the likelyhood function expressed from the series of the time intervalls seperating the different debug actions and mesured by the development engineers. These models describe on a different way the reliability growth process of the software modules; they have different properties, and give a different interpretation of how things happen when the failure appears, when the corrective action is performed, and how far the module is then improved. The hypothesis, under which failure rates 1 continuously decrease seem to be somewhat realistic; the sudden interruption of the curve at the moment the correction is performed is also rather convincing if we take into consideration that modifications introduced in softwares do improve every time its reliability on a very strong basis. But these models are satisfactory on a rather unequal way: for example, in MUSA's model, the uncontinuous decreasing of the failure rate, interval after interval is somewhat suspect, because one may think that that the more a program is used without showing any failure, the more its reliability increases, and thus the more its failure rate decreases. One would for example prefer to observe a regular decrease of 1 in every intervall seperating two successive failures: it is precisely what GOEL-OKUMUTO's model describes, but on the other hand, this doesn't take into account the discrete decrease of 1 i at every correction. In fact, SHANTIKUMAR's model seems to realize a synthesis of the hypotheses of MUSA and GOELOKUMUTO, but the numerical resolution of the equations to calculate the parameters leads to much more important difficulties. It's impossible, not to mention LITTLEWOOD's model, just to illustrate the theoretical critics of the hypotheses involved in the previous models: for example, MUSA's model, according to which the failure rate is rigourously proportionnal with the number of remaining errors, and decreases every time of the same quantity when a correction is applied, is mostly criticizable; indeed, according to their importance, the errors have a big or a small influence on the failure rate; for example, an error which is positioned in a software module which is activated very often will have a much more important contribution than an error located in a part of the program which is called more rarely. Moreover, one has to notice than the most critical errors will be probably detected first, and then, when half of the errors will have be detected, 1 will have been reduced at least much more than by half. That's why, to take into account the uncertainty characterizing the severity of the different errors corrected, LITTLEWOOD considers the parameters li as real independant variables which are distributed according to Gamma distribution laws. The models which have been previously presented are very much used and have been submitted to numerous validations. For this purpose, one can use KOLMOGOROV-SMIRNOV tests, to measure the difference between the theoretical partition function, that is to say, the one computed from the model (intervalls between failures, ore cumulated number of failures at a certain time) and the collected data. Tested validation is replicative if data have participated to the estimation of the parameters of the model, predictive in the other case.
Advances in Safety and Reliability: ESREL '97
387
The practical use of these models tends to proove that L I T T L E W O O D ' s model presents a better replicative validation criteria (64% of the tests accepted) and a better predictive validation criteria (65% of the tests accepted) than the two other models ( MUSA 46 et 50 %), GOEL O K U M U T O (50 et 45 %), if we get interested in the modelization of the partition functions of the intervalls between failures. If one gets interested in the partition function of the number of failures detected at a precise time, the three models give similar results for the replicative validation criteria (around 75% of the tests accepted) and for the predictive validation criteria. In summary, L I T T L E W O O D ' s model provides more precise results than the other models, but by paying the price for more difficult mathematical processing (the parameters are much more difficult to compute).
ADAPTATION OF THE RELIABILITY CONCEPT TO THE DEVELOPMENT PROCESS OF THE SOFTWARE The adaptation of the different SLI concepts to the development process of the softwares must be done, taking into account the fundamental differences between the inherent nature of the software and the one of the hardware; following board shows a certain amount of these differences: HARDWARE The failures may be caused by defaults which appeared during design, manufacturing, use or maintenance phases The failures may be cause by weareout phenomena or energy exchange. More often than not anticipating signs may help avoid the failure The maintenance actions which are applied to one equipment can not increase the inherent reliability. They only enable its overhaul according to a maintenance policy which has been defined precisely in advance
Reliability depends on weareout or screening phenomena; the failure rates can be decreasing, constant or increasing against the use time
Reliability is related with the environment factors.
Reliability can be predicted on a theoretical way, from the knowledge of the design of the hardware as well as its use conditions Reliability may be improved through the introduction of redundancies
At last, and above all, support policy of a hardware system can and must be defined in advance, in its total representation up to complete description of the content of every task
SOFTWARE The failures are mainly due to design errors, reproduction errors, or maintenance actions including regression There is no weareout phenomen. Most of the time, software failures do occur when it is not possible to anticipate The only possible maintenance action is the redesign of a piece of source code through its reprogramming, under the condition that the initial default may have been removed and no additional default has been introduced (no regression hypothesis): this will enable to increase its inherent reliability Reliability is not so much time-dependant; reliability may be improved along the time, but it is not a direct dependence against the use time. It is rather a timedependant function which will be dedicated to detection and error correction The external environment does not generally affect software reliability, except if it has an influence on the inputs of the software Reliability can be predicted from no theoretical base, as its exclusively depends on human factors playing their role at the moment of the design. Reliability may be improved by the introduction of redundancies, only if software elements constituting the redundancies have been developped and tested by different teams; in this case, however it is necessary to introduce vote. The maintenance policy of a software can only be defined in its principles; in fact the maintenance tasks must be anticipated as much as possible, for first and last definit application.
388
E. Arbaretier
CONCLUSION From the existing software reliability analysis methods, only one family has not been taken into account in our paper: this dealing with computation of Quality measurement indicators enabling to quantify the Quality characteristics of the source code of a software; these methods did not interest us as much as the others in ours paper, because they are only focused on parameters which are indirectly linked with the software reliability of the software. Among the previously described methods, formal methods are most satisfactory on the theoretical and intellectual aspects, because they very well take into account the fundamental differences existing between software reliability and hardware reliability; they are able to give the proof that a behaviour can never be reached by a software, or that a property is always verified. Simply they can proof the ~>, which has perfectly no sense in hardware reliability. The only problem is that they are very difficul to apply and they require very specialized skill: for the time being, it does almost cost the same time and the same energy to develop a formal model to develop a software, than to develop the software itself! Our company only adopted this approach on very limited pieces of algorithms, on the basis of qualitative formal processing associated with Petri Network modelling. The two other methods (failure rate computation through reliability growth modelling and defaut simulation through AEEL) are issued from hardware reliability methods analogy. We only practised the first on an experimental way, to compare different types of models, but we did not yet find a customer able to provide on an industrial way sufficient data collection from every developper. The method we have most used for the last three years is the qualitative analysis described in this paper as AEEL; we applied it in the methodological framework of working group and it helped us improving as well softwares which had been developped for ten year§, as softwares which did only exist through their specifications. Every time it helped the development team to identify what should clearly be done either to be sure to avoid a certain type of most dreaded failure event due to a bug, either to be sure such bug is not in the software.
REFERENCES Bev LITTLEWOOD ~ Likelihood function of adebugging model for computer software reliability ~ IEEE TR Vol R30, N°2 p145 MUSA <~The measurement and management of software reliability >> IEEE Vol 68 N°9 SOFRETEN ~>
Subjective Safety Analysis for Software Development J. Wang 1, A. Saeed 2 and R. de Lemos 2 1Department of Engineering and Technology Management Liverpool John Moores University L3 3AF, UK 2Centre for Software Reliability, Department of Computing Science University of Newcastle upon Tyne NE1 7RU, UK
ABSTRACT This paper presents a framework for subjective safety analysis of software requirements specifications for safety-critical systems. The framework incorporates fuzzy set modelling and evidential reasoning to assess the safety associated with safety requirements specifications. Fuzzy set theory is used to model each safety rule and evidentialreasoning is employed to synthesize the information produced. Three basic parameters - failure likelihood, consequence severity and failure consequence probability are used to analyse a safety rule (a basic element of a software requirements specification) in terms of membership functions. The subjective safety description associated with the safety rule is then mapped back to a scale of pre-defined safety expressions which are also characterised in terms of membership functions. Such a mapping results in the production of the safety evaluation associated with the safety rule, expressed in terms of the degrees to which the subjective safety description belongs to the pre-defined safety expressions. Such degrees represent uncertainty in the safety evaluation associated with the safety rule. The information produced for all safety rules can then be synthesized using an evidential reasoning approach to obtain the safety evaluation associated with the safety requirements specifications. The developed framework is capable of dealing with multiple safety analysts who make judgements on each safety rule. KEYWORDS
Fuzzy sets, software safety analysis, subjective safety analysis, evidential reasoning, information models and formal notations. 1. INTRODUCTION The increased employment of computer-based systems for the implementation of critical functions has introduced new challenges for the development and a s s e s s m e n t of software. For assessment, evidence must be provided to demonstrate that the risk associated with the software is acceptable within the overall system risk, IEC (1992). It has been proposed that an effective approach to assess and reduce the contribution of software failures to system risk is to conduct safety analysis in parallel with the phase of requirements 389
390
J. Wang et al.
analysis, within the software development lifecycle, Saeed et al (1995). In accordance with the proposed approach, the outputs of the requirements analysis are safety requirements specifications for the software, expressed in a formal notation. An information model is used as a structure to record the relationships between critical failure behaviours of the overall system (i.e. accidents and hazards) and the safety requirements specifications (safety constraints and safety strategies) for the software, de Lemos et al (1995). The results of the safety analysis provide arguments which support the validity of the relationships encoded in an instance of the information model, thereby evidence that the risk posed by software is acceptable. Safety analysis can be conducted on a qualitative or quantitative bases. Qualitative safety analysis aims to confirm that under normal circumstances the safety requirements specifications will prevent the system to enter into a hazard state, and examine the impact on hazards of defects in the specifications and violations of associated assumptions. Qualitative safety analysis can be conducted effectively by applying formal verification techniques and safety analysis techniques, Saeed et al (1994). Quantitative safety analysis aims to deal with the limitations of qualitative safety analysis, by providing a measure of the safety associated with the safety requirements specifications. For software development the measures should identify if the risk associated with a specification is acceptable, and when alternative specifications are proposed provide a basis for decision making. For traditional technologies, quantitative analysis is conducted in terms of probability distributions of primitive failure events. However, it is difficult to determine precisely probability distributions for those issues which can affect software safety. A novel approach pursued in this work is to express uncertainty in the safety associated with safety requirements specifications in terms of vague and imprecise descriptors like 'reasonably low', terms commonly used by safety analysts that can be expressed in fuzzy set theory, Wang et al (1995). This paper proposes a framework for subjective safety analysis of requirements specifications, based on fuzzy set modelling and evidential reasoning, Wang et al (1995). Deductive analysis starts with the stipulation of acceptable levels of safety for each accident as a linguistic variable (a pre-defined fuzzy safety expression) and dictates acceptable levels of safety for the hazards, from which a stipulated risk level (a numerical measure) is calculated. An alternative is to determine a linguistic variable for a hazard, on the basis of a traditional estimate of acceptable risk for that hazard. The risk of a hazard is controlled by the safety strategies defined to maintain the safety constraint that will exclude the hazard. The safety strategies are defined in terms of safety rules, which are based upon assumptions (also expressed formally), under which safe behaviour is maintained; these rules are characterized as primitive elements of a safety strategy. Inductive analysis starts with the application of fuzzy set theory to analyse these elements using three basic parameters, failure likelihood, consequence severity and failure consequence probability, in terms of membership functions. The subjective safety description associated with an element is mapped back to a scale of pre-defined safety expressions, to determine the uncertainty safety evaluation (i.e. the extent to which the rules belongs to each expression on the scale) for each element. The safety evaluations for each element are synthesized using evidential reasoning to obtain the safety evaluation for a safety strategy. These safety expressions can be used to rank alternative safety strategies, supporting development decisions for risk reduction. A similar synthesis process is used to obtain the safety expression of a safety constraint from the safety expressions of associated safety strategies. An estimated risk level is then computed for each safety constraint and compared with the stipulated risk level for the associated hazard, to confirm that the risk is acceptable. As the development proceeds, the safety strategies will be refined into m o r e detailed specifications and the additional information can be used to re-evaluate the initial assessments and further direct risk reduction. The approach is capable of dealing with evidence from diverse sources, such as multiple safety analysts who make judgements on safety based on the results of different techniques. The feasibility of the framework was illustrated by application to a railway safety problem, Wang et al (1996).
Advances in Safety and Reliability: ESREL '97
391
2. THE ANALYSIS OF SOFTWARE SAFETY REQUIREMENTS The framework for subjective safety analysis, presented in this paper, is described in the context of a systematic approach to the analysis of safety requirements, Saeed et al (1995). The systematic approach partitions the analysis into smaller phases; each phase corresponds to a domain of analysis (a particular scope of the analysis, e.g. a component of the system) in which requirements analysis and safety analysis are conducted in parallel. The results of applying the approach are encoded in an information model, the Safety Specification Graph (SSG), which records the safety requirements specifications obtained in each phase, such as accidents, (AC) hazards (HZ), safety constraints (SC - a condition that negates a hazard) and safety strategies (SS - a scheme to maintain a safety constraint) and their logical relationships. An SSG is represented as a linear graph, in which a node represents a safety specification and an edge denotes that a relationship exists between a pair of safety specifications. For a system for which I accidents have been identified, the SSG consists of I component graphs, one for each accident. The SSG records three kinds of relationships:
• Coverage. Absence of all hazards associated with an accident ensures that the accident does not occur. • Exclusion. A safety constraint excludes all the associated hazards. • Refinement. A safety strategy maintains all the specifications of the previous layer to which it is linked. Two characteristics of an SSG that make it amenable to subjective safety analysis are that the requirements specifications are expressed in a formal notation and the logical relationships between them are explicitly encoded. This supports a better judgement over factors related to a single specification and factors dependent upon the interrelationships between specifications, respectively. 3. A FRAMEWORK FOR SUBJECTIVE SAFETY ANALYSIS OF SOFTWARE SAFETY A framework for hierarchical subjective safety analysis of safety requirements specifications, for the initial layers of an SSG is proposed as shown in Figure 1; an ellipse represents the safety evaluation of the named specification and an arrow gives the propagation direction of safety information from one level to another. The safety analysis is comprised of a top down process and a bottom up process. The top down process leads to a stipulated risk level for each hazard. This expression is then used to derive acceptable levels of safety for the hazards associated to the accidents via a coverage relationships. The b o t t o m - u p process starts with associating safety evaluations with the safety rules at level 5, these are then used to determine the safety evaluations associated with the safety strategies at level 4 which further determine the safety evaluations associated with the corresponding safety constraints at level 3. Between levels 2 and 3 a comparison is conducted between the safety values stipulated with hazards and those safety constraints that aim to exclude the hazards, this is used to determine if the safety associated with the requirements specifications is acceptable. The framework consists of three main activities: the stipulation of a safety level for each hazard, the estimation of safety evaluation for each safety constraint and a comparison between the stipulated and estimated description of safety. In this paper, we will focus on the approach to the estimation of safety (see section 4).
3.1 Stipulation of Safety The safety associated with the safety requirements specifications should be contained to a level that is acceptable, depending on the particular situation in hand. This requires that the risk level associated with HZ/, j be stipulated, it commonly understood that safety can be described using linguistic variables, such as 'poor', 'fair', 'average' and 'good' that are referred to as safety expressions (see section 4.1.4) and provide a scale. The procedure used to associate a safety expression with a hazard will depend on the situation, it can
J. Wang et al.
392
be derived from a safety expression of the accidentAC/or from a traditional estimate of acceptable risk for the hazard. To obtain the level of risk associated with HZi, j in terms of numerical values for comparison purposes, it is necessary to described the linguistic variable .using numerical values. The numerical values associated with the four defined safety expressions can be calculated by studying the categories and membership values associated with the safety expressions. S
-
.
Leve,,
a
-
e
.
.
.
.
Level 2
t
a-Y
[
t e
Comparison
[
T
d S f
'--
~
"~ .
.
Q
.
. . . .
.
Level 3
.
Level 4
o t
y .
~
.
.
~
'
'
~
'
'
~
'
'
"
--
Level5
Key: AC/ - accident i, I number of accidents, HZil j - h a z a r d j ofACi, J(i) number of hazards of ACi, S C i , j - safety constraint j of HZ/, j, SSL j, k - safety strategy k of SCi, j, K(j) number of strategies for SCi, j, RUlei, j,k, l -- safety rule I associated with SSi, j, k,, and L(k) number of rules for SSi, j, k. Figure 1 A hierarchical framework for subjective safety analysis
3.2 Estimation of Safety Fuzzy set modelling is used to produce the safety evaluation associated with each safety rule at the bottom level, and evidential reasoning is employed to implement the hierarchical evaluation at different levels. The application of evidential reasoning avoids any information loss which may occur in the hierarchical evaluation of fuzzy information using fuzzy set theory, Anderson (1988) and Keller and Kara-Zaitri (1989).
3.3 Comparison of Safety The comparison of the stipulated risk level and the estimated risk level associated with SC/, j can then be carried out to see if the risk is acceptable, by converting the estimated safety evaluation to a numerical value using the same scale as used for determining the stipulated level. If the risk associated with SCi,j is acceptable, the produced information can be used as evidence to support certification, otherwise it may be required to modify the safety requirements specifications to increase the level of safety. After modifications some parts of the safety analysis will need to be conducted again to make sure that the required level of risk has been contained.
Advances in Safety and Reliability: ESREL '97
393
4. APPROACHES FOR SUBJECTIVE ESTIMATION OF SAFETY
To provide a subjective estimate for the safety of software requirements specifications, fuzzy set modelling techniques are used to model the judgements of safety analysts and evidential reasoning is used for the hierarchical propagation of the safety judgements. The main activities of the overall process are illustrated by the SADT diagram in Figure 2. Basic parameter estimates by safety analyst 1
Ci,j,k,l, 1
Pre-defined Safety Expressions
I
uzzyset /
Lijkll ....
r--[ manipulation
1"4,j, k, l,l
~1
t"'l
Si, j, k, l!IL[1 [ Safety
Safety description by safety analyst
ghi .°f. safe 1- ty analysts
identification
S(Si, j, k, l, 1 )[~...~ Uncertain safety evaluation judged by safety analyst 1 , Evidential °. reasoning S(Si, j, k, l, N) ----~ Uncertain safety evaluation judged by safety analyst N
Consensus safety evaluation S(Si, j, k, l) --
Figure 2: Activities for the Subjective Estimation of Safety
4.1. Fuzzy Set Modelling: Safety Definition The safety associated with a safety rule (say, Rulei, j, k, l) can be modelled by studying the associated failure likelihood, consequence severity and failure consequence probability as described earlier. These three parameters can be described by linguistic variables which can be further described by membership functions. A membership function is a description which consists of membership values to categories. The typical linguistic variables for describing failure likelihood, consequence severity and failure consequence probability may be defined in terms of membership degrees belonging to the seven categories, as recommended in Karwowski and Mital (1986), for details of our definitions see Wang et al (1996). The membership degrees of the typical linguistic variables are not exclusive with respect to a category, this makes it easier for safety analysts to make judgements on a safety rule. It is obviously possible to have some flexibility in the definition of membership functions for the typical linguistic variables to suit different situations.
4.1.1 Local Safety Parameter The failure likelihood can be assigned by a safety analysts examining a safety rule, specifically by estimating the likelihood that the safety rule will be violated. To estimate the failure likelihood, for example, an analyst would use such variables as 'highly frequent', 'frequent', 'reasonably frequent', 'average', 'reasonably low', 'low' and 'very low'.
4.1.2 Global Safety Parameters The consequence severity and the failure consequence probability are parameters derived from specifications at higher layers. To estimate the consequence severity, an analyst would use such variables as 'catastrophic', 'critical', 'marginal' and 'negligible'. The consequence severity can be assigned studying the severity class of the potential accident caused by the violation of a safety rule (in fact, it should be the same for all safety rules connected to an accident). However, it may be comparatively difficult for safety analysts to assign membership degrees for the failure consequence probability, described using variables, such as
394
J. Wang et al.
'definite', 'highly likely', 'reasonably likely', 'likely', 'reasonably unlikely', 'unlikely' and 'highly unlikely'. This is because it may be required to study the logical relations between safety strategies and between hazards leading to the accident. The failure consequence probability for safety rule Rulei, j, k, l is denoted by Ei, j, k,l. Four conditional properties need to be estimated to determine Ei, j, k,l, ei, j, k, l - PSSi, j, k is violated if Rulei, j, k, l is violated, el, j, k - S C / , j is violated given that PSSi, j, k is violated, el, j - H Z i , j occurs if SC/,j is violated, and el,~/z A C/happens if HZ/, j occurs. Multiple analysts may be involved in the identification of the individual conditional probabilities. The failure consequence probability is estimated on the basis of these probabilities; for example, if ei, j, k, t, el, j, k, ei, j and e.nz. are all estimated as 'low', then the literal estimate for Ei, j, k,l would be t, J 'low'. Obviously experience, together with an appreciation of the logical structure of the SSG, would enable a more informed assignment of membership degrees of the failure consequence probability. 4.1.3 Combination of Parameters Suppose Li, j, k, l represents the fuzzy set of the failure likelihood of occurrence associated with Rulei, j, k, l (i.e the likelihood that Rulei, j, k, l is violated) and Ci, j, k, l represents the fuzzy set of the consequence severity. The subjective safety description Si, j, k, l for Rulei, j, k, l can be defined as in (1), Karwowski and Mital (1986), where symbol 'o' represents the composition operation and ' × ' the Cartesian product operation. Si, j , k , l = fi, j,k, lOEi, j,k, l × Li, j,k, l
(1)
The relationship between the membership functions associated with Si, j, k, t, Ci, j, k, l, Ei, j, k, l and I~, j, k, l is: PSi, j,k,l
=
tUci, j, k l
o /re ,,j,k,l x /tL, , j , k , l
(2)
where/t s,. j. k,l is the membership function for Si, j, k, l, and the others terms are similarly defined. 4.1.4 Fuzzy Safety Identification To evaluate Si, j, k, 1 in terms of the basic safety expressions, it is necessary to characterize them using membership degrees with respect to the same categories used in order to map the obtained subjective safety description back to the pre-defined safety expressions. When characterizing the safety expressions, the conditions such as (3) need to be satisfied to confine the safety expression space within the certain extent, for details see Wanget al 1996. t~ S poor i,j, k, l
~-
~ ccata .... phic i,j, k l
0
tU Edefinite i,j,k,l
X
iu [_frequent i,j, k, l
(3)
The variables 'poor', 'fair', 'average' and 'good' are described by safety expressions (m= 1, 2, 3 or 4), respectively. Each fuzzy expression is defined as a set of seven pairs,the first elements is the membership category and the second the membership degree. 1.poor 2. fair 3. average 4. good
= {(1, 0), = {(1, 0), = {(1, 0), = {(1, 1),
(2, 0), (3, 0), (4, 0), (5, 0), (6, 0.75), (7, 1)} (2, 0), (3, 0), (4, 0.5), (5, 1), (6, 0.25), (7, 0)} (2, 0.25), (3, 1), (4, 0.5), (5, 0), (6, 0), (7, 1)} (2, 0.75), (3, 0), (4, 0), (5, 0), (6, 0), (7, 0)}
(4) (5) (6) (7)
The extent to which Si, j, k, l belongs to the mth (m = 1, 2, 3 or 4) safety expression can be obtained using the Best-Fit method, Schmucker (1984), described by flrni , j , k , l (m = 1, 2, 3 or 4).
Advances in Safety and Reliability: ESREL '97
395
4.2. Evidential Reasoning: Hierarchical Propagation for Safety Synthesis Evidential reasoning is used to synthesize the judgements of different safety analysis, in order to determine a safety evaluation for each rule, and then to propagate the safety evaluations up the levels of the framework. 4.2.1 Fuzzy Set Modelling by Multiple Safety Analysts If multiple safety analysts are involved in the safety analysis process, their judgements need to be synthesized. A diagram for synthesizing the judgements on a safety rule produced by multiple safety analysts is shown in Figure 2. Suppose there are N safety analysts who assign membership degrees for three basic safety parameters associated with a safety rule. Suppose Li, j, k, l, n, Ci, j, k, l, n and Ei, j, k, l, n represent the three basic safety parameters associated with Rulei, j, k, l judged by safety analyst n (n = 1, • • ", or N), respectively. The subjective safety description Si, j, k, l, n associated with Rulei, j, k, I judged by safety analyst n can be obtained: Si, j,k,l,n = Ci, j,k,l,n oEi, j,k,l,n
(8)
X ti, j,k,l,n
Si, j, k, l, n (n = 1, " " ", or N) can be mapped back to the defined safety expressions to identify the uncertainty
safety evaluation S(Si, j, k,l, n) associated with Rulei , j, k, l, as judged by safety analyst n " Suppose timi,j,k,l,n (m = 1, 2, 3 or 4) represents the extent to which Si, j, k, l, n belongs to the mth safety expression. S(Si, j, k,1, n) can be expressed in the following form: S(Si, j,k,l,n) = {([3],j,k,l,n, 'poor'),
(fli,2 j,k,l,n'
'fair'),
(fli,3 j,k,l,n'
'average') '
([3 4i,j,k,l,n'
'good')}
(9)
It is required to synthesize all S(Si, j, k, l, n ) (n = 1 .... ,and N) to obtain the safety evaluation associated with Rulei, j, k, l. A n evidential reasoning approach can be employed to synthesize S(Si, j, k, l, n) (n = 1, ..., and N) and take into account the weight of each safety analyst without losing any useful safety information. Evidential reasoning is well suited for handling uncertain and inconsistent safety evaluations, Yang and Sen 1994, and is based on the principle that it will become more likely that a given hypothesis is true if more pieces of evidence support that hypothesis. In Figure 2, whether the safety evaluation associated with a safety rule belongs to 'poor', 'fair', 'average' or 'good' can be regarded as a hypothesis. If the judgement on a safety rule produced by a safety analyst is to some extent evaluated as 'good', for example, then the safety associated with the safety rule would be to some extent evaluated as 'good', depending on the judgement itself and the weight of the safety analyst in the evaluation process. The application of the evidential reasoning approach provides a systematic way of synthesizing such uncertain safety evaluations involving multiple analysts' judgements to produce the safety evaluation for a safety rule. 4.2.2 Hierarchical Propagation of Safety Evaluations After the safety evaluation associated with each safety rule has been obtained, the safety evaluations associated with all rules - Rulei, j, k, t (l = 1, • • •, R ( K ) ) are synthesized to obtain the safety evaluation associated with SSi, j, k" Then the safety evaluations produced for all SSi, j, k (k = 1, • " ", and K(j)) are synthesized to obtain the safety evaluation associated with SCi, j.
5. CONCLUSIONS A framework incorporating fuzzy set modelling and evidential reasoning is proposed for subjective safety analysis of software requirements specifications for safety-critical systems. In this framework, a fuzzy set modelling method is used to analyse the safety associated with a safety rule, which is judged in terms of three
396
J. Wang et al.
basic parameters by multiple safety analysts. An evidential reasoning approach is then used to synthesize the information produced to obtain the safety evaluation associated with the safety requirements specifications. Finally, a comparison is made between the estimated safety evaluation and the stipulated risk level. The proposed framework can be used as an alternative approach for analysts to conduct safety analysis for software specifications, especially in the situations where there is a lack of quantitative safety data for use in probabilistic risk analysis and where non-numerical safety data is dealt with. Enhancements to the approach include conducting a Failure Mode, Effects and Criticality Analysis (FMECA) of each safety rule and then employ fuzzy set modelling at the failure mode level. This may make it more effective and efficient for safety analysts to make judgements. Other factors such as assumptions on the basis ofwhich specifications are produced may also need to be taken into account to increase the effectiveness of the framework in order to facilitate more practical applications. ACKNOWLEDGEMENT
The authors acknowledge the financial support of COPERNICUS Joint Research Project ISAT, and thank Dr. J. B. Yang of Birmingham University for allowing use of his software. REFERENCES
Andersson, L. (1988). The Theory of Possibility and Fuzzy Sets: New Ideas for Risk Analysis and Decision Making, Swedish Council for Building Research. de Lemos, R., Saeed A. and Anderson, T. (1994). On the safety analysis of requirements specifications, Proceedings of 13th International Conference on Computer Safety, Reliability and Security (SAFECOMP'94), Ed. Victor Maggioli, Anaheim, CA, 217-227. de Lemos, R., Saeed, A. and Anderson, T. (1995). Analysing safety requirements for process control systems. IEEE Software 12:3, 42-53. International Electrotechnical Commission. (1992). IEC/SC65A: Functional Safety of Electrical~ Electronic/ Programmable Electronic Systems. Generic Aspects. IEC (Secretariat) 123. Karwowski, W. and Mital, A. (1986). Potential applications of fuzzy sets in industrial safety engineering. Fuzzy Sets and Systems 19, 105-120. Keller, A. Z. and Kara-Zaitri. (1989). Further application of fuzzy logic to reliability assessment and safety analysis. Micro Reliability 29:3, 399-404. Saeed, A., de Lemos, R. and Anderson, T. (1994). An approach to the risk analysis of safety specifications. Proc. of 9th Annual Conference on Computer Assurance (COMPASS'94). Gaithersburg, MD. 209-222. Saeed, A., de Lemos R. and Anderson T. (1995). On the safety analysis of requirements specifications for safety-critical software. ISA Transactions 34:3, 283-295. Schmucker, K. J. (1984). Fuzzy Sets, Natural Language Computations And Risk Analysis, Computer Science Press. Wang, J., Yang, J. B. and Sen, P. (1995). Safety analysis and synthesis using fuzzy set modelling and evidential reasoning. Reliability Engineering and System Safety 47, 103-118. Wang, J., Saeed, A. and de Lemos, R. (1996). Subjective SafetyAnalysis of Safety Requirements Specifications. Technical Report. Dept. of Computing Science. University of Newcastle upon Tyne. 1996. (to appear). Yang, J. B. and Sen, E (1994). A general multi-level evaluation process for hybrid MADM with uncertainty. IEEE Transactions on Systems, Man and Cybernetics 24:10, 1458-1473.
A15" Software Reliability
This Page Intentionally Left Blank
SOFTWARE RELIABILITY PREDICTION RECALIBRATION BASED ON THE TTT-PLOT M. Zhao and M. Helander Division of Quality Technology and Management Link6ping University, S-581 83 Link6ping, Sweden
ABSTRACT
An important aspect of software reliability modelling is to make a good reliability prediction system. Although many software reliability growth models have been proposed, no single model has high prediction ability for different data sets. As a result, analysts must consider a large number of models to try to get good reliability predictions. Software reliability prediction recalibration is one technique to overcome the difficulty of model selection in the sense that only one basic model is needed for making a raw prediction system. A new prediction system is then created b~sed on basic models and by learning from the behavior of the raw prediction system. However, the spline function technique is needed to find a recalibrating distribution function, which is not easy to apply in practice. In this paper we use the well-known TTT-plot technique in hardware reliability for identifying the recalibrating distribution function. It is easier to apply and at the same time has the advantage of using the prediction of software reliability in decision making such as the determination of optimal release time of software. We give a numerical example to illustrate the technique. KEY WORDS Software reliability, TTT-plot, recalibration, spline functions, lifetime distribution. INTRODUCTION
An important aspect of software reliability modeling is to make a good reliability prediction system. A reliability prediction system is commonly established by applying a software reliability growth model and estimating its parameters from testing data. Many publications have contributed to finding better models and assessing their prediction ability. However, it has been realized that no single model of more than fifty previously proposed can possess high prediction ability in all circumstances, see e.g. Abdel-Ghaly et al. (1986), Brocklehurst, et al. (1990), Littlewood (1989), Xie (1991) and Zhao (1994). This fact forces people to consider a large number of models for a given data set in order to find a better model. However, it is still difficult to guarantee that a better model can make good predictions for the future behavior of system failures. In order to achieve high accuracy of the prediction, a new approach, called reliability prediction recalibration, has been considered by Brocklehurst, et al. (1990). In this approach, the spline function technique is used to recalibrate the raw prediction system and to create a new prediction system. Some improvement of the accuracy of the prediction can be achieved. In this paper, we consider the prediction recalibration of software reliability by making use of the wellknown properties of TTT-plot, see Bergman and Klefsj6 (1984). Based on the pattern of the corresponding TTT-plot, it is easier to find an appropriate lifetime distribution that can be used for prediction recalibrating. Compared to the method used in Brocklehurst, et al. (1990), where the spline functions were used to approximate the distribution function represented by u-plots, the application of TTT-plot is straightforward 399
400
M. Zhao and M. Helander
and easier to carry out. Numerical examples are provided in this paper to highlight the technique of TTTplots in software reliability prediction recalibration. P R E D I C T I O N SYSTEM AND TTT-PLOT A central topic in software reliability modelling is to establish and assess the performance of a reliability prediction system by which the software reliability can be predicted based on testing data. This section describes how a software reliability prediction system is built and how the TTT-plot is used to measure the performance of the prediction system. For other methods to measure the ability of a prediction system, see e.g. Abdel-Ghaly et al. (1986), Brocklehurst, et al. (1990), Littlewood (1989), Musa et al. (1987) and Zhao (1994). Software Reliability Model and Prediction System
The common testing data is of the form as displayed in Figure 1.
I
t1
t2
t3
I
I
I
--
i
--
--
tn
t
I
I
_l_. past
now
Tn+ 1
Tn+ 2 ',
I
vet
v
I future
Figure 1 Failure process of software system During testing period of time t, successive failures occur at time points t i ,i - 1,2 ..... n. We want to estimate the distribution function of the waiting time W,, or the software reliability function defined by
We only consider nonhomogeneous Poisson process (NHPP) models in this study, but note that the technique described can be applied for other models as well. It is known that, see e.g. Musa eta/. (1987), Xie (1991) and Zhao (1994), the reliability function defined by (1) is equal to R(x[O) = exp(-[ M ( t + x,O) - M(t,O)]),
(2)
where M ( t , O ) is the mean value function of an NHPP model, and 0 is unknown model parameter. In order to evaluate the reliability function, the unknown parameter 0 in R(x[O) is replaced by its estimate, which is commonly given by the maximum likelihood (ML) method. After obtaining the estimate t~, the future behavior is predicted using the prediction function F ( x ) = 1 - R ( x , O ) . When additional data are available, the parameter is re-estimated and then a new prediction function is produced. In this way, we say that a prediction system is set up. One important problem is how to assess the performance of a prediction system. The objectives of prediction assessment are two-fold. One is to set up a criterion by which it can be determined whether a prediction system is good or not. Another is for the selection of prediction system. During the last few decades, there have been some approaches developed for assessing prediction systems. However, much effort has been devoted to model selections based on prediction criteria, see Abdel-Ghaly et al. (1986), Brocklehurst, et al. (1990), Littlewood (1989), Musa et al. (1987) and Zhao (1994). When a prediction system is not able to perform well, it is usually thought that the underlying model is inappropriate so that one has to look for other models.
Advances in Safety and Reliability: ESREL '97
401
Performance of Prediction system Suppose that we start to predict the process with t~,t2,'",t, and estimate parameter 0. The estimate is denoted by t)n. When Tn+l is observed, the estimate t)n is replaced by t)n+1, which is estimated by using data tl,tz,'",t,,T,+~. Such a prediction process is continued until the time when we want to assess the prediction system, for instance, when m more failures have been observed. If the prediction performs well, it is expected that the sequence
Xni
:
M(Tn+i, On+i_l)-M(Tn+i_l,On+i_l) , i= 1,2,...,m,
will look like a random sample from an exponential distribution, since the statement holds if both the model and the estimates are perfect, see Abdel-Ghaly et al. (1986), Littlewood (1989) and Zhao (1994). Strictly speaking, sequence {x,i } can never be a random sample from an exponential distribution. Some analysis approaches for exponential distribution, however, can serve as an empirical method to assessing the performance of the prediction system by thinking {xm} as a random sample from an exponential distribution. One simple approach for testing exponential distributions is the TTT-plot technique. Here TTT stands for
total time on test, which is made from the sequence {x,i } as follows. Let X(nl) _<X(n2) <'"_< X(nm) be the order sequence formed by x nl , X n 2 ," " , X n m " Define s O = 0, S I --- m X ( n l )
,
i
s i = ~ ( m - j + 1)(X(nj)- X(nj_l)),j= 2,3,--.m. j=l
Then the TTT-plot is produced by plotting points (ilm line segments.
,Si/S m
) i= 0,1,'-" m, and connecting these points by
The TTT-plot was originally considered by Barlow and Campo (1975) for analyzing lifetime data. It is a simple tool for identifying the properties of lifetime distributions, see Bergman and Klefsj6 (1984) for a survey on the TTT-plot technique. In the present problem, the prediction system can be said to work well if the TTT-plot is close to the line of unit slope.
TTT-piot and Scaled TTT-transform Let F(t) be a lifetime distribution with finite mean/~. The scaled TTT-transform of F is defined as
q)(u) = _1 I,=-~o~(U)F(t)dt,~t
0 < u <_1,
(3)
where F-l (u)= inf {t: F(t) >_u} and ~ F ( x ) - l - F ( x ) . It is well-known that q~(u)-u,O<_u<_l, if F is an exponential distribution. Furthermore, ¢(u) is convex if F is IFR (increasing failure rate) and concave if F is DFR (decreasing failure rate). This property helps us identify the class of distributions when we create a new prediction system based on one raw prediction system. Figure 2 gives some examples of scaled TTTtransforms. Note that the TTT-transform of a distribution is independent of scale and, particularly, it is parameter free if F is an exponential distribution. Under rather general conditions, the TTT-plot from a random sample of F converges, uniformly and with probability one, to the scale TTT-transform when n--4 oo, see Landberg, et al. (1980) for details. This asymptotic property of TTT-plots implies that the pattern of a TTT-plot could indicate the aging properties of a distribution. Therefore, we can use the TTT-plot to identify the aging property of the distribution.
402
M. Zhao and M. Helander 1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Figure 2 Scaled TTT-transforms from four lifetime distributions: (1) Weibull distribution with shape parameter 2; (2) exponential distribution; (3) log-normal distribution with/.t=0, o=1; (4) Weibull distribution with shape parameter 0.5. RECALIBRATION OF RAW PREDICTION SYSTEM Suppose that we have a data set of software failures. A new prediction system is built in the following steps: a. Divide the data into two parts b. Make a raw prediction system based on the first part of the data and the selected model. c. Create the TTT-plot from the predictions for the second part of the data. d. Determine a recalibrating distribution F that best matches the pattern of the TTT-plot. e. Estimate the parameter of the distribution function F using the data forming the TTT-plot. f. Predict the distribution function of the next failure time T,+~by
P(T.+,< x)= F(- log(R(x))),
(4)
where R(x) is the reliability function in the raw prediciotn system. g. When new data comes, repeat the steps e and f. Note that when the prediction is perfect, the distribution of -log(R(T,+l) ) is exponential. Because of the error of the model, the distribution is F so that the distribution of T,+~ is recalibrated with the formula (4). NUMERICAL EXAMPLE In this section, we apply the recalibration technique to the data set Usbar from London City University. The data set contains totally 397 failures and their occurrence times. The raw prediction system is made by using the Goel-Okumoto model proposed by Goel and Okumoto (1979), which is one of the widely used software reliability growth models. The mean value function of the Goel-Okumoto model is given by
M(t) =a(1-e-b'),
(5)
where parameter a is the expected number of failures, and b is one parameter representing the intensity of testing. The ML estimates of the parameters a and b are determined by the equations -
N(t)
-t;,,
1-e
(6)
Advances in Safety and Reliability: ESREL '97
1 -~-
te -i''
1
1 - e -['' -
N(t)
403
N(t)
(7)
~f,t,. i=,
For this data set, the Goel-Okumoto model is not a suitable model, see Figure 3. However, the recalibration technique can improve the model quite well. 400
I
I
I
!
2
4
6
8
'
350 300
250
200 150 100 50 0
10
12 X 10 4
Figure 3. Plot of the cumulative number of failures against occurrence times and the modelling by the GoelOkumoto model. We start to evaluate the raw prediction system from the 2504 failure until the 3004 failure. The corresponding TTT-plot is shown in Figure 4.
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0;1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 4. TTT-plot for the raw prediction system using the failure times from 2504 to 3004. The TTT-plot shows that the raw prediction system does not work well and the shape of the TTT-plot implies that the Weibull distribution with a shape parameter less than one can be a candidate for the prediction recalibration. From the 3014 failure we use a Weibull distribution and create a new prediction system. Figure 5 is the TTT-plot made from the new prediction system employing the Weibull distribution as the recalibration distriobution. We see that the prediction system works well for this data set. Compared with the raw prediction, see Figure 6 for the TTT-plot from the raw prediction system, the new prediction is improved greatly.
M. Zhao and M. Helander
404 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
o
o'1
0'2
0'3
0'4
o's
0'6
0'7
0'8
0'9
1
Figure 5. TTT-plot of the new prediction system for the second part of data set USABR.
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 6. TTT-plot of the raw prediction for second part of data set USBAR. SUMMARY In this report we proposed an approach for software reliability prediction recalibration. Comparing with the recalibration previously studied, the TTT-plot method mainly has two advantages. First, it is easier to implement since the procedure involves standard techniques such as the parameter estimation of Weibull distributions. The application of spline-functions for approximating the u-plot may not be familiar to software engineers. Second, note that an most important aspect of software reliability prediction is its usefulness for decisionmaking. For example, the optimal release time of software is one important consideration. However, several studies have shown, see e.g. Zhao and Xie (1993), that the determination of optimal release time is difficult due to the poor prediction of software reliability. Since the TTT-plot approach considered here provides us with a parametric model with high prediction ability, determination of optimal release time can be improved by this approach. REFERENCES Abdel-Ghaly, A.A.; Chan, P.Y. and Littlewood, B. (1986). Evaluation of competing software reliability predictions. IEEE Trans. Software Eng., Vol. SE-12, 950-967. Barlow, R.E. and Campo, R. (1975). Total time on test processes and applications to failure data analysis. In Reliability and Fault Tree Analysis, Ed. By Barlow R.E., et al., Philadelphia, pp. 451-481.
Advances in Safety and Reliability: ESREL '97
405
Bergman, B. and Klefsj6, B. (1984). The total time on test concept and its use in reliability. Operation Research, 32, 596-606. Brocklehurst, S. Chan, P.Y., Littlewood, B. and Snell, J. (1990). Recalibrating software reliability models. IEEE Trans. Software Eng., Vol. SE-16, 456-470. Goel, A.L. and Okumoto, K. (1979). Time-dependent error-detection rate model for software reliability and other performance measures. IEEE Trans. Reliability, R-28, 206-211. Landberg, N.A., Leone, R.V. and Proschan, F. (1980). Characterization of non-parametric classes of life distributions. Ann. Prob., 8, 1163-1170. Littlewood, B. (1989). Predicting software reliability. Phil. Trans. R. Soc. London - A, 327, 513-527. Musa, J.D., A. Iannino, and K. Okumoto, Software Reliability." Measurement, Prediction, Application. McGraw-Hill, New York, 1987. Xie, M. (1991). Software Reliability Modelling, World Scientific Publisher, Singapore. Zhao, M. and Xie, M. (1993). Robustness of Optimum Software Release Policies. Proc. of the 4th Int. Symp. on Software Reliability Engineering, pp. 218-225. Zhao, M. (1994). Nonhomogeneous Poisson processes and their application in software reliability. PhD Dissertation, No. 336, Link~ping University.
This Page Intentionally Left Blank
SAFETY MONITOR SYNTHESIS BASED ON HAZARD SCENARIOS Janusz G6rski ~.2and Bartosz Nowicki ~Centre of Software Engineering, ITTI, Mansfelda 4, 60-854 Poznafi 6, Poland {gorski, nowicki}@itti.efp.poznan.pl 2Department of Applied Informatics, Technical University of Gdafisk 80-954 Gdafisk, Poland
ABSTRACT It is a common practice that safety-critical software systems are extensively verified and validated before their deployment. However, even the most advanced verification methods are not capable of removing all faults from the system. Moreover, in many cases the safe operation of the system depends on a set of (sometimes implicit) design assumptions related to its target environment. If such an assumption is violated (due to unpredictability of the environment), it may result in a hazardous situation. To deal with the above problem the safety monitoring principle can be adopted while designing safety critical applications. The paper presents an approach to systematic synthesis of a safety monitor device. The approach is based on object oriented modelling of the application followed by identification of possible scenarios leading to a hazardous situation. The scenarios are then used to synthesise a safety monitor - an automaton which is capable to detect that a hazardous scenario is developing in the system.
KEYWORDS safety, safety monitoring, safety monitor synthesis, object orientation
INTRODUCTION Safety is an attribute of the whole application and should be considered within a broad context including the control system, the plant and the environment. Often, safety depends on factors which are beyond the direct reach of the control system, G6rski et al (1995). During the control system design there are many (direct or indirect) assumptions the validity of which is taken as granted, e.g. while designing a road crossing control system it is commonly assumed that car drivers obey the road signalling lights. No control algorithm is capable to prevent an accident caused by a driver ignoring road lights in such system. Although system designers has no means to enforce that the validity of such assumptions is maintained throughout system operation, they still can equip the system with some additional mechanisms which observe the environment and check whether those assumptions are continuously valid. The above considerations lead to the concept of a safety monitor - a device which continuously observes the system and verifies if previously identified (potentially dangerous) behavioural patterns do not occur. If such a threat is discovered the monitor raises an alarm what in turn can trigger some corrective actions in the system. The idea of safety monitor conforms to international regulations on safety critical systems, International Electrotechnical Commission (1994), where the continuos on-line supervision is recommended in order to identify hazardous situations before they become accidents. Such a monitoring 407
408
J. Gdrski and B. Nowicki
system can activate other safety devices which aim at preventing accident occurrence (e.g. initiate moving the system to a fail-safe state). An approach to synthesis of a device monitoring validity of the environmental assumptions, based on temporal logic can be found in G6rski (1989). The problem with this approach was that it was difficult to scale it up to larger (i.e. more realistic) situations. This paper presents a systematic approach to deriving a safety monitor from a set of hazard scenarios. Hazard scenario is a sequence of events which, if occur, move the system to an unsafe state. Hazard scenario is a consequence of faults (both design and random faults) present in the system. To get a set of hazard scenarios for the given application we start with building an object-oriented model of the application, then we follow the method of G6rski et al (1996) to extend the model with safety aspects and finally we apply the method presented in G6rski et al (1995) to enrich the model with possible fault modes. The resulting model is then simulated and appropriate hazard scenarios are being captured. The key idea of the approach is that the monitor 'learns' these scenarios in order to recognise them in the actual system, at run-time. Throughout the paper we use the object oriented methodology presented in Rumbaugh et al (1991). The presentation refers to a well known case study - the gas burner system. As the starting point we assume the results of object oriented safety analysis of the gas burner system presented in G6rski et al (1996).
SAFETY M O N I T O R I N G The set of all states possible for a given system constitutes its state space. This space can be split into subdomains which differ regarding the criterion of their selection. One possible distinction is given below: • correct states - the states admitted by the system functional requirements (mission related), • incorrect states - the states which contradict the mission requirements. Another distinction, made from the safety standpoint can'be as follows: • dangerous states - hazardous states directly leading to a hazard, • safe states - those states that are not dangerous. Let us assume that for each dangerous state H we can define the monitoring criterion which selects a set of safe states surrounding H, called the danger zone of H. It is assumed that the system can reach H only by passing through the associated danger zone, i.e. any scenario leading to H will visit some states of the danger zone of H, before entering H. The safety monitor of a given dangerous state H is a device which continuously observers actual system states and compares them against the defined monitoring criterion of H. Whenever the system visits a state belonging to the danger zone of H, the monitor raises an alarm signal. Putting • • •
the above idea into practice encounters the following problems: identification of the monitoring criterion, access to the system state - not all relevant system parameters are directly measurable, complexity of the monitoring device - a complex monitor increases complexity of the overall system and may introduce new threats and decrease the overall reliability.
Monitor is an additional safety mechanism which is to be incorporated into the existing system to strengthen the safety guarantees of the whole application. Therefore, there are some important quality features required from such mechanism G6rski et al (1997): High sensitivity. Monitor should not 'overlook' any situation which would lead to a dangerous state. False alarm elimination. Monitor should not be 'oversensitive' in the sense of generating spurious alarms when it is not necessary. Early warning. Raising an alarm, the monitor should leave enough time for an appropriate reaction of the system, before the dangerous state occurs. Feasibility. Monitor should be physically and technically feasible within a reasonable cost limits.
Advances in Safety and Reliability: ESREL '97
409
Independence. Both the algorithm of monitoring and the technology of its implementation should be different from those employed in the target system, in order to avoid common mode failures. Simplicity. With a simple device it is easier to meet a high reliability level. Using a monitoring device which reliability is lower than reliability of the target system would not be a wise solution. O V E R V I E W OF THE M E T H O D The method of safety monitor synthesis presented in the paper comprises the following steps:
Step 1: Hazard scenario identification A sequence of events leading to a hazardous state, together with the timing relations between those events is called a hazard scenario. Such a scenario can be triggered by a fault in the system or in its environment. In our approach we concentrate on deriving those scenarios from a model. Simulation of the model is a way for capturing some hazard scenarios. Alternatively, if the problem is of a reasonable size, all hazard scenarios can be captured by the reachablity analysis, c.f. Holzman (1991). In such case we have a guarantee that all hazard scenarios, related to the faults covered by the model, have been identified.
Step 2: Implantation The identified hazard scenarios form the base for monitor specification. Eventually, the monitor is to be linked into the actual application and will be driven by the application events. We call this implantation of the monitor to the application. Successful implantation requires that the monitor has access to the events generated by the application. This means that those events have to be measurable, i.e. there must be some technical means to detect their occurrence. Normally, we can expect that all events exchanged through the interface between the plant and the control system are measurable (as the control system has direct access to all sensors, detectors and actuators of the plant). However, the events of a hazard scenario are not guaranteed to be measurable as they may be defined in terms of some application parameters which are beyond the control_system-plant interface. This problem is not trivial and sometimes can not be solved by just adding an additional sensor to the plant (as measuring of some parameters can be physically or economically infeasible).
Step 3: Monitor definition For each hazard scenario we define a state machine which accepts only the sequence of events belonging to the scenario and refuses any other. Acceptance of the full scenario moves the state machine to the ALARM state. The monitor is a union of all state machines related to the considered set of hazard scenarios. Thus, the monitor raises an ALARM if and when any of the related state machines moves to the ALARM state.
Step 4: Tuning Tuning of a monitor refers to the actions aiming at ensuring that the monitor will not rise alarms prematurely and on the other hand it will not delay the alarm signalling too long, until the hazard is about to happen. As the resulting monitor is based on the hazard scenarios, it raises ALARM together with entering the hazardous state. This is clearly too late as at that point the system is already unsafe. Therefore we have to move the ALARM generation backwards in the scenario and this way to force that the warning occurs earlier, before the system actually enters the hazard. This is achieved by cutting off a final subsequence of the hazard scenario and by adding ALARM at the end of the remaining part.
410
J. G6rski and B. Nowicki
It is often a case that a hazard scenario comprises, in its initial part, some events which are generated during the system initialisation phase. After the initialisation phase the system works in the (perhaps endless) loop. In our approach we concentrate on the 'normal' working conditions of the system and therefore we exclude the initialisation phase from monitoring. Therefore we cut off events from the beginning of the hazard scenario. The above tuning of a monitor is highly dependent on the application specific knowledge. In both cases, removing of too many events makes the monitor to rise false alarms.
HAZARD SCENARIO IDENTIFICATION Let us assume that a set of system hazards is known. The method presented in this paper does not deal with hazard identification. Instead, we concentrate on detection and analysis of system behaviours which can lead to a hazardous state (hazard scenarios). This is accomplished in three phases:
Phase 1: Development of the object model The object model of the application is developed in accordance with the OMT (Object Modelling Technique) method, Rumbaugh et al (1991). The model is sufficiently broad to provide for safety analysis: it includes objects representing the control system, the controlled system and the external environment. While the class model defines structural properties of the system the corresponding dynamic model describes the behavioural properties of objects and the related constraints. The functional model of OMT is of less importance in control dominated applications. The dynamic model is expressed in formalism of statecharts, Harel (1987), and forms the base for analysing the dynamic aspects of the system. Although OMT provides a firm base for system modelling it is predominately used for mission oriented modelling and needs some extensions to provide for safety analysis. The extensions are elaborated in Phase 2 and Phase 3.
Phase 2: Extending the object model to cover the safety aspects The model developed during Phase 1 is oriented towards the system mission. However, even if the mission involves high risk, the model does not identify this possibility in an explicit way. In order to cover this aspect we develop an additional model which explicitly distinguishes between safe and unsafe states. Such a model is shown on Figure 1.
[ safe "]~
~unsafe ]
Figure 1: A generic model of a hazard The model distinguishes between two states" safe and unsafe so it is built from the different point of view comparing to the mission oriented model - the focus is on modelling the safety aspect of the system. As a result we have two different models of the same system: • the model focused on the mission aspects, and • the model focused on the safety aspects. Then the two models are merged into one in which the interference between the mission and safety requirements can be studied and analysed. The hazard reachability analysis is performed to identify possible hazard scenarios. The details of this approach can be found in G6rski et al (1996).
Advances in Safety and Reliability: ESREL '97
411
Phase 3: Enriching the object model with faults The focus here is on the vulnerability of the system to some classes of faults. In particular, it is checked whether some faulty behaviours of objects could endanger safety. Frequently, the threat to safety is the consequence of a faulty behaviour of some part of the system. Such fault is represented as a departure from the (explicit or implicit) assumptions about the intended behaviour. Faults are originated in fault initiating objects. The method employs a generic list of possible faulty situations which can be identified with respect to a given transition in the dynamic model of the fault initiator. The decision on which faults should be taken into account depends on the application context. The extended specification (with faults included) of the fault initiator object replaces the original one in the object-oriented model. Then the usual reachability analysis is performed to check if the hazard is possible in the presence of the considered faults. The details of this approach can be found in G6rski et al (1995).
GAS BURNER EXAMPLE An example considered in this paper is a gas burner system. The set of objects of this system together with flows of events among those objects are presented on the Figure 2. The mission is to supply heat to a technological process. The model comprises the burning chamber, gas valve, ignition device, temperature sensor, control system, environment and operator. Computer based controller receives signals from the system operator to start and to stop the heating process (Heatingon, Heatingoff). The controller controls the ignition device (Sygignon) and the valve on the gas supplying pipe (Syggason, Syggasoff). The ignition device generates sparks (Spark) in response to the command received from the controller. A spark is generated after the delay necessary for charging the ignition device. Signals from the controller received during the charging process are ignored. The gas valve provides gas to the burning chamber (Gason, Gasoff) depending on the command from the control system. The increase of the gas concentration in the burning chamber is proportional to the time period during which the valve remains open (assuming that there is no ignition). Similarly, if the valve closes, the gas concentration decreases proportionally to time, due to ventilation. The minimal gas concentration sufficient to successful gas inflammation is known. Not every spark generated while the gas concentration is sufficient for ignition will result in gas inflammation, due to the possibility of a temporary draught (Draught). A draught can also blow the flame away. The information about the presence (Flameon) or absence (Flameoff) of flame in the burning chamber is (with some delay) detected by the temperature sensor and then passed to the control system (Cold, Hot). It is assumed that the temperature sensor can fail and generate Hot event even if there is no fire in the chamber.
Temperature Sensor
H
I Enviroment I / Draught I Flame°ff-.. ~' ii,,ameon~| BurningChamberI~'lSpark I
Ignition
ltGaS~lve tGa;°ff /
T SyggasOn Control
11~S Yg:~ rsygigno n
¢ HeatingonlLHeatingoff Operator
Figure 2: Event flow diagram for the gas burner system
MONITOR REALISATION In this section we present how the idea of monitor synthesis based on hazard scenario can be accomplished within an object oriented modelling framework. It is assumed that the behaviour of objects is expressed
412
J. G6rski and B. Nowicki
using the formalism of statecharts, Harel (1987), and that STATEMATE, i-Logix (1996), is used as the supporting tool. The presentation refers to the gas burner case study introduced in the previous section.
Step 1: Hazard scenario identification The STATEMATE tool provides for both, simulation in interactive and batch modes, and reachablity analysis. If the reachability analysis shows that unsafe state is reachable a TRC file is being created. It comprises a very detailed description of what happened in the system during consecutive steps of the simulation. Among other it includes information concerning modifications of variables and generation of events. This information is sufficient to describe a scenario related to a given simulation run. A reachability analysis performed on gas burner model revealed a number of hazard scenarios belonging to two classes. The first class includes scenarios caused by a design fault in the system. It relates to the situation where the operator repeatedly switches the heating on and off. If the frequency is high enough, the procedure of preparing the burning chamber for ignition can not be completed. Consequently, the gas concentration increases and eventually reaches the dangerous limit. The second class includes scenarios caused by a random error of the temperature sensor and relates to the situation where the sensor acknowledges the presence of fire in the burning chamber while actually there is no fire. The control system is informed that the gas is burnt out and keeps the gas valve open what finally leads to the unsafe gas concentration. A relevant part of the TRC file (only two first steps) obtained from the gas burner model simulation is shown in Figure 3. Step: 1 Phase: 1 Time : -- < C H A N G E S > elem_type n a m e (type) State GASOFF I Int State CHARGING I Int State S TEMPSENSOR:OFF I Int State COLD IInt State S CONTROL:IDLE I Int State S VALVE:OFF I Int Event HEATINGON X Ext
0 clock units value change_type
Step: 2 Phase : -- < C H A N G E S > elem_type Data-item T I 1 Int Event SYGGASON X Int State HEATING I Int State PREPARINGTOIGNI o o o
1 clock units value change_type
1 Time : name (type)
I Int
Figure 3" Part of a hazard scenario recorded in the TRC file Hazard scenarios are captured as TRC files. Some difficulty arises however if a faulty event participates in the scenario (e.g. when the fault model assumes that the same transition can be fired with and without occurrence of its activating event). In such situation the system behaviour is non-deterministic and a given scenario can move the system to either unsafe or safe state. We avoid this problem by introducing additional events, called error events, which explicitly trigger faulty transitions. A faulty transition is fired only when the corresponding error event occurs. The error events are generated during the reachability analysis process to cover all relevant situations of fault occurrence. Then, if a given fault contributes to the reachability of an unsafe state the corresponding TRC file will contain the trace of the corresponding error event.
Advances in Safety and Reliability" ESREL '97
413
Step 2: Implantation The problem of implantation is described in more detail in G6rski et al (1997). In general terms, implantation aims at defining the way of incorporating the monitor to the actual application. The set of events occurring in the application is split into measurable and unmeasurable events. We assume that all events exchanged through the control_system-plant interface and all error events are measurable. Consequently, the following events in the gas burner application are measurable: Heatingon, Heatingoff, Sygignon, Syggason, Syggasoff, Cold, Hot and Error_hot (the event representing fault of the temperature sensor). In this step all previously captured TRC files are filtered in such a way that only measurable events with the timing relations between them remain (i.e. the hazard scenarios are expressed in terms of measurable events only). Figure 4 presents a hazard scenario obtained from a TRC file representing the situation of frequent alternation between heating on and heating off. It is interpreted as follows: the UNSAFE state is entered if after 1 time unit after start of the system the Heatingon event occurs and then, 1 time unit later Syggason occurs and then, 9 time units later Heatingoff occurs and then, 1 time unit later events Syggasoff and Heatingon occur simultaneously, and so on. Finally, 10 time units after the last Syggason event the system moves to the UNSAFE state. START 1 SYGGASON
HEATINGON 1 SYGGASON 9 HEATINGOFF 1 SYGGASOFF HEATINGON 9 HEATINGOFF 1 SYGGASOFF HEATINGON 1 S Y G G A S O N i0 U N S A F E
1
Figure 4: Hazard scenario covering frequent alternation between heating on and off Figure 5 presents a hazard scenario being the result of the temperature sensor error. After the Heatingon event, Syggason and Sygignon events are generated at some instances of time. Then, in the considered hazard scenario two events occur simultaneously: Hot and Error hot. It means that the temperature sensor acknowledges production of heat but only as the result of firing the faulty transition. The gas valve is kept open and, after some time, the system reaches the UNSAFE state. START
1 HEATINGON
1 SYGGASON
9 SYGIGNON
5 HOT
ERROR
HOT
25 U N S A F E
Figure 5- Hazard scenario covering the temperature sensor error
Step 3: Monitor definition In this step the statechart model of the monitor is built. First, each hazard scenario is translated to the corresponding statechart model and then the compound model consisting of all hazard scenario models is developed. For each hazard scenario, we define a state machine model which accepts only the sequence of events belonging to the scenario and refuses any other, Hopcroft et al (1979). The events arriving in the order determined by the given hazard scenario drive the model and cause that the current state changes until it reaches the final state called here ALARM. Reaching ALARM means that whole hazard scenario actually happened and that safety has been violated. Figure 6 presents the model of the hazard scenario of Figure 4. The solid line transitions represent the acceptance of a subsequent event from the scenario. Initially, the model is in its initial state (s 1). If Heatingon event occurs it moves to s2. Then if after 1 time unit after entering s2 (event tm(1)) the Syggason event occurs the model moves to s3. Here the model stays 9 time units and then if Heatingoff occurs it moves to s4. It goes forward like this and eventually reaches the ALARM state. According to the semantics of statecharts if while being in a state, an unexpected event occurs or the expected event does not occur in the required time moment, the current state remains unchanged. This semantics is different from what we require: whenever something unexpected happens the current hazard
414
J. G6rski and B. Nowicki
scenario should be cancelled and the whole process should start from the beginning. In order to solve this problem we introduced so called quit transition which are represented by dotted lines. The quit transitions are fired by events which are not expected in a given state. For instance in Figure 6, while being in s l, any event different than Heatingon (represented by the label ALL\Heatingon) causes that the model returns to the initial state (which is s l itself- in statecharts linking the transition to the contour of the superstate is equal to linking this transition to the initial state). Similarly, if while being in s2 Syggason does not occur during 1 time unit or any other event occurs the model moves to the initial state.
ii
i
tm(1) and tm(9) and H e a t i n g o n r . ~ = . , I S y g g a s o n ~,.,~=.,~ H e a t i n g o f f t-- 3 ~t-J ~ LL \ ' tm(2) o r _' tm(10) or eatingon : ALL \ : ALL \ ~, S y g g a s o n ~ Heatingoff
o o o
ALAR : ALL V
Figure 6: Model for the hazard from Figure 4 The statechart models representing individual hazard scenarios are defined in a straightforward way. The safety monitor model is then defined as a concurrent composition of models of the individual scenarios. In the Figure 7 the safety monitor model of the gas burner system is presented. The state HAZARD SCENARIOS comprise of many substates separated by dotted lines which represent concurrent composition. Each concurrent state contains a single model of hazard scenario. For simplicity only two hazard scenarios (belonging to two considered classes) are included. The monitor raises an ALARM (moves to ALARM state) when any of the related hazard scenarios moves to the ALARMi state.
/" l--
4k tm(5) and
HS2
;
;
Heatingon ~Error [q~,,1Lsl; : ooo is2" = ~
ALL \ Heatingon L
__ .
.
L2nI--- ~-.'-
-
•-sn;
tm(6) or ALL \ {Hot, .
.
ooo
2"o-'-"2
-
;
--- "--" --' "~m(1T'an'd" ~ ' 1 " HeatingonF--~,.~ S y g g a s o n
:ALL\ .
:Lsz ;
Heatingon
' trn(2)or :ALL\
tm(25~ ALL
-
"
_~.._~
v',-l~,
.
&' Hot and ; hot ~
: ooo
--
,-~--ALL :
~
1
/
/
I
Figure 7" The monitor model The monitor model presented above can then undergo further optimisation aiming at reduction of its size. This however is outside the scope of this paper.
Step 4: l~ning In its present state, the monitor detects hazards instead of predicting them. Tuning aims at building the predictive facility into the monitor. In the gas burner example there is no initialisation procedure so there is no point in cutting off an initial subsequence of the hazard scenarios. On the other hand, to achieve early warning a final subsequence must be cut off. The decision of what should be removed is application
Advances in Safety and Reliability: ESREL '97
415
depended. In the considered example the ALARM states in hazard scenario models could be moved backward e.g. 10 time units. This way we leave some time between the alarm signal and the actual hazard occurrence assuming that this is enough to perform the appropriate corrective action (e.g. close the valve and start ventilation process).
CONCLUSIONS This paper gives a systematic method of developing a safety monitor using the object-oriented model of a given application. Hazard scenarios are first identified through simulation. They are expressed in terms of measurable events what in turn allows to link the resulting monitor to the actual application. Hazard scenarios are then translated into statechart models and then grouped into one monitor model. The final activity of tuning aims at ensuring appropriate sensitivity of the monitor device. An advantage of our method is that it can be used either before or after the system deployment. The structure of the monitor is such that as new experiences with the system become available new hazard scenarios can be easily incorporated. Although the monitor model looks complicated it is built according to a simple template so despite its size the model should be considered simple. In the case of large systems the number of different hazard scenarios can grow and their direct 'enumeration' can become unmanageable. This is particularly true if we aim at including all possible faults and their potential influence on system hazards. To overcome this problem the authors proposed another approach G6rski et al (1997) where the monitor does not concentrate on particular faults and their contribution to safety violation but rather observers more general system characteristics indicating the hazard occurrence. Consequently, regardless the cause of the hazardous behaviour, the monitor raises an alarm when safety is about to be violated. The approach presented in this paper provides the 'ideological foundation' of safety monitoring while the approach of G6rski et al (1997) provides a pragmatic solution. The method has been successfully applied to several small case studies. The application of the method to an industrial case study (an extra high voltage substaion) is presently under development.
ACKNOWLEDGEMENT The authors would like to acknowledge the support of the EU Copernicus ISAT (Integration of Safety Analysis Techniques) project.
REFERENCES G6rski, J. (1989). Deriving Safety Monitors from Formal Specifications. Proceedings of Safety of Computer Control Systems SAFECOMP'89, Vienna (Austria), 123-128. G6rski, J., Nowicki, B. (1995). Object-Oriented Approach to Safety Analysis. Proceedings of the 1st Annual ENCRESS Conference, Bruges (Belgium), 338-350. G6rski, J., Nowicki, B. (1996). Safety Analysis Based on Object-Oriented Modelling of Critical Systems.
Proceedings of the 15th International Conference on Safety, Reliability and Security SAFECOMP'96, Vienna (Austria), 46-60. G6rski, J., Nowicki, B. (1997). Object Oriented Model Based Safety Monitor Synthesis. Paper accepted to ENCRESS'97 Harel, D. (1987). Starecharts: A Visual Formalism for Complex Systems. Science of Computer Programming 8, 231-274. Holzman, G.J. (1991), Design and Validation of Computer Protocols, Prentice Hall. Hopcroft, J.E., Ullman, J.D., (1979), lntrocduction to Automata Theory, Languages and Computation, Addison-Wesley. International Electrotechnical Commission (1994), IEC 1508 Functional Safety: Safety-related Systems. i-Logix (1996). STATEMATE- Technical Documentation. Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F., Lorensen, W. (1991), Object-Oriented Modelling and Design, Prentice Hall Int.
This Page Intentionally Left Blank
DEVELOPMENT OF A SYSTEM FOR A RULEDRIVEN ANALYSIS OF SAFETY CRITICAL SOFTWARE Horst Miedl Institut ftir Sicherheitstechnologie (ISTec) GmbH Forschungsgel~inde, D-85748 Garching, Germany
ABSTRACT The work aims at describing a system for a rule-driven analysis (REVEAL) of safety critical software and the development of a tool performing that task. REVEAL aids in the assessment of high level language soft° ware. It is based on a static analysis of the control flow and data flow of the source code. One aim of REVEAL will be the automatic check - whenever possible - whether the source code adheres to coding rules of specific guidelines and standards.
KEYWORDS
Safety critical computer systems, high level language, I&C function, software reliability, software qualification, software analysis, static analysis, control flow, data flow, software metrics
INTRODUCTION
Following the technological development in the I&C field, analogous techniques are more and more replaced by digital systems in almost every industry. Surveillance and recording tasks are already taken over by computerised systems in nuclear power plants. Digital systems are even used in safety relevant applications, where the effort for their qualification is very high. An essential feature of digital systems is the implementation of the required functionality by software. The process of software development is often characterised by the automatic generation of high level language source code from formal application-oriented specification techniques. As the determination of ultrahigh reliability figures for safety critical software is hardly possible, the national and intemational guidelines and standards are mainly based on the qualitative evaluation of software. This means that these guidelines and standards present plenty of requirements on properties which the software should have or should not have. The proof of all these requirements may be extensive. Some of those requirements are based on syntactical or notational properties that can be measured and collected automatically by a software analysis tool. There are many "general-purpose" software analysis tools, both static and dynamic, which help analyzing the source code, e.g., by presenting the control flow graph or by performing 417
418
H. Miedl
data flow analysis. However, they are not designed to assess the adherence to specific requirements of guidelines and standards in the nuclear field. Against the background of the development of I&C systems in the nuclear field which are based on digital techniques and implemented in high level language, it is essential that the assessor or licensee has a tool with which he can automatically and uniformly qualify the high level language software. Therefore, a project has been initiated at the Federal Ministry of Education, Science, Research and Technology to cover these aspects. Thus, one aim of REVEAL is the proof of conformance of the source code with the given requirements as stated for example in international standards. This will relieve the assessor or licensee from tedious and error prone work and guarantee the assessment to be objective and uniformly performed with acceptable effort. Not only the measurement of software reliability poses problems, but also the measurement of other quality attributes like usability, maintainability, portability, robustness, etc. Hereby REVEAL aims at collecting source code metrics that are considered to be correlated with quality attributes. Furthermore, REVEAL builds the call graph hierarchy and performs control flow and data flow analysis. Summarising, it can be said, that REVEAL will assist in the qualification of digital systems to be used in safety relevant applications where the qualification of software is of special importance.
SYSTEM OVERVIEW To make REVEAL applicable to high level languages in general, the analysis kernel works on the control flow, data flow and software metric's information firstly extracted from the high level language source code. The following diagram sketches the structure of REVEAL:
high level language source c o d e
fi:ont-end
data f l o w
information
oonolow
software metrics
information
analysis kernel
/ \
guidelines and standards
Advances in Safety and Reliability: ESREL '97
419
On the right hand side of the diagram the collection of specific requiremems as stated in the guidelines and standards are to be seen. These input componems have to be designed flexibly allowing the user to select project specific standards from the whole set of standards. In addition, the integration of new standards into the whole set of standards or the removal of obsolete ones requires the flexibility of this componem. The analysis kernel consists of the following main modules: -
control flow analysis call graph analysis data flow analysis rule-driven analysis (i.e., conformance with guidelines and standards) software metrics analysis
These analyses will support an assessor of safety critical software in: -
-
understanding of the high level source code, proof of the functionality represemed by the source code against the specification, violation of standards and guidelines for coding, definition of test strategies and test cases, idemification of source code metrics for software attributes that may serve as acceptance or nonacceptance criteria.
Development of Front-ends REVEAL is conceived in the way that a common processing and analysis kernel can be used for different programming languages. At presem these languages are ANSI-C and FORTRAN-90 to cover the programming languages that are mainly used in the I&C field. The from-ends take over the language-dependem part of the analysis, that means they process the source code following a certain set of rules and they pass it to the analysis kernel using a common imerface. Furthermore, they have to recognise, to collect and store basic metrics in a common database. To minimise the effort for the from-end tools for lexical analysis and parser generation, PC based versions of the standard UNIX tools LEX and YACC are used.
Definition of the Interface between the Front-ends and the Analysis Kernel The output of the from-ends is stored in a common format which considers also the necessities of the analysis kernel. The output includes: a) an extended cross-reference list to render possible the data flow analysis, i.e., a description of all identifiers of the source code with its appearance of declaration (line number), the type of identifier (pointer, function, "simple variable", ...), the data type information, its nesting level, the module name it belongs to, and its dimension, if it is an identifier of a compound data type like an array or pointer. Moreover, the definition and reference of each identifier is captured by the front-end of REVEAL collecting the following information: -
-
the line number, the type of operation (e.g., initialization, function call, assignmem, logical operation .... ), and the nesting level.
420
H. Miedl
Thereby a major problem is the treatment of identifiers in compound data types. E.g.: a.b=c Evidently the identifier b is defined in that expression. But REVEAL has to keep also the information that this b is part of a specific compound data type, i.e., REVEAL has to analyse and process the "prefix" of that expression to extract and store the information that b is the component of a structure. Otherwise it could be confused with an identifier declared with the same name as a simple data type. Similar problems exist for the right hand side of expressions (reference part) whenever identifiers are part of compound data types. b) a graphical description of the source code for the control flow and call graph analysis, the call graph information consists of the module names (i.e., functions and procedures of the source code program), their number of nodes (i.e., number of statements) and their sequence within the source code file. The hierarchy of the modules, that means the connections between them, is represented by the following relation: < calling module > < called module > E.g.: main fl main t2 flf2 That means, that the module "main" calls the modules "fl" and "12" and "12" is also called by "fl". The flow graph information consists of the node list (i.e., a list of the line numbers of the statements of a module) and a relation between these nodes, which represents the edges of the control flow graph. If there is more than one statement per line, the line number is extended by a character suffix to differentiate between the statements. E.g.: Node list 1
2a, 2b 3 4 Edges 1, 2a 2a, 2b 2b, 3 3,4 The example shows a simple linear sequence of statements whereby two statements exist in line 2. c) basic metrics from the source code. The number and type of metrics to collect depend on the necessities that arise from the analysis of the coding requirements as stated in the according guidelines and standards. Besides that, the analysis of the metrics correlated with quality attributes also has feedback on the collection of metrics.
Advances in Safety and Reliability: ESREL '97
421
A database system is used for the interface between the front-ends and the analysis kemel. That also facilitates later enhancements of the database if for example new guidelines or standards arise. The same database also contains the coding rules extracted from the guidelines and standards.
Development of the Analysis Kernel The analysis kemel produces results which -
-
present the macro structure of the source code (call graph of modules), show the micro structure of the source code (control flow within modules), identify data dependencies (data flow), form the basis for higher level metrics (software metrics), prove the conformity of the source code with specific requirements as stated in the according guidelines and standards (rule-driven analysis).
That is performed by static analysis, where static means that the code does not need to be executed. The call graph of the source code describes how the modules interact with each other. For each module a structural analysis of its control flow graph is made. The structural analysis identifies the code elements of the module (branches, loops .... ) including their nesting. For data analysis, it is necessary to take into account all the paths in the source code and establish the effects of data manipulations along all these paths. Such an analysis is done by reaching definition's analysis (RDA). The RDA is carried out module-wise. It combines the information about variable references and definitions with information about the control flow graph of a module obtained during structural analysis. The basic idea of RDA is to treat definitions of variables like individuals that travel in a control flow graph and then to see where they can reach. Knowing the variables referenced in a node, and all definitions reaching that node, it is possible to determine whether any reference uses an uninitiated value. Similarly unused definitions can be found. REVEAL collects basic metrics to calculate higher level metrics. The higher level metrics shall serve as measures for quality attributes like readability, portability, etc.. Finally REVEAL has to prove the conformity with specific requirements as stated in the according rules and standards. Following, one can see a few example requirements (taken from IEC 880 and IEC 1508). R1.The number of input and output parameters of a module should be limited to a minimum. R2.Limited use of pointer. R3.A module should have only one entry. Single exits are recommended. Requirement R1 and R2 does not have an explicit threshold value. That is often the case with the requirements from the guidelines and standards. Mostly they are formulated in a qualitative manner. Thus, some default values must be defined for those requirements to enable the comparison with the metric values gathered from the source code. These default values may also be language dependent and therefore have to be changeable by the user. Requirement R3 is quite easy to check, you only have to extract the number of entry and exit points for each module of the source code. To make REVEAL flexible it is foreseen that the user can incorporate his "own" guidelines and standards. That also implies the possibility to select certain set of standards and even single requirements from the whole set of standards to be applied to the source code.
422
H. Miedl
Analyzing the coding requirements from different guidelines and standards no contradictions between them were found, but there are sometimes differences in the rigor (e.g., prohibited, avoided, limited .... ).
Development o f the User Interface
To interact with REVEAL the user, i.e., the assessor or licensee, needs an interface that allows him to select the requirements from the guidelines and standards to proof and the analysis (e.g., data flow analysis) to perform. For that purpose a graphical user interface is used.
CONCLUDING R E M A R K S The front-ends for the programming languages ANSI-C and FORTRAN-90 have been developed. The development of the ANSI-C front-end is nearly complete. It entirely produces the control flow information as an abstract representation of a graph. Apart from some specific aspects regarding arrays and pointers, it also produces the data flow information for an ANSI-C program. The data flow information consists of a variable description comprising several specific lists (tables for variables, parameters, arrays, functions, structures and types) and a cross-reference list describing the usage of these variables in the ANSI-C program. For FORTRAN-90 those parts of the front-end that extract the control flow information have been developed. This was carried out based on a restrictively used FORTRAN-90 grammar. Regarding the control flow analysis a prototype already exists which identifies loops, single-entry/single-exit sections of the control flow graph, isolates the corresponding subgraphs and lists the paths in these subgraphs. Input of the control flow module is an ASCI-file describing the control flow graph in a formal way. They extract from the source code based on the same concept that is used by the front-ends for ANSI-C and FORTRAN-90 to represent the control flow graphs. Conceming the reaching definition analysis as part of the data flow analysis, the theoretical basis already exists and now has to be implemented. A large amount of guidelines and standards have been investigated as to which degree they contain requirements or recommendations on coding. The general impression was, however, that few of them define coding requirements in a form suitable to be checked automatically by a software analysis tool. Those that are able to be checked automatically have been extracted and now have to be correlated to the according source code metrics.
REFERENCES
Brummer, J., Kersken, M., M~xtz, J. (1994). Tools for Software Safety Analysis, Reliability Engineering and System Safety. Elsevier 46, 123-138. Miedl, H. (1996). Reverse Transformation of Normed Source Code - Development of a Tool to Demonstrate the Functional Equivalence of Normed Source Code with its Specification, in Probabilistic Safety Assessment and Management. ESREL '96 - PSAM-III 2, 1139-1144.
BI" PSA Applications
This Page Intentionally Left Blank
INDIVIDUAL PLANT EXAMINATIONS: WHAT PERSPECTIVES CAN BE DRAWN? M.T. Drouin 1, A.L. Camp s, J. Lehner3, T. PraY, J. Forester 2 1U.S. Nuclear Regulatory Commission, Washington D.C. 20555 USA 2Sandia National Laboratories, Albuquerque, NM 87185 USA 3Brookhaven National Laboratory, Upton, NY 11973 USA
ABSTRACT The U.S. Nuclear Regulatory Commission (NRC) issued Generic Letter (GL) 88-20 in November 1988, requesting that all licensees perform an Individual Plant Examination (IPE) "to identify any plant-specific vulnerabilities to severe accidents and report the results to the Commission. " The purpose and scope of the IPE effort includes examining internal events occurring at full power, including those initiated by internal flooding. In response, the staff received 75 IPE covering regarding 108 nuclear power plant units. The staff then examined the IPE submittals to determine what the collective IPE results imply about the safety of U.S. nuclear power plants and how the IPE program has affected reactor safety. The following paper summarizes the results of the IPE Insights Program examination.
KEYWORDS severe accident, core damage frequency, IPE, vulnerability, containment performance, human reliability, safety goal
IMPACT OF THE IPE PROGRAM ON REACTOR SAFETY The primary goal of the IPE Program was for licensees to "identify plant-specific vulnerabilities to severe accidents that could be fixed with low-cost improvements. " However, GL 88-20 did not specifically define what constitutes a vulnerability; hence, there is considerable diversity in the criteria used to define a vulnerability. In addition, it is not always clear whether a licensee is identifying a finding as a "vulnerability" or as some other issue worthy of attention. Therefore, a problem considered to be a vulnerability at one plant may not have been specifically identified as a vulnerability at another plant. In fact, less than half of the licensees actually identified "vulnerabilities" in their IPE submittals; however, nearly all of the licensees identified other areas warranting investigation for potential improvements. Thus, the IPE program has served as a catalyst for further improving the overall safety of nuclear power plants. Only four licensees with boiling water reactor (BWR) plants and 15 licensees with pressurized water reactor (PWR) plants explicitly stated that their plants had vulnerabilities. Although no common vulnerabilities were identified, the following vulnerabilities can be considered applicable to many BWRs: •
failure of water supplies to isolation condensers 425
426 • • •
M.T. Drouin et al. failure to maintain high-pressure coolant injection systems when residual heat removal has failed failure to control low-pressure injection during an anticipated transient without scram (ATWS) drywell steel shell melt-through as a Mark I containment issue
Similarly, the following vulnerabilities can be considered applicable to many PWRs: • • • • • • • • • •
loss of reactor coolant pump (RCP) seals leading to a loss of coolant accident (LOCA) design and maintenance problems that reduce turbine-driven auxiliary feedWater pump reliability internal flooding caused by component failures failure of the operator to switchover from the coolant injection phase to the recirculation phase loss of critical switchgear ventilation equipment leading to loss of emergency buses need to enhance operator guidance for depressurization during steam generator tube ruptures inadequate surveillance of specific valves leading to interfacing system LOCAs loss of specific electrical buses compressed air system failures inability to crosstie buses during loss of power conditions
In addition, almost all of the licensees identified plant improvements to address perceived weaknesses in plant design or operation. (Over 500 proposed improvements were identified by the plants.) Most of these plant improvements are classified as procedural/operational changes, design/hardware changes, or both. Few of the improvements involve maintenance-related changes. Typically, the procedural or design changes indicate revised training in order to propedy implement the actual change. The specific improvements vary from plant to plant. However, numerous improvements that had significant impact on plant safety include changes to AC and DC power, coolant injection systems, decay heat removal systems, heating, ventilating and air conditioning, and PWR RCP seals.
CORE DAMAGE FREQUENCY (REACTOR DESIGN) PERSPECTIVES In many ways, the IPE results are consistent with the results of previous NRC and industry risk studies. The IPE results indicate that the plant core damage frequency (CDF) is often determined by many different sequences (in combination), rather than being dominated by a single sequence or failure mechanism. The largest contributors to plant CDF and the dominant failures contributing to those sequences vary considerably among the plants (e.g., some are dominated by LOCAs, while others are dominated by station blackout). However, for most plants, support systems are important to the results because support system failures can result in failures of multiple front-line systems. Further, the support system designs and dependency of frontline systems on support systems vary considerably among the plants. That variation explains much of the variability observed in the IPE results. The CDFs reported in the IPE ~ubmittals are lower, on average, for BWR plants than for PWR plants, as shown in Figure 1. Although both BWR and PWR results are strongly affected by the support system considerations discussed above, a few key differences between the two types of plants contribute to this tendency for lower BWR CDFs and cause a difference in the relative contributions of the accident sequences to plant CDF. The most significant difference is that BWRs have more injection systems than PWRs and can depressurize more easily to use loW-pressure injection (LPI) systems. This gives BWRs a lower average contribution from LOCAs. However, the results for individual plants can vary from this general trend. As shown in Figure 1, the CDFs for many BWR plants are actually higher than the CDFs for many PWR plants. The variation in the CDFs is primarily driven by a combination of the following factors: •
plant design differences (primarily in support systems such as cooling water, electrical power, ventilation, and air systems)
•
variability in modeling assumptions (including whether the models accounted for alternative accident mitigating systems)
•
differences in data values (including human error probabilities) used in quantifying the models
Advances in Safety and Reliability: E S R E L '97
427
conditional probability for significant early release varies from less than 0.01 to 0.5 for the B W R IPEs and from less than 0.01 to 0.3 for the P W R IPEs. In the B W R IPEs, significant early releases are almost exclusively caused by early containment failure, while containment bypass (especially SGTR), plays an important role in the reported P W R releases: Table 2 summarizes key observations regarding containment performance.
TABLE 2 O V E R V I E W O F K E Y C O N T A I N M E N T P E R F O R M A N C E OBSERVATIONS
Key Observations By Containment Failure Mode i
i
i
Early F a i l u r e -
On average, the large volume containments of PWRs are less likely to have early structural failures than the smaller BWR pressure suppression containments Overpressure failures (primarily from ATWS), fuel coolant interaction, and direct impingement of core debris on the containment boundary are important contributors to early failure for BWR containments The higher early structural failures of BWR Mark I containments versus the later BWR containments are driven to a large extent by drywell shell melt-through* In a few BWR analyses, early venting contributes to early releases Phenomena associated with high-pressure melt ejection are the leading causes of early failure for PWR containments* Isolation failures are significant in a number of large, dry and subatmospheric containments The low early failure frequencies for ice condensers relative to the other PWRs appear to be driven by analysis assumptions rather than plant features For both BWR and PWR plants, specific design features lead to a number of unique and significant containment failure modes Bypass - -
Probability of bypass is generally higher in PWRs, in part, because of the use of steam generators, and because the greater pressure differential between the primary and secondary systems may increase the likelihood of an ISLOCA Bypass, especially SGTR, is an important contributor to early release for PWR containment types Bypass is generally not important for BWRs Late F a i l u r e -
Overpressurization when containment heat removal is lost is the primary cause of late failure in most PWR and some BWR containments High pressure and temperature loads caused by core-concrete interactions are important for late failure in BWR containments Containment venting is important for avoiding late uncontrolled failure in some Mark I containments The larger volumes of the Mark III containments (relative to Mark I and Mark II containments) are partly responsible for their lower late failure probabilities in comparison to the other BWR containments The likelihood of late failure often depends on the mission times assumed in the analysis *There has been a significant change in the state-of-knowledge reporting some severe accident phenomena in the time since the IPE analyses were performed.
428
M.T. Drouin et al. TABLE 1 (Continued) Key Observations By Accident Class
Station b l a c k o u t s -
Significant contributor for most plants, with variability driven by: • number of redundant and diverse emergency AC power sources • availability of alternative offsite power sources • length of battery life • availability of firewater as a diverse injection system for BWRs • susceptibility to RCP seal LOCAs for PWRs ATWS --
Normally a low contributor to plant CDF because of reliable scram function and successful operator responses BWR variability mostly driven by modeling of human errors and availability of alternative boron injection system PWR variability mostly driven by plant operating characteristics, IPE modeling assumptions, and assessment of the fraction of time the plant has an unfavorable moderator temperature coefficient Internal F l o o d -
Small contributor for most plants because of the separation of systems and compartmentalization in the reactor building, but significant for some because of plant-specific designs Largest contributors involve service water breaks L O C A s (other than interfacing system L O C A s (ISLOCAs) a n d SGTRs) - -
Significant contributors for many PWRs with manual switchover to emergency core cooling recirculation mode BWRs generally have lower LOCA CDFs than PWRs for the following reasons: • BWRs have more injection systems • BWRs can more readily depressurize to use low-pressure systems ISLOCAs --
Small contributor to plant CDF for BWRs and PWRs because of the low frequency of initiator Higher relative contribution to early release frequency for PWRs than BWRs because of low early failure frequency from other causes for PWRs SGTR --
Normally a small contributor to CDF for PWRs because of opportunities for the operator to isolate a break and terminate an accident, but important contributor to early release frequency
CONTAINMENT PERFORMANCE (CONTAINMENT DESIGN) PERSPECTIVES For the most part, when the accident progression analyses in the IPEs are viewed globally, they are consistent with typical containment performance analyses. Failure mechanisms identified in the past as being important are also shown to be important in the IPEs. In general, the IPEs confirmed that the large volume PWR containments are more robust than the smaller BWR pressure suppression containments in meeting the challenges of severe accidents. Because of the risk importance of early releases, the containment performance analysis descriptions found in the IPE submittals emphasized the phenomena, mechanisms, and accident scenarios that can lead to such releases. These involve early structural failure of the containment, containment bypass, containment isolation failures and, for some BWR plants, deliberate venting of the containment.
Advances in Safety and Reliability: ESREL '97
429
1E-3
At A t~
1E-4
i!,
aal
t.._ o sJ t~
AA
~
~Q. 1E-5
•
&&
eO"
,,- 1E-6 E "o O ¢~ 1E-7
AA
1E-8 BWRs
PWtLs
Figure 1: Summary of BWR and PWR CDFs as reported in the IPEs. Table 1 summarizes the key observations regarding the importance and variability of accident classes commonly modeled and discussed in the IPEs.
TABLE 1 OVERVIEW OF KEY CDF OBSERVATIONS Key Observations By Accident Class
Transients (other than station blackouts and ATWS) - Important contributor for most plants because of reliance on support systems" failure of such systems can defeat redundancy in front-line systems Both plant-specific design differences and IPE modeling assumptions contribute to variability in results: • use of alternative systems for injection at BWRs • variability in the probability that an operator will fail to depressurize the vessel for LPI in BWRs • availability of an isolation condenser in older BWRs for sequences with loss of decay heat removal
(Dim) • • • • •
susceptibility to harsh environment affecting the availability of coolant injection capability following loss of DHR capability to use feed-and-bleed cooling for PWRs susceptibility to RCP seal LOCAs for PWRs ability to depressurize the reactor coolant system in PWRs affecting the ability to use LPI ability to cross-tie systems to provide additional redundancy
430
M.T. Drouin et al.
As a group, the large dry PWR containments analyzed in the IPEs have significantly smaller conditional probabilities of early structural failure (given core melt) than the BWR pressure suppression containments analyzed. Nonetheless, containment bypass and isolation failures are generally more significant for the PWR containments. As seen in Figure 2, however, these general trends are often not true for individual IPEs because of the considerable range in the results. For instance, conditional containment failure probabilities (CCFPs) for both early and late containment failure for a number of large dry PWR containments are higher than those reported for some of the BWR pressure suppression containments. 1.0 0.9 .-_ 0.8 u ~ 0.7
.j
-= 0.6
,m
~ O.g
A& ah . • A m
°,..
i I t
= 0.4 C
AaA 1 o 0.3
°m
A~
A&ll A
o
•
A
1
&;
.4~A--
0.1
__w
0.0
Bypass
Early Failure
PWRs
Late Failure
Bypass
Early Failure
Late Failure
BWRs
Figure 2: Summary of CCFPs for BWRs and PWRs as reported in the IPEs. The results for BWRs, grouped by containment type, follow expected trends and indicate that, in general, Mark I containments are more likely to fail during a severe accident than the later Mark II and Mark IT[ designs. However, the ranges of predicted failure probabilities are quite high for all BWR containment designs and there is significant overlap of the results, given core damage. A large variability also exists in the contributions of the different failure modes for each BWR containment group. However, plants in all three BWR containment groups found a significant probability of early or late structural failure, given core damage. The containment performance results for PWRs indicate that most of the containments have relatively low conditional probabilities of early failure, although a large variability exists in the contributions of the different failure modes for both large dry and ice condenser containments. The results presented in the IPE submittals are consistent with previous studies regarding radionuclide release. The containment failure modes identified as resulting in an early release of radionuclides to the environment are containment bypass, isolation failure, and early containment structural failure. In BWR pressure suppression containments, early venting also leads to an early release in a few cases. A significant early release is of particular concern because of the potential for severe consequences as a result of the short time allowed for radioactivity decay and natural deposition, as well as for accident response actions (such as evacuation of the population in the vicinity of the plant). What is considered to be a significant release varies among the licensees. For many, significant release includes instances involving a release fraction of volatile radionuclides equal to or greater than ten percent of core inventory. Using this definition, the reported
Advances in Safety and Reliability: ESREL '97
431
ItUMA.N ACTION (OPERATIONAL) PERSPECTIVES Only a few specific human actions are consistently important for either BWRs or PWRs and reported in the IPEs. For BWRs, the actions include manual depressurization of the vessel, initiation of standby liquid control during an ATWS, containment venting, and alignment of containment or suppression pool cooling. Manual depressurization of the vessel is more important than expected, because most plant operators are directed by the emergency operating procedures to inhibit the automatic depressurization system (ADS) and, when ADS is inhibited, the operator must manually depressurize the vessel. Only three human actions are important in more than 50 % of the IPE submittals for PWRs. These include the switchover to recirculation during LOCAs, initiation of feed-and-bleed, and the actions associated with depressurization and cooldown. Plant-specific features, such as the size of the refueling water storage tank and the degree of automation of the switchover to recirculation, are key in determining the importance of these actions. While the IPE results indicate that human error can be a significant contributor to CDF, in most cases there is little evidence that human reliability analysis (I-IRA) quantification method per se has a major impact on the results. Nevertheless, numerous factors influence the quantification of human error probabilities (HEPs) and introduce significant variability in the resulting HEPs, even for essentially identical actions. General categories of such factors include plant characteristics, modeling details, sequence-specific attributes (e.g., patterns of successes and failures in a given sequence), dependencies, performance shaping factors modeled, application of the HRA method (correctness and thoroughness), and the biases of both the analysts performing the HRA and the plant personnel from whom selected information and judgments are obtained. Although most of these factors introduce appropriate variability in the results (i.e., the derived HEPs reflect "real" differences such as time availability and scenario-specific factors), several have the potential to cause invalid variability. In order to examine the extent to which variability in the results from the BWRs is caused by real rather than artifactual differences, the HEPs from several of the more important human actions appearing in the IPEs were examined across plants. The results from this examination indicated that some of the variability in the HEP values may have been an artifact of the way in which HRA methods are applied. Nonetheless, in most cases, it appears that there were explanations for much of the observed variability in HEPs and in the results of the HRAs across the IPEs. However, such an assertion does not necessarily imply that the HEP values are generally valid. Reasonable consistency can be obtained in HRA without necessarily producing valid HEPs. An H E is only valid to the extent that a correct and thorough application of HRA principles has occurred. For example, if a licensee simply assumed (without adequate analysis) that their plant is "average" in terms of many of the relevant PSFs for a given event, but appropriately considers the time available for the event in a given context, the value obtained for that event may be similar to those obtained for other plants. Yet, the resulting value may be optimistic or pessimistic relative to the value that would have bene obtained if the licensee had conducted a detailed examination of the relevant plant-specific factors. Thus, to reiterate, consistency does not necessarily imply validity. In addition, because many of the licensees failed to perform high-quality HRAs, it is possible that the licensees obtained H E values that are not appropriate for their plants.
IPEs WITH RESPECT TO RISK-INFORMED REGULATION In performing their IPEs, licensees elected to perform a Level 1 probabilistic risk analysis (PRA) and a limited Level 2 PRA. In addition, the majority of the licensees have indicated their intention to maintain and update these PRAs for future use. These IPEs/PRAs can provide the foundation for the increased, future use of PRA in risk-informed regulation. However, before an IPE/PRA can be used beyond its original purpose (GL 88-20), the quality of the IPE/PRA will need to meet the standards established for the specific application, which the IPE/PRA may or may not currently meet. The CDF analyses in the IPEs are generally robust and generally use acceptable methods. Given the limited staff review, it is believed that the licensees, collectively, have identified the important accident sequences. Therefore, if a particular application requires only the identification of important sequences (not a relative
432
M.T. Drouin et al.
ranking of those accidents), most of the Level 1 PRAs for these IPEs are adequate. The staff reviews of the individual IPEs identified any relevant exceptions to this conclusion. In regard to the containment performance and source term calculations (Level 2), these analyses are generally simplified or of lesser quality than the CDF analyses primarily because of the use of some methods that are limited in nature. Many of the analyses relied heavily on either the use of the MA.AP code or the use of a set of industry position papers, neither of which have a comprehensive treatment of severe accident phenomena. Although the core damage analyses in the IPEs are generally robust, the staff identified weaknesses in certain areas, primarily including analysis of plant-specific data, common cause failure data, and human reliability. The most important shortcoming for some of the IPEs is the HR , with the most significant concern being the use of invalid HRA assumptions that did not produce consistently reasonable results.
OVERALL CONCLUSIONS AND OBSERVATIONS In considering the perspectives discussed above, and the results reported in the IPE submittals, certain conclusions and observations can be drawn as summarized below: •
As a result of the IPE program, licensees have generally developed in-house capability with an increased understanding of PRA and severe accidents. Further, the IPE program has served as a catalyst for further improving the overall safety nuclear power plants, and therefore, the generic letter initiative has clearly been a success.
•
Areas and issues have been identified where the staff plans to pursue some type of follow-up activity. Areas under consideration are plant improvements, containment performance improvement items either not implemented or not addressed in the IPE submittal, and.plants with relatively high CDF or conditional containment failure probability (greater than 1E-4/ry and 0.1, respectively).
•
If an IPE is to be used to support risk-informed regulation, then additional review may be needed in areas where the IPEs appear to be weak, depending upon the application of the PRA.
•
Examining the IPE results against the results of NUREG-1150 comparisons with safety goals, a fraction of the plants have the potential for early fatality risk levels that could approach the safety goals' quantitative health objectives.
•
Many of the BWR and PWR plant improvements address station blackout (SBO) concerns and originated as a result of the SBO rule. These improvements had a significant impact in reducing the SBO CDF (an average reduction of approximately 2E-5 per reactor-year (ry), as estimated from the CDFs reported by licensees in the IPEs). With the SBO rule implemented, the average SBO CDF is approximately 9E-6/ry, ranging from negligible to approximately 3E-5/ry. Although the majority of the plants that implemented the SBO rule have achieved the goal of limiting the average SBO contribution to core damage to about 1E-5/ry, a few plants are slightly above the goal. In comparing the IPE results to NUREG-1150, the average CDFs estimated for both BWRs and PWRs in NUREG-1150 fall within the ranges of the CDFs estimated in the IPEs; the relative contributions of accident sequences in the IPE results are also consistent with the N UREG-1150 results; the conditional probabilities of early containment failure reported in NUREG-1150 (mean values) also fall within the range of the IPE results for each containment type. Generally, the specific perspectives discussed in NUREG-1150 are consistent with the IPE results; however, the results of the IPEs do not indicate (as discussed in NUREG-1150) that the likelihood of early containment failure is higher for ice condenser designs than for large dry and subatmospheric designs. The opposite trend (as seen by the IPEs) appears to be driven by the modeling assumptions made in the five ice condenser IPEs rather than any phenomenological or design-related reasons.
•
IPE results indicate areas in PRA where standardization is needed.
Advances in Safety and Reliability: ESREL '97
433
•
Unresolved safety issue (USI) A-45 ("Shutdown Decay Heat Removal Requirements") and certain other USIs and genetic safety issues (GSIs), primarily GSI-23 ("Reactor Coolant Pump Seal Failures"), GSI-105 ("Interfacing System LOCA in Light Water Reactors") and GSI-130 ("Essential Service Water System Failures at Multi-Unit Sites"), were proposed by licensees for resolution on a plant-specific basis. Other safety issues resulting from the IPEs were identified as candidates for further investigation.
•
Areas where further research regarding both severe accident behavior and analytical techniques would be useful and should be considered were identified.
•
Information from the IPEs/PRAs can be used to support a diversity of activities such as plant inspection accident management strategies, maintenance rule implementation, and risk-informed regulation.
REFERENCES USNRC, "Individual Plant Examination Program: Perspectives on Reactor Safety and Plant Performance," NUREG-1560, Draft Report for Comment, October 1996. USNRC, "Individual Plant Examination for Severe Accident Vulnerabilities - 10 CFR 50.54(f)," Generic Letter 88-20, November 23, 1988. USNRC, "Severe Accident Risks: An Assessment for Five U.S. Nuclear Power Plants," NUREG-1150, December 1990.
This Page Intentionally Left Blank
PSA FOR CANDU-6 PRESSURIZED HEAVY W A T E R REACTORS : W O L S O N G UNITS 2, 3, AND 4 OF KOREA Myung-Ki Kim and Byoung-Chul Park ~Korea Electric Power Research Institute, 103-16, Munji-dong, Yusung-gu, Taejon, Korea
ABSTRACT Level 1 and 2 probabilistic safety assessments (PSAs) for both intemal and extemal events are being performed to meet one of the conditions for a construction permit for Wolsong units 2, 3, and 4 in Korea. These units are CANDU-6 Pressurized Heavy Water Reactors (PHWRs), and the study is the first comprehensive level 1 and 2 PSAs for CANDU type plants in the world. The detailed PSA includes an extensive fault tree, event tree analysis, human reliability analysis, and common cause failure analysis. Event trees have been developed for 35 internal initiating event groups. The preliminary results show that the total core damage frequency for Wolsong units 2, 3, and 4 each is similar to that for a typical PWR plant.
KEYWORDS PSA, CANDU, Core Damage Frequency, Severe Accidents, Plant Damage State, Wolsong
INTRODUCTION
Since the accident at TMI Unit 2 in 1979, the importance of the prevention and mitigation of severe accidents in nuclear power plants has increased. Many countries have focused research on understanding severe accidents, in order to identify ways to improve the safety of nuclear power plants. In Korea, the regulatory body, i.e., Korea Institute of Nuclear Safety (KINS), issued the Nuclear Safety Policy Statement in September 1994, which requests a plant-specific safety evaluation using a probabilistic safety assessment, and a severe accident management program based on this evaluation. To meet the construction permit condition requiring a probabilistic safety assessment for the Wolsong PHWRs, Korea Electric Power Corporation (KEPCO) has established a program to perform the level 1 and 2 PSAs including external events analysis. The project consists of two phases, a feasibility study (Phase 1) and a main study (Phase 2). The Wolsong 2, 3 and 4 units are CANDU-6 type plants, and there have been no comprehensive PSAs as for Light Water Reactors (LWRs). The feasibility study was performed over a five month period ('93.8 - '94.1). Its purpose was to determine the scope and methodology of the study for the CANDU type plant. The study has showed tKEPRI is a Research Center of Korea Electric Power Corporation (KEPCO). 435
436
M.-K. Kim and B.-C. Park
that even though Wolsong units have a different design concept compared to LWRs, full-scope level 1 and 2 PSAs including external events are needed to verify the safety, understand the most likely severe accident sequences, and provide necessary information for the accident management program. As a result, level 1 and 2 PSAs for the both internal and external events were started in September 1995 and will continue until September 1997. A detailed PSA has been performed in cooperation with Korea Atomic Energy Research Institute (KAERI), Korea Power Engineering Company (KOPEC), and Atomic Energy of Canada Limited (AECL), including an extensive fault tree, event tree analysis, a human reliability, and a common cause failure analysis. The paper describes the interim results along with the insights gained, largely focusing on level 1 PSA.
PLANT DESCRIPTION Wolsong Pressurized Heavy Water Reactors (CANDU-6 type, 600MWe each) are located in the southeast of the Korean peninsula. The Units 2, 3,and 4 are under construction and virtually identical with a limited number of shared facilities. The NSSS is supplied by AECL. The reactor consists of horizontal pressuretubes reactor, with fuel channels. Each fuel channel consists of a pressure tube which is surrounded by the calandria tube with CO2 gap. The moderator system cools and circulates the moderator in the calandria and acts as an emergency heat sink following a loss of coolant accident (LOCA). The primary heat transport system is arranged in two closed circuits to reduce the blowdown rate of reactor coolant in case of LOCA. Three special safety systems, Emergency Core Cooling system, Shutdown system, and Containment Dousing system are installed for preventing and mitigating potential accidents.
SCOPE AND METHODOLOGY Based on the feasibility study, the scope of this study has been determined as follows: level 1 and 2 PSAs for both internal and external (seismic, internal flooding, fire and other events) events as well as a framework for the accident management program. The level 1 internal events analysis was performed based on the Wolsong Units 2 PSA study, preliminary performed by AECL: refining the list of initiating events, where necessary, by surveying the operating experience Wolsong Unit 1 (already in operation); regrouping the initiating events for the efficiency of the work and the ease of review; performing additional analysis for defining the success criteria; incorporating common cause failures in system modeling; and supplementing the human reliability analysis. In the level 1 analysis for external events, the seismic analysis was performed using the NUREG-1407 method with Canadian practice. The fire and the flooding analyses were performed using a probabilistic method, and other external events such as high wind were analyzed using the progressive screening approach as recommended in the IPEEE guidance document. It was identified that there has been no experience of external events analysis for CANDU plants in the world. Also previous PSAs have indicated that the risk from external events could be a significant contributor to the core damage frequency depending on the unique feature of the plants under assessment. The level 2 PSA is being performed for the core damage sequences the frequencies of which are 1.0 X 107/ry or higher, with redefinition of plant damage states in connection with the construction of containment events. Also the containment ultimate strength, containment event trees, and source terms have been analyzed considering the characteristics of CANDU plants.
INITIATING EVENTS The first step of the level 1 PSA is the identification of initiating events that induce abnormal condition and
Advances in Safety and Reliability: ESREL '97
437
eventually may result in a core damage accident. The initiating events for Wolsong units 2,3, and 4 were selected from the operating experience and failure mode effects analysis (FMEA) to identify initiating events that either have happened or could happen as a result of hardware failures or human error. Particular attention was paid to the FMEA of support systems which resulted in intiating events associated with the component cooling system. They were finally regrouped into 3 5 initiating event group according to similar plant response. Table 1 shows some representative initiating events for Wolsong plants along with their frequencies and error factors of lognormal distribution.
EVENT TREES The purpose of an event tree is to determine the plant response to an initiating event and to identify the mitigating systems and necessary operator actions required to bring the plant to some final state following any accident sequence. In association with the mitigating systems fault trees, it is used to perform accident sequence quantification to derive the frequency of the endpoint or final state of a particular accident sequence. The endpoint is either a plant success state where fuel cooling is maintained with no radiation into containment, or a core damage state. The event tree also addresses the combination of the initiating events followed by mitigating system failures. The AECL, CANDU plant designer, implemented 11 plant damage states (PDSs) in order to define the final state of the event tree. PDS 0 represents those accidents which cause the most severe plant damage, i.e., rapid loss of core integrity, and PDS 10 represents the most minor transients such as deuterium deflagration in cover gas. In the level 1 PSA study, the core damage state instead of PDS is used as a final state to keep the consistency with the same level of PSA for PWRs. PDS 0, 1, and 2, which are classified in the AECL PSA study, are defined as the core damage. Each event tree for initiating event is constructed on the basis of plant behaviors following the accident. Table 1. Some Representative Initiating Events
Initiating
Frequency
Error
Event
(per year)
Factor
IE-LKC1
8.17E-2
2.0
Heat Transfer SystemLeak - With Operating D20 Feed Pump Capacity
IE-SGTR
1.0E-3
10.0
Heat Transfer SystemLeak - Steam Generator Tube Rupture
Description
IE-SL
1.35E-2
10.0
Small LOCA
IE-PCTR
8.46E-4
10.0
Pressure Tube and Calandria Tube Rupture
IE-FBS
2.0E-4
10.0
Feeder Stagnation Break
IE-BMTR
1.0E-5
10.0
Small LOCA - Multiple Steam Generator Tube Rupture
IE-LRVO
1.54E-2
3.0
Heat Transport Liquid Relief Valves Fail Open
IE-LL1
2.0E-4
10.0
Large LOCA - Large Diameter Pipe Break not Involving Containment
IE-HPFP
5.62E-1
3.0
Partial Loss of Heat Transfer SystemPumped Flow
IE-FWPV
2.05E-1
3.0
Loss of Feedwater Supply to SGs due to Failure of Pump and Valve
IE-FWB1
7.96E-4
3.0
Asymmetric FW Line Break - Outside Reactor Building Downstream of FW Regulation Station
IE-FWB2
8.6E-5
3.0
Asymmetric FW Line Break - Outside Reactor Building Upstream of FW Regulation Station
IE-LOCV
5.12E-2
2.0
Loss of Condenser Vacuum
IE-T
2.8
5.6
GeneralTransient
438
M.-K. Kim and B.-C. Park
HUMAN RELIABILITY ANALYSIS Human errors may occur before the initiating events, or while an operator is responding to an abnormal situation such as a loss-of-coolant accident or a general transient. The results of human reliability analysis ( H R ) can be validated through the review of the Abnormal Operation Manual and the simulator experiments of CANDU 6 NPP. In this study, the basic information for the HRA was obtained from an interview with Wolsong unit 1 personnel and other PSA documents for CANDU NPPs. The HRA follows the Systematic Human Action Reliability Procedure (SHARP) approach; for a detailed analysis of the significant human actions, ASEP or THERP methods are used. Also, a quantitative screening method using conservative values is used. The dependencies between multiple human actions are identified through the preliminary sequence quantification. The assumption and method used in the study is listed Table 2. Table 2. The Assumption and Method used in HRA Item
Assumption / Method
Screening Value
5.0 E-2
HEP Calculation
P =Pd + Pa - Pd x Pa
Dependency
5 Types according to THERP
Detailed HRA Method
ASEP/THERP
C O M M O M CAUSE FAILURE Common Cause Failure (CCF) generally is regarded as one of the most dominant contributors to core damage frequency. However, there is no CCF database for CANDU plants. Hence, it was decided to adapt the PWR CCF database to the Wolsong CANDU plants. Our comparison of the database for PWRs and CANDUs indicates that the levels of data are different for the major components for which CCF is modeled. For example, while the data for pumps in PWR is combined mechanical and electrical parts, in CANDU divided into separately so that we consider both as a basic event in fault tree model. In order to apply the CCF data for PWRs to detailed fault tree for CANDU pump, we review a boundary of component in a fault tree model and make the level of details the same. The CCF is quantified using the beta factor method and where its results are considered as conservative, the Multiple Greek Letter (MGL) method is used instead.
DATA BASE
Component reliability data for this study have been compiled primarily from operating experience at Ontario Hydro's generating stations, but include other sources, such as CANDU 6 operating experience where available. The primary source of data is "Component Reliability Data for CANDU Nuclear Stations", which was compiled in 1986. Where required data is not available, data from other sources such as Ontario Hydro fossil fuel station operating experience and external sources have been used. In addition, we try to developed limited data base for Wolsong unit 1 such as human error data and major some components. For this, we interviewed the plant personnel and collected plant trouble reports and generated plant specific data. Those data are incorporated into the HRA and used to estimate the probability of some undeveloped events.
ACCIDENT SEQUENCE QUANTIFICATION The objective of accident sequence quantification is to provide an evaluation of individual accident sequence impact and contribution to the frequency of core damage states. The quantification of core damage frequency is performed using KIRAP code in which the logical loops between the support systems are solved
Advances in Safety and Reliability: ESREL '97
439
automatically. After the cut sets are generated, which are then processed to remove mutually-exclusive events and cut sets which violate success criteria of the accident sequence. We expect the frequencies of sequences which cause the beyond design basis status to be 1.0E-6/ry or less, therefore the truncation limit of 1.0E- 10 is selected for the accident sequence quantification and used for all sequences in order to make sure all significant contributors to the sequence are included in the generated cut sets, and on the other hand, to limit the number of cut sets to a manageable number. After minimal cut sets for the sequences that result in the core damage state are obtained, the rule based recovery analysis is performed according to the recovery actions listed in Table 3. The unavailability of major systems is shown in Table 4.
UNCERTAINTY AND SENSITIVITY ANALYSES KIRAP's UNCERT program is used to determine the uncertainty of system failure probabilities. The uncertainty due to reliability data is being analyzed, and the uncertainty analysis due to data modeling and major assumptions is performed in detail. Table 5 shows the sensitivity study of the effects of human reliability analysis, mission time, common cause failure analysis, surveillance test interval on the core damage frequency and showed the positive effect and negative efect which mean a decrease in CDF and an increase in CDF, respectively.
RESULTS AND DISCUSSION A phase 2 PSA for Wolsong P HR 2, 3, and 4 has been performed with an extensive fault tree, event tree analysis including detailed human reliability and common cause failure analyses based on the typical PSA technique. Thirty five internal event trees have been assessed in terms of their safety implications. The preliminary analysis indicates that the total core damage frequency with recovery actions is of the order of 10 4/ry. It was analyzed that the detailed human reliability analysis and the use of 24 hour mission time reduce the CDF. In this study, we give credit to the second human actions with a conservatism about the dependency between human action; overall, such second human actions reduce the total CDF of the plant. On the other hand, an inclusion of CCF analysis increases the CDF to some extent. The dominant sequence to the core damage frequency is FWPV-11, namely, loss of feedwater supply to steam generator due to failure of pumps or valves. It contributes approximately 27% to the total CDF. Most of those sequences higher than 1.0E-6/ry include the failure of the shutdown cooling system and emergency water supply system. Sensitivity analysis is being carried out to optimize the surveillance test intervals for shutdown system # 1, #2, ECCS and shutdown cooling system. Also, the accident management program for the Wolsong PHWR units is developed using the severe accident insights gained from this study.
ACKNOWLEDGMENT The review and comments by Drs. Inn Seock Kim (KEPRI) and Joon Eon Yang (Korea Atomic Energy Research Institute) are gratefully acknowledged.
REFERENCES
A.E.Swain, (1987). ccident Sequence Evaluation Program Human Reliability Analysis Procedure, NUERG/CR- 4772, S.N.L. E.Swain and H.E. Guttman, (1983). Handbook of Human Reliability Analysis with Emphasis on Nuclear Power Plant Application, Nureg/CR- 1278, S.N.L. Chen, J.T., et al (1991). Procedure and Submittal Guidance for Individual Plant Examination of external Events (IPEEE) for severe Accident Vulnerabilities, NUREG- 1407, NRC Hannaman, (1984). Systematic Human Action Reliability Procedure (SHARP), EPR1/NP-3583,EPRI S.H.Han, (1989). KAERI Integrated Reliability Analysis Code Package (KIRAP), KAERI-PSA-002, KAERI
440
M.-K. Kim and B.-C. Park
K P.A.Santamaura et al., (1995). Overview of the CANDU 6 Wolsong NPP 2/3/4 Probabilistic Safety Assessment, Probabilistic Safety Assessment Methodology and Applications, PSA '95, November, 26-30, 1995, Seoul, Korea
Table 3. Some Representative Recovery Actions
Description
No
Actions
Probability
1
OR-CL4-30
7.6E-1
Operator Restore Class IV Power within 30 Minutes
2
OR-CL4-60
6.2E-1
Operator Restore Class IV Power within 1 Hour
3
OR-SST
0.1
Restore SST within 12 Hours
4
OR-PHT-S
0.1
Transfer to PHT mode SDC when SDC pumps fail due to mechanical problem
5
OR-N2-FW
0.1
Connect N2 bottles to regulating valves in condensate, feedwater system to restore feedwater supply
0.1
OR-DA-MKP
Make up to deaerator from demineralized water system via condenser hot well and condensate extraction pump
OR-RFT-AFW
0.1
Operator transfers water source to RFT for AFW
OR-MANUAL
0.1
Operator opens/closes the manual valve at local place
Table 4. Unavailability of Main Systems System
Unavailability
Shutdown System NO. 1
5.94E-4
Remarks
Shutdown System NO. 2
1.64E-3
ECCS (High/Low Pressure)
6.175E-3
ECCS (Long Term)
4.278E-3
Shutdown Cooling System
8.263E-2
Main Feedwater System
4.073E-3
No CCF Effect
Auxiliary Feedwater System
2.573E-2
Little CCF Effect
Emergency Water Supply
6.91E-3
Test Interval ( 3M -> 1Y)
Test Interval ( 3M -> 1Y)
System
Table 5. Effects on Core Damage Frequency Positive Effect
Negative Effect
Detailed Human Reliability Analysis
Common Cause Failure Analysis (Rough Estimation Based
Use of 24 hour Mission Time
on CCF Parameters of PWR)
Second Operator Action at Event Tree Level
Extension of Test Intervals for SDCS and ECCS
Rule-Based Recovery for All Sequences
LEVEL 2 PSA TO EVALUATE THE P E R F O R M A N C E OF THE D O E L I & 2 NPP C O N T A I N M E N T UNDER SEVERE ACCIDENT CONDITONS A.D'Eer ~, B.Boesmans 1, M.Auglaire 1, P.Wilmart ~, P.Moeyaert 2 ~Tractebel Energy Engineering, Avenue Ariane 7, 1200 Brussels 2 Electrabel, Doel 1&2 NPP
ABSTRACT
The objective of the Doel l&2 level 2 PSA is to evaluate in probabilistic terms the performance of the containment for core damage scenarios. The progression of the severe accident and its load on the containment is assessed by means of a logical model referred to as the Accident Progression Event Tree (APET). The PSA level 2 analysis shows that the Doel 1&2 containment prevents early loss of containment integrity during a severe accident. This is due to the low contribution of containment bypasses and to the extremely low probability of early structural containment failures. The late containment ruptures are dominated by basemat melt-through, whereas the late containment leaks mainly result from static overpressurisation. Sensitivity calculations have been performed to assess the contribution of the different severe accident management (SAM) measures (eg. auto-catalytic hydrogen recombiners) to the reduction of the containment failure probability.
KEYWORDS
PSA, level 2, containment performance, severe accident, APET, containment failure modes,SAM.
INTRODUCTION
In Belgium, each Nuclear Power Plant must be re-examined after ten years from the viewpoint of safety. The objective of this compulsory review is to compare the actual safety level of the unit with the safety level which would result from the application of the rules existing at the time of review. In this context, it was found to be desirable to perform a Probabilistic Safety Assessment in support of the ten yearly back-fitting process. The Doel 1&2 units are twin two loop Westinghouse PWRs, having an individual power output of 400 MWe. The primary containment is a spherical steel shell. 441
442
A. D' Eer et al.
The objective of the Doel l&2 level 2 PSA is to evaluate in probabilistic terms the performance of the containment for core damage scenarios. In addition, the contribution of several accident management (SAM) measures (eg. auto-catalytic hydrogen recombiners) to the reduction of the containment failure probability is assessed.
THE PLANT DAMAGE STATES. The Doel l&2 level 1 PSA analysis describes all accident sequences leading to core damage. The latter is defined as "at least 20% of the cladding material reaches a temperature above 1000°C ''. Beyond this temperature, the oxidation reaction of the zirconium present in the cladding escalates, and the integrity of the first barrier of the fission products is no longer guaranteed. The core is referred to as being damaged. The initiating events which have been considered in the level 1 analysis covers all internal events, for all power and shutdown plant operation modes. The level 2 PSA analysis is limited to the power modes only. The purpose of the Plant Damage State analysis task is to group the numerous and detailed level 1 core damage sequences into a limited number of states, such that the progression of the severe accident and the response of the containment will be identical for all the core damage sequences belonging to a given state• The Doel 1&2 Plant Damage States are characterised by 13 attributes : • • • • • • • • • • • • •
initiating event, pressure of the reactor coolant system, pressure of the secondary system, containment isolation, timing to core damage, status of AC power supply, status of DC power supply, status of SG feedwater supply, status of pressuriser relief valves, status of high head safety injection, status of low head safety injection, status of containment spray, status of containment heat removal.
THE ACCIDENT P R O G R E S S I O N EVENT TREE The key feature of a PSA level 2 analysis is the development of a severe Accident Progression Event Tree model (APET). The Doel 1&2 APET describes in detail the mechanisms leading to containment failure, by means of developing the progression of the severe accident starting from core damage. The model deals with both phenomenological events and system-oriented events (safety injection and recirculation in the RCS, containment heat removal). The following phenomena were addressed in the model : • RCS pressurisation due to hydrogen accumulation in the SG tubes • hot leg / surge line failure • stuck-open PORV as a result of cycling • SG induced tube rupture due to creep failure • core degradation as a function of possible recovery of injection • in-vessel hydrogen production, transport to the containment, and hydrogen bums
Advances in Safety and Reliability: ESREL '97 • • • • • • • •
443
in-vessel fuel coolant interactions vessel failure vessel thrust forces high pressure melt ejection direct containment heating ex-vessel fuel coolant interactions containment pressurisation core concrete interactions
In this large event tree, the evolution of the operation of key safety systems (high and low head safety injection, containment spray and fan coolers) during the progression of the accident is precisely described. In this way the interaction between severe accident phenomena and the operation of key safety systems during core degradation can be adequately modelled. A few examples are : recovery of low pressure in-vessel injection as a result of RCS pressure decrease, failure of fan coolers due to a hydrogen burn, actuation of spray signal due to steam build-up.
QUANTIFICATION OF THE APET The events appearing in the APET need to be quantified, implying that split fractions need to be assigned to different physically possible outcomes. Several sources of information have been used to quantify these split fractions. These include a considerable number of supporting calculations performed using a plant specific input deck for the severe accident code MELCOR, detailed measurements on engineering drawings to obtain geometrical containment data, laboratory measurements of concrete composition and properties, system reliability data (PSA level 1 fault trees), and finally human reliability analysis based on analysis of plant procedures. To promote a systematic approach to quantify the split fractions, specific quantification guidelines have been developed for assigning split fraction values. The evaluation of the APET by means of the EVNTRE code, i.e., the calculation of containment failure probabilities, is performed for initial boundary conditions being the Plant Damage States (PDS), which characterise the status of the plant.
THE CONTAINMENT FAILURE MODES Containment failure refers to loss of integrity of the last fission product barrier, i.e., structural containment failure, isolation failure (failure to isolate a penetration) and containment bypass (steam generator tube rupture or interfacing system LOCA). According to their size, structural failures are divided into leaks and ruptures. A leak implies that the size of the containment failure prevents containment pressurisation due to phenomena such as the generation of noncondensable gases from core concrete interactions and the steam production from molten debris cooling in the absence of containment heat removal. On the other hand a rupture is defined such that the size of the containment failure is large enough to depressurise the containment from its ultimate capacity down to atmospheric conditions in one hour. The Doel 1&2 APET allows for containment failure due to the following phenomena : • hydrogen combustion, • slow overpressurisation, • vessel thrust forces, • in-vessel fuel-coolant interactions,
444 • • • •
A. D'Eer et al.
steam spike, direct containment heating, ex-vessel fuel-coolant interactions, molten core/concrete interactions leading to basemat melt-through (BMMT).
Regarding timing of structural containment failure three separate categories are defined namely : containment bypass (before core damage or during the in-vessel phase), early containment failures (prior to or at vessel breach), and late containment failures (after vessel failure). Six Plant End States (PES) have been defined on the basis of both timing and size of the failures. The resulting Plant End States are : • bypass, • early leak, • early rupture, • late leak, • late rupture, • intact.
THE I M P A C T OF SAM MEASURES ON THE CONTAINMENT P E R F O R M A N C E The influence on the containment performance of several severe accident management measures has been investigated. The first two measures already are part of the existing plant and its operating procedures, and have been implemented on the basis of deterministic safety evaluation, namely : • depressurisation of the reactor coolant system during core damage; • the presence of auto-catalytic hydrogen recombiners, which have been recently installed. The last two measures result from instructions in the accidental operating procedures, which are intended to prevent core damage. However as a result of the Doel 1&2 containment layout, these actions have an impact on the progression of the severe accident and therefore can be interpreted in the context of a level 2 PSA as SAMs, namely : • use of the RWST to inject into the RCS or to spray into the containment, aiming at reactor vessel cavity flooding • use of a second RWST allowing for external cooling of the reactor vessel.
R C S depressurisation after core
damage
Depressurisation of the primary system by opening the PORVs is foreseen in the plant procedures when the temperature measured by the core exit thermocouples exceeds 650°C, indicating that the core has at least been partially uncovered and that the core degradation process has started. Depressurization of the primary system results in : • increasing the probability to reflood the core and arrest the core degradation process by means of low head safety injection; • reducing the probability of reactor vessel failure at high pressure, which may cause severe challenges to the integrity of the containment; • reducing the probability of an induced steam generator tube rupture as a result of creep failure.
Advances in Safety and Reliability: ESREL '97
445
This SAM action is incorporated as a separate event in the RCS pressure evolution subtree. Human reliability analysis is used to provide an appropriate value for the associated split fraction.
Hydrogen recombiners The design criteria for the autocatalytic hydrogen recombiners guarantee that the hydrogen concentration in containment remains below 5% (by volume) tbr a worst-case severe accident scenario, selected among a list of credible scenarios. Therefore, the presence of recombiners is modelled in the APET by ensuring that the hydrogen concentration always remains in the low concentration range, for which ignition of the H2/H20/air mixture is ruled out. Consequently containment failures due to hydrogen bums are insignificant.
Reactor cavityflooding As a result of its design, it is rather likely that the reactor cavity will be flooded if the RWST content is injected into the containment due to injection in the RCS or due to containment spray operation. The presence of water in the cavity is essential in the debris quenching process, as it enhances the probability of obtaining a coolable debris bed configuration in the cavity after vessel failure and thus reduces the risk of basemat melt-through.
External vessel cooling Plant procedures also call for the use of a second RWST (refill or use of RWST belonging to the twin plant). Although the purpose of this action is the restoration of injection into the reactor coolant system, it also allows to flood the containment to a level which is sufficient to establish external vessel cooling in an attempt to prevent vessel failure. This action is modelled by introducing into the APET a specific model for external vessel cooling: a sufficient quantity of water in the containment (2 RWSTs) and long-term containment heat removal to condense the produced steam. Successful extemal vessel cooling is guaranteed only if the fraction of melted core is sufficiently low. If the total amount of core material has melted, success of external vessel cooling is uncertain.
EVALUATION OF THE CONTAINMENT P E R F O R M A N C E Table 1 shows the probabilities of the different containment failure modes, for the reference case and for a series of sensitivity calculations. These results are still under review by the Utility and the Regulatory Body. The reference case, corresponding to the present status of the plant, includes all 4 existing SAM measures. The ruptures dominantly result from basemat melt-through. The leaks mainly result from static overpressurisation of the containment due to loss of containment heat removal. It follows that early containment failures are extremely unlikely. This can be easily explained by the high ratio of containment free volume to core power for the Doel 1&2 NPP. Furthermore sensitivity calculations have been performed to assess the contribution of the different SAM measures described earlier to the reduction of the containment failure probability. The results of these calculations confirm the positive impact of the SAM measures. The base case corresponds to an evaluation with no SAM measures at all.
446
A. D'Eer et al. TABLE 1 CONDITIONAL CONTAINMENT FAILURE PROBABILITIES PRELIMINARY RESULTS
X
x
!
x
I
P I i
t I
iii!!iii!ii 1 .
.
.
.
x
5.1E-1
3.5E-2
3.0E-3
3.8E-1
3.5E-2
3.8E-1
3.5E-2
5.3E-3 5.4E-3
i i
4.4E-3
8.0E-3
; 3.2E-1
[
4.5E-3
.
SAM 1 • RCS depressurisation SAM 3 • reactor cavity flooding
i
4.4E-1
i
r I
3.8E-3
! 3.2E-1
i
2.6E-1
,t
2.6E-1
i
L
SAM 2 • H 2 catalytic recombiners SAM 4 • external vessel cooling
Implementation of the first measure, being RCS depressurisation during core damage (sensitivity calculation 1), does not alter the base case results. There are basically two reasons for this : firstly most high pressure accident sequences also imply failure of low head safety injection, and secondly the containment structure of the Doel 1&2 NPP is strong, such that containment failure as a result of reactor coolant system or of vessel failure at high pressure is insignificant. Additional implementation of the hydrogen auto-catalytic recombiners (sensitivity calculation 2), reduces the occurence of late leaks drastically. Indeed as a result of these recombiners, failure of the containment fan coolers due to hydrogen bums is prevented, such that static overpressurisation due to loss of containment heat removal is also prevented. The impact of these recombiners on the containment performance is undoubtedly positive. However, even if late static overpressurisation can be considerably reduced, late containment rupture may still occur due to basemat melt-through as the reactor cavity remains dry for a considerable fraction of the severe accident scenarios. This is the reason why late ruptures due to basemat melt-through increase compared to the previous case. Furthermore additional implementation of reactor cavity flooding (sensitivity calculation 3), largely prevents containment rupture due to basemat melt-through if the amount of corium in the reactor cavity is limited. However, if containment heat removal has failed, which is the case for a large fraction of the core damage scenarios, the containment may still fail due to static overpressurisation. Indeed, the decay heat is removed from the reactor cavity to the containment free volume. As a result, late containment ruptures due to basemat melt-through become late containment leaks due to static overpressurisation. Finally additional implementation of extemal vessel cooling (reference case), only slightly reduces the risk of late containment ruptures due to basemat melt-through. External vessel cooling reduces the risk of vessel failure, and therefore reduces the probability of basemat melt-through. The reason why only a very slight reduction in risk is observed, is that containment heat removal is needed to avoid containment overpressurisation and to enable condensation of the generated steam in order to refill the reactor cavity.
Advances in Safety and Reliability: ESREL '97
447
CONCLUSION The PSA level 2 analysis shows that the Doel 1&2 containment prevents early loss of containment integrity during a severe accident. This is due to the low contribution of containment bypasses and to the extremely low probability of early structural containment failures. Sensitivity calculations confirm the benefit of the existing SAM measures.
REFERENCES
U.S. Nuclear Regulatory Commission (1990). Severe Accident Risks : An assessment for Five U.S. Nuclear Power Plants. NUREG 1150 Final Report, Washington D.C.,U.S.
This Page Intentionally Left Blank
R A O L - S I M P L I F I E D A P P R O A C H TO R I S K M O N I T O R I N G IN N U C L E A R P O W E R P L A N T S Zdenko ~imi61'2, Jim O'Brien 1, Steve Follen l, Vladimir Mikuli~i62 Yankee Atomic Electric Company 580 Main St., Bolton, MA 01740-1398, U. S. A. [email protected], [email protected], [email protected] 2 Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 10000 Zagreb, Republic of Croatia zdenko.simic @fer.hr, vladimir.mikulicic @fer.hr
ABSTRACT Probabilistic Risk Assessment (PRA) can provide safety status information for a plant during different configurations; an additional effort is needed however to do this in real time for on-line operation. This paper describes an approach to use of PRA to achieve these goals. A Risk Assessment On-Line (RAOL) application was developed to monitor maintenance (on-line and planned) activities. RAOL is based on the results from a full-scope PRA, engineering/operational judgment and incorporates a user friendly program interface approach. Results from RAOL can be used by planners or operators to effectively manage the level of risk by controlling the actual plant configuration.
KEYWORDS
Probabilistic Risk Assessment, on-line risk monitoring, maintenance
BACKGROUND AND INTRODUCTION A Probabilistic Risk Assessment (PRA) provides valuable information regarding Nuclear Power Plant (NPP) sensitivity to various events. The results of a PRA can be used to identify and prioritize the importance of different hardware, human actions and operating procedures to plant safety. PRA models the way accidents occur and progress at a NPP. The analyzed accidents are those leading to core damage. Each way the accident can occur is presented by an accident sequence. Each accident sequence consists of an initiating event and mitigating systems failures that lead to core damage. 449
450
Z. Simi6 et al.
A plant-specific PRA provides a whole spectrum of results: the core damage frequency, the rank of dominant accident sequences, component/system failure contributions and human error contributions (operation, test and maintenance). Initiating event contributions, and individual sequence contributions are also known. The information contained in the PRA can be used as a very effective tool for Risk Management. Here we will discuss how a plant-specific PRA, with some modification and adaptation, can be used to supply additional information for assessing new plant configurations. When a safety engineer knows the expected systems configuration (from the current plant state and the maintenance schedule), it is possible from the PRA to evaluate the impact of this configuration on the plant Core Damage Frequency (CDF), i.e., plant safety. PRA can also provide a list of currently operable equipment most important to safety in the analyzed configuration, and prioritize the restoration of inoperable equipment. Prioritization of different equipment can simply be achieved by a relative comparison of impact on the plant CDF. For this purpose, it is necessary to have a living PRA- model that is updated on a regular basis, and which can be quantified in a minimal amount of time. There is also a need to modify existing PRA models for this specific application with fast requantification capabilities. Possible solutions to this problem are discussed below: 1. Current PRA models are based on a combination of event trees and fault trees, which can differ significantly in size and level of detail from plant to plant. Linked fault trees and large event trees are two extremes in PRA methodologies. Quantification of the full PRA model provides the most accurate results, but is time consuming. The PRA model needs to be properly optimized to gain results in time for an appropriate solution. With the fight optimization, accurate results can be achieved with a reasonable number of accident sequences. 2. Less resource-intensive solutions can be achieved by using minimal cut sets instead of the complete PRA model. This is certainly a more incomplete and less accurate solution, but with a carefully chosen set of minimal cut sets, it can result in a very acceptable and dramatically faster solution. 3. The simplest solutions can be achieved by using a relevant systems and components matrix for the same purpose. This matrix is derived from the original PRA model and relevant elements (systems and components) are determined from an operating standpoint. The application of this method is very limited, because multiple combinations can not be accounted for. The solution presented in this paper, Risk Assessment On-line (RAOL) application is a mixed approach between matrix and cut-set methods. The method is more flexible than the matrix solution, but simplicity is still preserved. The application and results of RAOL are discussed below.
APPLICATION DESCRIPTION AND RESULTS This application was originally developed as a spreadsheet solution for the Maine Yankee (MY) NPP. RAOL is the modified version, developed with Microsoft (MS) Visual Basic. In this version, the interface between user and data/results is significantly improved. Presentation of results is also optimized. This application was developed primarily to monitor on-line maintenance activities. It is based on results from a full-scope PRA with importance measures, engineering/operational judgment, and industry experience. The purpose of this application is to generate instantaneous and cumulative risk information for various potential plant configurations. This knowledge can then be used by planners/operators to effectively control plant configurations acceptable level of risk.
Advances in Safety and Reliability: ESREL '97
451
databases, etc.). This, for example, is very important for communication with the maintenance schedule database. (Table 1 shows the RAOL database table headings.)
Figure 2: Dialog window for the event description. The assessment process is designed to estimate the instantaneous risk associated with any given configuration of systems, sub-systems and trains and the cumulative risk that results from that configuration. Figure 1 shows Key Safety Functions (KSF) and systems included in the model. TABLE 1 RAOL DATABASETABLESHEADINGS:
Key Safety Function Table: KSF
KSF_ID
EventType Table: EventType
Event_ID
ESubType
SYS Table: SYSJD
DesePC DeseSUL U s a g e
SubsNo
Subl
Sub2
Sub3
Sub4
KSFJD
NOTES
Cycle Table:
Cycle
StartDate
ONLSAMver
RIL (Cumulative Risk Log) Table: ConfigStart
PlantMode MaxOT PlantSeore ConfigDuration
CalcType
SUL (System Unavailability Log) Table: SYS_ID
Subsys PlantMode
Event_lD
OOS_Time BIS_Time OOS_l-lours WorkOrder Notes LogType
To do this calculation, the model uses a Risk Achievement Worth (RAW) derived from the Maine Yankee PRA as weighting factors to assess the importance of various out-of-service configurations. The instantaneous risk is evaluated and the cumulative risk is constrained by a maximum outage time for the existing configuration. A description of how one system is evaluated will explain the level of complexity that is incorporated in this model. A RAW value for SYS9 (Auxiliary Feedwater) is combined with the following systems: SYS6 (Component Cooling Water), SYS8 (Emergency Feedwater- EFW), SYS28 (Alternate Shutdown Sys.), SYS29 (EFCV Air), and all External Events (Seismic, Fire, Weather). SYS9 directly has impact on the state of Heat Sink (KSF 2). The impact of the SYS9 also depends on the status of the other systems, for example: SYS7 (Main Feed), SYS10 (Steam Dump and Turbine Bypass), SYSll (Condensate System), SYS25 (115kV Off-site Power), SYS 26 (Service Water), and SYS32 (Switchgear Ventilation). Table 2 shows complexity of these relations.
4 52
Z. 5~imi6 et al.
This specific application is an innovative, simplified approach, yielding valuable results for different configurations in the MY NPP. Using of this application can improve the understanding of the basic principles and problems in a risk monitor approach. The applied method post processes (filters) PRA results from Level 1 and Level 2 assessments. Development of values used in this method starts with the generation of matrix type Risk Achievement Worth (RAW) values for system trains. Expert judgments, reviews of plant dependencies, reviews of mitigation requirements for various initiators and existing code run results are then used to determine if the system/trains in a potential out of service combination are independent or functionally inter-related. If independent the values for each system/train can be "added". If functionally inter-related (e.g., Emergency Feedwater & Auxiliary Feedwater both perform secondary Heat Removal) a special code case must be run. Various methods are used to constrain the number of cases that are required to be run. For example: surrogate (more conservative) cases may be used for a combination. The results of the analyses are then programmed into the RAOL application. Table 2 contains an example illustrating how the method is implemented. This application is not designed to supersede any other plant control. It is intended to be used in conjunction with existing plant Technical Specification (TS) and Administrative Controls. An on-line maintenance safety assessment canbe performed: •
For scheduled activities identified during daily maintenance planning meetings.
•
When equipment failures occur which may alter the results of the assessment performed at the daily maintenance planning meeting.
•
Before authorizing unscheduled maintenance activities that may alter the results of the assessment performed at the daily maintenance planning meeting.
RAOL can be used as a tool to assess the safety implications of long range maintenance schedules when they are being developed.
Figure 1: System Status Tab: Screen with current plant configuration. The RAOL was developed for the MS Windows 3.1 and MS Windows 95 operating system environment, using MS Visual Basic 4.0 professional version as programming language. For all database operations the MS Jet database engine is used primarily with a MS Access database format. By using the database engine, it is possible to reach virtually all other database formats (dBASE, Open Database Connectivity, client-server
Advances in Safety and Reliability: ESREL '97
453
Figure 3: Plant Status Tab: Screen with summary information about cun~nt plant status. If, during the process of evaluating the various system/train requirements for each KSF, it is determined that a cross train situation exists, then the model ensures that the assessment reflects this condition. This requirement is applicable to any cross-train condition, whether it is prohibited by TS, or not. This function is designed to evaluate cross-train conditions that are not specifically identified by the KSF quantification model. For example: Train A of High Pressure Safety Injection and Train B of Containment Spray; AC Distribution Bus A and Train B of Low Pressure Safety Injection; or Emergency Diesel Generators 1B and EFW train A. This function provides a message and color code. TABLE 2 EXAMPLEOF SYSTEMAND EXTERNALEVENTDEPENDENCIES: SYS-9.
Number of Auxiliary Feedwater Trains Available (1, 0):
@MAX(@IF(S8<=I
* S9=0,
@IF(S9=0
58.1,
* (EEl=2
@IF(S9=0
+ EE3=I
* EE3=2,
@IF(Sg=0
SS28, S#
-
SYS;
+ EE2=2),
3.5"6-1,
* (S6<2),
4.79,
SS29) 58.1,
4.79,
SS#
-
RAWvalues;
3.5"3-1,
EE
-
@IF(S9=0,
externalevent;
2.5,
0)))),
0,1,2
-
SYSstate
With these interrelations, most configurations are precise or conservative for the configuration analyzed; however, the calculated result could be somewhat non-conservative for certain extremely rare configurations. The same applies for the cross-train situation. The RAOL program is fully applicable only to a full power operation. Figure 3 shows the application tab where the final plant status is evaluated. Figure 4 shows the tab where the risk history log is presented. The RAOL program has three protected user levels. It is possible to compute current actual risk and perform separate "what if" analysis to explore operating strategies and outage options. The plant can use the collected equipment unavailability data as feedback into the living PRA model.
454
Z. Simi6 et al.
Figure 4: Risk LOg Tab: Screen with plant safety history. Program is consisted from several different submodes: Monitoring, Planning, Comparison, and Database. This four submodes allow to user complete on-line, and planning functionality. Figure 5 shows Planning submode screen. The current plant configuration can be changed through the System Status 1 and 2 tabs (see Figure 1). If equipment is either going out of service or, is being restored, it is necessary to fill in the dialog entries in Figure 2. The risk profile can be seen from the Cumulative Risk Log tab (Figure 4). The current plant status is visible from the Plant Status tab (Figure 3). Finally various types of reports can be also generated from the System Unavailability Log (SUL) and Cumulative Risk LOg (RIL) database tables (see Table 1).
Figure 5: Planning submode: Screen with plant maintenance schedulle. The preliminary response from the final users is very encouraging. The main RAOL positive sides are described as: easy-to-use user interface, display complies with plant terminology, and plant user does not
Advances in Safety and Reliability: ESREL '97
455
have to understand PRA concepts and terminology to use the program. After using RAOL, MY plans to decide on its future development for a more comprehensive use.
CONCLUSION RAOL is an application that evaluates a plant configuration (actual or planned) and provides information including: instantaneous and cumulative risk of core damage, recommended Allowed Outage Time (AOT), important components, and summary reports. This application can be a valuable tool for use by plant operators and maintenance planning personnel to operate the plant safely. RAOL can be used to plan and execute safer planned and unplanned equipment outages during plant operation. It is also possible for this application to demonstrate compliance with those aspects of the Maintenance Rule that require maintenance to be more "risk aware". With good initial results, further development of this application can continue in additional important directions: addressing and balancing all the risk factors (initiating events, core damage, and containment failure), incorporating additional PRA Level 2 results, expanding operating modes (shutdown), and making the model more accurate by using minimal cut-sets or a modified PRA model. Some of these directions are not really related to the limitations of RAOL application. expanding covered operating modes.
One example is
REFERENCES Samantha, P.K., Vesely, W.E., Kim, I.S. (1991). Study of Operational Risk-Based Configuration Control, NUREG/CR-5641 ERIN (1995), PRA Application Guide, EPRI TR-105396 Vesely, W.E., and Rezos, J.T. (1995). Risk-Based Maintenance Modeling: Prioritization of Maintenance Importances and Quantification of Maintenance Effectiveness, NUREG/CR-6002 Yankee Atomic Electric Company (1992), "Maine Yankee Probabilistic Risk Assessment," YAEC-1824
Acknowledgements The authors would like to thank International Atomic Energy Agency for the support of this work through the fellowship.
This Page Intentionally Left Blank
B2" PSA Applications
This Page Intentionally Left Blank
RELATIVE RISK MEASURE SUITABLE FOR COMPARISON OF DESIGN ALTERNATIVES OF INTERIM SPENT NUCLEAR FUEL STORAGE FACILITY Milog Ferjen~,ik Department of Theory and Technology of Explosives, University of Pardubice n/tm. (~s. legii 565, CZ- 53210 Pardubice, Czech Republic
ABSTRACT
Accessible reports on risk assessment of interim spent nuclear fuel storage facilities presume that only releases of radioactive substances represent undesired consequences. However, only certain part of the undesired consequences is represented by them. Many other events are connected with safety and are able to cause losses to the operating company. The following two presumptions are pronounced based on this: 1. Any event causing a disturbance of a safety function of the storage facility is an incident event. 2. Any disturbance of a safety function is an undesired consequence. If the facility safety functions are identified and if the severity of their disturbances is quantified, then it is possible to combine consequence severity quantifications and event frequencies into a risk measure. Construction and application of such a risk measure is described in this paper. The measure is shown to be a tool suitable for comparison of interim storage technology design alternatives.
KEYWORDS interim spent nuclear fuel storage facility, relative ranking, risk comparison, risk analysis, risk measure, safety function
FACILITY, ITS PURPOSE AND DESIGN ALTERNATIVES
When spent nuclear fuel is removed from a power reactor it is usually cooled for a few years in a water pool. The pool is usually located inside the nuclear power plant containment building in close vicinity to the reactor. After this period the spent nuclear fuel uses to be transferred to an interim spent nuclear fuel storage facility (independent spent fuel storage installation). The facility is located outside the power plant containment. It may be either co-located with a nuclear installation (e. g. power plant or reprocessing plant) or sited independently of other nuclear installations. The interim spent nuclear fuel storage facility's purpose is to provide for the safe, stable and secure storage of spent nuclear fuel before it is reprocessed or disposed of as radioactive waste. It is usually supposed that the fuel packages will spend a few tens of years in the facility. 459
460
M. Ferjen6~
Generally the facility consists of four areas: 1. Fuel units in transportation packages are transported to the Transportation Package Handling Area. 2. Fuel units are unloaded, checked and prepared for storage in the Fuel Unit Handling Area. 3. Fuel units are transferred through the Fuel Unit Transfer Area and they are loaded into storage packages. 4. Fuel units in storage packages are stored in the Storage Area. The dual purpose (transportation and storage) package concept is often exploited in modem facilities simplifying substantially the above general description. There are many options as regards to fuel packages, transportation, handling and storage systems. A company intending to construct and operate an interim spent nuclear fuel storage facility has more than a single facility design alternative to choose from.
TASK TO BE SOLVED Safety of the facility is one of its most important properties. The safety is mentioned above in the purpose description, it is watched by the public and it is supervised by a regulatory office. It is natural that the company intending to operate the facility is asking about the safety of the design alternatives. It is natural that the company makes safety (or inverse term risk) one of the criteria in the decision-making process directed to selecting the most suitable design alternative of the facility. The question is: WHAT DESIGN ALTERNATIVE OF THE FACILITY WILL YIELD THE LOWEST RISK? (Q 1) The term risk in the question is understood as a possibility that the facility will cause undesired consequences during its life.
COMMON PROCEDURE TO SOLVE THE TASK
The common approach to this task presumes that the undesired consequences are sufficiently represented by radioactive substances and radiation released and spread in potential incident events. Undesired consequences are presumed to be measurable using a suitable radiological quantity, mostly dose or dose rate. The following risk evaluation procedure using risk measure A is based on the above presumption (see Radiological Risk Assessment in [ 1]): Design alternative risk evaluation procedure: 1. Determine 2. Determine 3. Determine 4. Determine
the the the the
set of events (generally scenarios) able to cause radioactive release in the design alternative. event frequencies fi. site boundary dose rates A~ caused by the events. risk measure A = E(fi × Ai).
It may be stated that the following question is answered using the risk measure A and the above risk evaluation procedure: WHAT DESIGN ALTERNATIVE OF THE FACILITY WILL YIELD THE LOWEST RISK OF RADIOACTIVE RELEASE MEASURED BY RISK MEASURE A? (Q2)
REASONS TO MODIFY THE PROCEDURE
It is expected that the risk evaluation results will help to determine a single design alternative yielding the lowest risk. This means that results similar to those shown on the left side of the Figure 1 are expected. The left side results justify the sentence: Design alternative ~ is the best one.
Advances in Safety and Reliability: ESREL '97
461
However it is probable that a real situation will not be so simple. For instance the definite Risk Criterion value (e. g. as defined in [ 1]) may be given and our results may be a few orders below it. Or the results may look as shown on the right side of the Figure 1. (Dashed lines represent lower and upper uncertainty bounds of the results.) Both cases would mean the same: Design alternatives are practically equivalent as regards their risk. ----P--
risk
E(f,×&)
I I I
----mI
~S~J~Y~Y~
"/////AVI///~ z/'////Ag////,~, ~JJJ/g.#.JJJ,~
-- --F
I i I
--
I I I
Xllllll/llllZ
7lll/llllllll l/l/l/l///lib
----P--
~l///lll//]l]
? Figure 1 Two possible sets of results using the risk measure A The alternatives are equivalent. This is a result, too. Risk is not important to select the alternative. Other criteria may be applied. However this closure may be premature. We remember that our original question (Q 1) is not identical with the question (Q2) answered by the applied procedure. Let us return back to the basic risk evaluation presumption cited above and let us ask: Do the radioactive releases really sufficiently represent the undesired consequences? Are the undesired consequences sufficiently measurable using the dose equivalent? Asking these questions we are close to the following ideas: Maybe it is possible to find a quantity providing more complete representation of the undesired consequences. Maybe the procedure using this quantity would provide a more complete picture of risk. It is not difficult to find some examples of events connected with safety and causing no radioactive release: Example 1 The fall of a fuel package during a transfer without tightness disturbance i.e. without any radioactive leakage. Example 2: The protection barrier (fence, gate) damaged by extreme wind or snowstorm. Example 3: Fire in the main transformer room, no electricity for handling and protection equipment. Are the above examples worth being studied in context of risk? Are they able to add a comparable contribution to events causing radioactive release? It is difficult to answer these questions exactly. The following two paragraphs explain why the answer may be yes. 1. Spent nuclear fuel is in the centre of public attention. It is highly probable that not only an event causing radiation release but practically every event connected with the safety of the storage facility has the potential to cause loss to the operating company. Every problem may bring the increase of undesired interest, new regulatory requirements, increasing demands, and decreasing trustworthiness of the company. 2. Although the events disturbing the facility's safety and causing no radioactive release seem to have much lower weight as regards to total facility safety than events causing a radioactive release, they at the same time seem to be substantially more probable than the events causing a radioactive release. There is a possibility that relatively many events of low .weight during the life of the facility may outweigh extremely improbable events of big weight.
462
M. Ferjen61"k
MODIFIED
PROCEDURE
The modified approach to the task presumes that the safety of the facility may be described using a set of safety functions. The term to be safe is understood as to fulfil each of the safe(y functions ,fully and permanently. Any disturbance of any safety function is considered to be an undesired consequence. Possible disturbances of every safety function are ranked in order to measure undesired consequence magnitudes. Undesired consequences are presumed to be measurable using a special quantity- relative consequence ranking. The following risk evaluation procedure (analogous to the common procedure) using risk measure B is based on the above presumptions. Modified risk evaluation procedure: 1. Determine the 2. Determine the 3. Determine the 4. Determine the
set of events (scenarios) able to disturb a safety function in the design alternative. event frequencies fi. relative consequence ranking Bi of the events. risk measure B = E(f~ × Bi).
The modified procedure is able to encompass a substantially broader set of possible safety problems than the original procedure since the radiation protection is only one of the facility safety functions. The procedure is able to integrate every problem evaluated by the original one. Nevertheless the result risk measure value is a relative number, its physical interpretation is difficult and probably it is not suitable for other purpose than for comparison of design alternatives. The following question is answered when the risk measure B is applied: WHAT DESIGN ALTERNATIVE OF THE FACILITY WILL YIELD THE LOWEST RISK OF ANY SAFETY PROBLEMS MEASURED BY RISK MEASURE B? (Q3) It seems to be much closer to question (Q 1) than the question (Q2). risk
--F-
E(f,×BJ I I re_i__
I I I
m
I
I I I
I u
ml--
. . . .
R
I I I I -- ur--
m
I
I
T Figure 2: Two possible sets of results using the risk measure B (shadowed parts belong to risk measure A) It is expected that the suggested modification may help provide an unambiguous answer in cases where the common procedure is not able to provide it. It is expected that the modified approach is able to help the operating company avoid the decision causing bigger problems in the future. These objectives are reached if the resultant overall picture of the alternatives is similar to any side of Figure 2. Columns are divided into two
Advances in Safety and Reliability: ESREL '97
463
parts to respect the division of the events into those causing radioactive release and those causing no release. The results on Figure 2 justify the sentence: Design alternative ot is the best one. However even if the overall risk picture is similar to the right side of the Figure 1, the situation is better now since there is no doubt that the risk measure B represents the whole thinkable range of the undesired consequences. Design alternatives may be considered to be equivalent in regards risk. Three problems have to be solved before it is possible to apply the modified procedure : 1. The set of the safety functions has to be identified. 2. Safety function disturbances have to be ranked. 3. Consequence ranking values have to be determined. A solution of the problems is described in the next three sections. Relative risk measure B defined below in detail seemsto be simple enough as not to be excessively laborious while still being able to record important differences in the design alternatives.
SAFETY FUNCTIONS OF THE FACILITY The following list of safety functions is based on the safety guide [2] where interim spent nuclear fuel storage facility design requirements are identified. The succession of safety functions is not of crucial importance. Detail explanations should be found in the referenced safety guide. Safety function 1: fuel packages subcriticality Safety function 2: fuel packages physical protection This safety function is subdivided into two subfunctions: 2a: fuel packages physical protection even if unauthorised persons gain access to them 2b: limited access (only to authorised personnel) to the fuel packages Safety function 3: radiation protection Safety function 4: fuel packages containment Safety function 5: permanent heat removal of the fuel packages Safety function 6: permanent handling possibility of the fuel packages Note: The safety functions are not disjunct in any sense. Complex relationships exist among them.
SAFETY FUNCTIONS DISTURBANCE CLASSES If the number of the safety function disturbance classes is high better distinguishability of results is enabled but it is more difficult to obtain the results. If the number of classes is low the results may be obtained easily but there is a danger that they will not differentiate risk of design alternatives. The same number of classes for each of the safety functions seems to be the best arranged solution. The mentioned above considerations resulted into a decision to divide the safety function disturbances into five classes: low, medium, high, very high, and extreme. Examples in Table 1 give some ideas how the specific distinctions between disturbance classes are defined.
CONSEQUENCE CLASSIFICATION TABLE Relative consequence ranking values are given in the Table 1.
464
M. F e r j e n 6 ~ TABLE 1 RELATIVE CONSEQUENCE RANKING
Safety Function Disturbance Ranking low
medium
high
107 / one fuel package close to criticality
108 / local criticality in one fuel package
109 / local criticality in a few packages
1010/ 1011 / global criticality in global criticality in one fuel package a few packages
SF2a" physical protection
105 / partial penetration through envelopes of one package
106 / complete penetr. through envelopes of one package
107 / complete penetr. through envelopes of a few packages
l0 s / destruction of one fuel package
SF2b: limited access to fuel packages
1/ a) one package a few hours with failed protection barriers, b) partial penetr. through barriers
SFI"
subcriticality
SF3:
radiation protection
°M
very high
extreme
109 / destruction of a few fuel packages
10 3 / 10 2 / 10 4 / 10/ a) a few fuel a) many packages a) many packages a) many packages packages a few a few hours with many hours with a few days with hours with failed failed barriers, failed barriers, failed barriers, protection barriers, b) penetration b) penetration b) penetration b) complete penetr. through barriers to through barriers to through barriers to through barriers one fuel package a few packages many packages
106 / 10 -~/ a) small gaseous a) small gaseous release from one release from a few fuel package, fuel packages, b) one fuel b) medium package a few days gaseous release without neutron from one fuel shield package, c) a few fuel packages a few days without neutron shield
108 / 109/ 107 / a) complete a) small gaseous rel a) medium gaseous from many release from many gaseous release from many fuel fuel packages, packages, packages, b) medium gaseous b) complete b) complete solid rel. from gaseous release particles release from a few fuel a few packages, from one package, c) complete gas. rel. packages, c) partial solid c) partial solid from 1 package, particles release d) many packages a particles release from a few fuel from one fuel few days without packages package neutron shield
SF4: package containment
10 3 / partial containment envelopes failure of one fuel package
10 4 / partial containment envelopes failure of a few fuel packages
105 / a) partial envelope failure of many packages, b) complete envelope failure of one package incl. small part of clads
106 / complete contaimnent envelopes failure of a few packages including small part of fuel clads inside
107 / complete containment envelopes failure of many packages including small part of fuel clads inside
SF5:
10/ one fuel package heat removal substantial deterioration for many hours
10 2 / a) one package substantial deterioration for a few days, b) a few packages deterioration for many hours
10 3 / deterioration of a) one package for many days, b) a few packages for a few days, c) many packages for many hours
10 4 / a) a few packages deterioration for many days, b) many fuel packages deterioration for a few days
105 / many fuel packages heat removal substantial deterioration for many days
1/ a) one package a few days in unusual position, b) one fuel package resealed
10/ a) a few packages a few days in unusual position, b) a few fuel packages resealed
102 / standard handling impossible with one fuel package for many days
103 / standard handling impossible with a few fuel packages for many days
10 4 / standard handling impossible with many fuel packages for many days
,,,
heat removal
SF6: permanent handling possibility
Relative Consequence Ranking / Example Events
Advances in Safety and Reliability: ESREL '97
465
The determination of consequence ranking values is the most difficult problem which has to be solved before the risk measure B can be evaluated. Assigned values define the relationships in the set of safety function classes in both vertical (functions) and horizontal (disturbances) directions. They have to represent a mixture of objective relations, common sense and experience. The following comments explain the most important rules applied to determine the ranking values:
1. Small (part, release) means approximately 1%, medium (part, release) means approximately 10% and complete means 100% or close to it. Complete release is 10 times worse than medium release and so on. 2. A few means between 1 and 10, typically 3, many means 10 or more. 3. A few damaged packages are 10 x worse than 1 package. Many damaged packages are 10 x worse than a few packages. 4. Disturbance duration of many hours is 10 × worse than a few hours, a few days is 10 x worse than many hours, many days is 10 × worse than a few days. 5. A few fuel packages destruction is equivalent to the partial solid particles release from a few fuel packages. 6. Complete penetration through the envelopes of a few packages is equivalent to a medium gaseous release from a few packages. 7. Complete containment envelopes failure of many packages including a small part of the fuel clads inside is equivalent to a small gaseous release from many packages. 8. Many fuel packages heat removal substantial deterioration for many days is supposed to be equivalent to the partial envelope failure of many packages.
RISK MEASURE B APPLICATION
The main purpose of Table 1 is to enable evaluation of any identified event (scenarios) consequences. Often not only one safety function is disturbed. In this case the highest consequence ranking value assigned to separated disturbances is used as the event consequence ranking value. Example 1: The fall of a fuel package during a transfer without tightness disturbance i.e. without any radioactive leakage. Consequence ranking value depends on three points: Has any envelope of the package containment failed? How much time was the package in an unusual position? How was it protected during this time? A value between 1 and 1000 is possible. Example 2: Protection barrier (fence, gate) damaged by extreme wind or snowstorm. The consequence ranking value depends on two points: How many packages are inside of the barrier? How long has the barrier been damaged? A value between 1 and 1000 is possible. Example 3: Fire in the main transformer room, no electricity for handling and protection equipment. Ranking value depends on duration of the power outage and on the number of packages in the facility. Example 4: Linear combinations of ranking values may be used. If it is for instance known that the fall of the package causes a small gaseous release with probability 0.1 and partial containment failure with probability 0.9, the consequence ranking value may be computed as 0.1 × 105 + 0.9 x 103 = 1.09 x 104.
REFERENCES
[ 1] American Nuclear Society. (1984). Design Criteriafor an Independent Spent Fuel Storage Installation (Dry Storage Type), American National Standard ANS1/ANS-57.9-1984, ANS, LaGrange Park, USA [2] International Atomic Energy Agency. (1994). Design of Spent Fuel Storage Facilities, Safety Series No. 116, IAEA, Vienna, Austria
This Page Intentionally Left Blank
EVALUATION OF ADVANCED CONTAINMENT FEATURES PROPOSED TO KOREAN STANDARD NUCLEAR POWER PLANT Y.Jin, S.Y. Park, S.D. Kim, D.H.Kim Integrated Safety Assessment, Korea Atomic Energy Research Institute P.O. Box 105, Yusong, Taejon, KOREA
ABSTRACT
Korean Standard Nuclear Power Plant (KSNP) has adopted many advanced design features to enhance its containment performance during severe accidents as well as to reduce chances of core damage. Robust design of containment and hydrogen mixing capability reduced containment failure probability significantly. In addition to these features, new systems are proposed for KSNP: advanced design of cavity geometry, reactor cavity flooding system, hydrogen igniter, and containment filtered venting system. Before these proposals are adopted to the KSNP, their effectiveness on containment performance has been assessed systematically. Containment event tree and sensitivity analysis are used to quantify the effectiveness of these design features. The overall results indicate that these new features do not improve the containment performance significantly except the containment filtered venting system. But adoption of the containment filtered venting system should be examined carefully because accidental failure of this system may result in undue risk to public.
KEYWORDS KSNP, Containment Performance, Level 2 PSA, CONPAS, Capture Volume in Cavity, Reactor Cavity Flooding System, Hydrogen Igniter, Containment Filtered Venting System.
INTRODUCTION Korea has achieved high economic growth in nineteen eighties. To support these economic growth, a longterm electricity supply plan was set up and revised every five years. According to this plan, nuclear power occupies one third of total electricity capacities and more than 40% of total electricity generation up to year of 2020. To accommodate this long term plan, nuclear power plants (NPPs) should be built continuously. Standardization of nuclear power plant was initiated in 1983 to build future NPPs economically and stably. 1000 MWe of PWR with large dry containment was selected as the Korean standard NPP. KSNP has adopted many advanced features to reduce possibility of severe accidents and consequences of those accidents compared with conventional PWRs. These advanced features which contribute to reduce core damage possibility are safety depressurization system, alternate AC source, and advanced design of auxiliary feedwater system. In addition to these systems, advanced containment features are considered to mitigate 467
468
Y. Jin et al.
severe accidents such as corium capture volume in the reactor cavity to hold the corium in the cavity during high pressure melt ejection process, reactor cavity flooding system (RFS) for debris cooling in the reactor cavity, hydrogen igniters to control hydrogen concentration in the containment, and containment filtered venting system (CFVS) to prevent the containment failure from overpressurization. The effectiveness of the advanced design features on containment performance was evaluated quantitatively through sensitivity studies.
IMPORTANT PLANT FEATURES
KSNP is a 1000MWe PWR with large dry containment. Its rated power is 2815MWth. It has two steam generators and four reactor coolant pumps like Combustion Engineering's System 80. To comply with station blackout rule of USNRC, an alternate diesel generator is equipped to supply AC power in the event of loss of emergency diesel generators. Safety depressurization system (SDS) was designed to prevent core damage even in the event of total loss of feedwater by feed and bleed operation with high pressure injection system. This SDS could make RCS pressure low enough to prevent high pressure melt ejection at reactor vessel rupture. This means that DCH, which was hypothesized in Zion PSA as a high contributor to early containment failure, is no more a significant threat to KSNP. In addition to these systems, an advanced design of auxiliary feedwater system was adopted to reduce the probability of core damage thanks to its high availability. The KSNP adopted a large dry containment with robust design. The containment building data for KSNP, KEPCO.(1996a)., Zion and Surry Plants, USNRC.(1990). are compared in Table 1. As shown, KSNP
TABLE "1 CONTAINMENT BUILDING DATA Thermal Power
Containment Volume
Volume/Power
Design Pressure
Failure Pressure
(MWt)
(f13)
(ft3/MWt)
(psig)
(psig)
KSNP
2815
2.8 x 106
995
54
178
Surry
2441
1.8 x 106
737
45
141
Zion
3236
2.7 x 106
834
47
149
Plant
has a larger containment free volume to power ratio compared to Zion and Surry Plant. Also the design pressure (54psig) and the failure pressure (178psig) are much higher than others. These robust design of containment can reduce containment failure probability significantly. Furthermore, many advanced design features which are proposed by EPRI ALWR Requirement Document, EPRI.(1987)., have been already adopted or are under consideration to enhance the containment performance in the event of severe core damage accidents. Corium capture volume in the reactor cavity is one of these features and already adopted. Figure 1 shows a schematic of cavity geometry with capture volume. The capability of capture volume for the holdup of corium in the cavity during high pressure melt ejection process was verified by various experiments, S.B.Kim et al(1994). The effectiveness of this volume depends on its capacity. The capacity of capture volume for KSNP is equal to the total volume of corium in the reactor. There is a path between the containment floor and the reactor cavity, which allows water accumulation in the reactor cavity and eventually submergence of the lower part of reactor vessel in the water
Advances in Safety and Reliability: ESREL '97
469
"--ExiT-
i,~..-Capture Volume
Figure 1: A schematic of cavity geometry with capture volume. if RWT water is introduced into the containment building. Reactor cavity flooding system is proposed to supply water to the reactor cavity to cool the debris in the reactor cavity when RWT water is not injected into the containment building. The water source is located outside of containment building. This water can be introduced into the reactor cavity directly through the dedicated system which consists of two trains of pump and valves. This system can be operated even in the event of station blackout. Hydrogen igniters are considered to be installed all over the containment to control hydrogen concentration in the containment. At present the type of igniters is not selected. Catalytic igniter is one of choices. After selection of igniter type, the locations and number of igniters will be determined. Containment filtered venting system is under consideration for control of over-pressurization. The functions of these systems and effects on the severe accident are summarized in Table 2.
TABLE 2 ADVANCED DESIGN FEATURES IN KSNP AND THEIR FUNCTIONS Plant Unique Features
II
Function
Effect on Severe Accident
Corium capture volume in the reactor cavity
Holdup of corium in cavity
Reduction of DCH effect and early containment failure probability
Reactor cavity flooding system
To supply water into reactor cavity
Increase of debris coolability
Hydrogen igniters
To keep hydrogen concentration low
Reduction of possibility of deflagration to detonation transition and detonation
Containment filtered venting system
To prevent containment from Reduction of containment overpressurization failure from slow overpressurization
470
Y. Jin et al.
ANALYSIS M E T H O D AND RESULTS The methodology used in the containment performance analysis is consistent with the "PRA Procedures Guide" , USNRC.(1983). and "Individual Plant Examination: Submittal Guidance", USNRC.(1989). The first step in the analysis is to define plant damage states (PDSs) which describes the plant states at the onset of core damage. The core-melt sequence is extended to illustrate the status of the containment safeguards and the condition of the cavity. Forty five PDSs are defined to group the core damage sequences of level 1 PSA based on the similarity of their accident progressions. Then each PDS has its own source term characteristics regarding the fission product transport and deposition within the containment. The second step in the analysis is to develope the containment event tree. Containment event trees (CETs) are used to model the containment responses by depicting the various phenomenological processes, containment conditions, and containment failure modes. A CET predicts the accident sequence progression from core melt to radionuclide release to the environment. The CET is constructed in sufficient detail to address the important phenomena that significantly affect the containment integrity and radiological source term. Ninety five end points are developed from the general CET in this analysis. The detail phenomena or operator actions for top events in the CET are treated in the decomposition event tree (DET). The ultimate strength of the containment for the static load inside the containment is evaluated in this step. The distribution of failure pressure is used to estimate the likelihood of containment failure for a given accident sequence. The final step is to quantify the CETs where each PDS is propagated through the CET. At each branch point on the CET, a branch probability is determined for the PDSs. A DET is a decomposition of the containment tree event into a more detailed set of events or factors that cause, or contribute to, the occurrence of the CET event and are used to aid in quantifying the CET branch probability. The branch probabilities in the DET are obtained by either the data associated with the PDS or by expert judgments. Extensive calculations using Modular Accident Analysis Program (MAAP), ABAQUS code, previous PRA reports, and other reports on severe accident phenomena are used to derive the expert judgments on the DET branch probabilities. The containment performance was evaluated through these three steps without consideration of advanced containment features which are described in the previous section. The effects of new design features are propagated and analyzed through sensitivity analysis. Sensitivity analysis can investigate the effect of changes in input variables on output predictions. It contributes to identifying the sensitive aspects of containment modelling to the overall results and possible weakness in the analysis or areas which may require further effort or support. Through this analysis, the effects of new design features are analyzed. All these work was done using the CONPAS code, KAERl.(1996). The sensitivity analysis can be represented by the following equation: DPm = f(DPevent) where DPm is the change in the conditional probability of a containment event tree (CET), DPevent is the change in the value of basic event probability in a decomposition event tree (DET) to the model and f is the functional relationship between the two, defined by the overall CET analysis model. The approach is to vary the probability of an event in the related DETs, KEPCO.(1996b). The sensitivity calculation involves changing one branch probability to one with all other branch probabilities set to zero and recalculating the CETs. The CET requantification is repeated for each branch. The effectiveness of four advanced features is analyzed and the results are summarized in Table 3.
Advances in Safety and Reliability: ESREL '97
471
TABLE 3 RESULTS OF SENSITIVITY STUDY FOR ADVANCED DESIGN FEATURES ] Base Case
Casel (With CV) Case2 (With RFS) Case3 (With Igniters) Case4 (With CFVS)
NO CF
BYPASS
LATE CF
1.988 x 10 -5
1.436 x 10"6
4.966 x 10-6
1.221 x 10"6
2.888 x 10.7
(71.5%)
(5.2%)
(17.9%)
(4.4%)
(1.0%)
1.987 x 105
1.436 x 10.6
4.967 x 10.6
1.227 x 10-6
2.855 x 107
(71.5%)
(5.2%)
(17.9%)
(4.4%)
(1.0%)
1.994 x 10.5
1.436 x 10.6
(71.7%)
(5.2%)
(17.8%)
(4.2%)
(1.0%)
1.995 x 10.5
1.436 x 10`6
4.851 x 106
1.265 x 10.6
2.888 x 107
(71.8%)
(5.2%)
(17.5%)
(4.6%)
(1.0%)
2.427 x 10.5
1.436 x 10.6
0
1.626 x 10-6
2.888 x 10.7
(87.3%)
(5.2%)
(0.0%)
(5.9%)
(1.0%)
4.961
x 10 -6
BMT
1.169
x 10 -6
ECF
2.888 x 107
Notes: Base Case : without any advanced design features Case 1 Case 2 Case 3
: with capture volume in the reactor cavity only : with reactor cavity flooding system only : with hydrogen igniters only
Case 4
: with containment filtered venting system only
Case 1 shows that the corium capture volume mainly did not affect on the failure probability of containment. The reasons for these results are followings: The high pressure core melt sequence (P>2000 psia) occupies 83% of the total core damage frequencies. Calculation, S.H.Hong et al (1997). showed that corium capture volume is not effective in holding corium in the cavity at this pressure. 78.5% and 77.8% of the total corium are ejected out of cavity for without and with corium capture volume, respectively. For medium RCS pressure, capture volume holds slightly more corium (about 5-10% of total corium generated). According to the MAAP runs, the containment peak pressure corresponding to 100% corium ejection out of cavity is 150 psia which is far below the mean failure pressure of containment building (mean failure pressure of 193 psia). Assuming the corium holdup in the cavity up to 40% due to capture volume, the peak pressure drop of 30psi, however, does not affect the containment failure probability much in these pressure ranges. Accumulation of slightly more corium in the capture volume almost did not increase the possibility of formation of noncoolable geometry of corium, i.e., BMT probability. The effect of reactor cavity flooding system is evaluated and shown in Case2. The result indicates that the BMT probability was reduced by 0.2% by the operation of the reactor cavity flooding system (RFS). This is due to the KSNP unique design which allows water flow into the reactor cavity through a pathway from the containment floor when refueling water tank (RWT) water is injected into the containment. This means that the reactor cavity would be filled with water if RWT water is injected into the containment regardless of the operation of RFS. As most of the accident sequences require RWT water into the containment by safety injections or containment sprays and the failure probability of these system on demand is very low, RFS is not effective in these sequences. Also the installation of an alternate AC source which reduces the possibility of station blackout sequences significantly, makes RFS less effective on the improvement of containment performance.
472
Y. Jin et al.
Case 3 analyzes the effect of hydrogen igniters. The result shows that hydrogen igniters does not improve containment performance much. Large containment free volume to power ratio and existence of flow paths between compartments keep the hydrogen concentration low by mixing the gas all over the containment and preclude local accumulation of hydrogen. This reduces the posgibility of local detonation. If all the zircaloy clad are oxidized, the hydrogen concentration could reach up to 12.5% in KSNP. The amount of hydrogen generated due to the zircaloy oxidation depends on the accident sequences. According to the MAAP4 calculations, maximum of 70% of zircaloy could be oxidized before reactor vessel failure. The hydrogen concentration reaches up to 8-9% if uniformly distributed. This amount of hydrogen could not threat the containment integrity. So operation of hydrogen igniters did not affect on the probability of early containment failure. Hydrogen can be generated during the core-concrete interaction (CCI) process. During this process carbon monoxide, which is also combustible, also generated. But CCI is a slow process and hydrogen (including carbon monoxide) concentration increases very slowly. The hydrogen igniters could limit the accumulation of hydrogen concentration during the late phase of accident and reduce the probability of late containment failure slightly. Continuous generation of noncondensible gases causes the containment failure by overpressurization. The effect of containment filtered venting system is analyzed and the result is summarized in Case4. The capacity of CFVS is sufficient enough to prevent slow pressurization. As shown in Table 3, the operation of this system reduces the probability of late containment failure significantly. The containment filtered venting system could be best in the improvement of containment performance among the four systems proposed in KSNP, but the adoption of this system should be examined carefully because the accidental failure of this system may result in undue risk to public.
CONCLUSIONS
The effectiveness of the four new design features in KSNP on the containment performance was evaluated quantitatively by the containment event tree and sensitivity studies. The overall results indicate that these new features do not improve the containment performance significantly except the containment filtered venting system. The capture volume, which can hold corium in the cavity and reduces the amount of corium ejected out of cavity during high pressure melt ejection process, reduces the early containment failure probability slightly, but increases the possibility of basemat melt-through adversely. Due to the safety depressurization system and the robust design of the containment, the reduction of early containment failure probability was negligible. Flow paths between the containment floor and the cavity reduce the effect of reactor cavity flooding system. Large containment free volume and flow paths between compartments which facilitate natural circulation of hydrogen all over the containment reduce the effect of hydrogen igniter. Containment filtered venting system reduces the probability of late containment failure significantly. But adoption of this system should be examined carefully because an accidental failure of this system may result in undue risk to public.
REFERENCES
EPRI. (1987). Advanced LWR Requirement Document. KAERI. (1996). CONPAS (Containment Performance Analysis System) 1.0 User's Manual. KAERI/TR651/96. KEPCO. (1996a). ULCHIN UNITS 3&4 Final Safety Analysis Report. KEPCO. (1996b). ULCHIN UNITS 3&4 Final Probabilistic Safety Assessment Report(Rev.O). S.B. Kim, et al. (1994). Improvement of Reactor Cavity Design for Mitigation of Direct Containment Heating. The Workshop on Severe Accident Research in Japan, SARJ-94. S.H. Hong, et al. (1997). A study on the debris dispersal fraction from the cavity with the capture volume during HPME. The Second International Conference on Advanced Reactor Safety, ARS'97. USNRC. (1983). PRA Procedure Guide. NUREG/CR-2300.
Advances in Safety and Reliability: ESREL '97
473
USNRC. (1989). Individual Plant Examination. Submittal Guidance. NUREG-1335. USNRC. (1990). Severe Accident Risks. An Assessment for Five U.S. Nuclear Power Plants. NUREG-1150.
This Page Intentionally Left Blank
THE BENEFITS OF SYMPTOM BASED P R O C E D U R E S I N A PSA (AND V I C E V E R S A ) A.J.P. Verweij 1and H.W. de Wit 2 1 Gemeenschappelijke Kemenergiecentrale Nederland, Waalbandijk 112a, 6669 MG, Dodewaard, Netherlands 2 KEMA, Utrechtseweg 310, 6800 ET, Arnhem, Netherlands
ABSTRACT The last two decades the nuclear industry has put a lot of effort in the generation of symptom based procedures in addition to the "old" event based procedures. These symptom based procedures have proven themselves during lots of simulator sessions during which difficult scenarios were handled. When at GKN a PSA was carried out large similarities were found between the procedures training scenarios and the PSA scenarios. It turned out that the symptom based procedures were an additional tool in the development of the event trees. Never the less also the event based procedures are still important in the final quantification of the PSA. On the other side the PSA results are used to improve the training of the operators and cross checks the completeness of the procedures.
KEYWORDS Symptom based procedures, event based procedures, PSA HISTORIC OVERVIEW Before the incident at Three Miles Island (TMI) nuclear power plant (NPP) on March 28, 1979, the nuclear industry used event based procedures in order to stabilize plant conditions after an initiating event. The TMIincident showed that such event based procedures were not providing the required response in all situations. This resulted in the requirements for the nuclear industry to develop a new set of procedures which had to be able to handle all types of incidents in an appropriate way. For the Boiling Water Reactors (BWR) the BWR Owners Group (BWROG) developed a set of symptombased Emergency Procedures Guidelines (EPG) of which procedures are to be derived. These guidelines evolved from a limited set of guidelines for the reactor in 1990 to complete set of guidelines for the complete plant (revision 4) and even a newer release in 1996 incorporating accident management strategies. At Gemeenschappelijke Kemenergiecentrale Nederland (GKN) the event based procedures are still operable. However a set of symptom based procedures derived from the BWROG guidelines revision 4 were implemented in 1991. The development of the plant specific symptom based procedures started in 1987 and finished in 1989. Final implementation followed two years later after intensive operator training and discussions with the authorities to gain approval. Since the early seventies, Probabilistic Safety Analysis (PSA) developed from a simple estimate to a full bodied assessment of safety. The first major study was released by the Nuclear Regulatory Commission (NRC) in October 1975, also known as WASH-1400. Since this initial study new requirements followed regarding safety assessments, finally resulting in consequence analysis incorporating all types of initiators during all operational modes. 475
476
A.J.E Verweij and H.W. de Wit
The first PSA study at GKN was carried out in 1975-76 by KEMA and was a limited analysis based upon the WASH-1400 analysis. Since 1990 new requirements were issued by the Dutch nuclear regulatory body. This resulted by 1992, starting in 1990, in a level 1 (system failure modes resulting in core damage) for internal events, screening of external events and detailed internal flooding, and a limited level 2 PSA for internal events. The next step was to extend this PSA to all operational modes (except external events Shutdown mode), and events and performance of a full-scope level-2 and level-3 analyses. This was finished begin 1994. At the end of 1993 the PSA (level 1, 2 and 3) started for the upgraded plant, upgrading was to be performed in 1996-1997, with the planned modifications as given in the safety concept report. These results were used for the comparison (as-built/upgraded plant) in the Environmental Impact Report which was submitted for the license. December 1995 a project started to perform a complete consequence analysis incorporating all operational modes and all types of initiators with as input the details of the upgraded plant as given by detail engineering of the upgrade project. By October 1996 this project delivered the complete Level 1 analyses and started the level 2 analysis to determining the release rates and frequencies. At that time the shareholders made the decision to close the plant at the end of the ongoing fuel cycle which was in March this year. Due to this decision the PSA-project was terminated. THE PROCEDURES
Event based procedures In order for an operator to use the correct emergency procedure upon a given initiator he has to make a correct diagnose. In many cases it is not an easy task to make such a diagnose, not even for an experienced operator. Different scenarios may look similar at first sight and a wrong diagnose may be made. Once this happens consequences might become severe. The number of event based procedures in order to cover all the different sequences is great. This is generated by the many possible combinations of an initiator with numerous failures or unavailability of components (a disaster is always due to a combination of small failures and malfunctions).
Symptom based procedures The symptom based procedures at GKN are derived from the B WROG EPG revision 4 written for the General Electric (GE) BWR's. The BWROG emergency guidelines revision 4 were developed with the requirement that these should be able to cope with mechanistically possible scenarios. Premises of these guidelines is to control key-parameters so-called safety functions and prevent them for exceeding prescribed limits. Upon exceedence of limits of key parameters actions are initiated. The actions are not based upon the initiators or available systems but are the actions needed to keep a key parameter within the required limits. The symptom based procedures are not providing the details, e.g. the procedures are instructing to start a emergency generator, but are not providing the operator with the information how to do this. The details of the actions are given in system support procedures, also a lot of knowledge like the handling of the equipment is a part of their "standard operator training". Event based versus symptom based
Both types of procedures have their advantages and disadvantages. The advantages of the event based procedures is that all information in order to stabilize the plant is optimized to the sequence of the event. However this requires a complete diagnoses and understanding of the status of the plant, otherwise unnecessary or wrong steps are taken. This also means that in order to be able to use the valid procedure the procedure has to be available. This implies that the scenario is identified in an earlier stage and a procedure is generated or modified specifically for the identified scenario. The advantage of the symptom based procedures is that an operator does not need to know what the initiator was, nor what the current status of components and systems is. The procedures are directing the operator to use systems as a response on exceedence of limits of key parameters. A1 systems, safety, normal operation or support, which are able to regain control of the plant which means control of safety functions are taken into account. The operators are trained to use the most effective system which is available. The second advantage is that there is only one set of procedures capable to deal with all types of initiators evolving in all types of different scenarios.
Advances in Safety and Reliability: ESREL '97
477
A disadvantage of the implementation of the symptom based procedures is the format. Because of the high level architecture of the symptom based procedures a textual format is not workable. It is possible that multiple paths are valid at the same time. The workability was improved by putting the procedures in a graphical format, a flowchart. However this methodology differs from the event based procedures which are typically presented in a textual format. This difference resulted in additional training effort in order to get operators to the required level of skill, specifically because they were never trained in graphical procedures. PROBABILISTIC SAFETY ASSESSMENT Typically the PSAs for the NPPs are divided in 3 levels. The level 1 analysis results in the so called core damage frequency (CDF), the level 2 analysis results in the release rates with the corresponding frequencies and the level 3 analysis results in the consequences for the environment. For this paper we have limited our selves to the level 1 analysis given the fact that most operator actions are incorporated to keep the CDF as low as possible. The logic model created to analyze a NPP incorporates initiators, component failures and human failures. However the modeling and quantification of the human aspect in a probabilistic model is a difficult task. Maybe even the most difficult one, failure rates of "technical" issues are discussed but are still rather straight forward, even for infrequently occurring events like earthquakes in geological stable zones. However, large discussions are going on between experts regarding the modeling and quantification of human actions. The different insights and opinions are always successfully generating these discussions.
PROCEDURES SUPPORTING THE DEVELOPMENT OF A PSA For the initial PSA for GKN, Functional Event Sequence Diagrams (FESDs) were created in order to describe the sequences. An FESD is a graphical structure that traces an accident sequence from its initiation to some pre-assigned end state. The graphical format consists of nodes that represent actions, symptoms, etc., and lines that represent progression in time (e.g. rightward) or change of state (e.g. downward). The major use of the FESDs are to (1) support the development of the PSA event trees and (2) support the recovery analysis work. FESDs provide a framework for the systematic examination of operator actions relevant to the PSA. During the initial development of the FESDs it turned out that the FESDs were basically representing the path through the symptom based procedures going to the end state. This basically mend that the FESDs were no longer providing additional information, the information was already available and integrated into the training scenarios. Eventually this led to a stop in the updating of the FESDs. Once it was determined that the paths through the symptom based procedures were comparable with the event trees a check of the developed event trees was an easy task. However it is a task to be carried out carefully, although the basis of the procedures should cover all mechanistically possible scenarios. The uniqueness of the GKN reactor provides specifically some different timeframes compared to the other GE BWRs for which the guidelines were written. The uniqueness of the GKN reactor is due to the natural circulation of the water steam mixture through the core. This implies that although procedures are generated with care, a cross check with initiators and possible scenarios has to be made to insure completeness. The support procedures for the symptom based procedures are event based. These procedures are nevertheless needed for the quantification of the human actions. The symptom based procedures point at the actions, but the event based procedures are providing the information how operator actions are carried out specifically. The number of operator actions for one task in combination with factors like e.g. "available time", "training" and "location" are an input for the quantification of the task. A typical example of such an operator action is given on the next page for an operator action to align alternate make-up to the isolation condenser. Due to the symptom based procedures the event based procedures are simplified. The event based procedures are used whenever the symptom based procedures are directing this. As such no diagnostic effort is required carrying out these event based procedures. And in terms of quantifying the human actions that is where a large part of the risk lies.
478
A.J.R Verweij and H.W. de Wit Example operator action as used as a recovery action for the PSA
RECOVERY ANALYSIS WORKSHEET Non-recovery event name:
NHLL_ICMU
. Gate:
Discussion of Recovery Options." Provide alternative makeup (using BBS, portable fire pump at the site or from the fire brigade of the town Dodewaard) to the secondary side of the IC after failure of the DHS normal supply. Timeline
Sequences: T/B/I/C/S T/HI/I/L1 T/HI/I/L2 T/H1/I/C/S T/H1/I/A T/H1/I/R F ZGCUS Quantification Method
Procedures M/I NCS 18-02
O HRA: G TRC O MF
Supplementary
G Data
G Model/Fault Tree G Dependent Event
Location ex-control room (outside Reactor building)
(Available Time)/24hr Human Engineering Outside Reactor Building, hook-up hose (level IC indication local at the outside wall of RG)
Training Normal training; simulator training; procedure training (aux. operators) 5
Total Factor Index Quantitative Analysis" P= 3.2E-03
Advances in Safety and Reliability: ESREL '97
479
A PSA SUPPORTING PROCEDURES A PSA can be used in various way to improve the overall safety and availability of a plant. In this paper the possible impact of the PSA results on the procedures are highlighted. After the first level 1 quantification of the PSA a number of failure modes were identified which are easily taken care of. However the actions although often trivial were not proceduralized. As such it was at first not possible to take credit for these actions during the quantification of recovery actions. The identified actions were then incorporated in the plant procedures and training of the operators was performed. The check of the PSA event trees using the symptom based procedures can also be turned around. Using the event trees and the results of the analysis it is possible to check if all failure modes are covered by the procedures. Identified failure modes not covered by the procedures resulted in changes which were incorporated and trained. Also operator training was be improved with the results of the PSA, see Verweij A.J.P. and De Wit H.W. (1993). The completeness of the training was checked by comparing the training scenarios against the PSA event trees and results. The training in specific high vulnerability areas were intensified and the number of plausible training scenarios were, mainly on the simulator, increased based on the quantification results. CONCLUDING STATEMENTS Although no explicit numbers are presented in this paper we believe that the qualitative argumentation given in the paper supports he following statements and conclusions; -
Symptom based procedures by taking away the diagnostic effort improves the safety profile.
-
Symptom based procedures are providing a check for the developed event trees.
-
Event based procedures are still needed for certain tasks.
-
PSAs are providing a check for the completeness and correctness of the procedures.
-
PSAs are providing material to improve the training of operators.
RELATED
PAPER
Verweij A.J.P. and De Wit H.W., "Optimalization of Operator Training with PSA Methods and Results", PSA'93 (January 26-29, 1993 Clearwater Florida).
This Page Intentionally Left Blank
B3" Living QRA
This Page Intentionally Left Blank
ON-LINE MAINTENANCE SCHEDULING AND RISK MANAGEMENT - THE EOOS MONITOR ® APPROACH Zdenko ~imi6 (1'2), Steve Follen °), Vladimir Mikuli6i6 (2) (1) Yankee Atomic Electric Company 580 Main St., Bolton, MA 01740-1398, U. S. A. simic @yankee.com, [email protected] (2) Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 10000 Zagreb, Republic of Croatia zdenko.simic @fer.hr, vladimir.mikulicic @fer.hr
ABSTRACT It is known that Probabilistic Safety Assessment (PSA) can provide safety status information for a plant during different configurations; an additional effort is needed however to do this in real time for on-line operation. This paper describes an approach to use of PSA to achieve these goals. Equipment Out Of Service (EOOS) tool was developed to monitor nuclear power plant safety from the probabilistic point of view. EOOS is using model developed from a full-scope PSA, and incorporates a user friendly program interface approach. Results from EOOS can be used by planners or operators to effectively manage the level of risk by controlling the actual plant configuration.
KEYWORDS
Probabilistic Safety Assessment, on-line risk monitoring, maintenance
EOOS MONITOR This paper will present approach used for development of the On-Line Risk Model for the Seabrook NPP inside the Equipment Out of Service (EOOS) Monitor software tool, and discuss obtained results and special EOOS Monitor capabilities. The EOOS Monitor is a PC based tool, providing accurate and fast calculations of risk, in order to monitor nuclear power plant safety from the probabilistic point of view. The EOOS Monitor is designed for three types of users with separate distinct needs. Scheduler: user concerned with scheduling equipment outages, 483
484
Z. S imi6 et al.
Operator: user concerned with current plant status, and Probabilistic Safety Assessment (PSA) Analyst: risk assessment analyst mission is to support scheduler and operator. The EOOS software is developed by Science Application International Corporation (SAIC) to use and integrate available modules from EPRI's Risk and Reliability Workstation. PSA results could be included through minimal cut sets representation, complete model or a combination of these two approaches. An advantage of EOOS is the easy transformation from an initial PSA model to an EOOS environment. EOOS can support different approaches in PSA: small event trees with big fault trees, or big/moderate event trees with small fault trees (support state approach). This means that all fault tree formats are supported (e.g., CAFTA, Riskman, Grafter, etc.). For the big fault tree approach, EOOS is designed to easily build the initial model, and to do recalculation directly. EOOS can also receive and include operational and scheduling information (i.e., events and activities). Table 1 shows list of EOOS capabilities. TABLE 1 EOOS CAPAaILrrIES •
List current active items;
•
Calculate plant safety measure;
•
•
Risk profile tracing components and events;
for
important
Different calculation options: pregenerated outsets, regenerating cutsets, and hybrid method;
•
Direct connection with original fault trees and complete PSA model;
•
Cut set browse;
•
Connect P&ID and plant model;
•
Schedule importing and editing functions;
•
System alignment and operating status for full plant configuration desc.;
•
Easy extension over original PSA model (environmental effects, etc.);
•
Importance ranking;
•
User accounts and access rights;
•
Colour thresholds for risk measures;
•
Out of service reasons;
•
Allowable outage time calculation;
•
Expandability to other operating modes;
•
Status panel custom defined for each operating mode
•
•
Calculate scheduled risk profile with and without impact of current equipment OOS
Reach set of model building tools: status panel editor, hotspot editor, database editor, "wizards" for different steps of EOOS model building;
•
Large set of parameters controllable by user.
EOOS M O D E L BUILDING Basically EOOS model is constructed from hierarchically structured data base relationships between different plant items. Types of items are: activity, test, system, train, clearance, component, and basic event. This is the base for evaluating a specific plant event with PSA model. Table 2 shows list of major steps for EOOS model building. Figures included in this paper present some important EOOS capabilities. Figure 1 shows EOOS screen for operators. With this safety panel EOOS can help operators to focus on safety. Operators' panel shows: a numerical measure of plant safety that reflects changes in equipment status; the maximum time allowed in a particular plant configuration based on the plant safety value; the status of plant systems affected by various test and maintenance activities (i.e., provides "defence-in-depth" information); a list of current activities.
Advances in Safety and Reliability: ESREL '97
485
TABLE 2 CHECKLISTFOR BUILDINGEOOS MODEL Basic Setup for Operator's Screen Relationship data files • Systems • Trains • Components • Basic Events
Risk Calculation with Cutsets • Pre-generated cutsets • Risk meter colour thresholds Status Calculation • Fault tree for status panel • Status panel configuration
Setup for Scheduler's Screen Schedule Data • Data import program • Activity file
Relationship data files • Clearances • Status Gate List
Advanced Setup for Operator's Screen Risk Calculation with Fault Tree • Modules • Initiators • Recovery Rules • Mutually Exclusive List
P&ID Displays Environmental Effects System Alignments
Other Advanced Features User Accounts and Access Rights Reasons for Work Activities Operating Modes
Relationship Data Files • Activities • Tests
)l)erator's Risk Evaluation
F[
2/24/97 10:18
CV-CCV1-FC r,, since 4/30/9613:35 CV-CCV4-CL ~CurrentActive ItemsP/9613:35 DG-DG1B - F S ZfliTi~R739/9613:46 FN-SWFN40A-FR since 4/30/9613:38 since 4/30/9613:35
e~
m
s~
m
Component Component Component Component EnvironmentalVotiemce
eccs
syst~
I
cvcs
Figure 1" EOOS Screen for Operators.
Sv~tm
486
Z. Simi6 et al.
Figure 2 shows how user can see more information about components currently out of service. This graphical interface works also for entering information about components out of service during operations.
CV-CCV1-FC CV-CCV4-CL DG-DG1B-FS
Conox~ Componer~
-r
,
!......4
m,ml,e,dO ~me8
b
P R I U A R Y C O E I PONENT COOLING GYS'T'EU
PRA 9alt~
Figure 2: Connection between HotSpot P&ID and Operators Panel. Results from proposed schedule of test and maintenance activities are visible on separate screen. Figure 3 shows all information from schedule activities and influence to the plant safety. User can see proposed schedule of activities, separate impact of these activities to the plant systems, and final impact on the plant safety. All these EOOS capabilities have serious request regarding effort for model preparation. We were able to transfer complete PSA model results for the Seabrook NPP to EOOS model. This includes following group of activities: initial PSA model changing, mapping database tables building, connecting P&ID with relevant basic events and fault trees gates, creating specific fault tree added for the operator panel purposes. For the scheduling part it is also needed to make additional PSA model changes and build relevant mapping database tables (that is, activities, testing). Experience with EOOS is continuing with more complete model preparation and results testing. First results are promising.
Advances in Safety and Reliability: ESREL '97 I~lle Options
F'w'-B DG-B CS-A RH-A SW-B Pw'-A CBS-B
Schedule
Iimeline
Risk Profile
487
Help
l--I I-4 I-4 I-4 I--4
CBS-B-Hx PCC-A PCC-B SW-B EFW Motor EFW Turbine E~ Staaup DC TrainA DC TlainB .
.
.
.
.
Figure 3: EOOS Scheduler's Display
CONCLUSION The paper documents an effort to transfer the complete Probabilistic Safety Assessment model results for the nuclear power plant into the Equipment Out of Service (EOOS) Monitor model. (The EOOS model enables to monitor nuclear power plant safety from the probabilistic point of view.) The idea was to make possible on-line risk management and to achieve on-line maintenance scheduling. For these purposes many changes had to be made: initial PSA model changing, mapping database tables building, connecting P&ID with relevant basic events and fault trees gates, creating specific fault tree added for the operator panel purposes .... First results are promising.
REFERENCES Samantha, P.K., Vesely, W.E., Kim, I.S. (1991). Study of Operational Risk-Based Configuration Control, NUREG/CR-5641 ERIN (1995), PRA Application Guide, EPRI TR-105396 Vesely, W.E., and Rezos, J.T. (1995). Risk-Based Maintenance Modeling: Prioritization of Maintenance Importances and Quantification of Maintenance Effectiveness, NUREG/CR-6002 SAIC (1996). EOOS MONITOR User's Manual Ver 2.5, EPRI Yankee Atomic Electric Co. (1993). Seabrook Station Probabilistic Safety Study, SSPSS-1993
This Page Intentionally Left Blank
SUPPORTING RISK M O N I T O R I N G WITH OFF-LINE PSA STUDIES C. Vivalda l, A. Carpignano 2, J.P. Nordvik 3 1Bureau Veritas Paris, Marine Division, Ocean Engineering Management, 17bis, Place des Reflets, La Defense 2, 92400 Courbevoie, F 2 Politecnico di Torino, Dipartimento di Energetica, C.so Duca degli Abruzzi 24, 10129 Torino, I 3 European Commission, Joint Research Centre, Institute for Systems, Informatics and Safety, T.P. 210, 21020 Ispra (VA), I
ABSTRACT The paper describes the starting points of the development of a general methodology devoted to risk management of complex systems based on the use of Belief Networks (BN) and Influence Diagrams (ID). The objective of the methodology is to exploit the knowledge resulting from activities such as Probabilistic Safety Assessments (PSA) and Safety Cases, which are carried out during design and construction phases of complex system life cycle, to support risk management during system operational phases. The first step of the approach is the transformation of an Event Tree (ET) into a BN and the resolution of the BN using probabilistic data supplied by Fault Trees (FT) as prior probabilities. Using the BN, other influences than those considered during FT-ET analysis, such as detected faults, control and maintenance operations, can be easily added to the network to complete and update the knowledge of the system. The resolution method and the implementation into STARS software tool are investigated. ,
.
KEYWORDS Risk Management, Risk Monitoring, PSA, Safety Cases, Belief Networks, Fault Trees, Event Trees, Complex Systems, STARS.
INTRODUCTION
Risk management represents a basic activity to be performed during the operation of safety critical systems. On line risk management is a heavy and difficult task as a good estimation of risk requires to take into account a wide variety of contributing factors that need to be identified and classified according to their importance. In addition, end users are sometimes unable to recognise the final consequences of the propagation of a rare, or apparently not relevant, initiating event. The proposed methodology addresses these issues by taking advantages of the existence of the Safety Case available for the system and based on a classical Probabilistic Safety Assessments (PSA). In fact the Safety Case supplies designers and users of plants with large amount of qualitative and probabilistic data related to Reliability, Availability, Maintainability and Safety (RAMS). Its main goal is to guarantee to the user and to the community that the designed system will be safe during its whole life cycle. 489
490
C. Vivalda et al.
The methodology foresees the integration of this knowledge with data derived from system operation, to provide operators with a tool able to perform "on-line" risk estimation and to support decision-making with respect to risk management. This methodology will be integrated into the STARS (Software Tools for Analysis of Reliability and Safety) software package (Nordvik, and al., 1995) developed by the Joint Research Centre of Ispra. STARS is a software toolkit for managing and integrating safety analyses of complex systems. Its principle domain of application is related to "off-line" RAMS activities within the context of the safety life cycle of a system. The integration of the actual methodology into the STARS package will extend its application to the operation of complex systems. The methodology is almost general, so it can be applied to different industrial systems, e.g. nuclear, chemical, oil, naval, spatial, and aeronautic. The methodology is based on the use of techniques such as Belief Networks (BN) and Influence Diagram (ID) to perform risk management, as already proposed in other works devoted to this subject (Seridji et al. 1995). Belief networks (Sarkar, 1996) are directed acyclic graphs in which nodes represent finite propositional variables. The belief accorded to different propositions are stated as probabilities, and the strengths of the dependencies across propositions are quantified by conditional probabilities (Neapolitan, 1990). BNs offer a representation of undesired situations taking into account the dependencies among operational and maintenance actions, evolution of operational parameters, technical failures, and environmental conditions. These last quantities identify the nodes of the network. Dependencies are displayed as directed arcs between the nodes. Network evolution leads to an estimation of the risk related to the undesired situation both in terms of expected occurrence and of related damage. Evolution requires that a prior probability and a conditional probability be associated to each node of the network. These probabilities are generally set using "expert judgement" and system operational experience. The Safety Case can provide useful input to the construction of such a network, for the identification of relevant event, of their dependencies, and of all related probabilities. Inputs to this approach are the data supplied by FT-ET analyses. The basic idea is to transform ET, already developed during PSA studies, into a basic BN which will be further improved using operational information. The resulting network will be used to forecast system evolution. These forecasts will be performed when unexpected (i.e. failed) situations are met, environmental conditions change or management intervenes on system configuration. The forecast is made through the instantiation of some variables and the resolution of the network. To be solved, the network needs as input the prior probability of the root and the conditional probability of the descendant propositional variables. In Safety Cases based on classical PSA approaches, these probabilities are supplied by the FTs which are appended to ET to evaluate sequence probabilities. Due to faults or management actions, system configuration changes and its state evolves into a new one. This implies that also the belief network evolves and new probability values are needed to solve it. These new probability values can be instantiated by the knowledge of external data or evaluated solving new FTs, modified according to the actual configuration of the system. The forecast of system evolution can be done in different conditions, i.e. without any maintenance intervention or with few alternative maintenance policies. If compared to ET approach, BNs allow an easier integration of operational influences and expert judgement into provisional models and knowledge provided by the Safety Case. The described approach makes the BNs able to support risk monitoring but cannot support any decisionmaking activities. Influence Diagrams, an extension of BNs containing decision nodes, will be introduced to address decision support to better manage or maintain the system and, hence, to mitigate consequences (Seridji et al. 1995).
APPLICATION OF PSA TO RISK MONITORING
Safety Cases based on Probabilistic Safety Assessments are becoming more and more important in many technological fields, as they provide a deep knowledge of the risk related problems as well as a way to overcome them. They are very useful during the design phases of a system, as they support the designer to
Advances in Safety and Reliability: ESREL '97
491
optimise the system from the risk point of view. They are also extensively used to verify the risk level of already designed or existing systems. A PSA provides the designers and the users with a large amount of data related to Reliability, Availability, Maintainability and Safety aspects. It represents a forecast of what could happen in the system during accidental sequences. It also looks for causes of that behaviour. The basic analyses performed in PSA are cause-consequences analyses mainly based on the use of FT-ET approach. These analyses are both qualitative and quantitative. Using qualitative investigations, such as Failure Mode and Effect Analysis or Hazard and Operability Analysis, the safety analyst identifies the expected accident initiators leading to the undesired consequences. ET analysis is successfully applied to investigate the accidental sequences of events arising from these initiators, according to system failures, operator actions and environmental conditions that are relevant in this context. The output of this approach consists of a set of possible accidental sequences for each initiating event and carefully describes all the dependencies among sequence events. Dependencies among events and cause-consequence relations can be studied by FT Analysis. This provides a mechanism to describe each event in terms of elementary causes such as equipment failures, human errors, and environmental conditions.
SAFETY CASE ~
~________~
BASIC NETWORK FINAL NETWORK
SYSTEM OPERATION MONITORING (control, errors, maintenance, failures,
SYSTEM OPERATION ISSUES (control, errors, maintenance, failures, environment)
NETWORK EVALUATION
Figure 1: The methodological approach FT quantification also provides conditional probabilities for each event which belongs to the accidental sequence. In this way, FT analysis provide input data to ET for the evaluation of the probability of accidental sequences. The Safety Case becomes an "off line" representation of the system describing all the safety critical issues for some predefined management and environmental conditions that are safety relevant. It cannot fully represent daily operations which are crucial to perform "on line" risk monitoring and are characterised by an unpredictable sequence of faults, control or maintenance actions, and environmental changes. Using the PSA approach the resulting Safety Case provides the following information useful to support the "on line" risk monitoring: * a collection of possible hazards deriving by system operation; * a collection of ET-FTs describing each hazard in terms of cause-consequence relations among the elementary events and their dependencies; * a provisional estimation of conditional probabilities for sequence events.
C. Vivalda et al.
492
In the following we briefly describe how these information can be used to support the construction and the evaluation of BN aimed at "on line" risk monitoring. Hazard collection focus on the networks necessary to describe and cover most of the system safety critical situations. For each situation the structure of the ET can be transformed into a basic graph according to some rules that will be shown in detail in the next paragraph. The basic network will be completed by the introduction of new nodes representing real operational and maintenance actions, environmental conditions, and management goals. The final network becomes an "on line" representation of system configurations. During the risk monitoring phase, the events represented in the network nodes can be considered as symptoms of some unknown initiating events with unknown consequences. These events represent the appearance of anomalies detected by the control system, or of changes in maintenance and management policies. The anomalies themselves can often be identified with nodes derived from primary events or secondary events of FT, normally linked to ET. When one anomaly appears, the risk monitoring system selects the relevant BN and evaluates its expected consequences as well as its the potential initiating events. System operational parameters read by the control system, maintenance and management actions, environmental changes can be used to update conditional or prior probabilities of events, sometimes recalculating system FTs, in order to obtain the instantaneous risk estimation by network revaluation. The approach to be followed to solve the networks depends mainly on the features of their structure and rigorous as well as approximated methods can be employed (Neapolitan, 1990; Sarkar et al., 1995). The decision on how to manage the risk will be taken as a consequence of the previous probabilistic analysis.
F R O M ET-FT TO BELIEF N E T W O R K S
The transformation of an ET into a BN is theoretically possible due to the fact the two technique are based on the same fundamental hypothesis, i.e.: * * *
both the events of an ET belonging to the same branching node and the events contained in a propositional variables of a BN are mutually exclusive and exhaustive events; ET and BN require the same input variables; both the events of an ET and the events contained in a propositional variables of a BN are conditionally independent, i.e. each event in the directed acyclic graph is independent from all its subsequent events.
Moreover, in order to make the transformation possible: * *
the structure of the Belief Networks has to be completely connected, i.e. each propositional variable is dependent on all its preceding variables the ET has to be ordered and pruned.
The transformation of an ET into the corresponding BN can be achieved according to the following procedure: 1. identification of the initiating event of the ET with the root of the BN; 2. identification of all branching nodes of the ET with the propositional variables of the BN; 3. identification of all events of a branching node of the ET with the events of the corresponding propositional variable; 4. attribution of the probability of the initiating event in the ET to the prior probability of occurrence of the root of the BN; 5. attribution of the conditional probability of each descending event of the ET to the conditional probability of occurrence of the corresponding event of the BN;
Advances in Safety and Reliability: ESREL '97
493
6. construction of the Directed Acyclic Graph (DAG) according to the following rule" each propositional variable is connected with arcs to all the preceding branching nodes appearing in the ET, that represent its parents. The direction of the arcs is from the parents to the propositional variable. The probability of occurrence of a sequence of events in the ET corresponds to the joint probability of the same set of events in the BN. It follows that risk estimation using BN is based on the evaluation the joint probabilities of sets of events leading toundesired consequences.
CASE STUDY
One technological field of application of the proposed methodology is the naval industry. The example given is related to passenger/Ro-Ro vessels for which Safety Cases based on FT-ET application have been developed. Making reference to the result obtained in the <>, Methodology Report (Spouge 1996), the proposed case study refers to ~'Grounding incident". The generic ET is reported in fig. 2. Grounding incident
Incident importance
Ship position
Flooding event
Floating position
Minor incident (bl = 0.76)
SI No flooding; (cl = 0.59)
(a = 0.02 per ship per year)
$2.1
Flooding Double Bottom only (c2 = 0.32)
Grounding incident
code
Serious casuality (b2 = 0.24)
$2.2
Hard al~round (dl = 0.33) Flooding above DB (c3 =0.09) Floats free (d2 = 0.67)
$3.1 Remains afoat (el = 0.8)
$3.2.1
Slow sinking (e2 = 0.1)
$3.2.2
Rapid capsize le3 = 0.1)
S
Figure 2" Event tree for the Generic Grounding Incident Grounding
I Grounding due to I steering failure
Steering failure
Steering failure causes grounding
© I Steering gear failure
@
Power failure
©
Figure 3" Portion of the Fault Tree describing the Generic Grounding Incident
494
C. Vivalda et al.
The conditional probability values used for the evaluation of the ET have been derived from the solution of appended FT and historical data. One appended FT, related to "Grounding incident" is partially shown in fig. 3. The basic BN, shown in fig. 4., is drawn using the rules explained in the previous section. P(A)
P(A): p(a)
P(B) CI
P(B)" p(b 1 p(b2 P(C)" p(cl p(c2 p(c3 P(D): p(dl p(d2 P(E): p(el
p(E)
B
P(D) P(C
a) a) b2,a) b2,a) b2,a) c3,b2,a) c3,b2,a)
d2,c3,b2,a) p(e2 d2,c3,b2,a) p(e3 d2,c3,b2,a)
Figure 4: Basic Belief Network for the Generic Grounding Incident derived from the ET To the basic BN, developed on the five events belonging to the ET, the following events have been integrated in order to take into consideration the ship operations and characteristics of the specific context: F: situation recognition fl = yes f2 = not
I
G = crew training gl = good g2 = sufficient g3 = poor
The final BN used for risk monitoring is presented in fig 5.
~
P(C'): p(c 1 p(c2 p(c3 p(cl p(c2 p(c3 p(cl
P(A): p(a) P(F)
P(A)
P(F): p(fl) p(f2) P(G): p(gl) p(g2) p(g3)
~ ]
P(D)
P(B'): p(bl p(b2 p(bl p(b2
a, fl) a,fl) a,f2) a, f2)
b2,a,g 1) b2,a,gl) b2,a,g 1) b2,a,g2) b2,a,g2) b2,a,g2) b2,a, g3)
P(D): p(dl I c3,b2,a) p(d21 c3,b2,a) P(E): p(el I d2,c3,b2,a ) p(e2 1d2,c3,b2,a) p(e3 I d2,c3,b2,a)
Figure 5: The final BN for the Generic Grounding Incident representing the system "on line" The initial probabilities to be used for risk evaluation during the monitoring phase are those evaluated for the ET and supplied by historical data and expert judgement. The modification of one or more of these probability values results in a change of the risk estimation and gives information to the operator on the ship safety conditions. For example, if during navigation the ship control system detects a steering gear failure (identified as E3 in the FT), this becomes a certain event in the FT and its failure rate changes (in this case
Advances in Safety and Reliability: ESREL '97
495
being oc). Based on this knowledge, the FT evaluation can be updated and a new estimation of risk is made by the solution of the BN. Risk estimation is based on the evaluation of the joint probabilities of events leading to undesired consequences. For example, for the sequence ending in ~ rapid capsize ~ (sequence S 3.2.3 in the ET) the joint probability to be evaluated is P(e3.d2.c3ogl eb2ofl oa), when the values of gl and fl have been instantiated. To evaluate the risk, this probability has to be combined with the magnitude of the consequences.
I M P L E M E N T A T I O N INTO THE STARS ENVIRONMENT STARS is an open software system to model industrial systems and assist the analyst in safety and reliability studies (Nordvik and al., 1995). STARS provides an answer to several problems concerning complex system design. Industrial systems are complex. They are the result of an extensive engineering activity which starts from high-level objectives that are translated into functional requirements, and refined into low-level operative specifications and system design. Once built, systems are prone to changes e.g. design modifications, replacement of broken or old pieces, general ageing of equipments. Consequently, as a system evolves, it should be re-assessed against its initial requirements. This situation calls for improved methods and tools to capture the overall system complexity, and assist the various professionals involved in the design, operation and maintenance activities. Ideally, a single information system should be able to collect, structure and manage any information on an industrial system, provide efficient retrieval and updating mechanisms to describe the system along its life-cycle. This information system would also support -and possibly automate- specific tasks of this life-cycle. This is the long term objective that the STARS system aims to achieve, as far as safety and reliability issues are concerned. Several tools are available to store generic information regarding the working domain (Taxonomy editor), to model the plant (Plant editor) and to easily perform system analysis (tools for Failure Mode and Effect Analysis, Fault and Event Tree Analysis). All the tools are fully integrated so that it becomes easy to navigate within the model itself, and, from the model to the analysis and viceversa. Integration also promotes data coherence and facilitates updating. Three different models - the structural model, the functional model and the behavioural model - describe all the information related to the layout and hardware structure of the system, the behaviour of its components and the functions that they perform. The models support automatic Failure Mode and Effect Analysis, Fault Tree and Event Tree construction. The use of software tools, therefore, simplifies the analyst's work in storing data, creating automatic relations between different data, evaluating information, researching particular data, producing reports on the analysis and, above all, updating data and analysis products during the design evolution, allowing a "living" safety assessment with useful feedback to designers. These features of STARS make it to be the right framework to implement the risk monitor methodology described above. In terms of system modelling, using the plant editor, the analyst can enrich the Safety Case description with new information regarding operations, maintenance, environmental conditions up to obtain the "on line" system model that fills the BN nodes. This improvement does not require heavy modifications in the tools today available, except for the configuration manager that, operating in an "on line" mode, needs a more versatile approach to record and set a larger amount of different configurations, set either by the user or by a interface directly connected to the control panel of the system. The main improvement in STARS, for the risk monitoring applications, consists of the specification and implementation of an editor to draw Belief Networks and the related analyser able to evaluate the net. A general Risk Monitor interface will be designed to drive all the tools involved in this kind of application, i.e. the conversion of ETs into BNs, the interface between the real system and its STARS model, the re-evaluation of FTs in order to update network probabilities and, finally, the dialog between the STARS environment and the system operator.
496
C. Vivalda et al.
CONCLUSIONS The present paper shows how important "off line" information related to risk and obtained during the design and construction phases of an industrial system can be usefully utilised for the development of an "on line" risk monitoring system. In fact Safety Cases based on PSA approach, and in particular FT-ET techniques provide useful knowledge for modelling a basic representation of system behaviour. The proved equivalence between ETs and BNs allows the construction of basic BNs. BNs being more flexible than ETs, they allow the integration in the basic network of operational information, updating the monitoring system on the bases of system evolution and environmental changes. FT-ET approach also provides prior and conditional probabilities to start the process of "on line" risk evaluation. The integration of the methodology in the STARS environment, devoted to "off line" risk studies, will improve the process of information transfer and the update of conditional probability evaluation through the use of FT analysis. The feasibility of the risk monitoring system through transfer and update of "off line" information is proved in the simple case study presented. The refinement of the approach is going on and more extensive application will be considered. A further step in the development of the methodology will be the introduction of Influence Diagram to support the operators in the decision process.
ACKNOWLEDGEMENTS
Authors wish to thank Abder Seridji and Alessandra Mosso for their helpful contribution in defining the methodology and revising the work.
REFERENCES
Neapolitan, R.E. (1990) Probabilistic reasoning in expert systems. Theory and algorithms. Wiley Interscience, New York, U.S. Nordvik J.P., Carpignano A. and Poucet, A. (1995) Computer-based system modelling for reliability and safety analysis, Proocedings American Nuclear Society Meeting, Philadelphia, 25-29 June 1995, American Nuclear Society Inc., La Grange Park, U.S. Sarkar, S. and Murthy, I. (1995) Criteria to evaluate approximate belief network representations in expert systems. Decision Support System, 15, Elsevier. Sarkar, S. and Murthy, I. (1996) Constructing efficient belief network structures with expert provided information. IEEE Transaction on Knowledge and data engineering, 8:1. Seridji, A . , Regan, J. P. and Mangini, M. (1995) The advanced risk management system: a rational approach to risk management in the offshore industry, Proocedings American Nuclear Society Meeting, Philadelphia, 25-29 June 1995, American Nuclear Society Inc., La Grange Park, U.S. Spouge, J. (1996) Safety Assessment of Passenger/Ro-Ro Vessels. Methodology Report. The North West European Project on Safety of Passenger/Ro-Ro Vessels. DNV Technica Report C6185, October 1996.
"LIVING" SAFETY CASES - A TEAM APPROACH
G.A.Rawlinson British Nuclear Fuels Plc, Sellafield, Cumbria, UK
ABSTRACT The Thermal Oxide Reprocessing Plant (THORP) is the central feature of a major investment programme at the BNFL Sellafield Site. The plant has a wide scope of operations and this has resulted in an operational safety case, the THORP Plant Safety Case (PSC), which is the largest such case to date at BNFL Sellafield. This paper explores those features considered to be of primary importance in keeping such a large and complex document as live as is practicable recognising the changing state of the facility as it progresses through its various active commissioning and operational phases. Key features of the safety case are also discussed.
KEYWORDS THORP, PSC, Safety, Plant, Live, Ownership
INTRODUCTION BNFLs Thermal Oxide Reprocessing Plant (THORP) is the third generation of irradiated nuclear fuel reprocessing plant to be built on the Sellafield site and reprocesses enriched Uranium fuel principally from Light Water Reactors (LWR) in Europe and Japan and Advanced Gas-Cooled Reactors (AGR) in the UK. Encompassed within the scope of THORP operations are : • • • • • •
Fuel flask receipt and storage operations. Fuel handling, shearing and dissolution. Chemical reprocessing. Product purification and storage. Waste management and disposal. Decontamination operations.
The size of the facility and its wide scope of operations provided a significant challenge with regard to initial Safety Case preparation and an "ongoing challenge to maintain the accuracy of the case as commissioning progresses and the plant receives an increasing challenge. 497
498
G.A. Rawlinson
The current safety case, termed the THORP Plant Safety Case (PSC), has already successfully supported the application to the regulatory bodies for consent to actively commission the facility and when updated will support a further application to the regulators to allow full plant operation. The THORP PSC is intended, as far as is achievable noting its size and complexity, to be a "live" document in order that it may reflect the changing state of the facility. The objective of this paper is to discuss the controls which have been implemented with regard to achieving this goal and to highlight those which have been demonstrated to be of "key" importance. A brief overview of the safety case is initially presented in order to give some idea of its size and scope and to introduce the reader to its principal components.
OVERVIEW OF THE TIIORP PSC The current THORP PSC is based largely on the projections of plant performance and reliability made at the design stage. Active commissioning is now however nearing completion and there is a need to incorporate the experience gained during this phase in order to reflect actual plant performance. The safety case will therefore develop through several phases and thus will eventually change from being based purely on design predictions to one based on observed plant performance. The implementation of each of these phases of the documentation will ensure that the safety case remains as live as is practicable. The discussion which follows, which is necessarily brief, concentrates on the following topics : •
• •
PSC - Background and Evolution PSC - Assessment Methodology PSC - The Safe Envelope
PSC- Background and Evolution The PSC addresses all aspects of operational safety and provides a detailed description of the plant and its processes. It is based on the work carried out for the Design Safety Report, which supported the application to build and inactively commission the facility, but is expanded in scope. This is in order to give further consideration to features which are in part or wholly operator based (e.g. maintenance activities) and to more fully integrate the operations of THORP into the wider framework of Sellafield operations. It is a large document which currently runs to some 17 volumes incorporating approximately 170 individual risk assessments. The layout of the safety case is arranged to be readily usable by the final customer, namely the plant operator, and to meet this demand the individual assessments have been prepared around clearly defined plant units or areas. This has probably increased the paper loading but has ensured that the safety assessors and plant operators are using the same plant and process terminology to their mutual benefit. The PSC will evolve through several formal stages and will gradually expand in content to reflect the increasing experience of plant operation. The formal review processes to be described later in this paper are aimed at controlling the required documentation changes.
PSC- Assessment Methodology Fault Identification The fault conditions assessed in the PSC are listed, on an area basis, in a "Fault Identification Schedule". This was compiled using information extracted from the design documentation, which used extensive
Advances in Safety and Reliability: ESREL '97
499
HAZOP studies, and from existing Sellafield Safety Cases for similar plants which incorporate plant experience and equally had their basis in HAZOP.
Risk Assessment Each identified hazard is subject to detailed analysis, and it is these analyses that form the major part of the safety case documentation and have the major impact with regards to placing limits on plant and process operations. These "risk" analyses use as a basis a methodology which has been developed over several years at Sellafield with the assessment of faults being primarily probabilistic (although where appropriate deterministic arguments have been made). The conclusions of the individual risk assessments are judged against established criteria, such criteria having been in use at Sellafield since the early 1980s. These are concerned with the tolerability of the risk posed to members of the public from the whole of the Sellafield site as a result of fault conditions that could potentially arise. Each plant of significance is therefore allocated a proportion of the overall site criteria. Two types of criteria are used, namely "primary" risk criteria, which limit the summed mortality risk for the most exposed member of the relevant critical group, and "secondary" criteria which account more for the public perception of maloperations on the plant and in particular consequence aversion. The "secondary" citeria limit the frequency of certain categories of events to levels below those which are necessary on pure risk grounds. The assessment of a given fault must therefore meet the constraints of the appropriate criteria. For each criteria there is, therefore, a consequence and frequency component. For example, the summed frequency of an operator receiving an accidental radiation exposure > 50 mSv must not be greater than 1E-4 y-1 Frequency estimations carried out as part of the probabilistic risk assessment are performed using specific information on plant equipment performance, which is obtained from a reliability database. For THORP this was built up throughout the design period with additional data added from other sources such as manufacturers literature and published databases. Similarly a human factors database is available with experts providing advice and ultimately independent verification. .
.
PSC- The Safe Envelope A key feature of the PSC, and indeed one of direct relevance to the licence under which BNFL is permitted to carry out its operations at Sellafield, is to establish a Safe Envelope of Operation. This is principally specified by the designation of key operational controls termed "Operating Rules and Safety Mechanisms". The basic definitions of these are : •
Operating Rule - a limit or condition which is necessary in the interests of safety. This is a high level control against which adequate compliance needs to be demonstrated. Safety Mechanism - A system / item which is key in maintaining a safe condition or is required to support an Operating Rule.
The safety case employs a sensitivity analysis in order to determine whether there is "numerical" justification for such measures to be recommended although it is always open to the plant operator to specify additional rules or equipment as deemed necessary. In addition "Rules or Equipment" can be designated from the safety assessment on the basis that they reflect key assumptions e.g. they significantly limit the potential consequences of a given scenario. In addition to the above "high level" items other categories of Instruction and Equipment are used to further define the safe envelope of plant operations. All of the above items, which are key findings of the safety
500
G.A. Rawlinson
case, are compiled in standard tables which occur at the same location of each risk analysis in order to make this important information readily accessible.
CONTROL OF CHANGE The requirement for a change in the safety case largely occurs as a consequence of modifications to Plant or Process. The need for these can arise from several sources, principally : • • • •
A requirement for optimisation of plant performance A change in the scope of plant operations A need to rectify observed plant deficiencies A need to replace on plant equipment
In addition to this ongoing requirement there is also a need to periodically check that the fundamental basis of the Safety Case remains valid. It is the procedures associated with these two principal items which forms the main theme of the following discussion. Each of these key processes require a heavy involvement of both plant and safety department personnel in order that they can be adequately implemented in a reasonable timescale. It is essential therefore that close communication links are developed between both groups together with a firm understanding of the key concepts of the Safety Case. Modifications to Plant or Process
In order for the safety case to remain valid it is essential that it adequately addresses the actual plant status and the full scope of operations being carried out, i.e. that it is as "live" a document as is possible. This provides a significant challenge with respect to such a large body of documentation addressing such a wide scope of operations and without close control it would be easy for the clarity of what actually constitutes the safety case to be lost. One of the key features is therefore to control modifications to plant or process. The experience at Sellafield is that the number of required modifications can be large for a facility of the size and complexity of THORP. This is to be expected noting that new information is continually being gathered as a result of ongoing commissioning operations. It is essential however to recognise that even seemingly insignificant modifications do have the potential to have a large effect on plant operability and safety. As such modifications to plant or process no matter how small need to be adequately considered. To this end there is strict control over the implementation of plant modifications. This involves the raising of a written proposal for each intended modification and the subjecting of this proposal to a rigorous and structured review process. This review not only addresses the potential radiological and conventional hazards during both the implementation phase and following final installation but provides a facility for controlling the implementation of the modification with regards to other on plant issues e.g. the need to change drawings, Plant Maintenance Schedule details, institute further operator training etc. The items reviewed when a proposal is considered are numerous with those of key significance with regards to the safety case being to check • • • • • •
If there is a potential affect on the scope of the existing safety case. If there is a potential to affect existing Operating Rules or Safety Mechanisms. If there are any safety assessments required prior to the implementation of the modification. If there is a need to consider the modification at a HAZOP study. If there is adequate testing to be carried out following the modification to ensure that installed systems are functioning as intended and that existing plant has not been affected during the installation.
Advances in Safety and Reliability: ESREL '97
501
The PMP process involves and requires endorsement by both the plant operators and the safety department. In this way it ensures that the intended modifications are given full consideration by suitably qualified and experienced personnel and that the safety case reflects "actuality". Following the structured review process the modification proposal is classified with respect to its potential radiological significance. The derived category of the modification then determines the minimum safety case requirements and authorisation route prior to the commencement of the modification. Depending on the category this may require the seeking of consent from the regulatory bodies and from the THORP main safety committee. In this way it is ensured that significant modifications are fully incorporated into the safety case prior to them being implemented. This is essential in maintaining the currency of the safety case arguments.
Periodic Formal Review of the Plant Safety Case The site licence for Sellafield requires arrangements to be in place for the "systematic and periodic review of safety cases" which is generally performed annually. In addition for THORP the safety case is required to gradually incorporate commissioning experience. The purpose of the periodic reviews is to : • Confirm the adequacy and assumptions of the Safety Case and to identify any changes required in order to reflect the experiences of actual plant performance during the active commissioning phase. • Confirm the continuing validity of the safety case assumptions as the plant moves towards its optimum design throughput. To achieve these objectives the safety case is considered, section by section, against a number of topics. Those considered key with regards to safety case control are now discussed in brief:
Comments Raised by the Operators on the Basic Scope of the Assessment Throughout active commissioning the performance of the plant is monitored against the original design predictions and it is inevitable that at certain times variations will occur between the predicted performance and that actually observed. This part of the checking procedure therefore focuses on those features concerned with the general basis of the assessment and considers whether the fundamental assumptions of how the plant operates and what operations are carried out remain valid.
Incorporation of Items Arising since the previous safety case issue As detailed earlier all changes to plant and process need to go through a rigorous checking procedure which involves consideration of the potential for the modification to affect the safety case arguments. A judgement is made at the time as to the significance of these changes and if not required to be incorporated immediately it is noted in a central file of safety case items. The periodic review is an ideal opportunity to review these items with regard to incorporation or otherwise in order that the safety case reflects operations and practice as closely as possible.
Consideration of the Potential Cumulative Effects of Plant Modifications THORP encompasses a wide scope of operations and is continually raising issues arising from ongoing commissioning experience and the need to optimise plant performance. It is perhaps then not surprising that a large number of modifications are carried out to plant and process over the period of one year. This check on the safety case entails a more global consideration of the modifications carried out with the view of
502
G.A. Rawlinson
identifying any cumulative effects i.e. modifications which in themselves were not significant but when considered in the light of the other subsequent modifications may achieve greater significance. In addition this check serves to highlight areas giving a cause for concern e.g.. a large number of modifications in one area.
Consideration of Safety Memoranda Raised Modifications to plant and process often require safety case rework on a short timescale in order to give assurance that an adequate safety margin is being maintained. This rework is often carried out using the vehicle of a Safety Memorandum. This document, which addresses the specific safety issues concerned and highlights the need for any additional controls over and above those already existing in the Safety Case, ensures that at all timesthe safety case addresses those items of key safety significance and maintains the safe envelope of operation. In this respect these documents are key to maintaining the safety case as a live document. The recommendations of a memorandum are of identical importance to those from the main body of the safety case. This part of the safety case review therefore considers the memoranda in place associated with the specific area of plant undergoing the review. Any memoranda dealing with long term issues, i.e. those currently addressing a shortfall in the safety case, are then incorporated into the Safety Case proper. This aids the clarity of the safety case in that it minimises the amount of documentation associated with the hazards in one particular area of plant. If memoranda were allowed to accumulate the definition as to what constituted the safety case for a given area would be lost with a potential detrimental effect on safety.
Review of blcidents and Events Audits and hlspections This section of the review is concerned with learning the lessons of on site incidents and involves consideration of the incident investigation reports and recommendations to determine whether the existing Safety Case is adequate or requires update. In addition the reports of any audit findings and recommendations are similarly reviewed for potential incorporation.
Maintenance and Reliability Experience The THORP PSC includes a large number of frequency estimations against given fault sequences. In making these estimations it is necessary to assume a given reliability value for the items deemed important within the sequence. This reliability data needs to be reviewed in the light of actual experience. The reliability of key items within the assessment is therefore reviewed against actual data recorded over the period of operation and if necessary the frequency estimation updated to reflect this actual data. In this way the safety Case more accurately reflects the actual performance of plant items operating in their actual environment and under the maintenance regime they actually see as compared to the design predictions. In terms of the human reliability figures assumed in the safety case these are reviewed in the light of actual experience of operations. In addition plant procedures and training are also reviewed to examine whether the assumptions made in the safety case as to the availability of a quality operator response to potential fault conditions remain valid.
Quantities of Radioactive Material Discharges Plant discharges of radioactive material are closely monitored for all effluent routes. The purpose of the review is therefore to compare the measured performance against that predicted. This not only confirms that the plant is being operated within its discharge authorisation but indicates the performance of effluent clean
Advances in Safety and Reliability: ESREL '97
503
up systems for which much credit is taken in the Safety Case.
Compliance with Constraints 1reposed by the Safety Case e.g. Operating Rules and Instructions The purpose of this part of the review is to ensure that the instructions are workable and do not give rise to difficulty in demonstrating compliance i.e. that it can be readily demonstrated that the rule or instruction is being adequately adhered to. In addition on the basis of observed plant performance it may prove operationally beneficial to review the need for an instruction whilst remaining within the safety envelope. This is a particularly important part of the review process and much effort has been put into ensuring that rules and instructions are clear, unambiguous and readily complied with. An observed shortfall in the usability of an instruction may well lead to a reconsideration of the wording or indeed the safety case since a poorly complied with rule or instruction does not add to safety. In addition to the check on these "high level" controls a review is also made of the validity of the "key parameters" of the safety case. An example of these could for instance be the number of flask moves assumed or be the inventory assumed for certain packages and these are appropriately listed in the assessments in order that they are readily visible.
O~vnership of the THORP PSC and the Need for Close Communication and Teamwork Throughout the production and authorisation stages of the various issues of the THORP PSC there is heavy involvement of the plant operators and plant management and this has been essential in not only ensuring accuracy of information but encouraging safety case "ownership" and developing close contacts between the operators and the safety department. This involvement has included thorough document consideration at plant safety committees which not only consider the technical and operational implications of the safety case presented but provide the necessary link between the papers conclusions and related plant documentation. One of the most important concepts to recognise is that the Safety Case belongs to and is the responsibility of the plant operators. This is key in developing a case that is both adequate and workable and in ensuring that its principal features are fully understood by those charged with the responsibility of operating and maintaining potentially hazardous plant. The experience gained in the process of preparing and maintaining the THORP PSC at Sellafield has indicated that close and open communication is essential with respect to maintaining an adequate safety case. In addition the team working of both plant operators and safety department personnel is crucial if the required controls are to be adequately implemented in a sensible timescale. The experience at Sellafield is that these features are greatly aided by : • • •
Having established teams for safety case issues in the safety department and on plant i.e. a clear contact route. The appointment of Safety Case managers on plant who are responsible for the upkeep and adequacy of the safety case for their particular plant area. The involvement of Plant Safety Department personnel as independent members on both the Modification Proposal and Management Safety Committees. This ensures that safety case matters are adequately captured and addressed and that close contacts are established. This involvement is site wide and thus ensures a cross fertilisation of safety information and experiences between the different areas.
504
G.A. Rawlinson
CONCLUSIONS Large and complex Safety Cases can be successfully controlled and maintained as "liVing" documents provided that certain controls are implemented. Of particular importance is the control of modifications made to plant or process and the periodic review of the safety case more generally to ensure that its key concepts remain valid. The experience at Sellafield, during the preparation and maintenance of a large and complex safety case during a period of significant on plant change , has highlighted several principal features. These are considered to be the main lessons learned and to be applicable to general safety case preparation and maintenance. These are that : • Safety case documentation must be readily understandable and usable by the final customer namely the plant operator. Key constraints must be clearly presented and be in a readily accessible form. • Document ownership by the plant operators must be developed. This can be aided by the close involvement of the plant operators throughout the preparation of the safety case. • The safety case should be prepared in a manner which facilitates subsequent review. This is crucial for large and complex documentation which relies for its arguments on data arising from various sources. The use of controlled databases of information with cross references to each instance where information is used proves essential with regards to ensuring continuing accuracy. • Close control must be exercised over interim safety arguments required to address safety aspects on a short timescale such that what constitutes the safe envelope of operation does not become clouded. Recommendations from interim safety documentation must be seen as part of the safety case proper. • "Actual" plant performance must be closely monitored to confirm safety case assumptions. The importance of this data collection with regards to the safety of the plant must be understood. Additionally procedures must be in place to ensure that quality information is gathered following any on plant incident. • Instructions addressing key aspects of the safety case must be clear, concise, workable and operational compliance demonstrable. Careful wording is essential in order to clearly identify the intent of these prime requirements since a badly worded instruction does not add to safety. • Management systems must be put in place early in the life of the safety case in order to facilitate both the ongoing control of modifications and the information gathering required for the periodic safety case review process. It has been the experience at Sellafield that of "overall" importance with regard to the preparation and maintenance of an adequate safety case is the development of close and open communications links and understanding between the Safety Department and the plant operators. This develops an awareness and understanding of key safety case concepts and arguments and ensures that safety case issues achieve the required priority. In this way safety case management is truly a team approach.
B4" Waste Isolation Pilot Plant
This Page Intentionally Left Blank
CONDENSED SUMMARY OF THE SYSTEMS PRIORIZATION METHOD AS A DECISION-AIDING APPROACH FOR THE WASTE ISOLATION PILOT PLANT D. M. Boak, a N. H. Prindle, aR. A. Bills, b S. Hora, c R. Lincoln, a
F. Mendenhall, a and R. Weiner ~ a
Sandia National Laboratories, Albuquerque, NM 87185 bU.S. Department of Energy, Carlsbad, NM 88221 cUniversity of Hawaii at Hilo, Hilo, HI 96720
ABSTRACT
In March 1994, the U.S. Department of Energy Carlsbad Area Office (DOE/CAO) implemented a performancebased decision-aiding method to assist in programmatic prioritization within the Waste Isolation Pilot Plant (WIPP) project. The prioritization was with respect to 40 CFR Part 191.13(a) and 40 CFR part 268.6, U.S. Environmental Protection Agency (EPA) requirements for long-term isolation of radioactive and hazardous wastes. 1 The Systems Prioritization Method (SPM), was designed by Sandia National Laboratories to: 1) ide ntify programmatic options (activities), their costs and durations; 2) analyze combinations of activities in terms of their predicted contribution to long-term performance of the WIPP disposal system; and 3) analyze cost, duration, and performance tradeoffs. SPM results were the basis for activities recommended to DOE/CAO in May 1995. SPM identified eight activities (less than 15% of the 58 proposed for consideration) predicted to be essential in addressing key regulatory issues. The SPM method proved useful for risk or performance-based prioritization in which options are interdependent and system behavior is nonlinear. KEY WORDS
Decision analysis, probabilistic performance assessment, geologic disposal, radioactive waste, hazardous waste, risk-based prioritization INTRODUCTION
The Systems Prioritization Method (SPM) is a performance-based, decision-aiding method developed by Sandia National Laboratories (SNL) for the U.S. Department of Energy Carlsbad Area Office (DOE/CAO) to assist in programmatic prioritization within the Waste Isolation Pilot Plant (WlPP) project. SPM was designed to 1) identify programmatic options (activities), their costs and durations; 2) analyze combinations of activities (activity sets) in terms of their predicted contribution to the WlPP disposal system with respect to EPA longterm performance requirements in 40 CFR 191.13(a) (EPA, 1993) and 40 CFR 268.6 (EPA, 1992); and 3) analyze cost, duration, and performance tradeoffs. The second iteration of SPM (SPM-2), completed in March 1995, determined the most viable combinations of scientific investigations, engineered alternatives (EAs), and waste acceptance criteria (WAC) for supporting the final compliance certification application for WIPP. The
i The WIPP Land Withdrawal Act amendmentsof 1996 effectively removed the need for WIPP to demonstrate compliance with 40 CFR Part 268.6. 507
508
D.M. Boak et al.
results of the second iteration of SPM (SPM-2) were the basis for recommendations to DOE/CAO in May 1995 for programmatic prioritization within the WIPP project. SPM identified eight activities (less than 15 % of the 58 proposed for consideration) predicted to be essential for addressing key regulatory issues. This paper is a condensed summary of SPM, its implementation and key results (Boak et al, 1996; Helton et al, 1996; Prindle et al, 1996a, b, and c). KEY STEPS AND CONCEPTS The goal of SPM was to provide information about how potential activities--scientific investigations, engineered alternatives, and waste acceptance criteria--when viewed singly or in combination, could contribute to a demonstration of compliance with performance requirements for the WIPP disposal system. For each combination of activities (activity sets), SPM was used to calculate the probability of demonstrating compliance (PDC) if the activity set were implemented. The activity set's PDC, cost, and duration were contained in a decision matrix that was analyzed to find programmatic options that maximized the PDC while minimizing activity set cost and duration. SNL performance assessment models were used to estimate how the disposal system might perform if activities were implemented, and this evaluation was the basis for calculating each activity set's PDC. SPM analyzed roughly 46,700 activity sets. Probabilistic performance calculations for these activity sets resulted in over 1.3 million complementary cumulative distribution functions (CCDFs). A relational database on a 600-megabyte CD-ROM was used to store performance assessment results, data analysis and visualization tools, information about the activities, electronic copies of 40 CFR 191 and 40 CFR 268, technical reference papers, and the draft SPM report (Harris et al, 1996). Copies of the CD-ROM were distributed to interested members of the public, WIPP participants, and the EPA. SPM can be described in terms of eleven key steps (Figure 1): 1) Define the performance objective(s); 2) Develop a technical baseline z for SPM calculations; 3) Perform computer modeling of the baseline; 4) Determine whether the baseline is predicted to succeed or fail in meeting the objectives; 5) If the baseline fails to meet performance objectives, identify activities that, if implemented, could improve a predicted ability to meet the performance objectives, and elicit potential outcomes for those activities (if the baseline passes, proceed to Step 11); 6) Evaluate the baseline combined with potential outcomes of combinations of activities; 7) Create a decision matrix containing the performance results, cost, and duration for all activities and perform decision analysis to develop final recommendations; 8) Make programmatic decisions about which activities to implement, if any; 9) Implement the selected activities; 10) Update the technical baseline with actual results from the activities; and iterate the process from step 3 as necessary until the baseline is predicted to meet the performance objectives; and, 11) Perform final performance assessment calculations with approved data and models when the baseline is predicted to comply. SPM is distinct from performance assessment calculations for compliance in important ways. SPM, in effect, is a strategic planning approach that applied performance assessment codes at a level of abstraction sufficient to discriminate between programmatic options but insufficient for the rigor and detail required in a complete performance assessment. Maintaining this separation is important to keep probabilistic calculations tractable and in maintaining an efficient planning process. Key to how SPM works is in understanding the relationship between the regulatory performance objectives, the input to and output of the performance calculations, and the tradeoff analysis between activity sets' PDC, cost, and duration. Performance assessment models are used by the WIPP project to produce information about the predicted long-term performance of the disposal system that can be compared to the regulatory requirements (WlPP PA, 1993). For WlPP, this means calculating a CCDF for radionuclide releases, which represents the probability distribution of summed, normalized radionuclide releases from the disposal system to the accessible environment, and estimating potential releases of regulated volatile organic compounds and heavy metals. The WlPP disposal system is predicted to be in compliance with the containment requirements if l) no point on the CCDF exceeds the summed normalized release limits in 40 CFR 191.13(a) and if 2) of hazardous constituent concentrations in soil do not exceed the limits in 40 CFR 268.6.
2 The technical baseline represents a current state of knowledge about the disposal system that is based on existing scientific knowledge, experimental data, and completed technical work. The technical baseline is encoded in performance assessment computational modelsand input parameters.
Advances in Safety and Reliability: ESREL '97
509
1. Define Performance Objectives
2. Define 1 Technical Baseline
i
" 9. ImplementActivities I
Yes
11. Final Performance Calculations for Compliance
~ No 5. Identify Activities and Elicit Potential Outcomes, Probabilities, Costs, and Durations /
6. Model Potential |- ~ Outcomes of Activities F"
7. Decision Analysis [
Figure 1 Key steps of SPM as applied to WIPP. While the regulatory release limits are fixed, estimates of the predicted performance of the WIPP disposal system are not; they are determined by a state of knowledge that changes over time as a result of performing scientific investigations, implementing EAs, or modifying WACs. The changed state of knowledge can alter the position of the CCDF with respect to the release limits and the state of knowledge can be expressed, in part, through probability distributions. For example, although it is not possible to predict the solubility of plutonium in WIPP brines with absolute certainty, a range of solubilities under various chemical conditions and based on many types of existing information can be postulated, thus defining a portion of the SPM-2 calculational baseline. Consider a scientific experiment (activity A) designed to more accurately determine the solubility of plutonium in brine. The experimental design anticipates a range of possible outcomes based on both published information and expert judgment. For simplicity, suppose that the set of experimental outcomes can be classified into two ranges (actually probability distributions), fiom lowest to highest solubility. Denote the event that the experimental outcomes for activity A are in the first range by XA~and in the second range by XA2.Denote the two probability distributions corresponding to the two experimental outcomes by fA1 and fA2- After the experiment has been completed, the state of knowledge about plutonium solubility changes to reflect the new information produced by the experiment. All uncertainty, however, will not be resolved by the experiment because uncertain repository conditions make it impossible to -know with certainty what the chemical environment--and thus actual solubility--will actually be. Nonetheless, after the experiment is completed, residual uncertainty about the solubility can, again, be expressed through a probability distribution that reflects the new information and new expert judgments; this process can continue until the cost of further work is no longer justified by the potential results. Now, suppose that we use expert judgment to specify the potential experimental outcomes x l,J and associated probability distributions fId (where i j are the activity and activity outcome identifiers, respectively) before conducting the experiment, and use these probability distributions in performance assessment models to estimate the consequences, in addition to providing the x~ and f~, we also use expert judgment to specify the relative likelihood or probabilities of the events XA1and XA2,PAl and PA2 respectively. Suppose that performance calculations predict that, if activity A is conducted alone, event xA2 will' indicate compliance with long-term performance requirements for radionuclide and hazardous material containment, but that the event XA1 will indicate noncompliance. The predicted probability of successfully demonstrating compliance for the this activity--viewed prior to
510
D.M. Boak et al.
conducting the experiment--is then PA2" Note that SPM-2 results showed that, when conducted alone, no single activity had a non-zero PDC, i.e., was sufficient to produce a CCDF indicating compliance with long-tema performance requirements. Finally, consider an activity set that is composed of two activities, A and B, each with two possible outcomes, and suppose that performance results show that compliance is indicated on6; if 1) activity A has outcome xA1 a n d activity B has outcome xB2 or if 2) activity A has outcome XA2a n d activity B has outcome xB2. The PDC for the activity set consisting of A and B would then equal (pAl×p B2) + (PAzXPB2)" Because each SPM activity has at least two outcomes and because activity sets consist of between one and 26 activities, activity sets can have anywhere between two and nearly 60,000 possible outcome combinations, each of which corresponds to a CCDF and a RCRA soil concentration. Thus, the PDC for an activity set represents a logically straightforward but very computationally intense set of calculations. SPM-2 results showed that many activity sets were predicted to produce a CCDF indicating compliance with long-term radionuclide containment requirement (Boak et al, 1996; Prindle et al, 1996c). SPM-2 R E S U L T S The first iteration of SPM (SPM-1), which was completed in September 1994, prototyped the approach implemented in the second iteration (SPM-2). SPM-2, completed in March 1995, was the basis for programmatic decision making. WIPP project technical staff, stakeholders, and oversight groups contributed to establishing the SPM-2 baseline. Technical teams also defined proposed activities and were elicited on the predicted outcomes of those activities. Trained elicitors external to the WIPP project formally elicited the technical baseline and proposed scientific activities from the technical teams DOE/CAO and the Westinghouse Waste Isolation Division provided information regarding EAs, potential changes to WACs, and other programmatic guidance. Potential outcomes were initially elicited for 58 discrete activities, including 37 scientific investigations, 18 EAs, and three WACs; these were screened to 26 activities (Table 1), including 21 scientific investigations, three EAs, and two WACs (Prindle et al, 1996b). SPM-2 used existing WIPP performance assessment computer codes, with modifications required to model the baseline and activity sets, to calculate CCDFs of potential radionuclide releases. SPM-2 evaluated more than 600,000 possible activity sets. Activities without performance impact were removed from the decision matrix, reducing the number to roughly 46,700. SPM-2 results indicated that PDC generally increased, as expected, with increasing cost and duration. Figure 2 shows the highly nonlinear structure of the results in terms of the PDC versus activity set cost. Programmatic interdependencies were also apparent from general trends in the data and are discussed in the next section, which summarizes the statistical regression analysis of the SPM-2 results. The SPM-2 baseline calculation predicted release of radionuclides in violation of 40 CFR 191.13(a) but compliance with respect to 40 CFR 268.6. About 40% of the SPM-2 activity sets had a PDC of 0 (i.e., with no predicted value in supporting a demonstration of compliance). Of the remaining 60% of the SPM-2 activity sets, one half had a PDC equal to one. When conducted alone, all single activities--whether scientific investigation, EA, or W A C - - h a d a zero PDC. Activity sets with a PDC of 1.0 included one of two scientific investigations for colloids (either NS 8.1 or NS 8.2) and one of two EAs (either EA 1 or EA 2). Note that EAs and WACs were assumed to be optimally effective and were assigned a 100% probability of yielding the predicted performance. Subsequent sensitivity studies investigated the impact of this assumption on the final decision. Two WACs were analyzed by SPM-2. In the WAC-1 activity, steel drums used to store the waste were replaced with non-corrodible materials. WAC-1 added costs to the program and slightly reduced the PDC. 3 WAC-2, the elimination of all high-molecular weight organic compounds (such as soils) from the waste, had no discernible impact on the PDC.
3 Refer to Prindle et al. (1996c) for a more detailed discussion of the interesting interdependencies discovered among activities. The SPM work demonstrated the importance of quantitative modeling to estimate the value of potential activities in a non-linear system.
Advances in Safety and Reliability: ESREL '97
511
TABLE 1 ACTIVITIES ANALYZED IN SPM-2
Activity Actinide Source Term (AST) Dissolved Actinide Solubilities for Oxidation States + I I I - +VI Dissolved Actinide Solubilities for Oxidation States + I I I - +V Disposal Room (DR) Decomposed Waste Properties Blowout Releases Non-Blowout Releases Seals and Rock Mechanics (SL and RM) Rock Mechanics Studies of Short- and Long-term Components Salado (SAL) Lab/Field Properties of Anhydrite Halite Far-Field Pore Pressure Halite Lab/Field Properties Fingering/Channeling Studies- Existing Data Fingering/Channeling Studies- New Data Anhydrite Fracture Studies Non-Salado (NS) Dewey Lake - Paper and Low-Effort Field Studies Culebra Fracture/Matrix/Flow - Lab Culebra Fracture/Matrix/Flow - Field Multi-Well Tracer Test Sorbing Tracer Test Chemical Retardation for Th, Np, Pu, U, and Am Concentrations and Transport of Colloid Carriers: High-Molecular Weight Organic Compounds (HWMOC) and Microbes Enhanced Colloid Experimental Program Engineered Alternatives (EAs) Passive Markers Backfill with pH Buffer Backfill with pH Buffer and Waste Form Modification Waste Acceptance Criteria (WAC) Non-Corroding Waste Containers Elimination of Humic-Containing Waste Drums
Code AST 1.1 AST 1.2 DR 1 DR 2 DR 3 RM 1 SL 4 SAL SAL SAL SAL SAL SAL NS NS NS NS NS NS NS
1 2 3 4.1 4.2 4.3
1 2 3 4 5 7 8.1
NS 8.2 EA 3 EA 1 EA 2 WAC 1 WAC 2
Based on these results, DOE/CAO had a preliminary decision to make, which was to either: 1) depend on a program consisting of EAs and minimal scientific investigations to provide a basis for the final compliance calculations; or 2) reserve EAs for possible use in providing assurance and depend on the scientific investigations to demonstrate compliance with 40 CFR 191.13(a) and 40 CFR 268.6. In May 1995, DOE/CAO chose the second option. Additional work has been conducted on EAs since the completion of SPM, and the final balance between predicted performance of the geologic system, EAs, and WACs is addressed in the compliance certification application (U.S. DOE, 1996). The final programmatic recommendations to DOE/CAO in May 1995 considered the SPM-2 results, sensitivity and uncertainty analyses, and existing information such as the 1992 WIPP PA Sensitivity Analysis (WIPP PA, 1993). ANALYSIS OF SPM-2 RESULTS SPM-2 generated roughly 46,700 unique activity sets. In order to determine the most favorable activity set(s) for meeting the DOE/CAO objectives, a statistical regression analysis was conducted. This analysis employed a
512
D.M. Boak et al.
logit regression methodology. A logit regression assumes that a probability, p (or other number bounded by 0 and 1), is related to several independent variables through Eqn. 1: log [p/(l-p)] = Z b~ x~
(1)
where x is the indicator variable (equal to 0 or 1) and b~is a regression coefficient to be estimated. Here, p is the i PDC. Because the left side of the equation is unbounded at p = 0 and p = 1, the PDC values were decreased slightly towards 0.5 as shown in Eqn. 2: p = ( p - 0.5)(1 - ~) + 0.5,
(2)
where e is a small number such as 0.01. An initial inspection of activity sets in the decision matrix revealed two very strong relationships. First, if ther colloid activity (NS 8.1 nor NS 8.2) was included in an activity set, the PDC was 0. Second, if either 8.1 or NS 8.2 was in an activity set, the PDC was equal to 1 as long as an EA (EA 1 or EA 2) was also in activity set, and less than 1 otherwise. Both of these relations were always true, and thus the first relation vided a sufficient condition for creating a PDC equal to 0. The second relation provided a condition that both necessary and sufficient for PDC to equal 1. These two relations logically limited the PDC of activity
neiNS that prowas sets
without EA 1 or EA 2 to 0_
4 A pareto-optimal series maximizes the PDC gained per dollar invested and consists of activities which, at every point, there can be no higher PDC at the same cost level.
Advances in Safety and Reliability: ESREL '97
513
the series only brings minuscule improvements. A PDC of 0.96 is achieved from the duration-constrained pareto-optimal series. RM 1,SL 4, DR 2~al
1.00
RM 1,SL 4,
rz T .
NS
0.70
,,,El ,,,, NS 5 /,,f .!~-"""'""""":"""-"-"-'1~ .....""-"E~ ~ .~-,NS 4 # NS 7 RM 1,SL 4, AST 1.2 / / S" DR2 , ~ / ~"¢: NS 3 NS 4
0.60 0 0"50 fl" 0.4o 0.30 0.20
NS2 I~ "\~.'#'~"
__
+
0.80
i I i I I
]
o,o
I
Flow in Culebra Lab/ calculations NS 3 Fractureversus Matrix Flow in Culebra Lab/ field studies NS 4Multi-wellTracer Test NS5SorbingTracer Test NS 7 LaboratoryKdS NS8.1- ColloidStudies RM 1 - RockMechanics SL4" SealsTests DR 2 - SpallingsRelease AST 1.2 - Solubilities for +111,+IV, +V ~ UnconstrainedDuration Duration <19 Months .'-'-I~'- SuboptimalSeries
o.oo 0
5,000
10,000 15,000 20,000 25,000 30,000 35,000 40,000 Cumulative Cost ($1,000)
Figure 2. PDC versus activity set cost for pareto-optimal and sub-optimal activity series. The two series on the left are both considered pareto-optimal, that is, neither series can be bettered simultaneously in both cost and PDC for its respective duration. Faced with programmatic options limited to scientific investigations--without EAs or WAC modifications--both the duration-constrained and unconstrained activity series appear to be logical programmatic choices. However, the duration-constrained series, which eliminated two scientific activities NS 3 and NS 5 resulted in virtually the same PDC as the unconstrained set and with lesser cost. SPM-2 results were the basis for recommendations to DOE/CAO in May 1995 for programmatic prioritization. DOE/CAO chose the duration-constrained series. CONCLUSIONS SPM identified eight key activities (less than 15% of the initial 58 activities proposed for consideration) for WIPP that, if implemented, were predicted to lead to a positive demonstration of compliance with EPA longterm performance requirements with a high level of confidence. Moreover, analysis of the results also indicated that optimal programmatic options existed and that activities could be systematically c u t o r a d d e d if budgets changed. The analysis indicated that a demonstration of compliance could be anticipated within the DOE/CAO WIPP Disposal Decision Plan schedule. These eight key activities have now been completed, WIPP performance assessment calculations now indicate compliance with applicable EPA long-term performance requirements, and a Compliance Certification Application was submitted to the EPA on October 29, 1996 (U.S. DOE, 1996). SPM focused on work to achieve compliance with long-term disposal system performance requirements and helped eliminate concerns that activities were not clearly and demonstrably focused on addressing regulatory issues. The use of quantitative analyses balanced with expert judgment proved essential in developing insights about decision options in a highly nonlinear system. SPM built upon the power of both performance assessment and decision analysis techniques, providing insights for decision making.
514
D.M. Boak et al.
REFERENCES
Boak, D. M., Prindle, N. H., Bills, R. A., Hora, S., Lincoln, R., Mendenhall, F., and Weiner, R. (1996). Summary of the Systems Prioritization Method (SPM) as a Decision-Aiding Tool for the Waste Isolation Pilot Plant, Waste Management '96, Tucson, AZ, February 25-29, 1996. SAND95-1998C. Albuquerque, NM: Sandia National Laboratories. EPA (Environmental Protection Agency) (1992). Land Disposal Restrictions, Code of Federal Regulations 40, Part 268. Washington, DC: Superintendent of Documents, U.S. Government Printing Office. EPA (Environmental Protection Agency) (1993). 40 CFR Part 191: Environmental Radiation Protection Standards for the Management and Disposal of Spent Nuclear Fuel, High-Level and Transuranic Radioactive Wastes, Final Rule, Federal Register. Vol. 58, no. 242, 66398-66416. Harris, C. L., Boak, D. M., Prindle, N. H., and Beyeler, W. (1996). The Systems Prioritization Method (SPM) CD-ROM Demonstration for Waste Management '96, Waste Management '96, Tucson, AZ, February 25-29, 1996. SAND95-2015C. Albuquerque, NM: Sandia National Laboratories. Helton, J. C., Anderson, D. R., Baker, B. L., Bean, J. E., Berglund, J. W., Beyeler, W., Blaine, R., Economy, K., Garner, J. W., Hora, S. C., Lincoln, R. C., Marietta, M. G., Mendenhall, F. T., Prindle, N. H., Rudeen, D. K,. Schreiber, J. D., Shiver, A. W, Smith, L. N., Swift, P.N., and Vaughn, P. (1996). Computational Implementation of a Systems Prioritization Methodology for the Waste Isolation Pilot Plant: A Preliminary Example. SAND94-3069. Albuquerque, NM: Sandia National Laboratories. Prindle, N. H., Mendenhall, F. T., Boak, D. M., Beyeler, W., Rudeen, D., Lincoln, R. C., Trauth, K., Anderson, D. R., Marietta, M. G., and Helton, J. C. (1996a). The Second Iteration of the Systems Prioritization Method: A Systems Prioritization and Decision-Aiding Tool for the Waste Isolation Pilot Plant. Volume I: Synopsis of Method and Results. SAND95-2017/1. Albuquerque, NM: Sandia National Laboratories. Prindle, N. H., Mendenhall, F. T., Beyeler, W., Trauth, K., Hora, S., Rudeen, D., and Boak, D. M. (1996b). The Second Iteration of the Systems Prioritization Method: A Systems Prioritization and Decision-Aiding Tool for the Waste Isolation Pilot Plant. Volume II: Summary of Technical Input and Model Implementation. SAND952017/2. Albuquerque, NM: Sandia National Laboratories. Prindle, N. H., Boak, D. M., Weiner, R. F., Beyeler, W., Hora, S., Marietta, M. G., Helton, J. C., Rudeen, D., Jow, H., and Tierney, M. (1996c). The Second Iteration of the Systems Prioritization Method: A Systems Prioritization and Decision-Aiding Tool for the Waste Isolation Pilot Plant. Volume III: Analysis for Final Programmatic Recommendations. SAND95-2017/3. Albuquerque, NM: Sandia National Laboratories. U.S. DOE (Department of Energy). (1996). Carlsbad Area Office. 1996. Title 40 CFR Part 191: Compliance Certification Application for the Waste Isolation Pilot Plant. DOE/CAO-1996-2184. Carlsbad, NM: U.S. Department of Energy, Waste Isolation Pilot Plant, Carlsbad Area Office. 21 Volumes. WIPP PA (Performance Assessment) Department. (1993). Preliminary Performance Assessment for the Waste Isolation Pilot Plant, December 1992. Volume 4: Uncertainty and Sensitivity Analyses for 40 CFR 191, Subpart B. SAND92-0700/4. Albuquerque, NM: Sandia National Laboratories.
CONCEPTUAL AND COMPUTATIONAL STRUCTURE OF THE 1996 PERFORMANCE ASSESSMENT FOR THE WASTE ISOLATION PILOT PLANT D.R. Anderson, 1 J.C. Helton, 2 H.-N. Jow, 1 M.G. Marietta, 1 M.S.Y. Chu, 1 L.E. Shephard, 1 G. Basabilvazo 3 1 Sandia National Laboratories, Albuquerque, NM 87185-1328 USA 2 Department of Mathematics, Arizona State University, Tempe, AZ 85287-1804 USA 3 U.S. Department of Energy, Carlsbad, NM 88221 USA
ABSTRACT The Waste Isolation Pilot Plant (WIPP) is being developed by the U.S. Department of Energy for the geologic (deep underground) disposal of transuranic waste. An application for the certification of the WIPP for such disposal was submitted to the U.S. Environmental Protection Agency (EPA) in October, 1996, and is currently under review, with a decision anticipated in late 1997. An important component of the certification application is a performance assessment (PA) for the WlPP carried out by Sandia National Laboratories. The final outcome of the PA is a complementary cumulative distribution function (CCDF) for radionuclide releases from the WIPP to the accessible environment and an assessment of the confidence with which this CCDF can be estimated. This presentation describes the conceptual and computational structure used to develop the preceding CCDF. KEYWORDS Performance assessment, radioactive waste disposal, uncertainty analysis, Waste Isolation Pilot Plant.
1. INTRODUCTION The Waste Isolation Pilot Plant (WIPP) is located in southeastern New Mexico and is being developed by the U.S. Department of Energy (DOE) for the geologic (deep underground) disposal of transuranic (TRU) waste. Waste disposal will take place in panels excavated in bedded salt approximately 2000 ft below the land surface (Figure 1, Helton 1996). As part of the development process for the WIPP, a sequence of performance assessments (PAs) has been carried out by Sandia National Laboratories (SNL) to organize knowledge currently available about the WlPP and to provide guidance for future research and development efforts (WIPP PA 1991-1992, 1992-1993). The structure of these PAs derives from the U.S. Environmental Protection Agency' s (EPA' s) regulation for the geologic disposal of radioactive waste: 40 CFR 191, Subpart B: Environmental Radiation Protection Standards for theManagement and Disposal of Spent Nuclear Fuel High-Level and Transuranic Radioactive Wastes (U.S. EPA 1993). The most recent iteration of these PAs was completed in the summer of 1996 and supports an application by the DOE to the EPA for the certification of the WIPP for the disposal of TRU waste (U.S. DOE 1996). This paper presents an overview of the conceptual and computational structure used in this PA to assess compliance with 40 CFR 191 and, 515
D.R. Anderson et al.
516
together with its companion paper (Helton et al. 1997), provides an update to the preliminary description given in Helton (1996). The following is the central requirement of 40 CFR 191, Subpart B, and the primary focus of this paper: § 191.13 Containment requirements. (a) Disposal systems for spent nuclear fuel or high-level or transuranic radioactive wastes shall be designed to provide a reasonable expectation, based upon performance assessments, that cumulative releases of radionuclides to the accessible environment for 10,000 years after disposal from all significant processes and events that may affect the disposal system shall: (1) Have a likelihood of less than one chance in 10 of exceeding the quantities calculated according to Table 1 (Appendix A); and (2) Have a likelihood of less than one chance in 1,000 of exceeding ten times the quantities calculated according to Table 1 (Appendix A). (b) Performance assessments need not provide complete assurance that the requirements of 191.13(a) will be met. Because of the long time period involved and the nature of the events and processes of interest, there will inevitably be substantial uncertainties in projecting disposal system performance. Proof of the future performance of a disposal system is not to be had in the ordinary sense of the word in situations that deal with much shorter time frames. Instead, what is required is a reasonable expectation, on the basis of the record before the implementing agency, that compliance with 191.13(a) will be achieved. Containment Requirement 191.13(a) refers to "quantities calculated according to Table 1 (Appendix A)," which means a normalized radionuclide release to the accessible environment based on the type of waste being disposed of, the initial waste inventory, and the release that takes place (App. A, U.S. EPA 1985). Table 1 (Appendix A) of U.S. EPA 1985 specifies allowable releases (i.e., release limits) for individual radionuclides. The WIPP is intended for TRU waste, which is defined to be "waste containing more than 100 nanocuries of ~pha-emitting transuranic isotopes, with half-lives greater than twenty years, per gram of waste" (p. 38084, U.S. EPA 1985). Specifically, the normalized release R for transuranic waste is defined by
R = ~_ji(Qil Li)(Ix10 6 Ci/C~
(1)
where Qi is the cumulative release of radionuclide i to the accessible environment during the 10,000-yr period following closure of the repository (Ci), Li is the release limit (Ci) for the radionuclide i (Table 1, App. A, U.S. EPA 1985) and C is the amount of TRU waste emplaced in the repository (Ci). For the 1996 WIPP PA, C = 3.44 x 106 Ci. To help clarify the intent of 40 CFR 191, the EPA also published 40 CFR 194, Criteria for the Certification and Recertification of the Waste Isolation Pilot Plant's Compliance with 40 CFR Part 191 Disposal Regulations; Final Rule (U.S. EPA 1996). There, the following elaboration on the intent of 40 CFR 191.13 appears (pp. 5242-5243, U.S. EPA 1996): § 194.34 Results of performance assessments. (a) The results of performance assessments shall be assembled into "complementary, cumulative distribution functions" (CCDFs) that represent the probability of exceeding various levels of cumulative release caused by all significant processes and events. (b) Probability distributions for uncertain disposal system parameter values used in performance assessments shall be developed and documented in any compliance application. (c) Computational techniques, which draw random samples from across the entire range of the probability distributions developed pursuant to paragraph (b) of this section, shall be used in generating CCDFs and shall be documented in any compliance application. (d) The number of CCDFs generated shall be large enough such that, at cumulative releases of 1 and 10, the maximum CCDF generated exceeds the 99th percentile of the population of
Advances in Safety and Reliability: ESREL '97
517
CCDFs with at least a 0.95 probability. (e) Any compliance application shall display the full range of CCDFs generated. (f) Any compliance application shall provide information which demonstrates that there is at least a 95 percent level of statistical confidence that the mean of the population of CCDFs meets the containment requirements of § 191.13 of this chapter. When viewed at a high level, three basic entities underlie the results required in 191.13 and 194.34 and ultimately determine the conceptual and computational structure of the 1996 WIPP PA: EN 1, a probabilistic characterization of the likelihood of different futures occurring at the WIPP site over the next 10,000 yr; EN2, a procedure for estimating the radionuclide releases to the accessible environment associated with each of the possible futures that could occur at the WIPP site over the next 10,000 yr; and EN3, a probabilistic characterization of the uncertainty in the parameters used in the definition of EN 1 and EN2. Together, EN 1 and EN2 give rise to the CCDF specified in 191.13(a) (Figure 1), and EN3 corresponds to the distributions indicated in 194.34(b). .
2. ENI: PROBABILISTIC CHARACTERIZATION OF DIFFERENT FUTURES The entity EN1 is the outcome of the scenario development process for the WIPP and provides a probabilistic characterization of the likelihood of different futures that could occur at the WIPP over the next 10,000 yr as specified in 191.13(a). When viewed formally, EN1 is defined by a probability space (~st, ~t, Pst), with the sample space e~stgiven by
e~st = { Xst: Xst is a possible 10,000 yr sequence of occurrences at the WIPP }.
(2)
The subscript st refers to stochastic (i.e., aleatory) uncertainty and is used because (~st, Z~st, Pst) is providing a probabilistic characterization of occurrences that may take place in the future.
_1
I
I
I
I
1.0 A g lO-1
I
I
I
I t
(1 0.1) I Boundary Line: ' ~ . L 191.13 (a)
"5 10-2 ._~
~a- 10_3
(lO, o.ool)
- = [R,S 8R [f (Xst )]dst (Xst )dVst ] ~
•
"3st
~
~- 10-4 A where.. , l _ r 1 ~: 10-5 o R lr ~Xst )1 =~_ 0
""~''
Density F u n c t i o n ~
iff(xst)>R otherwise
~10-6
_ //~
CCDF in SPecif!ed _.~1 0 10-5
I
I
I
I
10-4
10-3
10-2
10-1
I
100
I
101
I
102
R: Release to Accessible Environment TRI-6342-730-16
Figure 1: Boundary line and associated CCDF specified in 191.13(a)
D.R. Anderson et al.
518
The following guidance is given by the EPA (p. 5242, U.S. EPA 1996): § 194.32 Scope of performance assessments. (a) Performance assessments shall consider natural processes and events, mining, deep drilling, and shallow drilling that may affect the disposal system during the regulatory time frame. (b) Assessments of mining effects may be limited to changes in the hydraulic conductivity of the hydrogeologic units of the disposal system from excavation mining for natural resources. Mining shall be assumed to occur with a one in 100 probability in each century of the regulatory time frame. This guidance and an extensive review of possible disruptions at the WlPP led to drilling intrusions and potash mining being the only occurrences incorporated into the definition of ~st. Specifically, the elements Xst of ~st are vectors of the form
Xst = [tl, ll, el, bl, Pl, al, t2,12, e2, b2, P2, a2 ..... tn, ln, en, bn, Pn, an ,tmin ] v
v
1st intrusion
2 nd intrusion
(3)
v
n th intrusion
in the 1996 WIPP PA, where n is the number of drilling intrusions, t i is the time (yr) of the tlh intrusion, Ii designates the location of the txh intrusion, e i designates the penetration of an excavated or nonexcavated area by the zxh intrusion, bi designates whether or not the txh intrusion penetrates pressurized brine in the Castile Formation, Pi designates the plugging procedure used with the t~ intrusion (i.e., continuous plug, two discrete plugs, three discrete plugs), ai designates the type of waste penetrated by the t~ intrusion (i.e., no waste, contact-handled (CH) waste, remote-handled (RH) waste), and tmin is the time (yr) at which potash mining occurs. The following guidance is also given by the EPA (p. 5242, U.S. EPA 1996): § 194.33 Consideration of drilling events in performance assessments. (2) In performance assessments, drilling events shall be assumed to occur in the Delaware Basin at random intervals in time and space during the regulatory time frame. (3) The frequency of deep drilling shall be calculated in the following manner: (i) Identify deep drilling that has occurred for each resource in the Delaware Basin over the past 100 years prior to the time at which a compliance application is prepared. (ii) The total rate of deep drilling shall be the sum of the rates of deep drilling for each resource. This guidance led to the drilling rate in the vicinity of the WIPP being determined as 46.8 intrusions/km2/104 yr, which leads to a rate of ~'d = ( 4 6 . 8 / k m 2 / 1 0
4
yr)(0.6285 km 2) = 2.94 x l0 -3 yr -1
(4)
for intrusions into the area (0.6285 km 2) marked by a berm used as part of a passive marker system (Figure 3, Helton 1996). Further, 100 yr of active institutional control (§194.41, p. 5243, U.S. EPA 1996) and 600 yr of passive institutional control (§194.43, p. 5243, U.S. EPA 1996) lead to the following timedependent drilling rate: ~,d(t) = 0 yr-1 for 0 < t < 100 yr, ~,d(t) = 2.94 x 10-5 yr--1 for 100 < t < 700 yr, and
~,d(t) = 2.94 x 10-3 yr "1 for 700 < t < 10000 yr. Drilling intrusions are assumed to be equally likely to occur at each node used in a discretization of the repository (Figure 3, Helton 1996). Further, the analysis uses specified probabilities for: encountering no waste (0.80) (i.e., an intrusion into a nonexcavated area), CH waste (0.18), or RH waste (0.02); encountering pressurized brine (0.08); and use of a one (0.02), two (0.68) or three plug (0.30) procedure to seal boreholes. The CH waste is emplaced in the repository in 55-gallon drums that come from 569 distinct waste streams, which also have assigned probabilities. As the CH waste
Advances in Safety and Reliability: ESREL '97
519
is emplaced in the repository in drums stacked three high, each drilling intrusion into CH waste is assumed to intersect three randomly selected waste streams, which leads to the vector notation used for ai. Finally, the distribution for tmi n is defined by the assumption that potash mining occurs at a rate of ~,m = 1 x 10-4 yr-1 (194.32(b)), with a time-dependent rate ~, m(t) then defined in the same manner as shown for ~, d(t). The preceding assumptions define (~st, ~st, Pst).
3. EN2: ESTIMATION OF RELEASES
The entity EN2 is the outcome of the model development process for the WIPP and provides a way to estimate radionuclide releases to the accessible environment (i.e., values for Qi and hence R in Eq. (1)) for the different futures (i.e., elements Xst of Sst) that could occur at the WIPP. Estimation of environmental releases corresponds to evaluation of the function f in Figure 1. Release mechanisms associated with f include direct removal to the surface at the time of a drilling intrusion (i.e., cuttings, spallings, brine flow) and release subsequent to a drilling intrusion due to brine flow up a borehole with a degraded plug (i.e., groundwater transport). The primary computational models in the 1996 WIPP PA are illustrated in Figure 2. Most of these models involve the numerical solution of partial differential equations used to represent material deformation, fluid flow or radionuclide transport. It is the models in Figure 2 that actually define the functionf in Figure 1. As indicated in Figure 1, the CCDF specified in 191.13(a) can be formally defined by an integral o f f over ~st. In practice, this CCDF is never obtained by direct evaluation of an integral due to the complexity of f and ~st. Rather, an approximation procedure based on importance sampling (Helton and Iuzzolino 1993) or Monte Carlo (random) sampling (Helton and Shiver 1996) is used. The 1996 WIPP PA uses a Monte Carlo procedure. Specifically, elements Xst,i, i = 1, 2 ..... nS, are randomly sampled from ~st in consistency with the definition of (~st, Zest, Pst). Then, the integral in Figure 1, and hence the associated CCDF, is approximated by
nS
prob(Rel > R)=
~Sst6R[f(Xst)ldst(Xst)dVst - Z 6R[f(Xst,i)]/nS" i=1
CUTTINGS_S
GRASP-INV
CDUl?bom~;e ~
E ~'_~
wO
j~.(Release of Cu.ttings, Spallings, Brine to
~'~ t~
Transmissivi
i
[~P'- Upper Shaft Seal System
1 I
A~dliteL;ye dB
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Panel Seal
u_ •~ cu o
Fields
_SECOFL2D/SEC OTP2D
BRAG~~O '~"-(Brine Flow)
...............
MB139 I j
PANEL/NUTS
(Radionuclide Concentration) ,u,
Ieservoir]
........... I .........................
Not to Scale
u,ace
Boundary ~ of Accessible Environment
t
it t I t
TRI-6342-3401-10
Figure 2: Computer programs (models) used in 1996 WIPP PA
(5)
D.R. Anderson et al.
520
The models in Figure 2 are too computationally intensive to permit their evaluation for every element Xst, i of Sst in Eq. (5). Due to this constraint, the models in Figure 2 are evaluated for representative elements of Sst and then the results of these evaluations are used to construct values of f for the large number of Xst, i in Eq. (5).
4. EN3: PROBABILISTIC CHARACTERIZATION OF PARAMETER UNCERTAINTY The entity EN3 is the outcome of the data development effort for the WIPP and provides a probabilistic characterization of the uncertainty in the parameters that underlie the WIPP PA. When viewed formally, EN3 is defined by a probability space (Ssu, ZCsu,Psu), with the sample space Ssu given by
Ssu = { Xsu: Xsu is possibly the correct vector of parameter values to use in the WIPP PA }.
(6)
The subscript su refers to subjective (i.e., epistemic) uncertainty and is used because (Ssu, ZCsu, Psu) is providing a probabilistic characterization of where the appropriate inputs to use in the WIPP PA are believed to be located. In practice, Xsu is a vector of the form Xsu = Ix1, x 2 ..... XnV ], where nV is the number of uncertain variables under consideration, and (Ssu, ~u, Psu) is obtained by specifying a distribution Dj, j = 1, 2 . . . . . n V, for each element xj of Xsu. The preceding distributions correspond to the distributions in 194.34(b). In concept, some elements of Xsu can affect the definition of (Sst, Zest,Pst) (e.g., the rate constant ~,d in Eq. (4) used to define the Poisson process for drilling intrusions) and other elements relate to the models in Figure 2 that determine the function f in Figure 1 and Eq. (5) (e.g., radionuclide solubilities in Castile brine or fracture spacing in the Culebra Dolomite). However, all elements of Xsu in the 1996 WIPP PA relate to the models in Figure 2 (Table 1, Helton et al. 1997). If the value for Xsu was precisely known, then the CCDF in Figure 1 could be determined with certainty and compared with the boundary line specified in 191.13(a). However, given the complexity of the WIPP site and the 10,000 yr period under consideration, Xsu can never be known with certainty. Rather, uncertainty in
Xsu as characterized by (Ssu, ~su, Psu) will lead to a distribution of CCDFs as indicated in 194.34(c) and (e) (Figure 3a). The proximity of this distribution to the boundary line in Figure 1 provides an indication of the confidence that 191.13(a) will be met as required in 191.13(b). The distribution of CCDFs in Figure 3a can be summarized by distributions of exceedance probabilities conditional on individual release values (Figure 3b). In concept, these distributions are defined by double integrals over ~su and ~st (Helton 1996). In practice, these integrals are too complex to permit a closedform evaluation. Instead, the 1996 WIPP PA uses Latin hypercube sampling (McKay et al. 1979) to evaluate the integrals over ~su and, as indicated in Eq. (5), simple random sampling to evaluate the integrals over Sst. Specifically, a Latin hypercube sample (LHS) Xsu,k, k = 1, 2 ..... nLHS, is generated from ~su in consistency with the definition of (Ssu, ZCsu,Psu), and a random sample as indicated in conjunction with Eq. (5) is generated from Sst in consistency with the definition of (Sst, Zest, Pst). The quantile values in Figure 3b are then approximated by solving
prob(p <_PI R) - 1 - ~_~ ~ p k=l
5 R /(Xst,i, X su,k
I n S I nLHS
(7)
Li=I
for P with prob(p < PIR) = 0.1, 0.5 and 0.9, respectively. In the preceding, f is shown as a function of Xst, i and Xsu,k to emphasize that its evaluation depends on both of these quantities; the summations derive from integrals over Ssu and ~st, respectively. Similarly, the mean exceedance probability P is approximated by
Advances in Safety and Reliability: ESREL '97 Total N o r m a l i z e d Releases: Replicate R1 100 O b s e r v a t i o n s , 1 0 0 0 0 F u t u r e s / O b s e r v a t i o n 101 ........ , ........ , ........ , ........ , ........ , ........ , ...... .....
E P A Limit
100
101
Total N o r m a l i z e d Releases: Replicate R1 00 O b s e r v a t i o n s , 10000 F u t u r e s / O b s e r v a t i o n ........
!
........
!
........
i
........
!
........
I
........
i
.......
I
I
I I
100
i
rr ^
rr
A i0-I
521
. . . . .
1
.
10-1
~
0
>~ 10-2
>2 1 0_2
.,... ._ ._
._ m ._
I3
E P A Limit ..... M e a n 90th Q u a n t i l e 50th Q u a n t i l e
~ 10-3
x~ 9 10_3
13_
D..
10-4
10-4
10th e u a n t i l e
F r a m e 3& 10 -5 ........ ' ........ ' . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 -5 10-4 10 -3 10 -2 10 -1 10 ° 101
" i " , | '° ~,~Jl " ~-
10 -5 10 -5
102
........
I
10 -4
........
i
........
10 -3
\ i
10 -2
........
i
10 -1
F r a ? e 3b . . . .
i,,.i
10 °
101
102
Normalized Release (EPA units), R
N o r m a l i z e d R e l e a s e ( E P A units), R
TR 1-6342-5141-0
Figure 3: Distribution of CCDFs for total normalized release to accessible environment due to cuttings, spallings and direct brine release: (3a) individual CCDFs, and (3b) mean and quantile curves
-if-
]
~)R[f(Xst,i,Xsu,k)]/nS ,nLHS. k=l
ki=l
(8)
The results of the preceding calculations are typically displayed by plotting quantile values (e.g., P0.1, P0.5, PO.9 obtained for prob(p < PIR) = 0.1, 0.5, 0.9) and also the mean values (i.e., P ) for the exceedance probabilities above individual release values (i.e., R) and then connecting these points to form continuous curves (Figure 3b). 5. COMPUTATIONAL DETAILS OF THE 1996 WIPP PA
The requirements in 194.34(c), (d) and (f) interact in determining the details of the 1996 WIPP PA. Requirements 194.34(c) and (d) can be satisfied with a random sample from Ssu of size 298 (i.e., 1 - 0.99 n > 0.95 yields n = 298). However, the WlPP PA decided to use Latin hypercube sampling because of the efficient manner in which it stratifies across the range of each uncertain parameter (McKay et al. 1979) and the observed stability of uncertainty and sensitivity analysis results produced in past analyses that involved a separation of stochastic and subjective uncertainty (Iman and Helton 1991, Helton et al. 1995). Given that Latin hypercube sampling is to be used, the confidence intervals required in 194.34(f) can be obtained with a replicated sampling technique proposedby R.L. Iman (1982). In this technique, the LHS indicated in conjunction with Eq. (7) is repeatedly generated with different random seeds. These samples lead to a sequence Pr (R), r = 1, 2 ..... nR, of estimated mean exceedance probabilities, where Pr (R) defines the mean CCDF obtained for sample r (i.e., Pr (R) is the mean probability that a normalized release of size R will be exceeded) and nR is the number of independent LHSs generated with different random seeds. Then,
if(R) = Z -fir(R)~nR and SE(R) = r=l
[Pr(R)- P(R)] 2 / nR(nR- 1) Lr=l
(9)
522
D.R. Anderson et al.
provide an additional estimate of the mean CCDF and an estimate of the standard error associated with the mean exceedance probabilities. The t-distribution with nR-1 degrees of freedom can be used to place confidence intervals around the mean exceedance probabilities for individual R values (i.e., around P(R) ). D
Specifically, the 1-cz confidence interval is given by P(R) + tl-og2 SE(R), where tl-~2 is the 1 - ~ 2 quantile of the t-distribution with nR-1 degrees of freedom (e.g., tl-og2 = 4.303 for o~ = 0.05 and nR = 3). The same procedure can also be used to place pointwise confidence intervals around percentile curves. To implement the preceding procedure, the 1996 WlPP PA used nR = 3 replicated LHSs of size nLHS = 100 each, with these replicated samples denoted by R1, R2 and R3. This produced a total of 300 observations, which is approximately the same as the sample size of 298 indicated above. Each sample was generated with the restricted pairing technique developed by Iman and Conover (1982) to induce specified rank correlations between correlated variables and also to assure that uncorrelated variables had correlations close to zero. Once the indicated LHSs were generated, calculations were performed with the models in Figure 2 for the individual sample elements. The number of individual model calculations was too large to describe here. However, the basic strategy was to avoid the unnecessary proliferation of computationally demanding calculations by identifying situations where (1) a single computationally demanding calculation could be used to supply input to several less demanding calculations, (2) mathematical properties of the models could be used to extend the results of a single calculation t o many different situations, or (3) a relatively inexpensive screening calculation could be used to determine if a more detailed, and hence more expensive, calculation was needed. As examples, (1)each BRAGFLO calculation, which involves the numerical solution of a system of nonlinear partial differential equations and is quite demanding computationaUy (i.e., 1 - 2 hrs of CPU time on a Digital VAX Alpha using VMS), was used to supply conditions that were used in a number of different calculations with the CUTTINGS_S, BRAGFLO_DBR, NUTS and PANEL models indicated in Figure 2; (2) the linearity of the system of partial differential equations that underlies SECOTP2D made it possible to perform transport calculations for unit releases of individual radionuclides to the Culebra Dolomite and then use the outcome of these calculations to construct transport results for arbitrary time-dependent radionuclide releases into the Culebra; and (3) transport calculations with NUTS were initially performed with a nondecaying tracer and then calculations with radionuclides were only performed for those cases that had a potential to result in a radionuclide release from the repository. The analysis effort for all three replicates was still quite large and involved 1800 BRAGFLO calculations, 15,600 CUTTINGS_S calculations, 15,600 BRAGFLO_DBR calculations, approximately 1500 screening and 500 full calculations with NUTS, 2100 PANEL calculations, 100 GRASP_INV calculations, 600 SECOFL2D calculations, and 1200 SECOTP2D calculations. The outcome of these calculations was a set of results for each LHS element. As discussed in conjunction with Eq. (5), Monte Carlo procedures were then used to construct a CCDF for each LHS element. Specifically, this CCDF was produced from nS = 10,000 randomly selected futures of the form shown in Eq. (3), where nS is the sample size in Eq. (5). Once each future Xst, i was sampled, the corresponding normalized release 3~Xst, i) was constructed from releases to the accessible environment calculated with CUTTINGS_S, BRAGFLO_DBR, PANEL, NUTS and SECOTP2D. In this procedure, extensive algebraic manipulations and interpolations were performed to estimate releases for futures involving multiple intrusions from the results of the previously indicated calculations for one or two intrusions at fixed points in time. Once values for f(Xst, i) were determined, which correspond to the normalized release R in Eq. (1), the CCDF specified in 191.13(a) was readily constructed. Repetition of the preceding procedure for each LHS element yielded a distribution of CCDFs for each of the nR = 3 replicates as requested in 194.34(e); results for replicate R1 are shown in Figure 3. Further, the replicated samples and the procedure in Eq. (9) provided a basis for the estimation of confidence intervals as requested in 194.34(f); results proved to be quite stable across the three replicates (Figure 4).
Advances in Safety and Reliability" ESREL '97
101
Total Normalized Release: Replicates R1, R2, R3 300 Observations, 10000 Futures/Observation
101
Frame 4a
100 rr ^ > . _ . _ ..13
t~
Total Normalized Release: Replicates R1, R2, R3 300 Observations, 10000 Futures/Observation ........
!
........
Frame 4b
!
........
i
........
|
........
10-2
EPA Limit Mean ,, Overall Mean .... 90th Quantile - - - - - 50th Quantile - - - 10th Quantile
0- 3 10-4 .......
i
. . . . . . . .
10-4
i
. . . . . . . .
10-3
i
. . . . . . . .
10-2
cr ^
10-1
>
10-2
I
ii
........
!
. . . . . . .
i
100
i0-I
10-5 ' 10-5
523
i
. _
10-3 13_
10-4 i
. . . . . . . . . . . .
10-1
i0 °
Normalized Release (EPA units), R
i
101
. . . . . . .
02
10
"
-
-
-
----
-
EPA Limit Overall Mean Upper 95th CI Lower 95th CI
.................................... -5 10-5 10-4 10-3 10-2
~. I| iI i~1~ I ..~.., 10 -1 100
101
02
Normalized Release (EPA units), R TRI-6342-5000-1
Figure 4: Distribution of CCDFs for total normalized release to accessible environment due to cuttings, spallings and direct brine release: (4a) mean and quantile curves for individual replicates, and (4b) 95% confidence intervals (CIs) on mean exceedance probabilities obtained from all three replicates 6. STATUS
The calculations described in this presentation were completed in September of 1996 and formed the basis of an application by the DOE to the EPA in October of 1996 for the certification of the WlPP for the disposal of TRU waste (U.S. DOE 1996). This application is currently under review by the EPA and a decision is expected in late 1997. If certified, the WlPP will be the first facility for the geologic (deep underground) disposal of radioactive waste to begin operation. REFERENCES
Helton, J.C. (1996). Computational Structure of a Performance Assessment Involving Stochastic and Subjective Uncertainty. Proceedings of the 1996 Winter Simulation Conference (eds. J.M. Charnes, et al.), pp. 239-247. Helton, J.C. and Iuzzolino, H.J. (1993). Construction of Complementary Cumulative Distribution Functions for Comparison with the EPA Release Limits for Radioactive Waste Disposal. Reliability Engineering and System Safety 40: 277-293. Helton, J.C. and Shiver, A.W. (1996). A Monte Carlo Procedure for the Construction of Complementary Cumulative Distribution Functions for Comparison with the EPA Release Limits for Radioactive Waste Disposal. Risk Analysis 16: 43-55. Helton, J.C., et al. (1995). Robustness of an Uncertainty and Sensitivity Analysis of Early Exposure Results with the MACCS Reactor Accident Consequence Model. Reliability Engineering and System Safety 48: 129-148. Helton, J.C., et al. (1997). Uncertainty and Sensitivity Analysis in the 1996 Performance Assessment for the Waste Isolation Pilot Plant. Proceedings of International Conference on Safety and Reliability, Lisbon 17720, June 1997, to appear.
524
D.R. Anderson et al.
Iman, R.L. (1982). Statistical Methods for Including Uncertainties Associated with Geologic Isolation of Radioactive Waste Which Allow for a Comparison with Licensing Criteria. Proceedings of the Symposium on Uncertainties Associated with the Regulation of the Geologic Disposal of High Level Radioactive Waste, Gatlinburg, Tennessee, March 9-13, 1981, ed. D.C. Kocher. NUREG/CP-0022, CONF-810372. Oak Ridge National Laboratory, Oak Ridge, Tennessee, 145-157. Iman, R.L. and Conover, W.J. (1982). A Distribution-Free Approach to Inducing Rank Correlation Among Input Variables. Communications in Statistics BII: 311-334. Iman, R.L. and Helton, J.C. (1991). The Repeatability of Uncertainty and Sensitivity Analyses for Complex Probabilistic Risk Assessments. Risk Analysis 11:591-606. McKay, M.D., Conover, W.J. and Beckman, R.J. (1979). A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics 21: 239-245. U.S. DOE (Department of Energy). (1996). Title 40 CFR Part 191 Compliance Certification Application for the Waste Isolation Pilot Plant. DOE/CAO-1996-2184. Carlsbad, NM: U.S. Department of Energy, Carlsbad Area Office. U.S. EPA (Environmental Protection Agency). (1985). Environmental Standards for the Management and Disposal of Spent Nuclear Fuel, High-Level and Transuranic Radioactive Waste; Final Rule, 40 CFR Part 191. Federal Register 50: 38066-38089. U.S. EPA (Environmental Protection Agency). (1993). Environmental Radiation Protection Standards for the Management and Disposal of Spent Nuclear Fuel, High-Level and Transuranic Radioactive Waste; Final Rule, 40 CFR Part 191. Federal Register 58:66398-66416. U.S. EPA (Environmental Protection Agency). (1996). 40 CFR Part 194: Criteria for the Certification and Recertification of the Waste Isolation Pilot Plant's Compliance With the 40 CFR Part 191 Disposal Regulations; Final Rule. Federal Register 61: 5224-5245. WIPP PA (Performance Assessment) Division. (1991-1992). Preliminary Comparison with 40 CFR Part 191, Subpart B for the Waste Isolation Pilot Plant, December 1991, Vols. 1-4. SAND91-0893/1-4. Sandia National Laboratories, Albuquerque, New Mexico. WIPP PA (Performance Assessment) Department. (1992-1993). Preliminary Performance Assessment for the Waste Isolation Pilot Plant, December 1992, Vols. 1-5. SAND92-0700/1-5. Sandia National Laboratories, Albuquerque, New Mexico.
UNCERTAINTY AND SENSITIVITY ANALYSIS IN THE 1996 PERFORMANCE ASSESSMENT FOR THE WASTE ISOLATION PILOT PLANT J.C. Helton, 1 D.R. Anderson, 2 H.-N. Jow, 1 M.G. Marietta, 2 G. Basabilvazo 3 1 Department of Mathematics, Arizona State University, Tempe, AZ 85287-1804 USA 2 Sandia National Laboratories, Albuquerque, NM 87185-1328 USA 3 U.S. Department of Energy, Carlsbad, NM 88221 USA
ABSTRACT The preceding paper, "Conceptual and Computational Structure of the 1996 Performance Assessment for the Waste Isolation Pilot Plant," describes the overall structure of the performance assessment (PA) carried out by Sandia National Laboratories to support the U.S. Department of Energy's application to the U.S. Environmental Protection Agency for the certification of the Waste Isolation Pilot Plant (WIPP) for the geologic disposal of transuranic waste. An important part of this structure is the use of Latin hypercube sampling to propagate subjective (i.e., epistemic) uncertainty through the analysis. This propagation creates a mapping from uncertain analysis inputs to analysis results. The mapping itself provides a summary of the uncertainty in analysis outcomes that can be presented with box plots or cumulative distribution functions. Further, sensitivity analysis results can be obtained by exploring this mapping with regression-based techniques (e.g., stepwise regression analysis, partial .correlation analysis, examination of scatterplots). Example uncertainty and sensitivity results obtained in the 1996 WlPP PA are presented and discussed. KEYWORDS Performance assessment, radioactive waste disposal, sensitivity analysis, uncertainty analysis, Waste Isolation Pilot Plant. 1. INTRODUCTION The preceding paper (Anderson et al. 1997) describes the conceptual and computational structure of the 1996 performance assessment (PA) for the Waste Isolation Pilot Plant (WIPP), which was carded out to support an application by the U.S. Department of Energy to the U.S. Environmental Protection Agency (EPA) for the certification of the WIPP for the geologic disposal of transuranic waste (U.S. DOE 1996). An important part of this structure is the use of Latin hypercube sampling (McKay et al. 1979) to generate a mapping from imprecisely known analysis inputs to analysis outcomes of interest. This mapping provides both a display of the uncertainty in analysis outcomes (i.e., uncertainty analysis) and a basis for investigating the effects of individual inputs on these outcomes (i.e., sensitivity analysis). The sensitivity analysis procedures that can be used include examination of scatterplots, stepwise regression analysis, and partial correlation analysis (Helton 1993). The WIPP PA involves a sequence of linked models, with each model providing input to the next model in the sequence. Uncertainty and sensitivity at each model interface provides (1) assurance that the results crossing the interface have been calculated correctly, (2) insights on 525
526
J.C. Helton et al. how to organize the overall calculation in a computationally-efficient manner, and (3) guidance for future model development and data acquisition. Example uncertainty and sensitivity analysis results from the 1996 WIPP PA are presented and discussed. 2. BRAGFLO: TWO PHASE F L O W IN VICINITY OF REPOSITORY The BRAGFLO model (Fig. 2, Anderson et al. 1997) is used to represent two phase (i.e., gas and brine) flow in the vicinity of the repository. An important result calculated by BRAGFLO is the pressure in the repository as a function of time (Figure 1a), with this pressure influencing spallings and direct brine releases, which are releases directly to the surface at the time of a drilling intrusion, and also radionuclide transport away from the repository in anhydrite marker beds. The spread in the curves in Figure l a is due to the effects of imprecisely-known inputs to the analysis (i.e., subjective or epistemic uncertainty). One way to identify the variables that are dominating the uncertainty is to calculate partial rank correlation coefficients (PRCCs) between pressures at individual times and the variables in the Latin hypercube sample (LHS) (Figure lb), with WMICDFLG, WGRCOR, WASTWlCK and HALPOR being identified as the dominant variables (see Table 1 for variable definitions). The positive effects indicated for these variables result because increasing WMICDFLG increases gas generation by microbial processes, increasing WGRCOR increases the rate at which gas is generated by the corrosion of steel, and increasing WASTWICK and HALPOR increases the amount of brine available for consumption in the corrosion process. The results in Figure I are for undisturbed conditions. An important event in the 1996 WIPP PA is the occurrence of a drilling intrusion into the repository, which causes a major alteration of the pressure conditions (Figure 2a). Although PRCCs and also stepwise regression analysis were successful in identifying the variables giving rise to the uncertainty in Figure la, they performed very poorly for the pressure results in Figure 2a. When this occurs, the examination of scatterplots is often an effective way to identify influential variables, with a strong but highly nonlinear relationship being identified between pressure after a drilling intrusion and B H P R M (Figure 2b). The complex pattern involving pressure and B H P R M results from the role that B H P R M plays in influencing two phase flow in the borehole, with gas flowing up the borehole and brine typically flowing down the borehole.
8Nt. VdPP PASS: Blu.~FLO l U l L U d ' K 3 1 ~ (C~.,AR: lri) r Vola~,,Avli~lled P m a u r e in W a ~ Panel
~ v ~ o ' r o n ~c~ luaw.a~OlmU~ Fc~ ~
% nr
!
Jl
a'l o.I;
J ~,,
O.6
o.~
~,,I-
o.o 0.0
. %'+•~ I " --z~1
.,= 2.0 ~
4.0 "r'me. V ~
&O ('10=)
ILO
10.0
,~u
I.s
s.e
Frame 1b
"-- ~ ' ~
4.s 'nt¢.lo'|
~.b
~
i
u
Figure 1. Uncertainty and sensitivity analysis results for pressure (Pa) in waste panel under undisturbed conditions: (Ia) time-dependent pressures for 100 LHS elements in replicate R1, and (lb) PRCCs obtained from pooling of replicates R1, R2 and R3. See Sect. 5 of Anderson et al. (1997) for a discussion of replicates
Advances in Safety and Reliability: ESREL '97
527
TASTE 1. EXAMPt.ESOFTt-mnV - 57 UNCERTAINVARIABLESCONSIDEREDIN THE 1996 WIPP PA ANHPRM---Logarithm of anhydritc permeability (m2). Used in B R A G F L O . Distribution: Student's with 5 degrees of freedom. Range: -21.0 to-17.1 (i.e.,permeability range is 1 × 10-21 to i × I0 -17.1 m2). Mean, Median: -18.9, -18.9. Correlation : --0.99 rank correlation with A N H C O M P (Bulk compressibility of anhydrite, Pa-1).
BHPRM--Logarithm of borcholc permeability (m2). Used in B R A G F L O . Distribution: Uniform. Range: - 1 4 t o - l l (i.e., permeability range is 1 × 10--14 to 1 x 10-11 m2). Mean, median: -12.5,-12.5.
BPCOMP---.Logarithm of bulk compressibility of brine pocket (pa-1). Used in BRAGFLO. Distribution: Triangular. Range: -11.3 to --8.00 (i.e., bulk compressibility range is 1 x 10-I1.3 to 1 x 10 -8 pa-1). Mean, mode: -9.80,-10.0. Correlation: -0.75 rank correlation with BPPRM (Logarithm of brine pocket permeability, m2).
BPINTPRS--.Initial pressure in brine pocket (Pa). Used in BRAGFLO. Distribution: Triangular. Range: 1.11 × 107 to 1.70 x 107 Pa. Mean, mode: 1.36 x 107 Pa, 1.27 × 107 Pa.
CFRCSP--Culebra fracture spacing (In). Used in S E C O T I ~ D . Equal to half the distance between fractures. Distribution: Uniform. Range: 0.05 to 0.5 m. Mean, median: 0.275 m, 0.275 m. CMRTRDU---Culebra matrix retardation for uranium (dimensionless). Defined as function of other uncertain variables. Used in S E C O T P 2 D . Not a sampled variable. CVEL,---Norm of fluid velocity vector calculated by S E C O F L 2 D used in S E C O T P 2 D . Not a sampled variable.
(m/s). Representative of fluid velocity
HALPOR----Initial value for halite porosity (dimensionless). Used in B R A G F L O . Distribution: Piecewise uniform. Range: 1.0 x 10-3 to 3 x 10-2. Mean, median: 1.28 x 10-2, 1.00 x 10-2. HALPRM---- Logarithm of halite permeability (m2). Used in B R A G F L O . Distribution: Uniform. Range: -24 to-21 (i.e.,permeability range is I × 10-24 to I × 10--21 m2). Mean, median: -22.5, -22.5. Correlation: -0.99 rank correlationwith H A L C O M P (Bulk compressibility of halite,Pa -I).
WASTWICK---Incrcasc in brine saturation of waste due to capillary forces (dimensionless). Used in B R A G F L O . Distribution: Uniform. Range: 0 to I. Mean, median: 0.5, 0.5. WGRCOR--Corrosion rate for steel under inundated conditions in the absence of C O 2 (m/s). Used in BRAGFLO. Distribution: Uniform. Range: 0 to 1.58 × I0"-14 m/s. Mean, median: 7.94 × 10 "-15m/s, 7.94 × 10--15m/s. WMICDFLG--Pointer variable for microbial degradation of cellulose. Used in B R A G F L O . Distribution: Discrete, with 5 0 % 0, 2 5 % I, 2 5 % 2. W M I C D F L G ---0, 1, 2 implies no microbial degradation of cellulose, microbial degradation of only cellulose,microbial degradation of cellulose,plasticand rubber. WPRTDIAM--Waste particle diameter (m). Used in C U T T I N G S _ S . Distribution: Loguniform. Range: 4.0 × 10 --5to 2.0 × I0 --1m. Mean, median: 2.35 × 10 -2 m, 2.80 × 10-2 m. WRBRNSAT---Residual brine saturation in waste (dimensionless). Used in B R A G F L O BRAGFLO_DBR. Distribution:Uniform. Range: 0 to 0.552. Mean, median: 0.276, 0.276. WTAUFAIL---Shcar strength of waste (Pa). Used in c u T r I N G S _ S . 0.05 to 10 Pa. Mean, mezlian: 5.03 Pa, 5.03 Pa.
Distribution: Uniform.
and Range:
528
J.C. Helton et al.
3. NUTS AND PANEL: RELEASE FROM REPOSITORY The NUTS and PANEL models (Figure 2, Anderson et al. 1997) are used to represent long-term radionuclide transport away from the repository due to flowing groundwater. The NUTS model was used for undisturbed (i.e., E0) conditions, single drilling intrusions that penetrate pressurized brine in the Castile Formation (i.e., E1 intrusions), and single drilling intrusions that do not penetrate pressurized brine in the Castile Formation (i.e., E2 intrusions). The PANEL model was used for the penetration of a waste panel by two or more drilling intrusions, of which at least one penetrates pressurized brine in the Castile Formation (i.e., an E2E 1 intrusion). Most sample elements resulted in little or no radionuclide release being predicted by NUTS and PANEL due to failure of the repository to fill with brine, with the greatest number of releases occurring for the E2E1 intrusions. A single example of uncertainty and sensitivity analysis results for longterm radionuclide release from the repository is given in Figure 3. The box plots in Figure 3a provide a compact way to display the uncertainty in a number of variables in a single plot frame. The variable BHPRM dominates radionuclide release, with no releases typically occurring for small values of BHPRM due to a failure of the intruded waste panel to fill with brine (Figure 3b). 4. SECOFL2D and SECOTP2D: F L O W AND TRANSPORT IN THE CULEBRA D O L O M I T E The SECOFL2D and SECOTP2D models (Figure 2, Anderson et al. 1997) are used to represent brine flow and radionuclide transport, respectively, in the Culebra Dolomite. In the 1996 WlPP PA, only one sample element had the potential for radionuclide transport from the repository to the accessible environment. However, no releases to the Culebra occurred for this sample element. Thus, no release to the accessible environment due to transport through the Culebra occurred in the calculations performed to support the 1996 WlPP PA. To provide perspective on the processes affecting radionuclide transport in the Culebra, an additional set of calculations was performed for replicate R1 for a release of 1 kg of U-234 at the repository at time 0 yr and the transport of this release in the near vicinity of the repository (i.e., within a few 10's of meters) (Figure 4). The release was rapidly attenuated, with the dominant variables being CMATRDU, CVEL and CFRCSP. Specifically, CMATRDU characterizes the effects of sorption in the rock matrix; CVEL characterizes fluid velocity in fractures, and CFRCSP affects the amount of diffusion that takes place from fractures into the surrounding rock matrix. 5. CUTTINGS_S AND BRAGFLO_DBR: CCDFs FOR D I R E C T RELEASES The CLrITINGS_S model was used to represent direct releases (i.e., at the time of a drilling intrusion) due to cuttings and cavings and also to spallings; BRAGFLO_DBR was used to represent direct releases due to brine flow up the intruding borehole (Figure 2, Anderson et al. 1997). Due to the absence of significant transport in the Culebra and also through the anhydrite marker beds, releases to the accessible environment in the 1996 WIPP PA were dominated by the cuttings and cavings, spallings and direct brine releases (Figure 5a-c). The individual CCDFs in Figure 5a-c were constructed with Monte Carlo procedures (Sect. 3 Anderson et al. 1997). As comparison of Figure 3a of Anderson et al. (1997) and Figure 5a-c of this presentation shows, the total release CCDFs tend to be dominated by the cuttings and cavings component of the release.
The sensitivity of CCDFs to uncertain inputs can be assessed with PRCCs. As an example, the uncertainty in the CCDFs for spallings releases is dominated by WMICDFLG, WPRTDIAM, HALPOR and WGRCOR (Figure 5d). The positive effects for WMICDFLG, HALPOR and WGRCOR result from their roles in increasing repository pressure; the negative effect for WPRTDIAM results because large particles are less mobile than small particles. Another possibility is to convert the individual CCDFs to expected values and then to perform a stepwise regression analysis on these expected values. Then, variable importance is indicated by the order in which the variables enter the regression analysis, the changes in R 2 values as variables enter the regression model, and the standardized regression coefficients associated with the
Advances in Safety and Reliability: ESREL '97 aNL W1PP PA96: BRA(3FI.O BIMULATIONS
Volume-Averag~ Pmmwe
529
( ( : C A R1 SS} m C C A 4 I
rn W e m e Panel . . . . . . . .
I
. . . . . . . .
I
. . . . .
"" ""
Frame 2b 1.2
°IQ
e
e
$
p ¢,,,,
7,J
•
~ o.I)
%,
ii"
•
•
i
l
~ u
I.
t,,,o ~ O o
m el I )fill
9
•
•
Iii •
U
•
I I
Ill
I
°'~l
0.3
. "0,0
2.0
0.0
4.0
8.0
8.0
o_
•
•
III
•
I I
IIIII • =0 eq,., •
•
•e
• ~o
•
•
~.:.,; ." ~,~;. , : -~" .~..,2.-, ,..,,..,.. , • .-
•
_
•
• If•
I
.---~'o.-,,
.........
te 4o
10 n
11041
104~
ONP'RN
Time - Yearn ('lOP)
Figure 2. Uncertainty and sensitivity an~dysis results for pressure (Pa) in waste panel after a drilling intrusion at 1000 yr that does not penetrate pressurized brine in the Castile Formation (i.e., an E2 intrusion)' (2a) time-dependent pressures for 100 LHS elements in replicate R1, and (2b) scatterplot for pressure at 10000 yr versus BHPRM
• ,,"-t
. . . . . -,i
.... -1
.. ~_eoe COA POrt m ~ 0 l e e • Y n u v ~ . . . . . 01 , ,,,-,,,i . . . . '-I . . . . -'-I • ,."-'1 . . . . - I
. . . . . -"1
,nO2
:
:-'-' t
:
:
_
i
l.~mimmmo~
....
--"
"-"
....
1
•
x-
-
Sial VE~'I'OIIL " l l g T i 4 o m Y ,' .".~. ,i . . . . .
v.
,nO'
-
I
I-),,,,,i,*~
=
,,
•
,nO' •
10 4
]
~
~
'
....... ----::: x
-
x
10 '1
-
I
'"I ' ~
•
i ,10,e
m<xxx
•
• m
•
"';''~. ""
•
01901
m oae•
~"
-" ; " - , ,
• ..
.
•o'm
./'8
• .:..~.-
,no"
•
•
• lid •
al o • ub • •
%"
," " ,,,'.,,-,r~,:~,"',,~1
~
ool
"
•
qb •o
",,t',"
• ;'"'i w
-'.,~
,':"
••
, ,-
•
.i
m"
io"
Frame 3a . . . . . ,.J 'tO" ~
. . . . ,,.,i
....
'0u"
,J lo* ~
. ....,i
. .,....z
'm 4
.... d
. . . . ,,.i
, ....,..J
~" ,no" ,no" MMI.14III ~ I I I , A qlmll'Tll
Io"
.... ..I ,io'
.... -J
r
"
•
Frame 3b i
.
to'
L O n M ~ k ~ O N )
~
~
Pa~afl~
Figure 3. Uncertainty and sensitivity analysis results for an E2E1 intrusion with the E1 intrusion occurring at 1000 yr. (3a) box plots for release of individual radionuelides, and (3b) seatterplot for normalized release of Am-241 versus BHPRM
J.C. Helton et al.
530 SNL
PAN: POST-SECO ~
WIPP
MT4U234C
t.2[ 1.0
L
,
'
versus ='
i
~ L A T I O N S m ~ m w L m l e . A ~ m m
TIME "
"
"
i
'
"
" "
I--'-
Frame 4a
'
'
I
"
'"
R,
'----
Frame 4b
HWINJQ.T
Time - Y m m ( "1~ )
.twt.,~j
Figure 4. Uncertainty and sensitivity analysis results for transport of a 1 kg release of U-234 at the repository across a boundary 10 m from the release point: (4a) cumulative releases for individual sample elements, and (4b) PRCCs variables selected in the analysis. The cuttings and cavings release is completely dominated by WTAUFAIL (i.e., a standardized rank regression coefficient (SRRC) o f - 1 . 0 0 and an R 2 value of 1.00); a number of variables affect the other release modes (Table 2). 6.0 SUMMARY The 1996 WIPP PA maintained a separation of stochastic (i.e., aleatory) and subjective (i.e., epistemic) uncertainty. In conjunction with a Latin hyl~rcube-based uncertainty analysis, this separation enabled the PA to show that there is a high degree of confidence that the EPA regulation 40 CFR 191.13 will be met; specifically, the distributions of CCDFs in Figure 5a-c and also Figure 3 of Anderson et al. (1997) fall substantially below the specified boundary line. The associated sensitivity analysis helps explain why analysis outcomes behave in particular ways and also provides an important check on the correctness of individual results. Further, the uncertainty and sensitivity analysis results provide valuable guidance for model development, data acquisition, and the structure of future PAs. REFERENCES
Anderson, D.R. et al. (1997). Conceptual and Computational Structure of the 1996 Performance Assessment for the Waste Isolation Pilot Plant. Proceedings of International Conference on Safety and Reliability, Lisbon 17/20 June 1997, to appear. Helton, J.C. (1993). Uncertainty and Sensitivity Analysis Techniques for Use in Performance Assessment for Radioactive Waste Disposal. Reliability Engineering and System Safety 42: 327-367. McKay, M.D., Conover, W. J. and Beckman, R. J. (1979). A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics 21: 239-245. U.S. DOE (Department of Energy). (1996). Title 40 CFR Part 191 Compliance Certification Application for the Waste Isolation Pilot Plant. DOE/CAO-1996-2184. Cadsbad, NM: U.S. Department of Energy, Carlsbad Area Office.
Advances in Safety and Reliability: ESREL '97 Cuttings N o r m a l i z e d Releases:. R1 1 0 0 O bse rva tio n s, 1 0 0 0 0 F u t u m s / O b s e r v a b o n .....
" ........
-'
. . . .
" " - i
....
-~ ......
1
I
10 0
L_
10-~
531
~allings Normalized Relemms: 100 O m m r v a t i o n s , 1 0 0 0 0 ~ t ~ o n
R1
101
100
1°1 . ~ 1 0 "2
.~ lo-2
.~ I0.3 o
11 __,.,.
~ . ~o ~
-
I0 .4
Io-~
Frame lO-s
, .e-..-J
10..5
m ,_l--,J
10.4
• --=-.-J
10 .3
- ,,--..J
10.2
Norn~
• ,-*.-..J
10"1
.....
10 0
5a 1
,,.J
10-5
• ----J
101
102
. .,Jl.~
lo
•
Release (EPA units), R
Blowout Normalized Releases: RI 1 0 0 O b s e r v a t i o n s , 1 0 0 0 0 Futures~ObsFrv. ation 101
-- ~
" "'"'~
' " "''~
. 100
.._
"''"~t ,,,
(&l e D m
""'--'1
=,
~r''"'t
........ .....
I I I L____.__[
m i~p1~4~
~: 10,1 ^
("*"~"~
•
Fr..a~e 5b
• . . . . . .
lo" lo "~ lo ~ Io "~ iO° Normalized R e l e u e (EPA unit.s), R
_~ . . . . . . . _~
lo ~
, =j...
1,
.'~....~'J~...._'~"'.~~~_~'= !
1
-
I
II)
EPALimll
1
.D
= lo-3
"F
IX
lO-4 I 't°"=Io'6
• ,,--~
lo"
! |..,--I
lo .=
Normalized
....
--J
'=0.=
i =----J
lo "~
- ,--"~
lo o
R e l e a n e ( E P A u n its), R
Frame
5c
i "'--"J
I "'"
'm~
•. ,*~.~ .... ,~. .... ;.~ .... ..,, .... .~ 1
ul ~ O ¢ , t l l
NormalizedRelease
imLimrlU,CclmcPccm~c ~ 1 ~ 1 , . ,
..... ;.~ ..... ;;.. . .... ;;.~.,. (EPA
units)
Figure 5. Uncertainty and sensitivity analysis results for direct releases: (5a) CCDFs for cuttings and eavings, (5b) CCDFs for spallings, (5c) CCDFs for direct brine releases, and (5d) PRRCs for CCDFs for spallings
This Page Intentionally Left Blank
B5" Management of Safety Assessments
This Page Intentionally Left Blank
M A N A G E M E N T OF SAFETY ASSESSMENTS LESSONS LEARNED FROM EXPERIENCE WITHIN NATIONAL PROJECTS R.D. Wilmot and D.A. Galson Galson Sciences Ltd., 5 Grosvenor House, Melton Road, Oakham, Rutland, LE 15 6AX, UK
ABSTRACT Safety assessments of large installations, including radioactive waste disposal facilities, pose a wide range of management issues. Environment Agency sponsored an informal international seminar to discuss these issues, with participants from regulators, proponents and contractors. The seminar concluded that there was a role for quantitative assessments carried out by the regulator, supported by dialogue between the proponent and the regulator. In addition to establishing trust and credibility, such assessments should help in the setting of regulatory guidance. Sensitivity analyses can indicate which are the important issues in an assessment, and can provide the basis for decision-making to allocate resources and ensure value for money. The seminar did not resolve all the issues of concern to managers that were raised, and concluded that an international forum for personnel involved in regulatory assessment work would be a useful follow-on.
KEYWORDS
Management, safety assessments, performance assessments, regulation, radioactive waste disposal, decision making
I N T R O D U C T I O N AND BACKGROUND
Within England and Wales, the Environment Agency l, as responsible authority for the licensing of radioactive waste disposal, has been developing, demonstrating, and applying an independent risk assessment capability. An informal international seminar on management of safety assessments was sponsored by the Environment Agency and held in Epsom, UK, on 10-11 March 1997. The purpose of the seminar was to discuss recent experience and views on institutional management and technical management of the assessment process, and to evaluate the overall management lessons learned in a number of assessments conducted recently in OECD countries. While the main focus of the meeting was on radioactive waste disposal, the seminar also provided a perspective of lessons and management experience from safety assessments conducted within other industry sectors.
1The Environment Agency was formed on 1 April 1996 from the merger of Her Majesty's Inspectorate of Pollution, the National Rivers Authority and the Waste Regulatory Authorities. 535
536
R.D. Wilmot and D.A. Galson
The seminar was attended by representatives from Spain, Sweden, Switzerland, the United Kingdom and the United States. The majority of participants were from regulatory organisations, with a few from proponents and from contractors.
Institutional Management of Assessments Over the past 15 years, postclosure safety assessments have been undertaken in every OECD Member country concerned with the disposal of long-lived radioactive wastes in near-surface and deep repositories. The results of such assessments will be an important part of the licensing documentation prepared by repository proponents. In a number of countries, regulators have been developing an independent assessment capability, with the intention of providing better insight into the proponent's safety case. An important issue of underlying concern to both regulators and proponents is the need to ensure that such independent assessments conducted by regulators make use of the most up-to-date site data and understanding, research reports, software tools, and quality assurance (QA) procedures, so as to ensure that the independent assessment work can contribute constructively to the debate on repository safety. Yet there remains a conundrum: the need for the regulator and proponent to remain at arm's length can introduce difficulties in ensuring that the independent assessments and regulatory review make use of the best available information obtained by the proponent. It is in the interest of both regulator and proponent to ensure that independent assessments serve as a valuable contribution to the overall process of decision making. The seminar explored
possible working relationships between the assessment programmes of proponents and regulators.
Technical Management of Assessments Developing the technical bases for a safety assessment is a multidisciplinary task, placing high demands on management and requiring appropriate organisational structures. Little has been written about the management of these assessments, yet assessment management may be the single most important activity in contributing to the success of the exercise (Thompson et al., 1993). Individual assessments may typically extend over several years and cost the equivalent of several millions of dollars. The seminar explored means
for the efficient and cost-effective conduct of assessments.
THE SEMINAR
Introductory presentations were given by Brian Thompson (Environment Agency), who discussed general management issues as well as the experience gained from a series of assessments, and Soren Norrby (Swedish Nuclear Power Inspectorate - SKI), who provided a perspective on the regulator's role in decision making. Further details of these programmes are provided in companion papers (Thompson and Sumerling, 1997; Dverstorp et al., 1997). The remainder of the seminar was divided into two broad topics covering the institutional and technical aspects of assessment management: •
The Role and Conduct of Independent Assessments by Regulators.
•
Managing a Multi-Disciplinary and Multi-Contractor Team.
Participants gave presentations of 15-20 minutes focussing on one of these topics; these are summarised below. Each session of the seminar was concluded by a discussion period and a general discussion was held at the end of the meeting. The main points raised in these discussions are presented in the following section.
Advances in Safety and Reliability: ESREL '97
537
Presentations
The Role and Conduct of Independent Assessments by Regulators Participants described a range of approaches adopted by regulators for undertaking independent assessments: Regulators who have developed capabilities for undertaking full performance assessments (PAs), have gained benefits in terms of the development of regulatory tools, insights into the QA aspects of data collection, and the establishment of in-house expertise. Other regulators have undertaken partial assessments that focus on the modelling of particular subsystems to help the regulator understand what issues are important. A third approach described required the proponent to undertake supplemental PA calculations using the proponent's own codes. In this case, the regulator does not maintain an independent PA capability but does design the supplemental calculations. By undertaking calculations in-house, rather than asking the proponent to undertake them based on the regulator' s conceptual models and parameter values, the regulator could demonstrate their understanding and capability to other stakeholders. This approach is also efficient in terms of time. An in-house team may, however, be less flexible than a team of external contractors if major changes in the scope of work are required at short notice. The majority of regulators engage in a number of distinct phases of work prior to licensing: the development of licensing guidance, pre-application consultation and review, and review of an application. Whatever source of technical support is used, there is a need for regulatory staff to provide a perspective during each phase that ensures decisions are made and issues resolved, so that research is not allowed to continue indefinitely. The extent to which regulators take active steps to involve stakeholders in the decision-making process was described. These may include workshops, technical exchanges, public hearings, Internet sites, telephone information lines, and the maintenance of public registers.
Managing a Multi-Disciplinary and Multi-Contractor Team A system model used for the assessment of a radioactive waste disposal facility accounts for a wide range of processes and process interactions. These are treated at various levels of detail, and there must be a means of accounting for uncertainty in both the models used and the parameter values. The use of detailed models for the most important processes may reduce uncertainty but has cost and schedule implications. The use in PAs of simplified models, look-up tables and elicited parameter values may reduce costs and shorten schedules, but requires accounting for greater levels of uncertainty. A decision analysis tool based on a trade-off between activities conducted to reduce uncertainty in the PA, cost and schedule was developed and used for the Waste Isolation Pilot Plant (WIPP) in the US to determine which experimental and model development programmes should continue and which could be reduced or cancelled (Helton et al., 1996). An important pre-requisite to any assessment project is fixing the management boundary conditions, so that the assessment context, the methodologies to be used, and the deliverables are all determined as early in the project as possible. A second major lesson learnt from experience is the importance of information flow, so that those who need information have it when needed. Recent information technology developments, such as intranets and co-operative working software, could help to remove some of the bottlenecks in distributing documentation to multi-participant teams. This in turn may allow for more time and resources to be devoted to understanding and presenting the results of assessments.
538
R.D. Wilmot and D.A. Galson
One presentation described the regulatory review of the structural integrity of the Forth Rail Bridge by the Health and Safety Executive (HSE), in which the proponent had funded the majority of the work while, through a joint working group, the regulator had both control over what was done, and saved the cost and time of a separate review (HSE, 1996). This close co-operation of proponent and regulator was accepted by those who had raised the original safety questions, with no apparent concerns regarding collusion.
Discussion The major themes that arose during the discussion periods were the interactions between regulators and proponents, the use of quantitative assessments by regulators, organisational and cultural conflicts and their resolution, and the use of contractors.
Interactions between regulators and proponents There was consensus concerning the importance of the regulator and industry communicating on both the qualitative and quantitative aspects of the assessment process ahead of licensing. This communication could serve as a basis for the provision of guidance to the proponent on acceptable practice. As a minimum, such guidance could establish a document structure that would aid the preparation of assessment documentation by the proponent and the planning of reviews by the regulator. Other possible areas for guidance were also mentioned: The use of common data and standards - for radioactive waste disposal, this could include waste inventories, thermodynamic data and terminology. •
Protocols for peer review and elicitation of expert opinion.
•
Expectations concerning "validation" of models.
Outside the field of radioactive waste disposal it has been demonstrated that it is possible for a proponent and a regulator to cooperate very closely, to the extent of producing a joint assessment. There was discussion of whether a similar approach could be adopted in the regulation of radioactive waste disposal. In the USA, the Environmental Protection Agency (USEPA) and Department of Energy (USDOE) are working more and more closely as the application for the WIPP repository for disposal of transuranic waste progresses. This cooperation includes supplemental calculations undertaken by the USDOE as specified by the USEPA. Any future legal challenge will be against the USEPA rather than USDOE, but any defence will rely on the assessment by the proponent. It is therefore important that the regulator has confidence in the assessment, and that the proponent provides whatever information and support is necessary to provide this confidence. Joint assessments in other countries were considered unlikely, either because of different legal systems, or because of the low level of trust by the public for the nuclear industry. In particular, some participants considered that trust and credibility established by the regulator could be lost through the conduct of joint assessments with the proponent.
The use of quantitative assessments by regulators There was consensus concerning the central role of PA in regulatory decision making, although regulators differ in the extent to which they undertake independent quantitative assessment work. Where independent codes have been developed by the regulator, this is done to establish and maintain competence and credibility in terms of both software development issues and process understanding, and to illustrate acceptable practice; it remains the responsibility of the proponent to undertake and present a safety case for a
Advances in Safety and Reliability: ESREL '97
539
particular site or facility. Other regulators consider that any quantitative assessment work required for an understanding of the issues can be done by undertaking, or requiring the proponent to undertake, specific calculations using existing codes. As PA matures as a subject, the use of existing codes may become more widespread as independent codes written under rigorous quality regimes become more widely available. Whatever the approach to quantitative assessments, there should be a mechanism for informing the proponent and stakeholders of views or guidance gained by the conduct of independent calculations by the regulator. This is true regardless of the mechanism used for funding work by the regulator. Where there is a direct "polluter pays" system of funding, the proponent may wish to see a progressive closure of issues by the regulator, ensuring that the benefits of previously funded work are not lost. Where funding of the regulator by the proponent is less direct, and comes through central government, the taxpayer will require a demonstration of value for money which may include a similar approach to closing issues. On the other hand, even where there is an active prelicensing dialogue between regulator and proponent, the regulator may wish to retain flexibility and reserve the right to comment on all aspects of a final application.
Organisational and cultural conflicts Participants in the seminar who had been involved in mature assessments all stressed the need for a process to determine which areas of work should be funded and which areas, although interesting to particular researchers, could be discontinued. Sensitivity analyses can be used to determine the important issues, although this must be coupled to a decision analysis tool or optimisation scheme if the analysis is to be extended to ensure value for money. The definition of value in this context can be difficult, as it is necessary to determine the usefulness of research which, by its nature, may have an uncertain outcome. Experimental programmes intended to reduce uncertainty about parameter values by increasing the number of measured data are perhaps more amenable to this approach. It was also noted that incentives for completion can help to alleviate the tendency for some researchers to continue work on particular topics ad infinitum.
The use of contractors Whatever the proportion of work that is doneby extemal contractors, all the participants stressed that there must be a sufficient level of in-house regulatory expertise. There is a need to have sufficient in-house staff to manage external contractors, to ensure that an integrated programme of work is undertaken, and to have an understanding of the issues so that guidance and regulatory views can be published in addition to contractor reports. Amongst participants, the HSE major hazards group had the greatest degree of in-house expertise. In the case of the radioactive waste disposal regulators, however, there was a marked variation in the level of in-house staffing, with the Environment Agency having the lowest level, both in absolute terms and in relation to the size of the national nuclear programme.
CONCLUSIONS The seminar was a valuable forum for discussion, and there was a considerable degree of unanimity amongst participants as to the issues relevant to both the technical and institutional management of assessments. The presentations and discussions did not, of course, provide answers to all the issues raised, and there remains a number of topics where further discussion would help the Environment Agency, and other organisations, to develop appropriate strategies. These topics include: The need to establish trust and credibility as a competent regulator through dialogue with a wide range of stakeholders. The need to establish both the institutional constraints and technical scope of an assessment early on in any project.
540
R.D. Wilmot and D.A. Galson
•
Agreement on a documentation structure for safety cases and supporting material.
•
The provision of regulatory guidance to the proponent.
•
Methods for resolving issues in advance of an application. Assessing the cost implications and value for money of any independent assessment work, and the use of decision analysis to help guide research, site characterisation, and repository development programmes.
Because these topics are relevant to other regulators, it may be appropriate to establish an international forum for personnel involved in regulatory assessment work. Existing international fora bring together regulators at a policy level, but do not provide for those with the direct responsibility for managing and undertaking assessments. Environment Agency has already taken initial steps to establish a new international group of this kind.
ACKNOWLEDGEMENTS This work was funded by the Environment Agency and the results may be used for the formulation of policy, but do not at this stage constitute UK Government policy. We would like to thank Brian Thompson of the Environment Agency for instigating the project and reviewing this paper.
REFERENCES
Dverstorp, B., Kautsky, F., Norrby, S., Toverud, O. and Wingefors, S. (1997). Management of performance assessments in the Swedish waste disposal programme: SKI's views and experiences. Proceedings of an International Conference on Safety and Reliability - ESREL '97, Lisbon, Portugal (June 1997). (in press) Helton, J.C., Anderson, D.R., Baker, B.L. and 17 others (1996). Computational Implementation of a Systems Prioritization Methodology for the Waste Isolation Pilot Plant: A Preliminary Example. SAND94-3069, Sandia National Laboratories, Albuquerque, NM. HSE, (1996). An assessment by HSE of the structural integrity of the Forth Rail Bridge. Health and Safety Executive, London. Thompson, B.G.J. and Sumerling, T.J. (1997). Organisational and management issues in the regulatory assessment of underground radioactive waste disposal. Proceedings of an International Conference on Safety and Reliability - ESREL '97, Lisbon, Portugal (June 1997). (in press) Thompson, B.G.J., Wakerley, M.W. and Sumerling, T.J. (1993). Recent management experience of UK performance assessments of radioactive waste disposal. Proc. 4th Intl. Conf. on High-Level Radioactive Waste Management, Las Vegas, NV. (May 1993).
MANAGEMENT OF PERFORMANCE ASSESSMENTS IN THE SWEDISH WASTE DISPOSAL PROGRAMME: SKI'S VIEWS AND EXPERIENCES. B. Dverstorp, F. Kautsky, S. Norrby, O. Toverud and S. Wingefors Swedish Nuclear Power Inspectorate, S: 106 58 Stockholm, Sweden
ABSTRACT In Sweden, the nuclear industry (through SKB) has the full responsibility to develop a safe solution for disposal of spent fuel and nuclear waste. Up-coming license applications comprise an encapsulation plant and a deep repository. The safety authorities (SKI and SSI) are responsible for reviewing SKB' s RD&D program and for reviewing license applications. The development of a repository for spent nuclear fuel is a stepwise process with recurrent up-dated Performance Assessments (PAs) and regulatory reviews as an important basis for decisions. In preparation for these reviews SKI has built an independent competence in PA by carrying out its own PA exercises (Project-90 and SITE-94). Although demanding tasks, these PA projects have provided SKI with a PA methodology that can be up-dated and applied as a regulatory tool in future reviews. The experiences from these PA projects will also assist in defining what SKI expects of SKB in future license applications. KEYWORDS Performance Assessment, Management, Deep Repository, Spent Nuclear Fuel, Regulatory Review, License Application, Quality Assurance, Long-Term Safety, Scenarios, Crystalline Rock. INTRODUCTION According to the "Act on Nuclear Activities", the owner of a nuclear reactor has the full responsibility for handling and final disposal of spent nuclear fuel and nuclear waste. This also includes a responsibility to perform necessary R&D activities and to present a Research, Development and Demonstration (RD&D) programme every third year. The reactor owners have set up a joint company, the Swedish Nuclear Fuel and Waste Management Co. (SKB) to perform necessary R&D activities and to construct and operate nuclear facilities of joint interest to the industry. The Swedish Nuclear Power Inspectorate (SKI) is the main authority for review and supervision of the responsibilities put on the nuclear power reactor owners. SKI reviews applications for licenses to construct and operate nuclear facilities and supervises operation of such facilities. SKI reviews and supervises the required RD&D programme, and also the funding system for future costs. Swedish policy is to finally dispose of all types of waste from the Swedish nuclear power programme within Sweden. According to this policy, the direct disposal of spent fuel is the preferred option. The use of nuclear energy has been much debated in Sweden. According to existing plans, now under debate, the use of nuclear reactors will be phased out so that no reactor is in operation after 2010. The nuclear power programme comprises twelve nuclear reactors, a central intermediate storage facility for spent fuel (CLAB), in operation since 1985, and a repository for the final disposal of low and intermediate level radioactive waste (SFR), in operation since 1988. Plans exist for the expansion of the facilities so that all storage of spent fuel and disposal of the short-lived nuclear waste (including decommissioning waste) can be accommodated. 541
542
B. Dverstorp et al.
In addition, SKB is planning for siting, construction and operation of both a facility for encapsulation of spent fuel in copper/steel canisters and a deep geological repository for spent fuel. These plans also include a repository for long-lived low and intermediate level waste. The siting process for the repository for spent fuel is in its preliminary stages. Feasibility studies are going on in local municipalities. Figure 1 schematically illustrates a model of a stepwise decision-making process for the first stage of the development of a deep repository (from SKI, 1996a). In different phases of the licensing process, Performance Assessments (PAs) will be the main basis for decisions. E X P E R I E N C E S F R O M E A R L I E R REVIEWS For many years SKI has been involved in regulatory reviews of PAs carried out and presented by the nuclear industry through SKB. At the end of the 1970's, the KBS-1 report was presented. This was an assessment of long term performance of vitrified waste from the reprocessing of spent fuel. Swedish policy was then changed to direct disposal of spent fuel, and the safety of that disposal concept was assessed in the KBS-2 report. The KBS-3 report (SKBF/KBS, 1983), presented at the beginning of the 80's, also assessed the long term safety of spent fuel. Since the Swedish policy changed from reprocessing to direct disposal of spent fuel, the disposal concept has remained much the same over the years, comprising a long-lived stable canister (copper or copper/steel) surrounded by bentonite clay, deposited in vertical holes drilled from a tunnel system at a depth of about 500 m in crystalline basement. PA is not only focussed on spent fuel disposal. The Swedish repository for low and intermediate level waste (SFR) was licensed during the 80's. It is obvious that PA methods for spent fuel and low and intermediate level waste should in principle be the same, even if the time perspectives may be different. In the review of the safety assessment report for SFR, which was performed by SKI and SSI (Swedish Radiation Protection Institute) jointly, scenarios and the related probabilities were discussed. It was accepted that scenarios resulting in doses well above the limit for individual dose (0.1 mSv/year) could be accepted if the probability for their occurrence was very low. A discussion of that type requires a well developed safety assessment methodology including scenario and uncertainty analysis. The continuous development of the KBS-3 concept, including plans for further R&D, has been presented by SKB, and reviewed by SKI, every third year (1986, 1989, 1992 and 1995). In 1991 SKB also presented the SKB-91 safety assessment exercise (SKB, 1992), which is an update of primarily the geological aspects of the earlier KBS-3 report. Rather early, at the beginning of the 80' s, it became obvious that SKI needed to develop its own competence in PA methodology for regulatory purposes. SKI thus initiated international projects on groundwater and radionuclide transport in geological media, INTRACOIN, HYDROCOIN, INTRAVAL, and, later, also the ongoing project DECOVALEX on thermo-hydro-mechanical modelling. SKI also carried out two major competence building activities on PA of deep geological disposal as described below. In all its different regulatory roles, as described above, SKI has found it valuable, and to a great extent necessary, to attain a high and independent competence in PA methodology. This competence is useful not only in assessing licence applications but also in reviewing SKB's plans for future R&D, in focussing SKI' s own R&D activities and in corrununicating SKI's views on safety aspects to the general public. ROLES OF P E R F O R M A N C E ASSESSMENT IN THE LICENSING PROCESS
Pre-Licensing Phase Building o f competence and resources The ambition of SKI has been to build and rely on its own competence and resources as much as possible in the licensing of facilities for management of nuclear waste. This ambition derives partly from earlier experiences where it was found difficult to rely entirely on external experts with sometimes diverging opinions in important matters. A full coverage of all necessary expertise is not feasible, however, with the limited personal resources of SKI. In addition, the coverage will always be more or less influenced by the available competence and contact network of staff members. Therefore, external consultants and experts will also be needed in all stages of the licensing process, in particular for fields where in-house expertise is lacking and for those parts of PAs, e.g. site characterization, where the volume of work is too large to be handled effectively by the staff alone.
Time schedule accirding to SKR’q plans (mag be mndified1
I
I
I
97198
2nfi1
20(16
Encapsulation plant
Feasibility studiw. 5 -10 muniripnlilieq
Cnnstrurtion and inadire test nperation
Operation Stwe 1
2 sites
1 Figure I .
Advances in Safety and Reliability: ESREL '97
Pilot plant: teFting of
According m a Governmen! decision Fmm May 1995, a system evaluarion, whcrc Ihc “KBS-3”mcthad is evaluated and cmnparcd to the Ztro Alremmi\r and oaer aliemariveq. should k suhrniired in conneczrnn with thc application lor a pcmit to conwruct h e encaprulation plmt.
Schematic illustration of the decision-making process related to SKB’s programme for the encapsulation plant and die deep repository for spent nuclear fuel. NRL = the Act Concerning Management of Natural Resources. KTL = the Acl on Nuclear Activities. Auth./Gov. = permission by authorities or government decision (modified from SKI, 1996a).
543
544
B. Dverstorp et al.
The building of in-house competence is achieved by: •
Conducting an independent research programme, e.g., in order tO get independent answers to questions about model applicability, as a scientific basis to SKI's own modelling efforts, and as a complement to SKB's programme. Another very important aspect is the development of methods for scenario development and system analysis. This work must be done in such a way that SKI does not assume SKB's responsibilities. As an example, it would not be acceptable if the solution to critical issues in S KB's safety case relied entirely on research results obtained within SKI's programme.
•
Development of computer codes and data bases for PAs. Focus has been on the mainstream modelling of an integrated PA, i.e. models for radionuclide release and transport, groundwater flow, chemical modelling and rock mechanics. Independent databases of generic parameters such as radionuclide inventories in spent fuel and thermodynamic data have been developed and are maintained. It is also the aim that release and transport calculations should be made in-house and understood in fullby the ordinary staff. This will contribute to ensuring a basic understanding of which factors and assumptions are critical and should be subject to particular scrutiny in the assessment work.
•
Application of available knowledge, models/codes, data bases and manpower (staff together with external consultants and experts) in integrated PA exercises. These efforts have been invaluable for building of personal competence, reviewing in-house and otherwise available knowledge, assessment tools, and methodologies.
•
Establishment of a national/international network of external experts and consultants that may be used for the review of SKB's programme as well as the PAs of a licence applications.
It is quite obvious that the different items mentioned above are interdependent and of mutual benefit to a very great extent. Another aspect that should be as obvious is that the competence building work will never be finished. As long as PAs are needed, new scientific and technical advances must be taken into account and employed in the analyses. This fact must not be forgotten, in particular after licensing (see below). Finally, the supply of personnel and consultants, as well as ways to preserve the attained level of competence, are important strategic questions that need to be observed in both research and personnel planning.
Review of implementor's programme SKI's reviews of SKB's programme for RD&D every third year have been excellent opportunities to get a thorough and integrated understanding of the status of PA. On the other hand, it is quite clear that the knowledge and experiences gained by the competence building work and the PA exercises have been invaluable for SKI's ability to perform these reviews. In commending the programmes to the government, SKI has been able to enforce requirements on SKB's programme, e.g. concerning timing and scope of the PAs in the future licensing process.
Requirements on performance assessments for licence applications Based on the experiences from the work presented above and on internationally accepted standards and documents (IAEA, CEC, NEA), the SKI will issue regulations or guidelines which prescribes the content (e.g. inclusion of different scenarios) and format of SKB's future PA reporting. An important aspect is the extent and detail with which SKB will have to assess alternatives to the main disposal concept within the framework of an Environmental Impact Statement.
Inspection of site investigations The investigations of selected sites for a repository, and the subsequent management and evaluation of site specific data by SKB, will be followed closely by SKI. Quality assurance is particularly important. In addition to increasing the confidence in SKB's results, this effort will facilitate and speed up the review work performed after submittance of a licence application.
Licensing Phase Much of the PA work otherwise needed during the licensing phase, i.e. after submittance of an application to build a waste management facility such as a repository, will be much facilitated by the work performed in advance as described above. However, some strategic questions related to the regulatory review still need to be resolved. Among these may be mentioned in particular, the extent of SKI's own PA and the integration of PA results to a full EIS document. The latter should lead to the judgement by the authority of the acceptability of
Advances in Safety and Reliability: ESREL '97
545
the facility from both safety and radiation protection points of view.
Construction, Operation and (Post-) Closure Phases The PA needs to be maintained during construction of the facility, e.g., taking into account acceptable changes of layout and material properties. The operation permit of the facility will thus be based on a renewal of the PA submitted in connection with the application to start construction. The operation of the facility will usually be permitted on the condition that recurrent up-dating of the PA is made, e.g. every tenth year. For the closure of a repository (and the decommissioning of other facilities), a renewal of the PA will most probably be required. After closure of a repository no institutional control should be necessary according to nationally accepted principles. Nevertheless, the continual development of science and technology will probably lead to a wish for checking of old PAs. D E V E L O P M E N T OF P E R F O R M A N C E ASSESSMENT AS A R E G U L A T O R Y TOO1. In order to build an independent PA capability SKI has performed its own PA projects. The first fully integrated PA project, SKI Project-90 (SKI, 1991), was aimed at testing calculational tools for PAs. Project-90 was based on a hypothetical site and generic data from Swedish bedrock. Project-90 provided SKI with valuable experience regarding organization and planning of a large interdisciplinary research project. It also pointed at important research areas requiring further work, including methods for handling of site-specific data and treatment of scenarios and uncertainties. Guided by these findings and an international peer review of Project-90, a new PA project, SKI SITE 94 (SKI, 1996b), was started in 1992. SITE-94 is a site-specific PA of a hyp.othetical KBS-3 type repository, using real site investigation data from SKB's Hard Rock Laboratory site at Asp6, in SE Sweden. A key theme of the project was to determine how site-specific data should be assimilated into the PA process and to evaluate how uncertainties inherent in site characterization will influence PA results. In part this was an in-house learning exercise, as SKI, in contrast to SKB, did not have any experience in handling field data. Major efforts were made to improve the link between measured data and the use of these data in PA, including the development of an approach to handling geosphere variability and the model and data uncertainty resulting from this. The approach comprised a re-interpretation of much of the 'raw' data from the site investigation at Asp6, an attempt at integrated analysis of all interdisciplinary site data and evaluation of time history. For the detailed modelling of groundwater flow, the analyses demonstrate how to apply multiple conceptual models in order to handle conceptual model uncertainty in a PA. It is evident that only on rare occasions are field data used directly in radionuclide release and transport calculations. Instead, the raw data are interpreted by experts in geology, rock mechanics, hydrogeology and geochemistry, who then provide interpreted or manipulated parameters for the consequence analysis calculations. Care was taken to record data sources and to document analyses as well as more judgmental assessments of these data. A traceable record and a systematic treatment of different types of uncertainties appear to be crucial when assessing the results of a PA, and also for the possibilities to provide relevant feedback to site characterization. They are key elements for the successful planning of a site characterization programme and for SKI, when evaluating such programmes. The SITE-94 project has also resulted in the development of a comprehensive methodology for structuring and conducting PAs, for development of scenarios and for defining calculation cases for consequence analysis which takes full account of the uncertainties involved. The major advances of this work have been the development of: •
a rigorous methodology for defining and managing all the Features, Events and Processes (FEPs) and their interactions through the development of Process Influence Diagrmns (PID),
•
a quantitative technique for incorporating expert judgement of uncertainty through the process of assigning importance levels to influences between FEPs,
•
a procedure which integrates all the activities in the structuring and completion of an assessment exercise (data selection, model definition, calculation case production, uncertainty management etc.), through the concept of an Assessment Model Flowchart (AMF),
•
a comprehensive documentation system for all the decisions made in building the PA, which ensures that
546
B. Dverstorp et al. the way in which all FEPs have been incorporated, or relegated from the assessment, are recorded.
In doing this, SITE-94 has moved a long way forward from the position at the end of Project-90. Over the same period SKB have also developed their PA and scenario analysis methodology. For example, the RES matrix methodology currently applied by SKB (Eng et al., 1994), has many resemblances to the methodology developed within SITE-94, but there are also important differences. For this reason, it appears that the Process Influence Diagrams developed within SITE-94 should be highly relevant tools for reviewing future SKB assessments. Assessing the SKB field data bases also revealed problems in how to represent such data in an efficient and reproducible form for users other than SKB staff. Such lessons have been passed on to SKB during the course of the project. Good Quality Assurance will be an important aspect of the design, construction, operation and closure of a repository. It is also an important foundation of safety assessment, ensuring that analyses are comprehensive, traceable, reproducible and credible, Any regulatory submission would be expected to be supported by a well tried and tested QA system, particularly allowing reviewers to assess the quality and validity of all the data used by the applicant and required for use in any independent assessment by SKI. It is important to realize that the 'audit trail' would need to extend back in time to cover all data that might be used in an assessment, and it will thus not be adequate to apply QA at a late stage of the programme. Information abstracted from early projects where no QA was applied must be treated in a qualified manner. Where such information relates to clearly sensitive aspects of a submission, it is unlikely that it would be regarded as adequate without properly QA'd supporting data. There was no formal work in SITE-94 directed at setting in place a QA procedure for use in an assessment. However, the PA methodology developed, and the enhancement of the traceability of information flow from site characterization data to consequence calculations, will form a useful basis for data management and decision tracking. SITE-94 has also begun to contribute some important Quality Assurance tools, such as the Process Influence Diagrams and Assessment Model Flowcharts discussed above. Several of the findings in the SITE-94 project will be of direct value for SKI's future regulatory work. In particular, SITE-94 has: °
Identified clearly some of the most significant issues in preparing and justifying a safety case. This will assist in defining what SKI expects of SKB in a future regulatory submission and the emphasis which will need to be placed on each issue. SKI will also take advantage of these findings in the recurrent reviews of SKB's RD&D programme.
°
Clarified the capabilities and areas of weakness of SKI and the tools which it has available, in terms of its ability to respond to a regulatory safety submission.
•
Identified topics where further work is required either by SKI or SKB to sort out residual technical issues, including canister performance, derivation of geosphere retention properties, treatment of spatial variability, buffer evolution and modelling of time dependent processes such as climate evolution.
SKI'S REQUIREMENTS ON THE NUCLEAR INDUSTRY (SKB) SKI has recently been given the right to issue general regulations within the nuclear waste area. Important regulations to be issued by SKI within the next few years include requirements for long-term safety and guidelines on safety assessment reporting. However, up to now most of SKI's requirements on the nuclear industry with regard to safety assessments for a deep repository have been expressed as recommendations to the government in connection with SKI's reviews of SKB's RD&D programmes. In most cases these recommendations have been manifested through the government's decisions on these RD&D programmes. Recommendations have also been given, in connection with regularly held meetings, directly to SKB. Based on SKI's recommendations in the review of SKB's RD&D programmes 92 and 95 (SKI, 1993 and SKI, 1996a), the government has decided (government decisions, May 18, 1995 and December 19, 1996) that SKB shall in the forthcoming R&D work perform an integrated evaluation of the whole deep disposal system (system evaluation). This includes safety assessments of the encapsulation plant, the transportation system, a post-closure assessment of the deep repository as well as assessments of alternatives to the KBS-3 disposal concept. The overall objective of this system evaluation is to assess the preferred disposal system prior to further commitments to the main (KBS-3) disposal system.
Advances in Safety and Reliability: ESREL '97
547
An important aspect of this system evaluation will be to identify couplings between different components of the disposal system. Examples of such couplings are the performance requirements for the canister that need to be derived from assessments of the operating stage as well as for the long-term safety of the deep repository. Throughout the reviews of SKB's RD&D programme SKI has stressed the need for using PAs as means to integrate R&D in different areas. For example, in the developlnent of a site investigation programme it is essential that new methods are assessed in relation to the data needs in the PA. In particular, in its review of SKB's RD&D Programme 95, SKI has recommended SKB to perform an in-depth and comprehensive postclosure Safety Assessment of SKB's proposed disposal concept (KBS-3), using site-specific data from SKB's existing study sites in crystalline rock in Sweden (see Figure 1). This recommendation was subsequently manifested in a government decision (December 19, 1996) stating that "SKB shall perform a safety assessment of the long-term safety of the deep repository before an application for a permit to site and construct the encapsulation plant is submitted, as well as prior to the start of site investigations"° In SKI's view, there are several reasons why such safety assessment should be presented and subjected to an evaluation by national and international experts, such as: •
a comprehensive, up-to-date safety assessment is a necessary part of the system evaluation,
•
the design and quality requirements for the canister must be in agreement with an up-to-date safety assessment of the entire system,
•
the design of measurement programmes for site investigations and the evaluation of measured data from these investigations must be in agreement with an up-to-date safety assessment of the entire system, SKB's safety assessment methods and models (e.g. treatment of scenarios and uncertainty, handling of site-specific data and validation issues in a PA) must be thoroughly evaluated in the light of the developments which have raken place since SKB presented its first detailed KBS-3 analysis 13 years ago.
In its reviews of SKB's RD&D programmes and in other contacts with SKB, SKI has regularly commended SKB's development of PA methodology. The content of the most recent SKI PA project (SITE-94) will serve as an important input to SKI when providing further directions to SKB on what needs to be included in a PA as part of a license application. T E C H N I C A L A S P E C T S ON M A N A G E M E N T OF P E R F O R M A N C E A S S E S S M E N T S The task of performing PA projects within a small organization like SKI presents several problems regarding planning and timing, partly due to resource conflicts with ordinary regulatory work. However, much of the work is a "once-and-for-all" effort that has provided SKI with a useful tool kit for carrying out assessment calculations, both in terms of numerical and scoping analyses. This clearly needs to be kept under review and be up-dated. However, even if Project 90 and SITE-94 are not complete PA studies, future efforts would not have to repeat the time consuming work of establishing the assessment methodology all over again. Future work can be devoted primarily to improvement of different aspects of the methodologies developed. SKI is to a large extent dependent on the availability of a stable group of external experts and consultants in order to maintain and further develop its capability to carry out PAs. Although this makes SKI vulnerable, it has proved manageable to maintain a core of key consultants tied to SKI's research programme over long periods of time. The flexibility to freely engage new consultants, nationally as well as from abroad, has also several benefits considering the changing research objectives in different phases of the waste disposal programme. SKI's PA projects have given SKI valuable experiences regarding management of a full PA. Although the basis for SKI's PA capability was founded on SKI's previous PA project, Project-9() (SKI, 1991), the step to a site-specific PA, like SITE-94, has involved a great deal of time consuming development work. Any organization going from genetic to site specific assessments is likely to face similar challenges, which points to a need for applying and evaluating PA methodology using site-specific data at the early stages of a waste disposal programme. The reporting of SKI's PA. projects consists of a rather detailed PA report and a number of technical background reports aimed for SKB, other PA experts and the scientific community. In order to meet the demands from the intensified EIA process, SKI is currently producing more accessible documentation of PAs for the public. Peer reviews and participation in international PA groups like NEA/IPAG are regarded as important elements of the evaluation of overall confidence and QA of SKI's PAs.
548
B. Dverstorp et al.
C ( ) N C L U D I N G REMARKS The development of a repository for spent nuclear fuel in Sweden is a stepwise process with recurrent updated PAs and regulatory reviews as the main basis for decisions. Thus, over the coming years SKI will be called upon to carry out reviews and technical analyses of both generic and site-specific assessments prepared by SKB as part of the RD&D and site selection programmes. The regulatory review and evaluation of compliance in the implementor' s license application is a task that requires an experienced professional team. In the pre-licensing phases SKI has prepared itself by building an independent competence in PA by conducting its own PA exercises (Project-90 and SITE-94). Although the task of performing PA projects within a small organization like SKI is a demanding one, it has proved worthwhile and has provided SKI with several important benefits. In particular, it has permitted some major technical advances in PA methodology that can be up-dated to be applied as a regulatory tool in upcoming reviews of license applications. The experiences from SKI's PA work will also serve as an important input to SKI when developing guidelines and regulations to SKB on what needs to be included in a PA in a license application and when reviewing SKB's recurrent RD&D progra~Tunes. The comprehensiveness and quality of the implementor's PA are essential for a successful evaluation of a license application for a deep repository. Therefore it is necessary that the regulator clarifies its requirements in advance. Key components of such PA include: • • • • •
scientific basis for the predictions (validation aspects, applicability of models etc), comprehensiveness of the system description, selection of scenarios, evaluation of uncertainty, transparency and traceability of data, assumptions, models and information flow in the PA.
Other critical requirements concern QA of the site characterization (i.e. the site-specific data and models used) and QA of the engineered barriers (e.g. non-destructive testing of sealed canisters). In its reviews of SKB's RD&D programmes SKI .has emphasized the central role of PA as a means for integrating and evaluating R&D and for analyzing couplings between different components of the disposal system (e.g. the encapsulation plant, the canister, the transportation system and the deep repository). These views have been hnplemented, through government decisions, as requirements on S KB's programme, e.g. concerning timing and scope of the PAs in the future licensing process. REFERENCES
Eng, T., Hudson, J., Stephansson, O., Skagius, K., and Wiborgh, M., Scenario Development Methodologies, SKB TR 94-28, Swedish Nuclear Fuel and Waste Management Co., Stockhohn, 1994. SKBF/KBS. Final Storage of Spent Nuclear Fuel - KBS-3, Report, Swedish Nuclear Fuel Supply Co./Division KBS, Stockhohn, 1983. SKB 91, Final Disposal of Spent Nuclear Fuel. hnportance of the Bedrock for Safety, SKB TR 92-2(I, Swedish Nuclear Fuel and Waste Management Co., Stockhohn, 1992. SKI Project-90, Volumes I and II, SKI Technical Report 91:23, Swedish Nuclear Power Inspectorate, Stockhohn, 1991. SKI, SKI's Evaluation of SKB's RD&D Programme 92, SKI Technical Report 93:3, Swedish Nuclear Power Inspectorate, Stockhohn, 1993. SKI, SKI's Evaluation of SKB's RD&D Programme 95, SKI Report 96:57, Swedish Nuclear Power Inspectorate, Stockhohn, 1996a. SKI SITE-94, Deep Repository Performance Assessment Project, SKI Report 96:36, Swedish Nuclear Power Inspectorate, Stockholm, 1996b.
O R G A N I S A T I O N A L AND M A N A G E M E N T ISSUES IN THE R E G U L A T O R Y A S S E S S M E N T OF U N D E R G R O U N D R A D I O A C T I V E WASTE DISPOSAL B G J Thompson I and T J Sumerling 2 1Environment Agency of England and Wales, London, UK 2 Safety Assessment Management Ltd, Whitchurch-on-Thames, UK
ABSTRACT
The approach to management of multi-disciplinary contractor-based work, used by a UK Government regulatory body (Her Majesty's Inspectorate of Pollution (HMIp)I), to develop, test and apply methods of risk assessment, was described by Thompson et al (1993). Two case studies were discussed, and concern was expressed that few, if any, formal publications existed to make more generally available the hard-won knowledge currently possessed by individuals and organisations with practical experience. The danger being that expensive lessons will be relearnt in the coming years after experienced staff move on or retire. The present paper develops these themes further on the basis of recent experience gained in a four year project (1991-95) to review and independently assess preliminary safety-related documentation associated with the development of a deep disposal facility in the UK for solid low and intermediate level radioactive wastes.
KEYWORDS
Regulation; radioactive waste disposal; risk assessment; project management; multi disciplinary working; Her Majesty' s Inspectorate of Pollution; HMIP.
INTRODUCTION
The approach to management of multi-disciplinary contractor based work, used by a UK Government regulatory body (Her Majesty's Inspectorate of Pollution (HMIP)I), to develop, test and apply methods of risk assessment, was described by Thompson et al (1993). Two case studies were discussed, and concern was expressed that few, if any, formal publications existed to make more generally available the hard-won knowledge currently possessed by individuals and organisations with practical experience. The danger being that expensive lessons may need to be learnt in the coming years after experienced staff move on or retire.
Now merged with the National Rivers Authority and Waste Regulatory Authorities since 1 April 1996 to form the Environment Agency of England and Wales. 549
550
B.G.J. Thompson and T.J. Sumerling
A number of issues were discussed, including: the client research management environment, and use of computer based information management systems; •
estimating and controlling resources and time scales;
•
multi-disciplinary teamworking, communication and conflict resolution;
•
accommodating new information and computational tools during an assessment;
•
alternative documentation structures for reporting an assessment;
•
confidence, quality assurance (QA), and peer review.
The present paper develops these themes further on the basis of recent experience gained in a four year project (1991-1995) to review and independently assess preliminary safety-related documentation associated with the development of a deep disposal facility in the UK for solid low and intermediate level radioactive wastes. New factors emerged also, especially the influence of the relationships between regulator and proponent, which may affect the conduct, communications and outcome of a project. THE ASSESSMENT PROJECT (1991-1995) As outlined in the previous conference, Thompson et al (1996), the activities in the UK leading to a possible authorisation of a new deep disposal facility fall into three stages. The period up to the late 1980s may be regarded as the first of these, when no safety submissions were received from industry, and HMIP had consequently to construct surrogate "cases" for hypothetical facilities in order to indicate to what might be expected in a safety case at that stage of development of the subject. The second stage involves deep borehole investigation of a possible site, with the original expectation of seeking planning permission for construction of a deep repository for solid low and intermediate level radioactive wastes through a public inquiry in due course. The basis of this application would be a so called Detailed Environmental and Radiological Analysis (DERA) using interim data from the on going research and site studies at the time. If this application were successful, a third stage would commence some years later with a Final Environmental and Radiological Analysis (the FERA) submitted to regulators for authorisation to dispose of wastes under the Radioactive Substances Act (1993). During 1991, following a comprehensive competitive tendering exercise, HMIP commissioned two lead contractors to undertake assessments of safety documentation, and to provide support during the inquiry hearings. These two principal contracts were: •
the Scientific and Technical Review (STR) and the; Independent Performance Assessment (IPA) project; and these were assisted by two further contracts: Assessment Information Management System (AIMS), to provide a bibliographic database, and regulatory correspondence logging procedure; and the Quality Assurance (Q) contract, to review and extend the existing QA/SQA system and its documentation, and to review the proponent's approach as required.
Advances in Safety and Reliability: ESREL '97
551
The STR involved a panel of 15 recognised experts covering a wide range of subject disciplines with management and editorial services provided by three lead contractor staff. The IPA project employed a total of 63 professional staff, comprising those from the lead contractor together with those from the several subcontractors and individual consultants involved. The majority of personnel were not employed full time on HMIP work in either project. The resources used between January 1992 and December 1994 totalled about 6,000 person days. Table 1 presents a breakdown in terms of task headings designed to allow comparison with resources discussed in the earlier management review, Thompson et al (1993).
TABLE 1 Resources (professional person-days) expended on the activities during various assessments up to completion of draft technical reports. Activities
Data collation Summary model documentation Formulation of conceptual models Elicitation of subjective data Formulation of numerical models Systems calculations and pra High-risk re-analysis Evaluation of uncertainty and bias Overview/Summary documentation Project & Technical Management QA Computer Systems(2) Supplementary studies Reviews and advice TOTAL (person-days)
Drigg °)
Dry Run 3 (3)
190 115 820 435 260 35 45 3OO
164 25 258 87 247 437 59 100 25 200
2500
1602
300
The 1991-95 assessment project. IPA phase 1
IPA phase 2
STR
113 55 138
269 N/A 75
45 563 83 331 284 77
Not stated 311 73 212 257 110
N/A N/A N/A N/A N/A N/A N/A N/A Not done 200 60 N/A N/A 750
2595
2484
1010
_(1)
~ I
837
t994
I 183
Not separated from within estimates for bias evaluation. (2) Includes commissioning of new parallel computer workstation. (3) See Thompson, Wakerley and Sumerling (1993).
(1)
Figure 1 provides a barchart summary of the programme of work of the IPA project: (A) Shows the programme envisaged at the start of this contract, including support for preparation of Proofs of Evidence by HMIP for the Public Inquiry, and continued assistance during the hearings themselves (Phase 3). 03) Shows the revised schedule due to the delays of the Inquiry to 1997.1 Phase 3 was no longer relevant in the contracted period to 31 December 1994. Phase 1 was extended to April 1993 to complete a conventional PSA for steady-state conditions, based upon current climate, and a large number of ancillary studies were conducted to develop understanding of site specific features and processes, and to prepare for Phase 2.
1 Since, further delayed until about 2002.
552
B.G.J. Thompson and T.J. Sumerling (C) The second phase began in May 1993, and was then rescheduled early in 1994 because the proponent' s time-table had again altered necessitating early completion of the IPA work in September 1994, as shown. Rather larger amounts of geological data had been obtained by this time, and a "state of the art" PSA was then carried out using the TIME 4 and VANDAL Monte Carlo simulation software incorporating a fully three-dimensional groundwater flow, and radionuclide transport model, that evolved over time under the influence of climate-driven changing surface boundary conditions (for instance, topographic changes, sea level, recharge etc).
EXPECTED DELIVERY OF INFORMATION FROM PROPONENT
ORIGINALCONTRACT END DATE "
.~~ (A)
:IGI ,C. 1991
I
FORMATIONOF THE ENVIRONMENT AGENCY
(B) REVISED SCHEDULt APRIL 1993 VaASE ~
-
~
f
I I
(C) PROJECT SHORTENED AND SCHEDULE REVISED MARCH 1~'~ 1994 IIII
III
II
i 1991
P
R
A
___
S
2---------------.~
E
REVISED END DAtE RELEASE OF DRAFTS
COM P L E I ' E D ~ R E D ~ C - E I ~ PHASE1 7"~ PRASE2 I
i
" ~ T
II
1992
i
1993
i"
v [ PEER REVIEW J [ CLIENT O/V i
II
1994
i
1995
I--'--""
DEVELOPMEN'] OF AGENCY ASSESSMENT POLICY 1996
Figure 1. Independent Assessment Project Programme Revisions. The entire assessment-related project has been documented extensively as contractor drafts, and the structured sets of technical reports produced under IPA Phase 1 and 2 are shown in Figures 2A and 2B, respectively. DISCUSSION OF THE ISSUES
Use Of Resources In comparison with the examples quoted previously, Thompson et al (1993), this assessment related project was much larger, and more complex to manage combining for the first time in the HMIP programme, a detailed peer review (the STR) of safety related documentation with the quantitative assessment (IPA). The situation was also affected by the extemal changes described above, and the difficulty of obtaining expected information in the given time periods.
Advances in Safety and Reliability: ESREL '97
553
TR-ZI-1
Phase I Technical Overview
Preparation for Phase 2 Assessment
Phase 1 Assessment
TR-ZI-14 ~as Generation & Migration
TR-ZI-9 VANDAL System Model' PRA
TR-ZI- 15 Hydrog. Spatial Variability
TR-ZI-8 ~IDAL System Model Development TR-ZI-11 Identification of Processes
TR-ZI-10 VANDAL Enhancement
TR-ZI-19 Human Actions
TR-ZI-2 Surface Environment TR-ZI-2, 3, 4, 5, 6 As opposite
TR-ZI-3 Geology & Hydrogeology
TR-ZI- 12 Uncertainty and Bias Audit
TR-Zl- 16 Matrix Diffusion
TR-ZI-7 Environmental Change
TR-Zl-4 Repository & Nearfield
TR-ZI-17 'isualisation & Presentatiol TR-ZI-18 Hydrological Network
TR-Zl-5 Naste & Radiological Data TR-ZI-6 Geochemistry & Transport
ii~]
TR-ZI-13
i~il,,u,, o, ~oo...ica,
Oa.
I',i~!',] la
TR-ZI-20 Alternative Climate Driver
Figure 2A. Phase 1 Document Structure.
TR-Z-0
Technical
TR-Z2-11 System Calculations and PRA
TR-Z2-12 Uncertainty and Bias Evaluation
TR-Z2-7 Hydrogeological Modelling
TR-Z2-8 Spatial Variability Modelling
TR-Z2-6 Environmental Change Modelling
TR-Z2-9 Chemical Modelling
TR-Z2-3 Verification of Phase 1 Model
TR-Z2-1 Status Report (June 1993)
I
Overview
TR-Z2-5 Evaluation of Site Data
~!iil ',~I
TR-Z2-17 Elicitation for
TR-Z2-4 Foundation of System Modelling TR-Z2-14 Procedure for Review of Nirex
TR-Z2-10 Fractured Rock Modelling
liiil lii!l
TR-ZI-1
Phase 1 Overview
Phase 1 Assessment
Preparation for Phase 2
Figure 2B. Phase 2 Document Structure.
TR-Z2-16 Technical Overview VANT4
TR-Z2-15 Biosphere Model Enhancements
TR-Z2,-2 Specification for VANT4
TR-Z2-13 Evaluation of GGISs
554
B.G.J. Thompson and T.J. Sumerling
In comparison with the Dry Run 3 trial, which was under full control of HMIP, it was expected therefore that the new work would involve a higher proportion of effort on management, as indeed seen in Table 1, because HMIP had less control over the timing of supply of information. This is also true for bias evaluation, QA and commissioning the new parallel computer system, for example. However, contrary to expectation a lower percentage was required for collating and interpreting data. In retrospect, this may have been due to unavailability of site borehole information at the time. In turn, therefore, an increased proportion of time devoted to elicitation of judgements as a temporary substitute for such information, could be expected.
Effect Of External Influences Project organisation and management structures proposed by the lead contractors, and accepted by the client at the time, proved to be resilient in the face of the external factors outlined above. However, as the charts used in the Project Quality Assurance Plan (PQAP) reveal, there was no clear distinction in the minds of project managers, between the flows of control over resource allocation, and transfers of technical information that resulted in plans of work proposed within the team. Much of the information supplied from the proponent was obtained through meetings and correspondence with the regulatory staff, and not by direct contact between technical contractors on both sides. This slowed the resolution of technical queries, and also drew the client manager more deeply into technical discussions within the IPA project than had been anticipated. Figures 3A and 3B attempt to clarify the nature of the interactions that occurred in practice during the final phase of the IPA by separating information exchanges on technical matters from those related to project management and, see Figure 3B, from the control flow itself.
Proponent
]
Other c o n t r a cClient t s Manager ~ l ~ Regulator andQA ~I1~--t ,J~ I I I~ [ (HMIP)
C°ntra°t°rI
Management Team
Document Manager
I I 'stem Integration Group / (SlG)
Deputy Project Manager Technical Advisors
Key:
t;
information flows
Figure 3A. Technical Information Flows during Phase 2 of the Independent Performance Assessment.
Advances in Safety and Reliability: ESREL '97
555
Communication Within The Client-Contractor Team There has been some disagreement both inside HMIP, and between the contractors themselves, about the degree of interaction that would have been desirable between the STR and IPA projects. Clear separation was intended originally, when defining these two contracts in 1991, to enable the Expert Panel to be largely independent so that an effective internal peer review oflPA reports might be made. In practice, there were several individuals acting as members of the core management groups of both the IPA and STR projects. Progress and technical reports were exchanged regularly.
Client Manager
Quality Manager Document ~ Manager ~
Project Director Project Manager I I DeputyPM& Advisors
" Project Management Core Group / (PMCG)
Key:
l; I
information flows control
Figure 3B. Project Management Information and Control Flows during Phase 2 of the Independent Performance Assessment. Multi-Disciplinary Working
At the more detailed technical level of each project it proved more difficult to instill a broad appreciation of the overall project and of the regulatory objectives involved, especially as most of the staff were only part time on HMIP work. The formation of a Systems Integration Group (SIG), see Figure 3A, for the second phase overcame some of these difficulties, however. The process of elicitation of professional judgements about: •
(a) the overall system conceptual model content and structure;
•
(b) the possible bias due to assumptions and approximation in sub system representations, and
•
(c) the probability distributions of values of key parameters,
was carried out in an effective manner using balanced Working Groups of Experimentalists, Modellers and endUsers (EMU groups) as advocated earlier, Thompson et al (1993). Project Control
Progress reporting, and the use of formal methods such as PERT, as recommended previously, proved generally effective although (still) not in the culture of the contractor staff except for those carrying out systems and computer related work. Visualisation of progress using "wavefronts" was required by the client, and provided reasonable clarity during the later stages of the IPA project although computer based tools require some improvement in their graphical interfaces, surprisingly.
556
B.G.J. Thompson and T.J. Sumerling
Quality audits were carried out regularly in accordance with the PQAP. Only minor non conformities were recorded, and these were readily corrected.
Documentation
The deliverables shown in Figures 2A and 2B, and the reviews from the STR were supported by a very large number of management and computer systems related documents, internal QA audit reports, extensive records of data, case study calculation records, results files, test and acceptance records, software listings, and configuration control records. Overall, a high level of structure and traceability was achieved in the STR, and the IPA Phase 1 reports; but because of the truncation of the second phase the same standards were not fully achieved. The clarity of a "lateral" or "longitudinal" structure depicted in the previous paper, Thompson et al (1993), was not obtained here, although the documentation of the final systems and pra studies approach the level of integration of Dry Run 3 documentation. The need for summary documentation was realised for the IPA through the Management Overview and Technical Overview reports produced. The latter represent the opinions of the lead contractor and, subsequently, the client manager has drafted a further interpretation of the entire project as a "Perspective" report for internal briefing to help Agency formulate its policy regarding future assessments.
RADIOACTIVE WASTE DISPOSAL ASSESSMENT CENTRE QA PROCEDURES
MANAGEMENTOF THE DISPOSAL ASSESSEMENT PROJECT
I
I
MANAGEMENTOF THE DISPOSAL ASSESSMENT I~Dll~L"r_ RSD_OflR
I
OPERATION AND RECDRDING OF MEETINGS. BSD. 002
I
MAINTENANCEAND DISSEMINATIONOF A PROJECT GLOSSARY.
[ FX~^N~L~NAGE'~I SO~A~STA~',~S (CMAN). BSD.004
I
l
[
I SOFTWARE PLANSQUALITY
i I
I I[ so~O~LOP~Ti LIFE CYCLES I REVIEWS
i
i I t
I I
REQUIREMENTS TRACING
I I l
Figure 4.
INFORMATIONAND DOCUMENTATION CONTROL AND MANAGEMENT
ASSESSMENTOF PROPOSALS FOR DISPOSAL OF RADIOACTIVEWASTE
MANAGEMENTOF RESEAROI CONTRACTS
Quality Assurance Documentation.
[
I
' ,]
I ,1
INDEPENDENCE ANALYSISAND ASSESSMENT(OF PROPOSALSOF SOLID RADIOACTIVE WASTE). BSD. OO6
SCIENTIFIC AND TE(]tNICAL REVIEW OF POST-O.£)SURE SAFETYCASE. BSD. 032
I
REVIEWS OF AN j APPLICANTSSITE INVESTIGATIONBSD. 033
PROGRESS MANAGEMENT
i
Advances in Safety and Reliability: ESREL '97
557
Confidence Building A number of actions can be taken by management to generate confidence in assessment related work. Emphasis was given previously to the importance of QA, especially in relation to the development and application of computer software. The relatively simple standards and procedures existing in 1992 have now been replaced progressively by an extensive series of thirty-five detailed QA procedures some of which are shown in Figure 4. Much of this new work was done under the "Q" contract listed above in parallel with the later stages of the assessment project, and then afterwards. This QA related work drew upon the experience gained therein to enable the review (STR), and quantitative assessment (IPA) aspects of regulation to be set out in BSD.032 and 006 procedures. The procedures for review of site investigation and for the conduct of exchanges between HMIP and the proponent are documented as BSD.033 and BSD.002, respectively. During this period issues such as "validation", Wilmot and Galson (1994), and the "perception and communication of risk", Wilmot, Galson and Kemp (1995), were identified for Seminars that were sponsored under the wider HMIP research programme. Traceability Confidence in safety cases, and their associated assessments, will depend critically upon the availability of clear, explicit and complete records of all decisions, assumptions and other information used to establish the line of argument, and the results themselves. To demonstrate one possible approach to this problem a relational database was developed and applied during the IPA project, and an initial evaluation of related biases was carried out. See Grindrod (1996), and Thompson Gralewski and Grindrod (1995) for further information. CONCLUDING REMARKS The many conclusions about good management practice, drawn up in the previous paper, were borne in mind during the planning, execution, and subsequent appraisal of the recent assessment related project undertaken between 1991 and March 1996. Much that was done in this project represented a unique accomplishment in technical terms, by a regulator, in the subject to date, as explained in the companion paper at this conference. See Thompson and Williams (1997). The experience of carrying out these studies has been extremely beneficial. It has challenged then extant procedures being considered for use by regulators, and related technology, tools and the use of site specific data, and other knowledge. However, it is more important at this stage, in a potentially long programme of assessment activity, to provide a mechanism to expose issues and difficulties that proponents themselves may expect to encounter in developing and articulating a safety case, and which the Environment Agency as regulator will need to question. Difficulties are likely to stem principally from the changes in the proponent's programme, and from the unpredictable supply of information needed for regulatory review and evaluation, at that time. This reinforces the final comment in the previous paper that "... it is necessary to have a robust and flexible management approach to accommodate the quite dramatic changes in the plans of industry." To ensure the knowledge gained is available to Agency in the future, the present authors believe it is advisable to maintain an experienced professional team on a continuing basis, also to avoid loss of support due to potential "conflicts of interest" if contractors are employed elsewhere. Not only technical skills need maintaining, but those of management, also. At least some of the internal organisational and communication difficulties, within the IPA initially, may be due to lack of experience of some key personnel who built up improved working practices as the
558
B.G.J. Thompson and T.J. Sumerling
project developed. These, and other issues raised above, will be taken into account by the Environment Agency in deciding technical policy regarding regulatory assessments for the coming years. The lessons briefly expounded in this paper should be of interest to other similar groups undertaking safety assessment related work, and so, an international seminar about "Management of Risk Assessments" is to be held on behalf of the Agency early in 1997, and it is hoped to report this to ESREL '97, also. ACKNOWLEDGEMENTS The authors thank the Environment Agency for permission to publish this paper. The results of this work may be used in the formulation of policy, but views expressed in this report do not necessarily represent policy.
REFERENCES Grindrod, P. (1996). Traceability of Argument and the Treatment of Conceptual and Parametric Uncertainties within a Safety Case, and how a Regulator may examine this by Independent Analysis, Proc. ESREL '96/PSAM III Conference, Crete. (June 1996). The Radioactive Substances Act 1993. HMSO (ISBN 0-10-541293-7). Thompson, B.G.J., Gralewski, Z.A. and Grindrod, P. (1995). On the Estimation of Bias in Post-Closure: Performance Assessments of Underground Radioactive Waste Disposal, Proc. of 6th Intl. Conf. on High-Level Radioactive Waste Management, Las Vegas, Nev., USA. (May 1995). Thompson, B.G.J., Smith, R.E. and Porter, I.T. (1996). Some Issues affecting the Regulatory Assessment of Long-Term Post-Closure Risks from Underground Disposal of Radioactive Wastes, Proc. ESREL '96/PSAM III Conference, Crete. (June 1996). Thompson, B.G.J., Wakerley, M. and Sumerling, T.J. (1993). Recent Management Experience of UK Performance Assessments of Radioactive Waste Disposal, Proc. 4th Intl. Conf. on High-Level Radioactive Waste Management, Las Vegas, Nev., USA. (May 1993). Thompson, B.G.J. and Williams, C.R. (1997). The Regulatory Review of Safety-related Information regarding Underground Radioactive Waste Disposal in England and Wales. (ibid). Wilmot, R.D. and Galson, D.A. (1994). Validation: What should a Regulator look for in a Safety Case? Proc. Second Intl. Conf. on Prob. Safety Assessment and Managements, PSAM II, San Diego, California, USA. (March 1994). Wilmot, R.D., Galson, D.A. and Kemp, R.O. (1995). HMIP Seminar Proceedings: Risk Perception and Communication. UK Dept of the Environment Report DOE/H IP/RR/95.011.
B6" Safety of Nuclear Waste Disposal
This Page Intentionally Left Blank
SAFETY ASSESSMENT OF COMPLEX ENGINEERED AND NATURAL SYSTEMS: RADIOACTIVE WASTE DISPOSAL J. A. McNeish ~, M. A. Balady 2, V. Vallikat
1, and J. Atkins
INTERA Inc., a Duke Engineering and Services Company 1180 Town Center Drive Las Vegas, Nevada, USA 89134 2 TRW, Inc. 1180 Town Center Drive Las Vegas, Nevada, USA 89134
ABSTRACT Evaluation of deep, geologic disposal of nuclear waste requires the probabilistic safety assessment of a complex system over long time periods. A probabilistic approach is required because of the inherent complexities present in the system from the coupling of various processes and sub-systems, parameter and model uncertainties, spatial and temporal variabilities, and the multiplicity of designs and scenarios. Both the engineered and natural system are included in the evaluation. Each system has aspects with considerable uncertainty both in important parameters and in overall conceptual models. The study represented herein provides a probabilistic safety assessment of a potential repository system for multiple engineered barrier system (EBS) design and conceptual model configurations (CRWMS M&O, 1996a) and considers the effects of uncertainty on the overall results. The assessment is based on data and process models available at the time of the study and doesn't necessarily represent the current safety evaluation. In fact, the percolation flux through the repository system is now expected to be higher than the estimate used for this study. The potential effects of higher percolation fluxes are currently under study. The safety of the system was assessed for both 10,000 and 1,000,000 years. Use of alternative conceptual models also produced major improvement in safety. For example, use of a more realistic engineered system release model produced improvement of over an order of magnitude in safety. Alternative measurement locations for the safety assessment produced substantial increases in safety, though the results are based on uncertain dilution factors in the transporting groundwater.
KEYWORDS Radioactive waste disposal, environmental systems, uncertainty, probabilistic safety assessment, complex systems, chaos
INTRODUCTION Evaluation of deep, geologic disposal of nuclear waste requires the probabilistic safety assessment of a complex system over long time periods. This is due to the inherent complexities present in the system from 561
562
J.A. McNeish et al.
the coupling of various processes and sub-systems, parameter and model uncertainties, spatial and temporal variabilities, and the multiplicity of designs and scenarios. Both the engineered and natural system are included in the evaluation. Each system has aspects with considerable uncertainty both in important parameters and in overall conceptual models. The study represented herein provides an assessment of safety of the system for multiple design and conceptual model configurations and considers the effects of uncertainty on the overall results. The assessment represents the understanding of the repository system at the time of the study. As such, recent data on such parameters as percolation flux through the repository has not been included and may alter the overall conclusions as to the functioning of particular parts of the system, but not the general approach toward analysis of such complex systems.
APPROACH
Total System Performance Assessment (TSPA) (known in other projects as Probabilistic Safety Assessment (PSA)) is a tool for integrating the various data and process model information for a potential radioactive waste disposal system into a single probabilistic safety assessment. TSPAs require the explicit quantification of the relevant processes and process interactions. Total system performance assessments explicitly acknowledge the uncertainty in the process models and parameters and strive to evaluate the impact of this uncertainty on the overall performance. The aim of a total system performance assessment is to be as complete and reasonably conservative as possible and to assure that the descriptions of the predictive models and parameters are sufficient to ascertain their accuracy. In addition to providing a quantitative basis for evaluating the suitability of the site to meet regulatory objectives, such assessments are useful to help define the most significant processes, the information gaps and uncertainties regarding these processes and the corresponding parameters, and therefore the additional information required in order to have a more robust and defensible assessment of the overall performance. The overall philosophy of an assessment of total system performance is (1) to use models and parameters which are as representative as current information allows for those processes that may affect the predicted behavior of the system and (2) to predict the responses of the natural and engineered components of the system that are expected to result from the emplacement of wastes in the potential repository. In those cases where representative information is not available or is very uncertain, bounding or conservative assumptions must be made, in order that the predicted performance is realistic, but may produce lower performance than would be the case if more optimistic assumptions were included in the analyses. The performance assessment process requires the explicit treatment of uncertainty and variability of natural phenomena. The impact of the uncertainty is directly evaluated in the assessments themselves due to the stochastic nature of the analyses. Several sources of information were utilized in the analyses conducted for the study reported herein. The study used a probabilistic safety assessment model. Field and laboratory data were incorporated directly or as abstracted probability distributions or functions into the PSA model. Detailed process level modeling of some physical processes (e.g., thermohydrology of the system) was conducted to produce response surfaces for use in the probabilistic safety assessment model. The probabilistic safety assessment model used in the evaluation was RIP (Repository Integration Program; Golder Associates, 1994). RIP has been used to evaluate several potential nuclear waste disposal sites around the world. It provides the capability to incorporate a number of-uncertain parameters as well as process level modules if so desired. Alternative design and conceptual model configurations were evaluated and safety was assessed in terms of dose to a maximally exposed individual. Additionally, the effect of uncertainty in selected parameters and conceptual models on the overall, dose was evaluated. The simulated 1,000,000 year peak dose at a location of 5 km or 30 km from the source is presented herein as the key performance measure for the analyses.
DESCRIPTION OF SYSTEM The key components of the repository system include: engineered barrier system (e.g., waste form, waste
Advances in Safety and Reliability: ESREL '97
563
package, invert, drift liner, backfill), the geosphere (unsaturated and saturated zones), and the biosphere. The performance assessment tool used in the analysis described herein contains four component modules: (1) waste package behavior and radionuclide release model, (2) radionuclide transport pathways model, (3) disruptive events model, and (4) biosphere dose/risk model (see Figure 1). The disruptive events model was not used in these analyses and will not be described. However, disruptive events will be included in future evaluations. The other modules are summarized briefly below.
~
erform a n c e M e a s u r e s ~ Containm nt P e a k Dos e
Figure 1. Key components in potential repository system.
Waste Form/Waste Package The waste package behavior and radionuclide release model input requirements are descriptions of the radionuclide inventories in the waste packages, a description of near field environmental conditions (which may be defined as temporally and spatially variable), and subjective estimates of high-level parameters describing container failure, matrix alteration/dissolution, and radionuclide mass transfer. Waste package degradation profiles, along with matrix alteration/dissolution rates, are used to compute the rate at which radionuclides are exposed. Once exposed, the rate of mass transfer out of and away from the waste package is calculated. The output from this component (for each system realization) consists of time histories of release for each radionuclide from the waste packages, and acts as the input for the transport pathways component. The source term for the waste form and waste package degradation was developed from both laboratory data on the dissolution of the waste form and corrosion data reduced from the literature (CRWMS M&O, 1995).
Natural System The radionuclide transport pathways model simulates radionuclide transport through the near and far field in a probabilistic mode. The model uses a phenomenological approach that attempts to describe rather than explain the transport system. "Hae resulting transport algorithm is based on a network of user defined pathways. The pathways may be used for both flow balance and radionuclide transport purposes, and may account for either gas or liquid phase transport. An important purpose of a pathway in the model is to represent large-scale heterogeneity of the hydrologic system, such as geologic structures and formation-scale hydrostratigraphy. Geosphere pathways may be subdivided into flow modes, which address heterogeneity at the local scale (e.g., flow in rock matrix, flow in fractures). The unsaturated and saturated zone pathways were developed based on process level modeling of the geosphere. The system was subdivided into 6 unsaturated columns directly underneath the repository which fed into the saturated zone pathway. Additional information on the pathways model can be found in CRWMS M&O (1995).
Biosphere The biosphere dose/risk module describes the fate and effect of radionuclides in the biosphere. The biosphere module allows the user to define dose receptors in the system. Receptors receive radiation doses from specified geosphere (e.g., a water supply aquifer) or biosphere (e.g., a pond, or flora and fauna)
564
J.A. McNeish et al.
pathways. Concentrations in these pathways are converted to radiation doses (or cancer risks) based on userdefined conversion factors. The biosphere model used in these analyses simply incorporated dose conversion factors from the U.S. EPA for the simulated releases from the repository to the maximally exposed individual located a certain distance from the repository and drinking 2 liters of water per day from a well located in the contaminated saturated zone.
ABSTRACTION The analysis of complex systems in a probabilistic framework requires simulation of multiple realizations of the system to capture the effect of the uncertainty in the system. Due to computational limitations in representing the complexity of the processes involved, as well as incomplete knowledge of some of the systems, we have developed efficient, representative simplifications or abstractions of some of the processes present in the system. The abstractions are calibrated against detailed process models to the extent possible and implemented in the probabilistic safety assessment analysis. Examples of abstractions include response surfaces (e.g., temperature and relative humidity as a function of time from thermo-hydrologic calculations) and dimensionality reduction (e.g., use of 1-dimensional representation of 3-dimensional hydrologic flow system). Figure 2 presents an example of the abstraction process for thermo-hydrologic results. These simplifications allow the probabilistic safety assessment to be conducted in a computationally efficient manner that appropriately represents the components of the complex system which are important to performance. .ultiploe r u n s o f Mnerm -hydro odel ( e . g . NUFT)
\ ~Response 5 u r f - ~ ' e ~ . ~
T,RH
Figure 2. Example abstraction process for response surface
SENSITIVITY EVALUATIONS The safety assessment to determine the significance of additional engineered barriers to complement the waste packages and the natural system performance, evaluated multiple designs, conceptual models, data values, and performance measures. The design configurations presented include addition of a backfill barrier, and alternative waste package spacing (thermal hydrology effects). The backfill barrier refers to emplacement of material (e.g., sand, rock, etc) in the drifts over the waste packages to provide additional protection to the waste packages by reducing the relative humidity at the waste package surface, thus delaying corrosion. The conceptual model configurations evaluated included cathodic protection of the inner barrier of the waste packages, and an alternative location for dose measurement. Each of the designs and conceptual models possess uncertainty in performance as well. The analyses were conservative because the uncertainty in data or models was weighted toward values which would cause reduced calculated repository performance. A common representation of the uncertainty in the system is a complementary cumulative distribution function (CCDF). For these analyses, 100 realizations were conducted for each sensitivity case. The peak dose from each realization was plotted on the CCDF to determine the range in peak dose for the 100 realizations. For alternative designs and conceptual models, separate sets of realizations were conducted.
Advances in Safety and Reliability: ESREL '97
565
Use of backfill.
One of the major purposes of the study was to evaluate the effect of backfill on performance. If backfill was used, would performance increase substantially? The CCDF of peak dose results over 1,000,000 years at 5 km from the source are presented in Figure 3 for the simulated case with square waste package spacing. The difference in peak dose between the backfill and no backfill cases at the 50 ~ percentile is approximately an order of magnitude. Also, there is a small reduction in the range of uncertainty in the system for the backfill system vs the non-backfilled system. .E "o
1 , 0 0 0 , 0 0 0 - y r
Total
P e a k
D o s e
o- , .............. i i !...i
IU o ~
.... i .... ....
0.1
BackfilIcaszeCaSe [................. ;: ~i
="~~(ls I~,
t" .........I''+---i" BackfillN°
!
::
..,
10 0
10 1
102
O.Ol
1 0-7
1 0-6
Peak
1 0-s
Dose
1 0-4
1 0-3
1 0-2
to A c c e s s i b l e
1 0.1
Environment
(rem/yr)
Figure 3. Effect of backfill on system safety
Alternative Waste Package Spacing.
Waste package spacing options provide altemative designs which have uncertainty in their effects on the system safety, due to altered thermal hydrology of the system. Waste package spacing designs which eliminate cool spots hold promise in terms of keeping percolation from the waste packages. A simplified representation of the difference between the square waste package spacing design (advanced conceptual design) and a "line loading alternative" (i.e., packages nearly end to end along a drift, drift spacing increased, hot and cold waste packages interspersed) was evaluated. The simulated difference at 1,000,000 years shown in Figure 4 for the 5 km, backfill case was a factor of 2 or 3, not a significant amount considering the other uncertainties in the system. .__. -o
1 , 0 0 0 , 0 0 0 - y r
(1) ~D (3 x ILl
"~-
1
....
,
....
,
....
,
Total .
.
.
.
P e a k .
.
D o s e
.
,
. . . . . . .
,
...
0.1 -
¢~ O
= =
°
lI
n
0.01
. . . . . . . . . . . . . .
1 0-7 P e a k
Design
Advanced Conceptual Line L o a d i n g
1 0-8
1 0-5
Dose
1 0-4
~
,
.
1 0-3
1 0-2
to A c c e s s i b l e
.
.
.
1 0-1
'1
100
Environment
....
,
1 01
'
'
10 2
(rem/yr)
Figure 4. Effect of waste package spacing on system safety
566
J.A. McNeish et al.
Cathodic protection.
A conceptual model evaluated in the study was the inclusion of a simple cathodic protection model into the waste package degradation model. The evaluation of the alternative conceptual model provided a glimpse of the benefits to be gained by taking performance credit for cathodic protection of the waste package. The rates appropriate for this mechanism are currently being evaluated in laboratory tests. However, assumptions of protection of the inner barrier until 50% of the outer barrier is degraded produce a reduction in the peak dose of over an order of magnitude at the 50 th percentile for the 5 km backfill case (see Figure 5). The uncertainty as shown in the CCDF is also reduced by nearly an order of magnitude from over 3 orders of magnitude to slightly over 2 orders of magnitude of uncertainty (99 ~ percentile) by incorporating the cathodic protection performance credit. 0')
.E
1,000,O00-yr
"lO
Total
Peak
Dose
a) o
x UJ
-g
•-'ig
0.1 NO C a t h o d i c P r o t e c t i o n Cathodic Protection (50%)
=
o
0.01
10-7
1 0-e
I 0 "s
1 0 .4
i
I
I
1 0 .3.
1 0-2
1 0.1
I 00
I 0t
10 2
Peak Dose to Accessible Environment (remlyr)
Figure 5. Effect of cathodic protection on system safety
Distance to maximally exposed individual.
The specification of the distance from the source to the maximally exposed individual where doses will be calculated is a regulatory specification. The additional dispersion and dilution of the radionuclide source term which occurs over longer distances leads to reduction of the peak dose. Uncertainty in the flow field increases slightly with increasing distance from the source term. To investigate potential changes in dose due to increased distance from the source, doses were evaluated at the original regulatory distance of 5 km and a possible new regulatory distance of 30 km. The reduction in peak dose was linearly proportional to the factor of increase in the dilution and dispersion of the system for the backfill advanced conceptual design case (see Figure 6).
E F F E C T OF UNCERTAINTIES
Effect on Peak Dose
The effect of the uncertainties in the system were evaluated as to how this translates to the overall dose in the system. The correlation of large uncertainties in specific parameters with large uncertainties in the safety measure (dose to the maximally exposed individual at a specified location) was evaluated through the use of stochastic simulations for the specified range of uncertainty in selected parameters and design configurations. Using linear regression, such parameters as repository level infiltration, and waste package degradation were found to produce direct impacts on the safety measure. Other parameters such as solubility
Advances in Safety and Reliability: ESREL '97
567
of key radionuclides and conceptual models such as rockfall in the repository were shown to have less impact on the overall safety measure.
~) t-.m
1,000,O00-yr Total Peak Dose
-
!
o
.~_
0.1
................................................................................
i
i
O L 12"
i
o.ol
.... 1 0 -r
i
Peak D o s e at 5 km I Peak D o s e at 30 km I
= =
i 1 0 "6
i
....
i 10 5
i i
i
....
i 1 0 .4
i
......... 10 3
P e a k D o s e to A c c e s s i b l e
,-
,
1 0 .2
10 1
-~,i 10 °
Environment
i
,,.~,
.,,
1 01
10 2
(rem/yr)
Figure 6. Effect of distance to dose measurement on system safety
Evaluation of Uncertainty The uncertainty in the safety of the system was evaluated by considering the range of values on CCDFs of the peak dose for alternative conceptual models and designs from the 100 realizations. Another approach would be to combine the uncertainty for the alternative conceptual models to determine the overall uncertainty in the system. The difficulty in this approach is defining the initial probability for each of the alternative conceptual models. The evaluation then comes down to a reliance on expert judgement and iterative analyses with increasingly better models and more sophisticated data to represent the processes present at the site. Sensitivity analyses provide direction as to which parameters and conceptual models are important, and steps can be taken to reduce the uncertainty by collecting additional data or conducting additional analyses. The uncertainty can also be evaluated with the perspective that the current regulatory requirements address demonstrating reasonable assurance of performance, not absolute proof.
Issues/limitations on this Type of Analysis The evaluation of a complex system like a potential nuclear waste repository is a daunting task due to the long time periods of evaluation and the uncertainty involved. Unlike other complex projects such as nuclear reactors or complex chemical plants, a repository can not be constructed just to see what happens after several thousand years. Small scale testing and prototyping can be conducted, but scale-up to a full-size repository is untenable due to the long time periods of observation required. The long term nature of many of the processes (e.g., corrosion) produces a difficult analysis problem. How, for example, do we extrapolate the corrosion rates out to time periods of 10,000 or 1,000,000 years from the few years of data we currently have? Standard extrapolation techniques can be used, though the need for expert judgement in the development of the appropriate extrapolation is still required. The approach to these types of problems continues to evolve. However, in spite of the continuing development of sophisticated models for use as predictive tools, experience-based judgement still must be utilized to produce reasonable assurance of the safety of the system. Another complicated aspect of the analyses is the possibility of significant effects from processes which are not incorporated in our model. That is, perhaps the system as built will nr" be just the cumulative effect of
568
J.A. McNeish et al.
the processes which we have incorporated, but may incur a combination of compound effects and important but unexpected/unpredicted events that lead to failure of the system. Due to the many degrees of freedom in the problem, it is difficult to constrain the problem to determine unique solutions. Again, we must resort to the use of expert judgement to attempt to constrain the problem and provide reasonable assurance that the key processes affecting performance have been appropriately evaluated.
CONCLUSIONS The safety of a complex repository system was assessed for extended periods of time. The results of the study were highly dependent on the assumed design of the system (e.g., backfill, waste package spacing) and the assumed active processes in the system (e.g., cathodic protection). Use of alternative conceptual models, such as the cathodic protection model, produced major improvement in safety. This more realistic waste package degradation model produced improvement of over an order of magnitude in safety. Alternative measurement locations for the safety assessment produced substantial increases in safety, though the results are based on uncertain dilution factors in the transporting groundwater. The evaluation of complex systems with significant uncertainty over long time periods requires creative methods of analysis which draw from probabilistic analysis techniques, expert judgement, and consistent iterative sensitivity analyses. Due to the long time periods and multiple processes involved in radioactive waste repository performance, the assessment of the safety of the system requires the ability to analyze the uncertainty in the system and gradually attempt to reduce the overall uncertainty. As the understanding of the processes of the system evolves, updates to the analyses which incorporate new information must be conducted. In the evaluated repository system, an example of such information is the percolation flux through the repository which is now considered to be higher than the values used for these analyses. Updates in information about such key parameters can significantly alter the conclusions of such analyses, but does not alter the approach to analysis of such systems.
REFERENCES
CRWMS M&O (1995). Total System Performance Assessment-1995: An Evaluation of the Potential Yucca Mountain Repository. B00000000-01717-2200-00136, Rev. 01. Prepared for U. S. Department of Energy, Las Vegas, NV. CRWMS M&O (1996a). Engineered Barrier System Performance Requirements Systems Study Report. BB0000000-01717-5705-00001, Rev 02. Prepared for U. S. Department of Energy, Las Vegas, NV. CRWMS M&O (1996b). Controlled Design Assumptions Document. B00000000-01717-4600-0032, Rev 03. Prepared for U. S. Department of Energy, Las Vegas, NV. CRWMS M&O (1996c). Test Model Abstractions for Total System Performance. B00000000-01717-220000173. Prepared for U. S. Department of Energy, Las Vegas, NV. Golder Associates, Inc. (1994). RIP Performance Assessment and Strategy Evaluation Model: Theory Manual and User's Guide, Version 3.20. Golder Associates, Redmond, WA.
ACKNOWLEDGEMENTS
The authors would like to gratefully acknowledge the support of the U.S DOE for the study reported herein and the work of several others which was used in the analyses conducted for this paper. The contributions of Dr. Joon Lee and Dr. Thomas Buscheck provided background process model results used in the probabilistic safety assessment reported herein. Note: The opinions expressed herein are those of the authors only.
ASSESSING PERFORMANCE OF IMPRECISELY CHARACTERIZED SYSTEMS: A MATHEMATICAL PERSPECTIVE Martin S. Tierney and Rob P. Rechard WIPP Performance Assessment Department Sandia National Laboratories P.O. Box 5800 Albuquerque, NM 87185
ABSTRACT The current popular methodology for assessing performance of imprecisely characterized systems, i.e., model building followed by sensitivity and uncertainty analyses using Monte Carlo methods, represents a practical solution to the problem of mapping surfaces in a high-dimensional parameter space that are defined by the systems' performance variables; programs of pointwise mapping of performance variables are abandoned for an approach that characterizes surface topography in terms of statistical properties such as mean surface elevation, or percentiles of surface elevation. Statistical properties of the surfaces can be represented by expectation integrals over the multi-dimensional domain of uncertain model parameters; the expectation operation is taken with respect to distributions defined for the epistemically uncertain model parameters. Monte Carlo methods are used to numerically evaluate the expectation integrals; the drawing of random numbers introduces aleatory uncertainty into the performance assessment and opens up the use of the classical techniques of parameter estimation and hypothesis testing to determine whether the model system meets performance criteria. KEYWORDS
Performance assessment, uncertainty analysis, response surfaces, epistemic uncertainty, aleatory uncertainty, Monte Carlo methods. INTRODUCTION Though greatly differing in purpose and level of technical detail, modem probabilistic system studies, such as probabilistic risk analyses (PRAs), probabilistic safety analyses (PSAs) and performance assessments (PAs), usually follow the same pathway of investigations: A mathematical model of the system is constructed and parameterized, important model parameters are identified (sensitivity analyses), and uncertainties in model parameters are propagated through the model's equations to obtain measures of uncertainty in practically important dependent variables of the model (uncertainty analyses). In addition to these steps, the investigations of sensitivity and uncertainty usually employ Monte Carlo methods, even in cases where those methods are unnecessary and simple analytical or numerical solutions are available. The methodology just described is so popular that it forms the framework of several commercial computer program for risk and decision analysis (e.g., Decisioneering, 1993). We therefore believe we are justified in calling it the "standard methodology." An application of the standard methodology to the performance 569
570
M.S. Tierney and R.R Rechard
assessment of a geologic disposal system for transuranic wastes is described elsewhere in these Proceedings (Anderson et al., 1997; Helton et al, 1997). The purpose of the present, brief paper is to provide a perspective on natural mathematical structures behind the standard methodology and on the ways those structures can limit what can be concluded from probabilistic systems studies. The mathematical structures at the heart of any simulation of systems involving many independent variables are surfaces in a space of many dimensions (sometimes called the system's response surfaces). We suspect that the standard methodology evolved as a natural attempt to circumvent the computationally difficult problem of providing contour maps for response surfaces. But, as we hope to show by means of some simple examples, applications of the standard methodology are usually successful only when the systems models' response surfaces are smooth (i.e., nearly planar), or when it is feasible to exercise the system's computational model 104 to 106 times. In view of the trend in PA applications towards use of complex and nonlinear mathematics in systems models (e.g., the solving of systems of nonlinear partial differential equations: see examples in Anderson et al., 1997) and the attending possibility of computationally intensive system models whose response surfaces exhibit "rugged terrain," we advise caution in the unthinking use of the standard methodology for problems that involve comparisons between model outcomes and standards (determination of compliance) or intercomparisons between model outcomes that correspond to different system configurations (systems prioritization problems).
M E A S U R I N G SYSTEM P E R F O R M A N C E W I T H MODELS The performance of a proposed technological system or social policy can be efficiently tested by simulations with a mathematical model of the system. A mathematical model of the total system, often a concatenation of many subsystem models, can be viewed as defining a vector function y - f(x) that maps a point x = (Xl, x2, x3 ... XN) in the space of all model parameters, SN, into a single point y = (Yl, Y2, Y3.... YM) in what will here be called the space of system performance variables, CM. Dimensions of SN can be large (100-1000) for complicated systems modeled at a high level of detail. In the applications we have seen, dimensions of CM are usually smaller (1-10), and depend upon the number of distinct performance measures against which the system will be tested. The function f(o) is defined by the mathematical relationships embedded in the model of the total system and can range from a smooth, nearly linear function of all model parameters to a highly nonlinear relationship between model parameters and performance variables. We will here assume that f(°) is not an intrinsically stochastic mapping (Glimm, 1991). If the forms of the equations that determine model system behavior are unambiguous (no alternate conceptual models) and a precise value can be assigned to each model parameter, then a performance assessment can simply be a single running of the computer code implementing the model to obtain single values of each of the performance variables Yl, Y2, Y3.... yM. These point estimates of performance can then be compared with relevant standards or criteria and the results of the comparison communicated to decision makers. In our experience, matters are seldom this simple. There is almost always imprecision in some model parameters (parametric uncertainty) and ambiguity in some of the equations used to represent system behavior (model uncertainty). We will here assume that model uncertainty can be treated parametrically by attributing weights to the mathematical expressions describing each alternative conceptual model. Some experts advise against this assumption: see Morgan et al. (1990, p. 67).
MODEL PERFORMANCE SURFACES When there is uncertainty in values to be assigned to some model parameters, the system's configuration, and therefore the system's performance, are no longer reliably represented by single points in SN and CM respectively; instead one must study the collection of system behaviors that arise from the collection of parameter vectors contained in the subspace of SN defined by the ranges of the Nu < N imprecisely specified parameters. As the vector of model parameters x is moved through this subspace (here called SNu), the M model performance variables, YM = fm(x), m = 1, 2 . . . . . M, will trace out M performance surfaces in (Nu + 1) dimensional Euclidean space. These performance surfaces, sometimes called response surfaces,
Advances in Safety and Reliability: ESREL '97
571
contain all information about model system behavior that is consistent with assumed systems performance variables and knowledge of only ranges of the Nu uncertain parameters. A fundamental mathematical problem in assessments of the performance of imprecisely characterized systems is thus seen to be the problem of making a sufficiently detailed map of the system' s performance surfaces, one that is fine enough to discern extremes of system behavior. Systems with Nu between 10 and several hundred have been studied (Helton, 1994) by methods that do not directly employ the concept of a performance surface. In fact, detailed mapping of performance surfaces is seldom done; detailed mapping can be computationally difficult for certain nonlinear models even if Nu is two or three; detailed mapping becomes practically impossible when Nu is of the order of one hundred. Nonetheless, there is heuristic value in attempts to visualize topography of surfaces in high-dimensional space by building intuition on examples in three dimensions. One such example is provided in Figure 1 where level curves (contours) of two hypothetical surfaces of the form y = f(xl, x2) are shown: Surface (A) exhibits "rugged" topography and also illustrates the need for a finely spaced grid in the xl - x2 plane if one is to resolve the sharp peaks and valleys of the surface. Surface (B) exhibits "smooth" topography, typical of what one would obtain if y were a nearly linear function of the parameters (x~, x2); obviously, a coarse grid of points in the x~ - x2 plane would be adequate to resolve such a surface.
STATISTICAL C H A R A C T E R I Z A T I O N OF P E R F O R M A N C E SURFACES A standard methodology for assessing performance of imprecisely characterized systems has grown out of the need to circumvent the problem of mapping performance surfaces in high-dimensional space. The central feature of this methodology is the adoption of a statistical or populational point of view towards the characterization of performance surfaces; programs of pointwise mapping of high-dimensional surfaces are abandoned for an approach that views surface topography by means of its statistical characteristics such as mean surface elevation, or percentiles of surface elevation. This statistical approach involves two steps: (1) creation of probability models that quantify the epistemic uncertainty in each of the Nu uncertain parameters, and (2)calculation of expected properties of performance surface elevations, the expectation integral (see below) being taken with respect to the distribution defined in step (1). In implementing step (1), uncertain model parameters are implicitly treated as random variables, X1, X2, X3, .... XNu with joint probability distribution F ( x ) = P r { X 1 < x 1 , x 2_<x 2 ' X 3 < x 3 ..... XNu_<XNu}. Little can be said here about construction of F(x); the process usually involves judicious use of all information concerning the ranges and central values of real life counterparts to the uncertain model parameters (Tierney, 1990). Step (2) is accomplished in principle by forming multidimensional integrals that express the expectation of performance-surface properties of interest. For example, the expectation (or average) of heights of a surface defined by a single performance variable y = f(x) is given by the integral ~f(x)dF(x)
(1)
SNu Other statistical measures of performance surfaces, such as the joint probability distribution of the performance variables Yl, y2, Y3..... YM, may in principle be obtained by forming multidimensional integrals similar to Eqn. 1 (see Whittle, 1992, or pg. 3-9 of Tierney, 1991).
572
A. Rough Topography
B. Smooth Topography
M.S. Tierney and R.R Rechard
Figure 1: Contours of Performance Surfaces in Three Dimensions indicates possible quadrature points for numerical integration.
+
TRl-6342-4999-0
Advances in Safety and Reliability: ESREL '97
573
MONTE CARLO INTEGRATION The statistical approach to surface characterization used in the standard methodology does not really solve technical problems associated with rugged surfaces in high-dimensional space since the computational difficulties of exploring the topography of these surfaces are merely exchanged for other computational problems: namely, the problems of accurate numerical integration of multidimensional integrals such as Eqn. 1o A variety of methods has been proposed for the evaluation of multidimensional integrals (for a recent review, see Spanier and Maize, 1994). Most of these methods involve the use of pseudo-random numbers and, of this latter class, Crude Monte Carlo (Ross, 1985) or its variant, Latin Hypercube Sampling (Iman and Shortencarier, 1984) seem the most popular. Using Crude Monte Carlo, the integral of Eqn. 1 would be estimated by the sum 1 K
I =--Zf(Xk)
(2)
K k=l
where X~, X2, X3 . . . . . Xk are independent, identically distributed samples of the vector of uncertain parameters, x ~ Ssu, that are "drawn" from the joint distribution F(o) mentioned earlier. It is well known that the sum (2) converges (in a certain sense) to the value of the integral in Eqn. 1 as the number of samples, K, approaches infinity. In practice, though, it may not be practical to make K large (e.g., a single evaluation of the function f(°) may be time consuming) and, because of the finite number of terms, the estimates of I in Eqn. 2 may fluctuate in value as the sum is recomputed using new sets of "independent" pseudo-random numbers. In other words, successive estimates of I, say 11, 12, 13. . . . . IL, are themselves samples of an independent, identically distributed random variable whose (unknown) mean value is the value of the integral in Eqn. 1. Intuitively, for fixed sample size K, the variance of these estimates will be larger if the performance surface is rugged, as in Figure 1A, and smaller if the performance surface is smooth, as in Figure lB. Similarly, for a fixed performance surface, the fluctuations in the estimates of I in Eqn° 2 can be large if K is a small number: To see why this is so, imagine evaluating Eqn. 2 at the four quadrature points shown on Figure 1A to obtain an estimate 11. Now move those four quadrature points to other locations in the x~ - x2 plane, and repeat the evaluation of the sum in Eqn. 2 to obtain a second estimate 12. Because of rugged surface inFigure 1A, it is clear that estimates I1 and 12 could differ greatly. The difference between the two estimates would generally be much less if the same experiment had been performed on the smooth surface in Figure lB. Differences in the two estimates would be negligible (from a practical point of view) if, instead of four points, we had placed about 104 quadrature points in the Xl - x2 planes of Figure 1. Use of the Crude Monte Carlo method (or the slightly more efficient LHS method) to approximate expectation integrals (Eqn. 1) has both benefits and costs. The benefits arise from the conscious exploitation of the aleatory (or "luck-of-the-draw") type of uncertainty implicit in random number generators, which in turn allows proper use of the tools of classical statistics such as point and interval estimation (Blom, 1989) to infer properties of performance surface topography: By being able to obtain independent, identically distributed estimates of the value of an expectation integral, we are allowed to place, with confidence in our method, quantitative confidence intervals on performance surface properties that are being compared with standards for the performance of the system. The disadvantages of Monte Carlo integration became apparent when time limitations or computational expense preclude taking a sufficiently large number of samples of the sums typified in Eqn. 2. In that case, confidence intervals on critical performance measures may be so wide that no meaningful comparison with system performance criteria or standards can be made. In particular, the precision of values of performance measures that is necessary in systems prioritization problems is usually not achievable when Monte Carlo integration is used with the standard methodology. Systems prioritization problems require the ability to make comparisons between predicted model performance measures that correspond to different system configurations, and wide confidence intervals on the predicted performance measures make meaningful comparisons difficult if not impossible. In one recent
574
M.S. Tierney and R.R Rechard
systems prioritization study (Prindle et al., 1996), it became clear that, because of some computationally intensive nonlinear subsystem models, use of the full standard methodology, even with small-sample Monte Carlo integration, would entail an impractically large number of calculations to discriminate between the various action alternatives for experimentally characterizing the system. The idea of incorporating uncertainty into the comparison of action alternatives, and at the same time reliably discriminating between the performance of those alternatives, had to be abandoned in favor of calculations involving only discrimination between point estimates of the consequences of the action alternatives; the point estimates used as independent variables are the mean or median values of the parameters that characterized the particular action alternative. For these reasons, we suggest that caution be used when applying Monte Carlo methods in regulatory compliance calculations or in systems prioritization problems. REFERENCES
Anderson, D.R. et al. (1997). Conceptual and Computational Structure of the 1996 Performance Assessment for the Waste Isolation Pilot Plant. Proceedings of International Conference on Safety and Reliability, Lisbon 17/20, June 1997, to appear. Blom, G. (1989). Probability and Statistics: Theory and Applications, Springer-Verlag, New York, NY. Decisioneering. (1993). Crystal Ball Version 3.0: Forecasting and Risk Analysis for Spreadsheet Users, User Manual, Decisioneering, Denver, CO. Glimm, J. (1991). Nonlinear and Stochastic Phenomena: The Grand Challenge for Partial Differential Equations. SIAM Review 33:4, 626-643. Helton, J.C. (1994). Treatment of Uncertainty in Performance Assessments for Complex Systems. Risk Analysis 14:4, 483-511, Helton, J.C. et al. (1997). Uncertainty and Sensitivity Analysis in the 1996 Performance Assessment for the Waste Isolation Pilot Plant. Proceedings of International Conference on Safety and Reliability, Lisbon 17/20, June 1997, to appear. Iman, R.L. and Shortencarier, M.J. (1984). A FORTRAN 77 Program and User's Guide for the Generation of Latin Hypercube and Random Samples for Use with Computer Models. SAND83-2365. Sandia National Laboratories, Albuquerque, NM. Morgan, M.G., Henrion, M. and Small, M. (1990). Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis, Cambridge University Press, Cambridge, U.K. Prindle, N.H., Mendenhall, F.T., Boak, D.M., Beyeler, W., Rudeen, D., Lincoln, R.C., Trauth, K., Anderson, D.R., Marietta, M.G. and Helton, J.C. (1996). The Second Iteration of the Systems Prioritization Method: A Systems Prioritization and Decision-Aiding Tool for the Waste Isolation Pilot Plant, Volume 1: Synopsis of Method and Results. SAND95-2017/1. Sandia National Laboratories, Albuquerque, NM. Ross, S.M. (1985). Introduction to Probability Models, Third Edition, Academic Press, Orlando, FL. Spanier, J., and Maize, E.H. (1994). Quasi-Random Methods for Estimating Integrals Using Relatively Small Samples. SIAM Review 36:1, 18-44. Tierney, M.S. (1990). Constructing Probability Distributions of Uncertain Variables in Models of the Performance of the Waste Isolation Pilot Plant: The 1990 Performance Simulations. SAND90-2510. Sandia National Laboratories, Albuquerque, NM.
Advances in Safety and Reliability: ESREL '97
575
Tierney, M.S. (1991). Combining Scenarios in a Calculation of the Overall Probability Distribution of Cumulative Releases of Radioactivity from the Waste Isolation Pilot Plant, Southeastern New Mexico. SAND90-0838. Sandia National Laboratories, Albuquerque, NM. Whittle, P. (1992). Probability via Expectation, Third Edition, Springer-Verlag, New York, NY.
ACKNOWLEDGMENTS This work was supported by the United States Department of Energy under Contract DE-AC0494AL85000. Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin company, for the United States Department of Energy.
This Page Intentionally Left Blank
ASSESSING AND PRESENTING RISKS FROM DEEP DISPOSAL OF HIGH-LEVEL RADIOACTIVE WASTE McKinley, I.G., McCombie C., Zuidema P. Nagra (National Cooperative for the Disposal of Radioactive Waste), Switzerland
ABSTRACT In principle, repositories for high-level radioactive waste (HLW) are probably amongst the safest engineered facilities ever planned. This safety is provided by the combination of very deep disposal (500 - 1000 m below surface) in a stable geological formation and use of a series of engineered barriers. Health hazards from a sealed repository result from either very slow processes which result in low releases of radioactivity to the biosphere in the far distant future (after hundreds of thousands or even millions of years) or very low probability, catastrophic events. Key questions are, however, how can the safety of such facilities be quantitatively assessed with sufficient reliability and how can the results of such assessments be effectively communicated to the wide range of interested parties? This paper discusses how these questions might be addressed, focusing on the example of a concept for deep disposal of HLW in Northern Switzerland.
KEYWORDS Low-level risk, long timescales, radioactive waste, performance assessment, risk presentation
INTRODUCTION Disposal of the high-level radioactive waste (HLW) resulting from nuclear power production is often cited as one of the great, unsolved problems of our technological age. The fact that this opinion is widely held means that scientists in the disposal area face particular challenges. One is to work against the subjective, irrational fear of radioactive waste by the general public (as compared, for example, to chemical wastes of comparable or higher toxicity and longevity). The other main challenge is posed by legislation requiring demonstration of very high levels of safety for tens of thousands of years, or even "for all time"; this is clearly novel. The first question to be asked is whether the legislative requirements are reasonable or not m is it technically feasible to demonstrate acceptable performance of an engineered facility for a million years? Even if this first question can be answered positively at a purely technical level, is it possible to communicate the safety case in a convincing manner to wider audiences m regulatory authorities, politicians, the scientific community and the general public? In this paper we illustrate the arguments used to answer the former question, taking the concept for HLW disposal in Northern Switzerland as a typical example. The latter question is trickier for pure scientists, involving, as it does, ethical and sociological aspects. Nevertheless, considerable progress is being made in this area involving approaches (for example, natural analogue studies) which may be applicable to other areas where very long-term, low-level risks must be assessed and communicated to the public.
H L W : N A T U R E AND D I S P O S A L C O N C E P T High-level radioactive waste includes both spent fuel from reactors, if such fuel is disposed of directly, and the most active solidified residues produced if it is decided to reprocess the fuel. In both cases, the waste is so radioactive that it produces significant radiogenic heat. Most disposal concepts call for interim storage of such waste for 40-50 years prior to disposal, to decrease thermal output and thus avoid high temperatures in the repository. Even after 40 years 577
578
I.G. McKinley et al.
cooling, a typical block of vitrified reprocessing waste would contain -7x1015 Bq of 13/~/activity,-5x1013 Bq of etactivity and produce almost 600 W of heat (Nagra, 1994a). Although HLW contains many important radionuclides with half-lives of tens of thousands of years or more, total activity is initially dominated by much shorter lived isotopes. After 1000 years, indeed, the above 13/~/activities and heat output values have decreased by 2 orders of magnitude and the a-activity has dropped by a factor of five. This waste has a high toxicity which persists for very long periods of time, but has the advantage that only very small quantities are produced; nuclear power provides -40 % of the electricity in Switzerland but would produce a total volume of only-500 m 3 vitrified HLW over the planned 40 year lifetime of the reactors involved (Nagra, 1994b). Because volumes are so low, a concept for deep disposal can be developed which tends to be somewhat overdesigned (Fig. 1). The present Swiss reference case for HLW reprocessing waste would involve encapsulating the waste in a 25 cm thick cast steel canister which is emplaced horizontally in specially excavated tunnels which are backfilled with compacted bentonite clay. The repository would be constructed at depths of up to 1 km below surface in either crystalline basement rock or a low permeability sedimentary formation (Opalinus Clay).
ASSESSING LONG-TERM REPOSITORY PERFORMANCE In principle, a deep HLW repository is an extremely simple system ~ certainly when compared to a nuclear power plant, a petrochemicals complex or a jumbo jet. Indeed, over normal engineering timescales of years or decades, this system appears almost static. It is only when timescales of millennia are considered that the significance of very slow geological, geochemical and hydrogeological processes become apparent. To ensure that the behaviour of the repository engineered barriers within their slowly evolving geological environment is assessed in a comprehensive manner, a formal scenario development procedure has been utilised. This procedure includes the following generic stages: 1)
For the defined HLW disposal system, documenting our current understanding of the processes which determine its behaviour, and identifying the basic characteristics that are expected to ensure long-term safety.
2)
Developing a catalogue of all potentially relevant features, events and processes (FEPs) based on current understanding of the system and auditing this against international experience.
3)
Developing a System Concept - - a description of the behaviour of the repository and its environment, incorporating scientific understanding and indicating the interactions of all relevant FEPs.
4)
Developing a Safety Assessment Concept m a conceptual model of all FEPs to be taken into account in assessment calculations; comparing this with the assessment models available and identifying any important FEPs that cannot be analysed using existing models; defining the Reference Scenario and alternative scenarios for safety assessment calculations.
5)
Developing a Robust Safety Assessment Concept m including only those FEPs that can be relied upon to enhance safety, plus consideration of all detrimental FEPs; define the calculations required for a robust safety case.
As indicated in Fig. 2, various stages of the scenario development will make use of results of detailed performance calculations and safety assessment calculations. The scenario development for vitrified HLW disposal in the crystalline basement is described by Sumerling et al. (1993) while the associated performance and safety assessment modelling are documented in Nagra (1994a). Under the expected conditions, solute transport through the engineered barriers will occur by diffusion and corrosion reactions will procede only very slowly. Even after the canister integrity is lost (after more than 1000 years), the glass will dissolve at a very low rate and most radionuclides will decay to insignificance within the bentonite backfill. The small concentrations of long-lived radionuclides which are eventually released from the engineered barriers will be further reduced by retardation and decay in the host rock and dilution in overlying aquifers so that calculated peak doses to a critical group would occur only in the distant future (after about 300 000 years) and lie 3 orders of magnitude below the regulatory guidelines.
Advances in Safety and Reliability: ESREL '97 Safety barrier system for high-level waste .
Glass matrix (in steel flask) o Low corrosion rate of glass - High resistance to radiation damage - H o m o g e n e o u s radionuclide distribution
.
.
.
579
.
.
.
.
.
i
. i
~:.:.....:; :, ......
Steel Canister - Completely isolates the waste for > 1000 years • Corrosion products act as a chemical buffer
.
.
.
~i~ii!!
.
.
.
.
.
ii~ii~!!i!!!!!!!i!!iii~
:
!
~,~;~, ~~i'!,'i,'~,'i~~ii~~,~i ,i'~,'ii~i~ii~i~i~i~i!ii~i~i~i~.~ii~i~i~i!i~i!~i~:.:/ Bentonite • Low solute transport rates (Diffusion) - Retardation of radionuclide transport (Sorption) - Chemical buffer • Low radionuclide solubility in porewater • Colloid filter • Plasticity (self-healing following physical disturbance)
. . . . .
i . . . . . . . . . .
Geological barriers R e p o s i t o r y zone: • Low water flux • Favourable hydrochemistry - M e c h a n i c a l stability Geosphere: • Retardation of radionuclides (sorption, matrix diffusion) • Reduction of radionuclide concentration (dilution, radioactive decay) • Physical protection of the engineered barriers (e,g. from glacial erosion) __
Figure 1"
~ r Lg~ zone: : : :.: ~ ~:
"
.
.
.
.
!
t . ::::~~::.:~.::i~:/~:~;.
i ,il
i
The system of safety barriers for disposal of HLW in the crystalline basement of Northern Switzerland
Regulatory criteria Site
investigations
Laboratory studies Design studies Detail modelling
Nagra ..._~ System % ~derstanding~ Identify FEPs1
:I ~"
Scenario
development
:
Detailed performance modelling Safety
assessment
modelling
Robust safety ,demonstrationl
:_. :-
i
sl
System
!
concept
~-*
i
"
Safety
assessment~ concept
screenin Robust
-~ reserve l I FEPs j I
4-
safety assessment concept
reserve] FEPs
no ~ _ Demonstration
Figure 2:
; Audit against ~ international studies and i FEP lists
I . . . . . . screenings--
.
proced~
I 4--
FEP catalogue
Identify influence
Safety concept
.
I requirements 'for next I phase
'l!
T .
.
of safety
I ,.J
The central r6le of the scenario development procedure for assessing repository performance (FEPs = Features, Events and Processes)
580
I.G. McKinley et al.
Of particular interest is the "robust" scenario which, based on the philosophy presented by McCombie et al. (1991), attempts to bound worst expected performance of the repository system. Even though the host rock and surrounding geological formations can greatly retard any radionuclides released from the repository, it is difficult to preclude the existence (or formation within the next few millennia) of flow paths which short circuit this barrier. The robust case thus focuses on the engineered barriers - here only minimum credit is taken for the geology which serves predominantly to protect these barriers and isolate them from processes occurring on the surface. The wellestablished low corrosion rate of the glass waste form, low solubility of some important radionuclides and their slow diffusion through compacted bentonite - together with dilution in the eventual exfiltration zone - are sufficient to ensure that performance guidelines are met (Fig. 3).
PRESENTATION OF REPOSITORY ASSESSMENT RESULTS Safety assessments of the Swiss HLW disposal system indicate that, under expected conditions, no significant releases of radioactivity are to be expected within one million years after disposal. After this time period, the toxicity of the repository has decreased to levels similar to that of the radioactivity contained in the overlying granite and further quantitative analysis becomes meaningless. A critical point to communicate to the general public is that the analyses carried out do not "predict the future", they merely evaluate the potential consequences for a wide set of possible futures (scenarios) which should bound the range of expected system behaviour, but need not directly simulate reality. The presentation of quantitative dose/time plots for such scenarios can give a false impression of level of confidence in numerical v a l u e s - even for a specified evolution of the system, the accuracy with which it can be simulated decreases with time. To draw attention to this latter point, dose/time curves have background shading (cf. Fig. 3); the clear area up to 104 years indicates that we have reasonable confidence in numerical values until the next expected period of glaciation. Thereafter, results are much more illustrative and, by 106 years, can be considered only as an indicator of whether calculated releases are increasing or decreasing. Shading also indicates a cut-off for completely negligible doses at 10-7 mSv a-1. In fact, this cut-off is probably far too low - being -2 orders of magnitude below the dose resulting from drinking one glass of average Swiss mineral water per year. However, more reasonable cut-off, 10-4 mSv a-1 for example, would result in the calculated doses for many scenarios not being displayed because they lie far below this limit. Two further points which emerge when presenting the safety case are demonstration of completeness of the scenarios considered and verification/validation of the models/databases used to quantify these scenarios. In both these cases, confusion arises because (as in all applied science) rigorous mathematical proof of completeness or validation is not possible and hence somewhat weaker use of these terms must be accepted. An element of expert opinion or consensus is unavoidable (cf. McCombie & McKinley, 1993; Sheng et al., 1993). An important confidence-building measure for the issues of scenario completeness and model verification (including complete assessment model chains) involves comparison exercises between independent groups working on similar repository systems. Validation - demonstrating that the codes and databases are sufficiently realistic for the scenario to be simulated - is somewhat more difficult. A certain amount of support for the models/databases can be provided by well designed laboratory or field experiments; the very long timescales involved can, however, be approached only by studies of natural analogues (McKinley, 1993; Miller et al., 1994). Nevertheless, despite the increasing uncertainty in boundary conditions as systems of increasing age are studied, natural analogues can provide convincing support for the general applicability of some models and the "reasonableness" of the conclusions derived. For the general public, in particular, qualitative demonstration of the persistence of extremely rich uranium ore bodies for hundreds of millions of years in relevant environments can be more convincing than mathematical analyses of expected repository performance. Even for more technical audiences, the consistency of extrapolation of laboratory data with analogue observations can be reassuring (e.g. comparison of short-term iron corrosion data from the laboratory with the corrosion observed in archaeological artefacts which have been buried for two or more millennia).
581
Advances in Safety and Reliability: ESREL '97 10 2
.
10 0
_
.
.
.
.
.
.
.
.
.
.
.
.
.
Natural radiation exposures in
E Regulatory guideline (0.1 mSv per year)
~
lO-2
E 0
lO-~
0 N
10 -6 ~
c
j
-~
10-8 ~
t-
10 2
10 4
10 6
Time after repository closure [years]
Figure 3:
Time development of annual individual doses in the Robust Scenario
In Switzerland, our experience has been that it is possible to communicate the technical arguments supporting the safety of our repository concept to most of our target audiences. A convincing case can be made that, under expected conditions, the health consequences from such a facility would be negligible. Concern tends to be focused on very low probability events (meteorite impact, major earthquake, etc.) which could render the entire analysis inapplicable. For repositories at great depth, most "catastrophic" events in the biosphere (floods, dam-bursts, etc.) would have insignificant effects on safety. In most cases, indeed, repositories would be little influenced by long-term processes associated with the next ice age. Major catastrophes, like impact of a meteorite large enough to disturb a 1 km deep repository, can be argued both to have extremely low probability (-10 -12 a -1) and to have associated direct physical effects which would render the radiological effects of repository perturbation negligible in comparison. An open area which remains is human intrusion E whether accidental or deliberate. It is plainly meaningless to attempt to predict the development of either society or technology over decades, much less centuries or millennia. All that can be done is to argue that accidental intrusion is unlikely (due to great depth, absence of resources) and will not lead to any major environmental disaster, while deliberate intrusion is extremely costly based on present-day technology. This limitation in the analysis must be accepted and, in the Swiss case, is incorporated in the relevant performance guidelines.
CONCLUSIONS Regulatory authorities specify quantitative performance goals for deep HLW repositories which are required to be met for geological periods of time. A formal methodology has been developed and widely applied to assess the extremely low risk from such repositories and develop a safety case to demonstrate compliance with guidelines. Although based on quantitative analyses, the safety case also rests upon a wide range of qualitative arguments to support the reasonableness of the conclusions reached. For presentation to the general public, such qualitative arguments become more important and are a focus for public information projects. Despite the general distrust and fear of all things nuclear, it makes public discussion of risks associated with radioactive waste disposal particulary challenging. It does, however, seem possible to convince key audiences that repository safety can be demonstrated. As the spotlight slowly shifts to focus attention on the very long-term consequences of other major engineering projects, it may be that some of the approaches developed in nuclear waste management can be usefully adopted elsewhere.
582
I.G. McKinley et al.
REFERENCES McCombie C., McKinley I.G., Zuidema P., 1991: Sufficient validation - The value of robustness in performance assessment and system design; In GEOVAL-1990, Symposium on Validation of Geosphere Flow and Transport Models, pp598-610, OECD/NEA, Paris, France McCombie, C. & McKinley, I., 1993: Validation - Another perspective; Groundwater, 31, pp530-531 McKinley, I.G. 1993: The r61e of natural analogues in nuclear waste repository performance assessment with particular emphasis on experience in Switzerland; Reliability Engineering and System Safety, 42, 233-246, 1993 Miller, W., Alexander, R., Chapman, N, McKinley, I, Smellie J., 1993: Natural analogue studies in the geological disposal of radioactive wastes; Nagra Technical Report series, NTB 93-03, Nagra, Wettingen, Switzerland Nagra, 1994a: Kristallin-I: Safety Assessment Report; Nagra Technical Report series NTB 93-22, Nagra, Wettingen, Switzerland Nagra 1994b: Kristallin-I - Conclusions from the regional investigation programme for siting a HLW repository in the crystalline basement of Northern Switzerland. Nagra Technical Report series, NTB 93-09E, Nagra, Wettingen, Switzerland Sheng, G., Elzas, M.S., Oren, T.I., Cronhjort, B.T., 1993: Model validation - A systemic and systematic approach; Reliability Engineering and System Safety, 42, 247-259, 1993 Sumerling, T.J., Zuidema, P., Grogan, H.A., van Dorp, F., 1993: Scenario development for safety demonstration for deep geological disposal in Switzerland; Proc. 4th Int. Conf. on High-Level Radioactive Waste Management, Las, Vegas, USA, pp 1085-1092, American Nuclear Society, La Grange Park, IL, USA
B7" Safety of Nuclear Waste Disposal
This Page Intentionally Left Blank
PARALLEL COMPUTING IN PROBABILISTIC SAFETY ASSESSMENTS OF HIGH-LEVEL N U C L E A R WASTE A. Pereira, M. Andersson and B. Mendes Department of Physics, Stockholm University Box 6730, S-113 85 Stockholm, Sweden
ABSTRACT
The influence of rock heterogeneity on contaminant migration is analysed in a Monte Carlo study of transport of Technetium in fractured media. The CRYSTAL3D model developed to address heterogeneity in Monte Carlo transport calculations has been used for this purpose. CPU and memory requirements of this model made it necessary to make use of high-performance computing. Therefore a Monte Carlo driver implemented on a parallel supercomputer has been used to monitor the CRYSTAL3D model. The results of a sensitivity analysis show that the impact of the geometry of the media on the consequence distribution, expressed in this model by the F-ratio number, is a very important parameter. It is also concluded that it is not possible to evaluate the importance of the wet surface area by intercomparision of case simulations with this parameter as the only feature that distinguishes those cases. Therefore, to be able to complete the study of the impact of heterogeneity of fractured media on the estimation of consequences from Monte Carlo calculations of radionuclide transport, it will be necessary to introduce in the parallel driver the ability to sample the wet surface area and the Darcy velocity parameters.
KEYWORDS Radioactive waste, fractured media, heterogeneity, sensitivity analysis, parallel computing.
1. INTRODUCTION In performance assessments of underground radioactive waste repositories it is common to consider the system to be simulated as comprising of three compartments. The first one is the near-field which includes the repository with its tunnels, shatts and silos i.e. the engineered barriers and also the environment adjacent to those barriers. The second compartment is the far-field, a natural barrier formed by the surrounding rock which delays the transport of most of the radionuclides leaving the near-field and finally the third compartment is the biosphere. This paper focuses on some aspects of heterogeneity of fractured media (the far-field) in the way it is handled by models adequate for Monte Carlo simulations. To achieve our goal we have applied the CRYSTAL3D model, Andersson et al (1997), to perform a variation study of some far-field parameters and to analyse their influence on the transport of Technetium-99. This nuclide is one of the most important 585
A. Pereira et al.
586
species of high-level radioactive wastes (HLW), due to its mobility in the aqueous phase and its complex chemistry. The CRYSTAL3D model simulates radionuclide transport pathways between the near-field and the biosphere. Each pathway is formed by fractures connected in series. The fractures are generated stochastically within a given volume defined by the code user. Two geometric properties of those fractures are sampled from probability density functions (pdJ~) given as input to the model. Those properties are the lengths of individual fractures and their orientation in space. The fractures are characterised not only by their geometric properties but also by chemical and physical properties, for instance sorption on fracture walls, matrix diffusion, longitudinal dispersion, wet surface area, etc. These properties are also sampled from pdfs. The variation of these properties along the pathway forms the mechanism used in CRYSTAL3D to model the heterogeneity of the fractured media. CRYSTAL3D calls the CRYSTAL model as a subroutine (see section 3).
2. THE P A R A L L E L MONTE CARLO DRIVER The stochastic sampling is done by a parallel Monte Carlo driver, now under development for the European project GESAMAC. This driver enables one to perform probabilistic assessments or limited parameter variability studies, by sampling parameters from pdfs. The fundamental choices done by the user when running the code concern the number of variables that are to be sampled from pdfs, the number of samples, the sampling technique to be used and the models to be coupled to the driver for use in the simulations. The basic conceptual structure of the parallel driver is shown in Figure 1. Its simplicity arises from the fact that it incorporates transport calculations that are independent.from each other.
@ @
@ ...
@
Figure 1 SPMD model of the parallel driver The circles represent the different machine nodes. Node In reads input data, samples the parameters from different pdfs and generates the input matrix of all parameters. This matrix is scattered to nodes N1, N2, etc. All nodes now work independently of each other with the same transport code. The final results are gathered by node Out and printed to output files in a convenient way for statistical post-processing. Thus, there is no communication between nodes N1, N2, etc. which minimises communication costs; it is because of this that we expect this single program multiple data model (SPMD) to be the most suitable one for our M.C. driver. The generation of random numbers at only one node proved to be a good strategy because it is sufficiently fast and avoids the introduction of spurious correlations between parameters, which is not uncommon without a sophisticated parallel random number generator. It is also important in some cases to deliberately introduce correlations between geological parameters. Here once again, it is very useful to have the whole matrix of sampled parameters in only one node, before calling the correlation subroutine to introduce the desired dependencies between the parameters.
587
Advances in Safety and Reliability: ESREL '97 3. T H E T R A N S P O R T O F T E C H N E T I U M IN F R A C T U R E D M E D I A The migration of Tc-99 in the far-field is, in CRYSTAL, Robinson of advection/dispersion and is modelled by following equations: _3 Ot
et al
(1992), described by the mechanism
o (Cm;) (ORici)+ V(qci)_V(ODLCi ) =/~i_,ORici_, _/],iORi¢i +aOmDm OWw
0 (RmiCmi): Om 022 (Cmi).]_/~i lORmil Ot '- Cm'i-1 ~['iORmiCmi OWW
-
w = 0
(1)
(2)
--
with:
Ri = 1 +
Rmi = 1 +
K d i P m (1
KdmiPm
Om)a~
-
(1
-
Om)a~
and where: i - index representing the i th nuclide in a decay chain c i - concentration of nuclide i in water flowing in the fractures [moles/m 3] Cmi - concentration of nuclide i in matrix pore water [moles/m 3] w- maximum penetration distance, perpendicular to the fracture [m] D L - longitudinal dispersion [m/s] a - specific wet surfacearea (per volume of rock mass) [m -1] 0 - rock mass porosity 0 m - matrix porosity D m - matrix diffusivity [m2/s] q - groundwater flow or Darcy flow [m3/m2/s] K d i - distribution coefficient on the fracture surface of nuclide i [m3/kg] R i - retention coefficient on the fracture of nuclide i K d m i - distribution coefficient in the matrix of nuclide i [m3/kg] Rmi - retention coefficient in the matrix of nuclide i ')]'i - decay constant of nuclide i [s -1] /am - density of rock matrix [kg/m 3] Because we are focusing on the far-field we have not simulated the transport in the near-field, i.e. the leaching process from the canister and the subsequent transport of Tc-99 through the bentonite and the rock layer adjacent to it. We assume instead an hypothetical source term to describe the input to the far-field model and which is the same as the one used in Andersson e t a l (1997).
4. CASE STUDIES The transport of Technetium in the far-field is assumed to be under reducing conditions. Three case studies are defined by the variation of the wet surface parameter which assumes three distinct values, one for each case simulation. C a s e I corresponds to the lowest value of wet surface area in Table 1 and c a s e I H to the highest value. Fracture lengths are assumed to vary between 1 and 10 meters. These lengths are distributed according to a uniform p d f The distance between the far-field inlet and the biosphere is 100 meters. The inlet area is l m 2 and the outlet area 100m 2 . The half-life of 99Tc is 2.13 x 105 years. Remaining geosphere
588
A. Pereira et al.
parameters and the pdfs from which their values are sampled are also shown in Table 1. The wet surface area cannot, in the actual version of the parallel driver, be sampled from a p d f and therefore we have defined the three case studies by means of the three values of that parameter shown in Table 1. The groundwater flow rate or its equivalent the Darcy velocity q, cannot be varied along the transport pathway by mass conservation reasons associated to limitations of the CRYSTAL model.
5. RESULTS It is not feasible to run the CRYSTAL3D model in a normal computer due to CPU-time, diskstorage and memory requirements. The large amount of generated output data is also a downside of this model and although we had access to a parallel supercomputer for the simulations, we found it necessary to limit for each of the three cases, the number of samples to one thousand. The real limitations were those associated to the statistical software capabilities of our local workstation at the present moment, which do not allow us to perform sensitivity analyses on data from very large simulations. B y t h e same reason we have reduced the distance between the inlet and outlet areas from 500 meters to 100 meters which implies a lower number of fractures for each pathway and consequently a manageable amount of data used in the analysis.
TABLE
1
PARAMETER DISTRIBUTIONS USED IN THE PROBABILISTIC CALCULATIONS Param.
Unit
pdf.
D L
[m2/y]
Logunif.
q
[m/y]
W
[m]
Dm
ImPly]
Const. t Uniform Const.
[m "l]
Const. Const. Logunif. Const. *
0 0m
R a
Min '
Mean
Max.
1.2 × 10-4 1!
0.05
1.0 1.6 x 10-3 1.0x 10 -4 1.0 x 10- 3
0.43
2.83 10-2,1.6 x 10-2,10 -1
t The Darcy velocity is kept constant. The wet surface area is equal to 10 -2,1.6 x 10 -2,10 -1 [ m-1] for cases 1, H and 111 respectively.
5.1
Uncertainty Results
The total length of each pathway is distributed according to the histogram show in Figure 2a. Their length varied between 119.3 and 180.5 meters. The mean value for the pathway length is 141.9 meters. The distribution resembles a normal one, but a normality test has not been done upon the data. The difference between the peak releases for the three cases is insignificant and is motivated by the sensitivity analysis of the next section. Figure 2b shows therefore only the distribution of peak releases for case II.
5.2 Sensitivity Analysis In safety assessments of radioactive waste, peak releases are a very useful consequence measure. We have used here peak release rates expressed in [Bq/y] which implies, from the sensitivity analysis (S.A.) point of view, that we should examine the impact of geosphere parameters on these peak release rates.
Advances in Safety and Reliability: ESREL '97
589
One of the consequences of the averaging of properties along transport pathways, is a characteristically smaller variance of the output distribution if compared with corresponding distributions delivered by models which do not consider the effects of heterogeneity of fractured media in Monte Carlo simulations. This fact implies in turn, that we do not need to rely on rank based sensitivity analysis to obtained sensitivity values that are statistically significant. We have therefore used Pearson correlation coefficients to order the variable parameters or parameter groups according to their influence on peak release rates for the three cases. The above mentioned groups of parameters that we have considered are the Peclet number and the F-ratio. These numbers are given by: Pe=qL/OD
(3)
L
(4)
F = a L/q
In our case the variation of the F-ratio is equivalent to the variation of the total path length L because the wet surface area a and the Darcy velocity q were kept constant.
78 72 66 60 54 48
£ ~ ~/ ~/~,~ ~,,/ ~ ~I ~ ~~ ~ \ ~,~
~ 36
~ ~
z 30
~
~ ~,~,~,~ ~,
24
~.~~
0 121.0 128.0 136.0 143.0 151.0 159.0 166.0 174.0 182.0 Pathway length [m]
1.34E+07 1.39E+07 1.44E+07 1.50E+07 1.55E+07 Peak release[Bq/y]
Figure 2. The distribution of the pathways (left) and Peak release rate in [Bq/y] of 99TC (fight). The release rates of 99Tc regardless of its time of occurrence (peak releases) for the three cases here studied, are presented in Table 2.
TABLE 2
CENTRAL VALUES FOR THE RELEASE RATE REGARDLESS OF TIME IN [Bq/y]
Peak release stand, dev. min.
max. confid. -95% confid. +95%
Case I
Case II
Case III
1.48 x 107 4.07 x 105 1.32x 107 1.60× 107 1.484 x 107 1.489 x 107
1.46 x 107 4.43 x 105 1.28x 107 1.57 x 107 1.453 x 107 1.459 x 107
1.21 x 107 7.29 x 105 9.45 x 106 1.41 x 107 1.208 x 107 1.217 x 107
Each set of pathways in our three simulation cases are formed by fractures in series, so it is not obvious how to define a one to one relation between the parameters and the peak releases for the purpose of performing a sensitivity analysis. We have chosen to compute, for each pathway, the median of the fracture parameters, with the exception of the fracture lengths which is an additive parameter; therefore in this case the value of
590
A. Pereira et al.
L for each realisation is equal to the sum of lengths for each fracture. Table 3 shows the results obtained using the Pearson correlation coefficients. From Table 3, we can rank the geosphere parameters in decreasing order of importance as" F-ratio >> R >> P e >> w >> D L
The F-ratio is the most important parameter followed by the retention coefficient, the Peclet number and the wet surface area. The longitudinal dispersion is not a relevant parameter.
TABLE 3
PEARSON CORRELATION COEFFICIENTS BETWEEN GEOSPHERE PARAMETERS AND PEAK RELEASE RATE
DL w R Pe F-ratio
Case I
Case II
Case III
+0.09 -0.33 -0.42 -0.34 -0.53
+0.09 -0.31 -0.46 -0.33 -0.73
+0.14 -0.28 -0.56 -0.34 -0.60
All values are significant at 95% confidence level
For all but the longitudinal dispersion D L , the parameters show at 95% confidence level, an anti-correlation with peak release rates, i.e. lower parameter values result in increased release rates. Due to the dependence on the total fracture length L, the F-ratio and Peclet numbers are closely related in our model to the geometry of the fractured media. The rank of those parameters reflects that the average of properties along the transport pathways (i.e. the heterogeneity) has important consequences on the output distribution. The scatter plots of Figure 3 illustrate the correlation patterns between peak releases and parameters for case I. The scatter plots for the other cases are very similar. The ellipses are the bounds of the 95 % confidence levels. The physically important F-ratio and Peclet numbers can be used in cases of deterministic variation studies to cover the parameter space that results in higher release rates, by an appropriate choice of the parameters that define those quantities. For instance for the case ofF-ratio, a high Darcy velocity and a low wet surface parameter result, for a given transport length, in a high release rate but doubling the wet surface area a and the Darcy velocity q results on the same F-ratio and could be discarded as a new variation case. Considering that the difference between the three cases rely on the wet surface area a, we conclude that the span of one order of magnitude between the lowest value of a (case I) and the highest one (case III) is not reflected in significant differences between the peak releases. There are two main reasons for the, apparently low importance of the wet surface area parameter a, in the far-field transport of 99Tc. The first one is that it varies between 1 cm and 10 cm which in combination with low penetration depths (between 0.05 meters and 1 meter) and the low K d values of Technetium represent a low buffer capacity for the rock. Therefore the crystalline rock functions in practice only as a delay barrier and not a buffer ~. The second reason has to do with the approach used here, which illustrates the need to be very careful in the interpretation of S.A. results. In fact, one must distinguish between the impact of parameters within a simulation case and between simulation cases. The wet surface area has been varied between cases, but if it had been possible to vary it within each of the three cases, it had been shown to be an important parameter, t The reader should observe that, on the absence of experimental data, the interval for the variation of w is hypothetical and probably conservative.
Advances in Safety and Reliability: ESREL '97
591
i.e. the correlation value between the peak release and the parameter a should be relatively high (according to previous experience); this argument is reinforced by the observation that the variation between cases of the Pearson correlation coefficients for the first three parameters in Table 3 is insignificant and it is therefore ex1.62e7
1.62e7
1.58e7
1.58e7
"~ 1.54e7
'~ 1.54e7
" 1.5e7
1.5e7
1.46e7 ~
1.42e7
•
1.46e7 "~'~:;.zz
:'7/.
÷
.
1.42e7
~ 1.38e7
~ +
1.34e7 1.3e7 -20
/,
0
.. 20
1.38e7
"~?.Vr;a;_ ÷
1.34e7 40
60
80
100
120
140
160
1.3e7 500
1500
Retention coefficient
1.62e7 I 1.58e7
*//~ * ~ - ,
~-+-~•,
•
•
÷ ÷
2500
3500
Longitudinal dispersion
1.62e7 [ 1.58e7 [
*
~,~*** ~ 1 ~ ~ * , ~, ~ -~-~-~
1547 i
1.34e7 1.3e7 0.05
1.34e7 0.15
0.25
0.35
0.45
Penetration depth
0.55
0.65
0.75
1.3e7 300
÷
350
400
450
500
550
600
650
700
F-Ratio
Figure 3 Scatter plots of peak release rate of case II versus parameters. pected that if we had varied the parameter a by sampling it from a distribution and instead based the three cases by keeping for instance the retardation coefficient constant within each case, we should observe a similar behaviour to that of the parameter a. In summary, for the case of transport of Technetium and using the data set of Table 1 we cannot exclude that the wet surface is a less relevant parameter than the other geosphere parameters.
6. S U M M A R Y AND CONCLUSIONS We have applied the CRYSTAL3D model to introduce heterogeneity in Monte Carlo calculations of transport of Technetium in geosphere. A Monte Carlo driver implemented in an IBM-Sp2 parallel supercomputer was used to monitor the CRYSTAL3D model. Due to limitations in the actual phase of this project, in the examination of large amounts of data we found it necessary to reduce the number of variable parameters to four and the number of fractures per pathway to 23 in average. The Darcy velocity was kept constant in all simulations. To vary the wet surface area, three case studies were constructed, using three different values for this parameter. It was observed that the impact of this parameter on the peak release rates could not be conveniently extracted by this procedure. For the data set used in this paper, the distribution of the release rates shows a relative small variance. Probably the output distribution of peak releases would display a higher variance if the Darcy velocity had been varied between realisations, due to a spread in water transit times. The calculations indicate that the geosphere environment acts in the case of migration of Technetium only as a delay barrier and not as a buffer. The magnitude of the mean peak release is practically the same as the
592
A. Pereira et al.
peak of the source term. The S.A. analysis shows that (in absence of variation of the Darcy velocity) the Fratio is the most important parameter. It will be necessary to include the variation of the Darcy velocity and wet surface area to be able to draw definite conclusions on the impact of heterogeneity on the estimation of the release rates by means of the CRYSTAL3D model. Another improvement needed for the near future is the development of dedicated software that can cope with statistical post-processing of large amounts of data and allow the visualisation of these results. We conclude also that the SPMD model is by its simplicity and transparency a good option for Monte Carlo calculations using 1D transport codes.
ACKNOWLEDGEMENTS The authors thank the Swedish Nuclear Power Inspectorate for support to our work in the field of radioactive waste management. The European Commission is also gratefully acknowledged for financial support to develop the parallel drive as part of the GESAMAC project (contract F14W-CT95-0017), within the frame of the R&D program of"Nuclear Fission Safety" (1994-1998).
REFERENCES
Andersson, M., Mendes, B., and Pereira, A. (1997). The Impact of Heterogeneity of Fractured Media in Monte Carlo Assessments of High-Level Nuclear Waste. (to be printed in the proceedings of the WM'97 conference, Tucson, Arizona, USA.) Robinson, P. and Worgan, K. (1992). CRYSTAL: A Model of a Fractured Rock Geosphere for Performance Assessment within the SKI Project-90. SKI Technical Report 91:13, Stockholm, Sweden.
EXPLORING THE POTENTIAL FOR CRITICALITY IN GEOLOGIC REPOSITORIES FOR NUCLEAR WASTE Rob P. Rechard Nuclear Waste Programs Center Sandia National Laboratories Albuquerque, NM 87185-1328
ABSTRACT Arguments on criticality in a geologic disposal system can be grouped according to the two main aspects of risk: probability and consequence. Within the probability category, this paper suggests that the arguments be organized by physical, hydrologic, and geochemical constraints, which correspond to the main disciplines involved in exploration of the criticality issue (i.e., nuclear engineers, hydrologists, and geochemists). The arguments can then be further subdivided according to location within the disposal system (e.g., at container, near container, or in far field), with time used as a final subdivision, if necessary. Heuristic arguments for this approach are presented, followed by examples of specific arguments used in support of the potential repositories at the Waste Isolation Pilot Plant (WlPP) and the Yucca Mountain Project (YMP) in the United States, which demonstrate the difficulty of creating conditions conducive to criticality within a nuclear waste geologic disposal system. KEYWORDS Criticality, nuclear waste repository, highly enriched spent nuclear fuel, risk assessment, performance assessment, Waste Isolation Pilot Plant, Yucca Mountain Project, radioactive waste disposal, uncertainty analysis. INTRODUCTION
Like any surface facility that handles fissile material, a geologic disposal system for nuclear waste must be assessed for the probability and consequences of a sustained, nuclear chain reaction (i.e., criticality). In the United States, criticality has been listed as an event to be considered in scenario development since 1979 (Bingham and Barr, 1980); since 1981, the International Atomic Energy Agency (IAEA) has listed criticality for consideration in its guidance to member countries siting nuclear waste repositories. Over the years, the potential for criticality has been examined for two projects in the United States: (1)the Waste Isolation Pilot Plant (WlPP), a facility built for the disposal of waste contaminated with transuranic radioisotopes created during manufacture of nuclear weapons, and (2)the Yucca Mountain Project (YMP), a possible repository for commercial and defense high-level waste and spent nuclear fuel. Because of the diverse scientific disciplines required, studies on criticality in geologic disposal systems often describe only one aspect of the criticality issue rather than their interaction. For example, Allen (1978) reported on the size of spherically shaped critical masses of actinides while Brookins (1978) described several geochemical constraints on accumulating critical masses of actinides in geologic repositories. 593
594
R.R Rechard
Recently, Rechard et al (Rechard, ed., 1993; 1995; Rechard et al, 1996; 1997) explored the criticality event in salt (WIPP) and tuff (YMP) repositories. For both the WIPP and the YMP, the studies have evaluated whether the criticality event (or processes leading up to criticality) should be included in general scenarios or whether it can be omitted based on low probability and/or low consequence. This paper reviews the approach used in the latter papers and shows some typical results. PERFORMANCE ASSESSMENT The approach described is part of the general process of assessing the performance of a geologic disposal system in the United States. The overall risk analysis process for a nuclear waste disposal system is usually called a performance assessment; it assesses whether a waste disposal system meets a set of performance criteria.
Performance Criteria The current standard by which the probability and consequences of criticality are assessed in the United States is the U.S. Environmental Protection Agency (EPA) standard, 40 CFR Part 191--Environmental Standards for the Management and Disposal of Spent Nuclear Fuel High-Level and Transuranic Radioactive Wastes. Although 40 CFR 191 lacks specific guidance regarding the occurrence of criticality after closure of a repository, any risks associated with a critical condition are evaluated under the general provisions of the standard. The primary provision is the Containment Requirements, which require an analysis to evaluate probabilities of cumulative release at the disposal system boundary over 10,000 yr and compare the results against the numerical criterion of 40 CFR 191. The U.S. Nuclear Regulatory Commission is responsible for implementing a future EPA standard for YMP and will likely clarify the type of criticality control or the type of calculations required to demonstrate a low level of criticality risk after repository closure (i.e., low probability or low consequence); the expectation for the latter is that low probability will be emphasized. Criticality as an Event
To determine the probability and consequences of an unwanted occurrence, the disposal system must be characterized and mathematically modeled. In an abstract sense, this characterization determines the parameter space of the conceptual model of the disposal system. The parameters can be succinctly symbolized as D = [x~,x2.... xne] where nP is the total number of parameters of the model of the disposal system and the parameters are descriptors of various features, events, and processes of the disposal system (Rechard, 1995). For example, some of the parameters may define conditions important to criticality. In analyzing the risk of a critical condition developing in a repository, two approaches can be used. The first approach treats a critical condition as an event and develops scenarios that contain this event. In an abstract sense, the process of forming this scenario consists of assuming criticality and thereby focusing on a subset not including those parameters unique to criticality. The process of forming scenarios focuses attention on a particular aspect of the system being modeled. Forming scenarios can also help screen out particular features, events, and processes of the disposal system that do not influence the outcomes under study. In addition to evaluating consequences, this approach requires that a probability be assigned to the criticality event, a difficult process because the scenario is composed of natural processes. Although the probability of a natural process occurring may be high (i.e., corrosion of a container), whether the process will induce conditions that promote critical condition is not easy to discern. Furthermore, the probability of natural processes occurring changes with time (e.g., from climate change). However, despite the difficulties, bounding calculations on probability and consequences can be useful in conveying the arguments to a wide audience.
Advances in Safety and Reliability: ESREL '97
595
The second approach fully simulates the evolution of the disposal system as the container degrades to monitor whether conditions exist under which a criticality could occur. This approach permits a better understanding of the phenomena necessary to promote a nuclear chain reaction in a repository and the likely initial and boundary conditions for criticality (see Rechard, ed., 1995). A drawback to this approach is that the modeling is complex and thus difficult to convey briefly.
PROBABILITY OF C R I T I C A L I T Y EPA guidance in 40 CFR 191 (Appendix C) allows the omission of categories of features, events, and processes with probabilities of occurrence of less than 10-4 in 10,000 yr. Therefore, if the probability of all scenarios containing the criticality event is less than 10-4 in 10,000 yr (i.e., P[Sj} < 10-a), then the criticality event can be omitted. Given that probabilities of events that make up a scenario Sj are independent, any scenario containing a criticality event is less than 10-4 if the probability of criticality is less than 10-4. For example, if P{Sj} = P{HI}.....P{C}, then P[Sj}< 10-4 if P[C] < 10-4, where P[HI} is the probability of inadvertent human intrusion event and P{C} is the probability of a criticality event.
Factors of Probability Assuming independence of conditional factors, the probability of criticality is the conditional probability times the probability that physical, hydrologic, and geochemical constraints are satisfied. That is, P[C} = P{C Ip n h n c} op[p}.P{h}. P[c}, where P{C lp n h n c} is the conditional probability of criticality given no physical, hydrologic, or geochemical constraints (equal to 1 if all necessary conditions are factored out), Pip} is the probability of no physical constraints, P[h} is the probability of no hydrologic constraints, and P{c} is the probability of no geochemical constraints to criticality. This simple abstract view of the probability of criticality permits arguments to be conveniently organized with regard to arguing the possibility or impossibility of criticality in a nuclear waste repository. This approach is convenient particularly because the contributions of various scientific disciplines are readily apparent.
Example of Estimating Probabilities To demonstrate the application of the approach, a few examples of arguments used in support of performance assessment for the WIPP and for work related to disposal of highly enriched spent nuclear fuel at YMP are presented. The examples evaluate the probability of each of the three factors that comprise P[C}. (In practice, the author also categorizes arguments by location.)
Probability of Physical Constraints Criticality depends on the neutrons and their interaction with matter; hence a critical condition depends not only on the quantity of fissile material but also on its concentration and shape and any other material mixed with or surrounding the fissile material either as a solid (e.g., containment vessel), liquid (e.g., solvent), or gas that reflects or absorbs the neutrons. The temperature of the material also has an important influence. In describing the possibilities of fissile material mixtures, the behavior of fissile material with water is most often presented as the fissile mass versus the fissile concentration. A common curve shape for this relationship is represented by a 239pu-H20 mixture. For criticality to occur, the mass must be greater than 0.5 kg and the amount of 239pu mass in a unit volume of material must be greater than 7 kg/m 3 (-7000 ppm) when pure water is the primary interactive (moderating) substance and the other substances are fairly transparent to neutrons (Figure la). For a mixture of 239puO2, dolomite (porosity[~]=16%), and brine, or a mixture of 239pu and tuff, this concentration limit for criticality is --3 kg/m 3 (3000 ppm) (Figure la). Hence the solid concentration of plutonium or uranium (Figure l b) must reach levels that are considered to be economically mineable ore bodies (>--1000 ppm for uranium when located near the surface). The criticality limits for various geologic material presented here are based on calculations made with MCNP TM (Monte Carlo code for Neutron and Photon transport [Briesmeister, ed., 1986]).
596
R.R Rechard
The maximum fissile mass collected in geologic media depends on time (e.g., rates of fluid flow carrying and depositing fissile material) and so depends upon the regulatory period, unless a geometrical constraint on the maximum mass or volume exists. Hence, the minimum critical mass is not always a useful criterion in geologic media. In contrast, the limiting concentrations of plutonium and uranium can be easily compared with various solid concentrations possible through natural phenomena such as dissolution, adsorption, and precipitation. For example, using data from an assessment of the WIPP performed in 1996, a solution of pure 239puO2 at a concentration of 3 kg/m 3 corresponds to 12 mM, a concentration 30 times greater than the maximum solubility of Pu(IV). Similarly, a solution of pure 235UO2 at a concentration of 10 kg/m 3 corresponds to 37 mM. Consequently, a solution of either dissolved plutonium or dissolved uranium cannot go critical in the repository or elsewhere. Rather, the dissolved fissile material must be concentrated through adsorption and precipitation, for example.
Probability of Geochemical Constraints Once the fissile mass leaves a repository, the general tendency is for the radionuclides t9 disperse rather than concentrate. Within the WIPP disposal system, five mechanisms conceivably exist to cause a concentration of fissile material in one location: concentrated solution, compaction of waste, adsorption on mineral surfaces (e.g., ion exchange or surface complexation), filtration of colloid material, and precipitation. However, no special features exist to make these mechanisms feasible. Below is an example showing the probability of adsorption by itself, P[c}, that is clearly less than 10-4 such that P[C} is also less than 10-4. One condition upon which the potential for criticality from adsorption of uranium depends is the density of adsorption sites on dolomite. The amount of adsorption that can occur in a volume of Culebra dolomite at the WIPP is limited. An adsorbed uranium concentration of 1'0 kg/m 3 corresponds to 2.5 x 107 atoms/nm 2 of dolomite. The measured surface area on a carefully crushed and lightly acid-washed sample of dolomite from a shaft at the WIPP, as evaluated by surface area analysis, was 620 m2/kg dolomite and corresponds to 1.5 × 106 m 2 of surface area per m 3 of dolomite (assuming a porosity of 16% and dolomite grain density of 2820 kg/m3). Dividing the adsorbed uranium concentration by the surface area gives a site density of --17 atoms/nm 2. Although this site density is of the same order of magnitude as synthetically prepared goethite (~-FeOOH), a very effective adsorbent, most highly adsorptive minerals have much less adsorptive capacity (about 2 sites/nm 2) (Rechard et al, 1996). More importantly, the uranium at the WIPP is only -5% enriched (initially) and so the available adsorptive sites are much more likely to be filled with 238U than 235U. To get 10 kg/m 3 (Figure 1) of 235U at 5% enrichment would require 333 sites/nm 2. Thus, obtaining the critical concentration requires more than the entire sorptive capacity of a highly adsorptive material (of which the dolomite at the WIPP is reasonably bound).
Estimate of Conditional Probability Rechard et al (1997) present an argument for evaluating the conditional probability of P{CIp n h n c} • P{h}, i.e., the probability is conditional on the presence of proper geochemical conditions (such as strong oxidizing source, few other materials [e.g., other actinides] that readily absorb neutrons being coprecipitated with the fissile material) and proper physical conditions. Portions of this argument are presented below. The only known empirical basis for determining a rate of formation of a critical mass in nature is the number of uranium sites in the world that have gone critical. The -. 16 reactor zones of the Oklo ore deposit and a few reactor zones in other ore deposits in the Francevillian basin in Gabon, Africa, are the only sites known to have gone critical. One item of potential use is a rough estimate of the "efficiency" of nature in collecting fissile material in a critical condition as the result of strong reductants in a localized area. At Oklo, the rate of formation of the reactors is unknown, but the upper bound is the time required to form the uranium rich layer, --3 × 107 yr (2.00 × 109 yr minus 1.97 × 109 yr). The lower bound is possibly the minimum operating
Advances in Safety and Reliability: ESREL '97 10 4
-_ _ _
WIPP inventory of 239pu
_
Pu/Tuff 12%
10 a
~/~ '~ ".
O
"E. 0
10 2
\
•
Pu/Halite/brine 1.3%
~
\~" J PuO2/Culebra/brine ~ = 16%
_.
597
,J
euO2/Tuff/J-13 ~ = 12% /
/ ~
/Pu/Culebra/brine= 16% _
~_.:_.,_ -
_
../"
-
n
O3
_ o
~
s
S~'S
101
0 ~
c"
~
Limit J Concentration for which Criticality impossible
10 o
10-1
•. ~ . . . . .
.o
Pu/H20 (experimental)
101
10 2
10 3
(a) 239pu C o n c e n t r a t i o n !
10 4 E)3 v
(13 o
o ~
10
10
i
i
i iii
I
........
I
........
~
I
', t " t ' ~
t
ra
I
........
I
.......
.,te
~
"1
:
uo2 co2/5%
"i ~ ~
10 1
( k g / m 3)
........
' l l ~ 235U/Culebra }~~ / b r i n e saturated " / ~ = 8%
2 L
I
10 S
10 4
.-=
UO 2 CO 2/5% 235U/Culebra /brine saturated
3
0 ~
i
WlPP i n v e n t o r y of 235U
_
10 0
10-1
~"
UO/Tuff/J-13 2 s SS
~
Tuff/J-13 , . - ._. .
\4n,. 1Or
]
W /
s SS
Z-""
.
0 L _
c-"
10 0 _ _
_
(experimental)
_
_
10-1
. . . . . . . .
10-1
I
10 o
. . . . . . . .
I
10 ~
. . . . . . . .
I
. . . . . . . .
10 2
(b) 23sU C o n c e n t r a t i o n
I
. . . . . . . .
10 3
I
10 4
. . . . . . .
10 s
( k g / m 3) TRI-6342-4836-1
Figure 1. Critical masses of fissile material in a spherical shape as a function of fissile concentration when mixed with various substances and reflected by the same substance without fissile material (after Rechard et al, 1996). (a) Critical mass of 239pu at 100% wt. (b) Critical mass of 235U at 93.2% wt. unless noted otherwise.
598
R.P. Rechard
life of the reactors, -2 x 105 yr. Furthermore, the six zones for which data exist involved -800 metric tonnes of uranium (heavy metal) (MTHM). Thus, the maximum formation rate is -3.75 x 10.8 events/yr/MTHM (6 events/[800 MTHM • 2 x 105 yr]). At the YMP, this rate is conditional on a container of waste being under a dripping fracture and more infiltration occurring through the mountain than is thought now to occur. In these calculations, the spacing of the potentially wet fractures was -25 m, which is a frequent spacing of wet fractures in the E and O tunnels that are located under Rainier Mesa at Yucca Mountain (Rechard et al, 1996). With a fracture spacing of 25 m and a container length of 5 m, approximately 20% of the containers would eventually fail from wet fractures. At the YMP, the majority of fuel is expected from commercial power reactors. However, YMP may accept some U.S. Department of Energy (DOE) fuel and high level waste; of the latter waste, only the highly enriched uranium spent fuel (210 MTHM) is likely to exhibit a tendency to go critical. Therefore, the rate of formation, r(t), is a constant and equal to 1.6 x 10-6 events/yr (0.2 • 210 MTHM • 3.75 x 10.8 events/yr/MTHM). To be consistent with 40 CFR 191, the probability in the first 10,000 yr is determined. The probability model is based on the failure-rate function defined by r(t) = -d/dt In[l- F(t)], where t is time elapsed since the disposal system was closed and F(t) denotes the cumulative distribution function for the first time, T, when failure occurs (i.e., F(t) = P{T < t]). This equation can be integrated to give (1)
F(t) = 1- exp(- j" r(x)dx )
In the first 10,000 yr, however, the containers must first fail and then any boron in the containers must be separated from the uranium, which requires at least 7300 yr to occur (described in Rechard et al, 1997). Integrating Eqn. 1 from 7300 to 10,000 yields a probability of 4 x 10-3 for c~ h n c} eP{h}.
P{CIp
CONSEQUENCES OF C R I T I C A L I T Y If the consequences are negligible, then a basis is established for neglecting the criticality event. For criticality to be important after closure of the repository, assuming that a criticality could occur, it would have to either (1)degrade the ability of the disposal system to contain nuclear waste by generating significant amounts of kinetic or additional heat energy or (2) produce more hazardous waste than originally present such that dose at the accessible environment is greater. Following are examples of estimating consequences.
Beneficial Burn-Up of Plutonium at W I P P In general, the consequences of a 239pu criticality at the WIPP are beneficial because, after about 100 years following criticality, the fissioning produces fission products with fewer EPA units (a surrogate for health risk) than those present prior to the criticality (Table I).
Bounds on Total Energy (Fissions)at YMP Accidents and experiments of moderated and unmoderated fast rates of assembly release similar amounts of energy (as represented by fissions), between 1015 and 1020 fissions (Rechard et al, 1997). These incidents provide an empirical bound on the energy release from a criticality event because in an unsaturated repository, the criticality is assumed to occur at atmospheric pressure with a breached container. Here we assume a maximum number of fissions per event of 1020 and a rate of one event per day (defended below). Consequently an increase in inventory of radionuclides from one critical event of 1020 fissions would be negligible. One critical event occurring every day for 10,000 yr would amount to -1025 fissions and one critical event occurring every day for 1 million yr would amount to -1027 fissions. For comparison, a 70,000-MTHM YMP repository of spent fuel with burn-up of 40,000 MWd/MTHM would represent on the
Advances in Safety and Reliability: ESREL '97
599
TABLE I. SUMMED EPA UNITS VERSUS TIME BASED ON RELEASELIMITS (SURROGATEOF HEALTHRISK) IN 40 CFR 191 FOR FISSIONPRODUCTSOF 239pu (RECHARDET AL., 1997) Time(yr)
Summed EPA units offission products
0
1.000
0.003
11.179
10
8.8538
100
1.1047
110
0.8794
1000
0.0004
10000
0.0003
order of 1031 fissions. Thus, one critical event per day for 1 million yr (1027 fissions per container) is only 0.01% of the fission inventory represented by a 70,000-MTHM repository.
Increased Heat at YMP
The thermal energy released from criticality at YMP would be small since the fissile material would be at atmospheric pressure (Rechard et al, 1997). Thus the maximum temperature during the criticality event would be below 373 K in order to maintain the presence of the water moderator. This maximum temperature is used to determine the rate (power) of approximately 1 criticality event/day as follows. Assuming the power input must equal the radiative energy transfer from the surface of the cylindrical container through the air gap to the surface of the tunnel yields a steady-state power, Q, Q = 2rtrt (y(T4w- T4t)/[1/ew + (r/R) 2 (1/Et-1)] = 13 kW --- 1020 fissions/day
(2)
where r is radius of container (0.885 m), R is radius of tunnel (2.15 m), Tw is absolute temperature of container surface (373 K), Tt is absolute far-field temperature of tuff (303 K), e is container length (5.1 m), G is Stefan-Boltzmann constant (5.67 x 10-8 W/(m 2 K4)), Et and ew are total emissivity of oxidized carbon steel container and tuff at 100°C (0.8). In reality, the container would be corroded and the fissile mass would be on the tunnel floor but the approximation is consistent with the rough estimates. For the DOE fuel and high level waste in the YMP repository (Rechard, ed., 1995), the thermal energy in the year 2030 would be --4000 kW. Hence, the 13 kW/container thermal energy produced is less than 1% (per container) of the thermal power normally produced from radioactive decay of DOE fuel and high level waste. SUMMARY
Criticality in a nuclear repository can be examined either through simulation or as an event that then becomes part of various scenarios. Although the author has applied both approaches, the latter approach is discussed here. In presenting arguments about the possibility or impossibility of criticality when expressed as event, the author has found it convenient to organize the arguments according to the two main aspects of risk (probability and consequences), further categorizing the probability grouping by its components based on pertinent phenomena as follows: Risk {Crit }
= P { Crit }. C { Crit } = P {Crit [ phy n hydro n chem }. P {phy }.P{ hydro }.P {chem }. C {Crit }
600
R.R Rechard
These components can be further divided with regard to location within the disposal system (e.g., at the container, near the container, and in the far field). This simple approach is useful primarily because the scientific disciplines necessary to examine the criticality issue are easily distinguished. Although transferring information between disciplines remains a challenge, discussing criticality limits in terms of concentration has proved useful in information exchanges because fissile material concentration is more clearly dependent on geochemical processes than absolute mass.
REFERENCES Allen, E.J. (1978). Criticality Analysis of Aggregations of Actinides from Commercial Nuclear Waste in Geological Storage, ORNL/TM-6458. Oak Ridge National Laboratory, Oak Ridge, TN. Bingham, F.W., and Barr, G.E. (1980). Development of Scenarios for the Long-Term Release of Radionuclides from the Proposed Waste Isolation Pilot Plant in Southeastern New Mexico. Scientific Basis for Nuclear Waste Management, Proceedings of the International Symposium, Boston, MA, November 2730, 1979. Ed. C.J.M. Northrup, Jr., SAND79-0955C. Plenum Press, New York, NY. 2, 771-778. Briesmeister, J.F., ed. (1986). MCNP: A General Monte Carlo Code for Neutron and Photon Transport. Version 3a. LA-7396-M-Rev. 2. Los Alamos, NM: Los Alamos National Laboratory. Brookins., D.G. (1978). Geochemical Constraints on Accumulation of Actinide Critical Masses from Stored Nuclear Waste in Natural Rock Repositories. ONWI-17. Office of Nuclear Waste Isolation (ONWI), Battelle Memorial Institute, Columbus, OH. Rechard, R.P., ed. (1993). Initial Performance Assessment of the Disposal of Spent Nuclear Fuel and High-Level Waste Stored at Idaho National Engineering Laboratory. SAND93-2330/1/2. 1-2. Sandia National Laboratories, Albuquerque, NM. Rechard, R.P. (1995). An Introduction to the Mechanics of Performance Assessment Using Examples of Calculations Done for the Waste Isolation Pilot Plant Between 1990 and 1992. SAND93-1378. Sandia National Laboratories, Albuquerque, NM. Rechard, R.P., ed. (1995). Performance Assessment of the Direct Disposal in Unsaturated Tuff of Spent Nuclear Fuel and High-Level Waste Owned by U.S. Department of Energy. SAND94-2563/1/2/3. Sandia National Laboratories, Albuquerque, NM. Rechard, R.P., Stockman, C.T., Sanchez, L.S., Rath, J.S., and Liscom-Powell, J. (1996). FEP Screening Argument; RNT-I: Nuclear Criticality in Near Field and Far Field. Screening Memorandum of Record (SMOR), SWCF-A: 1.2.07.3:PA:QA:TSK:RNT-1. Sandia National Laboratories, Albuquerque, NM. (Copy on file in Sandia WIPP Central Files, WPO 40818.) Rechard, R.P., Tierney, M.S., Sanchez, L.S., Martell, M-A. (1997). "Bounding Estimates for Critical Events when Directly Disposing Highly Enriched Spent Nuclear Fuel in Unsaturated Tuff," Risk Analysis 17:1, 32-49.
ACKNOWLEDGMENTS This work was supported by the United States Department of Energy under Contract DE-AC04-94AL85000. Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin company, for the United States Department of Energy.
B8" Industrial Safety
This Page Intentionally Left Blank
DERIVATION OF FATALITY CRITERIA FOR HUMANS EXPOSED TO T H E R M A L RADIATION P. J. Rew ~and I. P. McKay 2 l WS Atkins Safety & Reliability, Woodcote Grove, Ashley Road, Epsom, Surrey, KT 18 5BW, UK 2 Health & Safety Executive, St Anne' s House, University Road, Bootle, Merseyside, L20 3RA, UK
ABSTRACT A review of literature has been undertaken in order to assess the current status of the modelling of the effects of thermal radiation on humans. The applicability of models is considered through reference to medical data, which showed significant recent improvements in treatment of bums. The paper outlines the determination of fatality criteria based on the nature of the thermal radiation source (ultraviolet or infrared), the age distribution of the exposed population, their typical level of clothing and the effectiveness of medical treatment. The results of the review suggest that the 'Dangerous Dose' criterion of 1000 (kW/m2)4/3s is a reasonable estimate of the thermal dose at which serious bums may be received, or a small percentage of an average population may die. The LDs0 (Lethal Dose) equivalent was estimated to be in the range 1460 to 3500 (kW/mZ)a/3s. A value of no greater than 2000 (kW/m2)4/3s is suggested, noting the considerable uncertainty in the determination of such criteria and that there is justification for the use of lower values.
KEYWORDS
Fatality criteria, thermal radiation, safety assessment, lethal dose, dangerous dose.
INTRODUCTION The estimation of the effects of thermal radiation on humans is a key step in the assessment of risk for installations where flammable liquids or gases are stored. Various approaches are used in the assessment of the effects of thermal radiation on humans. Simple assessment of likelihood of fatality, or level of injury, can be made using thresholds of incident heat flux, I. More detailed analysis may be undertaken using probit models, based on the incident thermal dose, I4/3t, received. Early probit models were based on data from nuclear explosions (Eisenberg et al, 1975). However, these probit functions have been re-assessed (TNO, 1992) in the light of the difference in characteristics between the ultraviolet radiation from nuclear incidents and the infrared radiation from typical hydrocarbon fires. Also, there have been significant recent improvements in the medical treatment of bums, resulting in lower levels of mortality. A further consideration is the significant impact that escape from the fire site may have on the probability of fatality of 603
604
EJ. Rew and I.E McKay
the exposed population. Thus, most risk assessment methodologies must also consider the speed of escape, delay before escape begins and the distance travelled to reach shelter from the heat radiation. This paper outlines a methodology which is used to derive fatality criteria based on the nature of the thermal radiation source, the age distribution of the exposed population, their typical level of clothing and the effectiveness of medical treatment. Thus, technical justification is provided for proposed fatality criteria for an average population, in particular a 'Lethal Dose' (LDs0) equivalent, and uncertainties in deriving such criteria are discussed. Further discussion on the subject is given by Hockey & Rew (1996) and Rew (1997).
DEFINITION OF DOSAGE CRITERIA
Dangerous Dose A dangerous dose is one which gives rise to all of the following effects:
a) b) c) d)
severe distress to almost everyone; a substantial proportion of the exposed population requiring medical attention; some people seriously injured, requiring prolonged treatment; any highly susceptible people might be killed.
In other words, the dangerous dose is that which would give rise to a small (say 1% to 5%) probability of fatality for a typical population. Note that the dangerous dose is related to the thermal radiation criterion given by Kinsman (1991) where, for an average population, a dose of 1000 (kW/m2)4/3s is given as the level which may cause serious bums to many people and a small percentage may die as a consequence. The corresponding dosage for a vulnerable population, defined as one that includes people who may not respond effectively to evacuation procedures in an emergency, is 500 (kW/m2)a/3s. As discussed below, the approximate dosage for third degree (or full thickness) bums is 1000 (kW/m2)4/3s. Since full thickness bums are most significant in causing fatality (see Section 3.3), and this severity of bum requires intensiveand prolonged treatment (skin grafting etc.), then the Kinsman criterion for an average population seems reasonable. It should be noted that 1000 (kW/m2)4/3s is also the dosage given by the Eisenberg et al (1975) Vulnerability Model for 1% probability of fatality for a normally dressed average population. Although Eisenberg's model is known to include non-conservatism due to its use of nuclear incident data, it appears that, at least at low levels of fatality, this is offset by significant improvements in medical treatment since 1945, as discussed by Lawrence (1991).
Significant Likelihood of Death (SLOD)/LDso equivalent This is defined as the level of exposure to a specified hazardous substance or event, for which an exposed population would expect a significant probability of death. For toxic gas inhalation hazards, it is usual to assume the LDs0 to be representative of the SLOD dosage. The following definition of the toxic lethal dose (LDs0) is given by the IChemE (1985): 'the quantity of material administered orally or by skin absorption which results in the death of 50% of the group within a 14-day observation period,'
LDso EQUIVALENT BASED ON CURRENT METHODOLOGIES Various methodologies are available which can be used to predict the thermal dose which gives a 50% probability of fatality for a normally dressed average population. These LDs0 equivalent values are given in Table 1 below, with 1% probability of fatality calculated from each methodology for comparison. It can be
Advances in Safety and Reliability: ESREL '97
605
seen that there is a wide variation in predicted dose for both 1% and 50% probability of fatality. Without ignition of clothing, the maximum probability of fatality given by the TNO (1992) methodology is 14% for typical patterns of clothing cover. This results from the use of the Bull (1971) Mortality Chart for total bum area, while using a probit model for second degree burn area alone. However, TNO (1992) does state that ignition of clothing can be assumed to give a 100% probability of fatality, and that this occurs at between 2.5xl 04 and 4.5xl 04 (kW/m2)2s. The lower bound of this ignition criterion is equivalent to a dosage of 2700 (kW/mZ)4/3s for an exposure of 30 seconds, reducing to 1800 (kW/mZ)4/3s for an exposure of 10 seconds. TABLE 1 COMPARISON OF CURRENT METHODOLOGIES
Dosage (kWtm2)4/3s for probability of fatality of:
Methodology
Eisenberg et al (1975) Tsao & Perry (1979) TNO (1992) Lees (1994) 1]
1%
50%
960 420 520 1655
2380 1050 44401
based on ignition of clothing at 3600 (kW/m2)a/3s
BURN AREA FOR 50% PROBABILITY OF FATALITY The relationship between bum area and probability of fatality, for an average UK population group, is illustrated in Figure 1. The age distribution used for a UK population group is as given by the Central Statistical Office (1991), and is for the year 1991. Use of predicted population characteristics for 2001 makes a negligible difference to the analysis. Note that 'full thickness bums' are equivalent to third degree bums and cause complete destruction of the dermis. full t h i c k n e s s b u r n w i t h i n h a l a t i o n i n j u r y ......
full t h i c k n e s s
burn
-- -- -- total burn area 100
.........
c
a~ (/1
.S
,;
.,,...;..-
80
g ~
,~..~..
/ I
."
/
60
t~
,-& o c o
~
40 o'J
2 ~
2o " I " , ~. I f
t,n
~"""
0 0
~
20
~
i
40 60 burn area (%)
l
80
.................
100
Figure 1 Relationship between bum area and fatality for an average population The curve for total (second and third degree) bum area is based on the Mortality Chart produced by Lawrence (1991) and it can be seen that, in order to cause 50% fatality in an average population, a total bum
606
EJ. Rew and I.E McKay
area of 50% is required. Alternatively, based on the model of Clark & Fromm (1987), between 30% and 45% body surface area of full thickness burn is required. The lower level of full thickness burn area assumes that the victims also receive some form of inhalation injury. In reality, for industrial hydrocarbon fire incidents, inhalation injury is unlikely to affect many of the victims. However, in order to be conservative, it is assumed that 50% of an average population will die for a full thickness burn area of 30% total body surface area. It should be noted that, within the exposed population, there will be higher levels of fatality for certain vulnerable age groups. Thus, based on the calculations above, for the burn area resulting in 50% fatality in an average population, 96% of exposed people older than 84 will die, compared to 40% of those between the ages of 15 and 34.
EXPOSED SKIN AREA
Typical exposed body surface areas for a selection of population groups are summarised in Table 2, based on the Brandwonden (1979) method. For adults, with typical levels of clothing, the exposed body surface area is 20% and it is this value which is used by TNO (1992) in their Green Book methodology. For young children, or adults in shorts, the exposed skin area may be approximately 30-35%. Only in exceptional circumstances will average exposed areas be as high as 70% (say in beach resorts etc.). In this study, the typical exposed area is taken as 30%, which, as discussed above, can be assumed to give a 50% probability of fatality if fully covered by full thickness bums. It should be noted that, when calculating the burn area required to cause 50% fatality in a population, consideration was not given to the distribution of clothing levels that may be found in an average population group. For all age groups, an exposed area of 30% is considered to be pessimistic for most weather conditions in the UK. Taking into account the small fraction of an average population which may have exposed skin areas above 30% would have a negligible effect on predictions of fatality levels. Groups with higher mean exposed skin areas (those on beaches, playing fields etc.) cannot be considered to be a representative population and require special consideration. TABLE 2 TYPICALEXPOSEDBODYSURFACEAREAS Population group
iii~:
: :~:
Young children - typical clothing Adults - typical clothing Adults - sportswear Adults - hot summer day
......i : Body areas exposed
:
face, neck, lower arms, hands as above as above plus lower legs as above plus trunk and upper arms
Uncovered area (% total body surface) 30 20 35 70
THERMAL DOSE FOR FULL THICKNESS BURNS
Having defined the full thickness burn area required to cause 50% probability of fatality in an average population, it is necessary to define the thermal radiation dose which will produce this severity of burn. There is considerable uncertainty in defining this dose, both due to uncertainties in the experimental data (which tended to be conducted on porcine skin with various types of thermal radiation source) and because of the variation in susceptibility to bums of various body areas. Table 3 summarises experimental and incident data relating thermal dose to level of bum injury and it can be seen that there is little published data for third degree, or full thickness bums. The single set of data for third degree burns under infrared radiation is for burns from flame contact on porcine skin and falls within the second degree burn range. Therefore it is not considered to be suitable for use as the threshold dose for third degree bums. Tsao & Perry (1979) suggest that an ultraviolet dose can be related to an infrared dose by dividing it by 2.23, which appears to be confirmed by the data given below. Thus, it may be possible to use
Advances in Safety and Reliability: ESREL '97
607
the two ultraviolet datasets for third degree burns, giving 550-800 (kW/m2)4/3s and 1400 (kW/m2)4/3s, for Glasstone & Dolan (1977) and Hinshaw (1957) respectively. The Glasstone & Dolan (1977) data is not consistent with the majority of the second degree burn data. It is also based on nuclear incident data and is therefore likely to be prone to large uncertainties (exposure duration, effect of shelter and distance of victims from source). Therefore it also is not considered to be suitable for use as the threshold dose. This leaves the Hinshaw (1957) data, obtained using a carbon arc lamp (assumed to produce radiation close to the ultraviolet end of the visible spectrum) radiating onto porcine skin. The values given in Table 3 for the Hinshaw data are those given by Hymes et al (1996) and are based on a correlation relating thermal dose to depth of damaged skin. Analysis of the Hinshaw data gives a standard deviation of 12.5 % with respect to the Hymes correlation for the predicted thermal dose, suggesting a range for the third degree, or full thickness, burn value (based on two standard deviations) of 1000 to 1750 (kW/m2)n/3s. The lower end of this range is used in this analysis, although, due to the uncertainty in the determination of this value, and in the interpretation of the data on which it is based, consideration is also given to the use of the upper limit for second degree bum data (730 (kW/m2)a/3s) as the threshold dose for full thickness bums. TABLE 3 COMPARISON OF ULTRAVIOLETAND INFRAREDBURN DATA
Dose (kW/m2)4/3s
Bum Severity
Infrared
Ultraviolet First degree
Second degree
Third degree
290 260-440 300-440 670-960 810-950 c. 1000 1100
Tsao & Perry (1979) d* Eisenberg et al (1975)* Glasstone & Dolan (1977)* Glasstone & Dolan (1977)* Eisenberg et al (1975)* Mixter (1954)P~ Hinshaw (1957)P~
1220-1790 3100
Glasstone & Dolan (1977)* Hinshaw (1957)P~
= carbon arc lamp f = burns from flame contact
* = nuclear incident data p = porcine skin
c.80 130
Mehta et al (1973)Pf Tsao & Perry (1979) a
240 270-410 c.350 290-540 730 c.500
Stoll & Green (1958) Stoll & Green (1958) r Mehta et al ( 1973)Pf Williams e t a l (1973) f Arnold et al (1973) f Mehta et al (1973)Pf
d = derived from a range of experiments r = white rat skin
As discussed by Lees (1994), the effective thermal radiation dose received by a victim will depend on body geometry. When defining a dose criterion for exposure to thermal radiation, it is usual to give a value for the maximum cumulative dosage incident at the victim's location. In practice, this dosage will be distributed over the body surface, the particular distribution depending on the mode of escape and body geometry. In the simplest case, it could be assumed that the dose is spread evenly between the front and back of the body, and that the exposed body areas could be treated as plane surfaces. Thus the effective dose received over the exposed skin is half of the cumulative thermal dose incident at the victim's location. In reality, the geometry of the body is not planar. If parts of the body are treated as circular cylinders then the mean dose received over the exposed skin areas would be a factor of E lower than the cumulative incident dose, assuming even distribution of radiation over the exposed areas. Similarly, for spherical body parts, the reduction factor will be 4. In practice, the exposed areas of the body will be a combination of planar, cylindrical and spherical geometry. Also, the victim's body will not necessarily be perpendicular to the direction of incidence of the thermal radiation, and so the effective dose received per skin area will be further reduced for the planar and cylindrical parts of the body. For the purpose of defining an LDs0 equivalent, it is conservatively assumed that the mean dose received on the exposed skin area of a victim is half that of the cumulative incident dose. Therefore, the cumulative incident dose required to produce 50% probability of fatality, i.e. the LDs0 equivalent, is double that required to give full thickness burns. Based on the range for full thickness burns
608
RJ. Rew and I.R McKay
discussed above, the required cumulative incident dose can be assumed to be between 1460 and 3500 (kW/m2)n/3s. IGNITION OF C L O T H I N G A further consideration in the prediction of fatality due to thermal radiation incidents is ignition of clothing. In general, it is assumed that the ignition of clothing will result in certain fatality, which may be pessimistic. Certainly, fatality levels for hydrocarbon fire incidents are likely to be of the order of 50% or above and, therefore, the ignition of clothing may have an effect on the LDs0 equivalent for certain scenarios. There are also significant uncertainties relating to the predictions of the intensity and duration of thermal radiation required to cause ignition of clothing. Criteria for ignition of clothing are compared in Figure 2, where the TNO (1992) criterion is shown to be more conservative than the Hymes et al (1996) model. The lower bound of the TNO criterion of 2.5x104 (kW/m2)2s only becomes significant to the definition of the LDs0 equivalent for an exposure duration of less than 13 seconds. Note that, for a duration of 10 seconds, the equivalent thermal dose for ignition corresponding to the lower bound of the TNO (1992) criteria is approximately 1800 (kW/m2)4/3s. It is assumed that, once an item of clothing ignites, flame spread will result in burns over a large proportion of the body surface. Thus a dose of 1800 (kW/m2)a/3s needs only to be incident at one point on the body in order to result in extensive burn injury and a high probability of fatality. 1000
i
A
E
• A • • O [] •
.......
10o
1
10
type 1: PE/cotton piloted type 1: unpiloted type 4: denim piloted type 10: cotton piloted type 11: acetate/nylon melts type 19: fire retardant piloted type 20: wool piloted .TNO (1992) Lees (1994)
100
exposure duration (s)
Figure 2 Comparison of Hymes et al model with TNO and Lees criteria
DERIVATION OF LDso EQUIVALENT Based on the above arguments, for an average UK population, an LDs0 equivalent in the range 1460 to 3500 (kW/m2)n/3s could be postulated; a value of 2000 (kW/m2)a/3s is proposed in this paper as it is considered that use of 730 (kW/mZ)4/3s for full thickness is pessimistic and 1000 (kW/m2)n/3s seems a more reasonable estimate of the lower bound for this degree of burn. In deriving this value, three key conservative assumptions have been made, as follows: 1. Inhalation injury is assumed to occur for all members of the exposed population, giving a full thickness burn area of 30% (rather than 45% without inhalation injury) for 50% probability of fatality. 2. The typical level of exposed body area is 30%, compared to 20% used by TNO (1992). 3. The exposed skin areas can be assumed to be planar, with radiation distributed evenly between them.
Advances in Safety and Reliability" ESREL '97
609
This conservatism is countered by the following uncertainties: 1. No consideration has been given to the effect of thermal doses of greater than 1000 (kW/m2)"/3s being incident on the exposed skin areas. This will occur if the radiation dose is not evenly distributed between the back and front of the victim. The medical data reviewed in this study does not specifically consider the effect on fatality of body tissue being damaged significantly beyond the dermis (a full thickness burn is defined as one that damages the full dermis layer). However, it is assumed that the records upon which the models or mortality charts were based included victims with such injuries. As noted by Hockey & Rew (1995), serious burns can damage muscle tissue and bones and have the potential to cause fatality even for small burn areas. 2. There is considerable uncertainty in the use of 1000 (kW/m2)n/3s as the dose which produces full thickness burns on human skin. As discussed above, the experimental work of Hinshaw (1957) suggests that the thermal dose for full thickness burns lies in the range of 1000 to 1750 (kW/mZ)a/3s. The effect of the use of the above assumptions in the production of the proposed LDs0 equivalent, and the uncertainties relating to the prediction of thermal dose required to cause full thickness bums, are illustrated in Figure 3. This figure also shows the range of criteria used as the threshold level for ignition of clothing, marked as the shaded area. It can be seen from the figure that the proposed LDs0 equivalent is similar to the lower bound for ignition of clothing. The lower bound is that given by TNO (1992), which for an exposure duration of 10 seconds is equivalent to a dose of 1800 (kW/m2)4/3s. This is the value which would need to be adopted to encompass the range of uncertainties in the modelling of ignition of clothing. The model of Hymes et al (1996) suggests that, for an exposure duration of between 10 and 30 seconds, some clothing types will not ignite until exposed to a thermal dose of greater than 6000 (kW/m2)4/3s. Ignition of clothing can be considered to produce 50% probability of fatality (or higher). Thus removing any of the conservatism discussed above would not significantly change the value of the proposed LDs0 equivalent. In fact, for short exposure durations, use of the TNO (1992) ignition criteria would suggest use of a lower LDs0 equivalent. 2nd degree i
full thickness burn
3rd degree max.
.... ~rd: degree min:i
body shape unclothed area
....
planarl 70%
inhalation injury
spherical
.... ~0% . . . . . . . ~es
20%
>
l
no
overall uncertainty
I
l
SHADED = RANGE OF THRESHOLD LEVELS FOR CLOTHING IGNITION I 1000
I 2000
I 3000
I 4000
thermal dose (kWlmZ)4~3s
Figure 3 Uncertainty in the definition of the LDs0 equivalent CONCLUSIONS The value for the LDs0 equivalent is in the range 1460 to 3500 (kW/m2)4/3s, based on the cumulative maximum incident thermal radiation at the location of the exposed population. A value of 2000 (kW/mZ)4/3s is proposed. If the lower bound of the TNO (1992) criteria for ignition of clothing is used, then the LDs0 equivalent should be reduced to 1800 (kW/m2)4/3s. It should also be noted that the modelling of fatality within a hazard assessment will be highly dependent on the particular scenario modelled. For example, the
610
EJ. Rew and I.E McKay
proposed LDs0 equivalent is not suitable for events in which fire engulfment of personnel occurs, where different heat transfer mechanisms exist (not radiation alone), exposure is shorter but more intense, and exposed body surface areas are greater. There is a large amount of uncertainty in the determination of the proposed LDs0 equivalent, as discussed above. However, it seems reasonable to use the proposed thermal dose as that for which there is a significant probability of death and, therefore, as a 'significant likelihood of death' (SLOD) criterion. The exact probability of fatality will vary depending on the age and response of the population to the incident and their level of clothing. The information presented in this report on the severity of bum injury, typical levels of clothing and medical data (relating area of bum injury to fatality) can be used as a guide to predict the probability of fatality for non-typical cases, for example school playing fields or holiday resorts. In those cases, the proposed LDs0 equivalent can only be used as an estimate, of the number of fatalities expected.
ACKNOWLEDGEMENT
The work described in this paper has been undertaken on behalf of the UK Health & Safety Executive. However, the views expressed in this paper are those of the authors and are not, except where the context indicates, necessarily those of the HSE.
REFERENCES
Arnold et al, (1973). Hazards from burning garments, Gillette Research Institute, NTIS: COM-73-10957. Brandwonden, (1979). Philips-Duphar Nederland B.V., Amsterdam. Bull, J. P. (1971). Revised Analysis of Mortality due to Bums, The Lancet, 1133-34. Central Statistical Office, (1991). Annual Abstract of Statistics, HMSO. Clark, W. & Fromm, B. S. (1987). Bum Mortality - Experience at a Regional Bum Unit, Acta Chirugica
Scandinavica Supplementum 537, Stockholm. Eisenberg, N. A. et al. (1975). Vulnerability Model: A Simulation System for Assessing Damage Resulting From Marine Spills (VM1), ADA-015-245 US Coast Guard NTIS Report No. Cg-D-137-75. Glasstone, S. & Dolan, P. J. (1977).
The Effects of Nuclear Weapons, 3rd Edition.
Hinshaw, J. R. (1957). Histologic studies of some Reactions of Skin to Radiant Thermal Energy, ASME Paper 57-SA-71. Hockey, S. M. & Rew, P. J. (1996). Review of Human Response to Thermal Radiation, HSE Contractor Report WSA/RSU8000/026, HSE Books. Hymes, I., Boydell, W. & Prescott, B. (1996). Thermal Radiation: Physiological and Pathological Effects, Major hazards monograph, I Chem E. IChemE, (1985). Nomenclature for Hazard and Risk Assessment in the Process Industries. Kinsman, P. (1991). Major Hazard Assessment: Survey of Current Methodologies and Information Sources, HSE Specialist Inspector Reports No. 29. Lawrence, J. C. (1991). The Mortality of Bums, Fire Safety Journal 17. Lees, F. P. (1994). The Assessment of Major Hazards: A Model for Fatal Injury from Bums, Trans. IChemE
Part B 72, August. Mehta, A. K., Wong, F. & Williams, G. C. (1973). Measurement of Flammability and Burn Potential of Fabrics, Summary report to NSF-Grant #GI-31881, MIT.
Advances in Safety and Reliability: ESREL '97
611
Mixter, G. (1954). The Empirical Relation Between Time and Intensity of Applied Thermal Energy in Production of 2+ Burns in Pigs, University of Rochester Report No. UR-316, Contract W-7041-eng-49. Rew, P. J. (1997). LDso Equivalent for the Effect of Thermal Radiation on Humans, RSU3520/R72.027, HSE Books, UK. Stoll. A. M. & Green, L. C. (1958). The Production of Burns by Thermal Radiation of Medium Intensity, ASME 58-A-219. TNO (1992) A Model for the Determination of Possible Damage, CPR 16E. Tsao, C. K. & Perry, W. W. (1979). Modifications to the Vulnerability Model: A Simulation System for Assessing Damage Resulting From Marine Spills (VM4), ADA 075 231, US Coast Guard NTIS Report No. CG-D-38-79.
This Page Intentionally Left Blank
AN INHERENT SAFETY OPPORTUNITY AUDIT/ TECHNOLOGY OPTIONS ANALYSIS Nicholas A. Ashford ~and Gerard Zwetsloot 2 1
Massachusetts Institute of Technology, Cambridge, Mass., USA and Ergonomia, Ltd., Athens, Greece 2 Dutch Institute for the Working Environment NIA-TNO, Amsterdam, the Netherlands
ABSTRACT A methodology is presented for encouraging firms to undertake primary accident prevention through an inherent safety opportunity audit or technology options analysis. Experience gained from its application in firms in the Netherlands and Greece will be discussed.
KEYWORDS accident prevention, inherent safety, occupational safety, prevention, safety, safety audit, technology options, technology assessment
INTRODUCTION
It is now generally recognized that in order to make significant advances in accident prevention, the focus of industrial firms must shift from assessing the risks of existing production and manufacturing systems to discovering technological alternatives, i.e. from the identification of problems to the identification of solutions (Ashford et aL 1993). The underlying premise of this project is that encouraging the industrial firm to perform technology options analysis (TOA) and to consider technological changes through an inherent safety opportunity audit will advance the adoption of primary prevention strategies that will alter production systems so that there are less inherent risks. In many cases, alternative production processes exist which completely, or almost completely, eliminate the use of highly toxic, volatile, or flammable chemicals. Normal accidents arising in these systems result in significantly less harmful chemical reactions or releases. Replacement of existing production systems by such benign chemical processes--sometimes called "green chemistry", as well as non-chemical approaches, are examples of primary accident prevention. Primary accident prevention approaches are similar to cleaner production/pollution prevention in that fundamental changes to the production system are contemplated, In contrast, secondary accident prevention parallels end-of-pipe pollution control with minimum changes to the fundamental production system. Industry often approaches cleaner production/pollution prevention and accident prevention quite separately, missing the opportunity to make production changes which address both problems simultaneously (Zwetsloot 1994). Acquiring knowledge about primary prevention/inherently safer technologies is essential for industry. The presentation of this paper will report progress on a project investigating the feasibility of developing an inherent 613
614
N.A. Ashford and G. Zwetsloot
safety opportunity audit/technology options analysis to encourage the adoption of primary prevention approaches by firms in the Netherlands and in Greece. These approaches involve both technological and managerial changes. Firms must have the willingness, opportunity, and the capability to change. An inherent safety opportunity audit provides firms with information that enhances their capability, and having then been alerted to new possibilities, this affects their willingness to change.
A METHODOLOGY FOR UNDERTAKING AUDIT/TECHNOLOGY OPTIONS ANALYSIS
AN INHERENT
SAFETY
OPPORTUNITY
We describe below the methodology employed in working with individual firms by NIA-TNO and Ergonomia, Ltd. in the Netherlands and Greece.
Phase One
1. Start-up and Obtaining Commitment from the Firm
Obtain general commitment and cooperation from management Select possible (parts of the) plant/unit/process/division Obtain the specific commitment of the management of that (part of the) plant/unit/process/division Formulate and formalize project goals and project plan
2. Initial Design and Preparation
Form a 5-7 member project team within the selected plant/division: firm members should be representatives of Safety (possibly including members of the health and safety committee), Technology (e.g., a design/chief engineer), Operations, Maintenance, Quality Control, and Management; one or two representatives from NIATNO/Ergonomia should also participate. Choose the project team manager (a major firm pioneer with some authority). The project team should construct the project plan. Project team to obtain commitment from all members, gather background information, and organize an informational meeting within the plant/division. Project team to work in all of the subsequent aspects of the project
3. Conduct a traditional Safety Audit
This safety audit is used for identifying inputs and material flows, processes and intermediates, and final products--but with special attention paid to human-material/process/equipment interactions that could result in (a) sudden and accidental releases~spills, (b) mechanical failure-based injuries, and (c) physical injuries--cuts, abrasions, etc. as well as ergonomic hazards. Additional sources of adverse effects/safety problem areas are records/knowledge of in-plant accidents/near misses, equipment failures, customer complaints, inadequate secondary prevention/safety procedures and equipment (including components that can be rendered non-operable upon unanticipated events), inadequacies in suppliers of material and equipment or maintenance services (see 4 below).
Advances in Safety and Reliability: ESREL '97
615
4. Selection of candidate processes or operations within the firm
Select candidate processes or operations within the firm that warrant special attention. The criteria for identifying these include three categories: (a) general safety information, (b) symptoms of inherent unsafety, and (c) inefficiency of safety management as detailed below:
general safety information the results of the risk assessment conducted in the firm findings stemming from statistical process control databases for reliability of components, materials, etc. records from test activities (e.g. devices, components or from software) evaluations from accident-preparedness tests environmental permits liability assessments occupational or environmental safety reports (for Post-Seveso Guideline) life cycle assessments environmental impact assessment documents results from HAZOPS, Fault-tree Analyses, etc. findings from safety or environmental audits
symptoms of inherent unsafety incidents, near,misses, and spills (reported or not) including analyses thereof resulting in the identification of direct and root causes accidents (recorded or not), including analyses thereof resulting in the identification of direct and root causes records from trouble-shooting activities added-on technical safety measures (back-up systems, collective and personal protective measures, etc.) presence of obviously-hazardous situations (including hazardous materials) economic data about failure costs from risk-management activities complaints about product safety from customers records of non-conformance of product quality
inefficiency of safety management number and nature of procedures or prescriptions to control hazards (e.g., in safety directives or handbooks) number and nature of necessary work permits schedules for preventive maintenance
5. Functional review
Review of the functional purposes of materials, equipment, processes and operations--noting obvious inefficiencies in material/water/energy use and gradual pollution, and obvious hazards due to spatial combinations of functions.
6. Specific set of search questions
Construction of a specific set of search questions to guide identification of opportunities for material substituti-
616
N.A. Ashford and G. Zwetsloot
on, equipment modification/substitution, changes in work practices and organization, modifications in plant layout, and changes in final product. Making use of the following table of preventive principles relevant to inherent safety.
TABLE 1 PREVENTIVEPRINCIPLESAPPLYINGTO MATERIALS/SUBSTANCES,PROCESSESAND PRODUCTS RELEVANTTO INHERENTSAFETY(DERIVEDFROMZWETSLOOT 1994) preventive principles
materials & substances
processes
products
preclude or eliminate inherent safety risks
low human toxicity
low energy intensity
ergonomic design
low eco toxicity
simple, integrated design (minimal added on measures)
no or low flammability no skin penetrating properties
ergonomic design
low volatility no or low dust forming properties preclude or eliminate sources of safety hazards
containment
ease of maintenance and disassembly
low pollution intensity foolproof for essential functions adequate information for customers & workers
identifiability of components separability of components
controllability (broad tolerance of maloperation or poor maintenance)
7. Brainstorming to generate options Plan creative brainstorming sessions to generate as many initial options as possible. suggestions from:
Solicit and cull
firm operatives, maintenance persons, supervisors, engineers, safety experts, workers, unions; include suggestions that had previously been made by these "local experts" and "rumours" or topics not usually addressed openly minutes of safety meetings, from cross-functional meetings, e.g., company-contractor, operations-maintenance, etc.
Advances in Safety and Reliability: ESREL '97 •
617
data about technical/process alternatives that have been evaluated but not implemented evaluations of ad hoc solutions, especially arising in the context of trouble-shooting activities
8. Construction of search process for information on options/alternatives.
Planning the process of using external sources;potentially useful are solution databases (such as compiled by Lyngby, DK. the Danish EPA and NIA-TNO), safety performance/benchmarking data, literature on process safety and reliability, literature on cleaner production/pollution prevention, academic experts/researchers-including the NIA-TNO/Ergonomia project staff, in-plant expertise including plant workers/union, suppliers, equipment manufactures, other domestic firms, foreign firms and technology, and national/international unions.
9. Identification of promising options
Identification of promising altematives/options for materials, equipment, processes, operations, work practices and organization.
10. Design of consistent set of system changes
With the involvement of both production and safety/environmental people, design intemally-consistent sets of 2-3 alternative overall system changes encompassing multiple component changes related to 9 above.
11. Feasibility study
Conduct feasibility studies utilizing rough relative economic (cost) and safety assessment for these 2-3 system changes. Also included are environmental impacts and organizational impacts and requirements.
12. Commitment of the project team
Present results of the feasibility studies to the project team.
13. Recommendations to management
Recommend system changes to the firm.
Phase Two
14.
Supporting decision making
Mobilize the decision-making processes within the plant/unit to implement the selected system, recognizing overall firm imperatives and constraints.
15.
Preparation of implementation
Work with in-plant personnel (both production and safety/environmental people, and the safety and health committee) to design general approach to changes in plant.
618
N.A. Ashford and G. Zwetsloot
Phase Three 16. Monitoring of actual design changes.
In-plant project team to monitor and evaluate the progress and success of the implemented options/system on the bases of safety, quality, technology, costs, and environmental impact.
Phase Four 17. Evaluation of overall project
Project team to evaluate the outcome of inherent safety project in the firm and formulate additional recommendations. This includes the results of plant management evaluation.
COMMENT The experience and results obtained in implementing the inherent safety opportunity audit/technology options analysis in the participating firms will be presented at the conference.
REFERENCES Ashford, N., Gobbell, J., Lachman, J., Matthiesen, M., Minzner, A., and Stone, R. (1993) The Encouragement
of Technological Change for Preventing Chemical Accidents: Moving Firms from Secondary Prevention and Mitigation to Primary Prevention, Center for Technology, Policy and Industrial Development, Massachusetts Institute of Technology, Cambridge, Massachusetts. Zwetsloot, G. (1994) Joint Management of Working Conditions, Environment and Quality: In Search of Synergy and Organizational Learning, Dutch Institute for the Working Environment (NIA-TNO), AMsterdam, The Netherlands.
LIVE W O R K SAFETY AUDITS Nuno Mendes Serviqo Prevenq~o e Seguranqa Interempresas EDP- Electricidade de Portugal, S.A., 1070 Lisboa, Portugal
ABSTRACT Electrical networks are subject to scheduled shut-downs. Aiming to reduce them, intervention techniques on live overhead lines (better known as live work) have been created. They allow thus the maintenance and modification of installations, without interrupting the electric power supply. After an introductory approach on the most relevant points of live work techniques, the author alleges that the accomplishment of safety audits is a management method that permits: the prompt detection of non-conformities; to check if standards and procedures are correctly applied; and to assess contractors activities, with reflection in their qualification. The different elements related with the implementation of audits are presented: the composition of the auditing team. the planning, the elaboration of checklists, the accomplishment, as well as the most important points to be checked at crew's quarters and at the worksite.
KEYWORDS
Electrical networks maintenance, live work, safety audits.
1. I N T R O D U C T I O N With the liberalisation of the market and the pursuit of earnings in competitiveness, the electrical power companies opted to concentrate their efforts to fulfil the needs of their clients, who have become more demanding and conscious of their fights, resorting, in a increasing rate, to service providers for their construction and maintenance works. The increase in demand corresponded to an increase of the number of contractors found on the market, and the conditions for competition among them has led to a larger rotativity of specialised workers. On the other hand, the flexibility of the labour laws makes it relatively easy to hire new workers, who often do not have the necessary preparation (training and experience), nor suitable safety culture in regards to the tasks they are called upon to perform. Such circumstances contribute to the raising of the safety difficulties, befalling upon the contracting companies to define the procedures which will allow one to minimise the situations of risk and to check that the job is executed in a safe way and with the intended quality. 619
620
N. Mendes
2. THE LIVE W O R K T E C H N I Q U E S M O S T R E L E V A N T ASPECTS
2.1. Live Work
Live Work (LW) - A generic term indicating the various working methods used to work on or near electrical installations whilst energized. In particular all work on which a worker can enter into the defined live working zone with either parts of his body or with tools, equipment or devices being handled. In a technologically more advanced and electricity dependant world, any flaw in its supply may cause damages or, at least, troubles to its consumers: from the blocking of an elevator, to the halting of work in a facto1T. passing through the loss of files in a computer application, everyone has been involved in at least one of these situations. The electrical networks (aerial or underground) are permanently exposed to the most varied types of aggressions, natural (atmospheric discharges), or human (falling trees, excavations), which cause unexpected. and at time inevitable interruptions, despite the efforts undertaken for the introduction of better network equipments and in the awareness of the people who work in the vicinity. The same networks (especially the distribution ones) are also subjected to programmed shut-downs, resulting from the connection of new consumers, topological alterations due to the construction of buildings or roads, as well as the preventive maintenance actions as they are made up by a myriad of elements (conductors, poles, cross arms. terminals. transformers, protections .... ) with very diversified reliabilities. The LW techniques allow for the execution of maintenance and modification works on installations, without any interruption of the power supply. One presents, as follows, some examples of works who can be performed live: • At low voltage (LV): Poles, conductor and distribution boxes replacement; connection/disconnection of consumers; • At medium voltage (MV): Mounting of line disconnectors; replacement of pin insulators by string insulators; connection/disconnection of feeders; • At high (HV) and very high voltage (VHV): Application of anti-vibrators; repairing of conductors: replacement of any type of terminal on lines and primary substations. Despke the already achieved progress, one can still follow the development of the LW throughout the world. through new kinds of jobs such as the washing of insulators with pressurised water jets (or more recently, vdth special abrasives) or through the new means of operator supports: • On VHV aerial lines the operators intervene from a '~oucket", suspended from a helicopter; • There are about a hundred robots in Japan which are capable o£ amongst other things, inserting a line disconnector into a 30 kV line. Each one of these robots are made up of an insulating ground booth housing an operator who manoeuvres two, also insulating, arms which act as the operator's hands: tlfis assembly is mounted on an articulated arm whose base is supported on the chassis of a vehicle.
2. 2. The Safety on the L W
In executing live work, the lineman is perfectly aware of the elevated degree of hazard of the physical agent he is "manipulating" - electricity, but he is also sure that he is provided with perfectly suitable means and teclufiques which allow him to work with low levels of risk. LW thus constitute a significant field of application of the Integrated Safety principles once: • Only the workers expressly trained in the LW techniques are allowed to work live. The personnel authorised to perform these works were selected (special relevance is given to medical examinations) and trained in the most suitable LW methodology, in accordance to a specific work programme, one obtaining thus a high intervention reliability; • The qualification diplomas given to elements who did well in the respective courses are valid for a year. thus having to be renewed. A LW worker who has been 6 months without work must return to training:
Advances in Safety and Reliability: ESREL '97
621
• The tools and equipment were projected, manufactured and tested for the effect (both in an isolation point of view as for their mechanical function), being subjected to periodical controls; • The professed methods and the organisation of the works were profoundly and criteriously studied; • All factors capable of providing the LW intervenients with a better performance were considered through the analysis of the several working posts; • The work organisation is extremely rigorous as it compiles to a well defined set of parameters with each of the main "actors" (Exploitation Manager, Crew Chief and Executors) playing their role in perfection.
2.3. The L Win the EDP LW are. since 1981 in LV and since 1982 in MV, a reality within the EDP Group's Distribution Companies; as of 1988 these have also been executed by companies which provide services to EDP.
2. 4. Legal and regulatory support The LW practice in electrical installations is authorised in the Portuguese Legislation, in the Safety Regulations regarding the electrical sector (HV electrical lines, Primary substations and MV/LV and dividing substations, LV electrical power distribution networks, Electrical power utilisation installations). These texts only cover the general principles, with the specific regulations, elaborated by the EDP, being made up by the following documents: • General Prescriptions which, complying with that which is found in the Safety Regulations, establish the set of safety conditions to be observed by specialised crews in the execution of the LW in Electrical Power Distribution Installations; • Work Execution Conditions, which define the general rules to comply with in the performance of a LW. These conditions establish the work preparation, the tools' utilisation and the correct verification of the work modalities. They also include rules regarding the atmospheric conditions and the Special Exploitation Regimen (see 2.6); • Operational Processes, which establish a minimal set of sequential operations to comply with in the execution of certain works; • Technical Cards and Operational Methods, regarding each equipment or tool, with the description of their characteristics and utilisation conditions; they also describe the conservation, maintenance. transportation and tool control conditions.
2. 5. Working methods One discerns three working methods, according to the executor's situation in regards to the live parts and according to the means he uses to protect himself from electrization and short circuit risks: • Rubber Glove Working - the executor works in contact, protecting himself (by means of dielectric gloves) from the live bare parts, which in turn must also be covered with insulating materials (blankets. covers and screens). In the EDP, this is the method used at LV; • Hot Stick Working - the executor maintains a permanent distance which is equal or above a Minimuml Approach Distance between his hands (as well as that of all parts of his body) and the bare live part on which he intervenes. He works with the help of tools mounted at the end of insulating poles. This is the method utilised, in the EDP, at MV networks; • Bare Hand Working - the executor works with his "bare hands" (that is, without any dielectric gloves but with mechanical protection gloves) at the potential of the part he is working on. He finds himself in the identical situation as that of a bird on an electrical line. Consequently, the executor's whole Evolution Zone is at the referred potential rate, one having to maintain a Minimum Approach Distance between tiffs zone and all other parts at the service voltage. This is the method chosen for interventions on HV and VHV aerial networks.
622
N. Mendes
2. 6. Special exploitation regimen The execution of a LW on a MV network implies the existence of a special safety procedure which consists of placing the installation on which one is intervening on, on a Special Exploitation Regimen (SER). One considers a MV installation as being in SER when any automatic reclosing is rendered impossible; the temporisation of the installation's selective protections has been eliminated; all appropriate dispositions were taken in regard to the device for the detection of resistant earths; no restitution of tension after tripping may be made without the acknowledgement of the Crew Chief. The SER does not protect the executor from the possibility of the occurrence of an electrical accident, but it avoids that the consequences of the same are aggravated, apart from averting eventual electrizations of other executors who come in aid of the hurt one.
3. LW P E R F O R M E D BY SERVICE P R O V I D E R S
There are presently within the EDP Group, around fif~ service providers within the dominions of LV, MV and HV. The emergence of a market for service providers with the capacity to work live, forced the EDP to create quality mechanisms with the objective of guaranteeing desirable safety levels for personnel and installations. Ii is important that the high safety standards reached world-wide today, and specially in the EDP, continue to be guaranteed, by means of a rigorous control of the quality of the provided service, where safety integrates as a determining factor.
3.1. The qualification and selection of the service provider as a prevention factor Wkh the implementation of a qualification process regarding the service provider, one intends to privilege the technical part and simultaneously the safety part and the working conditions, foreseeing eventual risks which result from the utilisation of less qualified service providers This recognition will come to create the conditions for the establishment and development of partnership relations between the contracting company and the service providers, namely in where it concerns safety, and quality assurance, with the following immediate advantages: • it facilitates the definition of common objectives in regards to safety; • it establishes permanent communication channels between the Company's safety services and that of the service provider's; • it facilitates the elaboration and application of safety plans; • it facilitates the execution ofjoint initiatives within the dominion of information and worker training
3.2. How safety integrates the qualification and selection process of the service providers The utilisation of service providers is based on the principle of one opting for the one(s) who are more apt for the execution of a certain job, considering not only the specifics of the same, but also the inherent risks to the same. The selection of the more apt ones must be made by means of a process which bears in mind the service providers' performances, according to a set of parameters which include, amongst others: their technical capacity; their economical and financial capacity; the safety, hygiene and health conditions at work; the quality assurance (quality of the executed works, compliance with deadlines, commercial relations). In regards to the parameter "safety, hygiene and health conditions at work", the classifications are given with the participation of the safety services.
Advances in Safety and Reliability: ESREL '97
623
The evaluation of the service providers is made according to a scale of values which normally varies between 1 and 4 with the following relation value/concept: Level I - Very good; II - Good; III - Average; IV - Poor. The service provider classified at level IV may not be registered in the list for qualified service providers and will consequently be hindered from executing any jobs for the Company; the one classified at levels I, II or III is recognised as being qualified and will be registered in the list of service providers available for consultation: the service provider classified at level Iii is understood as being associated with precautionary recommendations. being therefore subject to a tighter control.
3.3. How the "safety "parameter is evaluated The evaluation of the safety parameter is based on the analysis of four factors: • Compliance to safety rules and dispositions • Efficiency in the correction of registered anomalies • Frequency rate • Severity rate The two first are related to the fulfilment, by the service provider, of the applicable rules and regulations related to safety, the job's safety plan (should there be one) and the Contractual Dispositions on Safety, Hygiene and Health included in the specifications. The safety audits constitute an objective means to evaluate the degree of satisfaction in regards to the referred to set of evaluation facts, either through the obtained results and found defects, or through the service providers actuation in the repair of the anomaly and their collaboration with the Company in the solving of safety problems. The last two factors demonstrate the obtained results in regards to the accidents during the executed works for the Company and their evolution. The analysis of the accident rates though not being in themselves conclusive. complement and confirm the evaluation made through the safety audits, whose results allow one to evaluate in a more rigorous way the safety culture of the service provider. The ponderation of the frequency and severity rates is normally effectuated based on a scale of values, the definition of which one considers the values obtained by the various service providers, the indicators at a national level and the company's experience. Such scales must not be applied indifferently to any type of activity without a previous analysis of their suitability, bearing in mind namely, the size of the service provider. More important than the comparison of rates between different companies is the observation of the respective evolution of each one. On the other hand, the value scales must be periodically reviewed in order to show the positive evolution of the working conditions and the existence of natural expectations regarding the improvement of the accident indicators. The requirements imposed in the qualification process for "LW Service Provider" foresee an audking by the representatives of the Direcq~o-Central de Tecnologia e Aprovisionamento (DTA) and by a representative of the Distribution Company in the area where the service provider is established. Despite the origins of the training beingthe most diversified, all service providing companies are subjected to the regulation in force in the EDP Group of Companies.
4. LW SAFETY AUDITS The performance of an audit presupposes the joining of several factors: independent auditors in relation to the audited crew; the existence of a norm, internal regulation or operational procedure; checklists prepared beforehand; agreement by and the participation of the audited party; registration of the observations made during the audit.
624
N. Mendes
4.1. The auditing crew requirements The audking crew must: • be made up by three elements without any hierarchical nor functional bond to the audited party: a functional leader of the Technical Area (co-ordinator), advised by two LW and safety and prevention specialists; • be informed in detail about the organisational and technical aspects of the crews to be audited; • act with discretion, in an objective and reasonable manner, analysing, interviewing, evaluating and recommending, without interfering, opposing or hindering the normal operation of the audited party.
4. 2. Auditing planning During the planning one will perform the sedation of the set of actions to develop for the consubstantiation of the audit, namely: the gathering of base documentation; the scope of the work to be audited; the duration of the audit; the parameters to integrate the checklists.
4.3. Checklists The checklists must refer: • the identification of the auditors, reference number and date; • the identification of the operational unit and the audited crew; • the localisation of the installation where the work is performed; • applicable regulation; • the registration of the observations made, the contacted elements and the conclusions reached. The lists may include alternative answer questions (yes/no) or with the possibility of explanation, leading eventually to other questions, ensuring in this way the obtaining of the largest amount of information possible in order to reach the auditing objectives.
4. 4. The performance of the audit A foregoing meeting, in which the auditing crew and the hierarchy of the crew to be audited participates, will initiate the audit.
4.5. Methodology to follow The following general methodology rules must be applied during the audit: • utilisation of the checklists; • objective confirmation that one complies with that which is established in the procedures, through dh'ect observations and interviews with the audited party; • evaluation the audited personnel's degree of procedures knowledge; • identification of non-conformities; • establishment of the relation cause-effect for the found non-conformities; • registration of all non-conformities, as well as suggestions for their correction; • communicate to the audited party the conclusions of the audit. The auditing crew will meet directly afterwards with the objective of summing up the verifications made, in order to elaborate a final report: The final report, destined to be presented to the hierarchy of the audited crew. must contain all the elements considered relevant during the audit, pointing out, as well, the solutions in regards to the correction of the detectednon-conformities.
Advances in Safety and Reliability: E S R E L '97
625
5. BASIC ASPECTS TO VERIFY
5.1. At the site where the auditing crew is established 5.1.1. General organisation
Ensure: the good operation of the programming and make a survey of eventual difficulties in their compliance (logistics, lack of material, etc.); the correct management of the means in relation to the number of jobs to execute. 5.1.2. Tools and equipment
Check: the storage; the good apparent state and cleanliness; the equipments and tools control process before their utilisation; the registration of periodical tests.
5. 2. Aspects to verify on the work location 5. 2.1. Regulating and working documentation
Check: • that the executors possess regulative LW qualification; • the correct filling out of the Live Intervention Request; • that the work is being performed by means of the issuing of a written document - Live Inte~a'ention Authorisation, for MV or a Live Intervention Licence, for LV. • the correct elaboration of the Working Plan where one determines the most important phases of the whole process. • the adequation between the preparation and the job to be executed. 5. 2. 2. Preparation o f the L W intervention
This must have the following sequence of actions: Feasibility of the intended intervention: verification of the possibility for the execution, in regulatory conditions. of the required works. Reconnaissance of the work to be performed and the site: following the latter action, the person responsible for the preparation of the job, gathers on the execution site, the set of necessary and fundamental elements for the elaboration of a working plan, namely: • General installation data: the installation's characteristics such as: the nature and section of the conductors, type of poles involved, utilised accessories and equipments, spans upstream and downstream from the intervention, existing unevenness, state of the conductors, etc. • Accessibility to the work site: verification and study of the various existing accesses, not only for vehicles, but also for the transportation of equipments and tools. • Special Exploitation Regimen (specific for MV): selection of the output or outputs to be placed under SER and other measures to be taken, eventually, in order to ensure, throughout the intervention, the permanent compliance with the safety conditions. Necessary material for the intervention: listing of the set of tools and equipments. 5.2.3. Organisation o f the work location
Local conditions: check the positioning of the vehicles; the signalling of the working zones; the placing of tool racks, tarpaulin, service and manoeuvring ropes.
626
N. Mendes
Calculations: check if the mechanical calculations of the line, necessary for the execution of the inte~'ention. have been performed. Equipments and tools: check that one is using the homologated LW tools and equipments. Preparation and discussion of the Working Plan: check that the Working Plan has been discussed and analysed by the whole crew. 5. 2. 4. Personal protective equipment Check that the executor's equipment is found to be in a good state and that it is composed of: helmet (eventually with chin-strap); boots (homologated; rubber, high piped in case of humid ground or half-piped, leather in the case of dry ground); working clothing (waterproof clothing in case of rain); safety harness (aerial jobs); working gloves (LV, two pairs - insulating gloves and mechanical protection type with silicon cuffs); spectacles or face shield (protection against ukraviolet light and arc by-product particles). 5. 2. 5. Collective protection equipment Check the existence and good state of: fuse puller handle (LV); insulating matting (LV); phasing tester; no-voltage detector (MV). 5. 2. 6. Working development Ensure: • that there are no atmospheric conditions adverse to the execution of the job; • that the line is in SER (specific for MV); • that the Crew Chief controls and checks everything which goes on in the Working Zone, namely: checks that the various phases elapse in accordance to the Work Execution Conditions and the defined Operational Processes; warns the executors thatthey may not effectuate changes in the working positions without his express authorisation; interrupts the work and proceeds, with the remaining crew members, with a new analysis of the operations to execute, when an execution problem, different than the one initially foreseen, occurs; checks that everyone has understood his new task. 5. 2. 7. Temporary interruption of the work Check that the Crew Chief has guaranteed the safety of the work location in regards to the public. 5. 2. 8. End of work Check that the job was correctly executed and that the Crew Chief informs the Exploitation Manager of its end.
6. EXPERIENCES
The resuk of the LW safety audits constitute, with other parameters, a factor to the assessement of the contractor' s qualification maintenance. One must, lastly, register that the implementation of the qualification procedures was initiated about two years ago has been considerably well accepted by the service providing companies, as this gives them a better guarantee over the functionality of the rules of competition. In this way they know that they stand in a perfect level of equality in regards to the minimal requirements demanded by the contracting company, independently of the territorial location and the department which promotes the adjudication.
A d v a n c e s in Safety and Reliability: E S R E L '97 REFERENCE LIST
Filipe, J. Vilar- Auditorias de Seguran~a "lngenium", November 1987 NP - 2269 (1985) - Garantia da Qualidade - Auditorias da Qualidade Herranz, A. P. - Audits des t~quipes de Travaux sous Tension en Espagne "ICOLIM' 94" Lombardet, D. - Audit des Unit~s Op&ationnelles "ICOLIM' 94"
627
B9" Industrial Safety
This Page Intentionally Left Blank
TECHNICAL CERTIFICATION OF DANGEROUS EQUIPMENT: A STUDY OF THE E F F E C T I V E N E S S O F T H R E E L E G A L L Y C O M P U L S O R Y R E G I M E S IN T H E NETHERLANDS
A..R. Hale ~, C.M. Pietersen:, B.H.J. Heming ~, B. van den Broek s, W.E. Moi s, C. Ribbert ~
Safety Science Group, Delft University of Technology, 2628 EB Delft, NL : AEA Technology Netherlands B.V. 2514 AB Den Haag, NL 3 Dutch Institute for Working Conditions (NIA-TNO), 1070 AR Amsterdam, NL
ABSTRACT As part of the process of introducing more decentralisation and market forces in the area of safety and health regulation, the Dutch Ministry of Social Affairs is planning changes to the certification regimes for lifts, cranes and pressure vessels. This paper reports a study commissioned to evaluate the effectiveness of the regimes of periodic inspection and certification in the past and to make recommendations related to introducing competition between certifiers for in-service certification. The study found very little statistical evidence relating to the regimes. What was available, and the opinions of the interested parties leads to conclusions that the current regimes work well as a contribution to technical safety. A number of recommendations are made about improving the regimes and safeguarding their achievements if more competition is introduced. The paper ends with a brief summary of the Dutch government proposals based on the research.
KEYWORDS Technical certification, lifts, cranes, pressure vessels, risk graph.
INTRODUCTION Compulsory inspection of equipment has been used as a safety measure for more than a century. Three types of equipment which are subject to such inspection almost universally are pressure vessels for steam and other chemical substances, either in gaseous or liquid form, lifts (particularly those for the carriage of persons) and cranes and other lifting gear. In the Netherlands all three are subject to periodic inspection by one centrally designated certification agency. Changes in the European regulations for these three types of equipment, and a reconsideration of the role of the government in regulating safety led the Dutch Ministry of Social Affairs and Employment to reconsider the regulations in force in the Netherlands governing these inspection and certification regimes. A research study was commissioned to evaluate the effectiveness of the existing regimes (Pietersen et al 1996). In particular the Ministry was interested in the need for any changes in the criteria for certification and the organisation of the regimes themselves in the light of possible opening up of certification to competing certifying bodies in a free(r) market. With the introduction of the European CE-mark for bringing products onto the European market, national governments lose the ability to require separate certification of manufactured equipment, since this would reintroduce trade barriers. The CE-mark is regulated by standards under Article 100A of the European Treaty and is-" put on equipment by the manufacturer, in certain cases, including for the three types considered here, once the 631
632
A.R. Hale et al.
equipment has been tested by an independent certifying body. These notified bodies (nobo.s) re accredited in each of the European states and their approvals are valid for the whole European Union. Only once the equipment has been taken into use may national governments require additional standards or certificates, under Article 118A of the European Treaty. Among such certificates are the periodic inspection regimes which are dealt with in this stud3. These may include an inspection at the time of taking into use of the equipment if it is constructed (e.g. tower crane. combined pressure vessels in one plant).
F R A M E W O R K FOR THE STUDY A periodic inspection regime can be described as a number of steps as shown in figure 1. All these must be effective and efficient in order to make the regime both safe and cost-effective. Evaluation of the regime
I
Eventual certification at taking into use
Manufacture and Installation
I
Definition of certifiable object
-- Registration
I Certification criteria
I
.
.
.
.
.
.
.
.
.
I
.
Accidents, defect analysis & user satisfaction
Expertise Frequency Methods
~ Certification -¢ Poor----* Improv & re-certification
- Re arting
Figure 1" Framework for a periodic inspection regime
The study formulated questions about the following areas to provide the basis for data collections and interdews about the different steps: 1. Are all the appropriate types of equipment covered by the legal requirements, based on how dangerous they are? 2. Is all the required equipment registered for inspection and certification? 3. Are the inspection criteria appropriate: clear and covering the factors which lead to accidents? 4. Is the frequency of inspection appropriate and is it kept to? 5. Are the available expertise and methods of inspection satisfactory? 6. Are a significant number of (serious) defects found and corrected? 7. Are there accidents with the types of equipment which can be traced to the functioning of the regime? 8. What would the expected effects be of removing the inspection monopoly and introducing competition?
METItOD
The study was conducted using interviews of the actors in each certification regime; the certifying body with a monopoly of the legal inspection, other certifying bodies doing similar work, manufacturers, users of equipment (employers and employees), regulators. The records of the certifying bodies were studied to extract data about the effectiveness of the regimes in so far as this was possible. It is particularly worthy of note that none of the certifying bodies had their data available in a form which could be easily analysed; most was not computerised, and there was no consistent classification of the types of defects found during inspections. This necessarily limits the quantitative results which can be presented. Arrangements were made for a more detailed study of a sample of the card records of the pressure vessel certifying body. The records of all 427 deviations found during inspections between 1976 and
Advances in Safety and Reliability: ESREL '97
633
1991 were analysed for type of failure and, for some categories of equipment, when that was detected (by periodi inspection or during use). Secondly a random sample of 450 of the approximately 140,000 individual record card: coveting 1506 inspections were analysed to find estimates of the percentage of periodic inspections whic discovered deviations. Some results of this analysis are presented below. The records of the crane certifier could nc be studied because the organisation was in the process of moving premises. The lift certifier was not willing to grar access to the records for reasons of commercial privacy. The results of the study were presented to and discussed with the regulators and later a workshop of all intereste parties.
RESULTS Table 1 gives a summary of the way in which the regimes are currently organised and the number of inspections ant certification which are conducted on average each year. This table refers only to the compulsory certification regim~ and not to additional voluntary inspections. Data on results is taken from annual reports, except in the case of th~ pressure vessels, where results of the sample survey are used.
Table 1: Summary of the certification regimes and their results
Objects Monopoly certifier Equipment falling under compulsory periodic inspection Frequency of inspections
Lifts
Cranes
Pressure vessels
Liftinstituut lifts where cage can be used by a person. Travel vertical ± 15" taking into use, after 12 months and then every 18 months
Aboma/Keboma > 10ton meter
Stoomwezen Vessels for steam and other gases and liquids under pressure
taking into use, after 3 years and then every 2 years + modification and repair 3200 Legally required inspection by competent person for all cranes > 2 ton every 12 months, done by other certifying bodies, and trained personnel 70%
Every 1, 2 or 4 years depending on type and contents
Approx. no. inspect/year Other certifiers and certifications in area
37,000 Non-compulsory inspections by other certifying bodies
Results1: % OK at first
47%
inspection % OK after immediate correction % uncertifiable or requiring major correction Number of dangerous situations reported to Ministry (1995)
13.000 User inspectorates of large firms (esp. in process industry) carry out inspection under supervision of Stoomwezen
91.2%
50% (Mainly door contact problems, no safety implications)
26%
7.5%
3%
4%
1.3%
I0
38
IFor lifts and cranes data is from 1995; for pressure vessels it is from the random survey of cards over 1976-91 (see above) Table 2 gives the results of the interviews with the interested parties in relation to the questions set out above.
634
A.R. Hale et al. Table 2. Interview results on research questions
Objects 1. Are all the appropriate types of equipment covered by the legal requirements, based on how dangerous they are? 2. Is ,all the required equipment registered for inspection and certification?
Lifts Users satisfied on this point, despite lack of accident data
Cranes Multi-functional excavation equipment not covered. Otherwise users satisfied with what falls under regulations
Pressure vessels Satisfied for individual vessels. Concern over combinations of vessels constructed on site & not subject to start-up insp.
Overview lost since 1995 as registration of new lifts no longer compulsory. Concern about lifts in small housing blocks.
Overview may be lost as from 1997, since certification on taking into use of new cranes may cease then
Some under-registration in horticulture of N2 vessels. Overview will be lost when compulsory inspection of new vessels
3. Are the inspection criteria appropriate: clear and covering the factors which lead to accidents?
Standards are realistic. Some small interpretation differences. Accident factors for maintenance & installation not covered. Technical factors covered (except cage closure on pre-1978 lifts) Satisfactory. Much additional inspection on voluntary basis
Standards are realistic, but a clear CEN norm is needed to ensure harmonisation. Accidents mainly due to operator error or violation. Ergonomic aspects not covered. Satisfactory both tbr compulsory, by Keboma and 12 monthly by trained expert. 12 monthly by certified body perhaps desirable. Frequency based on hours used instead of calendar time perhaps desirable. Generally satisfactory. Manuthcturers doubt upto-date expertise of certifying bodies and argue for manufacturer certification Yes. Contribution of regime to safety not questioned
Standards are realistic. Some interpretation differences across inspectors and regions.
No reliable data. None known by interviewees. Accidents mainly due to operator error
No reliable data. None known by interviewees.
Need for clearer CEN
Need for rules on riskbased inspection criteria. Client-orientation should increase. Loss of overview of equip.
ceases
4. is the frequency of inspection appropriate and is it kept to?
5. Are the available expertise and methods of inspection satisfactory?
Yes
6. Are a significant number of (serious) defects found and corrected? 7. Are there accidents with the types of equipment which can be traced to the functioning of the regime? 8. What would the expected effects be of removing the inspection monopoly and introducing competition?
Yes. Value of regime not questioned.
No reliable data. None known by interviewees. Accidents mainly in installation, maintenance and inspection Need for clearer norms. Expect more clientorientation. Loss of centre of expertise
norm.
Competition welcomed. Possible loss of concentration of expertise.
Satisfactory, given flexibility with which certifier grants extensions for well argued reasons. Formalisation of criteria for risk-based inspection necessary
Yes. User inspectorates argue for regularisation of their position and a greater role in inspection
Yes. Value of inspection regime not doubted
Advances in Safety and Reliability" ESREL '97
635
The overview of what equipment falls under the regulations will disappear with the loss of the monopoly, since the requirements frr registration of new equipment have ceased, or will cease. Only at the time of compulsory user certification will registration take place with one or other of the approved certifying bodies and no system of reminders for certification will therefore be possible. The government appears to have no intention of requiring this, despite the problems this may cause for tracking down equipment which may more easily slip througJa the net of inspection as a result. This was seen by interested parties as a negative effect of the removal of the requirement for inspection before equipment can be taken into use, coupled with the introduction of the CE-mark. A further concern in that respect was the adequacy of the European supervision of the uniform quality of independent inspection of new equipment carried out by nobo's in all European Union countries. Users had a certain suspicion for equipment not certified by bodies known to them. It should, however, be pointed out that any user retains the right to require additional certification of any equipment purchased by himself or used under contract to him. It is to be hoped that the value for society as a whole of the retention (and indeed extension) of an overview.of the equipment and its hazards, as a means of monitoring safety and as a feedback to designers and users of equipment, may be able to prevail over local opposition and the apparent government indifference to this issue. Apart from minor concems with the existing regimes, the users were more motivated that any new regulations and the introduction of market competition should not jeopardise the advantages of the existing systems. Competition was felt to offer the advantage that certifiers would be more sensitive to client needs and demands, but the danger was realised that it could threaten the level of safety, if the criteria and procedures for inspection and especially for the granting of any extensions or exemptions were not clearly specified; perhaps even more clearly than is now the case. The danger therefore lurks that decentralisation of regulation could result in pressures for increased bureaucratisation and even thicker rule books and standards in order to preserve harmonisation. A general concern for all three regimes was expressed, that removal of the certification monopoly could threaten the level of expertise built up by the certifying bodies, which could result in the longer term in a "loss of memory" for less frequent and older problems. Finally the study revealed underlying issues about who should be allowed to certify equipment. The manufacturers of lifting equipment felt that the technical expertise of engineers not employed by them could never keep up-to-date with new developments as well as they could. On the other hand others doubted the independence of manufacturers. In the area of presstire vessel inspection a similar concern about the role of user inspectorates was the subject of strong debate. The independence of their inspections is currently guaranteed by supervision of their programmes and reports by the certifying body. When this monopoly is removed it is not clear whether this role should be taken over by the Dutch Accreditation Council which also accredits certifying bodies, or whether certifying bodies should take it on. The former arrangement would make user inspectorates in more ways the equivalents of certifying bodies in their own fight; the latter would create the anomalous position that certifying bodies were being asked to assess potential rivals for their direct certification business.
GOVERNMENT PROPOSALS At a workshop in October 1996, at which the results of the research study were reported to the interested parties the Ministry set out its proposals for the new regime of periodic certification in the light of the findings and its general policy (Ministry of Social Affairs & Employment 1996b). The comments and recommendations of the research were accepted except that: !. Criteria relating to user comfort, working conditions for installers, maintenance personnel and inspectors and ergonomics, c.q. error prevention for operators were not seen to be appropriate matters for regulations about periodic inspection, but for employer-employee agreement. 2. The need for a central register for equipment was not seen as a problem. 3. Concerns over nobo's from other countries were felt to be soluble by users making their own agreements with suppliers over what certificates to accept.
636
A.R. Hale et al.
From the analys~s of the pressure vessel records some idea could be gained of the origins of the failures. The analysis was complicated by the lack of data on the total number of pieces of equipment in the population in different subcategories (e.g. by type of metal for construction). It was also not always possible to distinguish defects discovered during periodic inspections and those found when the certifier was called in after the user had detected a possible defect and wished to have it checked (Stoomwezen had no monopoly on that sort of inspection). Table 3 indicates the sort of data which can be obtained from the analysis of the defect records.
Equip. type Inspection Design error Manu. Error Use error Corrosion Stress corr. Erosion Creep Fatigue Other Total PV = Pressure Div. = diverse.
Table 3: Origins of failures LPG LPG PV PV (P) (O) (P) (O) . . . . .
detected in periodic (P) Stor. Stor. HE (P) (O) (P) . . . .
and other (O) inspections HE SR Boil Pipe (O) (O/P) (O/P) (O/P) . 2
Div. (O/P) 2
Tot. (O/P) 4
5 17 7 6 1 l 5 5 15 32 6 l O0 7 34 l I1 2 3 7 28 7 2 I02 3 9 3 3 1 3 2 12 9 45 3 2 4 4 2 3 7 9 I1 2 i 48 . . . . . . . . 2 l 2 5 2 . . . . . 6 16 22 46 1 7 5 5 10 13 14 2 57 1 3 1 2 2 i 5 5 20 19 66 25 31 4 2 6 23 42 101 95 13 427 vessel, Stor. = storage tank, HE = Heat exchanger, SR = steam receiver, Boil. = all types of boiler, Use error combines the categories "operations", "wrong use" and "over-pressure" from the forms.
For the equipment where data is available on the type of inspection that uncovered the defect, 30% of the known defects were discovered at periodic inspections.
CONCLUSIONS FROM THE FIELD STUDY The study demonstrated a remarkable degree of satisfaction with the current regimes for technical certification in the Netherlands. All parties pointed to the relatively high percentage of defects found by the regime and the low numbers (if not absence) of accidents from technical failures in the three types of equipment. Frequent mention was made of the preventive working of the compulsory regimes, leading to more voluntary inspection and corrective actions so that expensive equipment was not laid still at the legal inspection. What is remarkable, however, is that this satisfaction seems to be based almost entirely on qualitative and subjective assessments. None of the regimes could produce easily comprehensive data to prove the worth of the inspections; none could demonstrate clearly what types of defects were discovered, not whether these had any implications for the design or use of the equipment. In other words the regulatory system durently has no systematic feedback and learning loop based on data analysis by which its effectiveness can be assessed, nor whereby proposed changes in it can be objectively evaluated. A reliable register of accidents with the equipment types (either by the certifiers or by the Ministry) is not currently available. The value of analysis of incident and defect data was shown by one user inspectorate for pressure vessels, which was able to demonstrate the value of its inspections, track the results of changes in frequency and make proposals for improvements on the basis of its (unfortunately confidential) data. The analysis of the sample of the pressure vessels records carried out during this research showed that much better classification and population data were necessary in order to formulate clear preventive strategies to reduce the number of defects, but that the high proportion of manufacturing errors is striking. A national system of analysis of inspection data would be worth establishing, if it were computerised and gave rather more depth of causal information. The incipient removal of the monopoly of periodic inspections will make this more difficult to set up, since more certifying bodies will have to be persuaded to collaborate.
Advances in Safety and Reliability: ESREL '97
637
The main feature of the government's proposals for regulation is a more consistent use of risk-based regulation. The type of certification regime to be applied would be decided by application of a risk graph (Figure 2) which w e i ~ s ,,the estimated probability of failure of the equipment (F), the likelihood that someone is in the vicinity if it fails (B) and the maximum expected effect of failure (E).
E1 E2 E2 E3 E3 E4
.......... ....... ..... ......
B1 B 2 -~-B1 B2 ......
F3 1 2 3 4 5 6
F2 0 1 2 3 4 5
F1 0 0 1 2 3 4
Figure 2: Risk graph for deciding certification regime Where: F 1 = very small chance. Not sensitive to ageing F 2 = small chance, fails functionally rather than catastrophically F 3 = relatively large chance. Sensitive to ageing B 1 = Persons seldom in danger area B 2 = Persons often or always in danger area
E E E E
1= 2= 3= 4=
Slight injury Permanent disability/1 death Several deaths Catastrophic. Many deaths
The proposed regimes are divided into 0-3 = no legal requirement for certification, 4-6 increasingly tight legal requirements" 0 = Voluntary inspection by persons after instruction and training 1 = Voluntary inspection by experts appointed by the employer and specially trained 2 = As 1, but inspectors have an independent position to those with an interest in the results. Training on approved courses 3 = As 2, but criteria such ,as EN 45012/ISO 9000 or EN 45004 applied to independent user inspectorate 4 = Compulsory legal certification according to government approved rules by independent expert certi~,ing body accredited under EN 45011 by Dutch Accreditation Council. Possibility for technical part of inspection to be delegated to certified user inspectorates 5 As 4 but no delegation of technical inspection permitted 6 = As 5 but with close government super~'ision, or carried out by the government. For Class 3 the government would reserve the right to require certification based on other arguments than those represented in the risk graph (e.g. environmental damage, practical experience, regimes in surrounding countries, political, economic or employment reasons). This preserves a region of discussion. In the first working out of the classes for the three types of equipment the Ministry comes no higher than class 4 for any Of them. Lifts for persons would fall in class 4, unless they can only carry one person (class 3). This would represent only a minor relaxation of the current compulsory legal certification. The only cranes which warrant class 4, according to the proposals, is that for lifting persons more than 3 meters. The current equipment which must be compulsorily certified would only reach class 3 and would fall into the voluntary sector; however the government proposes to use the additional arguments cited above to retain it in the compulsory sphere. Pressure vessels for all purposes would be grouped in class 4 above a certain volume and pressure level, which would vary per material contained in the vessel; the chosen limits would shift some equipment across the boundary of compulsory certification.
Comment The use of a risk graph seems a sensible method of bringing a certain nuance into the certification requirements. Such a graph has been used in other application for machinery standards and is proposed in the new Dutch approach
638
A.R. Hale et al.
to risk-based regulation proposed for all legislation in the area of working conditions. The main area of discussion in the use of such a graph is the interpretation of the categories. In particular what are regarded as slight or significant chances of failure, and whether the expected maximum effect should be interpreted as a "maximum credible ~-accident" or the maximum that has occurred according to records in the country concemed,-or some other more conservative criterion. In respect of the latter it is striking that none of the equipment is rated as having a potential effect above E 3; i.e. none is rated as having catastrophic potential, despite known cases of tower cranes collapsing on crowded streets or major explosions after vessel ruptures. The risk graph therefore, gives much food for discussion. It can contribute to clarifying the factors which lead to decisions to subject equipment to one sort of regime or another, but will demand intensive and careful discussion with all parties concerned to arrive at consensus. The rejection of the need for legal regulations conceming criteria other than construction requirements in periodic inspections seems to be a reflection of the current Dutch policy to limit government intervention in regulation to clearly provable issues of major safety concem. The fact that the ergonomics of crane controls and the provisions for the protection of lift maintenance staff can easily be checked at periodic inspections is a matter which the government wishes to leave to interested parties to include in the criteria if they wish. The mechanism exists for this to be realised through the influence exercised by standards committees and advisory boards of certifying bodies in modifying and approving the certification procedures and criteria for their work
REFERENCES
Ministry of Social Affairs & Employment 1996b. Goed keuren? SZW op zoek naar een samenhangend keuringsbeleid. (Good certification? Social Affairs in search of a co-ordinated certification policy). Den Haag. Ministerie van Sociale Zaken en Werkgelegenheid Pietersen C.M, Hale A.R., Heming B.H.J., v.d. Broek B., lViol W.E. 1996. Evaluatieonderzoek periodieke gebruikskeuringen van arbeidsmiddelen . (Evaluation research of the periodic certification of work equipment). AEA Technology. Den Haag.
ENVIRONMENTAL R I S K A S S E S S M E N T OF C H E M I C A L PLANTS: A PROCESS SYSTEMS METHODOLOGY
S.K. Stefanis, A. G. Livingston and E.N. Pistikopoulos Centre for Process Systems Engineering, Department of Chemical Engineering hnperial College, London SW7 2BY, U.K.
ABSTRACT
A Methodology for Environmental hnpact Minimization (MEIM) of routine and non-routine releases is presented in this paper. The methodology, which embeds environmental impact assessment techniques within a process optimization framework, involves proper definition of a consistent boundary around the process of interest, identification of the emissions inventory, quantification of environmental impact via proper metrics and inclusion of environmental criteria within process modelling and optimization tools. Interactions between cost and routine/non-routine environmental impact objectives are explored while implications for maintenance policies are also investigated. The steps of the theoretical analysis and the potential of the proposed methodology are illustrated with a simplified chemical process involving methane chlorination.
KEYWORDS
Environmental impact; non-routine releases; process optimization; preventive maintenance.
INTRODUCTION
Environmental risk assessment is typically concerned with the estimation of the damage caused to humans by hazardous pollutants and is traditionally defined as the likelihood of an adverse health effect, such as a carcinogenic death, due to an exposure to an environmental hazard (Lapp, 1991). Yet, little emphasis has been given to environmental effects such as the actual air or water damage, ozone depletion etc. Christou (1996) proposed a framework for developing an integrated approach for environmental risk assessment, which relies on qualitative hazard identification techniques (such as HAZOP, FMEA, see Montague, 1990). This approact, focuses on post release calculations (i.e. fate of pollutants and their health effects) rather than the actual source of pollution and its causes, either intended or unintended, which for process plants are strongly linked to aspects of plant design and operation. Environmental risk management is currently performed at the post assessment level in an iterative fashion based mainly on operational aspects of t,he plant in question (HMSO, 1995), using health but not environmental indicators (Sarigiannis and Volta. 1996). Currently, there is no formal process optimization based approach to fully explore the interactions between process design, operation (including maintenance) and environmental impact due to risk and unexpected events. Recently, we introduced a Methodology for Environmental Impact Minimization (MEIM) which embeds Life Cycle Analysis principles (LCA) within an optimization framework for continuous as well as batch processes (Pistikopoulos et al.. 1994; Stefanis et al., 1996) to quantify the environmental impact of routine releases. The main steps of MEIM, include: (i) definition of a process system boundary, (ii) environmental impact assessment on a short or long term basis and, (iii) incorporation of environmental impact 639
640
S.K. Stefanis et al.
criteria explicitly as process design objectives together with economics in a multiobjective optimization setting. MEIM is an effective tool ibr a rigorous assessment of the interaction between industrial technology and the environment, helping identify design and operation options to reduce pollution at source by minimizing process routine releases. A key characteristic of non-routine releases is that they are often related to equipment failures and the probabilistic occurrence of external events, such as unexpected leaks and human errors. Industrial risk frequency graphs indicate that non-routine releases can significantly influence the environmental damage related to a process system. Unlike extreme cases such as major accidents (occurring at very low frequencies with serious consequences) and routine releases (highly frequent causing minor environmental damage), non-routine releases, placed in between, often cause moderately severe adverse effects and may thereibre result in considerable risk levels. This necessitates the development of an integrated framework that will properly account ibr non-routine process waste generation due to "unexpected/undesired" events while simultaneously assessing the environmental impact of routine waste releases. Since the environmental impact of a non-routine release depends on its probability of occurrence, the machinery of reliability theory can be employed to provide such a ibrmal link, as for example used in the FR.AMS methodology (see Thomaidis and Pistikopoulos, 1995). Summarizing, the objectives of this paper are: (i) to quantify principles of MEIM so as to provide an integrated and rigorous framework to assess in a systematic way the adverse effects of industrial processes on ecosystems during normal as well as ab-normal conditions, (ii) to study the effects of plant design and operation on the environmental impact of Routine and Non Routine Releases, and (iii) to establish the fundamental theory and computational tools to arrive at cost optimal designs featuring minimum environmental impact via the use of multiobjective optimization techniques. METHODOLOGY FOR ENVIRONMENTAL ROUTINE RELEASES
RISK ASSESSMENT
OF ROUTINE/NON-
In the context of this work, Environmental Risk (ER) is the measure of potential threats to the environment taking into account that undesired events (scheduled/unscheduled) will lead to environmental degradation. Qualitatively, Environmental Risk represents the probability of environmental damage due to undesired events multiplied by the severity of the environmental degradation. In accordance with the principles of MEIM, the system boundary around the process of interest is first specified. Concentrating mainly on process waste generation, the following framework for minimizing routine and non-routine releases is proposed (see Figure 1).
RoutineReleases (Processinwastes) g
Non Routine Releases
(leakage.s,fugitiveemissions,accidentalreleases, off-specproduct)
~CHyCIz, NOx, HCI, CI z ..~.
............
..................................................... Environmental Limits
, Rvli'~bility/Maintenance Data
Envlrorlrnental Impact Indices
[CTAM] --] [c'rwM] ~_ Shot/Term Basis [SMD] Global Warming [GWI] Photochemical Oxidation [POCP] Stratospheric Ozone Depletion [SODP] ~_ LongTermBasis Ultimate Air Pollution [UCTAM] Ultimate Water Pollution [UCTWM] Ultimate Solid Waste [USMD]
] ,~'/~
Air Pollution
Water Pollution Solid Waste
Impact Azsessm~nt
Figure 1: Environmental Impact Assessment of Routine/Non-Routine Releases
Advances in Safety and Reliability: E S R E L '97
641
ROUTINE AND NON-ROUTINE EMISSIONS INVENTORY The process of interest is examined in detail to determine wastes that are regularly emitted into the air, aquatic or soil enviromnent and various non-routine releases such .as: 1) Accidental Releases mainly due to the occurrence of scenarios such as leakage, equipment failure, human error, etc. 2) Fugitive Emissions that involve small leaks or spills from pumps or flanges and are generally tolerated in industry. 3) Releases fl'om Process Deviations caused during start-up, shut-down, maintenance procedures and also from changes in operating conditions (temperatures, pressures) and various plant parameters such as feed variations. 4) Episode Releases as a result of sudden weather changes or other occurrences. The overall inventory is represented by a waste vector, as shown in Figure 1, which consequently needs to be assessed. ASSESSMENT
OF ENVIRONMENTAL
DAMAGE
All routine and non-rolttine releases are often grouped systematically in terms of the environmental damage caused on a short or long term basis. For the fully operable state (routine process system status), the Environmental hnpact (EI) vector shown below represents the damage caused to the environment during intended plant operation on a time basis (usually one hour of operation, ignoring pollutant intermedia partitioning), i.e. the environmental impact of routine releases: I,V
EI = ~ "lv~
T,V
EI.,, = ~ [CTAMw C T W M w S M D w GWIw POIw SODI~]p~oc~ T 1
(1)
"lu-- 1
comprising indices that measure air pollution CTAM [kg air/h], water pollution C T W M [kg water/h], solid wastes SMD [kg solids/hi, global warming GWI [kg C02/h], photochemical oxidation POI [kg ethylene/It] and stratospheric ozone depletion SODI [kg C F C l l / h ] for eachwaste w, depending on current legislation limits and the mass of pollutant disposed 1. When an equipment failure or an event which causes the system to significantly deviate from its normal operating status occurs, this defines a new operating state for which a corresponding environmental impact, similar to (1), can be computed. This new operating state will have an associated probability of occurrence, which in general will be a function of equipment reliability models and other data (maintenance, safety events, statistical charts for spills, ere). We denote the set of potential discrete operating states in which a process system can reside over its operating time horizon H as state space K, with a corresponding probability pk(t), k 6 K, where t denotes time (since the reliability of the processing system is a function of time). A combined environmental impact vector ibr routine and non-routine releases, CRNREI, can then be introduced, to represent the average environmental damage of a given process design during normal and unexpected operation within a specified time horizon [0,HI as tbllows. STEP
1 :
(a) Define all operating states K of a process system (b) Determine corresponding environmental impact vector (E/k), k E K from (1)
STEP
2 :
(a) Estimate the reliability (unavailability) of each part of the equipment as a function of time, Rj(t) [Qj(t)]. For example, ifWeibull functions are used to describe equipment reliability,
R.j(t) - .
'weif(cT;~j) dt, j • Sk Q~(t).
= .
weif c~j ;/3j) dt, j e-~k
(2)
where. S~:(S~:) is the index set for operational (failed) components of the equipment in state k and c~; ~ are the scale and shape factor of the Weibull function. (b) Determine the probability of each state k, e.g. assuming statistically independent equipment failures" Pk(t) = I-[ Ri(t) r I Qj(t) k e g (3)
j~s5
j~
~these indices rely on the linear contribution assumption of pollutants; extentions to include fate considerations are described elsewhere (Stefanis, 1996)
S.K. Stefanis et al.
642
S T E P 3 : Calculate the Environmental hnpact Vector as a function of time, EI(t)"
EI(t) = ~ Pk(t)EIk
(4)
kEK
S T E P 4 : Determine the combined Environmental hnpact of Routine and Non-Routine releases for a given time horizon H. 1
1
CRNREI = -~ ./, EI(t)dt = -H ./H ~ Pk (t)EIk
(5)
kcK
Qualitatively, this vector represents the average environmental impact of the process design over all possible system states within a specified time horizon H. Therefore, it measures the overall system environmental pertbrmance under both expected and unexpected events. The closer this vector is to the Environmental hnpact vector of the initial state (denoted here as fully operable state o), the lower environmental risk the system conveys. Note that the Environmental hnpact vector attributed to Non-Routine releases, NREI, over the time horizon can be easily computed as ibllows:
N R E I k = EIk - EI °
k EK
(6)
where EIo is the Environmental hnpact metric corresponding to the fully operable state, i.e. it denotes routine waste releases.
NREI(t) = ~ Pk(t)NREIk
(7)
kEK
N R E I = -~ 1 IHNREI(t)dt= -~ 1 fH ~ Pk(t)NREIkdt •
"
(8)
kEK
Qualitatively, NREI represents the average environmental impact due to non-routine releases. For the fully operable state from (6), NR,EI=0, as expected. S T E P 5 : Design ()ptimization ibr Minimum Environmental Impact and Environmental Risk (optional). The combined environmental impact vector, as defined above, provides an accurate estimate of the average environmental peribrmance of the system taking into account both routine and non-routine releases. In the analysis presented so far, decisions regarding the process design itself (for example, volumes of equipment) were considered fixed. A subsequent question is then, how to obtain a minimum cost design, while ensuring that the system is capable enough of keeping routine and non-routine release levels as low as possible. Conceptually, this problem can be posed as the following multiobjective optimization problem (9) using the e-constraint method (Hwang, 1979): min cTy + F(x)
(9)
x,y
s.t.
h(x) = o, ~(x) < o,
B . y + C . x < _ D,
1/. ~
NREI(x,y) = -~
•
CRNREI(z,y) = -~
~ Pk(t)EIkdt < e •
xEX,
Pk(t)(EIk - EI°)dt
kEK
kEK
y E Y E { 0 , 1 } T~
The continuous variables x represent flows, operating conditions and design variables. The binary variables y denote tile potential existence of process unit blocks and optionally streams, interconnections, e is a parameter vector that imposes stricter legislation on pollutant discharge. These variables typically appear linearly as they are included in the objective function to represent fixed charges in the purchase of process equipment (in the term cT.y) and in the constraints to enforce logical conditions (B.y+C.x <_d). The term F(x) denotes purchase costs for process equipment, raw material purchase costs, product/byproduct sales revenues and utility costs. The sizing equations correspond to h(x) = 0 and the inequality constraints g(x) <_0 include design and product specifications, other legislative limits which are typically linear inequalities.
Advances in Safety and Reliability: ESREL '97
643
S T E P 6 • Enviromnental Risk hnplications for Maintenance (optional).
Having identified the most environmentally benign and economically optimal set of designs with respect to all sorts of release scenarios, the idea of criticality analysis (Thomaidis and Pistikopoulos, 1995) can be applied to identi(v and rate the most critical events with respect to plant performance and the environment. More specifically, we are interested in the sensitivity of environmental risk a(t) to the probability of an event l, pt*. Then,
cr(t)_ ONREI(t) { OPk(t) } Op,. : y~. N R E I k k~K Opl*
(10)
since the estimation of NR,EI k is not influenced by pl*. Note that (11) allows tbr equipment/events can be ranked according to their corresponding criticality index as a function of time. While the mathematical details for the estimation of a(t) in (11) are described elsewhere (Thomaidis and Pistikopoulos, 1995; Stefanis, 1996), the results from such a ranking can be used as guidelines tbr maintenance and environmental optimisation based on quantitative information regarding maintenance resources (number of service crews, job durations etc.) and tasks (equipment maintenance specifications, list of scheduled preventive maintenance activities). The designer can then explore opportunities tbr maintenance execution based on a formal assessment of the deterioration of the operating and hence environmental system performance over time and the relative effect of restoring the performance of critical equipment on the environmental damage caused by unintended emissions. These issues are shown in more detail in the next section. EXAMPLE:
PRODUCTION
OF CHLOROMETHANES
Consider the simplified chloromethane reaction subsystem (Austin, 1984; Fusillo and Powers, 1988) shown in Figure 2. Chloromethanes are produced by the following set of reactions: 6H44-612 ~ CH3CI+HCI,
CH3CI + Cl2 ~
CH,2C12 + HCI, CH2Cl2 + Cl2 ~
CHCl3 + HCI, CHC13 + Cl2 -----+CCI4 4- HCI
that takes place in the gas phase with chlorine as the limiting reactant. The design must be such that chlorine is not allowed to accumulate in large quantities in the reaction system due to explosion hazards, therefore it should not exceed a specified stoichiometric amount with respect to methane reactor feed. The system is equipped with vents to the atmosphere and the separation system (which is not included in this case for simplicity). There is an air feed line that is open when the system is not operating. Pressure effects are negligible and the reactor operates at 3 atm. A two stage recycle compressor with intercooler is required which is assumed to operate adiabatically, followed by a gas fired heater to ensure that the inlet reactor gases are partially preheated by the recycle gases to reach a sufficiently high temperature to minimize heat control problems. CH4-Feed ;
Cl2-Feed
()
MIX-1
MIX-2
VENT
I
(onlyfor =tart up) Air Inlet bT~
R-I
(toatmosphere)
product to sep=rmtlons
CR-I HTR-1
Figure 2: Simplified Chlorination Flowsheet While the kinetics of the reaction scheme are given by McKetta (1989), the following operating constraints need to be satisfied for inherently safe operation in order to produce a stream of 50 kgmols/h to be fed directly to the separation block •
S.K. Stefanis et al.
644
400 _< Reactor Temperature(°C) _ 457 Air Feed = 0 Chlorine to Methane Molar Feed Ratio < 3
Temperatures much above 450°C cannot be tolerated since pyrolysis would occur. Pyrolysis is a very exothermic reaction and once initiated quickly reaches explosive violence. Presence of oxygen in the system decreases the rate of the reaction (1.25%wt oxygen in the reactor feed decreases approximately by 50% the rate of chlorination at the studied temperature range), as it behaves as an inhibitor. High chlorine to methane molar feed ratios result in accumulation of large amounts of chlorine in the system which may lead to explosion; tbr this reason material input flowrates are adjusted so that the chlorine to methane molar ratio at the i~llet of the reactor has a value of 1.3. Most of the process equipment is highly reliable apart from:(i) the recycle compressor system which has a pertbrmance described by a Weibull function and (ii) the measuring devices monitoring the ratio of chlorine to methane ted to the reactor, the air feed flow and the reaction temperature. The measurement errors are regarded as discrete events, and as their probability drifts with respect to time, they are described by an exponential density function of the following form: f(t) = ;~exp(- At ). In addition, the exponential distribution model is used to describe the probability of occurrence of external events such as gaseous leaks t'rom the recycle piping system that have occurred in the past. Table 1 summarizes the required reliability data tbr each event. The following environmental data (Habersatter, 1991 and UK
Table 1- Example R,eliability Data Horizon, H = 4 yr
CR-1
c~=1200001/h
/~ =1
MTTR=72 h
EVENT ERRc12:CH4=-~-8% ERRTREA=nL5% IMM LEAK 3MM LEAKFO2=0.1 KGMOLE/H A (l/h)
3 i0 -~
5 i0 -~
i i0 -~
4 I0 -~
1 10 -~
Ecol. Board, 1993) are also supplied tbr the process of interest: Chemical
Maximum Acceptable Concentration (kg/tn air)
C12 CH4 CHaCI CH2C12 CHC13
1.67 10.5 0.0125 8.333 10-6 8.333 10-¢ 8.333 10.6 8.333 10 -6 8.333 10-5
CC14
HC1 02
Global Warming Potential
(kg OO2/kg pol.) 11 5 15 25 1300
System Boundary & Emissions Inventory The system boundary is considered around the methane chlorination process and therefore the emissions inventory consists mainly of chlorinated hydrocarbons, unreacted raw materials and byproducts vented to the atmosphere: Waste Vector = [Cl2 CH4 CH3CI CH2CI2 CHCI3 CCl4 HCI 02]process Environmental hnpact Assessment of Routine and Non-Routine Releases The waste vector defined above is aggregated into an environmental impact vector of low dimensionality, reflecting the actual damage caused to the environment. In this case, the metrics employed to investigate the routine/non-routine environmental behaviour of the process are: EI = [ C T A M GWI]vroc~s~ T
(11)
and depend on the mass of pollutant discharged, the maximum acceptable concentration limits and the global warming potentials defined by the user (see above table). The probability of the system degrading into a non-operable state is negligible, since both mixers, inlet valves and tlie reactor are fully reliable. The external events are all assumed to cause degradation to operable states with decreased reliability and therefore, according to Table 2, the operable degraded system states number 31. The state probability estimation, according to equation 3 indicates that" (i) a l m m leak o n t h e recycle is more likely to occur than any other undesired event, (ii) all external events have greater probabilities of occurrence than failure of CIR.-l, (iii) simultaneous occurrence of more than
Advances in Safety and Reliability: E S R E L '97
645
Table 2" Chlorination System Degraded States State k
ERRc12:CH4=8% !ERRTI-tI~A=5% lmm Leak 3ram Leak F02=O.i kgmole/h CR-1 fails State k ERRcI2:CH4=8% ERRTREA=5% lmm Leak 3ram Leak F02=O.1 kgm~
V
V
V V
20 21 22 23
e" V V V
24 25
V V~
¢" V V ¢~
e"
V J
,/
27
J V
2s 29
V
,/
V V
zo 3~
V
,/
J
¢
,/ ¢"
V V
4"
V ,/ ,/ ,/ V 4,/ ,/ V
two undesired events is most rare. Design Optimization ibr Minimum Environmental Risk The optimization problem is posed as explained in section 2.3; the design variable to be optimized is the volume of the reactor VR (1.5 < VR (m 3) < 3), the degrees of freedom for each operable state are listed below: 675 _< NOMINAL REACTOR TEMPERATURE (K) ~ 730 ().2 < RECYCLE TO SEPARATIONS MOLAR RATIO < 0.971 9()0 ~ HEATER OUTLET TEMPERATURE (K) <_ 1200
The results summarized in Table 3, reveal some interesting points: • Cost optimization yields a smaller reactor (2.44 m 3) but at the same time results in substantially increased global warming impact due to non-routine releases. • By minimizing the expected value of Critical Air Mass, a 8% reduction of environmental risk NREIcTAM can be achieved compared to the corresponding cost optimal value (see Table 11). In addition, environmental risk related to global warming is reduced by almost 33%. However, one has to pay an economic penalty ibr pollution reduction in this case, as optimization of CRNREIcTAM has a negative effect on the economics of the process (30 % increase in cost). • Optimization of CRNREIGwI yields the most interesting results since the contribution of nonroutine releases with respect to global warming is reduced by six folds! At the same time, the annual cost aim the critical air mass are maintained at low levels and the optimal reactor design is quite similar to its cost optimal. • The dynamic response of environmental risk NREI(t), corresponding to the cost optimal case, is presented in Figure and shows that both GWI and CTAM risks increase with respect to time, as the reliability of the system decays. Note that both environmental metrics are based on steady state environmental behaviour of pollutants and in the context of this work the time dependence is a result of the reliability analysis. The time averaged integral of the dynamic response results in the risk values presented in Table 11. • Figure 4 demonstrates the deviation of the environmental impact metric CTAM, from the fully operable state value ibr each of the 31 degraded states. A s can be observed, failure of CR-1 (states 7,12,16,19,23,29,31) results in significantly increased damage in every case. The following trends can also be revealed (see Figure 4): i) the air pollution damage that corresponds to optimization of CRNREIcTAM, is consistently less for each state apart from state 2 (measurement error in molar teed ratio of reactants), verifying the fact that total CTAM is optimal in this case and, (ii) minimization of CRNREIGw[H results in larger CTAM in states above k=22; the overall CTAM though does not increase significantly because of their low probability of occurrence. Similar results are obtained ibr the GWI deviation.
646
S.K. Stefanis et al. Table 3: Summary of Results
min Expected COST rain CRNREIoTAM. rain CRNREIcwI.
Annual Cost (MS)
2622 12878 2.44
NREIcw/,, (kgCO2/h) V ,{ (m :~)
•
2
p--
209670 2630
2414 8612 2.6
2445 2.49
10 4
1.2 rr
253540.
195225
NREIcTAM. (10 l~ kg air/h)
/
•
11
/
/
./
GWl Risk [kg C O 2 i h ]
:
0.8
....
0.6 I
0.4
0•2
C T A M Risk [le6 kg air/hi
i i
O0~----
'
0.5
11. . . . . . . . . . . .
1
1.5
2
2.5 Time/hi
3
3.5
4
4.5 x 10 4
Figure 3" Environmental Risk Response with respect to Time
Critical Equipment & Preventive Maintenance Policy In order to detect the process bottlenecks with respect to environmental risk, a criticality analysis is performed with respect to the environmental impact vector of non-routine releases, NREI. The scaled criticality index or, presented in Table 4, demonstrates that failure of the recycle compressor is the main bottleneck of the process, as it has the largest effect on environmental damage, followed by the leaks on the recycle and finally the measurement errors.
Table 4: Criticality Index of Equipment Failures for Example 2 Event CR-1 fails 3ram Leak lmm Leak ERRc12.cH4=+8% Fo2 =0.1 kgmole/h
ERRTREA=+5%
at=o 1 0.001 0.001 0.001 0.001 0.001
art=lyr 1 0.076
0.072 0.001 0.001 0.001
The preventive maintenance policy obtained to satisfy NREIawt(t) _< 1000 kg C 0 2 is presented in Figure 5. The equipment maintenance policy dictates that CR-1 must be maintained every 5000 h of operation.
References Aelion, V., F. Castells and A. Veroutis (1995). Life Cycle Inventory Analysis of Chemical Processes. Environmental Progress 14(3), 193-200. Austin, G.T. (1984). Shreve's Chemical Process Industries. 5th ed.. McGraw Hill.
Advances in Safety and Reliability: ESREL '97 States 7,12,16,19,23,29,31 min C T A M in State 2 x 10
647
x 1000
C T A M deviation from
FO Operable State ( l e 6 kg air/h)
ii ' -
q •
l~rnin Cost [] mill CTAM []mnGW
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
~
7
~
£
9 10 11 12 13 14 15 16 17 '18 t9 20 21 22 23 24 25 26 27 28 29 30 31 State
Figure 4: CTAM Deviation from Fully Operable Case for Each Degraded State Christou, M.D. (1996). Environmental Risk Assessment and Management: Towards an Integrated Approach.. Proceedings of Probabilistic Safety Assessment and Management '96 - Crete, Greece vol. 2, 700-704.
Directive, Seveso (n.d.). Chemistry in Britain 32(10), 15-20. Geoffrion, A.M. (1974). Generalized Benders Decomposition. Y. Optimization Theory Applications 10, 237-260. Habersatter, K. (1991). BUWAL Report:Ecobalance of Packaging Materials State of 1990. 1st ed.. F.O.E.F.L.
HMSO (1995). A G,uide to Risk Assessment and Risk Management for Environmental Protection. Department of the Environment, UK. Lapp, S.A. (1991). A Risk Evaluation System. ChemTech pp. 700-704.
5ooor
]
4500 I 4000 J
3500 t
1
o .... r
z
2000
i
. . . . . . . . . . .
15ooF i
ooo ! .
0
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5000
Time
[h]
10000
t 5000
Figure 5: NR,EIawf Response and Maintenance Policy for Minimum Global Warming- Example 2
648
S.K. Stefanis et al. McKetta, J.,I. (1991). Ch~;'mical Engineering Design Encyclopedia. 1st ed.. Marcel Dekker. Montague, D.F. (1990). Process Risk Evaluation- What method to use?. Reliability Engineering and Sysyetm Safety 29, 27-53. Pistikopoulos, E.N.. S.K. Stefanis and A.G. Livingston (1994). A Methodology for Minimum Environmental hnpact Analysis. AIChE Symposium Series, Volume on Pollution Prevention th~vugh P~vcess and P~vd'uct Modifications 90(303), 139-151. Sarigiamns, D.A. and G. Volta (1996). Ecological Vulnerability Analysis: Towards a New Paradigm for Industrial Development. Proceedings of Probabilistic Safety Assessment and Management '96- Crete, Greece vol. 1, 478-484. Stefanis, S.K., A.G. Livingston and E.N. Pistikopoulos (1996). Environmental hnpact Considerations in the Optimal Design and Scheduling of Batch Processes. Computers Chem. Engng. Thomaidis, T.V. (1995). Incorporation of Flexibility, Reliability, Availability, Maintenance and Safety in Process Operations o,nd Design. Ph.D. Thesis. hnperial College. Thomaidis, T.V. and E.N. Pistikopoulos (1995). Optimal Design of Reliable and Flexible Process Systems. IEEE Transactions on, Reliability 44(2), 243-250. UK Ecolabelling Board (1993). Criteria for Hairspray Ecolabels.
PROCESS SAFETY MANAGEMENT PERFORMANCE SUPPORT AND TRAINING SYSTEMS Carlo Fiorentini ~Francesca De Vecchi ~Elliott P. Lander2 Carlos Vilagut Orta 3 1 TECSA S.p.A., Via Oratorio 7,-I,-20016, Pero MI-Italy, e.mail: [email protected] 2 ATR Applied Training Resources, 12337 Jones Road, Suite 350, Houston Texas 77070 USA, e.mail [email protected] 3 TECSA IBERICA S.A. C6rcega, 50 entlo 1a , 08029 Barcelona, Spain e.mail: [email protected]
ABSTRACT The Authors introduce the consideration on the fact that major disasters do continue to occur in the CPI, that similar accidents had already occurred few years ago, that the safety is crucially dependent on management and management systems, that the conventional on the job training approach is no longer applicable and that all these constraints coupled with the new compliance requirement of major hazards legislation in EU and in the USA, oblige the management to identify new advanced solution.
KEYWORDS Process Safety Management, Training, Risk Management, Major Hazards, Maintenance Management, Operator Training, Document Management.
1.
INTRODUCTION
Many are the references on the number of incident, disaster and heavy losses due to lack of sound management procedures: - The enquiry into the Piper Alpha disaster (L. Cullen 1990) (1) states (...) safety is crucially dependent on management and systems (..). - The report of CEC (G. Drogaris 1993 (2) states (...). The conclusions of several accident investigations or studies on safety-related issues often show that similar lessons had already been learned a few years before in similar accidents. This indicates that, although the knowledge needed to prevent accidents and/or to minimize their consequences is often available, there is: a) a lack of the proper safety culture to enable effective use of this knowledge; and b) a lack of a structured communication system to disseminate this knowledge (...). Another expert (3) (I. Nimmo 1996) underline: - in a short period US CPI suffered 24 incidents; 12 deaths, hundreds hurt, $1B + losses, $10B + impact; 649
650
C. Fiorentini et al. -
Recorded Accident causes: * Human Error 85% (people 10%) * Equipment 15% (people, research, design, construction, installation, operation, maintenance, etc)
- Problems observed: * Inadequate precision of temporal information (e.g. lack of true alarm order) * Excessive nuisance alarms due to weak conditional alarming capabilities * Inadequate anticipation of process disturbances * Lack of real-time, root-cause analysis (symptom-based alaxming) * Limited or time-consuming access to procedures or operating instructions Coupled with this situation other significant change occurred during the last twenty years for example: the industrial accidents for the most part are not new and are the consequence of the fact that the organizations do not learn from the past experience; the past experience often is lost with experienced workers, managers foreman retirement; the attitude of new staff towards the work has changed considerably so the past approach of "training on the job" is no longer applicable with success. -
-
-
2.
GENERAL REQUIREMENT F O R PROCESS SAFETY MANAGEMENT
2.1
A t European level:
•
the major step toward a new comprehensive approach to Process Safety Management took place with the European Directive 82/501 (4) which has been implemented into the national legislation of the UE member states with minor differences due to local safety legislation frame work; • a second step forward has been the position taken by the European Parliament (5) • a third major step has been the adoption of the UE frame work Directive 89/391 (along with many others (7). This framework Directive requires every employer to ensure the health and safety of their workers in every aspect relating to their work. The general safety requirement applies to all employers, no matter what the size or type of the business. The framework Directive lists the following general principles which must be followed in ensuring health and safety: a) if possible, avoiding a risk altogether; evaluating risks which cannot be avoided; b) c) combating risks which at source; d) wherever possible, adapting work to the individual; e) taking advantage of technological and technical progress; replacing dangerous things with less dangerous ones or ones which are not dangerous; 0 g) developing a coherent prevention policy; give priority to collective over individual measures; h) i) ensure workers understand what they have to do. This * * * * *
new approach stress the following. The risk assessment should: assume workers behave like real people if they do not follow instruction, you should take account; include operations which are not routine - such as maintenance - as well as normal work; be systematic - for example working through types of risk or types of activity; identify everyone who may be affected; consider whether any particular group of workers or others are at special risk.
The employer must review and, if necessary, update its risk assessment if: either there are reasons to suspect it is not accurate or complete; there is significant change in the matters covered by it.
Advances in Safety and Reliability: ESREL '97
2.2
651
At US Level
a) At federal level - The major step forward has been the Federal Standard 29.CFR 1910.119 "Process Safety Management of highly Hazardous Chemicals". - The Clean Air Act amendments requires the US Environmental Protection Agencies (EPA) to promulgate rules on Chemical Accident prevention. This new rule which focus on the protection of the general public will require the development and implementation of a Risk Management Program. This obligation has been fulfilled with the publication of the final Risk Management Program (RMP) rule, 40 CFR 68. The EPA rule builds on and expand upon requirements of the US Occupational Health Administration (OSHA) Process Safety Management (PSM) rule, 29 CFR 1910.119. b) State and private organizations. Other adopted or published recommended practices are: API "Management of Process Hazards" (API Recommended Practice 750); State of California: "Guidance for the Preparation of a Risk Management and Prevention Program"; Los Angeles County "Risk Management & Prevention Program Guidelines".
2.3
UE Compliance view of Process Safety Management
A further step forward has been given with the UE Directive 94/C106/04 (COMAH) which encourage the establishment of a sound Process Safety Management systems, etc. In this framework a series of quality assurance recommendation and standard, derived from ISO, are being implemented with the purpose of better define all the matter. For a better understanding of this process it seems of interest to include a comparison (Table n.1, derived from a previous work (8)) with a cross-reference between the Italian Standardization Organization (UNI) Standard E02.07.536.0 and the Responsible Care policies.
2.4
US Compliance view of Process Safety Management
a) The US OSHA 29 CFR 1910.119: The Federal Standard 29 CFR 1910.119: Process Safety Management consists of fourteen components. The regulation requires employers to manage hazards associated with processes involving materials identified as highly hazardous. This rule affects any facility that produces, uses, stores, transports or handles these materials in amounts equal to or greater than the specified threshold quantities. The regulation specifically requires that the facility develop a Process Safety Management program with the components listed below
(9) b) The EPA RMP rule 40 CFR 68 The federal standard Risk Management program (RMP) 40 CFR 68 include two more components of particular interest with respect to OSHA: Hazard assessment and Risk Management Plan/Registration.
3.
THE NEED OF AN INTEGRATED APPROACH FOR PROCESS SAFETY M A N A G E M E N T (PSAM APPROACH)
Many are the driving forces in the process industry for reducing the cost of compliance with PSM regulations (UE Directive 82/501, 89/391 etc.. OSHA 1910.119, EPA RMP 40 CFR 68 etc.). ATR has used the PSAM approach for the development of an electronic performance support and interactive training and testing system (PRISM).The following is a brief list of the driving forces in the process industry towards the PSAM approach and the resulting benefits for the companies.
~~
3d Wastes
3c Liquid Effluents
3b Gaseaus Emissions
IJa' OrganioationlManagernent
2a7.Qd3. Quality assurance program
2m. Internal Audit
21 Records
21 Training and employee participation
2h. Emergency Management
29. Disease prevention
2f First Aid
2e. Health Control
2d Risk exposure evalution
2c. Exposure Control
)2b Heallh Hazards
12a. Organisatiod Management
Ilm. Housekeeping
(11. Materials handling
11 Maintenancegeneral equipment
1h Common nsk management
11g. Personal Protective Equip
If. Emergency equipment
le. Communica!ion
lld. Incident-Nearmiss Analysis
1c Training and employee participation
1b Process Risk Management
C. Fiorentini et al.
39 Matenal Transportation
3f. Aesthetic impact
3e. Noise
+------
-I
652
Advances in Safety and Reliability: ESREL '97
653
TABLE 2 THE DRIVING FORCES FOR AN INTEGRATED APPROACH ON RISK MANAGEMENT
DRIVING FORCES Increased regulations and compliance issues (OSHA 1910.119 EPA RMP 40 CFR 68)/ISO 9000 registration Increased competition & lower operating margins
Reduced resources and budgets
Increased liabilities
Need for employee participation
Need to lessen overtime load
Less instructional personnel
Increased automation complexity
Need to retain knowledge & experience from retiring personnel Lower experience levels of workers
PROPOSED PSAM APPROACH The PSAM approach provides a way to consolidate required information into a single source, and keep that information up-to-date and easily accessible. It can use available information from a variety of sources and formats to decrease the time to reach compliance or registration. The PSAM approach allows a company to consolidate its vital information into a logical, easy-to-access system. Anyone who needs this information can access it lightning-fast. Thus, people can make better, more informed decisions and more people can be involved in the critical path. As budget are trimmed, the number of employees are often reduced. The PSAM approach can help you take full advantage of employees, as its extensive, effective training capabilities can help produce personnel who work smarter; these people can tap into information quickly to fill in any gaps in knowledge as your work force is reduced. The PSAM approach allows critical safety information and procedures to be collected into one source, and made readily available to the workers who need it. They can react more quickly and stop upsets and abnormal situations early on. Employees injuries can be avoided or lessened and downtime due to emergency situations can be reduced. The PSAM approach allows more people to be involved in the process, as anyone can aid in the development of information and training modules. Its multimedia capabilities permit anyone who can use a camera, video camera, or tape recorder to add vital information to the electronic system. Using Computer-Based Training (CBT) capabilities, training time can be reduced; studies show that CBT is much more effective, yet takes far less time to complete than traditional classroom instruction. Training can be done on any PC on-shift. The system, which is PC-based, can incorporate self-paced training and testing modules; this eliminates the need for pulling people from their work to hold a class. All training materials can be created on-line with all required information. In fact, information can be enhanced with text, graphics, photos, video, animation, and sound combined. The system's CBT capabilities can mean a higher retention rate of technical information. Relevant support information (illustrations, expanded explanations, definitions, video, animation, etc.) can be integrated into training. While working to improve skills, personnel always have fingertip access to necessary refresher information. The system, with multimedia capabilities, lets workers use video and sound to capture experiences and knowledge of retiring personnel. This information can then be consolidated in the system, which acts as a central repository. The system provides capabilities for efficient, effective training with related quizzes and tests to reinforce knowledge. Test results can be documented and tracked. Training materials can be modified for refresher training. While working to improve skills, personnel always have fingertip access to necessary refresher information.
Following is the flow-sheet with the main functions of the Integrated Performance Support System developed using the PSAM approach.
654
C. Fiorentini et al. FIGURE
1
MAIN FUNCTIONS OF THE INTEGRATED PERFORMANCE SUPPORT SYSTEM
j
>
~ ~Safety Information ~ G~Ved:::t/
~flndustry Consensus~
Standards j
I
I
Assessment ~~raini~
/-- T --~ ~~roced~._.~ !
i
~
Integrity
~ ~ K
~ '
i/
The system consider an open architecture, object oriented development system. It allows trained user's to quickly develop powerful performance support and training modules for control room operators and other operating personnel. It could integrate "point-of-need" access to critical operation control, and safety information management. This enable employees to do their job more efficiently and safely. Open architecture This approach provides an easy-to-use and learn graphical interface allowing user's to logically define information types for quick retrieval. Through user-defined menus and targeted graphical displays the user can rapidly call up: process safety information, material safety data sheets, process overviews, plant photos,
Advances in Safety and Reliability: ESREL '97
655
other types of information, e. g. graphics, Autocad diagrams, Intergraph and word processing document and graphics, digitized video, sounds, animation and link to databases. Users do not have to re-create currently available documents. They can use existing formats quickly and effectively. The PSAM approach provides Dynamic Data Exchange, Object Linking and Exchange and Dynamic Library Linking support for full connectivity to other sot~vare and access to network databases. The system provides on-line help and decision support including pop-up procedures when connected to DCS systems and serves as an infrastructure to support future dynamic process simulation and computer-based training models. Object-oriented information management This approach: - provides a new and better way to develop, incorporate, retrieve and re-use the enormous varietiesand amounts of Site-specific process-related information; uses a simple way of assigning information and data types to classes (controllers, indicator, valves, pumps, exchangers, sections, vessels, personnel, process areas etc). -
What does the company world look like? The system allows to look at the entire corporation, perhaps just one region, or, most likely, a single site with various manufacturing areas within it. If you are mainly interested in training in one area of the site, consider what other information may be needed in that area. If you are working in only one area, are there other areas of the plant that might also want access to the material you develop? The suggested approach allows to use a "wide-angle" view when determining how to define the world of the system to be developed. Who will develop the site's system? A long term advantage is that using the suggested approach the development of the system enable a large cross-section of the company organization to work together. A designated engineer or integrator, working to company standards, should be assigned. He/she would be responsible for integrating the various components into logical subject areas, developing menus, organizational structure, and overall links to information. Who will access the system? Access can be made available to anyone chosen at company site. This would include employees such operators, technicians, engineers and secretaries. Since users must be registered and assigned a password and access level, access to various parts of the system will be dependent upon their expected use. End user End users can access the informational and training subject areas, viewing and learning about vital process safety information, emergency procedures and general process control information. Users can be trained at their own pace, even on-shift. Performance and qualification tests can be integrated into the lessons. Or, a test can be linked to a management Of Change (MOC) document to ensure that people can understand it. The system Post Office feature allows administrators to send messages to end users about MOCs or other important issues; relevant files can be attached to messages, so a message about MOC could have the actual MOC and related quiz send with it. Administrators can the monitor who read their messages, who opened the associated files, who completed related quizzes, etc. tests and learning modules can be assigned, scheduled, and tracked within the system environment. Results can be tracked with hard copies provided through the Report feature.
656
C. Fiorentini et al.
An alarm manager can be integrated with DCS, with specific alarms being linked to relevant information. When the alarm is generated, the linked information is displayed. It basically provides on-line, situationspecific help for users.
4
CONCLUSIONS
Abnormal plant operations cost the US process industry over $20 billion. About five major losses occur each year, but much of the costs is due to numerous minor upset that mean lost time, equipment replacement, lessened quality, pollution and so on. Each one also has the potential of becoming a major catastrophe. According to studies of process upsets are, almost half are caused by people-related factors such as insufficient procedures, incorrect actions and inadequate work practices; equipment and process related causes ranked second and third, respectively. In follow-up incidents reports, two recommendations emerge: companies should establish and distribute new procedures, and conduct better training. The spirit of UE Directives, OSHA and EPA regulations is to make companies arm workers with the latest, correct information so they can quickly intervene in a potentially serious potential upset and prevent it from turning catastrophic. These regulations not only requires that employees or their representatives have access to information regarding hazardous materials in a given facility, they also require that processing facilities develop and implement a training program and maintain applicable training records. The approach presented was designed to address the information and training needs of companies that must comply with these new regulations for a quick and efficient implementation of a computerized system to meet all the requirements including: Process Hazard Analysis (PHA), normal (SOP), abnormal and emergency procedures, Material Safety Data Sheets (MSDS), Management of Change (MOC), and Hazardous Operations. These documents, must be up-to- date, ready, and easily accessible. The process safety management: performance support and training system presented allow an integrated, easily-accessible, point-of-need source of the information workers need in order to respond quickly and correctly. Armed with the right information and reacting quickly, life threatening situations, spills or leaks can be avoided or lessened. Downtime to repair the damage caused by accidents can be minimized, as workers respond more rapidly to process upsets with the increase of safety level.
BIBLIOGRAPHY
(1) (2) (3) (4) (5)
(6) (7) (8) (9)
Cullen Lord (1990), The enquiry into the Piper Disaster, HMSO Drogaris, G. (1993), Lessons Learned from Accidents Notified, UE - JRC - ISPRA - VA - Italy Nimmo, I (1996), "ASM" Abnormal Situation Management, - PRISM '96 Users's Conf. - Houston TX EEC Directive (on the major accident hazards of certain industrial activities ) - Adopted in 1982) (..) The essential requirement under the CEC proposal remain the Safety Report for the larger establishment covered by the proposal, submitted by the Competent Authorities: This Safety Report will have to include a completely new section on Management Systems and Organization of the establishment European Parliament "Industrial Hazards: the Seveso Directive - Project Paper N. 1 Luxembourg, 29 April 1993 (Doe. PE 171.126). Council Directive of 12 June 1989 on the introduction of measures to encourage improvements in the safety and health of workers at work. Council Directive: 89/654; 89/655; 89/656; 90/269; 90/270; 90/334; 90/679 and 91/383. Ielasi, R (1995), "Ambiente, Qualith, Sicurezza: Un approccio integrato" - 3ASI - Milano Courtesy of Atallah S. - 1995
AN INVESTIGATION OF THE CONCEQUENCIES FROM POSSIBLE INDUSTRIAL ACCIDENTS IN NORTHERN GREECE Ioannis C. Ziomas ~, A. Poupkou I and G. Mouzakis 2 ~Laboratory of Atmospheric Physics, Aristotle University of Thessaloniki, 54006- Thessaloniki, Greece 2 Ministry of Environment, Land Use and Public Works, Athens, Greece.
ABSTRACT
Following the SEVESO Directive safety reports have been produced for the major industries in Greece. The reports are being evaluated by the Competent authorities and the appropriate emergency plans are being constructed. In the present work there are presented quantitative and qualitative results from the analysis of the possible accidents from 17 industrial installations in northern Greece. The possible accidents have been grouped and the analysis include the estimated consequences and the zones of influence for different proposed threat limit values. Also a brief description of the methodology followed for the evaluation of the safety reports is presented and discussed.
KEYWORDS
Pool fire, Fireball, Flash fire, UVCE, Dispersion
INDRODUCTION
Since 1988 the Greek legislation is harmonised with the Community Directive 82/501/1982, which is also known as the "Seveso Directive" (Ziomas et al., 1994). Up to present, the involved Public Organisations for the application of the national legislation concerning major accidents are: 1. The Ministry of Industry, Research and Technology (MIRT). 2. The Ministry of Environment, Land Use and Public Works (MELUPW). 3. The Ministry of Labour (ML). 4. The Ministry of Public Order (MPO), where the Fire Corps belong. 5. The Ministry of Health and Providence (MHP). 657
658
I.C. Ziomas et al.
6. The Organisations of Local Governments (OLG) The on-site emergency plan of the installation describes all the procedures and actions to be taken in case of emergency. The aim of the on-site emergency plan is to increase the safety of the installation and to reduce the consequences of any kind, due to possible accidents. Depending on the severity of the accident, on the area affected and of course on the number of people in danger, there are off-site plans of different levels: a) Off-site Emergency Plans at the National Level and b) Off-Site Emergency Plans at the Prefecture Level. In general, the Ministry of Environment, Land Use and Public Works keeps a file of the dangerous chemical substances, draws a catalogue of experts, collects the various reports concerning the accident, gives technical advice and performs pollution measurements. In the framework of the Safety Reports evaluation procedure the Laboratory of Atmospheric Physics, Aristotle University of Thessaloniki, under a contract by the Ministry of Environment, Land Use and Public Works, performed an extended analysis of the possible accidents at 17 industrial installations located in Northem Greece. These installations include LPG and liquid fuel storage, refineries and chemical industries. In the present work there are presented quantitative and qualitative results from the analysis of the possible accidents. The analysis includes the estimated consequences and the zones of influence for different proposed threat limit values.
METHODOLOGY
Categorization of accidents and models used The possible accidents can be grouped in three major categories, based on their consequences. Accidents where flammable and or~explosive substances are involved and therefore the consequences are due to thermal heat flux and blast effects
The phenomena examined for these accidents, depending on the substances involved and the storage conditions, include: 1) pool fire, 2) fireball, 3) jet flame, 4) flash fire and 5) unconfined vapor cloud explosion (UVCE). The methodology followed for the analysis of this type of accidents is mainly based on the research performed by The Netherlands Organisation of Applied Scientific Research (TNO, 1989 and 1992). Since for the same type of accidents different models may be used, a selection was made, after several discussions with the competent authorities and other scientists involved in the evaluation procedure of the safety reports of the various industrial installations in Greece. More specifically: 1) for the pool fires the "source" model was used, which assumes that the fire has a cylindrical shape, 2) for the fireballs the "point" model was selected, 3) for the flash fire the DEGADIS (Spicer et al. 1989) model was used for the calculation of the dispersion of the vapor cloud until the distance where the concentration of the substance reached the lower flammability limit, 4) finally for the UVCE the same procedure as for the flash fire was followed and in case the concentration of the vapor cloud was above the lower flammability limit within the installation then the ignition was assumed to occur at the nearest road. Accidents where toxic substances are released
Depending on the substances involved in the accidental release and their storage conditions two types of models were used: 1) Gaussian models in case that the released plume was neutral or buoyant and 2) the DEGADIS model when the released cloud was initially heavier than air.
Advances in Safety and Reliability: ESREL '97
659
Evaluation of the consequences The Greek competent authorities have not adopted officially any limits concerning the evaluation of consequences from the various accidents. Thus the complete evaluation was not possible and only preliminary conclusions can be drawn. The assumptions applied for the calculation of the thermal and toxic doses were as follows:
Thermal dose For non instantaneous accidents (e.g. pool fire) first the thermal heat flux had been calculated and then the thermal dose for two cases: 1) the observer, 5 s after the beginning of the accident, was moving far from the source of heat with a speed of 4 ms ~ until a distance at which the thermal flux was 1 kwm 2, 2) the observer was in vertical position and standing at the same location for 60 s. For accidents with a duration of a few seconds (e.g. fireball) only the case of standing observer was considered and the thermal dose was calculated for a time equal to the total duration of the phenomenon.
Toxic dose For accidents that involve toxic substances first the spatial distribution of the substances concentrations were calculated and then the toxic dose for standing observer exposed to toxic plume for 30 min. When the toxic substances examined were sulfur dioxide or nitrogen oxides or smoke then the European Union ambient air concentration limits were used.
Blast effects As far as the blast effects were concerned, only the overpressure due to the explosion was calculated at various distances.
RESULTS Two meteorological conditions were considered: 1) atmospheric stability class D and wind speed 5 ms 1 and 2) atmospheric stability class F and wind speed 2 ms ~. For both cases the ambient temperature was assumed 30°C, the relative humidity 50% and the inversion height 300 m. If the maximum duration of the accident was more than 1 hour then the duration of the calculations was limited to 1 hour. Four different typical LPG storage tanks (LPG stored at ambient temperature) were examined, namely: LPG1 (tank volume about 35 m3), LPG2 (tank volume about 200 m3), LPG3 (tank volume about 500 m 3) and LPG4 (tank volume about 2500 m 3 with a 1600 m 2 bund). Concerning the liquid fuel storage facilities a typical floating roof tank was examined (LFST), containing about 60000 m 3 crude oil. Finally two typical ammonia storage tanks were examined, one cryogenic (NH3 (1), 15000 tn of ammonia stored at -33.4 °C) and one pressurized (NH3 (2), 1800 tn of ammonia stored at 0 °C). The results from the analysis of the most significant possible accidents (catastrophic failure) are summarised in Tables 1 to 9. More specifically the results presented in each Table have as follows" In Tables 1-3 the heat flux calculations and the corresponding thermal doses from pool fires are shown. It appears that if a spreading pool is formed (tanks LPG1, LPG2 and LPG3) then the heat flux and the thermal dose at various distances depends on the amount of the liquid stored. It is noted that the consequences from LPG storage tanks are less significant when a bund exists (tank LPG4). This due to the fact that the bund limits the burning surface while, the LPG released on the ground in the absence of a bund forms a spreading pool with considerably larger surface.
660
I.C. Ziomas et al.
In Tables 4 and 5 the heat flux calculations and the corresponding thermal doses from fireballs are shown. It appears that the consequences depend on the mass of LPG which forms the fireball. Blank entries in Tables 4 and 5 are because the corresponding distances are within the fireball. In Table 6 the maximum hazardous distances for a flash fire are shown. The shape of the dispersed plume is usually ellipsoid and the maximum distance corresponds to the largest axis of the ellipse. Since wind direction is not involved in our calculations no estimations are given for the small axis. In Table 7 the distances to specified overpressures are shown. It is noted that the bund in case of tank LPG4 results in smaller evaporation rate and thus less significant concequences. Concerning tanks without a bund (LPG1, LPG2 and LPG3) the calculated overpressures depend on the mass content of the tank. In Tables 8 and 9 the concentration of ammonia and the corresponding toxic dose at various distances are shown. As expected the concequences are far more significant in case of accidental releases from the pressurized tank, since a heavier than air plume is formed.
TABLE 1 HEAT FLUX AT VARIOUS DISTANCES FROM A POOL FIRE
Installation LPG1 LPG2 LPG3 LPG4 LFST
100 5.6 18.1 22.7 7.7 14.1
Heat flux (kwm 2) 200 1.5 6.3 8.7 2.2 4.5
at various distances (m) 300 400 2.9 4.2 1.0 2.1
1.7 2.4
500 1.1 1.5
1.2
TABLE 2 THERMAL DOSE AT VARIOUS DISTANCES FROM A POOL FIRE (STANDING OBSERVER)
Installation LPG1 LPG2 LPG3 LPG4 LFST
100 592 2845 3871 921 2051
Thermal dose ((kwm-2)4/3s) at various distances (m) 200 300 400 107 702 258 122 1072 411 197 177 60 157 73 444
500 67 109
TABLE 3 THERMAL DOSE AT VARIOUS DISTANCES FROM A POOL FIRE (MOVING OBSERVER)
Installation LPG1 LPG2 LPG3 LPG4 LFST
2 4/3 Thermal dose ((kwm-) s) at various distances (m) 100 200 300 400 176 26 1148 356 139 53 1705 593 257 120 296 59 5 752 202 66 14
500 11 52
Advances in Safety and Reliability: ESREL '97
661
TABLE 4 HEAT FLUX AT VARIOUS DISTANCES FROM A FIREBALL
Installation LPG1 LPG2 LPG3 LPG4
Heat flux (kwm "2) at various distances (m) 200 300 400 23.8 11.1 6.4 71.6 36.4 21.6 85.8 50.1 30.2 177 125
100 75.2 154.9
500 4.1 14.2 20.0 89
TABLE 5 THERMAL DOSE AT VARIOUS DISTANCES FROM A FIREBALL
Installation LPG1 LPG2 LPG3 LPG4
100 2883 11810
Thermal dose ((kwm2)4/3s) at various distances (m) 200 300 400 500 622 226 107 60 4223 1718 855 488 7257 3750 1803 1062 28774 17812 11432
TABLE 6 HAZARDOUS DISTANCES FOR A FLASH FIRE Distances (m) Installation LPG1 LPG2 LPG3 LPG4
D5 132 556 857 262
F2 260 2176 3700 536
TABLE 7 DISTANCE TO. SPECIFIED OVERPRESSURES FROM AN UVCE
Installation LPG1 LPG2 LPG3 LPG4
10000(Pa) D5 F2 283 485 580 725 708 907 430 715
Distance (m) 20000(P~ D5 F2 190 330 377 450 429 528 280 471
to specified overpressures
30000(Pa)
50000(Pa)
D5 158 310 336 231
D5 133 255 262 190
F2 278 358 403 390
F2 237 285 302 325
70000(Pa) D5 F2 123 219 233 254 230 258 174 297
662
I.C. Ziomas et al. TABLE 8 CONCENTRATION AT VARIOUS DISTANCES FOR TOXIC SUBSTANCES RELEASES
Substance NH 3 (1) NH3 (2)
Concentration (mgrm 3) at various distances (m) 1000 500 2000 3000 D5 F2 D5 F2 D5 F2 D5 F2 60 500 175 60 26700 1.5 105 6300 11200 1450 788 581
TABLE 9 TOXIC DOSE AT VARIOUS DISTANCES FROM TOXIC SUBSTANCES RELEASES
Substance NH 3 (1) NH 3 (2)
n 2 2
Toxic dose ((mgrm3) nmin) at various distances (m) 1000 2000 3000 500 D5 F2 D5 F2 D5 F2 D5 F2 6 105 5 104 105 6 106 107 1.4 106 1.7 109 1.3 10 ll 1.3 10 s
CONCLUSIONS
It should be noted that all results presented are for worst cases, and would be associated with a low frequency of occurrence. Hence risks may be acceptable if risk-based criteria are used. Since, as already mentioned above, the Greek competent authorities have not adopted officially any limits concerning the evaluation of consequences from the various accidents it is not possible to justify the exact extent of the dangerous zones around the sources of possible accidents. Due to that fact the authors do not intend to draw their own conclusions and thus only rough estimates are presented in the following paragraphs. The reader may reffer to the international literature (eg TNO, 1989) for relating the results of the present work with fatality/injury criteria. For the LPG storage facilities it appears that the possible consequences are: • • • •
for for for for
pool fire the hazardous effects are limited within the area of the installation fireball the hazardous effects are extended outside the installation flash fire the dangerous area is extended outside the installation UVCE the dangerous area is extended outside the installation
It is noted that the consequences from LPG storage tanks are less significant when a bund exists (eg LPG4 in Tables 1-3). This due to the fact that the bund limits the burning or evaporating surface. On the contrary the LPG released on the ground in the absence of a bund forms a spreading pool with considerably larger surface and thus a larger burning area or a larger evaporation rate. For the ammonia storage facilities it appears that the possible consequences are extended far from the installation. As expected the concequences are more significant in case of accidental releases from the pressurized tank, since a heavier than air plume is formed.
Advances in Safety and Reliability: ESREL '97
663
REFERENCES Spicer T. and J. Havens, 1989, "User's guide for the DEGADIS 2.1 Dense Gas Dispersion Model", Cincinnati: U.S. EPA, EPA-450/4-89-019 TNO, 1989, "Methods for the determination of possible damage to people and objects resulting from releases of hazardous materials (Green book)", The Netherlands Organization of Applied Scientific Research. TNO, 1992, "Methods for the calculation of physical effects (Yellow book, 2nd edition)", The Netherlands Organization of Applied Scientific Research. Ziomas I. C., Tzoumaka P. N., Fiorentini C., Romano A., Locatelli M., 1994, "Lessons learned from emergencies after accidents in Greece and Italy involving dangerous chemical substances", Office for Official Publications of the EC, ISBN 92-826-8222-6.
This Page Intentionally Left Blank
B 1O" Industrial Safety
This Page Intentionally Left Blank
USING MODERN DATABASE CONCEPTS TO FACILITATE EXCHANGE OF INFORMATION ON MAJOR ACCIDENTS IN THE EUROPEAN UNION C. Kirchsteiger European Commission, Joint Research Centre, Major Accident Hazards Bureau, TP 670, 21020 Ispra (Va), Italy. e-mail: [email protected] Tel.: +39 332 78 9391 ; Fax: +39 332 78 9007
ABSTRACT
This paper describes the background, functioning and status of the new version of the Major Accident Reporting System, MARS 3.0, dedicated to collect in a consistent way data on major accidents involving dangerous substances from the Member States of the European Union, to analyse and statistically process them, and to create subsets of all non-confidential data and analysis results for export to all participating Member States. This information exchange and analysis tool is made up.of two connected parts: one for each local unit (i.e., for the Competent Authority of each participating Member State), and one central part for the European Commission. The local as well as the central parts of this information network can serve both as data logging systems and, on different levels of complexity, as data analysis tools. The analysis of events notified to MARS 3.0 is based on two alternative types of approaches: Boolean searches of coded categories of event characteristics (for the local and central units) and hypertext-search based pattern analysis of free text descriptions of event characteristics (for the central unit only). This second option allows the accident causes to be identified and consistently analysed in a much more efficient and objective way, and the succession of the disruptive factors leading to the accident to be identified.
KEYWORDS
risk management, industrial risk, risk regulation, database, accident database, major accidents, industrial accidents, Seveso Directive
IMPLEMENTATION AND TASKS OF THE MAJOR ACCIDENT REPORTING SYSTEM IN THE EUROPEAN UNION
Several initiatives have been taken in many countries around the world to provide industry, governmental and research institutions with high quality information on industrial accidents as a means of accident prevention. Databases containing information on industrial accident events consist of specific sets of data variables describing causes, circumstances, evolution and consequences of past accident events and are designed to meet several objectives, often complementing each other: •
to assist in the formulation of new regulations and safety management 667
668
C. Kirchsteiger principles, to cope with "what-if' requests from the public, industrial management boards and governmental institutions, to assist in risk analysis, as a • qualitative tool, identifying relevant accident scenarios, ° quantitative tool, producing estimates of event frequencies.
Council Directive 82/501/EEC on the Control of Major-Accident Hazards Involving Dangerous Substances ("Seveso I"), European Council (1982), requires the Competent Authorities (CAs) of the (currently) 15 Member States of the European Union to notify all (non-nuclear, non-military, non-mining, non-transport related) major accidents involving dangerous substances which occurred in their respective countries to the European Commission. For this purpose, the Commission has set up in 1984 an industrial accident notification scheme, the Major Accident Reporting _S_ystem(MARS), operated and maintained by the Major Accident Hazards Bureau (MAHB) of the Commission's Joint Research Centre in Ispra / Italy. Besides the Member States, other European countries outside the European Union, such as Norway or some of the Eastern European countries, are already participating in MARS on a voluntary basis or have indicated their interest to do so. The accidents reported are collected in a register and information system ("MARS database"), and analysed in order to: classify the accidents according to various parameters, in particular the type of industrial activity, substances involved, consequences, and causative factors, extract the lessons-learned from the accidents to prevent the recurrence of similar accidents and to mitigate the consequences of accidents which do occur. The system currently holds information on about 280 accidents and incidents, information which has been submitted on a hardcopy (paper) basis in either English or in one of the other 10 official languages to the Commission in confidence for the use of the Commission and the National Authorities of the Member States. MAHB staff translate, if necessary, the forms submitted into English (the working language of MARS), check the consistency of the information, confirm any unclear, missing or inconsistent points with the national CA concerned, and input it into the computerised MARS database system. The number of events reported so far is - fortunately - not very large, but what makes this database unusual among industrial accident databases is the level of detail, which is usually sufficient to establish the detailed causes of the accident, both the intermediate causes and the underlying ones. Parts of the information submitted to MARS is considered confidential if it calls into question, for example, the confidentiality of the deliberations of the CAs and the Commission, public security, personal data and/or files, commercial and industrial secrets, etc. However, without violating these aspects, access to the register and information system has always been open to government departments of the Member States, industry, non-governmental organisations and research organisations working in this field.
STRUCTURE AND USAGE OF INFORMATION INCLUDED IN MARS Information on the major accidents to be notified to MARS consists of both character and numeric types of data in free text as well as in selection list type of format on events and circumstat~ces leading to the major accident, descriptions of the evolution of the accident, consequences (impact on humans, material loss, ecological harm .... ), emergency responses and lessons-learned. To ensure basic consistency in the understanding of the characteristics of an accident event to be notified, the CAs in the Member States have to put these data into a special agreed-upon format, the so-called "Accident Notification Form". As a result of an iterative process, two such MARS reporting forms have been established: the "short report" is intended for use for immediate notification of an accident, and the "full report" is prepared when the
Advances in Safety and Reliability: ESREL '97
669
accident has been fully investigated, and the causes, the evolution of the accident, and the consequences are fully understood. In certain cases further information comes to light - for example in the course of judicial proceedings - and there is provision for the "full report" information to be further modified. The "short report" gives essential information conceming the accident, in a selection list and associated freetext format, under the following 7 headings (data variables): • • • • • • •
accident types, substances directly involved, immediate sources of accident, suspected causes, immediate effects, emergency measures taken, immediate lessons-leamed.
The "full report" is much more analytic, and involves more work in its preparation. To provide the possibility of using as many original information as possible (and to validate the considerably subjective assignment of codes from selection lists), there are always sufficient free-text fields available in this form to describe facts connected with an accident. Using the information included in "full reports", the MARS database can be searched under almost 200 different headings (data variables), e.g.: type of accident (~25 coded subtypes for the 5 "short report" types) in terms of." • major occurrence, • initiating event, • associated event, dangerous substances (>200 named substances from Seveso I / II or CAS number, uses of substances in plant, estimates of quantities): • total establishment inventory, • relevant inventory (in)directly involved, source of accident: • type of industry where accident occurred (~20 coded subtypes), • type of activity being carried out (~15 coded subtypes), • systems / components directly involved (~ 15 coded subtypes), causative factors (immediate and underlying: ~35 coded subtypes), number of people affected (fatalities, injuries, people at risk), ecological harm (~20 coded subtypes), national heritage loss, material loss, disruption of community life, emergency measures taken / still required / continuous contamination or danger (~30 coded subtypes), accident response: • pre- / post-accident evaluation, • evaluation of safety organisation, • evaluation of ecological impact control, official actions taken, lessons-learned,
Using this information, regular summaries of accidents notified are prepared for the Committee of "Seveso Directive" CAs as well as occasional specific studies of lessons-learned from accidents, both for the Committee and - with identifying details removed - for the general public. Since the software structure of the previous database system, MARS 2.0, did not allow hypertext retrieval, besides Boolean searches of coded database categories, lessons-learned type of data analyses to identify overall patterns in the accidents
670
C. Kirchsteiger
data could be performed on a largely manual basis only. Along with corresponding studies and presentations at the periodic meetings of the Committee of CAs, several open publications have been performed in the last years on lessons-learned type of analyses of the non-confidential part of the data, see e.g. Drogaris (1993), Rasmussen (1996), Papadakis et al (1996). In addition to that, specific requests are continuously received by MAHB from external institutions to perform topical analyses on non-confidential MARS data (e.g., related to circumstances of accidents in fuel storages and underground storages, oil pollution accidents and incidents, vapour cloud explosions, etc.).
NEW I N F O R M A T I O N EXCHANGE SYSTEM (MARS 3.0) The accidents included in MARS are so-called "major accidents", as defined in general terms by the original text of the Directive, European Council (1982), and in a later amendment, European Commission (1988), without giving quantitative threshold criteria on, for example, the event consequences. Therefore, since until recently (i.e., on the basis of "Seveso I") only general guidance has been available as to what constitutes a "major accident", it has to be assumed that the different interpretations of this term have led to slightly different practices in notifying accidents. Yet, however fuzzy the concept of a "major industrial accident" may have been in the past, it seems to be that there has always been a general understanding that all these events have as basic common feature at least the potential to affect many people. The Commission has recently published a new Directive, "Seveso H", European Council (1997), replacing and strengthening the original Directive. Significant MARS related changes are implied by "Seveso II", in particular concerning criteria for the notification of an accident to the Commission, information system and exchanges, confidentiality of information submitted. Besides giving a clear and unequivocal definition of what constitutes a "major accident", resulting in an overall lowering of the threshold criteria for the notification of an accident to the Commission and thus in an increased number of events to be reported, the new Directive calls for a more open approach to the supply of information to the public, both from the Member States and from the Commission, supported by a precise definition of which type of information has to be kept confidential to interested parties other than the Committee of "Seveso Directive" CAs (see above Section 1). The number and contents of MARS data variables as summarised in the "Accident Notification Form" is, however, not subject to changes. In the light of the new requirements and the presumably significant increase in event notifications, the software structure of MARS had to be completely changed and consists now in its Version 3.0 of a distributed self-standing data logging system and analysis tool running on a MS-Windows © platform ("local DOS-MARS"), supported by a centralised DOS and UNIX data management system ("central DOSMARS", "central UNIX-MARS"), which reaches the required efficiency with the help of a UNIX-based relational database management system. This concept ensures the management of large and complex data sets, consisting of data of several different object classes. Due to the significantly larger number of events to be included in MARS 3.0, a manual evaluation of the free text elements of accident case histories to formulate lessons-learned type of results is no longer possible. Database evaluation based on Boolean analysis of event codes only bears, however, the risk of dependence on often subjective code assignments, which - especially in the case of identification of the underlying accident causes - can never be complel~ely eliminated. Therefore, to assure a high rate of automatic capturing of relevant information, a new method based on the indexing of relevant free text elements had to be developed for the central UNIX-based part of MARS 3.0. Each "non-trivial" element in each free text field of each MARS event is inserted into a general MARS thesaurus, from which specific user-defined sub-thesauri can be created. A sub-thesaurus can be topic-specific, e.g. related to human errors,
Advances in Safety and Reliability: ESREL '97
671
and consists of strings of expressions connected by their similarity. On this basis, complex queries, such as user-defined cluster and pattem analysis, are possible. Another important element in the functioning of MARS 3.0 is the new way of exchanging information between local units and the central database. By using their local MARS 3.0 unit, the CAs of the Member States can create under a user-friendly windows-guided environment their accident data files in ASCIIformat by writing accident descriptions in the English language and assigning codes (see Section 2). Before saving such a file, all data are automatically checked by the system for consistency and completeness of information. Consistency is assured by a variety of logical tests across the data variables, completeness is assured by defining all short report and a large number of full report data variables as being obligatory (blank input in such fields prevents saving of files). Accident event files are then sent by the CAs on 3,5" diskettes or via e-mail to MAHB, which reads them into its central DOS-part of MARS 3.0 for qualitycheck and, if necessary, for further editing in accordance with the respective CA. Next, the agreed-upon "final version" of an event data file is exported to the central UNIX-part and included in the total population of MARS events. Both the central and local units include options to perform statistical evaluations and to generate corresponding reports which can further be edited by using standard word-processor tools in a MSWindows ® environment. For more detailed analyses, extracting lessons-learned from accidents reported, MAHB can perform on its UNIX-part of MARS 3.0 cluster and pattern analyses on the entire data set. To support building local MARS databases in each Member State, MAHB makes periodical copies of data subsets from its central UNIX-part and distributes them to each participating CA. Each such subset consists of all CA-specific data as well as all non-confidential data of all other CAs. Further, MARS data analysis results will periodically be distributed to the CAs in an electronic format for further processing. The basic functioning of this information exchange system is depicted in the following Figure 1:
Ba sic
oj
,3°o
M
MAHB (UNIX)
A -r-rY~
(r v
....~iiii
%i~i,i'iiipi~ii,'~,iii,'ii.i~..ii~
,,..._.-----
Figure 1" Basic Structure of the New Information Exchange System MARS 3.0
672
C. Kirchsteiger
STATUS AND O U T L O O K ON THE FUTURE USE OF MARS 3.0 From May to October 1996, the detailed software specification and design of MARS 3.0 have been defined and discussed with various international bodies, including all CAs, resulting in the network-type of database structure outlined above (DOS-part for the CAs, DOS/UNIX-part for MAHB). Having finished the actual software development and testing in early 1997, the entire database contents of MARS 2.0 have been transferred to the new system. Following that, a workshop on the practical use of MARS 3.0 has been organised by MAHB in Ispra for those CAs with an active short-term interest in the use of the new system for the purposes of exchanging and discussing information, experiences and analysis results on major accidents with the Commission. As of January 1997, 13 of 15 CAs of the Member States of the European Union have declared such a short-term interest in MARS 3.0 and thus received from MAHB their copy of the local DOS-part of the system. Although there is an overall legal obligation to notify information on major accidents to the Commission, neither the "Seveso I" nor the "Seveso II" Directive give - for good reasons - a detailed technical specification how this should actually be accomplished. Therefore, Member States can in principle continue to send hardcopy reports on major accident occurrences in one of the 10 official languages of the European Union other than the English language to the Commission. However, many real benefits are related to a usage of the new system by the CAs: CAs can easily create and edit their MARS relevant accident events in MARS-consistent format, CAs can easily send the resulting electronic data files in standard ASCII-format to MAHB (on diskettes, via e-mail), CAs receive from MAHB periodic electronic updates of the contents of MAHB's central UNIX-based MARS database (i.e., all their own and all nonconfidential data from all other CAs), CAs can thus build up their own local accidents database in MARS-format, CAs can make statistical evaluations of their accidents data and generate corresponding reports. In accordance with the Seveso II requirements of a more open access to all non-confidential information, possible participants in the MARS 3.0 information exchange system are not only the CAs of the Member States and the European Commission, but also all other interested parties in the area, such as "industry or trade associations, trade unions, non-governmental organisations in the field of the protection of the environment and other international or research organisations working in the field", European Council (1997). The long term goal of MARS 3.0 is to develop an information network that provides electronic access to all major industrial accidents knowledge and experience within the European Union to anyone (in different levels of detail, depending on the aspect of confidentiality of raw data), anytime, anywhere. Any such network is bound by successful communication, which is a precursor to more sophisticated structures and purposes of a system. For MARS, this purpose is improved policies and practices on industrial accidents prevention, mitigation and response through successful international co-operation and information exchange. Further, as the information society develops and expands, it can be expected that in the next few years more and more demands for "instantaneous" on-line access to MARS data and analysis results are asked for (e.g., via the WWW). Although, from a technical point of view, MARS could be put on the Internet in the very near future, a complete integration of the MARS database in the WWW does not seem to be desirable at this moment, mainly due to reasons of data security.
Advances in Safety and Reliability: ESREL '97
673
ACKNOWLEDGEMENTS The valuable discussions with colleagues from the Major Accident Hazards Bureau in Ispra, Directorate General XI/E.1 in Brussels and the National Authorities in the Member States, their comments and assistance in defining and implementing MARS 3.0 are gratefully acknowledged.
REFERENCES
European Council (1982). Council Directive 82/501/EEC on the Major Accident Hazards of Certain Industrial Activities ("Seveso I"). Official Journal of the European Communities. Drogaris, G. (1993). Major accident reporting system : lessons learned from accidents notified, Elsevier Science Publishers, Amsterdam, The Netherlands. Drogaris, G. (1993). Learning from Major Accidents Involving Dangerous Substances. Safety Science 16, 89-113. Rasmussen, K. (1996). The Experience with the Major Accident Reporting System from 1984 to 1993, European Commission, Joint Research Centre, EUR 16341 EN. Papadakis, G., Amendola, A. (1996). in: Probabilistic Safety Assessment and Management '96, Springer Verlag, Berlin, Germany, 101-106. European Commission (1988). Report on the Application in the Member States of Directive 82/501/EEC of 24 June 1982 on the Major Accident Hazards of Certain Industrial Activities. COM(88) 261. European Council (1997). Council Directive 96/82/EC on the Major Accident Hazards of Certain Industrial Activities ("Seveso II"). Official Journal of the European Communities.
This Page Intentionally Left Blank
PLANT SAFETY IMPROVEMENT BY LOGICAL - PHYSICAL SIMULATION
N. Piccinini °, C. Fiorentini ^, L. Scataglini*, and F. De Vecchi ^ ° Politecnico di Torino (C.so Duca degli Abruzzi, 24 - I 10100 Torino Italy) * Agip SpA (Via Emilia, 1 - 1 20095 San Donato Milanese - MI Italy) ^ TECSA SpA (Via Oratorio, 7 - 1 20016 Pero - MI Italy)
ABSTRACT
This paper describes the way Integrated Dynamic Decision Analysis (IDDA) can be used to represent the ordinary, reduced and incidental states of a plant and all its irregular operating conditions, and determine their occurrence patterns (event analysis). The description is related to a real case of a natural gas drying plant. The versatility of the analysis program was exploited to simulate several plant improvements designed to minimize the risk of the plant being shut down through the spurious intervention of its protection systems.
KEYWORDS
Decision Analysis, Probability, Event Analysis, Optimization, Phenomenological Analysis, Integrated Analysis, Protection systems
INTRODUCTION This paper describes how it was possible to modify and reoptimise the protection systems of a natural gas drying plant so as to prevent their spurious intervention and the consequent discharge of large quantities of gas into the atmosphere. The probability of such events must be kept below 10-6 according to the indications supplied by the Owner of the plant. The study was conducted with a computerized methodology known as integrated dynamic decision analysis (IDDA). It is based on logical and probabilistic techniques using the decision analysis tree for modeling, and is the latest product of a research line developed solely by R. Galvagni over the course of many years [ 1-5]. As a Decision Analysis IDDA is based on a rigorous application of Logic to define and to depict all the possible alternative incompatible scenarios among witch the choice has to be done. Each alternative scenario is developed and presented according to a cause-consequence logic approach. In this approach both logic rules and probability evaluations are applied dynamically in that each piece of information progressively received can be used to define the successive logic path ant the conditional probabilities of the following events. But besides being Dynamic a Decision Analysis tool must also be Integrated with the physical behavior of the phenomena that the logical event trajectory implies. In order to be effective, a scenario in 675
676
N. Piccinini et al.
form of a logical trajectory has to represent both its logic-probabilistic and its physic-phenomenological evolutions. Further, knowledge of the phenomenology supplies directly the consequence related to each alternative scenario giving, along with the probability supplied by the logic-probabilistic elaboration, a unique overall representation; that is the information necessary to define that basic parameter of decision giving by "Risk", when evaluated on every possible foreseen alternative. IDDA was applied to a natural gas drying plant that is likely to release large quantities of gas into the atmosphere in the event of the unavailability of the downstream distribution network. This eventuality can also lead to pressure transients in excess of the design values of the drying columns and hence likely to damage them [6-7]. Greater gas consumption in recent years has led to the employment of plants in increasingly critical conditions. In other words, the intervention values of a plant's protection systems may be extremely close to those in which it normally operates, and spurious shutdowns are becoming increasingly frequent.
THE PLANT The plant referred to in this paper is located downstream from an exhausted natural gas field now employed as a backup lung to meet heavy consumption demands during the winter. In summer, in fact, when consumption is well below the average, gas is piped off from the distribution network and stored in the field. In winter, this gas is returned to the network. Before this can be done, however, it requires thorough drying. The layout of the drying plant is illustrated in fig.1. An intake manifold receives four input lines 1-4, each with a shutdown valve (SDV 75-78). Lines 1-4 join each other before the shutdown valve SDV 79. The gas is taken from the manifold to two parts of the plant. The first branches into four lines, the second into three, each with a drying column. The maximum design pressures of the first four columns are 72.5 barg (D l-D2) and 74.5 barg (D3-D4). They can all handle 3.5 • 106 Nm3/d of gas, the second three (D5-D7) have a maximum design pressure of 78.5 barg and can handle 4.5 • 106 Nm3/d of gas. Upstream from each column there is a shutdown valve (SDV 80-86) and a pressure control valve (PCV 61-67), each governed by a control system (PSH 1-7 and PIC 10-16) installed on the respective columns. Downstream from the columns, each line has another flow control valve (FCV 68-74) actuated by a pressure indicator controller (PIC 9) mounted on the collection manifold. Downstream from these regulating valves, the two parts of the plant meet up on a single output manifold with a shutdown valve (SDV 87). This line is also fitted with a high-pressure switch (PSH 8) that actuates all the plant's shutdown valves. Damage to the columns caused by excessive pressure is avoided by the installation on each upstream line of a one-way valve followed by a blow-down. The four lines from columns D l-D4 meet on the first blowdown manifold and the three from columns D5-D7 on the second. Each manifold is fitted with 4 PSV calibrated at 72.3 barg and at 78.5 barg on the first and second. The PSVs can discharge all the flow routed to the plant. The maximum design flow rate for the first manifold is 8.88.106 Nm3/d, and that for the second is 12.72-106 Nm3/d, while that of the discharge to the atmosphere is 15.106 Nm3/d, so that the Mach number is less than 1.
EVENT ANALYSIS The plant is thus composed of 77 components (61 valves, 8 high-pressure switches and 8 pressure regulators) capable of causing excess pressures, and 277 (1.5 • 1023) system states are possible. IDDA taken to scenarios with probabilities low enough not to exclude events of interest (a value of 10.9 was chosen in the light of previous experiments) showed that the number of alternatives was still beyond the handling capacity of the program [6-7]. It can be seen from the description of the plant that its configuration does not allow any reduction in the
Advances in Safety and Reliability: ESREL '97
677
number of alternatives through the use of symmetries. This reduction was therefore obtained by successive steps. A check was made at each step to ensure that the assumptions and eliminations would not have a significant effect on the dynamic response of the system (pressure and flow rate transients). A peculiarity of IDDA, in fact, is that it associates physical models with each operating alternative generated. In the present case, in addition to supplying the occurrence probability it traces the pressure and flow rate transient diagrams for each section of the plant. The phenomenological model takes account of the: • volumes in each part of the plant • flow rates and volumes during regular operation • load losses in the pipes • regulating and shutdown valve intervention times and characteristics • safety valve opening and reclosing characteristics • protection device intervention thresholds.
Step I - Identification of the initiating events Examination of the plant shows that five events could cause a pressure transient capable of leading to a flare off, namely: 1. Spurious closure of downstream shutdown valve SDV 87 (closing time 28s) 2. Spurious closure of downstream valve SDV 88. As a conservative estimate, it is assumed that this valve closes in 4 s so as to take all the possible high pressure to the consumer abnormalities into account. 3. Spurious closure of one of the FCVs downstream from the drying columns. Calculations have been made for FCV 68 (columns D1 3.5 mNm3/g) and FCV 72 (column D5 4.5 Nm3/g) (closing time: 10 s). Spurious operation of PSH 8 resulting in closure of the SDVs, including downstream SDV 87. The transmitter on the intake manifold under-reads the real operating pressure. This transmitter acts on both the central input and the output regulating valves. A spurious intervention of this kind leads to the closure (75% maximum) of the valve upstream from the columns and makes it impossible for those downstream from the separators to close. In view of the very fast reactions of the plant to each of these causes of excessive pressure, it is unreasonable to suppose that several events could overlap. Attention has thus been directed to the possibility that one or several parts of the protection system fail to intervene following a single initiating event.
Step H - Reduction of the number of alternatives by event analysis
Central input lines (clusters) The plant receives four clusters of eight lines, each with its pressure regulating valve closed by the pressure indicator controller (PIC 9) on the manifold (fig.l). These 32 valves may work correctly or fail, giving 232 = 4.3 • 109 operation alternatives. Since the flows from the clusters are not all the same, there are no identical operating situations. The phenomenological method, however, demonstrates that the failure of one valve to close rather than another does not result in any significant differences in the pressure peaks in the column and the blow-off peaks. An evaluation was therefore made of the probability that "k" out of the "n" valves present would not regulate by analyzing separately clusters A and clusters B/C/D as a whole. It was found that this probability was 5.5 10 .8 for 10 of the 24 valves on B/C.D. The probability of all the time trajectories that would result in the failure of more than 10 valves was added to this probability to give that for the failure to close of 10 or more
N. Piccinini et al.
678
valves. A similar procedure was used for cluster A. Here there were 7 mutually excluding alternatives. The total number of alternatives was therefore 170 instead of the initial 4.3.109.
A BLOWDOWNt A BLOWDOWN,t
I
I
I
" L
©
CLUSTERB ~
© ~3j,
CLUSTERC
2j
1)
~
~.~
6
' 7
T Fig. 1 Plant layout
Drying columns Columns D l-D2, D3-D4 and D5-D7 were analyzed separately. Examination of the transients showed that excessive pressure values are never reached in D3-D4 because their maximum design pressure is 74.5 barg and the PSVs come into operation at 72.5 barg If the protection system fails to intervene, the operating conditions are such that the maximum design pressures of both D1-D2 and D5-D7 are exceeded. The behavior patterns and consequences are much the same in each case. An assessment has therefore been made of the probability that one or more columns of each set remain open. The transients also showed that because of the extremely long response time of the shutdown system it is enough for the regulation system to be inoperative to have a pressure peak in excess of the maximum design value. For this reason, the probability that one or more columns in each set do not regulate has also been calculated.
TABLE 1 PROBABILITY THAT ONE OR MORE COLUMNS REMAIN OPEN
Columns
D l-D2
D5-D7
PSH 8
Lines open
1 2
operating 9.50.10 l 3.88 • 10 .5 1.74 • 10 1°
faulty 4 . 8 9 . 1 0 .2 2.76 • 10 -4 3.88 • 10 .7
0 1 2
9.50 • 10 1 5.83 • 10 .5 4.27" 10 ~°
4.88 • 10 2 4.13 • 10 4 1.16" 10 -6 1.02" 10 -9
3
Advances in Safety and Reliability: ESREL '97
679
III step - Reduction of the number of alternatives by analysis of the initiating events (transients) The results of the separate analysis of each of the initiating events described in 3.1 are set out below.
1. Spurious closure of the central output shutdown valve (SDV 87) It was found that both column excess pressure and flare-off pressure transients can occur. In the case of D1-D2, the simple failure of the regulation system to operate results in excessive column pressure, whereas in that of D5-D7 this will only take place if they remain open. The number of input cluster regulation valves that fail to close has no significant effect as far as the generation of excessive pressure is concerned. As to the flow discharged to the blow-down manifolds or the flare off, the maximum design pressure values are only exceeded in the event of PIC9 and PSH8 failing to intervene at the same time.
2. High pressure to the consumer. What been said with regard to the previous event is equally applicable.
3. Spurious closure of a FCV downstream from the drying columns As in the preceding cases, analysis of the transients shows that no major problems arise with regard to the flow flared off or discharged to the blow-down manifolds, whereas the non-operation of one or more regulation valves results in a rise in pressure above the maximum design value.
4. Spurious operation of high-pressure switch PSH 8 The reaction times of the system are such that in this event the delivery of gas to the central plant is stopped a few seconds after the closure of SDV 87. Problems are thus confined to the case in which one of the columns D l-D2 remains open, since here the intervention of their SDVs has proved sufficient. The discharge from the flare is limited. If more than one column stays open, the excess pressure is distributed and does not rise above the set values of the PSVs. Its contribution to the overall risk is negligible.
5. Spurious operation of PIC 9 No problems arise with respect to the flow flared off or discharged to the blow-down manifolds, whereas non-operation of the regulation system of D1-D2 causes an increase in pressure toabove the maximum design value. Excessive pressure in D5-D7, on the other hand, will only occur if one line remains open. The number of possible alternatives for each initiating event is 288 (leading to excessive pressure) and 167 (leading to flare off) for initiating event 1 and 2; 15 for initiating event 3 and 570 for initiating event 5. A probability rating of 10.9 has been considered.
Step I V - Estimation of the probabilities of occurrence Table 2 expresses the consequences of each initiating event as the probability that the following variables will rise above their preset threshold values: • flow of gas delivered to the manifolds or flared off • pressure reached inside the drying columns • gas discharge Mach number.
680
N. Piccinini et al. TABLE 2 PROBABILITY OF EXCEEDING THE DESIGN VALUES (SUMMARY)
Initiating event
P column D1/D2 > 72.5 barg
P column D3/D4 > 74.5 barg
P column D5/D7 > 78.5 barg
1 2 3 4 5
3.56 • 10.3 1 . 1 0 . 1 0 -3 9.60" 10-3 3.70 • 10- 6 2.50 • 10.3
-
5.16 • 10.3 1 . 5 9 . 1 0 -3 1.44" 10 -2
TOTAL
-
-
-
1.91 • 10.4
3.81 • 10.2
Q
Q
manifold 10">8.88 M N m 3/g < 101° <10 1° <10 -l° < 10" 1 0 < 10 1°
manifold 12">12.72 M N m 3/g < 101° < 101° < 101° < 10l° < 101°
< 10 1°
Q flare off >15 M N m 3/g < < < < <
101° 101° 10 "l° 101° 10 1°
< 10 l°
Mach >1
< 101° < 10~° < 10 1° < 10 1° < 10 ~° < 101°
OPTIMISATION OF THE PROTECTION SYSTEMS Since the aim of this study was to simulate plant improvements designed to minimize the risk of the plant being shut down through the spurious intervention of its protection systems, the most significant transients were analyzed in detail. Account was also taken of the fact that the possibility of replacing the columns or the PSVs had been discarded a priori as too costly. Four forms of initial modification were therefore examined: a) reduction of the SDV intervention and closing times (use of an electric instead of a pneumatic signal) b) increase of the wellhead valve pressures to about 70 barg c) reduction of the set point for PSH 8 on the output manifold at 72 barg d) reduction of the set point for PIC 9 on the intake manifold at 71 barg The changes introduced with respect to each initiating event are illustrated below: 1. Reduction of the SDV intervention and closing times has proved an extremely important factor, since the spurious closure of SDV 87 can now only result in excessive pressure in D l-D2 if they remain fully open. For this to happen in D5-D7, there must be a simultaneous failure of both PIC 9 and PSH 8. 2. Here, too, the situation in the new arrangement remains the same as in case 1. 3. Spurious closure of a FCV downstream from the drying columns no longer causes excessive column pressure. The discharge of the flare and blow-down manifolds is always nil. 4. Reduction of the central SDV and increase of the output SDV intervention and closing times means that spurious operation of high-pressure switch PSH 8 no longer causes problems. 5. The comments made with regard to case 1 are equally applicable here. Table 3 expresses the consequences of each initiating event in terms of the probability that the pressure reached in the drying columns will exceed the maximum design values. As can be seen, the modificationsenvisaged reduce the probability of occurrence of excessive pressure by two orders of magnitude. This, however, is not enough to bring the values within the limits set by the Owner of the plant. Other measures are thus required. These could, for example, reduce the probability of occurrence of the initiating events.
Advances in Safety and Reliability: ESREL '97
681
TABLE 3 PROBABILITY OF EXCEEDING THE DESIGN VALUES AFTER APPLICATION OF THE INITIAL MODIFICATIONS
Initiating event
P column D1/D2 > 72.5 barg
P column D3/D4 > 74.5 barg
1
1 . 2 5 • 1 0 -6
-
P column D5/D7 > 78.5 barg 1 . 9 3 • 1 0 -5
2
3.87 • 10 7
-
5.83 • 10 .6
3/4 5
1.27 • 10 .4
-
1.87 • 10 .4
TOTAL
3.41"
10 -4
e) In the case of event 1, the situation that would be created if a limit switch signal activating the shutdown system and with an occurrence frequency of 6.96-10-6occ./h w e r e installed on the central output valve is illustrated in Table 4. f) In the case of event 5, it would be possible to install a second pressure transmitter linked to the first, so that both would need to operate (spuriously) in order to close the central regulating valves. It is also necessary to make provision for a system allowing continuous monitoring of these two transmitters so as to detect any significant variations and thus identify possible measurement errors on the part of one of them. A reading exceeding a pre-established value should send the system into the alert state. In the estimates, it has been assumed that a faulty transmitter can be replaced within 24 hours. The results of this new arrangement are shown in Table 4. The value obtained is below the threshold defined by the Owner.
TABLE
4
PROBABILITY OF EXCEEDING THE DESIGN VALUES
Case e
Column Pressure D1/D2 > 72.5 barg
Column Pressure D5/D7 > 78.5 barg
Operating
1.08 • 10 .8
2.44 • 10 .9
1.32 • 10 .8
Fail
1.88 • 10 s
2.9 • 10 .7
3.08 • 10 -7
3.32 •10 .8
4.89 •10 -s
8.22 •10 -8
f
TOTAL
The effects of these various changes are summarized in Table 5. As can be seen, the probability of there being a discharge to the manifolds and/or the flare is below the threshold defined by the Owner of the plant, whereas the probability of excessive column pressure is of the same order of magnitude. Event 2 (high pressure to the consumer) provides the major contribution to this eventuality: its probability of occurrence has been assumed as 10 -2. A more precise definition of this value, or the adoption of operating, management and safety procedures agreed with consumers, should bring this parameter, too, within the limits laid down.
TABLE 5 PROBABILITY OF EXCEEDING THE DESIGN VALUES (SUMMARY)
Initiating event 1
2
Pressure > design values pressure 3 . 2 1 . 1 0 .7 6.22 " 10 -6
3/4 5
8.22 • 10 .8
TOTAL
6.62 • 10 .6
682
N. Piccinini et al.
CONCLUSIONS This study was conducted by combining the probabilistic logic approach with physical simulation of the operating parameters (flow rate and pressure). Its results demonstrate that: - the probability of a flare discharge is rather high (3.8" 10-2) in the original setting - the problems raised by discharge to the blow-down manifolds or the flare are rather limited. The need to install new columns or alter their maximum pressure values has been avoided by simply adopting the following measures: a. reduction of the SDV intervention and closing times (use of an electric instead of a pneumatic signal) b. increase of the wellhead valve pressures to about 70 barg c. reduction of the set point for PSH 8 on the output manifold at 72 barg d. reduction of the set point for PIC 9 on the intake manifold at 71 barg e. installation on the central output valve of a limit switch signal activating the shutdown system f. installation of a second pressure transmitter linked to PIC 9 so that both would need to operate (spuriously) in order to close the central regulating valves. With these modifications, the probability of excessive pressure in the columns is 6.62 • 10-6. This value can be regarded as acceptable, since the probability of discharge to the flare or the blow-down manifolds with a flow rate exceeding the design values is less than 10 -10. This application has shown that once a value regarded as acceptable because it embraces the complete set of alternatives becomes available operating decisions can be taken to improve the safety of the plant. Furthermore, this result has been reached at very little expense.
REFERENCES (1) Felicetti F., Galvagni R., and Zappellini G., "Analisi di sicurezza- Metodologia Semiprobabilistica e suo Sviluppo Applicativo", CNEN-RT/DISP, No. 78, 10, 1978. (2) Galvagni R., and Clementel S., "Risk Analysis as an Instrument of Design" in "Safety Design Criteria for Industrial Plants". M. Cumo and A. Naviglio (eds.), CRC Press, Boca Ratom, 1989. (3) Senni S., Semenza M.G., Galvagni R., "A.D.M.I.R.A. - An Analytical Dynamic Methodology for Integrated Risk Assessment" Int. Conf. Probabilistic Safety Assessment and Management, Feb. 4-7, 1991, Beverly Hills, G. Apostolakis (ed.), Elsevier, 20, 303-321, 1988. (4) E. Antona, I. Fragola and R. Galvagni, "Risk based decision analysis in design" 4th Conference SRA Europe, Rome, October 18-20, 1993 (5) R. Galvagni, I. Ciarambino and N Piccinini, "Malfunctioning evaluation of pressure regulating installation by Integrated Dynamic Decision Analysis", Int. Conf. PSAM II, San Diego, March 20-25, 1994. (6) R. Galvagni, I. Ciarambino, C. Fiorentini and N. Piccinini, "Integrated risk analysis of natural gas drying plant", 2nd Conf. on Chemical Process Engineering, Firenze, May 15-17, 1995 (7) N. Piccinini, I. Ciarambino, L. Scataglini, C. Fiorentini, S. Atallah, "Application of Integrated Dynamic Decision Analysis (IDDA) to a gas treatment facility, Chemputers IV Conference and Exposition, Houston, Texas march 11-13, 1996
PLANNING OF COMPONENT INSPECTION: DEVELOPMENTS IN THE NETHERLANDS J.H. Heerings 1 and J. Boogaard 2 i Project Office for Research on Materials and Production Technology, Apeldoorn, NL 2 Group Mechanical Plant Services, DSM Services, Engineering Stamicarbon, Chairman of the Dutch Quality Surveillance and Non-Destructive Testing Society, Geleen, NL
ABSTRACT In The Netherlands the inspection intervals for pressurized equipment are defined in the so-called Stoomwezen Rules which are compulsory by law. The prescribed intervals are fixed and more or less dependent on the type of equipment. In order to create more flexibility in inspection planning, recently a group sponsored project was undertaken. This was aimed at the development of a methodology for strategic planning of component inspection in the process industry. Checkable criteria have been formulated with regard to the conditions under which extension of the currentlyprescribed intervals can be considered justifiable. The methodology is based on the concept of'risk based inspection', including both the 'probability of failure' and the 'consequence of failure'. Besides the methodology has been tuned with the existing Dutch regulations. Therefore the new methodology is considered not to be a replacement of the existing Rules but rather an addition, which enhances the implementation. This year a validation programme will run to gain wide acceptance and to achieve implementation of the 'risk based inspection' concept in the Dutch regulations.
KEYWORDS
Inspection planning, Risk based inspection, Process industry, Pressure vessels, Extension inspection interval, Consequence of failure, Probability of failure, Organizational manageability
INTRODUCTION
By long tradition periodic inspections have been used by the process industry as a means of ensuring the continued safety and reliability of equipment and structures. They are prescribed by the government as legislative authority (e.g. within the scope of the Pressure Vessel Inspectorate or of environmental regulations) or they have been initiated by companies themselves. In particular the intervals at which the prescribed inspection must be carried out have at times been felt to be 'too rigid' and an obstacle to efficient production. In the process industry over the years there have been considerable changes in the operation of process installations: 683
684
J.H. Heerings and J. Boogaard - the processes have become more controllable because more experience has been gained and/or because improved process monitoring equipment has become available. There is less load on the integrity of the equipment. - more experience is available regarding the nature and the expected growth of damage phenomena that may occur in structures and equipment during operation and the manner in which appropriate maintenance work can help to avoid such. - new and improved inspection techniques have become available. - Quality Assurance systems in accordance with 'ISO 9000' are leading to a more structured and thus better control of operation and improved maintenance. Production and maintenance are becoming more and more integrated and there is a powerful drive towards optimizing the overall operating costs (sometimes right from the start of the planning phase). During this optimization drive the necessity for traditional inspections and the fixed inspection intervals, among other factors, has been critically evaluated. This has already resulted in some relaxation of the rigidity of intervals concerned such as are set out in the "Rules for Pressure Vessels" [1] (abbreviated to RToD in Dutch) following the recently-created possibility of "stretching", under stringent conditions, the interval for individual items of equipment. This possibility is formulated in Appendix 1 of Sheet T0102 from the RToD and shall be referred to as 'T0102' in this paper. There is, however, a conviction, shared by both industry and the supervising bodies, that a system for further relaxation of inspection intervals must be possible, but the question still remaining is, how can this be done while retaining a "guarantee" that safety will remain undiminished.
In order to develop this idea the project "Condition monitoring of process equipment" [2] has been carried out in collaboration with the Dutch process industry and the government under the leadership of the TNO Metals Institute. This resulted in the setting up of a methodology, with which the permissibility of proposed inspection intervals can be judged. The methodology is in the first place aimed at equipment that must satisfy the "Rules for Pressure Vessels". In the next few months the methodology will be validated in practice so that wide acceptance should be possible.
NEWLY DEVELOPED M E T H O D O L O G Y In drawing up an inspection programme for a given installation the nature and extent of the inspection activities and their frequency must first be defined. In the current methodology the laying down of the inspection intervals is the central consideration, and the nature and extent are regarded as dependent variables. Since the sheet T0102 (from the Dutch Rules for Pressure Vessels) defines fixed intervals for inspection, the term 'interval extension' or 'flexibilizing inspection' is often used. Following this line of thought the systematics of judgement has been drawn up to allow the inspection interval in a given situation to be determined. In determining the possible inspection interval a distinction is made from three inspection regimes, viz. * regime 1: no extension of interval allowed * regime 2: interval extension allowed up to the maximum of T0102 * regime 3: interval extension allowed beyond the maximum of T0102 Which regime is to be * determination of the * determination of the * determination of the
applied to a given situation is determined on the basis of three factors, viz. hazard class technical control organizational manageability
Advances in Safety and Reliability: ESREL '97
685
The combination of the second and third factor determines in fact the total manageability and thus represents a definite probability of failure. Therefore the three factors together represent both the 'consequence of failure' and the 'probability of failure'. In LIK and US literature the term 'risk' is commonly used to express the combination of'consequence of failure' and 'probability of failure'. So the newly developed methodology is in line with the international trend to use the concept of 'risk based inspection', see figure 1.
Hazard class
Technical control
Organizational manageability
Figure l: The basic elements of risk based inspection This trend is clearly illustrated by the risk based inspection methodology which has recently been developed by DNV under auspices of the American Petroleum Institute (API) [3,4]. Besides various process industries have developed company specific methodologies, which are mostly based on both the elements 'consequence' and 'probability'. However, two main groups can be distinguished: some of the 'risk based methodologies' are based on the use of a statistical figure to express the probability of failure whereas other methodologies use the estimated (residual) lifetime as basis. The methodology presented in this paper belongs to the second category and is therefore based on determination of the residual lifetime of the equipment concerned. This choice is made because of the intended purpose of the methodology, that is to derive inspection intervals in absolute terms. The use of statistical data is considered more appropriate in case of comparative studies, for instance when priorities in inspection should be set. In the newly developed methodology two subsequent steps should be made to conclude on the inspection interval: firstly the so-called 'inspection regime' is defined and secondly the inspection interval is derived. The 'inspection regime' is defined by the diagrams in figure 2. Once an installation has been classified in a given hazard class (1, 2 or 3) the defined diagram then applies to that category. Rating of the organizational manageability is carried out once every three years and is identical for all the installations of one company or site. The technical control is established for individual installations or parts thereof.
686
J.H. Heerings and J. Boogaard Hazard Class 1
Technical control
Regime 3: Beyond T0102
great
reasonable
little
................................................................................................................................................ i iiii~i~i~i ........................................................................................... i................................................ ii......
Regime 1 No extension standard
good
advanced
Organizational manageability
Hazard Class 2
Technical control
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ........................................................... great
........................................................... :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: -.:::.::: ...................................................... = ~:~:~ .......................................................................................................... =.:::.::: ...................................................... ~i!ii~!iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiill ...........................................................
Regime 3:
Beyond T0102
~:~:~:~:~:~:~:~:~:;:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:;:~:~
!~!~!~!!~!~:~!~:~:~:!~:!~:!~!~!~!~!~!~!~!~!~!~!~!~!~!~:~!~!~ ~I'I'I'I'I!I'I'I'I!I!I'I'I'I'I'I'I'I'I'I'I'I'I'I'I'I'I'I'I'I'I'! .......................................................................................................................
reasonable
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••.•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••w ••• ••••••••••••••••••••••••••••••w ••••••••••.••••••.•••:..•••.••••:••.••••.••.•••••.•••••••w ••w ••••••..••.•••••.•••::••••••••
little
Regime 1" No extension standard
good
advanced
Organizational manageability
Hazard Class 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . :~:~:~:~:~:~:~:~:~:~:~:~:~:.:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:.:~:~:~:~:~:~:~:~:~:~:~:.:`
Technical control
Regime 3: Beyond T0102
great
reasonable
little
Regime 1 No extension standard
good
advanced
Organizational manageability
Figure 2" Determination of the extent of flexibilization
Advances in Safety and Reliability: ESREL '97
687
After the determination of the applicable inspection regime the length of the inspection interval can be established. In regimes 1 and 2 this follows the rules of T0102. In regime 3 the inspection interval may exceed the maxima given in T0102, and a maximum of technical manageability is therefore required. In this regime the inspection interval is determined on the basis of the residual lifetime for the equipment concerned; this means that the degradation of condition and therefore the progress of the relevant damage process must be predictable. In fact, knowledge of the progress of damage, together with knowledge of the minimum required condition for the equipment concerned leads to a prediction of the residual lifetime. The maximum permissible inspection interval is based on a defined fraction of the predicted residual lifetime, see figure 3. Naturally, situations are possible in which the inspection interval so calculated is shorter than the maximum interval specified in T0102. In fact, the term 'interval extension beyond T0102' means in principle beyond T0102. Thus in regime 3 there is a question of a flexible inspection interval indeed.
Condition Inspection period t~ = safety factor * tR
Residual lifetime tR Minimal condition
.............................................................................. .,~...4~ P . ~ ..................
t~
J
Time
Figure 3" Determination of the inspection internal based on residual lifetime In order to determine the maximum permissible interval a factor of safety is used, that expresses the relationship between the residual lifetime and the maximum inspection interval. A value of 4 is used as standard. Moreover, this value is dependent on the organizational manageability. If the actual level of the organization increases, the safety factor may reduce to the value 3 or even value 2. A higher level of operating organization thus leads to a longer inspection interval. From the selected division into inspection regimes according to figure 2, it appears that the influence of the organizational manageability is very small for hazard classes 1 and 2. According to the manner of working given above the influence is, contrariwise, expressed by the safety factor applied and thus the length of the inspection interval. The calculation of the residual lifetime on the basis of damage progress is naturally possible only if one or more damage processes actually takes place. In the event that no damage process can occur (according to the international state-of-the-art), the residual lifetime is assumed to be infinite. For such a case the inspection interval is limited by an absolute maximum, i.e. five times the regular interval (cf. TO 102).
688
J.H. Heerings and J. Boogaard
DETERMINATION OF THE HAZARD CLASS Various classification schemes exist for the purpose of assigning plant to a safety, damage or effect category. The methodology proposed here uses assignment to 'hazard classes 1, 2 or 3' according to the RToD, sheet G0701. The following parameters are taken into consideration in the determination of the hazard class according to G0701: flammability, toxicity, quantity, pressure, temperature, presence of an enclosure. The most important reasons to work with this methodology are: * for all existing pressure vessels the applicable hazard class is known, so that no additional effort is required for assignment; * compliance with the RToD and current practice for requesting interval extension according to T0102. In the long term, although a choice has been made for assignment according to G0701, it may be better to select another system. An important disadvantage of G0701 is that it takes no account of the financial consequences and the possible effects on the continuity of production.
DETERMINATION OF THE TECHNICAL AND ORGANIZATIONAL MANAGEABILITY A questionnaire has been drawn up to assist in the determination of both the technical and the organizational manageability, which considers all the aspects that may be of influence. The technical control is qualified as little, reasonable or great. The qualification 'great' can be assigned only if it is possible to predict the deterioration of the integrity of a plant. This must be proved in the form of a graph in which not only the deterioration in condition, but also the minimum required level (corresponding to the 'end of lifetime') is given for the piece of equipment concerned. If such a prediction is possible, the reliability of the prediction must also be quantified: a questionnaire for this is also available. The organizational manageability is qualified as standard, good or advanced. The term 'standard' refers broadly to an organization in which inspection is based mainly on the observance of legal requirements; for the qualification 'good' inspection is explicitly directed towards the maintenance of integrity, while the term 'advanced' is applicable if management comprises combined inspection, maintenance and production. In order to make an objective and uniform judgement, criteria have been formulated for each question against which the answer to that question can be checked, and assignment to one of the above-named qualifications is possible. An important benefit of this approach is that it shows clearly which aspects are determinant (or: form the critical factor) for the final judgement, so that there is direct insight into what improvements could lead to a higher score.
SOME EXAMPLES In the development project the methodology has been applied to the following four cases. Creep of steam pipes Steam pipelines are subject to creep, fatigue (both low cycle fatigue and high cycle fatigue), thermal shock and corrosion during a shut-down. In most countries the inspection is intensified after a certain operating time is exceeded, mostly set by the value 0,6 for the so-called 'usage factor'. Extension of inspection intervals can therefore only be considered before the usage factor reaches the 0.6 threshold value. The degree of extension appeared to be limited to the maximum intervals from T0102 because of the potential occurrence of low cycle fatigue.
Advances in Safety and Reliability: ESREL '97
689
Internal corrosion in urea production plant In a certain part of a urea production plant general corrosion may take place due to the presence of a carbamate solution in a stainless steel pipeline. The corrosion rate is typically 0,05 to 0,10 mm per year in case of passive corrosion. Nevertheless, uncontrolled processing may lead to active corrosion with a very high rate. In order to detect a possible change from passive to active corrosion a measuring device has been installed based on the principle of neutron activation. The evalution showed that the inspection intervals may be defined based on the residual lifetime. Corrosion under insulation In a given naphta plant it has been shown that corrosion under insulation of pipelines may occur after 10 till 15 years after commissioning. In order to control this type of degradation the use of currently available NDT methods is not considered sufficiently. An integral approach is needed including training of maintenance and production personnel, quality control and guidance during application and specific inspections. In future reliable NDT techniques may be developed. In the current situation no extension of the inspection intervals appeared possible. Steam drums Potential degradation may occur as a result of cracking due to thermal stresses (high gradients during start-up and shut-down) and standstill pitting corrosion. The control of the temperature gradient accross the wall and in time during start-up is considered an effective measure apart from measurements of oxygen, pressure and water level. Ultrasonic inspection from the outside may be considered an alternative for internal inspection. The inspection interval could be based on the assessment of the residual lifetime provided sufficient control of the temperature gradient is ensured. Application of the newly developed methodology on the above mentioned cases showed that the broad analysis, as expressed by the methodology, is really necessary in order to arrive at a good judgement, and that application of the methodology leads to a correct insight into the actual manageability. Depending on the particular case, the conclusion was that extension of the interval was: not possible, possible according to T0102 or beyond it. An overview of the rating of the three determining factors, together with the conclusion regarding the inspection regime, is shown in table 1. It must be noted here that the evaluation for particular cases is based on an overall exercise with the methodology, so that the conclusions can not be considered as definitive. TABLE 1 OVERVIEW APPLICATION METHODOLOGY Hazard class
Technical control
Organizational Manageability
Inspection regime
Creep of steam pipes
1
reasonable
not determined
extension up to TO 102
Internal corrosion in urea production plant
2
great
good
beyond T0102
Corrosion under insulation
2
small
not determined
no
Steam drums
1
extension great
standard
beyond T0102
690
J.H. Heerings and J. Boogaard
BENEFITS
The methodology formulated here shows a number of l~enefits. A logical and effective programme of inspection can be drawn up that is well matched to the overall production process because use is made of the actual situation (the real damage processes) and the degree of manageability. Although during development of the methodology the area of applicability and the conditions of the Dutch Rules for Pressure Vessels (particularly T0102) were closely considered the results do not have to remain restricted to these. The checkable criteria provide clarity for both the inspecting authority and the user regarding the conditions under which inspection programmes are justified (and permissible). The user gains the possibility of compensating for the costs associated with a high degree of management (both technical and organizational) of the technical integrity by applying modified inspection intervals so that the economics of production can be optimized.
WHAT NEXT? Extensive validation of the systematics has not yet taken place. The methodology presented here must be considered as a guideline, while it is clear that modifications and supplements will be required before national acceptance is possible. The methodology offers the basis for a clear and systematic discussion regarding the flexibility of inspection intervals. In addition it is certainly possible to modify and further develop it during a "validation phase". In the meanwhile it has appeared that this opinion is shared by many process industries and the Dutch advisory committee "Technical Committee for Pressure Vessels". Therefore, a collaborative programme has been defined, to be started in the beginning of 1997. This programme is aimed at achieving national acceptance of the concept of'risk based inspection' in the Dutch Rules. The objective will be realised by the evaluation of a number of process units making use of the newly developed methodology.
REFERENCES
[11
Dutch licensing authority Stoomwezen (1994). Rules for Pressure Vessels, Staatsuitgeverij, Den Haag, NL.
[2]
Heerings, J.H. (1995). Final report "Condition monitoring of process equipment", KINT, Heemstede, NL (in Dutch).
[3]
Aller, J.E., Dunlavy, R. and Riggs, K.R. (1993). The Risked Based Managemend System: A New Tool For assissing Mechanical Integrity. ASME Reliability and Risk in Pressure Vessels and Piping PVP-Vol. 251, 115-125.
[4]
Reynolds, J.T. (1995). Risk based Inspection Improves Safety Of Pressure Equipment. Oil & Gas Journal, 37-40
B11" Modelling Physical Phenomena
This Page Intentionally Left Blank
UNCERTAINTY QUANTIFICATION IN PROBABILISTIC SAFETY ANALYSIS OF THE BLEVE PHENOMENON
I.A. Papazoglou, O. N. Aneziris Institute of Nuclear Technology-Radiation Protection, National Center for Scientific Research "DEMOKRITOS", Aghia Paraskevi 15310, Greece.
ABSTACT
This paper presents the methodology for risk estimation owing to the phenomenon of boiling liquid expanding vapour explosion (BLEVE), when either the model or its various parameters are not precisely known. BLEVE takes place when a tank containing liquefied petroleum gas (LPG) is exposed to fire and fails catastrophically. Two models have been used for the estimation of heat radiation to the population from a fireball namely the Emissive power independent of mass model and the Robert's model. In both models the thermal flux is expressed in terms of parameters like the radius, the duration and the maximum emissive power of the fireball, which might not be known precisely. The effect of uncertainty in the parameters on the estimated risk for both models is presented for a tank containing 200 tonnes of propane.
KEYWORDS BLEVE, Uncertainty, Risk assessment, chemical industry, LPG
INTRODUCTION
This paper presents the methodology for estimating the individual risk owing to the phenomenon of Boiling Liquid Expanding Vapour Explosion (BLEVE), when either the model or various parameters are not precisely known. BLEVE occurs when there is a sudden loss of containment of a pressure vessel containing a superheated liquid or a liquefied gas. The primary cause is usually an external flame impinging on the shell of a vessel above the liquid level, weakening the container and leading to sudden shell rupture. If the released liquid is flammable, a fireball may result. The procedural steps for quantifying risk for the specific case of the BLEVE phenomenon are presented. All uncertainties which appear in the BLEVE model such as the Emissive power, the diameter of the fireball, the duration of the phenomenon are described and quantified. Finally the results of the uncertainty risk assessment from BLEVE for a tank containing 200 tonnes propane are presented.
RISK A S S E S S M E N T OF F L A M M A B L E S U B S T A N C E S F R O M B L E V E
The methodology for quantification of risk from installations handling toxic or flammable substances has been presented in detail by Papazoglou et al. Here only the steps for risk assessment of the BLEVE 693
694
I.A. Papazoglou and O.N. Aneziris phenomenon will be described and in particular those for estimating the conditional (on a BLEVE) risk that is, the risk given that the BLEVE had occurred.
Consequence Assessment from BLEVE This phase aims at the establishment of consequences to the public and workers health and associated probabilities owing to the BLEVE phenomenon. Immediate health effects can be estimated by calculating the heat flux at each point around the site and the establishment of a thermal radiation/response model. The objective of this step is to determine the possible health effects and their conditional probabilities. Three steps can be distinguished for consequences from BLEVE: a) Estimation of heat radiation b) Dose assessment c) Consequence assessment Let Qr(x,y): dr(x,y): Pr(X,Y):
be the heat radiation at point (x,y). be the level of adverse exposure that is, the integrated over time exposure of the adverse effect. is the conditional probability of fatality for an individual at location (x,y)
The conditional probability of fatality pr(x,y) can be calculated as follows: BLEVE
radiation > Qr(x,y,t )
Dose >dr(x,y )
Probit > Pr(X,y )
Dose dr(x,y) is a function of heat flux and the duration tdof this phenomenon
dr(x,y ) = Qr(x,y)4/3tdlO -4
(1)
where: Q~ (x,y) is the thermal flux at point (x,y) (W/m 2) td is the duration of the BLEVE phenomenon (sec) The conditional probability of fatality for an individual pr(x,y) at location (x,y) in case of BLEVE is calculated from equations (2),(3) P-5
./.
2~
pr (x, y) = ~ expi-z)dU
(2)
P : - 1 4 . 9 + 2.561n(dr(x,y))
(3)
-oo
where : P is the "probit" value of the flammable substance due to heat radiation, proposed by the Green Book.
Heat Radiation Q The heat radiation Q which an individual may receive in case of a fireball can be calculated from the following equation:
Advances in Safety and Reliability" E S R E L '97 Q = E .... Gvv where: Q :radiation (W/m 2) E .... : emissive power per unit area (W/m 2) z~: atmospheric transmissivity, considered constant - 0.7 vv: view factor
695 [41
Emissive power
Emissive power is an uncertain parameter in the calculation of heat radiation. Two main approaches for the estimation of emissive power appear in the literature namely: a) Emissive power independent of mass, described in the Yellow Book and in CCPS/AIChE (1994) b) Emissive power depending on mass, proposed by Roberts A.F. and CCPS/AIChE(1994) Emissive p o w e r independent o f mass
According to this approach the maximum emissive power has a constant value, which does not depend on the mass of flammable substance. This value has been estimated from experiments. Three sets of experiments have been performed in the literature and their results appear in Table 1. The Yellow Book proposes 180 KW/m z and CCPS/AIChE(1994)350KW/m 2.
TABLE 1 EXPERIMENTAL RESULTS OF MAXIMUMEMISSIVEPOWER (Emax) Reference Fuels Fuel E max Mass[Kg) (KW/m 2) a] Hasegawa and Sato (1977) C5H12 0.3-30 110-413 bl Johnson et al. (1990) C4H10,C3H 1000-2000 320-375 8 clT.A. Roberts (1995) C3H 8 279-1708 320-415 Emissive p o w e r depending on mass
According to this approach the maximum emissive power is a function of the fuel mass, the radius and the duration of the fireball and is given by equation (5).
E
M'Hcf
max
=~
~ " D} . t d
(5)
where: f Fraction of heat release due to combustion that is radiated from the Fireball M Mass of combustion (Kg) Heat of combustion (J/Kg) Hc Df diameter of fireball (m) duration of fireball (sec) td maximum emissive power (W/m 2) Emax f is a function of pressure in tank and is calculated according to equation (6), proposed by Roberts A.F. f = 0.27 p0.32 where P is the pressure in tank in MPa.
(6)
696
I.A. Papazoglou and O.N. Aneziris View Factor
The view factor of a point on a plane surface located at a distance L from the center of a sphere with diameter DF depends not only on L and Dr but also on the orientation of the surface with respect to the fireball. The simplest case is when the surface is vertical and the view factor is given by Eqn. 7, as proposed by CCPS/AIChE (1989): VF
-4L 2
(7)
where: vv :view factor DF: fireball diameter L: distance from the center of the fireball For a point on the xy plane and co-ordinates (x,y) the distance L from the center of the fireball is assumed to be equal to: L = 4 ( D F / 2) 2 + x 2 + y2
(8)
Diameter o f the fireball
The equation which has been proposed by all researchers for the diameter of the fireball is the following: (9)
DF = aM b
where: DF" diameter of fireball (m) M" Mass of fireball (Kg) a,b: parameters Parameters a and b are not precisely known. Several researchers have proposed different values for these parameters as shown in Table 2. The fireball diameter might be treated an uncertain variable.
TABLE 2 VALUES OF PARAMETERS a, b, c, d FOR THE QUANTIFICATION OF THE BLEVE DIAMETER AND DURATION a
3.51 5.8 6.48 5.88 5.33 5.28 6.28 6.36
0.33 0.33 0.325 0.333 0.327 0.277 0.33 0.325
c 0.32 0.45 0.825 1.09 1.09 1.1 2.53 2.57
0.33 0.33 0.26 0.167 0.327 0.097 0.17 0.167
Reference Lihou and Maund 1982 Roberts 1982 Pietersen 1985, Yellow Book Williamson and Mann 1991 Moorhouse and Pritchard 1982 Hasegawa and Sato 1977 Fay and Lewis 1977 Lihou and Maund 1982
Duration o f the B L E V E p h e n o m e n o n
The duration of the fireball is an important factor in the assessment of Emax in the Roberts model (Eqn. 5) and of the dose in both approaches as evident from equation (1). The equation which has been proposed by all researchers for the duration of the BLEVE is the following:
697
Advances in Safety and Reliability: ESREL '97
t d = cM d
(10)
where: td : Duration of fireball (sec) M : Mass of fireball (Kg) c,d: parameters The parameters c and d are not precisely known. Table 2 also presents the values of c,d proposed by several researchers. Given the values of M, a, b, c, d and Emax (depending on the model), heat flux (Qr), dose (d r ) and individual risk (p~) can be calculated from equations (1)-(10).
QUANTIFICATION OF UNCERTAINTIES From the discussion in the previous section, it follows that the assessment of risk from the BLEVE phenomenon is characterised by lack of precise knowledge both in the associated models as well as in the value of their parameters. These uncertainties have been quantified for the two distinct models (i.e. the emissive power independent of mass approach and the Roberts approach)
Emissive power independent of mass
By virtue of Equations (1),(4),(7),(8),(9),(10) it follows that according to this model the dose is given by equation (11)
[
em oa+M X 2 +y2)
dr(x'Y)=L4((aMb/2)2
cMdlO-4
(11)
where: ~(x,y) Emax M x~ a,b,c,d x,y
Dose at point (x,y) (KW/m2)4/3sec emissive power per unit area (W/m 2) Mass of combustion (Kg) atmospheric transmissivity parameters coordinates
In this approach the uncertain parameters are: the maximum emissive power Emax, a and b (parameters determining the diameter of the fireball), c and d (parameters determining the duration of the phenomenon). Risk is calculated from equations (2),(3),(11) and is a function of these uncertain parameters: Pr = f(E m~x, a, b, c, d)
(12)
It follows that if uncertainty in these parameters is quantified by considering each of them as a random variable distributed according to a known probability density function (pdf), then Pr is also a random variable. A pdf for each random variable has been estimated based on data given in Table 2 and is presented in Table 3. Owing to the complexity of the dependence of the individual risk pr(x,y) on the uncertain parameters Emax, a, b, c, d the pdf of Pr has been estimated through a Monte Carlo simulation using the Latin Hypercube Sampling (LHS) method. Percentiles for the conditional individual risk Pr have been calculated for a tank of 200tn of propane as a function of distance. Results are shown in Figure 1 and discussed in the conclusions.
698
I.A. Papazoglou and O.N. Aneziris TABLE 3
PROBABILITYDENSITYFUNCTIONSofEmax,a,b,c,d Uncertain parameters
PDF
Emax a b c
Uniform (min: 200, max: 350) Logistic (mean: 5.655, parameter ~: 0.537) Normal (i" 0.319,0" 0.02) Rayleigh (parameter b" 1.037)
d
Triangular (min: 4.7 10-2, most likely: 0.19,max: 0.38)
1.00E+O0
5% Perc
.OOE-O
~,
- - - 10% Perc
\ I~i~ /~~ ~"£"~'i',,. "\\~ /lii~ ~\i'\.",,,,\\\\~
1.00E-02 1.00E-03
~'\'.'-\ \"
1.00E-04 ,.OOE-05
:\"" \ \
, ",
,.OOE-O6
1.00E-07
=~'
-
-
-
-
20% Perc 30% Perc 40% Perc 50% Perc
....
\ \
60% Perc
........... 70% Perc
Illi~| \\', k,."',, \\\
......
80% Perc 90% Perc 95% Perc
0
200
400
600
800
1000
1200
1400
1600
1800 2000
Distance (m)
Figure 1. Percentiles of conditional probability of fatality. Uncertain parameters Emax, a,b,c,d
Emissive power depending on mass (Robert's approach)
By virtue of Equations (1),(4),(5),(7),(8),(9),(10) it follows that according to this model the dose is given by equation (14)
I
fr,~ ~c M(l-d) dr(x'Y) = 4zcc((aM b / 2 i 2 + x 2 + y2 where: dr(x,Y) f M He
"[cc
a,b,c,d x,y
)14,3
cM d 10 -4
(13)
Dose at point (x,y) (KW/m2)4/3sec Fraction of heat release due to combustion that is radiated from the Fireball Mass of combustion (Kg) Heat of combustion (J/Kg) atmospheric transmissivity parameters coordinates
In this approach the uncertain parameters are: a,b,c and d. Risk is a function of these uncertain parameters and is given by the following equation (14). Pr = f(a,b,c,d)
(14)
Advances in Safety and Reliability: ESREL '97
699
The pdf's of a,b,c,d have already been presented in Table 3. A similar approach was followed and the percentiles of the pdf of conditional individual risk Pr for a tank of 200 tonnes are shown in Figure 2 and discussed in the conclusion. 1.00E+O0
5% Perc -
1.00E-01
"
-
10% Perc
-"20%
Perc
1.00E-02
30% Perc
1.00E-03
40% Perc 50% Perc
1.00E-04
. . . .
1.00E-05 1.00E-06
60% Perc
. . . .
70% Perc
......
80% Perc
- - - - - - 90% Perc 95% Perc
1.00E-07 1.00E-08 1.00E-09 0
200
400
600
800
1000
1200
1400
1600
1800
2000
Distance (m)
Figure 2. Percentiles of conditional probability of fatality. Uncertain parameters a,b,c,d
CONCLUSIONS Figure 1 presents some of the percentiles of the individual risk versus distance for the independent of mass emissive power model. These results can be interpreted as follows: at a distance of 600 m from the center of the fireball the conditional probability of fatality (Pr) is less or equal to 9xl 0 -9 with probability 50%, less or equal to 4x10 -7 with probability 60%, less or equal to 10-5 with probability 70%, less or equal to 5xl 0 -4 with probability 80%, and less or equal to 8xl 0 -2 with probability 95%. Figure 2 presents the corresponding results for the dependent on mass emissive power model. At a distance of 600 m from the center of the fireball the conditional probability of fatality (Pr) is less than 10-4 with probability 5%, less or equal to 8x10 -4 with probability 20%, less or equal to 2x10 -3 with probability 40%, less or equal to 8x10 -3 with probability 60%, less or equal to 2x10 -2 with probability 80% and less or equal to 7x10 -2 with probability 95%. It is noteworthy that the model allowing the emissive power to depend on the mass of the substance involved in the BLEVE phenomenon provides results characterised by less uncertainty than the model assuming no such dependence. For example, the 90% confidence interval for the distance beyond which the conditional individual risk is above 10.3 is [550m, 750m] for the former model while it is [200m, 800m ] for the latter. Furthermore it is noteworthy that the upper limit of the distance beyond which the level of individual risk is less than any given value is almost the same for the two models. For example, individual risk will be less than 10.5 beyond 900m with probability 95% for both models. Decisions, therefore, that require such a high degree of confidence do not depend on the choice of the BLEVE model. On the other hand if such high degree of confidence is not justified by the associated implications further research is required to establish the relevant validity of each model and reduce the associated uncertainties.
REFERENCES
CCPS/AIChE. (1994). Guidelines for Evaluating the Characteristics of Vapor Cloud, Explosions, Flash Fires, and BLEVEs, AIChE, New York" CCPS/AIChE.
700
I.A. Papazoglou and O.N. Aneziris CCPS/AIChE. (1989). Guidelines for Chemical Process Quantitative Risk Analysis, AIChE, New York: CCPS/AIChE. Fay, J.A., and Lewis, D.H. (1977). Unsteady burning of unconfined fuel vapour clouds. Sixteenth Symposium (International) on Combustion, 1397-1404. Pittsburgh: The Combustion Institute. Hasegawa, K. and K. Sato. (1977) Study on the fireballfollowing steam explosion ofn-pentane. Second Int. Symp. on Loss Prevention and Safety Promotion in the Process Ind., pp 297-304. Heidelberg. Johnson, D.M., M.J. Pritchard, and M.J. Wickens, (1990). Large Scale catastrophic releases of flammable liquids. Commission of the European Communities Report, Contract No EV4T.0014.UK(H). Lihou, D.A., and Maund, J.K. (1982). Thermal radiation hazard from fireballs. I. Chem. E. Symp. Ser. No. 71, pp 191-225. Moorhouse, J., and Pritchard M.J. (1982). Thermal radiation from large pool fires and thermals-literature review. I. Chem. E. Symp. Series No. 71. Papazoglou, I..A., Aneziris, O., Bonanos, G., and Christou, M. (1996). SOCRATES: a computerized toolkit for quantification of the risk from accidental releases of toxic and/or flammable substances, in Gheorghe, A.V. (Editor) Integrated Regional Health and Environmental Risk Assessment and Safety Management, published in Int. J. Environment and Pollution, Vol. 6,Nos 4-6,pp. 500-533. Pietersen, C.M. (1985). Analysis of the LPG incident in San Juan Ixhuatepec, Mexico City, 19 November 1984, Report TNO Division of Technology for Society. Roberts, A.F. (1981). Thermal Radiation from Releases of LPG from Pressurised Storage, Fire Safety Journal,4, 197. Williamson, B.R., and Mann, R.B. (1981). Thermal hazards from propane (LPG) fireballs. Combust. Sci. Tech. 25:14-145. Yellow Book. Committee for the Prevention of Disasters. (1992). Methods for the calculation of physical effects of escape of dangerous materials, TNO, Voorburg, The Netherlands.
EXTENDED MODELLING AND EXPERIMENTAL RESEARCH INTO GAS EXPLOSIONS W.P.M. Mercx TNO Prins Maurits Laboratory PO Box 45, Rijswijk, The Netherlands
ABSTRACT This paper summarises the work that has been performed by the nine participants in the project EMERGE, which was partially funded by the European Commission. The general aim of the project was to improve the various techniques of predicting the effects of gas explosion. A large number of experiments have been carried out on various geometrical scales with realistic obstacle configurations. The data have been used to investigate and validate submodels used in the various CFD codes capable of simulating the gas explosion process. Special items studied comprised: • the influence of pre-ignition turbulence on the explosion process; • the interaction of the explosion-induced expansion flow with obstacles; • the load induced on obstacles located in the combustion zone. KEYWORDS Gas explosion, experimentation, turbulence, explosion load, computerised simulation, obstacles. INTRODUCTION A project called 'Extended Modelling and Experimental Research into Gas Explosions (EMERGE)' has been carried out, partly sponsored by the European Commission DG XVII in the 'Environment' programme. EMERGE is a follow-up project of MERGE (Modelling and Experimental Research into Gas Explosions). An initial step has been taken in the project MERGE to collaborate on research concerning gas explosions. A number of experimental programmes were performed in MERGE to generate data on the influence turbulence has on flame acceleration and overpressure generation. Especially the influence of obstacles present in the gas cloud and the influence of jets have been investigated. Experiments were performed on various geometrical scales to study the influence of scale. The modelling part of MERGE consisted of the improvement of Computational Fluid Dynamics (CFD) models capable of simulating gas explosions. The experimental results were used for code validation and for validation of various scaling techniques (Mercx, 1995a). A similar approach was adopted for EMERGE. This project also consisted of two main parts; an experimental and a modelling part. Again the experiments were devoted to studying the influence of turbulence. As a difference to MERGE, the experimental set-ups were more complex in order to comply better with realistic situations. Also, the influence of initial turbulence, turbulence already in existence before ignition occurs, on the explosion process has been studied. The modelling part in EMERGE concentrated on the improvement of sub-grid obstacle modelling as applied in CFD codes. The outcome of EMERGE has been reported in Mercx et al. (1997). This paper gives an overview of the main activities that were performed. Within the restricted volume of this paper, it is inevitable that a selection is made from large number of activities carried out in this project. 701
702
W.RM. Mercx
OBJECTIVES The key objective of the project was to simultaneously improve the accuracy and establish the applicability of techniques to predict the characteristics of a vapour cloud explosion, i.e.: • experimental scaling techniques, since they have the potential to simulate in detail large-scale explosion behaviour in smaller scale experimental rigs; • simple approximate theoretical tools for routine use by industrial hazard engineers; • models based upon Computational Fluid Dynamics (CFD), which provide the most practical and cost effective route for extrapolating from large-scale experiments to full-scale scenarios. Since the greatest confidence lies in the results of large-scale experiments, the project aimed to establish a significant large-scale data set against which the predictive methods can be evaluated. In addition, a substantial fundamental data set is provided, against which the relative accuracy of the various sub-models used in the theoretical techniques can be evaluated and further developed. A number of deficiencies have been identified in the present knowledge of each of the three predictive techniques. There is still a need to establish the valid range of application of the scaling techniques and, in particular, it remains to be determined whether the methods can be applied more accurately. The guidance available for the application of simple practical explosion source models is limited and needs a better physical footing. The following uncertainties in the sub-models used in the applicable CFD codes have been identified: • the accuracy with which loads on structures can be predicted; • drag and momentum losses caused by realistic obstructions in steady and transient flows; • turbulence and detailed flow structures caused by steady and accelerating flows over different shaped obstacles; • the effects of initial turbulence and jet flows on the turbulent combustion rates and the detailed structure of a turbulent explosion flame. TASKS The work programme ofthe project was subdivided into a number of tasks in order to cover the deficiencies identified: 1. experiments to determine the influence of realistic obstacle and release environments on the explosion source strength; 2. further development of experimental scaling and simple model predictive techniques for the explosion source; 3. experimental measurements of the loads on structures within the combustion zone; 4. further development of numerical models for predicting loads on structures within the combustion zone; 5. detailed measurements of drag loads and turbulence characteristics on structures in unidirectional steady and transient flow past structures; 6. extended validation of sub-grid drag models used in numerical codes; 7. detailed experimental measurements of the effects of obstacles and pre-ignition turbulence in explosions; 8. further development of numerical sub-models based on the results of the previous task. Given the limited length of this publication, a selection has to be made of the studied subjects to be presented here in more detail. Therefore, only the work performed in tasks 1 and 6 will be highlighted. E X P E R I M E N T S TO INVESTIGATE THE INFLUENCE OF AN INITIAL TURBULENCE FIELD ON THE EXPLOSION PROCESS
Test set-ups The typical obstacle configuration used in the MERGE project was again used to perform small-, mediumand large-scale tests in order to extend the MERGE database in a consistent way. The MERGE obstacle configurations typically consisted of a number of cylinders orientated in three perpendicular directions. Configuration parameters were the number of cylinders, the cylinder diameter and the distance between cylinders. The outer dimensions of the obstacle configurations were scaled 1:2.25:4.5 for small-, medium- and large-scale, respectively. Obstacle diameters and distances were scaled accordingly. Typical outer dimensions were approximately 2x2xlm 3, 4.5x4.5x2.25m 3 and 9x9x4.5m 3 (Mercx et al., 1995b). Ignition was always in the centre of the array at ground level.
Advances in Safety and Reliability" ESREL '97
obstacle
703
t pipe
boundary
X
~, spark
obstacle
ly
PlanView
Tz
jet pipe
jet pipe
// I1~\
I
j EMERGEobstacle JIi
~.. ~
i / , x
z/
groundlevel \
•
•
•
•
I/
II~'l •
I
spark location
Figure 1" Schematic view of obstacle configurations with installed turbulence generating jet lances In order to obtain an initial turbulence field which could be reproduced on the three geometrical scales, four lances were installed symmetrically in the obstacle configurations. The turbulence field was created by the interference of four jets flowing through the lances. A schematic overview is shown in Figure 1. A similar four lance set-up for the generation of turbulence was used in the scaled-down mock-up of an offshore platform module in order to study initial turbulence in a realistic environment. Furthermore, tests were done in a tent without additional obstacles to study the influence of the initial turbulence field separately from the influence of turbulence created by the expansion flow interaction with obstacles. Results
Quantification of turbulence field Initial experiments were performed to quantify the required flow to obtain a reproducible turbulence field on the three geometrical scales. Values for the r.m.s, turbulent velocity should be preferably in the range of 1-2 m/s and in the range of 10-20 rn/s for a so-called low and a high initial turbulence field, respectively. Laser Doppler Anemometry as well as Pulsed Hot Wire Anemometry were used to quantify the turbulence field. It appeared that an initial turbulence field could be created only in the immediate vicinity of the ignition point. The region of the turbulence field extended to about two obstacles distances in each direction from the ignition location. Comparison of r.m.s, values for small- and medium-scales showed that the turbulence levels could be reproduced at similar sample points within a variation of about 20%. Figure 2 shows pictures of two gas explosions in a tent at a late stage of flame propagation for two different flow jet modes: no jet flow (zero turbulence) and medium initial turbulence. These pictures clearly demonstrate the influence of the initial turbulence field. Initial flame speeds were higher when there was initial turbulence. Flame speeds reduced when the flame left the initial turbulence region.
Tests in regular obstacle arrays All tests were done with roughly stoichiometric mixtures of methane or propane in air. In some tests, the oxygen content was increased according to (Catlin, 1991) to simulate the tests on the larger scales with the same mixture but without oxygen enrichment. Tests were done with three levels of initial turbulence: none, low and high.
704
W.RM. Mercx
The small-scale tests were all done with an obstacle diameter D of 19.1 mm and an obstacle distance of 4.65D. The obstacle arrays had either 12 or 20 cylinders in each horizontal direction. The maximum overpressures averaged on the sample locations inside the obstacle configuration varied from 7 kPa for the methane, no initial turbulence, 12 cylinders in a horizontal row configuration, to 140 kPa for the propane, oxygen-enriched to 24%, high initial turbulence, 20 cylinders in a horizontal row configuration. Sixteen medium-scale tests were reported. The obstacle diameter used was 43 mm. Obstacle distances of 4.65D (20 cylinders in a horizontal row) and 7D (14 cylinders) were used. Averaged maximum overpressures varied between 14 kPa for the case of methane/air with obstacle distance of 7D, to 325 kPa for the case of propane/air, oxygen concentration 22.5% and obstacle distance of 4.65D. A total of four large-scale tests were performed with an obstacle diameter of 82 mm and an obstacle distance of 4.65D. Averaged maximum overpressures were 100 kPa for methane/air and 300 kPa for propane/air mixtures. All large-scale propane and medium-scale propane, oxygen-enriched tests exhibited transition to detonation. The overpressures given above are for the fast combustion process, not for the detonation process, which generated pressure spikes with far higher overpressures. It appeared that the maximum overpressure was not influenced by the level of initial turbulence as was expected. Results were comparable to those obtained in MERGE. The only remarkable difference noticed was the time of maximum overpressure (Figure 3). Apparently the flame is accelerated in the very early stage where it is still in the region of initial turbulence. As flame speeds are relatively low then, no increase in overpressure is noticed. Outside the region of initial turbulence, the flame decelerates until the expansion flow-induced turbulence takes over and accelerates the flame to high flame speeds and accompanying overpressures.
Figure 2.a: Late stage of flame propagation; no initial turbulence (propane/air mixture), (picture from Christian Michelsen Research)
Advances in Safety and Reliability" ESREL '97
705
Figure 2.b: Late stage of flame propagation; low initial turbulence (methane/air mixture), (picture from Christian Michelsen Research) Tests in scaled offshore module The flammable mixture with which the mock-up was filled was ignited at several locations; in the focus of the four jets like in the TNO-PML and BG tests as well as at other locations. Increase in overpressure was noticed only when the ignition location was in the focus of the four jets. Contrary to the tests with the regular obstacle configurations, the overpressures increased by 50 to 80%.
,...-.,
1.000 -
::i:
..Q v (D
i. (b)
;7!
:3
o~ 0.750 -
(a)
o')
> 0
i' ! : i :
0.500 -.~
~
g/
0.250
i:!
i
0.000
I
0.0
'
J
'
'
~
~
'
I
I-
I
100.0
~
I
I
~
I
I
i
I
I
I
I
I
I
I
I
I
i
2000.0
I
I
I
i
I
I
I
I
300.0 time (ms)
Figure 3: Comparison of overpressures and times of maximum overpressures for the large-scale methane tests for zero (a) (MERGE), low (b) and high (c) initial turbulence levels. (Figure from British Gas Research)
706
W.RM. Mercx
Evaluation The influence of the initial turbulence appeared to be very minor in the tests with regular obstacle configurations. It was limited to an enhancement of the flame speed in the early stage, by which the time to maximum overpressure was reduced depending on the level of turbulence. It is concluded that in cases with a larger region of initial turbulence, the additional acceleration of the flame will be continued and an increase in maximum overpressure may be expected. Overpressures were increased by initial turbulence in the offshore module tests although a similar system was used to produce the turbulence and the same level of initial turbulence was obtained. The difference may be that, due to the difference in obstacle size, the turbulence length scale was different in both situations. O V E R A L L C O D E VALIDATION: P R E D I C T I O N OF L A R G E - S C A L E E X P L O S I O N E X P E R I M E N T S W I T H INITIAL T U R B U L E N C E The explosion codes FLACS/CMR (van Wingerden et al., 1993), AutoReaGas/TNO-PML (van den Berg et al., 1994), EXSIM/Tel-Tek (Hjertager et al., 1992) and COBRA/BG (Catlin et al., 1995) were applied to calculate the initial turbulence gas explosion experiments. The modelling approach for ignition of the vapour cloud with initial turbulence is validated against the medium-scale experiments with initial turbulence. Predictions were performed for the corresponding large-scale experiments prior to their execution. This exercise is similar to the one performed in MERGE (Popat et al., 1996). Additionally, the Shell fractal scaling method was used to predict the large-scale test results based on the medium-scale test results. For a chanze in scale by a factor L, the fractal scaling theory predicts that • overpressure increase by a factor of L 0 7"1 " 2 and times by a factor of L 0"644. All codes were capable of predicting the main findings in the medium-scale experiments; namely that the peak pressure is not influenced much by the initial turbulence region close to ignition. It is only the time from ignition to peak pressure that is influenced. Figure 4 shows a typical result. All modellers used their own basic assumptions and mode.lied the initial turbulence field as was considered best for their code to come up with large-scale predictions. Predictions were made for all sample locations inside the vapour cloud. The results for the predictions in the sample point which was closest to the ignition point are given in Tables 1 and 2. The large-scale propane/air tests exhibited transition to detonation. No code is capable of simulating this process. The overpressures given in Tables 1 and 2 are for the deflagrative combustion. The band in percentage of the experimental result, in which the predictions can be found, are -40 to +7% for the peak overpressure, -70 to +20% for the duration and -47 to +65% for the time of maximum pressure. The predictions for methane and propane according the various codes are quite different, the propane pressure being consequently underestimated. The degree of agreement between predictions and experiments is somewhat better than in the previous exercise in MERGE.
3°I
n
C
20
m
10
0
0
-10
b
20
I
I 1 O0
I
140
I 180
time (ms)
-20
Figure 4: AutoReaGas calculations for medium-scale tests with zero (a), low (b) or high (c) turbulence intensity.
Advances in Safety and Reliability: ESREL '97
707
TABLE 1 COMPARISON OF CFD PREDICTIONS, SCALING AND MEASUREMENTS FOR BG METHANE/AIR LARGE-SCALE EXPERIMENTS IN TRANSDUCER LOCATION CLOSEST TO THE IGNITION LOCATION
Low turbulence
Experiment FLACS AutoReaGas COBRA EXSIM Fractal scaling
High turbulence
Pmax (kPa)
Duration (ms)
t(Pmax) (ms)
Pmax (kPa)
Duration (ms)
t(Pmax) (ms)
100 98 107 96 72 100
30 39 32 31 33 28
112 128 96 85 173 111
110 89 109 96 73 107
32 39 31 32 33 29
70 99 77 77 124 81
TABLE 2 COMPARISON OF CFD PREDICTIONS, SCALING AND MEASUREMENTS FOR BG PROPANE/AIR LARGE-SCALE EXPERIMENTS IN TRANSDUCER LOCATION CLOSEST TO THE IGNITION LOCATION
Low turbulence
Experiment FLACS AutoReaGas COBRA EXSIM Fractal scaling
High turbulence
Pmax (kPa)
Duration (ms)
t(Pmax) (ms)
Pmax (kPa)
Duration (ms)
t(Pmax) (ms)
330 261 245 192 232 223
14 24 24 24 23 19
90 128 81 70 120 94
320 257 278 193 229 294
39 23 25 23 17
101 99 65 63 71 62
CONCLUSIONS AND RECOMMENDATIONS An extensive research programme has been carried out in the field of gas explosions, of which only two items studied could be described in this paper. The research project comprised the performance of experiments as well as theoretical investigations. The large number of experiments performed on the various items was necessary to partially cover the lack of experimental data. Therefore, experiments were done at various scientific levels. They were done with a specific general subject in mind: to investigate the influence of pre-ignition turbulence. Experiments were done: • at a practical level, to study as realistic circumstances as possible; • at a generic level, to supply data detailed enough to investigate and validate practical as well as CFD-type models; • at laboratory level, to provide data for development and testing of submodels used in CFD codes; • at laboratory level, to perform fundamental studies on specific physical features with respect to gas explosions. The experimental data obtained form a valuable extension of the MERGE database. By studying the experimental results, a considerable step forward has been taken in understanding the mechanisms of a gas explosions. It has been demonstrated that the explosion effects in realistic circumstances might be more severe than would be expected a priori. Progress has been made in the development of practical models for source strength prediction. Fundamental research into turbulent premixed flames shows that the present level of modelling adopted in the CFD codes can give acceptable results for practical applications. However, much improvement is required before all details of the physics involved can be simulated satisfactorily.
708
W.EM. Mercx
The exercise of predicting the large-scale experimental results with CFD codes showed better results than the previous exercise in MERGE, although the situation to simulate was more complicated because of the initial turbulence field. Until now, research into gas explosions in realistic scenarios has been concentrated on the realistic representation of the obstacles and obstructions present in the gas cloud. Very high overpressure values were obtained in the experiments. It must be realised that most of the tests were done with homogeneous concentrations of the gaseous fuel. It can be questioned whether these high overpressures will also occur in situations with non-homogeneous concentration distributions. A future direction for gas explosion research should be the investigation of realistic flammable clouds.
REFERENCES Berg A.C. van den, The H.G., Mercx W.P.M., Mouilleau Y. and Hayhurst C.J. (1994). AutoReaGas - A CFD tool for gas explosion hazard analysis. Procs of Symp. on Loss Prevention and Safety Promotion in the Process Industries, 19-22 June, 1995, Antwerp, Belgium. Catlin C.A. (1991). Scale effects on the external combustion caused by venting of a confined explosion. Combustion and Flame, 83, 399-411. Catlin C.A. Fairwether M. and Ibrahim S.S. (1995). Predictions of turbulent premixed flame propagation in explosion tubes. Combustion and Flame. 102, 115-128. Hjertager B.H., Solberg T. and Nymoen K.O. (1992). Computer modelling of turbulent gas explosion propagation in offshore modules. J. Loss Prevention in the Process Industries, 5, 165-174. Mercx W.P.M., Johnson D.M. and Puttock J.S. (1995a). Validation of scaling techniques for experimental vapour cloud explosions investigations. Process Safety Progress, 14-2, 120-130. Mercx W.P.M. (1995b). Modelling and experimental research into gas explosions. Procs of Symp. on Loss Prevention and Safety Promotion in the Process Industries, 19-22 June, 1995, Antwerp, Belgium. Mercx W.P.M. (editor, 1997). Extended modelling and experimental research into gas explosions - final summary report for the project EMERGE, EC contract EV5T-CT-93-0274. TNO Prins Maurits Laboratory, Rijswijk, The Netherlands. Popat N.R., Catlin C.A., Arntzen B.J., Hjertager B.H., Solberg T., Saeter O., Lindstedt R.P. and Berg A.C. van den (1996). Investigations to improve and assess the accuracy of computational fluid dynamic (CFD) based explosion models. J. Hazardous Materials, 45, 1-25. Wingerden K. van, Storvik I., Arntzen B., Teigland R., Bakke J.R., Sand I.O. and Sorheim H.R. (1993). FLACS-93, a new explosion simulator. Procs of the 2nd Int. Conf. and Exh. on offshore structural design against extreme loads, ERA report 93-0843 pp 5.2.1-5.2.14, London, United Kingdom.
MODELLING OF A LOW DISCHARGE AMMONIA RELEASE Gilles Dusserre ~, Aude Bara 1 1Ecole Nationale Sup6rieure des Techniques Industrielles et des Mines d'A16s Laboratoire G6nie de l'Environnement Industriel 6, Avenue de Clavi6res, 30319 ALES Cedex
ABSTRACT Experiments have been carried out by the School of Mines on ammonia liquefied under pressure. The aim was to estimate the ability of the Gaussian models to predict concentration of heavy gases in the near field, when a quick response is needed in the form of orders of magnitude, e.g. for the firemen on a chemical accident. Using the Doury standard deviations cr leads to underpredict the concentration compared to those observed on the site. The Pasquill cr seem to give better results. Of course, those conclusions depend on the estimation of the atmospheric stability class, which is sometimes difficult since the meteorological measurements can be lacking. They are also dependent on the experimental conditions tested here, and particularly the discharge flow rate. So the results given in the present paper can not yet be extended to larger releases as can be expected on real accidents.
INTRODUCTION The present firefighters are confronted to a particular kind of hazard : the spillage of toxic or flammable substances which can occur whether during the transportation or in an industrial storage. In case of such an event the firemen need quick information on the chemical concentrations to which they can be exposed, in order to act as quickly but also as safely as possible. Since the integrated dispersion codes need the knowledge of many data, which may be impossible to collect or measure on the site, and as the 3-D models need a long time for calculation, the Gaussian models seem to be the most appropriate answer to this special kind of need. Lots of experimental heavy gas releases have already been carried out in Europe (Air,Freon mixtures released at Thorney Island(Koopman, 1988)) and USA (Ammonia tests of Desert Tortoise (Goldwire, 1986) which provided data to validate the models. But they all deal with large amount of hazardous materials and the concentrations measured deal with distances much greater than 100 meters. It seems important to collect data on smaller releases so as to validate the models in the near field, because accidents can occur at every scale, including continuous leaks on a small storage. In order to evaluate the ability of simple Gaussian models to predict the near-field concentrations resulting from the complex dispersion of heavy gases, the School of Mines has chosen to carry out some experiments on an airport located in the south of France. In the next sections the experimental conditions willbe described, and the observed concentrations compared to the predicted ones. 709
710
G. Dusserre and A. Bara
Experimental Facilities Since such experiments need a large and flat area, the present tests have taken place on an airport located in the South-France, which ground was made of dirt and gravel (maximum roughness of 1 cm). To warrant the safety of the experiments, the presence and help of the local firefighters was necessary. The tests also constituted a good practice for the firemen since the use of a toxic chemical placed them in conditions very similar to those of an accident. The chemical substance chosen was ammonia, since it is the only hazardous material having a heavy gas behaviour under particular conditions, coupled with a relatively low toxicity compared to other heavier-thanair gases like chlorine. The ammonia was stored liquefied under its vapour pressure (8,6 bars at 20°C) and released from a 44 kg bottle placed the head below in order to generate a 2-phase flow. The jet was pointed nearly horizontally at 30 cm above the ground. The discharge flow rate, measured by weighing the bottle before and after the release, was approximately 10 kg/min. Under the sluice-gate of the storage, a metallic pool was placed on the soil in.order to retain the liquid phase. An additional device prevented the firemen who were acting direclty on the bottle, from liquid ammonia projections. This appliance can be regarded as a factor reducing the ammonia concentration since it keeps the liquid phase at a colder temperature than the ambient one, thus reducing the evaporation of the ammonia pool. The ammonia concentration were determined by trapping the chemical in hydrochloric acid and measuring out the formed ammonium ions by the means of spectrophotometry with the Nessler reagent (Norme AFNOR NF T 90-015). The gas was pumped at a constant rate during the whole release. This device allowed to measure averaged ground-level concentrations as the cloud crossed over the measurements points (see Figure 1).
NH 3 Source
l
Pool
Wind 8m
--'11
"
6m .......... "/" / /
....
Measurement points
"--
e
/
3m./
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
N
e
\ \
0
• 13 m
• 25 m
I1 : 5ore~
35 m
k -3 m .\..................................................... .Q \
t /
\ p
/
-6 m ..... I)~..'N.,. ........................................................... ~
~..
Figure 1. Experimental facilities on the site
s
p
/
I
Advances in Safety and Reliability: ESREL '97
711
Two meteorological stations were measuring data continuously throughout the trials, at the heights of 2 and 4 meters : humidity level, ambient temperature, speed and direction of the wind. The atmospheric stability on the site was appreciated on the observation of the cloud cover and the averaged wind speed, in accordance with the Pasquill-Turner tables.
AMMONIA CLOUD BEHAVIOUR The ammonia is stored at a high pressure and a temperature above its boiling point. Thus its release to the atmosphere, driving to a decrease of the initial pressure to the atmospheric pressure, leads to a 2-phase flow. A thermodynamical flash occurs immediately as the bottle is opened. One fraction of the liquid cools to the ammonia boiling point (240 K) and spreads in the shape of a pool, as the other part of the liquid phase forms an aerosol. Indeed, under the violent depressurization, the liquid is broken up into droplets which are held up into the gaseous cloud (Mathias, 1991). The resulting cloud has the appearance of a dense white mist. This phenomenon can be increased, depending on the ambient humidity level, since the air moisture may condense in contact with the cold gas. The presence of the aerosols (Hodin, 1996) gives to the cloud a density much higher than the density of the surrounding air, and the cloud behaves as a heavy gas. It goes through a gravity slumping stage due to its excess density and air is entrained into it. Independently from the jetting influence due to the release way, the cloud dispersion is greatly influenced by the wind (Koopman). It can accelerate the dispersion of the gas by increasing the entrainment of air and the evaporation of the aerosols. When the cloud is sufficiently diluted in the ambient air, it behaves as a passive gas and the dispersion is only due to the natural atmospheric turbulence.
Experimental results In this paper, we have chosen to present one of the several tests carried out since january 1996, that is the release made on July, 25th.
Meteorological conditions on July 25th The data measured on the site on this day are presented in table 1.
TABLE 1 : METEOROLOGICALCONDITIONSON JULY, 25TH. Ambient Temperature Humidity Level Wind Speed Wind Direction
16°C 58% 4 m/s South-East
The sky being cloudy, and taking into account the wind speed of 4 m/s, we can estimate that the atmospheric stability class was C or D, in accordance with the Pasquill-Tumer definitions (Turner, 1967).
712
G. Dusserre and A. Bara
Ammonia concentrations The table 2 presents the ammonia tenors (ppm) measured on the site.
TABLE 2. AMMONIA TENORS(PPM) ON THE 25 TH JULY TRIAL Crosswind Distances y (m) 8 6
-3 -6 -8
13
Downwind distances x (m) 25 35
50
860 ppm 2080 ppm
....
2620 ppm 2310 ppm 210 ppm
170 ppm
105 ppm
65 ppm
80 ppm 60 ppm
Unfortunately, the measurement axis did not coincide exactly with the main wind direction, which explains that the ammonia tenors are higher for y>0. We can estimate that there was an angle of approximately 15 ° between those 2 axis. This misalignement, which cannot be reproduced by any model (Duijim, 1996), is taken into account in the next section by comparing the predicted concentrations in the axis to the observed concentrations for the y>0.
GAUSSIAN M O D E L L I N G The Gaussian model used here is the puff model for continuous releases (Crabol, 1995). In accordance with the literature (Ftilleringer) and in order to adopt the crisis management behaviour (simple calculations), we will assume that 60% of the ammonia discharge participate in the formation of the cloud (flash and aerosols), i.e. a gaseous flow rate of 6 kg/min. The parameters used to measure the performance of the models are the Mean Relative Bias (MRB), the Mean Relative Square Error (MRSE) and the Factor-of-2 (FA-2) (Duijm, 1996). The MRB ranges from -2 to 2 with an optimum value of 0. A negative MRB indicates underprediction and a positive MRB, overprediction. The FA-2, currently used in the case of heavy gases, ranges from an optimal value of 100% to 0. For the present evaluation, the data at 50 m from the source will not be used because it was not possible to estimate the real ammonia tenor in the wind direction.
CEA-Doury model Taking into account the meteorological conditions observed, the atmospheric diffusion can be regarded as normal (Doury, 1984). The derived modelling gives the following results (Table 3). It allowed to calculate MRB=-0,847, MRSE=I,217 and FA-2=50% which indicate relatively poor performances. This is specially the case for the crosswind predictions (y¢0). But Doury has above all worked on nuclear dispersion problems, regarding longer distances (100 to 1000 km).
Advances in Safety and Reliability: ESREL '97
713
TABLE 3. PREDICTED TENORS (PPM) WITH THE DOURY SIGMAS IN ND
Crosswind Distances y
13
Downwind distances x (m) 25 35
50
(m) 0 3 6 8
3442 ppm 208 ppm
1153 ppm
656 ppm
361 ppm
30 ppm 17 ppm
Pasquill sigmas As explained above, the atmospheric stability can be evaluated as class C or D. Accordingly, two modellings have been done (Table 4 and 5).
TABLE 4. PREDICTED TENORS (PPM) WITH THE PASQUILL SIGMAS IN STABILITY C
Crosswind Distances y (m) 0 3 6 8
13
3471 ppm 774 ppm
Downwind distances x (m) 25 35
947 ppm
487 ppm
50
241 ppm
184 ppm 109 ppm
Under the atmospheric stability C, the performance parameters are : MRB = -0,317, MRSE=0,324 and FA2=67%. These results highlight the fact that the Pasquill model is better than the Doury prediction.
TABLE 5 PREDICTED TENORS (PPM) WITH THE PASQUILL SIGMAS IN STABILITY D
Crosswind Distances y (m) .0
13 8985 ppm 1206 ppm
Downwind distances x (m) 25 35
1430 ppm
736 ppm
50
365 ppm
290 ppm 8
172 ppm
For a stability class D, MRB=0,158, MRSE=0,378 and FA-2=83%. It shows that in this case the model is overpredicting the global concentrations, and particularly near the ammonia source.
DISCUSSION The following figure presents a comparison between the experimental data (in the presumed direction of the wind) and the different predicted concentrations, on the axis of the model. Seeing this graph, it appears that
714
G. Dusserre and A. Bara
the Doury and Pasquill models are equivalent in predicting the tenors in the axis. But it must be remembered that the crosswind concentration are much underpredicted by Doury whereas Pasquill gives better results. 9000
~Trial A Doury DN
8000 I 7000 A
E o
C
-
-'-
6000
Pasquill C4 Pasquill D4
5000 4000 3000 2000 1000 Distance
0 10
15
20
215
30
35
40
(m)
Figure 2. Comparison between the experimental and the predicted data.
CONCLUSION The Gaussian models could be employed to predict the near-field concentration in an emergency case, taking into account the special conditions tested during the present trials • low discharge flow, C or D atmospheric stability... IT also seems that Pasquill is better at predicting the crosswind tenors.
ACKNOWLEDGEMENTS We would thank the firemen from the local firestations of Ales and La Grand'Combe.
REFERENCES
Alp E., Mathias C.S., 1991, ~ COBRA : a heavy gas/liquid spill and dispersion modelling system )), Journal of Loss Prevention in the Process Industries, Vol 4, p 139. Crabol B., 1995, ~ Mrthodes d'rvaluation de la dispersion des gas dans l'atmosph&e >>,CEA. Doury A., 1984, t~ Une mrthode d'approximation instantanre de la dispersion atmosphrrique >), Rapport DAS n°52, CEA/IPSN. Duijm N.J., Ott S., Nielsen M., 1996, ~t An evaluation of validation procedures and test parameters for dense gas dispersion models )), Journal of Loss Prevention in the Process Industries, Vol 9, n°5, pp 323-338. Fiilleringer D., 1995, ~t Evaluation de la sfiret6 des installations utilisant de l'ammoniac )), Rapport du Ddpartement d'Evaluation de Sdretd, IPSN. Goldwire H.C., 1986, ~t Large-scale ammonia spill tests )), Chemical Engineering Progress, n°4, pp 35-40. Hodin A., 1996, ~t Modrlisation du drbit h la br~che et du jet grnrr6 par une fuite d'ammoniac en phase liquide: 6tat des connaissances et modrlisation des rejets diphasiques )), Note d'Etude, Direction de l'Equipement d'EDF, Centre Lyonnais d'Ingrnierie. Koopman R.P., 1988, ~t Atmospheric dispersion of large scale spills )), Chem. Eng. Comm., Vol. 63, pp 61-86. Koopman R.P., Ermak D.L., Chan S.T., 1989, tt A review of recent field tests and mathematical modelling of atmospheric dispersion of large spills of denser than air gases )), Atmospheric Environment, Vol 23, n°4, pp 731-745. Norme AFNOR NF T 90-015 ~ Dosage de l'Azote Ammoniacal )). Turner D.B., 1967, t~ Workbook of atmospheric estimates )), US Department of Health, Education and
Welfare.
B12, Pipeline Safety
This Page Intentionally Left Blank
RISK ASSESSMENT OF PIPELINES O. N. Aneziris, I.A. Papazoglou Institute of Nuclear Technology-Radiation Protection, National Center for Scientific Research "DEMOKRITOS", Aghia Paraskevi 15310, Greece.
ABSTRACT
Methodological steps for the Quantitative Risk Assessment of pipelines containing either toxic or flammable substances are presented. The effect of break location on the results is studied through a study of the sensitivity of individual risk on the number of possible break locations along the length of a pipeline carrying ammonia (toxic) or LPG (flammable). The computer package SOCRATES (Safety Optimization Criteria and Risk Assessment Tools for Emergency and Siting) has been used to indicate the effect of discretization of the pipeline on the isorisk contours as well as the area within a certain risk level.
KEYWORDS
Pipelines, Risk assessment, chemical industry, LPG, ammonia.
PIPELINE RISK A S S E S S M E N T
Many studies have been performed regarding the causes of pipeline failure and their frequencies as described by Geyer et al, Muhlbauer et al and Hovey et al, but only few for pipeline risk assessment such as the one performed by Bodner et al concerning release of toxic Hydrogen Sulfide.
Assessment of damage states and their frequency of occurrence The first step in Quantified Risk Analysis of a pipeline consists in identifying the potential damage states and the causes of failure. The main causes of pipeline failures are corrosion, overpressure, human error, third party damage and impact as described by Hurst et al and by Muhlbauer. Damage states are defined in terms of the size and location of the break and the phase of release. For certain causes of failure the location of the release can be determined (e.g. location of a valve which is inadvertently open owing to a human error). For other causes, like corrosion, the possible position of the break extends sometimes the entire length of the pipeline. It follows that at least one important parameter for damage state determination is characterised by uncertainty. 717
718
O.N. Aneziris and I.A. Papazoglou In this analysis this problem is addressed by considering a number J of possible break locations evenly distributed along the length of the pipeline and having equal probability of occurrence. Let f be the frequency with which a break is expected in a pipeline of total length L. Then the above assumption is equivalent to considering one installation (pipeline) characterised by J plant damage states (one for each possible location) each having a frequency of occurrence fj
fj-j
f
(1)
The paper calculates the effect on the calculated risk of the size J of possible break locations.
Consequences Assessment of the release of toxic orflammable material Next the consequences to the public and workers health owing to the release of hazardous substances must be established.
Toxic substances For toxic substances the assessment of the consequences involves the following procedural steps :
Determination of Release Categoriesfor Toxic Materials. A release category, defines all necessary physical conditions, phenomena and parameters that uniquely determine the concentration of a toxic substance at eachpoint in the area around the source or equivalently all the conditions (installation dependent and environmental) that affect atmospheric dispersion. It includes quantity and physical conditions of the substance released from its containment (outflow models), evaporation rate ( if released in liquid form), and weather conditions. Details for outflow and evaporation models are given by Papazoglou et al (1996) and by the Yellow Book. In the case of a pipeline break, the release category is defined by the diameter of orifice, the pump rate, the duration of release and the weather conditions required by the dispersion model. The considered weather parameters are: atmospheric temperature, weather stability, wind speed, and wind direction. Uncertainties have been assumed only in the weather conditions.
Atmospheric Dispersion of Toxic Materials. In this step a model simulating dispersion of a toxic substance is established. The model estimates the concentration of the toxic substance as a function of time and space. Each release category leads to a specific concentration level for each point of time and space. Atmospheric dispersion in this analysis has been calculated by a simple box model for dispersion of heavier than air gases over fiat terrain contained in "SOCRATES". The model is based the one presented by Jagger and is described by Papazoglou et al, 1992.
Dose Assessment. Given the concentration of the toxic substance an individual in the general area of the installation will receive a certain dose (inhalation) of the toxic substance. This depends also on any implemented emergency response plan. For toxic substances dose is calculated on the basis of concentrations calculated by the dispersion model and the exposure of an individual to these concentrations. For detailsthe reader is referred to Papazoglou et al, 1996 and the Green Book.
Advances in Safety and Reliability: ESREL '97
719
Consequence Assessment. A dose/response model receives as input the dose calculated by the dose model and calculates the probability of fatality for the individual receiving the dose (see Papazoglou et al, 1996 and the Green Book). Flammable Substances
A parallel set of major steps can be distinguished for the assessment of the consequences of released flammable substances.
Determination of Release categories of Flammable Material. A release category for flammable materials defines all necessary physical conditions, phenomena and parameters that uniquely determine the level of thermal flux or the overpressure at each point in the area around the emulsion source. For example, in the case of the LPG, it is established whether jet fire will take place or whether an explosion or flash fire will result following atmospheric dispersion of the gas.
Estimation of Heat Radiation and Peak Overpres~ure. In this step, a model for simulating the heat radiation or the peak overpressure resulting from the released flammable material and the associated physical phenomenon is established. Next the heat radiation and/or the peak overpressure is calculated. In case of a pipe release of a flammable substance which ignites immediately, the levels of Thermal Radiation as a function of distance are calculated according to the jet models. More details in the jet model are presented in the Yellow Book. If there is a break of a pipe containing liquefied flammable gas, the gas will disperse and if it encounters an ignition source the mixture of gas and air may either explode and cause damage to the surroundings owing to the shock wave, or burn as a flash fire in a short period (delayed ignition). Details on the flash fire and the explosion models are presented in the Yellow Book.
Dose Assessment. The integrated, over time, exposure of an individual to the extreme phenomenon generated by the flammable material is calculated. This defines the "dose" an individual receives. For substances causing high levels of thermal radiation effects dose is calculated on the basis of thermal fluxes while in the case of explosions in terms of the overpressure (see Papazoglou et al, or the Green Book).
Consequence Assessment Appropriate dose/response models receiving as input the dose of heat radiation or overpressure calculate the probability of fatality or injury of the individual receiving the dose (see Papazoglou et al, or the Green Book).
Risk Integration Integration of the results obtained so far, that is combining the frequencies of the various accidents with the corresponding consequences, results in the quantification of risk. Here the measure of individual risk is used. Individual fatality risk is defined as the 'frequency (probability per unit time) that an individual at a specific location (x,y) relative to the installation will die as a result of an accident in the installation. Individual fatality risk is usually expressed per unit of time (e.g. per year) of installation operation. Individual fatality risk is calculated as
720
O.N. Aneziris and I.A. Papazoglou
follows: Let j: be an index spanning the space of the possible break locations (j=l ..... J) r" be an index spanning the space of the possible release categories. (r = 1. . . . . R) f." be the frequency of the r th release category, of the jth break location Prj (x,y) " is the conditional probability of fatality for an individual at location (x,y) given release category r and the break location j. If now R(x,y) • is the frequency of fatality for an individual at location (x,y). (individual risk) It follows that : rj
R(x'Y)=~I~lP~J(X,Y)frj
(2)
It follows from equation (2) that for a given set of assumptions about weather conditions and other parameters affecting the release categories, R(x,y) depends on the degree of discretization (J) of pipeline.
R E S U L T S F R O M PIPELINE RISK A S S E S S M E N T
Risk assessment has been performed for two pipelines; one containing flammable LPG and the other toxic ammonia. Both pipes connect a ship containing the loading substance to a tank. Isorisk contours and the risk area between particular individual risk levels for pipelines with various break locations have been calculated and presented in the following. The above analysis has been performed with the computer code SOCRATES (Papazoglou et al, 1996).
PipelinecontainingflammableLPG A four inch and 1000m long pipeline loads a LPG tank at a rate 120 Kg/sec, when the ship pumps are operating. If it fails, LPG will be released to the environment for ten minutes. In case of immediate ignition a jet fire will occur, while in case of delayed ignition either a flash fire or explosion will occur. Isorisk curves (conditional on a break in the pipeline) in case of explosion following a release are given in Figures 1 and 2 for two cases a) 1 segment of 1000m and b) 50 segments of 20m. Table 1 gives the variation of the maximum distance between particular conditional risk levels as a function of the number of segments of the pipeline. It follows that this distance does not change with finer discretization after considering about two segments of 500m each. That is no substantial accuracy is achieved by segmenting the pipeline in more than 2 segments.
TABLE 1 MAXIMUM DISTANCE OF PARTICULAR INDIVIDUAL RISK LEVEL IN CASE OF EXPLOSION (m) i
i,
i
RISK LEVEL
NUMBER OF SEGMENTS 1
R=10 "4 R=10 "3 R=10 -2
970 670 450
2
930 690 370
5
10
2O
5O
930 670 370
930 670 370
930 650 370
930 650 370
Advances in Safety and Reliability: ESREL '97
721
A similar analysis has been performed in case offlash fire following a release. Table 2 gives the variation of the area between particular conditional risk levels as a function of the number of segments of the pipeline. It follows that the area within the various risk levels does not change with finer discretization after considering aboutfive segments of 200m each. That is, no substantial accuracy is achieved by segmenting the pipeline in more than 5 segments.
o
io~o
2000
2000
1000
-I000
2ooo
~6oo
2ooo
Figure 1: Isorisk curves (conditional) in case of Figure 2: Isorisk curves (conditional) in case of explosion. One segment, explosion. Fifty segments.
TABLE 2 MAXIMUM DISTANCE OF PARTICULAR INDIVIDUAL RISK LEVEL IN CASE OF FLASH FIRE (m) RISK LEVEL
R=I 0 °4 R=10 "3 R=10 -2
NUMBER OF SEGMENTS
1 810 560 310
2 810 510 270
5 810 520 220
10 810 520 220
20 810 520 220
50 810 520 220
Pipeline containing ammonia A 20 centimetres and 1000m long pipeline loads ammonia to a tank with rate 100 Kg/sec, when the ship pumps are operating. If it fails, ammonia will be released to the environment and the population will be exposed to the toxic cloud for ten minutes. Isorisk curves (conditional on a break in the pipeline) following a release are given in Figure 3 and 4 for the case of 1 segment and of 50 segments of 20m each, respectively. Table 3 gives the variation of the maximum distance between the pipeline and particular conditional risk levels as a function of the number of segments of the pipeline. It follows that the area within the various risk levels does not change with finer discretization after considering aboutfive segments of 200m each. That is no substantial accurany is achieved by segmenting the pipeline in more than 5 segments.
722
O.N. Aneziris and I.A. Papazoglou 4OOO
-3000
-3000
2000
"2000
1000
I000
-
1O00
0
! 2000
3000
•4 0 0 0
4000
0
Figure 3: Isorisk curves (conditional) in case of ammonia release with rate 100 Kg/sec for 600sec. One segment.
z6oo
1000
3o'oo
4ooo
Figure 4: Isorisk curves (conditional) in case of ammonia release with rate 100 Kg/sec for 600sec. Fifty segments.
TABLE 3 MAXIMUM DISTANCE OF PARTICULAR INDIVIDUAL RISK LEVEL IN CASE OF AMMONIA RELEASE 100Kg/sec for 600 see (m).
NUMBER OF SEGMENTS
'RISK LEVEL
R=10 -4 R=10 -3 R=10 "2 R=10-i ,,
1
2
5
1670 1250 825 450
1580 i140 7i0 300
'1580 1170 730 320
10 1580 1170 730 320
.
20 . . . . . . 50 " 1580 1580 1170 1170 ' ' '730' 730 . . . 320 320
......
,
Isorisk curves (conditional on a break in the pipeline) following a release are given in Figure 5 and 6 for the case of 1 segment and of 50 segments of 20m each, respectively, when the exposure time to the toxic cloud is reduced to five minutes. Table 4 gives the variation of the maximum distance between the pipeline and particular conditional risk levels as a function of the number of segments of the pipeline. It follows that the area within the various risk levels does not change with finer discretization after considering about ten segments of 100m each.
.
Advances in Safety and Reliability" ESREL '97
723
4000
4000
3000
-~000
2000
- 2000
1000 1000
o
,o'oo
2oo0
~obo
4000
' 3000
Figure 5: Isorisk curves (conditional) in case of ammonia release with rate 100 Kg/sec for 300sec. One segment.
zobo
~o'oo
-4ooo
Figure 6: Isorisk curves (conditional) in case of ammonia release with rate 100 Kg/sec for 300sec. Fifty segments.
TABLE 4 MAXIMUM DISTANCE OF PARTICULAR INDIVIDUAL RISK LEVEL IN CASE OF AMMONIA RELEASE 100Kg/sec for 300 sec (m). NUMBER OF SEGMENTS
RISK LEVEL
1
R = 10-4 R=10 -3 R=10 -2 R=10 -1
1175 725 450 250
2
1080 660 400 170
5
10
1070 660 380 150
1050 640 380 150
20
'1050 640 380 140
50
1050 640 i 370 120
CONCLUSIONS Discterization of pipelines for QRA purposes could play an important role in a certain class of decisions. From the examples presented here it follows that in case of flammable material the area of particular conditional risk level (e.g. 102) could extend up to 450 m from the pipeline with no discretization while only up to 3 7 0 m with discretization. This difference of 80 m per km of pipeline (each side) might be of rather significant value for pipelines of moderate to large lengths. In general the importance of discretization depends on the particular conditions of the accident and hence on the extend of the consequences of interest. The larger the extend of the consequences with respect to the pipeline length under consideration, the lower the importance of segmentation. This is exemplified by the case of ammonia release (100 kg/sec for 600 see) where the 10.2 individual risk area contains the 1 km pipeline in the case of one segment (Fig. 4). In this case discretization does not affect the results (see Table 3).
724
O.N. Aneziris and I.A. Papazoglou
If on the other hand the extend of the consequences is small with respect to the length of the pipeline as in the case of the shorter ammonia release (100 Kg/sec for 300 sec) then higher degree of discretization is required to realistically assess the required distances. It follows that a simple analysis like the one presented in this paper is recommended in cases of pipelines of significant lengths and/or where the value of affected land is of particular importance.
REFERENCES
Bodner, A.I., Greenwod, B.W., Hudson, J.M. (1990). Risk Analysis of a Sour Gas Pipeline Using Personal compute, Rel. Eng. And System Safety ,Vol. 30, 455. Geyer, T.A.W., Bellamy, L.J., Astley, J.A., Hurst, N.W. (1990). Prevent Pipe Failures due to Human Errors, Chemical Engineering Progress, November ,66-69. Green Book. Committee for the Prevention of Disasters. (1989). Methods for the calculation of possible damage, TNO, Voorburg, The Netherlands. Hovey, D.J., Farmer, E.J. (1993). Pipeline accident, failure probability determined from historical data, Oil & Gas Journal, July 12, 104-107. Hurst, N.W., Bellamy, L.J., Geyer, T.A.W., Astley, J.A. (1991). A classification scheme for pipework failures to include human and sociotechnical errors and their contribution to pipework failure frequencies, Joumal of Hazardous Materials, 26, 159-186. Jagger, S.F. (1983). Development of CRUNCH: A Dispersion Model for Continuous Releases of a Denserthan-air Vapour into the Atmosphere, SRD Report R229, UKAEA, Safety and Reliability Directorate, Warrington. LPG A Study, A Comparative analysis of the risks inherent in the storage, transhipment, transport and use of LPG and motor spirit, 10 Main Report LPG.(1983).TNO, Voorburg, The Netherlands Muhlbauer, W.K.(1992). A proactive approach to pipe line r'isk assessment, Pipe Line Industry, July, 29-31. Papazoglou, I..A., Aneziris, O., Bonanos, G., and Christou, M. (1996). SOCRATES: a computerized toolkit for quantification of the risk from accidental releases of toxic and/or flammable substances, in Gheorghe, A.V. (Editor) Integrated Regional Health and Environmental Risk Assessment and Safety Management, published in Int. J. Environment and Pollution, Vol. 6, Nos 4-6, pp. 500-533. Papazoglou, I. A., Christou, M., Nivolianitou, Z., Aneziris, O. (1992). "On the management of severe chemical accidents DECARA : A computer code for consequence analysis in chemical installations. Case study : Ammonia plant, Journal of Hazardous Materials, 31, 135-153. Yellow Book. Committee for the Prevention of Disasters. (1992). Methods for the calculation of physical effects of escape of dangerous materials, TNO, Voorburg, The Netherlands.
QUANTIFIED RISK ANALYSIS IN TRANSPORT OF DANGEROUS SUBSTANCES: A COMPARISON BETWEEN PIPELINES AND ROADS P. Leonelli 1, S. Bonvicinil, G. Spadoni 1 1Department of Chemical, Mining Engineering and Environmental Technologies, University of Bologna, Viale Risorgimento 2, 40136 Bologna, Italy
ABSTRACT
A study on Quantified Risk Analysis applied to transport of dangerous chemicals is presented. Particular attention has been devoted to the analysis of the societal risk, represented by means of F/N curves, arising from the transport of dangerous goods by both road and pipelines. The three steps which form the procedure for evaluating societal risk are described: first of all the modelling of population distribution namely the definition of the population map; the second step consists in identifying and describing all possible accidents that may occur during the transport of dangerous goods; finally a procedure must be applied, that allows to calculate the societal risk. Tests are carried out to compare societal risk resulting from pipeline and road transport of ammonia. The comparison of Quantified Risk Analysis results is performed resorting to the fuzzy logic, in order to take into account uncertainties connected with the identification of accidents.
KEYWORDS Dangerous good transportation, Societal Risk, Roads, Pipelines, F-N curves, Algorithm, Uncertainty evaluation, Fuzzy logic.
INTRODUCTION
The importance of assessing risks in transportation of dangerous goods has been shown by several historical analyses (Brockoff 1992) which pointed out how their magnitude is, in many cases, quite similar to that one resulting from chemical and process industries. As a consequence, the Quantified Risk Analysis, largely used to evaluate the impact of process industry' s risks on territory, has been recently adopted for analysing risk in transports too (Brockoff 1992; .Spadoni et al. 1995). In order to quantify the risk to which a population living in an industrialised area is subjected, the societal risk, represented by means of F/N curves (frequencies of accidents involving N or more fatalities), is usually evaluated. In the procedure here described, this important risk measure is evaluated by performing the following three phases. The first one is the identification, on a population map, of the areas of different densities for different land uses, where people may be considered uniformly distributed, of the roads (people linearly distributed) and of the aggregation centres (e.g. schools, hospitals and commercial sites). The modelling of population distribution also takes into account the probability of people to be indoor. 725
726
E Leonelli et al.
The second step consists of the identification and evaluation of all possible accidents that may occur during the transport of dangerous goods (in our calculations only transport of a toxic substance like ammonia is considered, but the procedure can deal also With other toxic substances like chlorine, or inflammables like motor spirit or LPG). The characterisation of the accidents - regarding likelihood and size of breakage; rate, physical aspect and duration of the release - has been carried out and then consequences and probit models are used to obtain vulnerability maps for each possible accidental scenario, which can be stored once and for all, because they do not depend on population distribution. In this way all possible accidental events are converted into both vulnerability distribution around the source risk point and related likelihood of occurrence. Finally a procedure is developed that allows to calculate the contribution to the societal risk of all people categories, by overlapping, with an efficient mathematical algorithm, vulnerability maps of the possible accidents on the user-defined impact area. The procedure describes the linear risk source, i.e. pipeline and tanker's route, as a sort of travelling accident:
POPULATION DISTRIBUTION MODEL As far as population modelling is concerned, a suitable description of the population distribution in the area of interest is requested. The population is subdivided in three categories: uniformly distributed, linearly distributed and aggregated in specific centres. Uniformly distributed population is described by partitioning the impact area into rectangular subareas, where population density may be considered uniform owing to the use of the subareas (e.g. residential quarter, urban area, park .... ). Points are used to describe centres of people aggregation, i.e. hospitals, commercial centres, schools and so on. Finally people on-road is better described by means of linearly distributed population, whose density is connected to the road traffic. In order to allow the description of changes in population density, in this way taking into account its dependence on time, the year and the day may be subdivided in more periods (for example seasons, day time-night time). For each period characterised by a different population distribution the procedure evaluates accident occurrence frequencies and people involved in each scenario.
IDENTIFICATION AND EVALUATION OF POSSIBLE ACCIDENTS The identification of possible accidental scenarios due to a material release is related to the conveyed dangerous substances and the means of transport. It is worth noticing that the identification of possible accidental scenarios is completely independent from population distribution, and consequently can be performed taking into account the design characteristics of the tanker or pipeline and the properties of the conveyed dangerous substance. Typical accidental release cases for road or pipelines transport of toxic, flammable substances may be identified and the characterisation of their accidental scenarios - regarding likelihood and size of breakage; rate, physical aspect and duration of the release;- may be carried out by using engineering judgement, literature information and proper simulation models. Consequences models enable to calculate the spatial distribution of the physical effects (toxic gas concentration, thermal radiation and blast overpressure) in the impact area, under the assumption of typical weather conditions. Lastly probit models allow to translate physical effects into vulnerability maps. Owing to the independence from population distribution, the vulnerability maps, i.e. the description of accidental scenarios and related likelihood of occurrence, can be stored and then utilised to evaluated the number of persons involved in each accident.
Advances in Safety and Reliability: ESREL '97
727
Pipelines accidents In a risk analysis typical pipeline accidents are described as provoked by breakages of different sizes: the engineering judgement, supported by historical searches, leads the analyst to choose which and how many classes of ruptures must be considered (usually a small hole, a medium one and the guillotine breakage are assumed). To describe a pipeline accident the frequency of occurrence of a given release per unit length (releases year-1 km-1)fof the pipeline must be evaluated. The parameterffor the pipeline can be derived by: f
=
2e p . Pl Pw xe
(1)
where 2p is the average release frequency (release year "l km-1),Po is the probability the release belongs to a certain rupture class, Pt is the ignition probability (for flammable substances only), Pw is the probability of a given wind direction (in a sector of 1o) and meteorological situation for the accidental scenario and xe is fraction of the time partitioning during which the pipeline is active. Pipeline incident data involving unintentional release of the pipeline content have been collected over the last decades by all major gas or oil transmission system operators both in Western Europe and in North America. This data form an extensive database from which the pipeline incident frequencies can be deduced. The incidents are classified according to the initial cause, which can be an external interference, corrosion, a construction defect or material failure and a natural hazard. Pipeline release frequencies mainly depend on the wall thickness, the diameter, the depth at which the pipe is buried, the year of construction (modem pipelines are safer than older ones because of improved construction standards and the rigorous material testing). Our calculations refer to an ammonia pipeline whose features are summarised in table 1:
:
TABLE1
Diameter Wall thickness Earth cover Year of construction Design flow Isolation valve spacing
203 mm 11.1 mm 1.5 m 1990 10.4 kg/s 10km
The release frequencies for such a pipeline have been taken from EGIG 1993; different values are given for a pinhole, a medium size hole and the guillotine break. These release frequencies and the diameters chosen to represent the pinhole and the medium size hole are presented in table 2:
TABLE 2 HOLE TYPE Pinhole Medium hole Guillotine break
RELEASE FREQUENCY [2p*p~] (ev/(km*y) 1.7 E-4 1.7 E-4 0.6 E-4
HOLE SIZE (mm) 20 100 /
E Leonelli et al.
728
Tanker accidents For road transport the frequency of occurrence per unit length (releases year "1 km -1) f o f a tanker release is evaluated resorting to the following equation:
f = 2R Prel P~ Pl PW nv Xv (2) where 2 n is the average incident rate (incidents vehicle -1 km-l), P,-et is the probability to have a release once an incident has occurred and the term nvx v represents the number of travelling tankers in each time partitioning. The tanker incident rate 2,R depends on the road type (i.e. motor ways, main roads, urban roads) and the traffic load, while Prel and p¢~ depend on the tanker features (for example the wall thickness). In our calculations data inferred from HSE 1991 were used, both for the release frequencies and the breakage sizes. These data are shown in table 3:
TABLE 3 RUPTURE CLASS
RELEASE FREQUENCY
HOLE SIZE
Small hole Medium hole Catastrophic rupture
[2"R*Prel*P~] 7 E-10 ev/(km*vehicle) 4.3 E-10 ev/(km*vehicle) 0.48 E-10 ev/(km*vehicle)
25 mm 50 mm /
The number of tankers travelling on the road was chosen to arrange the road delivery capacity to be the same that the pipeline capacity (the tanker tonnage is assumed to be 20000 kg).
Vulnerability maps In the numerical procedure a scenario is represented by a vulnerability distribution on a ~/rl Cartesian reference frame, whose origin is the point where release occurs, ~ being the downwind direction (Fig. 1). Its vulnerability data are stored in a matrix which represents the distribution, calculated once and for all, on a non uniform grid in the ~/rl plane. The ~/rl Cartesian axes can be rotated and translated on the area of interest in order to describe both the changes in the wind direction and the position where accidents occur along the tanker route or pipeline.
Figure 1: vulnerability map
SOCIETAL RISK ALGORITHM Once frequencies of occurrence are evaluated, the number of fatalities N for each scenario is calculated, by recalling vulnerability values stored for a point risk source, for each accidental scenario located in a generic point of the route. In order to evaluate the number of fatalities N due to a scenario, equation 3 is utilised:
Advances in Safety and Reliability: ESREL '97 n/
(3)
n.4
N-ZP,, IVp[Xt,+(1--Xz,)a,"]dl'+ZPA,IAVp[XA,+(1 i
i
729
/A,)ae]dA,+
'
nc
i
where n 1, n A and n C are respectively the numbers of lines, rectangles and point on the population map, 1 and A are the corresponding people density, x 1, x A and x C the fraction of people staying indoor. In equation 3 Vp is the vulnerability stored in vulnerability maps and p the mitigation factor deriving from being indoors. In order to evaluate the vulnerability Vp to perform the integration steps the vulnerability matrix is linearly interpolated obtaining a continue function. Equation 3 can be solved by locating the origin of /1"1Cartesian frame on the release point, and orienting /rl plane taking into account wind direction, overlapping in this way vulnerability and population map. An efficient numerical algorithm based on circuitation theorem has been developed in order to accelerate the surface integration of equation 3 that constitutes the slowest step. Once each scenario of the point risk source is characterised by a number of fatalities N and a frequency per unit lengthf frequencies for all scenarios involving a specific user-defined range of fatalities are added up to evaluate The procedure performs the line integration values along the linear risk source in order to obtain F/N curves (Leonelli, Spadoni 1996).
fN.
offN
UNCERTAINTY EVALUATION In order to better evaluate the impact on population of dangerous goods transportation the effect of the uncertainties of the input data on final results should be quantified. Fuzzy logic is a mean for quantifying uncertainties, which, compared with other more traditional techniques (for example the Monte Carlo method), has the advantage of a limited number of model calculations. Using fuzzy logic (Quelch, Cameron 1994) it is possible to represent a parameter A by a membership function A, which has values in the interval [0,1 ]; this function can be viewed as a possibility distribution as opposed to a probability distribution, since the sum of the grades of membership of a fuzzy number is not required to equal 1 and it is possible for more than one member to have a grade of membership equal to 1. In a QRA applied to transport of hazardous materials an important parameter, always affected by uncertainty, is the frequency of occurrence per unit lengthfof the linear risk source. Suppose to ask to some risk analysts to choose a release frequency for a given pipeline: imagine the majority of them say that f h a s values between 6E-4 and 7E-4 rel km -1 y-l, other two smaller parties say that acceptable values for f are respectively 4.5e-4 and 9E-4, and only few of them say thatfis smaller than 4.5E-4 or greater than 9E-4, while no one chooses values below 3E-4 or upon 1E-3. Applying fuzzy logic t o f we can say that f i s 0 for f > 1E-3 orf<3E-4, is equal to 1 for 6E-4
--
0.6 0.4 0.2 0.0-0 . 2 0
-
'
0.40
0.60
•
'
0.80
ev/(1 O00*km*y)
"
-
- -
1.00
Figure 2: fuzzy representation of a generic pipeline failure rate The fuzzy number f describes the statement 'the value o f f is between 0.3 and 1 and most likely around 0.7 ev 1000 -1 km -1 y-l' without omitting any of the information given in the vague statement; the closer a value
730
P. Leonelli et al.
o f f is to 0.7, the closer the corresponding value of the membership function is to 1. A fuzzy number can be represented as a series of intervals at different membership levels called a-cuts. Considering the fuzzy number f for instance, the a-cut 0.5 is represented by an interval whose endpoints are the values o f f when the membership function value is 0.5, i.e. 4.5E-4 and 9.0E-4;fcan be represented by the following a-cuts: o~0=[3E-4,1E-3];
t~0.2=[3.6E-4,9.4E-4]; o~0.5=[4.5E-4,9.0E-4]; tx0.8=[5.4E-4,7.6E-4]; O~l=[6E-4,7E-4].
Operations can be performed on these a-cuts using interval arithmetic, which involves combining the endpoints of the interval according to the operation (addition, multiplication, etc.) and selecting the lower endpoint of the solution to be the minimum value and the upper endpoint to be the maximum value. In this paper only the release frequencies for each accidental scenario and the respective hole sizes, both for the pipeline and the road have been regarded as uncertain parameters described by fuzzy numbers. An example of a fuzzy number used in the QRA of the transport by road is shown in figure 3.
1
o
ii
0.6 0.4 0.2
0 . 0
.
.
. 4
.
. 8
.
.
. 12
. 16
20
24
mm
Figure 3: fuzzy representation of the pipeline pinhole diameter Model calculations have been performed for some a-cuts, i.e. for ct=O, 0.2, 0.4, 0.6, 0.8,1.
COMPARISON TEST In order to compare the societal risk due to the transport of dangerous goods by means of pipeline or tankers, the area plotted in figure 4 has been analysed. 26OO (m)
i
[] , I'I , ~
Rural a r e a Hospital School
! S8
Commercial
~
" ~
centre
0
i
~0
1000
12000
i 3000
3400 (m)
Figure 4: test area (population density values are: 8.33E-3 persons m -2 in the urban area, 2.27E-3 persons m2 in the rural area and 0.06 persons m-1 on the road; people in aggregation centres are: 200 in the hospital, 240 in both schools and 400 in the commercial centre)
Advances in Safety and Reliability: ESREL '97
731
Results The main results of our calculation are shown in figure 5, where the F/N curves for the pipeline and the road are reported with their uncertainty bands (obtained with the a-cut 0.2). It is worth noticing that for certain N the two bands cross: for small N the road F value are greater than the pipeline F values, while the opposite occurs for large N and only pipeline accidents can kill more than 100 people.
1E+O0 1E-01 1E-02 --..._ --.....
1E-03 -.....
1 E-04
-....
-.....
--... --..__
1 E-05
I
I
1 E-06
-.._._
-.... I
1 E-07
.
lEO
1E1
1E2
"""
-..._
-..... "-...
1E3
Figure 5: Comparison F/N for road and pipeline This means that it is not possible to say a priori that the pipeline is certainly safer than the road or viceversa; however it is true that there are some preventive measures that can lower the pipeline F/N curves, which are not applicable to the road. For example, the pipeline failure frequency can be reduced by adopting a greater wall thickness or a major cover depth and the total release can be reduced by lowering the isolation valve spacing. Other advantages come from the adoption of protective measures like an emergency plan: for example alerting people to close doors and windows, to switch off the ventilation systems and to stay indoor after an incident has happened, can reduce the consequences of the pipeline accident. It is worth noticing that an emergency plan can reduce only the off road fatalities, but not the on road ones: since the last ones give the larger contribute to the total road fatalities, it does not bring great advantages to consider an emergency plan for a tanker accident. The best risk reduction measure both for the road and for the pipeline off road fatalities would be to move them far away from the town. Figure 5 shows also that the F/N curves with their uncertainty bands both for the pipeline and the road are in the UK Health and Safety Executive ALARP region (in a log-log graph paper this region is comprehended between two lines, the upper one representing the maximum tolerable risk level, the lower one a negligible risk level). The ALARP region is the zone where risk needs to be reduced As Low As Reasonably Possible: the few simple risk reduction measures we have proposed above for the pipeline could move the pipeline F/N curve to the negligible risk level. Looking at figure 5, it can be seen that, while the curves delimiting the pipeline uncertainty band are nearly parallel for all N, the road uncertainty band is very large for small N and very narrow for greater N. This is shown also in figures 6 and 7, where the cumulative frequency F is represented as a fuzzy number for N= 10 and N= 40 respectively.
E Leonelli et al.
732 1 0.8
0.4
0.2
0
:. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1E-06
1E-05
1E-04
1E-03
:
:
1E-02
:...:=,
1E-01
ev/y JO. road "~-pipe J Figure 6: fuzzy representation ofF for N=10
]
-,
L
0.6
i
-
0.4
-
-:
.
.
.
.
-~ .
0.2
0
•
1E-06
"
1E-05
E-04 [ ~l" road
.
.
.
i .
ev/y
t_
-
.
........
1E-03
eD-pipe
:
1E-02
::::
....
1E-O 1
!
Figure 7: fuzzy representation ofF for N=40
CONCLUSIONS The evaluation of societal risk gives the measure of the impact on population of the transport of dangerous goods and requires a particular attention. As matter of fact the F/N curves may be utilised to distinguish among cases of negligible risk, of unacceptable risk or cases where a risk reduction is convenient or necessary. The proposed procedure is studied in order to reduce the numerical errors introduced in evaluating the fatalities number and the corresponding frequencies, and to be easily extended to different ways of transportation. It is worth noticing that the flexibility of the procedure allows to use fuzzy logic in order to quantify the influence on the results of the uncertainties in input data. In fact F/N curves are better described by means of F/N bands that can be compared with the ALARP region.
REFERENCES Brockoff, L. H. (1992) 'A Risk Management Model for Transport of Dangerous Goods', EUR 14675EN, JRC EGIG (European Gas pipeline Incident data Group) (1993), Gas Pipeline Incidents, Report 1970 - 1992 Health & Safety Commission (1991), 'Major Hazard Aspects of the Transport of Dangerous Substances', HMSO, London. Leonelli, P. and Spadoni, G. (1996). A new numerical procedure for calculating societal risk from road transport of dangerous substances. Proceedings of the Annual S.R.A. Meeting, Guilford (UK), 243-246 Quelch, J. and Cameron, I., (1994). Uncertainty representation and propagation in quantified risk assessment using fuzzy sets. J.Loss Prev. Process Ind, 7:6, 463-473 Spadoni, G. Leonelli, P. Verlicchi, P. and Fiore, R. (1995) .A numerical procedure for assessing risks form road transport of dangerous substances. J.Loss Prev. Process Ind., 8:4, 245-251
ARTIFICIAL NEURAL NETWORKS FOR LEAK DETECTION IN PIPELINES Salvatore Belsito i and Sanjoy Banerjee 2 1 Center for Energy and Environmental Technologies, Consorzio Pisa Ricerche, Piazza A. D'Ancona, 1 56100 Pisa (Italy). E-mail: [email protected] 2 Department of Chemical Engineering, University of California, Santa Barbara (CA).
ABSTRACT This paper presents the results obtained by the development of a leak detection system based on innovative process technology. The leak detection system uses Artificial Neural Networks (ANNs) for processing of field data. An existing pipeline carrying hazardous material (liquified ammonia) has been selected as reference for evaluating the detection system. The database used for the present work was derived by developing, and subsequently running, a computer code able to simulate flow in pipelines. Normal operation field data have also been used to assess the code capabilities. Noise and instrumentation drift are superimposed on the numerical predictions provided by the code. The system that has been developed includes pre-processing of the data to correct for the "packing" effect due to liquid compressibility. The system is able to detect the presence of the smallest simulated leaks (1% of the inlet flowrate) and gives a very low incidence of spurious alarms even for this cases. ANNs have been also developed for leak positioning along the pipeline. Positions of large leaks (5%, 10% of inlet flowrate) are predicted with very high accuracy. This is particularly useful because for large leaks location is important in order to undertake mitigatory actions. About 50% of 1% leaks are correctly located. This percentage increases to about 75% for 2% leaks.
KEYWORDS Leak Detection, Artificial Neural Networks, Pipeline Failure, Artificial Intelligence, Packing Effect
INTRODUCTION Leak detection at an acceptable cost is of importance in the mitigation of accidents. Losses by catastrophic failures can cause significant damage to the environment, especially in undersea and remote land. Such leaks are the easiest to detect being accompanied by large pressure drops and mass flow rate discrepancies. Another dangerous leak is the small, difficult to detect one. Leaks as small as 1% of the nominal flowrate can cause large amounts of dangerous fluid be discharged before they are detected. Be that as it may, the early detection of such small leaks is then one of the most difficult goals of a leak detection system. 733
734
S. Belsito and S. Banerjee
Many methods for leak detection system based on process measurements in liquid and gas pipelines have been proposed and installed (Ellul (1989), Billmann et al. (1987), Stouffs et al. (1993), Hamande et al. (1995), Parry et al. (1992)). Detection of small to medium sized leaks for liquified gas pipelines requires sophisticated on-line processing of field data related to pressure, temperature and flowrates. Difficulties arise due to noise, drift and compressibility effects that become significant even for liquids, because of the large volume between measurement points (Stouffs et al. (1993)). From the analysis of the state-of-the-art leak detection methods, it appears that no single method is universally applicable. As a consequence more than one leak detection system is often implemented in important pipelines. Artificial neural network-based systems may offer a method of leak detection which is an alternative, giving high reliability and resilience to noise and drift. An artificial neural network (ANN) can be regarded as a non linear mathematical function which transforms a set of input variable into a set of output variables (Bishop, 1994), the transformation function depending on weights determined on the basis of a set of training data or examples. Artificial neural networks have the capability of learning a general solution of a problem from a limited number of examples. The problem of leak detection seems to have the characteristics appropriate to the use of artificial neural networks for its solution (Bishop (1994)), viz.: (i) it is difficultto find an adequate first-principle or model-based solution, (ii) new data must be processed at high speed and (iii) the system must be robust to noisy signals. Another important aspect for application of neural networks is outlined by Bishop (1994): a large set of training data must be available. This crucial aspect appears to be met by the enormous amount of data being measured and recorded at a pipeline central station. In the present case the data should ideally consist of field measurements of process variables (pressure, flowrate, temperature) made in the presence of leaks. In fact, no-leak field data can be used for part of the training operation, but in addition, data measured in presence of leaks are also needed. These are normally not available and alternative strategies must be pursued to generate such data. In the present work, the data for network training were generated by a computer code developed for simulating the flow in pipelines with and without the presence of leaks. The generated data must be representative of field conditions, i.e. the noise and drift of the instrumentation installed in the field must be included in the data.
DESCRIPTION OF THE METHOD R e f e r e n c e Pipeline The main purpose of the present work is to develop a leak detection system with improved performances compared to available systems. An existing pipeline carrying hazardous material (liquified ammonia) has been selected as the reference (Table 1).
Pipeline length Pipeline diameter Carried fluid Number of monitorin~ stations Operating flowrate Operating pressure
TABLE 1 PIPELINECHARACTERISTICS 74668 m 8" (0.203 m) Liquid ammonia 13 10.4 kg/s (max) 2.22 MPa (arrival) - 2.82 MPa (starting)
D e v e l o p m e n t of a D e t e r m i n i s t i c Model A large number of data, covering the whole range of operating and fault conditions should ideally be available in order to successfully train an ANN. Field data measured in the presence of leaks in a pipeline are generally not available as mentioned earlier. Due to the lack of such data for ANN training, the database used
Advances in Safety and Reliability: ESREL '97
735
for the present work was derived by developing and subsequently running a computer code able to simulate flow in pipelines. Normal operation field data have been used to validate the code's capabilities. The computer code is based on the balance equations for mass, momentum and energy, solved by a finite difference scheme. The analyses performed with the code showed that in the reference pipeline twophase flow does not occur when leaks of the size of interest for the aims of the present work are formed (i.e. leaks with flowrates less than 20% of the flowrate in the pipeline). This is due to the large subcooling of the ammonia in the pipeline (about 40°C) and also to the small depressurization following the leak occurrence (about 5 kPa for a 1% leak for the pipeline here considered). The situation can be different in other pipelines where two-phase flow can occur. Nevertheless the numerical scheme adopted in the code is able to represent two-phase flow conditions.
Set up of the Database The code has been used to generate a database for training the artificial neural networks. For each simulated condition the time evolution of inlet and outlet flowrate, and pressures calculated at the 13 measurement stations located along the pipeline have been recorded. Two database have been used: the first consists of about 300 runs made in different conditions. The second database consists of about 1000 runs and includes the first. The need of a larger database arose in the development of the leak location system. Numerical data with and without leaks have been generated. Among the patterns without leak, different operating conditions have been simulated, by varying both the inlet flowrate and the outlet pressure (those quantities are kept constant in each run, though). For the cases with leaks,, spanning the simulated range of operating conditions (i.e. outlet pressure and inlet flowrate), both size and position of the leak along the pipeline have been varied.
Noisy Signals Database The data generated with the computer code by solving the balance equations do not consider instrumentation noise. Noise and drift of the instrumentation needed to be superimposed on the numerical predictions. In order to do that, the measurement systems used in the pipeline were characterized. Due to lack of suitable information on field noise, the noise superimposed on the data was derived from the specifications of the instruments. In particular, worst case assumptions were made when deciding the operating mode (digital or analog): the largest possible values for noise were assumed. Later in the activity, when field data were made available it has been verified that the characteristics of the noise that have been assumed are consistent with those derived from the instrumentation recordings.
Design of Artificial Neural Networks The database previously described has been used for ANN training and testing. During such activities, feedforward ANNs have been utilized with the classical backpropagation algorithm for training. The data have been partitioned in two sets: a training set used for network training and a test set (not used for ANN training) used to verify the capabilities of the network in the analysis of data not used in the learning phase.
D E V E L O P M E N T OF THE LEAK D E T E C T I O N SYSTEM The development of the system was subdivided into two main tasks: set up of the system for leak dimension detection and set up of the system for leak positioning. The first system would work on-line by
736
S. Belsito and S. Banerjee
monitoring the pipeline status and giving an alarm if a leak is detected, thus firing the second system which enables location of the leak along the line. For leak size detection, both data without leaks and data with leaks are used for network training. The aim is to develop an artificial neural network that can detect the leaks down to sizes of 1% of the inlet flowrate. Clearly one fundamental issue is that the system should not give spurious alarms in spite of the operational variations. For leak positioning, the leak is assumed to be occurred and therefore data without leaks are not needed.
Development of the Leak Sizing Network The basic configuration of the sizing ANN, i.e. choice of the architecture, of the learning parameters and of scaling and formats of the patterns, was set up using the first database (300 cases), without noise being superimposed. Information contained in the data base is relative to the steady-state which is reached some time after the leak has occurred. In particular, the values of pressure at the thirteen monitoring stations and inlet and outlet flowrates are used. This first part of development of the leak sizing ANN was performed because of the shorter computing time required for training of an ANN with non-noisy data rather than using data with embedded noise. This strategy allows rapid acquisition of the main network parameters, besides the definition of the optimal normalization for the physical quantities contained in the patterns. As a result of the activity, a feedforward multilayer perceptron, using sigmoidal functions as activation was set up. The artificial neural network consists of three layers: the input layer has 15 nodes, each of them corresponding to one signal measured in the pipeline; the hidden layer has 7 nodes and the single output gives the size of the leak. The 15 input nodes receive the signals coming from the two flowmeters and the 13 pressure meters installed along the reference pipeline. 50
,
4s
•
,
-
~
-
,
•
,
•
w
•
1
.
:
3O
10
i/.
25
40
lO=~
[: eo 30
5*/.
~. "6
2*/. 25
1"/.
Q. "5
15
-~ eo z
5%
~Z ,o
15
10
s
k
o.oo o.o1 o.o, o.o~ o.o4 o.o~ o.o~ o.o, o.o~ o.o, o.,o Leak dimension (fraction of Inlet flowrate)
oH,,.
0.00 0.01
.h
. . . . . . . . . . . . . . .
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 • Leak dimension (fraction of inlet flowrate)
0.10
Figure 1: Predicted leak dimension for no-noise data Figure 2: Predicted leak dimension for no-noise data (training set). (test set). Figs. 1 and 2 show the prediction of the network for the training and test sets. In the Figures, the histograms of the number of cases predicted with a certain leak dimension are reported. The various sets are for different simulated leaks (no-leak, 1%, 2%, 5%, 10% of the inlet flowrate). The ANN was able to correctly predict both leak occurrences and dimensions. Furthermore, it was able to discriminate cases with the smallest leak (1% of the inlet flowrate) from cases without any leak. Since the histogram of no-leak patterns can be easily separated from the one referring to patterns with 1% leaks, no spurious alarms would be generated. In a second step, noisy signals were used. The first attempt was to develop a network with the same architecture as used before. This resulted in an ANN that was no more capable to recognize the dimension of a leak and to distinguish the 1% leaks from cases where leaks are not present in the line.
Advances in Safety and Reliability: ESREL '97
737
To proceed, moving averages were applied and proved effective in noise reduction even though their application can slow down network response.-Sensitivity analyses also showed a strong effect of flowmeter drift. Information related to flowmeter drift was therefore provided to the Artificial Neural Network. This was done by inputting the difference among the measurements of the two flowmeters recorded at some instant before the actual time at which detection is done (optimization of such timedelays led us to choose At- 1800s). It must be noticed that this difference, due to the short time constants of the line (in terms of inlet/outlet flowrate correlation), does not depend very much on the operating conditions (e.g. operating variations in pressure and flowrate itself) but is indeed indicative of flowmeter drift. It is implicitly assumed that the drift does not vary significantly during this time, At. Many patterns are derived from a single set of data generated by the code, superimposing zero average noise and drift that are randomly selected. In this way the ANN faced situations in which different inputs (because of the noise) gave the same leak dimension. 80
.
.
.
.
.
.
.
.
.
40 10
60
1%
2%
.~ 30
E
5%
:1~
o =..
E
40
E Z
IL i
0
,
-~
0.00 0.01
.
-.
.
".
0.02 0.03 0.04
.
.
~
20
Z
1o J
"
0.05 0.06
'
'
"
'
"
0.07 0.08 0.09
Leak dimension (fraction of inlet flowrate)
Figure 3: Predicted leak dimension for noisy data (training set).
'
0.10
0
0.00 0.01
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 Leak dimension (fraction of inlet flowrate)
Figure 4: Predicted leak dimension for noisy data (test set).
The ANN that was successfully trained, after pre-processing the data as discussed above, is a feedforward neural network (16-6-1) made of 23 nodes with 1 hidden layer. It is very similar to the network developed for the case in which noise is not present, but one input more is added with the difference between the measurements of the flowmeters 1800 s before. The performance of this network can be observed in Figures 3 and 4. The neural network appears to be able to detect the smallest simulated leak (1% of the inlet flowrate) and appears to distinguish between situations where there is no leak from cases where the smallest design leak is present in the line.
A s s e s s m e n t of the Leak S i z i n g N e t w o r k At this point a leak sizing ANN had been successfully trained using noisy, numerically generated patterns. Designing a system to be operated in the field requires that routine transient operations are treated by the apparatus without giving rise to spurious alarms. To this end, our leak sizing network has been fed with real field data. The typical trends of such data are reported in Figure 5, where about 8000s of inlet and outlet flowrates recorded in the reference pipeline have been reported. The performance of the ANN fed with the above mentioned field data is shown in Figure 6. The ANN foresees, at several instants in time, the presence of a leak whose size is up to about 5% of the actual inlet flowrate, whereas in reality such event did not occur. The most serious difficulty arose not from noise and drift, but from the "packing" effect due to liquid compressibility. Since the flow in a real pipeline continually changes, therefore the inlet and outlet flowrates show significant differences due to compressibility. To correct for this effect, the data fed on-line to the ANN must be preprocessed. In this case the preprocessing was done by the computer code used for the simulations, which ran faster than real time on a modern PC. Another ANN could possibly have been developed to handle the preprocessing, but conventional ANN technology did not prove successful.
738
S. Belsito and S. Banerjee
The new output given by the leak sizing ANN is reported in Fig. 7, which highlights a behavior that is fairly robust against transient field signals. If we think of a warning threshold set at a value of 0.75% of the inlet flowrate no alarms are given in this case. 8.8
i
,
|
I
|
I
!
5
, mr.
~.13
~"
~'.~.
8.2
I !, ,,:" .,,-,," '~ ,.., .~ ,
.
..:....;
-,
it. ,,
I
,
,
7000
8000
4
/'
, t~"~.J,
i
1
:", .... .,t'.;,, ~ s,,/ 3
2
7.8
7.6
1 7.4
/
1
7.2
0
........................
i'"]'l" . . . . . . . . . . . . . . . . . . . . . . . .
7
6.8
,
0
i
1000
I
2000
I
3000
I
4000
t
5000
I
6000
Time (s)
7000
I
8000
"~000
Figure 5: Flowrate field data showing typical operational transient.
2000
3000
V/
4000
,5000
(s)
6000
Figure 6: Response of ANN to field signals without leak: non-treated signals.
12
'
1~% '
!
e
3
.........................................................
~.................. 1 1 ) % . . ~ . . .
.......................................................................................................................................................
i..............
• : !!L.,?" ' .................................................................................................................................................... i? ..............................?
2
~/
4
......................... -...................... ..-...............................................................................................
2 ............................................
0
...~_~ :
f'"-'J'5%~." ..... ;
10
~o
9000
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
: ........................................................................................
ii ..................... ~ . . . . . . . . . . . . .
~..-,;:~;,,.:;r~. :~.~..~ :
. 0
1ooo
i
i
i
2000
3000
4000
i
I
I
i
5000
6000
7000
8000
Tins (s)
Figure 7: Response of ANN to field signals without leak: treated signals.
.,
9000
0
[
1000
2000
3000
4000
1 - ~ (s)
5000
6000
7000
Figure 8: Response of ANN to field signals with effect of leak simulated.
Leak Sizing Test To confirm the reliability of the system in leak detection we set up data representative of a leak. When a leak develops a sudden drop (even if very small) occurs in the pressure measured along the pipeline and in outlet flowrate. The numerical code that has been developed is able to calculate the pressure and flowrate drop and was extensively used during database set up. In order to get data representative of leak occurrence in field conditions (i.e. transient conditions) we subtracted the pressure and flowrate drop calculated by the code for various leak flowrates (1%, 2%, 5% and 10% of inlet flowrate) from the data measured on line. The signals derived from field measurements including the effect of leak presence were then fed to the system for leak sizing described above. The results of this test are reported in Fig. 8 where the leak dimension is calculated by the ANN-based system. It can be clearly seen that the system is able to detect the leak giving a signal that exceeds the threshold even for the smallest leaks.
Advances in Safety and Reliability: E S R E L '97 i
-
=
|
i
i
i
i
• 8.4
.i stat. I slat. 13
/
8.6
739
- .....
/i+
8.2
7.8
iT
f
7.6
7.4 7.2
. . . . . . . . . . . . Ht
7 6.8
0
i
i
i
I
i
1000
2000
3000
4000
5000 (s)
Time
I
i
6000
7000
i
%0
80OO
j
,
,
,
3000
4000
5000 (s)
Tmn~
6000
7000
8O0O
9O00
Figure 6: Response of ANN to field signals without leak: non-treated signals.
Figure 5" Flowrate field data showing typical operational transient. 5
2000 +
/....I) t ...i....
12
,
/
i 8 ....................................................................................................................................................................
o
g
~
.|
3
,-. . . . . . . . . . . . .
,:L...:?."
.
................................................................................................................................................ I.?.:Y:..,:!_...:-...:_:.._
4
...................................................................................................................................... +J. . . . . . . . . . . . . .
:2
2
i:'
i/
+
-,+,--. . . . . . . . .
E
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
'i ..+..
.
0
-1
1 000
i
2000
i
3000
i
4000
i
5000
"r~(s)
i
6000
i
7000
Figure 7" Response of ANN to field signals without leak: treated signals.
i
8000
9000
-2
i
0
1000
2000
i
3000
4000 (s)
Tmae
5000
6000
7000
Figure 8: Response of ANN to field signals with effect of leak simulated.
L e a k Sizing Test To confirm the reliability of the system in leak detection we set up data representative of a leak. When a leak develops a sudden drop (even if very small) occurs in the pressure measured along the pipeline and in outlet flowrate. The numerical code that has been developed is able to calculate the pressure and flowrate drop and was extensively used during database set up. In order to get data representative of leak occurrence in field conditions (i.e. transient conditions) we subtracted the pressure and flowrate drop calculated by the code for various leak flowrates (1%, 2%, 5% and 10% of inlet flowrate) from the data measured on line. The signals derived from field measurements including the effect of leak presence were then fed to the system for leak sizing described above. The results of this test are reported in Fig. 8 where the leak dimension is calculated by the ANN-based system. It can be clearly seen that the system is able to detect the leak giving a signal that exceeds the threshold even for the smallest leaks.
740
S. Belsito and S. Banerjee
D e v e l o p m e n t of the Leak Positioning Network In order to develop the leak location system the pipeline was divided in 8 different parts of similar length delimited by sectioning valves. The output of the ANN consists of 8 neurons, one for each of the segments in which the pipeline is divided. The neuron with the highest OUtput value determines the leak location. The pressure drops due to a leak occurrence were provided as input to the neural network instead of the absolute pressure measured by the transducers along the line. Leak occurrence causes a pressure decrease in the various positions where transducers are located, and in addition, a drop in the outlet flowrate. The quantities then reach a new steady state in which the values of pressure and outlet flowrates are constant. The combination of the pressure decreases contains the information on leak position. In addition the ANN for leak detection provides the dimension of the leak. This information was also used as an input for the leak location ANN. Moving averages were used in order to reduce noise: a filter using a time window of 21 measurements was applied. The ANN developed has been trained with data with noise superimposed and the database used since this point was the most complete one with the largest number of patterns utilized (about 1000). Actually, extensive field data were available only at this stage of the research and they were used to directly get information on field instrumentation behavior. The results confirmed the assumptions previously made regarding noise behaviors.
TABLE 2 TEST SET CORRECT CLASSIHCATIONS
'1% 42/75 (56%)
2% 61/73 (84%)
3% 40/56 (71%)
5% 27/29 (93%)
10% 31/32 (97%)
TABLE 3 TEST SET CORRECT CLASSIFICATIONS CONSIDERING "CORRECT" DATA PREDICTED IN THE EXACT TRACT OR IN ADJACENT TRACTS
1% 62/75 (83%)
2% 69/73 (95%)
3% 53/56 (95%)
5% 28/29(97%)
10% 32/32(100%)
Table 2 reports the percentage of correct location for leaks of different sizes: 56% of success is obtained for the smallest leaks while the location is almost certain for leak larger than 5%. This is particularly important in order to initiate mitigatory actions. An interesting result of the leak location system is shown in Table 3 which reports the percentage of success considering as "correct" the location in the segment where the leak occurs or in an adjacent segment. The success probability increased to 83% for 1% leak using lesser definition. From further analysis it emerges that the effect on noise on the result is not lerge. The performance of ANN trained with data with double the noise are comparable to the results shown in Tables 2 and 3. The effect of the number of points used for the moving average is also not too significant: similar results have been obtained using 9 and 41 points for filtering using moving averages.
CONCLUSIONS The behavior of ANNs for leak detection and location have been investigated. The data have been generated by using a computer code developed for this purpose and noise, characteristic of the field instruments, has been superimposed. ANNs were separately developed for leak detection and location. The final database consists of about 1000 examples.
Advances in Safety and Reliability: ESREL '97
741
The ANNs for leak detection has been trained and tested successfully with noisy signals. It is able to detect leaks as small as 1% of flowrate. There is no overlap between cases with such leak and cases without leaks. As a consequence, spurious alarms are not generated. It was necessary to pre-process the data in order to take into account the "packing" effect caused by variations in boundary conditions in the line. Due to liquid compressibility and the large volume between the measurement stations differences between inlet and outlet flowrate are present. Pre-processing is done by running on-line the code simulating the flow in the pipeline with the same boundary conditions as in the line. The position of large (5%, 10% of inlet flowrate) leaks is predicted with very high accuracy even when noisy signals are used. This is particular interesting because fast location of large leaks is more important for safety. About 50% of 1% leaks are correctly located. The percentage of correct location increases to about 75% for 2% leaks. It is interesting to note that usually the incorrect location shall lead to the location being predicted in a segment adjacent to the one in which the leak is actually present.
AKNOWLEDGEMENTS The project described was undertaken in the frame of the Research Program DEPIRE sponsored under the CEC ENVIRONMENT Program with the partners being Snamprogetti (Coordinator), Consorzio Pisa Ricerche, TNO and AEA Technology. The sponsorship of the EC, DGXII (Contract EV5V-CT94-0419) is gratefully acknowledged.
REFERENCES
Billmann, L., Isermann, R. (1987). Leak detection methods for pipelines, Automatica 23:3, 381-385. Bishop, C.M. (1994). Neural networks and their applications. Rev. Sci. Instrum. 65:6, 1803-1832. Ellul, I. (1989). Pipeline leak detection. The chemical engineer, June 1989, 39-44 Hamande, A., Condacse, V., Modisette, J. (1995). New system pinpoints leaks in ethylene pipeline, Pipeline & gas journal, April 1995, 38-41. Parry, B., Mactaggart, R., Toerper, C. (1992). Compensated volume balance leak detection on a batched LPG pipeline, 1992 OMAE- Volume V-B, Pipeline Technology, ASME 1992, 501-507. Stouffs, P., Giot M. (1993). Pipeline leak detection based on mass balance: importance of the packing term, Journal of loss prevention in process industry, 6:5, 307-312
This Page Intentionally Left Blank
KEYWORD INDEX
1-out-of-2:G system l&O function 417
asymptotic reliability 1931 ATM switching system 2233 audit 63 instruments 71 autoCAD based software tool 2197 automation 117 automotive 1021 autonomous underwater vehicle (AUV) 2301 availability 281,799, 961,987, 1039, 1047, 1057, 1155, 1767 analysis 2415 of any system as a renewal equation 1719 computation 2053 modelling 807 aviation 71,915 safety diagnostics 943 axioms of intuition 1637
2163
absolute optimization 1749 absorption probabilities 2163 accelerated failure time models 1985 accelerated life testing 1849 accelerated testing 1859 acceptable 915 acceptable risk 45 occeptance criteria 1379 accident analysis 951 costs 247 database 667 investigation 89 management 257 prevention 137, 613 sequence precursor assessment 2097 accidents 89, 97, 189, 217, 1681, 1841 ADAM 257 adaptive architectures 341 adjusted mathematical models 1351 advection-dispersion model 2115 aerodromes 915 age factor 1749 policy 1947 replacement 1783 ageing model 1849 aggregating opinions 1903 aging 1741 aircraft 915 aircraft structure 933 airports 915 alarms 1391 aleatory uncertainty 569 algorithm 725, 2239 alignment chart 1559 ammonia 717 analysis 1653 APET 441 application of probabilistic methods 333 apportionment 2361 area risk assessment 1003 artificial intelligence 733 artificial neural networks 733 assembler language 323 assessment 3, 1143 asset management 237 asymptotic methods 1727
background information 289 banknote circulation system 1957 banknote life-length 1957 basic subgraph 2239 Bayes 1819 Bayesian analysis 1647 Bayesian belief networks 943 Bayesian estimation 1859 Bayesian inference 1031 Bayesian linear model 1595 Bayesian methods 1903 Bayesian models 1491 Bayesian prediction 1903 Bayesian statistics 1535, 1661, 2341 Bayesian upgrading 1195 beams 1551 belief networks 489 benchmark 2045 bending moment 1229 binary decision diagram (BDD) 2045 BLEVE 693 blowout risk 209 Boolean forms 2239 Boolean models with loops 2063 Boolean orthogonalization 2197 Boolean polynomials 2053 Boolean representation.method (BRM) 227 bootstrap 1877 branching stochastic processes 2115, 2173 bridges 1183, 1195, 1203, 1213, 1237 bridge management 1183 brittle fracture 1477
I1
12
KeywordIndex
Bromwich Integral (or Mellin theorem) as Laplace inverse 1719 buffer stop collision risk assessment 977 building 1575 building fires 1379 bulk reliability 333 burn-in 1869 C language 323 calibration 1423, 1567, 1903 Campbell's theorem 1245 CAN-bus 2293 CANDU 435 capture volume in cavity 467 case study 79, 247 AFWS 1757 causal factors 951 " causal probabilistic networks 943 centre for dependability and maintenance 1995 certainty equivalence principle 1351 chaos 357, 561, 2181, 2329 checklists 175 chemical industry 693, 717 chemical production 71 civil aviation 909, 951 CMM structure 1013 coefficient of variation 1583 of the safety margin 1601 cognitive science 307 coherent structure 629 coherent.system 1629, 1937 collision avoidance 893 coloured point processes 1775 combination of loads 1457 combining information 1595 common cause 1741 failures 129, 2189 communication 971 comparing reliability methods 2265 competing risk 1775 complex systems 357, 489, 561, 2181 complexly structured reliability data 1877 component 1681 groups 2189 relationship 1937 reliability 1307 indices 1127 computational method 2353 computer aided reliability engineering (CARE) 1075 computer architecture 349 codes 1661 simulation 2393 tools for risk and reliability analysis 333
computerised analysis tools 1775 computerised simulation 701 concrete (reinforced) 1551 concrete bridges 1183 column 1307 products 1575 concretes quality control 1575 condition based maintenance 1701 condition-based predictive JIT maintenance 1013 condition monitoring 1013, 1701 conditional probability 1261 confidence level 1629 confined areas 771 conjugate priors 1897 CONPAS 467 consequence of failure 683 constraint 1477 automata 2021 programming system Toupie 2021 containment failure modes 441 filtered venting system 467 performance 425, 441,467 contaminant transport 2115 continuity of supply 1105 control 1977 charts 1967 charts/reliability/maintenance/safety research projects 2003 controlled structure 1369 convexity 1253 convolution 2011 core damage frequency 425, 435, 1801 corrosion 1183, 1435 and materials advisor 289 simulator 289 corrosion-liable equipment 289 cost 1183, 1681, 1741, 1749 cost/benefit 237 cost reduction 1801 coupling 1269 covariates 1985 Cox distributions 2149, 2163 Cox models 1859 CPPE - (Portuguese producer of electricity) 181 crack growth 933, 1315, 1467 crack propagation 1485 cranes 631 crew response time 167 criteria for standard setting 27 criterion of failure 1253 criticality 593, 763, 995 crystalline rock 541
Keyword Index cumulative hazard function 1877 cumulative number of outages 2353 cumulative outage time 2353 cut sets 2089, 2265 cutoff connections 2233 dam break 53 damage 1183, 1195, 1351 dangerous dose 603 dangerous goods transportation 725, 1003 data analysis 147, 1841, 1859, 2341 bank 1819 collection 147, 1841 exchange 1835 database 281,667, 857, 987, 1039, 1835, 2189 databooks 1827 decision analysis 3, 507, 675 decision making 45, 209, 237, 535, 791, 1647 process 1637 decomposition method 2089 deep repository 541 deficiency 189 definitions 1835 degradation 995 mechanism 1221 degraded system configurations 2197 demand failure probability 1749 demanning 799 density function 1601 dependability 117, 341, 1629, 1995, 2149, 2329 data 1653 dependable systems 1021 dependencies 155 dependent failures 2029, 2173 design 281, 1519, 1551 criteria 1183, 1415 explosion loads 771 models 1379 scenarios 1435 destructive tests 1575 detection 2285 deteriorating structures 1213 deterministic 1143 development 175 management 375 diagnosis 1351, 2285 diagnostics 257 diagram 365 digraphs 2265 direct modelling 2139 Dirichlet prior 1031 disjoint terms 2239
dispersion 657 distributed computing environments 341 distribution functions 1083, 1535 systems 1105 distributions 1093, 2139 diverse back translation 349 document management 649 documentation 365 Doury standard deviation 709 down time costs 1709 driver assistance 1021 duration models 1031 duty of care 1115 dynamic characteristic 1341, 2105, 2129, 2155 dynamic systems 2089, 2155 earthquakes 1535 economic analysis 1047 effects and criticality analysis (FMECA) 227 efficiency 341 EIAMUG 1013 EIReDA 1819 electric motors 1827 electric propulsion systems 2275 electrical networks maintenance 619 electrical propulsion systems 1047 electronic circuit board 1629 emergency core cooling system 2311 emergency preparedness 829 engine room 849 environment 175 influence 2029 environmental impact 639, 877, 1171 environmental systems 561 epistemic uncertainty 569 equipment criticality 1709 testing 1801 errorproneness 189 escalating fire 745, 755 escalation 819, 837 ESReDA 1819 essential logic model 365 ETA 849 EUReDATA 1819 Eurocodes 1307 evaluation 175 methods. 2213 of test procedures 1757 event analysis 273, 675 event based procedures 475 event time tracking logic 2205 event trees 489, 2105
I3
I4
Keyword Index
evidential reasoning 389 Evora 1399 exact reliability 2265 excitation 1323 expanded-exponential function 315 expected test time 1913 expected total cost 1913 experience feedback 995 experiment 147 experimental design 1269 experimentation 701 expert assessments 1903 expert judgement 199, 209, 951 expert opinions 155, 1897 expert systems 943 expert witness 217 explosion 781 load 701 exponential law 1883 exponential-multinomial function 315 extension inspection interval 683 extension principle 2423 extreme natural action 1369 extreme value distribution 1245 extreme values 1237 F - N curves 725, 1003 fail silence 1021 failure 995, 1519, 2285, 2393 analysis 383 diagnosis 2301 estimation 1609 failures mode 227 free life 923 identification 297 intensity 1083 mechanism 1527 probability 1269 rate 1075, 2233 dynamic analysis 1155 statistics 1135 FAST 1621 fatal accident database 951 fatality criteria 603 fatigue 933, 1315, 1467 life 1485, 2367 fault history 129 prevention 1013 tolerance 1021 tolerant system 923 trees 489, 2039, 2045, 2063, 2071, 2079, 2139, 2247, 2265 analysis 2415, 2423 faults 1351 field data 1897
collection 987 field return 1067 fieldbus technology 1013 financial risk management 1967 finite elements 1477, 1501 finite period 2353 fire and explosion 819 fire hazard 1399 protection •1391 reliability 1407 fireball 657 fires 837, 849 first and second order (FORM/SORM) methods 1407 first crossing problem 1203 first outage time 2353 flammable substances 709 flash fire 657 flaws 1869 flooding 1741 flow diagrams 2257 fluctuating environment 2163 fluctuation scale 1583 FMECA 2403 Fokker-Planck equation 2105 FORM 1341, 1519 formal languages 383 fractal processes 1245 fractured media 585 French standards 1171 frequency 1889 model 885 response function 1331 FSA, 849 FTA 849 function block programming 349 function-oriented system analysis 297 functional breakdown 763 functional modelling 2301 Fussell-Vesely 1801 fuzzy constraint 1749 fuzzy decision support system 2403 fuzzy fault tree analysis 2423 fuzzy implications 2403 fuzzy inference 2403 fuzzy logic 725 fuzzy probability 2423 fuzzy sets 389, 1609, 2415, 2423 theory 155 gas explosions 701,771 Gaussian models 709 Gaussian processes 1245 Gaussian wide-band process 1323 generalized extreme value 1535
Keyword Index generalized model 2385 generalised modus ponens 2403 generation 1093 genetic algorithm 1749 geologic disposal 507 geometrical imperfections 1501 geotechnical risk 1171 goodness of fit testing 1849 graph connectedness 2071 graphical methods 1985 gravity dams 1519 Great Britain 799 Gretener method 1399 groundwater 2115 GTST method 2301 hardware 383 reliability 307 hardware and software fault tolerance 341 hazards 189, 1543, 2293 hazard field 357, 2181 identification 79 registration system (HRS) 977 hazardous material 1841 hazardous waste 507 hazop 1155 health and safety executive 247 health and safety practices 105 heat transfer correlations 2311 heavy gases 709 Her Majesty's Inspectorate of Pollution (HMIP) 3, 549 heterogeneity 585 heuristic 2045 hierarchical decomposition 2063 hierarchically fault tree synthesis 2247. highly enriched spent nuclear fuel 593 historic center 1399 HMG method 297 housing 1575 HSE management 819 •human activities 117 human cognitive reliability method 167 human errors 129, 137, 155, 791,867 human factors 117, 129, 137, 147, 155, 273, 857, 901 human mind behavior modeling 1637 human reliability 129, 137, 147, 307, 425 analysis (HRA) 155, 167 methods 117 Hurst coefficient 1245 hydraulic modeling 877 hydrogen combustion 1661 igniter 467
I5
I-beam 1229 IAMmodel 1013 idempotence property 1937 IEC 605 1883 IEEE reliability test system 333 immunity 189 imperfect repair 2205 implementation 175 importance 1155, 2341 factors 2053 sampling 1261 imposed load 1307 in-service testing 1801 incompatibility 2321 independent cut-set 2089 independent power production 333 independent sub-systems 2089 individual risk 45 industrial accidents 667 industrial activities 1003 industrial risk 667 industrial system failures 2021 information models and formal notations 389 processing 1595 system 281 infrastructure project 265 inherent safety 613 inherent SHE 1155 innovation 2341 lnoperability 2321 inspection 1467 planning 683 program 763 instrumentation and control systems 1143 Insurance 829 integrated analysis 675 integrated logistic support 281 Integrated risk modelling 63 integrated safety 181 assessment 2097 intelligent systems 289 interactive step trade-off method (ISTM) 227 interdisciplinary post-doctoral programs 2003 interdisciplinary postgraduate degrees 2003 interim spent nuclear fuel storage facility 459 interval availability 1727 intervals 1681 intuition theory 1637 inundation maps 53 inventory 1671 control 1709 IPE 425
I6
Keyword Index
irregular series-parallel systems isolation 2321
1931
jet fires 781, 1407 jump processes 1295 k-terminal problems 2247 Kalman filter 1351 knowledge acquisition and modeling 199 base 1351 engineering 199 knowledge-based system 297 Kolmogorov-Dmitriev 2173 theory 2115 KSNP 467 land-use planning 1003 landslides 1543 Lantau and airport railway 977 large databases 1827 large systems 1931 law 1115 leak detection 7 3 3 learning curves 867 lethal dose 603 level 2 441 PSA 467 liability 971, 1115 license application 541 life consumption 1057 life cycle 1039, 1213 cost 829, 1183 life distribution function 1849 life extension 1057 life safety 1391 lifetime cost 791 lifetime distribution 399 lifts 631 likelihood function 383 limit analysis 1527 limit load 1501 limit theorems 1923 linear quadratic Gaussian problem linear regression 1595 live 497 live work 619 living PSA 1793 load management 333 models 1295 local risk contours 1003 log-linear rate model 2385 log-normal distribution 2367 logic network 1937
1351
long-term safety 541 long timescales 577 loss of off-site power 1889 loss prevention 181 losses per unit time 1947 low-level risk 577 LPG 693,717 M(t)-curves 1869 machine 2393 machine tool design 2367 MAFERGO 117 magnitude recurrence 1535 maintainability 961, 1039 and safety (RAMS) 1047 maintenance 129, 281,449, 483, 995, 1155, 1195, 1467, 1681, 1701, 1709, 1741, 1757, 1793, 1801, 2139 engineering 1995 free operating periods 923 interval 995 management 649, 1995 information systems 1701 optimisation 1775 program 763 scheduling 1783 significant systems 763 strategies 1691 major accidents 667 risks 1003 major hazards 649 man-machine systems 901 management 535 costs 1155 dynamics 63 program 181 system 175 marginal cost analysis 1671 maritime safety 867, 901 Markov 2123 graph steady state analysis 2275 processes 1195, 2155 system 2149 theory 1727 Markovian model 2053 Markovian modelling 1757 material combination factor 1559 material damage 829 maximum distribution 1245 maximum likelihood method 1315 mean time between failure (MTBF) 1047 mean time-to-system-failure 2163 mechanical component 1653 mechanical reliability 1859 memorization 2129 metacomputer 1287
Keyword Index method of maximum likelihood 1485 methodology 247, 799, 901 development 819 methods 383 metrics 323, 375 minimal cuts 2257 minimal paths of a network 2239 minimum method 1859 mission analysis 933 mission management system 2301 model-checking 2021 model uncertainty 1647 modelling 2329 models 1551 of uncertainty 1253 modular approach 2089 modularization 2039 moment equations 1323 moments and characteristic function of a distribution 1937 monitoring 1977 monotone multistate structures 2213 Monte Carlo 1261, 1287, 2029, 2123 Markov chains 2341 methods 569 simulation model 1171 simulations 837, 961, 1135, 1379, 1527, 1767, 2011, 2129, 2257 MPLD method 2301 MTBF 1075 MTTF 2361 #-calculus 2021 multi-degree of freedom systems 1323 multi disciplinary working 549 multilane effects 1237 multilayered perceptron 2257 multinormal distribution 1261 multinormal probability 1261 multi-phased homogeneous Markovian models 1767 multi state systems 1727 multiple-bus based systems 2247 multiple correspondence analysis 105 multiple criteria decision making (MCDM) 227 multiple objective decision making (MODM) 227 multiplicative connection model 2029 multiplicative models 1985 multistate systems 2213 natural circulation 2311 natural gas infrastructure 807 naval applications 1047 negligence 1115 negligible risk 27
I7
network reliability analysis 2239 neural networks 2257 no-failure 1253 nodalization 2311 non coherent looped systems 2257 non-destructive examinations 1491 non-destructive tests 1575 non-deterministic 2293 non-differentiable processes 1245 nonexponential lifetime 2361 nonhomogeneous Poisson point processes 2385 non-linear dynamics 2329 non-Markovian 2163 non-parametric methods 1067 non-parametric statistics 1621 non-probabilistic theory 1253 non-repaired systems 2213 non-routine releases 639 non-structural components 1369 normal accidents 357, 2181 normal distribution 1261 normal processes 1245 NORSOK 763 North sea 799 not normally manned offshore installation 763 nuclear power plants 129, 1143, 2265, 2385 nuclear safety 1661 nuclear waste repository 593 nucleation 933 numerical algorithms 1261 numerical approximation 1783 numerical quadrature 1897 object orientation 407 objective function 1229 obstacles 701 occupational accidents 105 occupational ill-health 217 occupational safety 613 occurrence grading 909 office building 1391 offshore 799 installations 771 platforms 1415, 1809 pool fires 781 structures 1423, 1445, 1509 oil 799 and gas production 837 tankers 1457 on-line risk monitoring 449, 483 open pit 2393 operating experience 281, 1889 operation 2393 operator 147
I8
KeywordIndex
operator training 147, 649 optimal control 1351 optimal resource allocation 1213 optimization 237, 675, 1183, 1229, 1519, 1681, 1709, 1741, 2045 organizational factors 89, 273, 1809 organizational learning 273 organizational manageability 683 outage costs 1093 outage impact measure 2353 outage models 1127 overhaul period 1749 ownership 497 packing effect 733 paradigm 89 parallel computing 585 parallel systems 1559 parallel time scales 1485 parameter estimation 2115, 2189 uncertainty 1341, 1647 parametric optimisation 1083 partial differential equations 2155 partial renewals 1767 partial safety factors 1295, 1567 parts count 1075 parts stress 1075 Pasquill 709 passive 2321 perception 37 performance 867 assessments 515, 525, 535, 541,569, 577; 593 distribution function 2i97 evaluation 2223, 2377 function 1341, 1601 measures 2213 shaping factors 155 perturbation factor 1601 petri nets 799, 1883, 2265, 2275 ph-distribution 2149 phenomenological analysis 675 physically-connected components 2029 pipeline failure 733 layability and installation 1435 stability 1435 pipelines 717, 725 plan for risk prevention 1171 planned preventive maintenance 1767 plant 79, 497 damage state 435 plated structures 1407 • platform 799 point estimators 1877
point processes 1245 Poisson process 315 , policy 37 polynomial expansions 1501 pool fire 657, 1407 population at risk 53 power generating systems reliability 2011 power law rate model 2385 power lines 1115 power systems 1093, 1135 precursor 89 predictive r.eliability 1897 pressure vessels 631,683 prevention 613 preventive maintenance 639, 1013 preventive renewal 1691 PRIAM 1013 principal components analysis 105 prior information 1535 probabilistic 1143 probabilistic approach 15, 1543 probabilistic design 1279, 1601 probabilistic dynamics 2129 probabilistic estimation 1315 probabilistic formulation 1229 probabilistic fracture mechanics 1477 probabilistic methods 837, 1183 probabilistic models 781, 1445 probabilistic performance assessment 507 probabilistic reliability 2367 probabilistic risk analysis (PRA) 227 probabilistic risk assessment 3, 167, 449 probabilistic safety analysis (PSA) 155 probabilistic safety assessment 15, 147, 199, 483, 561,909, 1221, 2097 probability 675, 1287, 1583 assessment 209, 1903 distribution 2353 ~ of failure 683 .. ~.. ~i generating functions 2173 paper 1595 process 79 hazard analysis 1155 industry 683 monitoring 375 optimization 639 safety management 649 production loss 829 profitability 1977 programmable electronic system 349 programmable logic controller 349 progressive collapse 1407 project management 265, 5 4 9 project risks 265 PROMENVIR 1287 ~ propagation of uncertainty 1629
Keyword Index proportional hazard model 1031 protection systems 675 PSA 435, 441,475, 489, 849, 1793 applications 1793 PSC 497 pseudo-Boolean 2079 public decision making 3 public equipment 1575 public spaces 971 QRA 745, 755, 819 qualitative analyses 2021 quality 1083, 1835, 2403 assurance 375, 541 quantified risk analysis 829 quantitative characteristics 987 quantitative risk analysis 885, 1809 radar accuracy 893 radioactive waste 507, 577, 585 disposal 3, 515, 525, 535, 549, 561,593 railway 961,971 network 995 vehicles 987 RAM analysis 2275 random eigenvalue problem 1331 random fields 1287 random shocks 1977 random variables 1287, 1509, 1967 ranking of investment opportunities 1967 rate of occurrence of failures 2385 rating system 175 raw reliability data 1827 RC structures 1559 reactor cavity flooding system 467 real-time 257 reassessment 1423 rebars 1559 recalibration 399 reconciliation 1903 recovery 2285 ~ .. rectangular wave processes. 1295 redundancy 2173 .applications 2275 regenerative systems 1727 regularity 807, 995 regulation 3, 535, 549, 915 regulatory burden reduction 1801 . regulatory review 541 relative ranking 459. reliability 281,323, 383, 799, 961,987, 1031, 1039, 104~7, 1057, 1083, 1093, 1105, 1155, 1203, 1213, 1245, 1287, 1391, 1423, 1467, 1501, 1519, 1527, 1559, 1629, 1727, 1741, 1767, 1819,
I9
1841, 1883, 1977, 2089, 2123, 2139, 2149, 2233, 2321, 2361, 2393, 2403 analysis 1047, 1135, 1937, 2223, 2353, 2377, 2385 block diagram 961, 2197 centered maintenance 995, 1221 characteristics 1985 data 1835 databases 807, 1775 degradation 923 design 1877 evaluation 2079 expression 2265 factor 2367 growth 307, 383, 1897, 2341 index 1307,.1379, 1601 of large systems 1923 measures 2213, 2353 of NDE 1491 networks 2063 parameters 1749 prediction 375, 1859 test 1653 theory 2213 trend models 307 worth 1093 remanent life assessment 1057 renewal models 1783 renewal process and replacement equipments 1719 repair 1519 effect 2205 times 1105 repair/replace as required 1057 repairable systems 1869, 2385 replacement 1671 times 1947 requirement 365, 923 resampling 1877 research projects 1995 resistance 1567 response processes 1323 response surfaces 569, 1269 approach 1341 response time 341 restoration probability 1889 rheological properties 1203 risk 3, 37, 915, 1115, 1543, 1749, 1793, 2155 achievement worth 1801 analysis 45, 209, 265, 459, 837, 877, 885, 1809, 1841 applications 1637 assessment 217, 265, 549, 593, 693, 717, 791, 1379, 2063 aversion 45 communication 53
110
Keyword Index
comparison 459 control process 977 cost 791 criteria 915 of failure 1691 graph 631 importance measures 1801 index 977 informed regulations 15 level indicators 18.09 management 71,265, 489, 649, 667, 857, 915 measure 459 monitoring 489 overview 745, 755 perception 53 to personnel 885 presentation 577 priority number 2403 ranking 909 reducing measures 829, 857 regulation 667 risk-based 237 risk based inspection 683, 1221 risk-based prioritization 507 risk based regulations 15, 1801 risk-influencing factors 1809 risks attitudes 105 risks control 1003 road networks 1213 roads 725 robots 2285, 2293 robotic car assembly 1013 robust reliability 1253 robustness 1351 rock falls 1171 root cause analysis 247 S-R-K classification model 167 safe passing distance 893 safety 137, 175, 189, 357, 365, 407, 497, 613, 995, 971, 1039, 1115, 1155, 1315, 1351, 1681, 1977, 2105, 2181, 2321, 2329, 2393 advisor 1155 analyses 771 assessments 217, 535, 603 audits 613,619 cases 217, 489, 2293 critical computer systems 417 culture 71, 97 engineering 15 function 459 goal 425 hazards 97 index 1379
licensing of software 349 management 63, 79, 89, 97, 105, 915, 1t55, 1809 research 247 system 977 manager 2293 of maritime traffic 893 monitor synthesis 407 monitoring 407 preventive systems 745, 755 related 1021 control 349 system 2265 SAM 441 sample size 1913 sampling 1261 scatter 1287 scenarios 541 self weight 1307 semi-Markov system 2149 semi-parametric 1031 sensitivity 1519, 2311 analysis 525, 585, 1621 studies 745, 755 sequential quadratic programming 1519 service water pumps 2385 sets 2139 severe accidents 257, 425, 435, 441 Seveso Directive 667 Shannon expansion 2039, 2247 ship 1039 accidents 857 collision 877 grounding 877 ship-platform collision 885 shipping 867 shore based collision risk recognition 893 signalling 995 simple definition of risk 1967 simulation 257, 867, 1261, 1279, 1287, 2139, 2233, 2329 methods 1509 procedure 2205 techniques 1883 simulation-based reliability assessment and design 1279 simulator 147 experiment 167 single and multiple variable inversion 2079 siting 37 situation assessment 893 skeletal structures 1407 slender structures 1501 sliding stability 1519 slope slides 1171 small firms 105
Keyword Index small-size sample 1883 smart structures 1369 smoke 781 societal risk 45, 725 socio-economics 1741 sociotechnical systems design 71 software 323, 383 analysis 417 engineering 365 failure 315 fault 315 metrics 417 qualification 417 reliability 307, 315, 399, 417 safety analysis 389 soil 1583 space 2285 spare ordering models 1671 spare parts 1709 spectral input-output relationships 1323 spent nuclear fuel 541 spline functions 399 splines 1661 sprinklers 1391 s,t-problems 2071, 2247 stability 1501 standard 1143 standby 1681 failure rate 1749 STARS 489 states graph 2089 states of a system and their components 1937 statics theory 1575 stationary processes 1245 statistical analysis 1083, 1551, 1567 statistical methods 1457 statistics 129, 147, 1583 steady-state availability 2163 steel 1567 steel bridges 1183 stochastic model 1323 of traffic flow 1203 stochastic petri nets 1767, 2257 stochastic process 1237 stochastic vibrations 1203 strong explosion 745, 755 structural codes 1567 structural collapse 1407 structural components 1221, 1369 structural integrity 1477, 1491 structural reliability 933, 1183, 1195, 1261, 1269, 1279, 1415, 1457, 1509, 1609 analysis 885 model 1221 structural risk 791
I 11
structural systems 1351 structure life 1315 structures 323, 1567, 2393 subjective probabilities 1647, 1903 subjective probability 209 subjective safety analysis 389 submarine pipelines 1435 sum-of-disjoint products 2079 symptom based procedures 475 system 189 identification 1351 operation 1135 reliability 307, 1067 safety 273 unreliability 2029 systematic analysis 951 systemic 357, 2181 systems engineering 365 systems of single use 1947 taxonomies 1827 taxonomy 857 Taylor's series 1937 technical certification 631 technological accident 137 technology assessment 613 technology options 613 telecommunication networks 2223, 2377 temporal dependency 2089 terminal pair reliability 2071 testing 1681 tests 1551 theory of complex variable functions (integrals) 1719 thermal-hydraulics 257 thermal power plants 181 thermal radiation 603 thin-walled girders 1229 THORP 497 time to testing 1913 time-dependent availability analysis 2275 time-sequential simulation 1135 time-triggered 1021 time-variant reliability 1295 tolerability 37 tolerable 915 tolerable risk 27 tolerance interval 1653 T O M H I D . 79 topological formula 2265 total productive maintenance 1995 total safety management 71 toxic gas 781 toxic substances 709 traffic actions 1237
I12
Keyword Index
data 1237 management systems 901 model 1237 training 649, 867 transient analysis 2097 transmission 1093 element 2367 network planning 1135 transport equations 2155 Tri6ves database 1543 true standard Gaussian distribution truncation decision 1913 truss structures 1609 truth table 1937 TTT-plot 399 turbulence 701 two parameter model 1601 type II censoring 1913
UVCE
657
variance reduction 2123, 2129 techniques 1621 VDEW statistics of incidents 1127 VTS 893 vulnerability 425 VVER reactors 1889, 2189 1967
UK 37, 247 ultimate strength 1457 unacceptable risk 27 unavailability 1681, 1793, 2353 uncertain structural parameters 1331 uncertainties 155, 1287, 1351, 1445 uncertainty 209, 561,693, 1629, 1653, 1841 analysis 515, 525, 569, 593, 1621, 1647 evaluation 725 propagation 2415 undergraduate course module 1995 uneven seabed 1435 unstable cliffs 1171 unstructured environment 2293 upcrossings 1245 upper bound theorem 1527
wall thickness sizing 1435 warning system 53 Waste Isolation Pilot Plant 515, 525, 593 wave data 1445 environment 1445 loads 1457 models 1445 wearing process 1977 weather stdtistics 1135 Weibull distribution 1957 Weibull failure model 1913 WeibuU law 1883 Weibull stress 1477 Weibull systems 1923 wheeling 333 wind action 1307 wolsong 435 work analysis 89 x-by-wire
1021
Yucca Mountain Project zero option
1691
593
A U T H O R INDEX
Aamodt, K. 763 Absil, P. 1757 Adamski, M. 349 Agarwal, J. 2329 Agarwal, M. 2163 Ait Mohand, H. 1575 Alexandre das Neves, J. 181 Allain-Morin, G. 1143 Allan, R . N . 1093 Alonso, A. 1621 Ambartzumian, R. 1261 Andersen, L.B. 209 Andersen, T. 807 Anderson, D . R . 515, 525 Andersson, M. 585 Aneziris, O . N . 693, 717 Ansell, J.I. 2341 Appleton, D.P. 923 Arbaretier, E. 383 Ardorino, F. 1221 Arima, T. 849 Arsenis,'S. P. 1819, 1827 Ashford, N . A . 613 Atkins, J. 561 Aufort, P. 1221, 1819 Auge, J.-C. 1859 Auglaire, M. 441 Augusti, G. 1213 Aven, T. 209, 1647, 1727
Bertrand, G. 1601 Betfimio de Almeida, A. 53 Biasse, J.-M. 1883 Bigiin, E.S. 1903 Bills, R. . Bischoff, AK. 507 2423 Bitner-Gregersen, E. M. 1445 Bjorna, J . K . 829 Bljuger, E. 1551 Blockley, D.I. 2329 Boak, D . M . 507 Boesmans, B. 441 Bogaerts, W. 289 Bolado, R. 1621 Bondavalli, A. 341 Bonvicini, S. 725 Boogaard, J. 683 Borb~ly, S. 147 Borgofi, J. 1315 Bos, J. F.T. 2285 B6se, C. 1127 Bot, Y. 1075 Bottazzi, A. 1155 Bouissou, M. 2045 Boulliard, G. 289 Bousfiha, A. 2149 Brandtz~eg, A. 791 Brazier, A. 37 Breitung, K. 1245 Brennan, G. 829 Bretschneider, M. 2423 Brown, D . A . 217 Bruy~re, F. 2045 Bryla, Ph. 1221 Buvelot, R. 265
Baek, S.Y. 2205 Baeta, I. 2377 Baggen, R. 273 Balady, M.A. 561 Baldauf, M. 893 Banerjee, S. 733 Bara, A. 709 Bareith, A. 147 Baroni, P. 199 Basabilvazo, G. 515, 525 Battaini, M. 1323, 1369 Bea, R . G . 791 Becker, G. 2155 Bellamy, L.J. 63 Belsito, S. 733 Belyaev, Y . K . 1877 Ben-Haim, Y. 1253 B6raud, M.-T. 281 B6renguer, C. 1767 Bereza, P. 1229 Berg, H.P. 15
Camarinopoulos, L. 2155 Camp, A.L. 425 Carpignano, A. 489 Carvalho, T. 289 Casciati, F. 1245, 1369 Casella, G. 1457 Cecconi, A. 1567 Chambon, P. 1527 Chamoux, P. 799 Ch~telet, E. 1767, 2257 Chaudhuri, M. 2163 Cheng, X. 2367 Cherubini, C. 1583 Chevalier, M. 1883 Chiaradonna, S. 341
113
114
Author Index
Choi, S.H. 2233 Christensen, P. 2301 Christiansen, H.C. 175 Chu, M. S.Y. 515 Ciampoli, M. 1213 Ciccotelli, M. 1519 Clarotti, C.A. 1897 Cojazzi, G. 199 Collings, R.J. 357 Collins, R.J. 365, 2181 Colombo, L. 1163 Comotti, D. 2415 Cooke, R . M . 1775 Cooper, J.A. 2321 Copello, S. 1423 Cox, R.A. 819 Cox, T. 829 Craveirinha, J. 2223, 2377 Croce, P. 1237, 1567 Cross, S.J. 867 Csenki, A. 1971 D'Eer, A. 441 de Almeida, A. 1279 de Lemos, R. 389 de Rijke, W . G . 265 de Simone, E.A. 2385 de Souza Jr, D.I. 1913 De Vecchi, F. 649, 675 de Wit, H.W. 475 Degbotse, A. 1783 Dekker, R. 1709, 1727 Delcoux, J.L. 2123 Der Kiureghian, A. 1261 Dermenghem, J.-P. 995 Dessars, N. 2105 Devictor, N. 1269 Devooght, J. 2105 Dias Lopes, E. 289 Di Cocco, N . R . 1423 Di Giandomenico, F. 341 Di Giulio, A. 2415 Didier, C. 1171 Dilger, E. 1021 Dodds Jr, R . H . 1477 Doherty, S.M. 909, 951 Dornellas, C. R . R . 333 Dorrepaal, J.W. 1775 Drouin, M.T. 425 Duhesme, E. 2089 Dusserre, G. 709 Dutuit, Y. 2063, 2257 Dverstorp, B. 541 Efthymiou, M. 1415 Egidi, D. 1003
Eisinger, S. 2139 Eknes, M.L. 837 E1Khair, A. 1031 E1-Shayeb, Y. 1171 Emi, H. 849 Enes, J. 1057 Esmaili, H. 257 Esmeraldo, D. 289 Estes, A.C. 1183 Fadier, E. 117 Fahlbruch, B. 273 Faravelli, L. 1245, 1369 Favaro, M. 105 Favre, J.L. 1543 Ferjen6ik, M. 459 Ferreira, I.M. 1399 Ferro Fernandez, R. 137 Fichera, C. 1155 Fiorentini, C. 649, 675 Fleming, P.V. 2403 Follen, S. 449, 483 Forester, J. 425 Formaniak, A.J. 961 Fracchia, M. 1047, 2275 Frangopol, D . M . 1183, 1213 Frankhauser, H . R . 1629 Frantzich, H. 1379 French, L.H. 971 Frutuose e Melo, P.F. 2385 Gaarder, S. 857 Gabay, Y. 1075 Galson, D.A. 535 Gamba, L. 1527 Gao, J. 167 Garbatov, Y. 1467 Gertsbakh, I. 1485 Ghisleni, T. 2415 Gibczyfiska, T. 1229 Girmes, D . H . 1967 Gomes, T. 2223, 2377 G6rski, J. 407 Gouget, N. 1031 Goulfio, A. 1057 Grall, A. 1767 Grote, G. 71 Guedes Soares, C. 781, 1407, 1445, 1467, 2003 Guerneri, A. 209 Guida, G. 199 Guldenmund, F. 63 Gustar, M. 1279 Haak, R.
1601
Author Index Hadjisophocleous, G. V. 1391 Halang, W.A. 349 Hale, A.R. 63, 631 Hallberg, O. 1083 Hamzehee, H . G . 1801 Hang, X.-R. 167 Hansen, E. 175 Hansen, N.J. 877 Haugen, K. 837 Haugen, S. 885 Hausberger, R. 1143 Heerings, J.H. 683 Heijer, T. 901 Heikkil~i, J. 79 Helander, M. 399 Helton, J.C. 515, 525 Heming, B. H.J. 63, 631 Hernandez, R. 1067 Herrer, Y. 1075 Hewitt, J.R. 1801 Hide, A. 829 Hines, I.M. 1415 Ho, C.W. 977 Hokstad, P. 1775 Holick~,, M. 1307 Hol16, E. 147 Hong, J.S. 2205 Hooijer, J. 901 Hora, S. 507 Horgen, H. 807 Hudoklin, A. 989 Hfigel, R. 1127 Inozu, B. 1039 Irwin, A. 37 Islamov, R.T. 1637 Izquierdo Rocha, J. M. Jacquot, J.P. 1221 Jaeger, M. 2361 Jalashgar, A. 297 Janssen, M.P. 1741 Ja~wifiski, J. 1691 Jendo, S. 1609 Jeong, M . K . 2233 Jia, Y. 2367 Jia, Z. 2367 Jin, Y. 467 Johansson, L. ,~. 1021 Jovanovic, A. 289 Jow, H.-N. 515, 525 Kafka, P. 15 Kara-Zaitri, C. 2403 Karasawa, S. 2353
2097
Karpyak, S.D. 1801 Karsa, Z. 147 Kautsky, F. 541 Kelly, C. 37 Khatib-Rahbar, M. 257 Kim, D . H . 467 Kim, M.-K. 435 Kim, S.D. 467 Kim, T.W. 2205 Kinsella, K. 829 Kirchsteiger, C. 667 Kirwan, B. 63 Kladias, N. 1155 Klampfer, B. 71 Klefsj6, B. 1995 Klimaszewski, S. 1315 Koh, J.S. 2233 Kolowrocki, K. 1923 Kopetz, H. 1021 Kopnov, V.A. 1947 Korczak, E. 2213 Kordonsky, Kh. 1485 Korenfeld, H. 1075 Korneliussen, G. 837 Kosmowski, K.T. 155 Kragh, E. 829 Krakovski, M.B. 1559 Krug, M. 1021 Kfifner, H. 2247 Kumar, D. 1985 Kumar, U. 1995 Kundin, J. 1379 Kfinzler, C. 71 Kwong, P. 977 Laakso, K. 129 Labeau, P.E. 2129 Lajeunesse, S. 2021 Laleuf, J.-C. 2089 Lallement, J. 1859 Lander, E.P. 649 Lannoy, A. 1897 Lardeux, E. 2053 Lauridsen, K. 2301 Le Guen, J.M. 27 Lebeau, H. 1127 Lechani, M. 1575 Leclercq, P.R. 375 Lehner, J. 425 Leira, B.J. 1435 Lemaire, M. 1269, 1501 Le6n, F. 1957 Leonelli, P. 725, 1003 Leroi, E. 1543 Leroy, A. 799 Lid6n, P. 1021
I15
116
Author Index
Lie, C.H. 2205 Lim, T.J. 2205 Lima, M.L. 53 Limnios, N. 2149 Lincoln, R. 507 Livingston, A. D. 247 Livingston, A. G. 639 Llory, M. 89 Lovgts, G . G . 1135 Lund, J. 829 Lungu, D. 1535 Luxhoj, J.T. 943 Lyonnet, P. 1859 Maciejewski, H. 1849 Madiou, H. 1575 Madsen, H.O. 2301 Maglione, R. 209 Magne, L. 1221 Magnusson, S.E. 1379 Mancino, E. 1195 Manzini, R. 1047 Marchante, E.M. 1287 Marek, P. 1279 Marietta, M . G . 515, 525 Marq.ues, M. 1269 Marseguerra, M. 2029, 2115, 2173 Martinez Garcia, J. 1719, 1937 Martorell, S. 1749 Matoba, M. 849 McCall, G. 1021 McCombie, C. 577 McKay, I.P. 603 McKinley, I.G. 577 McNeish, J.A. 561 Meghella, M. 1519 Mel~ndez Asensio, E. 2097 Mello, J. C.O. 333 Melo, A. C.G. 333 Mendenhall, F. 507 Mendes, B. 585 Mendes, N. 619 Mercx, W. P.M. 701 Micheler, M. 2155 Miedi, H. 417 Mikuli6i6, V. 449, 483 Miller, R. 273 Mira, J. 1661 Mitteau, J.C. 1501 Mitterrand, J.-M. 1883 Mlynczak, M. 2393 Moeyaert, P. 441 Mol, W.E. 631 Molinari, V. 1039 Monnier, B. 1221 Morgado Pereira, C. 181
Morozov, V.B. 1889 Morrey, R. 2293 Mortara, P. 1021 Mouzakis, G. 657 Moya, J.A. 1621 Muir, I. G.S. 961 Mtiller, B. 1021 Mufioz, A. 1749 Murdock, W.P. 1783 Nachlas, J.A. 1783 Nanu, L. 961 Natke, H . G . 1351 Nesje, J.D. 829 Neves, L. 781 Nicholls, D.B. 915 Niczyj, J. 1609 Nielsen, L. 1809 Nilsen, T. 209 Nojo, S. 2353 Nordvik, J.P. 489 Norrby, S. 541 Nowakowski, T. 1653 Nowicki, B. 407 Nfifiez McLeod, Jorge 323 Nfifiez McLeod, Juan 323 Nyborg, M. 1869 O'Brien, J. 449 Ohanian, V. 1261 Oien, K. 1809 Olofsson, M. 857 Omata, S. 849 Ono, T. 1341 Oort, M. J.A. 2285 Opskar, K.A. 1135 Orandi, S. 257 Oscarsson, P. 1083 Padovani, E. 2029 Panizza, U. 1021 Papazoglou, I.A. 693, 717 Pappas, J. 819, 829 Pardi, L. 1195 Park, B.-C. 435 Park, S.Y. 467 Parkinson, W.J. 1801 Pasquet, S. 2257 Pasquini, A. 307 Paulsen, J.L. 1775 Pedersen, B. 807 Pefia, D. 1661 Perasso, L. 1155 Pereira, A' 585 Pereira, J. 2377
Author Index Perez, D. 1883 P6rez Mulas, A. 2097 Petit, A. 799 Petkov, G . I . 2265 Pettersson, L. 1835 Phillips, M . J . 2341 Phlippo, K. 289 Piccinini, N. 675 Pickett, A. D . F . 961 Pieracci, A. 933 Pierrat, L. 1047, 2275 Pietersen, C . M . 631 Pineau, J.P. 1841 Pinheiro, J. M.S. 333 Pinola, L. 199 Pistikopoulos, E. N. 639 Pitner, P. 1221 Piva, R. 1423 Plasmeijer, R. 1709 Poledna, S. 1021 Poloni, M. 289 Porat, Z. 2361 Poupkou, A. 657 Pratt, T. 425 Preston, M . L . 1163 Priestly, K . N . 247 Prince, L. 37 Prindle, N . H . 507 Procaccia, H. 1819, 1897 Pulkkinen, U. 1491 Pyy, P. 129 Quayle, S.
2293
Rackwitz, R. 933, 1295 Rai, S. 2079 Ramalhoto, M . F . 2003 Rao, N . R . 1783 Raoult, J.P. 1031 Rauzy, A. 2021, 2045, 2063 Rawlinson, G . A . 497 Rechard, R . P . 569, 593 Reiman, L. 129 Reinertsen, R. 763 Rejdemark, K. 1629 Rew, P.J. 603 Rezig, S. 1543 Ribbert, C. 631 Ricotti, M . E . 2311 Righini, R. 1155 Rit, J.-F. 281 Rivera, S.S. 323 Rivier Abbad, J. 1105 Rizzo, A. 307 Rizzuto, E. 1457 Roberts, J.B. 1323
Rognstad, K. 857 Romfin Ubeda, J. 1105 Romera, R. 1361,2011 Roy, P. 1039 Rozmarynowski, B. 1509 Ruggieri, C. 1477 Saeed, A. 389 Sfiiz de Bustamante, B. 315 Saldanha, P. L.C. 2385 Salvatore, W. 1237, 1567 Sanpaolesi, L. 1237 Santos, R. 2377 Sardella, R. 199 Sayce, A . G . 909, 951 Scarf, P . A . 1701 Scataglini, L. 675 Schaedel, P . G . 1039 Schedl, A.V. 1021 Scheiwiller, A. 1595 Schilling, M. Th. 333 Schloss, J.C. 1801 Schmatjko, K . J . 289 Schmocker, U. 257 Schneeweiss, W . G . 2039, 2071, 2247 Scott, I. M.B. 217 Selmer-Olsen, S. 837 Senni, S. 1163 Seppala, A. 97 Serradell, V. 1749 Serrano, M.B. 1399 Seward, D. 2293 Shen, Y.-L. 2239 Shen, Z.-P. 167 Shephard, L.E. 515 Shetty, N . K . 1407 Sieniawska, R. 1203 Silva, D. 53 Simi6, Z. 449, 483 Simmons, P. 37 Simola, K. 1491 Sinisi, M. 2415 Sklet, S. 1809 Skramstad, E. 829 Smalko, Z. 1691 Smith, M. A. J i 1727 Smolarek, L. 1931 Smolifiski, H. 1315 Sniady, P. 1203 S6derberg, J. 1021 Somerville, I. 2293 Spadoni, G. 725, 1003 Spray, S.D. 2321 Spurgin, A.J. 147 Stadler, A. 989 Stefanis, S.K. 639
I17
I18
Author Index
Stillman, R . H . 1115 Str6mberg, M. 1021 Sukiasian, H. 1261 Sumerling, T.J. 549 Svendsen, T. 771,829 Sz6kely, G.S. 1331 Tan, J. K . G . 227 Tao, C.-X. 2239 Teichert, W . H . 1331 Teixeira, A.P. 781 Thieffry, P. 1501 Thomas, P. 2257 Thompson, B. G . J . 3, 549 Thompson, R. 357, 2181 Thurner, T. 1021 Tierney, M.S. 569 Tirsun, D . M . 1801 Tokmachev, G.V. 1889, 2189 Tomaszek, H. 1315 Tombuyses, B. 1757 Tomter, A. 189 Tong, D. 977 Torres Valle, A. 1793 Toverud, O. 541 Trahan, J.L. 2079 Trbojevic, V. 829 Tromans, P.S. 1415 Tulp, J. 289 Tveit, O. 829 Tzidony, D. 2361 Uguccioni, G.
2415
Vahl, A. 2197 Vallikat, V. 561 van Acker, W. 289 van Gelder, P. H. A. J. M. 45, 1535 van Manen, S.E. 1741 van deGraaf, J . W . 1415 van den Broek, B. 631 van den Bunt, B. 1741 van der Does de Bye, M. R. 265 van der Duyn Schouten, F. A. 1727 Vancoille, M. 289 Vaurio, J . K . 1681 Veneziano, V. 307 Verdel, T. 1171
V6rit6, B. 1221 Verweij, A. J.P. 475 Vianna, C. 289 Vijaykumar, R. 257 Vilagut Orta, C. 649 Villagarcia, T. 1361, 2011 Villain, B. 1221 Vinnem, J.E. 745, 755, 819, 829 Vivalda, C. 489 von Collani, E. 1977 Vrijling, J . K . 45, 265 Vrouwenvelder, T. 1307 Walker, G . P . 37 Wang, J. 389 Watanabe, H. 2353 Weber, T. 1127 Weiner, R. 507 Wellssow, W . H . 1127 Westberg, U. 1985 Wiersma, E. 901 Williams, C . R . 3 Wilmart, P. 441 Wilmot, R . D . 535 Wilpert, B. 273 Wingefors, S. 541 Wirsching, J. 2071 Woodhouse, J. 237 Woodman, N . J . 2329 Wynne, B. 37 Xu, J.
341
Yamamoto, N. 849 Yung, D. 1391 Zeng, S.W. 1013 Zhao, M. 399 Zhao, Y . G . 1341 Zio, E. 2029, 2115, 2173, 2311 Ziomas, I.C. 657 Zuchuat, O. 257 Zuidema, P. 577 Zukowski, S. 1203 Zurek, J. 1691 Zwetsloot, G. 613