This page intentionally left blank
MAINTAINING MISSION CRITICAL SYSTEMS IN A 24/7 ENVIRONMENT
IEEE Press 445 Hoes Lane Piscataway, NJ 08855 IEEE Press Editorial Board Lajos Hanzo, Editor in Chief R. Abhari J. Anderson G. W. Arnold F. Cañavero
M. El-Hawary B-M. Haemmerli M. Lanzerotti D. Jacobson
O. P. Malik S. Nahavandi T. Samad G. Zobrist
Kenneth Moore, Director of IEEE Book and Information Services (BIS)
MAINTAINING MISSION CRITICAL SYSTEMS IN A 24/7 ENVIRONMENT Second Edition
Peter M. Curtis
IEEE PRESS| SERIES WÚ0S5m SERI
° " POWER ENGINEERING
IEEE PRESS
JWILEY A JOHN WILEY & SONS, INC., PUBLICATION
Copyright © 2011 by the Institute of Electrical and Electronics Engineers, Inc. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representation or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 5724002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, however, may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data: Curtis, Peter M. author. Maintaining Mission Critical Systems in a 24/7 Environment / Peter M. Curtis. — 2 p. cm. — (IEEE Press Series on Power Engineering) ISBN 978-0-470-65042-4 (hardback) 1. Reliability (Engineering) I. Title. TA169.C87 2011 620'.00452—dc22 2010049606 Printed in Singapore. oBook ISBN: 978-1-118-04164-2 ePDF ISBN: 978-1-118-04162-8 ePub ISBN: 978-1-118-04163-5 10 9 8 7 6 5 4 3 2 1
CONTENTS
Foreword
xvii
Preface
xix
Acknowlegments
xxi
1 An Overview of Reliability and Resiliency in Today's Mission Critical Environment 1.1 Introduction 1.2 Risk Assessment 1.2.1 Levels of Risk 1.3 Capital Cost Versus Operation Cost 1.4 Critical Environment Workflow and Change Management 1.4.1 Change Management 1.4.2 Escalation Procedures 1.5 Testing and Commissioning 1.6 Documentation and the Human Factor 1.7 Education and Training 1.8 Operation and Maintenance 1.9 Employee Certification 1.10 Standards and Benchmarking 1.11 Conclusion 1.12 Risk Analysis and Improvement 2 Energy Security and Its Effect on Business Resiliency 2.1 Introduction 2.2 Risks Related to Information Security 2.3 How Risks are Addressed 2.4 Use of Distributed Generation 2.5 Documentation and Its Relation to Information Security 2.6 Smart Grid 2.7 Conclusion 2.8 Risk Analysis and Improvement
1 2 4 6 6 8 9 10 10 14 18 19 20 21 22 22 25 25 29 34 37 40 42 44 45 v
VI
CONTENTS
Mission Critical Engineering with an Overview of Green Technologies 3.1 Introduction 3.2 Companies'Expectations: Risk Tolerance and Reliability 3.3 Identifying the Appropriate Redundancy in a Mission Critical Facility 3.3.1 Load Classifications 3.4 Improving Reliability, Maintainability, and Proactive Preventative Maintenance 3.5 The Mission Critical Facilities Manager and the Importance of the Boardroom 3.6 Quantifying Reliability and Availability 3.6.1 Review of Reliability Terminology 3.7 Design Considerations for the Mission Critical Data Center 3.7.1 Data Center Certification 3.8 The Evolution of Mission Critical Facility Design 3.9 Human Factors and the Commissioning Process 3.10 Short-Circuit and Coordination Studies 3.10.1 Short-Circuit Study 3.10.2 Coordination Study 3.11 Introduction to Direct Current in the Data Center 3.11.1 Advantages of DC Distribution 3.11.2 DC Lighting 3.11.3 DC Storage Options 3.11.4 Renewable Energy Integration 3.11.5 DC and Combined Cooling, Heat, and Power 3.11.6 Current State of the Art 3.11.7 Safety Issues 3.11.8 Maintenance 3.11.9 Education and Training 3.11.10 Future Vision 3.12 Containerized Systems Overview 3.13 Conclusion
47
4 Mission Critical Electrical System Maintenance and Safety 4.1 Introduction 4.2 The History of the Maintenance Supervisor and the Evolution of the Mission Critical Facilities Engineer 4.3 Internal Building Deficiencies and Analysis 4.4 Evaluating Your System 4.5 Choosing a Maintenance Approach 4.5.1 Annual Preventive Maintenance 4.6 Safe Electrical Maintenance 4.6.1 Standards and Regulations 4.6.2 Electrical Safety: Arc Flash 4.6.3 Personal Protective Equipment (PPE)
81 81 83
47 48 50 51 52 53 54 55 56 57 58 59 60 60 63 65 65 67 67 68 68 68 70 71 71 71 72 73
85 86 87 88 89 89 90 92
CONTENTS
4.7
4.8 4.9
Vil
4.6.4 Lockout/Tagout Maintenance of Typical Electrical Distribution Equipment 4.7.1 Thermal Scanning and Thermal Monitoring 4.7.2 15 kV Class Equipment 4.7.3 480 Volt Switchgear 4.7.4 Motor Control Centers and Panel Boards 4.7.5 Automatic Transfer Switches 4.7.6 Automatic Static Transfer Switches (ASTS) 4.7.7 Power Distribution Units 4.7.8 277/480 Volt Transformers 4.7.9 Uninterruptible Power Systems Being Proactive in Evaluating Test Reports Conclusion
5 Standby Generators: Operations and Maintenance 5.1 Introduction 5.2 The Necessity for Standby Power 5.3 Emergency, Legally Required, and Optional Systems 5.4 Standby Systems That Are Legally Required 5.5 Optional Standby Systems 5.6 Understanding Your Power Requirements 5.7 Management Commitment and Training 5.7.1 Lockout/Tagout 5.7.2 Training 5.8 Standby Generator Systems Maintenance Procedures 5.8.1 Maintenance Record Keeping and Data Trending 5.8.2 Engine 5.8.3 Coolant System 5.8.4 Control System 5.8.5 Generator Mechanics 5.8.6 Automatic and Manual Switchgear 5.8.7 Load Bank Testing 5.9 Documentation Plan 5.9.1 Proper Documentation and Forms 5.9.2 Record Keeping 5.10 Emergency Procedures 5.11 Cold Start and Load Acceptance 5.12 Nonlinear Load Problems 5.12.1 Line Notches and Harmonic Current 5.12.2 Step Loading 5.12.3 Voltage Rise 5.12.4 Frequency Fluctuation 5.12.5 Synchronizing to Bypass 5.12.6 Automatic Transfer Switch 5.13 Conclusion
98 99 100 102 102 103 103 104 105 105 105 107 107 109 109 110 111 112 113 113 113 114 115 115 116 116 117 117 117 117 118 118 118 119 119 120 121 121 121 122 122 122 123 123
VIII
CONTENTS
6 Fuel Systems Design and Maintenance 6.1 Introduction 6.2 Brief Discussion on Diesel Engines 6.3 Bulk Storage Tank Selection 6.3.1 Aboveground Tanks 6.3.2 Modern Underground Tanks and Piping Systems 6.4 Codes and Standards 6.5 Recommended Practices for All Tanks 6.6 Fuel Distribution System Configuration 6.7 Day Tank Control System 6.8 Diesel Fuel and A Fuel Quality Assurance Program 6.8.1 Fuel Needs and Procurement Guidelines 6.8.2 New Fuel Shipment Prereceipt Inspection 6.8.3 Analysis of New Fuel Prior to Transfer to On-Site Storage 6.8.4 Monthly Fuel System Maintenance 6.8.5 Quarterly or Semiannual Monitoring of On-Site Bulk Fuel 6.8.6 Remediation 6.9 Conclusion
125 125 126 126 127 128 128 129 133 135 139 141 141 144
7 Power Transfer Switch Technology, Applications, and Maintenance 7.1 Introduction 7.2 Transfer Switch Technology and Applications 7.3 Types of Power Transfer Switches 7.3.1 Manual Transfer Switches 7.3.2 Automatic Transfer Switches 7.4 Control Devices 7.4.1 Time Delays 7.4.2 In-Phase Monitor 7.4.3 Test Switches 7.4.4 Exercise Clock 7.4.5 Voltage and Frequency Sensing Controls 7.5 Design Features 7.5.1 Close Against High In-Rush Currents 7.5.2 Withstand and Closing Rating (WCR) 7.5.3 Carry Full Rated Current Continuously 7.5.4 Interrupt Current 7.6 Additional Characteristics and Ratings of ATS 7.6.1 NEMA Classification 7.6.2 System Voltage Ratings 7.6.3 ATS Sizing 7.6.4 Seismic Requirement 7.7 Installation and Commissioning, Maintenance, and Safety 7.7.1 Installation and Commissioning
149
145 146 146 148
149 151 152 152 153 163 163 164 165 165 166 166 166 167 167 167 167 167 168 168 168 168 168
CONTENTS
7.8 7.9
7.7.2 Maintenance and Safety 7.7.3 Maintenance Tasks 7.7.4 Drawing and Manuals 7.7.5 Testing and Training General Recommendations Conclusion
170 173 173 173 176 176
8 Static Transfer Switch 8.1 Introduction 8.2 Overview 8.2.1 Major Components 8.3 Typical Static Switch, One-Line Diagram 8.3.1 Normal Operation 8.3.2 Bypass Operation 8.3.3 STS and STS/Transformers Configurations 8.4 STS Technology and Application 8.4.1 General Parameters 8.4.2 STS Location and Type 8.4.3 Advantages and Disadvantages of the Primary and Secondary STS/Transformer Systems 8.4.4 Monitoring, Data Logging, and Data Management 8.4.5 Downstream Device Monitoring 8.4.6 STS Remote Communication 8.4.7 Security 8.4.8 Human Engineering and Eliminating Human Errors 8.4.9 Reliability and Availability 8.4.10 Repairability and Maintainability 8.4.11 Fault Tolerance and Abnormal Operation 8.5 Testing 8.6 Conclusion
179 179 180 180 181 181 182 183 183 184 184 184
9 Fundamentals of Power Quality 9.1 Introduction 9.2 Electricity Basics 9.2.1 Basic Circuit 9.2.2 Power Factor 9.3 Transmission of Power 9.3.1 Life Cycle of Electricity 9.3.2 Single-Phase and Three-Phase Power Basics 9.3.3 Unreliable Power versus Reliable Power 9.4 Understanding Power Problems 9.4.1 Power Quality Transients 9.4.2 RMS Variations 9.4.3 Causes of Power Line Disturbances 9.4.4 Power Line Disturbance Levels 9.5 Tolerances of Computer Equipment
193 193 195 196 196 197 198 199 201 202 202 204 207 212 212
184 185 185 186 187 187 189 189 190 190
CONTENTS
X
9.6 9.7 9.8
9.5.1 CBEMA Curve 9.5.2 ITIC Curve 9.5.3 Purpose of Curves Power Monitoring The Deregulation Wildcard Conclusion
10 UPS Systems: Applications and Maintenance with an Overview of Green Technologies 10.1 Introduction 10.1.1 Green Technologies and Reliability Overview 10.2 Purpose of UPS Systems 10.3 General Description of UPS Systems 10.3.1 What is a UPS System? 10.3.2 How Does a UPS System Work? 10.3.3 Static UPS Systems 10.3.4 Online 10.3.5 Double Conversion 10.3.6 Double Conversion UPS Power Path 10.4 Components of a Static UPS System 10.4.1 Power Control Devices 10.5 Online Line Interactive UPS Systems 10.6 Offline (Standby) 10.7 The Evolution of Static UPS Technology 10.7.1 Emergence of the IGBT 10.7.2 Two-and Three-Level Rectifier/Inverter Topology 10.8 Rotary UPS Systems 10.8.1 UPSs Using Diesel 10.8.2 Hybrid UPS Systems 10.9 Redundancy, Configurations, and Topology 10.9.1 N Configuration 10.9.2 N+l Configuration 10.9.3 Isolated Redundant Configuration 10.9.4 N + 2 Configuration 10.9.5 2N Configuration 10.9.6 2N+ 1 Configuration 10.9.7 Distributed Redundant/Catcher UPS 10.9.8 Eco-Mode for Static UPS 10.9.9 Availability Calculations 10.10 Energy Storage Devices 10.10.1 Battery 10.10.2 Flywheel Energy 10.11 UPS Maintenance and Testing 10.11.1 Physical Preventive Maintenance (PM) 10.11.2 Protection Settings, Calibration, and Guidelines
214 215 215 215 217 221 223 223 223 225 228 228 228 229 230 230 231 232 232 238 239 240 240 241 242 243 244 245 245 246 246 246 248 248 249 249 250 251 251 255 256 257 258
CONTENTS
10.11.3 Functional Load Testing 10.11.4 Steady-State Load Test 10.11.5 Steady-State Load Test at 0%, 50%, and 100% of Load 10.11.6 Harmonic Analysis and Testing 10.11.7 Filter Integrity and Testing 10.11.8 Transient Response Load Test 10.11.9 Module Fault Test 10.11.10 Battery Rundown Test 10.12 Static UPS and Maintenance 10.13 UPS Management 10.14 Conclusion Center Cooling Systems Introduction Background Information Cooling within Datacom Rooms Cooling Systems 11.4.1 Air Side 11.4.2 Cooling-Medium Side 11.5 Components Outside the Datacom Room 11.5.1 Refrigeration Equipment—Chillers 11.5.2 Heat-Rejection Equipment 11.5.3 Energy-Recovery Equipment 11.5.4 Heat Exchangers 11.6 Components Inside the Datacom Rooms 11.6.1 CRAC Units 11.7 Conclusion
XI
258 259 259 259 260 261 261 261 262 262 263
11 Data 11.1 11.2 11.3 11.4
265 265 266 266 267 267 267 269 269 273 282 287 290 290 295
12 Data Center Cooling Efficiency: Concepts and Advanced Technologies 12.1 Introduction 12.1.1 Data Center Efficiency Measurement 12.2 Heat Transfer Inside Data Centers 12.2.1 Heat Generation 12.2.2 Heat Return 12.2.3 Cooling Air 12.3 Cooling and Other Airflow Topics 12.3.1 Leakage 12.3.2 Mixing and Its Relationship to Efficiency 12.3.3 Recirculation 12.3.4 Venturi Effect 12.3.5 Vortex Effect 12.3.6 CRAC/CRAH Types 12.3.7 Potential CRAC Operation Issues 12.3.8 Sensible Versus Latent Cooling
297 297 299 300 301 302 302 303 303 303 303 304 304 304 305 305
CONTENTS
12.3.9 Humidity Control 12.3.10 CRAC Fighting—Too Many CRACs 12.4 Design Approaches for Data Center Cooling 12.4.1 Hot Aisle/Cold Aisle 12.4.2 Cold-Aisle Containment 12.4.3 In-Row Cooling with Hot-Aisle Containment 12.4.4 Overhead Supplemental Cooling 12.4.5 Chimney or Ducted Returns 12.4.6 Advanced Active Airflow Management for Server Cabinets 12.5 Additional Considerations 12.5.1 Active Air Movement 12.5.2 Adaptive Capacity 12.5.3 Liquid Cooling 12.5.4 Cold Storage 12.6 Hardware and Associated Efficiencies 12.6.1 Server Efficiency 12.6.2 Server Virtualization 12.6.3 Multicore Processors 12.6.4 Blade Servers 12.6.5 Energy-Efficient Servers 12.6.6 Power Managed Servers 12.6.7 Effects of Dynamic Server Loads on Cooling 12.7 Best Practices 12.8 Efficiency Problem Solving 12.9 Conclusion 12.10 Conversions, Formulas, Guidelines 13 Raised Access Floors 13.1 Introduction 13.1.1 What is an Access Floor? 13.1.2 What are Typical Applications for Access Floors? 13.1.3 Why Use an Access Floor? 13.2 Design Considerations 13.2.1 Determine the Structural Performance Required 13.2.2 Determine the Required Finished Floor Height 13.2.3 Determine the Understructure Support Design Type Required 13.2.4 Determine the Appropriate Floor Finish 13.2.5 Airflow Requirements 13.3 Safety Concerns 13.3.1 Removal and Reinstallation of Panels 13.3.2 Removing Panels 13.3.3 Stringer Systems 13.3.4 Protection of the Floor from Heavy Loads
307 308 308 308 309 309 309 310 310 310 310 311 311 312 312 312 313 313 313 313 313 313 314 314 316 316 317 317 317 318 319 319 320 322 323 325 326 328 328 328 330 331
CONTENTS
13.4
13.5
13.6
13.7
13.8
XIII
13.3.5 Grounding the Access Floor 13.3.6 Fire Protection 13.3.7 Zinc Whiskers Panel Cutting 13.4.1 Safety Requirements for Cutting Panels 13.4.2 Guidelines for Cutting Panels 13.4.3 Cutout Locations in Panel—Supplemental Support for Cut Panels 13.4.4 Saws and Blades for Panel Cutting 13.4.5 Interior Cutout Procedure 13.4.6 Round Cutout Procedure 13.4.7 Installing Protective Trim Around Cut Edges 13.4.8 Cutting and Installing the Trim Access Floor Maintenance 13.5.1 Best Practices for Standard High-Pressure Laminate Floor Tile (HPL) and for Vinyl Conductive and Static Dissipative Tile 13.5.2 Damp Mopping Procedure for HPL and Conductive and Static Dissipative Vinyl Tile 13.5.3 Cleaning the Floor Cavity Troubleshooting 13.6.1 Making Pedestal Height Adjustments 13.6.2 Rocking Panel Condition 13.6.3 Panel Lipping Condition (Panel Sitting High) 13.6.4 Out-of-Square Stringer Grid (Twisted Grid) 13.6.5 Tipping at Perimeter Panels 13.6.6 Tight Floor or Loose Floor—Floor Systems Laminated with HPL Tile Additional Design Considerations 13.7.1 LEED Certification 13.7.2 Energy Efficiency—Hot and Cold Air Containment 13.7.3 Airflow Distribution and CFD Analysis Conclusion
14 Fire Protection in Mission Critical Infrastructures 14.1 Introduction 14.2 Philosophy 14.3 Alarm and Notification 14.4 Early Detection 14.5 Fire Suppression 14.6 System Designs 14.6.1 Stages of a Fire 14.6.2 Fire and Building Codes 14.7 Fire Detection 14.8 Fire Suppression Systems
336 337 337 328 328 328 338 339 339 339 340 340 340 341 342 342 343 343 343 343 344 345 345 346 346 346 347 354 357 357 358 359 361 362 364 364 365 366 374
CONTENTS
XIV
14.8.1 14.8.2 14.8.3 14.8.4 14.8.5 14.8.6 14.8.7 14.8.8 14.8.9 14.8.10 Appendix A
Water Mist Systems Carbon Dioxide Systems Clean Agent Systems Inert Gas Agents IG-541 IG-55 Chemical Clean Agents Portable Fire Extinguishers Clean Agents and the Environment Conclusion
Policies and Regulations A.l Introduction A.2 Industry Policies and Regulations A.2.1 USA PATRIOT Act A.2.2 Sarbanes-Oxley Act (SOX) A.2.3 Comprehensive Environmental Response, Compensation, and Liability Act of 1980 A.2.4 Executive Order 13423—Strengthening Federal Environmental, Energy, and Transportation Management A.2.5 ISO27000 Information Security Management Systems (ISMS) A.2.6 The National Strategy for the Physical Protection of Critical Infrastructure and Key Assets A.2.7 2009 National Infrastructure Protection Plan A.2.8 North American Electric Reliability Corporation (NERC) Critical Infrastructure Protection Program A.2.9 U.S. Security and Exchange Commission (SEC) A.2.10 Sound Practices to Strengthen the Resilience of the U.S. Financial System A.2.11 C4I—Command, Control, Communications, Computers, and Intelligence A.2.12 Basel II Accord A.2.13 National Institute of Standards and Technology (NIST) A.2.14 Business Continuity Management Agencies and Regulating Organizations A.2.15 FFIEC—Federal Financial Institutions Examination Council A.2.16 National Fire Prevention Association 1600 Standard on Disaster/Emergency Management and Business Continuity Programs
379 382 384 384 385 385 386 390 390 391 393 393 395 396 397 399 399 400 403 404 405 405 405 407 408 408 410 412 412
CONTENTS CONTENTS
XV
A.2.17 Private Sector Preparedness Act .3 Data Protection .4 Encryption A.4.1 Protecting Critical Data through Security and Vaulting .5 Business Continuity Plan (BCP) .6 Conclusion
414 414 416 417 417 419
Appendix B
Consolidated List of Key Questions
421
Appendix C
Airflow Management: A Systems Approach C.l Introduction C.2 Control is the Key C.2.1 Benefits of Control C.3 Obtaining Control C.3.1 Lower Return Air AT Versus Higher AT C.4 Air Management Technologies C.4.1 In-Row Cooling C.4.2 Overhead Cooling C.4.3 Containment Strategies C.4.4 Active-Air Management C.4.5 A Benchmark Study for Comparison C.5 Conclusion
441 441 442 444 445 445 451 452 452 452 453 454 456
Glossary
459
Bibliography
473
Index
479
IEEE Press Series on Power Engineering
This page intentionally left blank
FOREWORD
Our lives, livelihoods, and way of life are increasingly dependent on computers and data communication, and this dependence increasingly relies on critical facilities or data centers where servers, mainframes, storage devices, and communication gear are brought together. In short, we are becoming a datacentric or data center society. We are all witnessing the extraordinary expansion of the Internet, from social media to search engines, games, content distribution, and e-commerce. The advent of cloud computing, the "everything as a service" model, will further amplify the importance of the data center as a hub for the new wired world. Consequently, there is an everincreasing demand on our information infrastructure, especially our data centers, changing the way we design, build, use, and maintain these facilities. However, the industry experts have been very slow to document and communicate the vital processes, tools, and techniques needed to do this. Not only is ours a dynamic environment, but it also is complex and requires an understanding of electrical, mechanical, fire protection, and security systems, and reliability concepts, operating processes, and much more. I realized the great benefit Peter Curtis' book will bring to our mission critical community soon after I started reviewing the manuscript. I believe this is the first attempt to provide a comprehensive overview of all of the interrelated systems, components, and processes that define the data center space, and the results are remarkable. Data center facilities are shaped by a paradox: critical infrastructure support systems and the facilities housing them are designed to last 15 years or more, whereas the IT equipment typically has a life of about three years. Thus, every few years we are faced with major IT changes that dramatically alter the computer technology and invariably impact the demand for power, heat dissipation, and the physical characteristics of the facility design and operation. In addition, the last few years have seen a growing focus on energy efficiency and sustainability, reflecting society's effort to reduce its carbon footprint and reverse global warming. Data centers are particularly targeted by these efforts because they are such huge users of power. It is no secret that one of the most difficult challenges facing our industry is our ability to objectively assess risk and critical facility robustness. In general, we lack the metrics needed to quantify reliability and availability—the ability to identify and align the function or business mission of each building with its performance expectation. XVII
XVIII
FOREWORD
Other industries, particularly aircraft maintenance and nuclear power plants, have spent years developing analytical tools to assess systems resiliency, and the work has yielded substantial performance improvements. In addition, the concept of reliability is sometimes misunderstood by professionals serving the data center industry. Curtis' efforts to define and explain reliability concepts will help improve performance in the mission critical space. Further, the process of integrating all of the interrelated components—programming space allocation, design, redundancy level planning, engineered systems quality, construction, commissioning, operation, documentation, contingency planning, personnel training, and so on—to achieve reliability objectives is clear and well reasoned. The book plainly demonstrates how and why each element must be addressed to achieve reliability goals. Although this concept appears obvious, it often is not fully understood or accepted, especially in our changing IT world, where we are facing the challenge of finding the right balance between complexity, automation, and human intervention. The comprehensive review of essential electrical and mechanical systems populating these facilities, from uninterruptible power supplies to chillers and generators, yields great benefits, not only from a functional standpoint, but also because it provides the necessary maintenance and testing data needed for effective system operation. And, maybe most importantly, Curtis recognizes and deals with the vital human factor, " . . . perhaps the most poorly understood aspect of process safety and reliability management." I am confident that time will validate the approach and ideas covered here. Meanwhile, we will all profit from this admirable effort to bring a better understanding to this complex and fast-changing environment. PETER GROSS
November 2010
PREFACE
The evolution of our digital society has sped up our lives so significantly that a critical event unfolds with extreme rapidness. Because of this societal transformation, our critical infrastructure is susceptible and vulnerable to catastrophic and escalating failures more than ever. Consequently, we now need to incorporate a new mindset and decision-making structure, process, and tools in order to manage unpredictable events such as power outages, natural disasters, or manmade incidences. This book facilitates bridging the gap between a critical event and the mission critical infrastructure that needs to be in place and working according to detailed specifications in order to manage the event with safety, confidence, and situational awareness. The intent of this text is to provide the foundations of mission critical infrastructure from an engineering and operations perspective. It should be noted that this book is a work in progress and does not include every detail. It does, however, provide the foundation topics as well as more advanced subjects of critical infrastructure that are relevant to society today. Topics such as reliability; resiliency; energy security; UPS systems; standby generators; automatic and static transfer switches; power quality; data center cooling, efficiency, and air-flow management; fire systems; raised floors; mission critical engineering; safety, and green practices are discussed in layman's terms. It is imperative that anyone wishing to enter into the mission critical industry have the necessary foundation to operate critical systems, with the goal of reducing operational risk, improving safety, and decreasing greenhouse gases. Not having the proper standardized training in place will lead to lost lives and cascading failures of every type due poor operational decisions. The purpose of this book is to transition today's facilities and IT engineers into the growing mission critical industry and properly equip them to deal with the hazards and challenges of a field that is such a lynchpin for the reliability and resiliency of our ever-evolving digital society. XIX
XX
PREFACE
Accompanying this book is a complete accredited online course for professional engineers, union personnel, corporate facilities engineers, property managers, IT managers, and policymakers. To learn more about the professional development hours (PDHs) offered for online training, go to www.missioncritical.powermanage.com. If you would like to make a comment regarding new material for the next edition please feel free to contact me at 510 Grumman Road West, Morrelly Homeland Security Center, Bethpage, NY 11714,
[email protected]. PETER M. CURTIS
Bethpage, New York May 2011
ACKNOWLEDGMENTS
Creating this book could not be possible through the efforts of only one person. I have attended various conferences throughout my career, including 7/24 Exchange, IFMA, AFE, Data Center Dynamics, AFCOM, and BOMI, and harvested insights offered by many mission critical professionals from all walks of the industry. I am grateful for the professional relationships that were built at these conferences, which allow the sharing of the knowledge, know-how, information, and experiences upon which this book is based. I am also grateful to IEEE/Wiley for taking on this project almost 12 years ago. The format that initially began as material for online educational classes transcended into an entire manuscript passionately assembled by all who came across it. Professionals in the mission critical field have witnessed its evolution from a fledgling 40-hour-a-week operation into the 24/7 environment that our digital society demands today. The people responsible for the growth and maintenance of the industry have amassed an invaluable cache of knowledge and experience along the way. Compiling this information into a book provides a way for those new to this industry to tap into the years of experience that have emerged since the industry's humble beginnings. This book's intended audience includes every business that understands the consequences of downtime and seeks to improve its own business resiliency. Reviewed by senior management, technicians, vendors, manufacturers, and contractors alike, this book gives a comprehensive, 360 degree perspective on the mission critical industry as it stands today. Its importance lies in its use as a foundation toward a seamless transition to the next stages of education and training for the mission critical industry. I am grateful to many people and organizations for their help, support, and contributions that have enabled this information to be shared with the next generation of mission critical engineers and business continuity professionals.
Chapter Contributors • Don Beaty, P.E., DLB Associates (Chapter 11—Data Center Cooling Systems) • Charles Berry, Power Management Concepts, LLC (Chapter 12—Data Center Cooling Efficiency, and Chapter 10—UPS Systems) • Tom Bronack, CBCP (Appendix A—Policies and Regulations) • Dan Catalfu, Täte Access Floors (Chapter 13—Raised Access Floors) xxi
ACKNOWLEDGMENTS
XXII
• Howard L. Chesneau, Fuel Quality Services (Chapter 6—Fuel Systems and Design) • George E. Ello, Long Island Power Authority (Chapter 2—Energy Security) • Edward English III, Fuel Quality Services (Chapter 6—Fuel Systems and Design) • Brian K. Fabel, P.E., ORR Protection Systems (Chapter 14—Fire Protection) • James P. Fulton, Ph.D., Suffolk Community College (Appendix C—Airflow Management) • John Golde, P.E., Golde Engineering, PC (Chapter 3—Mission Critical Electrical System Maintenance and Safety) • John Kammetor (Chapter 8—Static Transfer Switches) • Walter Phelps, Degree Controls, Inc. (Chapter 12—Data Center Cooling Efficiency) • Dean Richards, Mitsubishi Electric Power Products (Chapter 10—UPS Systems) • Ron Ritorto, P.E., Mission Critical Fuel Systems (Chapter 6—Fuel Systems and Design)
Technical Reviewers and Editors • • • •
• • • • • • • • • • •
Scott Alwine, Täte Access Floors, Inc. (Chapter 13—Raised Access Floors) Bill Campbell, Emerson Network Power (Chapter 10—UPS Systems) Steve Carter, Orr Corporation (Chapter 14—Fire Protection) Thomas Corona, Jones Lang LaSalle (Chapter 4—Mission Critical Electrical System Maintenance and Safety, and Chapter 12—Data Center Cooling Efficiency) John C. Day, PDI Corp. (Chapter 8—Static Transfer Switches) John De Angelo, Power Service Concepts, Inc. (Chapter 10—UPS Systems) John Diamond, DAS Associates (Chapter 5—Standby Generators, and Chapter 10—An Overview of UPS Systems) Doug Dittmas, East Penn Manufacturing Company (Chapter 10—UPS Systems) Michael Fluegeman, P.E., PlanNet Consulting (Chapter 7—Power Transfer Switch Technology, and Chapter 10—UPS Systems) Steve Guzzardo, Hewlett-Packard (Chapter 1—Reliability and Resiliency) Richard Greco, P.E., California Data Center Design Group (Chapter 3—Mission Critical Engineering) Ross M. Ignall, Dranetz-BMI (Chapter 9—Fundamentals of Power Quality) Ellen Leinfuss, Dranetz-BMI (Chapter 9—Fundamentals of Power Quality) Teresa Lindsey, BITS (Chapter Questions) Wai-Lin Litzke, Brookhaven National Labs (Appendix A—Policies and Regulations)
ACKNOWLEDGMENTS
XXIII
• Michael Mallia, AFCO Systems (Appendix C —Air Flow Management) • Joseph McPartland III, American Power Conversion (Chapter 10—UPS Systems) • Stefan Miesbach, Siemens (Chapter 2—Energy Security) • Kevin McCarthy, EDG2 Inc. (Chapter 10—UPS Systems) • Mark Mills, Digital Power Group • David P. Mulholland, PDI (Chapter 8—Static Transfer Switches) • Gary Olsen, P.E., Cummins (Chapter 5—Standby Generators) • Ted Pappas, Keyspan Engineering (Chapter 3—Mission Critical Engineering) • Anthony Pinkey, Layer Zero Power Systems, Inc. (Chapter 8—Static Transfer Switches) • Richard Rotanz, Applied Science Foundation for Homeland Security (Appendix A—Policies and Regulations) • Dan Sabino, Power Management Concepts, LLC (Chapter 7—Power Transfer Switch Technology) • Douglas H. Sandberg, GHI Group (Chapter 7—Power Transfer Switch Technology) • Ron Shapiro, P.E., Cosentini Mission Critical, (Chapter 4—Mission Critical Electrical Systems Maintenance and Safety) • Terri Sinski, Strategic Planning Partners (Appendix A—Policies and Regulations) • Kenneth Uhlman, P.E., Eaton/Cutler Hammer (Technical Discussions) • Steve Vechy, Enersys (Chapter 10—UPS Systems) I am grateful to Dr. Robert Amundsen, Director of the Energy Management Graduate Program at New York Institute of Technology, who gave me my first teaching opportunity in 1994. It allowed me to continually develop professionally, learn, and pollinate many groups with the information presented in this book. I would like to thank two early pioneers of this industry for defining what mission critical really means to me and the industry, and I appreciate the knowledge they have imparted to me: Borio Gatto for sharing his engineering wisdom, guidance, and advice and Peter Gross, P.E. for his special message in the Foreword and for expanding my view of the mission critical world. Thank you to my good friends and colleagues for their continued support, technical dialogue, feedback, and advice over the years: Thomas Weingarten, P.E. of Power Management Concepts, LLC and Vecas Gray, P.E. and Mark Keller, Esq. of Abramson & Keller. I would also like to thank Lois Hutchinson for assisting in organizing and editing the initial material, acting as my sounding board, and supporting me while I was assimilating my ideas, and also for keeping me on track and focused when necessary. I would like to express gratitude to all the early contributors, students, mentors, in-
XXIV
ACKNOWLEDGMENTS
terns, and organizations that I have been working with and learning from over the years for their assistance, guidance, advice, research, and organization of the information presented in this book: John Altherr, P.E., Nada Anid, Ph.D., Elijan "Al" Avdic, Tala Awad, Anna Benson, Charles Berry, Nancy Camacho, Joseph Cappiello, Ralph Ciardulli, Charles Cottitta, Guy Davi, Kenneth Davis, Brad Dennis, David Donatov, Stephen Worn, David Dunatov, Andres Fortino, Ph.D., Ralph Günther, P.E., Kevin Heslin, Patrick Hoehn, Hartley Jean-Aimee, Al Law, John Nadon, John Mezic, John Montana, David Nowak, Jay O'Neill, Daniel Ortiz, Shawn Paul, Arnold Peterson, P.E., Victoria Pierre-Louis, Richard Realmuto, P.E, Michael Recher, Adam Redler, Edward Rosavitch, P.E, Christie Rotanz, Brad Weingast, Jack Willis, P.E., Anthony Wilson, and my special friends at 7x24 Exchange, AFE Region 4, Data Center Dynamics, Long Island Forum for Technology (www.lift.org), and Mission Critical Magazine. Special thanks to my friends and partners Bill Leone, CPA and Eduardo Browne for sharing my initial vision and building upon it. They have been pillars of strength for me. Most of all I would also like to express gratitude to my loving and understanding wife Elizabeth for her continual support, encouragement, and acceptance of my creative process, and leaving the house when I needed extreme quiet to focus and concentrate. I am also grateful to Elizabeth for her editorial assistance and feedback when my ideas were flowing 24/7, Thank you, Love! This book is dedicated to Brian K. Fabel, Al Baker, Bill Mann, Kenneth Morrelly, and John Kammeter—five inspirational men who contributed vastly to the mission critical and homeland security industries and enriched all of our lives. You will be missed. I would also like to dedicate this book to the men, women, and children who lost their lives on 9/11 and to the emergency personnel who responded to this horrific event. "We all die. The goal isn 't to live forever; the goal is to create something that will. " —Chuck Palahniuk Lastly, my deepest apologies go to anyone I have forgotten. P.M.C.
1 AN OVERVIEW OF RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
1.1
INTRODUCTION
Continuous, clean, and uninterrupted power and cooling is the lifeblood of any data center, especially one that operates 24 hours a day, 7 days a week. Critical enterprise power is the power without which an organization would quickly be unable to achieve its business objectives. Today, more than ever, enterprises of all types and sizes are demanding 24-hour system availability. This means enterprises must have 24-hour power and cooling day after day, year after year. One such example is the banking and financial services industry. Business practices mandate continuous uptime for all computer and network equipment to facilitate round-the-clock trading and banking activities anywhere and everywhere in the world. Banking and financial service firms are completely intolerant of unscheduled downtime, given the guaranteed loss of business that invariably results. However, providing the best equipment is not enough to ensure 24-hour operation throughout the year. The goal is to achieve reliable 24-hour power and cooling at all times, regardless of the technological sophistication of the equipment or the demands placed upon that equipment by the end user, be it a business or municipality. Today, all industries are constantly expanding to meet the needs of the growing global digital economy. Industry as a whole has been innovative in the design and use of the latest technologies, driving its businesses to become increasingly digitized in Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
1
2
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
this highly competitive business environment. Industry is progressively more dependent on continuous operation of its data centers in reaction to the competitive realities of a global economy. To achieve optimum reliability when the supply and availability of power are becoming less certain is challenging to say the least. The data center of the past required only the installation of stand-alone protective electrical and mechanical equipment, mainly for computer rooms. Data centers today operate on a much larger scale, 24/7. The proliferation of distributed systems using hundreds of desktop PCs and workstations connected through LANs and WANs, simultaneously using dozens of software business applications and reporting tools, makes each building a computer room. As we add the total number of locations utilized by each bank all over the world utilizing the Internet, we now realize the necessity of a critical infrastructure and the associated benefits of uptime and reliability. The face of corporate America was severely scarred in the last decade by a number of historically significant events: the collapse of the dot.com bubble and the high-profile corporate scandals. These events have taken a significant toll on financial markets and have served to deflate the faith and confidence of investors. In response, governments and other global organizations enacted new or revised existing laws, policies, and regulations. In the United States, laws such as the Sarbanes-Oxley Act of 2002 (SOX), Basel II, and the U.S. PATRIOT Act were created. In addition to management accountability, another imbedded component of SOX makes it imperative that companies not risk losing data or even risk downtime that could jeopardize accessing information in a timely fashion. These laws can actually improve business productivity and processes. Many companies thoughtlessly fail to consider installing backup equipment or the proper redundancy based on their risk profile. Then, when the lights go out due to a major power outage, these same companies suddenly wake up, with the outcome of taking a huge hit operationally and financially. During the months following the Northeast blackout of 2003, there was a marked increase in the installation of uninterruptible power supply (UPS) systems and standby generators. Small and large businesses alike learned how susceptible they are to power disturbances and the associated costs of not being prepared. Some businesses that were not typically considered mission critical learned that they could not afford to be unprotected during a power outage. The Northeast blackout of 2003 emphasized the interdependencies across the critical infrastructure and the cascading impacts that occur when one component falters. Most automated teller machines (ATMs) in the affected areas stopped working, although several had backup systems that enabled them to function for a short period. Soon after the power went out, the Comptroller of the Currency signed an order authorizing national banks to close at their discretion. Governors in a number of affected states made similar proclamations for state-charted depository institutions. The end result was a loss of revenue, profits, and almost the loss of confidence in our financial system. More prudent planning and the proper level of investment in mission critical infrastructure for electric, water, and telecommunications utilities, coupled with proactive building infrastructure preparation and operations, could have saved the banking and financial services industry millions of dollars. At the present time, the risks associated with cascading power supply interruptions from the public electrical grid in the United States have increased due to the ever in-
1.1
INTRODUCTION
3
creasing reliance on computer and related technologies. Today, there are close to one trillion devices and one billion people connected to the Worldwide Web. As the number of computers and related technologies continue to multiply in this increasingly digital world, the demand for reliable quality power increases as well. Businesses are not only competing in the marketplace to deliver whatever goods and services are produced for consumption, but now they must compete to hire the best engineers from a dwindling pool of talent who can design the best infrastructures needed to obtain and deliver reliable power and cooling. This keeps the mission critical manufacturing and technology centers up and running with the ability to produce the very goods and services that sustain them. The idea that businesses today must compete for the best talent to obtain reliable power is not new, and neither are the consequences of failing to meet this challenge. Without reliable power, there are no goods and services for sale, no revenues, and no profits—only losses. Hiring and keeping the best-trained engineers employing the very best analyses, making the best strategic choices, and following the best operational plans to keep ahead of the power supply curve is essential for any technologically sophisticated business to thrive and prosper. A key to success is to provide proper training and educational resources to engineers so they may increase their knowledge and keep current on the latest mission critical technologies available all over the world, which is one of the purposes of this book. In addition, all companies need to develop an educational system and certification programs for young mission critical engineers to help combat the decreasing workforce necessary to sustain the growing mission critical industry. It is also essential for critical industries to constantly and systematically evaluate their mission critical systems, assess and reassess their level of risk tolerance versus the cost of downtime, and plan for future upgrades in equipment and services that are designed to meet business needs and ensure uninterrupted power and cooling supplies in the years ahead. Simply put, minimizing unplanned downtime reduces risk. Unfortunately, the most common approach is reactive, that is, spending time and resources to repair a faulty piece of equipment after it fails as opposed to identifying when the equipment is likely to fail and repairing or replacing it without interruption. If the utility goes down, install a generator. If a ground-fault trips critical loads, redesign the distribution system. If a lightning strike burns power supplies, install a new lightning protection system. Such measures certainly make sense, as they address real risks associated with the critical infrastructure; however they are always performed after the harm has occurred. Strategic planning can identify internal risks and provide a prioritized plan for reliability improvements that identify the root causes of failure before they occur. In the world of high-powered business, owners of real estate have come to learn that they, too, must meet the demands for reliable power supply to their tenants. As more and more buildings are required to deliver service guarantees, management must decide what performance is required from each facility in the building. Availability levels of 99.999% (5.25 minutes of downtime per year) allow virtually no facility downtime for maintenance or other planned or unplanned events. Moving toward high reliability is imperative. Moreover, avoiding the problems that can cause outages and unscheduled downtime never ends. Even planning and impact assessments are tasks that are never completed; they should be reviewed at least once every budget cycle.
4
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
The evolution of data center design and function has been driven by the need for uninterrupted power. Data centers now employ many unique designs developed specifically to achieve the goal of uninterrupted power within defined project constraints based on technological need, budget limitations, and the specific tasks each center must achieve to function usefully and efficiently. Providing continuous operation under all foreseeable risks of failure, such as power outages, equipment breakdown, internal fires, and so on, requires use of modern design and modeling techniques to enhance reliability. These include redundant systems and components, standby power generation, fuel systems, automatic transfer and static switches, pure power quality, UPS systems, cooling systems, raised access floors, and fire protection, as well as the use of probability risk analysis modeling software (each will be discussed in detail later) to predict potential future outages, develop maintenance, and upgrade action plans for all major systems. Also vital to the facilities life cycle is two-way communication between upper management and facilities management. Only when both ends fully understand the three pillars of infrastructure reliability—design, maintenance, and operation of critical environments (including the potential risk of downtime and recovery time)—can they fund and implement an effective plan. Because the costs associated with reliability enhancements are significant, sound decisions can only be made by quantifying performance benefits against downtime cost estimates for each upgrade option to determine the best course of action. Planning and careful implementation will minimize disruptions while making the business case to fund necessary capital improvements and implement comprehensive maintenance strategies. When the business case for additional redundancy, specialized consultants, documentation, and ongoing training reaches the boardroom, the entire organization can be galvanized to prevent catastrophic data losses, damage to capital equipment, and danger to life and limb.
1.2
RISK ASSESSMENT
Critical industries require an extraordinary degree of planning and assessing. It is important to identify the best strategies to reach the targeted level of reliability. In order to design a critical building with the appropriate level of reliability, the cost of downtime and the associated risks need to be assessed. It is important to understand that downtime occurs due to more than one type of failure: design failures, catastrophic failures, equipment failures, or failures due to human error. Each type of failure will require a different approach to prevention. A solid and realistic approach to business resiliency must be a priority, especially because the present critical infrastructure is inevitably designed with all the eggs located in one basket. Within the banking and financial services industries, planning the critical area places considerable pressure on designing an infrastructure that evolves in an effort to support continuous business growth. Routine maintenance and upgrading of equipment alone does not ensure continuous availability. The 24/7 operation of such services means an absence of scheduled interruptions for any reason, including routine maintenance, modifications, and upgrades. The main question is how and why infrastructure failures occur. Employing new methods of distributing critical power, under-
1.2
RISK ASSESSMENT
5
standing capital constraints, and developing processes that minimize human error are some key factors in improving recovery time in the event critical systems are impacted by base-building failures. The infrastructure reliability can be enhanced by conducting a formal risk management assessment (RMA), gap analysis, and by following the guidelines of the critical area program (CAP). The RMA and the CAP are used in other industries and customized specifically for needs of data center environments. The RMA is an exercise that produces a system of detailed, documented processes, procedures, and checks and balances designed to minimize operator and service-provider errors. The CAP ensures that only trained and qualified people are associated with and authorized to have access to critical sites. These programs coupled with probability risk assessment (PRA) address the hazards of data center uptime. The PRA looks at the probability of failure of each type of electrical power equipment. Performing a PRA can be used to predict availability, number of failures per year, and annual downtime. The PRA, RMA, and CAP are facilitating agents when assessing each step listed below: • • • • • • • • •
Engineering and design Project management Testing and commissioning Documentation Education and training Operation and maintenance Employee certification Risk indicators related to ignoring facility life cycle process Standard and benchmarking
Industry regulations and policies are more stringent than ever. They are heavily influenced by Basel II, the Sarbanes-Oxley Act (SOX), NFPA 1600, and the U.S. Securities and Exchange Commission (SEC). Basel II recommends "three pillars"—risk appraisal and control, supervision of assets, and monitoring of financial markets—to bring stability to the financial system and other critical industries. Basel II implementation involves identifying operational risk, then allocating adequate capital to cover potential loss. As a response to corporate scandals in the last decade, SOX came into force in 2002 and contains the following sections: The financial statement published by issuers is required to be accurate (Sec 401) Issuers are required to publish information in their annual reports (Sec 404) Issuers are required to disclose to the public, on an urgent basis, information on material changes in their financial condition or operations (Sec 409) Penalties of fines and/or imprisonment are imposed for not complying (Sec 802) The purpose of the NFPA 1600 Standard is to help the disaster management, emergency management, and business continuity communities to cope with critical events.
6
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
Keeping up with the rapid changes in technology has been a longstanding priority. The constant dilemma of meeting the required changes within an already constrained budget can become a limiting factor in achieving optimum reliability.
1.2.1
Levels of Risk
Risk can be described as the worst possible scenario that might occur while performing a task within a facility. Risk assesses how much we know or can predict about unforeseen circumstances. As we review risk, it is essential that the facility IT team has the proper change management processes and procedures in place for planned events, so that downtime can be minimized. Reducing the frequency of these events and understanding their impact is the key to proper critical environment management. Table 1.1 shows the three typical levels of impact—high, medium, and low—that result from event occurrence.
1.3
CAPITAL COSTS VERSUS OPERATION COSTS
Businesses are at the mercy of the mission critical facilities sustaining them. Each year, billions of capital dollars are spent on the electrical and mechanical infrastruc-
Table 1.1. Levels of risk impact on facilities Risk impact
Effects of system failure
High
It will cause an immediate interruption to the clients' critical operations such as: • Activity requiring a planned major utility service outage, or temporary elimination of system redundancy in the critical environment • Activity that would disrupt critical production operations • Activity that would likely result in an unplanned outage or disruption of operations if unsuccessful
Medium
There is time to recover without impacting the clients' critical operations, including any: • Activity requiring a planned service outage that does not affect systems but may impact noncritical operations • Activity that involves a significant reduction in system redundancy • Activity that is not likely to result in an unplanned outage in the critical environment or disruption of operations if unsuccessful
Low
It will not interrupt operations and will have minimum potential of affecting the clients' critical operations including: • Activity involving systems directly supporting operations but the execution of which will be transparent to operations • Activity that cannot result in an unplanned outage of the critical environment or impact operations if unsuccessful
None
Activity not associated with the critical environment
1.3
CAPITAL COSTS VERSUS OPERATION COSTS
7
ture that supports IT around the globe. It is important to keep in mind that downtime can cost companies millions of dollars per hour or more. An estimated 94% of all businesses that suffer a large data loss go out of business within two years, regardless of the size of the business. The daily operations of our economic system and our way of life depend on critical infrastructure being available 100% of the time with no exceptions. Critical industries are operating continuously, 365 days a year. Because conducting daily operations necessitates the use of new technology, more and more servers are being packed into a single rack. The growing number of servers operating 24/7 increases the need for power, cooling, and airflow. When a disaster causes the facility to experience lengthy downtime, a prepared organization is able to quickly resume normal business operations by using a predetermined recovery strategy. Strategy selection involves focusing on key risk areas and selecting a strategy for each one. Also, in an effort to boost reliability and security, the potential impacts and probabilities of these risks, as well as the costs to prevent or mitigate damages and the time needed to recover, should be established. Many organizations associate disaster recovery and business continuity only with IT and communication functions and miss other critical areas that can seriously impact their business. Within these areas may be a multitude of critical systems that require maintenance, the development of procedures, and appropriate documentation. Some of these systems are listed later in Table 1.3. One major area that necessitates strategy development is the banking and financial services industry. The absence of strategy that guarantees recovery has an impact on employees, facilities, power, customer service, billing, and customer and public relations. All areas require a clear, well-thought-out strategy based on recovery time objectives, cost, profitability impact, and safety. The strategic decision is based on some of the following factors: • The maximum allowable delay time prior to the initiation of the recovery process • The time frame required to execute the recovery process once it begins • The minimum computer configurations required to process critical applications • The minimum communication device and backup circuits required for critical applications • The minimum space requirements for essential staff members and equipment • The total cost involved in the recovery process and the total loss as a result of downtime Developing strategies with implementation steps means no time is wasted in a recovery scenario. The focus is to implement the plan quickly and successfully, and in order to accomplish this people must be properly trained. Is the person you hired 3 months ago up to this task? The right strategies implemented will effectively mitigate damages, minimize disruptions, reduce the cost of downtime, and remove the threat to life and safety.
8
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
1.4 CRITICAL ENVIRONMENT WORKFLOW AND CHANGE MANAGEMENT To assure reliable operation, a critical environment workflow and change management process must be established and followed. Commensurate roles and responsibilities of the engineering, technology, and security groups must be developed, implemented, and adhered to in order to manage both planned and unplanned events and associated risks. The critical environment (CE) is defined as the physical space and the systems within a facility that are uniquely configured, sized and dedicated to supporting specific critical business operations as defined by the user. There are many specific rooms and areas within facilities in today's ever-changing environment. Some are located within the buildings structure whereas others are located outside. Regardless of where a CE may be located, these locations have immediate impact on the client's ability to maintain business operations and continuity. Examples of some of these CE areas can be seen in Table 1.2.
Table 1.2. Critical Areas Data centers Operations center Electrical switchgear rooms Network equipment rooms (NER) Intermediate distribution frames (IDF) Main distribution frames (MDF) Main equipment rooms (MER) Telecom rooms (TR) Switching and hub rooms Voice telephone and data closets Server rooms Business continuity and technology recovery rooms Tape silo and storagetek rooms Local area network (LAN) rooms Business operations control rooms Uninterruptible power supply (UPS) rooms Command centers Chiller rooms and thermal energy storage spaces Building management, monitoring, and automation centers Mechanical equipment rooms Standby emergency power (SEP) generator and switchgear rooms
1.4
CRITICAL ENVIRONMENT WORKFLOW AND CHANGE MANAGEMENT
9
Critical infrastructure systems are prevalent throughout a facility. Depending on the facility size, there could be many redundant systems supporting the same critical environment. Knowing which systems could impact the clients' critical functions and operations is paramount. Some of these systems are listed in Table 1.3.
1.4.1
Change Management
Change management is a process for managing and communicating change across relevant functions and business units to ensure and deliver integration of procedures and processes. Note that during emergency situations, established emergency response and escalation procedures must be followed. When work is must be done within the critical environment, ranging from certain simple or routine cleaning and inspection tasks to very complex and detailed preventive maintenance, corrective maintenance, or construction efforts, it is essential that an orderly and thorough approach to work planning and execution be undertaken. In every instance where work is planned in the critical environment, all departments must ensure that risk to company operations is thoroughly assessed and that appropriate risk mitigation is in place while the work is performed. The level of detail required in a method of procedure (MOP) must be correlated to the complexity of the work and magnitude of the potential risk. Relatively complex or high-risk work must be meticulously detailed in the MOP. The detail required for less complex work would not necessarily be as extensive. The bottom line is that a proper-
Table 1.3. Critical systems Compressed air systems Utility power feeder systems Diesel engine fuel systems Fire and life safety systems Natural gas supply systems Electrical distribution and grounding systems Condenser water systems Telephone andfiberoptic communications systems Standby emergency power (SEP) systems Glycol systems Environmental control systems (chillers, CRACs, etc.) Water service systems Building management systems (BMS) Boilers Uninterruptible power supply (UPS) systems
10
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
ly developed, reviewed, and approved MOP will result in reduced risk to business operations. Required change request information includes: • • • • • • • •
Who is doing the work? What systems will be affected? Which areas of the building will be affected? Is there redundancy for the system being disrupted? Are there detailed procedures for the proposed task Is assistance needed from other lines of business? What hardware will be moved, added, or changed? How long will the task last?
If an outage is required: • • • • • • •
1.4.2
How long will power be out? Are there any critical points during the process (high risk) that can be identified? Will those systems be protected by UPS, generators, or other redundancy? What kind of backup systems are available if a problem arises? Will utility power be taken down? Are other feeds to the building affected? If redundancy is to be reduced, what redundancy will be lost and for how long?
Escalation Procedures
The purpose of the critical escalation procedures is to allow for the successful response to a critical site event. Following the escalation process assures proper notification and timely response. By assessing the event first, critical information will be available early on. It is important that a chain of command be followed because when events arise, teams need to ensure that communication and reactions are escalated in the proper fashion.
1.5
TESTING AND COMMISSIONING
The definition of commissioning (Cx) has developed over the past 12 years from being nothing more than vendor start-up to the full quality-control process it is today. Some still cling to the idea that commissioning is something that starts at the end of construction or after the design is complete. In some circles, they talk about levels of commissioning and level one starts after the design is complete at factory acceptance testing (FAT). The use of levels of commissioning goes out the window as soon as leadership in energy-efficient design (LEED) certification is injected into the mix, as LEED requires design reviews and other requirements that start earlier in the project. The best
1.5 TESTING AND COMMISSIONING
11
definition of commissioning can be found in ASHRAE's Guideline 0.* ASHRAE's Commissioning Guideline 0-2005 is a recognized model and good resource that explains commissioning as a quality-control process in detail and can be applied to critical facilities with some embellishment when it comes to verifying critical system performance. The quality-control process given by ASHRAE is in phases, starting with the predesign phase and continuing through the occupancy and operations phase. A summary of the phases given in the ASHRAE Guideline are given below, with some additions for mission critical facilities, and these phases should be included in all mission critical facility projects. • Predesign phase —Document owner's project requirements and basis of design • Design phase —Commission-focused design review —Writing Cx specifications • Construction phase —Factory acceptance testing (FAT) —Construction-check listing —Start-up (prefunctional) testing • Acceptance phase —Site acceptance (functional) —Acceptance testing to verify performance of critical equipment —Integrated testing —O&M document review —Staff training oversight —Develop and prove out EOPs, SOPs, and MOPs • Occupancy and operations phase —Continual review and updating of materials —Continual training of O&M staff —Reliability assurance testing (continual commissioning) A quality-control process would never be overlooked in any valuable production project. Why would we neglect quality control for critical facility projects? In the case of critical facility projects, quality control or commissioning starts at the predesign phase so we can make sure the owner's project requirements (OPRs) are fully developed and track all relevant documents such as the basis of design (BOD). We do this so that during value engineering we can evaluate any impact to the BOD and OPR, and if the proposed change does not meet the documented requirements, then the team must sign off on it. It should be clear that these documents are the foundation of any *www.wbdg.org/pdfs/comm_def.pdf
12
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
project and part of the quality control process required for the commissioning of a critical facility. In the design phase, we have a commissioning-focused design review that should not be confused with a peer review. In a commissioning-focused design review, the commissioning authority (CxA) should provide input on making the building and systems easier to commission and comment on equipment layout as it pertains to operational symmetry to help prevent operational staff from making errors during crisis situations. The CxA should also verify that bid documents adequately specify building commissioning as this will help reduce vendor change orders. It many cases, it is better to have the CxA provide commissioning specifications and have them included in the prepurchase and other bid sets. The focused review also needs to verify that there are adequate monitoring and control points specified to facilitate commissioning and O&M (trending capabilities, test ports, control points, gages, and thermometers). A review needs to include a review of design as it pertains to the reliability and redundancy standards of the owner and industry standards and verify that building O&M plan and documentation requirements specified are adequate. During the construction phase, much of the rudimentary testing is accomplished. During this time, the factory acceptance testing is being conducted and it is important to have the CxA involved to verify that the controls and interlocks will work with the complete system. During this time, equipment is being delivered and installed. During this installation, the vendors and GC should be verifying the installation using construction checklists provided be the CxA. These checklists basically track the construction process and verify that the vendor delivered what was paid for in good condition, and that it was installed properly and has the proper clearance. The vendor start-up will follow and, if performed in accordance with the agreed procedures, all the functions including all the alarms will be verified. It is important to track all these documents and have them signed off by the vendors as proof of proper start-up. In some cases, the vendors will sign off the documents and not perform all the requirements, and that will slow down the acceptance and integrated testing. In this case of improper start-up, the delays can be back charged to the responsible vendor. In the acceptance phase, the CxA will first operate all the equipment in all configurations and verify proper start-up by the vendors. Once this is completed, the CxA should verify performance of certain equipment without using vendor-provided test gear. This is done to keep the quality control process in the hands of the CxA and make sure that only calibrated equipment is used and the calculations are performed without bias. The equipment listed below is recommended to be subjected to this extra acceptance test phase: 1. 2. 3. 4.
Emergency power systems and controls Uninterruptible power supply (UPS) systems and batteries Flywheel energy storage systems Static transfer switches (STS) and all associated controls
If the acceptance testing is done properly, it as a minimum will verify that the equipment is worthy of critical load. In some cases, deficiencies found during this
1.5 TESTING AND COMMISSIONING
13
process have forced the vendors to meet their own specifications and improve product quality. We are now ready for integrated testing, and the intent of this test is to verify that the building and all the systems work together to meet the client's design requirements. Some hints for having a successful integrated test that proves proper operation and no unwanted system interaction are: 1. Perform a full data center heat-load test, including any enclosed cooling systems 2. Perform integrated testing at 25%, 50%, 75%, and 100% of the design load 3. Use data loggers on the data center floor to verify measured data and BMS controls 4. Check all operating modes, including maintenance configurations Staff training and operations documents need to be provided before we can start operations. Proper training must be given to the staff for systems and integrated operations. I would suggest that the vendors provide system training and the CxA provide overall operations documents. The operations documents should include maintenance operation procedures (MOP), emergency operation procedures (EOP), standard operation procedures (SOP), and alarm response procedures (ARP). With properly trained staff and proper operations documents, the human error faults can be minimized. As we stated earlier, the commissioning process continues into the occupancy phase to maintain operational continuity. A yearly review and update of training as required due to system upgrade or operational requirements maintains staff and procedural quality. For mission critical facilities, a yearly verification of performance for critical electrical systems prevents loss of productivity due to system degradation; this is referred to as reliability assurance testing. These tests should be similar to those performed in the system acceptance-test procedures used during the acceptance phase and should use the original data for trending any changes in the system. The reliability assurance testing should be performed after the vendor has provided preventive maintenance (PM). The reason we perform these tests after the vendor preventive maintenance routine is that the vendor will have interacted with a commissioned system and disassembled some portions. In some cases, they provide updated software or control boards. The system now needs to be certified through reliability assurance testing to be worthy of critical load. Remember that the vendor-provided PM does not measure performance or track system degradation, so without a reliability assurance testing program the quality control process will have been compromised. Before the facility goes online, it is crucial to resolve all potential equipment problems (technology, operations, etc.). This is the construction team's sole opportunity to integrate and commission all the systems, due to the facility's 24/7 mission critical status. At this point in the project, all systems installed have been tested at the factory and witnessed by a competent commissioning authority (CxA) familiar with the equipment processes and procedures. Once the equipment is delivered, set in place, and wired, it is time for the second phase of certified testing and integration. The importance of this phase is to verify and
14
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
certify that all components work together and to fine-tune, calibrate, and integrate the systems. There is a tremendous amount of preparation for this phase. The facilities engineer must work with the factory, field engineers, and independent test consultants to coordinate testing and calibration. Critical circuit breakers must be tested and calibrated prior to placing any critical electrical load on them. When all the tests are completed, the facilities engineer must compile the certified test reports, which will establish a benchmark for all future testing. The last phase is to train the staff on each major piece of equipment and prepare for the transition to operations. Many decisions regarding how and when to service a facility's mission critical electrical and mechanical equipment are going to be subjective. The objective is easy: a high level of safety and reliability from the equipment, components, and systems. But discovering the most cost-effective and practical methods required to accomplish this can be challenging. Network with colleagues, consult knowledgeable sources, and review industry and professional standards and best practices before choosing the approach best suited to your maintenance goals. Also, keep in mind that the individuals performing the testing and service should have the best training and experience available. You depend on their conscientiousness and decision-making ability to avoid potential problems with perhaps the most crucial equipment in your building. Most importantly, learn from your experiences and those of others. Maintenance programs should be continuously improving. If a scheduled procedure has not previously identified a problem, consider adjusting the schedule respectively. Examine your maintenance programs on a regular basis and make appropriate adjustments to constantly improve. Acceptance and maintenance testing are pointless unless the test results are evaluated and compared to standards and to previous test reports that have established benchmarks. It is imperative to recognize failing equipment and to take appropriate action as soon as possible. Common practice in this industry is for technicians to perform maintenance without reviewing prior work tickets and records. This approach defeats the value of benchmarking and trending, and must be improved. The mission critical facility engineers can then keep objectives in perspective and depend upon their options when faced with a real emergency. The importance of taking every opportunity to perform preventive maintenance thoroughly and completely, especially in mission critical facilities, cannot be stressed enough. If not, the next opportunity will come at a much higher price: downtime, lost business, and lost potential clients, not to mention the safety issues that arise when technicians rush to fix a maintenance problem. So do it correctly ahead of time and avoid shortcuts because it will be very difficult to do it again.
1.6
DOCUMENTATION AND THE HUMAN FACTOR
The mission critical industry's focus on physical infrastructure enhancements descends from the early stages of the trade, when all efforts were directed solely toward design and construction techniques to enhance mission critical equipment. Twenty-five years ago, the technology supporting mission critical loads was simple. There was little sophistication in the electrical load profile; at that time the indus-
1.6
DOCUMENTATION AND THE HUMAN FACTOR
15
try was in its infancy. Over time, the data centers have grown from a few mainframes supporting minimal software applications to server farms that can occupy 100,000 ft2 or more, with Google and Microsoft being prime examples. As more processing power is required to sustain our global economy, the electrical and mechanical systems supporting the critical load became increasingly complex. With businesses relying on this infrastructure, more capital dollars were invested to improve the uptime of the business's lines. Today, billions of dollars are invested on an enterprise level in the infrastructure that supports the business 24/7 applications; the major investments are normally in design, equipment procurement, and project management. Few capital dollars are invested in documentation, change management, education and training, or operations and maintenance. The initial capital investment was just the tip of the iceberg (Figure 1.1). Years ago, most organizations relied heavily on their workforce to retain much of the information regarding the mission critical systems. A large body of personnel had a similar level of expertise. They remained with their company for decades. Therefore, little emphasis was placed on maintaining a living document for a critical infrastructure. Tables 1.4 to 1.6 identify questions with regard to managing loss of personnel, documentation, and managing during a critical event. The mission critical industry can no longer manage critical systems as they did 25 years ago. The requirements are very different today in that the sophisticated nature of the data center infrastructure requires the constant refreshing and updating of documentation. One way to achieve this is to include a living document system that provides the level of granularity necessary to operate a mission critical infrastructure in a capital project. This will assist in keeping the living document current each time a project is completed or a milestone is reached. Accurate information is the first level of
Figure 1.1
Hidden costs of operations.
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
Table 1.4
Managing loss of critical personnel
The issues: employee turnover, retirement, sick leave, or vacation Was knowledge lost? Where is the existing documentation? How are new employees trained? What risks are faced during the transition?
Table 1.5
Documentation issues
The issue: traditional documentation systems are inconsistent, inaccessible, and unstructured. How is information shared? Is system data readily available? Where is the documentation? How are revisions approved and made available to all users?
Table 1.6
Managing during critical events
The threats:fires,natural disasters, blackouts, and intentional disruption Who should be contacted? Is your critical system data defined? Where are the procedures? Will you be able to respond in time?
support that provides first responders the intelligence they need to make informed decisions during critical events. It also acts like a succession plan as employees retire and new employees are hired, thus reducing risk and improving their learning curve. Remember that greater than 50% of all downtime can be tracked to human error. Human error as a cause of hazard scenarios must be identified and the factors that influence human errors must be considered. Human error is a given and will arise in all stages of the process. It is vital that the factors influencing the likelihood of errors be identified and assessed to determine if improvements in the human factors design of a process are needed. Surprisingly, human factors are perhaps the most poorly understood aspect of process safety and reliability management. Balancing system design and training operating staff in a cost-effective manner is essential to critical infrastructure planning. When designing a mission critical facility, the level of complexity and ease of maintainability is a major concern (Figures 1.2 and 1.3). When there is a problem, the facilities manager (FM) is under enormous pressure to isolate the faulty system while maintaining data center loads and other critical loads.
1.6
DOCUMENTATION AND THE HUMAN FACTOR
17
Figure 1.2 Typical Screenshot of mission critical access. (Courtesy of Power Management Concepts, LLC.)
Figure 1.3 Mission critical access Screenshot. (Courtesy of Power Management Concepts, LLC.)
18
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
The FM does not have the time to go through complex switching procedures during a critical event. A recipe for human error exists when systems are complex, especially if key system operators and documentation of emergency action procedures (EAP) and standard operating procedures (SOP) are not immediately available or have not been reviewed or updated periodically. A rather simplistic electrical system design will allow for quicker and easier troubleshooting during this critical time. To further complicate the problem, equipment manufacturers and service providers are challenged to find and retain the industry's top technicians within their own company. As 24/7 operations become more prevalent, the talent pool available will continue to diminish. This would indicate that response times could increase from the current standard of four hours to a much higher and less tolerable timeframe. The need for a simplified, easily accessible, and well-documented design is only further substantiated by the growing imbalance of supply and demand of highly qualified mission critical technicians. When designing a mission critical facility, a budgeting and auditing plan should be established. Each year, substantial amounts of money are spent on building infrastructure, but inadequate capital is allocated to sustain that critical environment through the use of proper documentation, education, and training.
1.7
EDUCATION AND TRAINING
Technology has been progressing faster than Moore's Law. Despite attaining high levels of technological standards in the mission critical industry, most of today's financial resources remain allocated for planning, engineering, equipment procurement, project management, and continued research and development. Unfortunately, little attention is given to the actual management of these systems. As equipment reliability increases, a larger percentage of downtime results from actions by personnel who were not properly trained or did not have access to accurate data during crisis events. The diversity among mission critical systems severely hinders people's ability to fully understand and master all necessary equipment and relevant information. In the past, a greater percentage of people were hands-on, and it was natural for many families to make their own home and auto repairs just out of necessity. In doing so, they became mechanically inclined and attained an understanding of how systems operate. This experience gave a number of today's mission critical professionals a set of skills to build upon. Today's "Nintendo generation" is gaining a slightly different set of skills through computers, software, and video games. They are gaining valuable experience with IT systems, and will have a solid foundation to continue to develop more advanced IT skills. The next step is to create a strong succession plan that teaches them how critical infrastructure operates and connects their already abundant IT knowledge to engineering. Then, existing professionals can show them how to apply that knowledge in the field. The best strategy may be to start training successors as early as possible so, upon retirement of current staff, someone is trained with the necessary experience to take on operational responsibilities. Such training may be online (Figure 1.4). New college programs that include internships should be developed and made attractive for young
1.8 OPERATION AND MAINTENANCE
19
Figure 1.4 Screenshot of an online training program. (Courtesy of Power Management Concepts, LLC, and Mission Critical Magazine.) engineers. These programs need to show real career path options and align with corporate needs. It is time to invest in our future, so that the people who will be running the critical infrastructure of our country will have the necessary skill sets needed to meet and exceed our current standards. We need to constantly evolve and improve as professionals or risk becoming extinct. If not addressed in a timely and proper manor, we jeopardize the foundation of how our everyday business is run and our e-commerce generated. Imagine what would happen if, due to inadequate training, no one fully understands how to operate and maintain our critical infrastructure before all the experience-hardened experts retire. With that being said, certified training programs should be developed by industry and instituted so there are established standards and best practices. It is only through education and training that we can guarantee facility employees are knowledgeable about all equipment and processes.
1.8
OPERATION AND MAINTENANCE
What can facility managers do to ensure that their critical system is as reliable as possible? The seven steps to improved reliability and maintainability are: 1. Planning and impact assessment 2. Engineering and design
20
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
3. 4. 5. 6. 7.
Project management Testing and commissioning Documentation Continuing education and training programs Operations and maintenance
Hire competent professionals to advise at each step of the way. When building a data processing center in an existing building, you do not have the luxury of designing the electrical and mechanical systems from scratch. A competent engineer will design a system that makes the most of the existing building design. However, before investing precious capital be sure you understand your business requirements for the next 3-5 years, as well as the reliability levels you must sustain. Use contractors who are experienced in data processing installations. Have an experienced firm inspect all systems, such as performing tests on circuit breakers and using thermal-scan equipment to find hot spots due to high-resistance connections or faulty equipment. Finally, you should plan for routine shutdowns of your facility so that you can perform preventive maintenance on critical equipment. Facility managers as well as senior management must not underestimate the cost-effectiveness of a thorough preventive maintenance program. Maintenance is not a luxury; it is a necessity. Do you want electrical outages to be scheduled or unscheduled? Or better yet, can you afford to deal with the consequences of an unscheduled outage? Integrating the ideal infrastructure is just about impossible. Therefore, seek out the best possible industry authorities to solve your problems. Competent consultants will have the knowledge, tools, testing equipment, training, and experience necessary to understand the risk tolerance of your company, as well as recommend and implement the proper and most-advanced proven designs. Whichever firms you choose, always ask for sample reports, testing procedures, and references. Your decisions will determine the system's ultimate reliability, as well as the ease of system maintenance. Seek experienced professionals from within your own company as well as outside professionals: information systems, property and operations managers, space planners, and the best consultants in the industry for all engineering disciplines. The bottom line is to have proven organizations working on your project.
1.9
EMPLOYEE CERTIFICATION
Empowering employees to function effectively and efficiently can be achieved through a well-planned certification program. Employees have a vested interest in working with management to reduce risk. Empowering employees to take charge in times of crisis creates valuable communication allies who not only reinforce core messages internally, but also carry them into daily operations. The internal crisis communication should be conducted using established communication channels and venues in addition to those that may have been developed to manage specific crisis scenarios.
1.10 STANDARDS AND BENCHMARKING
21
Whichever method of internal crisis communication a company may choose, the more upfront management is about what is happening, the better informed and more confident employees feel. In this way, security can be placed on an operation or a task requiring that an employee be certified to perform that action. Certification terms should be defined by industry best practices. Furthermore, the company's risk profile should include training and periodic recertification. Should these evaluations fall below standard over a period of time, the system could recommend decertification. Technology is driving itself faster than ever. Large investments are made in new technologies to keep up to date with advancements, yet industries are still faced with operational challenges. One possible reason for this is the limited training provided to employees operating the mission critical equipment. Employee certification is crucial not only to keep up with advanced technology, but also to promote quick emergency response and situational awareness. In the last few years, technologies have been developed to solve the technical problem of linkage and interaction of equipment but without well-trained personnel. How can we confirm that the employee meets the complex requirements of the facility to insure high levels of reliability?
1.10
STANDARDS AND BENCHMARKING
The past decade has seen wrenching change for many organizations. As firms and institutions have looked for ways to survive and remain profitable, a simple but powerful change strategy called benchmarking has become popular. The underlying rationale for the benchmarking process is that learning by example and from best-practice cases is the most effective means of understanding the principles and the specifics of effective practices. Recovery and redundancy together cannot provide sufficient resiliency if they can be disrupted by a single unpredictable event. A mission critical data center must be able to endure hazards of nature, such as earthquakes, tornados, floods, and other natural disasters, as well as human-made events. Great care should be taken to ensure those critical functions that will minimize downtime. Standards should be established with guidelines and mandatory requirements for continuity of business applications. Procedures should be developed for the systematic sharing of safety- and performance-related material, best practices, and standards. The key is to benchmark the facility on a routine basis with the goal of identifying performance deviations from the original design specifications. Done properly, this will provide an early warning mechanism to allow a potential failure to be addressed and corrected before it occurs. Once deficiencies are identified, and before any corrective action can be taken, a method of operation (MOP) must be written. The MOP will clearly stipulate step-by-step procedures and conditions, including who is to be present, the documentation required, phasing of work, and the state in which the system is to be placed after the work is completed. The MOP will greatly minimize errors and potential system downtime by identifying responsibility of vendors, contractors, the owner, the testing entity, and anyone else involved. In addition, a program of ongoing
22
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
operational staff training and procedures is important to deal with emergencies outside of the regular maintenance program. The most important aspect of benchmarking is that it is a process driven by the participants whose goal is to improve their organization. It is a process through which participants learn about successful practices in other organizations and then draw on those cases to develop solutions most suitable for their own organizations. True process benchmarking identifies the hows and whys for performance gaps and helps organizations learn and understand how to perform with higher standards of practice. Keep in mind that you cannot improve if you do not measure and benchmark.
1.11
CONCLUSION
Everyday industries are becoming increasingly dependent on continuous business operations. As a result, companies need to understand the level of reliability that they can supply to their customers and evaluate how this can either be improved or maintained. The following chapters will reinforce the concept that reliability and resiliency is dependent on an array of variables such as education and training, operation and maintenance, documentation, and testing and commissioning. It is the responsibility of employees at all levels of a hierarchy to communicate and develop best practices that will strengthen their business.
1.12
RISK ANALYSIS AND IMPROVEMENT
Below is a list of questions that you may wish to ask yourself about the needs analysis and risk assessment of the mission critical infrastructure you are supporting with regard to reliability and resiliency. Your answers to these questions should help to shed some light on areas where you can improve your operations. 1. How much does each minute, hour, or day of operational downtime cost your company if a specific facility is lost? 2. Have you determined your recovery time objectives for each of your business processes? 3. Does your financial institution conduct comprehensive business impact analyses (BIAs) and risk assessments? 4. Have you considered disruption scenarios and the likelihood of disruption affecting information services, technology, personnel, facilities, and service providers in your risk assessments? 5. Have your disruption scenarios included both internal and external sources, such as natural events (e.g., fires, floods, severe weather), technical events (e.g., communication failure, power outages, equipment and software failure), and malicious activity (e.g., network security attacks, fraud, terrorism)?
1.12
RISK ANALYSIS AND IMPROVEMENT
23
6. Does this BIA identify and prioritize business functions and state the maximum allowable downtime for critical business functions? 7. Does the BIA estimate data loss and transaction backlog that may result from critical business function downtime? 8. Have you prepared a list of "critical facilities" to include any location where a critical operation is performed, including all work area environments such as branch backroom operations facilities, headquarters, or data centers? 9. Have you classified each critical facility using a critical facility ranking/rating system such as the Tier I, II, III, and IV rating categories? 10. Has a condition assessment been performed on each critical facility? 11. Has a facility risk assessment been conducted for each of your key critical facilities? 12. Do you know the critical, essential, and discretionary loads in each critical facility? 13. Must you comply with the regulatory requirements and guidelines discussed in this chapter? 14. Are any internal corporate risk and compliance policies applicable? 15. Have you identified business continuity requirements and expectations? 16. Has a gap analysis been performed between the capabilities of each company facility and the corresponding business process recovery time objectives residing in that facility? 17. Based on the gap analysis, have you determined the infrastructure needs for your critical facilities? 18. Have you considered fault tolerance and maintainability in your facility infrastructure requirements? 19. Given your new design requirements, have you applied reliability modeling to optimize a cost-effective solution? 20. Have you planned for rapid recovery and timely resumption of critical operations following a wide-scale disruption? 21. Following the loss of accessibility of staff in at least one major operating location, how will you recover in a timely manor and resume critical operations? 22. Are you highly confident, through ongoing use or robust testing, that critical internal and external continuity arrangements are effective and compatible? 23. Have you identified clearing and settlement activities in support of critical financial markets? 24. Do you employ and maintain sufficient geographically dispersed resources to meet recovery and resumption activities? 25. Is your organization sure that there is diversity in the labor pool of the primary and backup sites, such that a wide-scale event would not simultaneously affect the labor pool of both sites?
24
RELIABILITY AND RESILIENCY IN TODAY'S MISSION CRITICAL ENVIRONMENT
26. Do you routinely use or test recovery and resumption arrangements? 27. Are you familiar with National Fire Protection Association (NFPA) 1600— Standard on Disaster/Emergency Management and Business Continuity Programs, which provides a standardized basis for disaster/emergency management planning and business continuity programs in private and public sectors by providing common program elements, techniques, and processes?
2 ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
2.1
INTRODUCTION
Our nation's antiquated energy infrastructure and dependence on cheap fuel presents significant risks to our security. The transportation sector alone keeps the economy moving, and accounts for about two-thirds of all U.S. oil consumption. Electricity, on the other hand, plays a uniquely important role in the operations of all industries and public services. The loss of electricity for any length of time compromises data and communication networks, as digital electrical loads provide physical and operational security for all mission critical infrastructures. Without oil and electricity, our economy and security would come to a complete halt. Of concern to those considering energy security is the dependence of the United States on foreign energy sources. Fossil fuels make up the bulk of our energy sources, as seen in Figure 2.1, leaving us vulnerable to disruptions in their supply as well as their contribution to the release of greenhouse gases and other pollutants. Developing a more diverse energy portfolio is the solution to this issue; we must take a phased approach to introducing renewable energy sources into the power grid. Mission critical facilities may take a similar path, albeit on a smaller scale. By introducing renewable energy sources on-site, facilities may reduce their dependency on the grid and improve resiliency in the event of an outage by complementing conventional energy with alternatives. Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
25
26
ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
Figure 2.1. U.S. energy sources. (Courtesy ofCleanPower.org.) The electrical grid is so important that it was addressed in a late-2002 White House briefing involving the President's Critical Infrastructure Protection Board. The briefing specifically noted that the electric power grid now stands "in the crosshairs of knowledgeable enemies who understand that all other critical-infrastructure components depend on uninterrupted electricity for their day-to-day operations." In October 2009, the federal government announced plans to inject $8 billion into grid modernization efforts nationwide. The funding will be used to upgrade the existing grid, augment manufacturing of infrastructure components, and install 18 million smart meters. Computer hackers pose a significant threat to our information systems. There have been instances in which hackers have gained access to an electric power plant's computers and could have triggered major power interruptions. These events indicate how vulnerable and fragile our critical infrastructure really is. The electric grid is not the only area we need to be concerned with; the government, military, and business networks are all at risk, and officials are not acting swiftly enough to address these vulnerabilities. Steps need to be taken to improve information security and mitigate the threat of cyber attacks. The government is a major target for cyber attacks. From 2005 to 2007 alone, the Homeland Security Department, responsible for protecting civilian computer systems, suffered 850 cyber attacks, and they are becoming more frequent, targeted, and sophisticated. A distinction should be made between the actions of discontented teenagers and those of foreign or domestic operatives with the backing of a large organization or nation, who pose a far greater threat. Whereas the former group may launch an isolated attack, it is the latter group and their resources who are more likely to damage large computer networks and remotely interrupt power production or delivery. Military networks need to be safeguarded as well. Research shows that cyber threats are forming and are searching for ways to remotely disrupt operations. Denial of-service attacks are a major threat accomplished by bombarding a computer system
2.1
INTRODUCTION
27
with automated message traffic, causing bandwidth overload, the effects of which could approach the magnitude of a weapon of mass destruction. Analyses of the vulnerabilities of other critical infrastructure sectors have reached similar conclusions: The loss of electric power quickly brings down communications and financial networks; cripples the movement of oil, gas, water, and traffic; and paralyzes emergency response services. Conversely, a disruption in the transportation of coal, oil, and gas can also bring down central power plants along with the power grid. The stark reality is that a sustained interruption of any energy delivery system would cripple our country. The national electric grid is inherently vulnerable since a small number of large central power plants are linked to millions of customers by hundreds of thousands of miles of exposed transmission and distribution lines. Nearly all high-voltage electric lines run above ground throughout the country, with only a handful of high-voltage lines serving major metropolitan areas. The national electric grid is a vast, sprawling, multitiered structure that reaches everywhere and is used by everyone. The North American electric grid and the Internet are the largest networks on the planet. When one key transmission line fails, the load is spread to other lines, which may become overloaded and also fail, causing a domino effect and cascading outages. Most accidental grid interruptions last less than two seconds, and many powerquality issues involve problems that persist for only a few cycles or milliseconds. In most areas of the country, electric outages of less than a couple of hours occur only a few times per year, with longer outages even less common. Unless deliberate, there is a low risk that several high-voltage lines feeding a metropolitan area from several different points could fail simultaneously, and when one high-voltage transmission line does fail, resources are dispatched quickly to isolate the problem and make appropriate repairs and any necessary improvements. Deliberate assaults, by contrast, are much more likely to disable multiple nodes on the network simultaneously. A 2002 National Academy of Sciences report drove this reality home, observing: "A coordinated attack on a selected set of key points in the [electrical] system could result in a long-term, multi-state blackout. While power might be restored in parts of the region within a matter of days or weeks, acute shortages could mandate rolling blackouts for as long as several years."* Operations that can afford to simply shut down and wait out short blackouts may not be able to take that approach in response to the mounting threats of longer outages. Future plans for implementing a "smart grid" will reduce the effects a deliberate attack on the system. Over 90% of the top tier of the grid is typically fueled by coal, uranium, water, or gas, and the remainder by oil and nonhydro renewables such as solar photovoltaic, geothermal, biomass, and wind. The relative amount of each fuel used for electricity generation is shown in Figure 2.2. The 1% "other" category is composed of fuels such as propane, batteries, tire-derived fuels, and hydrogen. Technologies such as batteries and pumped storage can be used to store power produced off-peak and release it at peak demand. Each lower tier is typically fueled initially by the electric power delivered from the tier above. Power plants in the top tier deliver electrical power via miles of *National Research Council, Making the Nation Safer: The Role of Science and Technology in Countering Terrorism, National Academy Press, 2002.
28
ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
Figure 2.2. Fuel sources for electricity generation in the United States in 2009. (Source: eia.doe.gov.)
high-voltage, long-haul transmission lines, which feed power into substations. The substations dispatch power, in turn, through miles of local distribution lines. At the same time, a few large power plants can provide all the power required by a large city. Many communities are served by just a handful of smaller power plants, or fractional shares of a few larger power plants. Since 2000, production from coal-fired plants has increased slightly, but electrical production by natural gas has increased by over 50% and wind power has increased almost tenfold. Nuclear power also saw a slight increase through higher efficiencies at existing plants. No new nuclear power plants have come online in the United States since 1996. Many different power plants operate in tandem to maintain power flows over regions spanning thousands of miles. In principle, segments of the grid can be cut off when transformers fail or lines go down, so that failures can be isolated before they cascade and disrupt power supplies over much larger regions. The effectiveness of such failure isolation depends on the level of spending on the electric grid, which has been in decline for years as a consequence of electric industry deregulation. Where investments are made, monitoring is improved to adequately transfer loads through the distribution system, thereby bringing the system up quicker after an outage. Identical strategies of isolation and redundancy are used on private premises to make the supplies of power to critical loads absolutely assured, insulating those loads from problems that may affect the grid. Switches control the flow of power throughout the grid, from the power plant down to the ultimate load. Interties between high-voltage transmission lines in the top tiers allow even the very largest plants to supplement and back up each other. When power stops flowing through the bottom tiers of the public grid, on-premise generators are designed to start up automatically. In defining priorities and designing new transmission and distribution, collaboration between utilities and critical power customers is becoming increasingly important,
2.2
RISKS RELATED TO INFORMATION SECURITY
29
most notably because power is essential for maintaining critical services for first responders, 911 call centers, air traffic control, wireline and wireless carriers, emergency response crews, hospitals, and data centers, among others. Critical facilities often have their own on-site backup generators and some, such as hospitals, are required to by code, so that utility power loss at any given time can be remedied rather promptly. However, owners and/or users of these critical facilities must provide adequate maintenance for the local generators and periodically exercise them under load to assure they operate properly and reliably when called upon. In addition, the facility must either tolerate or mitigate the initial power interruption from its onset until the backup generator successfully starts and is able to pick up the loads. From even a cursory review of the challenges faced by today's electrical grid, it is clear that a long-term solution for our production, distribution, and security needs will require a synthesis of our modern digital, renewable, and power distribution technologies. The smart grid is one solution proposed to fit our needs. An initial step will be the introduction of smart meters, which will allow end users to closely monitor and control their energy usage, as well as sell back excess energy produced by on-site renewable resources and generators. Such a digital system, when integrated on a large scale, will allow utilities to more efficiently produce power and provide it where it is needed, as well as help decrease the frequency and severity of outages. Energy security has serious repercussions for mission critical facilities. If the power is not flowing, business comes to a screeching halt. Whereas improving our energy security is a national/global imperative, facility owners and managers also have the option of taking steps to ensure the continued operation and success of their businesses. This may manifest itself in many different ways, through improving physical and cyber security, decreasing reliance on the electrical grid, improving employee training to decrease the occurrence of preventable service outages, and developing an effective disaster recovery plan.
2.2
RISKS RELATED TO INFORMATION SECURITY
The security of all of these networks is the subject of urgent, ongoing assessment. Much of the analysis has been focused on physical and cyber security—protecting the physical structures themselves or the computers that are used to control them. But their greatest vulnerability is the loss of power upon which every aspect of their control and operation ultimately depends. Although the multiple layers of the utility's critical infrastructure are highly interdependent, electric power is, more often than not, the prime mover—the key enabler of all the others. However, in the past, the energy industry has not typically been focused on information security risks, and has been even less concerned about privacy. Equipment failures due to information security vulnerabilities are not usually anticipated, and except for an acknowledgement of damage caused by data theft, the exploitation of those vulnerabilities is not usually seen as a likely cause of catastrophic events. The root cause of the August 2003 Northeast blackout was attributed to "Human decisions by various organizations, corporate and industry policy deficiencies, and inadequate manage-
30
ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
ment."* Proper policies backed by strong information security measures are part of the solution, as well as solid training programs that include refresher courses in emergency action, alarm response, and standard operating procedures. According to the Federal Energy Regulatory Commission, both domestic and foreign hackers are now devoting considerable time and capital to mapping the technology infrastructures of companies. The network exploitation done to explore a network and map it has to be done whether the intruder is going to steal information, bring the network down, or corrupt data. Information security experts believe that this may be the cause of a few recent major blackouts. Hackers are like digital spies with the ability to steal information or disrupt networks remotely. Officials need to be more aware of security breaches, as they are a national/global security issue. Intellectual capital and industrial secrets are at risk, and keeping the risks quiet only makes the situation worse. The private sector, which owns most information networks that operate power plants, dams, and other critical infrastructure, needs to do more to improve security and protect critical data. A cyber attack could disrupt critical operations and impact customers. The smart grid, being a digital system, would be vulnerable to cyber attacks. To address this hazard, recommendations have been made to build the smart grid from the ground up with security in mind. An intelligent system would be able to detect intrusions and bypass affected nodes to keep electricity flowing to consumers. This capacity to "heal" through the use of installed "smart" switches throughout the network would create a grid that is more resilient to deliberate attacks and natural disasters. How do power outages relate to the level of reliability your company requires from an energy standpoint? Facilities can generally be classified by tiers, with Tier I being the most basic and Tier IV being the most reliable facility. The reason for having different tiers is due in large part to maintainability—maintaining the facility without shutting it down. Tiers I and II must be shut down to perform maintenance; Tiers III and IV are deemed concurrently maintainable. Critical functions will usually require a facility in the Tier III to Tier IV range or utilize other strategies such as colocation. Although rare, it is possible that critical business functions will be located in a Tier II or even a Tier I facility configuration, despite the fact that both lack full backup and redundancy support. This practice is not encouraged. Figure 2.3 identifies types of electric load interruptions associated with the recent significant power outages shown in Table 2.1. In fact, the energy industry is just beginning to utilize the latest operation technology. Some organizations lack even accurate and up-to-date information to provide first responders to grid outages with the intelligence and support necessary to make informed decisions during critical events. Keeping personnel motivated, trained, and ready to respond to emergencies is a challenge, made even greater without an appropriate records retrieval program in place. Augmenting security for utilities is seeing some progress. The U.S. Federal Government is taking steps to enhance physical and cyber security for utilities. The Critical Infrastructure Protection Cyber Security Standards, mandated by the Federal Energy *Bullock, J. A. et al., Introduction to Homeland Security, 2nd ed., Elsevier Butterworth-Heinemann, 2005; www.electripedia.info/reliability.asp.
2.2 RISKS RELATED TO INFORMATION SECURITY
31
Figure 2.3. Potential causes of load interruption or downtime. Regulatory Commission (FERC), are designed to reduce the risk to the reliability of the electric utility system and enhance security by protecting critical cyber assets (CCA). The Cyber Security Standards requires utilities to implement and document a program to identify, classify, and protect information associated with CCAs. Some facilities, control centers, and substations must undergo security assessment and augmentation when identified as critical assets. Access to these critical assets, whether in person or through cyber and electronic means, has to be authorized and will be controlled, monitored (with an immediate response to all unauthorized access attempts), and logged. Physical access will likely be controlled by the use of card reader systems. To be authorized for access, affected employees, contractors, and vendors are required to have an appropriate level of personnel risk assessment consisting of identity verification, seven-year criminal record search, and terrorist watch list search. In addition, they are also required to attend annual cyber security training and regular security awareness training. Most utilities are required to be compliant with the North American Electric Reliability Corporation (NERC) Cyber Security Standards CIP-002 through CIP-009. In order to be compliant, there are a number of physical security access control requirements that must be met at bulk power electric substations. These are substations handling large power transmission capabilities, not solely local electric distribution to local areas. The requirements are to control, monitor, and log access to critical cyber assets contained within the control houses at these substations. There is also a noncompliance self-reporting requirement that mandates utilities to self-report to NERC any known violation of the CIP standards. Another growing threat to be wary of is an EMP or electromagnetic pulse attack. An EMP can be generated via a nuclear warhead that is detonated above the Earth's at-
32
ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
Table 2.1. Recent significant power outages Location
Cause
Effect
Denver, CO
Power rerouting for maintenance caused a system trip (7/1/2010)
• 7 substations brought down • 100,000 with no power for 15 minutes and 20,000 without power for over an hour
Chicago, IL
Severe storms (6/22/2010)
• 550,000 customers lost power for approximately 4 days
Kentucky
Winter storm (12/25/2009)
• 607,000 customers went without power
Siberia
Long-term maintenance negligence (8/17/09)
• 75 workers killed, 2 day blackout, oil spill
Florida
Significant equipment failure (2/26/08)
• 4,400,000 are left without power
New England
Lightning storms cause debris to damage power transmission lines (1/14/08)
• 20,000 people report power loss over the span of a week-long storm
San Francisco, CA Data center backup power generators failed (7/24/07)
• 40,000 customers directly affected. Internet users worldwide couldn't access Internet sites
Los Angeles, CA
Massive power outage, utility worker wiring error (9/12/05)
• Traffic and public transportation problems • Fear of a terrorist attack
Indonesia
Transmission line failure between Java and Bali (8/18/05)
• 100 million without power
Gulf Coast (Florida/ New Orleans)
2004/05 Hurricanes: Iban, Charley, Frances, Katrina, etc.
• Millions of customers without power, water, food and shelter; government records lost due to flooding
China
20 million kilowatt power shortage, • Multiple sporadic brownouts equivalent to the typical demand in • Government shuts down least the entire state of New York energy-efficient consumers (Summer 2005)
Greece
Temperatures near 104°F, mismanagement of electric grid (7/12/04)
• Over half of the country left without power
O'Hare Airport, Chicago, IL
Electrical explosion (7/12/04)
• Lost power to two terminals • Flight delays over course of a day
Logan Airport, Boston, MA
Electrical substation malfunction (7/5/04)
• Flight delays and security screening shutdown for 4 hours
Italy
Power line failures, bad weather (9/29/03)
• Nationwide power outage, 57 million people effected
2.2 RISKS RELATED TO INFORMATION SECURITY
33
Table 2.1. Continued Location
Cause
Effect
London
National grid failure (8/29/03)
• Over 250,000 commuters stranded
Northeast, Midwest, and Canada
Human decisions by various organizations, corporate and industry policy deficiencies, inadequate management (8/14/03)
• 50 million people effected due to 61,800 MW of capacity not being available
Brazil
Lightning strike (3/11/99)
• 75 million without power
Source: Google Alert—Major Power Outages. mosphere or by a "regular" explosion with the correct combination of an electrically sourced magnetic field. Essentially, this type of attack causes a massive electric surge that can potentially be over 10,000 volts per meter. An EMP attack would damage computers, electronics, electrical networks, and control systems. In an era that has become completely reliant on digital technology, such an assault would not only cause disorder, but completely shatter the ability of a country to operate normally. The United States has developed the Electromagnetic Pulse Commission to analyze the growing EMP threat. The main areas that the commission recommends to decrease susceptibility are deterrence, defense, and protection and recovery. Alongside other international organizations, the EMP Commission is aiding in developing a framework for infrastructure protection. But what organizations have been developed within the mission critical industry to educate and set standards for protecting vital infrastructures against such attacks? Today, EMP is not really an area of concern within critical industries and perhaps an EMP protection engineering field needs to be developed to start educating facility managers of this concern. Planning rationally for infrequent but grave contingencies is inherently difficult. Organizations that have prepared properly for yesterday's risk profiles may be unprepared for tomorrow's. The risk-of-failure profiles of the past reflect relatively benign threats, including routine equipment failures, lightning strikes on power lines, and such small-scale hazards as squirrels chewing through insulators or cars colliding into utility poles. Now, in addition to concerns about weather-related outages (hurricanes and ice storms in particular), as well as recent experiences underscoring the possibility of widespread operational outages, there is also the heightened concern of deliberate attacks on the grid. The latter changes the risk profile fundamentally; it poses the risk of outages that last a long time and extend over wide areas. The possibility that any parent company of any energy provider may have the ability to link their own network to those that control the power grid makes the threat of attack increasingly likely. The planning challenge now shifts from issues of power quality or reliability to issues of business sustainability. Planning must now take into account outages that last not for seconds or a single hour, but for days as a result of deliberate actions. Mission critical facilities do not have the luxury of being able to shut down or run at a reduced capacity during outages, whether they last minutes, hours, or days. Disaster
34
ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
recovery plans are a necessity for mission critical facilities, involving the proper training of business continuity personnel to enact enterprise-level plans for business resiliency. In the past a company may have had a single employee responsible for emergency procedures at all of its locations, but the reality of today's threats necessitate a larger organization to support mission critical facilities. Those in charge of data centers must be familiar with the plans for their facility and able to work with local utilities and emergency personnel to ensure uptime continuity.
2.3
HOW RISKS ARE ADDRESSED
The need to provide continuous operation under all foreseeable risks of failure, such as power outages, equipment breakdown, natural phenomena, and terrorist attacks, requires the use of many techniques to enhance reliability and resiliency. These techniques include redundant systems and components such as standby power generation, UPS systems, automatic transfer switches, static transfer switches, and the use of probability-risk-analysis modeling software. This software identifies potential weaknesses of critical infrastructure, develops maintenance programs, and upgrades action plans for all major systems. Electric utility transmission and distribution system planners attempt to predict future load growth, design capital projects to construct the necessary additional capacity, and attempt to design adequate redundancy if the main supply line ever fails. This design concept identifies preferred and alternate or contingency supplies. It is used by most utilities, and is commonly referred to as an n + 1 or n + 2 design. However, the alternate supplies are not infinite in number. Electric system supply redundancy can be constructed in a number of ways. One method is to construct a power plant in an area (a load pocket) that needs the electricity. Power plants are huge capital investments, however, and must be offline periodically for extended maintenance and upgrades. Another way of bringing power as well as redundancy is extending transmission facilities from an adjacent area. This, too, can be a costly and time-consuming process, especially taking into account the permitting processes in many states. The upshot is that, ultimately, either strategy will likely be used under any given circumstances. However, state governmental utility regulators, permit grantors, and even the Federal Energy Regulatory Commission (FERC) all have a voice in how generation and transmission systems are constructed and reinforced in the future. Overall electric reliability is also dependent upon how the utility transmission and distribution facilities are constructed. Whereas lightning strikes affect overhead and underground facilities alike, storms affect overhead constructed facilities to a much greater extent than they affect underground facilities. In spite of this added disadvantage, the vast majority of utility facilities are overhead, constructed on poles. Underground construction, though more insulated from mechanical storm damage, is more expensive to construct, needs more redundancy built into the system, and is more labor intensive to troubleshoot and repair. Damage to overhead poles and wires is immediately visible and repaired, and end-of-life replacement is also much less costly. Pure economics clearly favors such overhead construction. As a result, owners of critical fa-
2.3 HOW RISKS ARE ADDRESSED
35
cilities will be predominantly supplied by overhead utilities that are subject to more, shorter interruptions, but are easier to repair. Even with the implementation of increased utility transmission, distribution system monitoring, and data communications effectively building toward the smart grid, outages may become less frequent, and corrected faster, but they will still occur. Factors such as overhead versus underground construction, alternate feeder supplies, and susceptibility to weather and other mechanical damages all play into a utility's overall reliability record. Electric utility reliability metrics employ several measurements, but the most commonly used are: • The Sustained Average Interruption Frequency Index (SAIFI). This is a measurement of the months between interruptions for the utility's electric customers. For example, a SAIFI of 0.9 indicates that the utility's average customer experiences a sustained electric interruption every 0.9 * 12 months = 10.8 months. • The Customer Average Interruption Duration Index (CAIDI). This is an average of outage minutes experienced by each customer that experiences a sustained interruption. For example, if a utility has a CAIDI of 120 minutes, this means that on average a power outage will be restored within 120 minutes. However, since utilities generally follow a prioritized restoration practice whereby outages affecting many customers are addressed prior to smaller and single-customer outages, the larger outages may be restored in 10 minutes (via automated switching), and smaller outages may take as long as 300 minutes to be restored. These long and short outages result in the overall CAIDI average of 120 minutes. • The Momentary Average Interruption Frequency Index (MAIFI). This measures the average number of momentary interruptions experienced by utility customers. Depending upon state regulations, momentary interruptions are defined as any interruption lasting less than 2 to 5 minutes. If the criterion is less than 5 minutes, an interruption of 4 minutes and 59 seconds will not count toward the utility SAIFI metric, but is considered momentary. It should be noted that in the mission critical industry, an outage of eight milliseconds can be catastrophic if the facility is not properly protected by a UPS or the critical systems do not operate according to their design specifications. Each metric measures a different statistic, and any organization can consult their local utility for information about the actual historical reliability metrics as they pertain to their facility's specific feeder supply. For an organization that is forward-thinking, these utility reliability indices will drive the level of redundancy and business resiliency as it pertains to the critical infrastructure. The organization should, in conjunction with the local utility's input, assess the utility's SAIFI, CAIDI, and MAIFI for both the utility's service territory as well as for the local distribution circuit supplying power to that business. Once these historical reliability metrics are known, the organization can plan for the likeliest and most feasible outage scenarios (many sustained interruptions but few momentary outages, long utility repair times, etc.) As mentioned previously, to
36
ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
address human risk factors, SOPs, EAPs, and ARPs need to be available at a moment's notice so trained personnel can respond with situational awareness and confidence. Many companies use Web-based information management systems to address human risk factors. A living Web-based document system can produce a database of perpetually refreshed knowledge, providing the level of granularity necessary to operate, maintain, and repair mission critical infrastructure. Keeping the ever-changing documents current and secure can then be easily addressed each time a capital project is completed or an infrastructure change is made. One such program is M.C. Access, a Web-based document portal shown in Figure 2.4. It is important to secure this critical infrastructure knowledge and also leverage this asset for employee training and succession planning. Events such as the terrorist attacks of September 11th, the Northeast blackout of 2003, the 2006 hurricane season, and the outages in Italy and Greece in 2003 and 2004, respectively, which left many millions without power, have emphasized our interdependencies with other critical infrastructures, most notably telecommunications. There are numerous strategies and sector-specific plans such as Basel II, U.S. PATRIOT Act, SOX and, NFPA 1600, all of which highlight the responsibility of the private sector for increasing resiliency and redundancy in business processes and systems. These events have also prompted the revision of laws, regulations, and policies governing reliability and resiliency of the power industry. Some of these measures also
Figure 2.4. Mission critical access—a Web portal database program. (Courtesy of Power Management Concepts, LLC.)
2.4 USE OF DISTRIBUTED GENERATION
37
delineate controls required of some critical infrastructure sectors to maintain businesscritical operations during a critical event (see Appendix A for further information). The unintended consequence of identifying vulnerabilities is the fact that such diligence can actually invite attacks tailored to take advantage of them. In order to avoid this, one must anticipate the vulnerabilities created by responses to the existing ones. New and better technologies for energy supply and efficient end use will clearly be required if the daunting challenges of the decades ahead are to be adequately addressed. In 2000, the Electric Power Research Institute (EPRI) launched a consortium dedicated to improving electric power reliability for the new digital economy. Participants in this endeavor, known as the Consortium for Electric Infrastructure to Support a Digital Society (CEIDS), include power providers and a broad spectrum of electric reliability stakeholders. Participation in CEIDS is also open to digital equipment manufacturers, companies whose productivity depends on a highly reliable electricity supply, and industry trade associations. According to EPRI, CEIDS (now known as IntelliGrid) represents the second phase of a bold, two-phase national effort to improve overall power system reliability. The first phase of the plan, called the Power Delivery Reliability Initiative, launched in early 2000, brought together more than twenty North American electric utilities as well as several trade associations to make immediate and clearly necessary improvements to utility transmission and distribution systems. In the second phase, CEIDS addresses more specifically the growing demand for "digital quality" electricity. "Unless the needs of diverse market segments are met through a combination of power delivery and end-use technologies, U.S. productivity growth and prosperity will increasingly be constrained," explains Karl Stahlkopf, a former Vice President of Power Delivery at EPRI. "It's important that CEIDS study the impact of reliability on a wide spectrum of industries and determine the level of reliability each requires." Specifically, CEIDS focuses on three reliability goals: 1. Preparing high-voltage transmission networks for the increased capacity and enhanced reliability needed to support a stable wholesale power market. 2. Determining how distribution systems can best integrate low-cost power from the transmission system with an increasing number of distributed generation and storage options. 3. Analyzing ways to provide digital equipment, such as computers and network interfaces, with an appropriate level of built-in protection. It is only through these wide-reaching efforts to involve all industry constituencies that the industry can raise the bar with respect to protective measures and knowledge sharing.
2.4
USE OF DISTRIBUTED GENERATION
The way electricity was produced before the advent of our modern electric grid has received renewed interest over the past couple of decades. Unlike the large centralized
38
ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
power plants located at the top tier of our electric grid, much smaller distributed generation (DG) systems are now being deployed at the bottom tier, typically installed on-site by the end electric user. This trend is spurred on by growing concern over our aging electric infrastructure, widespread outages like those that occurred in 2003, customer desire to have an alternative to their grid electric supply, and environmental impact. Some new technologies such as fuel cells and microturbines that have much lower emissions than fossil-fuel central power plants are being used on-site for a wide variety of applications. Other "green" renewable technologies, including solar and wind, are also becoming more widely applied as distributed resources (DR). They are helped financially by an assortment of federal, state, and utility incentives that provide various grants, rebates, or tax benefits to help justify the installation. Some generalizations that favor DG include: 1. Since most DG is of relatively small scale, it is more modular and can be sized to match a facility's base load, sized to just supplement the grid supply, or, in cogeneration applications, sized to just meet the thermal requirements. 2. In general, DG is a cleaner alternative to grid electricity. Solar and wind power produce no emissions and completely offset the emissions and greenhouse gas production of central power plants. 3. Even with fossil-fuel-based DG, if waste-heat recovery offsets the use of some separate fuel combustion for thermal needs, a net reduction in greenhouse gas emissions can be realized. 4. Many DG systems are designed to run grid-parallel, and only operate when the grid supply is available, such as systems that employ induction generators or line commutated inverters. DG systems that use synchronous generators or selfcommutated inverters can also be designed to run grid-independent to provide standby power. 5. Of the various DG benefits, some may be less tangible than others, but, in general, most DG is installed with the objective of reducing the owner's energy costs while also providing an acceptable rate of return on the capital investment. For cogeneration systems, it is the spark spread—the difference in cost between the DG fuel and the cost of grid electricity—along with the amount of waste heat that can be recovered that determines energy savings. For renewable-energy systems, renewable-energy credits and/or carbon credits are another revenue stream that can improve the economics. 6. To the extent that distributed generation uses renewable energy (e.g., solar or wind), or uses fossil fuels more efficiently (i.e., cogeneration units in which power is produced and waste heat is also recaptured and used), then overall greenhouse gas production is reduced. These are some typical barriers that discourage DG: 1. Grid interconnect requirements vary from state to state and utility to utility. The time and costs associated with the approval process can be a deterrent for wouldbe self-generators.
2.4 USE OF DISTRIBUTED GENERATION
39
2. The economics of combustion-based DG are very dependent on spark spread. The volatility of energy prices today makes it difficult to project long-term cost of ownership. For renewable technologies like wind and solar, which typically have a high initial cost, the availability of incentives like rebates and tax credits greatly impact system payback. The high rate-of-return requirements of many businesses typically pose a challenge for prospective DG projects. 3. Distributed generation must disconnect from the utility whenever there is a momentary or sustained interruption. This requirement is intended to prevent distributed generation units that are outside of the utility's direct control from backfeeding into the utility distribution system and jeopardizing the safety of linemen who may be working to restore the grid supply. With induction generators and line-commutated inverters, this causes the distributed generation units to drop offline and not supply backup power to their host building. Synchronous generators and self-commutated inverters that can operate in island mode must employ an intertie breaker to disconnect the host site from the grid while the DG continues to provide backup power. In this case, synchronizing equipment is required so that the DG can be reparalleled when the grid supply returns. The cost of these controls and relay protection can impact DG economics, especially for smaller DG systems, since the cost of these controls is not directly proportional to DG system size. In addition, most interconnection requirements require that after a relay protection trip, the DG must wait at least 5 minutes before reconnecting to the grid. Since grid disturbances that trip protective relays are fairly common, some percentage of an owner's demand savings is lost each time the DG system must disconnect from the grid. 4. Utility distribution feeders have a limited capacity to accept distributed generation. For example, in New York State, if the DG capacity exceeds 50 kW on a single-phase branch of a radial distribution circuit, or 150 kW on a single distribution feeder, the interconnect applicant may have to pay for a "coordinated electric system interconnection review." 5. The fundamental issue is that the electric utility is both the supplier of last resort and is responsible for the quality of power (such as it is today) delivered to every metered customer. This means that utilities will endeavor to live up to these responsibilities and design a system robust enough to minimize liabilities due to poor power quality. Therefore, the technical requirements placed upon distributed generation designs will be very stringent in an effort to make such generation units utility grade in quality. The above factors tend to both drive and discourage renewable energy and distributed generation economies at the same time. A rational plan that fosters a diverse energy supply, advances easy-to-permit distributed resources for a redundant supply, and cultivates distributed generation into a dependable source of capacity must be developed. Given the technical and institutional realities, there is still much room within which any critical facility owner can design a power supply system that includes at least these
40
ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
three key components: reasonably reliable utility electric service, a reliable and wellmaintained backup power generation system, and a distributed generation unit using high-efficiency or renewable energy sources to drive down energy costs while reducing air pollution and greenhouse gas emissions. In certain installations, the distributed generation unit can be engineered to work in concert with the backup generation system. Fuel cell technology has seen uses in mission critical environments since its inclusion in NASA's Gemini space missions in the mid-1960s using hydrogen and oxygen to produce power. Modern fuel cells, which convert natural gas to electricity without combustion, are deployed as a combined heat and power (CHP) system to provide reliable, high-quality power and recoverable waste heat, while reducing the carbon footprint of a facility. To make the best use of a fuel cell installation, it should be sized at or below the base load of the data center or other critical facility that it will serve. In areas with reliable grid service, the financial viability of fuel cells is dependent on low natural gas prices, but fuel cells may also be deployed as the primary power source in areas with poor grid reliability due to a lack of utility investment or extreme weather conditions. There are also many examples of data centers incorporating photovoltaic technology in their power systems. Due to federal and state subsidies and incentives, PV technology is often a cost-effective way to use green technology and lower costs with a return on investment as low as a few years. One proposed larger-scale solution is the use of virtual power plants, whereby multiple distributed generation resources are linked together via the Internet so they can be managed as a single entity. This model allows for a mix of resources to work together to negate some of the disadvantages and power quality issues traditionally associated with small energy sources.
2.5 DOCUMENTATION AND ITS RELATION TO INFORMATION SECURITY In recent years, there have been critical infrastructure drawings found on unsecure laptop computers, in garbage pails, and blowing around the streets of major cities. These security leaks provide an opportunity for cyber threats to occur, and make our national infrastructure vulnerable to people who want to disrupt the electrical grid or specific critical buildings vital to our national and economic security. Examples of these security leaks include a major banking and finance company's laptop computer that was found in India with critical infrastructure drawings on it, transportation drawings found in a trash can outside a major transportation hub, and most recently, the New York City Freedom Tower drawings found in the trash. The occurrence of these situations can compromise corporate and national safety and security if these documents fall into the wrong hands. Business officials travelling abroad are also a major target for information theft. Spyware installed on electronic devices and laptops can open communications with outside networks, exposing information stored on them. In the environment we live in today, we need a steadfast plan to secure invaluable informa-
2.5
DOCUMENTATION AND ITS RELATION TO INFORMATION SECURITY
41
tion such as critical drawings, procedures, and business processes. The following items should be considered when you are evaluating your internal security. Security Questions: 1. Have you addressed physical security concerns? 2. Have all infrastructures been evaluated for the type of security protection needed (e.g., card control, camera recording, key control)? 3. If remote dial-in or Internet access is provided to any infrastructure system, have you safeguarded against hacking, or do you permit read-only functionality? 4. How frequently do you review and update access-permission authorization lists? 5. Are critical locations included in security inspection rounds? Network and Access: 1. Do you have a secured network between your facility's IT installations? 2. Do you have an individual on your IT staff responsible for managing the security infrastructure of your data? 3. Do you have an online file repository? If so, how is use of the repository monitored, logged, and audited? 4. How is data retrieved from the repository and then kept secure once it leaves the repository? 5. Is your file repository available through the public Internet? Techniques for Addressing Information Security: 1. Enforce strong password management for properly identifying and authenticating users. 2. Authorize user access to only permit access needed to perform job functions. 3. Encrypt sensitive data. 4. Effectively monitor changes on mainframe computers. 5. Physically identify and protect computer resources. Enhancements that Can Improve Security and Reliability: 1. Periodic assessments of the risk and magnitude of harm that could result from the unauthorized access, use, disclosure, disruption, modification, or destruction of information and information systems. 2. Policies and procedures that: Are based on risk assessments Cost-effectively reduce risks Ensure that information security is addressed throughout the life cycle of each system. Ensure compliance with applicable requirements 3. Plans for providing adequate information security for networks, facilities, and systems.
42
ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
4. Security awareness training to inform personnel of information security risks and of their responsibilities in complying with agency policies, procedures, and practices performed. 5. A process for planning, implementing, evaluating, and documenting remedial action to address deficiencies in information security policies, procedures, or practices. 6. Plans and procedures to ensure continuity of operations for information systems. Recommendations for Executive Action: 1. Update policies and procedures for configuring mainframe operations to ensure that they provide the necessary detail for controlling and documenting changes. 2. Identify individuals with significant security responsibilities and ensure that they receive specialized training. 3. Expand the scope for testing and evaluating controls to ensure more comprehensive testing. 4. Enhance contractor oversight to better ensure that contractors' noncompliance with information security policies is detected. 5. Update remedial action plans to ensure that they include what, if any, resources are required to implement corrective actions. 6. Identify and prioritize critical business processes as part of contingency planning. 7. Test contingency plans at least annually.
2.6
SMART GRID
The smart grid (Figure 2.5) is the convergence of electric distribution systems and modern digital technology. Whereas our current electric grid was designed for the oneway flow of energy, the smart grid will allow bidirectional energy flow and two-way digital communication over the same distribution system. Such communication potential would allow utilities to overhaul their pricing plans and provide time-of-day metering, charging more for electricity consumed during peak hours. This would encourage consumers to program their appliances to operate during off-peak periods that have lower electric rates, thereby saving money. While the consumer is using lower priced electricity, the utility is encouraging load shifting into periods of lower electric demand, thereby reducing peak period electric usage. This will defer major utility capital improvements such as distribution system upgrades and the construction of additional generation capacity. Smart meters should also provide greater grid accessibility for distributed generation equipment through the net metering capabilities built into these meters. And instead of relying entirely on costly peaking plants to handle peak loads, the smart grid will facilitate the participation of customer-owned load-shedding equipment and on-site generation in demand-response programs. Security is a key component for the development of the smart grid. In March 2007, the DHS Aurora Generator Test demonstrated the dangers of a hacker attack on the grid by remotely causing a diesel-powered generator to overheat and fail. Such a see-
43
2.6 SMART GRID
Figure 2.5. The smart grid network and its features. nario is becoming increasingly more likely as utilities and system operators become more dependent on digital technology and the Internet to control their assets. With new solutions come new challenges; for example, should data from smart meters be leaked, it could give would-be criminals an indication that a home is unoccupied, making utility customers more susceptible to break-ins. The smart grid must be designed with inherent security, robust enough to prevent unauthorized access, or at least capable of providing early warning of tampering attempts so that damage can be minimized. In September 2009, the National Institute of Standards and Technology (NIST) issued a road map for developing smart grid deployment standards, with security being a top priority. It remains to be seen what action the federal government will take with regard to this issue, but many expect military-grade security to be a necessity for critical portions of the smart grid. In summary, the generalized conception of the smart grid includes the installation of communication links, high-voltage switches, and smart electric meters that would enable: • Automatic switches that detect system faults and open to isolate just the faulted areas, keeping the major portion of the grid intact • Real-time load-flow information to identify system load pockets for local generation dispatch
44
ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
• Creation of pricing mechanisms and rate structures based upon actual power supply costs, updated by the hour or even by the minute, to encourage customers to shift power use to system off-peak times • Dispatch on or off of customer-owned generation units to better match system needs • IP-addressable electric meters for customers, allowing two-way communication between the electric meter and the utility. This will enable the interchange of unique pricing, voltage, and power-quality information on a meter-by-meter basis. • The utility will be able to remotely turn electric service on or off, assuming that smart meters have this capability. Turning off a meter for nonpayment or if an electrical fault is sensed within the building electric service would be possible. • The smart meter may be able to generate alerts to customers or real-time electric load information that will enable building load-management systems to automatically schedule noncritical equipment operations such as shutting down or delayed starting of air conditioning compressors, resulting in more effective peak shaving. There are major cost hurdles to be overcome in the design, planning, and implementation of any smart grid. To the extent that the federal government has endorsed its development and provided some funding, it is starting to take shape. Exactly what this will mean for mission critical facilities remains to be seen, but it is likely that it will at a minimum allow greater flexibility in installing on-site power-generating devices and facilitate the shifting of discretionary loads from peak periods to off-peak periods that have lower electric rates.
2.7
CONCLUSION
It is important to address the physical and cyber security needs of critical infrastructures including systems, facilities, and assets. Security requirements may include capabilities to prevent and protect against both physical and digital intrusion, hazards, threats, and incidents, and to expeditiously recover and reconstitute critical services. Personnel should be trained and made aware of security risks and consequences, ensuring that sensitive information will not be leaked and lead to a security breach. A sensible and cost-effective security approach can provide a protection level achieved through design, construction, and operation that mitigates adverse impact to systems, facilities, and assets. This can include vulnerability and risk assessment methodologies that identify prevention, protection, monitoring, detection, and sensor systems to be deployed in the design. Also, less frequent but greater consequence risks must be taken into consideration. No longer are accidents the only risks to be expected; deliberate attacks must be anticipated as well. The increased use of advanced information technology, coupled with the prevalence of hacking and unauthorized access to electronic networks, requires physical security to be complemented by cyber security considerations. Hacking techniques are
2.8 RISK ANALYSIS AND IMPROVEMENT
45
becoming more sophisticated, and before enabling remote access for monitoring and/or control of critical infrastructure systems, cyber security protection must be assured. Major damage can be done remotely, and the greatest effort should be made to prevent illicit access to critical networks. For mission critical facilities, this cements the need for the design of backup power systems, including UPS, generators, transfer switches, and so on, to be on par with the challenges facing us as today's digital society evolves and necessitates the digitalization of the power grid. The importance of ongoing and technical maintenance programs coupled with strong training and education should also be stressed. Without these elements, no business recovery plan will be successful.
2.8
RISK ANALYSIS AND IMPROVEMENT
Below is a list of questions about power utilities that you may wish to ask yourself about the mission critical infrastructure you are supporting. Your answers to these questions should help to shed some light on areas where you can improve your operations. 1. Do you have a working and ongoing relationship with your electric power utility? 2. Do you know who in your organization currently has a relationship with your electric power utility, such as facilities management or accounts payable? 3. Do you understand your electric power utility's electric service priority (ESP) protocols? 4. Do you understand your electric power utility's restoration plan? 5. Are you involved with your electric power utility's crisis management/disaster recovery tests? 6. Have you identified regulatory guidelines or business continuity requirements that necessitate planning with your electric power utility? 7. What is the relationship between the regional source power grid and the local distribution systems? 8. What are the redundancies and the related recovery capacity for both the source grid and local distribution networks? 9. What is the process of restoration for source grid outages? 10. What is the process of restoration for local network distribution outages? 11. How many network areas are there in your city? 12. What are the interrelationships between each network segment and the source feeds? 13. Does your infrastructure meet basic standard contingency requirements for route grid design? 14. What are the recovery-time objectives for restoring impacted operations in any given area?
46
ENERGY SECURITY AND ITS EFFECT ON BUSINESS RESILIENCY
15. What are recovery-time objectives for restoring impacted operations in any given network? 16. What are the restoration priorities to customers, both business and residential? 17. What are the criteria for rating in terms of service restoration? 18. Where does your industry rank in the priority-restoration scheme? 19. How do you currently inform clients of a service interruption and the estimated time for restoration? 20. What are the types of service disruptions, planned or unplanned, that your location could possibly experience? 21. Could you provide a list of outages, types of outages, and length of disruptions that have affected your location during the last 12 months? 22. What are the reliability indices and who uses them? 23. During an outage, would you be willing to pass along information regarding the scope of interruptions to a central industry source, for example, an industry business-continuity command center? 24. Are the local and regional power utilities cooperating in terms of providing emergency service? If so, in what way? If not, what are the concerns surrounding the lack of cooperation? 25. Would you be willing to provide schematics to select individuals and/or organizations on a nondisclosure basis? 26. Could you share your lessons learned from the events of 9/11 and the Northeast outage of 8/14/03? 27. Are you familiar with the Critical Infrastructure Assurance Guidelines for Municipal Governments document written by the Washington Military Department Emergency Management Division? Is so, would you describe where your city stands in regard to the guidelines set forth in that document? 28. Independent of the utility's capability to restore power to its customers, can you summarize your internal business continuity plans, including preparedness for natural and manmade disasters (including but not limited to weather-related events, pandemics, and terrorism)?
3 MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
3.1
INTRODUCTION
Businesses that are motivated to plug into the information age require reliability and flexibility regardless of whether the companies are large Fortune 1000 corporations or small companies serving global customers. This is the reality of conducting business today. Whatever type of business they are in, many organizations have realized that a 24/7 operation is imperative. An hour of downtime can wreak havoc on project schedules, resulting in lost hours rekeying electronic data, not to mention the potential for losing millions of dollars. Twenty-five years ago, the facilities manager (FM) was responsible for the integrity of the building. As long as the electrical equipment worked 95% of the time, the FM was doing a good job. When there was a problem with downtime, it was usually a computer fault. As technology improved on both the hardware and software fronts, producers of information technology began to design their hardware and software systems with redundancy. As a result of IT's efforts, computer systems have become so reliable that they are only down during scheduled upgrades. Today, the major reasons for downtime are human-error, utility failures, poor power quality, power distribution failures, and environmental system failures (although the percentage remains small). When a problem does occur, the facilities manager is usually the one in the hot seat. Problems are not limited just to power quality, but also that the staff has not been properly trained in certain situations. Further complicating matMaintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
47
48
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
ters, recruiting qualified inside staff and outside consultants can be difficult, as facilities management, protection equipment manufacturers, and consulting firms are all competing for the same talent pool. Minimizing unplanned downtime reduces risk but, unfortunately, the most common approach is reactive, that is, spending time and resources to repair a faulty piece of equipment after it has failed. However, strategic planning can identify internal risks and provide a prioritized plan for reliability improvements. Also, only when everyone fully understands the potential risk of outages, including recovery time, can they fund and implement an effective plan. Because the costs associated with reliability enhancement are significant, sound decisions can only be made by quantifying the performance benefits and weighing the options against their respective risks. Planning and careful implementation will minimize disruptions while making the business case to fund capital improvements and maintenance strategies. When the business case for additional redundancies, consultants, and ongoing training reaches the boardroom, the entire organization can be galvanized to prevent catastrophic data losses, damage to capital equipment, and even danger to life safety.
3.2 COMPANIES' EXPECTATIONS: RISK TOLERANCE AND RELIABILITY In order to design a building with the appropriate level of reliability, a company must first assess the cost of downtime and determine its associated risk tolerance. Because recovery time is now a significant component of downtime, downtime can no longer be equated to simple power availability, measured in terms of one nine (90%) or six nines (99.9999%). Today, recovery times are typically many times longer than outages, since operations have become much more complex. Is a 32-second outage really only 32 seconds? Is it perhaps 2 hours or 2 days? The real question is, How long does it take to fully recover from the 32-second outage and return to normal operational status? Although measuring in terms of nines has its limitations, it remains a useful measurement we need to identify. Table 3.1 shows the law of nines for a 24/7 facility. In new 24/7 facilities, it is imperative to not only design and integrate the most reliable systems, but also to keep them simple. When there is a problem, the facilities manager is under enormous pressure to isolate the faulty system without disrupting any critical electrical loads, and does not have the luxury of time for complex switching procedures during a critical event. An overly complex system can be a quick recipe for failure via human error if key personnel who understand the system functionality are unavailable. When designing a critical facility, it is important that the building design does not outsmart the facilities manager. Companies can also maximize profits and minimize cost by using the simplest design approach possible. In older buildings, facility engineers and senior management need to evaluate the cost of operating with obsolete electrical distribution systems and the associated risk of an outage. Where a high potential for losses exists, serious capital expenditures to upgrade the electrical distribution system are monetarily justified by senior management. The cost of downtime across a spectrum of industries exploded in recent years, as busi-
3.2 COMPANIES' EXPECTATIONS: RISK TOLERANCE AND RELIABILITY
49
Table 3.1. Law of nines % Uptime/reliability level 99% 99.9% 99.99% 99.999% 99.9999%
Downtime per year 87.6 hours 8.76 hours 52 minutes 5.25 minutes 32 seconds
ness has become completely computer dependent and systems have become increasingly complex (Table 3.2). Imagine that you are the manager responsible for a major data center that provides approval of checks and other online electronic transactions for American Express, MasterCard, and Visa. On the biggest shopping day of the year, the day after Thanksgiving, you find out that the data center has lost its utility service. Your first reaction is that the data center has a UPS and standby generator, so there is no problem, right? However, the standby generator is not starting due to a fuel problem and the data center will shut down in 15 minutes, the amount of time the UPS system batteries can supply power at full load. The penalty for not being proactive is loss of revenue, potential loss of major clients, and, if the problem is large enough, your business could be at risk of financial collapse. You, the manager, could have avoided this nightmare scenario by exercising the standby generator every week for 30 minutes—the proverbial ounce of prevention. There are about three times as many UPS systems in use today than there were 10 years ago, and many more companies are still discovering their worth after losing data during a power line disturbance. Do you want electrical outages to be scheduled or unscheduled? Serious facilities engineers use comprehensive preventative maintenance procedures to avoid being caught off-guard.
Table 3.2. The cost of downtime* Industry Brokerage Energy Credit card operations Telecommunications Manufacturing Retail Health care Media Human life
Average cost per hour $6,400,000 $2,800,000 $2,600,000 $2,000,000 $1,600,000 $1,100,000 $640,000 $340,000 Priceless
"■Prepared by a disaster-planning consultant of Contingency Planning Research.
50
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
Many companies do not consider installing backup equipment until after an incident has already occurred. During the months following the U.S. Northeast blackout of 2003, the industry experienced a boom in the installation of UPS systems and standby generators. Small and large businesses alike learned how susceptible they are to power disturbances and the associated costs of not being prepared. Some businesses that are not typically considered mission critical learned that they could not afford to be unprotected during a power outage. For example, the blackout of 2003 destroyed $250 million of perishable food in New York City alone.* Businesses everywhere, and of every type, are reassessing their level of risk tolerance and cost of downtime.
3.3 IDENTIFYING THE APPROPRIATE REDUNDANCY IN A MISSION CRITICAL FACILITY Mission critical facilities cannot be susceptible at any time to an outage, including during maintenance of the subsystems. Therefore, careful consideration must be given in evaluating and implementing redundancy in systems design. Examples of redundancy are classified as N + 1 and N+2 configurations, and are normally applied to these systems: • • • • • • •
Utilities service Power distribution UPS Emergency generator Fuel system supplying emergency generator Mechanical systems Fire-protection systems
A standard N+ 1 system is a combination of two basic schemes that meet the criteria of furnishing an essential component plus one additional component for backup. This design provides the best of both worlds at a steep price with no economies of scale considered. A standard system protects critical equipment and provides long-term protection to critical operations. In a true N + 1 design, each subsystem is configured in a parallel redundant arrangement such that full load may be served even if one system is offline due to scheduled maintenance or a system failure. The next level of reliability is a premium system. The premium system meets the criteria of an N + 2 design by providing the essential component plus two components for backup. It also utilizes dual electric service from two different utility substations. Under this configuration, any one of the backup components can be taken offline for maintenance and still retain N + 1 reliability. It is also recommended that the electric services be installed underground as opposed to aerially. *Source: New York City Comptroller William Thompson.
3.3
IDENTIFYING THE APPROPRIATE REDUNDANCY IN A MISSION CRITICAL FACILITY
51
Facilities can also be classified into different tiers based on their required level of reliability and maintainability. Tiers for mission critical facilities range from I to IV, with Tier IV being the most maintainable (Table 3.3).
3.3.1.
Load Classifications
• Critical Load. Requires 100% uptime. Must have uninterrupted power input to safeguard against facility damage or losses, prevent danger and injury to personnel, or keep critical business functions online. • Essential Load. Supports routine site operations. Able to tolerate power failures without data loss or affecting overall business continuity. • Discretionary Load. Load that indirectly supports operation of the facility such as administrative and office functions. Can be shed without affecting overall business continuity in order to keep critical loads online.
Table 3.3. Uptime tiers Tier I—basic nonredundant
Tier II—basic redundant
• No redundancy • Susceptible to interruptions from planned and unplanned activities • Equipment configurations minimum required for equipment to operate • Operation errors or failures will cause an interruption in service
• Limited backup and redundancy
Source: Uptime Institute.
Tier III—concurrently Tier IV—fault maintainable and failure tolerant
• Full single system backup and redundancy (N + 1 ) • Susceptible to • Planned disruptions from preventative and planned and programmable unplanned activities maintenance activities, repairs, • May contain limited testing, etc. can be criticality functions conducted without that can be shut down interruption of properly without service adverse effects • Errors in operation on business or spontaneous • UPS and/or failures of generator backup infrastructure may may be installed for cause disruption parts of the building of power to the • Failures may cause load a disruption in facility service
• Facility functions cannot tolerate any downtime • No single points of failure, and multiple system backup with automated recovery (2N) • Capable of withstanding one or more component failures, errors, or other events without disrupting power to the load • Full load can be supported on one path without disruption while maintenance/testing is performed on the other
52
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
3.4 IMPROVING RELIABILITY, MAINTAINABILITY, AND PROACTIVE PREVENTATIVE MAINTENANCE The average human heart beats approximately 70 times a minute, or a bit more than once per second. Imagine if a heart missed three beats in a minute. This would be considered a major power-line disturbance if we were to compare it to an electrical distribution system. Take the electrical distribution system that feeds your facility or, better yet, the output of the UPS system, and interrupt it for 3 seconds. This is an eternity for computer hardware. The critical load is disrupted, your computers crash, and your business loses two days worth of labor or, worse, is fined $10 to $20 million by the federal government because they did not receive the quota of $500 billion dollars of transaction reports by the allocated time. All this could have been prevented if electrical maintenance and testing were performed on a routine basis and the failed electrical connections were detected and repaired. Repairs could have been quickly implemented during the biannual infrared scanning program that takes place before building maintenance shutdowns. What can the data processing or facility manager do to ensure that their electrical system is as reliable as possible? The seven steps to improved reliability and maintainability are (Figure 3.1): 1. 2. 3. 4.
Planning and impact assessment Engineering and design Project management Testing and commissioning
Figure 3.1. Seven steps is a continuous cycle of evaluation, implementation, preparation, and maintenance. (Courtesy of Power Management Concepts, LLC.)
3.5 THE MISSION CRITICAL FACILITIES MANAGER
53
5. Documentation 6. Education/training and certifications with annual recertification 7. Operations and maintenance When designing a data processing center, it is important to hire competent professionals to advise at each step of the way. If the data processing center is being installed in an existing building, you do not have the luxury of designing the electrical system from scratch. A proficient electrical engineer will design a system that makes the most of the existing electrical distribution. Use electrical contractors who are experienced in data processing installations. Do not attempt to save money using the full 40% capacity for a conduit, because as quickly as new, state-of-art equipment is installed, it is deinstalled. Those same number 12 wires will need to come out of the conduit without disturbing the working computer hardware. Have an experienced electrical testing firm inspect the electrical system, perform tests on circuit breakers, and use thermal-scan equipment to find hot spots due to improper connections or faulty equipment. Finally, plan for routine facility shutdowns to perform preventative maintenance on all critical equipment. Facility managers must not underestimate the cost-effectiveness of a thorough preventative maintenance program, nor must they allow senior management to do so. Critical system maintenance is not a luxury; it is a necessity. Again, do you want electrical outages to be scheduled or unscheduled? Integrating the ideal critical infrastructure is just about impossible. Therefore, seek out the best possible industry authorities to solve your problems. Competent consultants will have the knowledge, tools, testing equipment, training, and experience necessary to understand the risk tolerance of your company, as well as recommend and implement the proper and most advanced proven designs. Equipment manufacturers and service providers are challenged to find and retain the industry's top technicians. As 24/7 operations become more prevalent, the available talent pool will diminish. This could cause response times to increase from the current industry standard of 4 hours. Therefore, the human element has a significant impact in risk and reliability. No matter which firms you choose, always ask for sample reports, testing procedures, and references. Your decisions will determine the system's ultimate reliability, as well as how easy the system is to maintain. Seek experienced professionals from both your own company and third parties for information systems, property and operations managers, and space planners, and the best consultants in the industry for all engineering disciplines. The bottom line is to have proven organizations working on your project.
3.5 THE MISSION CRITICAL FACILITIES MANAGER AND THE IMPORTANCE OF THE BOARDROOM To date, mission critical facilities managers have not achieved high levels of prestige within the corporate world. This means that if the requirements are 24/7, forever, the mission critical facilities manager must work hard to have a voice in the boardroom. The board can then become a powerful voice that supports the facilities manager and
54
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
establish a standard for managing the risks associated with older equipment or maintenance cuts. For instance, relying on a UPS system that has reached the end of its useful life but is still deployed due to cost constraints increases the risk of failure. The facilities manager is in a unique position to advise and paint vivid scenarios for the board. Imagine incurring losses due to downtime, plus damage to the capital equipment that is keeping the Fortune 1000 company in business. Board members understand this language; it is comparable to managing and analyzing other types of risk, such as whether to invest in emerging markets in unstable economies. The risk is one and the same; the loss is measured in the bottom line. The facilities engineering department should be run and evaluated just like any other business line; it should show a profit. But instead of increased revenue, the business line shows increased uptime, which can be equated monetarily, plus far less risk. It is imperative that the facilities engineering department be given the tools and the human resources necessary to implement the correct preventative maintenance training, and document management requirements, with the support of all company business lines.
3.6
QUANTIFYING RELIABILITY AND AVAILABILITY
Data center reliability ultimately depends on the organization as a whole weighing the dangers of outages against available enhancement measures. Reliability modeling is an essential tool for designing and evaluating mission critical facilities. The conceptual phase, or programming, of the design should include full probabilistic risk assessment (PRA) methodology. The design team must quantify performance (reliability and availability) against cost in order to push fundamental design decisions through the approval process. Reliability predictions are only as good as the ability to model the actual system. In past reliability studies, major insight was gained into various electrical distribution configurations using IEEE Standard 493-1997, Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems, or the IEEE Gold Book. The latter is also the major source of data on failure and repair rates for electrical equipment. There are, however, aspects of the electrical distribution system for a critical facility that differ from other industrial and commercial facilities. Therefore, internal data accumulated from the engineer's practical experience is needed to complement the Gold Book information. Reliability analysis with PRA software provides a number of significant improvements over earlier, conventional reliability methods. The software incorporates reliability models to evaluate and calculate reliability, availability, unreliability, and unavailability. The results are compared to a cost analysis to help reach a decision about facility design. The process of evaluating reliability includes: • Analyzing any existing system and calculating the reliability of the facility as it is currently configured • Developing solutions that will increase the reliability of the facility
3.6 QUANTIFYING RELIABILITY AND AVAILABILITY
55
• Calculate the reliability with the solutions applied to the existing systems • Evaluate the cost of applying the solutions
3.6.1
Review of Reliability Terminology
Reliability (R) is the probability that a product or service will operate properly for a specified period of time under design operating conditions without failure. The failure rate (A) is defined as the probability that a failure per unit time will occur in the interval, given that no failure has occurred prior to the beginning of the interval. For a constant failure rate À, reliability as a function of time is R(t) = e-kt Mean time between failures (MTBF), as its name implies, is the mean of probability distribution function of failure. For a statistically large sample, it is the average time the equipment performed its intended function between failures. For the example of a constant failure rate, MTBF = 1/A Mean time to repair (MTTR) is the average time it takes to repair the failure and get the equipment back into service. Availability (A) is the long-term average fraction of time that a component or system is in service and satisfactorily performing its intended function. This is also called steady-state availability. Availability is defined as the mean time between failures divided by the mean time between failures plus the mean time to repair: A = MTBF/(MTBF + MTTR) High reliability means that there is a high probability of good performance in a given time interval. High availability is a function of failure frequency and repair times, and is a more accurate indication of data center performance. As more and more buildings are required to deliver service guarantees, management must decide what performance is required from the facility. Availability levels of 99.999% (5.25 minutes of downtime per year) allow virtually no facility downtime for maintenance or other planned or unplanned events. Therefore, moving toward high reliability is imperative. Since the 1980s, the overall percentage of downtime events caused by facilities has grown as computers become more reliable. And although this percentage remains small, total availability is dramatically affected because repair times for facility events are so high. A further analysis of downtime caused by facility failures indicates that utility outages have actually declined, primarily due to installation of standby generators. The most common response to these trends is reactive, that is, spending time and resources to repair the offender. If a utility goes down, install a generator. If a ground
56
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
fault trips critical loads, redesign the distribution system. If a lightning strike burns power supplies, install a new lightning protection system. Such measures certainly make sense, as they address real risks in the data center. However, strategic planning can identify internal risks and provide a prioritized plan for reliability improvements. Planning and careful implementation will minimize disruptions while making the business case to fund these projects. As technological advances find their way onto the data center's raised floor, the facility will be impacted in unexpected ways. As equipment footprints shrink, free floor area is populated with more hardware. However, because the smaller equipment emits the same amount of heat, power and cooling densities grow dramatically and floor space for cooling equipment increases. The large footprint now required for reliable power without planned downtime, such as switchgear, generators, UPS modules, and batteries, also affects the planning and maintenance of data center facilities. Over the last two decades, the cost of the facility relative to the computer hardware it houses has not grown proportionately. Budget priorities that favor computer hardware over facilities improvement can lead to insufficient performance. The best way to ensure a balanced allocation of capital is to prepare a business analysis that includes costs associated with the risk of downtime. This cost depends on the consequences of an unplanned service outage in that facility and the probability that an outage will occur.
3.7 DESIGN CONSIDERATIONS FOR THE MISSION CRITICAL DATA CENTER In most mission critical facilities, the data center constitutes the critical load in the daily operations of a company or institution. The costs of hardware and software could run to $3000-$4000 per square foot, often resulting in an investment of millions and millions of dollars depending on size and risk profile. In a data center of only 5000 square feet, you could be responsible for a $20 million capital investment, without even considering the cost of downtime. Combined, the cost of downtime and damage to equipment could be catastrophic. Proper data center design and operations will protect the investment and minimize downtime. Early in the planning process, an array of experienced professionals must review all the factors that affect operations. This is no time to be a "jack of all trades, master of none." Here are basic steps critical to designing and developing a successful mission critical data center: • Determine the needs of the client and the reliability of the mission critical data center. • Develop the configuration for the hardware. • Calculate the air, water, and power requirements. • Determine your total space requirements. • Validate the specific site. Be sure that the site is well located and away from natural disasters and that electric, telecommunications, and water utilities can provide the high level of reliability your company requires.
3.7 DESIGN CONSIDERATIONS FOR THE MISSION CRITICAL DATA CENTER
57
• Develop a layout after all parties agree. • Design the mission critical infrastructure to N + 1, TV + 2, or higher redundancy level, depending on the risk profile and reliability requirements. • Once a design is agreed upon, prepare a budgetary estimate for the project. • Have a competent consulting engineer prepare specifications and bid packages for equipment purchases and construction contracts. Use only vendors that are familiar with and experienced in the mission critical industry. • After bids are opened, select and interview vendors. Take time to carefully choose the right vendor. Make sure you see their work, ask many questions, verify references, and be sure that everybody is on the same page.
3.7.1
Data Center Certification
In addition to the previously mentioned design considerations, The Leadership in Energy and Environmental Design (LEED) Certification is also important to consider. LEED is an internationally recognized green building certification system developed by the U.S. Green Building Council and should not be overlooked when designing a data center. LEED Certification will impress customers as well as the boardroom, since its implementation requires energy efficiency improvements. Although initial costs for LEED certification are 2-A% higher than standard building construction, it theoretically results in an energy savings annuity over the facility's lifecycle. LEED has yet to develop specific standards that apply to data centers in the accreditation process in its current version 2.2. However, a LEED data center draft proposal aimed at remediating this situation through specific standards associated with data centers is under development. In addition, LEED 2009 introduced a new method of weighting credits and systems, called LEED Bookshelf, that will allow the creation of new credits, which is very promising for data centers' energy challenges. Certification involves categories such as sustainable sites, water efficiency, energy and atmosphere, materials and resources, indoor environmental quality, innovation and design, and regional priority. A 100 point scale with 10 bonus points is assigned to these categories, and there are four levels of certification: certified, silver, gold, and platinum. The most points can be earned in the energy and atmosphere category. They are given for building designs that are capable of tracking building performance, managing refrigerants in order to reduce CFCs, and renewable energy use. Many data centers have attained LEED certifications in an effort to curb energy consumption and reduce GHGs. One of the first data centers in the world to achieve LEED Platinum was the Citigroup Data Center in Frankfurt, Germany in April 2009. Some features implemented to achieve certification were the use of fresh-air free cooling, reverse osmosis water treatment in cooling towers to reduce water use, a vegetated roof, a vegetated green wall irrigated using harvested rain water, extensive use of server virtualization, and a data center layout that reduces required cabling by 250 km. New data center designs should take these practices and expand on them in order to achieve LEED certification. Another internationally used rating system similar to LEED is the Building Research Establishment Environmental Assessment Method (BREEAM), which is based
58
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
in the United Kingdom. As you can see from its title, BREEAM is an environmental assessment method for all types of new and existing buildings that provides best practices for sustainable design. BREEAM has a set of standard categories called schemes by which a building can be assessed. The array of schemes they provide cover a wide range of facilities including prisons, schools, offices, and industrial factories. There is incredible flexibility with this system because if there is a building that does not fall under standard categories, BREEAM will develop a customized criterion to evaluate the building. When trying to attain BREEAM certification, the ratings that can be attained include Pass, Good, Very Good, and Excellent. In 2010, BREEAM launched its new scheme, called BREEAM Data Centres, to evaluate the environmental performance of data centers. Data centers are unique environments as they have limited employees but still use huge amounts of energy. In addition, the intensive energy use will only increase with rising computing needs and demands. According to BREEAM, this new framework could reduce energy use in data centers by more than 50%.* Over the next year, it is hoped that this new scheme will be refined even further to provide design, construction, and operation of data centers with the most efficient and effective standards possible.
3.8
THE EVOLUTION OF MISSION CRITICAL FACILITY DESIGN
To avoid downtime, facilities managers also must understand the trends that affect the utilization of data centers, given the rapid evolution of technology and its requirements on power distribution systems. The degree of power reliability in a data center will impact the design of the facility infrastructure, the technology plant, system architecture, and end-user connectivity. Today, data centers are pushed to the limit. Servers are crammed into racks, and their high-performance processors all add up to outrageous power consumption. In the early 2000s, data center power consumption increased about 25% each year according to Hewlett-Packard. At the same time, processor performance has gone up 500%, and as equipment footprints shrink, free floor area is populated with more hardware. However, because the smaller equipment is still emitting the same amount of heat per unit, cooling densities are growing dramatically. Traditional design using watts per square foot has grown enormously and can also be calculated as transactions per watt. All this increased processing power generates heat, but if the data center gets too hot, all applications grind to a halt. Many data center designers (and their clients) would like to build for a 20-year life cycle, yet the reality is that most cannot realistically look beyond 2 to 5 years. As companies push to wring more data-crunching ability from the same real estate, the lynchpin technology of future data centers will not necessarily involve greater processing power or more servers, but improved heat dissipation and better air flow management. To combat high temperatures and maintain the current trend toward more powerful processors, engineers are reintroducing an old technology—liquid cooling—which was used to cool mainframe computers a decade ago. To successfully reintroduce liq*http://www.breeam.org/newsdetails.jsp?id=672.
3.9
HUMAN FACTORS AND THE COMMISSIONING PROCESS
59
uid cooling into computer rooms, standards will need to be developed, another arena in which standardization can promote reliable solutions that mitigate risk for industry. The large footprint now required for reliable power without planned downtime also affects the planning and maintenance of data center facilities. Over the past two decades, the cost of the facility relative to the computer hardware it houses has not grown proportionately. Budget priorities that favor computer hardware over facilities improvement can lead to insufficient performance. The best way to ensure a balanced allocation of capital is to prepare a business analysis that shows the costs associated with the risk of downtime.
3.9
HUMAN FACTORS AND THE COMMISSIONING PROCESS
There is no such thing as plug and play when critical infrastructure is deployed or existing systems are overhauled to support a company's changing business mission. Reliability is not guaranteed simply by installing new equipment, or even building an entirely new data center. An aggressive and rigorous design, failure mode analysis, testing/commissioning process, and operations plan proportional to the facility criticall y level are necessities and not options. Of particular importance are the actual commissioning process and developing a detailed operations plan. More budget dollars should be allocated to testing and commissioning, documentation, education and training, and operations and maintenance because more than 50% of data center downtime can be traced to human error. Due to the facility's 24/7 mission critical status, this will be the construction team's sole opportunity to integrate and commission all of the systems. At this point in the project, a competent, independent test engineer familiar with the equipment has witnessed testing of all installed systems at the factory. Commissioning is a systematic process of ensuring, through documented verification, that all building systems perform according to the design intent and to the future owner's operational needs. The goal is to provide the owner with a safe and reliable installation. A commissioning agent who serves as the owner's representative usually manages the commissioning process. The commissioning agent's role is to facilitate a highly interactive process of verifying that the project is installed correctly and operating as designed. This is achieved through coordination with the owner, design team, construction team, equipment vendors, and third-party commissioning provider during the various phases of the project. As previously mentioned, ASHRAE's Commissioning Guideline 0-2500 is a recognized model and a good resource that explains this process in detail and can be applied to critical systems. Prior to installation at the site, all equipment should undergo factory acceptance testing that is witnessed by an independent test engineer familiar with the equipment and the testing procedures. However, relying on the factory acceptance test is not sufficient. Once the equipment is delivered, set in place, wired, and functional testing completed, integrated system testing begins. The integrated system test verifies and certifies that all components work together as a fully integrated system. This is the time to resolve all potential equipment problems. There is no one-size-fits-all formula. Before a new data center or renovation within an existing building goes online, it is
60
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
crucial to ensure that the systems are burned-in and failure scenarios are tested no matter the schedule, milestones, and pressures. You will not have a chance to do this phase over, so get it right the first time. A tremendous amount of coordination is required to fine-tune and calibrate each component. For example, critical circuit breakers must be tested and calibrated prior to exposing them to any critical electrical load. After all tests are complete, results must be compiled for all equipment and the certified test reports prepared, establishing a benchmark for all future testing. Scheduling time to educate staff during systems integration testing is not considered part of the commissioning process but is extremely important in order to reduce human error. This activity can be considered part of the transitions-to-operations process. Hands-on training is invaluable because it improves situational awareness and operator confidence, which in turn reduces human error. The training can also break through misplaced confidence. Sometimes, we deem ourselves ready for a task, but we are really not. Handing off the new infrastructure or facility to fully trained and prepared operations teams improves success and uptime throughout the facility lifecycle. When you couple the right design with the right operations plan (training programs, documentation/MOP (method of procedures), and preventative maintenance), the entire organization will be much better prepared to manage through critical events and the unexpected. If proper training and preparation is not done during the commissioning stage, building engineers will not become familiar with various process and procedures. Learning on the job increases operational risk. Knowing this up front, there is absolutely no reason not to commit the necessary budget for training, technical maintenance programs, accurate documentation and storage, and, finally, some type of credible certification that is continually revisited. If done correctly, potential mishaps and near misses will be avoided and the reduced risk will be like an annuity paying reliability/availability dividends.
3.10 3.10.1
SHORT-CIRCUIT AND COORDINATION STUDIES Short-Circuit Study
Whenever a fault occurs in an electrical power system, relatively high currents flow, producing large amounts of destructive energy in the forms of heat and magnetic forces. When an electrical fault exceeds the interrupting rating of the protective device or the fault exceeds the rating of equipment such as switchgear or panel boards, the consequences can be devastating, including injury, damaged electrical equipment, and costly downtime. A short-circuit study (SCS) is required to establish minimum equipment ratings for power system components to withstand these mechanical and thermal stresses that occur during a fault, so SCSs are mandated by Article 110.9 of NEC 2008. Also required by NEC are markings and nameplates of the SCS ratings on equipment. These include industrial control panels (409.110), industrial machinery (670.3 [A]), HVAC equipment (440.4[B]), meter disconnect switches (230.82[3]), and motor controllers (430.8). The short-circuit calculations that are the basis of an SPS involve the reduction of
3.10 SHORT-CIRCUIT AND COORDINATION STUDIES
61
the electric power distribution system supplying current to each fault location to a Thévenin equivalent circuit of system voltage sources and impedances at each bus. [Refer to ANSI/IEEE Standard 399, Recommended Practice for Power System Analysis (Color Book Series—the Brown Book).] These calculations can be done by hand, but since they can be very tedious for mid-size to large distribution systems, they are usually performed using the same specialized software that models the power system under normal load flow conditions to determine system power flows and voltage drop. When the software is used for a SCS, the model is used to calculate the resultant fault current at each bus in the system. Typically, results are presented for both three-phase and line-to-ground faults. (Software used by electric utility companies also calculates line-to-line fault current values.) In order for the electrical distribution system analysis software to produce reliable results, the electrical distribution system being analyzed must be modeled accurately. All current sources must be investigated and input into the model, including: • The Electric Utility. As a minimum, the service voltage and available threephase and single line-to-ground short-circuit current (usually given in amps or MVA) and the three-phase and line-to-ground equivalent circuit reactance/resistance ratio, referred to as the X/R ratio, are required. Many utilities also provide the equivalent system impedances at the point of service connection. The system short-circuit contribution data must be requested from the electric utility. • On-site Generation. If the worst-case short-circuit levels are to be calculated, all generation that can operate in parallel with the utility supply must be included in the model. Rated voltage, kVA, power factor, and generator subtransient and transient impedances should be obtained from the equipment nameplate or manufacturer for input into the model. • In-Service Motors. Rotating electric motors have stored kinetic energy and underfault conditions can act as generators for a short period and return some of this stored energy into the fault in the form of short-circuit current. All across the line, induction motors and synchronous motors should be modeled. If induction motors are supplied by variable-frequency drives, only motors that are fed from drives that are regenerative should be included. Typically, lumping contributing motors together on each bus provides reasonably accurate results. Software packages have default impedances and power factors that cover typical motors. Actual impedances should be used where known, especially for very large motors. • UPS Systems. UPS systems often are limited as to the amount of short-circuit current they can contribute to a downstream fault. Consult with the equipment supplier for characteristics of any such equipment on the system being analyzed. All impedances in the system must be defined and input: • Transformers. Correct modeling of transformers is critical to an accurate model. Winding voltage ratings, kVA ratings, and impedances should be obtained from transformer nameplates or manufacturers. Winding connections must be input, as well as any ground impedances.
62
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
• Cables. Wire and cable size, length, and routing type must be included in the model. Where existing cables sizes are not known, use the NEC as a reference and include any assumptions that you made in the report as a reference for future investigations. Cable lengths can be estimated based on a site visit and inspection, or by making reasonable assumptions from building and site plans. Whether or not cable is routed in a steel conduit or raceway can also make a difference in cable impedance. • Reactors. Where line reactors are included to limit short-circuit current or minimize harmonics, they should be included in the model. Obtain ratings from the equipment nameplate or the manufacturer. Once the electrical distribution system model is created, the system can be analyzed for all operating scenarios. These might include generators operating or not, bus-tie breakers closed or open, and so on (Figure 3.2). The maximum short-circuit fault current calculated from any of the scenarios should be recorded at each bus in the distribution system. Protective device manufacturers assign short-circuit interrupting and
480 Disc PP-B1
PP-B1 600AS/400AF
CBL-0105
480 Main PP-B1
PP-B1 MDS 400AS/400AF
PP-B1
CBL-0396
CBL-0395
a
MTRI-0065
a
MTRI-0066
Figure 3.2. Sample SCS Screenshot. (Courtesy of Power Management Concepts LLC.)
3.10 SHORT-CIRCUIT AND COORDINATION STUDIES
63
fault-withstand ratings to their equipment, signifying the maximum fault condition under which the device may be safely applied. The currents calculated in a SCS are used to specify the required ratings for new equipment and to evaluate the adequacy of existing system components to withstand and interrupt these high-magnitude currents. Once the electrical distribution system is modeled, proposed changes in system configuration can be analyzed in order to determine what, if any, equipment must be upgraded to support the proposed changes. An up-to-date SCS is also a requirement for protective device coordination studies as well as arc-flash hazard analysis.
3.10.2
Coordination Study
The goal of a protective-device coordination study is to determine overcurrent device settings and selections that maximize selectivity and power system reliability. In a well-coordinated protective device scheme, faults are cleared by the nearest upstream protective device. This minimizes the portion of the electrical distribution system interrupted as a result of a fault or other disturbances. At the main distribution panel level, feeder breakers and fuses should trip before the main one. Likewise, panel board branch breakers should trip before the feeder breaker or fuse supplying the panel. As with short-circuit studies, a protective-device coordination is usually performed using the same specialized integrated software that is used for the short-circuit, loadflow, and arc-flash calculations. Protective device types, ratings, and settings can be incorporated while the model is built, or added after initial studies are completed. The use of computerized calculations allows the system protection engineer to evaluate a number of setting options in a short period of time, thereby allowing him or her to finetune settings to achieve the best possible coordination. Using the computer model, a time current curve (TCC) is developed for each circuit fed from service switchboards and other critical buses. The curve includes the overcurrent devices for the largest loads in series on the circuit, the worst case from a coordination aspect. All fuses, breakers, and electromechanical or electronic relays and trip devices are entered into the computer model, if not entered previously. Transformer and cable damage curves should be selected for inclusion on the TCC to verify that critical equipment is being protected (Figure 3.3). Transformer inrush points must be included to verify that the protective device feeding the unit will not operate when the transformer is energized. Device selection and settings from the database are reviewed to determine if changes are required to improve coordination. If so, settings or selections are modified and the resulting TCCs are printed for inclusion in the report. Protective device settings and selections are also summarized in tabular form. In many cases, protective device coordination is a matter of compromise. Rather than being a choice of black or white, device coordination requires making selections that result in the best coordination that can be achieved with the devices that are installed: fuse characteristics are not as varied as electronic trip devices, instantaneous elements in series cannot be reliably coordinated, and transformer protection must be selected to provide protection from damage from a fault while passing inrush current when the unit is energized. Luckily, where power system reliability must be maximized and, therefore, strict
64
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
Figure 3.3. Sample TCC curve analysis. (Courtesy of Power Management Concepts, LLC.)
coordination is required, electronic relays and trip devices offer a variety of settings, curve shapes, and other functions that allow the system protection engineer to achieve this goal. Electronic relays come with a variety of trip characteristic curves which, along with time-delay and pick-up settings, allow a great deal of flexibility when programming the device. The zone-selective interlocking feature available in many static trip devices allows the arming of instantaneous settings on breakers in series without losing coordination. The upstream breaker trip device (such as the main device on a bus) communicates with downstream breakers. If the downstream device sees a fault current event, it sends a signal to the main device to block the tripping of the main breaker, thus allowing the downstream device to operate and minimizing the extent of the electrical system affected by the fault. As discussed, state-of-the-art protective devices can make a significant contribution to protective-device coordination, minimizing or eliminating unnecessary outages due to compromised coordination. If project requirements demand strict coordination, electrical equipment selection may be affected. It is, therefore, important to consider how these requirements affect equipment selection, specification, and layout as early
3.11
INTRODUCTION TO DIRECT CURRENT IN THE DATA CENTER
65
in the project as possible. Selection of the wrong type of equipment may negate the ability to take advantage of the technological advances discussed above.
3.11 INTRODUCTION TO DIRECT CURRENT IN THE DATA CENTER All of our modern electronic equipment today relies on solid-state semiconductor technology, which will only operate on direct current, or DC. According to the Green Buildings Forum, 72% of the energy used in the United States is consumed in commercial real estate buildings. A study by the University of Virginia shows that 80% of this energy is used by semiconductor technology, which means that this much AC power must be converted to DC. The 2007 EPA "Report to Congress on Server and Data Center Energy Efficiency" shows that data centers in the United States have the potential to save up to $4 billion in annual electricity costs through more energy-efficient equipment and operations. One movement to accomplish this that is currently gaining acceptance among data center designers, and garnering support from the Electric Power Research Institute (EPRI), is the use of a direct current distribution system. When we step back and look at the end-to-end power distribution in a conventional data center, we see several power conversions taking place. Incoming AC utility power is first rectified to DC at the UPS for the purpose of connecting to DC battery storage. It is then inverted back to AC for distribution to the server racks. At the racks, the AC is then rectified back to DC again in the power supplies for each server. We may ask ourselves, Are all these back-and-forth conversions really necessary? The answer is no.
3.11.1
Advantages of DC Distribution
Suppose we made only one conversion of the incoming AC to DC at the UPS, and then distributed the DC throughout the facility without any further conversions, thereby eliminating two power conversions along with the attendant losses. Not only would this make for a more efficient system, it would also reduce the number of components, thereby eliminating points of failure, making for a more reliable system. As an example, a 380 V DC distribution system will result in a 200% increase in reliability, a 33% reduction in required floor space, and a 28% improvement in efficiency over conventional UPS systems, or a 9% improvement over best-in-class AC UPS architectures. A DC distribution system would also facilitate the integration of on-site DC-generating power sources, such as solar PV arrays, wind power, and fuel cells, which all can provide DC power without a single power conversion. For example, the best published efficiency for a fuel cell is 50%. According to UTC Power, by eliminating the AC power conditioning subassembly and utilizing waste heat in a combined cooling, heating, and power application, efficiencies can exceed 85% with a high load factor. Figure 3.4 shows the various power conversions that are required in a conventional AC distribution system, along with those that are required to connect alternate energy
66
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
Solar PV
:~u
Fuel cell
1
300-400VDC
DC/AC
Lighting
ELECTRONIC BALLAST
UPS Utility grid
AC/DC - DC/AC POWER SUPPLIES
AC/DC - DC/AC
— AC/DC - DC/AC
Electronic loads
Motor loads
Figure 3.4. Traditional AC distribution. sources. Figure 3.5 shows the fewer conversions that are required with a DC distribution system and the simpler integration of alternate energy sources. This kind of DC distribution system can provide up to a 25% decrease in energy usage when compared to a traditional AC power distribution system by eliminating the losses associated with the inefficiencies of power conversion equipment. It also reduces the initial cost for electrical distribution equipment by about half. And since
Figure 3.5. DC distribution.
3.11
INTRODUCTION TO DIRECT CURRENT IN THE DATA CENTER
67
there are fewer conversions from AC to DC and DC to AC, which translates into less equipment, the distribution system occupies significantly less space. Proponents of DC power in data centers tout fewer single points of failure in DC systems due to the reduction in components. There is also the accompanying decrease in heat production. This reduces cooling capacity requirements and provides further reductions in operating costs.
3.11.2
DC Lighting
Besides all of the electronic equipment that is found in the data center, another significant consumer of electricity is the lighting system. Fluorescent lighting is typically fed with 120 V AC, or 277 V AC in larger commercial buildings. Since fluorescent lamps vary by type and length, they operate on varying voltages and currents. A fluorescent ballast is used to adjust the supplied voltage and regulate current accordingly. In an effort to improve fluorescent lighting efficiency, the electronic ballast (Figure 3.6) was invented in the 1970s. Besides adjusting the supplied voltage, electronic ballasts also changed the frequency from 60 Hz to 20,000 Hz or higher, which substantially eliminates flicker while also providing higher system efficiency. By supplying the lighting system with DC, we eliminate another power conversion and improve data center efficiency even further.
3.11.3
DC Storage Options
The fact that electrical energy is most easily stored in batteries as DC is the primary reason power conversions are necessary when AC distribution is used in the data center. The two types of batteries that are most commonly used in conjunction with a UPS are lead-acid wet cell (flooded) and valve-regulated lead-acid (VRLA). Another storage technology that is gaining popularity is flywheels. Conventional batteries and flywheels are covered in more detail in Chapter 10. Also worth mentioning here are some new battery storage technologies called zinc bromide flow batteries and a megawatt-class of batteries that use a sodium sulfur electrolyte, which are both, by nature, DC sources. Both of these batteries have the potential to be used in grid-storage-class applications and may undergo many deep discharges without suffering ill effects.
Figure 3.6. Electronic ballast. (Courtesy of Antron Electronics Co., Ltd.)
68
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
As of this writing, sodium sulfur batteries have already been installed at several locations globally, including a wind farm in Japan and a bus depot in Garden City, NY. Zinc bromide batteries have yet to prove themselves in anything more than a handful of demonstration installations, but the technology seems promising for the storage of electricity from intermittent sources and load-leveling applications, similar to sodium sulfur batteries.
3.11.4
Renewable Energy Integration
Solar power is the most common renewable resource that can be used for the on-site generation of electrical power for a data center. Since photovoltaic (PV) arrays produce DC electricity, they are easily integrated with a DC distribution system. Only a voltage regulator is required, or a charge controller if the PV array is used to charge battery storage. Wind power is another option. Inverter-based wind turbines already produce DC power. The power conversion that would normally be needed to convert the DC to AC is not required, eliminating the inverter and any need for synchronization.
3.11.5
DC and Combined Cooling, Heat, and Power
Combined cooling, heat, and power (CCHP), also known as cogeneration, is an ideal strategy for improving data center energy efficiency. This is accomplished by using some form of power generation equipment that also produces thermal energy as a byproduct. Since data centers typically require 24/7 cooling, the recovered thermal energy can be used to activate absorption chillers that turn the waste heat into free cooling (Figure 3.7). There are two types of power generating equipment that can be easily integrated with a DC distribution system. One is a fuel cell. The fuel cell is an electrochemical device that produces DC electricity, much like a battery (Figure 3.8). Only a voltage regulator is needed to match the DC voltage of the fuel cell to the DC distribution voltage. Since the chemical reaction within the fuel cell produces heat, this thermal energy can be recovered and used to drive a water-based absorption chiller. The microturbine is another prime mover that can produce DC power (Figure 3.9). Microturbines operate on the Brayton cycle to rotate a small permanent-magnet alternator at very high rpm to generate high-frequency AC, which is rectified to DC. A voltage regulator can be used to match this DC to the DC distribution voltage. The turbine exhaust is run through a heat recovery absorption chiller to make chilled water.
3.11.6
Current State of the Art
In general, operating costs for DC distribution are lower than AC distribution. However, since DC distribution is a relatively new concept, early adopters will no doubt be faced with higher installation costs, even though less equipment is needed, because specialized DC equipment must be developed. Product development is already underway. In the case of servers and storage hardware, some manufacturers may not offer
3.11
INTRODUCTION TO DIRECT CURRENT IN THE DATA CENTER
69
Figure 3.7. Absorption chiller. (Courtesy of Yazaki Energy Systems, Inc.)
DC solutions, but OEMs are beginning to prepare offerings with DC input directly to servers and storage systems. Delta Electronics has announced the first commercial offering of 380 V DC fans, servers, UPS units, and racks with DC distribution power supplies. Commercial proof-of-concept demonstrations have been installed in Sweden, Japan, and California. Direct Power Technologies, Inc. (DPTI) is working with Satcon
Figure 3.8. Typical fuel cell. (Courtesy UTC Power.)
70
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
Figure 3.9. Microturbine CCHP system. (Courtesy UTC Power. The pictured system is the 360M, which has six microturbines. There are alsofive-and six-microturbine versions.)
Power Systems to develop high-efficiency rectifiers and DC-DC converters specifically for DC system architecture. Satcon is currently providing power conversion products for the Navy's new DD(X) destroyer program for a DC-based integrated power system. DPTI is also working with EPRI to develop high-efficiency rectifiers and DC-DC converters specific for DC system architecture. DPTI is working with Anderson Power to develop a 380/400 V DC connector and also with multiple companies to develop a 380/400 V DC plug strip.
3.11.7
Safety Issues
One of the first steps in accomplishing the paradigm shift to DC is developing some consensus as to what DC voltage level is best for distribution, given the need to keep current levels and conductor ampacity within reason. The schools of thought range from 48 to 550 V DC. These voltage levels that are under consideration fall under the National Electrical Code, ANSI/NFPA 70, which defines low voltage as any voltage up to 600 V. The grounding of DC systems is also covered in the NEC in Sections 2503 and 250-22. Once the preferred voltage level is established, more specific codes and standards will have to be developed. Codes and standards development is in progress overseas. The European Standard ETSI EN 300 132-3 v 1.2.1 (2003-08) covers DC systems up to 400 V.
3.11
INTRODUCTION TO DIRECT CURRENT IN THE DATA CENTER
3.11.8
71
Maintenance
One of the most valuable data center maintenance tools is the power quality monitor. Whereas the monitoring of AC power quality has been done for many years, doing the same thing on DC distribution systems will require different equipment. Fortunately, there are already some instruments available (Figure 3.10) that are up to the task, like the PSL PQube® AC/DC Power Monitor and the Dranetz-BMI Encore Series Model 61000.
3.11.9
Education and Training
Initially, there may be some difficulty in locating personnel with the appropriate experience to install and maintain these DC systems. With some aggressive training programs, existing personnel will develop the necessary skills. This is the same training and education that is currently underway in the photovoltaic industry, where electrical distribution systems rated up to 600 V DC are key ingredients. It should also be noted that there will remain a need for AC expertise, since data center HVAC systems will no doubt continue to be powered by AC, and, consequently, still need the conventional AC distribution.
3.11.10
Future Vision
Looking to the future, global work on fusion reactors has demonstrated that future power generation should be DC. For long-distance distribution, high-voltage DC systems are less expensive and suffer lower electrical losses. High-temperature superconductors promise to revolutionize power distribution by providing near-lossless transmission of electrical power. The development of superconductors with transition temperatures higher than the boiling point of liquid nitrogen has made the concept of superconducting power lines commercially feasible, at least for high-load applications. It has been estimated that the waste would be halved using this method, since the necessary refrigeration equipment would consume about half the power saved by the elimination of the majority of resistive losses. Some companies such as Consolidated Edi-
Fiqure 3.10. DC monitoring equipment.
72
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
son and American Superconductor in the United States have already begun commercial production of such systems. Many people are unaware that superconductors are already being used commercially to construct the ultrastrong magnets used in magnetic resonance imaging (MRI) scanners. All the advantages and benefits of the DC alternative will have to be proven over time. The industry must continue the dialogue and take steps to ensure that the concept is thoroughly vetted. Small-scale demonstrations are a good first step. Pentadyne, a leader in flywheel energy storage, assembled the first bench test at their Chatsworth, CA headquarters in 2005. An industry group sponsored by the California Energy Commission through Lawrence Berkeley National Labs and headed by EPRI Solutions and Ecos Consulting is planning more demonstrations of data center applications. Commercial proof-of-concept demonstrations are also being conducted in Sweden and Japan. Although retrofitting existing facilities for DC may not be cost-effective in the near term, the DC alternative should be thoroughly investigated when designing a new data center. The increase in efficiency and the associated reduction in operating costs will be the primary driving force behind a DC revolution. And the fact that DC promises higher tier ratings with lower capital costs makes improved reliability another big attraction. As with any change that upsets the status quo, the conversion to DC will no doubt meet with opposition, but the benefits seem too irresistible to pass up.
3.12
CONTAINERIZED SYSTEMS OVERVIEW
When you hear the term containerized systems, modular data centers are probably the first technology to come to mind. Initially, the primary purpose of this technology was aimed at disaster recovery operations. However, with the increasing challenges of high-density computing, including cooling, power distribution, and the continual expansion of facilities to meet customer needs, containerized systems and modular infrastructure have emerged as an innovative solution to these problems. Several vendors offer fully containerized data centers housed in trailer-like enclosures (Figure 3.11), marketed as both a rapid expansion solution and, for some companies, a complete data center architecture. The units are fitted with specially designed racks and chilled-water cooling units. Electrical power, chilled water, and network connections are all that is required to commission a complete data center. These modules can be connected to a centralized power/cooling plant or each can be individually equipped with their own dedicated power and cooling systems. This is proving to be a quintessential plug-and-play solution because it allows companies to forecast accurately and efficiently expand their data centers to meet computing needs. Energy efficiency is reported to be higher than traditional data centers, due in part to the compact, high-density design. As a result, companies such as Google and Microsoft are finding this technology to be invaluable. It is important to note that modular systems can include an array of technologies not solely restricted to a shipping container. They may incorporate modularized infrastructure such as prefabricated UPS rooms and chiller and generator containers. The benefit
3.13 CONCLUSION
73
Figure 3.11. Open rear door of containerized data center. (Courtesy Sun Corporation.) of these prefabricated solutions is that the quality of construction of these products is higher and more controlled than on a chaotic construction site, therefore ensuring more reliable equipment. Additional benefits include easy integration with existing sites, reduced costs of construction, and portability and flexibility of relocation.
3.13
CONCLUSION
Designing or managing a mission critical facility is not a role to be taken lightly. As technology continues to develop and advance, critical data centers, hospitals, disaster recovery sites, and other mission critical facilities are only going to become more complex. I have included some questions in this section of the chapter to help you assess the status of your facility. Take care when answering these questions and use them as a guide to improving your operations.
Installation 1. Is your organization sure that there is diversity in the labor pool of the primary and backup sites, such that a wide-scale event would not simultaneously affect the labor pool of both sites? 2. Do you routinely use or test recovery and resumption arrangements? 3. Are you familiar with National Fire Protection Association (NFPA) 1600, Standard on Disaster/Emergency Management and Business Continuity Pro-
74
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
4. 5.
6. 7. 8. 9. 10. 11.
12.
13. 14. 15. 16. 17. 18. 19. 20. 21.
grams, which provides a standardized basis for disaster/emergency management planning and business continuity programs in private and public sectors by providing common program elements, techniques, and processes? Has the owner, working with an engineering professional, developed a design intent document to clearly identify quantifiable requirements? Have you prepared a basis of design document that memorializes, in a narrative form, the project intent, future expansion options, types of infrastructure systems to be utilized, applicable codes and standards to be followed, design assumptions, and project team decisions and understandings? Will you be provided the opportunity to update the basis of design document to reflect changes made during the construction and commissioning process? Are the criteria for testing all systems and outlines of the commissioning process identified and incorporated into the design documents? Have you identified a qualified engineering and design firm to conduct a peer project-design review? Have you considered directly hiring the commissioning agent to provide true independence? Have you discussed and agreed to a division of responsibilities between the construction manager and the commissioning agent? Do you plan to hire the ultimate operating staff ahead of the actual turnover to operations so they will benefit from participation in the design, construction, and commissioning of the facility? Have you made a decision on the commissioning agent early enough in the process to allow participation and input on commissioning issues and design review by the selected agent? Is the proper level of fire protection in place? Is the equipment (UPS or generator) being placed in a location prone to flooding or other water damage? Do the generator day tanks or underground fuel cells meet local environmental rules? Does the battery room have proper ventilation? Has adequate cooling or heating been specified for the UPS, switchgear, or generator room? Are the heating and cooling for the mechanical rooms on the power protection system? Have local noise ordinances been reviewed and does all the equipment comply with the ordinances? Are the posting and enforcement of no-smoking bans adequate, specifying, for example, no smoking within 100 feet? Are water detection devices used to alert building management of flooding issues?
3.13 CONCLUSION
75
Procurement 22. Is there a benefit to using an existing vendor or supplier for standardization of process, common spare parts, or confidence in service response? 23. Have the commissioning, factory, and site-testing requirement specifications been included in the bid documentation? 24. If the project is bid, have you conducted a technical compliance review that identifies exceptions, alternatives, substitutions, or noncompliance to the specifications? 25. Are the procurement team members versed in the technical nuances and terminology ofthe job? 26. If delivery time is critical to the project, have you considered adding latepenalty clauses to the installation or equipment contracts? 27. Have you included a bonus for early completion of project? 28. Have you obtained unit rates for potential change orders? 29. Have you obtained a GMP (guaranteed maximum price) from contractors? 30. Have you discussed preferential pricing discounts that may be available if your institution or your engineer and contractors have other similar large purchases occurring?
Construction 31. Do you intend to create and maintain a list of observations and concerns that will serve as a checklist during the acceptance process to ensure that these items are not overlooked? 32. Will members of the design, construction, commissioning agent, and operations teams attend the factory-acceptance tests for major components and systems such as UPS, generators, batteries, switchgear, and chillers? 33. During the construction phase, do you expect to develop and circulate for comment the start-up plans, documentation formats, and prefunctional checklists that will be used during start-up and acceptance testing? 34. Since interaction between the construction manager and the commissioning agent is key, will you encourage attendance at the weekly construction-status meetings by the commissioning team? 35. Will an independent commissioning and acceptance meeting be run by the commissioning agent, ensuring that everything needed for that process is on target? 36. Will you encourage the construction, commissioning, and operations staff to walk the job site regularly to identify access and maintainability issues? 37. If the job site is an operating critical site, do you have a risk assessment and change-control mechanism in place to ensure reliability? 38. Have you established a process to have independent verification that labeling on equipment and power circuits is correct?
76
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
Commissioning and Acceptance 39. Do testing data result sheets identify expected acceptable result ranges? 40. Are control sequences, checklists, and procedures written in plain language, not technical jargon that is easily misunderstood? 41. Have all instrumentation, test equipment, actuators, and sensing devices been checked and calibrated? 42. Is system acceptance testing scheduled after balancing of mechanical systems and are electrical cable and breaker testing complete? 43. Have you listed the systems and components to be commissioned? 44. Has a detailed script sequencing all activities been developed? 45. Are all participants aware of their responsibilities and the protocols to be followed? 46. Does a team directory with all contact information exist and is it available to all involved parties? 47. Have you planned an "all hands on deck" meeting to walk through and finalize the commissioning schedule and scripted activities? 48. Have the format and content of the final report been determined in advance to ensure that all needed data is recorded and activities are scheduled? 49. Have you arranged for the future facility operations staff to witness and participate in the commissioning and testing efforts? 50. Who is responsible for ensuring that all appropriate safety methods and procedures are deployed during the testing process? 51. Is there a process in place that ensures that training records are maintained and updated? 52. Who is coordinating training and ensuring that all prescribed training takes place? 53. Will you videotape training sessions to capture key points and for use as refresh training? 54. Is the training you provide both general systems training as well as specifically targeted to types of infrastructure within the facility? 55. Have all vendors performed component-level verification and completed prefunctional checklists prior to system-level testing? 56. Has all system-level acceptance testing been completed prior to commencing the full system integration testing and "pull the plug" power failure scenario? 57. Is a process developed to capture all changes made and to ensure that these changes are captured on the appropriate as-built drawings, procedures, and design documents? 58. Do you plan to reperform acceptance testing if a failure or anomalies occur during commissioning and testing? 59. Who will maintain the running punch list of incomplete items and track resolution status?
3.13 CONCLUSION
77
Transition to Operations 60. Have you established specific operations planning meetings to discuss logistics of transferring newly constructed systems to the facility operations staff? 61. Is all as-built documentation, such as drawings, specifications, and technical manuals, complete and has it been turned over to operations staff? 62. Have position descriptions been prepared that clearly define roles and responsibilities of the facility staff? 63. Are standard operating procedures (SOP), emergency action procedures (EAP), updated policies, and change-control processes in place to govern the newly installed systems? 64. Has the facility operations staff been provided with warranty, maintenance, repair, and supplier-contact information? 65. Have spare parts lists, setpoint schedules after Cx is complete, TAB report, and recommissioning manuals been given to operations staff? 66. Are the warranty start and expiration dates identified? 67. Have maintenance and repair contracts been executed and put into place for the equipment? 68. Have minimum response times for service, distance to travel, and emergency 24/7 spare stock locations been identified?
Security Considerations 69. Have you addressed physical security concerns? 70. Have all infrastructures been evaluated for type of security protection needed (e.g., card control, camera recording, and key control)? 71. Are the diesel oil tank and oil fill pipe in a secure location? 72. If remote dial in or Internet access is provided to any infrastructure system, have you safeguarded them against hacking or do you permit read-only functionality? 73. How frequently do you review and update access-permission authorization lists? 74. Are critical locations included in security inspection rounds?
Documentation 75. What emergency plans, if any, exist for the facility? 76. Where are emergency plans documented (including the relevant internal and external contacts for taking action)? 77. How are contacts reached in the event of an emergency? 78. How are plans audited and changed over time? 79. Do you have complete drawings, documentation, and technical specifications of your mission critical infrastructure including electrical utility, in-facility
78
MISSION CRITICAL ENGINEERING WITH AN OVERVIEW OF GREEN TECHNOLOGIES
80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106.
electrical systems (including power distribution and ATS), gas and steam utilities, UPS/battery/generator, HVAC, security, and fire suppression? What documentation, if any, exists to describe the layout, design, and equipment used in these systems? How many forms does this documentation require? How is the documentation stored? Who has access to this documentation and how do you control access? How often does the infrastructure change? Who is responsible for documenting change? How is the information audited? Can usage of facility documentation be audited and tracked? Do you keep a historical record of changes to documentation? Is a formal technical training program in place? Is there a process in place that ensures personnel have proper instrumentation and that the instrumentation is calibrated as recommended? Are accidents and near-miss incidents documented? Is there a process in place that ensures action will be taken to update procedures following accidents or near-miss events? How much space does your physical documentation occupy today? How quickly can you access the existing documentation? If a consultant is used to make changes to documentation, how are consultant deliverables tracked? Is your organization able to quickly provide information to legal authorities, including emergency response staff (e.g., fire and police)? How are designs or other configuration changes to infrastructure approved or disapproved? How are these approvals communicated to responsible staff? Does workflow documentation exist for answering staff questions about what to do at each stage of documentation development? In the case of multiple facilities, how is documentation from one facility transferred or made available to another? What kind of reporting on facility infrastructure is required for management? What kind of financial reporting is required in terms of facility infrastructure assets? How are costs tracked for facility infrastructure assets? Is facility infrastructure documentation duplicated in multiple locations for restoration in the event of loss? How much time would it take to replace the documentation in the event of loss? How do you track space utilization (including cable management) within the facility?
3.13 CONCLUSION
79
107. Do you use any change management methodology (i.e., ITIL) in the day-today configuration management of the facility?
Staff and Training 108. How many operations and maintenance staff do you have within the building? 109. How many of these staff do you consider to be facility's "subject matter experts?" 110. How many staff members manage the operations of the building? 111. Are specific staff members responsible for specific portions of the building infrastructure? 112. How long has each of your operations and maintenance staff, on average, been in his or her position? 113. What kind of ongoing training, if any, do you provide for your operations and maintenance staff? 114. Do training records exist? 115. Is there a process in place to ensure that training records are maintained and updated? 116. Is there a process in place that identifies an arrangement for training? 117. Is there a process in place that ensures the training program is periodically reviewed and identifies changes required? 118. Is the training you provide general training, or is it specific to an area of infrastructure within the facility? 119. How do you design changes to your facility systems? 120. Do you handle documentation management with separate staff, or do you consider it to be the responsibility of the staff making the change?
This page intentionally left blank
4 MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
4.1
INTRODUCTION
Before a corporation or institution embarks on a design to upgrade an existing building or erect a new one, key personnel must work together to evaluate how much protection is required. Although the up-front costs for protection can be somewhat startling, the cost and impact of providing inadequate protection for mission critical systems can lead to millions in lost revenue, risks to life and safety, and perhaps a tarnished corporate reputation. The challenge is finding the right level of protection for your business requirements. However, investing a significant amount of time and money in systems design and equipment to safeguard a building from failure is just the beginning. An effective maintenance and testing program for your mission critical systems is key to protecting the investment. Maintenance procedures and schedules must be developed, staff properly trained, spare parts provisioned, and mission critical electrical equipment performance tested and evaluated regularly. Predictive maintenance, preventive maintenance, and reliability centered maintenance (RCM) programs play an important role in the reliability of the electrical distribution systems. How often should electrical maintenance be performed? Every 6 months or every 6 years? Part of the answer lies in what level of reliability your company can live with. More accurately, what are your company's expectations in terms of risk tolerance or Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
81
82
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
goals with regard to uptime? As previously discussed, if your company can live with 99% reliability, or 87.6 hours of downtime per year, then the answer would be to run a maintenance program every 3 to 5 years. However, if 99.999% reliability, or 5.25 minutes of downtime per year is mandatory, then you need to perform an aggressive preventive maintenance program every 6 months. The cost of this hard-line maintenance program could range between $250 and $500 annually per kilowatt (kW), not including the staff to manage the program. The human resource cost will vary depending on the location and complexity of your facility. There are several excellent resources available for developing the basis of an electrical testing and maintenance program. One is the InterNational Electric Testing Association (NETA), which publishes the Maintenance Testing Specifications that recommends frequencies of maintenance tests. The recommendations are based on equipment condition and reliability requirements. Another is the National Fire Protection Association's (NFPA) 70B Recommended Practice for Electrical Equipment Maintenance. These publications, along with manufacturer's recommendation, give guidance for testing and maintenance tasks and schedules to incorporate in your maintenance program. Over the last ten years, there have been significant changes in the design and application of infrastructure support systems for mission critical facilities. This has been driven by the desire for higher reliability levels and has been fueled by continuing technological innovations. The resulting system configurations have yielded equipment that is more fault tolerant. Herein lies an opportunity. The response to increases in equipment has been to increase preventive maintenance tasks and schedules. However, since these systems are more fault tolerant, the effect(s) of a failure may be insignificant. Consequently, different approaches to maintenance can be taken. In addition, significant numbers of mission critical failures are directly attributable to human error. Therefore, reducing human intervention during various maintenance tasks can have a substantial impact on improving availability. Reliability centered maintenance (RCM) was developed in the 1960s by the aircraft industry when it recognized that the cost of aircraft maintenance was becoming prohibitively expensive while the reliability results were well below acceptable levels. Since then, RCM strategies have been developed and adapted by many industries. Traditionally, the goal of a maintenance program has been to reduce and avoid equipment failures. Facilities management goes to great lengths to prevent the failure of a device or system, regardless of the failure's consequence on the facility's mission. RCM shifts the focus from failure avoidance to understanding and mitigating the failure effect upon the process it protects. This is a major shift from a calendar-based program. For many facilities, the net effect will be a change from the present labor-intensive process driven by the operating group, as well as by the equipment suppliers, to a more selective method wherein the understanding of the system responses to component failures plays a critical role. Although the benefits are proven, the process needs to be supported by thorough analysis. Reliability centered maintenance analyzes how each component and/or system can functionally fail. The effects of each failure are analyzed and ranked according to their impact on safety, environment, business mission, and cost. Failures that are
4.2
THE HISTORY OF THE MAINTENANCE SUPERVISOR
83
deemed to have a significant impact are further analyzed to determine the root causes. Finally, preventative or predictive maintenance is assigned based on findings of the analysis, with emphasis on condition-based procedures to help ensure the optimal performance level.
4.2 THE HISTORY OF THE MAINTENANCE SUPERVISOR AND THE EVOLUTION OF THE MISSION CRITICAL FACILITIES ENGINEER Managing a facilities engineering department in a corporation or institution has changed significantly in the past three decades. Thirty years ago, the most important qualification for a facilities manager was exposure to different trades and hands-on experience gained from working up through the ranks of maintenance, beginning at the lowest level and progressing through a craft by way of on-the-job training or various apprentice programs. Upon mastering all the trades, and gaining knowledge of the facility from on-the-job experience, upper management would promote this craftsperson to maintenance supervisor. The maintenance supervisor was an important player in getting equipment and facility up and running quickly, in spite of day-to-day problems, procedures, and policies. This was the key to advancement. That supervisor was regarded as someone who would do whatever it takes to get the down equipment operating quickly, whatever situation would occur. Operations managers would expect the maintenance supervisor to take $200 out of his own pocket in order to buy that critical part. Coming into the facility at all hours of the night, weekends, and holidays further projected the image that this maintenance supervisor was giving his all for the company. After many years of performing at this level of responsibility, the maintenance supervisor advanced to a facilities management position with responsibility for maintenance supervisors, craftspeople, budgeting, and overall operations. This promotion was pivotal in his career. Because he had worked hard and made many sacrifices, he reached the highest levels in the facilities engineering industry. The facilities manager continued to get involved with the details of how equipment is repaired and facilities are maintained. He also continued to ensure that he was called at any hour of the day or night to personally address problems in the facilities. The secret to his success was reacting to problems and resolving them immediately. Many facilities managers have tried to adapt from reacting to problems to anticipating and planning for them, but fall short in achieving a true 24/7 mission critical operations department. The most advanced technologies and programs are researched and deployed, and personnel are assigned to utilize them. Yet, with all of these proactive approaches, the majority of facilities organizations fall short of adequate levels of functionality, in some cases due to limited human or monetary resources. The dilemma is that these tools overwhelm many facilities managers; day-to-day management of the facilities becomes a Sisyphean task. Today, the facilities manager continues with the tenacious effort and determination to resolve problems, conflicts, and issues. Additionally, the facilities manager must
84
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
know, understand, and implement the programs, policies, and procedures dictated by state, local, and federal agencies. He must also support corporate governances of his employer. Today's manager must formulate capital and operational budgets and report departmental activities to senior management. He must be aware of safety and environmental issues while all the while his role continues to be very critical. However, to be successful when operating mission critical 24/7 operations, a proactive approach (not reactive) is imperative. The days of brute force and reacting to problems are gone. The real key to success is planning, anticipation, and consistency. Today's facility manager is a manager in every sense of the word. Most facilities organizations appear to run smoothly on the surface: critical equipment is up and running; craftspeople have more projects than ever; and facilities managers and supervisors are juggling the responsibilities of administrative, engineering, technical, and operational duties. Yet, when we look beyond the façade, we can see that the fundamentals of maintenance engineering are changing. In fact, this change is occurring along with developments in technology. The drive to move to state-of-the-art technologies, practices, and programs cannot blind facilities management's ability to ensure that the building blocks of good maintenance are applied. Technologies such as database-driven preventative maintenance programs are another tool in the facility manager's toolbox. Like any other tool, they are most effective when used in the manner in which they were designed to be used. In order to keep up and build a successful mission critical facilities operations department, the facilities manager must develop and sustain: • A clear mission, vision, or strategy for the facilities organization • Narrative descriptions of the design intent and basis of design documents • Written control sequences of operation that clearly explain to anyone how the facility is supposed to operate • Standard and emergency operating procedures • Detailed, up-to-date drawings • Ongoing employee skills assessment and training programs • Effective document management systems • Maintenance planning and scheduling programs • Maintenance inventory control • Competent consultants in all engineering disciplines • The most advanced communications, hardware, and software • Adequate capital and operating budgets • Senior management support The facility maintenance manager must keep his team focused on their primary responsibilities—maintenance of the mission critical electrical distribution system. After reacting to equipment failures when they do occur and getting the facility back in operation, preventative maintenance must be the department's top priority. Many maintenance electricians would much prefer to be working on the project-related tasks of in-
4.3 INTERNAL BUILDING DEFICIENCIES AND ANALYSIS
85
stalling new systems and equipment. Such project tasks should not take priority over preventative maintenance. Equipment and system downtime is a direct result of a loss of priority in maintenance responsibilities. With the increasing complexity of mission critical facilities and the technologies involved, the emphasis has shifted away from the resourceful craftsman who made repairs on the fly. The focus has transitioned to the mission critical facilities engineer charged with ensuring 24/7 operation, whereas repairs and preventive maintenance take place through thoughtful design and operation of the facility. Indeed, the fundamental of maintenance engineering and the role of the mission critical facilities engineer continues to evolve and grow in importance with each new technology applied.
4.3
INTERNAL BUILDING DEFICIENCIES AND ANALYSIS
In addition to having a robust UPS system, building design must eliminate the probability of single points of failure occurring simultaneously in several locations throughout the electrical distribution system. Mission critical facilities engineers have exclusive accountability for identifying and eliminating problems that can lead to points of failure in the power distribution infrastructure. Failures can occur anywhere in an electrical distribution system: utility service entrance, main circuit breakers, transformers, disconnect switches, standby generators, automatic transfer switches, branch circuits, and receptacles to name a few. A reliability block diagram must be constructed and analyzed to quantify these single points of failure. In a mission critical power distribution system, there are two basic architectures for power distribution to critical loads: 1. Electrical power from the UPS system is connected directly through transformers or redundantly from two sources through static transfer switches to remotely mounted circuit breaker panels with branch circuits running to the critical loads. 2. Power from the UPS system is fed to power distribution units (PDUs), which house transformers and panel boards, with branch circuits running to the critical loads. Typically, in data or operations centers with raised floors, liquid-tight flexible conduit runs the branch circuits from the PDUs to the critical loads. This allows for easy removal, replacement, and relocation of the circuit when computer equipment is upgraded or changed. In an office application, the typical setup involves a wall-mounted panel board hard wired to a dedicated receptacle that is feeding computer equipment. Where the UPS system output voltage is 480 volts, transformers in the PDUs are required to step down voltage to 120/208 volts. Electrical loads equipped with two power cords are common. This equipment will accept two sources of power, and may either operate on a single cord at a time or split the load between the two cords. In the event that one power sources fails, the load draws on the second source and the application remains online. This is a tremendous step toward computer equipment that is tolerant of power quality problems or internal
86
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
electrical interruptions. However, it does challenge the facility to provide two different sources of power supplied to the two cords on each piece of equipment. This can mean the complete duplication of the electrical distribution system from the service entrance to the load, and the elimination of scheduled downtime for maintenance. Two utility switchboards, along with redundant utility substations, power dual UPS systems that feed dual PDUs. This system topography constitutes what is known as a 2N system. Although dual-corded loads may be serviced by a N or N + 1 system, far greater reliability is realized when power is brought to the load from separate substations.
4.4
EVALUATING YOUR SYSTEM
Read the operation and maintenance manuals for all electrical distribution equipment and you will likely find a recommendation for preventive maintenance. However, can you justify preventive maintenance service for your mission critical equipment? Is comprehensive annual maintenance cost-effective for the user? Should all of a mission critical facility's power distribution equipment be serviced on the same schedule? If not, what factors should be considered in establishing a realistic and effective preventive maintenance program for a given facility? To answer these questions, first evaluate the impact to your business if there were an electrical equipment failure. Then examine the environment in which the electrical equipment operates. Electrical distribution equipment is most reliable when kept dustfree, dry, cool, clean, tight, and, where appropriate, lubricated. Electrical switchgear, for example, if kept in an air-conditioned space, regardless of maintenance practices, will be considerably more reliable than identical equipment installed in the middle of a hot, dirty, and dusty operating floor. In such uncontrolled environmental locations, it would be difficult to pick an appropriate maintenance interval. The amount of load on the electrical equipment is also critical to its life. Loading governs how much internal heat is produced, which will expedite the aging process of nearly all insulating materials and dry out lubricants in devices such as circuit breakers. High current levels can lead to overheating at cable connectors, circuit breaker lugs, interconnections between bus and draw-out devices, contact surfaces, and so on. An important exception is equipment located outdoors or in a humid environment. In those instances (especially prevalent in 15 kV-class equipment), if the electrical load is too light, the heat created by the flow of current will be insufficient to prevent condensation on insulating surfaces. If the proper number and size of equipment heaters are installed and operating in the 15 kV-class equipment, condensation is less of a concern. Operating voltage levels should influence the maintenance interval. Tracking across contaminated insulating surfaces can proceed much more rapidly on 15 kVclass equipment than at 480 V. Also, higher voltage equipment is often more critical for the successful operation of a larger portion of the building or plant; therefore, more frequent preventive maintenance can be justified. Consider the age of electrical equipment. It is important to perform preventive maintenance on older equipment more frequently. Test new equipment before placing it in service, then implement a preventive maintenance schedule based on manufactur-
4.5 CHOOSING A MAINTENANCE APPROACH
87
ers' and experienced consultant recommendations. Certified testing will identify defects that can cause catastrophic failures or other problems. Additionally, it will document problems and repair/replacement, as well as the need for additional testing while still under warranty. Another consideration is the type and quality of equipment initially installed. Was lowest first cost a primary consideration? If the engineer's design specified high-quality, conservatively rated equipment and it was installed correctly, you are at a good preventive maintenance starting point. It will be less difficult and less expensive to achieve continued reliable operation from a properly designed project. The benefits of quality design and well-thought-out specifications will aid preventive maintenance, because the higher quality equipment typically has the features that enable service. On the other hand, do not allow simple minor service checks to become the building standard. It is easy to test only the protective relays (because it can be accomplished while in normal operation) and let important shutdown services slide. If minimal service becomes the norm, most of the benefits of effective preventive maintenance will not be realized. Therefore, demand regular, major preventive maintenance on your mission critical equipment. Then you can expect continued, reliable operation.
4.5
CHOOSING A MAINTENANCE APPROACH
NFPA 70B says, "Electrical equipment deterioration is normal, but equipment failure is not inevitable. As soon as new equipment is installed, a process of normal deterioration begins. Unchecked, the deterioration process can cause malfunction or an electrical failure. Effective maintenance programs lower costs and improve reliability. A maintenance program will not eliminate failures but can reduce failures to acceptable levels." There is little doubt that establishing an electrical maintenance program is of paramount importance. There are various approaches for establishing a maintenance program. In most cases, a program will include a blend of the strategies listed below: • Preventive Maintenance (PM) is the completion of tasks performed on a defined schedule. The purpose of PM is to extend the life of equipment and detect wear as an indicator of pending failure. Maintenance procedures describing the tasks to be performed are fundamental to a PM program. They instruct the technician on what to do, what tools and equipment to use, what to look for, how to do it, and when to do it. Tasks can be created for routine maintenance items or for breakdown repairs. • Predictive Maintenance uses instrumentation to detect the condition of equipment and identify pending failures (Figure 4.1). A predictive maintenance program uses these equipment condition indices for scheduling maintenance tasks. Successful predictive maintenance requires a higher level of instrumentation, monitoring, and analysis than does preventative maintenance. • Reliability Centered Maintenance (RCM) is the analytical approach to optimize reliability and maintenance tasks with respect to the operational requirements of the business. Reliability, as it relates to business goals, is the funda-
88
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
Figure 4.1. Theory of predictive maintenance. mental objective of the RCM process. RCM is not equipment centric, but business centric. Reliability centered maintenance analyzes each system and how it can functionally fail. The effects of each failure are analyzed and ranked according to their impact on safety, mission, and cost. Those failures that are deemed to have a significant impact are further explored to determine the root causes. Finally, maintenance is assigned based on effectiveness, with a focus on conditionbased tasks. RCM requires a very high level of information gathering and analysis, and will likely take several years to establish and significant resources to maintain.
4.5.1
Annual Preventive Maintenance
The objective of annual preventive maintenance is to reach the optimum point at which the benefits of reliability are maximized while the cost of maintenance is minimized. Essentially, it relies upon the use of a checklist of tests and adjustments to "tune up" the apparatus or system. Through this process, a facility can develop an additional list of problem areas that need to be addressed immediately or within the coming months. It also allows facilities to develop documentation of persistent issues and keep a log of past issues. Also, the testing of items such as transformer oil and medium-voltage (MV) cable on a regular basis can be very helpful in indicating trends that can be used to diagnose equipment aging and deterioration before it results in failure. Transformeroil gas content, for example, can indicate a "hot" connection inside the transformer, coronal discharge in the unit, or deterioration of the insulating capabilities. Several conditions can be addressed and the deterioration halted before the unit fails and results in downtime for the mission critical facility. In most cases, it is impossible to perform complete preventive maintenance on energized gear due to the requirements of NFPA 70E and inherent limitations of personal protective equipment (PPE). Therefore, relying on PPE only should be used in limited
4.6 SAFE ELECTRICAL MAINTENANCE
89
situations, such as to take measurements, operate switches or breakers, lock out/tag out gear, and draw out or rerack breakers for maintenance. In order to provide a safe environment for maintenance personnel, system design must include "concurrent maintainability" (the flexibility that allows sections to be isolated and secured without interruption of critical loads or impact to system functionality). Unfortunately, many legacy systems do not contain this design feature and cannot be fully maintained. In this case, your options would include: • Do not perform complete maintenance. • Provide temporary hard-wired, wrap-around circuits. • Place personnel in harm's way and hope for the best (a violation of OSHA requirements). Another essential element of annual maintenance is to ensure that a formal electrical safety program is established at facilities and that it requires vendor/contractor compliance. Equipping and training personnel in comprehensive safety procedures and the use of personal protective equipment (PPE) is important to both employee safety and equipment safety. Strict adherence to an electrical safety program and maintenance of proper documentation safeguards both workers and the company and is not to be considered optional under present regulations.
4.6 4.6.1
SAFE ELECTRICAL MAINTENANCE Standards and Regulations
Electrical construction hazards are plentiful. There are many types, including electrical shock, short circuits, arc flash, and trauma. Electrical standards and regulations are necessary to ensure a safe working environment. The associated codes and standards discussed include National Electric Code (NFPA 70), Standardfor Electrical Safety in the Workplace (NFPA 70E); Occupational Safety and Health Administration (OSHA); and IEEE Guide for Performing Arc-Flash Hazard Calculations (IEEE 1584). The following are critical to a successful, safe electrical maintenance program: 1. Prior to contemplating electrical maintenance, it is critical to have up-todate one-line diagrams. A facility's one-line diagram is a baseline document that indicates how power flows from source(s) to loads. Having a correct and up-to-date one-line diagram reduces the chance of inadvertent power transmission from an alternate source (back-feed) and unexpected downstream energized parts, and increases overall familiarity with the facility's electrical distribution system. These drawings should be accompanied by a current short-circuit calculation and protective-device coordination study. These will ensure that equipment is properly rated and that protective devices will have been programmed/set to trip in sequence to reduce mass outages due to misapplication and minimize the portion of the electrical distribution system that must be
90
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
interrupted in order to clear a fault. The requirements found in NFPA 70E can be helpful in securing funds to perform these needed engineering studies. 2. The National Electric Code, which typically applies to new or renovation work, does not have specific requirements regarding when maintenance may be completed. However, the NEC does require the application of a label to electrical equipment, such as switchboards, panelboards, industrial control panels, meter socket enclosures, and motor control centers that are in other than dwelling occupancies and are likely to require examination, adjustment, servicing, or maintenance while energized, and should be field marked to warn qualified persons of potential electric arc flash hazards. Keep in mind that this is a field-applied warning label only and no application or site-specific information is required. 3. OSHA Regulation 29 CFR Subpart S.1910.333 states that "Live parts to which an employee may be exposed shall be de-energized before the employee works on or near them...." It goes on to state that working on or near energized electrical equipment is permissible if"... the employer can demonstrate that deenergizing introduces additional or increased hazards or is infeasible due to equipment design or operational limitations." The determination of whether work may occur on energized equipment is clearly based on one's definition of "infeasible." This is a discussion that must be documented among the individual decision makers regarding maintenance in your facility. This OSHA statement establishes a policy that energized electrical work is not typical or ordinary and requires that an energized electrical work permit be generated. 4. Once it is determined that work will occur on or near live parts to which an employee may be exposed, the investigation to determine whether a hazard is present, and what to do, is the responsibility of the employer of the electrical personnel or the employers' delegate. Compliance with the OSHA regulations lead the investigator in part to NFPA 70E-2009 Standard for Electrical Safety in the Workplace and IEEE 1584-2002 Guide for Performing Arc Flash Hazard Calculations for the hazard calculation. 5. NFPA 70E contains many requirements for maintaining safe electrical working practices, including protection from shocks and traumatic injury to a worker.
4.6.2
Electrical Safety: Arc Flash
Two important terms used when discussing electrical faults are bolted faults and arc faults. Traditionally, electrical equipment has been designed and tested to standards identifying its short-circuit rating, whether it be the equipment's ability to function as an overcurrent protective device to open and clear a fault, or for equipment to simply withstand or survive that amount of bolted fault current. Bolted faults are conditions that occur in a test lab and in the field on very rare occasions. A bolted fault, as the name implies, includes the solid connection of any combination of energized electrical phases (ungrounded) or an electrical phase and ground. Most equipment currently does not carry ratings related to the equipment's ability to withstand or survive arc faults.
4.6 SAFE ELECTRICAL MAINTENANCE
91
IEEE Guide for Testing Metal-Enclosed Switchgear Rated Up to 38 kV for Internal Arcing Faults (C37.20.7-2007) provides guidance for design and selection of arc-resistant switchgear. Arc faults include the inadvertent connection of energized electrical phases (ungrounded) or an electrical phase and ground with the air acting as an electrical path. A bolted fault will typically result in the rapid operation of an upstream overcurrent-protective device, resulting in relatively low equipment damage. An arcing fault will cause lower fault current to travel as compared to a bolted fault, and result in longer clearing times of an upstream overcurrent-protective device. This is due to the air, which would typically act as an insulator between phases or phase and ground, acting as a conductor. Arc faults may result in high-temperature radiation, molten metal material, shrapnel, sound waves, pressure waves, metallic vapor, and rapid expansion of plasma (ionized air). Much of what we do while maintaining mission critical electrical infrastructure is intended to avoid unplanned downtime as a result of this problem, so the last thing we want is an arcing fault. An arc flash is literally a ball of fire resulting from an electrical fault. Ralph Lee's technical paper entitled "The Other Electrical Hazard; Electric Arc Blast Burns"* provides the following graphic definition: "current passing through a vapor of the arc terminal conductive metal." Recognition of this hazard has increased emphasis on safety standards and changed the traditional approach to maintenance and repair. However, serious workplace injuries and fatalities from electrical arc flash incidents continue to occur each year. Here are a few arc flash facts for consideration: • An arc flash reaches temperatures of 35,000°F (the face of the sun is believed to be approximately 10,000°F). • The material vaporized by an arc flash expands 70,000 times in volume. • The pressure blast from an arc flash moves at 5000 to 6000 feet per second with a physical force of 500 to 600 pounds per square inch. • Specify and install fault-tolerant switchgear. Manufactures are racing to develop designs that incorporate a variety of solutions and features in order to create market differentiation and mitigate the danger of arc flashes. Designs may include guides or shutters designed to direct an arc up and out, safely away from personnel. New arc flash detectors trip upstream breakers or trigger downstream devices to create a bolted fault. Some breaker manufacturers are incorporating a service switch on circuit breakers that temporarily overrides normal trip settings so that, should an event occur during maintenance or testing, the subject breaker will trip as quickly as possible. Installation of special view ports enable IR scans to be done without opening covers and doors. However, the best viewport lens material is fragile and may not stand up well over time. A more robust material may not offer the most desirable transmission rate of infrared energy and can degrade over time. View ports may also not allow for a thorough scan of all bus *Lee, R. H., The Other Electrical Hazard: Electric Arc Blast Burns, IEEE Transactions on Industry Applications, IA-18, 246-261, 1982.
92
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
and cable joints or connections due to a limited field of view. This is especially true when bus or cables are "stacked" behind one another.
4.6.3
Personal Protective Equipment (PPE)
When it is determined that work will be performed on energized equipment, an energized-work permit is required. A sample of this document can be found in NFPA 70E, Annex J. It includes information documenting why and how the energized work will be performed. NFPA 70E presents a table [130.7(C)(9)(a)] for simplified personal protective equipment selection, but its use is contingent on compliance with certain criteria (notes) at the end of the table (see Table 4.1) related to bolted fault current values and protective device clearing times, information that is not normally available to electricians assigned to work on the equipment. Any deviation from the criteria makes the simplified table unusable. Where the table is deemed inappropriate for the system due to compliance issues with the table notes, a detailed calculation/analysis must be performed. Procedures and equations for detailed analysis can be found in NFPA 70E and IEEE 1584. The detailed calculation/analysis may be recommended even if the NFPA 70E tables are used, as sometimes it results in lower personal protective equipment (PPE) requirements since the detailed calculations can address work-specific conditions, such as increased distance to the energized parts, resulting in lower incident energy and thus lower PPE level selection. It is important to understand that the use of the prescribed level of PPE does not eliminate the danger to the electrician performing energized electrical work; it merely helps to manage the risk. Arc flash PPE only limits the damage to the skin resulting from an arc flash incident to a second degree burn, which does not permanently damage skin cells. Also, PPE does not protect one from the physical trauma resulting from such an explosive event. In summary, energized work puts an electrician at an elevated level of risk, and is not to be taken lightly. PPE is also required while performing tasks that do not require an energized electrical work permit, such as troubleshooting and taking measurements. Figure 4.2 from NFPA 70E indicates boundaries which, if crossed, require PPE and/or an energized electrical work permit. Classification of Hazard/Risk Category. Table 4.2 shows the minimum recommended PPE required to protect personnel at various maximum energy levels in accordance with specified classification of hazard/risk categories (NFPA 70E lists other PPE required for each category, such as hearing protection and face and eye protection). PPE Description • PPE should cover all clothing that can be ignited and should not restrict visibility and movement. • Nonconductive protective headwear is required when in contact with live parts. The face, neck, and chin should be protected.
4.6 SAFE ELECTRICAL MAINTENANCE
93
Table 4.1. Types of personal protection Task (assumes equipment is energized and work is done within the flash-protection boundary)
Hazard/risk V-rated V-rated category gloves tools
Panelboards rated 240 V and below Circuit breaker (CB) or fused switch operation with covers on CB or fused switch operation with covers off Work on energized parts, including voltage testing Remove/Install CBs or fused switches Removal of bolted covers (to expose bare, energized parts) Opening hinged covers (to expose bare, energized parts)
0 0 1 1 1 0
N N Y Y N N
N N Y Y N N
Panelboard of switchboards rated >240 V and up to 600 V CB or fused switch operation with covers on CB or fused switch operation with covers off Work on energized parts, including voltage sensing
0 1 2
N N Y
N N Y
0 0 1 0
N N N Y
N N N Y
2 3 2 2 3 2
Y N N Y N N
Y N N N N N
2 1 2 2 2 1 1
N N Y Y Y N N
N N Y N N N N
600 V class switchgear (with power circuit breakers or fused switches) CB or fused switch operation with enclosure doors closed Reading a panel meter while operating meter switch CB or fused switch operation with enclosure doors open Work on control circuits with energized parts 120 V or below, exposed Work on control circuits with energized parts >120 V, exposed Insertion or removal of CBs from cubicles, doors open Insertion or removal of CBs from cubicles, doors closed Application of safety grounds, after voltage test Removal of bolted covers (to expose bare, energized parts) Opening hinged covers (to expose bare, energized parts) Other 600 V class (277 V-600 V nominal) Removal of bolted covers (to expose bare, energized parts) Opening hinged covers (to expose bare, energized parts) Work on energized parts, including voltage testing Application of safety grounds after voltage test Insertion or removal Cable trough or tray cover removal or installation Miscellaneous equipment cover removal or installation Source: NFPA 70E, 130.7(c)(9)(a).
Safety glasses should be worn underneath the headgear. Multiple layers of clothing provide greater thermal insulation. A bib overall worn with a shirt provides higher protection to the chest area. A coverall is a garment combining shirt and pants. Jackets are usually multilayered, similar to multilayered shirts. The hood, part of the headgear, has face protection and fire-resistant fabric over the head, ears, neck, and shoulders.
94
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
Figure 4.2. Arcflashboundaries. (From NFPA 70E-2009, used with permission.) Table 4.3 shows glove and boot classification and the corresponding maximum voltage. Electrical PPE Inspection • Insulating PPE must be inspected for damage before each day's use and immediately following any incident that can reasonably be suspected of having caused damage. Insulating gloves must be given an air test when inspected. • PPE must be stored in a location and in such a manner as to protect it from light, temperature extremes, excessive humidity, ozone, and other substances and harmful conditions. • If PPE has a hole, tear, an embedded foreign object, or texture changes such as swelling, softening, or hardening, or becomes sticky or elastic it should not be used. PPE should be tested on a regular basis. Table 4.4 shows the frequency of certain types of PPE testing. Working Near Generators • Authorized employees who conduct the servicing/inspections must comply with the lockout/tagout requirements previously mentioned.
95
4.6 SAFE ELECTRICAL MAINTENANCE
Table 4.2. Minimum recommended PPE Category
Maximum energy level N/A 5 cal/cm2 8 cal/cm2 25 cal/cm2
Nonmelting, flammable materials Fire-resistant shirt and pants Cotton underwear plus fire-resistant shirt and pants Cotton underwear plus fire-resistant shirt and pants plus fire-resistant coverall Cotton underwear plus fire-resistant shirt and pants plus double-layer switching coat and pants
40 cal/cm2
4
Typical PPE examples
Source: NFPA 70E. Notes: V-rated gloves are gloves rated and tested for the maximum line-to-line voltage upon which work will be done. V-rated Tools are tools rated and tested for the maximum line-to-line voltage upon which work will be done. For systems that are 600 V or less, theflashprotection boundary shall be a minimum of 4 ft in front of "live" electrical equipment.
Table 4.3. Glove and boot classification Glove/boot voltage classification
Maximum working voltage
Proof test voltage, kV
500 1,000 7,500 17,000 26,500 36,000
2.5 5.0 10 20 30 40
Class 00 Class 0 Class 1 Class 2 Class 3 Class 4
Source: OSHA Regulations 29 CFR—1910.137 Table 1-5—Rubber Insulating Equipment Voltage Requirements, http://www.osha.gov/pls/oshaweb/owadisp.show_document7p_table =STANDARDS&p_id=9787.
Table 4.4. Frequency of PPE testing Type of equipment
When to test
Rubber Rubber Rubber Rubber Rubber
Upon indication that insulating value is suspect Upon indication that insulating value is suspect Before first issue and every 12 months Before first issue and every 6 months Before first issue and every 12 months
insulating insulating insulating insulating insulating
line hose covers blankets gloves sleeves
Source: OSHA Regulations 29 CFR- 1910.137 Table 1-6—Rubber Insulating Equipment Test Intervals http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_table=STANDARDS&p_id=9787.
96
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
• Control devices should be mechanically locked out to prevent the release of stored energy, for example, the unexpected start of a fan belt. Engine emergency stop button(s) are not acceptable to prevent inadvertent operation during engine work. • When working near a running generator, proper ear protection must be worn. • If working near moving parts or rotating machinery, avoid wearing loose clothing and jewelry, and tie back long hair. • Guards, barriers, and access plates must be maintained to prevent employees from contacting moving parts. Working with Fuel Oil • All petroleum products, including fuel oil, are potentially dangerous. Heated fuel oil may generate vapors that are flammable, explosive, and dangerous if you inhale them. • Personnel must have thorough knowledge of these hazards. • The following list covers the most important precautions: 1. Do not allow anyone to smoke or to carry matches or lighters while handling fuel oil. 2. Use only an approved type of protected lights when working near fuel oil. 3. Do not allow oil to accumulate in bilges, voids, and so forth. The vapor from even a small pool of heated fuel oil can cause an explosion. 4. Never raise the temperature of fuel oil above 120°F in fuel oil tanks. Note: Recent environmental regulations have resulted in changes to fuel formulations that will result in a reduced shelf life for diesel fuels. These changes require that facilities work closely with their fuel suppliers to ensure that a reliable fuel supply is available on-site when it is required for generator operation. Working with Batteries 1. Operate all battery-related switches and breakers and provide lockout/tag-out devices and personal protective equipment (PPE) at all times as required by National Electrical Code (NEC) and Occupational Safety and Health Administration (OSHA) for required safety. 2. Provide appropriate personal protective equipment in accordance with NEC, OSHA, and IEEE guidelines, including but not limited to: • Safety glasses with side shields, goggles, or face shields, as appropriate • Acid-resistant gloves • Protective aprons and safety shoes • Portable or stationary water facilities in the battery vicinity for rinsing eyes and skin in case of contact with acid electrolyte • Bicarbonate of soda solution, mixed as 100 grams bicarbonate of soda to 1 liter of water, to neutralize lead-acid battery spillage. Note: The removal and/or
4.6 SAFE ELECTRICAL MAINTENANCE
• • • •
97
neutralization of an acid spill may result in production of hazardous waste. The service contractor must comply with appropriate governmental regulations. Class C fire extinguisher. Note: Some manufacturers do not recommend the use of C0 2 fire extinguishers due to the potential for thermal shock. Acid- (or alkali-) neutralizing solution and spill-containment kit Lifting devices of adequate capacity, when required Adequately insulated tools and instrumentation
In addition to the safety guidelines mentioned above, the following IEEE recommended precautions are to be observed: • Use caution when working on batteries because they present shock and arcing hazards. • Neutralize static buildup just before working on the battery by contacting the nearest effectively grounded surface. • Check the voltage to ground (AC and DC) before working around the battery. If the voltage is other than anticipated or is considered to be in an unsafe range, do not work on the battery until the situation is understood and/or corrected. Wear protective equipment suitable for the voltage. • Prohibit smoking and open flame, and avoid arcing in the immediate vicinity of the battery. • Provide adequate ventilation, and follow the manufacturer's recommendations during charging. • Ensure unobstructed egress from the battery work area. • Avoid the wearing of metallic objects such as jewelry while working on the battery. • Ensure that the work area is suitably illuminated. • Follow the manufacturer's recommendations regarding cell orientation. • Follow the manufacturer's instructions regarding lifting and handling of cells. • An uninterruptible power system (UPS) or other systems might not be equipped with an isolation transformer. In addition to DC voltage, an AC voltage may also be present. Lack of an isolation transformer may provide a direct path to ground of the DC supply to the UPS. This can substantially increase the electrocution and short-circuit hazards. Provide all required tools and rigging for required work including but not limited to: • General tools for panel cover removal and reinstallation as well as minor repairs • Multimeters for verification of power. Note: All test instruments should be checked and calibrated annually against working standards that are traceable to the National Bureau of Standards. All calibration certification must be available and provided to the manager upon request.
98
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
• • • • • • • •
Calibrated torque wrenches (foot-pound and inch-pound increments) Socket sets Cleaning solvents (denatured alcohol) and lint-free rags Circuit tracers Ladders Hoists and cranes Battery-spill containment kit Battery manufacturer's recommended anticorrosion compound for terminals and posts
Suggested Codes and Standards • • • • • •
4.6.4
OSHA Regulations 29 CFR Safety-related work practices (1910.331—1910.335) Personal Protective Equipment (1910.132—1910.138) National Fire Protection Association, 2002 National Electric Code® NFPA 70E, Standard for Electrical Safety in the Workplace. National Fire Protection Association, 2004 Lockout/Tagout
Before any employee performs maintenance or repair on a piece of equipment subject to an unexpected energizing, start-up, or release of stored energy, proper lockout/ tagout procedures must be in place to isolate and render the equipment inoperative. All parties must be informed of the planned servicing and the possibility of an unexpected power outage. Lockout is the procedure of using keyed or combination security devices (locks) to avoid the activation of mechanical or electrical equipment accidentally. In conjunction with lockout, tags are used to visibly indicate that equipment is not being energized, that locks should not be removed without authorization, and that the disconnecting means should not be operated without permission. This is also known as tagout. The protocol and procedures for lockout/tagout need to be implemented by employees in order to protect life and equipment safety. Where all or some of the electrical distribution is scheduled to be deenergized as part of the work procedures or electrical-work permit, NFPA 70E offers detailed suggestions regarding lockout/tagout, including a sample lockout/tagout procedure (NFPA 70E Annex G). Equipment in an electrically safe work condition per the NFPA 70E standard is essential to electrical safety and work practices around electrical equipment. OSHA's standard for the control of hazardous energy sources requires employers to establish and implement procedures to disable machinery and equipment and to prevent the release of hazardous energy sources while maintenance and servicing activities are being performed.
4.7 MAINTENANCE OF TYPICAL ELECTRICAL DISTRIBUTION EQUIPMENT
99
Below is a detailed description of OSHA's lockout/tagout procedures developed by Thomas L. Bean, in collaboration with Timothy W. Butcher and Timothy Lawrence, as part of a fact sheet developed at the Ohio State University.* Phase 1—Lockout/Tagout, Deenergize Machinery or Equipment 1. The authorized employee notifies all affected people that a lockout/tagout procedure is ready to begin. 2. The machinery or equipment is deenergized. 3. The authorized employee releases or restrains all stored energy. 4. All locks and tags are checked for defects. If any are found, the lock or tag is discarded and replaced. 5. The authorized employee places a personalized lock or tag on the energy-isolating device. 6. The authorized employee holds the key throughout the duration of the service or maintenance. 7. If multiple vendors will be servicing or performing maintenance on the same equipment, an authorized employee from each vendor will place their own lockout/tagout device. 8. The authorized employee tries starting the machinery or equipment to ensure that it has been isolated from its energy source. The machinery is then deenergized again after this test. 9. The machinery or equipment is now ready for service or maintenance. Phase 2—Return the Machinery or Equipment to Production 1. The authorized employee checks the machinery or equipment to be certain no tools have been left behind. 2. All safety guards are checked to be certain that they have been replaced properly. 3. All affected people are notified that the machinery or equipment is about to go back into production. 4. The authorized employee performs a secondary check of the area to ensure that no one is exposed to danger. 5. The authorized employee removes the locks and/or tag from the energy-isolating device and restores energy to the machinery or equipment.
4.7 MAINTENANCE OF TYPICAL ELECTRICAL DISTRIBUTION EQUIPMENT In order to provide more specific recommendations, let us examine a select portion of a high-rise office building with mission critical loads. The system comprises 15 kV "OSHA's lockout/tagout standard may be found at http://ohioline.osu.edu/aex-fact/pdf/0595.pdf.
100
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
vacuum circuit breakers with protective relays, solid dielectric-shielded feeder cable, and a fused-load break switch. On the 480 V system, there is switchgear with moldedcase circuit breakers on the main and feeders. The mission critical loads are then connected to motor control centers and panel boards with molded-case circuit breakers.
4.7.1
Thermal Scanning and Thermal Monitoring
A high-rise building of this type can easily justify the cost of an infrared scan once per year. There is no more cost-effective maintenance procedure that can positively identify potential problems quickly and easily and without a shutdown. Its principal limitation is that there are certain areas of the system and equipment where the necessary covers cannot be removed while in service. Without line-of-sight access, the scanning equipment is very limited in its capabilities. Where electrical panel covers cannot be removed during normal working hours, hinged panel covers should be installed during scheduled building shutdowns; subsequently, all mission critical board panel covers can be changed to hinged panel board covers. This will eliminate a branch circuit breaker from tripping or opening when removing panel board covers. As we know, all it takes is the right outage at the right time to create chaos in a data center. This is a small price to pay to avoid such an occurrence. For many key businesses—financial, media, petrochemical, oil and gas, telecommunications, large-scale manufacturing, and so on—consequences of sudden unexpected power failures can result in severe financial and/or safety issues. As mentioned previously, the current accepted method of noncontact thermal inspection has been through thermography via thermal imaging cameras. However, this only gives a snapshot of the day; problems could arise the next day, undetected. Another solution is a thermal monitoring system that is permanently installed within the electrical distribution system. These systems are small, low-cost, accurate, noncontact, infrared sensors that are permanently fitted where they can directly monitor key components like motors, pumps, drive bearings, gearboxes, generators, and even inside enclosures monitoring high- and low-voltage switchgear (Figure 4.3). The sensors feed back signals to
Figure 4.3. A small thermographie camera and a typical installation. (Courtesy of Exetherm.)
4.7 MAINTENANCE OF TYPICAL ELECTRICAL DISTRIBUTION EQUIPMENT
101
a PC where automatic data logging provides instant on-screen trend graphs (Figure 4.4). Two alarm levels per sensor automatically activate in the event that preset levels are exceeded, thus identifying problem components and allowing planned maintenance to be carried out. These sensors, which require no external power, measure the target temperature in relation to ambient with the signal indicating the °C rise on ambient, thus avoiding inaccuracies due to weather conditions, environment, and so on. Here are some advantages to this approach: • 7 x 24 x 365 thermal monitoring provides a constant stream of thermal data as opposed to the "snapshot in time" afforded by traditional infrared scans. Combined with comprehensive power monitoring, trend analysis becomes a new weapon in the arsenal of failure prevention. • Doors and covers need not be opened, thereby eliminating the need for PPE and reducing personal risk. • Interpretation of the information is simplified. Combined with comprehensive power monitoring, a thermal map can be developed and trends monitored so that anomalies can be addressed long before a failure occurs. Combined with comprehensive power measurement/analysis, this pinpoints inefficiencies so they may be corrected and energy demand reduced. This provides the following advantages: • This technology uses a nonproprietary, open protocol for integration with existing BMS. • Information is available globally for those responsible for operations. • No power supply is required for thermal sensors. • Devices are self-calibrating. • Direct contact devices are available where line-of-site measurement is impractical.
Figure 4.4. Sample IR scanning tracking. Left: shows the benchmark correlation between power and delta T of the bus or cable joint being monitored. Right: shows the abnormal rise of the delta T for the same power benchmark, an early indicator of a potential failure. (Courtesy of Exetherm.)
102
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
• Depending on the equipment involved, the time periods between traditional maintenance events may be lengthened or, in some cases, events may be eliminated.
4.7.2
15 kV Class Equipment
Some procedures that should be a part of a preventive maintenance program include the following: • Insulators should be inspected for damage and/or contamination. • Cables, buses, and their connections should be visually inspected for dirt, tracking, water damage, overheating, or oxidation. • Liquid-filled transformers should be checked for temperature, liquid level, and nitrogen pressure. • Transformer oil should be sampled and tested for moisture and gases that indicate a variety of insulation failure processes. • Dry transformers should be checked for temperature and a physical inspection performed. • Switchgear encompasses enclosures, switches, interrupting devices, regulating devices, metering, controls, conductors, terminations, and other equipment that should be inspected and have recommended deenergized and energized tests performed. • Complete testing and recalibration of the relays • Vacuum integrity test on the breaker interrupters • Insulation-resistance measurements on the breaker and switch • Accurate determination of the contact resistance over the complete current paths of each device. Check for oxidation and carbon buildup. • Operational checks of the mechanical mechanisms, interlocks, auxiliary devices, and so on • Inspection and cleaning of contacts and insulators; lubrication of certain parts • Analysis of all results by someone with the proper training and experience who is on-site and participating in the work Remember that observations and analysis of experienced engineers cannot be replaced by a software program that "trends" the data obtained, or by a later comparison of the results to existing or manufacturers standards.
4.7.3
480 Volt Switchgear
The 480 V switchgear and breakers serve the same function as the higher-voltage switchgear. It is easier to justify less-frequent annual service for this equipment because insulation failure can be a slower process at the lower voltage, and if there is a problem, it will affect a smaller portion of the system. Two years is a common interval and many buildings choose 3 years between thorough maintenance checkouts on this type of equipment. A comprehensive maintenance program at this level is very similar
4.7 MAINTENANCE OF TYPICAL ELECTRICAL DISTRIBUTION EQUIPMENT
103
to the medium-voltage program above. One change results from the fact that lower voltage breakers have completely self-contained protection systems rather than separately mounted protective relays and current transformers. The best protection-confirmation test is primary current testing. A large test device injects high currents to simulate overloads and uses very low voltage to create a fault. The breaker is tested and operated as it must during a fault or overload situation. Use the breaker manufacturer's secondary test device for a less-extensive test on breakers with solidstate protective devices. This can range from a test equal to injecting high current levels, to little more than powering up the trip unit and activating an internal self-test feature. The latter test method may be appropriate at alternate service periods but cannot be considered a comprehensive evaluation and should not be relied upon as proper maintenance. Lower voltage breaker contacts should be cleaned and investigated after the arc chutes are removed. Precisely measure the resistance of the current path on each pole as well as insulation resistance between poles and ground. Verify proper operation and lubrication of the open/close mechanisms and various interlocks. Again, the inspection and analysis of the results by a knowledgeable and trusted service technician familiar with the equipment and previous history should conclude the test. Refer to manufacturer recommendations and professional society standards for specific recommendations on the equipment installed at your site.
4.7.4
Motor Control Centers and Panel Boards
Located further downstream are the motor control centers and panel boards. These components should be thoroughly examined with an infrared scanning device once a year. Testing and service beyond that level can vary considerably. To have a testing company come to scan your electrical distribution equipment can cost upwards of $1500 per day during normal working hours. Molded-case breakers are not designed for servicing and, consequently, many buildings limit them to scanning only. An exception is where molded- or insulated-case breakers are used in place of the air circuit breakers at the substation. Here they serve a more critical role, frame sizes can be quite large, and testing is more easily justified. If the breakers are of nondraw-out construction, as most circuit breakers are, primary current injection testing is more difficult, especially on the large frame sizes. Secondary testing of solid-state circuit breakers, along with an accurate determination of the contact resistances, might be justified more often than for air circuit breakers. At a minimum, exercise molded case breakers by manual opening and closing on a consistent schedule. If so equipped, push the trip button; this will exercise the same mechanism that the internal protective devices do. If periodic testing by primary current injection is available, bring the test equipment to the breaker mounting locations and test them in place wherever practical.
4.7.5
Automatic Transfer Switches
The automatic transfer switch (ATS) is a critical component of the emergency power system. Correct preventive maintenance of an ATS depends on the type of switch and where it is located within the building. There are four basic types of ATS:
104
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
1. Break-before-make (open transition) is most common and allows the load to be interrupted during the transfer from normal to emergency power or from one source to another. 2. Make-before-break (closed transition) allows a transfer to occur without dropping the critical load. 3. Delayed transition (center off) is designed for applications in which large inrush currents exist, allowing the magnetic fields associated with inductive loads to completely collapse before reconnection. 4. Static transfer switches (STS) do not incorporate the traditional mechanical transfer switch. These transfer switches instead rely on transistor or SCR technology where subcycle transfers are possible. The following is a list of basic preventive maintenance items for automatic transfer switches. • Infrared scan the ATS while loaded to identify hot spots and high-resistance connections. • Deenergize the switchgear and isolate it electrically. Make sure the standby power generator is locked out and tagged out. • Remove arc chutes and pole covers and conduct visual inspection of the main and arcing contacts. • Test and recalibrate all trip-sensing and time-delay functions of the ATS. This step will vary depending on the manufacturer. • Vacuum the dust from the switchgear and accessory panels. Never use air to blow out dirt; you may force debris into the switch mechanism. • Inspect for signs of moisture, previous wetness, or dripping. • Clean grime with solvent approved by manufacturer. • Inspect all insulating parts for cracks and discoloration caused by excessive heat. • Lubricate all mechanical slide equipment such as rollers and cams • Exercise the switch mechanically and electrically. • After maintaining the switch, test it for full functionality.
4.7.6
Automatic Static Transfer Switches (ASTS)
• Infrared scan the ASTS while loaded to identify hot spots and high-resistance connections. • Visually inspect the cabinet and components of the unit. • Take voltage reading (phase to phase) of inputs 1 and 2 and the output. • Record the voltages and compare the measured readings to the monitor. Calibrate metering as required. • Retrieve information from the static voltage, detected from the event log. • Clear the event log.
4.7 MAINTENANCE OF TYPICAL ELECTRICAL DISTRIBUTION EQUIPMENT
105
• Test: Manually switch from preferred source to the alternate source a prescribed number of times. • Replace any covers or dead fronts and close unit doors.
4.7.7 • • • • • • • • • • •
4.7.8
Power Distribution Units Visually inspect the unit: transformer, circuit breakers, and cable connections. Check all LED/LCD displays. Measure voltages and current. Check wires and cables for discoloration, shorts, and tightness. Infrared scanning on each unit and visually inspect transformer. Check for missing screws and bolts. Check all monitors and control voltage. Check and calibrate monitors. Check high/low thresholds. Check for worn or defective parts. Check for general cleanliness of units, especially vents, and vacuum out transformer windings and enclosure if dirt and dust are present.
277/480 Volt Transformers
As long as transformers operate within their designed temperature range and specifications, they will operate reliably. However, if the transformer is not ventilated adequately and is operating hotter than manufacturer's specifications, problems will occur. The most important maintenance function for transformers is to verify proper ventilation and heat removal. Even though there are no moving parts in a transformer, except for fans in larger units, perform the following preventive maintenance: • Infrared scanning for loose connections and hot spots, then rétorque per manufacturer's specifications. • Remove power. • Inspect wire for discoloration. • Vacuum inside transformer and remove dust. • Verify that the system is properly grounded.
4.7.9
Uninterruptible Power Systems
There are many configurations of uninterruptible power supply installations, such as single-module systems, parallel redundant, and isolated redundant. But there are common tests and maintenance procedures for the major components of these systems. It is important to perform the manufacturer's recommended maintenance in addition to independent certified performance testing.
106
MISSION CRITICAL ELECTRICAL SYSTEM MAINTENANCE AND SAFETY
Visual and Mechanical Inspection 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Inspect physical, electrical, and mechanical condition. Check for correct anchorage, required area clearances, and correct alignment. Verify that fuse sizes and types correspond to drawings. Test all electrical and mechanical interlock systems for correct operation and sequencing. Inspect all bolted electrical connections for high resistance. Verify tightness of accessible bolted electrical connections by the calibrated torque-wrench method in accordance with the manufacturer's published data. Perform a thermographie survey at full load. Thoroughly clean the unit prior to tests, unless as-found and as-left tests are required. Check operation of forced ventilation. Verify that filters are clean and in place and/or vents are clear. Replace dirty filters as necessary.
Electrical Tests 1. Perform resistance measurements through all bolted connections with a low-resistance ohmmeter, if applicable. 2. Test static transfer from inverter to bypass and back at 25%, 50%, 75%, and 100% load. 3. Set free-running frequency of oscillator. 4. Test DC undervoltage trip level on inverter input breaker. Set according to manufacturer's published data. 5. Test alarm circuits. 6. Verify sync indicators for static switch and bypass switches. 7. Perform electrical tests for UPS system breakers. 8. Perform electrical tests for UPS system automatic transfer switches. 9. Perform electrical tests for UPS system batteries for 5 minutes at full load. 10. See above for a more complete battery maintenance procedure, if applicable. Test Values 1. Compare bolted connection resistances to values of similar connections. 2. Bolt-torque levels must be in accordance with manufacturer's specifications. 3. Micro-ohm or millivolt drop values must not exceed the high levels of the normal range as indicated in the manufacturer's published data. If manufacturer's data is not available, investigate any values that deviate from similar connections by more than 25% of the lowest value.
4.9 CONCLUSION
4.8
107
BEING PROACTIVE IN EVALUATING TEST REPORTS
Acceptance and maintenance testing are pointless unless the test results are objectively evaluated and compared to standards, and to previous test reports that have established benchmarks. It is imperative to recognize failing equipment and to take appropriate action as soon as possible. However, it is common practice for maintenance personnel to perform maintenance without reviewing prior maintenance records. This approach defeats the purpose of benchmarking and trending and must be avoided. The importance of taking every opportunity to perform preventive maintenance thoroughly and completely, especially in mission critical facilities, cannot be stressed enough. If not, the next opportunity will come at a much higher price: downtime and lost business and clients, not to mention the safety issues that come up when technicians rush to fix a maintenance problem. So do it right ahead of time, and do not take shortcuts.
4.9
CONCLUSION
Many decisions regarding how and when to service a facility's mission critical electrical power distribution equipment are going to be subjective. It is easy to choose the objective: a high level of safety and reliability from the equipment, components, and systems. But discovering the most cost-effective and practical methods required to get there can be a challenge. Network with colleagues and knowledgeable sources, and review industry and professional standards before choosing the approach best suited to your maintenance goals. Also, keep in mind that the individuals performing the testing and service should have the best education, skills, training, and experience available. You depend on their conscientiousness and decision making to avoid future problems with perhaps the most crucial equipment in your building. Most importantly, learn from your experiences, and those of others. Maintenance programs should be continuously improving. If a task has historically not identified a problem at the scheduled interval, consider adjusting the schedule respectively. Examine your maintenance programs on a regular basis and make appropriate adjustments to take advantage of improvements in technology and changes in codes, regulations, and procedures.
This page intentionally left blank
5 STANDBY GENERATORS: OPERATIONS AND MAINTENANCE
5.1
INTRODUCTION
Failure of a standby diesel generator to start—which can be due to batteries that are not sufficiently charged, faulty fuel pumps, control or main circuit breaker being left in off position, or low coolant level—is inexcusable. In a facility where the diesel generator is being used for life safety or to serve mission critical loads, the cost of generator failure can be immeasurable. Standby generator reliability is contingent on appropriate equipment selection, proper system design, installation, and proper operations/maintenance. To confirm that maintenance is being done correctly, it is imperative to develop a comprehensive documentation and performance plan. Since standby generators run occasionally, the schedule for performing specific maintenance tasks is usually stated in terms of daily, weekly, monthly, semiannual, and annual time frames, rather than in hours of operation. Diesel engines are one of the most reliable sources of emergency power. Every day, thousands of truckers successfully start a diesel engine without a second thought. An emergency diesel generator should be no exception. However, because truckers operate their engines daily, they tend to be more familiar with the operation and maintenance needs of those engines. Engines being operated daily are getting the use and attention they require. Most emergency diesel generators are neglected because they are not being used on a consistent basis and, consequently, are not being maintained. Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 109 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
110
STANDBY GENERATORS: OPERATIONS AND MAINTENANCE
More importantly, the engines are effectively tested under real load conditions every time they are being used. Running an emergency generator on a periodic basis does not provide a significant amount of operational experience. Facilities engineers who have been running one-hour weekly plant exercises or load tests over ten years have less than 600 hours of operational history. It is not until a disaster occurs, and a long run of 100 consecutive hours or more is needed, that one sees the issues associated with power plant operations and their associated maintenance requirements. In this chapter, the reader will become familiar with developing an effective standby generator operations and maintenance program.
5.2
THE NECESSITY FOR STANDBY POWER
The necessity for standby power is an essential concern for businesses that need to continuously operate. Traditionally, standby power has been synonymous with emergency generators. With the advent of very sensitive electronic devices, an emergency generator can fall short of meeting requirements for standby power. There are a number of ways a standby power system can be provided for a load, namely, motor-generator sets, uninterruptible power systems (UPS), and emergency generators. The combinations of the last two are used most often for providing protection from all power-quality and availability problems. The main motive in the use of standby power is to increase the availability of equipment. There are a number of different root causes for why a standby power source may be required; lack of reliable service for the area, utility supply/demand problems, or weather-related problems. A key recent issue referring to the ability of a utility to supply power was the 2003 Northeast blackout, which, due to overloads and untested and malfunctioning utility distribution protection systems, left millions on the East Coast without power. Weather-related problems are normally caused by a loss of transmission service due to high winds, tornadoes, severe snowstorms, or the formation of ice on transmission lines. In general, a standby generation system can be justified for two main reasons: dealing with a chronic problem that leads to power loss or trying to protect the system from utility interruptions. The solutions to these can be very different. Electrical loads in power systems can be roughly divided into two groups. The first group consists of the traditional linear electrical loads, such as lighting and motors. The second group consists of more sensitive nonlinear electronic devices such as computers. The first group can resume normal operation after short power interruptions, provided (a) the interruption is not repeated continuously and the device has a relatively high tolerance for over- and undervoltages, voltage sags, and voltage swells; and (b) the device operates successfully under moderate levels of noise and harmonics. In other words, the normal variation and power quality problems of the utility service does not affect the loads significantly and as soon as the load is energized with auxiliary power, the equipment will resume its function immediately. For these loads, an emergency generator is sufficient to serve as standby power. After loss of utility service, an emergency generator usually begins to energize the load in about 10 seconds.
5.3
EMERGENCY, LEGALLY REQUIRED, AND OPTIONAL SYSTEMS
111
On the other hand, the second group of loads—electronic devices—will be affected by power quality problems. Moreover, they cannot sustain a momentary outage without possible hardware failure or software lockup. Therefore, an emergency generator cannot meet the standby power requirements of the equipment. As mentioned previously, a UPS will provide a reliable source of conditioned power where the load remains energized without any interruptions after the loss of utility service. A UPS's main purpose is to serve as an online power-conditioning device to protect against unwanted power line disturbances and for outages on the short term, as standby generators are there to supply power to the load through the UPS in the long term. With the proliferation of real-time computing in LAN, midrange, and mainframe environments, a standby power source is a necessity in many commercial and institutional facilities. A number of power problems can harm the operations of electrical equipment. They include blackouts, brownouts, voltage swells and sags, under- and overvoltage conditions, power surges, and spikes. Regular power received from the utility is usually stable and reliable, so its presence is almost taken for granted. There is a tendency to presume the source of power problems is external but, in reality, more than 50% of power troubles stem from in-house equipment. Although major power blackouts are uncommon, a major electrical outage in July 1996 impacted more than 2 million in the Northwest United States. In August 1996, another power outage affected more than 4 million in five states. Similarly, several areas in New England and Canada were without electric power in the winter of 1998. Localized blackouts due to weather occur more frequently. For instance, thunderstorms, tornadoes, ice storms, and strong heat waves, as we experienced in the summer of 1999, resulted in power outages. To protect the critical equipment and reduce or eliminate the risk of downtime during a long-term power interruption, organizations rely on backup generators.
5.3 EMERGENCY, LEGALLY REQUIRED, AND OPTIONAL SYSTEMS One of the most misunderstood electrical terms applied to the design of electrical distribution is the word "emergency." Although all emergency systems are standby in nature, there are numerous standby systems that are not emergency systems within the meaning of the National Electrical Code. These other systems, whether legally required or optional, are subject to considerably different prerequisites. It is not admissible to select the rules of the NEC to apply the design of an alternate power source to either supply the circuits of an emergency or standby system. Once an alternate power source is correctly classified as an emergency power source, conditions that apply to standby systems must not be used to decrease the burden of more stringent wiring. A standby power source is either required by law or required by a legally enforceable administrative agency. Emergency systems and legally required standby systems share a common thread in that both are legally required. If a standby system is not legally required, then it is not an emergency or legally required standby system. Once it is established that the law requires the standby system, it must be classified as either an emer-
112
STANDBY GENERATORS: OPERATIONS AND MAINTENANCE
gency system or simply a required standby system. In some cases, the applicable regulation will specify the precise code article. In other cases, this determination can be difficult to make, since there is some overlap in the two articles. The relative length of time that power can be interrupted without undue hazard is the most useful criterion. In a true emergency system, there are three important considerations in identifying a load as one requiring an emergency source of power, all of which state the basic question of how long an outage would be permissible. The nature of the occupancy is the first consideration, with specific reference to the numbers of people that would be congregated at any one time. A large assembly of people in any single location is ripe for hysteria in the occurrence of a fire, particularly when the area turns dark. Panic must be avoided, since it can develop with intense speed and contribute to more casualties than the fire or other problems that caused it. Therefore, buildings with high occupancy levels, such as highrise buildings and large auditoriums, usually require emergency systems. Criticality of the loads is the second consideration. Egress lighting and exit directional signs must be available at all times. Additionally, lighting, signaling, and communication systems, especially those that are imperative to public safety, must also be available with minimal interruption. Other loads, such as fire pumps or ventilation systems important to life safety, will be unable to perform their intended function if they are disconnected from the normal power source for any length of time. Danger to staff during an outage is the third consideration. Some industrial processes, although not involving high levels of occupancy, are dangerous to workers should power be unexpectedly altered. To the highest degree possible, the applicable guideline will designate the areas or loads that must be served by the emergency system. One document refered to often for this purpose is NFPA 101. It states that all industrial occupancies must have emergency lighting for designated corridors, stairs, and so on that lead to an exit. There are exceptions for uninhabited operations and those that allow adequate daylight for all egress routes during production. A note advises authorities having jurisdiction (AHJ) to review large locker rooms and laboratories using hazardous chemicals, to be sure that major egress aisles have adequate emergency illumination.
5.4
STANDBY SYSTEMS THAT ARE LEGALLY REQUIRED
Provisions of the NEC apply to the installation, operation, and maintenance of legally required systems other than those classified as emergency systems. The key to separating legally required standby systems from emergency systems is the length of time an outage can be permitted. They are not as critical in terms of time for recovery, although they may be very critical to concerns other than personnel safety. They are also directed at the performance of selected electrical loads, instead of the safe exit of personnel. For instance, there are several rules requiring standby power for large sewage treatment facilities. In this case, the facility must remain in operation in order to prevent environmental problems. Another example is the system pressure of the local water utility, which must always be maintained for fire protection as well as public health and safety. Although critical, this is a different type of concern, allowing a longer time
5.7
MANAGEMENT COMMITMENT AND TRAINING
113
delay between loss and recovery than is permitted for lighting that is crucial to emergency evacuation.
5.5
OPTIONAL STANDBY SYSTEMS
Optional standby systems are unrelated to life safety. Optional standby systems are intended to supply on-site generated power to selected loads, such as mission critical electrical and mechanical infrastructure loads, either automatically or manually. The big difference between true emergency circuits and those that are served by standby or optional sources is that care must be exercised to select the correct code articles that apply to the system in question.
5.6
UNDERSTANDING YOUR POWER REQUIREMENTS
Managers cannot evaluate the type of backup power requirements for an operation without a thorough understanding of the organization's power requirements. Managers need to evaluate whether they want to shut down the facility if a disturbance occurs, or if they need to ride through it. Since the needs of organizations vary greatly, even within the same industry, it is complicated to develop a corrective solution that can be valid for varying situations. Facilities managers can use answers to the following questions to aid in making the correct decision: 1. What would be the impact of power outages on mission-critical equipment in the organization? 2. Is the impact a nuisance, or does it have major operational and monetary consequences? 3. What is the reliability of the normal power source? 4. What are the common causes of power failures in the organization? Are they overloaded power lines, weather related, or other causes? 5. What is the common duration of power failures? In most cases, there are always far more momentary "blips" and other aberrations then true outages. Are there brief interruptions or brownouts/blackouts that can last minutes, hours, or days? 6. Are there common power quality problems, such as voltage swells, voltage sags, ripples, over- or undervoltages, or harmonics? 7. Do the power interruptions stem from secondary power-quality problems generated by other in-house equipment?
5.7
MANAGEMENT COMMITMENT AND TRAINING
The actions discussed in this chapter and book cannot occur without the support of company senior management and administration. This assistance is critical for the re-
114
STANDBY GENERATORS: OPERATIONS AND MAINTENANCE
liable operation of the critical infrastructure. It is possibly the most demanding task for the user to make management aware of and responsive to the costs involved with maintaining standby systems. It is the responsibility and mission of the user to work with management to institute reasonable operating budgets and expectations based on the level of risk of your organization. Both budgets and goals must be balanced against the ultimate cost of a power loss. Management and the user need to make sure that each is providing the other with realistic expectations and budgets. It is in this area that maintenance records are of vital importance. They form the basis of accurately projecting cost estimates for the future based upon current actual costs. Items that are consuming too much of the maintenance budget can be singled out for replacement. Records kept of power quality problems and the costs avoided because of system reliability are certain to be invaluable in this process. There needs to be a specific plan to monitor and maintain the system, a system of checks to verify that the maintenance is done properly, and tests to be sure that the equipment has not deteriorated. There also has to be a commitment to continuous training to keep the knowledge fresh in the minds of critical people who are in charge of dealing with power failures. One major issue in mission critical facilities is for managers to allow the proper testing of the system, such as an annual pull-the-plug test. In order to accomplish this, you actually need to expose the system to a power failure to know that it and all associated functions, including people, perform accordingly. That does insert a bit of risk into the power reliability, but it is far better to find a problem under controlled conditions.
5.7.1
Lockout/Tagout
Repairs made to equipment and systems using electricity require that the energy be controlled so workers are not electrocuted. The means of accomplishing this is called lockout/tagout. At points where there are receptacles, circuit breakers, fuse boxes, and switches, facilities must use a lock that prevents the energy from being applied and post a label warning personnel that the receptacle has been deenergized and that the power will not be applied until repairs are made. The lock also prevents the switch from accidentally being thrown in the closed or "on" position. The locks should be key operated, and the keys stored in a safe, secure area. It is also a good idea to have a trustworthy person handle the keys and keep them on a key ring, with each key properly labeled. This person should be in charge of repairs made to power sources. Before beginning maintenance, be aware that a generator can start without warning and has rotating parts that can cause serious injury to the unwary. Some other safety concerns are: 1. Never wear loose clothing around a generator. 2. Always stay clear of rotating and hot components. 3. Hearing protection and safety glasses must be worn any time the generator set is running.
5.8 STANDBY GENERATOR SYSTEMS MAINTENANCE PROCEDURES
115
4. Be aware of high voltages. 5. Only qualified and trained personnel should work on standby generators.
5.7.2
Training
Facilities managers should be aware of and take advantage of available resources to help companies meet the challenges of safety issues and stay informed as the regulations change from year to year. Since power failures are rare, it is nearly impossible for people to remember what to do when they occur, so training is needed to keep actions fresh and provide practice so they are properly performed. One such remedy is enhanced training. Whether the training is done in-house or performed by an outside source, a complete and thorough safety training session should be implemented. Some safety topics require more time to finish, but that should not be a factor in the amount of time spent learning about ways to keep workers safe. After all, workers are an organization's most precious assets. Always have regular safety training sessions, usually every year, and practice sessions quarterly. This schedule will instill basic safety requirements while helping the worker remember procedures in wearing and removing personal protective equipment. Practice sessions can be done in-house with ease. There should be a formal program documenting that each critical person did the training, was tested on proper operations, and performed properly. Safety products made with durable material suited for rough applications can make the difference between worker productivity and worker death. It also is wise to buy any upgrades to the safety equipment because technology changes rapidly. Up-to-date information will help ensure complete worker safety. Managers must remember that safety pays. Cutting back on worker safety is not only detrimental to productivity; it is also expensive. How much is the company willing to pay for medical expenses and lost time versus keeping in the clear with OSHA? In the 1970s, worker deaths were at an all-time high. Now, increased awareness and safety training has reduced these figures, but that does not mean that OSHA does not still target companies, especially repeat offenders. Remember that the cost of being safe is insignificant in comparison to the debilitating nature of injuries and the value of a human life.
5.8 STANDBY GENERATOR SYSTEMS MAINTENANCE PROCEDURES When a standby generator (Figure 5.1) fails to start, it is usually due to oversight of the people accountable for its maintenance. Generator sets are reliable devices, but failures often occur because of accessory failures. Managers must take a system approach to power to get more reliability. As stated before, people play a big part in this. They need to be trained and monitored and their performance documented. Standby generators are too dependable and too easily maintained for a failure to occur. If facilities engineers do their part to establish maintenance and testing programs to prove its reliability, the standby generator can be counted on to perform in an emergency situation. As
116
STANDBY GENERATORS: OPERATIONS AND MAINTENANCE
Figure 5.1. Generator. (Courtesy of ADDA.)
dependable as standby generators are, that reliability factor is only as good as their ongoing maintenance and testing programs.
5.8.1
Maintenance Record Keeping and Data Trending
One of the most important elements in any successful quality assurance program is the generation and storage of relevant information. Relevant information can include items such as procurement contracts, certificates of conformance, analysis results, system modification, operational events, operational logs, drawings, emergency procedures, and contact information. It is important to realize that by generating and maintaining the appropriate documentation, it is possible to review important trends and significant events or maintenance activities for their impact on system operation. Sections 5.8.2 through 5.8.7 cover suggested maintenance items for standby generators.
5.8.2
Engine
1. Verify and record oil pressure and water temperature. 2. Inspect the air intake system, including air filter condition, crankcase breather, and turbocharger. 3. Inspect the muffler system and drain condensation trap (if applicable) and verify rain cap operation. 4. Inspect the engine-starting system and verify cable integrity and connections. 5. Inspect exhaust flex coupling and piping for leaks and proper connection. 6. Check for abnormal vibration or noise.
5.8 STANDBY GENERATOR SYSTEMS MAINTENANCE PROCEDURES
5.8.3
117
Coolant System
1. 2. 3. 4. 5. 6.
Inspect clamps, verify condition of all hoses, and identify any visual leaks. Check temperature gauges for proper operation of engine-jacket water heater. Test coolant's freezing point and verify coolant level. Test coolant additive package for proper corrosion inhibitors. Inspect belt condition and tension; correct as required. Inspect radiator core for visual blockage or obstructions. Keep it clean and look for junk laying around that might blow into it and block it. 7. Inspect for proper operation of intake louvers, motorized or gravity type (if applicable). 8. Verify proper operation of remote radiator motor and belt condition (if applicable).
5.8.4
Control System
1. Verify and record output voltage and adjust voltage regulator if necessary. However, the only people who should be adjusting voltage regulators and governing and protecting equipment are those who are specifically trained in how to make the adjustments. 2. Calibrate control meters. 3. Verify and record output frequency and adjust governor if necessary. 4. Verify operation of all lamps on control panel. 5. Inspect for any loose connections and terminals or discoloration. Thermographie inspection would work well here.
5.8.5
Generator Mechanics
1. Inspect and lubricate generator and ball bearing. (Note: this is usually only necessary for older generators; most new models come lubricated for life. Also, when possible, this should be done by a qualified technician.) 2. Look for blocked cooling air passages around the alternator and general condition of generator. 3. Inspect for abnormal vibration. 4. Verify connections and insulation condition. 5. Verify that the ground is properly attached. 6. Verify proper operation of shunt trip on the mainline circuit breaker (if applicable).
5.8.6
Automatic and Manual Switchgear
1. Verify proper operation of exercise clock (adjust if necessary). 2. Visually inspect all contacts and connection points.
118
STANDBY GENERATORS: OPERATIONS AND MAINTENANCE
3. Perform building load test (if practical). 4. Verify operation of all lamps on control.
5.8.7
Load Bank Testing
Load bank testing (Figure 5.2) exercises a single piece of equipment. Tests verify proper operation of generators and associated equipment such as transfer switches. Exercise is part of a maintenance process to verify load-carrying capability and sometimes to remove deposits from injection systems that can occur due to light-load operation. Perform a test for approximately 4 hours that includes a detailed report on the performance of your system. The duration of the test can be shorter in warm climates and may need to be longer in colder ones. The important factor is to run the test long enough so that all temperatures (engine and generator) stabilize. (In addition to ensuring a proper test, this will help drive off condensation.) During the test, all vital areas should be monitored and recorded, including electrical power output, cooling system performance, fuel delivery system, and instrumentation. All of this is done with no interruption of normal power! Load bank testing is the only sure way to test your system. Manufacturers recommend that testing always be done at full load.
5.9 5.9.1
DOCUMENTATION PLAN Proper Documentation and Forms
When the generator is operating, whether in an exercise mode or the load test mode, it must be checked for proper performance conditions. Many facilities have chosen an automated weekly exercise timer. One of the unpleasant aspects that is found in this type
Figure 5.2. Load bank testing. (Courtesy of ADDA.)
5.10
EMERGENCY PROCEDURES
119
of system is that the generator is rarely checked when its weekly exercising is automated. Anytime the generator is being run, it must be checked and all operating parameters verified and thereafter documented. It is through the exercising and testing of the generator that deficiencies become evident. Yet, many facilities with automated weekly exercise timers on the automatic transfer switches do not know that the generator has run, except by reading the incremental run hour clock. If your facility selects this automated feature, be sure that the building engineer or maintenance technician is also available, so that the generator never runs without being checked and having the proper documentation completed. A weekly exercise is recommended but not always practical. However, manufacturers recommend that generators be exercised at least once per month.
5.9.2
Record Keeping
Keeping good records of inspections can give better insights into potential problems that might not be obvious. In other words, certain types of deterioration occur so gradually that they will be hard to detect in a single inspection. However, combined with prior inspection data, a particular trend of potential failure might become evident. For instance, measuring the insulation value of a cable without any past data might show it to be satisfactory, but when compared to a prior test, the insulation value might be seen to have dropped a certain percentage during every test. It would then be easy to estimate an approximate failure time for the cable in the future, before which the cable would need to be replaced. The success of an effective PM program is based on good planning. When deciding when to schedule a shutdown, pick a time that will have minimal impact on the operation. The continual proliferation of electronic equipment has created higher dependence on electrical power systems. As the hardware for electronic devices becomes more robust, the reliability of the electrical distribution may become the weak link in overall system capability. Many elaborate reliability considerations that were typically considered for mission critical applications exclusively are now more commonplace for other facilities. Facilities mangers play a more critical role than ever in the operational and financial success of a company. Now, when facilities managers speak about the electrical distribution system, facility executives are much more likely to listen. The Internet is also becoming a key tool in shaping the way businesses communicate with customers. A growing number of maintenance operations are setting up home pages designed, among other things, to provide information to and foster communications with customers. The most active in this regard have been colleges and universities. Additionally, a Web-based application can be set up so everybody can keep track of scheduled maintenance as well as to provide pertinent documents and spreadsheets.
5.10
EMERGENCY PROCEDURES
Even with the best of testing and maintenance procedures, it is wise to prepare for a potential generator-set failure so that its affects are minimized. The generator-set emergency stops and shut-offs should be easily identifiable and so labeled. A posted
120
STANDBY GENERATORS: OPERATIONS AND MAINTENANCE
basic trouble-shooting guide should be readily available right beside the unit, with emergency call numbers of affected areas and repair contractors. A clean copy of the generator's operation and maintenance (O&M) manuals, complete with schematics, should be kept in the generator room at all times. Never leave the original (O&M) manuals in the generator room as they might become soiled and unusable. Keep the (O&M) originals in a safe place for making additional copies or for an emergency if any of the copies get misplaced or otherwise become unusable. The emergency stop and any contingency plan also need to be tested to prove their proper operation and effectiveness. This regular testing and review is key because responsible people demonstrate their understanding of the system design and their ability to respond to the common failure modes in the system. It is recommended that the generator-set logbook contain the previously noted emergency telephone numbers, basic trouble-shooting guidelines and contingency plans. Although this may be redundant with the posted procedures, it provides an extra safeguard that is strongly recommended. With the emergency procedures detailed in the generator-set logbook, more precise information and instructions can be provided to the operating technician in case of an emergency situation.
5.11
COLD START AND LOAD ACCEPTANCE
An important concern of the standby-generator design engineer is how much time it takes for the standby or emergency generator system to sense and react to a loss of power or other power quality problems. A cold start does not mean a completely cold engine but rather an engine that is not at operating temperature. There is a stipulation in certain codes and standards that state emergency generators must be able to pick up the emergency loads within 10 seconds in the United States and 15 seconds in Canada
Figure 5.3. Generator control cabinet. (Courtesy of ADDA.)
5.12
NONLINEAR LOAD PROBLEMS
121
following a power failure. Once these loads are online, then other critical loads can be connected. In most cases, there are two different standby generator systems. The first is for emergency loads such as the life safety systems. The second is for critical systems that support the data center's critical electrical and mechanical loads. The design criteria from a 10-second cold start is to equip the generator with coolant heaters or block heaters. The coolant heaters are necessary to start the machine and enable it to pick up a full rated load in one step.
5.12
NONLINEAR LOAD PROBLEMS
If the facility's electrical loads, such as computer power supplies, variable speed drives, electronic lighting ballasts, or other similar nonlinear electrical equipment, is furnished with switch-mode power supplies, it is imperative that you advise the generator supplier of such so that proper steps can be taken to avoid equipment overheating or other problems due to harmonics. Some generator manufacturers recommend lowimpedance generators and have developed winding-design techniques to reduce the effects of the harmonic currents generated. The key issue is maintaining the ability to produce a stable voltage waveform and a stable frequency. This is mostly an issue of voltage regulation system design, but can also be impacted by the governing (fuel control) system design. The winding arrangement is not as critical as the impedance of the machine relative to the utility service. The closer it is to what the utility provides, the more like the utility it will operate and the lower the probability of problems. In some instances, the generator may have to be derated and the neutral size increased to safely supply power to nonlinear loads. When a standby generator and UPS are integrated in a system together, problems occur that typically do not exist with a UPS system or generator when they are operating alone in a system. Problems arise only when the UPS and standby generator are required to function together. Neither the UPS nor the standby generator manufacturer are at fault and both manufacturers would probably need to work together to solve the problem. The following are common problems and solutions when applying a design that incorporates a standby generator and UPS.
5.12.1
Line Notches and Harmonic Current
The UPS manufacturer using a properly designed passive filter can address the problem of both line notches and harmonic currents. Most generator manufacturers have derating information to solve harmonic heating problems. However, an input filter on the UPS that reduces the harmonics to less than 10% at full load eliminates the need for derating the generator.
5.12.2
Step Loading
When a generator turns on and the ATS switch connecting it to the UPS transfers, the instantaneous application of the load to the generator will cause sudden swings in both
122
STANDBY GENERATORS: OPERATIONS AND MAINTENANCE
voltage and frequency. This condition can generally be avoided by verifying that the UPS has a walk-in feature. This requires that the UPS rectifier have some means of controlling power flow so that the power draw of the UPS can slowly be applied to the generator in a 10-20 second time frame.
5.12.3
Voltage Rise
This is an application problem that occurs when a generator is closely designed and sized to the UPS and there is little or no other electrical load on the generator. When the UPS is first connected to the generator by the ATS, its charger has turned off so that it may begin its power walk-in routine. If the input filter is the only load on the generator, it may provide increased excitation energy for the generator. The issue is the capability of the alternator to absorb the reactive power generated by the filters. The amount that can be absorbed varies considerably between different machines from the same manufacturer. System designers should evaluate the capability in the initial design of the system to avoid problems. The outcome is that the voltage roams up without control to approximately 120% by some fundamental generator design constraint, typically magnetic saturation of the generator iron. If the value does hit 120% of the nominal, it will be damaging or disruptive to the system operation. However, a UPS that disconnects its filter when its charger is off avoids this predicament altogether.
5.12.4
Frequency Fluctuation
Generators possess inherent limitations on how closely they can manage frequency regarding their response to changing electrical loads. The function is complicated and not only involves generator features, such as rotational inertia and governor speed response, but also involves the electrical load's reaction to frequency changes. The UPS charger, conversely, also has inherent limitations on how closely it can control its power needs from a source with fluctuations in voltage and frequency. Since both the generator controls and the UPS charger controls are affected by and respond to the frequency, an otherwise small frequency fluctuation may be a nuisance. The most noticeable effect of this fluctuation is a recurring alarm that is found on the bypass of the UPS, announcing that the generator frequency is changing faster than the UPS inverter can follow. In order to minimize or eliminate frequency fluctuation problems, good control design from both the engine-generator and UPS manufacturers are required. The engine must have a responsive governor, appropriately sized and adjusted for the system. The UPS should have a control responsive to fast frequency fluctuations.
5.12.5
Synchronizing to Bypass
Some applications require the UPS to synchronize to bypass so that the critical load may be transferred to the generator. This generally places tighter demands on the generator for frequency and voltage stability. When this is the case, the system integration problem may be intensified. As described above, good control design can usually reverse this problem.
5.13 CONCLUSION
5.12.6
123
Automatic Transfer Switch
Most generator/UPS projects incorporate automatic transfer switches that switch the UPS back to utility power once it becomes available again. The speed of transfer can be an obstacle and may result in a failed transfer. This, in turn, will lead to nuisance tripping of circuit breakers or damage to loads. If the ATS switch also has motor loads, such as HVAC systems, the UPS input filter will supply excitation energy during the transfer. If the transfer occurs too fast, causing an unexpected phase change in the voltage, the consequences can be devastating for both the motors and the UPS. One of the best solutions is to simply slow the transfer switch operation speed so that the damaging condition does not exist. Rather than switching from source to source in one-tenth of a second, slowing to one-half of a second will resolve the problem. UPS manufacturers can resolve this problem by providing a fast means of detecting the transfer and disconnecting the filter.
5.13
CONCLUSION
This chapter has discussed selected ways to enhance the time factor in the terms generally used to describe reliability. This is not, and should not be, regarded as the final statement on reliable design and operation of standby generators. There are many areas of this subject matter that warrant further assessment. There is also a considerable lack of hard data on this subject, making practical recommendations difficult to substantiate. For example, there is virtually no data on the trade-offs between maintenance and forced outages. It is self-evident that some maintenance is required, but what is the optimum level? This question is easily answered if your organization's level of risk has been evaluated. However, there are certain common elements on both reliable and unreliable systems that I think are relevant and I listed them accordingly. These suggestions, when seriously considered, will help to increase the level of reliability of the standby system.
This page intentionally left blank
6 FUEL SYSTEMS DESIGN AND MAINTENANCE
6.1
INTRODUCTION
The sudden loss of electrical power has a different meaning to different people. If asked, someone may explain it as being an inconvenience, such as not being able to play a game on the computer, watch the television, or make coffee. But ask the same question to an intensive care nurse, law enforcement officer, electric utility worker, or a data center manager for the banking and financial industries, and the sudden loss of electricity may be explained in terms of loss of life, civil unrest, regional blackout, or possible financial upheaval. Diesel engines have played a pivotal role in providing electrical power in many diverse nonemergency and emergency circumstances due to their reliable start-up and their ability to operate continuously under varying load conditions. Diesel engines are a source of power and electricity; therefore, diesel fuel is the lifeblood of any emergency or backup power system. Under normal everyday use, diesel engines consume diesel fuel well within the allotted shelf life of the fuel. However, fuel for diesel engines dedicated to providing periodic emergency power is generally stored well beyond the expected shelf life. This means that emergency diesel systems can potentially run the risk of becoming inoperable due to issues associated with degradation of fuel quality. This chapter discusses the issues associated with long-term storage of diesel Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 125 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
126
FUEL SYSTEMS DESIGN AND MAINTENANCE
fuel, fuel quality, and how to monitor the useful life of diesel fuel dedicated to emergency power systems for mission critical applications.
6.2
BRIEF DISCUSSION OF DIESEL ENGINES
Little did Rudolph Diesel realize that his fascination with steam engine efficiency ratios would lead to the invention of the diesel engine, which quickly found its niche within applications that require power with high torque and low RPM. Modern diesel engines offer many advantages over other types of prime movers. Today, the diesel engine provides power and propulsion in a variety of applications such as agriculture, construction, transportation, heavy long-distance hauling, marine propulsion, mining, military vehicles, stationary power systems, and emergency power generators. The diesel engine does not have igniters or spark plugs to assure proper fuel ignition as do gasoline engines or jet turbines. Instead, the diesel engine depends totally on the compression stroke of the piston to create high cylinder temperatures to ignite the air-fuel mixture in the engine. This unique characteristic of a diesel engine requires the ignition to occur very close to the top of the piston stroke, such that the expanding burning fuel-air mixture produces the power stroke. The original diesel engine engineered by Rudolph Diesel was designed to operate using vegetable oil, which some have speculated to be peanut oil. But with a little convincing, Rudolph Diesel redesigned his engine to run on a crude oil fractional distillate known as "middle distillate" or what is commonly referred to today as diesel fuel. Over the years, the technology used to convert crude oil into diesel fuel had an influence on the stability and shelf life of the fuel product. However, other factors such as storage temperature, exposure to oxygen, the presence of water and other fuel contaminants can have a much greater influence on the shelf life of diesel fuel. Anecdotally, there are a number of examples of diesel fuel performing well after years of storage; however, there are a greater number of examples of diesel fuel failing after just months of storage. Typically, consumption of diesel fuel occurs within 30 to 90 days after production. However, consumption of diesel fuel dedicated to mission critical systems can take place well beyond the general shelf life of 12 months. It is not uncommon for diesel fuel in mission critical systems to be stored well in excess of 12 months before being replaced. Consequently, onsite fuel storage and a fuel quality surveillance program play critical roles in the long-term availability and continuous operation of mission critical diesel engines.
6.3
BULK STORAGE TANK SELECTION
Selecting the main bulk storage tanks is an important first step in system design. There are many factors that must be evaluated in order to establish the best configuration: aboveground or underground, single wall or double wall, steel or fiberglass. Since 1998, federal and state codes that govern the installation of underground tanks made it more attractive to look at aboveground bulk storage tanks. From an environmental
6.3 BULK STORAGE TANK SELECTION
127
standpoint, underground tanks have become a difficult management issue. In 1982, the President of the United States signed into law the reauthorization of the Resource Conservation and Recovery Act (RCRA). Federal and state codes now require very sophisticated storage and monitoring systems to be placed into operation in order to protect the nation's ground water supply. In some instances, property owners prohibit the installation of petroleum underground storage tanks on their properties. On the other hand, aboveground storage tanks occupy large amounts of real estate and may not be esthetically pleasing. Here are some factors to consider when deciding on aboveground or underground bulk storage tanks.
6.3.1
Aboveground Tanks
Aboveground tanks are environmentally safer than underground tanks. The tanks and piping systems that are exposed aboveground are more stable and easier to monitor for leaks than underground systems. Since the tanks and piping are not subject to ground movement caused by settlement and frost, there is less likelihood of damage to the system and, consequently, less of a likelihood of an undetected leak. However, if large tanks are required, the aboveground storage system can be awkward and will occupy large quantities of real estate. Code officials, especially fire departments, may have special requirements or often completely forbid the installation of large quantities of petroleum products in aboveground tanks. Day tanks and fuel headers may be located at the same elevation or occasionally lower than the bulk storage tank, making gravity drains from these devices impossible to configure unless the generators are on or above the second floor of the building. Filling the tanks may require special equipment, and in some cases codes have even required secondary containment systems for the delivery truck because it is felt that the "pumped fill" is more vulnerable to catastrophic failures that can empty a compartment of the delivery truck. Additionally, constant thermal changes to the fuel can potentially stress the fuel, causing it to become unstable. Diesel fuel that is exposed to subfreezing temperatures during the winter requires special conditioning in order to make it suitable for low-temperature operability. If the fuel is not conditioned properly, wax formation may occur, causing the fuel to clog filters and fuel lines. Issues stemming from poor fuel quality are discussed further in Section 6.8 of this chapter. Aboveground tanks are almost always fabricated from mild carbon steel. The outer surface of the tank is normally painted white in order to prevent the fuel from becoming heated due to sun exposure. The tank design should always include a manway and ladder in order to allow access to the tank for cleaning, inspection, and repair. Epoxy internal coatings compatible with diesel fuel provide extra protection to the fuel from the eventual contamination from internal rust. The epoxy liner also provides extra corrosion protection to the internal tank surface. All aboveground tanks should be fabricated to ULI42 or UL2085 standards and should bear the appropriate label. The ULI42 tank is an unprotected steel tank in either single- or double-wall configuration. The UL2085 tank is a special tank that is encased in concrete. The concrete provides a listed fire rating.
128
6.3.2
FUEL SYSTEMS DESIGN AND MAINTENANCE
Modern Underground Tanks and Piping Systems
Modern underground tanks and piping systems have become more reliable, easier to install, and in general easier to operate over the past several years. Since the tanks are usually buried under parking lots or driveways, they occupy little or no real estate. Drain and return lines from fuel headers and day tanks can drain by gravity back to their respective bulk storage tank, making for more efficient piping systems. On the down side, federal, state, and local codes usually impose construction standards that may be difficult to comply with. The underground storage tank must be designed and installed in accordance with the code of federal regulations, 40CFR280. In addition, the appropriate state and local codes must also be followed. The state and local government must follow, at a minimum the federal code; however, state and local governments may implement local codes that are more stringent than the federal code. Obtaining permits can be a long and tedious process, with many agencies involved in the approval process. Most important, since the tank and piping system are out of sight, leaks can go undetected for long periods of time if monitoring systems and inventory control are not meticulously maintained. Underground tanks are most commonly fabricated from mild carbon steel or fiberglass-reinforced epoxy. Steel tank must be specially coated with either fiberglass-reinforced epoxy or urethane coatings to protect them from external corrosion. In addition to the external coating, some steel tanks are additionally provided with cathodic protection to protect the exterior of the tank from corrosion. Fiberglass tanks are usually not coated. The most common manufacturing standard for steel tanks is UL58 and for fiberglass tanks it is ULI316. Similar to aboveground tank designs, underground tanks should always include a manway and ladder in order to allow access to the tank for cleaning, inspection, and repair. For steel tanks, epoxy internal coatings compatible with diesel fuel provide extra protection to the fuel from the eventual contamination from internal rust. It also provides extra corrosion protection to the internal tank surface.
6.4
CODES AND STANDARDS
Codes and standards should always be reviewed prior to any discussion of type and location of the bulk storage tank. A manager should understand that an improperly designed fuel system is a huge risk, not just with regard to reliability, but also with respect to environmental concerns. An improperly installed or maintained fuel tank can expose a company to millions of dollars of liability for cleanup of spilled fuel. It is a good idea to have a preliminary meeting with the governing code agencies in order to discuss the conceptual design and requirements of the system. Do not assume that merely because the state building code allows the use of aboveground storage tanks, there will not be some opposition from one of the other agencies that may have jurisdiction. Fire marshals, in particular, may be completely opposed to the installation of large aboveground diesel storage tanks, especially when they are located in metropolitan areas. The usual sequence to follow when establishing the priority of compliance is
6.5
RECOMMENDED PRACTICES FOR ALL TANKS
129
to start with the state building code. The state building code will normally reference either NFPA or the uniform fire code for the installation of flammable and combustible storage systems. NFPA 30, Flammable and Combustible Liquids Code, addresses the installation of storage tanks and piping systems. NFPA 37, Standard for the Installation and Use of Stationary Combustion Engines and Gas Turbines, addresses the piping systems, day tanks, flow control systems, and, in some instances, references NFPA 30. The uniform Fire Code, Article 79, Flammable and Combustible Liquids, addresses the installation of flammable liquid storage systems. In addition, the U.S. EPA has codes in place to address performance standards for both aboveground and underground storage systems. Underground storage tank standards are covered in 40 CFR 280. This code sets minimum standards that must be followed by all states. The individual states have the option of requiring more stringent standards. Aboveground storage tanks are covered in 40 CFR 112. This code also covers certain topics relative to threshold storage limits for both aboveground and underground tanks that require the owner to maintain a document known as a spill prevention control and countermeasure plan.
6.5
RECOMMENDED PRACTICES FOR ALL TANKS
There are some recommended practices that should be adhered to regardless of what the codes require. Bulk storage tanks should always be provided with some form of secondary containment, regardless of the requirement by the local authority. Anyone who has ever been involved in a project connected to a leaking underground storage tank understands the astronomical costs that can be incurred. In addition, data centers are usually high-profile facilities. The negative press associated with a petroleum spill contaminating underground water supplies can be horrific. The most popular system and easiest to maintain is the double-wall tank. Open containment dikes for aboveground tanks are a maintenance headache, especially in cold climates where removing snow and ice is very difficult. Similarly, for underground tanks, cutoff walls and flexible liners are not effective or efficient ways to provide secondary containment. Instead, the double-wall tank is efficient and easiest to maintain. Piping systems for underground tanks can also be furnished in a multitude of different configurations. Secondary containment should always be provided for underground piping. Similar to the underground tank, the piping is out of sight and can cause grave environmental damage if a leak goes undetected for long periods of time. Some of the choices for piping are double-wall carbon steel, double-wall stainless steel, and double-wall fiberglass. Probably the most popular system in use today is the doublewall flexible piping system. It is fairly easy to install and also UL listed for use with underground petroleum storage systems. Since underground piping runs each have only two connections—one at the tank and the other where the piping leaves the ground and attaches either to a device or an aboveground pipe—there is little opportunity for leaks at joints. There are complete systems available, including high-density polyethylene termination sumps that provide secondary containment where the piping attaches to the tank and also allow access from grade without excavation to service the
130
FUEL SYSTEMS DESIGN AND MAINTENANCE
termination fittings, as well as submersible pumps, and any other equipment such as tank monitoring equipment. Figure 6.1 describes basic installation practices for all tanks, whether aboveground or underground. Ideally, the floor of the fuel storage tank should be installed with a slight pitch to encourage the collection and removal of accumulated water from the low point in the tank. As elucidated in Section 6.8 of this chapter, water accumulation in a fuel storage tank is associated with the risk of microbial contamination, which can contribute to the degradation in the quality of the stored diesel fuel, lead to corrosion of the storage tank, and result in fuel system leaks. A spill-containment fill box should be located at the opposite end of the tank with a drop tube terminating close to the bottom of the tank. Introducing new fuel at this end of the tank will help to move any water accumulated at the bottom of the tank to the opposite end where the water pumpout connection is located. The fuel maintenance system should draw fuel from the lowest point possible in the tank and return it to the opposite end of the tank. Piping systems for aboveground tanks provide fewer choices. Carbon steel and stainless steel are the primary choices. When the piping runs are in open areas such as generator rooms and machine runs, it is generally installed as a single-wall pipe. If the piping runs are in more critical areas or where codes require secondary containment, it is usually installed as a double-wall piping system. However, there are some problems associated with the double-wall system. If a leak develops in the primary pipe, it can be difficult to locate the source of the problem since the leak tends to fill the annular space between the primary and secondary pipe. Another concern is the inability to contain valves, flexible connectors, and other vulnerable fittings. Typically, the doublewall pipe is terminated with a bulkhead at valves and flexible connectors, leaving these
Figure 6.1. Basic installation practices for all tanks, whether aboveground or underground. (Courtesy of Mission Critical Fuel Systems.)
6.5
RECOMMENDED PRACTICES FOR ALL TANKS
131
devices unprotected. Since the piping systems are under rather low pressure and the product, diesel fuel, is not very corrosive, leaks are usually minor. For this reason, wherever possible, install single-wall carbon steel pipe and either place it into trenches or provide curbs around the area where it is installed. The area is then monitored with leak-detection equipment. If a leak develops, an alarm warns the facility operations personnel of the problem. Since the piping is exposed, the source of the leak can be readily determined and corrected. The pipe trench or curb will also contain the leak, thereby preventing widespread damage. Fuel distribution piping materials are of major concern. Many of the performance problems that occur with diesel fuel systems are directly related to the materials of construction. There are two categories of materials that should be avoided. The first group is described in NFPA 30. These materials include low-melting-point materials that may soften and fail under exposure to fire. They include aluminum, copper, brass, cast iron, and plastics. There is, however, one exception to this rule. NFPA 30 allows the use of nonmetallic piping systems, including piping systems that incorporate secondary containment. These systems may be used underground if built to recognized standards and installed and used within the scope of Underwriters Laboratory Inc.'s Standard for Nonmetalic Underground Piping for Liquids, UL 971. The next group is materials that contain copper or zinc. Although these materials may be commonly found in many fuel distribution systems, their contact with diesel fuel should be avoided as described in ASTM D975, Standard Specification for Diesel Fuel Oils. When copper or zinc come in contact with diesel fuel, these materials can form gummy substances that can cause filter clogging. The most popular and practical piping material is carbon steel. The piping should comply with the applicable sections of ANSI B31, American National Standard Code for Pressure Piping. Pipe joints, wherever possible, should be either butt-welded or socket-welded. Where mechanical joints are required, flanges provide the best seal. Flange gaskets should be fiber-impregnated with Viton®. Threaded connections should be avoided wherever possible. Flexible connectors should be installed wherever movement or vibration exists between piping and machinery. The flexible connectors should be listed for the appropriate service. Careful consideration should be given to the type and placement of isolation valves. Carbon steel or stainless steel ball valves with Teflon® trim and Viton®* seals provide excellent service. Once again, the avoidance of threaded end connections should be considered. Socket-welded or flanged end connections provide excellent service. If the end connections are welded, consider the use of three-piece ball valves in order to facilitate valve service and repair without disturbing the welded joint. Fire-safe valve designs should be considered, especially where codes may mandate their use. Thermal expansion of diesel fuel in piping systems should be carefully examined during the design process. The temperature of diesel fuel supplied from underground tanks is approximately 60°F. When the fuel enters a warm building and resides in the fuel piping system when the generators are not running, it will expand due to the change in temperature. If the particular pipe run has closed control valves, check *Teflon® and Viton® are both registered trade names of Dupont Performance Elastomers.
132
FUEL SYSTEMS DESIGN AND MAINTENANCE
valves, or manual valves at both ends of the pipe run, the expanded fuel develops enormous pressures and leaks through threaded fittings, flange gaskets, valve seals or wherever it can find a weak spot. These nuisance leaks usually occur a day or two after the generators are tested. Once the fuel expands and is expelled from the system, the leak ceases until the next generator run. If this condition is present in an existing fuel system, thermal expansion is usually the culprit. Adding pressure relief valves to the section of piping associated with the problem is the ultimate fix. The blow-off from the PRV should be piped to a fuel oil return pipe that empties into the primary fuel storage tank. Fuel maintenance systems or fuel polishing systems are frequently included in new installations where diesel fuel is stored for prolonged periods of time, that is, one year or longer. These systems were not common several decades ago. However, these systems are becoming a must for anyone who stores diesel fuel for emergency standby power. Fuel quality will be discussed in greater detail later in this chapter. As shown in Figure 6.1, the tank is pitched to a low area to encourage settlement of water. The suction stub for the fuel maintenance system is terminated as close to the bottom of the tank as possible at this end of the tank. When the system operates, fuel is pumped through a series of particulate filters and water separators, and then returned to the tank at the opposite end. The operation of the system is timed in order to circulate approximately 20% to 25% of the tank volume once per week. The tank fill connection should be in a spill containment fill box or spill bucket. The spill bucket is designed to contain small spills created during connection or disconnection of the delivery hose. The capacity of the spill bucket is dictated by the authority having jurisdiction. The fill box should be placed at the high end of the tank, encouraging the movement of any sediment and water to the opposite end of the tank, where it will be removed by the fuel maintenance system. Many states now require annual leak testing of theses spill containment fill boxes. A failed test means replacement of the spill bucket. Since it is cast into the concrete top pad over the tank, replacement is a costly proposition resulting in removal and replacement of the concrete pad. Several manufacturers have developed a double-wall retractable spill bucket that can be replaced without removal of the concrete. This should be considered for new installations. A connection similar to the fill connection and spill containment fill box should be located as close as possible to the opposite or low end of the tank. From this connection, any accumulated water may be removed by manually pumping it from the tank. An access manhole and internal ladder provide convenient access to the tank for cleaning and maintenance. For underground tanks, the manhole should be accessible through an access chamber that terminates at final grade. The bulk storage tanks should be provided with an automatic gauging and monitoring system. Systems range from a simple direct-reading gauge to very sophisticated electronic monitoring systems that will provide automatic inventory control, temperature-compensated delivery reports, system leak monitoring, and underground tank precision testing. Many systems include gateways that allow them to interface with many of the popular building automation systems (BAS). These systems are especially beneficial since any unusual conditions are easily identified at the BAS console. An évalua-
6.6 FUEL DISTRIBUTION SYSTEM CONFIGURATION
133
tion should be performed to determine the appropriate system for the application. Prior to deciding on a tank monitoring system, a compliance check should be performed to establish if the local authority has special requirements for tank monitoring.
6.6
FUEL DISTRIBUTION SYSTEM CONFIGURATION
The next step is assembling the pieces into a fuel storage and distribution system. It is strongly advised to divide the bulk fuel storage into at least two tanks. This is a practical approach to providing a means to quarantine new fuel deliveries and allow a laboratory analysis to be performed on it prior to introducing it into the active fuel system. Fuel testing and maintenance will be covered later in the chapter. Additionally, if it is necessary to empty one of the tanks for maintenance purposes, or if the fuel becomes contaminated in one of the tanks, or for any other reason one of the tanks becomes unusable, switching to the alternate tank is a simple task. However, without the alternate tank, it may be necessary to install a temporary tank and piping system, which can be a major undertaking. The simplest of fuel distribution systems is a single tank associated with a single diesel generator. Common configurations are underground tank, aboveground tank, and generator subbase tank. Suction and return piping must be sized within the operating limits of the generator's fuel pump, a common error encountered when calculating the size of the piping. The fuel flow rate, used for calculating the pipe size, is approximately three times the fuel consumption rate. The additional fuel is used to cool the diesel engine injectors and then returned through the return line to the storage tank. A more common system is a combination of bulk storage tanks and smaller day tanks located closer to the generators. The day tanks can be furnished in several different configurations. Generally, the configuration is one day tank per generator. This makes the generator system autonomous for better reliability. However, day tanks can provide fuel to several generators. The size of the day tank is dependent on several factors, including a predetermined period of fuel supply, usually calculated at generator full load. Other determining factors are maximum fuel quantity threshold limits as specified by the governing code. Figure 6.2 shows a typical fuel storage and distribution system flow diagram utilizing two bulk storage tanks and three day tanks. The fuel transfer pumping system consists of two submersible petroleum transfer pumps in each of the bulk storage tanks. Systems utilizing submersible petroleum pumps are efficient and dependable, and the pumps are more easily selected. The alternative is the positive displacement fuel transfer pump, which typically is located inside the facility, usually in an equipment room. If a positive displacement pump is selected, it is vital that the pressure loss in the fuel suction line from the bulk storage tank to the pump inlet be carefully calculated within the pump performance limits. Each of the submersible fuel transfer pumps should contain a dedicated supply pipe from the pump to the manifold loop that supplies the day tanks. In particular, if the bulk storage tanks are located underground, the supply piping is also underground and vulnerable to damage from ground movement due to settling and frost. It may also be damaged by excavating. This is a good place to mention that underground piping, especially nonmetal-
134
FUEL SYSTEMS DESIGN AND MAINTENANCE
Figure 6.2. Typical fuel storage and distribution systemflowdiagram.
lie piping systems, is very fragile. Therefore it is good practice to protect the piping with concrete top slabs and early warning tape. The fuel oil supply manifold shown in Figure 6.2 is piped in a loop configuration. This allows fuel flow to the day tanks from two separate directions. By placing isolation valves at strategic locations in the manifold, sections of the manifold can be isolated for maintenance or repair without disabling the entire system. Final fuel distribution is accomplished by supplying fuel via two control valves in parallel piping circuits to each day tank. Since a failed control valve will disable the day tank that it services, providing a redundant valve in a parallel circuit eliminates the single point of failure created if a single control valve is used. Control valves should always fail closed during a power failure in order to prevent a day tank overflow. As a backup to the control valves, there should be a manual bypass valve that will allow the tank to be filled manually in case a catastrophic control failure disables both of the control valves or the day tank control panel. Finally, manual isolation valves should always be provided around the control valves. This allows isolation of the control valve for servicing without disabling the remainder of the system. A day tank overflow line should be provided as a final spill-prevention device. In the event of a control system or control valve failure, the overflow line should be adequately sized to return fuel to the bulk storage tank and prevent the eventual release of
6.7 DAY TANK CONTROL SYSTEM
135
diesel fuel into the facility or to the environment. The applicable code should be closely followed in order to comply. If a diverter valve, shown in Figure 6.2, is required to divert fuel to the selected supply tank, the ports should be drilled for transflow. This will prevent the valve from stopping flow if the actuator stalls between valve positions. An overflow line can be provided between the bulk storage tanks to prevent an overflow if the diverter valve actuator fails, leaving the valve sequenced to the wrong tank. This will prevent the overflow of one of the bulk storage tanks if fuel is supplied from one tank and inadvertently returned to the other. The overflow line should not be confused with an equalization line. An equalization line enters both bulk storage tanks through the top and terminates near the bottom. The line is then primed with fuel, causing a siphon to begin, which allows the fuel level in both tanks to equalize. It also allows contaminated fuel or water from one tank to transfer to the other tank. For this reason, the use of an equalization line should be avoided.
6.7
DAY TANK CONTROL SYSTEM
There are many commercially available day tank control systems available. Systems vary from simple packaged systems that are best suited to control a single day tank associated with a single bulk storage tank to very elaborate packaged systems that provide control of multiple pumps associated with multiple bulk storage tanks and multiple day tanks. There are several important steps that should be followed when establishing the appropriate system, beginning with the expectations of the degree of reliability. The bulk of this discussion will be directed to the highest degree of reliability. From that level, optional equipment can be eliminated or tailored to fit the budget and the degree of reliability necessary. Step 1 is to establish the day tank capacity. Assuming that the generators and day tanks are located inside of the facility, as opposed to the generators being packaged and located in a parking lot, code restrictions will dictate the threshold limit of fuel that may be placed inside the facility. There are a few techniques that may increase that threshold limit. One technique is to place the day tanks in fire-rated rooms. In addition to increasing the threshold limit, this will also provide some additional physical protection for the day tanks. There is also an exception that was added to NFPA 37 in the 1998 edition stating that "Fuel tanks of any size shall be permitted within engine rooms or mechanical spaces provided the engine or mechanical room is designed using recognized engineering practices with suitable fire detection, fire suppression, and containment means to prevent the spread of fire beyond the room of origin." Common day tank capacities range from 20 minutes to as much as 24 hours of fuel supply at full load. Between 2 and 4 hours fuel supply is a very common range. An important consideration is the amount of fuel remaining after a low-level alarm is activated. If the facility is manned 7 * 24, 1 hour of fuel supply after the low-level alarm is activated is usually adequate. The day tanks should be provided with some form of secondary containment, such as a double-wall, open-top, or closed-top dike. Open-top dikes are difficult to keep clean. On the other hand, if a leak develops at a device that is installed through the top of the tank, the open-top dike will capture it whereas closed-top dikes
136
FUEL SYSTEMS DESIGN AND MAINTENANCE
and double-wall tanks will not. The tank should be labeled either ULI42 or, in some special cases as required by the permitting authority, UL2085. Step 2 is to establish a pumping and piping system configuration. As mentioned earlier, submersible petroleum pumps are efficient and dependable, and the pumps are more easily selected than positive displacement pumps. The pumps should be labeled in accordance with the UL 79 standard. When establishing the piping system configuration, it is most efficient to place bulk storage tanks and day tanks in a configuration that allows the day tank overflow and day tank drain to flow by gravity back to the bulk storage tanks. By doing this, commissioning and periodic testing of the control system is simplified since fuel must be drained from the day tanks in order to adequately exercise and test the level control system. If the piping system does not allow for a gravity drain, a pumped overflow will be necessary. Prior to selecting a pumped overflow, discuss the concept with the permitting agency. Step 3 is to establish a control system. The control system that affords the best reliability utilizes separate control panels for each of the day tanks and pump controls. An important concept in the control system design is the ability to allow manual operation of the entire system in the event of a catastrophic failure of the control system. This includes manual operation of pumps, bypass valves that will allow fuel flow around control valves, and manual tank gauging. Keeping in mind that a day tank may contain several hours' worth of fuel for the connected generator, it is possible to start a pump, manually open a bypass valve, and fill the day tank in just a few minutes. With the day tank now full, the facility operator can more easily assess the control problem and resolve it. Figure 6.3 shows a poorly arranged system. This type of arrangement depends on the control panel to control pumps and control valves. A failure of something as simple as the power feed to the control panel will disable the entire system. Figure 6.4 describes a system using the same components, but arranges them into a configuration that allows the system to operate in a manual mode if the control panel fails. Additionally, the motor starters each have a diverse electrical feed, which in-
Fiaure 6.3. Poorly arranged system. (Courtesy of Mission Critical Fuel System.)
6.7 DAY TANK CONTROL SYSTEM
137
Figure 6.4. System using the same components as in Figure 6.3 but arranging them into a configuration that allows the system to operate in a manual mode if the control panel fails. (Courtesy of Mission Critical Fuel Systems.) creases the reliability further. This system, however, still depends on a single control panel for logic for the entire system. A further improvement would be to provide individual control panels for each of the day tanks with a separate panel to control the pumps. This will allow the day tanks to function individually, preventing a single point of failure at the fuel system control panel. With this system, if the pump controls fail, a pump could be started manually and the day tanks would continue to cycle with their independent control panels. Communication between all panels can either be hard wired or via a communication bus (see Figure 6.5).
Figure 6.5. System using individual components to prevent a single point of failure. (Courtesy of Mission Critical Fuel Systems.)
138
FUEL SYSTEMS DESIGN AND MAINTENANCE
Figure 6.6 further increases reliability by adding redundancy for critical day tank components. The first location is a redundant control valve. Control valves are notoriously prone to failure. Providing two valves in parallel piping circuits increases reliability. Motor-actuated ball valves are less troublesome than solenoid valves, but come at a higher cost. Control valves should fail closed to prevent day tank overflow during power failure. The control valves should also have separate power circuits with separate fuse protection. A common single point of failure is to provide two control valves both of which are wired to the same circuit in the control panel. A short circuit in one valve actuator will disable both control valves if they are being fed from the same circuit. When redundant devices are incorporated into the controls scheme, it is of utmost importance to provide continuous monitoring for each device. Getting back to the control valve scenario, if there are two control valves in parallel fluid circuits, as shown in Figure 6.6, a failure of one of the valves would not be apparent without valve monitoring. The day tank would appear to function normally with diesel fuel flowing through the operating valve. The first indication of a problem appears when the redundant valve also fails. Monitoring devices such as actuator end switches or flow switches should be used to monitor the devices for failure.
Figure 6.6. System that further increases reliability by adding redundancy for critical day tank components. (Courtesy of Mission Critical Fuel Systems.)
6.8
DIESEL FUEL AND A FUEL QUALITY ASSURANCE PROGRAM
139
The last item is selection of day-tank level-control devices. Float-type controls are the most common devices used for day tank control. If this type of device is used, the floats should be either stainless steel or another high-quality material compatible with diesel fuel. Avoid the use of copper or copper-containing alloys for reasons stated earlier in this chapter. Noncontact-level measuring devices such as ultrasonic transmitters are increasingly popular. Due to the decline in diesel fuel quality over the past few decades, gummy substances may begin to form on float controllers, rendering them inoperative. The noncontact devices are less likely to be affected by this condition.
6.8 DIESEL FUEL AND A FUEL QUALITY ASSURANCE PROGRAM Following the enactment of the 1970 Clean Air Act, the United States Environmental Protection Agency (USEPA) has implemented programs designed to improve ambient air quality by reducing the emissions from highway, nonroad, locomotive, and marine engines. Generally, these programs involve the reduction of sulfur in the nation's diesel fuel supply. Recently, two major changes to the nation's diesel fuel supply intended to improve ambient air quality by reducing sulfur content and including the introduction of ultralow sulfur diesel and biodiesel fuels. In 2006, the USEPA launched the ultralow sulfur diesel (ULSD) program. The ultralow concentration of sulfur in the diesel fuel (^ 15 ppm) protects the newer emission systems, such as catalytic converters, and diesel particulate filters of the newer model diesel engines, which are designed to remove harmful emissions and particulates from diesel engine exhaust to very low levels. The goal of the program is to ensure that by the year 2014 all highway, nonroad,* locomotive, and marine vehicles will be using a diesel fuel that meets the maximum specification of 15 parts per million (ppm) sulfur. After years of testing and evaluation, biomass-derived diesel fuel (i.e., biodiesel) was introduced into the nation's fuel supply. Biodiesel blended fuels such as B5 and B20 offer another way to reduce sulfur and our need for foreign crude by extending current fuel supplies, and offer individuals a "green" alternative to power their diesel engines. In 2008, under the jurisdiction of ASTM1' Committee D02 on Petroleum Products and Lubricants and Subcommittee E, ASTM D975 Standard Specification for Diesel Fuel Oils was modified to reflect the increased use of biodiesel and the practice of adding up to five percent biodiesel to certain grades of diesel fuel to allow users to comply with state mandates for the use of renewable diesel fuel. Also in 2008, ASTM D7467, Standard Specification for Diesel Fuel Oil, Biodiesel Blends, was created to address fuel quality specifications for biodiesel blends ranging from B6 up to B20. It is estimated that over 95% of the petroleum-based diesel fuel is manufactured, released to market, and consumed within a relatively short period of time of three months or less. During this period of time, the diesel fuel easily remains in specifica*http://www.epa.gov/nonroad-diesel/2004fr/industry.htm. ASTM International, 100 Barr Harbor Drive, P.O. Box C700, West Conshohocken, PA 194428-2959, www.astm.org. +
140
FUEL SYSTEMS DESIGN AND MAINTENANCE
tion and is always available for use, excluding fuel quality issues that may arise due to unsatisfactory fuel storage system conditions, cleanliness issues, or unusually low temperatures. Historically, most conventional diesel fuel has been very stable and has been stored successfully for many years with minimal degradation and limited remediation required. Emergency diesel generators play a critical role of supplying emergency power when there is a loss of power. Typically, the design function of the emergency diesel generator requires that it be able to operate for extended periods of time before refueling. In some cases, large volumes of diesel fuel must be stored on-site in order to meet the operational requirement. However, during the long periods of nonemergency these systems are operated or tested much less frequently. This generally results in a very small volume of diesel fuel consumed during a 12 month period. Therefore, the storage time of on-site diesel fuel can be in excess of 12 months, which is classified as long-term storage. In some cases, diesel fuel that is stored for longer than 12 months has the potential to undergo fuel degradation that could adversely affect fuel quality, which can affect the operability and reliability of the emergency diesel generators. It has been demonstrated that small amounts of organic sediment that have formed in diesel fuel stored over long periods of time had to be removed by filtering or "polishing" the fuel in order to keep the bulk-stored diesel fuel clean and suitable for immediate use. Another issue that affects the long-term storage of petroleum-based diesel fuels is the presence of biodiesel blended with diesel fuel. To date, the industry's experience with biodiesel blends has resulted in concerns regarding the stability of the fuel under conditions of long-term storage. Typically, for diesel fuels that have been blended with up to 5% biodiesel it is strongly advised by the industry that these fuels be consumed within six months.* Therefore, due to the short-term storage restrictions associated with biodiesel, diesel fuel allocated for use in emergency power generators should be strictly restricted to petroleum-based diesel fuel that contains no biodiesel (i.e., fatty acid methyl ester—FAME). The basis for every successful fuel quality assurance program is a clear program objective, valid performance criteria, valid quality indicators, and a willingness to make process improvements from lessons learned. In the case of mission critical systems involving emergency diesel generators, the program objective is to ensure the reliability and operability of the diesel fuel so that mission critical systems are operable when called upon for service. A fuel quality assurance program considers a variety of issues that can impact fuel quality and system operability such as: 1. Type, quality, and special consideration of diesel fuel needed to operate on-site diesel engines 2. Identifying a local supplier that can meet the fuel quality needs of the facility 3. Identifying a local and competent laboratory that can perform the necessary diesel fuel analysis *"Biodesel Myths Busted," http://www.biodiesel.org/pdf_files/ftielfactsheets/Myths_and_Facts.pdf. diesel Usage Checklist," http://www.biodiesel.org/pdf_files/Usage_Checklist.pdf.
"Bio-
6.8 DIESEL FUEL AND A FUEL QUALITY ASSURANCE PROGRAM
141
4. Factors that can affect the bulk diesel fuel, such as the physical condition and location of the on-site bulk fuel storage tank as well as the annual weather conditions 5. Identifying industry changes that can negatively impact long-term quality of the diesel fuel 6. Prereceipt inspections of routine diesel fuel deliveries 7. Monitoring the ongoing condition of the on-site fuel during long-term storage conditions 8. Remediation process(es) to restore the on-site diesel fuel supply to a stable and useful condition
6.8.1
Fuel Needs and Procurement Guidelines
The first step in a fuel quality assurance program is to determine what the fuel type and fuel quality needs are in order to establish fuel quality purchasing specifications based on accepted industry standards (Table 6.1). Additional information relevant to the long-term storage of fuel that is not available in accepted industry standards should be obtained from the diesel fuel manufacturer, the distributor, the diesel engine and/or fuel injector manufacturers, industry peer groups such as ASTM, or qualified industry consultants. Other issues that are directly related to fuel quality include purchasing diesel fuel from a reputable dealer and having access to a qualified diesel fuel laboratory to perform the necessary fuel testing. In the United States, the industry standards for diesel fuel are detailed in ASTM D975, Standard Specification for Diesel Fuel Oils. The year of issue or revision is indicated by a suffix containing two or more figures, such as D975-09, signifying that the standard was revised in 2009. A second or third subsequent issue of the standard would be indicated by a letter designation; for example, D975-09a or D975-09b. Therefore, it is always prudent and in the best interest of the fuel system operability and reliability to purchase diesel fuel using the specifications quoted from the latest edition of D975. In order to further ensure and document the type, quality, and condition of the diesel fuel being purchased, the fuel dealer should agree to provide the purchaser with a certificate of analysis (COA) for each shipment of diesel fuel delivered to the site and state on the COA that biodiesel was not added to the diesel fuel.
6.8.2
New Fuel Shipment Prereceipt Inspection
In some cases, new fuel shipments may contain residual fuel such as gasoline, E10, or biodiesel from a prior fuel shipment, or water and debris that can contaminate an existing bulk fuel supply. A simple way to protect the on-site bulk diesel fuel supply is by performing a simple prereceipt inspection of the fuel shipment prior to accepting the fuel shipment (Table 6.2). When a diesel fuel shipment arrives on-site for delivery, the paperwork accompanying the shipment should be reviewed carefully to ensure that the fuel shipment matches the fuel purchase. If possible, the delivery truck compartments should also be inspected to ensure that the fuel and compartments are clear and bright,
142
FUEL SYSTEMS DESIGN AND MAINTENANCE
Table 6.1. Fuel procurement guidelines Discussion Prior to developing fuel specifications for diesel fuel oil, the purchaser should be aware of the diesel fuel oil type, quality, and special considerations, as specified by the engine manufacturer, to ensure uninterrupted fuel supply to the diesel engine when needed, regardless of weather conditions or storage limitations. Additionally, the purchaser should interview several fuel manufacturers or their authorized representatives in the local area to ensure that the type and quality of fuel is always readily available and accessible, is not delivered to site with unnecessary contaminants, and does not contain biodiesel, which can shorten the shelf life of the diesel fuel oil or result in unnecessary system remediation. Issue(s) Determine the type of diesel fuel oil and special requirements necessary to operate the on-site diesel engine(s). The diesel fuel oil is purchased or is stipulated to be purchased to address engine operation requirements, special considerations, and winter conditions. The delivery truck providing the fuel delivery to the on-site bulk storage tank has cleared/cleaned the affected fuel compartments of any prior fuel delivery, especially if it contained a fuel product with a lower flash point, biodiesel (B100), or biodiesel blend (BXX), which is commonly called switch loading. Require that the diesel fuel oil shipment does not contain biodiesel.1' The supplier will provide a "Certificate of Analysis" (COA) with each fuel load delivered to the site and is presented to the buyer upon delivery of the fuel. The COA is to be stored in the on-site document storage for future reference or dispute resolution. That every diesel fuel oil order delivered to the site will be clear of excess free water, sediment, or suspended matter (i.e., clear and bright) visible at the bottom of each fuel compartment containing the fuel delivery. The supplier, buyer, and buyer's petroleum laboratory should agree to a reasonable period of time that the fuel delivery truck is to remain on site to undergo the fuel quality inspections (Table 6.2) and analysis of the new fuel (Table 6.3). That any fuel shipment that fails a prereceipt quality inspection (Table 6.2) or new fuel test (Table 6.3) can be returned to the supplier without recourse and replaced with in-specification diesel fuel oil.
Reference(s)/recommendation(s)* 1. Self explanatory
2. ASTM D975-10, Section 1, Section 7, Table 1 with footnotes, and application Appendices 3. ASTM D975-10, Section 3, and Section 7.3
4. ASTM D975-10, Section 3, and Section 7.3 5. Certificate of Analysis to verify fuel type and quality
6. ASTM D975-10, Section 1, Section 6
7. To perform prereceipt inspections and testing
8. ASTM D975-10, Section 1, Section 6
♦Unique references specified above may be modified or updated in later revisions to ASTM D975. Be certain to consult the latest revision of ASTM D975 regarding fuel quality issues. t"Biodesel Myths Busted," http://www.biodiesel.org/pdf_files/fuelfactsheets/Myths_and_Facts.pdf. "Biodiesel Usage Checklist," http://www.biodiesel.org/pdf_files/Usage_Checklist.pdf.
6.8
DIESEL FUEL AND A FUEL QUALITY ASSURANCE PROGRAM
143
Table 6.2. New fuel shipment prereceipt inspection Part 1 Description As a first line of defense to protect the on-site bulk diesel fuel storage, the buyer should attempt to visually inspect the fuel shipment and review the certificate of analysis (COA) to confirm that the diesel fuel shipment is correct and that the shipment is absent of any free water, sediment, and suspended matter at the bottom of each fuel compartment (i.e., clean and bright) prior to testing the new fuel shipment or transfer to on-site bulk storage. Issue(s)
Reference(s)/recommendation(s)*
1. Fuel meets the purchase specification
1. Purchase Agreement, ASTM D975-10, and COA 2. The fuel compartment(s) of the delivery truck 2. ASTM D975-10, Section 1 and Section 6 are visually absent of free water, sediment, and suspended matter (i.e., clean and bright) 3. The bulk fuel shipment is clean and bright 3. ASTM D4176 with "Free Water—Pass" and "Particulates—Pass" Sample Fuel Shipment—ASTM D975-10, Appendix X2 Part 2—Red Dyed New Fuel Shipment Prereceipt Inspection Description As a first line of defense to protect the on-site bulk diesel fuel storage, the buyer should attempt to visually inspect the fuel shipment and review the certificate of analysis (COA) to confirm that the diesel fuel shipment is correct and that the shipment is absent of any free water, sediment, and suspended matter at the bottom of each fuel compartment (i.e., clean and bright) prior to testing the new fuel shipment or transfer to on-site bulk storage. However, the presence of red dye in the fuel can make it difficult to perform an adequate visual inspection. Issue(s):
Reference(s)/recommendation(s)*
1. The red dyed fuel meets purchase specification. 2. A composite sample of fuel from the fuel delivery truck is clean and bright.
1. Purchase Agreement and ASTM D975-10, COA 2. ASTM D4176 with "Water—Pass" and "Particulates—Pass". Note: The fuel dye may interfere with the test. See discussion in the method regarding interferences.
*Unique references specified above may be modified or updated in later revisions to ASTM D975. Be certain to consult the latest revision of ASTM D975 regarding fuel quality issues.
meaning that there are no signs or indications of excess and undissolved water, sediment, or suspended matter. While there is no color or odor requirement for diesel fuel, if a shipment has an appearance or odor that is distinctly different from previous shipments, further testing can be warranted to verify that the fuel is indeed the correct fuel. For example, the density, flash point, and color should be very close to the values on the certificate of analysis to verify that the product being delivered is the certified fuel.
144
FUEL SYSTEMS DESIGN AND MAINTENANCE
The purchase agreement should contain provisions that allow the purchaser to reject a fuel shipment without penalty when there is evidence of an incorrect product shipment, free water, sediment, or suspended matter in the fuel shipment that can potentially contaminate the on-site bulk fuel storage system and degrade the existing fuel.
6.8.3
Analysis of New Fuel Prior to Transfer to On-Site Storage
Some fuel quality issues with new fuel shipment that can compromise the quality of the on-site bulk diesel fuel are not visually discernable. Therefore, it is recommended that after the fuel shipment has passed the visual inspection, found to be clear and bright, and is ready to be transferred on-site that a composite sample of the new diesel fuel shipment be collected and sent to a local laboratory selected by the purchaser to be tested for biodiesel content, flash point, and clear and bright (Table 6.3). The results from this testing will further validate that the fuel shipment is ready for transfer to onsite bulk storage. If the purchaser has contracted a qualified laboratory for service and arrangements are in place to analyze the fuel sample immediately once it has been delivered to the laboratory, test results can be available for review between the laboratory and purchaser within a reasonable period of time. Provisions must be made between the purchaser and fuel supplier to allocate time to wait for test results prior to fuel transfer. If the fuel analysis results indicate that the fuel does not correspond to the diesel fuel specifications, then provisions should exist in the purchase agreement for the buyer to reject the shipment. If the fuel supplier refuses to allocate time for the fuel
Table 6.3. Analysis of new fuel shipment prior to transfer to on-site bulk storage Description As a second line of defense to protect the on-site bulk diesel fuel supply, the purchaser should obtain a composite sample of the diesel fuel oil shipment in order that it can be analyzed for the following quality indicators: biodiesel content,flashpoint, clear and bright, and color. Issue(s)
Reference(s)/recommendation(s)*
1. Obtain a composite fuel sample. 2. Biodiesel
1. ASTM D975-10, Appendix X2 2. None acceptable—ASTM D975-10, Section 7.3.3, Standard test method ASTM D7173 or EN 14078. In case of dispute, ASTM D7173 is the referee test. 3. See ASTM D975-10, Table 1 and Footnote E. Also refer to certificate of analysis. 4. ASTM D4176, Procedure 2,—Water Absent = Pass Water Present = Fail and Particulates Absent = Pass Paniculate Present = Fail. Note: The fuel dye may interfere with the test. See discussion in the method regarding interferences. 5. Compare to description on the certificate of analysis.
3. Flash point 4. Clear and bright
5. Color
"Unique references specified above may be modified or updated in later revisions to ASTM D975. Be certain to consult the latest revision of ASTM D975 regarding fuel quality issues.
6.8 DIESEL FUEL AND A FUEL QUALITY ASSURANCE PROGRAM
145
verification, then the purchaser may need to make other arrangements regarding future fuel supply.
6.8.4
Monthly Fuel System Maintenance
Regardless if diesel fuel is being stored for retail sale, municipal use, mining operations, or mission critical applications, exposure to contaminants, storage conditions, and nature's elements can influence the long-term quality and condition of the diesel fuel stored within the tank. Ideally, all diesel fuel storage should be maintained full, dry, and cool (Table 6.4). However, as a practical matter, this may not be possible in all circumstances. For example, keeping the fuel storage tank full at all times keeps the air space (ullage) at the top of the tank to a minimum. Maintaining a minimal ullage reduces the volume of outside air that can enter the tank. This is important because as the outside air cools in the ullage during the evening hours, the moisture from the air condenses and forms free water droplets that adhere to the wall of the fuel storage tank and/or collect at the bottom of the fuel storage tank. Over time, a sufficient volume of water will accumulate on the wall and the tank floor that can promote and sustain microbial contamination and growth. It is a recommended industry practice to keep diesel fuel as dry as reasonably possible by routinely draining fuel storage tanks of accumulated water. If it is determined that the bulk storage tank has microbial contamination, then it is recommended to remediate the fuel system with an effective and efficacious biocide. Maintaining the fuel cool and at a constant temperature assists in reducing or slowing the degradation reactions within the bulk fuel. This is totally dependent upon the
Table 6.4. Monthly fuel system maintenance/surveillance for water and microbial growth Description Perform monthly tank surveillance for accumulated water and microbial growth. On-site bulk fuel storage tanks can accumulate water from fuel deliveries or as a result of the cooling of moist air in the tank ullage into water droplets. Accumulated water can initiate fuel degradation issues and, if not removed, can grow and sustain microorganisms. Microorganisms are responsible for fuel degradation,filterplugging, and operational problems, as well as tank wall corrosion. Issue(s)
Reference(s)/recommendation(s)*
1. Drain and record volume of accumulated water removed from the tank.
1. ASTM D975-10, Appendix X3 and X3.7. Record volume of water drained from the bulk storage tank. If results indicate an increasing volume of water being drained from the tank, determine and remediate the source of water. 2. ASTM D975-09a, Appendix XI. 15.1, X3.4, and X6
2. Microbial growth
*Unique references specified above may be modified or updated in later revisions to ASTM D975. Be certain to consult the latest revision of ASTM D975 regarding fuel quality issues.
146
FUEL SYSTEMS DESIGN AND MAINTENANCE
location of the fuel storage tank. This is easily accomplished with underground storage tanks because the ground temperature is fairly cool and constant relative to the diurnal fluctuations of the outside air temperature. Indoor fuel storage tanks or vaults have similar temperature controls. However, in some circumstances mechanical or auxiliary equipment located near the fuel storage tank may impart a heat load onto the fuel storage tank, which elevates the temperature of the fuel inside, resulting in accelerated fuel degradation issues. Conversely, aboveground storage tanks (AST) are generally outdoors and exposed to the elements and the diurnal cycle of heat during daytime hours and cooling during the night. Over time, continuous heating and cooling of the fuel can lead to accelerated fuel degradation issues as well as accumulation of water inside the fuel storage tank.
6.8.5
Quarterly or Semiannual Monitoring of On-Site Bulk Fuel
At least quarterly or semiannually, the on-site bulk fuel storage system and fuel inventory should be evaluated to ascertain the ongoing quality of the existing bulk fuel inventory as well as the accumulation of water, microbial growth, and evidence of fuel degradation. Just as important are the locations from where the samples are drawn to provide fuel samples that are meaningful and produce information and valid results (Table 6.5). Test results provide the owner/operator the opportunity to trend, evaluate, and assess fuel quality issues as they unfold, so corrective actions can be implemented proactively to preserve fuel quality and system operability. Fuels that are stored for periods of time that exceed the manufacturer's recommended shelf life can form sediments or gums that can overload filters or plug injectors. Although diesel fuels (i.e., middle distillates) are inherently stable, if stored for extended periods of time they can undergo degradation under specific oxidizing conditions incurred during long-term storage. Additionally, for older diesel engines that circulate diesel fuel for cooling and lubrication during operation, the diesel fuel undergoes continuous heating and cooling and runs the risk of undergoing thermal degradation. Accumulation of water in bulk fuel storage can promote uncontrolled microbial contamination in fuel systems that can also cause or contribute to a variety of problems, including increased corrosivity of the fuel, decreased stability, and filterability problems. Microbial contamination in fuel systems can also cause or contribute to system damage such as pitting corrosion and fuel system leaks. Aged petroleum products can contain acidic constituents that are present as additives or as degradation products formed during service.
6.8.6
Remediation
Diesel fuels that have quarterly or semiannual test results that indicate a trend toward oxidative instability, thermal instability, or the presence of microbial contamination need to be remediated. There are several actions that can be taken by the owner/operator of the system to address these issues. Fuel additives that are available in the market can improve the suitability of fuels for long-term storage and thermal stability, but can be unsuccessful for fuels with
6.8 DIESEL FUEL AND A FUEL QUALITY ASSURANCE PROGRAM
147
Table 6.5. Quarterly or semiannual on-site bulk fuel testing Description Sample and analyze the on-site bulk diesel fuel supply each quarter or semiannually in order to monitor fuel quality during on-site storage. Contact the approved fuel laboratory for instructions on sample volume and sample containers. Issue(s)
Reference(s)/recommendation(s)*
1. Sampling, containers, and sample handling 2. Storage and Thermal Stability of Diesel Fuels 3. Oxidation Stability (ASTM D2274 or D5304)
1. ASTM D975-10, Appendix X2. 2. ASTM D975-10, Appendix X3 3. ASTM D975-10, Appendix X3, X3.5, X3.5.1,andX3.6. 4. ASTM D975-10, Appendix X3, X3.6, andX3.6.3. 5. ASTM D975-10, Appendix X3, X3.5, X3.5.3, X3.6, X3.6.4, X3.6.5, and X3.10. 6. ASTM D975-10, Appendix X1.15.X3, X3.4.2, X3.7, X3.9.2, and X6 (Tank Bottom Sample). 7. ASTM D7173 8. ASTMD664 9. ASTM 3703
4. Particulates(ASTMD6217) 5. Thermal Stability (ASTM D6468)
6. Microbial Contamination
7. Biodiesel Content 8. Acid Number 9. Hydroperoxides Comment(s)
A. Monitor and trend test results for items 3-7 above. Test results that indicate a continuous adverse trend may indicate the need to remediate the on-site bulk fuel and/or storage tank. B. The addition of items 7-9 above (biodiesel, acid number, and hydroperoxides) provides the user with secondary quality indicators to monitor the fuel quality of long-term stored fuel. *Unique references specified above may be modified or updated in later revisions to ASTM D975. Be certain to consult the latest revision of ASTM D975 regarding fuel quality issues.
markedly poor stability properties. Fuel additives generally contain one or more of the following functionalities: antioxidant; metal deactivator, corrosion inhibitor, or dispersant. Most diesel fuels are treated with additives at the refinery, which is adequate to keep diesel fuels stable for consumption within three to six months after being refined. However, diesel fuel that is stored long term (> 12 months) may require additional additives during the storage cycle. Once the fuel degradation process begins, there are some fuel additives available on the market that can temporarily stabilize the fuel condition until the fuel is filtered, polished, and cleaned, but cannot reverse or improve the condition of the degraded fuel product. The ability of degraded and oxidized fuel to act as a catalyst to initiate and facilitate the degradation of new fuel is so strong that diluting a small amount of old, oxidized fuel with a tankload of new fuel will accelerate the degradation of the new fuel. Degraded and reactive fuel must be all used or removed, and the tank emptied and cleaned to make it suitable for long-term fuel storage again.
148
FUEL SYSTEMS DESIGN AND MAINTENANCE
Microbial contamination associated with fuel and fuel systems problems generally involves bacteria and fungi (i.e., yeast and mold) only. In some literature, algae has been improperly referenced as a microorganism that can grow in fuel system. This is not the case, since algae requires sunlight for metabolic processing. Bacteria and fungi in fuel systems have several resource requirements that need to be in place to grow and proliferate, including water, nutrients, and fuel. Microbes require water for hydration and to remain viable. Sources of water include fuel deliveries and condensation of moist air in tank ullage. Nutrients such as phosphorus, nitrogen, and sulfur are provided by the fuel or water. The diesel fuel acts as a source of food energy and carbon for growth and biomass development. Uncontrolled microbial contamination of fuel and fuel systems can result in biodégradation of the fuel, cause recurring filter plugging problems, or cause various forms of corrosion that could potentially result in a fuel system leak or costly and unnecessary system maintenance. The elimination of microbial activity can be accomplished using a biocide. A biocide should be selected based on its ability to eliminate microorganisms in the water phase, overall effectiveness, and efficacy. Note that even a thorough cleaning of a tank, which can remove all microbial growth from visible surfaces, is usually not enough to remove all microbes or their spores from piping and valves. Thus, after a contaminated fuel system has been cleaned, a biocide is often used to kill any remaining microbes in out-of-the-way piping or places to ensure a complete job.
6.9
CONCLUSION
For decades, diesel engines have been a source of power for a myriad of applications and will continue to do so for the foreseeable future. Over decades, market demands and environmental requirements have influenced the methods used to extract, refine, and blend the fractional component commonly referred to as diesel fuel, which, in turn, have influenced product long-term stability. Looking forward, environmental and market demands may require the use of more nonpetroleum components in the final blending of diesel fuels, which could affect product quality. For most applications, the consumption of diesel fuel is done within the product shelf life, but mission critical systems require diesel fuel to remain stable and perform well beyond the normal shelf life. To date, experience has shown that if bulk storage tanks that contain diesel fuel are maintained cool, under constant temperature, dry (with minimal or no accumulated water), and full (to reduce exposure to oxygen), the diesel fuel tends to remain stable for longer periods of time. Coupled with a simple but effective fuel quality assurance program to continuously monitor fuel quality, mission critical systems can be expected to perform reasonably when called upon for service.
7 POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
7.1
INTRODUCTION
This chapter is about the power transfer switch and its role in the emergency power system of mission critical facilities. The power transfer switch comes in many varieties; however, we will concentrate on the most common—the automatic transfer switch (ATS). The term mission critical has been traditionally applied to the data center; however, with the introduction of the NEC's Article 708, Critical Operations Power Systems (COPS), we can extend this mantel to any system or, specifically, power system on which a critical operation depends. And while 708 aims at governmental, municipal, and other operations critical to society, we can also consider that any business venture big or small depends on systems critical to their mission. Mission critical operations require continuous electrical power. An interruption in the flow of power could result in significant financial losses unless emergency power comes online quickly. A prolonged power outage will affect the operation of any business. From the data centers to hospitals, to wastewater treatment facilities, critical loads must be supported. Blackouts, such as the one that plunged the northeast United States and parts of Canada into darkness on August 14, 2003, drove home the importance of emergency power systems. Emergency power systems must be understood, tested, inspected, Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
149
150
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
maintained, and documented. Facilities that followed regular maintenance programs had few, if any, problems when the power went out, while many of those that did not follow routine maintenance practices, had inaccurate drawings, insufficiently documented procedures, or insufficiently trained operating staff experienced major problems or, in some cases, complete failures. Both the National Fire Protection Association (NFPA) and National Electrical Manufacturers Association (NEMA) recognize the need for preventive maintenance. A maintenance program and schedule should be established for each particular installation in order to detect, reduce, and eliminate impeding issues that cause unexpected downtime. A preventive maintenance program includes periodic testing, tightening of connections, removal of dust and dirt, and replacement of contacts when visual inspection reveals excessive contact erosion. The NFPA requires that an emergency power transfer system be tested for at least 30 minutes at least every 30 days, although 30 minutes per week is ideal. Although many standards stress the importance of a comprehensive maintenance program, few provide sufficient detail. Your original equipment manufacturer (OEM) is your best source for detailed recommended practice to maintain reliability and protect your warranty. Automatic transfer switches are the heart of the emergency system. If the ATS does not function properly, loads could be left high and dry, no matter how sophisticated the rest of the system may be. Any discussion of power switching and control has to include the automatic transfer switch. Actually, the term automatic transfer switch is much too specific. Not all power transfer switches are "automatic" and not all power transfer mechanisms are power transfer switches. In order to be truly classified as an automatic transfer switch in the United States, units must meet the requirements of ULI 008. All automatic transfer switching devices share some common characteristics, either in construction or application. Automatic transfer switches provide for a transfer between the normal source (usually utility) and the emergency source (the engine or turbine-generator set) on site. An ATS contains two elements: the transfer switching mechanism and the control panel. The ATS control panel will recognize loss of the normal source of power, initiate a start signal to the emergency source, monitor the quality of the emergency power according to customer preferences programmed into the control panel, and, if acceptable, initiate an automatic transfer to the emergency source. Once power has been restored, the ATS control panel will sample the quality of the normal power source again according to customer preferences programmed into the control panel, initiate a timer in order to wait until power is stable, transfer the load back to the normal source, and allow the engine or turbine-generator to run unloaded or cool down for a predetermined time period For ATSs to work automatically, quickly, and dependably, they must be properly selected, sized installed, and maintained. Different types of ATSs are available with various options. Selection of a new or replacement ATS may seem straightforward; however, a careful evaluation of the facility's mission and electrical infrastructure will help determine the specifications that sat-
7.2 TRANSFER SWITCH TECHNOLOGY AND APPLICATIONS
151
isfy the technical requirements and cost considerations for a particular location and application. NFPA 70E has drastically changed the way we perform maintenance and repairs of emergency power systems. Holding briefings, establishing safe work boundaries, and utilizing personal protective equipment (PPE) are all meant to mitigate the hazard of arc flash and electrical shock. Unfortunately, the reality is that when wearing PPE, it is impossible to perform many of the tasks required by maintenance routines. This means that there are only a few tasks that can safely be accomplished wearing PPE. Comprehensive maintenance requires equipment to be deenergized, locked out, and tagged. The issue at hand is that many emergency power systems containing ATSs do not afford the necessary flexibility to isolate and secure these circuits without affecting the connected load. One solution may be the selection of an isolation-bypass-type ATS. The bypass switch isolates the ATS while maintaining power to critical infrastructure. Some ATS manufacturers provide an integrated ATS bypass switch. Switchgear configurations can also provide an ATS maintenance bypass-isolation function, which often has advantages over an integral bypass switch. Some bypass switches allow for load transfer while on bypass; others do not. The critical nature of emergency and standby power systems applications dictates the importance of rigorously maintaining the equipment involved. A good preventive maintenance program that includes operator training, maintenance, and testing of ATSs and the larger integrated system will maximize system reliability.
7.2
TRANSFER SWITCH TECHNOLOGY AND APPLICATIONS
Power transfer switches are available in low-voltage (< 600 VAC) and medium-voltage (600 to 35 kV) versions. The most familiar power transfer switching application is automatic low voltage and that is where we will direct our discussion. Low-voltage automatic transfer switches range in size from 30 to 4000 A. When the ATS controls sense a significant drop in normal source voltage (typically 80% of nominal voltage), the ATS begins the automatic transition to the emergency source. Depending on the application, a programmed delay may be employed to ride through power glitches. Once the emergency source is ready to accept load, the ATS transfers the load automatically. Let us walk through a typical transfer scenario: 1. The ATS control panel senses a loss of the normal source. 2. The ATS control panel provides a contact closure to signal the generator set to start after an adjustable time delay to prevent unnecessary engine operation for momentary power outages. 3. The ATS control panel monitors the acceptability of the emergency source. The parameters that determine acceptability are programmable. When voltage and frequency have reached an acceptable level, transfer is made from the normal to the emergency source. When the normal power source is restored, the ATS control panel again samples the voltage and frequency to determine whether or not the source is acceptable.
152
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
4. After an adjustable time delay to ensure that the source is stable, the load is transferred back to the normal power source. This time delay is adjustable and intended to time out only after power has remained within acceptable limits for a certain period of time. Typically, additional interruptions will reset this timer until the full period remains uninterrupted. 5. Following the retransfer to the normal source, the ATS control panel allows the engine or turbine-generator to run unloaded or cool down before the run signal is removed and the unit shut down. The steps above constitute the basic function of every type of ATS when a failure of the normal source occurs. That is where the similarity stops. There are many different applications for the ATS. Accordingly, manufacturers have developed different types and features to deal with specific applications. Health-care applications, for example, may require closed transition operation to eliminate power disruptions during testing or a selective transfer. Data center applications, on the other hand typically require three-pole, open transition switches as the bulk of the loads are balanced three-phase UPS or HVAC loads. ATSs can also be applied for redundancy purposes, to switch between two utility sources, between two generator sources, or for small loads downstream of two larger ATSs. If an emergency source other than a diesel or turbine generator is to be used (most commonly an uninterruptible power supply), the manufacturer must understand the requirement at the time the ATS is ordered, in case special engineering is required for the application.
7.3 7.3.1
TYPES OF POWER TRANSFER SWITCHES Manual Transfer Switches
Manual power transfer switches commonly come in two types: (1) manually initiated, electrically operated and (2) manually operated, quick make, quick break. Both types require the presence of a qualified operator. The handle or manual controls must be accessible from the exterior of the switch enclosure in order to allow operation without personal protective equipment in accordance with NFPA 70E. The manually initiated, electrically operated transfer switch (TS) has two positions: normal and emergency. A qualified operator must operate controls deliberately, which acknowledges the intent to affect a transfer and causes the TS mechanism to operate. The manual ATS has two positions: normal and emergency. This type of TS is totally manual. Setting the handle to the normal position causes the normal power source to be connected to the load. Setting the handle to the emergency position causes the emergency power source to be connected to the load. If the emergency source is a generator set, it must be started and an acceptable voltage and frequency must be established before or after the transfer, depending on the rating of the transfer switch. Some of these units are designed to perform a hot-to-hot transfer under load and some re-
7.3
TYPES OF POWER TRANSFER SWITCHES
153
quire the transfer to be made first and the generator or utility breaker to be closed after the TS has been transferred. Remember that emergency systems as defined in article 517 and article 700 of the National Electrical Code and NFPA 99 must automatically supply alternate power to equipment that is vital to the protection of human life. Electrical power must be automatically restored within 10 seconds of power interruption. Manual transfer switches will not meet these code requirements and automatic transfer switches are required.
7.3.2
Automatic Transfer Switches
These switches are available in 30 to 4000 amperes for low-voltage (<600 VAC) applications (see Figures 7.1-7.3) Different types of transfer switches are available with various options, including bypass-isolation switches. ATSs equipped with bypass isolation allow the automatic portion of the ATS to be inspected, tested, and maintained (in some designs, the automatic portion can be drawn out and serviced or replaced) without any interruption of power to the load. Open Transition Automatic Transfer Switches. A standard automatic transfer switch is designed to monitor its normal source of power, usually the incoming utility. When the normal source is interrupted, the ATS control panel sends a start signal to the backup generators, samples the power level once the generator starts, and
Figure 7.1. Break-before-make ATS.
154
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
Figure 7.2. Basic ATS enclosure.
Figure 7.3. Break-before-make ATS (bypass isolation).
7.3
TYPES OF POWER TRANSFER SWITCHES
155
transfers to the alternate (emergency) source. The ATS will remain connected to the emergency power source for the duration of the utility outage. Upon resumption of commercial power, the ATS controls will confirm the viability of that source and retransfer to the normal source following an adjustable time delay to ensure source stability. This return transfer occurs between two live sources that are not synchronized. For many applications such as UPS systems and inductive motor loads, it is desirable to sychronize the transfer utilizing a passive sync check device. This lessens the negative dynamic effects of an out-of-phase transfer. Open Transition Automatic Transfer Switches with Bypass Isolation. Open transition automatic transfer switches with bypass isolation incorporate all of the standard features explained above with a second manual transfer device (Figure 7.4). This manual power transfer device is connected in parallel and interlocked with the automatic switching mechanism. Under normal operating conditions, this bypass/isolation mechanism provides a means to bypass the automatic power transfer device if there is an operational problem or for routine maintenance. This device is fully rated and capable of making a manual transfer between the normal and emergency source. Once the bypass/isolation device has been connected and transferred to the appropriate source, the automatic power switching mechanism may be placed in one of two positions:
Figure 7.4. ATS enclosure equipped with bypass isolation.
156
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
1. Position 1 is the test position. In this position, the main power poles of the automatic power switching device are disconnected; however, control power is still available so that the ATS may be tested by qualified maintenance personnel equipped with appropriate personal protective equipment in accordance with the hazard level identified as defined by NFPA 70E. 2. Position 2 is the isolate position. In this position, the ATS mechanism is completely disconnected from all power and may be completely removed for invasive maintenance and repair. Most manufacturers provide a variety of accessories to fine tune the device to the application. It is important, no matter what type of power transfer switch you incorporate, that it be coordinated with other system components. For example, the speed of the transfer may present a problem for a UPS, VFD, or other downstream load. In this case, a presignal from the ATS may be desirable. This may be a contact closure or opening prior to the actual transfer, allowing for a connected device to revert to a controlled state or shut down prior to the transfer. Following transfer, the action is reversed and all system components return to normal operation. In a UPS, you may wish to force the system to its battery until the transfer is complete. The coordination of set points in the system can be a major issue. Take the voltage drop-out or lowest acceptable value, for example. Consider what would happen if the ATS had a lower drop-out value programmed into the ATS controls than the UPS. In a brownout, the UPS might go to its battery while the ATS would remain connected to a low utility source because its critical set point had not been reached. This also means the engine generators would never start. The end result might be for the UPS to run out of battery backup and crash. There may also be interfaces with other building systems depending on the circumstances or local building regulations. In some cases, you may wish to signal all elevators to return to the ground floor in the event of a power failure. In this case, a presignal from the ATS could be interfaced with the elevator controls for that purpose. Closed Transition Transfer Switches (CTTS). Closed transition transfer switches (Figure 7.5) permit an overlapping or closed transition transfer between two hot sources without load interruption. This transfer mode is very desirable when employed in mission critical operations such as data centers or health-care facilities. Upon initial power outage, the CTTS operates exactly the same as an open transition unit. When the load will see a loss of power, the CTTS controls will signal the engine generator to start, sample the power produced, and then connect the load to the alternate or emergency source. Upon restoration of the normal power source, the CTTS controls will briefly connect both sources in parallel before isolating the emergency source, thus providing a smooth uninterrupted transfer to occur. The CTTS controls employ a passive bandwidth monitor and must wait until both sources are synchronized within 5 electrical degrees of one another. This type of unit adds immeasurable flexibility to any power system. The obvious advantage is not having
7.3
TYPES OF POWER TRANSFER SWITCHES
157
Figure 7.5. Closed transition ATS. to sustain a second hit when transferring back to the normal power source following a power outage. This avoids a second hit to sensitive equipment or an unnecessary use of the UPS battery. Some operations use the closed transition advantage to transfer to the alternate source in-house in advance of a storm or for maintenance operations. CTTS switches are typically more expensive due to the advanced switching and control mechanisms required to safely execute an overlap transition. The CTTS will necessarily connect the normal utility and standby generator sources together for a very brief time. Protective circuits in CTTS controls prevent prolonged parallel connection of the commercial and emergency sources. Power utility company requirements will need to be understood and followed for the safety of utility service personnel. Some power authorities may require additional outboard protective relays. Closed Transition Automatic Transfer Switches with Bypass Isolation. Closed transition automatic transfer switches with bypass isolation (Figure 7.6) incorporate all of the standard features explained above with a second manual transfer device. This manual power transfer device is connected in parallel and interlocked with the automatic switching mechanism. Under normal operating conditions, this bypass/isolation mechanism provides a means to bypass the automatic power transfer device if there is an operational problem or for routine maintenance. This device is ful-
158
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
Figure 7.6. Closed transition ATS with isolation bypass.
Figure 7.7. Closed transition ATS enclosure.
7.3
TYPES OF POWER TRANSFER SWITCHES
159
ly rated and capable of making a manual transfer between the normal and emergency source. Once the bypass/isolation device has been connected and transferred to the appropriate source, the automatic power switching mechanism may be placed in one of two positions. Position 1 is the test position. In this position, the main power poles of the automatic power switching device are disconnected; however, control power is still available so that the ATS may be tested by qualified maintenance personnel equipped with appropriate personal protective equipment in accordance with the hazard level identified as defined by NFPA 70E. Position 2 is the isolate position. In this position, the ATS mechanism is completely disconnected from all power and may be completely removed for invasive maintenance and repair.
Delayed Transition Switches. Delayed transition switches (DDTSs, Figure 7.8) are designed to provide an extended open period between sources for a number of different reasons. The design mimics that of the closed transition ATS with a programmable open time between sources. The most common issue is to effectively deal with the inrush current typically associated with large inductive loads (transformers, motors, etc.). On a cold start, large inductive loads require substantial current in order to establish a magnetic field. This can be 8 to 10 times the normal running current. Since this is normal and not considered a fault, protective devices need to ignore this tran-
Figure 7.8. Delayed transition ATS.
160
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
sient. During power switching operations, we must consider the inrush current created as a result of an open transition between two hot sources. When power is removed from large inductive loads, their windings give up residual power briefly as their magnetic fields collapse. This is known as back EMF (electromotive force). We also have to consider the phase angle of the alternate source. A transfer between two hot sources that is too quick and does not consider the phase angle of the alternate source results in massive current. The DTTS has two separate operating mechanisms with a programmable delay or open time between normal and emergency. By programming an appropriate open window, we can allow the disconnected load's field to collapse and the back EMF to dissipate before reconnecting to the alternate source, and we have only normal black-start inrush to consider. Aside from handling large inductive loads, there are other applications that may be effectively handled by DTTS. Let us look at a few. UPS or VFD applications can be touchy. As with large inductive loads, CTTS works well if you employ it. While not the ideal, DTTS may be the answer for these applications. Depending on the type and vintage of a UPS, today's open transition ATS may be too fast for the UPS controls to deal with. In this case, DTTS provide a definitive open period, which the legacy UPS will interpret as a loss of power and revert to its battery until it sees the alternate source as present and viable. The downside is that every hit to the battery takes a toll on battery life. Variable frequency drives and other connected equipment may have the same type of issue with a very fast open transition between two hot sources. These devices can usually be managed to do an orderly shutdown and a sequential restart. DTTS units are available with the isolation bypass feature as are many other types of ATS (Figures 7.9 and 7.10). However it is important to note that the isolation bypass feature is a manually operated unit, which does not incorporate the ability to delay transition between the sources. It is still possible to connect the isolation bypass feature to the critical load without interruption; however, should it become necessary to manually transfer between two hot sources while using the isolation bypass feature, the operator must observe caution and open the source breakers, transfer the bypass MTS, and reclose the appropriate breaker after the transfer is completed. As with any switching operation, the use of appropriate PPE is required. Soft-Load Power Transfer Switching Devices. Soft-load power transfer switching devices are a key element of the modern emergency power supply system. Unlike open transition (break before make), delayed transition (center off), or closed transition (make before break), the soft-load power transfer switch combines the best features of each in a very special design (Figure 7.11). A soft-load power transfer switch provides closed transition or overlapping transfer between two energized sources (normal and emergency). A traditional closed transition power transfer switch does the same; however, the overlap time is intentionally limited, a passive sync check function is employed and the entire load is shifted in three steps (detect coincidental synchronism, parallel both sources, then disconnect from the source you are leaving).
7.3 TYPES OF POWER TRANSFER SWITCHES
Figure 7.9. Delayed transition ATS with bypass isolation.
Figure 7.10. Delayed transition transfer switch enclosure.
161
162
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
Figure 7.11. Soft-load power transfer switch. A soft-load power transfer switch uses an active synchronizer to control or drive the emergency source prime mover into synchronism and a load controller to gradually walk the load from one source to the other softly until the source you are leaving is unloaded to a predetermined threshold and disconnected. When transferring from a hot normal source to a hot emergency source, the load controller will drive the prime mover of the emergency source to assume more load until the normal source load is reduced to a specified value, say 200 kW. At this point, the normal source is disconnected and the entire load is supplied by the emergency source. Upon return to the normal source, the scenario above is reversed. First, the emergency source is driven into synchronism with the normal source, then the load controller will throttle back the emergency source prime mover, thereby walking the load off the emergency source softly. At a predetermined load set point, the emergency source is disconnected. The key components of these types of units include: • A dual-operator, closed transition transfer switching mechanism designed according to ULI 008 • A power measurement device to provide precise information to a specially designed controller • Control for the connected engine generator set in all modes of operation • Utility-grade protective devices as required by the utility authority involved
7.4 CONTROL DEVICES
163
Breaker PairATSs. Most standard, cataloged ATS designs are essentially large contactors. The contactor ATS design carries the advantages for most applications of lower cost and physical size as well as simplicity and the resulting higher reliability. These units also comply with ULI008 as true automatic transfer switches. However, there are some standard ATS designs and many customized ATS designs that use a pair of electrically operated circuit breakers instead of contactors. The circuit breakers are typically controlled by a programmable logic controller (PLC). Advantages of breaker pair ATSs include better adjustment capability of timing and overlapping sequences, easier-to-get high-fault current ratings, and more reliability for large load switching (4000 A). Disadvantages of breaker pair ATSs include complexity and lower reliability of motor operators and PLCs, programming issues, and so on. Except for specialized applications, it is recommended to avoid customized breaker pair ATSs as operation and service can be an ongoing challenge.
7.4
CONTROL DEVICES
Most major ATS manufacturers provide a variety of options available ether as standard control features or optional discrete devices.
7.4.1
Time Delays
Time delays are crucial in many ATS applications, and can range from a few seconds to 30 minutes. Often included as standard on ATSs, time delays allow for power stabilization and coordination of equipment. Start Time Delay. Start time delays can override a momentary deviation in the normal power supply. This feature prevents the starting of an engine or transfer of load during momentary voltage dips. Timing starts at the instant of normal power interruption or dip. If the duration of the interruption exceeds the time delay, the ATS signals the generator set to start and the transfer sequence begins. Start time delay should not be set greater than a few seconds for emergency power because the NEC requires automatic restoration of electrical power within 10 seconds of power interruption, and you need to allow several seconds for the engine to start and come up to speed in addition to the starting time delay. Delays should be long enough to prevent a response to voltage sags caused by nearby faults or a momentary loss of voltage from the operation of reclosers in the electrical distribution system. However, in specialized applications such as UPS systems with limited energy storage backup time, especially in the case of batteryless flywheel systems, engine start delays need to be minimal—as low as 1 second or less. Transfer Time Delay. Transfer time delays allow adequate time for stabilizing of the generator set at the rated voltage and frequency before the load is applied. The range for adjustment is 0 to 5 minutes.
164
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
Return to Utility Time Delay. This feature allows time for the normal power source to stabilize before retransferring the load. This also allows the generator set to operate under load for a minimum time and prevents multiple transfers back and forth during temporary power recoveries. The range for adjustment is 0 to 30 minutes. Some facility operators prefer to turn the ATS to a manual or test position once the facility is on generator following a power outage. The philosophy here is that if utility power fails once, it may fail again soon after returning. Then they will run on generator until the storm has passed, for the rest of the day until the bulk of the work shift is over, and so on, then place the ATS back into automatic and observe time-out and retransfer of the standby system back to normal. Engine Warm-up Time Delay. Engine warm-up time delays allow time for automatic warm-up of engines prior to loading during testing. This feature is not available from all ATS vendors. Frequent loading of cold engines (including those with jacket water heaters) may increase wear and thermal stresses. Because standby engines normally accumulate a relatively small number of operating hours, the added wear and stress may not be of concern, especially for small and medium-sized engines. Engine Cool-down Time Delay. Stop-time delays allow the generator set to cool down at no-load conditions before stopping, and should be specified on all ATSs associated with diesel generators. This feature minimizes the possibility of a trip and in some cases an engine lockout due to an increase of the cooling-water temperature after shutdown of a hot engine. The delay logic does permit an immediate transfer of load back to the generator set if the normal power source is lost during the cool-down period. ATSs with stop-time delay also conform to vendor recommendations for operation of larger engines. The range for adjustment is between 0 to 10 minutes.
7.4.2
In-Phase Monitor
This device monitors the relative phase angle between the normal and emergency power sources. It permits a transfer from normal to emergency and from emergency to normal only when the phase angle differences are acceptable. The advantage of the inphase transfer is that connected motors continue to operate with little disturbance to the electrical system or to the process. An in-phase monitor prevents residual voltage of the motors from causing an outof-phase condition, which can in turn cause serious damage to the motors or other mechanical loads. Transferring power to a source that is out of phase can result in an excessive current transient, which may trip the overcurrent protective device or clear fuses. In-phase transfer is principally suited for low-slip motors driving high-inertia constant loads on secondary distribution systems. The in-phase monitor feature is also suitable for most UPS applications. In-phase monitors are inexpensive compared to closed-transition or delayed-transition ATSs. Some vendors provide different in-phase monitor modules, depending on whether the alternate power source (i.e., diesel generator) has a governor, which operates in isochronous or droop modes. Consider this option for all ATS applications that will supply mo-
7.4 CONTROL DEVICES
165
tor loads and will not tolerate momentary interruptions. The cost for this option is relatively low and, in some switch designs, the cost of a later retrofit may be quite high. In-phase monitors are passive devices and become permissive only when the two sources coincidently align in the allowable bandwidth and phase displacement. It is important that the generator be adjusted to provide an offset in frequency from the mains and thereby create slip between the two sources. Typically, this frequency offset occurs without effort; however, if the generator is operating at exactly the same frequency as the normal source (but not synchronized) it can become stuck out of phase for an undesirably long period of time during the retransfer operation. Alternatively, a synchronizer may be employed to drive the emergency generator into phase if so desired. This has to be engineered by the manufacturer at the time of order.
7.4.3
Test Switches
Regular testing is necessary to both maintain the backup power system and to verify its reliability. Regular testing can uncover problems that could cause a malfunction during an actual power failure. A manually initiated test switch simulates failure of the normal power source, start the transfer of the load to the emergency power source, and then retransfer it either when the test switch is returned to normal or after a selected time delay. The test switch does not require shutting off the normal source. The test switch transfers the load between two live sources, avoiding the several seconds of power loss during time out and engine start. The advantage of a test switch is that it does not cause a total loss of power to connected loads as if the main disconnect were opened (pull-the-plug test) to simulate a failure. Therefore, the test switch reduces risk of complications, including a normal source breaker failure to reclose. It also allows the system to be proven under controlled conditions. It is common for the connected load characteristics to change over time, either by planned expansion of the system or by inadvertent connection of new loads. A periodic controlled function test ensures the performance of the system. (If the ATS is not closed in transition, the load will be briefly affected when the ATS transfers to emergency and returns to normal.) It is important to understand that although use of the test switch is convenient and does mitigate some of the risks as noted, most manufacturers require the test switch be held for a certain time period, say 5 seconds, in order to confirm the test request is deliberate and not simply caused by inadvertent operation of the test switch. When using the test switch in applications in which reaction time is critical, this delay must be factored into the total time required to start the emergency engine or turbine-generator, sample the output, and transfer to the critical load.
7.4.4
Exercise Clock
An exercise clock is a programmable time switch. It automatically initiates starting and stopping of the generator set for scheduled exercise periods. The transfer of load from the normal to alternate source is not necessary but is highly recommended in most cases. As mentioned previously, the standby generator system should be exercised as often as 30 minutes per week, or at least monthly. Operating diesel engines re-
166
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
peatedly or for prolonged periods of time without load can cause wet stacking of unburned diesel fuel oil in the exhaust system. Wet stacking can lead to other diesel engine problems including explosions in the exhaust or muffler system. Many operating engineers prefer to transfer building load onto the engine once every four tests to reduce wet stacking. Many operating engineers either prefer to be present during each and every automatic test of the standby power system or they disable the automatic exerciser so that they can manually initiate the test.
7.4.5
Voltage and Frequency Sensing Controls
Voltage and frequency measurements are necessary elements of any manufacturer's ATS control panel. The addition of local and remote metering enhances the information available to building operations personnel. Meters offer precise readings of voltage and frequency. Enhanced ATS metering functions, when properly applied, allow building operations to understand and segment power usage and its associated cost. It is also beneficial to be able to monitor the load on the ATS at any given time, including when operating on the normal source. If load metering is included in the ATS, then it might not be necessary in downstream switchboards. Local metering can be captured remotely through a wide range of communications protocols; the most common is Modbus. Remote metering (especially load metering) is important for building automation systems and load-shed schemes. Enhanced metering packages add cost as current transformers (CTs) must be added to the power conductors.
7.5
DESIGN FEATURES
ATS switches rated 100 A or larger are typically mechanically held in both the normal and emergency positions. In such switches, the main contact pressure relies on the mechanical locking mechanism of the main operator. Automatic operation of these switches in either direction requires electrical control power. Therefore, the main poles of the transfer switch ought to be the type that are held closed mechanically even when the control circuit is deenergized
7.5.1
Close Against High In-Rush Currents
When a transfer switch closes onto a source, the contacts may be required to handle a substantial in-rush of current. The amount of current depends on the loads. In the case of a motor load, inrush current can be six to eight times the normal running current. If the two power sources are not synchronized, or the residual voltage of the motor is (worst case) 180° out of phase with the source, the motor may draw as much as 15 times the normal running current. A minimum contact bounce should occur during switching. The contact material must have adequate thermal capacity so that melting will not occur as a result of high in-rush current. UL requires that a transfer switch be capable of withstanding in-rush current 20 times full load rating.
7.6 ADDITIONAL CHARACTERISTICS AND RATINGS OF ATS
7.5.2
167
Withstand and Closing Rating (WCR)
When a short circuit occurs in a power system, a surge of current will flow through the system to the fault until the upstream protective device opens. If the device supplying the normal source-side transfer switch opens, the transfer switch will sense a loss of normal power and initiate transfer to the emergency source. This means that the transfer switch is closing into the fault. The entire available fault current from that source will flow through the switch. Power transfer switches are designed to maintain fault current for a finite amount of time (cycles) and have a specific withstand and closing rating (WCR). Transfer switches should be able to safely close into the fault and remain undamaged until the upstream overcurrent device opens. UL sets a minimum acceptable WCR for transfer switches. The correct ATS WCR must be specified depending on the available fault current at the point of ATS application. Standard ATS WCR ratings available from manufacturers typically increase as the continuous ampere ratings increase.
7.5.3
Carry Full Rated Current Continuously
A transfer switch must continuously carry current to critical loads for a minimum life of 20 years. Periodic infrared thermal scanning of ATS load-carrying components, when loaded from both the normal and emergency sources, is recommended. The maximum continuous load current should be estimated when selecting a transfer switch. The selected transfer switch must be able to carry continuous load current equal to or greater than the calculated value and must be rated for the class of the connected load.
7.5.4
Interrupt Current
An arc is drawn when the contacts switch from one source to another under load. The duration of the arc is longer when either the voltage is higher and the power factor is lower or the amperage is higher. This arc must be properly extinguished prior to the contacts connecting to the other source. Arc interrupting devices such as arc splitters are often used. Wide contact gaps can also extinguish an arc. Unless no-load transfers are used exclusively, the ATS must be able to interrupt load current. Power transfer switches which meet ULI008 incorporate proper arc quenching ability.
7.6 7.6.1
ADDITIONAL CHARACTERISTICS AND RATINGS OF ATS NEMA Classification
Transfer switches are classified as NEMA Type A or NEMA Type B. The Type A transfer switch will not provide the required overcurrent protection, so an overcurrent protection device should be installed ahead of the ATS to clear fault current. The Type B transfer switch is capable of providing the required overcurrent protection within the transfer switch itself.
168
7.6.2
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
System Voltage Ratings
The system voltage rating of transfer switches is normally 120, 208, 240, 480, or 600 VAC, single phase or polyphase. Standard frequencies are 50-60 Hz.
7.6.3
ATS Sizing
For most electrical and mechanical equipment, it is generally more efficient and costeffective to use fewer, larger sizes of equipment. However, for ATS applications the opposite is generally true. ATS cost, physical size, and reliability concerns increase dramatically as the ampere size increases. Rather than using one large 2000 or 3000 A ATS, breaking the loads into smaller increments and using ATSs rated at 225-1200 A is often much better. Contactor-type ATSs are arguably less feasible above 3000 A and breaker-pair-type designs or customized switchgear should be used if ATS applications of 4000 A or larger are required.
7.6.4
Seismic Requirement
The Department of Energy and local, regional, and national building codes have specific seismic requirements for facilities in certain geographic areas of the country. Seismic testing is becoming a normal requirement for electrical equipment installed in specific locales. Calculations are no longer sufficient. Manufacturers must subject standard units to shake-table tests at certified testing facilities. These tests not only require that mechanical and electrical components remain secured and free from damage but also that the unit will operate while subjected to the shake-table tests for a specified period. In the case of power transfer switches, a specific number of transfer cycles (transfer from normal to emergency and return) are required.
7.7 INSTALLATION AND COMMISSIONING, MAINTENANCE, AND SAFETY 7.7.1
Installation and Commissioning
It was not that long ago the process of putting a new piece of gear or a system online was referred to simply as start-up. This activity varied greatly and more often than not was primarily based on the experience of the expert doing the work. The proliferation of data and demand for high-nines reliability has transformed start-up into the very deliberate process we call commissioning. Commissioning is a cycle of events that actually begins at the design stages. Let us focus on the final four stages of this process as it relates to the lifeblood of any mission critical facility, the emergency power supply system (EPSS). Shipping, staging, and proper installation are vitally important. Imagine the contractor who poured the concrete pad in and around the bottom of an ATS because he forgot to provide the 4" raised pad prior to installation, the generator controls that sat exposed to the elements on a rooftop for an extended period of time prior to enclosure,
7.7 INSTALLATION AND COMMISSIONING, MAINTENANCE, AND SAFETY
169
or the ATS that was dropped off the truck and not inspected until it was time for installation. All of these situations actually occurred, caused considerable delay in the project schedule, and had to be repaired prior to commissioning. All three were also avoidable. There are three primary types of shipment options. 1. The most common is freight on board (FOB), "a term used in shipping to refer to the place where the buyer becomes responsible for the shipment and the shipping charges. Example: If the buyer lives in Des Moines and buys a product F.O.B. New York, the buyer must pay the shipping charges from New York to Des Moines and is responsible for seeing that it is properly insured during that shipment."* 2. Transfer of ownership at the owner's dock is called a destination contract, "a contract of sale in which a seller bears the risk of loss all the way, until the shipment of goods reaches its named place of arrival (or port of destination)".1' 3. Another option is ship in place. This option allows the manufacturer to transfer ownership and invoice the customer while the material remains in the custody of the manufacturer. Failure by the owner or owner's designated representative to properly inspect the shipment upon arrival is a mistake. Electrical equipment damaged in shipment, staging, or installation may result in failure, damage or personal injury. Inspection and testing includes the following: • Component verification is the process of auditing all delivered material. Do not wait until the last minute to discover that something is missing or damaged. This is also the time for the site team (general contractor, electrical contractor, mechanical contractor, riggers, the commissioning agent, the consultant, the owner's representative, and the equipment manufacturer's representatives) to review the physical layout, electrical and mechanical installation, structural considerations or impediments, and the master schedule. Once the freight has been accepted, it is most probably the installing contractor's responsibility to stage and protect the equipment until actual installation begins. The equipment should be placed in a safe, dry area and partially unpacked for inspection and removal of the manufacturer's handling and installation instructions. If the equipment is to be stored for an extended period, protect it from dirt, water, and physical damage. Heat should be used to prevent condensation inside the equipment. If a forklift is used after the equipment is unpacked, transport the unit in the upright position and take care not to damage any handles, knobs, or other protrusions. If the equipment is to be lifted, lifting plates should be fastened to the frame with hardware as recommended in the mounting instructions. Lifting cables or chains should be long enough so that the distance between the top of the equipment and *Source: www.reference.com. ''Source: www.businessdictionary.com.
170
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
the hoist is greater than one-half the longest horizontal dimension of the equipment. • System construction verification is the inspection of all electrical, mechanical, and environmental aspects of the installation to determine that installation is complete, safe, and compliant with design specifications, codes, and standards. It is a critical step prior to applying power to begin actual operational verification and performance testing. • Individual system operation verification is where all systems and subsystems supplied by different manufacturers come together for the first time. All individual subsystems are tested to ensure that they are operationally ready and safe. For example, each individual engine or turbine generator set would be started and all individual controls, alarms, indicators and support systems tested with and without load. The UPS system would be tested in the same way. When this stage is complete, there should be a high degree of confidence that individual pieces and subsystems will operate as one integrated machine. Take advantage of this period to optimize nonrelated systems testing. For example, if part of engine generator or UPS testing involves a heat run or burn-in, the load-bank heat discharge can be used to load and test the facility's air conditioning system. Care must also be taken to properly coordinate concurrent testing of interconnected and interdependent systems. Failure to properly manage the master schedule will put you in danger of missing crucial deadlines that may affect occupancy, business continuity, and so on. • Integrated system operation verification (site acceptance test) is the final stage of installation. This is literally an integrated test of all building systems that support the facility. It is at this point that all the care taken in the prior steps pays off. It is also most probably the point in the schedule that is the most compressed. Aside from confirmation of the normal sequence of operations, it is the demonstration of failure scenarios and reactions. This is true for the operations staff as well as equipment. More than one data center has experienced disaster at the hands of an inexperienced or ill-trained operator. If, indeed, something does go wrong, responsible personnel must be prepared to address the situation rapidly, so make training a key part of this final stage. The process of commissioning is complex. The team entrusted to design, manufacture, construct, test, and operate a mission critical data center must be best in class. Anything less will cause at the very least, delays; at the worst, failure. As the owner, operator, or stakeholder, understanding the process is essential to successful commissioning and the life expectancy of your mission critical facility.
7.7.2
Maintenance and Safety
Maintenance is intended to be preventative or predictive. The nature of maintenance has changed in many ways over the past 10 years. There are more maintenance providers than ever before; the following are examples:
7.7
INSTALLATION AND COMMISSIONING, MAINTENANCE, AND SAFETY
171
• Original equipment manufacturer (OEM) organizations, which service primarily the brands manufactured by their parent companies • Independent service representatives, licensed and trained to provide OEM service support. • Third-party maintenance providers. These organizations are not necessarily tied to any manufacturer. As you might imagine, there are substantial differences in capability, OEM recognition and backing, scope of work provided, and, of course price. Let us analyze typical ATS maintenance. First and foremost, it is not possible to provide comprehensive maintenance on any piece of electrical equipment that cannot be deenergized. This is primarily due to increased emphasis on safety and the standards NFPA 70E, OSHA, and others. Consider the following facts: • In 2007, there were 212 electrically related fatalities and 2540 electrically related injuries. • Arc flash accidents that result in serious injury or fatality occur 5 to 10 times a day in the United States. • The temperature of an arcing fault can reach 35,000°F (the Sun is 10,000°F). • The face of the fireball is a pressure wave that can exert 600 pounds of force per square inch on a person. • The sound pressure level of an arc flash is 160 dB (rock concerts do not exceed 115 dB). • An arc flash can generate metal fragments and molten metal droplets moving at 5000 ft/see. • An arc flash can produce copper vapor that will destroy lungs if inhaled. All of this means that all of us must observe the policies, regulations, and best practices designed to protect workers and equipment. Due to their design, many ATSs have exposed energized buses or operating mechanisms inside the cabinet door. Hazardous voltages will be present in the ATS during inspection and testing. Wear proper personal protective equipment at all times when handling equipment for arc flash safety. Review the NFPA 70E-2009 Section 130.7 for more information. When operating the ATS test switch, if so equipped, use the ATS isolation bypass feature. Also, workers should never stand in front of the ATS cabinet door. Instead, they should stand to the hinged side of the ATS cabinet door. An electrical arc or blast can occur during switching if there is component failure. A strong fault can create a fireball and the pressure can swing the door open. This blast can potentially cause death or severe burns. Maintenance personnel should also never stand in front of an open ATS cabinet door during switch operation. Most new ATS designs have control components
172
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
mounted on the door, away from power-carrying conductors. These control devices can then be easily disconnected for troubleshooting. The use of recording devices may occasionally be necessary to perform measurements during troubleshooting of control circuits in older switches if control components are hard-wired and are adjacent to high-power conductors and moving components. Remember that an ATS is connected to two power sources. When performing work requiring a deenergized ATS, make sure that both power sources are disconnected, locked out, and tagged out in accordance with NFPA 70E. Always verify absence of voltage before starting work. Admittedly, the problem we face is that many installations are unable to isolate and secure power so that safe service may be performed. But can't we use PPE to work on energized switches? Yes and no. The question is not, "Can you perform maintenance on energized equipment?" but, rather, "What maintenance work may be safely performed on energized equipment?" The answer is that the only tasks that may be safely performed while wearing PPE are operation of switches and circuit breakers and taking readings for the record. The use of PPE protects a worker if a flash occurs while they are exposed but severely hinders the range of tasks that can be safely accomplished: • PPE limits peripheral vision and may affect color recognition. • PPE limits manual dexterity due to the bulkiness of the clothing and required gloves. Therefore, attempting to handle delicate parts or small tools may indeed create a hazard should a tool slip or part be dropped. • Last but not least is the danger of heat prostration. The rate of rise over ambient temperature can be a problem anywhere but it is exponentially worse in warm climates or seasons. Infrared (IR) scans are very useful if done correctly by qualified personnel. Personnel performing infrared scans must have equipment doors open. Accordingly, they must wear PPE appropriate for the category of risk exposed. Remember that infrared scanning is the capture and analysis of heat signatures. As such, the equipment being scanned must be operating at normal loads for the scan to be fully effective. We typically do this exclusive of maintenance so that problems exposed can be identified in advance and that those performing maintenance can be prepared to address issues identified during the maintenance window. Also remember that infrared scanning, like many other electrical tests, represents a snapshot in time and is subjective. There are now viewing ports available for new gear or retrofits that allow IR scans with covers and doors secured. However, the highest quality port may degrade the resolution of the image by as much as 20% when new. Over time, they tend to degrade as well. Ports also limit the field of view. Take, for example, a three-phase bus in which the bars are stacked horizontally behind each other. It would be virtually impossible to get an effective IR scan of all three bars or joints through a viewing port. As mentioned in Section 4.7.1, recent advances in technology have produced onboard heat sensors permanently mounted at joints and connections. Combined with
7.7
INSTALLATION AND COMMISSIONING, MAINTENANCE, AND SAFETY
173
power measurement and monitoring devices, it is possible to constantly monitor a thermal map and take advance action to avoid failures.
7.7.3
Maintenance Tasks
Once the ATS has been secured and both the emergency and the normal source have been locked out and tested to verify that there is no voltage, the hands-on tasks may begin. Exercising and performance testing should be performed per the manufacturer's recommended procedures and by a qualified testing and maintenance organization. Basic procedures include cleaning, meter calibration, contact inspection, microohmmeter readings across closed contacts, load transfer testing, and timer testing should be performed.
7.7.4
Drawings and Manuals
Upon the completion of installation and commissioning, all documentation should be updated to reflect the as-built condition. This up-to-date documentation must be managed and kept current. That means that any changes over time must be reflected accurately. Before attempting to test, inspect, or maintain an emergency power transfer system, make sure you have up-to-date reference material and manuals. Keep these in a clean and readily accessible location. For convenience, additional copies should be readily accessible; they can usually be easily obtained from the manufacturer. Accurate, as-built drawings and manuals are an indispensable aid in: • • • • • •
7.7.5
Understanding the sequence of operation of the equipment Establishing test and inspection procedures as well as training Determining the type and frequency of tests Troubleshooting Stocking spare parts Conforming to code requirements for periodic preventive maintenance
Testing and Training
Regular testing of an emergency power system and training of operators will help to uncover potential problems that could cause a malfunction or misoperation during an actual power failure. Furthermore, regular testing gives maintenance personnel an opportunity to become familiar with the equipment and observe the sequence of operation. By being familiar with the equipment, personnel on duty during an actual power failure can respond more quickly and correctly if a malfunction occurs. An automatic emergency power system should be tested under conditions simulating a normal utility source failure; field surveys have proven this. Experience shows that the rate of discovery of potential problems when a system is tested automatically with the emergency generating equipment under load for at least one-half hour is almost twice the rate as when systems are tested by manually starting the generating equipment and letting it run unloaded.
174
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
A test switch on the transfer switch will accurately simulate a normal power failure, and some manufacturers supply this as standard equipment. The advantage of a test switch is that it does not cause a loss of power to connected loads, as is caused if the main disconnect is opened to simulate a failure. A test switch opens one of the sensing circuits and the controls react as if there were an actual power failure. In the meantime, the loads stay connected to the normal source. When replicating an actual failure, power is interrupted to the loads only when the load is being transferred from the normal to the emergency source and vice versa. This interruption is only for a few cycles. Keep in mind that some manufacturers require you to hold the test switch in position for a few seconds before the mechanism will react. The reason for this is to be sure the activation of the test switch is deliberate and not an accidental event. This also adds time to the cycle. If your test requires a certain time reaction, you have to deduct the test switch delay period from the total reaction time. However, if the main disconnect is opened to "simulate" loss of normal power (sometimes referred to as a pull-the-plug test), then power to the load could be interrupted for up to 10 seconds. The interruption could be much longer if a malfunction occurs during start-up of the emergency generator or if there is a problem reclosing the normal feeder switch or circuit breaker. When a test switch is used, fewer maintenance personnel are needed because the potential for an extended power outage during the test is significantly reduced. However, using the test switch is not as thorough as a performance test. Though it is not always recommended, the control circuits of an emergency power system can be designed so that a clock will automatically initiate a test of the system at a designated hour on a specific day for a predetermined test period. This type of test can be restricted to running the engine generator set unloaded or can include an actual transfer of loads to the generator output. Using a clock to initiate testing under load or no-load conditions is not recommended for the following reasons: • With no load interruption, the engine generator set is tested unloaded, which can adversely affect engine performance. In addition, the transfer switch is not tested and the interconnecting control wires are not proven. The transfer switch is a mechanical device that must be maintained. Some power transfer switches are lubricated for life. However, some require periodic cleaning and lubrication, especially in locations subject to dust, dirt, or other contaminants. • A clock malfunction or improper setting could result in a test at a most inopportune time. • Clocks are usually set to initiate a test on weekends or during early morning hours. If a malfunction occurs, skeleton crews are usually not equipped to handle it. • When an automatic test clock is employed, equipment should be labeled to caution those working in and around the installation that equipment should be secured, locked out, and tagged out to prevent unexpected operation while personnel are working. When a test switch is used to test an emergency power transfer system in the automatic mode, all relays, timing, and sensing circuits operate in accordance with the en-
7.7
INSTALLATION AND COMMISSIONING, MAINTENANCE, AND SAFETY
175
gineered sequence of operation. If the test procedure is written in accordance with this predictable sequence of operation, then it is possible for one person to initiate a test and then move to another location to initiate another action or monitor and record certain test results. The automatic sequence of operation should not vary greatly over time, so maintenance personnel movements can be correlated to that regular sequence. Further, test procedures can include emergency procedures to cover any malfunction. This is added insurance. Key people should be trained to react almost instinctively when certain malfunctions occur. For example, in a multiple-engine standby generator system with automatic load-shed capabilities, an overspeed condition on one set may cause certain loads to be shed, retransferring them from the emergency source of power back to the normal. This is necessary to avoid overloading the remaining generator sets. When performing a cold start, load blocks will be added gradually so that the bus will not be overloaded. The highest priority loads are added first on a cold start and shed last in the event of a failure. The test procedure should simulate such a condition and include steps to: 1. 2. 3. 4.
Confirm that the loads were retransferred. Reset the failed generator and put it back online. Confirm that the load is again transferred to the emergency source of power. Be aware of vulnerabilities of ATS load equipment, including chillers and other HVAC equipment (especially those powered by variable frequency drives), and UPS equipment. UPS systems are always vulnerable during ATS transfer and retransfer operations. Although UPS batteries may be designed and rated to carry the load for power interruptions of 15 minutes or more, numerous failure modes exist for the UPS to fail immediately or within seconds of power loss. Many facility engineers believe that live ATS load transfer testing provides a good test of UPS batteries, since the batteries carry the load for several seconds. This is not a reliable test of UPS batteries and connections. Although the UPS batteries may pass a test that requires operation for five or ten seconds, the batteries may not survive a test for fifteen seconds. Many UPS systems successfully survive the forward transfer to standby power during testing or an actual power outage, only to drop the load during retransfer to normal because a marginal component failed or the UPS batteries were on the edge of failure during the forward transfer.
A minimum checklist for an emergency power system test should include the following items: 1. The time from when the test switch is activated until the engine starts to crank 2. The time required for the engine to achieve nominal voltage and frequency after the test switch is activated 3. The time the transfer switch remains on the emergency generator after the test switch is released 4. The time the generator system runs unloaded after the load circuits have been retransferred back to the normal source of power
176
POWER TRANSFER SWITCH TECHNOLOGY, APPLICATIONS, AND MAINTENANCE
5. Proper shutdown of the engine generator set and engine auxiliary devices
7.8
GENERAL RECOMMENDATIONS
Choosing the right power transfer switch for your application is essential to the operation of your facility. The device selected must also interface properly with other system components. For example, voltage detection set points must be uniform. If they are not, the UPS could sense a low voltage and revert to battery to protect the critical load. The power transfer switch, on the other hand, may not react at all. Result? The engine generators will never start and the UPS will run out of battery shortly, resulting in a crash of critical loads. When selecting an ATS and hardware, the following criteria are recommended: • Automatic transfer switches transfer the load within six cycles. Make sure the power transfer switch you specify is the proper type and equipped with the proper features for your application. Failure to pay close attention may cause severe damage. For example, power transfer switches without proper phase angle monitoring may cause motors to experience severe mechanical and electrical stress, which can cause motor winding insulation to fail. Excessive torque may also cause damage to the motor coupling or loads. Cost will increase as more hardware and options are added. The cost of wear and maintenance to other components such as diesel engines may justify the use of timing modules, which facilitate routine testing. • Maintain continuous operation of a UPS system by selecting a closed-transition ATS or an open-transition ATS with an in-phase monitor, motor load disconnect control devices, or ATS with programmed (delayed) transition. • Take future expansion and load growth into account (and revisit the issue periodically). • Equipment design must consider maintenance requirements. Bypass-isolation switches are relatively high-cost options but may be necessary to perform proper maintenance when loads cannot be deenergized. The control circuit design should permit safe and thorough troubleshooting without deenergizing loads or requiring electricians to work in close proximity to high-power, high-voltage conductors. • For applications in which downtime cannot be tolerated and electrical power to the load is critical, an external or integral bypass-isolation switch (nonload break design) used with an ATS is the best choice.
7.9
CONCLUSION
The ATS is one of the basic building blocks of a reliable backup power system and plays a vital role in power reliability, which is necessary for critical infrastructures. In-
7.9 CONCLUSION
177
terruptions in the flow of power can cause significant losses, so ensuring that emergency power comes on quickly is imperative. A properly selected, installed, and maintained ATS provides a quick and automatic transfer that eliminates having to manually switch utility service to the backup generator. This feature reduces the downtime that a critical infrastructure experiences. The next chapter will focus on static transfer switches (STS), which play a similar role to ATSs, though they have the capacity to switch between primary and secondary sources in less than one-half cycle or 8 milliseconds rather than six cycles.
This page intentionally left blank
8 STATIC TRANSFER SWITCH
8.1
INTRODUCTION
The UPS and the static transfer switch (STS) are the alpha and omega of the power system in mission critical sites. The UPS and its accompanying backup power plant/ATS, as covered in previous chapters, are the first line of power quality defense. The UPS removes all power anomalies generated external to the site. The STS is the last line of power quality defense and removes all power problems that occur in the site between the UPS and the STS. The STS does not protect against power component failures downstream of the STS; it merely transfers between sources so that the customer's loads are preserved. In this manner, the static switch is used to protect the information and/or financial business that is transacted on a computer server. The STS receives power from two or three independent power flow paths; generally, the paths derive power from one or more UPSs. If one power path fails, then the function of the STS is to transfer the load to the other power path. The transfer of the load between the two power paths that lead to the static switch is fast enough to ensure that the load (customer's information/business that is transacted over the servers supported by the switch) is always preserved. These independent power paths that feed the static switch are set up so that they do not affect components such as circuit breakers, power panels, distribution panels, transformers, conduit wiring ducts, or wiring trays. The STS performs four major functions: Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 179 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
180
STATIC TRANSFER SWITCH
1. It allows maintenance and repair of the upstream site components without affecting the load. 2. It prevents upstream component failure from affecting the load. 3. It reduces transformer inrush current from transformers connected to the STS output, which reduces circuit-breaker tripping and UPS bypassing. 4. The STS must provide data to help management meet the key elements of Sarbanes-Oxley (SOX), Basel II, and NFPA 1600, that is, provide data for production risk management, production upset analysis, and security management, as discussed in Appendix A.
8.2 8.2.1
OVERVIEW Major Components
The STS contains only three major components, excluding the sheet metal; these are: 1. Protective and isolating circuit breakers or MCSWs 2. Logic, especially triple-modular redundant logic to ensure the safety of the loads (three independent sets of control logic) 3. Two sets of SCRs, configured as single or three-phase switches Circuit Breakers. The STS can be supplied with standard circuit breakers, but generally the STS is supplied with circuit breakers containing only instantaneous trips. These circuit breakers with only instantaneous trips are referred to as nonautomatic circuit breakers or molded-case switches (MCSW). The circuit breakers or MCSWs are selected to have an ampere-squared x second (I2t) rating to protect the silicon-controlled rectifiers (SCRs) from damage when the STS output is shorted. If the STS is supplied with MCSWs, upstream source circuit breakers or MCSWs must be rated to protect the STS internal power conductors and even the MCSW frames. The continuous current rating of the STS must be equal to or less than the continuous rating of the upstream protective device. To provide hot-swap capability, all the MCSWs are plug-in or draw-out. Most STS are supplied with two paralleled isolation circuit breakers or MCSWs for two reasons: 1. Redundancy 2. Spare MCSW. If all MCSWs supplied are identical, one isolation MCSW can be unplugged and used to temporarily replace any of the other MCSWs. Logic. Since the logic is the heart of the STS, it is typically specified to be triplemodular redundant with voting circuits. This redundancy ensures that the switch supports the loads (business operating on the servers located downstream of the switch). Part of the logic function is the power supplies that power the logic elements. These
8.3 TYPICAL STATIC SWITCH, ONE-LINE DIAGRAM
181
logic power supplies are also typically specified as triple redundant for the same reasons listed above. The SCR gate drivers are generally specified to be redundant for the S1 SCRs and redundant for the S2 SCRs, thereby requiring six different sets of gate drivers. Again, this is designed so that the switch fails safe and supports the loads. Triple-modular redundant gate drives are also available (three independent sets of six gate drives). The logic provides the following functions: • Monitors both sources and output power and determines if and when the load should be transferred. • Provides signals to the SCRs to turn them on and off when transferring or retransferring the load from one source to the other. • Provides the interface, generally multiscreen and graphical, between the operator and the STS. • Provides metering, status indicators, and alarm indicators, locally on the operator interface and remotely via commutations links. • The logic inhibits transfers of short circuit currents or transfers to a source that is worse (in terms of power and/or power quality) than the connected source. • Learns the environment and adjusts transfer parameters to prevent continuous nuisance transfer due to incoming-power noise. • Makes waveform capture pictures three cycles before and after transfer events. • Makes waveform capture pictures three cycles before and after transfer-inhibit events. Silicon-Controlled Rectifiers. The silicon-controlled rectifiers (SCRs) are the electronic elements that control the power flow between the source and the load and, therefore, control which source will provide power to the load. The SCR can be turned on or gated on by the logic; however, turning off the SCR requires no gate drive and no power or the reversing of power through the SCR. The no power or reversing of power through the SCR can be accomplished by natural or forced commutation. The SCRs must be rated for the available I2t let through the MCSWs and for the continuous rated current of the STS.
8.3 8.3.1
TYPICAL STATIC SWITCH, ONE-LINE DIAGRAM Normal Operation
Refer to Figure 8.1. Power is supplied to the STS from two sources: SI and S2. In normal operation, both of the bypass circuit breakers or MCSWs are open and all other circuit breakers or MCSWs are closed. The logic can control the SCR switches to be on or off. Normally, one of the SCR switches is on and is supplying power to the load. If the source supplying power to the load has an anomaly, the logic will disconnect the failing source and turn on the SCRs
182
STATIC TRANSFER SWITCH
yi
S1 Input CB
— > ■
S1 SCR switch
S2 Input CB
S2SCR switch
>' Isolation CB Qty = 2; S1 Bypass CB
v
S2
Logic
S2 Bypass CB
51 = Power source 1 52 = Power source 1 SCR = Silicon-controlled rectifier
Figure 8.1. Typical static switch, one-line diagram. being powered by the good source. The transfer of the load by turning off one set of SCRs and turning on the other set of SCRs can be accomplished in four milliseconds or less; thus, the transfer is transparent to the load. From the discussion above, the STS is a dual-position switch designed to connect a critical load to one of two separate sources. The source to which the load is normally connected is designated the preferred source and the other, the alternate source. Either source can be designated as the preferred or alternate via operator selection. The load is transferred to the preferred source whenever the source is within specified tolerance, unless retransfer is inhibited via an operator selection. When the retransfer is inhibited, the load is only transferred from one source to the other when the connected source deviates from the specified tolerance.
8.3.2
Bypass Operation
If the SCRs or logic must be serviced or upgraded by a factory authorized technician, the STS can be placed in bypass operation by closing one of the bypass circuit breakers shown in the one line diagram. After the bypass circuit breaker is closed, the two-input and the isolation circuit breakers or MCSWs can be opened, thus isolating the SCRs and logic for servicing without disrupting power to the critical loads. Following the reverse procedure, the load can be transferred from bypass to normal operation without disrupting power to the critical loads. This type of operation should only be performed by a trained technician because the improper sequencing of this procedure may compromise the power feeding the servers. The circuit breakers or MCSWs are interlocked in the following manner to prevent the interconnecting of the two sources: • The bypass circuit breakers are interlocked so that only one can be closed at a time. • Source 1 input circuit breakers or MCSWs and source 2 bypass circuit breakers or MCSWs are interlocked so that only one can be closed at the same time.
8.4 STS TECHNOLOGY AND APPLICATION
183
• Source 2 input circuit breakers or MCSWs and source 1 bypass circuit breakers or MCSWs are interlocked so that only one can be closed at the same time.
8.3.3
STS and STS/Transformer Configurations
The STS can be specified as a stand-alone unit but it is generally specified as part of a STS/transformer system. There are two STS/transformer configurations: 1. Primary system in which the transformer is connected to the STS output 2. Secondary system in which each STS input is fed from a transformer There are many iterations of these configurations and different trade-offs made when choosing between stand-alone, primary, and secondary systems. Some of these tradeoffs are outlined later in this chapter.
8.4 8.4.1
STS TECHNOLOGY AND APPLICATION General Parameters
This STS/transformer system is the last line of defense, and the STS logic provides the transfer algorithm, the operator interface, status data, alarm data, and forensics data. The STS or the STS/transformer system must have or perform the following: • The STS must be as electrically close to the load as possible. • The STS must provide data to help management meet the key elements of the policies and regulations identified in Appendix A (i.e., production risk management data, production upset data, etc.) and provide production security. • The STS must communicate status locally and remotely, including waveform capture, possible problems (predictive maintenance), immediate problems (alarms), and forensics (data for failure analysis). • The STS and the overall system design must be resilient to human error, for both operator and maintenance personnel. • The STS must provide data logging of any configuration changes and record when and by whom any changes are made. • The STS and/or the STS/transformer system must have 99.999% availability. • The STS and/or the STS/transformer system should minimize the mean time to repair (MTTR). • The STS and/or the STS/transformer system must be internally fault tolerant, must be fail safe, and must not fail due to external faults and abnormalities. • The STS and/or the STS/transformer system must be safely repairable while maintaining load power.
184
STATIC TRANSFER SWITCH
• The STS and/or the STS/transformer system should have hot-swappable modules and components.
8.4.2
STS Location and Type
The STS should be as electrically close to the load a possible. A rack-mounted STS is used to supply this dual-power path redundancy as close as possible to rack-mounted loads, such as servers. Typically, these systems are single-phase static switches. SCRs are often replaced with contactors in rack-mounted STSs. Most data centers will have three-phase static switches in their design. The server rack systems are typically supplied single-phase power from a remote power panel (RPP) or a remote distribution unit (RDU), which are in turn fed power from a STS or STS system. The STS or STS system can be stand-alone or part of a primary or secondary static switch power distribution system as previously outlined. These STSs or STS systems will typically feed the RDU, which in turn feeds the servers in the racks in a data center.
8.4.3 Advantages and Disadvantages of the Primary and Secondary STS/Transformer Systems Each configuration has advantages and disadvantages, and the decision is different for every data center. Some of the decision-making criteria are listed below as a sample: Primary System Advantages • Less costly since only one transformer is required • Less floor area required since there is only one transformer Secondary System Advantages • High level of maintainability. Since there are two transformers, one can be serviced or replaced while the other transformer is powering the load. • Higher level of redundancy. The STS is closer to the load, so a transformer failure will not cause a power outage. • The possibility of transformer in-rush is much less. The STS in a primary system should have a low in-rush transfer algorithm; this algorithm is not required in a secondary system. There are many opinions and theories about which system is better than the other, but the best fit will be a result of considering all of the design criteria for the each individual data center.
8.4.4
Monitoring, Data Logging, and Data Management
The STS should incorporate features that allow a company's management to meet the key elements of SOX, Basel II, NFPA, and other regulations (i.e., production risk management data, production upset data, etc.) and provide production security.
8.4 STS TECHNOLOGY AND APPLICATION
185
The STS should communicate status locally and remotely, including waveform capture, possible problems (predictive maintenance), immediate problems (alarms), and forensics (data for failure analysis). Most STS have a dynamic, multiscreen graphic display rather than a text display. The multiscreen graphic display can display information as icons, images, and text; this combination allows the operator to absorb the information quicker than with a text display. Another advantage of the graphic display is the ability for the operator to view the waveforms of the two sources and of the STS output. The graphical/touch-screen operator interface is complex and a failure leaves the operator blind. A simple backup interface with LED indicators and manual control switches is an important backup operator interface and ensures that the switch can be properly operated if the monitor were to fail for any reason. Be sure to have update procedures that the operator can use.
8.4.5
Downstream Device Monitoring
Some STSs have a feature that allows the site manager to route downstream distribution devices into the monitor of the switch. The advantages of this system are that the switch will serve as a hub device for monitoring these panels/subfeed breakers, which improves the clarity of the power scheme down to the individual breaker. Access to this information allows the data center manager to better utilize existing equipment when adding loads and pinpoint areas where no more loads should be added. Another advantage is that this hub feature in the STS will minimize the overall cost for the owner as these points will not need to be monitored separately by the building management system and independently wired. In general, the ability to see the power down to the individual breaker improves the decision making made by the data center manager and, although this can be retrofitted to an existing site, the most economical time to implement it is during the construction phase of a new data center.
8.4.6
STS Remote Communication
It is becoming increasingly important for power quality products to provide information along with ensuring the quality of the power being provided. If the system is being overloaded or performance degradation is occurring, the operator/maintenance personnel must be alerted (alarms). If there is a failure of any component in the system, data center personnel need to be able to establish what happened (forensics) in order to prevent future occurrences. The STS monitor should provide a log of events, alarms, voltage and current values, frequency, waveform capture, and other information. The information is useful when displayed locally, but most data center, facility, and IT managers will want to view the information remotely. The data needs to be communicated in a universal language in order to be viewed with the site management system. There are several standard languages for these communications. The STS should have, as a standard or optional feature, all of the necessary hardware and software for this monitoring of the STS status and alarm data and the accompanying waveforms. The typical protocols needed are:
186 • • • • • •
STATIC TRANSFER SWITCH
Modbus RTU or ASCII Modbus TC—Modbus over the Ethernet SNMP (Simple Network Management Protocol) Web-enabled communication Data available using a Web browser such as Internet Explorer Summary data that is e-mailed at specified times
Other features sometimes requested are cell or land-line modems to alert users of alarms and status changes.
8.4.7
Security
Security is important for all equipment in mission critical sites and especially for both the UPS and STS systems, due to the nature of the information that they are feeding power to. The quality of the power protection afforded by the UPS and STS is largely determined by operating parameters that are set in each unit. The method of protecting the UPS and STS operating parameters against unauthorized changes is a primary concern when setting up an internal/external firewall for the security system. Two other levels of security that will need to be addressed for the STS system concern access to the mechanical devices in the switch. The first level starts with the doors of the units, which should be secured via a lock with a specific key. The second level starts with the circuit breakers or MCSWs and limits the ability of an untrained person to toggle or change their status, which may cause power losses. The caveat for the first-level or door security is that the doors must be opened in emergency situations to operate bypass circuit breakers or MCSWs; hence, the keys to equipment doors must be readily available. There are two issues to be addressed in security management: preventing problems and preventing the recurrence of problems. Prevention of problems requires physical and psychological barriers. A lock on a door is a physical barrier, and alarming a door with a sign indicating the door is alarmed is a psychological barrier; both are effective. A system involving both a psychological barrier and a requirement of a specific personnel identification device is especially effective. Preventing the recurrence of problems requires logging enough data to make a full forensics analysis of a security problem. From the forensics analysis, a solution can be formulated and implemented. Some of the security features offered by STS manufacturers are: • The sign-in process to change the STS operating parameters should include a password and a PIN so the operator who is signed-in can be identified and logged with a time/date stamp. The password provides physical security; the PIN provides psychological security and data collection. • The sign-in is only good for a predetermined amount of time. • The use of a fingerprint and/or swipe card device instead of a password and PIN number to allow changes to the operation of the STS.
8.4 STS TECHNOLOGY AND APPLICATION
187
• Provide data to help management meet the security elements mentioned in Chapter 2 and provide data for security management. • Use door opening alarms if the user does not sign in first. Entry should be logged and time/date stamped. Since the change of status of a circuit breaker is logged and time/date stamped, the operator changing circuit breaker status could be identified. • Be aware that vendors will plug in their laptops during maintenance activities. This can cause a system problem if the laptop has a virus The security features should be an important part of the specification, used to procure a STS.
8.4.8
Human Engineering and Eliminating Human Errors
It has been stated that greater than 50% of mission critical site problems are caused by human errors. It has also been written that a human has difficulty in following written instruction in high-stress situations. Operators under stress respond well to visual text instructions, with icons and simultaneously voiced instructions. STS suppliers address human errors by the following methods: • Operating instructions written in text on a graphics screen, as well as voiced to avoid misoperation • Screen icons to guide operators in a step-by-step fashion through the command steps • One display for STS/transformer systems, so that the operator has a common place to view data • STS and STS/transformer systems with door alarms for psychological barriers • Password and PIN or fingerprint/swipe card for psychological barriers Since human engineering can help eliminate the human error component of mission critical site problems, the human engineering features should be an important part of the specification used to procure an STS.
8.4.9
Reliability and Availability
As mentioned in Chapter 3, availability is a function of reliability or mean time between failures (MTBF) and mean time to repair (MTTR): Availability = MTBF/(MTBF + MTTR) When availability is equal to one, there is no down time. Availability approaches 1 if MTBF is infinity or repair time is zero. MTBF is a function of the number of components and the stress levels on the components. The STS logic is generally the limiting factor of MTBF due to the large number of components mounted on the PCBs.
188
STATIC TRANSFER SWITCH
A new failure mode is becoming apparent. Radiation from space is becoming a problem as the junction areas in the new FPGA (field programmable gate array) and other ICs are becoming smaller and smaller and operating frequencies are becoming higher and higher. The logic circuits and memories become prone to switching unexpectedly when hit by neutrons and/or other space particles. The neutrons are especially dangerous since it may take 10 feet of concrete to shield chips from neutron radiation. The most effective method of increasing MTBF is via redundancy. Most STS manufacturers can provide the following redundancies: • Redundant logic power supplies • Redundant operator interface. The graphical/touch screen operator interface is complex and a failure leaves the operator blind. A simple backup interface with LED indicators and manual controls switch is important. • Redundant SCR gate drivers. This would mean having two drivers for SI SCRs and two drivers for S2. Since the drivers are in parallel, a failure of one cannot affect the other. • Redundant logic • There is one manufacturer that offers triple redundant data-acquisition, logic, and gate-drive boards (10 times more reliable). Triple redundancy with voting is the most recognized redundancy method. Triple redundancy is accomplished by having three circuits generate signals and routing the three signals to a voting circuit. The output of the voting is based on at least two signals being the same. The implementation of triple redundancy presents problems, as illustrated in the following example. Assume the nonredundant STS logic contains 500 components on each of three PCBs and has a logic and STS mean time to first failure (MTTF) of 9 years, and thus assume the logic failure would cause a loss of load power. In an effort to generate triple redundancy, the designer could use three sets of PCBs (nine total) and add a simple voting circuit. This would probably increase the STS MTTF to 100 years. However, part of the logic would fail in 3 years, since there are three times the components. The logic failure would not cause an STS output failure but the logic would have to be repaired to return the system to the triple redundant configuration. Therefore, using this method of triple redundancy would cause too much down time for repair. The service costs would also increase. An example of one way to implement triple redundancy is to use a new system on a chip (SoC), use ICs to reduce the parts count to achieve a logic mean time to first failure of 8-10 years, and use STS mean time to first failure of 80-100 years. A quick way to determine the logic parts count is by measuring the PCB area, excluding the logic power supplies (the magnetics require a large area). The logic of a typical nonredundant STS may contain 400-450 square inches of PCB area. The logic of a triple redundant STS should contain only 5-10% more square inches of PCB area than the nonredundant logic. Note: If the owner is willing to pay a premium, there is no downtime with triple
8.4 STS TECHNOLOGY AND APPLICATION
189
modular redundant control logic paths. The STS is still fully functional while waiting for repair due to component failure.
8.4.10
Repairability and Maintainability
The STS and/or the STS/transformer system should have hot-swappable modules, PCBs, and components. If the STS has to be placed in bypass mode for maintenance or repair, all protection is lost for the servicing period. If the component is truly hotswappable, the load will be protected while the component is swapped. To be hotswappable, the component must have the follow attributes: • The STS continues to protect the load within CBEMA and/or ITIC power profiles while hot-swapping. • The hot-swapping process should take less than 3 minutes. • Fasteners securing the component must be captive or nonconductive, so if the fastener were dropped, there would be no consequence. • The removed connector must have guarded pins so that there is no event if the connector pin touches metal or the pins on another connector. • The exposed section, required to hot-swap the component, should not contain any voltage higher than 42 V (UL allows humans to be exposed to voltages up to 42 V). Even if the STS is placed in bypass for the service period, the components should have the hot-swappable attributes so that service is quick and safe for personnel and the load.
8.4.11
Fault Tolerance and Abnormal Operation
The STS and/or the STS/transformer system should be both internally fault tolerant, fail-safe, and externally fault tolerant/fail-safe due to faults and abnormalities. Redundancy does not necessarily make the STS fault tolerant. For example, if a shorted capacitor on one PCB or module could reduce the logic power to other PCBs, triple redundancy does little good, since all circuits will fail. Abnormal operation is the operation of the STS when abnormal situations occur, such as: • Loading short circuits. The STS should not transfer the short to the other source due to low voltage. • Both sources have reduced voltage, beyond 15%. The load should transfer to the best source. • Transferring magnetic loads between out-of-phase sources. The in-rush should not trip any internal or external circuit breakers or MCSWs. Fault tolerant and abnormal operation must be specified and tested on the purchased STS.
190
8.5
STATIC TRANSFER SWITCH
TESTING
The following tests should be performed on a sample or production STS: • Standard performance test, including transfer times, alarm and status indications, and monitoring functions • Redundancy testing: Disconnect all but one logic power supply while the STS is operating and verify alarms. Disconnect one logic power supply and then open each power source one at a time and verify alarms and operation. Disconnect enough fans to check redundancy when operating at full load and verify alarms and operation. Disconnect one SCR gate driver on each source at the same time and verify that load transfer is proper, then verify alarms and operation. Remove the connector supplying logic power on each PCB and module, verify STS operation, and verify alarms and operation. Verify the backup operator interface. If there are serial signal buses, such as CAN buses, break each serial loop and verify operation and alarms. • Reduce the voltage on one source by 20% and then reduce the voltage on the other source by 22%. The STS should transfer to the best source. • Short the STS output, verify that the STS does not transfer to the other side, and, by reclosing the circuit breakers or MCSWs, verify that the STS is fully operational. Generally, factory test stations only have less than 12 kA available, and all STSs have short-circuit ratings of 22 kA or higher. • Verify the low in-rush algorithm used in the STS by connecting a transformer on the STS output and manually transferring the transformer load with the sources 0, 15, 120, and 180 degrees out of phase. • Reparability. Remove and install several PCBs or modules with the load hot in bypass or via hot swap. • Verify internal fault tolerance: Short the logic power supplies on each PCB or module; the load must be supported. Remove or stop the crystal or oscillator on each processor, DSP, or FPGA; the load must be supported.
8.6
CONCLUSION
A static transfer switch is a critical component in a mission critical facility, working to protect the facility from a variety of power issues. Distribution problems that could or-
8.6 CONCLUSION
191
dinarily lead to outages, such as improperly made crimped or bolted connections, circuit breaker failure, overheating and/or failure of the conductors, failure of the upstream UPS, and the opening, shorting, or overheating of a site transformer, can be mitigated through the use of STSs. So keep in mind when designing, specifying, and operating STSs in a mission critical environment that you do not overlook the following: monitoring and data collection, waveform capture of transfer and transfer-inhibit events, remote communications, redundancy, maintainability, reliability, and fault tolerance. If all of these are considered, the STS procured will indeed be a stable and reliable last line of defense in the mission critical power system.
This page intentionally left blank
9 FUNDAMENTALS OF POWER QUALITY
9.1
INTRODUCTION
You would not run your high-performance car on crude oil, so why run your mission critical computer system on low-quality power? Your data center needs high-octane power to keep it revving at optimum capacity. Just as a high-performance car needs quality fuel to operate at peak performance, today's digital equipment and data centers can falter when subjected to less-than-perfect power. As a network manager, facilities engineer, or chief engineer, you need to know how to avoid potential power pitfalls that can put undue wear and tear on your computers, data, and mission critical infrastructure. Do not take the power from your wall outlet or local power station for granted. A common misconception is that utility power is consistent and steady. It isn't. The power we receive from the grid is 99.9% available, which translates into approximately 8 hours of downtime per year. That level was once acceptable, but today's data centers and other 24/7 operations require a much higher level of availability. For those facilities, the requirement is closer to 99.9999% available—less than one minute of downtime a year. The newer-generation loads, with microprocessor-based control and a multitude of electronic devices, are much more sensitive to power quality variations than the equipment used in the past. No matter where you are located, spikes, surges, brownouts, and other power problems are potential hazards to your computer Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 193 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
194
FUNDAMENTALS OF POWER QUALITY
network, the engine that runs your business. Figure 9.1 shows the average cost of downtime for a range of industry sectors. Some power problems stem from obvious sources, such as lightning. According to the National Weather Service, most regions in the United States experience at least 40 thunderstorms a year. In addition, it is estimated that 10% of all the thunderstorms that occur in the United States each year are considered severe. These storms, sometimes accompanied by high winds and flooding, can cause damage to aboveground and underground wires, thus interfering with the high-quality power your mission critical infrastructure requires. Whereas the outside world poses numerous threats to your power quality, what happens inside your building can be just as dangerous. In fact, about 70% of all power quality disturbances are generated inside the facility, by the very equipment you purchase to run your business. Microprocessor-based equipment and computer systems oftentimes inject harmonics and other power quality events into the power infrastructure. If your office building is more than 10 years old, chances are its wiring was not designed to meet the demands of a growing number of PCs, laser printers, scanners, LANs, midranges, and mainframes. These older buildings are wired for loads such as elevators, copiers, and mechanical equipment. As more nonlinear loads, or switching loads, are used in the building, overloading of the electrical distribution systems occurs more frequently, leading to more power quality problems and failures that can harm valuable data and result in equipment loss or malfunction. Furthermore, electronic devices and other electrical equipment could be feeding power anomalies back through your power distribution system. This produces an effect similar to viruses that have the potential to spread and infect sensitive electronics. Many devices interconnect via a network, which means that the failure of any component has far-reaching consequences.
Figure 9.1. Cost of downtime.
9.2
ELECTRICITY BASICS
195
Although you may not have control over nature, insufficient wiring, and the unpredictability of electricity, you can shield your mission critical infrastructure and data center from these hazards.
9.2
ELECTRICITY BASICS
Electricity must flow from a positive to a negative terminal, and the amount of electricity produced depends on the difference in potential between these two terminals. When a path forms, electrons travel to reach the protons and create electricity. The difference of potential depends upon the amount of electrons on the negative terminal; the greater the charge, the higher the difference in potential. Imagine two tanks of water, equal in circumference and height, set apart from each other but connected at the bottom by a pipe and a valve, currently closed. The first tank is full of water and creates a maximum pressure at the valve, whereas the second tank is completely empty. When the valve opens, the water flows into the second tank until it reaches equilibrium. The rate at which the water flows depends on the pressure difference between the two points. With electrons, the movement depends on the difference in electrical potential, or voltage, across the circuit. Voltage is the force or "pressure" that is available to push electrons through a conductor from the negative to the positive side of the circuit. This force, measured in volts (V), is represented by the symbol E. While the voltage is pushing electricity, current measures the rate. Current, measured in amperes, is represented by the symbol /. Current is determined by the voltage difference. In a water pipe, if the pressure differential is great, then the flow of water is swift, but as the pressure decreases, the rate of flow slowly dwindles until it reaches equilibrium. Voltage and current are directly proportional when resistance is constant (Ohm's Law). Resistance is the opposition of electrical current by an object. This resistance to flow, represented by R, is measured in ohms. Resistance is analogous to the width and smoothness of the pipe in the two tanks of water analogy. Each of these three quantities is affected by the other two and can be represented in the equation E = I x R (Ohm's Law). This equation can also be understood as R = E/I or / = E/R. For example, a circuit has a voltage of 12 V and a resistance of 4 ohms. The current can be found by using the equation / = E/R. So 12 volts/4 ohms is equal to 3 amperes of current. Another term used to describe electrical quantity is power expressed as voltamperes (VA; 1 kVA = 1000 VA). It is possible to have a very large voltage, but without current flow, the power realized is insignificant. Conversely, a large current flow with only a moderate voltage can have a catastrophic effect, as in a groundfaulted circuit. Another manner of expressing true power is in watts (W) or kilowatts (kW). In simplified terms, wattage is equal to voltage multiplied by amperage times a power factor that accounts for the type of load (inductive, capacitive, or resistive). Typical power factors range from 0.8 to 0.95.
196
9.2.1
FUNDAMENTALS OF POWER QUALITY
Basic Circuit
An electric circuit comprises a power source, wires, the load, and a means of controlling the power, such as a switch. Direct current (DC) flows in only one direction. The traveling electrons only go from negative to positive. For current to flow, the circuit must be closed. The switch provides a means for opening and closing the circuit. Remember that current always travels through the path of least resistance. Manufacturers rate each piece of equipment for a given voltage, current limit, operating duty, horsepower or kW power consumption, kVA, temperature, enclosure, and so on. Equipment voltage ratings are the maximum voltage that can be safely applied continuously. The type and quantity of insulation and the physical spacing between electrically energized parts determine the voltage rating. Actual operating voltages are normally lower. The maximum allowable operating temperature at which components can properly operate continuously determines the current rating. Most equipment is protected internally against overheating by using fuses as circuit breakers.
9.2.2
Power Factor
The power factor is the ratio of active power to apparent power in an alternating current (AC) system. There are four important components involved with the power factor: active power, reactive power, apparent power, and phase angle. Active power, also called real power and true power (P), is measured in watts (W) and is the amount of power available to do work. Reactive power (Q) is measured in reactive volt-amperes (VAR) and is expressed as an imaginary number in calculations. It is caused by capacitive and inductive loads storing power and returning it to the power source. Apparent power (S) is a combination of active (real) and reactive (imaginary) power that can be expressed as a vector. The angle between the current and voltage waveforms, known as the phase angle, is expressed as the Greek letter phi (
), and is also equivalent to the angle between the real power and apparent power vectors. A common nonmathematical analogy for power factor involves a mug of beer (Figure 9.2). The foam in the mug is the reactive power, and the liquid beer is the active or real power. The total contents of the mug make up the apparent power. With electricity, just as with beverages, it is desirable to have the most active power (beer) to enjoy and minimize the amount of reactive power (foam). Power factor is always between 0 and 1 and may be described as leading or lagging, determined by the sign of the phase angle between the current and voltage waveforms. Purely resistive loads do not affect power factor, whereas capacitive loads will cause a leading power factor, and inductive loads will cause a lagging power factor. In this way, a leading power factor can be corrected by adding inductive loads and a lagging power factor can be corrected by adding capacitive loads. Power factor should always be considered when managing a facility since many utilities will often levy charges on commercial users with poor power factors. When correcting the power factor, it should be as close to 1 as possible.
9.3 TRANSMISSION OF POWER
197
Figure 9.2. Power factor beer mug analogy. Power factor is important because systems with low power factors will require greater currents to drive the same load with a high power factor. Low power factors can also increase the amount of heat dissipated as a result of the higher current draw. Transformers, in particular, are vulnerable to power factor problems. The voltage on the secondary side of a transformer is affected by the loads connected to it. If the loads reduce the power factor, it may cause a drop in voltage on the secondary side. For this reason, power factor correction on transformer load is important. Uninterruptible power supply (UPS) systems can also cause power factor problems due to their rectifier properties and switching components. This causes distortions on the input that are also reflected in the power source. These distortions will affect all equipment connected to the power source, including the UPS itself. Figure 9.3 depicts the different effects of power factor on the current waveform. The solid line represents the synchronous current and voltage for a purely resistive load that has a power factor of 1 and a phase angle of 0°. The dotted line represents a leading current with a positive phase angle and a power factor less than 1. The leading current will reach its peak magnitude before the voltage reaches its peak. The dashed line represents a lagging current with a negative phase angle and a power factor less than 1. The lagging current will reach its peak magnitude after the voltage reaches its peak.
9.3
TRANSMISSION OF POWER
Power disturbances are increasing, not only because of utilities, but also because of the stress applied to the nation's electrical grid. Starting from the utility company to the workplace, the journey that power takes can be affected in many different ways. Facilities managers must identify a power quality problem in order to implement a proper corrective action, but first the cycle of power must be understood. This section gives a brief background on the production and distribution of electricity.
198
FUNDAMENTALS OF POWER QUALITY
Figure 9,3. Power factor waveforms.
9.3.1
Life Cycle of Electricity
There are four phases of the cycle of electricity: generation, transmission, distribution, and end use. Generation is the actual process of converting different types of energy to electrical energy. This process usually occurs at a power generating plant. Utility companies generate their electrical power from many different types of energy sources, including nuclear, coal, oil, and hydroelectric. The next step—transmission—transports the electricity over long distances to supply towns and cities. The electrical energy is then distributed to the residential and commercial consumers, or end users. Electricity is very rarely the end product; it is used for kinetic work (e.g., to turn the blades of a fan), in solid-state devices, or converted to other forms of energy, such as light and heat. In reference to power quality, transmission and distribution are most important. These two phases of the journey produce the most significant disturbances of power quality. Once the power is generated, transformers step up the voltage. The purpose of increasing the voltage from 300,000 to 750,000 V is to minimize the current flow. Current is proportional to heat in an electrical circuit. By keeping the voltage high and the current low, there will be fewer disruptions caused by overheating. Once the electricity reaches the town, a transformer will step down the voltage to 66,000 V (typical) to distribute locally. The voltage then gets stepped down again to about 13,000 V (typical) at smaller substations before reaching your critical infrastructure, where the electricity is stepped down to a usable 480 or 208 V. Each panelboard or load center then feeds branch circuits, sending 120 V of convenience power using 15-amp branch circuits to the wall receptacles. Figure 9.4 lays out the transmission and
199
9.3 TRANSMISSION OF POWER
Figure 9.4. Typical electric system. (Courtesy of Drantetz-BMI.) distribution process, demonstrating the flow of electricity from the generating station to the end users such as industry, commercial, and residences. Alternating current (AC) is preferred over direct current (DC) because it is easy to generate, and transformers can adjust the voltage for many different power requirements. However, as mentioned in Chapter 3, DC power is now being considered in data center applications because of its efficiency and reliability attributes. Weather can be crucial to reliable transmission of power. When winds are high, aboveground lines sway too close together, creating arcs and turbulent voltage variations. An arc occurs when two objects of different potential create a spark between them—a flow of current through the air.
9.3.2
Single-Phase and Three-Phase Power Basics
You cannot see electricity, but you can record it. Figure 9.5 shows what one cycle of undistorted electricity looks like. In the United States, a cycle occurs 60 times per second. If a 1-second recording shows 60 turns, look at the wave on a scale of milliseconds or even microseconds to see the distortions. Sophisticated recording equipment can capture the waveform graphically, so it can be reviewed for disturbances. This equipment is referred to as disturbance- or power-monitoring equipment. The cycle shown in Figure 9.5 is a sine wave with a fundamental frequency of 60 Hz—the waveform model for reliable power. Single-Phase phase power are:
Power.
Some of the most common service designs using single-
• 120 V, single-phase, two-wire used for small residences and isolated small loads up to about 6 kVA • 120/240 V, single-phase, three-wire used for larger loads up to about 19 kVA
200
FUNDAMENTALS OF POWER QUALITY
Figure 9.5. A cycle of electricity. This undistorted sine wave is also known as the fundamental waveform. (Courtesy of Dranetz-BMI.)
Depending upon the rating of the service transformer system, voltages can be 120/240, 115/230, or 110/220. The 120/240 V, single-phase systems come from a center-tapped single transformer. When the two hot legs and a neutral of a three-phase system are used, it is possible to have 120/208 V, single-phase, three-wire systems, which are commonly found within residential and commercial buildings. Three-Phase Power. Three-phase power takes three single-phase sine waves and places them 120 degrees apart (Figure 9.6). A three-phase wiring system powers 1.73 times as much load as a single-phase. The result is an almost smooth line instead of an up and down (on/off) effect.
A Phase
Generator windings
C Phase
Figure 9.6. Three-phase power is produced from the rotating windings of a generator. (Courtesy of Dranetz-BMI.)
9.3 TRANSMISSION OF POWER
201
Some of the most common service designs using three-phase power are: • 120/208 V, three-phase, four-wire systems are a common service arrangement from the local electric utility for large buildings. • 277/480 V, three-phase, four-wire systems use the 277 V level for fluorescent lights, 480 V for heavy machinery, and relatively small, dry-type, closet-installed transformers to step the voltage down to 120 V for convenience receptacles and other loads. This system is ideal for high-rise office buildings and large industrial buildings. • 2400/4160 V, three-phase, four-wire systems are only used in very large commercial or industrial applications where on-site equipment mandates this incoming system design (e.g., large paper mills and manufacturing plants).
9.3.3
Unreliable Power versus Reliable Power
Reliable power is clean power that makes equipment function properly with little or no stress. When measuring reliable power, the monitor would record an unchanging or stable sine wave at a rate of 60 cycles per second. A continuous sine wave that never changes and never ends would never cause problems, but is almost impossible to find. Unreliable power can be described as having any type of problem, such as a sag, swell, flicker, dimming of lights, continuous malfunctions that happen frequently, and even deficient amounts of power, making equipment work harder. Unreliable power problems can affect many day-to-day activities. When measuring this type of power, the monitor may record so many disruptions that the sine wave is unrecognizable. Figure 9.7 identifies the most common forms of power quality disturbances.
Figure 9.7. Types andrelativefrequency of power quality disturbances. (Courtesy of DrantetzBMI.)
202
9.4
FUNDAMENTALS OF POWER QUALITY
UNDERSTANDING POWER PROBLEMS
A wide range of businesses and organizations rely on computer systems and a multitude of networked equipment to carry out mission critical functions. Often, network links fail because the equipment is not adequately protected from poor power quality. When a facility is subjected to poor power quality, the electrical devices and equipment are stressed beyond their design limits, which leads to premature failures and poor overall performance. Power supply problems can cause significant consequences to business operations. Damaged equipment, lost or incorrect data, and even risks to life safety can all result from power quality problems. In order to institute appropriate solutions, facilities engineers must clearly understand the causes and effects of specific power problems that may interrupt the flow of data across an enterprise. Table 9.1 outlines the causes and effects of these common power problems. IEEE 1159, IEEE Recommended Practice for the Transfer of Power Quality Data, is the prevailing U.S. standard for categorizing power quality disturbances. The categories are: • Transients • RMS variations (include sags, swells, and interruptions) Short-duration variations Long-duration variations Sustained variations • Waveform distortions DC offset Harmonics Interharmonics Notching • Voltage fluctuations • Power frequency variations
9.4.1
Power Quality Transients
Transients, which are common to many electrical distribution systems, can permanently damage equipment. Transients are changes in voltage and/or current, usually initiated by some type of switching event, such as energizing a capacitor bank or electric motors, or interference in the utility company's power lines. Transients are often called "ghosts" in the system, because they come and go intermittently and may or may not impact your equipment. Transients last much less than a full cycle, oftentimes a millisecond or less. Broadly speaking, we classify transients as either impulsive or oscillatory. An impulsive transient is a sudden frequency change in the steady-state condition of voltage or current, or both, that is either positive (adds energy) or negative (takes away energy). Oscillatory transients include both positive and negative polarity.
9.4
203
UNDERSTANDING POWER PROBLEMS
Table 9.1. Causes and effects of power problems Power problem
Possible causes
Effects
High-voltage transients (spikes and surges)
Lightning Utility grid switching Heavy industrial equipment Arcing faults
Equipment failure System lockup Data loss Nuisance tripping ASDs
Harmonics
Switch-mode power supplies Nonlinear loads ASDs Single-phase computers
High neutral currents Overheated neutral conductors Overheated distribution and power transformers
Voltage fluctuations
Overheated distribution networks Heavy equipment startup Power line faults Planned and unplanned brownouts Power-factor-correction capacitors Unstable generators Motor startups
System lockup Motor overheating System shutdown Lamp burnout Data corruption and loss Reduced performance Control loss
Power outages and interruptions
Blackouts Faulted or overloaded power lines Backup generator startup Downed and broken power lines
System crash System lockup Lost data Lost production Control loss Power supply damage Lost communication lines Complete shutdown
Low-voltage electrical noise
Electronic equipment Switching devices Motorized equipment Improper grounding Fault-clearing devices, contactors, and relays Photocopiers
Data corruption Erroneous command functions Short-term timing signal variations Changes in processing states Drive-state and buffer changes Erroneous amplifier signal levels Loss of synchronous states Servomechanism control instability In-process information loss Protective circuit activation
Transients on AC lines, also known as surges, glitches, sags, spikes, and impulses, occur very quickly. Events like lightning and the switching of a reactive load commonly cause these transients. Although lightning generates the most impressive transients, switching large motors and transformers or capacitors creates transients big enough to affect today's delicate microprocessor-based equipment. Note that common industry practice is to refer to transients as "surges," even though spike is more descriptive. Technically, the term surge refers to an overvoltage condition of a longer duration.
204
FUNDAMENTALS OF POWER QUALITY
Whether caused by energization of power-factor capacitors, loose connections, load or source switching, or lightning, these transients happen so quickly that they are imperceptible to the eye. However, recorded results from a power monitoring analyzer will reveal what types of transients are affecting the system. Transients can cause equipment damage, data corruption, reduced system life, and a host of irreproducible or identifiable problems. Figure 9.8 shows different types of transients and their effect on the normal waveform.
9.4.2
RMS Variations
Normal power follows a near-perfect sine wave. An RMS variation is typically one cycle or longer. Power quality disturbances can be recorded and later identified by comparing the associated waveform deviations with the normal sinusoidal curve. For instance, a swell makes the waveform increase outside of the normal range for a short period of time. A sag is the opposite; it falls below the normal voltage range for a short time. Overvoltage and undervoltage conditions last longer than surges and sags, which can cause serious problems. If these events occur often and go undetected, this will stress and eventually deteriorate sensitive equipment. Figure 9.9 shows the most common RMS variations. Voltage Sags. Voltage sags usually occur when a large load is energized, creating a voltage dip. The voltage will drop below -10% of normal power for a period of one-half cycle to 1 minute. Here, the sine wave is less than its normal range and is not functioning to potential. For digital devices, a sag can be as perilous as an outage. Sags are usually caused by switching on any piece of equipment with a large in-rush current, including induction motors, compressors, elevators, and resistive heating elements. A typical motor startup will draw six to ten times the amount of operating power. Figure 9.10 shows the signature sag resulting from a motor start; voltage drops quickly, then steadily increases to nominal.
Figure 9.8. Transients shown as waveforms. (Courtesy of Dranetz-BMI.)
9.4 UNDERSTANDING POWER PROBLEMS
205
Figure 9.9. Waveforms of RMS variations. (Courtesy of Dranetz-BMI.) Voltage Swells. Voltage swells usually occur when a large load is turned off, creating a voltage jump. A voltage swell is similar to shutting off a water valve too quickly and creating water hammer. The voltage rises above +10% of normal voltage for one-half cycle to 1 minute. The sine wave exceeds its normal range. Swells are common in areas with large production plants; when a large load goes offline, a swell is almost inevitable. Figure 9.11 shows a swell generated during a fault. Undervoltage. Undervoltage (long-duration sag as defined by IEEE) occurs when voltage dips below -10% of normal voltage for more than 1 minute. An under-
Fiqure 9.10. Motor-start waveform signature. (Courtesy of Dranetz-BMI.)
206
FUNDAMENTALS OF POWER QUALITY
Figure 9.11. Voltage swell time line. (Courtesy of Dranetz-BMI.) voltage condition is more common than an overvoltage condition because of the proliferation of potential sources: heavy equipment being turned on, large electrical motors being started, and the switching of power mains (internal or utility). Undervoltage conditions cause problems such as computer memory loss, data errors, flickering lights, and equipment shutoff. Overvoltage. Overvoltage (long-duration swell as defined by IEEE) occurs when voltage increases to greater than 10% of normal for more than 1 minute. The most common cause is heavy electrical equipment going offline. Overvoltage conditions also cause problems such as computer memory loss, data errors, flickering lights, and equipment shutoff. Brownouts. A brownout is usually a long-duration undervoltage condition that lasts from hours to days. Brownouts are typically caused by heavy electrical demand with insufficient utility capacity. Blackouts. A blackout is defined as a zero-voltage condition that lasts for more than two cycles. The tripping of a circuit breaker, power distribution failure, or utility power failure may cause a blackout. The effects of an unanticipated blackout on networked computers include data damage, data loss, file corruption, and hardware failure. Voltage Fluctuations. Voltage fluctuations are any increases or decreases that deviate within ±10% of the normal power range. The current never remains consistent as the voltage fluctuates up and down. The effect is similar to a load going on- and offline repeatedly. These variations should be kept within a ±5% range for the optimal performance of equipment. Voltage fluctuations are commonly attributable to overloaded transformers or loose connections in the distribution system. Voltage Imbalance. Voltage imbalance, as seen in Figure 9.12, is a steady-state abnormality defined by the maximum deviation (from the average of the three-phase voltages or currents) divided by the average, expressed as a percentage. Voltage imbalance can be the result of capacitor bank anomalies, such as a blown fuse on one phase of a three-phase bank. It is most important for three-phase motor loads.
9.4 UNDERSTANDING POWER PROBLEMS
207
Figure 9.12. Imbalance time line. (Courtesy of Dranetz-BMI.)
Voltage Notches. Voltage notches (Figure 9.13) occur on a continuous basis (a steady-state abnormality), and they can be characterized through the harmonic spectrum of the affected voltage. Three-phase rectifiers (found in DC drives, UPS systems, adjustable speed drives, and other applications) that have continuous DC current are the most important cause of voltage notching.
9.4.3
Causes of Power Line Disturbances
There are many causes of power line disturbances; however, all disturbances fall into two categories: externally generated and internally generated. Externally generated disturbances originate from the utility, whereas internally generated disturbances stem from conditions within the facility. The source of the disturbances may determine the frequency of occurrence. For example, if a downed power line initiates a disturbance externally, the disturbance is random and not likely to be repeated. However, internal cycling of a large motor may cause repeated disturbances of predictable frequency. Again, facilities engineers must understand the source of the problem to determine a solution to prevent recurrence. Lightning. When lightning actually strikes a system, it causes the voltage on the system to increase, which subsequently induces a current into the system. This transient can instantaneously overload a circuit breaker, causing a power outage. The major problem with lightning is its unpredictability. When and where lightning strikes has a great impact on a facility, its equipment, and personnel safety. A facility manager can only take precautions to dampen transients and provide a safe environment.
208
FUNDAMENTALS OF POWER QUALITY
Figure 9.13. Notching waveform. (Courtesy of Dranetz-BMI.) There is no single way to protect yourself and the equipment from lightning. However, lightning arrestors and diverters can help minimize the effects of a lightning strike. Diverters create a path to a grounding conductor, leading the voltage to a driven ground rod for safety. Many facilities install other devices, called surge protectors, to protect key critical equipment. A surge protector acts as a clamp on the power line, preventing rapid rises in voltage from moving past the choke point. Electrostatic Discharge. Electrostatic discharge (ESD), referred to as static electricity, also produces transients. Static electricity is similar in nature to lightning. When an electric charge is trapped along the borders of insulating materials, it builds up until the charge is forcefully separated, creating a significant voltage and sometimes a spark. ESD transients do not originate on the power lines and are not associated with utility power input. ESD requires an electrostatic-generating source (such as carpeting and rubber soled shoes) and is exacerbated by a low-humidity environment. The high voltages and extremely rapid discharges can damage electronic devices. Static electricity builds on the human body; touching a part of a computer, like the keyboard, allows a discharge into the device, potentially damaging microprocessor components. Minimize ESD by providing paths to the ground, such that charges dissipate before they build to a damaging level. Since humid air is a better conductor than dry air, maintaining a humidity of 40 to 60% helps control electrostatic buildup. Harmonics.
There are two different types of AC electrical loads:
1. Linear loads are such things as motors, heaters, and capacitors that draw current in a linear fashion without creating harmonic disturbances.
9.4 UNDERSTANDING POWER PROBLEMS
209
2. Nonlinear loads are such things as switching power supplies and electronic ballasts that draw current intermittently and produce harmonic distortions within the facility's distribution. Harmonics cause many problems within transformers, generators, and UPSs: overheating, excessive currents and voltages, and an increase in electromagnetic interference (EMI). Major sources of harmonics are variable speed motor drives, electronic ballasts, personal computers, and power supplies. Any device that is nonlinear and draws current in pulses is a risk to an electrical system. Power supplies are important because they draw current only at the peak of the waveform, when voltage is at its highest, instead of extracting current continuously. The discontinuous flow produces a great distortion, changing current and adding heating strain. Figure 9.14 shows how harmonics distort the sine wave. The fundamental waveform and the fifth harmonic are both represented separately. When added together, the result is a disfigured sine wave. The normal peak voltage is gone and replaced with two bulges. What happens to the sine wave affected by three or four harmonics at the same time? Harmonics have tremendous ability to distort the waveforms consumed by sensitive equipment. A common way to visualize harmonics is with the harmonic spectrum (Figure 9.15). IEEE Standard 519, IEEE Recommended Practices and Requirements for Harmonic Control in Electrical Power Systems, recommends that the voltage total harmonic distortion (THD) should not exceed 5% (including all the added frequencies) and no single harmonic should exceed 3% of the fundamental frequency. Harmonics are generated when voltage is produced at a frequency other than the fundamental. Harmonics are typically produced in whole-number multiples of the fundamental frequency (60 Hz). For example, a third harmonic is three times the funda-
Fiqure 9.14. The fundamental and thefifthharmonic. (Courtesy of Dranetz-BMI.)
210
FUNDAMENTALS OF POWER QUALITY
Figure 9.15. Harmonic spectrum indicating a problem. (Courtesy of Dranetz-BMI.) mental frequency, or 180 Hz; the fourth harmonic is four times the fundamental, or 240 Hz; and so on up to an infinite number of harmonics. In general, the larger the multiple, the less effect the harmonic has on a power system because the magnitude typically decreases with frequency. You should be concerned about harmonics when they occur over a long period, as they can have a cumulative effect on the power system. Electronic power supplies, adjustable-speed drives, and computer systems in general can add harmonics to a system and cause damage over time. Of significant concern are the triplen harmonics (odd multiples of three), as they can add to the neutral phase, causing overheated distribution neutrals and transformers (Figure 9.16). PHASE SEQUENCE. An important aspect of harmonics is the phase sequence, as the phase determines how the harmonic will affect the operation of a load. Phase sequence is determined by the order in which a waveform crosses zero. The result includes a positive, negative, or zero sequence. Table 9.2 shows the frequency and phase sequence of whole-number harmonics. Positive sequence harmonics (first, fourth, seventh, etc.) have the same phase sequence and rotational direction as the first harmonic (fundamental). The positive sequence increases the amount of heat to the electrical system slightly. Negative sequence harmonics (second, fifth, eighth, etc.) have a phase sequence opposite to the fundamental's phase sequence. The reverse rotational direction of negative sequence harmonics causes additional heat throughout a system and reduced forward torque of the motor, decreasing the motor's output. Zero sequence harmonics (third, sixth, ninth, etc.) are not positive or negative. There is no rotational direction so the zero sequence harmonics do not cancel out, but add together in the neutral conductor. The neutral conductor contains little or no resistance, which causes big problems because there is nothing to limit the current flow. Even numbers (second, fourth, sixth, etc.) usually do not occur in properly operating power systems; odd numbers (third, fifth, seventh, etc.) are more likely. The most
9.4
211
UNDERSTANDING POWER PROBLEMS
Figure 9.16. The additive effect of the triplen harmonics. (Courtesy of Dranetz-BMI.) harmful harmonics are the odd multiples of the third harmonic. Table 9.2 shows the zero sequence harmonics (most harmful) to be multiples of three. Because the even harmonics are known not to cause problems, the odd multiples of three are called the triplen harmonics (third, ninth, fifteenth, twenty-first, etc.). Triplens cause overloads on neutral conductors, telephone interference, and transformer overheating. Because harmonics are produced in multiples, they can also affect an electric system by combining with each other. Certain multiples of harmonics are easy to identify, but when they come in pairs, like a third and fifth harmonic, the recorded waveform becomes distorted and difficult to identify (Figure 9.17).
Table 9.2. The frequency and phase sequence of whole-number harmonics Harmonic Fundamental (first) Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth
Frequency (Hz)
Sequence
60 120 180 240 300 360 420 480 540 600
Positive (+) Negative (-) Zero (0) (+)
(-)
(0) (+)
(-)
(0) (+)
212
FUNDAMENTALS OF POWER QUALITY
Figure 9.17. Harmonic distortion waveform. (Courtesy of Dranetz-BMI.) Utility Outages. An outage or blackout is the most obvious power quality problem and is the most costly to prevent. An outage is defined as the complete loss of power, whether momentary or extended. Utility outages are commonly attributable to rodents and other small animals eating through insulation or nesting in electric distribution gear; downed power lines from accidents or storms; disruption of underground cables caused by excavation or flooding; utility system damage from lightning strikes; and overloaded switching circuits. In simple terms, in an outage there is no incoming source of power. The consequences are widespread, amounting to more than an inconvenience. Computer data can be lost, telephone communications may be out of service, equipment may fail, and products and even lives could be lost (as in hospitals or public transportation). In Figure 9.18 an outage waveform is shown, captured during the August 2003 blackout that hit eight northeastern states and part of Canada. EMI and RFI. Other, less damaging, disturbance initiators are radio-frequency interference (RFI) and electromagnetic interference (EMI). EMI and RFI are high-frequency power-line noises, usually induced by consumer electronics, welding equipment, and brush-type motors. Problems induced by RFI and EMI are usually solved by careful attention to grounding and shielding, or an isolation transformer.
9.4.4
Power Line Disturbance Levels
Table 9.3 categorizes power line disturbances by threshold level, type, and duration. Type I disturbances are transient in nature, Type II are considered momentary in duration, whereas Type III disturbances are classified as power outages. Duration is the important factor because the longer the problem lasts, the worse the effects on an electrical system.
9.5
TOLERANCES OF COMPUTER EQUIPMENT
Computer tolerances of unstable power vary from manufacturer to manufacturer: The most important considerations are the duration and magnitude of the disturbances that
213
9.5 TOLERANCES OF COMPUTER EQUIPMENT
Figure 9.18. Interruption timeline. (Courtesy of Dranetz-BMI.)
Table 9.3. Power line disturbances and characteristics
Definition Causes
Threshold level
Duration
Duration in cycles (60 Hz)
Type I—transient and oscillatory overvoltage Lightning Power network switching (particularly large capacitors or inductors) Operation of on-site loads 200 to 400% rated RMS voltage or higher (peak instantaneous above or below rated RMS) Spikes 0.5 to 200 microseconds up to 16.7 milliseconds at frequencies of 0.2 to 5 kHz and higher 0
Type II—momentary undervoltage or overvoltage Faults on power system Large load changes Utility equipment malfunctions On-site load changes Below 80 to 85% and above 110% of rated RMS voltage
Type III—outage Faults on power system Unacceptable load changes Utility or on-site equipment malfunctions Below 80 to 85% rated RMS voltage
From 0.06 to 1 second depending on type of power system and on-site distribution
From 2 to 60 seconds if correction is automatic; unlimited if manual
0.5
120
214
FUNDAMENTALS OF POWER QUALITY
the equipment can withstand. Understanding the tolerances of computer processing equipment—how well the computer power supply will bear up to transient-induced stress—will assist the facility manager (FM) in avoiding information loss and equipment failure. Most solid-state electronic loads will shut down after a certain amount of power instability. If tolerances are known, the FM can take measures to prevent this type of inadvertent shutdown. The typical RMS voltage tolerance for electronic equipment is +6% and -13% of normal voltage, although some electronic equipment requires tighter voltage tolerances. When voltages spike or sag below the computer manufacturer's tolerances, the equipment may malfunction or fail. In practice, sags are more common than spikes. Let us review: • Starting large electric loads that pull line voltage below acceptable levels usually causes sags; large loads taken offline usually cause spikes. • Inadequate wire size causing overloaded circuits and poor electric connections (loose or partial contact) also commonly causes sags. End users often identify sags by noticing momentary dimming in their lighting. A change of-1 V on a 120 V circuit can be seen in the reduced output of an incandescent lamp. More serious sags and surges may cause computer disk read/write errors and malfunctions of other electronic devices. Brownouts (persistent undervoltage condition) cause more general consumer concern due to widespread failures. A steady state is a range of-10% of nominal voltage. Voltage and current are monitored using the root mean square (RMS) method to determine whether it varies slowly or constantly. When a large number of nonlinear, single-phase devices operate from a three-phase (208/120 V) power system, the building wires, transformers, and the neutral conductors are subjected to increased risk from harmonic-induced overloading. A steady state is the goal, and the following curves provide guidelines to prevent mishaps.
9.5.1
CBEMA Curve
The Computer Business Equipment Manufacturers Association (CBEMA) curve represents the power quality tolerances by displaying comparative data. The curve provides generic voltage limits within which most computers will operate (purely a rule of thumb). This curve is used as a standard for sensitive equipment and as a generic guide, because most computers differ in specifications and tolerances. When analyzing a power distribution system, the recorded data are plotted against the CBEMA curve (Figure 9.19) to determine what disturbances fall outside of the area between the two curves, or outside the envelope of the power quality window. For example, in Figure 9.19, point A represents recorded data that had no affect on the systems function. The next set of data, point B, occurred within milliseconds of point A, yet fell outside the acceptable range of the CBEMA curve and into the danger zone. Power disturbances that lie outside the acceptable zone of the CBEMA curve place the computer processing equipment at risk of losing information or shutting down.
215
9.6 POWER MONITORING
Figure 9.19. CBEMA curve.
9.5.2
ITIC Curve
CBEMA improved the rule of thumb with a new organization called the Information Technology Industry Council (ITIC) and updated the comparative data. The ITIC curve is more applicable to computers and to test the performance of electronic equipment. There has been a significant increase in the use of new power conversion technology, so using a more economical curve makes the ITIC curve very useful. The ITIC curve, like the CBEMA curve, has a tolerable zone in which a system is generally considered safe from malfunction. Figure 9.20 shows how the ITIC curve is a more technical curve, typically used for higher-performance computer processing equipment.
9.5.3
Purpose of Curves
The CBEMA and ITIC curves may be applied to any electronic equipment that is sensitive to power disturbances, such as computers or programmable logic controllers. When looking at the CBEMA and ITIC curves you can see the dangerous area versus the voltage tolerance envelope. If a recorded voltage disturbance fell outside of the tolerable area, the equipment would be susceptible to disruption or an outage. Determining the number and type of disturbances outside the guidelines will assist the designer or facility manager in planning corrective action to prevent loss of critical data and equipment.
9.6
POWER MONITORING
Look at power as a raw material resource instead of an expense. Today's digital environment is demanding, especially for uptime, nonlinear loads, and power quality as-
216
FUNDAMENTALS OF POWER QUALITY
\
\ CO
2
Œ
! ¡
\
Applicable to 120,120/208, and 120/240 nominal voltages
\
| \ envelope
1|xs
0.001c
1
i
*
1 I
i
X
i
m X
•
f
0.01c
i
.
<
n
'0.1c i0.5c1cT
1ms 3ms 20ms
,
10c
¡
¡
0.5s
100c
Duration of disturbance in cycles (c) and seconds (s)
'
¡1000c '
10s Steady state
Figure 9.20. ITIC curve.
surance. Power monitors provide data for benchmarking the power infrastructure and then sense, analyze, and report disturbances continually. The FM can view voltage and current waveforms, view time sags and swells, study harmonic distortion, acquire transients, and then plot all the data on a CBEMA or ITIC curve. Instead of reacting to problems individually, the FM should trend the numbers and types of disturbances. Only then can he or she properly evaluate remediation strategies and develop recommendations for power conditioning technologies. Acting promptly but thoughtfully to solve power disturbances will save money in downtime, equipment repair cost, and load analysis. Remember, power distribution systems are only reliable when closely monitored and properly maintained. The primary purpose of power monitoring is to determine the quality of power being provided to the facility and the effects of equipment loads on the building distribution system. Although disturbances may be expected, the source of the disturbance is not always predictable. Power monitoring lets the FM determine whether the power quality event occurred on the utility or facility side of the meter, then pinpoint the cause of disturbances and take countermeasures, hopefully before computer problems arise. For example, consider a voltage spike that is consistently occurring around the same time every day. By monitoring the power, the FM can track down the source of the overload. By correlating disturbances with any equipment performance problems or failures, the FM makes a strong business case for remedial action. Here, we are looking at the importance of power in terms of cost. By managing power, you can maximize the use of power at minimum cost while protecting your
9.7 THE DEREGULATION WILDCARD
217
sensitive electrical equipment. In computer equipment, improved power quality can save energy, improve productivity (efficiency), and increase system reliability, sometimes extending the useful life of computer systems. Thanks to power management and analysis software now available, power quality monitoring is effective in a broad range of applications. Power costs comprise electrical energy costs, power delivery costs, and downtime. Common power disturbances affect cost centers and, thus, can be managed better with monitoring. Below are some examples: • Electrical energy costs come from the types and qualities of services purchased. The utility usually handles billing, but any billing errors would likely go undetected without power monitoring. • Monitoring can save money by evaluating the energy costs to different departments. The FM can then reduce usage in vacant or underproducing cost centers. Power monitoring in combination with utility rate analysis can help the FM negotiate utility rates. • Conserving energy in noncritical areas during peak hours helps to avoid high electric demand charges. • Power delivery costs comprise equipment, depreciation, and maintenance. Equipment stressed by unseen transients may require increased maintenance for consistent performance. Costs of repairs to maintain sensitive and expensive equipment continually increase. • Downtime is the easiest way to lose a whole day of profit. Retailing, banking, credit card transactions, and many types of data processing are susceptible to losing millions of dollars for every hour of downtime. The bottom line is, if the tolerances of the equipment are known, then consistently staying within those limits is cost-effective. Data from the Electric Power Research Institute shows the duration of interruptions as percentages of all interruptions (Figure 9.21). The most common disturbances occur for only milliseconds to microseconds. Although these brief disturbances are almost invisible on a recording of a 60 Hz sine wave, they are nonetheless damaging to equipment. The most notable fact about this chart is the last column: 3% of interruptions last 30 minutes or longer. With millions of dollars at stake, and life and safety at risk, many companies would find such outages catastrophic. Utilizing a power monitoring system also allows for improved efficiency and utilization of the existing power system. By having reliable, firsthand data, the FM can determine if and where additional load can currently be added, and then plan for future corporate and technological growth. In an organization that depends upon computerized systems for day-to-day operation, power monitoring data is a very functional tool.
9.7
THE DEREGULATION WILDCARD
Answer these questions about your business: How important is the continuity of your power supply? How will utility deregulation affect the risk of disruption?
FUNDAMENTALS OF POWER QUALITY
218
Figure 9.21. A portion of data obtained from the Distribution Power Quality Monitoring Project survey conducted by EPRI and Electrotek Concepts. This survey produced information on voltage sags and swells, transient disturbances, momentary disruptions, and harmonic resonance conditions.
POWER QUALITY TROUBLESHOOTING CHECKLIST LOAD PROBLEMS Problem Observed or Reported: aOverloads Tripping aComputer Hardware Problems aErratic Operation oShortened Life aOther Load Type: oComputer aPrinter/Copier oPLC aDrive aMotor aOther Problem Pattern: Day(s) of week: DContinuous oRandom DMonday oTuesday oWednesday oThursday aFriday aSaturday oSunday Time(s): DContinuous oRandom DAIways Same Time oMorning aAfternoon aEvening aNight Possible Problem(s): oOperator Error aPower Interruption oSags or Undervoltage oSwells or Overvoltage oHarmonics oTransients or Noise olmproper Wiring/Grounding aUndersized Load aUndersized System olmproperly Sized Protection Devices oOther Measurements Taken at Load: Line Voltage V; Neutral-to-Ground Voltage Power Factor PF, DPF; Voltage THD
V; Current A; Power W, VA. ; Current THD ; K Factor ; Other
VAR;
Waveform Shape: Voltage Waveform: oSinusoidal oNon-Sinusoidal oFlat-Topped aOther Current Waveform: oSinusoidal oNon-Sinusoidal nPulsed aOther Measurements Taken at Load (Over Time): Normal Voltage V; Lowest Sag V at (Time); Highest Swell V at Highest Inrush Current A at (Time); Number of Transients Recorded over a time period of at a level of
(Time); % above normal
Possible Problem Solution(s): aUPS oK-Rated Transformer olsolation Transformer oZig-Zag Transformer aLine Voltage Regulator oSurge Suppressor oPower Conditioner aProper Wiring and Grounding oHarmonic Filter aDerate Load aProper Fuses/CBs/Monitors aOther Figure 9.22. Load problems—power quality troubleshooting checklist.
9.7
THE DEREGULATION WILDCARD
219
Deregulation is the government's way of preventing monopolies in the utility field. By changing some laws, smaller companies are able to supply power to consumers without restrictions imposed by large utilities. The intended benefit of this action is cheaper power. Utilities will compete to serve end users, and the open market will set the price. The problem with deregulation is the quality of power supplied. Power produced at a cheap price and sold at a cheap price is probably cheap power. The least-cost utility will have the greatest number of clients, which may cause widespread problems. Making a phone call on Christmas is a good analogy: depending on your provider, you may have to wait for the lines to clear off traffic before you can use the phone. Such factors are crucial to a mission critical facility in which power cannot
POWER QUALITY TROUBLESHOOTING CHECKLIST BUILDING DISTRIBUTION PROBLEMS Problem Observed or Reported: oCBs Tripping/Fuses Blowing oConduit Overheating oOverheated Neutrals oElectrical Shocks oDamaged Equipment oHumming/Buzzing Noise oOther Distribution Type: o1 o30Y a3 $A aFuses oCBs oVoltages(s)
V aAmperage Rating
A oOther
Problem Pattern: Day(s) of week: DContinuous oRandom nMonday oTuesday oWednesday oThursday oFriday öSaturday oSunday Time(s): □Continuous oRandom oAlways Same Time oMorning oAfternoon oEvening oNight Problem or Distribution History: Has problem been observed or reported before? oNo nYes Was any corrective action taken? 0N0 oYes Are other parts of system affected? 0N0 oYes Are nonlinear loads in area, or on same circuit? DNO oYes Has there been any recent work or changes made to system? DNO oYes Are large power loads being switched ON/OFF? 0N0 oYes Is panel properly grounded? 0N0 oYes Possible Problem(s): aConductors Undersized (Hot) aNeutral Conductors Shared/Undersized oHigh Number of Nonlinear Loads oVoltage/Current Unbalance nHarmonics aSystem Undersized olmproper wiring/grounding oOther Measurements Taken: Taken at Panel Located at Voltage V; Current A; Power W, Voltage THD ; Current THD ; K Factor
VA, VAR; Power Factor ; Other
PF,
DPF;
Waveform Shape: Voltage Waveform: □Sinusoidal oNon-Sinusoidal oFlat-Topped oOther Current Waveform: aSinusoidal oNon-Sinusoidal oPulsed oOther Measurements Taken at Load (Over Time): Normal Voltage V; Lowest Sag V at (Time); Highest Swell V at Number of Transients Recorded over a time period of at a level of
(Time); % above normal
Possible Problem Solution(s): oOversize Neutrals oRun Separate Neutrals oAdditional Transformer oSeparate Loads oHarmonic Filter oProper Wiring and Grounding oProper Fuses/CBs/Monitors oAdditional Subpanel oSurge Suppressor oPower Factor Correction Capacitors oOther
Figure 9.23. Building distribution problems—power quality troubleshooting checklist.
220
FUNDAMENTALS OF POWER QUALITY
diminish during peak hours, surges and swells cannot be tolerated, and outages cannot be risked. With the technology revolution permeating all levels of society, the awareness of power quality becomes a very important issue. Today's sophisticated technology makes even instantaneous disturbances problematic. With little assurance of continuity of power, sophisticated backup power is imperative. Under deregulation, many utility companies need to cut costs and have no choice
POWER QUALITY TROUBLESHOOTING CHECKLIST FACILITY TRANSFORMER AND MAIN SERVICE EQUIPMENT PROBLEMS Problem Observed or Reported:
DCBS Tripping/Fuses Blowing aConduit Overheating aOverheated Neutrals oElectrical Shocks DDamaged Equipment aHumming/Buzzing Noise aOther
Distribution Type:
n1 a3Y D3 A oFuses aCBs nVoltages(s)
V oAmperage Rating
A oOther
Problem Pattern:
Day(s) of week: oContinuous oRandom oMonday oTuesday oWednesday aThursday aFriday öSaturday oSunday Time(s): oContinuous aRandom oAlways Same Time oMorning oAfternoon nEvening nNight
Problem or Distribution History:
Has problem been observed or reported before? oNo oYes Was any corrective action taken? aNo aYes Are other parts of system affected? QNO aYes Are nonlinear loads in area, or on same circuit? oNo oYes Has there been any recent work or changes made to system? DNO aYes Are large power loads being switched ON/OFF? oNo oYes Is main service panel properly grounded? oNo oYes Are any subpanels properly grounded? DNO oYes Has there been a recent lightening storm? DNO aYes Has there been a recent utility feeder outage? DNO oYes
Possible Problem(s):
oConductors Undersized (Hot) oNeutral Conductors Shared/Undersized oHigh Number of Nonlinear Loads aVoltage/Current Unbalance oHarmonics aSystem Undersized almproper wiring/grounding aOther
Measurements Taken:
Taken at Panel Located at Voltage V; Current A; Power W, Voltage THD ; Current THD ; K Factor
VA. VAR; Power Factor ; Other
PF,
DPF;
Waveform Shape:
Voltage Waveform: uSinusoidal aNon-Sinusoidal aFlat-Topped aOther Current Waveform: uSinusoidal aNon-Sinusoidal aPulsed aOther Measurements Taken at Load (Over Time): Normal Voltage V; Lowest Sag V; Highest Swell Number of Transients Recorded over a time period of
V;
at a level of
% above normal
Possible Problem Solution(s):
oOversize Neutrals aRun Separate Neutrals aAdditional Transformer nHarmonic Filter aChange to K-Rated Transformer oAdd Subpanel aSeparate Loads aProper Wiring and Grounding oProper Fuses/CBs/Monitors aPower Factor Correction Capacitors aChange Transformer Size aOther
Figure 9.24. Facility transformer and main service equipment problem—power quality troubleshooting checklist.
9.8 CONCLUSION
221
but to terminate employees and postpone overdue system upgrades. Monitoring your electric distribution system can warn you of impending utility failures and avert disaster. The more knowledge you have about your own system, the better you can service its needs. This becomes important when negotiating with your energy provider.
9.8
CONCLUSION
Power quality problems are common and stem from a variety of sources. Power monitoring can help determine the source and the magnitude of these problems, before damage is done to expensive equipment. Here is a prescription to find the simplest, least-expensive solutions: 1. Interview the personnel whom observe the operation and take measurements. Use the data to decide what type of problem has occurred. 2. Identify a solution. 3. Implement the solution and fix the problem. 4. Observe and measure all data in the problem area to confirm that the solution is working. 5. Institute a procedure to prevent this type of problem in the future. Power problems are not always easy to find, and a checklist makes troubleshooting much simpler. Figure 9.22 provides a checklist for load problems. A record of the occurrence can help pinpoint load problems by revealing specific times that a problem occurs and what other indications or symptoms were coincident. Figure 9.23 can be used for troubleshooting building distribution problems, and Figure 9.24 for facility transformer and main service equipment problems. Understanding the various types of power quality issues that can be present in mission critical infrastructure and having the necessary knowledge to remedy these issues will help a facility manager maintain the highest levels of availability and reliability.
This page intentionally left blank
10 UPS SYSTEMS: APPLICATIONS AND MAINTENANCE WITH AN OVERVIEW OF GREEN TECHNOLOGIES
10.1
INTRODUCTION
For today's mission critical facility managers, critical power management is an essential tool for providing comprehensive power protection with optimal functionality and power usage effectiveness (PUE). With the wide range of choices available, selecting the right power protection architecture for your network can be a daunting task. In this chapter, we cover the basics of uninterruptible power supply (UPS) systems, theory of operation, descriptions of UPS power solutions/configurations, and UPS maintenance guidelines. The different UPS topologies, trends, and technological advancement of semiconductors utilized within UPS systems are also addressed.
10.1.1
Green Technologies and Reliability Overview
Computer systems and software applications are growing more important every day, and the applications that drive these requirements are expanding at an alarming rate. It is difficult to find a business that is not dependent on computers in its daily operations. The uninterrupted flow of data and information has become vital to the operation of every size business today, from bar-code-scanning cash registers to integrated manufacturing control systems to massive 24/7 banking and finance computer systems to social networking and the emerging power controls for smart grid systems. Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 223 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
224
UPS SYSTEMS
The advent of the Information Age has added the task of providing backup power without allowing an interruption of even one-half cycle or 8 milliseconds (ms) to avoid unplanned shutdowns of computer systems. Uninterruptible power supplies have become a necessity for powering large or small systems in facilities where personal safety and financial loss is at risk if power is interrupted. Today's power quality and reliability needs remain, but efficiency and energy consumption have become additional requirements that can no longer be ignored. The development of the UPS industry is analogous to the quest for greater fuel economy in the automobile industry to reduce the world's dependence on oil, which was never seen as a concern until recently, when the associated longer term risks have been exposed, including threats to national security. The recent Car Allowance Rebate System (CARS, or Cash for Clunkers) incentive and tax rebate program is evidence that this issue is being considered and solutions are being developed. In relation to mission critical infrastructure and UPS system efficiencies, this is also being addressed by organizations and engineering consultants through Leadership in Energy and Environmental Design (LEED) certification, the EPA Energy Star Program for Data Centers, and Green Grids power and power usage metrics such as PUE and Data Center Infrastructure Efficiency (DCiE), which will be discussed in Chapter 12. Today's mission critical infrastructure, the failure of which exposes an enterprise to extreme liability, risk, and financial losses, includes communication technology operations such as e-commerce, e-mail, and the Internet, which were in their infancy a bit more than a decade ago. When you combine critical communications infrastructure with diverse IT networks, databases, applications, hardware and software platforms that operate just-in-time inventory control, manufacturing scheduling, labor tracking, raw material ordering, financial management, and so on, the dependence on information flow grows even greater. Facilities management professionals must bridge various platforms to provide an integrated and seamless technology solution that supports the management of all devices and applications. In addition, today's design considerations for mission critical data centers must also address the growing requirement to be "green." The increasing pressure to reduce the energy consumption of mission critical facilities has led to a popular new metric—power usage effectiveness (PUE). Reducing energy consumption and the cost of ownership are becoming the top priorities in the facility and IT manager's decision making process. Facilities and IT management professionals now not only face the need for 24/7 availability, but they must also consider their associated capital and operating costs, and the PUE of their data centers. These are very large, complex, and real problems that are faced every day. Mission critical power solutions for enterprise management must be available for organizations and systems both large and small. When power flow ceases, so does the flow of information driving the enterprise. Power solutions must be able to provide the optimum solution for a given environment, regardless of the size of the application. Enterprise-wide systems demand flexibility and reliability to maximize performance, energy efficiency, and system availability, and today must also support the green initiatives of the organization. Minimizing the number of systems, subsystems, and components incorporated into a given network application simplifies the delivery of reliable power because there are
10.2 PURPOSE OF UPS SYSTEMS
225
fewer points of failure. When it comes to managing these devices, facility professionals face enormous difficulties in keeping up with the complex structure and performance of their networks. The implementation of high-availability software applications at major data centers requires constant processing seven days a week, 24 hours a day. Compounding these difficulties, are considerations for the redundancy required to maintain continuous operation and evaluating total cost of ownership and return on investment in associated mission critical infrastructure. Such continuous operation also means an absence of unscheduled or scheduled interruptions for any reason, including routine maintenance, periodic system-level testing, modifications, upgrades, or failures. Various UPS topologies are available today to satisfy these stringent reliability and maintainability requirements. UPS manufacturers are also striving to produce higher and higher efficiency systems in an effort to become greener. By deploying the proper power management solution, facility managers gain control of critical power resources while adding functionality and protection to multiple networks. Many power management products offer easy installation, scalable architecture, remote monitoring and control, and other convenient features that facilitate power monitoring and increase total system availability while decreasing operating and maintenance costs, in addition to reducing the risks for protected equipment.
10.2
PURPOSE OF UPS SYSTEMS
In the industrial era, our economic system developed a dependency on electric power. Electricity became necessary to provide lighting, energy to spin motors, and heat to use in fabrication processes. As our electrical dependency grew, processes were developed that required continuous uninterrupted power. For example, car manufacturers use an automated step-by-step system for painting new cars. Every part must be coated precisely, and if any errors occur the process must be restarted. A power interruption to this process would cost the company large amounts of money due to downtime and production delays. Consequently, manufacturers could not afford to lose power to critical motors, control systems, or process systems that would shut down production. As the industrial era transitioned into the information era, creative electromechanical and electrochemical solutions to this need for uninterrupted power were developed. The simplest was attaching a backup battery system to the electrical load; however, this arrangement only worked for direct-current (DC) powered devices. As alternating current (AC) became the preferred form of electricity used in our transmission and distribution systems, it became more difficult to provide standby power. The added task of remaining synchronized with the standard frequency of the utility grid (60 Hz in North America, 50 Hz in Europe and most of Asia) had to be accomplished. The first computer (the ENIAC) began functioning in 1948. It filled an entire room and had no memory. Today, computers and data communications are critical to most aspects of our lives. These new computing systems can be no more reliable than the power from which they operate and the continuous cooling they require. Reliability and power usage are factors that must be accounted for within the critical infrastructure.
226
UPS SYSTEMS
Utility power varies in reliability from utility to utility and from day to day, but is generally assumed to have no better than 99.9% availability. Even within a given utility's service territory, there can be, and often are, wide variances in reliability. Applying an availability of 99.9% over the course of a year, the utility could have interruptions totaling 8 hours and 46 minutes per year. Utility power with emergency generator backup is no longer considered adequate for electronic loads that cannot handle even a momentary interruption in power. As a result, the UPS industry was created. The double-conversion UPS was the first topology created, back in the 1960s. These systems have a rectifier that converts AC input power to DC power, and an inverter that converts the DC power to AC output power. A battery is connected to the DC link in these systems. When the input power fails, the input of the UPS shuts down but the output continues to operate as if nothing happened, drawing power from the DC backup source with no change of state. Therefore, the UPS (excluding DC backup) does not have a significant failure rate during an input power failure. If the power-conditioning or waveform-shaping equipment randomly fails while utility (or generator) power is available, a UPS will automatically and rapidly reverse transfer the load directly to the utility bypass source. In the event of a UPS failure or overload, the load is saved, the failure is acknowledged and repaired, and the load is forward transferred back onto normal UPS power. With full or true online double-conversion (rectifierinverter or motor-generator) designs, power is completely conditioned to tight specifications and the output voltage and frequency is completely isolated from the input. True online double-conversion UPS systems emerged as the preferred topology because they offer the highest reliability and quality of continuous conditioned power to the critical load. This continued for a long time, until recently when users realized that there is a trade-off—the double-conversion UPS has a lower system efficiency when compared to that of other UPS system concepts (offline, standby, or line-interactive types of UPS systems). This is due to switching losses of the semiconductor power devices in the conventional online double-conversion circuit and associated filter-component losses. Every ampere in a double conversion is converted twice, and that is inefficient but reliable. In addition, online UPS systems are generally more expensive than other UPS systems. In today's greening world, UPS manufacturers have realized that now it is not only a requirement to offer a highly technical, reliable, and top-quality online UPS system that integrates seamlessly into the overall power infrastructure, but it must also be compact, lightweight, and offer high operating efficiencies throughout the full load spectrum. This can only be achieved with the advancement of technology and is addressed in the following pages. There are many UPS designs that can be commonly characterized as offline or lineinteractive. One system is called delta conversion, and another, single conversion. All of these systems have input filtering just like the double-conversion UPS; the difference in efficiency is that they are not double converting every ampere. Here is a brief description of these technologies: • Offline UPS. This UPS allows utility power to flow through a static transfer switch (STS), the alternate side of which is connected to a rectifier and a battery.
10.2 PURPOSE OF UPS SYSTEMS
227
When power fails or goes out of tolerance, the STS switches and the load goes on the battery. The S&C company makes an offline, medium-voltage UPS that can handle up to 20 MW and is made to go outdoors. • Single conversion. This UPS allows current to flow through a choke, but at full output voltage to the load. When the power fails, the system transfers to the already functioning rectifier and battery or flywheel. The Active Power company makes this type of UPS in sizes up to 1.2 MW in a single enclosure with the backup flywheels. • Delta Conversion. This UPS uses the laws of transformers to its advantage. These laws are that current does not flow in the primary of a transformer unless there is current on the secondary, and the waveshape of the current in the primary will reflect the waveshape of the current in the secondary. The delta conversion UPS has a one-fifth sized rectifier that circulates pure sine wave current in the secondary of a one-to-five turns ratio transformer. So if the rectifier pulses 5 amps in the secondary, 25 amps will flow in the primary. The primary is connected to the UPS output where the a full-size rectifier is applying full output voltage constantly. These UPS systems all have one common advantage: they are all over 97% efficient as compared to the double-conversion UPS efficiency of 92%. They cost less to operate and require half of the air conditioning that the double-conversion UPS does. Inefficiency leaves UPS modules as heat. These UPS modules emit less than half the heat of a double-conversion system, so they save money in more than one way. They also take advantage of today's high-speed communications networks, and the fact that today's insulated gate bipolar transistors (IGBTs) can be switched thousands of times a second. The batteries or other stored energy (flywheels, etc.) of many of the offline or lineinteractive UPS systems must be drawn down to correct for source voltage and frequency variations that an online UPS can correct without drawing down the batteries. To minimize battery drawdown or "hits" on the battery, these UPS designs typically allow a wider window of power imperfections to be passed onto the load. They are built to function within the CBEMA-ITIC curve and keep the computers up and running. Fortunately, the vast majority of modern IT equipment can tolerate a much wider range of voltage and frequency, and brief power outages. They are far more tolerant than older IT equipment. Many IT power supplies today are designed for application worldwide and can operate from roughly 90-250 V and 45-65 Hz. Therefore, power availability has become much more important than power quality. Advantages of offline and line-interactive UPS designs include lower cost, higher efficiency, smaller physical size, and quieter normal operation. A significant disadvantage of offline and line-interactive UPS designs is that the offline or lightly loaded portion of the system must rapidly respond to a power failure or voltage fluctuation. An emergency standby power system is off until it is manually or automatically started when the utility power fails. It may take as little as 7 seconds or as long as a few minutes to start a standby
228
UPS SYSTEMS
generator system, depending on how many engines are in the system, whether they are diesel or turbine engines, the setting of time delays, and the size of the system. If three or more generators need to synchronize before enough capacity is online to accept load, this can add many seconds or more to the start-up time. This amount of time can affect systems in a way that makes them unable to function reliably if the UPS and DC backup are not properly designed and integrated into the system. A flywheel energy storage system is only an option where parallel generators are not used, but a distributed generator topology is used. Equipment can be divided into four categories based on the reliability requirements and the time that the power supply can operate without electrical power: 1. Data processing equipment that requires uninterrupted power 2. Safety equipment defined by codes that require power restoration within seconds 3. Essential equipment that can tolerate an interruption of minutes 4. Noncritical equipment that accepts prolonged utility interruptions
10.3 10.3.1
GENERAL DESCRIPTION OF UPS SYSTEMS What is a UPS System?
UPS stands for uninterruptible power supply. Critical loads are usually protected with a backup source of power: a standby generator, batteries, or some other alternate source of electrical energy. Today's critical loads are typically computers and other sensitive electronic equipment that will malfunction if their power supply is interrupted for more than a few milliseconds, so the mission critical industry needed a way to instantly transfer these sensitive systems onto an alternate source of power without dropping the critical load. When the backup power source can be seamlessly connected to the critical load, we have what is commonly referred to as a UPS system. UPS systems can regulate, filter, and otherwise condition the utility supply and seamlessly switch a critical load to backup power in the event of a subcycle power interruption. They also reduce or eliminate power quality problems caused by voltage sags, surges, or high-speed transients.
10.3.2
How Does a UPS System Work?
The UPS system requires AC input power, typically supplied by the local utility. Since the reliability and quality of this commercial AC power no longer meets the needs of today's mission critical industry, a UPS is required to condition and filter this supply to eliminate transients and compensate for voltage sags and surges. If the normal source becomes unusable or is lost completely, the UPS will draw power from batteries (or other stored energy source), until a standby turbine or diesel generator can be brought online to provide an alternate input source of AC power for the UPS. UPS sys-
10.3 GENERAL DESCRIPTION OF UPS SYSTEMS
229
terns can be found in sizes that range from 50 VA all the way up to 20 MW. There are three main types of UPS systems: static, rotary, or some combination of both that forms a hybrid system. The output of a rotary UPS is produced through the use of rotating machinery, much as electricity is produced at large central power plants. Only rotating equipment can produce the pure sine wave that is a characteristic of alternating current or AC. Static UPS systems produce high-quality output power without the use of any rotating components by using solid-state devices to condition and/or convert unregulated input power (typically from a DC power source such as a battery) to dependable, stable AC output power. Full online double-conversion UPS designs convert the input AC to DC through the use of a rectifier. An inverter changes the DC back into AC, which is fed to power distribution units. Batteries or flywheels are used to store DC energy. The required backup time is determined by the critical load requirements of the facility. For long-duration power outage protection, UPS systems typically incorporate enough battery backup to provide sufficient time to start a standby generator and get it ready to accept the critical load, which can then be used as long as a reliable source of fuel is available. An unlimited variety of hybrid UPS systems can be configured using any number of static and rotating devices to improve the power quality of any source of input power. Hybrid systems can also incorporate on-site generating systems like wind turbines, photovoltaics, and fuel cells.
10.3.3
Static UPS Systems
Static UPS systems are electronically controlled, solid-state systems designed to provide a continuous source of conditioned, reliable electrical power to critical equipment. This power is provided without rotating power components. Double-conversion static systems electronically rectify AC utility power into DC power, which is then paralleled with a DC power storage device, typically a battery. The battery is kept fully charged during normal operation. The DC power is then inverted electronically back into AC power that supplies the critical load. In the event of a loss of utility power, energy from the DC energy storage system is instantaneously available to continue providing power to the critical load. The result is effectively an uninterrupted AC power supply. It is difficult to differentiate between all the providers of static UPS systems in terms of reliability because there are a tremendous number of variables. Most providers offer a high degree of sophistication, ranging from simple power conversion to sophisticated synchronization and filtering arrangements. More often, the reliability of a UPS system will be driven by the overall system design, operation, and maintenance than by the particular type of internal UPS technology. UPS reliability will typically follow the bathtub curve. Brand new units have a higher failure rate that rapidly decreases over time, then goes up again near the end of life. UPS systems greater than five years old will have increased risk of component failure, including batteries, capacitors, diodes, contactors, motor operators, and power
230
UPS SYSTEMS
switching devices including transistors and thyristors/silicon controlled rectifiers (SCRs). If the UPS is not properly maintained, the user should not expect the system to be reliable, just as if an automobile were not properly maintained. You cannot expect continuous availability and operation without the proper preventative maintenance. Static UPSs are complicated, highly technical pieces of equipment. With proper maintenance and change-out of components, larger UPS systems can reliably last 15 years or more. Routine maintenance service and technical knowledge about the device are important factors in reliable service of the equipment. All power protection devices are not created equal, so understanding their capabilities and limitations is very important. The next sections explain the leading UPS technologies and the level of protection they offer. There are several types of static UPS designs on the market including online double conversion, online delta conversion, offline/standby, and line-interactive UPS designs.
10.3.4
Online
Most engineers believe that online UPS systems provide the highest level of power protection and are the ideal choice for shielding your organization's most important computing and processing installations. An online UPS provides a full range of power conditioning. The internal components of an online UPS are functioning all of the time to continuously power the load, to provide both conditioned electrical power and line disturbances/outage protection. Online UPS designs offer complete protection and isolation from all types of power quality problems: power surges, high-voltage spikes, switching transients, power sags, electrical-line noise, frequency variation, brownouts, and short-duration blackouts. In addition, they provide consistent computer-grade power within tight specifications not possible with offline systems. For these reasons, they have traditionally been used for mission-critical applications that demand high productivity data processing and systems availability.
10.3.5
Double Conversion
As previously mentioned, this technology uses the combination of a double-conversion (AC-to-DC and DC-to-AC) power circuit, which continuously powers the load, to provide both conditioned electrical power and outage protection. With the rectifier/inverter (double-conversion) UPS, input power is first converted to DC. Normal supply of power feeds the rectifier and the AC load is continually supplied by the output of the inverter. The DC power is used to charge the batteries and to continuously operate the inverter under any load conditions required. The inverter then converts the DC power back to AC power. There is an instantaneous static bypass switch path should any components fail in the rectifier/inverter power path (the static bypass switch is part of the UPS system, which is either internal or in a separate cabinet, depending on system size). Figure 10.1 shows the basic schematic of a double-conversion online UPS system.
10.3 GENERAL DESCRIPTION OF UPS SYSTEMS
231
Figure 10.1. Double conversion in normal operation. (Courtesy of Emerson Network Power.)
10.3.6
Double-Conversion UPS Power Path
Normally, AC power is rectified to DC, with a portion used to trickle charge the battery and the rest converted back to AC and supplied to the critical load (Figure 10.1). When the utility power fails, the battery takes over supplying the power to the critical load after being inverted (Figure 10.2). Since the battery stores DC energy, the power is converted to AC by the Inverter and then supplied to the critical load. This is supplied for only a limited amount of time (depending on the size of the battery plant), typically 5-15 minutes. Much longer (multiple hours) runtimes are possible but not realistic due to the cost and space requirements for batteries. In the event of an overload or random failure in the rectifier/inverter power path, the utility power is typically available at the bypass input. In this case, the UPS sys-
Fiqure 10.2. Double conversion on battery. (Courtesy of Emerson Network Power.)
232
UPS SYSTEMS
tern's static bypass switch will instantly reverse transfer the critical load to the bypass utility source. Figure 10.3 shows input power fed directly through the static bypass switch to the critical load. From time to time, the UPS system will need to be taken offline for maintenance (bypassed). In addition to the static (electronic) bypass, UPS systems may have an internal mechanical bypass (as shown in Figure 10.3), usually a contactor or motorized switch/circuit breaker or external bypass power path. Large UPS systems typically include automatic static and manual internal and external bypass paths. Because unconditioned utility power may be undesirable to utilize even under maintenance conditions, some users will switch to a standby generator or a redundant UPS system as the bypass source rather than using the utility source for this purpose. Figure 10.4 shows how the bypass power source may feed unconditioned power to the critical load.
10.4 10.4.1
COMPONENTS OF A STATIC UPS SYSTEM Power Control Devices
Silicon-Controlled Rectifier (SCR). A silicon-controlled rectifier (SCR), sometimes referred to as a thyristor, is a solid-state electronic switching device that can be turned on by a relatively small current signal applied to its gate that causes it to conduct. Large SCR devices are capable of handling thousands of amperes of current. The switch can be operated at a speed much faster than a mechanical device, into the millisecond range. The SCR is composed of silicon-based semiconducting materials. In their deenergized state, the materials are excellent insulators and will not allow current to pass through. Yet, by applying a slight current via a gate signal, the SCR is instantaneously energized and becomes an excellent conductor of electricity. The SCR remains an excellent conductor until its gate signal is deenergized and the current flow is stopped or
Figure 10.3. Double conversion, internal bypass. (Courtesy of Emerson Network Power.)
10.4 COMPONENTS OF A STATIC UPS SYSTEM
233
Figure 10.4. Double conversion, bypass. (Courtesy of Emerson Network Power.) reversed. Removal of the SCR gate signal by itself will not turn off an SCR. The ability to precisely time the turn off of an SCR was one of the major challenges for UPS inverter designers for decades. Methods to rapidly turn off inverter SCRs include auxiliary commutation using stored energy in a bank of capacitors. To turn off the SCRs, the bank of energy is released to back feed into the SCRs and reverse the current flow enough to reach a zero crossing. The result of forced or auxilliary commutation quite often was blown fuses. Modern high-power transistors (see IGBT below) have largely made SCRs obsolete for inverter designs. SCRs are typically still used in the rectifier or battery-charger portion of larger capacity UPS systems as AC power control switches because of natural commutation of the input voltage waveform. Because the AC power naturally reverses itself 120 times per second, SCRs are an efficient, cost-effective device to convert AC power into DC power; however, this brings further challenges to engineers when considering integration to the overall mission critical power infrastructure, specifically, to harmonic and generator compatibility. Insulated Gate Bipolar Transistors (IGBT). There are other devices that function similar to the SCR, such as the insulated gate bipolar transistor (IGBT). A key feature of the IGBT is that it will rapidly turn off with the removal of its gate signal, thereby making it easier to control. This is why IGBTs are better suited for switching the DC power from the rectifier output or a storage battery and are typically utilized in newer UPS inverter section designs of today. Using IGBTs in UPS rectifiers can eliminate the majority of input voltage and current harmonic problems normally associated with silicon control rectifiers (SCRs). With a pulse-width-modulated (PWM) controlled IGBT UPS rectifier, harmonics can be reduced to between 3 and 7% without custom input filters and resulting leading power factor concerns, and regardless of load from 50 to 100%. However, IGBT-based rectifiers have typically been less efficient due to device switching losses and greater cost. In summary, this explains why many large-capacity UPS systems available today do not have full IGBT rectifier/converter sections. They cost more and have higher
234
UPS SYSTEMS
switching losses that affect overall efficiency. The natural commutation of the AC input sinusoidal voltage waveform allows automatic turn off of SCR and diodes due to the inherent forward conduction characteristics of these devices. IGBT switching loss and filter size and weight are, therefore, not new issues; however, with the existing need for more compact, lightweight, and higher efficiency online UPS systems, UPS manufacturers are always evaluating new design topologies to satisfy the market need. Full IGBT UPS systems that are transformerless, therefore silicon rich but with less copper, and the evolution of power converter stages from two- to three-level topologies, means that a cost-effective and highly efficient solution is becoming available. Rectifier/Battery Charger. In double-conversion UPS designs, the rectifier/ charger converts alternating current supplied from a utility or generator into direct current at a voltage that is compatible for paralleling with battery strings. When the UPS is supplied by normal input power, the rectifier provides direct current to power the inverter, batteries are trickle charged, and control logic power is provided. The rectifier/charger is sized to provide the full UPS load plus battery trickle charge load continuously, as well as short-term battery recharge. UPS designs including rotary, delta conversion, line-interactive, and offline use a smaller rectifier/charger sized only to recharge the battery, flywheel, or other energy storage device. Most major manufacturers of UPSs provide separate power supplies with dual inputs for internal control and power redundancy. They can take power from either the incoming AC source or from the protected inverter output. A few UPS designs have a third input and a DC-to-DC converter to take power from the DC bus. The rectifier splits the AC waveform into its positive elements and negative elements. It does this by using IGBT, SCRs, or a diode bridge to switch the alternating positive and negative cycles of the AC waveform into a continuous positive and negative DC output. Typically, the UPS capacity size impacts the chosen rectifier design that then determines harmonic content and input filter requirements. Different rectifier topologies available are: 1. 2. 3. 4.
Full IGBT Six-pulse SCR Twelve-pulse SCR Diode bridge and IGBT Bidirectional chopper
All but the largest three-phase UPS rectifiers are typically available with IGBT or six-pulse SCR rectifiers. UPS designs at 400 kW and higher may be available with IGBT or twelve-pulse SCR rectifiers. IGBT rectifier design have considerable advantages in relation to lower input harmonics and being an active front end to control the input power factor to near unity. Twelve-pulse SCR rectifiers offer reduced input harmonics with reduced filtering required. Twelve-pulse rectifiers, where available, require more space than IGBT or six-pulse rectifiers due to the requirement of adding an input isolation transformer to offer a 30° phase shift for the additional six-pulse bridge circuit.
235
10.4 COMPONENTS OF A STATIC UPS SYSTEM
Most UPS manufacturers smooth out the bumps of the AC half-cycles by using a DC output filter, to prolong battery life. This filter is made of inductors and capacitors or what is referred to as a LC (inductive/capacitive) filter. Most manufacturers use LC filters on the AC input and output for AC harmonic filtering. Alternate DC bus filtering includes using DC capacitors and a DC choke. The choke and DC capacitors serve the purpose of the filter, but it normally is not called a DC output filter. The topology of the power-conversion bridge circuit impacts the size and design of filter components. Constant losses associated with filter reactor/inductive elements significantly add to overall system losses and, therefore, directly affect overall efficiency characteristics. Rectifier/battery chargers are often designed to limit their output to 125% or less of the UPS output continuous rating. Figure 10.5 is a typical rectifier schematic utilizing SCRs as the power control devices. No matter what load is placed on the UPS, the rectifier limits how much continuous power the system will produce. The rectifier capacity plus the battery limits how much power could be produced by the inverter. The inverter is typically sized to 100% of the UPS rating continuously, 125% for 10 minutes, 150% for 30 seconds, and 300% with voltage degradation for less than one cycle. Rotary UPS designs can typically handle short-term overloads better than static designs. Current limiting is an important feature when considering the amount of instantaneous load that could be placed on a UPS system if the output suddenly experienced an electrical fault. Current limiting also prevents overloading of the upstream power distribution system. Input current limits restrict the amount of power that the UPS will consume from the input power distribution system. There is a separate circuit for limiting the output current of the UPS, and it will tolerate massive short-term overloads for fault clearing and PDU transformer in-rush. Best practices include always switching UPS systems to bypass prior to energizing PDUs and other transformers. Since input current is limited, the overload is supplied by the rectifier plus the battery up to 300% of rated load for a small time period. The bypass static switch is typically rated up to 1000% of rated load for 40 milliseconds before transferring the load to mechanical bypass with a contactor or circuit breaker.
3 Phase AC input
DC output SCR section KYYY>
Input voltage
Filter section
7r^
Output voltage
Figure 10.5. Typical three-phase rectifier schematic utilizing SCRs as the power control devices.
236
UPS SYSTEMS
Another typical design feature is to supply no more than 25% of the available current to the battery, as charging current ensures that the battery recharge will not overload the AC distribution system following a battery discharge. This recharge voltage is also adjustable to match the utility service capability. The recharge time for a battery is an important factor when there are multiple battery discharges. A rule of thumb is to allow 10 times the discharge rate to recharge the battery or spin a flywheel back up to speed. Therefore, a 15 minute discharge would result in a 150 minutes of recharge time to restore the battery to 95% of capacity. The remaining 5% of charge for a battery can take an additional 2 4 ^ 8 hours. Input Filter. The process of chopping an AC wave into its positive and negative components is much like throwing a stone into a pool of water. The initial wave created by the stone will encounter the walls of the pool and reflect back to the center. The reflected waves will combine and distort the initial wave. In the electronic world, the distortion is referred to as harmonic distortion. In this case, the concern is that the UPS current draw is distorted and the result is a reflection of that distortion in the source voltage. The amount of reflected harmonics in the voltage source is a function of source impedance. A utility source typically has much lower impedance than a standby generator; therefore, voltage harmonics from UPS rectifiers are typically higher when measured with the UPS fed from a generator than from a utility. The voltage harmonics on the source are not typically a concern for the offending UPS system or the UPS load, but can be a concern for the source and/or other loads fed from the source. Several UPS systems and/or variable speed drives on the same source can be a concern for additive, combined, or resonating harmonics. To minimize the switching effect a UPS rectifier will have on the building power distribution system, a large three-phase UPS is often specified with maximum permissible input harmonic distortion levels. If the rectifier design cannot provide harmonic levels within those specified, input harmonic filters are also required. The input filter is designed to trap most of the unwanted harmonics through a tuned inductive-capacitive (LC) circuit. In brief, the inductor or coil impedes most of the frequencies above a designed value from being fed back into the upstream distribution system, while the capacitive element shunts the frequencies below the designed value. Typical UPS specifications require input current with total harmonic distortion (THD) held to 10% or less at full load. At lighter loads, and especially for redundant UPS applications in which the load is often in the 2 0 ^ 0 % range, the THD becomes much higher. However, since the higher THD is a percentage of the actual reduced loads, the amount of harmonics at light loads is typically of less concern than at higher loads. Input UPS passive harmonic filters can pose a risk to the upstream electrical system. The concern is the leading power factor from the filter capacitors when the UPS is at light load or initiating its walk-in function during transition to generator feed. Risks include nuisance tripping of circuit breakers and unstable operation of standby generators due to resonance or leading power factor. When a UPS system that includes a large bank of input filter capacitors is energized, the capacitors store a lot of current to charge up. The resultant instantaneous high current can nuisance trip a circuit breaker.
10.4 COMPONENTS OF A STATIC UPS SYSTEM
237
The leading power factor (current peak ahead of voltage peak in time) can cause instability in generator voltage regulators. The regulators are typically designed for loads with power factor at unity or lagging (current peak follows voltage peak in time). With that said, generators can operate in leading power factor conditions; however, it is important to study the specific generator reactive capability curve to determine if generator operation will become unstable and, therefore, less reliable near its leading power factor load limit. Resonance between the mechanical characteristics of a generator and the electrical characteristics of input filter circuits also needs to be considered to prevent unstable generator operation. Proactive studies can usually be undertaken by UPS manufacturers to ensure that any generator compatibility problems are highlighted and addressed prior to on-site testing and commissioning. Designers, specifiers, and purchasers of UPS systems must be careful not to overdesign or overspecify UPS input harmonic current reduction. Trying too hard to solve a harmonics problem that may not exist might cause leading power factor problems. UPS manufacturers in recent years have recognized the problem of excessive input harmonic filtering and are trending toward IGBT rectifier design, which actively cancels out most harmonics and manages the input power factor to near unity over the full load range. The SCR rectifier has been a primary cause of harmonic and power factor problems. In recent years, this has become more acute with the increase in redundant UPS applications, with UPS systems typically operating well under 50% of capacity. A typical midsize UPS with a six-pulse rectifier and 7% THD at full load harmonic filter draws an input power factor that is at unity at around 40% load. Below 40% load, the input power factor becomes increasingly leading and, if not cancelled out by lagging power factors such as mechanical (HVAC) loads that are always operating anytime the UPS is operating, the overall leading power factor can cause problems. One method to deal with this is to add an automatic contactor to take out the capacitive element of the UPS input filter at loads below 40%. The contactors, with their differential relays and set-point adjustments, have also become a reliability problem in many cases. Additionally, the cause and effect of negative sequence current on generator life are being neglected. Instead of removing the capacitive element by using an automatic contactor, another method is to incorporate a system shunt reactor that would offset the leading power factor caused by the input filter capacitors. When considering parallel UPS systems with multiple UPS modules and also data centers utilizing large UPS quantities, energizing one large reactor versus dropping out multiple-input-filter capacitive elements with multiple contactors becomes advantageous. Over time, the capacitors in the input harmonic filter can age and alter the filter's value. The input distortion should be checked annually to assure that the filter is functioning within specification and according to benchmark recordings from startup commissioning and annual maintenance testing. Inverter. Inverters utilizing a DC source can be configured to generate 6- or 12step waveforms, pulse-width-modulated waveforms, or a combination of both, to synthesize the beginnings of an AC output. The inverter converts DC into AC utilizing IGBT or an equivalent technology. By pulsing on DC current in a timed manner, the resulting square wave can be filtered to resemble a sinusoidal alternating current wave-
238
UPS SYSTEMS
form. Consideration of inverter section control methods is also of interest. Utilizing the IGBT device is important to benefit from the switching advantages that this device brings, but it is how the device is controlled that is key to obtaining optimum performance and specification of UPS systems. The inverter must produce a sine wave at a frequency that is matched to the utility bypass source while regulating the output voltage to within 1%. The inverter must accommodate load capacities ranging from 0% to 100% of full load or more and have the response to handle instantaneous load steps with minimal voltage deviation to the critical load. Output Filter. A voltage square wave is the sum of all odd voltage harmonics added together. By generating an optimized square wave at a frequency above 60 Hz for any given load condition, lower order harmonics are minimized. Pulse-width-modulated (PWM) inverters are particularly good at this function. An output filter is used to remove the remaining odd-ordered harmonics, resulting in a sine wave, which is then used to power the critical load. As discussed, the inverter generates a square wave by filtering the inverter output. As a result, a clean sinusoidal alternating voltage is produced. A typical specification for an output filter is less that 5% THD, with no single harmonic greater than 3%. Static Bypass. In the case of an internal rectifier or inverter fault, or should the inverter overload capacity be exceeded, the static transfer switch will instantaneously transfer the load from the inverter to the bypass source with no interruption in power to the critical AC load. Efficiency. The double-conversion UPS systems have a typical efficiency of 92 to 94% at full load, and this typically falls off to about 85% efficiency at 25% load. Due to this loss of efficiency, manufacturers of double-conversion systems have developed what is known as eco-mode. In eco-mode, the UPS runs in bypass, allowing pure utility power to flow to the load, then, when it senses a power anomaly, it switches to normal operation.
10.5
ONLINE LINE INTERACTIVE UPS SYSTEMS
Delta conversion online UPSs are a variation of the line interactive UPS and online double-conversion UPS. The delta conversion consists of a delta converter and a main inverter (Figure 10.6). The delta converter functions as a variable current source. It controls the input power by regulating the DC bus voltage. The main converter is a fixed voltage source. It regulates the voltage to the load. The delta converter controls the UPS input impedance. This is similar to a conventional rectifier in a double-conversion UPS to control power across the DC bus and load. Through PWM, the main converter recreates AC voltage, providing constant output voltage. The delta conversion online feature consists of a connection between the two converters (from AC to
10.6 OFFLINE (STANDBY)
239
Figure 10.6. Conventional delta conversion online UPS. (Courtesy of APC.)
AC). This connection is referred to as the pure power path. This term refers to the fact that the delta converter controls the waveshape of the current and the amount of current that flows. This pure power path is the path of pure sine wave current to the load. The delta and the main converters have the ability to change AC power to DC power and, simultaneously, DC power to AC power. The main inverter is a voltage-controlled IGBT PWM inverter. It has the primary function of regulating the input current and input power by acting as a variable current source in the secondary circuit of the transformer, that is, it behaves as a load without power dissipation, except for its switching losses. Working together, the two converters form a very good power control system. The main inverter is a voltage controlled IGBT PWM inverter. Its primary function is to regulate the voltage at the power balance point (PBP). The delta converter has a unique approach to compensate loss. It consists of flyback diodes that simultaneously rectify the excess power from the PBP and feed it to the DC bus to account for all the system losses. The power-balance concept control system in a delta conversion UPS uses digital signal processing (DSP). Using DSP gives a very high degree of accuracy.
10.6
OFFLINE (STANDBY)
Offline or standby UPSs consist of a basic battery/power conversion circuit and an internal power transfer switch that senses irregularities in the electric utility. The computer is usually connected directly to the utility, through the sensing power switch that serves as the primary power source. Power protection is available only when line voltage dips to the point of creating an outage. A standby power supply is sometimes termed an offline UPS since normal utility power is used to power the equipment until
240
UPS SYSTEMS
a disturbance is detected and the power switch transfers the load to the battery-backed inverter. The transfer time is typically from a few milliseconds to a few seconds, depending on the required performance of the connected load. Some offline UPSs do include surge suppression circuits, and some possess optional built-in power line conditioners to increase the level of protection they offer. In the case of power surges, an offline UPS passes the surge voltage to the protected load until it hits a predetermined level, around 115% of the input voltage. At the surge limit value, the unit then goes to battery. With high-voltage spikes and switching transients, they give reasonably good coverage but not the total isolation needed for complete input protection. For power sags, electrical line noise, and brownouts, offline UPSs protect only when the battery is delivering power to the protected system. A similar limitation exists in the case of frequency variation. An offline UPS protects only if the inverter is operating and on battery. If the input frequency varies outside the device's range, the unit is forced to go to battery to regulate the output to the computer loads. In very unstable utility power conditions, this may drain the battery, making it unavailable during a blackout or zero voltage condition. Standby UPSs are cost-effective solutions, particularly suitable for single-user PCs and peripheral equipment that requires basic power protection. Small, critical electric loads, like personal computers, may be able to use standby power supply (SPS) systems. These systems have the same components as online UPS systems but only supply power when the normal source is lost. The SPS senses the loss and switches the load to the battery inverter supply within 1 cycle (or 16 ms). Some computer equipment may not tolerate a 16 ms transfer time (in most cases, 4 to 8 ms is required), potentially resulting in data loss. In addition, SPS systems do not correct other power quality problems. They are, however, less expensive than online UPS systems.
10.7 10.7.1
THE EVOLUTION OF STATIC UPS TECHNOLOGY Emergence of the IGBT
Analyzing the trends in the technological advancement of semiconductor power devices and the emerging need for higher switching frequencies in UPS systems and other power electronic applications shows the emergence of the insulated gate bipolar transistor (IGBT) as an advanced power device to offer superior UPS performance and reliability. The benefits of utilizing IGBTs in rectifier and inverter sections of UPS systems have already been addressed; however, the evolution to full IGBT design and the ability to control and switch devices at higher frequencies has allowed the green needs of today to be satisfied. Full IGBT UPS systems that are transformerless with reduced switching losses through converter topology advancement have led to highly technical, reliable, and high-quality online UPS systems. These IGBT systems integrate seamlessly into the overall power infrastructure but are also compact, lightweight, and offer high operating efficiencies throughout the full load spectrum. Transformers do provide a benefit, however. When transformerless UPS systems are being considered, external transformer(s) may be needed for such purposes as input/output
10.7 THE EVOLUTION OF STATIC UPS TECHNOLOGY
241
galvanic isolation, independent AC input and bypass sources, DC link isolation, voltage changes, and local neutral and ground point establishment. The system designer will need to take such issues into consideration.
10.7.2 Two- and Three-Level Rectifier/Inverter Topology With PWM, the rectifier/inverter semiconductor power devices act as an array of switches and are turned on and off thousands of times during each half-cycle of voltage. Considering the inverter section of a UPS system, the output voltage is, therefore, controlled by varying the width of pulses and durations that each device conducts. The control system utilizes a DSP that controls the voltage regulation. The switching pattern and commutation of the IGBT devices thus enable the synthesis of the inverter output voltage waveform. A conventional two-level inverter circuit configuration and variable width pulse train is shown in Figure 10.7. The above represents inverter and system output voltage reconstruction, and, as discussed previously, large output filter circuits will be required to suppress high-frequency components and obtain a smooth sinusoidal output voltage waveform (i.e., low harmonic distortion) from the pulse train. Passive components such as filters are a substantial contributor to weight, cost, and losses in power converters. The low-voltage power-conversion market (up to 600 V classification), including UPS systems, is almost exclusively satisfied by the conventional two-level DC voltage link topology. With the advancement in IGBT generation, switching and conduction losses have improved. The two-level topology (Figure 10.7) associated switching loss at lower load levels and larger filter element constant losses limit how far system efficiency can be improved, as described previously. The low- and medium-voltage power conversion market offers some diversity that includes three-level topology structures. A three-level inverter topology and variable width pulse train is shown in Figure 10.8. It can be seen that the output variable width pulse train of a three-level circuit tracks the sinusoidal waveform much better than a conventional two-level circuit topology.
Figure 10.7. Conventional two-level inverter topology and variable width pulse train. (Courtesy of Mitsubishi Electric Power Products.)
242
UPS SYSTEMS
Figure 10.8, Three-level inverter topology and variable width pulse train. (Courtesy of Mitsubishi Electric Power Products.) This improvement in waveform synthesis means significant noise reduction can be achieved and also that a smaller filter circuit can be used to obtain the required clean sinusoidal output voltage waveform required of a UPS system. In addition, even though three-level circuits consist of more IGBT devices, associated switching losses are less, due to reduced voltage stress across the device. These combined factors mean that three-level topology is becoming attractive for applications in the lower voltage power-conversion market that require highly efficient solutions. Some UPS manufacturers are already adopting three-level topology, realizing that efficiency benefits from a transformerless design concept combined with three-level topology that allows reduced filter reactor losses and overall semiconductor losses. This not only provides the customer with significant UPS system operating savings, but it is also provides savings associated with reduced cooling needs. (Higher efficiencies mean less heat emission from the top of the UPS modules; thus, less cooling equipment and energy is required.)
10.8
ROTARY UPS SYSTEMS
Rotary UPS systems can be substituted for any static UPS technology; however, even though they require fewer parts, they will be much larger, heavier, and less efficient. The rotary power converter uses an electromechanical motor/generator combination in lieu of the rectifier/inverter used in static systems. The simplest rotary system is nothing more than an electric motor driving a generator, commonly called an M-G set. This device electrically isolates the critical load from the input power source. Used primarily as a power conditioner, the M-G relies on the inertia of the rotating components to bridge momentary power outages or improve power quality. Unlike the static inverter, which requires high-frequency switching to replicate a sine wave and provide filtering, the AC generator portion of the M-G has a pure sine wave output. By using a DC motor or a motor drive that is fed from a combination of the local utility and a rectifier/inverter, this system can be converted into a true online UPS.
10.8 ROTARY UPS SYSTEMS
243
Should the utility source become unavailable, the bypass path to the motor becomes isolated and the motor is then fed from the DC battery system directly or through the rectifier/inverter path. In normal operation, the critical loads are supplied as regulated computer-grade power from the generator. Another advantage of the rotary UPS is that the generator will provide the additional current necessary to clear electrical faults in the downstream distribution without having to transfer to the bypass source, as is sometimes required by static UPS systems. The rotary UPS system can sustain the output waveform from short-duration power-quality problems without transferring to battery because of its motor/generator design (it is a basic flywheel, continuing to rotate for some period of time on inertia only, based on the mass of the flywheel). This feature can preserve the battery system from short-duration power quality events, thus reducing downtime and maintenance costs, and increasing battery life. Rotary UPS systems offer a high initial cost with lower maintenance cost; reliability is excellent, and fewer electronic components mean fewer pieces that fail. Motor/generators are durable and can withstand higher room temperatures (so less cooling is required). Bearing maintenance is reduced with longer life, sealed bearings allowing up to ten years between scheduled bearing replacements. Routine vibration analysis can reliably predict the need for overhaul or bearing change-out. Rotary systems also offer the possibility of medium-voltage inputs and outputs. Static UPSs are not economically suitable for this application. Another potential benefit for a rotary UPS is that an IC (internal combustion) engine can be incorporated into the rotary UPS components by essentially coupling the shafts together through an appropriate controllable coupling. See Section 10.8.1. There are three major concerns when considering rotary UPS systems: 1. Initial first cost. The capital costs of rotary UPS systems are normally 20% to 40% higher than static UPS systems and higher with an integrated IC engine. 2. Equipment weight. If these systems are being installed on a slab located on a grade, the added weight is not a structural concern. However, if this equipment is being installed in a high-rise building, then additional reinforcement is required and a lighter weight static UPS becomes a more attractive alternative. 3. Service technician availability. As with any system, good service is imperative and should be local with response time of 4 hours or less.
10.8.1
UPSs Using Diesel
Based on the components described in Section 10.7, diesel UPSs operate in parallel with the utility supply, delivering clean and continuous power without the need for batteries. When the utility source is available, the diesel UPS rotary components filter out small disturbances and other power quality problems. This system includes a flywheel that stores enough energy to bridge interruptions lasting a few seconds. When the utility fails, the UPS diesel will start within about 10 seconds and take over as the primary energy source supplying power to the load. These systems may have a 50%
244
UPS SYSTEMS
Figure 10.9. Diesel UPS. (Courtesy of Hitec Power Solutions Inc.) smaller footprint than traditional static UPS systems using batteries, and although the generator requires maintenance, maintenance on the flywheel is minimal and there are no batteries to be concerned with. Dual Output Concept. The dual output system is an extension of the diesel UPS system. The dual output system is designed to maintain the quality and continuity of electrical power required by critical loads and noncritical loads. The advantage of this system is that the properties of a UPS and of a generator set are combined, creating a continual supply of power to critical loads as well as protection to noncritical loads. Description of Operation. Under normal conditions, while the utility is supplying power the choke and generator provide clean power to the critical loads while the noncritical loads are fed directly from the utility. In an emergency, when the utility fails the input circuit breaker is opened and the diesel engine is started. At the same time, stored kinetic energy in the induction coupling is used to maintain the critical load. When the speed of the engine equals the speed of the induction coupling, the clutch engages and the engine becomes the energy source to support the critical loads. A few seconds after the diesel has started, the noncritical loads are connected.
10.8.2
Hybrid UPS Systems
Any UPS configuration that incorporates both static and rotary equipment is considered to be a hybrid system. In a system that uses the grid and several sources of onsite
10.9 REDUNDANCY, CONFIGURATIONS, AND TOPOLOGY
245
generation as the input to the power conditioning equipment, flywheels provide the energy storage and motor-generators provide the power conditioning.
10.9
REDUNDANCY, CONFIGURATIONS, AND TOPOLOGY
Multimodule UPS systems (static or rotary systems) are arranged in various ways to achieve greater total capacity and/or improved fault tolerance and availability. Multimodule UPS systems allow for redundancy and enable module maintenance by isolating one module at a time. A single-module UPS system must transfer to bypass (commonly referred to as static bypass) in the event of a module overload or failure. Owners must decide whether they are comfortable with the UPS transferring to utility as the backup power source or configure parallel redundant modules to continue carrying the critical load on conditioned UPS power. The following sections will show the different types of redundancy, configurations, and topology techniques.
10.9.1
N Configuration
When classifying UPS installations, the variable TV is commonly used to express the minimum capacity required by the critical load to be supported. Whether a 300 kW load is supported by a single 300 kW UPS or two parallel 150 kW UPS systems, the configuration would be described as being an TV configuration due to the lack of redundancy, since the size of the load closely matches the total UPS capacity. Figure 10.10 displays an example TV setup in which the load is supported by a single UPS module. The advantages of an TV system include improved UPS efficiency (if the module is run near capacity) and, for some facilities, may be a cost-effective solution. However, if the rectifier/inverter pathway fails, the UPS will expose the load to utility power via the internal static bypass path, and if the internal STS also fails (a rare occurrence) the load will lose power.
Figure 10.10. TV UPS configuration.
246
UPS SYSTEMS
10.9.2 N + 1 Configuration An N + 1 system is defined as a configuration that contains the minimum UPS capacity required to support a load, plus a single redundant module. N + 1 systems are also referred to as parallel redundant. An example of this is shown in Figure 10.11. This configuration allows either module to fully support the load in the event of a UPS failure or if one module must be taken offline for maintenance. It is important to remember that the load on each UPS module cannot exceed 50% of the total load so that if a module fails, the remaining module can carry the total load at up to 100% of its capacity. For example, if the total load is 200 kW, each of the two UPS modules must be rated for at least 200 kW for a total capacity of 400 kW. A parallel redundant configuration is costlier than a simple N system but will yield a higher level of availability due to the elimination of a single UPS module as a single point of failure.
10.9.3
Isolated Redundant Configuration
Another example of an N + 1 system is an isolated redundant configuration that features a primary UPS that supports the full load. If the primary UPS fails, the load is caught by the secondary UPS system through the static bypass of the failed primary UPS. Under normal operating conditions, the catcher UPS sits idle with no load associated with its output unless the primary UPS fails. The only disadvantage of the isolated redundant (or catcher bus) configuration as opposed to the parallel redundant system is that the load will lose power if the primary module and its associated STS switch fail at the same time, and the catcher UPS will have no way of supporting the load through the primary bypass.
10.9.4 N + 2 Configuration An N + 2 UPS configuration incorporates the N + 1 system discussed earlier plus an additional second redundant UPS module. This allows a facility to suffer a UPS failure
MMS = Multimodule system
Figure 10.11. N+ 1 UPS configuration.
10.9 REDUNDANCY, CONFIGURATIONS, AND TOPOLOGY
Figure 10.12. Isolated redundant UPS configuration.
MMS = Multimodule system
Figure 10.13. N+2 UPS configuration.
247
248
UPS SYSTEMS
or have a UPS brought offline for maintenance and still retain N + 1 redundancy. This configuration can be expanded to any number of modules—N+ 3,N+4, and so on— depending on design uptime requirements and costs.
10.9.5
2A/Configuration
A 2N system includes two wholly separate UPS hierarchies to support a common critical load. A 2N system may be combined with any of the above UPS configurations to further improve availability; however, the cost increases at twice the rate of a similarly equipped single-hierarchy UPS system. 2N configurations are generally the most reliable and best suited for companies in which high availability outweighs the high costs of installation and maintenance of such a system. In Figure 10.14, each UPS module is fed from different utilities or substations and a different generator set (genset). Generally, each utility or substation feed comes from different incoming locations around the building property, so UPS 1 is fed from a utility coming from the north side of the building, whereas UPS2 is fed from a utility feed from the south side of the building. This increases availability if one of the utility feeds goes down due to power line or other equipment failures. The same is true regarding the emergency power feeds as one bank of generators feeds UPS1 and another bank feeds UPS2. Again, in the example in Figure 10.14, the load on each UPS cannot exceed 50% of its rated capacity; in case one of the UPS modules fail, the remaining module must carry the full load.
10.9.6
2(/V + 1 ) Configuration
The 2N configuration can be further expanded to a 2(N+ 1) system for further redundancy and flexibility, as shown in Figure 10.15. A 2(N+ 1) system consists of two sep-
Fiaure 10.14. 2N UPS configuration.
10.9 REDUNDANCY, CONFIGURATIONS, AND TOPOLOGY
249
Figure 10.15. 2N+ 1 configuration. arate power distribution hierarchies, each including a secondary redundant UPS such that any individual UPS may be brought offline for maintenance and still maintain at least a 2N level of availability. This type of system configuration further adds complexity and maintenance costs to an already costly design and installation investment and is feasible for industries such as the financial sector, where an hour of downtime far outweighs the system costs.
10.9.7
Distributed Redundant/Catcher UPS
Both distributed redundant, and catcher systems use external STSs (see Chapter 8 for more on static transfer switches, which are not the same as a static bypass switches). In a catcher system, one module feeds the alternate side of each STS while other modules feed the primary side. Similar to an isolated redundant system, the catcher module is unloaded most of the time and all modules can be serviced. In a distributed redundant system, each module is connected to the primary side of one STS and the alternate side of another (Figure 10.16). In this way, there are no unloaded modules and a very high level of availability. The distributed redundant configuration is commonly deployed in environments with high reliability demands. A distributed redundant architecture yields high availability without the cost of a full 2N system.
10.9.8
Eco Mode for Static UPS
Recently, UPS manufacturers have started offering double-conversion UPS modules equipped with an eco mode to improve efficiency. Some vendors claim 97-98% effi-
250
UPS SYSTEMS
Figure 10.16. Distributed redundant UPS configuration.
ciency while operating their UPSs in eco mode. This is achieved by connecting the critical load directly to utility power via the bypass circuit, but keeping the UPS inverter energized. Should the UPS detect a power anomaly, the module can near-instantaneously transfer the load to battery backup. There is some concern with regard to longterm reliability when exposing critical loads to utility-grade power for extended periods of time, so running a UPS in eco mode should not be seen as a solution appropriate for every application. Essentially, eco mode allows a double-conversion UPS to operate analogously to an offline UPS when commanded to do so.
10.9.9 Availability Calculations Availability = (System MTBF) H- [(System MTBF) + (System MTTR)]. Typical system configuration design goals are: For single-bus UPS systems: power availability at the IT load input > 0.99999 (5-6 nines)
10.10
ENERGY STORAGE DEVICES
251
For dual/multibus bus UPS systems: power availability at the IT load inputs > 0.9999999 (7-9 nines) MTBF = field-observed calculated mean time between failure. MTTR = field-observed calculated mean time to repair hours, including the service technician travel time to site. Field-observed MTBF = (UPS systems installed base total operating hours) -î- (total observed UPS system output failures +1).
10.10 10.10.1
ENERGY STORAGE DEVICES Battery
Lead-acid batteries are the most commonly used power storage devices to provide the UPS inverter with energy when the normal AC power supply is lost. Battery systems can be designed to provide enough capacity for a predetermined duration of normal power loss. This is typically 15 minutes or less, although there are some applications that may require hours of run time on a battery. Depending on the design load, the battery capacity will vary. Some systems are designed to accommodate momentary power interruptions. Applications such as these typically utilize a few minutes of reserve time on battery. Traditionally, data center UPS systems are designed to provide 15 minutes of backup power via batteries, allowing enough time for several attempts to start a genset or to provide enough time for an orderly system shut down of the major computer hardware. However, with the advent of today's high-density data centers it is almost impossible to shut down a data center of significant size in 15 minutes. For example a 10,000 square foot data center can take a minimum of 2 hours to shutdown. Today there are data centers that are over 100,000 square feet. Therefore, why is 15 minutes still the industry standard for larger data centers, whereas applications such as telephone switches or broadcasting systems may design enough capacity for many hours of uninterrupted service? Among the common batteries in use today are lead-calcium batteries that have a float voltage between 2.21 and 2.25 volts per cell (vpc). An equalizing charge of 2.30-2.35 vpc is used for a boost charge or battery conditioning. Extended operation at this voltage would gas the electrolyte, accelerate dry-out, and shorten battery life due to heat issues. Depending on the application, over 200 cells may be required for a 480 volt UPS system. The footprint and floor loading of large battery systems are a major consideration when designing a UPS system. Battery cells are connected in series (known as a battery string) to provide enough DC voltage to supply the inverter of the UPS system. Some applications will run parallel strings to provide more minutes of storage for larger systems, and possibly add one additional string should redundancy be required. Many managers make the mistake of buying a battery system and neglecting the maintenance, oftentimes as a means of cutting O&M costs. It is important to understand that one faulty battery cell can render an entire UPS system useless if it only has
252
UPS SYSTEMS
a single string of batteries. Although battery life is not entirely predictable, the vast majority of defective batteries can be detected through a good maintenance and loadtesting program. The addition of monitoring adds a layer of security for trending and predictive failure analysis ability. Batteries supply power to the inverter when input AC power has been lost. A battery string is a group of single cells connected in series and/or parallel. The need for reliable and properly maintained batteries is critical for UPS system reliability. Batteries must to be able to ride through short-term outages and also provide enough time to start up generators in the event of long-term outages. There are two types of batteries used in most UPS applications: flooded or wet lead-acid and valve-regulated lead-acid. Flooded or Wet Lead-Acid. Flooded lead-acid batteries have been in use for more than a century. The flooded or wet design refers to those batteries that require periodic replenishment of water or acid, which makes maintenance costs higher. The battery must be filled with distilled water and vented to keep the hydrogen gas that is emitted during charging from accumulating. Lead-acid batteries must be maintained and inspected to ensure proper function and longevity. The expected life of this battery can be up to 20 years with the proper maintenance program. The requirements of flooded batteries are: large floor space, ventilation systems, rack/stands, acid containment, regular maintenance, and safety equipment such as eye wash stations. An example of a flooded lead-acid battery may be seen in Figure 10.17. The design of flooded batteries is quite simple. Negative plates, made of lead or a lead alloy, are sandwiched between positive plates of lead or lead alloy with added calcium. Sheets of a nonconductive material separate the plates. All positive and negative plates are welded together and attached to a terminal post. The assemblies are placed
Figure 10.17. Typical wet cell battery. (Courtesy of GNB Industrial Power.)
10.10
ENERGY STORAGE DEVICES
253
in a container and filled with a dilute sulfuric acid solution whose specific gravity can be changed to increase capacity. An example of a flooded-cell battery room is shown in Figure 10.18. Valve-Regulated Lead-Acid (VRLA). VLRA batteries (Figure 10.19) rely on gas recombination to keep the batteries saturated with acid. The positive plate generates oxygen while the negative plate generates hydrogen; the two gasses recombine on the surface of the negative plate. In most areas, there should be no need for special ventilation or acid containment. However, it is important to note that although the acid is held in an absorbent material and there is no free electrolyte, they can and do still have the potential to leak or vent hydrogen, which is why the term sealed is no longer used. The relief valves are designed to emit small bursts of gas when pressures reach predetermined levels. This battery needs no watering during its life; however, the most common failure mode is dry-out. Charging voltage and temperature must be maintained per the manufacturer's specifications. Lead-acid performance is optimal at 77°F. Operating a battery 15° above 77°F will cut a battery's life in half. Even higher temperatures will hasten dry-out and eventually cause the battery to fail catastrophically. VRLA batteries include an absorbent glass mat (AGM) that holds the needed electrolyte (water-acid mixture) between the plates. The thicker the mat, the longer the battery will last without drying out. 10-year VRLA batteries only have half the life of flooded lead-acid batteries. However, recent data has proven that 20-year VRLA batteries have started to equal the service life of many flooded lead-acid batteries that have lower reliability and a higher chance to fail open. Although the flooded battery requires more maintenance, the VRLA actually requires more care since we cannot see the condition of the plates and must rely on voltage and internal measurements to provide the data necessary to determine its relative health. Careful monitoring is a must and can extend that useful life; however, it is not unusual to start replacing some VRLA batteries after four years.
Figure 10.18. Flooded-cell battery room. (Courtesy of GNB Industrial Power.)
254
UPS SYSTEMS
Figure 10.19. Typical VRLA battery. (Courtesy of East Penn Manufacturing Co., Inc.)
Other Considerations. Batteries can be sized for any specified discharge time. The discharge limit can range between 5 minutes and 1 hour, but the most common length of time used is around 15 minutes. The longer the duration, the greater the number of batteries, which also means the larger the installation space requirements. With generators reaching synchronization in far less time than gensets of the past, it is not unusual to size batteries for 5 minutes or less. Again, if the generator does not start, the extra time in the battery does nothing to help provide orderly shutdown, or any other benefits for that matter. Safe operation is the primary consideration when determining the size and space needed for batteries. Natural disasters such as earthquakes, ventilation (depending on type of battery), and even operating temperatures (preferred battery room temperature is 25°C/72°F) are important considerations when designing large UPS systems that require a significant number of batteries. Maintaining proper room temperature and cell voltages will maximize the life of a battery. There are various federal, state and local codes and standards that define some of the requirements for battery installations. Battery monitoring systems help manage other parameters that contribute to a long (and reliable) battery string life. Failure Modes. The most significant difference between flooded and VRLA batteries is how they manifest failure. The flooded battery tends to fail in capacity, which generally means that with proper maintenance and monitoring you have plenty of time to react to a degrading battery. The VRLA, on the other hand, tends to fail catastrophically and can do so almost without notice. There are many stories of VRLA battery strings that worked fine the last time they were needed. This makes proper maintenance and monitoring much more critical to ensure reliability.
10.10
ENERGY STORAGE DEVICES
10.10.2
255
Flywheel Energy
Essentially, a flywheel (Figures 10.20 and 10.21) in UPS applications is a device for storing energy in a rotating mass, and is used where high power is required only intermittently and briefly. Flywheels can be high-mass and low-speed (as low as a few hundred RPM) or low-mass and high-speed (greater than 50,000 RPM). A flywheel with greater inertia (mass x RPM2) will store more kinetic energy. In this application, energy can be defined as kW x s, or kilowatt-seconds. Watts (power) delivered over a period of time equals energy. Compared to batteries, flywheels generally have a higher initial capital expense, lower installation costs, lower maintenance costs, and an extended working lifetime. Flywheels are "green" technologies that do not require the cooling or ventilation of conventional battery systems, and do not pose the same environmental hazards as batteries when disposed of. Flywheels may be used with motor-generators to increase the ride-through time of the set. An alternative application for flywheels is using them to store DC energy. They may be operated in parallel with batteries to improve the overall reliability and life of a battery string by filtering it from momentary outages and other power quality issues. A third application is using flywheels in place of a battery with a UPS, to bridge the few seconds needed to start a standby genset. Flywheels, while currently able to only offer from a few seconds up to just over 90 seconds of power, are generally more
Figure 10.20. Cutaway view of aflywheel.(Courtesy of Pentadyne Energy Corp.)
256
UPS SYSTEMS
Figure 10.21. An integrated 300 kVA flywheel UPS. (Courtesy of Active Power.) reliable than batteries, and when coupled with a redundant and well-maintained generator system they can provide extremely high reliability for mission critical applications without the need for batteries and the associated battery maintenance costs. See Table 10.1 for advantages and disadvantages of flywheels. Some companies have designed integrated flywheel/UPS systems. These typically combine a flywheel with single-conversion UPS electronics, all in a single cabinet. These integrated UPS designs are surprisingly compact for their rated power output, requiring less than half the floor space of a double-conversion UPS plus batteries. The footprint advantages of flywheels are especially helpful to system designers coping with the power density of some modern data centers. By replacing battery racks with flywheel cabinets, a system designer can reclaim enough floor space to add another complete UPS-plus-flywheel system.
10.11
UPS MAINTENANCE AND TESTING
Today's UPS systems are well designed and robust. They are no less than the Sherman Tank of the power electronics world. If we count the number of failures of modern UPS systems, they are statistically insignificant when compared with the millions of successful run hours. Quite often, we
10.11 UPS MAINTENANCE AND TESTING
257
Table 10.1. Advantages and disadvantages of a flywheel Advantages: Long life (15-20 yr), unaffected by number of charge/discharge cycles Lower lifecycle cost High efficiency—can be considered a green product High charge/discharge rates and no trickle charge required Inherent bus regulation and power shunt capability Reliable over a wide range of ambient temperatures High power density compared to batteries Disadvantages: Short reserve time Higher initial cost Batteryless operation requires use with a genset Expensive bearings (some manufacturers)
hear terms like greater than six sigma of availability or upward of 1,000,000 hours mean time between failures attached to the success of electronic engineering in providing computer-grade power. Nevertheless, when you fall into the ranks of the statistically insignificant your perspective on reliability changes forever. Suddenly, that last micro percentage point is more important than the previous year of online operation. Such is the case when managers find that after ten years of successful uninterrupted operation a system will fail, stranding a data center without power for only minutes, but leading to a recovery process that could take days at minimum. In the postmortem search for a smoking gun, a technician may find buckets of failed components: diodes, capacitors, printed circuit boards, SCRs, fuses, and even essential logic power supplies. The failure will leave its mark and we come to realize that there are thousands of boxes out there continuing to run, but waiting for that last component to fail before the system comes unglued. Consequently, our on-site testing procedures need to include comprehensively testing the UPS equipment along with the associated batteries. Because of the complexity of a UPS system, managers tend to leave the UPS maintenance in the hands of a vendor service provider. For ten years, the technicians may provide good service but often they are not allowed to functionally test the equipment because of business constraints. Additionally, the quality of technicians' service varies with their experience and training. In order to assure functionality of the UPS equipment, a test procedure or script must be instituted, one that is centered on accurately recording and verifying real-life conditions. The test is primarily based on standard factory-acceptance testing and it is broken down into three phases as outlined below.
10.11.1
Physical Preventive Maintenance (PM)
Testing of a UPS system as described is a complex and dangerous task if not performed correctly. The skills needed to run a thorough UPS functional test are beyond
258
UPS SYSTEMS
the experience of most facility managers. It is important to contract with a reputable test engineering group that has strong, continuous experience in UPS maintenance and testing of critical systems to oversee the procedure and analyze the test results and make recommendations. The physical PM should include infrared scanning prior to shutting down the unit. Check for loose connections on power supplies, diodes, capacitors, inverter gate, and drive boards, and check for swollen capacitor cans. In general, a swollen or leaking capacitor is an indication that the capacitor has failed or is beginning to fail.
10.11.2
Protection Settings, Calibration, and Guidelines
The next step is checking of protection settings and calibration. By simulating an alarm condition, the technician can verify that all alarms and protective settings are operational. Using independent monitoring, the system metering can be checked to verify its operation. If the UPS system metering is out of calibration, it may affect the safety and operation of the UPS system. For instance, if a meter that monitors DC bus voltage is out of calibration, it may affect the DC overvoltage or undervoltage alarm and protection. If the UPS were to transfer to battery, the ensuing battery discharge might be allowed to continue too long. As the battery voltage drops, the current must increase to continue supporting the same power. When current becomes too high, the temperature in the conduction path increases to the point of destruction. The following is a sampling of typical UPS alarms: • • • •
AC AC DC DC
10.11.3
Overvoltage Undervoltage Overvoltage Undervoltage
Functional Load Testing
Functional load testing can be broken down into four distinct operations: steady-state load test, transient response test, filter integrity test, and battery rundown test. For systems with multiple modules, a module fault-test restoration should be performed in addition. If you think of a UPS system as a black box, the best way we can determine the health ofthat box is by monitoring the input into the box and the outputs from the box. In most UPS systems, power is drawn from the utility to create a DC power supply that may be paralleled with the battery. That power is drawn equally across all phases with all modules sharing in the transmission of power. On the output side, the power distribution is dependent upon the actual load that is drawn from the UPS box. In many instances, the three phases will not be equally loaded; consequently, current draw will not be equal across all three phases. On the other hand, output phase voltage should be equally balanced across all phases and across all modules (for multimodule systems).
10.11 UPS MAINTENANCE AND TESTING
10.11.4
259
Steady-State Load Test
Under a steady-state test, all input and output conditions are checked at 0%, 50%, and 100% of load. The test provides a controlled environment to verify key performance indicators such as input currents and output voltage regulation. The steady-state test checks that input currents are well matched across all phases of a module and that all modules are equally sharing the load. Output currents are load dependent, that is, there may be a heavier load on an individual phase due to load distribution. The input currents, however, should be closely matched as each module is sharing in the generation of a DC power source from which the load will draw. Voltage readings are also taken during the steady-state conditions to check for good, evenly matched voltage regulation on the output side. Since power is equal to voltage times current, a degraded voltage from a single module or phase means that other modules or phases must produce more current to accommodate the voltage drop. Multiple modules are intended to run closely matched in output voltage and frequency.
10.11.5
Steady-State Load Test at 0%, 50%, and 100% of Load
The following tests are performed: • • • • • • •
Input voltage Output voltage Input current Output current Output frequency Input current balance Output voltage regulation
10.11.6
Harmonic Analysis and Testing
The input and output of the UPS should be monitored for harmonic content during the steady-state load testing. Observing the harmonic content at 0%, 50%, and 100% of load allows the engineer to determine the effectiveness of the input and output filters. It is important to note that most UPS filtering systems are designed for greatest efficiency at full load. Consequently, the harmonic distortion is greatest when the system is least loaded and smoothes out as the load is increased. As long as the distortion is consistent across phases and modules, this is no reason for alarm. Total harmonic distortion (THD) is essentially a measure of inefficiency. A unit operating with high THD uses more energy to sustain a load than a unit with low THD. The additional energy is dissipated as heat. Consequently, a UPS operating with high THD at low load, though inefficient, is in no danger of damaging conduction components. A UPS operating with high THD at high load is extremely taxed since the unit is producing energy to sustain the load along with additional heating due to a poor efficiency.
260
UPS SYSTEMS
Another way of explaining this condition is that a system rated for 1000 amps operating at 100 amps of load with 20% THD is producing 120 amps, which is easily accommodated on a 1000 amp rated system. That same system operating at the 1000 amp full load with a 20% THD is producing 1200 amps in an overloaded condition. Harmonic testing consists of: • • • •
Input voltage testing Output voltage testing Input current testing Output current testing
10.11.7
Filter Integrity and Testing
Most UPS systems contain three filters: input filter, rectifier filter, and output filter. The filters commonly use an arrangement of inductors (coils) and capacitors to trap or filter unwanted waveform components. Coils are relatively stable and seldom break down. They are quite simply wound metal (copper or aluminum conductors). Indications of a coil failure are typically only picked up on IR scans under loaded conditions. Capacitors are more prone to failure under stress. Like batteries, capacitors are subject to expansion and leakage or electrolyte drying up. AC capacitors are subject to rapid expansion or popping when over stressed. The result is acidic electrolyte sprayed onto sensitive electronic components. DC capacitors have a definite life span and should be changed every 5-7 years. The electrolyte in DC capacitors is prone to drying out over time, which is difficult to detect short of a capacitance test. DC capacitors are subject to swelling as well, but are normally equipped with a relief indicator that can be readily spotted. In large, three-phase UPS systems, there are virtually hundreds of capacitors wired in series and arranged in banks. Identifying a single failed capacitor can be difficult and time-consuming. A simplified means of checking filter integrity is by performing a relative phase-current balance. By checking phase current through each leg of the filter, a quantitative evaluation may be made. A marked difference in phase currents drawn through a filter assembly is an indicator that one or more capacitors have degraded. Further analysis may be directed to a filter assembly drawing abnormal current relative to corresponding phases. Filter testing consists of: 1. Input filter capacitance Phase current balance Individual capacitance test 2. Rectifier filter capacitance Phase current balance Individual capacitance test
10.11 UPS MAINTENANCE AND TESTING
261
3. Output filter capacitance Phase current balance Individual capacitance test
10.11.8 Transient Response Load Test The transient response test simulates the performance of a UPS given large instantaneous swings in load. The UPS is designed to accommodate full load swings without a distortion in output voltage or frequency. Periodically testing the responses to load swings and comparing them with the original specifications may ascertain the health of the UPS. In order to observe the response to rapid load swings, a recording oscillograph and load banks are connected to the UPS output. The oscillograph monitors three phases of voltage along with a single phase of current. (The current trace acts as a telltale as to when load is applied and removed.) As the load is applied and removed in 50% increments, the effect is observed on the voltage waveform. The voltage waveform should not sag, swell, or deform more than 8% with the application and removal of load. Similar tests are performed simulating the loss of AC input power and again on AC input restoration. Module output transient response (recording oscillograph) load step testing consists of: • • • • •
0%-50%-0% test 25%-75%-25% test 50%-100%-50%test Loss of AC input power test Return of AC input power test
10.11.9
Module Fault Test
Multimodule systems should be functionally tested to verify that the system would continue to maintain the critical load in the event of an inadvertent module failure. By applying full-system-rated load (via load banks), a single module should be simulated to fail. The system should continue to maintain the load without significant deviation of voltage or frequency as verified by a recording oscillograph. Each module should be tested in such a manner.
10.11.10
Battery Rundown Test
The final test of a system should be a battery rundown. Quite often, the battery system may be subject to failures that go undetected. The most meaningful test of a battery is to observe the temperature, voltage, and current under load conditions. Disparate voltages from cell to cell or string to string will provide the best indication that a battery or
262
UPS SYSTEMS
string is degrading. The load test will depend on which type of battery is connected to the UPS system. Wet-cell batteries are more robust and are capable of undergoing a full battery rundown test once a year. Valve-regulated batteries are many times overstressed by a full battery rundown and must be tested by a shorter duration test. Regardless of the battery test design or duration, a battery should undergo a discharge that will record the knee of the curve, that is, the point when battery voltage with respect to time begins to decay at an exponential rate. While undergoing the discharge test, battery cell voltages should be continuously checked, either with a cell monitor or manually. Batteries that exhibit lower than average cells should be marked for potential replacement. When the battery string is no longer able to provide the designed discharge time (e.g., 15 minutes), then it is time to consider full battery replacement. Battery temperatures should be continuously monitored during the discharge test, particularly at the plates and terminal posts, to determine if abnormally high temperatures exist. High temperature is an indication that it is working too hard to provide the designed output.
10.12
STATIC UPS SYSTEMS AND MAINTENANCE
It is difficult to differentiate between all the providers of static UPSs in terms of reliability. Most providers offer a high degree of sophistication ranging from simple power conversion to sophisticated synchronization and filtering arrangements. A UPS that is greater than ten years old will begin to show signs of age via component failure. Capacitors, SCRs, and diodes have, in most instances, been in service continuously and will begin to fatigue and fail. Maintenance awareness must be increased. With a system of this age, the standard maintenance routine of tightening screws and dusting off the circuit boards is not adequate to prevent failures. It is worthwhile to contract with a reputable power-engineering firm to evaluate your current maintenance program. It is difficult to hold a maintenance vendor or manufacturer responsible without a fair degree of internal technical knowledge. It is imperative that you obtain reliable service followed with a full service report and follow-up recommendations. Skipping routine maintenance service will inevitably lead to disaster and downtime. Examples of semiannual checks and services for UPS systems are shown in Table 10.2.
10.13
UPS MANAGEMENT
The UPS may be combined with a software package to automatically shut down a server or workstation when a power failure occurs. The server/workstation programs are typically shut down in a predetermined sequence to save the data before the UPS battery is out of power. This package is operated from the power-protected workstation or computer and functions as if it were in a stand-alone environment, even when
263
10.14 CONCLUSION
Table 10.2. Procedures and details used for semiannual checks and services for UPS systems Procedure
Details
Perform a temperature check (infrared scan preferable).
Perform a complete visual inspection of the equipment. Check module(s) completely.
Current sharing balance capacitor assemblies.
Visually inspect the battery room.
Check all breakers, connections, and associated controls Report and repair all high-temperature areas. If greater than 10°C over the expected temperature, corrective measures are required as soon as time can be arranged or during next shutdown window, as long as it is within months not years. Include subassemblies, wiring harnesses, contacts, cables, and major components. Check rectifier and inverter snubber boards for overheating. Check power capacitors for swelling or oil leaking. Check DC capacitors vent caps that have extruded more than 1/8 in. Record all voltage and current meter readings on the module cabinet or the system control cabinet. Measure and record 5th, 7th, and total harmonic trap filter currents. Check the grease or oil on all connections (if applicable). Check battery jars for proper liquid levels (if there are flooded cells). Check for corrosion on all the terminals and cables. Check battery jar for electrolyte seepage and leaks. Examine the cleanliness of the battery room and jars. Check for ventilation where applicable. Be sure all safety equipment is on-site and available.
the computer is networked. A UPS connected in a power management function usually provides UPS monitoring and computer shutdown functions. In addition, the manager also can control the operations of the UPS from a remote location, including turning the UPS on and off (not recommended for mission critical IT operations), configuring communications capability, and making performance adjustments. This control is typically executed either within an SNMP (simple network management protocol) environment, or through software supplied by the UPS vendor. This software operates directly with the network and typically is an attached module in the network management software.
10.14
CONCLUSION
When downtime costs millions of dollars an hour and can permanently damage the reputation of your business or life safety of people, no expense should be spared to guarantee uninterrupted operations. A UPS can help maintain the high level of power
264
UPS SYSTEMS
uptime demanded in today's marketplace and, with the proper ancillary components, do it in a green fashion. Careful consideration should be applied when selecting the appropriate UPS and when determining the maintenance and operation procedures. Although the system may appear to be running perfectly, it is not until an outage occurs and the UPS fails that it is truly tested, unless a rigorous preventative maintenance program is devised. No matter if you are powering sensitive mission critical data, storage, or life safety equipment, a properly maintained UPS is a must-have piece of equipment for many businesses. The next chapter discusses data center cooling and the challenges associated with maintaining the correct temperature in a data center environment.
11 DATA CENTER COOLING SYSTEMS
11.1
INTRODUCTION
A reliable and high-quality supply of electric power is a top priority for data centers, but the ability to provide effective and efficient cooling equipment is equally important. Since there is some commonality between general building air-conditioning systems and data center cooling systems, one would expect more collaboration between the IT and facilities industries. However, different sectors within the datacom industry may have a unique focus and, consequently, may use different terminology and training methods, which hampers or handicaps collaboration. Although there is no reason to believe that the IT and facilities industries are intentionally limiting their interaction, it is no secret that there needs to be a concerted effort to reverse this trend. ASHRAE (the American Society of Heating, Refrigeration and Air-Conditioning Engineers) is an engineering society that specializes in heating, ventilation and airconditioning (HVAC) design, and is best known in the facilities/buildings sector. But since they write standards, guidelines, and model codes, and given the overlap between general HVAC and the data center cooling industries, ASHRAE has formed Technical Committee TC9.9, which is focusing on the data processing and communication industries, and includes strong representation from the IT sector, including IBM, HP, Intel, and Cisco. Inherent to the mission of TC9.9 is a working mechanism that is increasing cooperation between IT and HVAC professionals. Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
265
266
DATA CENTER COOLING SYSTEMS
Although the two industries share some common ground, there is no doubt that data centers pose some unique challenges. Higher equipment densities and increasing loads are creating significant cooling issues that require increased collaboration and a holistic approach to cooling. This holistic approach includes considering the interdependence of space geometry, airflow obstructions, IT equipment design, and the implications of moves, adds, and changes. The bottom line is a constant priority for any business, which translates into a never-ending quest for increased efficiency and lower operating costs. Furthermore, the increasing concern for our environment has initiated efforts to green the data center and reduce its carbon footprint. Fortunately, both goals can be achieved with a cost-effective approach.
11.2
BACKGROUND INFORMATION
There are a wide range of options available when designing a building cooling system, and there are various types of cooling equipment that are common to all systems and specific to others. This chapter will start with some background information on the basic components of a cooling system such as chillers, cooling towers, condensers, dry coolers, heat exchangers, air handlers, fan-coil units, and computer-room air-conditioning units (CRAC units). This equipment can be classified in a number of ways, but the terminology used in this chapter is commonly used by the cooling equipment manufacturers as well as engineers and contractors. The material is presented in such a way that those who do not have a professional background in cooling engineering or facilities management (e.g., datacom equipment engineers) will be able to fully take advantage of the content. Parts of this chapter will refer to guidelines and standards that are good sources of information that would be helpful for understanding the cooling of computer equipment, such as ASHRAE's Thermal Guidelines for Data Processing Environments.
11.3
COOLING WITHIN DATACOM ROOMS
The fundamental principle behind a cooling system is heat transfer. In the broadest sense, heat that is generated by computer equipment must be transported from inside the room, where the equipment resides, to outside the room, where it can be released to the atmosphere. Along the way, one or more methods of heat transfer may be utilized. It is easy to understand that if a heat-generating piece of equipment (e.g., computer equipment) is operating inside a closed room and there is no means of regulating the room temperature or removing the heat, the room temperature will increase steadily until it exceeds the maximum operating temperature of the equipment, causing it to ultimately fail. In order to compensate for the heat gained by the room as a result of the heat produced by the equipment, we need to introduce a source of cooling. However, the exact quantity of cooling that needs to be supplied is still in question. If we provide less cooling than the quantity of heat being produced, the room temperature will still steadily increase over time, just at a slower rate than it would without cooling. Therefore, to estab-
11.4 COOLING SYSTEMS
267
lish an equilibrium or steady-state temperature within the room, the amount of cooling we need to provide should exactly equal the amount of heat being produced. The actual design temperature (and other attributes such as humidity) of the room must be within the equipment manufacturer's specified operating range. If the equipment requirements are unknown, or if you have multiple equipment types that have different operating requirements, you can use the environmental classifications and specifications listed by ASHRAE in their publication, Thermal Guidelines for Data Processing Environments.
11.4
COOLING SYSTEMS
11.4.1 Air Side The most common method of cooling a data center is to condition and recirculate the air within the datacom room. The type of system and equipment used varies greatly depending on the size and design of the datacom room (e.g., is there a raised access floor or return-air ceiling?). The various types of equipment that can be used to condition the air might include an air handling unit (AHU), a packaged rooftop unit (RTU), a fan-coil unit (FCU), or a packaged computer-room air-conditioning (CRAC) unit. These types of equipment all work in a similar manner, with the key components being a fan to move the air and a coil that extracts heat from the airstream. The fan section of these units induces air movement over the coil. The coil is made up of a serpentine configuration of finned tubes that carry a fluid that can absorb heat. Heat is exchanged by passing the warm room air over the coil containing cool fluid, resulting in a drop in air temperature and an increase in fluid temperature. The cool air is distributed back to the room being cooled and the warm fluid is transported away from the coil and outside the room. Figure 11.1 shows this heat transfer in a simple graphical format.
11.4.2
Cooling-Medium Side
The two products resulting from the heat transfer depicted in Figure 11.1 are cold air and warm fluid that leaves the coil. Some examples of commonly used fluids that are circulated through the coil include refrigerant, water, or a glycol-water mixture. When a refrigerant is used, the coil serves as the evaporator. These kinds of systems are typically called direct expansion or DX systems. With these types of systems, the refrigerant compressor and condenser can be located inside or outside the datacom room. When water or a glycol-water mixture is used, the systems are typically called chilled-water systems. With these systems, the component that cools the chilled water is referred to as a chiller, which is typically located outside the datacom room. The warm fluid is transported via piping to a piece of equipment that accomplishes the second transfer of heat to either another fluid or air. In the case of air, the piece of cooling equipment essentially performs the reverse of the airside cooling that was described in the first heat transfer, except with warm fluid and cooler air entering the equipment, and cool fluid and warmer air leaving the equipment.
268
DATA CENTER COOLING SYSTEMS
Figure 11.1. Airside cooling system heat transfer for datacom rooms. (Courtesy of DLB Associates.) The cool fluid goes back to the airside equipment described earlier and the warmer air is released to the atmosphere. Therefore, the type of cooling equipment that performs this sort of heat transfer is sometimes referred to as air-cooled, since the fluid is cooled by air, and may also be classified as heat-rejection equipment, since the heat is rejected to the atmosphere. Given its operation, air-cooled heat-rejection equipment is located outside the building envelope, either on the roof or at ground level. The kinds of equipment that fall into this category include air-cooled chillers, air-cooled condensing units, and dry coolers. Figure 11.2 is a graphical representation of how heat transfer occurs with this type of equipment. The second type of cooling-medium-side equipment that we will discuss falls under the category of water-cooled. There are two facets to water-cooled equipment. The first involves a piece of equipment that transfers heat from one fluid stream to another fluid stream, such as the water-cooled chiller. The warm fluid from the air-handling equipment shown in Figure 11.1 enters the water-cooled chiller where it is mechanically cooled and returned, forming a closed loop. The heat from the entering warm fluid in that process is transported away from the chiller using a secondary fluid loop. The fluid between the airside cooling equipment and the chiller is typically referred to as the chilled water. The fluid that is used to transport the heat away from the chiller is referred to as condenser water. The condenser water leaves the chiller warmer than it enters, and is returned to the chiller after being cooled by water-cooled heat-rejection equipment, such as a cooling tower, which will be discussed in greater detail later. Figure 11.3 shows the operation of a water-cooled chiller and Figure 11.4 shows the continuation of the condenser water loop by showing the heat rejection that occurs at a cooling tower.
11.5 COMPONENTS OUTSIDE THE DATACOM ROOM
269
Figure 11.2. Heat transfer for air-cooled heat-rejection equipment. (Courtesy of DLB Associates.)
11.5 11.5.1
COMPONENTS OUTSIDE THE DATACOM ROOM Refrigeration Equipment—Chillers
In the building cooling industry, a chiller is the term used to describe mechanical refrigeration equipment that produces chilled water as a cooling output or end product (Figure 11.5). The term chiller is not normally used in the building cooling industry to describe a unit that produces chilled air (e.g., a CRAC unit). In fact, chilled air is not a
Figure 11.3. Heat transfer for a water-cooled chiller. (Courtesy of DLB Associates.)
270
DATA CENTER COOLING SYSTEMS
Figure 11.4. Heat transfer for a cooling tower. (Courtesy of DLB Associates.) common term in the building cooling industry; the normal term is conditioned air or supply air. It is extremely rare for the chiller itself to be located in the same room as the computer equipment for a variety of reasons, including room cleanliness, maintenance, spatial requirements, and acoustics. Depending on the type of chiller, it is typically located in a dedicated mechanical room or outside the building envelope at ground level or on a roof. The chilled water produced by a chiller for building cooling is commonly supplied at 42^5°F, but in some applications the temperatures can be higher or lower than this range. The chilled fluid can be 100% water or a mixture of water and glycol (to prevent the water from freezing if piping is run in an unconditioned or exterior area) plus additives such as corrosion inhibitors. The cooling capacity of the chiller is influenced or derated by whether glycol or additives to prevent corrosion are included. In a typical chilled-water configuration, the chilled water is piped in a loop between the chiller and the equipment (e.g., a CRAC unit) that supplies conditioned air to the electronic equipment room or other conditioned space. The heat that the chilled water absorbs from the hot air in the electronic equipment room is transferred from the chilled water loop to the condenser loop by the chiller. The method by which the
Figure 11.5. Generic chiller diagram. (Courtesy of DLB Associates.)
11.5 COMPONENTS OUTSIDE THE DATACOM ROOM
271
chilled water heat rejection is accomplished represents the first delineation of how chillers are classified. Using this method of delineation, chillers are described as being either air-cooled or water-cooled. Air-Cooled Chillers. For air-cooled chillers, the chilled-water loop rejects heat to a refrigerant loop, which in turn rejects the heat to the atmosphere using fan-driven, outdoor air (Figure 11.6). Air-cooled chillers are often described as packaged equipment, meaning that the heat-rejection component is an integral part of the chiller package, which visually looks like a single assembly (Figure 11.7). The majority of aircooled chillers are located outside of the building envelope to facilitate heat rejection to the atmosphere. However, due to spatial or other constraints the air-cooled chiller components can be split, with the heat-rejection component (typically an air-cooled condenser) located remotely from the chiller. Water-Cooled Chillers. Water-cooled chillers employ a second liquid loop to transport heat removed from the chilled water to the heat-rejection equipment. The transfer of heat from one loop to the other is accomplished using a refrigeration cycle that is generated within the chiller. The chilled-water medium described above leaves the chiller colder than it arrived. The other liquid loop is called the condenser-water loop and leaves the chiller warmer than it arrived (Figure 11.8). A separate piece of heat-rejection equipment, typically a cooling tower, is used to cool the chiller's condenser-water loop. The heat-rejection equipment is typically located outside of the building envelope, whereas the water-cooled chiller itself (Figure 11.9) is located inside the building, with the condenser-water loop connecting the two. How the condenser water is cooled can vary greatly, depending on geographic region (climate), availability of water, quality of water, noise ordinances, environmental ordinances, and so on. Water-cooled chillers are typically more efficient than air-cooled
Figure 11.6. Air-cooled chiller diagram. (Courtesy of DLB Associates.)
272
DATA CENTER COOLING SYSTEMS
Figure 11.7. Typical packaged air-cooled chiller. (Courtesy of Trane.)
chillers. Based on electrical consumption per quantity of cooling produced, and depending on the type of configuration used, water-cooled chillers can be as much as 25%-75% more efficient than air-cooled equivalents. Chiller Classification and Typical Sizes. Another way to classify chillers is based on the type of compressor they utilize. Compressor types include reciprocating, screw, scroll, and centrifugal. The type of compressor selected is commonly based on
Figure 11.8. Water-cooled chiller diagram. (Courtesy of DLB Associates.)
11.5 COMPONENTS OUTSIDE THE DATACOM ROOM
273
Figure 11.9. Typical water-cooled chiller. (Courtesy of Trane.) the cooling capacity needed (e.g., centrifugal compressors are a common choice for large capacities, whereas reciprocal chillers are a common choice for small capacities). Another parameter influencing the chiller selection is efficiency and operating cost at part-load versus full-load conditions. In the building cooling industry, common commercial-size chillers range from 20 tons to 2000 tons (or 60-6000 kW of electronic equipment power cooling capacity), with chiller sizes outside of this general range being available but not as common. Aircooled chillers are typically only available up to 400 tons (1200 kW). Chiller configurations also often utilize additional units for redundancy and/or maintenance strategies. This can play an integral part in determining the quantity and the capacities of chiller units for a given facility.
11.5.2
Heat-Rejection Equipment
Cooling Towers INTRODUCTION. A cooling tower is a heat-rejection device that rejects waste heat to the atmosphere through the cooling of a water stream to a lower temperature. The type of cooling that occurs in a cooling tower is termed evaporative cooling. Evaporative cooling relies on the principle that flowing water, when in contact with moving air, will evaporate. In a cooling tower, only a small portion of the water that flows through the cooling tower will evaporate. The heat required for the evaporation of this water is given up by the main body of water as it flows through the cooling tower. In other words, the vaporization of this small portion of water provides the cooling for the remaining unevaporated water that flows through the cooling tower. The heat from the water stream transferred to the airstream raises the air's temperature and its relative humidity to 100%, and this air is discharged into the atmosphere. Figure 11.10 provides a simple schematic overview of a generic cooling tower flow diagram.
274
DATA CENTER COOLING SYSTEMS
Figure 11.10. Schematic overview of a generic cooling towerflow.(Courtesy of DLB Associates.)
Evaporative heat-rejection devices, such as cooling towers, are commonly used to provide significantly lower water temperatures than are achievable with air-cooled or dry heat-rejection devices. Air-cooled or dry heat-transfer processes rely strictly on the sensible transfer of heat from the fluid to the cooling air across some type of system barrier and do not depend on the evaporation of water for cooling. Generally, in aircooled or dry heat-transfer processes, the fluid is separated from the cooling air by a system barrier, and is not in direct contact with the cooling air. A radiator in a car is a good example of this type of system and heat-transfer process. Sensible cooling is limited by the temperature of the air flowing across the system barrier; the temperature of the fluid can never be cooled to a temperature that is less than the temperature of the air flowing across the system barrier. Evaporative cooling, on the other hand, can cool the fluid to a temperature less than the temperature of the air flowing through the cooling tower, thereby achieving more cost-effective and energy-efficient operation of systems in need of cooling. Common applications for cooling towers are to provide cooled water for air conditioning, manufacturing, and electric power generation. The smallest cooling towers are designed to handle water streams of only a few gallons of water per minute, supplied in small pipes (e.g., 2 inch), whereas the largest cool hundreds of thousands of gallons per minute, supplied in pipes as much as 15 feet (about 5 m) in diameter in a large power plant. Cooling towers come in a variety of shapes, sizes, configurations, and cooling capacities. For a given capacity, it is common to install multiple smaller cooling tower modules, referred to as cells, rather than one, single, large cooling tower. Since cooling towers require an ambient airflow path in and out, they are located outside, typically on a roof or elevated platform. The elevation of the cooling tower relative to the remainder of the cooling system needs to be considered when designing
11.5 COMPONENTS OUTSIDE THE DATACOM ROOM
275
the plant because the cooling tower operation and connectivity relies on the flow of fluid by gravity. CLASSIFICATION. The generic term "cooling tower" is used to describe both opencircuit (direct contact) and closed-circuit (indirect contact) heat-rejection equipment. Although most people think of a cooling tower as an open, direct-contact heat-rejection device, the indirect cooling tower is sometimes referred to as a closed-circuit fluid cooler and is, nonetheless, also a cooling tower. A direct or open-circuit cooling tower (Figure 11.11) is an enclosed structure with internal means to distribute the warm water fed to it over a labyrinth of packing or fill. The fill may consist of multiple, mainly vertical, wetted surfaces upon which a thin film of water spreads (film fill), or several levels of horizontal splash elements that create a cascade of many small water spreads (splash fill). The purpose of the fill is to provide a vastly expanded air-water interface for the evaporation process to take place. The water is cooled as it descends through the fill by gravity while in direct contact with air that passes over it. The cooled water is then collected in a cold-water basin or sump that is located below the fill. It is then pumped back through the process equipment to absorb more heat. The moisture-laden air leaving the fill is discharged to the atmosphere at a point remote enough from the air inlets to prevent its being drawn back into the cooling tower (Figure 11.12). An indirect or closed-circuit cooling tower (Figure 11.13) involves no direct contact of the air and the fluid being cooled. A type of heat exchanger defines the boundaries
Figure 11.11. Direct cooling towers on an elevated platform. (Photo courtesy of DLB Associates.)
276
DATA CENTER COOLING SYSTEMS
Figure 11.12. Direct-contact, open-circuit cooling tower schematicflowdiagram. (Courtesy of DLB Associates.) of the closed circuit and separates the cooling air and the process fluid. The process fluid being cooled is contained in a closed circuit and is not directly exposed to the atmosphere or the recirculated external water. The process fluid that flows through the closed circuit can be water, a water-glycol mixture, or a refrigerant. Unlike the open-circuit cooling tower, the indirect cooling tower has two separate fluids flowing through it; one is the process fluid that flows through the heat exchanger and is being cooled, and the other is the cooling water that is evaporated as it flows over the other side of the heat exchanger. Again, system cooling is provided by a portion of the cooling water being evaporated as a result of air flowing over the entire volume of cooling water.
Figure 11.13. Indirect, closed-circuit cooling tower schematic flow diagram. (Courtesy of DLB Associates.)
11.5 COMPONENTS OUTSIDE THE DATACOM ROOM
277
In addition to direct and indirect cooling, another method to characterize cooling towers is by the direction of the ambient air drawn through them. In a counterflow cooling tower, air travels upward through the fill or tube bundles, directly opposite to the downward motion of the water. In a cross-flow cooling tower, air moves horizontally through the fill as the water moves downward. Cooling towers are also characterized by the means by which air is moved. Mechanical-draft cooling towers rely on power-driven fans to draw or force the air through the tower. Natural-draft cooling towers use the buoyancy of the exhaust air rising in a tall chimney to provide the airflow. A fan-assisted, natural-draft cooling tower employs mechanical draft to augment the buoyancy effect. Many early cooling towers relied only on prevailing wind to generate the draft of air. MAKEUP WATER. For the recirculated water that is used by the cooling tower, some water must be added to replace, or make up, the portion of the flow that evaporates. Because only pure water will evaporate as a result of the evaporative cooling process, the concentration of dissolved minerals and other solids in the remaining circulating water that does not evaporate will tend to increase. This buildup of impurities in the cooling water can be controlled by dilution (blow-down) or filtration. Some water is also lost by droplets being carried out with the exhaust air (a phenomenon known as drift or carry-over), but this is typically reduced to a very small amount in modern cooling towers by the installation of baffle-like devices, appropriately called drift eliminators, to collect and consolidate the droplets. To maintain a steady water level in the cooling tower sump, the amount of makeup water required by the system must equal the total lost to evaporation, blow-down, drift, and other water losses, such as wind blowout and leakage. Typically, the selection of cooling towers is an iterative process since there are a number of variables, resulting in the opportunity for trade-offs and optimization. Some of those variables include size and clearance constraints, climate, required operating conditions, acoustics, drift, water usage, energy usage, and part-load performance. CHILLED WATER AND ICE STORAGE. For data centers or other mission critical facilities, chilled water or ice storage (Figure 11.14) is a strategy for providing backup cooling capability when there is a disruption in chiller operation. For large, high-density data centers where it is not practical to put the mechanical equipment on a UPS, some backup cooling capacity can be provided with chilled water or ice storage that will carry the cooling load until backup generators can be started and the mechanical plant can be brought back online.
Condensers and Dry Coolers. Heat-rejection equipment is the term commonly used by the building cooling industry for the equipment that ultimately discharges the heat that was removed from the conditioned space to the atmosphere. Cooling towers, also called open or wet towers, are part of a water-cooled system. They are evaporative cooling devices that use the cooling effect produced by the phase change of water, going from liquid to vapor, to provide cooling. This equipment is a predominantly a wet-bulb-centric device.
278
DATA CENTER COOLING SYSTEMS
Figure 11.14. Left—typical ice storage system. Right—typical chilled water storage system. The heat captured by chillers can be rejected to either air or water media. Watercooled chillers utilize wet-bulb-centric heat-rejection devices (i.e., cooling towers). Air-cooled chillers, on the other hand, utilize a different type of heat-rejection equipment. Air-cooled chillers have either an integral or remotely located air-cooled condenser section. This section contains condenser fans that induce airflow over a coil that contains refrigerant. As the refrigerant inside this coil cools and changes phase from a gas to a liquid, it releases heat that is transferred to the ambient air as it passes over this condenser coil. Air-cooled condensers are basically dry-bulb-centric devices that use the movement of ambient air passing through a coil to transfer heat to the atmosphere while condensing the refrigerant vapor in the coil to a liquid state (see Figure 11.15).
Figure 11.15. Schematic overview of a generic air-cooled condenser. (Courtesy of DLB Associates.)
11.5 COMPONENTS OUTSIDE THE DATACOM ROOM
279
TYPES OF DRY-BULB-CENTRIC HEAT-REJECTION DEVICES. There are a number of different types of dry-bulb-centric heat-rejection devices. Similar to the wet-bulb-centric devices (i.e., cooling towers), these devices are all located outdoors (since they all utilize ambient air as a means of heat rejection). Depending on the specific type of heat-rejection device and the application, their location varies from being on grade or on a roof. Other typical design considerations include:
• Noise ordinances and acoustics • Clearances for both air inlets and discharges to provide adequate airflow to the coil • A configuration that avoids the short-circuiting of the hot discharge air with the air inlets Too frequently, proper clearances are not maintained. This is especially true when multiple units are installed in the same area. Also, it is important to consider the source and impact of flying debris, dirt, vegetation, and so on. The three main types of dry-bulb-centric heat-rejection devices that we will be discussing are: 1. Air-cooled condensers 2. Self-contained condensing units 3. Dry coolers AIR-COOLED CONDENSERS. The air-cooled condenser section of an air-cooled chiller is, in some cases, a completely separate unit that can be located remotely from the compressor section of the chiller or other compressorized piece of equipment, depending on the product and site-specific situations. This stand-alone piece of equipment is simply called an air-cooled condenser. The two main components of an air-cooled condenser are a fan, to induce the air movement through the unit, and a coil (typically with fins), to provide the heat-transfer surface area that is required for the refrigerant to condense to a liquid state. The combination of a fan and a coil are the components that provide the means for the refrigeration cycle to reject its heat to the air. Air-cooled condensers come in various shapes, sizes, and orientations. Figures 11.16 and 11.17 show examples of air-cooled condensers with different air inlet and outlet (discharge) configurations. Figure 11.16 shows an air-cooled condenser that is typically paired with a small wall- or ceiling-mounted air-handling unit used in many smaller computer server rooms for office applications (i.e., one or two racks). Figure 11.17 shows an air-cooled condenser that is more typically used in larger server rooms in conjunction with air-cooled CRAC units. The multiple condenser fans and larger footprint allow it to have a greater heat-rejection capacity than the aircooled condenser shown in Figure 11.16. SELF-CONTAINED CONDENSING UNITS. In some cases, the compressor and air-cooled condenser sections are combined into a single piece of equipment that is called a self-
280
DATA CENTER COOLING SYSTEMS
Figure 11.16. Side inlet and side outlet, air-cooled condenser. (Courtesy of Sanyo.) contained condensing unit or simply a condensing unit. A condensing unit is the type of equipment that is most commonly used in residential applications and looks like a cube with a round propeller fan horizontally mounted on top, discharging air upward (Figure 11.18). It is also used in larger systems where the noise generated by the compressors must be outside the occupied space. DRY COOLERS. The air-cooled condenser shown in Figure 11.17 looks very similar to another type of dry-bulb-centric heat-rejection device, a dry cooler. Whereas an aircooled condenser cools a refrigerant, a dry cooler utilizes water as its cooling medium (typically with glycol added to the water to prevent freezing). The dry coolers themselves are used with a glycol-cooled CRAC unit. One of the biggest advantages of using dry coolers is that the mechanical cooling provided by the compressors can be eliminated during low ambient temperatures, providing energy savings to the owner. This is accomplished by the installation of a sup-
Figure 11.17. Bottom inlet and top outlet, air-cooled condenser. (Courtesy of Liebert.)
11.5 COMPONENTS OUTSIDE THE DATACOM ROOM
281
Figure 11.18. Typical self-contained condensing unit. (Courtesy of Carrier.) plementary cooling coil inside the CRAC unit. This economizer coil can provide either all or a portion of the cooling required by the space. PERFORMANCE. AS is typical of all coil selections, a heat-rejection device's capacity is impacted by coil-specific parameters such as:
• • • • •
Coil or tube size Number of rows of coils Fin density (number of fins per inch) Materials of construction Thickness and shape of the tubes and fins
In the building cooling industry, there is a standardized way to rate the capacity of a heat-rejection device. The Air-Conditioning and Refrigeration Institute (ARI) sets a fixed dry- and wet-bulb condition that all manufacturers test to, and their published capacity data is based on the testing performed under these conditions. This allows for easier comparison of different products. The performance of all outdoor heat-rejection devices is largely based on climate. Therefore, the maximum output for a given device will differ, depending on the climate of the region in which they are situated. Furthermore, its energy usage will vary along with the weather. Depending on the requirements, a heat-rejection device can have one or more fans and the controls can vary from a single control per condenser to staged control of a multifan, multisection condenser. Variable speed is also possible. Geothermal Heat-Rejection Systems. In a geothermal or earth-coupled system, coils are installed horizontally or vertically in the ground to take advantage of the
282
DATA CENTER COOLING SYSTEMS
earth's relatively constant underground temperature. Since its heat rejection does not utilize ambient air, it is not considered to be either dry-bulb-centric or wet-bulb-centric. An earth-coupled heat-rejection system has the advantage that it does not require fans to reject heat. Not having a fan is an energy saver that also eliminates a source of noise. A disadvantage of an earth-coupled system is that it typically has a much higher installation cost. In some cases, utility rebates are available to make the TCO (total cost of ownership) more attractive. Its main advantage, however, is that it is generally a very efficient system.
11.5.3
Energy-Recovery Equipment
Economizers INTRODUCTION. The term economizer can be used to describe a piece of equipment (or a component that is part of a larger piece of equipment), or it can also refer to an energy-saving process or cycle. Building cooling systems utilize energy provided by the utility company (gas or electric) to drive mechanical components that reject heat from a transport medium to produce a cooling effect. If the heat transferred by this equipment is ultimately rejected to the atmosphere, then the performance of this equipment is dependent on the ambient weather conditions. Weather conditions or climate are also the main drivers in determining the economic feasibility of using an economizer cycle or free cooling. An economizer basically takes advantage of favorable outdoor conditions to reduce or eliminate the need for mechanical refrigeration to accomplish cooling. DESIGN WEATHER CONDITIONS. The design parameters used to select and size heatrejection equipment are typically based on ambient weather conditions that usually occur with some expected frequency in the project location. Depending on the form of heat rejection utilized, either the dry-bulb temperature or a value that relates to the amount of moisture in the air (wet bulb or dew point) will be used. Sometimes, a mean coincident value will be combined with either a dry-bulb or wet-bulb temperature to more accurately state the maximum, actual condition (or total energy of the ambient air) that will most likely be experienced. A common source for this information is the ASHRAE Fundamentals Handbook, which lists weather data for over 750 locations in the United States (and another 3600 locations worldwide). The weather data is listed in columns with 0.4%, 1%, and 2% as the headings of these columns. The values provided under these percentages represent the amount of time in one calendar year that the conditions listed for a particular location will be exceeded. For example, in Boston, Massachusetts, the dry-bulb temperature stated in the 0.4% dry-bulb/mean coincident wet-bulb bin is 91°F. This means that the temperature in Boston will only exceed 91°F 0.4% of the year (which is about 35 hours). The 2% value (about 175 hours) for this same heading for Boston is 84°F. Although cooling equipment is typically selected based on the data found in one of these categories (the design conditions for the project) the vast majority of the time,
11.5 COMPONENTS OUTSIDE THE DATACOM ROOM
283
the actual ambient conditions are below these design conditions. (The specific data used to select the equipment depends on the particular emphasis of the application and the client's preferences.) Generally, the design conditions are used to select the maximum capacity of the cooling equipment. As a result, the full capacity of the cooling equipment is not utilized for the majority of the year. ECONOMIZER PROCESS. Fundamentally, the economizer process involves utilizing favorable ambient weather conditions to reduce the energy consumed by the building cooling systems. Most often, this is accomplished by limiting the amount of energy used by the mechanical cooling (refrigeration) equipment. Since the use of an economizer mode (or sequence) reduces the energy consumption while maintaining the design conditions inside the space, another term that is used in the building cooling industry for economizer mode operation is free cooling. ASHRAE Standard 90.1 (Energy Code for Commercial and High-Rise Residential Buildings) is a standard or model code that describes the minimum energy efficiency standards that are to be used for all new commercial buildings and includes aspects such as the building envelope and cooling equipment. This standard is often adopted by an authority having jurisdiction (AHJ) as the code for a particular locale. Sometimes, it is adopted in its entirety, while in other instances it is adopted with specific revisions to more closely meet the needs of the AHJ. Within that standard, tables are provided stating the necessity of incorporating an economizer into the design of a cooling system, depending on design weather conditions and cooling capacity. In simplified terms, ASHRAE 90.1 mandates that unless your building is in an extremely humid environment or has a cooling capacity equivalent to that of a singlefamily home, an economizer is required to be designed as a part of any cooling system for your building. Economizers can utilize more favorable ambient air directly to accomplish the cooling needs of a building or process. This is known as an airside economizer. If a water system is used to indirectly transfer heat from the building or process to the atmosphere, as in a condenser-water/cooling-tower system, this known as a waterside economizer. AIRSIDE ECONOMIZER. In order to simplify the following discussion, the moisture content of the air streams will be considered to be of no consequence. However, this is not a practical approach when building systems are designed. During normal cooling operation on a hot summer day, in order to maintain a comfortable temperature inside the building in the mid-70s (°F), a cooling system for an office space might typically supply air at temperatures around 55°F. The temperature of the air returned back (recirculated) to the cooling system from the space would approximate the temperature maintained inside the space before any ventilation (outdoor) air is introduced into the air stream. Therefore, it requires less energy for the cooling system to cool and recirculate the air inside the building than it would to bring 100% outdoor air into the building (i.e., cooling the air from 75°F down to 55°F uses less energy that going from, say, 90°F to 55°F). However, building codes mandate that a minimum quantity of ventilation (outdoor) air be introduced into a building to maintain some level of indoor air quality. This min-
284
DATA CENTER COOLING SYSTEMS
imum quantity of ventilation air is often based on the expected occupancy load (i.e., the number of people per 1000 square feet) in a space multiplied by a prescribed quantity (CFM) of outdoor air per person. This formula can yield a very high percentage of outdoor air (as high as 75%) under certain "densely populated" situations. The percentage of outdoor air is the ratio of the amount of outdoor air (in CFM) divided by the amount of supply air (in CFM). For purposes of this discussion, let us assume that the outdoor air percentage is 25% of the total amount of air supplied by the cooling system. Therefore, this minimum quantity of outdoor air is mixed with the recirculated (return) air, producing a mixed air temperature closer to 75 °F than 90°F prior to passing through the cooling coil in an air handling unit. This scenario is graphically depicted in Figure 11.19. For days in the spring or fall, the outdoor air temperature may drop to below 75°F (the temperature of the return air). During those days, it would take less energy to use 100% outdoor air in the cooling air stream for the same reasons highlighted above. Figure 11.20 shows the economizer mode operation of an air handling unit. In an airside economizer setup, the air handling equipment can vary the amount of outdoor air introduced into the building through the use of a mixing box and dampers that are configured as a part of the air handling equipment itself. Outdoor air temperature sensors and motorized controllers are used to optimize the energy consumption for any weather condition. Keep in mind that in addition to temperature, humidity as well as contaminants contained in the outdoor air are also a concern and should influence the operation of an air-side economizer. For datacom (data processing and communications) facilities, the use of air-side economizers has been a controversial topic. Some reasons behind the controversy are included in Table 11.1. WATERSIDE ECONOMIZER. A more common economizer application in datacom facilities is the use of a waterside economizer. In order to understand the operation of a waterside economizer, we should recap some of the information presented earlier in this chapter. Previously, we introduced two liquid-media loops that are used to transport heat:
Figure 11.19. Hot summer day cooling operation. (Courtesy of DLB Associates.)
11.5 COMPONENTS OUTSIDE THE DATACOM ROOM
285
Figure 11.20. Spring/fall day cooling operation—economizer mode. (Courtesy of DLB Associates.)
Table 11.1. Issues surrounding the use of airside economizers Issue
Description
Number of hours airside economizer can be used Effects on interior environment
Lack of consensus on this number Geography based Facilities must maintain strict tolerance of environmental conditions. Interior limits must be within service lease agreement (SLA) Limits number of hours suitable for economizer operation
Energy savings that can be realized
Areas with greater quantity of contaminants in outdoor air require filtering processes Filtering processes require increased fan power usage. Savings from elimination of mechanical cooling need to be offset by energy increases from large fan operation. Areas in coastal regions have salt in the air, which is very costly to remove. Frequency of filter replacement and the cost associated with this affects overall energy savings.
Humidity control
Facilities have accepted ranges of relative humidity that equipment should be operated in. Cost of removing moisture or adding moisture involves substantial investments in operating expenditures.
First costs concerns
Requires greater first cost of equipment More sophisticated controls must be installed Facility owners reluctant to take on these costs Payback studies have proven a lower cost of ownership
Effectiveness of airside economizer
Dependent on several factors from geographical climate to owner preferences Each facility is unique, and facilities must be individually evaluated to determine if an airside economizer would be valuable.
286
DATA CENTER COOLING SYSTEMS
• Chilled- or Process-Water Loop. Transports heat from the cooling coil within the air handling equipment (e.g., CRAC unit) to the mechanical cooling equipment (e.g., chiller). • Condenser-Water (or Glycol) Loop. Transports heat from the mechanical cooling equipment (e.g., chiller) to the heat-rejection equipment (e.g., cooling tower, dry cooler, etc.). The mechanical cooling process involves an input of energy to realize a transfer of heat between these two liquid loops. The net result is that the temperature of the condenser water increases and the temperature of the chilled or process water decreases as it passes through the mechanical cooling equipment. In some air handling or CRAC unit configurations, it is possible for the mechanical cooling process to be internal to that piece of equipment (also called self-contained). In that configuration, the condenser-water loop runs from the air handling or CRAC unit directly to the exterior heat-rejection equipment (typically a dry cooler). Again, the dry cooler is typically selected and sized based on the hot-summer-day design conditions; therefore, there is an opportunity to use more favorable ambient weather conditions to save energy, this time through the use of a waterside economizer. Similar to the process described for airside economizers, the lower ambient air temperature can result in a lower temperature condenser water being produced and the temperatures may be such that the energy required for the mechanical cooling of the process-water loop can be reduced or, in some cases, eliminated entirely. In order to utilize the condenser water as a part of a waterside economizer process, the necessary components need to be included as part of the mechanical design. If the cooling equipment utilizes a refrigerant (as opposed to water) to transfer heat from the cooling coil to the heat-rejection equipment (dry cooler or cooling tower), then a second cooling coil will be required in the air handling equipment. When the ambient conditions are favorable and cold condenser water can be produced by the heat-rejection equipment, then the condenser water can be pumped through this coil, providing all or part of the required space cooling. If water, instead of refrigerant, is used for the cooling medium, then the designer has two options available. If the condenser-water system has been provided with above-average water treatment, then the cold condenser water can be pumped directly through the chilled-process-water piping system. The second option would be for the designer to include a heat exchanger in the system design to allow for the transfer of heat between the systems while keeping the two water streams separate. Please keep in mind that producing cold water with heat-rejection equipment can require a considerable amount of fan (both in the heat-rejection and airhandling equipment) and pumping energy, so the free cooling may not be as free as one might expect. Additional components required include motorized control valves and condenserwater-temperature sensors in order to control the routing of the condenser water dependent on its temperature. ENTHALPY WHEELS. A relatively new technology for data centers is under development in Europe. To solve the problems associated with outside-air cooling, it employs
11.5 COMPONENTS OUTSIDE THE DATACOM ROOM
287
a heat wheel (Figure 11.21) to isolate the data center from contamination and humidity issues. In cooler climates that rarely exceed 72°F, this has the potential to significantly improve data center efficiency. The solution requires total physical isolation of heated and cool air in the data center.
11.5.4
Heat Exchangers
Introduction. Building cooling systems rely on thermal transportation in order to condition the air within an enclosed space. This thermal transportation typically involves multiple media loops between various pieces of building cooling equipment. Heat exchangers represent equipment that is supplementary to the cooling system, and they are installed in order to either attain some energy recovery from the thermal transportation media flows in order to reduce the amount of energy required to perform the building cooling by providing a nonmechanical cooling process when ambient conditions allow it (similar to a waterside economizer but on a larger scale). As its name suggests, a heat exchanger is a device that relies on the thermal transfer from one input fluid to another input fluid. One of the fluids that enter the heat exchanger is cooler than the other entering fluid. The cooler input fluid leaves the heat exchanger warmer than it entered, and the warmer input fluid leaves cooler than it entered. Figure 11.22 shows a simplistic representation of the process. Heat Exchanger Types. Heat exchangers can be classified in a number of ways, ranging from the media streams involved in the transfer process to the direction
A diagram of a rotary heat exchanger or "heat wheel" (From Uptime Technology BV)
Figure 11.21. Heat wheel for data center application. (Courtesy Uptime Technology BV.)
288
DATA CENTER COOLING SYSTEMS
Figure 11.22. Simple overview of the heat exchanger process. (Courtesy of DLB Associates.) of the flow of the media within them to their physical configuration. We will discuss the types of heat exchangers in turn and then classify them using the criteria listed above. Shell-and-Tube Heat Exchanger. A shell-and-tube heat exchanger is a waterto-water type of heat exchanger where one liquid fills a larger cylindrical vessel (i.e., the shell) and the other liquid is transported through the filled vessel in multiple small tubes. The tubes are typically made from a thermally conductive material such as copper, whereas the shell material is more durable and is made from stainless steel or brass. The configurations of shell-and-tube heat exchangers can be expressed in terms of the number of passes, with each pass indicating the transportation of the liquid in the tubes across the length of the shell. For example, a one-pass configuration would have the tube liquid input at one end of the shell and then the output at the other end. A twopass configuration would have the tubes double back, and the input and output of the tube liquid would be at the same end. Four-pass configurations are also available; additionally, the longer the passage for the tube liquid in the submerged shell, the greater the heat transfer that occurs (depending on the relative temperatures of the liquids involved). Shell-and-tube types of heat exchangers are typically used for applications that involve hot water systems. Figure 11.23 shows a typical shell-and-tube heat exchanger. Plate-and-Frame Heat Exchanger. A plate-and-frame type of heat exchanger is a water-to-water type of heat exchanger that consists of a series of flat hollow plates (typically made from stainless steel) sandwiched together between two heavy cover plates. The two liquids pass through adjacent hollow plates such that one liquid in a given plate is sandwiched between two plates containing the other liquid. In addition,
11.5 COMPONENTS OUTSIDE THE DATACOM ROOM
289
Figure 11.23. Shell-and-tube heat exchanger. (Courtesy of DLB Associates.) the direction of the liquid flow is opposed for the two liquids and, hence, from one plate to the next (Figure 11.24). All of the liquid connections are in the same side, but one liquid travels from the bottom of the plate to the top and the other from the top of the adjacent plates to the bottom, thus maximizing the heat transfer rate by always having the difference in liquid temperatures as large as possible throughout the process. Figure 11.25 shows a typical installation of plate-and-frame heat exchangers. One of the main advantages of plate-and-frame heat exchangers over shell-and-tube heat exchangers is that the overall heat transfer coefficients are three to four times
Figure 11.24. Exploded isometric diagram showing plate-and-frame heat exchanger operation. (Courtesy of DLB Associates.)
290
DATA CENTER COOLING SYSTEMS
Figure 11.25. Installed plate-and-frame heat exchangers. (Courtesy of DLB Associates.) higher, and this type of heat exchanger is more common in water-cooled, chilled-water applications. Other Types of Heat Exchangers. There are many other types of heat exchangers for more specialized applications that are beyond the scope of this chapter and will not be covered in this book. The two types mentioned above are the most common types used for the majority of applications. Heat Exchanger Operation for Datacom Facilities. For datacom (data processing and telecom) facility operations that use water-cooled, chilled-water systems, heat exchangers are typically used when outdoor conditions can produce condenser water that is cold enough that just a heat exchanger can be used to cool the chilled water without the operation of a mechanical chiller. The two liquid streams normally connected to the chiller are rerouted to one or more heat exchangers that are installed in parallel to the chillers. The switchover from the heat exchangers to the chillers can be as simple as a set day of the year to switch over, or it can be dynamically controlled depending on the ambient conditions.
11.6 11.6.1
COMPONENTS INSIDE THE DATACOM ROOM CRAC Units
Introduction. The final component of building cooling equipment we will introduce is a computer room air-conditioning unit or, as it is more commonly known, a CRAC unit. CRAC units are typically used to cool large-scale (20 kW or more) instal-
11.6 COMPONENTS INSIDE THE DATACOM ROOM
291
lations of electronic equipment. They are sometimes called by other names such as CAHU, CACU, CRAH, and so on. Although these units all look very similar from the outside (Figure 11.26), the components within the CRAC unit can be configured to integrate with a wide variety of building cooling systems, including chilled-water, direct-expansion (DX), glycol, aircooled, water-cooled, and so on. The units themselves are available in capacities ranging from 6 to 62 nominal tons, or 20 to 149 kW of electronic equipment cooling. For the majority of electronic equipment installations, the heat generated by the equipment is discharged into the computer room environment in the form of hot air through small fans that are integral to the electronic equipment packaging. This hot air from the electronic equipment is known as return air (i.e., air that is being returned to the cooling equipment to be cooled). The return air is transported to the CRAC units, where it is cooled and redelivered to the room environment adjacent to the air inlets of the electronic equipment. The cold air delivered to the electronic equipment room environment from the CRAC unit is known as supply air or conditioned air. In the electronic equipment industry, this air is sometimes referred to as chilled air, but the term chilled air is not used in the building cooling industry. Location. More often than not, the CRAC units themselves are located in the same room as the electronic equipment. The CRAC unit itself has a fairly large physical footprint (the larger CRAC units are 8 to 10 feet wide by 3 feet deep) and also requires unobstructed maintenance clearance areas at the front and the two sides of the unit, making its overall impact on the floor area of an electronic equipment room significant. Where spatial constraints inside the electronic equipment room cannot accommodate the floor space that is required for CRAC units, they may be located in adjacent
Figure 11.26. Typical computer room air conditioning unit. (Courtesy of Liebert.)
292
DATA CENTER COOLING SYSTEMS
rooms, with ducted connections to supply and return air from the electronic equipment room (Figure 11.27). However, the CRAC units cannot be located too far from the computer equipment that they are meant to cool due to limitations of the CRAC unit fans and the pressure drop associated with long duct runs. Airflow Paths. The airflow paths to and from a CRAC unit can be configured in a number of different ways. The cold supply air from the unit can be discharged through the bottom of the unit, through the front of the unit, or through the top of the unit. The warmer return air can also be configured to enter the CRAC unit from the front, the rear, or the top. The most common airflow path used by CRAC units is referred to as downflow (Figure 11.28). A downflow CRAC unit has the combination of return air entering the top of the unit and supply air discharged from the bottom of the unit, typically into a raised floor plenum below the electronic equipment racks. The supply air pressurizes the raised floor plenum and is reintroduced into the electronic equipment room proper through perforated floor tiles that are located adjacent to the air inlets of the rackmounted electronic equipment. Upflow CRAC units have return air entering the CRAC unit in the front of the unit, but toward the bottom. The supply air is either directed out of the top or out the front but near the top of the unit. A frontal-supply air discharge is accomplished through the use of a plenum box, which is basically a metal box with no bottom panel that is placed over and entirely covers the top of the unit. The plenum box has a large grille opening in its front face to divert the air out in that direction (Figure 11.29). Instead of using a plenum box, an alternate airflow distribution method for an upflow CRAC unit is to connect the top discharge to conventional ductwork. The ductwork can then be extended to deliver supply air to one or more specific locations within the electronic equipment room (Figure 11.30). Upflow CRAC units are typically used either in areas that do not have raised floors or in areas that have shallow raised floors used only for power/data cable distribution.
Figure 11.27. CRAC unit located outside the electronic equipment room. (Courtesy of Liebert.)
11.6 COMPONENTS INSIDE THE DATACOM ROOM
Figure 11.28. Downflow CRAC unit airflow path. (Courtesy of Liebert.)
Figure 11.29. Upflow CRAC unit airflow path. (Courtesy of Liebert.)
Figure 11.30. Ducted upflow CRAC unit airflow path. (Courtesy of Liebert.)
293
294
DATA CENTER COOLING SYSTEMS
The latest thinking suggests that hot air isolation can provide the greatest efficiency improvements. The hot aisle/cold isle strategy of arranging rows of cabinets is an attempt to improve the separation of hot and cold air streams. Return-air ceilings are also coming back into favor. By extending the return-air intake of the CRAC unit up to the ceiling, and installing ceiling grills above hot aisles, or ducting the exhaust of upflow cabinets directly into the ceiling, equipment exhaust heat can be almost completely isolated from cold supply air. Configurations. The common components within all cooling configurations of CRAC units are a fan and a coil. Other optional components that are available to all CRAC unit configurations include a humidifier and a heating element, the combination of which helps control the humidity of the electronic equipment room environment. The specific type of building cooling system that is being utilized determines the additional components located within the unit as well as the type of liquid medium that is circulated in the coil. Classification. In addition to the upflow and downflow approach to classifying the CRAC units, the units can also be classified by whether or not the refrigeration process occurs within their physical enclosure. These fall into two main categories called compressorized (e.g., air-cooled, water-cooled, glycol, etc.) and chilled-water systems. A CRAC unit is said to be compressorized or self-contained if it contains the compressors within its housing. Compressorized CRAC units require a remote piece of equipment to perform the heat rejection, and a liquid (condenser water, glycol, etc.) is piped between the CRAC unit and the heat-rejection device. In a chilled-water CRAC unit, the mechanical refrigeration takes place outside of the physical extents of the CRAC unit enclosure in a remotely located chiller. No compressors are required in the CRAC unit, and only the chilled-water piping from the chiller is required to provide chilled water to the coil. Chilled-water CRAC units have a greater capacity (up to 60 nominal tons or 210 kW) than compressorized systems (up to 30 nominal tons or 105 kW).
Figure 11.31. Return-Air Ceiling. (Courtesy of Liebert.)
11.7 CONCLUSION
11.7
295
CONCLUSION
Data center cooling is composed of a wide variety of components an, depending on the design, it can be very cost-efficient or a burden on the operating budget of your facility. Ultimately, it is critical that the right combination of these systems is implemented to effectively cool IT equipment so that uptime and reliability are maintained. Recognizing and understanding the different systems and designs that assist in accomplishing this task is valuable in visualizing how your overall system works and how it can be improved.
This page intentionally left blank
12 DATA CENTER COOLING EFFICIENCY: CONCEPTS AND TECHNOLOGIES
12.1
INTRODUCTION
Now that you have been introduced to the equipment and components most commonly encountered in data center environments, this chapter will discuss more advanced techniques of improving data center cooling efficiency. Today's data centers consume approximately 1.5% of U.S. electrical production. Estimates from the Environmental Projection Agency (EPA) predict this could double to 4% in a few short years without efficiency improvements. Multimegawatt sites operated by collocation companies and larger businesses are prevalent today with many more planned. Even the typical smaller corporate sites can consume half a megawatt. For many businesses, data centers are their largest energy consumers. It is commonly thought that the IT equipment inside these data centers is the main culprit in this major energy consumption. The fact is that the cooling of the IT equipment typically consumes 40 to 50% of this energy and in some sites this percentage can be as much as 65%. It is obvious that improving cooling system efficiency should be of prime importance to the industry. Strategies outlined in this section can reduce cooling energy consumption for existing sites by typically 20 to 30%, and as much as 40% in more inefficient sites. New site construction using the latest techniques and equipment, discussed here, can reduce Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 297 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
298
DATA CENTER COOLING EFFICIENCY: CONCEPTS AND TECHNOLOGIES
cooling costs from the typical 40-50%, to less than 20% of the total energy consumed in the data center. The basics of an efficiently cooled data center are actually quite simple. Separate cooling air from heated air. Provide just enough cooling to the equipment racks to maintain effective operating temperatures. Then ensure that the return temperature of the cooling units (CRACs) is as high as possible to increase A r (see Appendix C for a more comprehensive discussion, since this is not always true as stated). Figures 12.1 and 12.2 illustrate these concepts concisely. The solid line in the temperature distribution of Figure 12.1 shows the results of proper cooling and airflow distribution to the equipment racks without needing to overcool. The return air temperature versus efficiency graph of Figure 12.2 illustrates the efficiency gains of increased return air temperature at the CRACs. As will be discussed later, data center design and airflow paths are the keys to attaining both these goals simultaneously. Why is raising CRAC return temperature of such importance? As illustrated in Figure 12.2, raising return air temperature improves cooling efficiency in the range of 4% per each 1°F increase. Looking back to the comparison in Figure 12.1, if the return air temperature of the efficient site realizes an 8°F rise in return air temperature versus the inefficient site, the cooling efficiency is improved by 32%. By maintaining a higher return temperature, extended use of free cooling, which will be discussed later, is possible in most climates. The key concepts of efficient data center cooling, as described above, are simple. Effective implementation and reliable operation of such an efficient data center is a far more complex topic. Full understanding of the operation and interactions between heat production, heat rejection, and cooling distribution components in a specific data center is required in order to understand the remaining text in this chapter. By the end of this chapter, the reader will understand where to look within his or her data center to find energy saving opportunities.
25
Inefficient cooling Efficient cooling
20
go
15
DC
0
64 66 68 70 72 74 76 78 80 82 84 86 Rack maximum intake °F
Figure 12.1. Comparison of IT rack inlet temperatures for an inefficiently cooled data center versus an efficiently cooled one. (Courtesy of AdaptivCool.)
299
12.1 INTRODUCTION
^
70-
c>-
T 6 °-
Chilled water CRAC Direct expansion CRAC
/ s'
CO
2 50-
11" 3040CD
:§ 2 0 S
10070°F 21.1°C
75°F 23.9°C
80°F 85°F 90°F 26.7°C 29.4°C 32.2°C Return air temperature
95°F 35.0°C
100°F 37.8°C
Figure 12.2. Typical CRAC cooling efficiency increase (normalized at 75°F/23.9°C) at different return air temperatures (chiller and CW pumping not included for the CW CRAC). (Courtesy of Emerson Network Power.)
12.1.1
Data Center Efficiency Measurement
An accepted measure of data center efficiency developed by members of the Green Grid (www.thegreengrid.org) is known as power usage effectiveness (PUE). It is defined as PUE;
Total power usage of data center Power usage of IT equipment
Therefore, to calculate PUE, the total energy used by the data center cooling infrastructure, UPS input and ancillary equipment such as lighting and safety equipment must be collected. Once this information is known, it is divided by the IT equipment load, which will be very close to the UPS output. (There is some loss in transmission from the UPS to the IT equipment, typically less than 2%.) In the typical data center today, only about 50% of the energy is used by the IT equipment; thus, the PUE in this instance is 2.0. A data center running at a higher efficiency will have a lower PUE. Conversely, a less efficient site will have a higher PUE. For some, having a lower number represent higher efficiency is considered counterintuitive. A derivative of PUE called data center infrastructure efficiency (DCiE) has also been developed and adopted by the Green Grid. This is simply calculated as Power usage of IT equipment Total power usage of data center This results in a higher number for higher efficiency. A data center with a PUE of 2.0
300
DATA CENTER COOLING EFFICIENCY: CONCEPTS AND TECHNOLOGIES
will have a DCiE of 0.5 or 50%. This means the site is 50% efficient. If PUE were 1.6, a reasonable goal for many data centers, the DCiE would be 0.625 or 62.5% of the total energy going to IT equipment. As of this writing, the average PUE for most reporting data centers is close to 1.9 (see Table 12.1). However, accurate numbers are hard to quantify since the metric is so new to the industry. The EPA has set a goal to reduce average PUE numbers dramatically depending on the type of infrastructure equipment. Today, there are a small number of data centers reporting PUE numbers better than 1.4. The best reported PUEs in North America as of this writing are between 1.07 and 1.09. There are a number of cautions when looking at PUE numbers published by others or collected for a specific data center. PUE is a very dynamic number; it will improve as the weather gets cooler, such as evenings or cooler months. Some data centers have very dynamic heat loads that change as computing demand changes. PUE will often improve as computing loads go up and cooling systems run closer to capacity. The most reliable PUE numbers are those that are an average of many months; a full year is best, though data center IT equipment profiles can change dramatically in a whole year. Another issue when comparing PUE numbers is accuracy. Proper submetering is required to isolate the data center load from any other entities on-site. Collocation and IT-intensive sites, by their nature, are motivated to improve efficiency since it affects their bottom line. They are also the most likely to publish their PUE number, particularly if it is good. Corporate sites often do not know their real PUE, due to metering issues, and are more likely to have a worse PUE, simply because data centers are not their main corporate focus or core business. This situation is rapidly improving due to corporate sustainability awareness.
12.2
HEAT TRANSFER INSIDE DATA CENTERS
As mentioned previously, air cooling relies on fans and air movement to cool heatproducing equipment. Liquid cooling relies on pumps and a heat transfer fluid to move heat away from this equipment. Liquid cooling represents a very small but growing percentage of today's installations. Air cooling is adequate for power densities up to about 10-12 kW per rack. Using specially constructed racks and a well-defined hot-air
Table 12.1. Table of PUE and DCiE efficiencies PUE
DCiE
Level of efficienc
3.0 2.5 2.0 1.5 1.2
33% 40% 50% 67% 83%
Very inefficient Inefficient Average Efficient Very efficient
Souce: Courtesy of the Green Grid.
12.2
HEAT TRANSFER INSIDE DATA CENTERS
301
return path, in some cases air cooling can support densities of up to 20 kW per rack. Beyond 20 kW per rack, some form of liquid cooling, brought close to or inside the electronics racks, is deemed necessary. Since liquid cooling is a relatively new technology, this chapter will focus primarily on air cooling. Some key liquid cooling topics and terms will be covered, however, since liquid cooling technology is rapidly evolving, it is suggested that the reader contact several vendors to obtain their latest information.
12.2.1
Heat Generation
The path of heat flow originates from the IT equipment as shown in Figure 12.3. Cooling air is drawn into electronic equipment by internal fans. The cool air extracts heat from internal heat-producing components; the air heats up on its path to the equipment exhausts. CPUs are by far the highest heat producers with power supplies (conversion efficiency) second and disk drives third. Most IT equipment is designed to have enough fan-induced airflow to limit temperature rise (AT) from the intake to the exhaust to 20°F. Newer equipment, such as blade servers, raises this AT to 26°F and sometimes as high as 30°F. Higher AT equipment is used in an effort to reduce the amount of power consumed by the internal fans (which can be significant) and to help facilitate higher CRAC return temperatures, which are beneficial to efficiency. The volume of air consumed by a specific piece of IT equipment depends on its designed AT and electrical load. Equipment designed for a 20°F A r consumes ~160 CFM per per kilowatt, whereas a 30°F A T design consumes approximately 120 CFM per kilowatt. It should be noted that IT equipment power consumption and cooling requirements are becoming more dynamic. Legacy equipment consumes nearly the same power whether idling or fully loaded. Current-generation equipment can reduce power when idling to 60% of full load. Next-generation equipment will throttle back power con-
Fiqure 12.3. Air cooling of a data center equipment rack. (Courtesy of AdaptivCool.)
302
DATA CENTER COOLING EFFICIENCY: CONCEPTS AND TECHNOLOGIES
sumption to 30% or less of full load. This has implications on data center cooling efficiency that can be addressed by adaptive cooling methods.
12.2.2
Heat Return
In an ideal situation, there is a direct low-resistance path for the hot exhaust air to flow from the IT equipment to the return air intakes of the CRACs. In practice, this path is rarely ideal. There are often obstructions to efficient exhaust-air flow, sometimes beginning as close as the fan outlets. Cables and cable arms at the back of the servers can impede heat escaping the rack. Once the heated air does escape the rack, there can still be other impediments to this air returning efficiently to the CRACs. Poor rack placement, poor CRAC placement, cooling leakage in a hot aisle, and a host of other issues can cause heated air to mix with cool air on its path back to the CRAC. Figure 12.4 illustrates the effect of poor rack layout that forces mixing to take place.
12.2.3
Cooling Air
Air heated by IT equipment in the data center is cooled by terminal cooling equipment commonly called computer room air conditioners (CRACs). In raised-floor data centers, these CRACs are configured as downflow units. Heated air is drawn into the top of these units, cooled, and then forced by high-capacity blowers into an underfloor plenum. The air in this underfloor plenum is pressurized to a typical static pressure level of 0.02 to 0.6" will to allow it to be forced through perforated tiles. These perforated tiles are typically 2' x 2' in size and have from 25% to 56% open area. Perforated tiles are placed in locations that will deliver cooling air to equipment-rack intakes. Due to obstructions such as pipes, cables, and support structures, as well as velocity vectors, the static pressure and resulting flow through the perforated tiles is uneven and often unpredictable.
Figure 12.4. Poor equipment layout causes mixing within the data center. (Courtesy of AdaptivCool.)
12.3 COOLING AND OTHER AIRFLOW TOPICS
303
Slab-floor data centers use similar cooling units with airflow direction changed from downflow to upflow. Heated air is drawn into the bottom or front of these units, cooled, and then forced by high-capacity blowers through the top of the unit. In some cases, there is ducting to deliver this cool air to the vicinity of the rack intakes. In other cases, the cool air is simply ejected out the top front of the CRAC unit in an effort to supply cooling air throughout the data center.
12.3 12.3.1
COOLING AND OTHER AIRFLOW TOPICS Leakage
In raised-floor data centers, passageways or cutouts are needed in the floor tiles under the racks. These cutouts allow rack power and data cables to pass between the underfloor cableways and the IT equipment in the racks. Most often, these cutouts are much larger than needed and allow cool air to leak out. Usually, these cutouts are at the back of the rack, on the exhaust side; thus, the leaking cool air simply mixes with heated air and lowers CRAC return temperatures. Another downside to these cutouts is reduced static pressure, which then affects how much cooling air is delivered from perforated tiles. Leakage can occur in other areas of the data center such as under power distribution units (PDUs), around CRACs, and even through breeches under the floor to other spaces in the building. Leakage should be controlled using various methods discussed in Chapter 13. Leakage is sometimes referred to as bypass air.
12.3.2
Mixing and Its Relationship to Efficiency
Unwanted mixing of cool and heated air within the data center is the prime cause for poor cooling efficiency. Mixing occurs in three places. First, cool air from the CRACs is mixed with heated air before making its way to the rack intakes. This causes IT equipment to run hotter than it needs to. Second, heated air from the IT equipment is mixed with cool air before it can return to the CRAC intakes. This reduces return-air temperature and can cause CRACs to provide less cooling than required. Third, CRACs that are seeing low heat loads (possibly due to mixing) and not cooling inject large quantities of relatively warm air into the underfloor plenum, causing temperatures from perforated tiles to vary.
12.3.3
Recirculation
Air, like water, follows the path of least resistance. Enough volume of cool air to support the heat load in a rack needs to be supplied to the area in front of the rack intake. Failure to supply enough cool air will cause heated exhaust air from the back of the rack to recirculate to the intakes. This recirculation can take place inside, over the top, or around the sides of the rack. It becomes particularly difficult to supply enough cool air to equipment toward the top of racks. If possible, install high-kilowatt equipment
304
DATA CENTER COOLING EFFICIENCY: CONCEPTS AND TECHNOLOGIES
low in the racks. Blanking panels are one method of controlling recirculation within racks.
12.3.4 Venturi Effect Perforated floor tiles placed within 6 feet of a downflow CRAC unit will experience a Venturi effect caused by high underfloor air velocities, typically as high as 1500 ft/min near the CRAC unit discharge. This effect will actually cause room air to be pulled down through the perforated tiles. Because of this effect, static pressure cannot build until airflow velocity has dropped below 800 ft/min. This phenomenon can be easily demonstrated by lifting the tile directly in front of a CRAC; in most cases, very little if any air will be forced out of the opening. Do the same at the fourth, fifth or sixth tile from the CRAC. The results will be noticeably different from the first tile. For this reason, it is always advisable not to crowd racks too close to CRAC units, necessitating the installation of perforated tiles within 6 feet of the unit, which would offer a minimal amount of airflow to those racks.
12.3.5 Vortex Effect Hurricanes and tornadoes are the most well-known examples of vortices. Pressure in the "eye" of the storm is extremely low. Data centers also experience underfloor vortices due to high-velocity airflow emanating from CRACs. If these vortices occur near CRACs where there is typically no equipment, they cause no harm. However, if CRACs are placed at right angles or opposite each other but offset by as little as 1 foot, severe vortices can and do occur, with detrimental effects on airflow.
12.3.6
CRAC/CRAH Types
As mentioned in Chapter 11, datacom room cooling is implemented as either a DX style with compressors to cool evaporator coils or chilled water with cooling coils fed by chilled water. Of the two types, the chilled-water style is typically more efficient due to evaporative cooling used in most chiller plants. Chilled-water units (CRAHs) require a chiller and are often used at larger sites. Chilled water is pumped at 40-50°F and utilized directly in the cooling coils. A chilled-water valve regulates the amount of cooling supplied. Chilled-water temperatures in the high 40s are desired for data centers to minimize dehumidification (latent cooling). DX units (CRACs) are typically found in small to mid-size data centers. Air cooled versions pump refrigerant to outdoor condensers. The distance from the CRAC to the condenser is limited so these are rarely seen in multifloor or large buildings. More common are glycol units that have an internal heat exchanger to transfer heat to a glycol loop and outdoor dry cooler (radiator). DX units can be ordered with an economizer coil, which utilizes cold glycol directly during winter months. In this mode, operation is very similar to a chilled-water unit; compressors may need to run very little during colder weather.
12.3 COOLING AND OTHER AIRFLOW TOPICS
12.3.7
305
Potential CRAC Operation Issues
Due to mixing of heated air with cooled air, when this heated air finally does reach the CRAC returns, it is at a lower temperature than it could be without mixing. Lower CRAC return air temperatures can cause two problems. One, already discussed, is lower efficiency; the other is that the CRAC can be fooled into thinking there is less IT load to cool so it runs at reduced capacity, allowing overheating. Figure 12.5 helps to illustrate how common CRAC units are controlled and behave under various loading conditions. Temperature of the return air entering the top of the CRAC is sensed for temperature as it makes its way to the cooling coils. The CRAC provides enough cooling capacity to hold the return temperature within the set point's dead band (typically ±2°F). The temperature of the supply air exiting the CRAC becomes cooler as load increases. Typical DX CRACs have discrete cooling steps of 0%, 25%, 50%, 75%, and 100%. Chilled-water CRACs and newer digital scroll CRACs have a more linear capacity control. DX and digital scroll CRACs have a maximum A r of 20°F, after which temperatures will drift upward as capacity is exhausted, as can be seen to the far right of Figure 12.5. Chilled-water CRACs can run at a higher AT if the chiller's loop and pumps have enough capacity to support the heat load.
12.3.8
Sensible Versus Latent Cooling
In HVAC terminology, the terms sensible cooling, latent cooling, and total cooling are used to define different types of heat transfer. Sensible cooling does not mean that someone very smart designed the cooling system, so let us discuss latent cooling first. To understand latent cooling, it is very helpful to know what a dew point is. For HVAC engineers, dew point is found by studying the psychometric chart (Figure 12.6) and plotting temperature against relative humidity to arrive at a dew point temperature. It is an interesting exercise the first time through, since it is a very complex chart. For non-HVAC engineers, it is easiest to relate dew point to everyday experiences. Cars on cool early mornings are often covered with a film of water droplets, or condensation.
Figure 12.5. CRAC cooling behavior for various heat loads. (Courtesy of AdaptivCool.)
306
DATA CENTER COOLING EFFICIENCY: CONCEPTS AND TECHNOLOGIES
Since their surfaces have become colder than the dew point, water vapor condenses from the air onto their surface. Now think back to two different days on which the temperature was mild but on one day the morning air was dry whereas on the other day it was very humid. What happened? If you will remember, on the dry day there was very little or no condensation on the cars, whereas on the humid day there was a thick layer, even though the temperatures were similar. This is a demonstration of dew point, the point at which water vapor will condense. Dew point needs both a relative humidity and temperature to be calculated. Now that the concept of dew point is understood, latent cooling is easy to understand: 1. Latent cooling is dehumidification. 2. Latent cooling can only occur when the cooling coils are at or below the dew point for the given temperature and relative humidity of the air passing across them. Water vapor condenses out of the air onto the cold coils and is carried away down a drain. 3. Energy is needed to condense this water since it requires a phase change. Phase changes such as liquid to ice, liquid to vapor, or vapor to liquid are very energy intensive. This energy comes from the cooling system. 4. Latent cooling does not affect dry-bulb temperature, only humidity. 5. Latent cooling in a data center is a waste unless it is needed to control humidity. Sensible cooling is cooling that lowers dry-bulb (an ordinary thermometer's) temperature and does not change the moisture content of the air. The ratio of sensible to la-
Fiaure 12.6. Example of a psychometric chart.
12.3 COOLING AND OTHER AIRFLOW TOPICS
307
tent cooling (sensible heat ratio or SHR) for a data center should be well above 90% to be efficient, and requires that the cooling coils be at or above the dew point. Conversely, for human comfort the SHR ratio is in the 70% range so the air can be dehumidified. Understanding the concepts above is one of the keys to data center cooling efficiency and why overcooling is so wasteful. Overcooling puts the cooling coils in the latent cooling regime (dehumidification) more than needed. And, by the way, any water vapor that is removed by dehumidification needs to be replenished, more often than not with another energy-intensive phase change of liquid to vapor. Total cooling, the last part of the equation, is the sum of both sensible and latent cooling. For CRACs, it is important to look at both total cooling and sensible cooling specs at the desired return-air temperature.
12.3.9
Humidity Control
The subject of humidification control can be very complex. There are too many factors in building design, climate, and infrastructure equipment to make broad recommendations that apply to all situations. To simplify the subject for data center efficiency purposes, it should to be controlled in the 35^10% range, if possible. Going below 30% puts the IT equipment at risk for potential ESD (electrostatic discharge) damage. On the other hand, humidity levels above 45% are inefficient due to the presence of excess water vapor and more latent cooling (dehumidification), as discussed earlier. Controlling to this ideal 35 to 40% range is made easier if there is a good moisture barrier in the data center envelope. Special paints, sealants, and ceiling tiles are employed in data center construction to prevent or slow the intrusion of moisture. In sites where the intrusion of moisture is too great, it is sometimes better to have higher summer and lower winter humidity settings rather than try to battle humidity down to 40% in the summer. Some sites are even reported to run well with little or no humidity control, most likely through a combination of a tight building envelope, the right amount of cooling, and favorable outdoor moisture content. Since it is so important and not well documented in literature, some of what was discussed in the sensible and latent cooling section will be repeated again here. The process of moving cooling air across cooling coils that are below the dew point (typically 40^45°F in a data center) causes what is called latent cooling or dehumidification. This latent cooling (dehumidification) condenses water vapor in the air to liquid on the cooling coils, which is carried out of the data center in condensate pipes. A steady stream of water from the condensate lines indicates an inefficient humidity control situation. Water carried out of the condensate lines needs to be reintroduced to the data center as water vapor (rehumidification) to keep humidity levels from becoming too low and risking ESD damage. The process of rehumidification can be very energy intensive if done by common electric humidifiers found in many CRACs. Other methods of adding humidity such as ultrasonic or steam can be much more energy efficient. Running CRACs at higher return-air temperatures and avoiding overcooling also helps in humidity control, since there is less inherent latent cooling (dehumidification) and less required rehumidification.
308
DATA CENTER COOLING EFFICIENCY: CONCEPTS AND TECHNOLOGIES
Recently released guidelines from ASHRAE* Technical Committee 9.9 advocate the use of dew point control rather than relative humidity to achieve the lowest amount of latent cooling and higher potential efficiency. Since most common CRAC/CRAH equipment uses humidity sensors that read relative humidity, this dew point must eventually be recalculated to relative humidity for entry into the CRAC unit control. It is expected that newer CRAC controls will incorporate dew point control settings in the future. Some CRAC units already have an option for absolute humidity control. If this option is available, it should generally be used.
12.3.9
CRAC Fighting—Too Many CRACs
A situation known as CRAC fighting is frequently observed in data centers. It is a very inefficient and wasteful condition. It is defined as one or more CRACs actively dehumidifying the air (as indicated on the front panel) and one or more other CRACs simultaneously rehumidifying it. Often, these can be adjacent CRACs. There are usually one of two causes. In one, the humidity setpoints can simply be different and need to be reset. The more common cause is that the IT heat load for the CRACs is imbalanced. The highly loaded CRACs will sense a lower relative humidity (RH) from warmer air, while the lightly loaded CRACs will sense higher RH from cooler air. A stopgap measure is to adjust the humidity setpoints to eliminate the fighting. However, cooling capacity is still being lost from the lightly loaded CRACs. A potentially more serious issue is that underfloor temperatures are being raised when completely unloaded CRACs are allowed to run. No cooling is taking place and warm air is being pumped under the floor to mix with cooling air. Future IT loads placed in the room should be targeted toward lightly loaded CRACs versus the more highly loaded CRACs. Active airflow can also be used to redistribute warmer air to lightly loaded CRACs, improving overall efficiency. In many cases, there are too many CRACs running in a data center. With a large number of CRACs running at low capacity, both energy efficiency and cooling are compromised. Poor airflow masks the overcooling, so temperatures can be relatively high and not reflect the fact the site is running at many times the cooling needed. By running the correct number of CRACs, at higher loads, more cold air is produced. Data center temperatures actually go down when a severe overcooling situation is resolved. Control systems that can put these excess CRACs into a hot-standby condition, ready for use when needed, protect the cooling redundancy of the site.
12.4 12.4.1
DESIGN APPROACHES FOR DATA CENTER COOLING Hot Aisle/Cold Aisle
The traditional approach to data center cooling is known as hot aisle/cold aisle organization, by which server cabinets are oriented so that each row of cabinets shares a cold-intake-air aisle with the row on one side and a hot-exhaust-air aisle with the row *http://tc99.ashraetcs.org/.
12.4
DESIGN APPROACHES FOR DATA CENTER COOLING
309
on the other side. With the inclusion of blanking panels to fill any gaps in the server cabinets, this structure works to keep the cool air and warm air from mixing and attempts to deliver higher efficiencies by increasing the difference in temperature (A 7) between CRAC intake and output air.
12.4.2
Cold-Aisle Containment
As mentioned in Chapter 11, any technique for maintaining separation of heated and cooling air and reducing mixing will improve cooling efficiency. Containment strategies that physically separate heated and cooling air with overhead partitions, walls, and doors can be very effective. There is a choice between hot-aisle containment and coldaisle containment. The general consensus is that cold-aisle containment is easier for existing raised-floor data centers that have a well-laid-out hot aisle/cold aisle layout. In a cold-aisle containment layout, barriers are erected to physically prevent the cool air in the cold aisles from mixing with the warmer room air before being passed across the servers. There are several important considerations for containment. The first is sufficient cooling capacity and airflow needs to be available. Without enough cooling air supply, the racks and IT equipment, especially at the top of the racks, will run hotter with containment than without. The second is compatibility with the fire suppression system. Preaction and sprinkler coverage needs to be provisioned in the contained area and the design and implementation reviewed and approved by the local fire safety authorities. The third is ingress/egress for racks and servers if doors and walls are used. The last is ensuring adequate lighting if overhead curtains are erected.
12.4.3
In-Row Cooling with Hot-Aisle Containment
There are hot-aisle containment systems that can be constructed on a slab-floor data center. These can be a good choice if no raised floor exists. A hot-aisle containment scheme that includes in-row cooling consists of cooling units being placed between adjacent server cabinets, set up to maintain the traditional hot aisle/cold aisle organization. The cooling units are placed facing one another to promote the buildup of static pressure in the cold aisles. A solid boundary is used to hold the warm air in the hot aisles, where it is conditioned by the cooling units and moved to the cold aisles. In-row cooling serves to move the focus of data center cooling from the datacom room to the individual aisles.
12.4.4
Overhead Supplemental Cooling
In certain instances of high-server-power densities or tight space allowances, overhead cooling may be installed to improve data center cooling efficiency. An increase in power efficiency is realized by placing the overhead cooling units closer to the servers than in traditional arrangements; additionally, their placement above the server cabinets allows more cooling to be achieved without sacrificing more floor space to CRAC units.
310
12.4.5
DATA CENTER COOLING EFFICIENCY: CONCEPTS AND TECHNOLOGIES
Chimney or Ducted Returns
Analogous to a raised access floor, a dropped ceiling or ductwork may be used to transport warm air from server cabinets to CRAC and CRAH units. This approach adds an additional expense to data center construction, but may improve efficiency by decreasing the incidence of mixing in the data center.
12.4.6
Advanced Active Airflow Management for Server Cabinets
Instead of concentrating on keeping warm and cool air from mixing in a traditional hot aisle/cold aisle data center, an alternative technique, using digitally controlled variable-speed fans, concentrates on ensuring a more stable overall temperature gradient in a data center environment. This is done by locating servers in cabinets with integrated, digitally controlled variable-speed fans and allowing the control system to regulate how the cool air is distributed within the cabinet. These products eliminate the need for perforated tiles in the cold aisles of a raised access floor, instead drawing cool air in through the bottom of the cabinet and expelling warm air out the top (Figures 12.7 and 12.8). More information about this is included in Appendix C.
12.5 ADDITIONAL CONSIDERATIONS 12.5.1 Active Air Movement Air is lazy and, like electricity and water, will always follow the path of least resistance. This is the cause of mixing, recirculation, underfloor pressure variation, and oth-
Figure 12.7. Active airflow management. (Courtesy of AFCO Systems.)
12.5 ADDITIONAL CONSIDERATIONS
311
Figure 12.8. Underfloor air mover designed for data center application. (Courtesy of AdaptivCool.)
er detrimental cooling conditions. Oftentimes, an entire data center is overcooled due to a small number of problem locations. By utilizing fans in strategic locations, spot or area cooling problems can be eliminated. Floor-mounted fans are sometimes employed, but in many cases these types of fans cause more widespread mixing. An alternative solution is fans designed specifically for data center applications.
12.5.2 Adaptive Capacity With IT loads now becoming more variable, several methods for adaptive cooling capacity have been developed. One method combines new variable capacity compressors known as digital scroll compressors with variable speed blowers using variable speed drives (VFDs) in DX CRAC units. This allows more flexibility in the amount of cooling delivered. Another method is to install VFDs on chilled-water CRAH blower units to allow more flexibility. (Note: VFD drives cannot be installed on older fixed-compressor DX CRACs). Yet another method monitors the thermal load and can put entire CRAC units into hot standby when not needed and immediately bring them back online when required.
12.5.3
Liquid Cooling
Liquid can transport thousands of times the amount of heat energy as air; thus, there are a number of advantages to liquid cooling for high-density data centers. Being a fairly new technology, in this application there are a number of competing liquid cooling solutions vying for market dominance. The technologies from various vendors are very different but they do have similarities. All move cooling coils closer to the IT equipment and heat source, raising
312
DATA CENTER COOLING EFFICIENCY: CONCEPTS AND TECHNOLOGIES
A r and reducing required fan power for cooling; this is called close-coupled cooling. Most require a chilled-water supply. All need to have humidity control external to the liquid cooling system. All need to control temperatures of their cooling coils above the dew point to prevent condensation near the IT equipment. Due to these higher coil temperatures, all can increase the potential free cooling hours to save energy. The differences among these technologies are that some use water or glycol whereas others use refrigerant for the cooling coils. Some use the main chilled-water loop directly. Others use a heat exchanger to isolate the main loop from the cooling loop. Refrigerant-based systems always use a heat exchanger to transfer heat to a chilled-water loop. Cooling coil locations vary greatly; they can be found inside racks, on the back of racks, on top of racks, and in specially made racks that are actually tall, thin CRAC units. Properly designed and provisioned, liquid cooling can be the most energy-efficient solution for newly designed high-density data centers. This is primarily due to less fan power needed and higher coil temperatures, which provide near 100% sensible cooling and the potential of more hours of free cooling. As mentioned earlier, redundancy is a challenge in liquid cooling and needs to be accounted for in the design. Rack placement flexibility will be less than with traditional air cooled designs, as once racks are deployed they seldom can be moved easily; more advance planning is needed. Initial cost for liquid cooling will be much higher than traditional CRACs. Energy savings can often offset the capital cost over the full life of the equipment. Liquid cooling is typically implemented in smaller, separate, high-density zones within a larger data center where air cooling cannot effectively remove the large quantities of heat. With the rapidly changing landscape for liquid cooling, it is advisable to contact multiple vendors to obtain information on their latest offerings. With wider adoption, equipment costs are expected to come down over time.
12.5.4
Cold Storage
Although it is not strictly an energy saving technology, cold storage can be very effective at peak shaving, which saves electricity expense for customers who have time-ofday metering. This technology also helps the environment by allowing data center chillers to shut down during peak-demand hours, reducing loads on typically dirtier power plants. During off-peak hours, water or a water/glycol mix is cooled and stored as cold water or slush in large tanks. During peak hours, this cold mixture is used to keep the chilled water loop cool.
12.6 12.6.1
HARDWARE AND ASSOCIATED EFFICIENCIES Server Efficiency
Since each server requires power for both electronics and cooling, reducing server energy consumption pays a double dividend in energy efficiency.
12.6
HARDWARE AND ASSOCIATED EFFICIENCIES
12.6.2
313
Server Virtualization
Studies have shown that typical server utilization for single-application servers is less than 10%. Virtualization addresses this issue by combining several applications on one server. By making more efficient use of computer power, virtualization can dramatically reduce the number of servers needed in the data center. Typical compression ratios for virtualization are 4:1 to 7:1. This means that one server can do the work that four to seven servers used to do. Virtualized servers will tend to run at higher loads, using up to 30% more power and cooling than nonvirtualized servers; however, the energy savings are significant. Some business processes are more compatible with virtualization than others.
12.6.3
Multicore Processors
Processor makers realized several years ago that they could no longer ignore power consumption in the race for faster processors. Since power consumption is closely related to clock speeds, the solution was to move to multiple processor cores, each running at lower clock speeds.
12.6.4
Blade Servers
By combining many processor/memory modules into a single chassis, efficiency is gained by sharing power supplies and disk drives. Further efficiency in these models is realized by using lower capacity fans and raising AT. It should be noted that the actual energy efficiency of blade servers is limited since they use the same processors, disk drives, and other components as other servers. One major benefit of blade servers is less cabling, which can reduce airflow restrictions within racks and in the underfloor.
12.6.5
Energy-Efficient Servers
Green servers are now available. These servers use more efficient components such as high-conversion-efficiency power supplies, more efficient multicore processors, and more efficient disk drives and memory.
12.6.6
Power Managed Servers
Legacy servers consume nearly the same power at idle as they do at 100% utilization. Current-generation servers have taken a page from laptop computers by using power management techniques to reduce power at idle. As of this writing, idling servers consume roughly 60% of maximum power. Next-generation servers will bring this number to 30%.
12.6.7
Effect of Dynamic Server Loads on Cooling
Data center cooling has not traditionally had to deal with dynamic heat loads, but this is rapidly changing. Adaptive cooling methodology will need to be employed to address these new dynamic loads in an efficient manner.
314
12.7
DATA CENTER COOLING EFFICIENCY: CONCEPTS AND TECHNOLOGIES
BEST PRACTICES
In many disciplines, there are an accepted number of best practices that should be employed; data center cooling is no different. Table 12.2 will serve as a guide to obtaining the most efficient cooling possible in both existing and new data centers.
12.8
EFFICIENCY PROBLEM SOLVING
Efficient data center cooling solutions are very site specific. Three basic scenarios with possible problems and solutions are discussed here. This can be used as a guide to problem solving techniques.
Table 12.2. Data Center Cooling Best Practices Best practices
Definitions
Hot aisle/cold aisle layout
Cold aisles need to be 2 full tiles wide to allow proper tuning of perforated tiles. Hot aisles should be I2 tiles minimum to allow proper flow of heated air. This allows unimpeded flow of heated air back to CRAC units.
CRACs perpendicular to rows Lower density racks at ends of rows
This minimizes wraparound of heated air to the rack intakes.
Long rows with no space between racks
Rows of up to 15-20 racks allow hot aisle/cold aisle configurations to work best. Rows longer than this may lead to the problem of CRAC units being unable to throw air that far down the underfloor.
No perforated tiles in hot aisles
Hot aisles should be as hot as possible to raise CRAC return temperatures.
Minimize or close off cable cutouts
This minimizes the mixing of cooling and heating air.
Load servers from bottom to top with no spaces between
Air is cooler at lower elevations. Spaces between servers allow recirculation.
Use blanking panels
Required only where heated air can be felt recirculating to the server intakes.
Dress out rack cabling
Allows server exhaust to exit the racks properly.
Use lower humidity
Minimizes latent cooling.
Raise chilled-water temperature
To minimize latent cooling (dehumidification) losses.
Prevent infiltration of unconditioned air
Outside air can be difficult to maintain at proper humidity once it enters the data center.
315
12.8 EFFICIENCY PROBLEM SOLVING
Table 12.3. Scenario 1—Data center is overheating but has adequate cooling capacity Investigate
Questions
Solutions
Humidity control
Is humidity greater than 45%?
Reduce
Rack temperature
Is there a wide distribution? Can it be reduced through better perforated-tile distribution or other airflow improvements?
Load racks from bottom to top.
Cable cutoff situation
Is there a large amount of cooling being wasted?
Close largest cable cutouts not providing cool air to servers.
Blanking panel situation
Is air recirculating from the back ofracks?
Install blanking panels where required. {Note: Adequate supply of cooling air is still required.)
CRAC return temperatures
Is there a wide distribution?
IT load may be imbalanced in the room—redistribute airflow or IT load to even out CRAC loads.
Table 12.4. Scenario 2—Data center is overcooled but still has hot spots Cause of hot spot
Possible solutions
Poor air flow through perforated tiles
Install tiles with a greater open area, install under floor fans, and close unnecessary cutouts in floor.
Recirculation of warm air through
Install blanking panels.
servers Server exhaust impeded
Dress cables in hot aisles.
Supply air not cool enough
Change number of CRACs in use.
Hot air not returning to CRACs
Reduce restrictions in room; install ducts or fans to move warm air.
Scenario 3—New Data Center Build The options for designing a new, efficient data center have never been more varied or complex. The pace of innovations in both IT and facilities make choosing the right design to last for the next 10, 20, or 30 years a formidable task. Rather than try to predict which technologies and vendors will provide the best solutions, basic efficiency and design tenets will be listed here: 1. Design to use as many hours of free cooling as possible. 2. If possible choose a site in a cooler or dryer climate.
316
DATA CENTER COOLING EFFICIENCY: CONCEPTS AND TECHNOLOGIES
3. 4. 5. 6.
For sites that will start out life partially loaded, phase in cooling as needed. Design for high AT. Use ultrasonic humidification. For air cooling, design for minimum mixing/highest separation and use a deep raised floor. 7. For liquid cooling, design with a redundancy plan.
12.9
CONCLUSION
With so many site-specific issues, there is no one-size-fits-all solution for data center efficiency. The key to a successful program is to evaluate the specific site, look for ways to measure progress, then choose the best method. With each success identified and documented, obtaining resources for the next phase of improvements becomes easier.
12.10
CONVERSIONS, FORMULAS, AND GUIDELINES
Thermal 1 BTU = 0.292 W 1 watt = 3.412 BTU 1 ton =12,000 BTU 1 ton = 3.51 kW Airflow CFM per ton of cooling: CRAC (Compressor based) ~ 500 CFM CRAH (Chilled-water based) ~ 450 CFM CFM per kW of IT Legacy- 160 CFM Blades ~ 120 CFM Cooling System Design Typical excess CFM supply ~ 30%
13 RAISED ACCESS FLOORS
13.1 13.1.1
INTRODUCTION What is an Access Floor?
An access floor is literally an elevated or raised floor system installed on top of another floor (typically the concrete structural slab in a building). The raised floor shown in Figure 13.1 consists of 2-ft x 2-ft panels, generally comprised of a welded sheet metal sandwich filled with lightweight concrete to form a strong, quiet, fire-resistant, accessible floor structure that is supported by adjustable-height pedestal supports. The raised access floor system (floor panels and support pedestals) creates a space between the underfloor slab and the access floor above. This space is where you can cost-effectively run any of your building services. These services typically consist of electric power, data, telecom/voice, environmental control and air conditioning, fire detection and suppression, security, and so on. The access floor panels or tiles are designed to be removable from their support structure so that access to these underfloor services is fast and easy. The plenum created by the raised floor is also typically used as part of the data center air-conditioning system. Cooled air that is pushed under the raised floor is distributed by strategically located perforated tiles that deliver cooling to the heat-generating IT equipment. Cutouts are also made in the floor tiles to bring power and data wiring to this equipment. Hinged-cover termination boxes and passMaintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 317 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
318
RAISED ACCESS FLOORS
Figure 13.1. Typical raised floor. (Courtesy of Täte Access Floors.) through brush grommets are typically used to minimize cold-air leakage into equipment or areas where it is not required. Raised access floors are an essential part of mission critical facilities due to the need for efficient and convenient cold air, power, and data distribution. Access floors were in fact developed during the beginning of the computer age in the 1950s to provide the increased air flow required to keep equipment and control rooms at safe operating temperatures. Whereas increased air flow is still a major benefit, as data centers have evolved, so have access floors. There are now a variety of panel grades and types to accommodate practically every situation. There have also been advancements in electrical and voice/data distribution that increases flexibility, adding to the benefits of an access floor. Overall, the use of an access floor will help you maximize the use of your services. Whether you are designing a raised access floor for a new facility, or looking to improve an existing floor, there are certain industry best practices that bear consideration. Openings in the floor used to pass wires through should be sealed to restrict airflow, and cables running out of the servers should be dressed and organized. Computational fluid dynamics (CFD) software may be used to determine the most efficient placement of perforated tiles to minimize the mixing of hot and cool air.
13.1.2 What are the Typical Applications for Access Floors? Specific requirements of the space will determine the potential areas in which raised access flooring can be considered. Areas in which raised access flooring can be used are:
13.2 DESIGN CONSIDERATIONS
• • • • • • • • • • • • • • •
319
General office areas Auditoriums Conference rooms Training rooms Executive offices Computer rooms Telecom closets Equipment rooms Web farms Corridors Cafeterias/break rooms Bathrooms/restrooms Storage areas Lobby/reception areas Print room
13.1.3 Why Use an Access Floor? A raised access floor is used because it is a more cost-effective service distribution solution from a first-cost and operating-cost perspective, providing easy-to-move service terminals (power, voice, and data) and air at any point on the floor. Furthermore, as technology advances, it is critical that your facility can be easily and economically modified with minimal interruption. An access floor gives you that flexibility. The interchangeability of solid and airflow panels allows you to concentrate your airflow wherever needed to cool sensitive equipment, which can produce an enormous amount of heat. To summarize, an access floor: • • • • • • •
13.2
Is less expensive to build Speeds up the construction cycle Reduces slab-to-slab height by up to 1 ft per story Greatly reduces HVAC ductwork Is less expensive to operate Provides ongoing cost saving per HVAC zone addition Allows efficient equipment relocation or addition with minimum interruption
DESIGN CONSIDERATIONS
When selecting the appropriate access floor system for your facility, you must take several factors into consideration. These factors will be in terms of structural perfor-
320
RAISED ACCESS FLOORS
manee, finished floor height, understructure support design type, and the surface finish of the floor panels.
13.2.1
Determine the Structural Performance Required
It is important at an early stage in the consideration of a raised access floor that a detailed assessment is made of the likely loadings that will be imposed on the floor surface. These loadings need to be assessed in terms of the following. A design load (Figure 13.2) is the load expected to be applied to a small area of the panel surface during normal use. These loads are typically imposed by stationary furniture and equipment that rest on legs. In the laboratory test, a specified design load is applied to the access floor system on a one-inch-square area while deflection is measured under the load to determine the yield point. The permanent set (rebound) is measured after the load is removed. The rated design load capacity indicates the load that can be safely placed on a single floor panel without it yielding or becoming permanently deformed. Design loads are also referred to as working loads. Design loads are performed with panels on the actual understructure instead of tests performed on rigid blocks so that the panels and understructure are tested as a system, which is more representative of an actual installation. Rolling Load. Rolling loads (Figure 13.3) are applied by wheeled vehicles carrying loads across the floor. Pallet jacks, electronic mail carts, and lifts are some examples. The rated rolling load capacity indicates the load allowed on a single wheel rolling across one panel. Rolling loads are defined by the amount of weight on the wheels and number of passes across the panel. The 10-pass specification is used to rate a panel for rolling
Equipment Legs on equipment (concentrated load)
Access floor
Figure 13.2. A design or working load is a single load applied to a small area of the panel surface. (Courtesy of Täte Access Floors.)
321
13.2 DESIGN CONSIDERATIONS
Equipment Access floor
Equipment moving on wheels (rolling load)
w TO
œ~"G~ ï ï GO
W
Figure 13.3. Rolling loads are applied by wheeled vehicles carrying loads across the floor. (Courtesy of Täte Access Floors.) loads that will occur infrequently over the life of the floor (usually referred to as movein loads). The 10,000-pass specification is used to rate panels for rolling loads that may occur many times over the life of the floor. Uniform Loads. Uniform loads (Figure 13.4) are applied uniformly over the entire surface of the panel and are typically imposed by items such as file cabinets, pallets, and furniture or equipment without legs. The rated uniform load capacity indicates the panel's load capacity stated in pounds per square foot.
Equipment
Flat base (uniform load) Access floor
Figure 13.4. Uniform loads are applied uniformly over the entire surface of the panel, typically imposed by file cabinets, pallets, and furniture or equipment without legs. (Courtesy of Täte Access Floors.)
322
RAISED ACCESS FLOORS
Ultimate Loads. The ultimate load capacity of a floor panel is reached when it has structurally failed and cannot accept any additional load (Figure 13.5). The ultimate load capacity is not an allowable load number; it merely indicates the extent of the panel's overloading resistance. A panel loaded to this degree will bend considerably and will be destroyed. You should never intentionally load a floor panel anywhere near its ultimate load rating. It is recommended that the panel's ultimate load or safety factor be rated at least two times its design or working load. The ultimate load test is done on actual understructure in conjunction with the design load test. Impact Loads. Impact loads (Figure 13.6) occur when objects are accidentally dropped on the floor. In the laboratory test, a load is dropped 36 inches onto a steel plate that is resting on top of a one-inch-square steel indentor placed on the panel. Because the test load is concentrated onto a small area, it generates a severe shock, which makes the test a safe indication of the panel's impact capacity. Heavy impact loads, which most often occur during construction, move-in, and equipment/furniture relocations, can cause permanent panel damage.
13.2.2
Determine the Required Finished Floor Height
Finished floor height is defined by the distance from the concrete slab subfloor to the top surface of the access floor panel. Finished floor heights can range from as low as 2.5 inches to as high as 6 feet and above with special understructure supports, but typical mission critical facilities will be in the 13 to 36 inch range most of the time, depending on the services required under the floor. When selecting the appropriate finished floor height, take the following into consideration.
Equipment
Access floor
Figure 13.5. The ultimate load capacity of a floor panel is reached when it has structurally failed. (Courtesy of Täte Access Floors.)
13.2 DESIGN CONSIDERATIONS
Object Drop height
323
T Access floor
Figure 13.6. Impact loads occur when objects are accidentally dropped on thefloor.(Courtesy of Täte Access Floors.) Consider the Service Requirements. These services will typically include electrical power, data, telecom/voice, environmental/air conditioning, fire detection/ suppression, security, and so on. An evaluation of each of these underfloor services will give an indication of the space required for each service run. Due regard needs to be taken for any multiple service runs, crossover of services, separation requirements, and so on. This information can then be used to determine the cavity depth required under the raised floor and, hence, the finished floor height, which will then be used in specifying the raised access floor system. For access floors using underfloor electrical distribution, a 13" finished floor height would be typical. For an access floor utilizing underfloor air, 18 to 24" finished floor heights are more typical. Although these finished floor heights are typical, each facility will need to be evaluated based on the amount of cable and air handling equipment to determine the appropriate finished floor height. Minimum Floor-to-Ceiling Requirements. Within any building design, the top of the floor to the underside of the ceiling minimum dimension is one aspect that is outside the designer's area of influence as it is governed by building regulations. However, the use of a raised access flooring system in conjunction with underfloor modular cable management and/or underfloor HVAC will allow this dimension to be optimized.
13.2.3 Determine the Understructure Support Design Type Required Various understructure design options are available to accommodate structural performance requirements, floor height, and accessibility needs. It is important at an early
324
RAISED ACCESS FLOORS
stage in the consideration of a raised access floor that a detailed assessment be made of the likely loads that will be imposed on the understructure. These loads need to be assessed in terms of the following: • Overturning Moment. Overturning moment is a lateral load applied to the pedestal due to rolling load traffic, underfloor work due to wire pulls, or seismic activity. It is determined by multiplying the lateral load applied to the top of the pedestal by the pedestal's height. A pedestal's ability to resist an overturning moment is a function of its attachment method to the structural slab, size and thickness of the base plate, and pedestal column size. • Axial Load. Axial load is a vertical load applied to the center of the pedestal due to design, rolling, uniform, and other loads applied to the surface of the access floor panel. • Seismic Load. Seismic load is a combination of vertical (both up and down) motion on the surface of the access floor as well as lateral loads. • Understructure Design. Ninety percent of mission critical data centers utilize a bolted stringer understructure for lateral stability and convenient access to the underfloor plenum (Figure 13.7). Bolted stringers can be 2 ft, 4 ft, or a combination of both. The 4 ft basket-weave system will provide you with the most lateral support, since it ties multiple pedestals together with a single stringer. It also makes aligning the grid easier at installation, and helps keep the grid aligned during panel removal and reinstallation. The 2 ft system gives you flexibility, as
Figure 13.7. Typical bolted stringer understructure and cable tray. (Courtesy of Täte Access Floors.)
13.2 DESIGN CONSIDERATIONS
325
fewer panels need to be removed if stringers need to be removed to perform service or run additional cabling under the floor. Cables can be laid directly on the slab, or can be cradled in cable trays positioned between the slab and the bottom of the access floor panels. The combination 2 ft/4 ft system will give you some benefits of both of the previous options.
13.2.4
Determine the Appropriate Floor Finish
There are many options for factory-laminated floor finishes, but most are not appropriate for mission critical data centers. You will most likely want to select a finish with some degree of static control or low static generation. Floor tile for rooms requiring control of static electricity is often specified incorrectly because the tile names and definitions are easily misconstrued. There are two basic types of static-control floor tile: low-static-generating tile and static-conductive tile. Static-conductive tile has a close cousin named static-dissipative tile. Low-Static-Generation Floor Tile. High-pressure laminate tile (HPL) is a low-static-generating tile informally referred to as antistatic tile. It typically provides the necessary static protection for computer rooms, data centers, and server rooms. Its purpose is to provide protection for equipment by limiting the amount of static charges generated by people walking or sliding objects on the floor. It offers no protection from static charges generated in other ways. The static dissipation to ground is extremely slow with HPL because its surface-to-ground electrical resistance is very high. In addition to using tile to control static generation and discharge, generation should also be controlled by maintaining the proper humidity level, as well as by the possible use of other static-control products. The standard language for specification of HPL electrical resistance is the following: Surface to Ground Resistance of Standard High Pressure Laminate Covering: Average test values shall be within the range of 1,000,000 ohms (1.0 * 106) to 20,000 megaohms (2.0 x 1010 ohms), as determined by testing in accordance with the test method for conductiveflooringspecified in Chapter 3 of NFPA 99, but modified to place one electrode on thefloorsurface and to attach one electrode to the understructure. Resistance shall be tested at 500 volts.* Static-conductive floor tiles are typically not used in computer rooms for two noteworthy reasons. When conductive tiles are used on a floor, occupants must wear conductive footwear or grounding foot straps in order for static charges to dissipate through their shoes into the tile. Rubber, crepe, or thick composition soles insulate the wearer from the floor and render conductive tile ineffective. Also, conductive floor tile can present electrical shock hazards since its electrical resistance is not high enough to
*High-pressure laminate—design specification. (Courtesy of Täte Access Floors.)
326
RAISED ACCESS FLOORS
safely limit current to the floor surface from an understructure system that becomes accidentally charged.
Conductive and Static-Dissipative Floor Tile for Sensitive Equipment
Areas. Static-conductive floor tile is typically used in ultrasensitive environments such as electronics manufacturing facilities, assembly factories, clean rooms, and MRI rooms. Its surface-to-ground electrical resistance is very low; therefore, it provides rapid static dissipation to ground. The standard language for specification of electrical resistance of conductive tile is the following: Surface to Ground Resistance of Conductive Floor Covering: Average test values shall be within the range of 25,000 ohms (2.5 x 104) to 1,000,000 ohms (1.0 * 106 ohms). Conductive tiles may be tested in accordance with either NFPA 99 or ESD S7.1 .*
There is also a more resistant tile that has a resistance range of 1,000,000 ohms (1.0 x 106 ohms) to 100,000,000 ohms (1.0 x 108 ohms). This tile is usually referred to as static-dissipative tile, rather than static-conductive tile. (Both tiles dissipate static electricity to ground, but conductive tile does it much faster.)
13.2.5 Airflow Requirements Selecting the Appropriate Airflow Method. Airflow can be achieved through many ways, including grilles and diffusers, but mission critical data centers typically require a large volume of air to cool the equipment, so perforated panels or grates are usually selected. Both are interchangeable with solid panels, so it is relatively easy to modify airflow as needs change. Perforated Panels. Perforated panels are generally of an all-steel design with a series of holes that create a 25% open area in the panel surface for air circulation. The airflow can be regulated by means of a damper located on the underside of the panel if desired. Perforated panels can be laminated with the same factory finish as solid panels. Täte Access Floor's Perf 1000 with its airflow chart is shown in Figure 13.8. Grate Panels. Grate panels are generally made from aluminum, with a series of honeycomb slots that create a 56% open area for air circulation. The airflow can be regulated by means of a damper located on the underside of the panel if desired. Grate panels cannot be laminated with a tile, but they can be coated with conductive or nonconductive epoxy, metallic coatings, or can simply be installed bare. Täte Access Floor's GrateAire™ with air flowchart is shown in Figure 13.9. Air Leakage. For areas without perforated or grate panels, the floor must remain properly sealed to control air leakage and proper air volume to the areas requiring air*Surface-to-ground resistance design specification. (Courtesy of Täte Access Floors.)
13.2 DESIGN CONSIDERATIONS
327
Figure 13.8. Täte Access Floor's Perf 1000 with airflow chart. (Courtesy of Täte Access Floors.)
flow. Excessive leakage can reduce temperature control and cause poor air distribution performance. Be aware that air can leak through construction seams where walls, columns, and other obstructions extend through the access floor, and through access floor, wall, and slab penetrations created for ducts, conduits, and pipes. Leakage between the panel seams is typically not an issue because of the stringer, but, if necessary, gaskets can be applied to the stringer to reduce leakage.
Figure 13.9. Täte Access Floor's Grate Aire™ with airflow chart. (Courtesy of Täte Access Floors.)
328
RAISED ACCESS FLOORS
New construction seams and utility penetrations created after the floor installation must be sealed. The access floor perimeter should remain sealed with snug-fitting perimeter panels and with the wall base firmly pressed against the floor. Perimeter seams where panels do not consistently fit tightly (within approximately 1/16 inch of the wall), due to rough or irregular-shaped vertical surfaces, should be caulked, gasketed, taped, or otherwise sealed to maintain air tightness. For areas where cables pass through to the plenum beneath the access floor, be sure to use grommets or cutout trim and seal designed to restrict air leakage (Figure 13.10). These can be obtained from the access floor manufacturer or directly through a grommet manufacturer such as Koldlok®.
13.3 13.3.1
SAFETY CONCERNS Removal and Reinstallation of Panels
Proper panel removal and reinstallation is important so that injuries to people and damage to the floor system are prevented. Stringer systems require particular care to maintain squareness of the grid. Panel Lifters. Suction cup lifters (Figure 13.11) are used for lifting bare panels and panels laminated with vinyl, rubber, HPL, VCT or other hard surface coverings. Perf panel lifters (Figure 13.12) are used for lifting perforated airflow panels and can also be used to lift grates.
13.3.2
Removing Panels
When underfloor access is required, remove only those panels in the area where you need access and reinstall them as your work progresses from one area to another. (The more panels you remove at one time, the more likely some pedestal height settings will
Figure 13.10. For areas where cables pass through to the plenum, grommets or cutout trim and seal restrict air leakage. (Courtesy of Täte Access Floors.)
329
13.3 SAFETY CONCERNS
10-1/4"
Aluminum air release handle
6-1/4"
Aluminum handle Black rubber suction cups
4" diameter
4" diameter
-H
Figure 13.11. Suction cup lifter. (Courtesy of Täte Access Floors.) be inadvertently changed.) The first panel must be removed with a panel lifter. Adjacent panels can be removed with a lifter or by reaching under the floor, pushing them upward, and grabbing them with your other hand. Lifting a Solid Panel with a Suction Cup Lifter. Attach the lifter near one edge of the panel and lift up that side only (as if opening a hinged door). Once the panel's edge is far enough above the plane of the adjacent panels, remove it by hand. You may find that kneeling on the floor while lifting panels reduces back strain. Take these precautions: • Never attempt to carry a panel by the lifter. The suction could break and allow the panel to fall from your hand.
Figure 13.12. Perf panel lifter. (Courtesy of Täte Access Floors.)
330
RAISED ACCESS FLOORS
• Do not use screwdrivers, pliers, or other tools to pry or lift panels. • Do not change the height adjustment of the pedestals when removing panels. To lift a perforated panel without a lifter, remove an adjacent solid panel so that you can reach under it and push it out of the floor by hand. If you use a perf panel lifter, hook the lifter through a perforation hole on one side of the panel and lift up until you can place your other hand under it; then remove it by hand. Reinstalling Panels. All but the last panel to be reinstalled can be placed on the understructure by hand or with a lifter. When using a lifter, attach the lifter near one edge of the panel, sit the opposite edge of the panel on the understructure, and lower the side with the lifter in your hand as if closing a door. Kneeling on the floor while replacing panels may avoid back strain. Take these precautions: • Do not overtorque Posilock or stringer screws. Your torque setting should be no more than 35 in.-lbs. • Do not attempt to reinstall panels by kicking them into place. • Do not change the height adjustment of the pedestals when replacing panels. • Make a final check to verify that panels are correctly in place and level.
13.3.3
Stringer Systems
A stringer system can become out of square when people remove large blocks of panels from the floor and unintentionally twist the grid while performing underfloor work. Correcting an out-of-square grid condition is not an in-house, do-it-yourself task. It must be performed by a professional installer. To prevent grid twisting from occurring in the first place, adhere to the instructions below. Be aware that two-foot stringers systems are more susceptible to grid-twisting than four-foot (basketweave) systems.
Precautions to Prevent Grid Twisting • When you need to work under the floor, remove only as many floor panels as necessary to do the work. Keeping surrounding panels in the floor helps to maintain grid squareness: the more panels you remove at one time, the greater the possibility that the grid will be accidentally twisted. Replace panels as your work progresses from one area to another. • Never pull electrical cables against the pedestals and stringers. • When removing and reinstalling cut panels at the floor perimeter and around columns, reinstall each section exactly where it came from. Panels are custom cut for each perimeter location and are not interchangeable. Walls are often wavy or out of square: interchanging differently sized perimeter panels can cause floors to become out of square, and can also cause panels to become too tight or too loose in the floor.
13.3 SAFETY CONCERNS
331
• Keep foot traffic and rolling loads on the access floor away from the area where the panels and stringers are to be removed. Failure to do this can easily cause panels to shift and stringers to twist on the heads. • When removing stringers for complete access, remove only those stringers necessary for access and replace them as your work progresses from one area to another. • Do not walk on stringers or put lateral loads on stringers when panels are removed from the floor. • When replacing stringers, make sure that the adjacent pedestal heads attached to each stringer are square to one another before final tightening of the stringer screws. • Do not overtorque the stingers with your driver when reattaching the stringers to the pedestal heads. Set the torque on your screw gun between to 30 to 35 inch-lbs. • Do not try to resquare the stringer grid by hitting stringers with a hammer. This can result in the formation of bumps on top of the stringers that will cause panels to rock.
13.3.4
Protection of the Floor from Heavy Loads
The following types of heavy loading activity may merit the use of floor protection: construction activity, equipment move-in, facility reconfiguration, use of lift equipment or pallet jacks on the floor, and frequent transportation of heavy loads in areas where delivery routes exist. A look at the allowable loads for your panel type will tell you whether the loads you intend to move are within the capacity of the floor. The allowable rolling load indicates the load allowed on a single wheel rolling across a panel. Determining if a Rolling Load is Within the Floor's Capacity. First determine the type(s) of floor panels in the load path. Be mindful of the possibility that there may be different types of floor panels installed between delivery routes and other areas of the floor. For demonstration purposes, we will refer to the panel loads in Table 13.1. For one-time movement of a heavy load, refer to the column headed Rolling loads: 10 passes. ("10 passes" means that tests demonstrate that the panel can withstand the allowable load rolled across its surface 10 times without incurring excess deformation.) To determine if a load is within the panel's allowable capacity, ascertain the wheel spread of the vehicle. When there is at least 24 inches between wheels (front to back and side to side), there will be only one wheel on any floor panel at a time (provided that the load is not moved diagonally across individual panels). If the payload is evenly distributed, divide the combined weight of the vehicle and its payload by four to determine the load on each wheel and compare it to the allowable load in the table. If the payload is unevenly distributed, estimate the heaviest wheel load and compare it to the table. Caution: Avoid moving heavy loads diagonally across floor panels as this could allow two wheels to simultaneously roll over opposite corners of each panel in the path. Failure to avoid this could result in excess panel deformation.
332
RAISED ACCESS FLOORS
Table 13.1. Panel loads Allowable loads Panel type ConCore 1000 ConCore 1250 ConCore 1500 ConCore 2000 ConCore 2500 All steel 1000 All steel 1250 All steel 1500
Design loads (lbs)
Rolling loads: 10 passes (lbs)
Rolling loads: 10,000 passes (lbs)
Uniform loads (lbs/ft2)
1000 1250 1500 2000 2500 1000 1250 1500
800 1000 1250 1500 1500 400 500 600
600 800 1000 1250 2000 400 500 600
250 300 375 500 625 250 300 375
Floor Protection Methods for Heavy Rolling Loads. There are several methods of protecting the floor from heavy rolling loads: • • • • •
Place load-spreading plates on the floor along the load path. Reinforce the floor with additional support pedestals along the load path. Combine methods 1 and 2. Install heavy-grade floor panels along the load path. Use load moving equipment equipped with air bearings (also called air casters or air cushions). • Use a moving vehicle with six or more wheels that are spread more than 24 inches apart. Load-Spreading Plates. The load path can be covered with one of the following types of load-spreading plates: • One layer of 3/4-inch thick plywood sheet • Two layers of '/¡¡-inch thick steel plate • One layer of '/4-inch thick aluminum plate When moving very heavy loads, you should place either one layer of '/4-inch aluminum plates or '/s-inch steel plates on top of 3/4-inch sheets of plywood. Overlap the ends of the metal plates to a position near the center of the wood sheets to keep the load from becoming too severe when the wheels reach the ends of wood sheets. Caution: When using '/i-inch spreader plates or any thicker combination of plates, you need to create a step-down at the end of the plate so that the load will not drop to the access floor. You can accomplish this by using any rigid material that is approximately half the thickness of the load spreader. Floor Reinforcement. The floor panels in a heavy load path can be reinforced by having the access floor installer install additional support pedestals beneath them.
13.3 SAFETY CONCERNS
333
Five-point supplemental support for each panel is achieved by positioning one additional pedestal under each panel center and one under the midpoint of each panel edge where two panels abut. Each pedestal will be glued to the subfloor with approved pedestal adhesive and be height-adjusted so that each head is supporting the undersides of panels. (The number of additional pedestals required amounts to three per panel.) This method of floor reinforcement is often used in delivery corridors where heavy loads are moved on a daily basis and in areas where lift equipment is used to perform overhead maintenance work. Load-Spreading Plates and Panel Reinforcement On occasions when a one-time rolling load may slightly exceed floor capacity, reinforcement pedestals should be installed along the load path and load-spreading plates should be placed on top of the floor. Heavy-Grade Floor Panels. Floor panels with different load ratings can be installed alongside of one another to create a more supportive path for heavy loads. All panel types are typically interchangeable, but keep in mind that heavier-grade panels may require heavier-duty understructure than standard grade panels. Installation of floor panels and pedestal heads should be performed by the access floor installer. Load-Moving Equipment Equipped with Air Bearings. Load-moving equipment equipped with air bearings has been successfully used to move equipment on access floors. The equipment uses compressed air to float loads across the floor on a low-friction air film. The bearing area of the air casters is considerably larger than that of wheeled casters, so that use of load spreading plates can typically be avoided. Perforated floors require only a thin polystyrene sheet to create an air seal. Additional Equipment-Movement Precautions • Before rolling a load on the floor, make sure that all floor panels in and around the load path are in place to prevent the floor from shifting. • When moving heavy loads on a floor without spreader plates, select a vehicle with the largest wheels available. The larger and wider the wheels, the less potential there is for localized panel indentation. • When moving motorized vehicles on an access floor, avoid sudden stops and rapid accelerations. Hard starting and stopping can damage the understructure supporting the floor. • Do not allow heavily loaded vehicles to drop to the floor when stepping down from a higher level. Precautions for Raised Floor Ramps. Ramps and landings should be protected with 3/4-inch plywood before moving very heavy loads on them. There should be plywood extending to the edge of the raised floor where it meets the top of the ramp. The plywood at the top of the ramp should be connected to the plywood at the edge of the floor by a section of sheet metal attached to each piece of wood with sheet metal
334
RAISED ACCESS FLOORS
screws. A metal plate may be needed at the bottom of the ramp to step up to the plywood (the plate can be held in place by gravity). Using Lift Equipment on the Floor. Whenever possible, overhead work that requires the use of heavy lift equipment should be performed prior to installation of the access floor. Where lifts will be occasionally used on an access floor for overhead maintenance, the system should be selected accordingly. Use heavy-grade panels installed on a four-foot bolted stringer understructure system. A lift with a side-to-side wheel spread greater than 24 inches is recommended so that the wheels will be on separate floor panels. To determine the compatibility of a particular lift with the floor, add the weight of the expected platform load to the weight of the unit and distribute the combined load to each wheel. Compare it to the allowable load in Table 13.1, in the column headed Rolling loads: 10,000 passes. Two protective measures can be used to fortify floor areas that will require overhead maintenance throughout the life of the building. The first is to reinforce the floor with five additional support pedestals per panel as previously mentioned. The second is to place loadspreading plates on the floor whenever maintenance with a lift is required. These two steps should be used simultaneously if the lift equipment has a wheel spread that is not fully 24 in. between wheels. Installation of additional support pedestals should be performed by the access floor installer. Using Electronic Pallet Jacks on the Floor. The floor protection guidelines also apply to the use of electronic pallet jacks, which can be very heavy when loaded. A pallet jack's weight is unevenly distributed to its wheels; therefore, the load per wheel needs to be calculated according to Table 13.2. The wheel loads can then be compared to the allowable load in the table. An empty electronic jack typically bears 80% of its weight on the drive wheel and 10% on each fork wheel. To calculate the load per wheel of a loaded jack, enter the appropriate weights into Table 13.2. Caution: Verify the distance between the fork wheels of your truck; many are built with less than 24 inches between them. If two fork wheels will simultaneously ride across a single column of panels, you need to combine the weight of the two wheels (or a portion of weight on the two wheels) to accurately determine the load on the panels. Using a pallet jack with both fork wheels simultaneously riding across a single column of panels over an extended period of time will cause panel deformation if the load Table 13.2. Load calculation table for electronic pallet jack Jack and Payload
Drive wheel load
Fork wheel load
Empty jack weight:
80% of jack weight:
10% of jack weight:
Payload weight:
40% of payload:
30% of payload:
Total weight:
Total load on drive wheel:
Total load on fork wheel:
Caution: Verify the distance between the fork wheels of your truck (see text).
13.3 SAFETY CONCERNS
335
exceeds the panel's rating. To accommodate a jack with this condition you may need to have the floor reinforced with additional pedestals and/or a heavier grade of floor panel. Installation of floor panels and additional pedestal heads should be performed by the access floor installer. You should also avoid moving the jack diagonally across floor panels as this could allow two wheels to simultaneously roll over opposite corners of each panel in the path. Using Manual Pallet Jacks on the Floor. To calculate the load on each wheel of a manual pallet jack, add the weight of the payload to the weight of the jack, then distribute 20% of the total weight to the steering wheel and 40% of the total weight to each fork wheel. The wheel loads can then be compared to the allowable load in the load table.
Protecting the Floor's Finish from Heavy Move-In Loads. Laminated floor panels and painted-finish floor panels are susceptible to scratching, minor indentations, and marring by the wheels of heavily loaded vehicles. Vinyl tiles are especially susceptible due to the ability of wheels to leave permanent impressions. To avoid causing this type of damage, the floor can be covered with sheets of fiberboard (such as masonite), plywood, heavy cardboard, or rigid plastic (such as plexiglass). The sheets should be taped end to end with duct tape to ensure that there is a continuous path. Floor Protection Methods for Heavy Stationary Equipment. Heavy stationary loads are most often design loads consisting of equipment with legs or feet. For these loads, refer to the Design load column in Table 13.1. A few types of design loads may slightly exceed panel capacity and require the use of additional panel support. The most common items are uninterruptible power supply units (UPS) and safes. There are generally three ways to support these types of heavy design loads: 1. Reinforce the floor by installing additional support pedestals under the equipment legs. 2. Permanently place a load-spreading plate (such as an aluminum sheet) on the access floor under the equipment. 3. Substitute heavy-grade panels in the areas supporting the heavy equipment. Determining if a Static Load is Within the Floor's Capacity. To determine whether a load is within the floor's capacity, determine the leg footprint of the equipment. If the legs are more than 24 inches apart (and the equipment is situated square with the floor panels), the load will automatically be distributed to separate panels. If this is the case, divide the load among the legs (disproportionately, if the load is uneven) and compare the load on each panel to the allowable load in the load table. If a piece of equipment will have two legs on a single panel, combine the loads on the two legs. Caution: If there will be separate pieces of equipment within close proximity,
336
RAISED ACCESS FLOORS
determine whether two legs will reside on a single panel. Combine the loads for comparison to Table 13.1. Reinforcing the Floor with Additional Pedestals. The floor is reinforced by installing an additional support pedestal under (or near) each leg of the equipment and adhering it to the subfloor with approved pedestal adhesive. (The additional pedestals consist of standard pedestal bases with flat pedestal heads.) Each pedestal's height is adjusted so that they are supporting the undersides of panels.
13.3.5
Grounding the Access Floor
The access floor understructure must be grounded to the building in some manner. The type of understructure used determines the quantity of ground-wire connections. The final determination of the number, type, and actual installation method of building ground wires should be made by an electrician. A typical recommendation is to use #6 AWG copper wire. The ultimate design should be approved and designed by the project electrical engineer. The floor can be bonded using components available from electrical supply companies. We recommend attaching the wire connector to the top portion of the pedestal assembly (to the stud or the top of the pedestal tube) or to the pedestal head itself (Figure 13.13). For stringer systems, attach a ground connector to a minimum of one pedestal for every 3000 square feet of access flooring. Using a Consolidation Point. One grounding design possibility is to have a copper bus bar located somewhere centrally under the access floor so that the facility has a common consolidation point to terminate all bonding conductors (Figure 13.14). A single bonding conductor can then be taken back to the electrical panel supplying power to the equipment in the room. This helps to eliminate any stray ground currents that could be floating in the understructure.
}<
Pedestal head assembly
Figure 13.13. Typical grounding method. (Courtesy of Täte Access Floors.)
13.3 SAFETY CONCERNS
337
To service panel Bus bar
Figure 13.14. Using a consolidation point. (Courtesy of Täte Access Floors.)
13.3.6
Fire Protection
From an access floor point of view, the only applicable code is NFPA 75, which simply states that access floors must be Class A for surface burning and smoke-developing characteristics, and that the understructure support must be noncombustible. The best way to guard against fire hazards is to select a system that is either of all-steel welded construction, or all-steel welded construction with a cementitious core. Avoid products that are made of wood or polymeric materials unless they are adequately encased to avoid exposure.
13.3.7
Zinc Whiskers
Some form of corrosion protection is necessary on steel access floor components, but the type of coating selected is important to avoid future problems. Zinc whiskers (also called zinc needles) are tiny hair-like crystalline structures of zinc that have been found to sometimes grow on the surface of electroplated steel. Electroplating is a commonly used method of galvanizing steel and has been used on a variety of steel products now present in data centers and other computer controlled environments. In recent years, whiskers have been found growing on electroplated components of computer hardware, cabinets, and racks, as well as on some wood-core access floor panels. In some cases, these electrically conductive whiskers are dislodged from electroplated structures and infiltrate circuit boards and other electronic components via the room's pressurized air cooling system, causing short circuits, voltage
338
RAISED ACCESS FLOORS
variances, and other signal disturbances. These events, of course, can cause equipment service interruptions or failures. An alternative to electroplating, which is applied through a high-energy electrical current, is hot dip galvanizing, which is applied through a molten bath. The hot dip process does not produce the zinc whisker effect and, therefore, should be specified.
13.4
PANEL CUTTING
For all-steel panels or cement-filled panels that do not contain an aggregate, cutouts for accessories or cable passageways and panel cuts for system modifications can oftentimes be provided by the access floor manufacturer, but if you opt to cut floor panels yourself, always use the proper equipment and follow the recommended safety requirements below.
13.4.1
Safety Requirements for Cutting Panels
When using a hand-held, heavy-duty (industrial) reciprocating saw, follow these guidelines: • • • •
Use a bench or worktable with clamps to cut the panels. Work in a well-lighted area. Be sure tools are properly grounded and dry. Cut outside the data center.
Use personal protection gear: • • • •
Safety glasses (or full-face shield) and ear protection Long-sleeve shirt or sleeve protectors Lightweight work gloves to protect from sharp metal edges and hot sawdust Steel-toe safety shoes
13.4.2
Guidelines for Cutting Panels
Cut the panels in an area where cement dust can easily be cleaned up. After cutting a cement-filled panel, wipe the cut edge with a damp cloth before returning it to the floor. We do not recommend sealing cut panels because the cement-fill mixture contains a binder that prevents cement dust emission after the cut edges have been wiped clean.
13.4.3 Cutout Locations in Panels—Supplemental Support for Cut Panels To maintain structural integrity in a panel with a cutout, locate the cutout at least 3 inches from the panel's edge. To maintain the original deflection condition in a panel
13.4 PANEL CUTTING
339
with a cutout, or to have a cutout extending to the edge of a panel, you need to reinforce the panel near the cut with additional support pedestals (Figure 13.15). Each additional pedestal should be glued to the subfloor with approved pedestal adhesive and be height-adjusted so that the heads are supporting the underside of the panel. Reinforcement may not be required for all cutout sizes and locations. Consult with the manufacturer for specific requirements.
13.4.4
Saws and Blades for Panel Cutting
A cutout extending to the edge of a panel can be cut with a heavy-duty handheld reciprocating saw or a band saw. An interior cutout can only be cut with a heavy-duty handheld reciprocating saw. Use bimetal saw blades with approximately 14 teeth per inch.
13.4.5
Interior Cutout Procedure
When performing interior cutting, follow this procedure: 1. Lay out the cutout on the panel. 2. Drill entry holes in two opposite corners of the layout large enough for the saw blade to pass through without binding, as shown in Figure 13.16. 3. Cut the hole. When making cutouts for floor accessories and air terminals, debur the perimeter of cutouts made for air terminals, grills, or electrical boxes where no protective trim will be installed. Deburring can be accomplished with a metal file or coated abrasive sandpaper. After making cable cutouts, install protective trim around exposed edges, as shown in Figure 13.16.
13.4.6
Round Cutout Procedure
Round cutouts up to 6 inches in diameter can be made with a bimetal hole saw. A drill press is best suited for this operation. Operate the drill press at a very slow speed. If a
Figure 13.15. Additional support pedestal locations. (Courtesy of Täte Access Floors.)
340
RAISED ACCESS FLOORS
•
• 1 1
i i
Pilot holes
•
•
Figure 13.16. Interior cutout procedure. (Courtesy of Täte Access Floors.) hand-held drill will be used, predrill a hole at the center of the cutout location to stabilize the hole saw. For round cutouts larger than 6 inches in diameter, lay out the circle on the panel and cut the panel with a reciprocating saw. Drill an entry hole for the saw blade along the edge of the circle just inside of the layout line and make the cut. Debur the cut.
13.4.7
Installing Protective Trim Around Cut Edges
Rectangular cutouts used as a passageway for cables must have protective trim along the cut edges. Protective trim components typically include universal cutout trim, molded corners, and screws. A lay-in foam plenum seal is available to seal the opening (a plenum seal is typically a sheet of Armaflex insulation).
13.4.8
Cutting and Installing the Trim
To determine how long to cut the straight pieces, take note of how the molded corners will hold them in place (Figure 13.17). Cut the sections straight at each end so that they will fit under the corners. Place the sections along the cutout and secure each section with a molded corner by attaching the corners to the panel with the screws. You will need to predrill holes in the panel for the screws with a 9/64-inch metal bit. If the cutout extends to the edge of the panel, attach the trim near the edge (without using the corners) by fastening the trim directly to the panel with a screw or a pop rivet. If a screw is used, countersink the screw into the trim piece.
13.5 ACCESS FLOOR MAINTENANCE For preservation of all tiled floor panels, the most important thing to remember is that application of abundant and excessive amounts of water to the floor when mopping will degrade the glue bond of the tiles to the panels.
341
13.5 ACCESS FLOOR MAINTENANCE
#8 Phillips flat head screw 3/4" long, black oxide finish
,
,„, y ledge for — ' plenum seal
ñ 4
~-^
/ /
molded corner —'
cutout trim
'
Figure 13.17. Cutout protection. (Courtesy of Täte Access Floors.)
13.5.1 Best Practices for Standard High-Pressure Laminate Floor Tile (HPL) and for Vinyl Conductive and Static-Dissipative Tile • Keep the floor clean by light damp-mopping with a mild, multipurpose ammoniated floor cleaner, mild detergent, or neutral cleaner. • Use a nonflammable organic cleaner such as Rise Access Floor cleaner on difficult to clean spots (this can also be used for regular maintenance of high-pressure laminates). • Provide protection from sand and chemicals tracked in on shoes by providing walk-off mats at entrances. • Rotate panels in high-use areas to areas of low traffic to spread years of wear over the entire system. • Do not flood the floor or use anything other than a damp mop. Abundant quantities of water can attack the adhesive bond and cause delamination of the tiles from the floor panels. Do not permit the cleaner to get into the seams between the floor panels. • Do not wax or seal the tile; it is never necessary. If a floor has been mistakenly waxed, the wax should be removed as soon as possible. Some commercial wax removers may cause damage to the floor, so select a product such as Owaxo, which has been specially designed to remove wax from high-pressure laminate surfaces. The use of wax will build an insulating film on the floor and reduce its effectiveness. • Do not use strong abrasives, steel wool, nylon pads, or scrapers to remove stains. • Use a diluted commercial stripping agent on particularly bad spots.
342
RAISED ACCESS FLOORS
• Use warning or caution signs if any traffic is possible while the floor is wet. Vinyl flooring will become slippery when wet.
13.5.2 Damp-Mopping Procedure for HPL and Conductive and Static-Dissipative Vinyl Tile When light soiling is widespread and spot cleaning is impractical, use this damp-mopping procedure: • Sweep or vacuum your floor thoroughly. • Damp-mop with warm water and commercial neutral floor cleaner. • Dip a sponge mop into a bucket of warm water, wring it out well, and push the sponge across the floor, pressing hard enough to loosen the surface dirt. • Damp-mop a small area at a time, wringing the sponge out frequently to ensure that the dirt is lifted and not simply redistributed. • When damp-mopping a large floor, change the water several times so the dirt does not get redeposited on the floor. • A good sponge mop for cleaning is one with a nylon scrubbing pad attached to the front edge. This pad is similar to those recommended for use on Teflon pans. • Rinsing is the most important step. Detergent directions often state that rinsing is not necessary. Although this is true on some surfaces, any detergent film left on a floor will hold tracked-in dirt. • Ideally, use one sponge mop and bucket to clean the floor and another mop and bucket solely for rinsing.
13.5.3
Cleaning the Floor Cavity
The necessity of cleaning the floor cavity will depend on environmental conditions. The only way to tell if cavity cleaning is necessary is by occasional inspection. Normal accumulations of dust, dirt, and other debris can be removed by a shop vacuum. Perforated panels in computer rooms may need to be cleaned a couple of times a year by running a wide shop vacuum nozzle across the underside of the panels. Removing Liquid from the Floor Cavity. If a significant amount of liquid enters the access floor cavity by spillage or pipe leakage, vacuum extraction is the most effective way of removing it. The drying process can be accelerated with dehumidifiers, fans, and/or heaters. Extraction and drying should occur as soon as possible, and no more than 48 hours after the event. For areas subject to excessive moisture or potential water leakage, water detection systems such as Dorlen's Water Alert® may be installed in the underfloor cavity. These systems monitor the subfloor in places where water leakage may otherwise go undetected.
13.6 TROUBLESHOOTING
13.6
343
TROUBLESHOOTING
From time to time you may be required to fix minor problems that are typically caused by unintentional mistreatment of floor systems. Repairs that are not addressed here should be performed by a professional access floor installer.
13.6.1
Making Pedestal Height Adjustments
You may need to make pedestal height adjustments to correct vertical misalignment conditions. Adjust the pedestal's height by lifting the head assembly out of the base tube enough so that you can turn the leveling nut on the stud. Turn the nut upward to decrease the pedestal height and turn it downward to increase the height. Once the head is reinserted in the tube, make sure that the antirotation mechanism on the bottom of the nut is aligned with the flat portion of the tube. Caution: A minimum of 1 inch of the stud must remain inside the tube to keep the head assembly safely secured to the base assembly. When making a height correction in a stringer system, it is not usually necessary to remove the stringers to make the adjustment. With the four panels surrounding the pedestal removed and the stringers attached, you can lift the head just enough to turn the nut on the head assembly.
13.6.2
Rocking Panel Condition
If a panel rocks from the understructure, it can be due to one of the following conditions: • There is dirt or other debris on the stringers that needs to be cleaned. • There is dirt or other debris stuck to the underside of the panel that needs to be cleaned. • One (or more) pedestal heads are vertically misaligned. Correct this by adjusting the height of the misaligned pedestals (see Section 13.6.1). Remove three panels surrounding the misaligned pedestal but keep a rocking panel in place to gauge the levelness. Raise or lower the pedestal head until the remaining panel appears level and no longer rocks. (Placing a level on the panel will help to make this adjustment.) Replace each of the panels and check for rocking before replacing the next one.
13.6.3
Panel Lipping Condition (Panel Sitting High)
If the lip of one panel is sitting higher than the lip of an adjacent panel, it can be due to one of the following conditions: • There may be dirt or other debris on the stringers that needs to be cleaned. • There may be dirt or other debris stuck to the underside of the higher panel that needs to be cleaned.
344
RAISED ACCESS FLOORS
• A pedestal may have been knocked loose from the subfloor and be tilting in one direction, as shown in Figure 13.18. Check this by removing two panels where the slipping exists. If any pedestal base plate is not flat on the subfloor and appears loose, you will need to remove the old adhesive and reglue the pedestal to the subfloor. To do this, remove all stringers and panels surrounding the loose pedestal, cover the entire pedestal base plate with pedestal adhesive and reposition it on the subfloor (pedestals are spaced 24 inches on center). While the adhesive is wet, reattach the stringers and replace just two panels to serve as a guide for precise pedestal positioning. Move the pedestal as necessary until it is perpendicular to the access floor and the stringers are level. Raise or lower the pedestal head until the panels are level and do not rock. (Placing a level on the panel will help to make this adjustment.) Replace all panels and verify that they do not rock.
13.6.4
Out-of-Square Stringer Grid (Twisted Grid)
This condition is created when people remove a large block of panels from the floor and then unintentionally twist the grid while performing underfloor work. Correcting an out-of-square grid condition is not an in-house, do-it-yourself task. It must be performed by a professional installer. To prevent grid twisting from occurring in the first place, adhere to the instructions below. Be aware that 2-foot stringers systems are more susceptible to grid-twisting than 4-foot (basketweave) systems. To Prevent grid twisting: • When you need to work under the floor, remove only as many floor panels as necessary to do the work. Keeping surrounding panels in the floor helps to maintain grid squareness; the more panels you remove at one time, the greater the possibility that the grid will be accidentally twisted. Replace panels as your work progresses from one area to another.
Figure 13.18. Tilted pedestal. (Courtesy of Täte Access Floors.)
13.6 TROUBLESHOOTING
345
• Never pull electrical cables against the pedestals and stringers. • When removing and reinstalling cut panels at the floor perimeter and around columns, reinstall each section exactly where it came from. Panels are customcut for each perimeter location and are not interchangeable. Walls are often wavy or out of square. Interchanging differently sized perimeter panels can cause floors to become out of square and can also cause panels to become too tight or too loose in the floor. • Keep foot traffic and rolling loads on the access floor away from the area where the panels and stringers are to be removed. Failure to do this can cause panels to shift and stringers to twist on the heads. • When removing stringers for complete access, remove only those stringers necessary for access and replace them as your work progresses from one area to another. • Do not walk on stringers or put lateral loads on stringers when panels are removed from the floor. • When replacing stringers, make sure that the adjacent pedestal heads attached to each stringer are square to one another before final tightening of the stringer screws. • Do not overtorque the stingers with your driver when reattaching the stringers to the pedestal heads. Set the torque on your screw gun to between 30 to 35 inch-lbs. • Do not try to resquare the stringer grid by hitting stringers with a hammer; this can result in the formation of bumps on top of the stringers that will cause panels to rock.
13.6.5
Tipping at Perimeter Panels
Tipping of perimeter panels toward a wall generally occurs when the perimeter panel is improperly supported at the wall. Adjust the leveling nut on the perimeter pedestal head, raising or lowering the head until the panel sits level.
13.6.6 Tight Floor or Loose Floor—Floor Systems Laminated with HPL Tile If floor panels laminated with HPL have become too tight (or loose) after installation, it is typically due to one of the following causes: • Temperature or Humidity Change. A significant increase of temperature or humidity caused the HPL on the panels to expand, making the panels difficult to remove and replace. A decrease of temperature or humidity can cause the floor panels to be loose on the understructure. To avoid these conditions, keep the room temperature and humidity levels constant. If an environmental change does occur and the floor seems to become tight or loose, it should return to its original condition after the room conditions return to the original settings.
346
RAISED ACCESS FLOORS
• Application of Excessive Amounts of Cleaning Water. Putting excessive amounts of water on the floor while mopping causes the water to migrate between the floor panels, which causes the tile to expand. This can be prevented by using the proper mopping technique: wring excess water out of the mop before putting it on the floor and do not leave puddles of water on the floor. As stated in Section 13.5.1, abundant quantities of water can also attack the adhesive bond and cause delamination of the tiles from the floor panels. • Interchanging of Perimeter Panels. Floor panels that have been cut to fit at the room perimeter and around columns must be returned to their original locations. During installation, panels are custom-cut for each perimeter location and are therefore not interchangeable. Walls are often wavy or out of square, so interchanging of perimeter panels can cause some areas of the floor to become loose or too tight. When removing and reinstalling cut panels at the floor perimeter and around columns, reinstall each cut panel section exactly where it came from.
13.7 ADDITIONAL DESIGN CONSIDERATIONS 13.7.1
LEED Certification
The U.S. Green Building Council's (USGBC) Leadership in Energy and Environmental Design (LEED) program is broadly recognized as the leading authority for certifying green buildings. Evaluating the manufacturing location and recycled content of the raised access flooring system can contribute toward the certification of your data center. Using a raised floor system with more than 20% recycled content (postconsumer plus one-half preconsumer) that are manufactured within 500 miles of the data center location will help contribute to several points required for LEED certification.
13.7.2
Energy Efficiency—Hot and Cold Air Containment
Ensuring energy efficiency in a data center requires special attention to the placement of the equipment racks in relation to the airflow panels. Most server equipment is designed to allow cool air to enter in the front of the rack and exhaust out the back. It is best to align the fronts of the rack facing each other in the aisles where airflow panels are going to be placed. Likewise, the backs of the racks should face each other in the next aisle where solid access floor panels are placed. The return air diffusers in the ceiling should also be placed above this aisle to help the warm exhaust air exit the building more efficiently. This arrangement is called hot aisle/cold aisle design and is intended to segregate the cool supply air from the warm exhaust air. Other methods for keeping the exhaust air from mixing with the supply air are widely used and recommended. These techniques include the use of blanking panels in the empty slots of a server rack. Blanking panels prevent the exhaust air from recirculating to the front of a server when the rack is not completely full. The use of baffles or partitions above the server rack is also helpful to keep the air from flow-
13.7 ADDITIONAL DESIGN CONSIDERATIONS
347
ing over top of the rack and contaminating the other side. A third option to keep the air segregated is also known as full containment. This is where either the hot aisle or the cold aisle is fully enclosed so that the contrasting air on the other side of the rack cannot short-circuit. Figure 13.19 illustrates a system partitioned into hot and cold aisles, with partitions segregating the adjacent aisles and overhead hot air return plenum installed.
13.7.3 Airflow Distribution and CFD Analysis Whether you are designing a new data center or analyzing an existing facility, software packages utilizing computational fluid dynamics (CFD) should be considered to determine the effectiveness of your raised access floor and CRAC arrangement. CFD technology has been used for years in many industries such as aerospace, automobiles, and IT manufacturing to analyze and produce optimal designs where fluid flow or airflow is involved. Rapid what-if simulations can be performed before any expensive and potentially inefficient designs are implemented in a live data center, allowing several configurations of perforated tile, grates, and other items to be virtually tested. Useful CFD analysis requires some level of proficiency with the software model and a thorough and accurate audit of the elements within the data center. This data is entered into the model, an accurate baseline is verified, results and opportunities are analyzed, and, finally, iterations run to obtain an optimal result (within the constraints of what physical changes can be made). CFD analysis is a powerful tool and may be used to illustrate the advantages of raised access floors. To help illustrate the elements of efficient data center cooling, a sample CFD efficiency study will be performed and results analyzed. For this efficiency study, a prototype data center will be created and manipulated to yield the highest efficiency design possible.
Figure 13.19. Typical energy efficient data center design.
348
RAISED ACCESS FLOORS
High-efficiency design goals: 1. 2. 3. 4.
Tightly controlled cooling delivery to racks Low mixing Avoid overcooling High CRAC return-air temperatures
A prototype data center might have 1. 1380 square feet 2. Two 20 ton CRACs (>2N redundancy) 3. Twenty-two 2 kW racks—44 kW total (liquid cooling quantity four 11 kW racks) 4. Slab and raised floor Below are five different designs, from least efficient to most efficient First, we will model a data center with a slab floor. Illustrated in Figure 13.20 is the configuration of the data center, and Figure 13.22 shows the direction and intensity of airflow and temperature by which such a data center may be cooled to an acceptable level; however, the level of mixing is quite high. The CRAC return temperature must be set to 72°F, which provides low efficiency. In Figure 13.21, you can see from the arrows that slab floor data centers commonly have a large amount of mixing unless there is a directed supply and return-air ducting provided. For our second model data center, the orientation of servers and CRACs was preserved, but a raised access floor was added with perforated or grate tiles. In Figure 13.23, cool air is being forced through the subfloor to the server cabinets and warm air is then expelled into the room. Much of the cool air, however, is short-circuiting directly back to the CRACs. This design may be cooled to a reasonable level; however,
Figure 13.20. Slab floor data center, random rack orientation with upflow CRACs. (Courtesy of AdaptivCool.)
13.7 ADDITIONAL DESIGN CONSIDERATIONS
349
Figure 13.21. Slab floor data center—rack temperatures and airflow at 6 feet. (Courtesy of AdaptivCool.) the level of mixing is still high. The CRAC return temperature must be set to 74°F, which provides about an 8% improvement in efficiency over our first model. Random rack layout promotes poor cool air containment. Our third model data center, shown in Figures 13.24 and 13.25, has been organized into a hot aisle/cold aisle setup, and is beginning to show the true possibilities of tightly controlled cooling distribution, low mixing, and higher CRAC return tem-
Figure 13.22. Raised-floor data center—random rack orientation with downflow CRACs. Shaded tiles are perforated or grate tiles. (Courtesy of AdaptivCool.)
350
RAISED ACCESS FLOORS
Figure 13.23. Raised-floor data center—rack temperatures and airflow at 6 feet. (Courtesy of AdaptivCool.) peratures. As you can see in Figure 13.25, the air is directed from the CRAC units through the subfloor to the grated tiles located in the center of the rack setup. The cold air from the CRACs cools servers and now hot air is expelled through the back of the cabinets, as indicated by the arrows, returning to the CRAC units to once again become cooled air. CRAC return temperatures can be increased to 76CF, further improving efficiency.
Figure 13.24. Raised-floor data center—hot aisle/cold aisle arrangement with downflow CRACs. (Courtesy of AdaptivCool.)
13.7 ADDITIONAL DESIGN CONSIDERATIONS
351
Figure 13.25. Raised-floor data center—rack temperatures and airflow at 6 feet. (Courtesy of AdaptivCool.)
Our fourth model data center adds a ducted ceiling plenum, which acts as a sort of raised access ceiling, reducing mixing and short-circuiting to very low levels. This model is illustrated in Figures 13.26, 13.27, and 13.28. With the added ducted ceiling plenum, this model's airflow acts very similar to that of the third model. The only difference is that the hot air expelled from the server racks is now navigated to the ducted ceiling and the CRAC intake air almost exclusively contains heated air. CRAC return temperatures can be set to 80°F while still providing adequate cooling air to the rack intakes, allowing an approximately 64% increase in efficiency over our first model. Air cooling for data centers has been on the market for over 40 years. Its attractive features are its low cost, mature technology, ease of adding redundancy, and wide industry support. More recently, liquid cooling has been making inroads, particularly where high-density racks are planned. As is typical with new technology, there are many competing liquid cooling designs vying for market dominance. Several common denominators for all these liquid cooling solutions are: 1. They are considered close-coupled systems (Figure 13.29). This means the cooling coils are brought much closer to the IT equipment than is possible with traditional CRACs. This has the benefit of raising AT across the cooling coils, which, as we saw with the traditional CRAC examples, is beneficial. The higher allowable coolant temperatures allow greater use of free cooling. Free cooling is not really free but it is more efficient.
352
RAISED ACCESS FLOORS
Figure 13.26. Hot aisle/cold aisle arrangement with ducted ceiling plenum. (Courtesy of AdaptivCool.)
2. Required fan power is usually less than what is required for traditional CRACs placed throughout the room. 3. Rack densities typically need to be higher than 6-8 kW and sometimes as high as 10 kW for liquid cooling to function. 4. Liquid cooling concepts will be discussed in greater detail later; however, given the significant differences, it is useful to add liquid cooling to the CFD matrix for comparison. A liquid cooling solution using close-coupled cooling with cooling coils at the top of the racks is shown for comparison in Figure 13.30. No statement is made or should be inferred that this is the preferred liquid cooling solution. As can be seen in Figure 13.30, there is far more heated air allowed into the room, partially due to the high-density racks. This heated air is cooled locally and delivered to the rack faces using a very short airflow path; thus, A r is higher in the cooling path and lower fan power is required, both of which are beneficial for efficiency. Most liquid cooling solutions still need some reduced amount of traditional CRAC capacity to control humidity, thus the single smaller unit. One drawback to liquid cooling is the expense and/or difficulty to provide redundancy. As you can see in Figure 13.30, if one of the liquid cooling units were to fail or need maintenance, there would be a serious deficit of cooling for some of the racks. Traditional CRAC units are more flexible in providing redundancy to multiple racks simultaneously. An additional unit in an area of the data center can provide redundancy (albeit at lower density) to a large group of racks. Summary of CFD Study. The results of the CFD study illustrate the benefit of reduced mixing and several common methods to obtain this result. Of note is the im-
13.7 ADDITIONAL DESIGN CONSIDERATIONS
353
Figure 13.27. Hot aisle/cold aisle with ducted ceiling plenum. (6 feet). (Courtesy of AdaptivCool.)
Figure 13.28. Hot aisle/cold aisle with ducted ceiling plenum. Temperature and airflow in return plenum (9 feet). (Courtesy of AdaptivCool.)
354
RAISED ACCESS FLOORS
Figure 13.29. Close-coupled liquid cooling for four 11 kW racks. (Courtesy of AdaptivCool.)
pact of including a raised access floor. Figure 13.31 shows a comparison of the four air-cooling methods.
13.8
CONCLUSION
Raised access floors, while not a requirement for a high-reliability facility, are the industry standard. Many best-practice procedures depend on the modular nature of a
Figure 13.30. Close-coupled liquid cooling airflow. (Courtesy of AdaptivCool.)
13.8
355
CONCLUSION
25 80°F Setpoint 20
5 15 o S io
Slab floor upflow CRACs no ducting 72°F
76°F Setpoi
- ■ — Raised floor random rack orientation 74°F ■ O - - Raised floor hot/cold aisle 76°F
74°F Setpoint
DC
■ Hot/cold aisle with ceiling return plenum 80°F
64
T~
66
68
T
T
70 72 74 76 78 80 82 Rack maximum intake °F
84
86
Figure 13.31. Comparison of rack intake temperature and return-air setpoints for a variety of cooling solutions. (Courtesy of AdaptivCool.)
raised access floor and the cold-air plenum created below it. If properly installed and maintained, a raised access floor can help to ensure greater efficiency and flexibility for your present-day and future mission critical facility needs, building upon your knowledge of the equipment and procedures commonly found in mission critical data centers and other similar facilities. In the next chapter, we will continue with an overview of fire suppression systems for mission critical facilities. Such a system is a requirement for life safety and the most appropriate system to promote business continuity for your firm in the event of a disaster, and must be carefully selected.
This page intentionally left blank
14 FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
14.1
INTRODUCTION
Fire protection in facilities housing mission critical operations is a long-debated subject. Opinions, often based on misleading or false information, are as numerous as one can imagine. Lacking a firm understanding of the problems and potential solutions can lead to lowest common denominator or silver bullet approaches. In reality, lowest common denominator solutions rarely ever meet the needs of operators, the end result being a waste of money. In general, silver bullets usually miss the target. The purpose of this chapter is to provide an understanding of how fires behave, an objective accounting of the various fire protection technologies available, the expected performance of these technologies, and the applicable codes. Regardless of what anyone says, fires do occur in data centers, computer and server rooms, telecommunications sites, and other mission critical environments. However, there is enormous pressure to suppress and conceal this information because it represents a competitive disadvantage for the owners, engineers, and equipment makers. One manager reported that after two fires in an eight-year period, the manufacturer of the fire-prone hardware requested that the events be kept quiet. What customer can rely on an operator that has fires in their mission critical facility? Although fires are less likely in these environments, they usually are the result of an overheated or failed component and occur in circuit boards, cable bundles, terminal Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 357 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
358
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
strips, power supplies, transformers, electric motors (in HVAC units), and so on; basically, anything that generates heat and is made in whole, or in part, of combustible materials. Unless the fire has enormous consequences, such as the infamous Hinsdale fire, it will never be reported to the general public. Makers of equipment will defend their products with statements such as "It's made of noncombustible (or fire-rated) materials. It won't bum!" Nothing could be further from the truth. Fire-rated or limited-combustible materials are simply certified to standards for resistance to burning, but they still bum. A common example is a two-hour fire-rated door. The fire rating only confirms that the door resists a complete breach by fire under a controlled condition for a period of two hours. Actual results in a real fire and uncontrolled conditions may vary. Mission critical facilities are filled with plastics in the form of circuit boards, data cable jackets, power cord jackets, enclosures, and so on. All such plastics will bum. Like the power and cooling systems of a mission critical facility, design of fire protection systems should be more performance-driven rather than code-driven. Although fire codes pertaining to mission critical facilities do exist, they require few fire protection measures beyond that required for general commercial facilities. Most fire protection codes are developed using a consensus process and have the primary objective of providing life safety in buildings. This concept is not well understood by many engineers and owners. Most codes are written to ensure that in the event of a fire, a facility will not collapse until the last occupant is given an opportunity to safely escape. Generally, fire codes are not meant to protect property, operational continuity, or recovery, except to the extent necessary for life safety. Fire protection standards do govern how we implement fire protection technology. Even when elective, any installation of fire protection technology is required to comply with national, state, and local standards in effect at the time of the installation, modification, or upgrade. What this means is that the same set of fire protection technologies installed in a given facility in one locality may be subject to different regulations than in another locality. Additionally, jurisdictions that must adopt each new edition of a code or standard through the legislative process might discover that different editions of the same code with revised requirements may be simultaneously in effect in any given city. It is not safe to assume that the most recent edition of a particular code is in effect since it may be the case that a jurisdiction is working from a code that is several editions old.
14.2
PHILOSOPHY
In order to determine what technology is appropriate for an application, an extensive review of the facility, its mission, building fire codes in effect, and risk parameters should be conducted. Location, construction, purpose, contents, occupants, operational parameters, and emergency response procedures should all be included in the review. Where is the facility located? Is it close to a fire station? Is the facility occupied 24/7, or does it maintain a nine-to-five weekday schedule (lights out by 5 pm)? What is the cost of lost assets or lost business? Who is going to respond to a fire alarm?
14.3 ALARM AND NOTIFICATION
359
Only after this sort of review has been completed can you with confidence begin the exercise of determining how you intend to protect your facility from fire. The basic elements of a fire protection system are control, detection, alarm, and extinguishment/ suppression.
14.3 ALARM AND NOTIFICATION There are two basic types of fire alarm systems: conventional and addressable. Conventional control systems (Figure 14.1) are microprocessor-based and process inputs on a high or low basis. Signaling and control functions for peripheral equipment such as shutdowns of HVAC or EPO (emergency power off) triggers are accomplished with relays. Addressable systems are programmable-software-based and utilize sophisticated signaling line circuits (SLC) to transmit signals as data messages to and from addressable devices. Devices on the SLC are programmable and each is assigned a unique identifier (address). Addressable devices include smoke and heat detectors, input and output modules, relays, remote displays, and annunciators. The SLC circuit carries signals, information, and commands between the main processor (control panel) and the addressable devices located throughout the protected facility. With the exception of small, simple applications, the proper choice for most mission critical applications is the addressable-type fire alarm control system. The main reason is cost: the initial cost of installation and the cost of owning, operating, and maintaining
Figure 14.1. Conventional fire alarm control panel. (Photo courtesy of Fenwal Protection Systems.)
360
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
a system. Conventional systems normally have a lower total cost of ownership when protecting facilities under a few thousand square feet. When multiple zones, hazards, and alarm signaling requirements are part of the required operational characteristics, these added capabilities must be bolted-on to conventional systems at incremental cost. A small simple application with only a few smoke detectors and pull stations on one initiating device circuit (IDC), and two or three horn/strobes on one notification appliance circuit (NAC) is a perfect fit for a conventional system. Once the application requires twenty to thirty input and output devices, you have probably approached the maximum cost-effectiveness of a conventional fire alarm system. When a facility requires several hundred detectors, controls, and alarm devices, a conventional control panel would then be required to support perhaps dozens of wired circuits and a large number of special function modules. However, addressable panels are capable of supporting hundreds of devices in multiple control schemes right out of the box. Most can accommodate from 100 to 250 programmable detection, input, or output devices on a single pair of wires. The operation and maintenance costs of an addressable system can be lower if it is properly set up. Most addressable systems have diagnostics that can provide information in significant detail, contributing to better decisions and optimizing the maintenance of the system, thereby lowering costs. For example, a dirty smoke detector monitored by a conventional panel may initiate an alarm signal. To discover this, a technician must visit to conduct an inspection of the device and to test it. This may be of little consequence if the alarm happens during the workday and a technician is available; however, if it happens at 3 am on a Saturday night, there may be some significant expense involved. Considering the fact that 75% of the total hours in a week are outside of "normal" business hours, this is a very realistic scenario. Now contrast this with the ability to remotely access an addressable panel over the Internet to discover a false alarm due to a dirty detector, the response to which can wait for a less costly solution. With an addressable system, changing and expanding the system becomes less expensive as well. Many operational characteristics can be changed with a few minutes of programming rather than changing modules and pulling new wires as is frequently required with conventional systems. Determining how a fire alarm signal will be communicated is another step in the process. It may be the most overlooked issue in mission critical fire protection. Alarms can be initiated upon smoke, heat, or flame detection, gaseous agent discharge, sprinkler water flow, or actuation by a manual pull station. Different notification appliances may be operated to notify personnel of a variety of events such as incipient fire detection or fire detection in different areas or rooms. A bell may be used to indicate fire in a power room, whereas a high-low tone from a horn may be used to indicate fire in a server room or subfloor. Voice commands may be used in some scenarios. Most gaseous suppression systems require a second, or confirming, fire detection before an automatic release will occur. This sequence is typically followed by three distinct alarms: 1. Fire alarm 2. Predischarge alarm 3. Discharge alarm
14.4
EARLY DETECTION
361
Interviews with employees working in facilities protected by these gaseous suppression systems reveal that very few occupants understand the meaning of any given alarm and frequently react inappropriately in an emergency situation. Visitors, contractors, and new employees are typically completely uninformed. It is a good idea to develop an alarm notification strategy that includes audible and visual appliances as well as signage and training. In grade school, everyone knew what to do when a particular alarm sounded because of training that included fire drills, tornado drills, and the like. Only notification appliances that meet the appropriate fire system standards and are listed (by Underwriters Laboratories, etc.) for such use may be installed, but there are more than enough options to accomplish just about any notification strategy. The key to making any strategy work is to keep it simple enough to make sense, provide periodic training, and include signage that can be read from a distance. It is important to note that changing the mode of operation of occupant notification and evacuation systems is also more easily accomplished with an addressable system. Many changes to the operational characteristics of audible and visual signals can be accomplished through simple software programming changes. Some fire alarm control equipment manufacturers have addressable notification appliance circuits and devices offering even greater flexibility to the fire protection system designer.
14.4
EARLY DETECTION
The NFPA (National Fire Protection Association) Standard for the Protection of Information Technology Facilities (NFPA 75) requires smoke detection, and the Standard for Fire Protection in Telecommunications Facilities (NFPA 76) requires early warning fire detection (EWFD) and/or very early warning fire detection (VEWFD) in sites carrying telecommunications. Many different technologies are available for detection of a fire, including a variety of smoke, heat, and flame detection devices. Detection of other environmental conditions should also be considered for use in a mission critical facility including liquid (water, etc.) leak detection and combustible or toxic gas detection. The wide variety of applications in a mission critical facility may warrant the use of any or all of the available technologies. For example, a data center may include a server room with a raised floor, power room, battery room, emergency generator room, fuel storage area, loading dock, staging/transfer area, offices and conference rooms, restrooms, break/kitchen area, transmission room, and storage areas among others. Each area represents a unique fire hazard and the characteristics of the potential fire in each of these areas will be different. Fires in server rooms are typically electrical/plastics fires that may smolder for several hours undetected by personnel and/or traditional detection systems because of the slow growth of the fire and the high airflow conditions of the room's cooling system. Fires in generator rooms and fuel storage areas will ignite and grow rapidly, producing extreme heat and smoke within seconds or minutes. Battery room fires can generate huge amounts of smoke rapidly, with little increase in temperature. In each case, the optimum solution is different.
362
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
The proximity of the mission critical facility must also be considered. Many data centers are located in multitenant or high-rise buildings with other hazards that are factors. Obviously, an emergency backup diesel generator installed in a suburban, singlestory, one-tenant building is a different fire protection scenario than the same generator installed on the 20th floor of a downtown office tower.
14.5
FIRE SUPPRESSION
Automatic fire suppression as a standard in commercial buildings did not gain traction until the mid-to-late 1980s. Fires in high-rises and densely occupied facilities in the 1970s and 1980s, resulting in large losses of life, such as the Beverly Hills Supper Club fire and the MGM Grand Hotel fire, drove building agencies to require sprinklers throughout many types of facilities. This push was strictly life-safety related. Prior to this era, computer rooms, and many telephone buildings were protected with Halon 1301 or Carbon Dioxide (C0 2 ) fire extinguishing systems. These were the only suitable options, Halon 1301 being the much preferred option due to the life-safety issues associated with C0 2 systems. The primary driver for installing Halon 1301 systems was to protect against the loss of business continuity from a fire. In occupancies requiring sprinklers, owners were given the option of installing Halon 1301 systems as an alternative to sprinklers for the protection of the mission critical areas of the building. Mission critical facilities were rarely protected with automatic sprinklers until the publication of the Montreal Protocol in 1987 and the ensuing Copenhagen Resolution calling for end of production of chlorinated fluorocarbons (CFCs) because these substances were found to contribute to the depletion of the ozone layer in the Earth's stratosphere. Halon belongs to the family of chemicals affected by these treaties. In the United States, the production of the Halon 1301 ceased on December 31, 1993. The final rule on Halon 1301 was promulgated by the U.S. Environmental Protection Agency on April 6, 1998. Although several alternatives to Halon 1301 became available in the early 1990s, many owners and engineers began turning to sprinklers for protection of mission critical facilities, particularly the preaction type of system. Many continue to utilize this fire protection strategy even today. However, the emergence of MIC (microbiologically induced corrosion) in fire sprinkler systems is rapidly causing owners to rethink fire suppression strategies. When developing a fire protection strategy, owners of mission critical facilities must address the goals of minimizing (or avoiding) loss of business continuity and loss of valuable assets in addition to the goal of life safety. Fires in these facilities have the potential to impact production and/or service to customers, resulting in lost customers, lost future business opportunities, and lost customer confidence. As to future business, customers will leave if they feel let down. Think about what a customer would do if their mobile phone did not provide the coverage expected. Someone in every organization knows what these risks are, and access to that information is required in order to develop a proper fire protection strategy. There are a number of water-based and gaseous suppression systems suitable for use in mission-critical facilities. Sprinklers are mandated in most cases, and although
14.5 FIRE SUPPRESSION
363
they are very effective in extinguishing fires in most ordinary hazard spaces, sprinklers may not be effective for extinguishing fires in most mission critical facilities due to the nature of the contents. Rather, sprinklers will suppress or control fires in these types of areas. In order for a sprinkler system to extinguish a fire, the water must make contact with the burning material. Mission critical facilities contain rack enclosures and cabinets with solid perimeters that can shield burning material from the water spray of a traditional sprinkler system. A sprinkler system can be relied upon to effectively control (suppress) such fires and prohibit the fire from spreading to other areas of the facility. Again, the objective of installing fire sprinklers in a building is to provide life safety for the occupants and does not address the issue of business continuity. That is why the collateral damage resulting from a sprinkler release must be considered. Numerous case histories abound in which the sprinkler system did the job of extinguishing a fire and yet caused tens or hundreds of thousands of dollars in water damage. Although only one sprinkler head discharges in most cases, hundreds of gallons of water may be dispersed over a 10 to 15 ft radius, dousing not only the fire but also anything else within the spray and splash zone. Sprinkler water has the potential to fill a subfloor and run off to other areas and/or lower levels of the building. Typically, the cost of restoration from the water damage caused by the sprinkler system release is second only to the smoke damage restoration. Water mist fire extinguishing systems, once only applicable for generator and fuelstorage applications have been developed for use in electrical equipment areas. Creating very small water droplets, Class 1 water mist systems in particular produce a discharge that behaves more like a gaseous fire extinguishing system than a sprinkler system. In addition, the mist produced in some systems has been proven to be electrically nonconductive, reducing the likelihood of damaging electrical shorts that occur in sprinkler discharges. The relatively small amount of water compared to a sprinkler discharge means that collateral damage is low. A typical closed-head water mist system will discharge only 50-100 gallons of water in a 30-minute period. Class 2 and 3 water mist systems (with larger water droplet sizes) are suitable only for use in generator and fuel-storage areas and should be avoided in electronic equipment areas. Gaseous fire extinguishing systems are designed to fill a protected space with gas to a given concentration and not only extinguish a fire, but prohibit the reignition of a fire. Gaseous systems (hereafter referred to as clean agents) are clean (leaving no residue), electrically nonconductive, noncorrosive, and safe for human exposure at the prescribed concentration. Several types of systems are available that utilize either a chemical agent or inert gas agent. The primary extinguishing mechanism of chemical clean agents is to physically cool the fire at the molecular level. Inert gas clean agents primarily work by diluting the oxygen concentration in the protected space as C0 2 is not a safe breathable atmosphere for humans. The clean agents currently available in the U.S. market include the following: • HFC 227ea, also known as FM200® and FE227® • IG541, also known as Inergen®
364
• • • •
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
FK-5-1 -12, also known as Novec 1230® and Sapphire® HFC-125, also known as FE-25® and ECARO-25® HFC-23, also known as FE-13® IG-55, also known as Argonite® and Prolnert®
It is frequently falsely reported that gaseous agents react with oxygen to extinguish fire. You may have heard it said that Halon 1301 "sucks the oxygen" out of air. This is completely untrue and competing interests propagate the myth in a very irresponsible way. Other types of suppression systems that may be found in mission critical areas include foam-water systems and dry-chemical systems. These systems are limited for application in diesel-generator, fuel-storage, or kitchen areas.
14.6 SYSTEMS DESIGN 14.6.1
Stages of a Fire
All fires grow and progress through the same stages, regardless of the type of fire or material that is burning. The only difference in how a fire grows is the speed at which the fire progresses through the four stages and the amount of combustibles available. Fires involving materials with greater flammability will grow faster (e.g., gasoline burns faster than coal). The four stages of a fire are (1) incipient, (2) smoking, (3) flaming, and (4) intense heat. A fire that starts in a waste can full of paper will move from the incipient stage to intense heat in a matter of seconds or minutes. A fire that begins when a pressurized fuel line on a hot diesel generator bursts will move from ignition to the fourth stage (intense heat) almost instantaneously. A fire involving an overheated circuit board in a server cabinet or a cable bundle will grow slowly, remaining in the incipient stage for possibly several hours or even a few days, producing so little heat and smoke that it may be imperceptible to human senses, especially in high-airflow environments. Fires that occur in battery rooms, HVAC units, and power supplies typically stay in the incipient stage for minutes, moving to the flaming and intense heat stages in minutes or hours. Fires produce smoke, heat, and flames, all of which may be detected by automated systems. Smoke may consist of a number of toxic and corrosive compounds, including hydrochloric acid vapors. When uninhibited by high airflow, smoke rises to the ceiling with the heated air from the fire. However, mission critical facilities often require massive amounts of cooling involving enormous volumes of air movement, which change the dynamics of smoke and heat. Cooling systems designed to keep processing equipment running efficiently also absorb the heat from a fire and move smoke around in mostly unpredictable ways. The movement of smoke is unpredictable because environmental conditions are constantly in flux.
365
14.6 SYSTEMS DESIGN
14.6.2
Fire and Building Codes
Building codes set forth the minimum required fire protection measures (systems) for a building based on its occupancy type. Fire protection system standards provide the minimum requirements for the design, installation, and maintenance of systems. Compliance with building codes and fire system standards is essential. If the owner of a building chooses to install a fire system that the building code does not require, following the appropriate fire system standard is still mandatory. Cities, counties, and states each have their own variety of building codes and adopt model codes such as the International Building Code (IBC) and national standards such as those published by the National Fire Protection Association (NFPA). Other standards that may be required include Underwriters Laboratory (UL), Factory Mutual Global (FM Global), American National Standards Institute (ANSI), and others. For the purpose of this discussion, reference will be primarily to the NFPA codes and standards (Table 14.1). During the conceptual design phase of a project, it is wise to meet with local authorities having jurisdiction (building department, fire department, inspection bureau, etc.) to discuss any fire protection design issues that are beyond code requirements. What may be acceptable to the authority having jurisdiction (AHJ) in Washington, DC may not be acceptable to those in New York, NY. In addition, features and functions marketed by manufacturers of listed products may be unacceptable to the local authorities. For example, a manufacturer of smoke detection equipment may offer an optional capability for coordinating several smoke detectors for confirmation before sounding an alarm, thereby eliminating false alarms due to a single detector. It is a great idea, but the local AHJ may, and can, require that each detector initiate an alarm signal without such coordination. Successive revisions of the various codes, which occur every 3 to 4 years, may have dramatically different requirements. For example, the 1999 edition of the National Fire Alarm Code (NFPA 72) requires a spot-type smoke detector inside of each beam pocket for ceilings with beams deeper than 18 inches and greater than 8 feet apart. The
Table 14.1. NFPA standards that govern applications of fire protection technologies Standard or code Fire Code (previously Uniform Fire Code) Standard for Portable Fire Extinguishers Standard for the Installation of Sprinkler Systems National Fire Alarm and Signaling Code Standard for the Protection of Information Technology Equipment Standard for the Fire Protection of Telecommunications Facilities Life Safety Code Standard on Water Mist Fire Protection Systems Standard on Clean Agent Fire Extinguishing Systems
Number
Current edition
1 10 13 72 75 76 101 750 2001
2009 2010 2010 2010 2009 2009 2009 2010 2008
366
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
2010 edition changed the requirements for spacing of detectors on beamed ceilings. Following this edition of the standard only requires smoke detectors in every beam pocket when beam depth is greater than 10% of the ceiling height and farther apart than 40%) of the ceiling height. In facilities that are strictly IT or telecommunications facilities, the requirement of NFPA 75 and 76, respectively, apply throughout the facility. In IT and telecommunications operations within multiuse facilities such as hospitals, manufacturing plants, and high-rise buildings, the requirements of NFPA 75 and 76 only apply to the specific area of the mission critical function. NFPA 75, Standard for the Protection of Information Technology Equipment, requires smoke detection in equipment areas. Fire extinguishing systems are also required based on a few rules: 1. If the equipment area is in a sprinklered building, the equipment area must be sprinkled. 2. If the equipment area is in an unsprinklered building, the area must be protected with a sprinkler system or a clean agent system (it is permissible to install both systems). 3. The area under a raised floor is to be protected. If the room is not already protected by a clean agent system, then the raised floor area must be protected with either sprinklers, a carbon dioxide system, or an inert gas, clean agent system. If the room is protected by a clean agent system, the raised floor must be included in the area being protected. Also, any combustible storage media device in excess of 27 ft3 in capacity (3 ft * 3 ft * 3 ft) is required to be fitted with an automatic sprinkler or clean agent system with extended discharge. NFPA 76, Standard for the Fire Protection of Telecommunications Facilities, requires the installation of early-warning smoke detection in all equipment areas, and very-early-warning smoke detection in all signal processing areas 2500 ft2 or larger in floor space.
14.7
FIRE DETECTION
A variety of fire detection devices are available in the marketplace and each is meant for specific types of applications, although manufacturers and installers may have reasons other than applicability for recommending a given technology. Types of fire detectors include smoke, heat, and flame. Other detectors that can indicate the increased potential for fire include corrosive and combustible gas detection and liquid-leak detection. Smoke detection is the most common and applicable type of fire detection for use in mission critical facilities. Smoke detectors measure smoke concentrations in terms of percent of visual obscuration per foot, in other words, how much visible light is obscured. Smoke detectors can measure obscuration rates from the invisible range
14.7
FIRE DETECTION
367
(0-0.2%/ft), which corresponds to the incipient fire stage, to the visible range (0.2-6.0%/ft), which corresponds to a fire's smoking stage. Some smoke detectors are passive devices such as spot-type detectors and beam detectors. Others are active devices such as air sampling smoke detectors (ASSD) and video image smoke detectors (VISD). Spot-type detectors are mounted on ceilings, in cabinets, and under raised floors. Beam detectors are mounted on walls and project a beam across an open space. Air-sampling smoke detectors draw continuous flows of air from multiple points anywhere in a three-dimensional volume to an analyzer. Video image smoke detection processes live digital video images from standard surveillance-type cameras through a sophisticated image processing algorithm to sense smoke movement in an open area. Active detectors bring smoke to the detector and passive detectors wait for smoke to find it. Smoke detectors are to be installed according to requirements found in the applicable codes and standards. For area coverage, spot detectors are to be located in a matrix that ranges from 11 ft to 30 ft (125 to 900 ft2 per detector), depending on the amount of airflow in the volume of the protected area. The minimum standard for normally ventilated spaces is for spot smoke detectors to be spaced at 30 ft on center. So for a 900 ft2 room that is 30 ft wide by 30 ft long, one smoke detector located in the center of the room is sufficient. However, in a 900 ft2 hallway that is 10 ft wide and 90 ft long, three smoke detectors are required at 15, 45, and 75 ft along the centerline of the hallway. In mission critical spaces, detector density is increased (spacing is decreased) due to high airflow movement rates beginning at the equivalent of 8 air changes per hour (Figure 14.2). The density increases to 250 ft2 per detector at 30 air changes per hour, and the maximum is 125 ft2 per detector at 60 air changes per hour. Because of high power densities in equipment rooms today, the maximum density (one detector per 125 ft2) is typical. For the purpose of location, the ports of air-sampling smoke detectors are considered the equivalent of spot-type detectors and are subject to the same spacing rules. Many mission critical facilities are being constructed with open ceilings and the spacing of detectors must comply with beam or joist requirements. Solid joists are any solid member that protrudes 4 in or more below a ceiling and is spaced at 3 ft or less, center to center. Beams are members protruding more than 4 in on centers of more than 3 ft center to center. For smoke detectors on ceilings with beam depths less than 10% of the ceiling height, detectors can be installed on the ceiling or bottom of a beam using standard smooth ceiling spacing guidelines. If beam depths are greater than 10% of the ceiling height and farther apart than 40% of the ceiling height, then smoke detectors must be installed in between every beam in the dimension perpendicular to the beams. If beam depths are greater than 10% of the ceiling height and closer together than 40% of the ceiling height, then smoke detector spacing must be reduced 50% in the dimension perpendicular to the beams. In corridors (less than 15 ft wide) with beams on the ceiling, standard smooth ceiling spacing may be followed regardless of beam depth or ceiling height. As stated before, these rules changed in the 2010 edition of the National Fire Alarm and Signaling Code (NFPA 72).
368
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
900 (83.6) -i
1
800 (74.3) -
/
700 (65.0) -
s
/
600 (55.7) -
/
o
/
S "§
I 500 (46.5) -
/
O.
\
/
400 (37.2) -
/
c
/ /
300 (27.9) -
200(18.6)-
^ s '
100(9.3)60
1
50
1
1
1
40 30 20 Air changes per hour
1
10
0
Figure 14.2. Detector density increases (spacing is decreased) along with higher airflow movement rates. (Source: National Fire Protection Association, NFPA 72, National Fire Alarm Code, Chapter 5.7.5.3, 2007 ed.) Stratification is also a problem in areas with high ceilings and substantial cooling. In slow-growing fires without sufficient heat energy to propel smoke to the ceiling where smoke detectors are located, fire can burn for an extended period of time without detection. The hot gases that entrain smoke are cooled and the smoke finds a thermal equilibrium and pools at a level that is well below the ceiling. The fire alarm code allows high/low layouts of smoke detectors. The layout should be in an alternating pattern with the high level at the ceiling and the low level a minimum of 3 ft below the ceiling (Figure 14.3). Smoke detectors in subfloors are also subject to the same spacing rules as those located at the ceiling level, but reduced spacing is not subject to the high-airflow rules. It has been traditional practice, however, for smoke detectors below a raised floor to be matched to the locations of the ceiling smoke detectors. Historically, two passive smoke detection technologies—photoelectric and ionization—have been widely used. Photoelectric smoke detectors measure light scattering (ostensibly from the presence of smoke) and compare to a threshold to establish an alarm condition. This threshold is also known as a detector's sensitivity. Conventional photoelectric detectors are nominally set at 3%/ft obscuration, and ionization detectors are nominally set at 1%/ft. This corresponds to complete visual obscuration at 33 ft and 100 ft distances, respectively.
14.7
FIRE DETECTION
369
Figure 14.3. High/low layouts of smoke detectors include some detectors 3ftbelow the ceiling to overcome stratification. A spot detector (high) and air-sampling port (low) are shown here. (Photo courtesy of Orr Protection Systems, Inc.) Prior to the advent of addressable/programmable fire detection systems, smoke and heat detectors were built to comply with published standards so much so that the performance characteristics were essentially the same regardless of brand. Only minor differences could be noted. Today, manufacturers develop technologies that are unique, and the performance characteristics and detection capabilities can be quite different from brand to brand, although they are listed to the same standard. To confuse matters, many smoke detector manufacturers private label their technologies for a number of system manufacturers, which may use different model names for the same device. Modern smoke detectors are capable of being programmed to different thresholds down to and including levels in the invisible range, and many have early-warning thresholds that can provide advanced notification of a potential problem for the purpose of emergency warning. Air-sampling smoke detectors may have three or four programmable alarm thresholds that can be used for sequential interlocks such as early-warning alarm, equipment shutdown, fire alarm evacuation notice, gaseous agent release, sprinkler release, and so on. NFPA 72, 75, and 76 dictate the layout and spacing of spot-type smoke detectors. Although the International Building Code (IBC) does not require smoke detection in all buildings that might contain mission-critical areas, NFPA 75 and 76 do.
370
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
NFPA 75 requires smoke detection on the ceilings of electronic equipment areas, below raised floors where cables exist, and above ceilings where that space is used to recirculate air for other parts of a building. NFPA 75 references NFPA 72 requirements for spacing of smoke detectors in IT areas. NFPA 76 requires very-early-warning fire detection (VEWFD) for signal processing areas over 2500 ft2 in size and early-warning fire detection (EWFD) in smaller signal processing areas as well as all other equipment areas. In telecommunications facilities, a maximum of 200 ft2 per VEWFD detector is permitted in signal processing areas and 400 ft2 for EWFD detectors in support areas such as battery, AC power, and DC power rooms. VEWFD is defined as detection with a maximum sensitivity of 0.2%/ft and EWFD is defined as a maximum sensitivity 1.6%/ft. NFPA 76 also requires VEWFD at the air returns of the signal processing areas at the rate of one detector (or sample port) for each 4 ft2. When facing the challenge of detecting smoke in electronic equipment rooms with high-airflow cooling systems, the best technology for discovering a fire before significant damage occurs is air-sampling smoke detection (Figure 14.4). The systems draw a continuous sample from throughout the area, enabling the system to detect minute amounts of heavily diluted and cooled smoke. Air-sampling smoke detection (ASSD) systems can accurately detect slow-growing fires anywhere from a few hours to a few days in advance of traditional spot-type smoke detectors. By actively exhausting the air samples, the systems can easily ignore spurious events and false-alarm conditions. However, ASSD systems must be designed, set up, programmed, and commissioned under watchful, experienced eyes or they can be an abortive nuisance. The preferred construction material for air-sampling systems is plenum-rated CPVC pipe made from Blazemaster® plastic. The pipe is orange in color and easily identifiable. It is manufactured by several different companies and is available from fire sprinkler distributors and air-sampling smoke detector manufacturers. When the overheating of PVC-based plastics begins, the heated material off-gases hydrochloric acid (HC1) vapors. Detectors are made that will detect these small
Figure 14.4. Air-sampling smoke detection. (Image courtesy of Vision Systems.)
14.7
FIRE DETECTION
371
amounts of HC1 vapors. There are no guidelines for spacing of these detectors; they can be placed selectively anywhere an overheating situation is probable or expected. Locations might include a high-power-density rack or inside an enclosure. In areas such as battery rooms and power rooms, less airflow reduces the requirements of smoke detection. Twenty to thirty air changes per hour are common, and 250 to 400 ft2 per smoke detector is typical. Spot-type smoke detectors or air-sampling smoke detectors provide an excellent level of protection for these areas. Although not prohibited, beam-type smoke detectors and video-image smoke-detection systems are designed for other types of environments and typically do not perform well in mission critical facilities. Both of these smoke detection options are designed to work in large open spaces such as lobbies, atriums, and arenas. In generator and fuel-storage rooms, heat detectors (Figure 14.5) are the primary choice for fire detection. Heat detectors are spaced in a matrix format like smoke detectors but the spacing is dictated by the particular model's listed spacing (usually 25-50 ft). In the same way that smoke detector spacing is reduced for increased airflow, heat detector spacing is reduced by ceiling height (Table 14.2). The reduction begins at 10 ft and reaches a maximum of 30 ft above the floor. It is common for the ceiling height of a generator room to be anywhere from 12 to 25 ft. In areas with solid beam or joist ceilings, the spacing of heat detectors must be reduced. The rules are similar to those for smoke detectors but different, and one should take time to understand the differences. In solid-joist ceilings, spacing of a heat detector must be reduced by 50%. On ceilings with beams that are greater than 4 in in depth, the spacing perpendicular to the beam must be reduced by two-thirds. So for an area with 14 ft-high ceilings and 12 in beam construction, a heat detector with a listed spacing of 50 ft would be reduced by two-thirds to 16.5 ft (50 * 0.33) running perpendicular to the beam, and then by onethird to 34 ft (50 x 0.67) running parallel to the beam. On ceilings where beams are greater than 18 in. in depth and are more than 8 ft on center, each pocket created by the beams is considered a separate area requiring at least one heat detector in each pocket.
Figure 14.5. Typical heat detectors. (Photo courtesy of Fenwal Protection Systems.)
372
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
Table 14.2. Heat detector spacing reduction based on ceiling height Ceiling height above
Up to and including
m
ft
m
ft
Mutiply listed spacing by
0 3.05 3.66 4.27 4.88 5.49 6.10 6.71 7.32 7.98 8.54
0 10 12 14 16 18 26 22 24 26 28
3.05 3.66 4.27 4.88 5.49 6.10 6.71 7.32 7.93 8.54 9.14
10 12 14 16 18 20 22 24 26 28 30
1.00 0.91 0.84 0.77 0.71 0.64 0.58 0.52 0.46 0.40 0.34
Source: National Fire Protection Association, NFPA 72, National Fire Alarm Code, Table 5.6.5.5.1, 2007 ed.
Heat detectors must be mounted on the bottom of solid joists. They may be mounted on the bottom of beams that are less than 12 in in depth and less than 8 ft on center; otherwise, they must be mounted at the ceiling. The types of heat detectors available are many and varied: fixed-temperature, rateof-rise, rate-compensated, programmable, averaging, and so on. There are spot-type and linear or cable-type. Technologies employed include fusible elements, bimetallic springs, thermistor sensors, and so on. In most indoor applications, spot-type heat detectors are suitable. The temperature threshold should be set at 80 to 100°F above the maximum ambient temperature for the area. In electronic equipment areas, a temperature threshold of 50-60°F above ambient is more appropriate. It must be noted that heat detection in electronic equipment spaces is typically a design to safeguard against an accidental preaction sprinkler release. Heat detection is not a substitute for smoke detection, which remains the best practice for protection of mission critical areas. For conventional systems, rate-compensated detectors will provide the fastest response to a given temperature setting. They are not prone to false alarms due to temperature swings, are restorable, and can be tested. Fixed-temperature and rate-of-rise detectors are less responsive and less expensive than rate-compensated detectors, which are 25-30% more expensive on an installed-cost basis. Fusible-element heat detectors are the least expensive, but also the slowest reacting and cannot be tested without requiring replacement. Linear heat detectors (or cable-type) are meant for industrial equipment protection and, therefore, are suitable for fire detection in diesel-generator and fuel-storage applications. Linear detectors are subject to spacing rules like spot detectors, but can be distributed and located anywhere within the three-dimensional space and can be mounted directly on the equipment. This is extremely effective in areas with very high ceilings
14.7
FIRE DETECTION
373
or excessive ventilation. Linear detectors can also be mounted directly on switchgear, battery strings, or cable bundles, or located in cable vaults or utility tunnels. Linear heat detection is available in fixed-temperature and averaging-type detectors. The averaging-type detector can be programmed to respond to fast-growing fires that cause high temperatures in specific locations, or slow-growing fires that cause a general increase of the temperature in an area. Flame detectors utilize optical detection technologies and are useful in generator and fuel-storage areas. If the quickest response to a diesel fuel fire is desired, a flame detector will respond almost instantaneously to flames, whereas heat detectors may take several minutes, depending on the location of the detector relative to the fire. Flame detectors can be placed anywhere, but must be pointed in the general direction of the fire in order to view the fire and detect it. There are two primary types of flame detectors: ultraviolet (UV) and infrared (IR) (Figure 14.6). Both analyze the light spectrum and both are subject to false alarms due to normal light sources, including sunlight. Combination detectors such as UV/IR, dual-spectrum IR, and triple IR are available and are much more reliable and stable than single-spectrum detectors. Flame detectors are expensive, with a cost of nearly 30 or 40 times the cost of a single heat detector. However, in large open areas, flame detectors will not only provide superior detection performance, but will also be cost-effective compared to heat detectors. If the quickest detection of a fire is required, flame detectors become the best choice. Combustible-gas and liquid-leak detection is also a consideration in mission critical facilities. Although not fire detectors, these types of detectors can discover the presence of substances that indicate a breach that heightens the probability of fire. Guidelines for required spacing or locations do not exist, but common sense dictates where and when the technologies are used. In wet-cell battery rooms, hydrogen off-gas can be a problem. If not properly vented, it can pool and be ignited by static electricity or arcing contactors. Combustible-gas detection can alert personnel to the presence of hydrogen due to a failed exhaust fan or cracked battery case. They can also sense natural-gas leaks or fuel vapors in generator and fuel-storage areas.
Figure 14.6. Typicalflamedetectors. Left: Triple IR; Right: UV/IR combination. (Photos courtesy of Dettronics Corporation.)
374
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
Liquid-leak detectors can be used to detect fuel-tank and fuel-line leaks. Linear or cable-type detectors can be located within containment areas or attached directly to fuel lines. Some detection systems are designed so that they are sensitive only to hydrocarbons such as diesel fuel and will not false-alarm when made wet by mopping or water spray.
14.8
FIRE SUPPRESSION SYSTEMS
Today, there are a number of types of automatic fire suppression systems available to an owner. Making a choice on what technology or combination of technologies to utilize can be difficult if taken seriously. The suppression technologies include sprinklers, water mist, carbon dioxide, Halon 1301, and clean agents. For mission critical facilities, the 2009 edition of NFPA 75, Standard for the Protection of Information Technology Equipment, requires the following: 8.1 Automatic Sprinkler Systems. 8.1.1 Information technology equipment rooms and information technology equipment areas located in a sprinklered building shall be provided with an automatic sprinkler system. 8.1.1.1 Information technology equipment rooms and information technology equipment areas located in a nonsprinklered building shall be provided with an automatic sprinkler system or a gaseous clean agent extinguishing system or both (see Section 8.4). 8.1.1.2 Either an automatic sprinkler system, carbon dioxide extinguishing system, or inert agent fire extinguishing system for the protection of the area below the raised floor in a information technology equipment room or information technology equipment area shall be provided. 8.1.2 Automatic sprinkler systems protecting information technology equipment rooms or information technology equipment areas shall be installed in accordance with NFPA 13, Standard for the Installation of Sprinkler Systems. 8.1.3 Sprinkler systems protecting information technology equipment areas shall be valved separately from other sprinkler systems. 8.1.4 Automated Information Storage System (AISS) units containing combustible media with an aggregate storage capacity of more than 0.76 m3 (27 ft3) shall be protected within each unit by an automatic sprinkler system or a gaseous agent extinguishing system with extended discharge. The information technology spaces addressed by NFPA 75 are often found inside of buildings that require a fire sprinkler system according to NFPA 101—Life Safety Code, NFPA 1 Fire Code, or the International Building Code (IBC). These types of facilities include hospitals, airports, high-rise buildings, industrial occupancies, and large business occupancies. Gaseous extinguishing systems are neither required nor prohibited in these occupancies. Other times, a large data center will be in a stand-alone building or low-rise business occupancy which does not require protection by a fire sprinkler system. However, in these facilities, NFPA 75 requires either sprinklers or clean agent fire suppression for the "information technology equipment rooms and areas."
14.8
FIRE SUPPRESSION SYSTEMS
375
NFPA 75 also requires sprinklers, C0 2 , or inert gas clean agent (IG-541, IG-55) suppression below a raised floor (this requirement is waived if the equipment and underfloor spaces are protected with a total-flood, clean-agent system). In addition, the standard requires fire suppression in automated media/information storage units. For facilities processing telecommunications, fire suppression systems are required by NFPA 76, Standard for Fire Protection in Telecommunications Facilities, only if the telecommunications facility is located in a multitenant building that is not constructed to specific noncombustibility ratings. Sprinklers save lives and property and are designed in accordance with NFPA 13, Standard for the Installation of Sprinkler Systems. For most mission critical facilities, the primary application is considered "light hazard," and this would include all electrical and electronic equipment areas. Areas containing standby diesel generators are considered "ordinary hazard" applications. The basic types of sprinkler systems are wet, dry, or preaction types. Combination dry and preaction systems are commonly known as double-interlock systems. Most building sprinkler systems are wet type, meaning pipes are normally filled with water. The sprinkler heads are closed and opened one at a time by heat that either bursts a glycerin-filled glass bulb or melts an element. Most activations of wet sprinklers involve only one sprinkler head. Dry systems utilize a valve held shut by compressed air within the piping. When a sprinkler head is activated, the pressure within the piping is relieved and the valve opens, allowing water to flow to the open head. Dry systems are typically found in applications where freezing is a problem, such as cold-storage facilities and loading docks. Preaction systems can be closed-head or deluge (open-head) distribution systems. Preaction systems utilize an independent detection system to provide a releasing signal that opens a valve, allowing water to flow to an activated head, or, in the case of deluge systems, to all the heads served by the valve. Preaction, closed-head systems are typically found in mission critical applications. Preaction deluge systems are found mostly in industrial fire protection applications. Preaction, closed head systems require a releasing signal and a head opening before delivering water to the protected space. Preaction/dry-pipe or double-interlock systems are the best technology choice when choosing to install a sprinkler system in a mission critical area (Figure 14.7). The pipes are dry and filled with compressed air. Any leaks in the piping are identified with lowpressure switches and excessive operation of the air compressor. The valve opens only when it has received a releasing signal and the pressure within the piping has been relieved. Unlike a standard preaction system, the valve does not open and water does not enter the piping until a sprinkler head opens. There are both advantages and disadvantages to sprinklers systems (Table 14.3). Sprinkler systems were anathema to mission critical facilities infrastructure until the late 1980s and early 1990s, when it was well understood that Halon 1301 was not going to be produced for much longer and many feared it would be banned at some point in the future. In the early 1990s, preaction and double-interlock systems became commonplace. However, by the mid-1990s a problem with corrosion began to show up. The sprinkler code was changed and preaction and double-interlock systems were re-
376
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
Figure 14.7. Preaction/dry-pipe or double-interlock systems are the best technology choice for conventional sprinklers in mission critical areas. (Image courtesy of Viking Sprinkler Corporation.) quired to use galvanized pipe, the thinking being that galvanized pipe would better resist corrosion. It has now been discovered that the problem is even worse with galvanized pipe (Figure 14.8). Once thought to be rusting, microbiologically induced corrosion (MIC) is a corrosive attack on the piping by the wastes generated by microbes living in the dark, wet, oxygen-rich atmosphere inside "dry" sprinkler pipes. When pre-action sprinklers are tested, some of the water is trapped in the pipe either because of low points, seams, or internal dams on grooved pipe. Microbes then set to work gobbling up oxygen and wa-
Table 14.3. Advantages and disadvantages of sprinkler systems Advantages
Disadvantages
Most effective for life safety Inexpensive suppression system Inexpensive restoration of system after suppression Low-tech life cycle
May not extinguish afirein mission critical environment Collateral damage to building, equipment, business operations Extensive cleanup required
14.8
FIRE SUPPRESSION SYSTEMS
377
Figure 14.8. Galvanized pipe damaged by corrosion. (Photo courtesy of Orr Protection Systems, Inc.)
ter and leaving highly acidic wastes that attach to the rough surface of the pipe. Galvanized piping has an even greater profile than standard steel pipe; hence, the corrosive action is more aggressive. The speed at which this action occurs depends on the amount of water left standing in the piping. It is often only a matter of time before this problem surfaces in preaction or double-interlock sprinkler systems. Treating the piping with microbiocides along with thoroughly draining and mechanically drying is the best way to prevent this problem from occurring. Still, several other corrosion processes can also be at work in sprinkler pipes. In mission critical areas, where preaction/dry-pipe sprinkler systems are used, they must be separately valved from other areas of the building in which they are located. This means that if a server room is located in the middle of an office high-rise, it would require tapping into the main fire-water riser or branch piping for the water supply and piping to the double-interlock valve located close to the room. Each valve is limited to serve 52,000 ft2 for light and ordinary hazard spacing, which is sufficient for most information technology equipment rooms and telecommunications signal processing rooms, with the exception of extremely large facilities. Each double-interlock, preaction sprinkler system is outfitted with several devices: releasing solenoid valve, tamper switch, pressure switch, flow switch, and air compressor. The releasing solenoid valve is operated by a control system and opens to bleed pressure from the double-interlock valve, allowing it to open. The tamper switch is a supervisory device that monitors the shutoff valve (which can be integral to the
378
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
double-interlock valve). If the shutoff valve is closed, the fire alarm system will indicate that the sprinkler system is disabled as a supervisory signal. The flow switch monitors the outgoing piping for water flow. In turn, it is monitored by the fire alarm system and will activate the alarm signal if water is flowing through the pipe. The pressure switch monitors air pressure in the pipes (branch piping) going out above the protected space to the sprinkler heads. A low air pressure indication means that either a sprinkler head has opened or there is a leak in the piping that needs to be repaired. The fire alarm panel will indicate this situation also as a supervisory signal, indicating a problem needing attention with the sprinkler system. The air compressor supplies compressed air to the branch piping. Each system typically has its own compressor that may be attached directly to the piping; however, several systems can be served with one common compressor. Sprinkler heads are spaced in a matrix at the ceiling according to design calculations, roughly on 10-ft to 15-ft centers, depending on the specific sprinkler-head flow characteristics and water pressure in the system. Sprinkler heads are designed to open at standardized temperature settings. For mission critical facilities, these temperatures are 135°F, 155°F, 175°F, and 200°F. Typically, heads with 135°F threshold would be used in electronic equipment areas, 155°F or 175°F in power and storage areas, and 200°F in unconditioned standby-generator areas. Keep in mind that because of thermal lag and the speed of growth of a fire, the temperatures surrounding the sprinkler head may reach several hundred degrees higher than the threshold before the head opens. Sprinkler heads must also be located below obstructions. In data centers with overhead cooling units or ductwork that is more than 3 feet wide, sprinkler heads must be located below the equipment. A common strategy for releasing a double-interlock system from a fire alarm control system is to require cross-zoning or confirming detection, meaning that two devices must be activated or thresholds must be met before sending a releasing signal to the double-interlock valve. This is a strategy that has been used for many years, beginning with the installation of Halon 1301 systems, and is meant to reduce the likelihood of a premature or accidental release. "Any-two" strategies are easy to program but may not provide the level of false-alarm suppression required to eliminate accidental releases. Many cases exist in which two smoke detectors in a subfloor close to a CRAC unit have gone into alarm, causing a release of the suppression system because the reheat coils in the CRAC kicked on for the first time in a few months, burning off accumulated dust on the coils. If the confirming strategy in those cases had been "one in the subfloor, one on the ceiling," the pain and cost of the accidental release would have been avoided. Normal operating characteristics of the environment, plus the capabilities of the detection and control systems, should be taken into account when determining what confirming strategies should be used. In facilities with gaseous total-flood suppression systems, it is expected that the sprinkler release would only be initiated after the discharge of the gaseous agent. The sprinkler system should have means to manually release the valve. A manual pull station, properly identified with signage and located at each exit from a protected space, should be provided. A backup manual mechanical release should be located at the double-interlock valve.
14.8
FIRE SUPPRESSION SYSTEMS
379
The codes do not require that your mission critical equipment be shut down, but encourage an automatic emergency power-off upon a sprinkler release. Remember that sprinklers are meant and designed for life safety, not business or asset protection. It is obvious that any water that makes contact with energized electronics will cause permanent damage. Manufacturers of computers have said their equipment can be "thrown into a swimming pool, then dried out, and it will run." Maybe so, but if the piece is plugged in and powered up when thrown in, it will be destroyed. One final consideration for sprinkler systems is purity of the water being discharged. Sprinkler water is not clean: it is dirty and full of microbes, rust, and lubricants and other debris that are incidental to the construction of sprinkler systems. Each sprinkler head in a light-hazard area will discharge approximately 20 gallons per minute (a garden hose delivers about 3 to 5 gallons a minute) until the fire department shuts off the valve. The fire department will arrive within a few minutes after receipt of the alarm, but will not shut off the sprinkler valve until an investigation is complete, which may take up to one-half hour. Expect the cleanup after a sprinkler discharge to take several days. Ceiling tiles will be soaked and may collapse onto the equipment. Equipment that has been affected by the discharge should be disassembled and inspected, dried with hair dryers or heat guns, and cleaned with compressed air.
14.8.1
Water Mist Systems
Water mist systems atomize water to form a very fine mist that becomes airborne like a gas. This airborne mist extinguishes fires with four mechanisms: cooling, radiant heat absorption, oxygen deprivation, and vapor suppression. The primary fire extinguishing mechanism of water is cooling and the rate of cooling that water provides is proportional to the surface area of the water droplets that are distributed to the fire. The smaller the water droplet, the greater the surface area for a given quantity of water, increasing the rate of cooling. Cooling rates for water mist systems can be several hundred times that of conventional sprinklers. The airborne mist creates a screen that absorbs radiant heat (when entering burning structures, firefighters use "water fog" nozzles that produce a fog screen to absorb the radiant heat), and because the water droplets are so small they turn to steam quickly when they impinge on the hot surface of the burning material. The conversion to steam expands the water by a factor of 1760 (one cup of water becomes 1760 cups of steam), and this expansion pushes air away from the fire, depriving it of oxygen where it needs it. In water mist environments, any airborne combustibles such as fuel or solvents are rendered inert. Water mist systems were developed originally for applications aboard ships in an attempt to use less water and use it more effectively. In the early 1990s, when researchers in the United States were scrambling for alternatives to Halon 1301, much effort was put into the development of water mist systems for mission critical applications. Most of that effort ended up with the development of Class II water mist systems, which are only applicable to protect diesel standby generators and other similar machinery. However, in recent years, Class I water mist systems have been developed and are listed for use in fire suppression and smoke scrubbing in electronic equipment areas. The key to effective water mist fire suppression for protection of
380
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
electronics areas has been the extremely small water droplet sizes and the use of deionized water. Water mist systems (Figure 14.9) operate much like gaseous systems in that they are stand-alone and require no outside power to operate. The primary components of a basic water mist system include a water storage vessel, compressed gas agent and vessel (usually nitrogen or compressed air), a releasing valve, piping, and nozzles. They operate very simply: after receiving a releasing signal, the compressed gas pushes the water out through the pipes at elevated pressure where it is atomized at the nozzles. The supply is designed to run for 10 minutes in the case of a typical total-flood system. The systems are preengineered and utilize typically four to eight nozzles. Most systems available in the United States are of the total-flooding type in compliance with the NFPA 750 Standard and Factory Mutual testing/approval for use in a maximum volume of 9160 ft3 (916 ft2 with a 10-ft ceiling). However, Class I water mist systems are available in larger total-flooding systems, local-application systems, and even double-interlock, preaction systems. Total-flood water mist systems are applicable in stand-by generator areas. Most systems will deliver 70-100 gallons of water mist over a 10-minute period. The water is delivered as a mist, will create an environment very similar to a foggy morning, and will reduce visibility to 5-10 ft. The systems stand by at atmospheric pressure, and when released create 300-400 psig in the piping. There are a number of benefits to a total-flood water mist system in generator areas, including: • Little or no collateral damage • Fast suppression of Class B (fuels) and Class A (solids) fires
Figure 14.9. Typical water mist system. (Image courtesy of Fike Corporation.)
14.8 FIRE SUPPRESSION SYSTEMS
• • • •
381
Will not thermally shock and damage a hot diesel engine Safe for human exposure No runoff Inexpensive recharge costs
Class I water mist (Figure 14.10) systems are also available as local-application systems that can be targeted to a piece of equipment, such as a transformer or diesel engine, rather than as a total-flooding system, which requires distribution throughout an entire volume. Class I water mist systems may also be designed for use in data centers, primarily as an alternative to double-interlock sprinkler systems. The systems use deionized water and stainless steel piping systems to deliver approximately 2 gallons per minute of water mist, which is electrically nonconductive. Like doubleinterlock sprinkler systems, the piping is filled with compressed air, and the sprinkler heads are closed until heat-activated. The nozzles are located at the ceiling like standard sprinkler heads and have the same temperature ratings. The spacing of the nozzles depends on the size of the nozzle, but in a 2-gallon-per-minute model they would be spaced approximately 12 ft apart. A single water supply and compressor can serve several different areas in several modes (double-interlock, total-flood, or local application), so a single system can protect all the spaces of an entire mission critical facility. Water mist double-interlock systems have both advantages and disadvantages (Table 14.4). Unlike double-interlock sprinkler systems, MIC has not been a problem for double-interlock water mist systems (the nature of the stainless steel piping is not conducive to the problem). These systems are very expandable and can serve applications of several hundred thousand square foot and high-rise buildings
Figure 14.10. Class I water mist system. (Image courtesy of Marioff Corporation.)
382
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
Table 14.4. Advantages and disadvantages of double-interlock, preaction water mist systems Advantages
Disadvantages
Safe for human exposure Low water usage Electrically nonconductive agent No collateral damage No thermal shock Low restoration costs
Expensive to install Requires additional maintenance skill set
A double-interlock water mist system will have a water supply, compressed gas supply and/or compressor, solenoid releasing valve, supervisory devices, zone valves, piping, and nozzles. The systems operate much like a standard double-interlock sprinkler system. There is a releasing signal and one activated nozzle before delivering water mist to the protected space. The system supervises the piping at 100 psig and the valve/compressed gas manifold at 360 psig. When released, the system will generate approximately 1400 psi in the discharge piping. The system is typically designed to operate for 30 minutes. The nozzles of the double-interlock water mist system deliver the atomized water in a high-speed, cone-shaped discharge. Liquid water will pour out on any surface that is directly in the path of the discharge, wetting it. However, outside the cone of the discharge, the water mist forms a nonconductive cloud that will act like a gas and penetrate cabinets and enclosures enabling the mist to extinguish a fire. The mist will also "scrub" the smoke from the air that is present and drop it to the floor, minimizing any potential damage to electronic equipment from the acids present in the smoke. Class I water mist systems are also available and listed for use as smoke scrubbing systems (only in subfloors in the United States) and can installed to supplement the fire sprinkler and/or gaseous agent systems. The smoke scrubbing systems are similar to the fire suppression systems except in the delivery of mist. The mist is distributed through a 4 in diameter plastic pipe and vacuumed through the same pipe on the opposite side of a space. The mist, carrying smoke and its constituents, is collected in a small reservoir. These systems are modular, based on a 30-ft by 15-ft area, and utilize about 4 gallons of water. Approval testing has demonstrated dramatic reductions of HC1 vapors in fires involving jacketed cables and circuit boards.
14.8.2 Carbon Dioxide Systems Currently, the Code for Carbon Dioxide Fire Suppression Systems (NFPA 12) prohibits the use of total-flood systems in normally occupied spaces. Under the definitions of the NFPA code, mission critical facilities such as data centers, telecommunications sites, and so on are considered to be normally occupied. This prohibition was made in response to demands by the U.S. Environmental Protection Agency to make changes to protect human life. There are one or two deaths in the United States each year as a
14.8
FIRE SUPPRESSION SYSTEMS
383
result of exposure to carbon dioxide in total-flooding fire suppression applications. At this time, it is not known whether the EPA will accept or continue to push for more changes. The prohibition does not affect applications in subfloors that are not occupiable, those that are 12 in or less in depth. Carbon dioxide (C0 2 ) extinguishes fire by pushing air away from a fire. Systems are designed to either envelope a piece of equipment (local application) or fill a volume (total flood). Total-flood applications in dry electrical hazards require a volume of 50% C0 2 in the protected space, pushing air out of the space and lowering the oxygen content to 11% from the normal 21% within one minute. Exposed to this environment, humans will pass out and cease breathing, hence the life-safety problem. Human brain function is impaired at oxygen levels of 16% and serious permanent brain damage will occur at 12%. C0 2 systems in mission critical applications are released like any other gaseous system, from a fire alarm and control system automatically after the confirmed detection of a fire or upon activation of a pull station. They distribute the C0 2 from high-pressure gas cylinders through valves into a manifold and out through piping to open nozzles at the ceiling of the protected space or in the subfloor. Pipe sizes for C0 2 systems in mission critical applications are typically 2 in or less. New and existing systems must be outfitted with a pneumatic time delay of at least 20 seconds (in addition to any prerelease time delays), a supervised lock-out valve to prohibit accidental release to avoid exposure to humans, a pressure switch to be supervised by the alarm system, and a pressure-operated siren (in addition to any electrically operated alarm devices in the space). Subfloors protected with C0 2 will require (a) 100 lbs of C0 2 per 1000 ft3 of volume for protected space less than 2000 ft3, and (b) 83 lbs of C0 2 per 1000 ft3 of volume for protected spaces of more than 2000 ft3 (200 lb minimum). So, for a subfloor that is 50 ft x 50 ft and 12 in deep, 208 lbs (50 x 50 * 1 x 0.083) of C0 2 is required. C0 2 cylinders are available in a variety of sizes up to 100 lbs. This application requires a minimum of three cylinders, so three 75 lb cylinders would be used for a total of 225 lbs. The cylinders are about 10 in in diameter and 5 ft tall. One-hundred-pound cylinders are about 12 in in diameter and 5 ft tall. If the subfloor is used as a plenum for air conditioning, upon activation of a C0 2 system the air-conditioning units must be shut down in order to keep the gas in the subfloor. Also, any ductwork or cable chases below the floor leading to other areas will require consideration to avoid leakage of the gas. Ductwork should be fitted with motorized dampers to seal off and contain the gas. Cable chases can be filled with fire stopping compounds or "pillows." If a C0 2 system is in place, a series of warning signs must be installed throughout the space, at each entrance and in adjoining areas. The signs must comply with ANSI standards, which call for specific, standardized wording and symbols. In addition, selfcontained breathing apparatus should be located at entrances to spaces protected by C0 2 so that emergency response personnel can enter and remove any incapacitated personnel or perform any necessary work required during a C0 2 discharge. The advantage of a C0 2 system is that the agent is inexpensive and plentiful. Foodgrade C0 2 is used and can be obtained from many fire protection vendors as well as
384
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
commercial beverage vendors. The primary disadvantage of C0 2 systems is the lifesafety issue.
14.8.3
Clean Agent Systems
Clean agents are used in total-flood applications in mission critical facilities. They are stand-alone suppression systems and require no external power supply to operate. Like other systems, they are released by an electrical signal from a fire alarm and control system. The two basic types of clean agents extinguish fires in very different ways. Inert gas systems, like C0 2 systems, extinguish fires by displacing air and reducing available oxygen content to a level that will not support combustion. Chemical clean agents extinguish fires by cooling, and some also combine a chemical reaction. Clean agents are safe for limited exposure; however, under the current clean agent code all agents have been assigned a maximum human exposure limit of 5 minutes. Prior to the 2004 edition of the NFPA 2001 standard, only inert gas clean agents were subject to a limit for these types of applications. The limit was established because exposure testing and computer modeling indicated the need for limiting exposure to a 5minute duration for all clean agents. Beginning with the 2008 edition of the NFPA 2001 standard, there is now a minimum required hold time for clean agents. The requirement is to maintain at least 85% of concentration of gas up to the maximum protected height for at least 10 minutes. The NFPA 2001 standard requires the installer to verify this retention time using an approved room-integrity test procedure.
14.8.4
Inert Gas Agents
The inert gas agents include IG-541 and IG-55. They are introduced to a typical mission critical application in concentrations within 1 minute of 35% and 38% by volume, respectively. IG-541 is a blend of 52% nitrogen, 40% argon, and 8% C0 2 and is commonly called Inergen®. It is blended to create a low oxygen environment with a slightly elevated C0 2 concentration. This blend not only will extinguish a fire but also will cause human metabolism to utilize oxygen more efficiently. For this reason, humans can remain in IG-541 discharges for extended periods without suffering the effects of oxygen deprivation (see Table 14.5). Pressure relief venting must be provided for all spaces protected with inert gas agents. The systems are designed to displace approximately 40% of the volume inside the room within one minute. This is like adding 40% more to a balloon that is already full and cannot expand. The tightly sealed nature of modern mission critical facilities must be vented during discharge or severe structural damage to walls and doors may occur. The venting must open during discharge and close immediately thereafter to provide a seal. Venting can be accomplished using motorized dampers or, in some cases, nonmotorized back-draft dampers, but in all cases should utilize low-leakagetype fire/smoke-rated dampers.
14.8
FIRE SUPPRESSION SYSTEMS
14.8.5
385
IG-541
IG-541 systems (Figure 14.11) are comprised of high-pressure gas cylinders, releasing valve, and piping to open nozzles. IG-541 supplies are calculated in cubic feet, not pounds. For a mission critical space at 70°F, the total flooding factor is roughly 0.44 ft3 of IG-541 per cubic foot of protected volume. So for a computer room that is 30 ft long by 25 ft wide with a 10-ft ceiling height and a 24-in raised floor, the amount of IG-541 required would be approximately 3960 ft3. Cylinders are available in 200, 250, 350, 425, and 435 ft3 capacities; therefore, this application would require ten 425 ft3 cylinders. One nozzle on the ceiling and one under the raised floor should be sufficient.
14.8.6
IG-55
IG-55 is 50% nitrogen and 50% argon and is a little different from IG-541 in that it has no provisions to help humans survive in the low-oxygen atmosphere, and although humans are safe when exposed to the agent, they must leave an area where a discharge has occurred or risk suffering from oxygen deprivation (see Table 14.6). IG-55 systems are comprised of high-pressure gas cylinders, releasing valve, and piping to open nozzles. IG-55 supplies are calculated in kilograms. For a mission critical space at 70°F, the total flooding factor is roughly 0.02 kg of IG-55 per cubic foot of protected volume. So for a computer room that is 30 ft long by 25 ft wide with a 10-ft ceiling height and a 24-in raised floor, the amount of IG-55 required would be approximately 180 ft3. Cylinders are available in effective capacities of 4.3, 18.2, and 21.8 kg; therefore, this application would require nine 21.8 kg cylinders or ten 18.2 kg cylinders. One nozzle on the ceiling and one under the raised floor should be sufficient.
Figure 14.11. Typical IG-541 system. (Image courtesy of Tyco Fire & Security, Inc.)
386
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
Table 14.5. Advantages and disadvantages of IG-541 Advantages
Disadvantages
Recharge costs are less than chemical clean agents. Retention time is better than that of the chemical clean agents. Extended exposure at the specified concentration is not hazardous.
There is only one manufacturer of the hardware. Few certified recharge stations exist. Many major cities do not have one.
14.8.7
Chemical Clean Agents
The chemical clean agents available for use today include HFC-227ea, HFC-125, FK-5-1-12, and HFC-23. All of these agents extinguish fire primarily by cooling. Most are fluorinated compounds that are very similar to refrigerants used in air-conditioning and refrigeration. For HFC-227ea and HFC-125, only two-thirds of the extinguishing process is cooling; about one-third is a chemical reaction wherein the gas molecule breaks up in the presence of high temperatures (above 1200°F) and the fluorine radical combines with free hydrogen, forming HF vapor and shutting down the combustion process. This production of HF is minor and contributes very little to the toxic nature of smoke from fires. HFC-227ea. Since its introduction, HFC-227ea has been the leading alternative to Halon 1301, making up 90-95% of clean agent systems installed over the past decade (see Table 14.7). The agent has proved to be extremely safe and is actually used (under a different name) as a propellant in metered-dose inhalers for asthmatics. The agent is produced by only one manufacturer, and more than a half-dozen manufacturers produce systems hardware. Systems are typically designed to provide between 6.25% and 7% by volume concentration in protected spaces within 10 seconds of activation. A typical HFC-227ea system (Figure 14.12) is made up of an agent-storage container (or tank), releasing device, piping, and nozzles. HFC-227ea supplies are calculated in pounds. For a mission critical space at 70°F, the total-flooding factor is roughly 0.0341 pounds of HFC-227ea per cubic foot of protected volume. So for a computer room that is 30 ft long by 25 ft wide with a 10-ft ceiling height and a 24-in raised floor, the amount of HFC-227ea required would be approximately 307 pounds. Agent con-
Table 14.6. Advantages and disadvantages of IG-55 Advantages
Disadvantages
Recharge costs are less than chemical clean agents. Extended exposure at the specified Any compressed gas vendor can refill the cylinders. concentrations is hazardous to humans. It has better retention than chemical clean agents.
14.8 FIRE SUPPRESSION SYSTEMS
387
Table 14.7. Advantages and disadvantages of HFC-22ea Advantages
Disadvantages
Extremely safe for human exposure. More costly recharge than inert gas agents. Less expensive than inert gas agents to install. Relatively small amount of storage compared to inert gas agents. Pressure relief venting is not typically required for most protected rooms.
tainers are available in capacities from 10 to 1200 pounds, depending on the manufacturer. Agent containers range in physical size from 6 in in diameter and 2 ft tall, to 2 ft in diameter and 6 ft tall. This application would require one 350 lb container. One nozzle on the ceiling and one under the raised floor should be sufficient. HFC-125. HFC-125 became an approved clean agent in 2002 and is considered to be the closest of all agents to being a drop-in replacement for existing Halon 1301 systems. HFC-125 matches Halon 1301 in terms of physical properties such as flow characteristics and vapor pressure (see Table 14.8). The pressure traces, vaporization, and spray patterns for this agent nearly mirror that of Halon 1301. When converting a properly designed Halon 1301 system to utilize the HFC-125 agent, often only minor piping changes are required in addition to the replacement of agent-storage containers and nozzles. In some cases, the cost of conversion can be much lower when using this clean agent. Systems are typically designed to provide an 8% by volume concentration in protected spaces within 10 seconds of activation.
Figure 14.12. Typical HFC-227ea system. (Image courtesy of Fike Corporation.)
388
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
Table 14.8. Advantages and disadvantages of HFC-125 Advantages
Disadvantages
Least expensive to install of all the clean agents. Relatively small amount of storage compared to inert gas agents. Pressure relief venting is not typically required for most protected rooms.
More costly recharge than inert gas agents. Only one manufacturer of agent and hardware.
A typical HFC-125 system is made up of an agent container (or tank), releasing device, piping, and nozzles. HFC-125 supplies are calculated in pounds. For a mission critical space at 70°F, the total-flooding factor is 0.0274 pounds of HFC-125 per cubic foot of protected volume. So for a computer room that is 30 ft long by 25 ft wide with a 10-ft ceiling height and a 24-in raised floor, the amount of HFC-125 required would be approximately 247 pounds. Agent containers are available in capacities from 10 to 1000 pounds and are of the same construction as the containers for HFC-227ea. This application would require one 350 lb container. One nozzle on the ceiling and one under the raised floor should be sufficient. FK-5-1-12. FK-5-1-12 was introduced in 2003 for use as a total-flooding clean agent in normally occupied spaces. FK-5-1-12 was developed as an environmentally responsible extinguishing agent (see Table 14.9) and is the first chemical clean agent to address not only the concerns of stratospheric ozone depletion but also those related to climate change (i.e., global warming). All of the other chemical clean agents have a global warming potential (GWP) of 3000 or more, meaning that 1 kg of those compounds have the same climate effect as three tons or more of C0 2 . Due to the exceptionally short atmospheric lifetime of FK-5-1-12, its GWP is extremely low—only 1, equivalent to that of C0 2 . FK-5-1-12 is actually a fluid at atmospheric pressure. It has an extremely low heat of vaporization, an extremely high vapor pressure, and a boiling point of 120°F. Superpressurized with nitrogen, it is pushed through nozzles and is vaporized quickly. Sys-
Table 14.9. Advantages and disadvantages of FK-5-1-12 Advantages
Disadvantages
Very high safety margin (140%). Relatively small amount of storage compared to inert gas agents. Pressure relief venting is not typically required for most protected rooms. Lowest environmental impact of all the chemical clean agents.
Most costly clean agent system on the market. More expensive recharge than inert gas agents.
14.8
389
FIRE SUPPRESSION SYSTEMS
terns are designed to provide a 4.2% by volume concentration in protected spaces within 10 seconds of activation. A typical FK-5-1-12 system is made up of an agent container (or tank), releasing device, piping, and nozzles. FK-5-1-12 supplies are calculated in pounds. For a mission critical space at 70°F, the total-flooding factor is 0.0379 pounds of FK-5-1-12 per cubic foot of protected volume. So for a computer room that is 30 ft long by 25 ft wide with a 10-ft ceiling height and a 24-in raised floor, the amount of FK-5-1-12 required would be approximately 341 pounds. Agent containers are available in capacities from 10 to 800 pounds and are of the same construction as the containers for HFC-227ea. This application would require one 350 lb container. One nozzle on the ceiling and one under the raised floor should be sufficient. HFC-23. HFC-23 has been available on the market for several years but has not been used much in mission critical applications. It is mostly used in low-temperature industrial applications (see Table 14.10). Systems are designed to provide an 18% by volume concentration in protected spaces within 10 seconds of activation. Because of the high volumetric concentration and quick discharge time, rooms protected with HFC-23 must be vented in a similar fashion as for the inert gas agents. A typical HFC-23 system is made up of high-pressure cylinders, releasing device, piping, and nozzles. HFC-23 supplies are calculated in pounds. For a mission critical space at 70°F, the total-flooding factor is 0.04 pounds of HFC-23 per cubic foot of protected volume. So for a computer room that is 30 ft long by 25 ft wide with a 10-ft ceiling height and a 24-in raised floor, the amount of HFC-23 required would be approximately 360 pounds. Agent cylinders are available in capacities 74 and 115 pound capacities. This application would require four 115 lb containers. One nozzle on the ceiling and one under the raised floor should be sufficient. Halon 1301. Although not manufactured in the United States and North America since 1993, Halon 1301 is still available in abundant quantities. Banking and recycling practices have created a situation in which Halon 1301 remains an affordable option still today. The biggest problem facing owners of Halon 1301 system is the availability of spare parts and technical support. Most manufacturers of Halon 1301 equipment quit manufacturing the system equipment many years ago, and replacement part inventories are dwindling fast. So although Halon 1301 agent has not gone away, the equipment has. If parts can be found for repair or modification of a system, there is little
Table 14.10. Advantages and disadvantages of HFC-23 Advantages
Disadvantages
Very high safety margin (67%). Relatively small amount of storage compared to inert gas agents.
Pressure relief venting of the protected room is required,
390
FIRE PROTECTION IN MISSION CRITICAL INFRASTRUCTURES
likelihood that the hydraulic design software program can be obtained from the system manufacturer. Also, most technical support personnel today were not even out of technical school in 1993. The prevailing wisdom with regard to Halon 1301 systems is to take steps to conserve any Halon 1301 that is still in service and to prevent accidental release. This translates to upgrading control systems and devices, and improving fire detection systems that might malfunction or false-alarm, causing an accidental discharge. Beyond that, it has become necessary to make plans for the eventual replacement and decommissioning of any remaining Halon 1301 system. If the replacement is not budgeted, and parts cannot be found to restore the system, it may be difficult to obtain the needed funds to install a well-designed replacement. To put a few myths to rest: • Halon 1301 is not banned in the United States. There is no federal or state legislation at this time that requires removal of Halon systems. • You can only buy Halon 1301 in the United States if it was manufactured before January 1, 1994. Cheap Halon 1301 from overseas is illegal for purchase in the United States. • Halon 1301 does not react with oxygen to put out a fire. It cools the fire and chemically attacks the combustion reaction. • Halon 1301 is safe for human exposure. There has never been a case of a human death due to exposure to a properly designed Halon 1301 system.
14.8.8
Portable Fire Extinguishers
Portable fire extinguishers for Class A fires (solid combustibles such as paper and plastics) are required in all commercial facilities. However, the extinguishers rated for Class A fires are typically the dry-chemical (powder) type and, though perfect for offices and warehouses, are not appropriate for use in electronic equipment areas. The NFPA 75 Standard for the Protection of Information Technology Equipment requires C0 2 or halogenated hand portable extinguishers with at least a 2-A rating and specifically prohibits dry-chemical extinguishers in the information technology equipment areas. The NFPA 76 Standard for Fire Protection in Telecommunications Facilities also prohibits dry-chemical extinguishers from signal processing, frame, and power areas.
14.8.9
Clean Agents and the Environment
On April 17, 2009 the United States Environmental Protection Agency (EPA) ruled that greenhouse gases (GHGs) are harmful to personal health, thus opening the door to creating regulations to control GHGs. Legislation has been introduced in the U.S. Congress specifically designed to regulate this environmental problem. When passed, this sort of global warming legislation would place limits on the emissions of heat-trapping gases, including HFCs. Some of the chemical clean agents are in fact HFCs, including
14.9 CONCLUSION
391
HFC-227ea, HFC-125, and HFC-23, and could be impacted by future environmental regulations. The legislation would create a special program to reduce emissions of HFCs by adding HFCs to the list of similar substances that EPA currently regulates in the Clean Air Act. Under this program, EPA would be directed to phase down the total production and consumption of HFCs. If adopted, limitations and reductions would begin in 2012 and decline by 3% per year from the 2000 baseline. Many states are also addressing the problem of greenhouse gas emissions by adopting legislation that regulates HFCs. For example, the state of California has passed a bill (AB32) that delegates to its Air Resources Board (CARB) the task of reducing state greenhouse gas emissions to 1990 levels by 2020. It gives open-ended authority to CARB to "achieve the maximum technologically feasible and cost-effective greenhouse gas emission reductions from sources or categories of sources." California has put the HFC clean agents into a category of high-global-warming gases, which may be impacted by future regulations. The major impacts of proposed HFC regulations are unpredictable today. Some observers believe that the eventual result will be significant increases in the HFC clean agents and possible shortages. Others believe that there may not be shortages at all and the price increases will be nominal. The differences in opinion are due to the unknowns that will result from the possible implementation of the legislation under consideration. One thing is for certain, the chemical clean agent FK-5-1-12 and all inert gas agents will not be impacted because of their extremely small environmental footprint.
14.9
CONCLUSION
A fire can be a major catastrophe for business operations. The damage that is caused from heat, smoke, and water is detrimental to IT equipment and it could be very expensive to repair the damage. In addition to the financial consequences, a company's reputation may never recover from such an event. It is essential that facility managers understand the applicable fire protection technologies that are available to mission critical environments so they can implement appropriate solutions that meet business reliability goals. In addition, understanding the fire codes that must be adhered to will help in the selection of these technologies.
This page intentionally left blank
APPENDIX
A
POLICES AND REGULATIONS
A.1
INTRODUCTION
Globalization, telecommunications, and advances in technology have dramatically improved today's business operations in terms of speed and sophistication. However, these advancements have also added complexity and vulnerability, especially relating to IT infrastructure and other interdependent physical infrastructures that are crucial to the mission critical industry. This appendix discusses business continuity and how current regulations and policies impact the way organizations must operate today with regard to data integrity, availability, and disaster recovery. Firms and organizations must be compliant with current policies and regulations in order to operate efficiently and ensure that critical infrastructure and facilities are constantly guarded against potential threats or punitive fines from government. The USA PATRIOT Act, Basel II Accord, and the Sarbanes-Oxley Act are laws that have significantly impacted the mission critical industry, especially the financial industry and government agencies, particularly the Securities and Exchange Commission. In some cases, the rules build on existing policies of business operations and in others they have tightened the practices. The main objectives of these policies are to: 1. Assure public safety, confidence, and services 2. Establish responsibility and accountability Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 393 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
394
APPENDIX A. POLICIES AND REGULATIONS
3. Encourage and facilitate partnering among all levels of government and between government and industry 4. Encourage market solutions whenever possible 5. Facilitate meaningful information sharing 6. Foster international security cooperation 7. Develop technologies and expertise to combat terrorist threats 8. Safeguard privacy and constitutional freedoms 9. Raise the confidence level of investors and shareholders while publicly reporting 10. Create an array of harsh new punishments pertaining to scams and other such violations Guidelines to attain the above objectives that will be discussed here include: The U.S. National Strategy for the Physical Protection of Critical Infrastructures and Key Assets, detailing the roles of government departments and agencies in the protection of the nation's assets Sound Practices to Strengthen the Resilience of the U.S. Financial System, planning for business continuity for critical financial institutions NFPA 1600, Standards on Disaster/Emergency Management and Business Continuity Programs Unfortunately, most of these documents can only provide general requirements for compliance without necessarily having specific details on implementation and technical controls. The Federal Financial Institutions Examination Council (FFIEC) guidelines do provide a road map to construct recovery plans, but they do not guaranty compliance. Regardless, the cost of noncompliance can be severe, with personal liabilities attached. Essentially, industry has to decide the specific strategies that will be used based on an organization's individual risk assessment and management objectives. It is the role of each organization to develop a plan that incorporates these strategies to effectively respond to policies and regulations instituted and emergency situations. Business continuity is not, for the most part, crisis-specific; it is organization-specific. Every business needs to develop business continuity and disaster recovery plans for its unique IT environment as it functions within the specific business context of the organization.—Michael Cray* Business continuity and disaster recovery are important components to a company's risk-management strategy. In today's regulatory environment, inadequate planning or shortfalls can be costly in terms of business operations, and noncompliance can mean stiff penalties. On the positive side, more stringent standards along with new technologies that address security and continuity issues can significantly improve busi*http://www.forsythe.com/na/aboutus/news/articles/2009%20Articles/bcplanninginfaceofpandemic.
A.2 INDUSTRY POLICIES AND REGULATIONS
395
ness processes and productivity. Business decisions are often dependent on the accuracy of financial reports as they are generated in real time. Plans must be in place to ensure that information is stored, transmitted accurately, and secured should an unexpected event occur. As a result, preparedness is key to maintaining any business operation, along with having educated workers who can understand how to implement standard operating procedures, emergency action procedures, and alarm response procedures. These are essential components to responding to and remedying unexpected events. Events do not have to be major disasters to have a significant and damaging impact upon an organization; even short power outages can be a large risk. Business continuity must be woven into business processes throughout the company. This is especially necessary within the mission critical industry.
A.2
INDUSTRY POLICIES AND REGULATIONS
Industry policies and regulations have been created to protect business and government entities by identifying areas that should be protected against improper usage. Most compliance issues deal with protecting data from illegal storage, access, processing, or termination. For example, all financial data must be protected from loss because the U.S. Office of the Controller of the Currency (OCC) will prosecute under the Foreign Corrupt Practices Act if financial records are missing. As a result, the OCC has four basic requirements that corporations must adhere to: 1. OCC-177 requires that a recovery plan be in place. 2. OCC-187 requires that financial records be defined and safeguarded. 3. OCC-229 requires that access controls be in place to protect against unauthorized access to financial data. 4. OCC-226 requires that end-user computing locations adhere to the same standards governing local information technology locations. As you can see, these requirements can be directly tied to the mission critical industry, specifically data centers because they are a hub for transferring financial information and for financial transactions. Data centers are relied upon to provide the optimal levels of availability in order to safeguard information, particularly in this case, financial data. Industry policies and regulations that will be discussed within this appendix include: Section A.2.1—The USA PATRIOT Act Section A.2.2—Sarbanes-Oxley Act (SOX) Section A.2.3—Comprehensive Environmental Response, Compensation, and Liability Act of 1980 Section A.2.4—Executive Order 13423: Strengthening Federal Environmental, Energy and Transportation Management Section A.2.5—ISO27000—Information Security Management System (ISMS)
396
APPENDIX A. POLICIES AND REGULATIONS
Section A.2.6—The National Strategy for the Physical Protection of Critical Infrastructures and Key Assets Section A.2.7—2009 National Infrastructure Protection Plan Section A.2.8—North American Electric Reliability Corporation (NERC), Critical Infrastructure Protection Program Section A.2.9—U.S. Securities and Exchange Commission (SEC) Section A.2.10—Sound Practices to Strengthen the Resilience of the U.S. Financial System Section A.2.11—C4I—Command, Control, Communications, Computers, and Intelligence Section A.2.12—Basel II Accord Section A.2.13—NIST—National Institute of Standards and Technology Section A.2.14—Business Continuity Management Agencies and Regulating Organizations Section A.2.15—FFIEC Section A.2.16—National Fire Prevention Association 1600 Standards on Disaster/Emergency Management and Business Continuity Programs Section A.2.17—Private Sector Preparedness Act
A.2.1
USA PATRIOT Act
Shortly following the September 11, 2001 terrorist attack, Congress passed the USA PATRIOT Act (Uniting and Strengthening America by Providing Appropriate Tools Required To Intercept and Obstruct Terrorism, Act of 2001) in order to strengthen national security. This Act contains many provisions to support activities related to counterterrorism, threat assessment, and risk mitigation for critical infrastructure protection and continuity, to enhance law enforcement investigative tools, and to promote partnerships between government and industry. Critical infrastructure can be defined as systems and assets, either physical or practical, that are so crucial to the United States that the annihilation of such systems would have many devastating impacts on not only security, but national economic security and national public health and safety. In today's information age, a national effort is necessary for the protection of cyberspace as well as our physical infrastructure. This is relied upon for national defense, continuity of government, economic prosperity, and the quality of life in the United States for citizens, private business, and government. National security depends on an interdependent network of critical physical and information infrastructures, including telecommunications, energy, financial services, water, and transportation sectors. The policy requires that critical infrastructure disruptions must be rare, brief, geographically limited in effect, manageable, and minimally detrimental to the economy, human and government services, and national security. Through public-private cooperation, comprehensive and effective programs ensure continuity of businesses and essential U.S. government functions. Although specific
A.2 INDUSTRY POLICIES AND REGULATIONS
397
methods are not provided in the Act, each institution must assess and determine its own particular risks in developing a program. Programs should include security policies, procedures and practices, reporting requirements, line of responsibility, training, and emergency response. An essential element for compliance under the USA PATRIOT Act is identification of customers and their activities. For both financial and nonfinancial institutions, compliance means that you need to know your customers and acquire knowledge of their business operations. The federal government requires customer identification programs (CIPs), which can be cross-referenced with the Treasury Department's List of Specially Designated Nationals and Blocked Persons (SDN) to ensure that financial transactions are not done with terrorists. These programs usually consist of identity checks, securing and monitoring personal data, and reporting of suspicious activities. Another provision of the USA PATRIOT Act is Title III, entitled International Money Laundering Abatement and Anti-Terrorism Financing Act of 2001, which requires that a financial institution take reasonable steps to safeguard against money laundering activities. The purpose of these requirements is to thwart any possible financial transactions or money laundering activities by terrorists. Some steps required for compliance include: (1) designating a compliance offer; (2) implementing an ongoing training program; (3) adopting an independent audit function; and (4) developing internal policies, procedures, and controls. There are two specific federal entities, both of which are housed under the Treasury Department, that support these programs: 1. Financial Crimes Enforcement Network (FinCEN), which supports domestic and international law enforcement efforts to combat money laundering and other financial crimes. It also monitors CIP compliance. 2. Office of Foreign Asset Control (OFAC), which maintains the List of Specially Designated Nationals and Blocked Persons (SDN), and also monitors compliance that prohibits U.S. entities from entering into business transactions with SDN entities (Executive Order 13224). OFAC has established industry compliance guidelines specifically for insurance, import/export businesses, tourism, securities, and banks. Each institution must determine and evaluate its own particular risks in developing a program, which is likely to include an analysis of its customer base, industry, and location.
A.2.2 Sarbanes-Oxley Act (SOX) Passed by Congress in 2002, the Sarbanes-Oxley Act (see Table A.l) was enacted in response to major incidences of corporate fraud, abuse of power, and mismanagement, such as the Enron scandal, WorldCom (MCI) debacle, as well as others. The Act, which defines new industry best practices, mainly targets publicly traded companies, although some parts are legally binding for nonpublicly listed and not-for-profit companies. The SEC, NASDAQ, and the New York Stock Exchange have prepared the regulations and rulings. Its mandates include the following:
398
APPENDIX A. POLICIES AND REGULATIONS
Table A.1. Key points of Sarbanes-Oxley Act (SOX) Requires companies to perform quarterly self-assessments of risks to business processes that affectfinancialreporting and to attest to findings on an annual basis (CFO, CEO, and possibly, CIO). Section 302 requires "Signing Officer" to design reports for compliance submission. Section 404 requires that technology personnel develop and implement means for protecting critical financial data (data security, backup and recovery, business continuity planning, and disaster recovery), because loss of data is not acceptable. Section 409 requires "real-time reporting" of financial data, thus creating the need for new standards and procedures and perhaps reengineering of functions to better comply with the law. Companies must devise "checks and balances" to guarantee that persons creating functions (like programmers) are not the same persons responsible for validating the functions' operation (a separate checker must validate function). Checks and balances prohibit Big 4 accountingfirmsfrom performing risk assessment because they are the ones performing audits (conflict of interest). Penalties can includefinesas high as $5 million and imprisonment can be for as long as 20 years for deliberate violation.
• Increased personal liability for senior-level managers, boards, and audit committees • New reporting and accounting requirements • Code of ethics and culture, and new definition of conflict of interest • Restrictions on services that auditors can provide • A Public Company Accounting Oversight Board to oversee the activities of the auditing profession SOX requires major changes in security laws and reviews of policies and procedures, with attention to data standards and IT processes. A key part of the review process is how vital corporate data is stored, managed, and protected. From a power protection and mission critical standpoint, the focus is on availability, integrity, and accountability. The Sarbanes-Oxley Act was designed to ensure that corporations have current financial information available for review by auditors and to ensure that financial irregularities are addressed before they cause catastrophic problems. Its initial objective (Section 302) is to define reporting requirements, then to gather all financial information into a common repository (Section 404) for reporting and to eliminate security exposures. The final goal of the Sarbanes-Oxley Act (Section 409) is to have an automated financial reporting system available for instant auditing, if necessary, and to improve financial controls. Under Section 404 of SOX, there is an extensive description of the role of IT infrastructure and applications in today's financial reporting and accounting processes. Although SOX itself does not specify how the technical controls should be im-
A.2 INDUSTRY POLICIES AND REGULATIONS
399
plemented, there are technical solutions available to help public companies comply with its objectives. The system incorporates confidentiality, integrity, and authentications as security services within the network that prevents illegitimate modification of financial data and ensures authorization decisions that are based on true user identities. The financial controls aimed at preventing corporate fraud have brought about mixed feelings, with some CIOs indicating that it has proved costly. With an optimized and flexible approach, and cooperation between the SEC and corporations, the figures will likely be less.
A.2.3 Comprehensive Environmental Response, Compensation, and Liability Act of 1980 This Act, also known as the EPA Superfund Act, was designed to help identify and clean up toxic sites. Essentially, the EPA will identify dump sites and the parties accountable for contamination of the area. The responsible party will be required to clean up the waste site. If no party can be located, the EPA is authorized to execute cleanup by using a special trust fund. Through the years, though, it has evolved to address the problem of disposal of electronic equipment such as computers, cell phones, and batteries. This problem has worldwide implications because of the materials contained in computers, which range from gold to an array of toxic materials. Some computers are discarded and sent overseas for breakdown and material processing. Unfortunately, the methods used to dispose of computers have resulted in poisoning and pollution on a large and dangerous scale. Many companies have developed a total cost of ownership concept for purchasing, deploying, and terminating their computer equipment. The vendors are responsible for supplying and disposing of computer equipment in a manner that adheres to EPA Superfund guidelines. This also includes the erasing of any data from the memory of disposed or redeployed equipment, thereby eliminating the chance that personal or business data would be accessed by unauthorized personnel.
A.2.4 Executive Order 13423—Strengthening Federal Environmental, Energy, and Transportation Management President Bush signed this executive order on January 24, 2007 in order to strengthen the environmental initiatives of all federal agencies. The goals set by this executive order are more demanding and aggressive than the Energy Policy Act of 2005 and previous executive orders. In general, it calls for improved energy efficiency and reduced greenhouse gas emissions of federal agencies through increased use of renewable energy, reduction in overall water usage, reduction in toxic and hazardous chemicals and materials used, and increase of sustainable environmental practices, to name a few. It is also the responsibility of these agencies to take measurements and do analysis to ensure that the desired objectives laid out are being met on the timeline set out.
400
APPENDIX A. POLICIES AND REGULATIONS
A few of the main goals that federal agencies have to try to meet by 2015 are: • Reduce energy consumption of federal agencies by 30% • Reduce water consumption by 16% • If an agency has a fleet of over 20 motor vehicles, it must reduce its total consumption of petroleum products by 2% annually until 2015, while increasing consumption by 10% annually of nonpetroleum-based fuel. A.2.4.1 Federal Real Property Council (FRPC). This is an interagency created under Federal Real Property Asset Management Executive Order 13327 and is responsible for the development of a single and descriptive database of federal real properties. The executive order requires improved real property asset management and complete accountability that integrates with an organization's financial and mission objectives. It is consistent with other federal financial policies such as SOX and the new approach toward business operations. The implication is that facilities and infrastructure will need to be optimized in order to ensure that disclosure of financial data is complete, verifiable, and authenticated to allow for sharing of data across functional business areas.
A.2.5 ISO27000, Information Security Management System (ISMS) The Information Security Management System's primary responsibility is to respond to data sensitivity requirements and ensure that there is no unauthorized access to sensitive data. It is also responsible for monitoring and reporting on access to IT assets so that security violations are identified and supportive documentation needed to pursue legal action or repair access rules is obtained. The control of entitlements that allow personnel access to assets must be maintained so that access controls are updated when personnel leave the firm or change positions. Other responsibilities of an IT security department include: • • • • • • • • •
Data owner definitions Data sensitivity Data SAGE guidelines Data access controls Violation capturing Violation reporting Required forms Procedure for completing forms Forms submission and processing
Vital records management procedures associated with library management and vaulting are based on data sensitivity associated with data, a job, or an application.
A.2 INDUSTRY POLICIES AND REGULATIONS
401
Storage management procedures are implemented in support of library management requirements and are responsible for real-time and incremental backup of data and the control of data encryption to safeguard critical documentation from unauthorized access, even when contained on a tape/cartridge that is lost in transport. The International Standards Organization (ISO) and the International Electromechanical Commission (IEC) created a Joint Task Force (ISO/IEC JTF 1) to develop and publish the ISO 27000 Information Security Management System guidelines. Both domestic and international standards are met through following the ISO 27000 guidelines (see Figure A.l): • • • • • • • •
ISO 27000 (2009)—Overview and Vocabulary ISO 27001 (2005)—Requirements Definition ISO 27002 (2005)—Code of Practice for Information Security Management ISO 27003—ISMS Implementation Guidelines ISO 27004—Information Security Management Measurement Guidelines ISO 27005 (2008)—Information Security Risk Management ISO 27006 (2007)—Requirement for Audit and Certification for ISMS ISO 27007—Guidelines for ISMS Auditing
Figure A.l. ISO 27000 Information Security Management System Guidelines.
402
APPENDIX A. POLICIES AND REGULATIONS
• ISO 27011—ISMS Guidelines for Telecommunications Organizations Based on ISO 27002 • ISO 27799 (2008)—Health Informatics Governing ISMS Based on ISO 27002 The ISO 27000 sections are provided in an orderly sequence that first lays the foundation of understanding and then proceeds through the steps needed to build, implement, monitor, and maintain the ISMS that are designated to safeguard against security breaches and cyber crimes. ISO 27000 includes standards that: • Define requirements for an ISMS and for those certifying the ISMS • Provide direct support, detailed guidance, and/or interpretation for the overall plan-do-check-act (PDCA) process and requirements • Address sector-specific guidelines (health information, telecommunications companies, etc.) • Address conformity assessments for ISMSs • Provide terms and definitions for ISMSs • Cover commonly used terms and definitions in the ISMS standards • Will not cover all terms and definitions applied within the ISMS • Do not limit the ISMS family of standards in defining terms for own use Organizations of all types and sizes: • Collect, process, store, and transmit large amounts of data • Recognize that information and related processes, systems, networks, and people are important assets for achieving organizational objectives • Face a range of risks that may affect the functioning of assets • Modify risks by implementing and ISMS with information security controls As information security risks and the effectiveness of controls change, depending on shifting circumstances, organizations need to monitor and evaluate the effectiveness of implemented controls and procedures, identify emerging risks to be treated, and select, implement, and improve appropriate controls as needed. An ISMS provides a model for establishing, implementing, operating, monitoring, reviewing, maintaining, and improving the protection of information assets to achieve business objectives based upon a risk assessment and the organization's risk acceptance levels designed to effectively treat and manage risks. Analyzing requirements for the protection of information assets and applying appropriate controls to ensure the protection of these information assets, as required, contributes to the successful implementation of an ISMS. The following fundamental principles also contribute to the successful implementation of an ISMS: • Awareness of the need for information security • Assignment of responsibility for information security
A.2 INDUSTRY POLICIES AND REGULATIONS
403
• Incorporating management commitment and the interests of stakeholders • Enhancing societal values • Risk assessments determining appropriate controls to reach acceptable levels of risk • Security incorporated as an essential element of information networks and systems • Active prevention and detection of information security incidents • Ensuring a comprehensive approach to information security management • Continual reassessment of information security and making modifications as appropriate ISO 27002 provides guidelines for implementing an information security system that protects against a security breach and cyber crimes. This section is also used as a guideline for telecommunications companies and health information. It is therefore recommended that you review the ISO 27000 guidelines before implementing an ISMS.
A.2.6 The National Strategy for the Physical Protection of Critical Infrastructures and Key Assets This document, released in 2003, detailed a major part of President Bush's overall homeland security strategy. It defined "critical infrastructure sectors" and "key assets" as follows: Critical Infrastructure—This includes agriculture, food, water, public health, emergency services, government, defense industrial base, information and telecommunications, energy, transportation, banking and finance, chemical industry, postal, and shipping. The national strategy also discusses "cyber infrastructure" as closely connected to, but distinct from physical infrastructure. Key Assets—These targets are not considered vital, but could create a local disaster or damage our nation's morale and confidence. These assets include: • National symbols, monuments, centers of government or commerce, icons, facilities and structures that represent national economic power and technology, and, perhaps, high-profile events. • Prominent commercial centers, office buildings, and sports stadiums where large numbers of people congregate. National security, business, and the quality of life in the United States rely on the continuous, reliable operation of a complex set of interdependent infrastructures. With the telecommunications and technology available today, infrastructures systems have become more linked, automated, and interdependent. Any disruptions to one system could easily cascade into multiple failures of the remaining linked systems, thereby
404
APPENDIX A. POLICIES AND REGULATIONS
leaving both physical and cyber systems more vulnerable. Ensuring the protection of these assets requires strong collaboration between government agencies and between the public and private sectors. According to S. G. Varnado, U.S. House of Representatives Committee on Energy and Commerce in 2002, "Currently, there are no tools that allow understanding of the operation of this complex, interdependent system. This makes it difficult to identify critical nodes, determine the consequences of outages, and develop optimized mitigation strategies." Clearly, communication of accurate information is an important link to these interdependent infrastructures. Surveillance, critical capabilities (facilities and services), and data systems must be protected. The Department of Homeland Security is working with officials in each sector to ensure adequate redundant communication networks and to improve communications availability and reliability, especially during major disruptions.
A.2.7
2009 National Infrastructure Protection Plan
The National Infrastructure Protection Plan (NIPP) is focused on the protection of critical infrastructure and key resources (CIKR) that are essential to United States security and way of life. According to the Department of Homeland Security, the NIPP "provides the unifying structure for integrating a wide range of efforts for the protection of CIKR into a single national program." The NIPP's framework (Figure A.2) clearly defines the roles and responsibilities of all agencies and private sector partners involved in its implementation, which meets the requirements of HSPD-7 (Homeland Security Presidential Directive 7): Critical Infrastructure Identification, Prioritization, and Protection. Some of the main objectives in achieving this risk management framework are to open communications between CIKR partners about threats and share information about protection programs that can be implemented. The structure is designed for continuous improvement to reduce risks to CIKR through setting goals and objectives, identifying assets, identifying vulnerabilities, establishing priorities and protective programs, and measuring the effectiveness of these programs and strategies.
Figure A.2. NIPP. Risk management framework. (Courtesy of U.S. Dept. of Homeland Security.)
A.2 INDUSTRY POLICIES AND REGULATIONS
405
A.2.8 North American Electric Reliability Corporation (NERC) Critical Infrastructure Protection Program NERC is designed to make sure that the North American electrical system is reliable and secure by maintaining reliability standards. The Critical Infrastructure Protection (CIP) Cyber Security Standards offer a safeguard for critical cyber security assets that affect the reliability and resilience of North America's electrical system. These cyber security standards are broken up into a set of requirements that critical infrastructure should meet in order to ensure that it is safe from the many cyber security threats that we are faced with today, such as hacking and viruses. If there are any known violations of the CIP standards, NERC requires utilities to self-report them. CIP-002 through CIP-009 provides us with a cyber security framework (Table A.2).
A.2.9
U.S. Securities and Exchange Commission (SEC)
The primary mission of the U.S. Securities and Exchange Commission (SEC) is to protect investors and maintain the integrity of the securities markets. The SEC works closely with Congress, federal departments and agencies, the stock exchanges, state securities regulators, and the private sector to oversee the securities markets. It has the power to enforce securities laws to prevent insider trading, accounting fraud, and the provision of false or misleading information to investors about the securities and the companies that issue them. The latest addition to the list of laws that govern this industry is the Sarbanes-Oxley Act of 2002, which has led to a major shakeup of the industry and has had a huge impact on the way "normal" business operations are conducted today. This is discussed in greater detail in Section A.2.2. It is also the responsibility of companies to disclose the necessary information regarding investment objectives, funding, and details of company structure and operations. The term "business operations" encompasses the critical infrastructure that ensures information security of clients and transactions. As a result, it is essential that the critical infrastructure be maintained in order to be in compliance with many of the policies and regulations set forth by the SEC.
A.2.10 Sound Practices to Strengthen the Resilience of the U.S. Financial System This interagency paper was developed in 2003 by an interagency agreement between the Federal Reserve Board (Board), the Office of the Comptroller of the Currency (OCC), and the Securities and Exchange Commission (SEC). It provides guidance on business continuity planning for critical financial institutions, particularly important in the post-9/11 environment, and contains four sound practices to ensure the resilience of the U.S. financial system: 1. Identify clearing and settlement activities in support of critical financial markets. 2. Determine appropriate recovery and resumption objectives for clearing activities in support of critical markets.
406
APPENDIX A. POLICIES AND REGULATIONS
Table A.2. CIP-002 through CIP-009 Cyber Security Standards NERC Standard
Description
CIP-001— Sabotage Reporting
Any unusual happenings or suspicions that are suspected or determined to be caused by sabotage must be reported to the appropriate authorities, governmental agencies, regulatory bodies, etc.
CIP-002—Critical Cyber Asset Identification
Requires the identification of critical assets and critical cyber assets by the responsible entity. The responsible entity must use a risk-based assessment methodology in identifying these assets.
CIP-003—Security Management Controls
Requires the development and implementation of security management control to protect the critical assets mentioned in CIP-002.
CIP-004—Personnel and Training
Employee training and identification is a must when dealing with critical assets and cyber security risks. CIP-004 requires that all personnel with access to critical cyber security assets have proper identification, verification, and a criminal background check. It also addresses the issue of training and employee education.
CIP-005—Electronic Security Perimeters
A security perimeter also needs to be implemented when protecting critical assets. This requires identifying all access points and securing them. The electronic security perimeter must protect and include all the critical cyber assets.
CIP-006—Physical Security Program
The physical security program will ensure that all the critical assets contained in the electronic security perimeter are physically protected as well.
CIP-007—System Security Management
A clear definition of procedures, methods and processes for securing the critical cyber security systems.
CIP-008—Incident Response and Reporting
A method of incident reporting is a must for securing critical assets. The responsible entity must identify any suspicions or proven security threats and make sure they are documented, classified, identified, and responded to.
CIP-009—Recovery Plans for Critical Cyber Assets
Clear recovery plans are critical to ensuring business continuity in the event of any type of natural disasters, security breaches, or failing systems. Well-established techniques and practices must be instilled and kept up to date.
3. Maintain sufficient geographical dispersed resources to meet recovery and resumption objectives. 4. Routinely use or test recovery and resumption arrangements. Business continuity objectives as stated in the Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System include:
A.2 INDUSTRY POLICIES AND REGULATIONS
407
• Rapid recovery/timely resumption of critical operations following a wide-scale disruption. • Rapid recovery/timely resumption of critical operations following a wide-scale disruption or the loss or inaccessibility of staff in at least one major operating location. • High level of confidence, through ongoing use or robust testing, that critical internal and external continuity arrangements are effective and compatible. Sound practices to consider are: • Focus on establishing robust backup facilities. • Definition of wide-scale regional disruption and parameters of probable events (e.g., power disruptions, natural disasters). • Duration of outage and achievable recovery times. • Impact on type of data storage technology and telecommunications infrastructure. • "Diversification of risk" and establishing backup sites for operations and data centers that do not rely on the same infrastructure and other risk elements as primary sites. • Geographic distance between primary and backup facilities. • Critical staffing needs sufficient to recover from wide-scale disruption. • Routine use and testing of backup facilities. • Costs versus risks. The sound practices can only provide overall general guidelines with no specific details and no recommendations with respect to recovery times, prescribed technology, backup site locations, testing schedules, or the level of resilience across firms. Each firm must determine their own risk profile based on their existing infrastructure, planned enhancements, resources, and risk management strategies. In the end, it is the role of the Board of Directors and senior management to ensure that their business continuity plans reflect the firm's overall business objectives. It is the role of the regulatory agencies to monitor the implementation of those plans.
A.2.11 C4I—Command, Control, Communications, Computers, and Intelligence Systems that provide C4I capabilities offer the means for executing special operations. These could include nuclear, military, and homeland security missions. Command and control is the decision making or exercising of authority by a designated commander who is responsible for heading the forces needed to accomplish a mission. These missions and functions are performed through an arrangement of personnel, communications, facilities, equipment, and procedures employed by the commander who is in charge of planning and coordinating the mission. An example of command and control equipment is Navstar's Global Positioning System (GPS). This is a space-based posi-
408
APPENDIX A. POLICIES AND REGULATIONS
tioning system that provides three-dimensional position, velocity, and timing information to a multitude of U.S. and allied military users. Communications and computers serve to support command and control capabilities. Communications and computers enable operational and functional staff to access, share, and exchange information worldwide with minimal knowledge of computing technologies. The intelligence is the assortment of information and data resulting from the collection, processing, integration, and analysis of available information. C4I systems most importantly provide situational awareness for commanders. With this data and a combination of relevant knowledge and judgment, commanders can provide superior decision making in a time of crisis.
A.2.12
Basel II Accord
With globalization of financial services firms and sophisticated information technology, banking has become more diverse and complex. The Basel Capital Accord was first introduced in 1988 by the Bank for International Settlements (BIS) and was then updated in 2004 as the Basel II Accord. This provides a regulatory framework that requires all internationally active banks to adopt similar or consistent risk-management practices for tracking and publicly reporting their operational, credit, and market risks (Table A.3). Basel II impacts information technology because of the need for banks to collect, store, and process data. Financial organizations must also minimize operational risk by having operational controls such as power protection strategies to ensure reliability, availability, and security of their data and business systems. Basel II is a platform for better business management whereby financial institutions systematically incorporate risk as a key driver in business decisions.
A.2.13
National Institute of Standards and Technology (NIST)
The National Institute of Standards and Technology was established in 1901 as a nonregulatory federal agency within the U.S. Department of Commerce. The primary mission of NIST is to advance economic security and improve quality of life through the encouragement of U.S. industrial competitiveness and innovation. As part of the Federal Information Security Management Act (FISMA) established in 2002, federal agencies need to establish an agency-wide program to protect information and information systems that support their daily operations. NIST has been a
Table A.3. Basel II compliance benefits Improverisk-managementpractices by integrating, protecting, and auditing data. Rapidly and effectively track exposure to operation, credit, and market risks. Reduce capital charges and other costs resulting from noncompliance. Control data from the source to end reporting for improved data integrity. Provide secure and reliable access to applications and relational databases 24/7. Maintain a historical audit trail to rapidly detect data discrepancies or irregularities.
A.2 INDUSTRY POLICIES AND REGULATIONS
409
key component in the development of standards and guidelines, which has helped these agencies to implement the goals of FISMA. Currently, NIST is assisting in establishing a credentiahng program for public and private organizations to provide security assessment services. The aim of the credentiahng program is to create competence in implementing NIST security standards and guidelines, and confidence within agencies requiring these security assessment services. The phases of the FISMA Implementation Project as stated by NIST are as follows.* Phase I: Standards and Guidelines Development (2003-2008) • FIPS Publication 199, Standards for Security Categorization of Federal Information and Information System (Completed) • FIPS Publication 200, Minimum Security Requirements for Federal Information and Federal Information Systems (Completed) • NIST Special Publication 800-30, Risk Management Guide for Information Technology Systems (Completed) • NIST Special Publication 800-37, Guide for the Security Certification and Accreditation of Federal Information Systems (Completed) • NIST Special Publication 800-37 Revision 1, (initial public DRAFT) Guide for Security Authorization of Federal Information Systems: A Security Lifecycle Approach (Completion October 2009) • NIST Special Publication 800-39 (second public draft), NIST Risk Management Framework (Completion December 2009) • NIST Special Publication 800-53 Revision 2, Recommended Security Controls for Federal Information Systems (Completed) • NIST Special Publication 800-53 Revision 3, Recommended Security Controls for Federal Information Systems (Completed) • NIST Special Publication 800-53 A, Guide for Assessing the Security Controls in Federal Information Systems (Completed) • NIST Special Publication 800-59, Guide for Identifying an Information System as a National Security System (Completed) • NIST Special Publication 800-60, Revision 1, Guide for Mapping Types of Information and Information Systems to Security Categories (Completed) Phase II: Organizational Credentialing Program (2007-2012). The second phase of the FISMA implementation project will primarily focus on the development of a program for public/private sector organizations to provide security assessment services for federal agencies. The second phase will include the following initiatives: • Training initiative. Establishing a common understanding of standards and guidelines by providing training courses. *A11 information provided courtesy of http://csrc.nist.gov/groups/SMA/fisma/overview.htmI.
410
APPENDIX A. POLICIES AND REGULATIONS
• Product and service assurance assessment initiative. Providing a definition of criteria and guidelines for the evaluations of services and product used in the implementation of SP 800-53-based security controls. • Support tools initiative. This initiative will include identifying or developing common protocols, programs, reference materials, checklists, and technical guides supporting implementation and assessment of SP 800-53-based security controls in information systems. • Harmonization initiative. This initiative will include identifying common relationships and the mappings of FISMA standards, guidelines, and requirements with (1) ISO 270000 series information security management standards and (2) ISO 9000 and 17000 series quality management and laboratory testing and accreditation standards. This harmonization is important for minimizing duplication of effort for organizations that must demonstrate compliance to both FISMA and ISO requirements.
A.2.14 Business Continuity Management Agencies and Regulating Organizations Over the years, business continuity management (BCM) principles and practices have evolved in response to changing requirements, new tools, and increased responsibilities. Because of the complexity associated with BCM, organizations have been combined to develop and codify BCM practices and certification tests for BCM personnel. The Business Continuity Institute (BCI) and the Disaster Recovery Institute International (DRII) are the two primary organizations regulating business continuity management practices and certifying BCM personnel. Initially, the BCI was more involved with international recovery standards, whereas DRII concentrated its efforts domestically. But recently the two organizations have coordinated many of their efforts and each has a broader approach to recovery planning. Through their programs and training sessions, a company could achieve a very high degree of protection, both domestically and internationally. These firms provide onsite, classroom, and online training, thereby making it easier for a company to develop a recovery staff that is aware of best practices for recovery planning and the tools available to support recovery operations. It is recommended that your recovery staff attend a training session from one of these organizations that will result in their being a certified recovery professional. The following ten-step process for implementing recovery plans was developed by DRII and is taught in its Business Continuity Certified Professional (CBCP) classes. This process is a proven method to achieve recovery operations that will protect personnel, clients, suppliers, and business operations. 1. Project Initiation and Management • Establish the need for a business continuity plan (BCP). • Obtaining management support and guidelines. • Distributing a BCP initiation memo outlining the scope and requesting resources.
A.2 INDUSTRY POLICIES AND REGULATIONS
2.
3.
4.
5.
6.
7.
411
• Creating a business continuity organization. • Managing the project to completion within agreed-upon time and budget limits. Risk Evaluation and Control • Determine the events and environmental surroundings that can adversely affect the organization and its facilities following a disruption or disaster event and calculate the damage such events can cause (risk identification). • Establish the controls needed to prevent or minimize the effects of potential loss (problem management and control). • Provide cost-benefit analysis to justify investment in controls to mitigate risks (risk management). Business Impact Analysis (BIA) • Identify the impacts resulting from disruptions and disaster scenarios that can affect the organization and techniques that can be used to quantify and qualify such impacts. • Establish critical functions, their recovery priorities, and interdependencies so that recovery time objectives (RTO) and recovery point objectives (RPO) can be set. • The recovery time capability (RTC) should be measured to determine if recovery objectives can be met in the current environment, or if improvements need to be implemented so that recovery time objectives can be met. Developing Business Continuity Strategies • Determine and guide the selection of alternative business recovery operating strategies for recovery of business and information technologies within the recovery time objective, while maintaining the organization's critical functions. • Test and deliver solutions that are relevant to the changing business requirements. Emergency Response and Operations • Develop and implement procedures for responding to and stabilizing the situation following an incident or event. • Establish and manage an emergency operations center (EOC) to be used as a command center during the emergency. • Demonstration of practical experience in handling an emergency. Designing and Implementing Business Continuity Plans • Design, develop, and implement the business continuity plan that provides recovery within the recovery time objective. Awareness and Training Programs • Prepare a program to create corporate awareness and enhance the skills required to develop, implement, maintain, and execute the business continuity plan.
412
APPENDIX A. POLICIES AND REGULATIONS
8. Maintaining and Exercising Business Continuity Plans • Preplan and coordinate plan exercises, evaluate and document plan exercise results, and make improvements to plans as necessary. • Develop processes to maintain the currency of continuity capabilities and the plan document in accordance with the organization's strategic direction. • Verify that the plan will prove effective by comparison with a suitable standard, and report results in a clear and concise manner. • Integrate recovery operations with the everyday functions performed by personnel and within the systems development life cycle, so that recovery plans are maintained in a current manner at all times. 9. Public Relations and Crisis Communication • Develop, coordinate, evaluate, and exercise plans to handle the media during crisis situations. • Develop, coordinate, evaluate, and exercise plans to communicate with key customers, critical suppliers, owners/stockholders, and corporate management during crisis. • Ensure that all stakeholders are kept informed on an as-needed and continuing basis. • Provide trauma counseling for employees and their families. 10. Coordination with Public Authorities • Establish applicable procedures and policies for coordinating response, continuity, and restoration activities with local authorities while ensuring compliance with applicable statutes or regulations. • Demonstration of practical experience in dealing with local authorities.
A.2.15 FFIEC—Federal Financial Institutions Examination Council The FFIEC is the most robust and recognized organization for implementing recovery operations within a firm. It has been used domestically and internationally and has a wide audience of followers, especially within the financial community The standard can be used to create recovery plans that meet or exceed industry compliance requirements and sections are updated as needs change.
A.2.16 National Fire Prevention Association 1600 Standard on Disaster/Emergency Management and Business Continuity Programs NFPA 1600 gives businesses and other organizations a foundation document to protect their employees and customers in the event of a terrorist attack. A sweeping overhaul of the U.S. intelligence community includes provisions to make NFPA 1600 disaster/emergency management and business continuity programs the national preparedness standard. The act was approved by the Senate by an 89 to 2 vote, and in the
A.2 INDUSTRY POLICIES AND REGULATIONS
413
House by a vote of 336 to 75. President Bush signed the bill into law on December 17, 2004. The law implements many of the recommendations made by the National Commission on Terrorist Attacks Upon the United States, better known as the 9/11 Commission. The provision of the law highlighting NFPA 1600 is Section 7305, Private Sector Preparedness (B), Sense of Congress on Private Sector Preparedness, which states, "It is the sense of Congress that the secretary of Homeland Security should promote, where appropriate, the adoption of voluntary national preparedness standards such as the private sector preparedness standard developed by the American National Standards Institute and based on the National Fire Protection Association 1600 Standard on Disaster/Emergency Management and Business Continuity Programs." The Commission recommended that NFPA 1600 be recognized as the National Preparedness Standard. The Commission reviewed the need for private-sector preparedness during their public hearings and noted that "the private sector controls 85 percent of the critical infrastructure in the nation," and "the 'first' first responders will almost certainly be civilians." The Commission acknowledged the lack of a standard as a principal contributing factor to the lack of private-sector preparedness and asked the American National Standards Institute (ANSI) to develop a consensus on a standard for the private sector. ANSI convened its Homeland Security Standards Panel (HSSP), which held a series of workshops attended by representatives of many industries and associations in the private sector as well as public-sector officials. After reviewing available documents and studying NFPA 1600, ANSI's HSSP endorsed NFPA 1600. NFPA 1600 has become well known in the public sector since the Emergency Management Accreditation Program (EMAP) began using NFPA 1600 as criteria for accreditation of public-sector emergency-management programs. EMAP is a voluntary,, national accreditation process for public emergency-management programs at the state, territorial, and local levels. Surprisingly, however, the standard has not received as much attention in the private sector. The recommendation of the Commission has shined a bright spotlight on NFPA 1600, and now the document is receiving much more attention from business and industry. NFPA 1600 is not a handbook or how-to guide with instructions on building a comprehensive program. Instead, it outlines the management and elements that organizations should use to develop a program for mitigation, preparedness, response, and recovery. The Technical Committee on Emergency Management and Business Continuity has declined to prescribe a program development process, leaving it up to each entity to follow a method that meets the specific needs of the entity. NFPA 1600 requires development of a documented program that defines the entity's policy, goals, objectives, plans, and procedures. The standard also requires appointment of a program coordinator to work in conjunction with an advisory committee to develop the program. Specific program elements include: hazard identification, risk assessment, and impact analysis; hazard mitigation; resource management; mutual aid; planning; direction, control and coordination; communications and warning; operations and procedures; logistics and facilities; training, exercises, evaluations, and corrective actions; crisis communications and public information; and finance and admin-
414
APPENDIX A. POLICIES AND REGULATIONS
istration. NFPA 1600 also requires compliance with applicable regulatory requirements. NFPA 1600 requires an "all hazards" approach. There has been significant emphasis in the past on preparedness for terrorist attacks and much work remains to be done. However, hurricanes, tornadoes, flooding, and other natural and manmade disasters continue to take a significant toll in lives lost, property damage, and business interruption. NFPA 1600 requires assessment of all hazards that might impact people, property, operations, and the environment. Risk assessments should quantify the probability of occurrence and the severity of their consequences. A business impact analysis (BIA) should quantify the impact a disaster will have on the organizations' mission, as well as direct and indirect financial consequences. The business impact analysis enables an organization to evaluate the cost-effectiveness of mitigation efforts and determine how much to invest in preparedness, response, and recovery plans.
A.2.17
Private Sector Preparedness Act
H.R. 4830, Private Sector Preparedness Act of 2004, is a bill introduced in the U.S. House of Representatives to amend the Homeland Security Act of 2002 and direct the Secretary of Homeland Security to develop and apply a plan to improve private sector preparedness for disasters and emergencies. It stipulates that in carrying out the program the Secretary shall develop guidance and identify best practices to assist or foster action by the private sector, including: 1. Identifying hazards and assessing risks and impacts. 2. Mitigating the impacts of a wide variety of hazards, including weapons of mass destruction. 3. Organizing necessary disaster preparedness and response resources. 4. Developing common aid agreements. 5. Developing and maintaining emergency preparedness and response plans, as well as associated operational procedures. 6. Developing and maintaining communications and warning systems. 7. Developing and conducting training and exercises to support and evaluate emergency preparedness and response plans and operational procedures. 8. Developing and conducting training programs for security guards to implement emergency preparedness and response plans and operations procedures. 9. Developing procedures to respond to external requests for information from the media and the public.
A.3
DATA PROTECTION
An exposure that all companies face is the transportation of media and data between locations. Over the past 20 years, we have seen a significant transition from paper
A.3 DATA PROTECTION
415
records to digital records. As a result, the transportation of information does not only include physical transport but electronic transport as well. With today's emphasis on preventing intellectual capital theft and protecting critical documentation from unauthorized access, it is important to implement safeguards to protect media from being stolen, lost, or unintentionally discarded. For example, in 2010 CBS News conducted an investigative report on digital copiers, which revealed that since 2002 nearly all these copy machines contained hard drives that stored thousands of office documents. Surprisingly, many companies do not invest in encryption software or other software that erases the hard drive before digital copiers are released from their lease or resold. As a result, a company's security and privacy can be considerably compromised. In response to this need, firms that are responsible for critical infrastructure and facilities should commission a risk assessment study to identify gaps and exposures in how media transportation is performed and to guaranty that the firm is complying with all laws and regulations. A method for performing this risk assessment is as follows: • Review laws and regulations affecting tape transport to/from data centers and remote locations. • Review the tape transportation process between the data center and remote locations (i.e., vaults, customers, credit bureaus, other). • Evaluate vendors included in the media transportation process, including those used for purchase and disposal of media. • Perform an audit of the local and remote vaults. • Research existing insurance over loss of media. • Review procedures and the response plan for lost media or a data breach. • Investigate the use of encryption to protect data from misuse. • Identify exposures and gaps; define impact, draw conclusions, and make recommendations to mitigate and remedy identified problems. • Prepare a final report with findings and recommendations. As a result of the risk assessment, you will be able to determine if your company is in compliance with the laws and regulations related to data protection. In the mission critical industry, it is the responsibility of facility managers or any person managing critical assets to ensure that their IT department has proper procedures and protocols in place in order to comply with federal, state, and local policies and regulations. In many companies, the levels of workers span a wide range of responsibilities and this affects their right to view privileged information. As a result, degrees of access to proprietary information have to be delegated appropriately and security of intellectual capital has to be safeguarded against viruses and computer hackers. Company information that is illegally obtained is categorized as a security breach. The guidelines that help define a security breach are listed below, and are the only ones requiring immediate notification. The guidelines consist of sensitive client information or a company's critical documentation, which includes:
416
APPENDIX A. POLICIES AND REGULATIONS
• • • • • •
Name, address, and telephone number Social security number, account number Credit/debit card number and associated PIN User names and passwords Facility infrastructure documentation (electrical, mechanical, HVAC, plumbing) Any combination of components that would allow access to a company's confidential or proprietary data
Policies, regulations, and company procedures are very important steps that allow critical assets to be controlled, monitored, and logged, but there are additional technologies that are very useful and effective in protecting and securing data, such as encryption. Encryption is a method for protecting data under all conditions, including everyday usage and when being transported between locations. Encrypted data cannot be read by violators should it be lost or stolen, thereby preventing the opportunity for the loss of intellectual property.
A.4
ENCRYPTION
Although encryption is not a policy or regulation, it is a key element in securing critical data. In the past, encryption was not seriously applied to protect data because of the overhead associated with its use. Today, encryption can be performed electronically via channel-connected devices that eliminate the potential 50% overhead previously associated with encryption. This is due to how encryption occurs. The original data is processed via a key code that hashes the data into an unreadable format. A matching key is used at the receiving end to decrypt the data into its original format. The types of files that are candidates for encryption include long-term files, short-term sensitive files, and financial, compliance, and other critical files. Ways in which one can define and research the types of encryption available include: • Encryption key methodology (128 bit, etc.) • Encryption key escrow and usage procedures • Duration associated with encryption jobs and their impact on production schedules • Define encryption selection and usage criteria Encryption protects data by scrambling data through encryption formulas identified as keys (Figure A.3). Public and private keys can be used to scramble and unscramble data as needed. Encryption can now be accomplished in hardware devices with little or no overhead. Because of this, its use is growing. Encrypted information contained on tapes and cartridges that are being stored at a remote vault or being transported to another site will be protected against identify theft or a security breach, which will eliminate the possibility of fines should media be lost
A.5 BUSINESS CONTINUITY PLAN (BCP)
417
Figure A.3. Safeguarding data through encryption. during transport. This is because the data is encrypted and, therefore, safeguarded from unlawful access.
A.4.1
Protecting Critical Data through Security and Vaulting
Data sensitivity is used to define the relative importance of information and the access controls that should be implemented to protect the information from a security breach or media loss. Vital records management is responsible for selecting the placement of information during processing and for protecting the information through backup and recovery processes (i.e., local vault, remote vault, outside locations, etc.). An overview of data sensitivity and vital records management is shown in Figure A.4.
A.5
BUSINESS CONTINUITY PLAN (BCP)
A business continuity plan allows an organization to quickly recover from a disaster (i.e., natural disaster, terrorism, cyber-terrorism) and to reestablish normal operations given the scope and severity of the business disruption. The purpose is to safeguard employees and property, make financial and operational assessments, protect records, and to resume business operations as quickly as possible. In financial institutions, a business continuity plan addresses the following: data backup and recovery; all mission critical systems; financial and operational assess-
418
APPENDIX A. POLICIES AND REGULATIONS
Figure A.4, Overview of data sensitivity and vital records management.
ments; alternative communications with customers, employees, and regulators; alternate physical location of employees; critical supplier, contractor, bank, and counterparty impact; regulatory reporting; and assuring customers prompt access to their funds and securities. For federal departments and bureaus, business continuity is developed under an enterprise architecture framework, which addresses how a department, agency, or bureau should maintain operations following a major disaster, incident, or disruption. It includes government standards as well as commercial best practices. Every federal agency is responsible for determining which processes and systems are mission critical. In addition, they must prioritize business processes to decide the order in which they must be restored should a disaster occur. For federal facilities, a business continuity plan consists of the following hierarchy of elements (from higher to lower priority): • • • • •
Continuity of operations Continuity of government Critical infrastructure protection (computer and physical infrastructure) Essential processes/systems Nonessential processes/systems
419
A.6 CONCLUSION
Figure A.5. Business continuity management disciplines and integration.
There are many business continuity management disciplines that can work together to develop, implement, and maintain recovery plans that meet current needs and compliance requirements. Figures A.5 and A.6 show that establishing interfaces with key departments will allow for the inclusion of corporate-wide recovery procedures in department-specific recovery plans.
A.6
CONCLUSION
Policies and regulations for the mission critical industry serve as fundamental elements in protecting our critical assets and safeguarding our country. They provide the checks
Figure A.6. Contingency recovery planning.
420
APPENDIX A. POLICIES AND REGULATIONS
and balances needed to protect client interests by ensuring that businesses are conducting themselves ethically, while also requiring companies to have proper procedures established to deal with critical situations. Many policies and regulations, however, do not have prescribed methods outlined for compliance. Therefore, companies must develop business continuity plans that incorporate objectives to proactively meet policies and regulations. These objectives and methods ought to be customized according to the unique characteristics of the company's industry as well as the specific and particular operations and procedures of their infrastructure and facility. Noncompliance is not an option. The policies and regulations outlined in this appendix are meant to serve as a tool and guide to aid companies in the development of methods that will not only lead to compliance but will also strengthen and protect the overall resiliency of the organization.
APPENDIX
B
CONSOLIDATED LIST OF KEY QUESTIONS
This appendix is a consolidation of the questions that appear throughout the book. The questions are the starting point for a rigorous examination of a mission critical environment. The answers will help mission critical facilities achieve the necessary levels of diversity, recoverability, redundancy, and resiliency. The questions are presented in a worksheet format, providing space so that mission critical facilities can indicate whether the question is applicable and can record comments germane to the question.
Question
Applicable? Comments (Y/N)
Questions 1 through 28 Apply to Power Utilities General 1.
Do you have a working and ongoing relationship with your electric power utility?
2.
Do you know who in your financial institution currently has a relationship with your electric power utility; i.e., facilities management or accounts payable?
Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 421 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
422
APPENDIX B. CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 1 through 28 Apply to Power Utilities General 3.
Do you understand your electric power utility's electric service priority (ESP) protocols?
4.
Do you understand your electric power utility's restoration plan?
5.
Are you involved with your electric power utility's crisis management/disaster recovery tests?
6.
Have you identified regulatory guidelines or business continuity requirements that necessitate planning with your electric power utility? Specifically for an Electric Power Utility
7.
What is the relationship between the regional source power grid and the local distribution systems?
8.
What are the redundancies and the related recovery capacity for both the source grid and local distribution networks?
9.
What is the process of restoration for source grid outages?
10.
What is the process of restoration for local network distribution outages?
11.
How many network areas are there in (specify city)?
12.
What are the interrelationships between each network segment and the source feeds?
13.
Does your infrastructure meet basic standard contingency requirements for route grid design?
14.
What are the recovery time objectives for restoring impacted operations in any given area?
15.
What are recovery time objectives for restoring impacted operations in any given network?
16.
What are the restoration priorities to customers, both business and residential?
17.
What are the criteria for rating in terms of service restoration?
Applicable? Comments (Y/N)
CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 1 through 28 Apply to Power Utilities Specifically for an Electric Power Utility 18.
Where does the financial services industry rank in the priority restoration scheme?
19.
How do you currently inform clients of a service interruption and the estimated time for restoration?
20.
What are the types of service disruptions, planned or unplanned, that (specify city) could possibly experience?
21.
Could you provide a list of outages, type of outage and length of disruption that have affected (specify city) during the last 12 months?
22.
What are the reliability indices and who uses them?
23.
During an outage, would you be willing to pass along information regarding the scope of interruptions to a central industry source, e.g., a financial services industry business continuity command center?
24.
Are the local and regional power utilities cooperating in terms of providing emergency service? If so, in what way? If not, what are the concerns surrounding the lack of cooperation?
25.
Would you be willing to provide schematics to select individuals and/or organizations on a nondisclosure basis?
26.
Could you share your lessons learned from the events of 9/11 and the regional outage of 8/14/03?
27.
Are you familiar with the "Critical Infrastructure Assurance Guidelines for Municipal Governments" document written by the Washington Military Department Emergency Management Division? Is so, would you describe where (specify city) it stands in regard to the guidelines set forth in that document?
423
Applicable? (Y/N) Comments
424
APPENDIX B. CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 1 through 28 Apply to Power Utilities Specifically for an Electric Power Utility 28.
Independent of the utility's capability to restore power to its customers, can you summarize your internal business continuity plans, including preparedness for natural and manmade disasters (including but not limited to weather-related events, pandemics, and terrorism)? Questions 29 through 55 apply to Needs Analysis/Risk Assessment
29.
How much does each minute, hour, or day of operational downtime cost your company if a specific facility is lost?
30.
Have you determined your recovery time objectives for each of your business processes?
31.
Does your financial institution conduct comprehensive business impact analyses (BIA) and risk assessments?
32.
Have you considered disruption scenarios and the likelihood of disruption affecting information services, technology, personnel, facilities, and service providers in your risk assessments?
33.
Have your disruption scenarios included both internal and external sources, such as natural events (e.g., fires, floods, severe weather), technical events (e.g., communication failure, power outages, equipment and software ¡failure), and malicious activity (e.g., network security attacks, fraud, terrorism)?
34.
Does this BIA identify and prioritize business functions and state the maximum allowable downtime for critical business functions?
35.
Does the BIA estimate data loss and transaction backlog that may result from critical business function downtime?
Applicable? (Y/N) Comments
CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 29 through 55 apply to Needs Analysis/Risk Assessment 36.
Have you prepared a list of "critical facilities" to include any location where a critical operation is performed, including all work-area environments such as branch backroom operations facilities, headquarters, or data centers?
37.
Have you classified each critical facility using a critical facility ranking/rating system such as the Tier I, II, III, IV rating categories?
38.
Has a condition assessment been performed on each critical facility?
39.
Has a facility risk assessment been conducted for each of your key critical facilities?
40.
Do you know the critical, essential, and discretionary loads in each critical facility?
41.
Must you comply with the regulatory requirements and guidelines discussed in Appendix A?
42.
Are any internal corporate risk and compliance policies applicable?
43.
Have you identified business continuity requirements and expectations?
44.
Has a gap analysis been performed between the capabilities of each company facility and the corresponding business process recovery time objectives for that facility?
45.
Based on the gap analysis, have you determined the infrastructure needs for your critical facilities?
46.
Have you considered fault tolerance and maintainability in your facility infrastructure requirements?
47.
Given your new design requirements, have you applied reliability modeling to optimize a costeffective solution?
48.
Have you planned for rapid recovery and timely resumption of critical operations following a wide-scale disruption?
425
Applicable? (Y/N) Comments
426
APPENDIX B. CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 29 through 55 apply to Needs Analysis/Risk Assessment 49.
Following the loss of accessibility of staff in at least one major operating location, how will you recover and resume critical operations in a timely fashion?
50.
Are you highly confident, through ongoing use or robust testing, that critical internal and external continuity arrangements are effective and compatible?
51.
Have you identified clearing and settlement activities in support of critical financial markets?
52.
Do you employ and maintain sufficient geographically dispersed resources to meet recovery and resumption activities?
53.
Is your organization sure that there is diversity in the labor pool of the primary and backup sites, such that a wide-scale event would not simultaneously affect the labor pool of both sites?
54.
Do you routinely use or test recovery and resumption arrangements?
55.
Are you familiar with National Fire Protection Association (NFPA) 1600, Standard on Disaster/Emergency Management and Business Continuity Programs, which provides a standardized basis for disaster/emergency management planning and business continuity programs in private and public sectors by providing common program elements, techniques, and processes? Questions 56 through 126 apply to Installation Design
56.
Has the owner, working with an engineering professional, developed a design intent document to clearly identify quantifiable requirements?
Applicable? Comments (Y/N)
CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 56 through 126 apply to Installation Design 57.
Have you prepared a basis of design document that memorializes, in a narrative form, the project intent, future expansion options, types of infrastructure systems to be utilized, applicable codes and standards to be followed, design assumptions, and project team decisions and understandings?
58.
Will you provide the opportunity to update the basis of design to reflect changes made during the construction and commissioning process?
59.
Are the criteria for testing all systems and outlines of the commissioning process identified and incorporated into the design documents?
60.
Have you identified a qualified engineering and design firm to conduct a peer project design review?
61.
Have you considered directly hiring the commissioning agent to provide true independence?
62.
Have you discussed and agreed to a division of responsibilities between the construction manager and the commissioning agent?
63.
Do you plan to hire the ultimate operating staff ahead of the actual turnover to operations so they will benefit from participation in the design, construction and commissioning of the facility?
64.
Have you made a decision on the commissioning agent early enough in the process to allow participation and input on commissioning issues and design review by the selected agent?
65.
Is the proper level of fire protection in place?
66.
Is the equipment (UPS or generator) being placed in a location prone to flooding or other water damage?
427
Applicable? Comments (Y/N)
428
APPENDIX B. CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 56 through 126 apply to Installation Design 67.
Do the generator day tanks or underground fuel tanks meet local environmental rules?
68.
Does the battery room have proper ventilation?
69.
Has adequate cooling or heating been specified for the UPS, switchgear, or generator room?
70.
Are the heating and cooling for the mechanical rooms on the power protection system?
71.
Have local noise ordinances been reviewed and does all the equipment comply with the ordinances?
72.
Are the posting and enforcement of no-smoking bans adequate, specifying, for example, no smoking within 100 feet?
73.
Are water detection devices used to alert building management of flooding issues? Procurement
74.
Is there a benefit to using an existing vendor or supplier for standardization of process, common spare parts, or confidence in service response?
75.
Have the commissioning, factory, and sitetesting requirement specifications been included in the bid documentation?
76.
If the project is bid, have you conducted a technical compliance review that identifies exceptions, alternatives, substitutions, or noncompliance to the specifications?
77.
Are the procurement team members versed in the technical nuances and terminology of the job?
78.
If delivery time is critical to the project, have you considered adding late penalty clauses to the installation or equipment contracts?
79.
Have you included a bonus for early completion of project?
Applicable? (Y/N) Comments
CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 56 through 126 apply to Installation Procurement 80.
Have you obtained unit rates for potential change orders?
81.
Have you obtained a GMP (guaranteed maximum price) from contractors?
82.
Have you discussed preferential pricing discounts that may be available if your institution or your engineer and contractors have other similar large purchases occurring? Construction
83.
Do you intend to create and maintain a list of observations and concerns that will serve as a checklist during the acceptance process to ensure that these items are not overlooked?
84.
Will members of the design, construction, commissioning agent, and operations teams attend the factory acceptance tests for major components and systems such as UPS, generators, batteries, switchgear, and chillers?
85.
During the construction phase, do you expect to develop and circulate for comment the start-up plans, documentation formats, and prefunctional checklists that will be used during start-up and acceptance testing?
86.
Since interaction between the construction manager and the commissioning agent is key, will you encourage attendance at the weekly construction status meetings by the commissioning team?
87.
Will an independent commissioning and acceptance meeting be run by the commissioning agent, ensuring that everything needed for that process is on target?
88.
Will you encourage the construction, commissioning, and operations staffs to walk the job site regularly to identify access and maintainability issues?
429
Applicable? Comments (Y/N)
430
APPENDIX B. CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 56 through 126 apply to Installation Construction 89.
If the job site is an operating critical site, do you have a risk assessment and change control mechanism in place to ensure reliability?
90.
Have you established a process to have independent verification that labeling on equipment and power circuits is correct? Commissioning and Acceptance
91.
Do testing data result sheets identify expected acceptable result ranges?
92.
Are control sequences, checklists, and procedures written in plain language, not technical jargon that is easily misunderstood?
93.
Have all instrumentation, test equipment, actuators and sensing devices been checked and calibrated?
94.
Is system acceptance testing scheduled after balancing of mechanical systems and electrical cable/breaker testing are complete?
95.
Have you listed the systems and components to be commissioned?
96.
Has a detailed script sequencing all activities been developed?
97.
Are all participants aware of their responsibilities and the protocols to be followed?
98.
Does a team directory with all contact information exist and is it available to all involved parties?
99.
Have you planned an "all hands on deck" meeting to walk through and finalize the commissioning schedule and scripted activities?
100.
Have the format and content of the final report been determined in advance to ensure that all needed data is recorded and activities are scheduled?
Applicable? Comments (Y/N)
CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 56 through 126 apply to Installation Commissioning and Acceptance 101.
Have you arranged for the future facility operations staff to witness and participate in the commissioning and testing efforts?
102.
Who is responsible for ensuring that all appropriate safety methods and procedures are deployed during the testing process?
103.
Is there a process in place that ensures training records are maintained and updated?
104.
Who is coordinating training and ensuring that all prescribed training takes place?
105.
Will you videotape training sessions to capture key points for use as refresh training?
106.
Is the training you provide both general systems training as well as specifically targeted to types of infrastructure within the facility?
107.
Have all vendors performed component-level verification and completed prefunctional checklists prior to system-level testing?
108.
Has all system-level acceptance testing been completed prior to commencing the full system integration testing and "pull the plug" powerfailure scenario?
109.
Is a process developed to capture all changes made and to ensure that these changes are captured on the appropriate as-built drawings, procedures, and design documents?
110.
Do you plan to reperform acceptance testing if a failure or anomaly occurs during commissioning and testing?
111.
Who will maintain the running punch list of incomplete items and track resolution status? Transition to Operations
112.
Have you established specific operations planning meetings to discuss logistics of transferring newly constructed systems to the facility operations staff?
431
Applicable? (Y/N) Comments
432
APPENDIX B. CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 56 through 126 apply to Installation Transition to Operations 113.
Is all as-built documentation, such as drawings, specifications, and technical manuals, complete and has it been turned over to operations staff?
114.
Have position descriptions been prepared that clearly define roles and responsibilities of the facility staff?
115.
Are standard operating procedures (SOP), emergency action procedures (EAP), updated policies, and change control processes in place to govern the newly installed systems?
116.
Has the facility operations staff been provided with warranty, maintenance, repair, and supplier contact information?
117.
Have spare parts lists, setpoint schedules after Cx is complete, TAB report, and recommissioning manuals been given to operations staff?
118.
Are the warranty start and expiration dates identified?
119.
Have maintenance and repair contracts been executed and put into place for the equipment?
120.
Have minimum response times for service, distance to travel, and emergency 24/7 sparestock locations been identified? Security Considerations
121.
Have you addressed physical security concerns?
122.
Have all infrastructures been evaluated for type of security protection needed (e.g., card control, camera recording, key control)?
123.
Are the diesel oil tank and oil fill pipe in a secure location?
124.
If remote dial-in or Internet access is provided to any infrastructure system, have you safeguarded against hacking or do you permit read-only functionality?
Applicable? (Y/N) Comments
433
CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 56 through 126 apply to Installation Security Considerations 125.
How frequently do you review and update access-permission authorization lists?
126.
Are critical locations included in security inspection rounds? Questions 127 through 164 Deal with Maintenance and Testing Strategic
127.
Is there a documented maintenance and testing program based on your business risk assessment model?
128.
Is an audit process in place to ensure that this maintenance and testing program is being followed rigorously?
129.
Does the program ensure that maintenance test results are benchmarked and used to update and improve the maintenance program?
130.
Is there a program in place that ensures periodic evaluation of possible equipment replacement?
131.
Is there a process in place that ensures that the spare parts inventory is updated when new equipment is installed or other changes are made to the facility?
132.
Have you evaluated the impact of loss of power in your institution and other institutions because of interdependencies?
133.
Has your facility developed standard operating procedures (SOPs), emergency action procedures (EAPs), and alarm response procedures (ARPs)?
134.
Are the SOP, EAP, and ARP readily available and current?
135.
Is your staff familiar with the SOPs, EAPs, and ARPs?
Applicable? Comments (Y/N)
434
APPENDIX B. CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 127 through 164 Deal with Maintenance and Testing Planning 136.
Does the system design provide redundancy so all critical equipment can be maintained without a shutdown if required?
137.
Are there adequate work control procedures and is there a change management process to prevent mistakes when work is done on critical systems and equipment?
138.
Are short circuit and coordination studies up to date?
139.
Do you have a service level agreement (SLA) with your facilities service providers and contractors?
140.
Is there a change management process that communicates maintenance, testing, and repair activities to both end users and business lines?
141.
Do you have standard operating procedures to govern routine facilities functions?
142.
Do you have emergency response and action plans developed for expected failure scenarios?
143.
Have you prepared an emergency telephone contact list that includes key service providers and suppliers? Safety
144.
Is there a formal and active program for updating the safety manual?
145.
Are electrical work procedures included in the safety manual?
146.
Has an arc-flash study been performed?
147.
Are specific PPE requirements posted at each panel, switchgear, etc.?
148.
Is there a program in place to ensure studies and PPE requirements are updated when system or utility supply changes are made?
149.
Are workers trained regarding safety manual procedures?
Applicable? Comments (Y/N)
435
CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 127 through 164 Deal with Maintenance and Testing Safety 150.
Are hazardous areas identified on drawings?
151.
Are hazardous areas physically identified in the facility? Testing
152.
Have protective devices been tested or checked to verify performance?
153.
Is a site acceptance test (SAT) and a factory acceptance test (FAT) performed for major new equipment such as UPS systems and standby generators?
154.
Is an annual "pull the plug" test performed to simulate a utility outage and ensure that the infrastructure performs as designed?
155.
Is an annual performance and recertification test conducted on key infrastructure systems?
156.
Is there a process in place that ensures that personnel have the proper instrumentation and that it is periodically calibrated? Maintenance
157.
Does the program identify all critical electrical equipment and components?
158.
Is there a procedure in place that updates the program based on changes to plant equipment or processes?
159.
Does a comprehensive plan exist for thermo infrared (IR) heat scan of critical components? Is an IR scanning test conducted before a scheduled shutdown?
160.
Are adequate spare parts on hand for immediate repair and replacement?
161.
Is your maintenance and testing program based on accepted industry guidelines such as NFPA 70B and on equipment supply recommendations?
Applicable? (Y/N) Comments
436
APPENDIX B. CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 127 through 164 Deal with Maintenance and Testing Maintenance 162.
Do you incorporate reliability-centered maintenance (RCM) philosophy in your approach to maintenance and testing?
163.
When maintenance and testing is performed, do you require preparation of and adherence to detailed work statements and method of procedures (MOPs)?
164.
Do you employ predictive maintenance techniques and programs such as vibration and oil analysis? Questions 165 through 225 apply to Training and Documentation Documentation
165.
What emergency plans, if any, exist for the facility?
166.
Where are emergency plans documented (including the relevant internal and external contacts for taking action)?
167.
How are contacts reached in the event of an emergency?
168.
How are plans audited and changed over time?
169.
Do you have complete drawings, documentation, and technical specifications of your mission critical infrastructure, including: electrical utility, in-facility electrical systems (including power distribution and ATS), gas and steam utility, UPS/battery/generator, HVAC, security, and fire suppression?
170.
What documentation, if any, exists to describe the layout, design, and equipment used in these systems?
171.
How many forms does this documentation require?
172.
How is the documentation stored?
Applicable? Comments (Y/N)
CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 165 through 225 apply to Training and Documentation Documentation 173.
Who has access to this documentation and how do you control access?
174.
How many people have access to this documentation?
175.
How often does the infrastructure change?
176.
Who is responsible for documenting change?
177.
How is the information audited?
178.
Can usage of facility documentation be audited?
179.
Do you keep a historical record of changes to documentation?
180.
Is a formal technical training program in place?
181.
Is there a process in place that ensures that personnel have proper instrumentation and that the instrumentation is periodically calibrated?
182.
Axe accidents and near-miss incidents documented?
183.
Is there a process in place that ensures action will be taken to update procedures following accidents or near-miss events?
184.
How much space does your physical documentation occupy today?
185.
How quickly can you access the existing documentation?
186.
How do you control access to the documentation?
187.
Can responsibility for changes to documentation be audited and tracked?
188.
If a consultant is used to make changes to documentation, how are consultant deliverables tracked?
189.
Is your organization able to prove what content was live at any given point in time, in the event such information is required for legal purposes?
437
Applicable? Comments (Y/N)
438
APPENDIX B. CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 165 through 225 apply to Training and Documentation Documentation 190.
Is your organization able to quickly provide information to legal authorities, including emergency response staff (e.g., fire, police)?
191.
How are designs or other configuration changes to infrastructure approved or disapproved?
192.
How are these approvals communicated to responsible staff?
193.
Does workflow documentation exist for answering staff questions about what to do at each stage of documentation development?
194.
In the case of multiple facilities, how is documentation from one facility transferred or made available to another?
195.
What kind of reporting on facility infrastructure is required for management?
196.
What kind of financial reporting is required in terms of facility infrastructure assets?
197.
How are costs tracked for facility infrastructure assets?
198.
Is facility infrastructure documentation duplicated in multiple locations for restoration in the event of loss?
199.
How much time would it take to replace the documentation in the event of loss?
200.
How do you track space utilization (including cable management) within the facility?
201.
Do you use any change management methodology (i.e., ITIL) in the day-to-day configuration management of the facility? Staff and Training
202.
How many operations and maintenance staff do you have within the building?
203.
How many of these staff do you consider to be the facilities subject matter experts?
Applicable? (Y/N) Comments
439
CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 165 through 225 apply to Training and Documentation Staff and Training 204.
How many staff members manage the operations of the building?
205.
Are specific staff members responsible for specific portions of the building infrastructure?
206.
What percentage of your building operations staff turns over annually?
207.
How long has each of your operations and maintenance staff, on average, been in his or her position?
208.
What kind of ongoing training, if any, do you provide for your operations and maintenance staff?
209.
Do training records exist?
210.
Is there a process in place to ensure that training records are maintained and updated?
211.
Is there a process in place that identifies an arrangement for training?
212.
Is there a process in place that ensures that the training program is periodically reviewed and identifies changes required?
213.
Is the training you provide general training, or is it specific to an area of infrastructure within the facility?
214.
How do you design changes to your facility systems?
215.
Do you handle documentation management with separate staff, or do you consider it to be the responsibility of the staff making the change? Network and Access
216.
Do you have a secured network between your facility IT installations?
217.
Is this network used for communications between your facility management staff?
Applicable? (Y/N) Comments
440
APPENDIX B. CONSOLIDATED LIST OF KEY QUESTIONS
Question Questions 165 through 225 apply to Training and Documentation Network and Access 218.
Do you have an individual on your IT staff responsible for managing the security infrastructure for your data?
219.
Do you have an online file repository?
220.
If so, how is use of the repository monitored, logged, and audited?
221.
How is data retrieved from the repository kept secure once it leaves the repository?
222.
Is your file repository available through the public Internet?
223.
Is your facilities documentation cataloged with a standard format to facilitate location of specific information?
224.
What search capabilities, if any, are available on the documentation storage platform?
225.
Does your facility documentation reference facility standards (e.g., electrical codes)? If so, how is this information kept up to date?
Applicable? Comments (Y/N)
APPENDIX
c
AIRFLOW MANAGEMENT: A SYSTEMS APPROACH
C.1
INTRODUCTION
In Chapters 11 and 12, we presented the foundations of data center cooling equipment and energy efficiency. This was a basic introduction, in which the focus was on components and efficiency. In this appendix, we analyze cooling efficiency from an integrated system-level approach. As we will see, the system-level perspective will sometimes contradict some of the earlier statements made in previous sections of this book. The purpose of this appendix is to provide a different viewpoint for efficiency when considered at the system level, as well as to highlight some of the cooling technologies discussed earlier to achieve better airflow management. In the 1980s, the power requirements of data centers were low enough that cooling was not considered a critical problem. The heat sources were not as concentrated as they are today, so airflow management was not a necessity. Over the years, server technology has made tremendous advances to keep up with increasing IT demands, but data center cooling strategies have been largely ignored. As the power densities grew, however, so did the cooling problems, to the point where the overriding problem in today's data centers is their effective thermal management. The first attempt to manage airflow in a data center was introduced by Robert Sullivan at IBM in 1992. Sullivan proposed organizing the cabinets in a hot-aisle/cold-aisle (HACA) cabinet configuration, whereby cabinets would be placed front to front and rear to rear, with the front of Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 441 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
442
APPENDIX C. AIRFLOW MANAGEMENT: A SYSTEMS APPROACH
one cabinet never faces the rear of another cabinet. In this way, you could create alternating rows of cold supply air and hot return air. At the time at which this strategy was introduced, cabinets supported loads of 250 watts/ft2 (based upon the footprint of the cabinet, not the total square footage of the data center) or less, and this proved to be an effective solution. Today's cabinets, however, are being asked to support loads of 4000 watts/ft2 or more (16 times larger than what it was just 15 years ago). In 2007, the U.S. Environmental Protection Agency (EPA) released a report to Congress highlighting the enormous overall energy consumption of our nation's data centers. In this report, the EPA warned that data center power consumption would double within the next five years, and urged our nation to become more efficient in the design and operation of data centers, otherwise it would be difficult for power companies to keep pace with the demand. The study identified areas in which energy inefficiencies existed, and presented areas where improvements could provide the largest impact. Airflow management was cited as one of the most critical components in achieving an energy efficient data center, and that this alone could provide up to 30% in energy savings, provided data center designers and operators understood how to manage it properly. Since the publication of this report, data centers have seen an exponential growth in the power density requirements in order to meet the information technology needs of the twenty-first century. This trend shows no signs of abating and, in fact, has increased faster than some earlier projections. For many data center professionals, the HACA strategy along with some simple housekeeping initiatives is a good start. In fact, most best-practice lists (see Chapter 12, Table 12.2) include the suggestions of using the HACA cabinet placement scenario, closing all gaps between servers with blanking panels, removing all perforated tiles from the hot aisles and placing them more strategically in the room, and blocking off any leakage of cold supply air into the room that is not meant to directly cool equipment. Surprisingly, though, there are still data centers today that do not use this basic level of airflow management. However, even with these suggestions, many of today's data centers can still have cooling problems because they have surpassed the limits of the basic HACA design strategy. The real question then becomes, what technology or strategy best suits the needs of your data center? There is no debate on the main goal of cooling in a data center, which is to provide server inlet temperatures within the range specified by the manufacturers at the lowest possible cost. We need to be able to effectively control the server inlet temperatures. How to accomplish this task safely, robustly, and efficiently is a subject for active and passionate debate, and in what follows we outline a system perspective. We focus on airflow management and control, and provide information and insight into how best to achieve it for your data center.
C.2
CONTROL IS THE KEY
Ever since Juran and Deming transformed the industrial landscape with their groundbreaking work on quality control [1], the importance of control has spread across a
C.2 CONTROL IS THE KEY
443
wider and wider cross section of industries. From management to factory floor optimization to effective product design and development (six sigma and lean manufacturing), and now to the efficient design and operation of data centers, control is, as it should be, a key and indispensable principle. It is a long-overdue strategy for data centers. The data center industry is beginning to express the tacit and, more frequently, the explicit belief that temperature control is now an essential part of data center design, as illustrated by the following comments (emphasis added): A data center with ideal airflow will deliver cold air at the same temperature to every piece of equipment. [2] Airflow direction in the room has a profound effect on the cooling of computer rooms, where a major requirement is control of the air temperature at the computer inlets. [3] Mechanical designers face the challenge of handling heat dissipation efficiently in a data center. . . . The relationship between rack inlet temperature . . . is analyzed, and used to demonstrate the effectiveness of a local control algorithm through a control simulation. [4] In the high computer density data centers, populated with rack-mounted slim servers, one can pack 15 kW of power in a standard rack. The data center is now akin to a symphony hall with 150 people per seat. For such power density, energy balance alone in sizing air conditioning capacity in a data center does not suffice. One has to model the airflow and temperature distribution to assure inlet air temperature to systems. The data center is the computer. [5] We modeled airflow, .. . used barriers and custom cabinets to control airflow and increase the air conditioning airflow efficiency. [6] Cooling users are not confident that their current and planned systems will be able to cool high-density racks. This is primarily a problem relating to air distribution and mixing. Data center cooling systems must provide the capability for greater control of airflow at the rack. [7] So then, what is control and how do we achieve it? For data center applications, we define control as the ability to maintain differences in equipment inlet temperatures, as well as differences in AC return air temperatures to within a few degrees. For example, we may wish to have all server inlets be at 68°F ±2°F, and all AC return air temperatures at 80°F ±3°F. Another desirable feature is to have stable control. Stable control, or what we will also refer to as robust control, is when you can make modest changes to your data center and not lose control. For example, removing some blanking panels from cabinets, adding a new high-power server to a cabinet or a few high-density cabinets around the room, or performing routine maintenance on equipment should not cause the previously defined temperature differences to change significantly. If any of these changes impact the server inlet temperatures significantly, then your control strategy is not stable. Contrary to major modifications to a data center, which will always require adjustments to regain control, routine modifications should not. To achieve control, today's data center designers and operators have to follow a different design and management philosophy than they did 10 or 15 years ago. Designers
444
APPENDIX C. AIRFLOW MANAGEMENT: A SYSTEMS APPROACH
and operators now have to approach cooling in a data center from a system perspective, where the primary goal is system-level control. They can no longer consider cooling as being made up of a series of distinct technologies that, when deployed as per the respective manufacturers' specifications, will effectively and efficiently cool their data centers. Designers and operators have to become familiar with the interactions of the various technologies to create an optimal system-level approach as to how their data center delivers cold air to the servers and then cools the return air. Focus on the fact that the main goal of the cooling system is to control the temperature of the air at the server inlets, and that this may not require that every piece of cooling equipment will be operating at its peak efficiency level, but instead that the entire system as a whole will operate at its most optimal efficiency, while delivering on its primary objective—controlling server inlet air temperature. The more control you can achieve, the more effectively you can run your data center and reap huge energy savings and benefits. How control is achieved is dependent upon all aspects of the cooling system. That is why the entire data center cooling system design—from the AC units down to the cabinets, servers, and the entire infrastructure—is critical. It is the whole cooling system that defines the level of control and, hence, the true cooling capabilities of the room. Later in this appendix, we will discuss ways of achieving control, but for now let us consider the benefits of control.
C.2.1
Benefits of Control
Beyond the obvious benefit of providing a stable temperature environment, there are numerous other benefits provided by control: • Greater system efficiency (operational cost savings and better server performance) • Greater server reliability and life • Increased power density potential and scalability • Design robustness and flexibility Greater Efficiency. Temperature control at the equipment inlet level allows a data center to run more efficiently. This capability gives designers two different options. For example, in a controlled data center it is possible to run the CRAHs using 55°F chilled water rather than 40°F or 45°F, which allows the cooling system to run more efficiently. Alternatively, or even in addition to the raised water chiller temperature, the CRAHs do not have to run at full capacity and still achieve server inlet temperatures within manufacturing specifications. Energy cost savings can be achieved by running at half the CRAH tonnage. This further reduces the total cost of ownership (TCO) of the data center. Greater Reliability. A thermally balanced and controlled data center means that the server inlet temperatures are all within design specifications, so that server life will be extended. Overheated servers can slow throughput and affect system reliability:
C.3 OBTAINING CONTROL
445
In fact, the failure rate on the top third of servers (in the cabinet) is typically three times that of the bottom two-thirds. [8] The impact of high heat is severe—for every 1-degree rise in temperature, server failure rates increase 2-3 percent. [9] Increased Power Density and Scalability Potential. Control also enables data center designers and operators to fill their cabinets with blade servers without impacting performance and reliability. Of course, they must have enough cooling capacity to dissipate the heat to begin with or at least have the ability to add more capacity. This could have a significant impact on data centers where either real estate or construction costs are high. Design Robustness and Flexibility. A thermally balanced and controlled data center allows designers to either downsize the AC units, increase water-chiller temperature if applicable, or increase the server density and still maintain appropriate server inlet temperatures. Furthermore, data center operators can make minor modifications to the configuration of the data center without fear of driving servers into thermal failure.
C.3
OBTAINING CONTROL
To achieve control, we start with one basic premise: to control the temperature of a heat source, the closer the control is to the heat source the better the control provided by the cooling system will be. This way, system-wide changes can be accommodated locally and have less of an impact throughout the rest of the room. For example, the loss of a CRAH or an imbalance of plenum floor pressure can be compensated for if local control features are available. Also, many servers already have built-in control in the form of variable-speed fans, but this can be underutilized or even wasted if the rest of the system does not provide the same level of control. A direct corollary of the above premise is that cooling solutions that cause large temperature variations within a room, as is typical in the HACA design, are naturally more susceptible to unstable temperature equilibrium conditions in the dynamic data center environment, and are consequently less capable of providing control. Cooling solutions that create a more uniform room temperature distribution, on the other hand, are far less vulnerable to dramatic, localized temperature changes, since they are already close to thermal equilibrium. The better the thermal balance, the better the control. This has been validated by a recent study at Stony Brook University, and will be discussed in detail in Section C.4.
C.3.1
Lower Return Air AT Versus Higher AT
The means for achieving thermal balance can cause a heated debate. Oftentimes, the discussion leads to whether or not you should mix cold supply air with hot return air.
446
APPENDIX C. AIRFLOW MANAGEMENT: A SYSTEMS APPROACH
To mix or not to mix, that is the question. Earlier in this appendix, we stated that mixing the return air with the supply air could lead to inefficiencies as well as potential problems. In this section, we show how this may not always be the case. For many professionals, however, mixing cold supply air with hot return air is an already settled question, and in their opinion it should be avoided at all costs. In fact, many have argued that not mixing the supply air with the return air is a best-practice approach to managing airflow in a data center. For data centers that do not employ any control strategies, this can be an effective first step. Since many data centers operate with a supply-to-return air differential (A T) of around 10°F, increasing this to 20°F can have a significant impact on the cooling efficiency of your AC units. However, we intend to show that the question is not as easy or as straightforward as some would have you believe. The physics obviously tells you how much cooling you will need to cool a room to a particular average temperature. How you distribute the cooling is critical in determining the temperature spread (A7) in the room. From the vantage point of physics, the only thing that matters is that the average room temperature is a fixed value. This, however, can be achieved in an infinite number of overall room temperature distributions, with either large or small room temperature variations; that is, the AT of the room can be as large or as small as you want, and you can still satisfy the physical constraint of a fixed average temperature. As we have stated previously, the goal of many designers is to try and maximize the room A7. They do this so that the temperature in the region in front of the server inlet is as cold as possible, while the hot exhaust region behind the cabinets is a hot as possible. The extremes in temperature, however, create an unstable thermal environment in which small changes in the operation and/or configuration of the data center can lead to large temperature swings in locations of the room where consistent temperatures are required. This frequently leads to unacceptably high server inlet temperatures. The designer further tries to maximize the input temperature (return air) to the AC units, since this leads to greater AC efficiency. Although it is true that maximizing the return air temperature leads to greater AC efficiency, it is only true up until the point at which the capacity of the unit is reached. If you exceed this limit, the AC unit will not be able to cool the return air temperature down to the desired supply air temperature. Furthermore, the risk of raising the return air temperature too high when you are operating at the cooling capacity of your data center is that you lose your redundancy. Since all the AC units are operating at capacity, if one unit fails or must be shut down for service the entire system will fail. Without thermal balance, a sudden loss of an AC unit can have a major impact on the equipment in the area serviced by that unit. In addition, if the AC units were already operating at or near capacity and there is no control in the area in which the temperature will rise, there will be significant problems with equipment inlet temperatures. If, on the other hand, you raise only some AC return temperatures and not others (not accomplished uniformly), then the AC units that are raised beyond their capacity will start emitting warmer supply air, since they do not have the cooling capacity to sustain the lower supply temperature. These AC units are carrying the entire thermal
C.3 OBTAINING CONTROL
447
load, while others are hardly utilized. If the return air temperature to an AC unit is too low, these underutilized units will not cool the air and simply return it at room temperature right back into the supply air channel. This in turn causes problems with the supply air to other parts of the room. So, whereas the philosophy of a high AT may make sense from the AC component level, it can create tremendous problems at the system (data center) level. The high AT design philosophy leads the data center designer to try to achieve, at best, an unstable temperature solution. This means that the design will probably work within the original design parameters, but slight changes or later modifications to the data center can lead to dramatic changes in localized temperature values, not to mention the impact of overstressing certain ACs and reducing their efficiency and life. On the other hand, the control approach is to target a return air temperature and, just as importantly, to provide a temperature band around this target temperature, so that all the AC units are cooling the same approximate load. The primary goal is to achieve thermal balance in the room. This is why it is suggested that, contrary to popular belief, controlling AT is more critical than simply maximizing AT. A controlled AT will ensure that small perturbations in airflow, due to slight changes in server power density or the removal or reconfiguration of floor tiles, do not cause large unpredictable changes in local room temperatures. In a thermally balanced room (thermally controlled), small perturbations in airflow or plenum floor pressure variations do not lead to catastrophic problems such as recirculation, hot spots, or backflow. Thus, the more control you have over the variations in the entire room temperature, the more effective the design and the better the room performs. This is why in some cases the HACA design approach can only lead to temporarily acceptable but unstable room temperature distributions, which can easily go out of balance, yielding undesirable results. If you design a system so that you have no mixing between the supply air and return air, then your system must be able to exactly balance the CFM from the supply to the return. Thus, if the servers either start to ramp up and introduce more CFM from the warm exhaust air into the return channel, or ramp down as they are not utilized and provide less CFM to the room, then the AC units must also be able to adjust their CFM to account for either change instantaneously. If not, you create an airflow imbalance and the system will decide for you how to either obtain the needed air or reject the extra air. This can cause significant and unexpected problems. The problem is that the heat loads are separated from the cooling loads. The two are often loosely coupled and to make up for this, the system frequently has to overcool some equipment to ensure that all the equipment meets recommended manufacturer specifications. Another issue is that the air you provide is not only cooled but must also satisfy a humidity condition. The relative humidity of the supply air should be between 40% and 60%. If the humidity is too low, you will have problems with static charge buildup. If it is too high, you will have condensation problems that could cause a whole host of water-related problems. Thus, we recommend the managed mixing of the cold supply air with the hot exhaust air to achieve a more uniform temperature distribution in the data center. This contradicts what has been said earlier in this book, as well as what you see throughout
448
APPENDIX C. AIRFLOW MANAGEMENT: A SYSTEMS APPROACH
the literature. However, managed mixing actually contributes to control, and while it does slightly reduce the efficiency of the individual AC units, the control it provides allows you to increase the overall efficiency of your entire data center. This is the difference between a system-level strategy and a component-level strategy. The reason people argue against mixing is that the type of mixing they are usually talking about is unmanaged mixing—mixing that takes place by accident. Mixing does reduce the efficiency of the individual AC unit, which makes many people opposed to it. Rather than thinking at a system level, they are thinking at the component level. Thus, the best advice for any data center manager is to first know your data center. Not all data centers are created equal, and not every data center should deploy the same airflow management strategy. The goal of any airflow management approach is the same: to effectively control server inlet temperatures. The air management system you use to accomplish this can vary depending upon your needs and constraints. However, a thermally balanced and controlled room will go a long way toward providing the most robust and efficient air management solution. So when you hear that you should make the return air temperature as high as possible, first understand your data center. If your AC units are seeing return air temperatures just at or near their set points, then you do need to raise the return air temperature. If some of your AC units are cooling return air above their set points while others are below it, then you need to thermally balance the room. Once you have a thermally balanced room, you can uniformly raise your AC set points to ultimately raise the return air temperature, but in a controlled way. This way, you can obtain higher AC efficiency without compromising the temperature control in the room. Put another way, the problem is fairly straightforward. For every kilowatt-hour of heat energy produced by the equipment, you need roughly a kilowatt-hour of energy to cool the load. If you provide too much cooling energy, the supply temperatures will be unnecessarily cold and you will waste energy. If you supply too little cooling, the supply temperatures will be too warm and you will not effectively cool the equipment, which could lead to either premature failure or reduced processing speeds. Either of these possibilities is not acceptable. You also need to be aware that it takes more than 1 kW of cooling capacity to cool a 1 kW load, because not only do you have to cool the heat load, but you also have to control the humidity of the air. Thus, 1 kW-hr of heat energy will require more than 1 kW-hr of cooling energy since the cooling energy required comes in two different forms. The first is sensible cooling—the cooling required to lower the ambient air temperature. The other type of cooling energy needed is called latent cooling. Latent cooling is the cooling energy that is used to change the relative humidity of the air. The exhaust air from the equipment typically has less humidity than the supply air, so more energy is needed to replace the water lost. If this does not occur, then the relative humidity of the data center will continue to drop and static charge buildup will eventually be a significant problem. Most precision cooling systems strive to utilize 95% of their capacity for sensible cooling, but some can use as little as 70%. You need to check your cooling equipment specifications and actually test the environment to verify the results. You need to be very knowledgeable about all aspects of your data center.
449
C.3 OBTAINING CONTROL
Every data center operator should know: • • • • •
The heat load of the room and cooling capacity of the cooling equipment The percent breakdown of sensible to latent cooling in your room The temperature and humidity set point of each AC unit The return-air temperature at each AC unit The underfloor pressure distribution in the entire data center. The pressure is not constant throughout the data center and you need to assess this to see where you may have some problem areas with insufficient pressure to furnish sufficient supply air to your equipment. • The power density distribution throughout the data center. Identify high-density regions. • The inlet and exhaust temperatures of your equipment Every data center operator should know the total cooling capacity of their data center. This requires knowledge of your sensible cooling capacity, or the cooling energy required to lower the temperature of the room, and your latent cooling capacity, or the cooling energy required to rehumidify the air. The total capacity is the sum of both sensible and latent capacities: Total cooling capacity = Sensible cooling + Latent cooling It is given in terms of BTUs per hour by the formula BTU = 4.5 x CFM x Ah hr where CFM identifies the cubic feet of air per minute being generated by the AC units and Ah is the change in heat content or enthalpy. The factor 4.5 is obtained for total heat and is defined as follows: 4.5
min x lb lb min — = 60 ^ x 0.075 4 T hr x ft3 hr ft3
The sensible (temperature) cooling capacity is given by the formula: BTU = 1.08 x CFM x At hr Again, CFM identifies the cubic feet of air per minute being generated. At is the change in temperature (supply versus return). The factor 1.08 is obtained for sensible heat and is defined as follows:
450
APPENDIX C. AIRFLOW MANAGEMENT: A SYSTEMS APPROACH
min x BTU min lb BTU 3 x 0.24 1-08 hr x ft3— — = 60 — x 0.075 —r x °F hr ' ft ' lb x °F The latent (humidity) cooling capacity is given by the formula BTU = 4840 x CFM x Aw hr Again, CFM identifies the cubic feet of air per minute being generated. Aw is the change in humidity ratio or a pound of moisture divided by a pound of dry air. The factor 4840 is obtained for latent heat and is defined as follows: 4840
min x BTU min lb BTU x h — = 60 - — x 0.075 — x 1076 ft3 hr ft3 lb
The sensible cooling capacity will tell you the maximum load your room can support. Typically, AC units tell you the tonnage or kilowatts of total cooling capacity. You also need to know what percentage of this total is used for sensible and what percentage is used for latent cooling in your data center. In a residential setting, about 30% of the cooling load is used for latent cooling, whereas in a data center this can be as little as 5%. Where does your data center stand? You can use the formulas above and, with a few simple temperature and humidity measurements and basic calculations, estimate the distribution of sensible and latent cooling in your data center. This tells you the true amount of cooling capacity you have available for the equipment in the data center. If you have deployed a cooling strategy that requires a raised floor to carry the supply air to your equipment, you need to know what the temperature and pressure distributions beneath the floor are. This will tell you if you are capable of providing the needed supply air to the heated equipment. You also need to know the specifications of your air delivery system. Is it a computer room air handling (CRAH) system with a separate water chiller plant, or does it use computer room air conditioning (CRAC) units, in-row, or overhead air conditioners? What are the specifications of each? What are the set points? What is the CFM produced by the distribution fan? What is the supply temperature? What is the return temperature? What are their cooling capacities in tons or kW? Once this is known, then you can begin to effectively and efficiently manage the cooling in your data center. The above parameters are needed to assess the control level of your data center. You know you have achieved control if: • The inlet temperature of your equipment is "tightly" grouped around a fixed value that is within the manufacturers' specifications. Typically, this is within the ASHRAE standard range of 60°F—81.4°F, although, if you let your equipment rise to the upper ASHRAE limit, you may be negatively impacting the perfor-
C.4 AIR MANAGEMENT TECHNOLOGIES
451
manee of the equipment, since performance (speed) is reduced at higher temperatures. (Verify with your manufacturer the design specifications of your equipment.) • Your AC units return-air temperatures are also "tightly" grouped around a fixed value. This guarantees that all units are cooling the same load and your data center is balanced. Having high-density areas in the data center without treating the exhaust air temperature can lead to certain AC units operating at or near capacity. Although this is efficient for that particular AC unit, it puts your data center at risk if that unit should fail. You have no redundancy in this region, even if you have redundancy in another part of the room. • Slight modifications in the configuration of the room do not have a major impact on either the equipment inlet temperatures or the AC return air temperatures. To obtain control, a variety of new technologies have been introduced. In the next section we discuss these technologies along with their control capabilities.
C.4
AIR MANAGEMENT TECHNOLOGIES
Cooling a data center involves several key components. The first are the servers, which contain their own fans to cool down their internal electronics. The servers need supply air at around 68°F. The AC system must be able to deliver sufficient supply air, on demand and as needed by the servers. To accomplish this, air is sometimes transported over great distances to supply cooled and humidified air to the equipment. The heated, drier air must then be transported back to the AC unit so that it can be cooled down, humidified, and then sent back out as fresh supply air. To attain the level of control needed to efficiently cool your data center, a number of new cooling technologies have been developed that can provide the potential for different levels of control. Each technology, however, has pros and cons associated with it. Depending upon your data center, one may be better suited for you. In what follows, we will consider the most widely used cooling strategies (see Chapter 12) along with their advantages and disadvantages. We will discuss these technologies from the viewpoint of deploying them in an existing data center, but everything we talk about can be applied to designing a new data center as well. At the outset, however, we would like to point out that the task of effectively and efficiently cooling a data center is quite complicated as there are many interactions and unexpected outcomes you have to be aware of and account for. The newer cooling technologies can be placed into three broad categories. You have air containment strategies, localized AC strategies, and active air management approaches, as well as hybrids of these. We will look at the three main groups and analyze their performance with respect to systemlevel control capabilities. All of the technologies will be compared against the HACA strategy, which we use as a baseline for comparison. APC introduced in-row AC units to provide localized cooling to high-density cabinets. They have also combined this with both hot- and cold-aisle containment. Liebert
452
APPENDIX C. AIRFLOW MANAGEMENT: A SYSTEMS APPROACH
also offers a local cooling strategy based on strategically placed overhead AC units (Liebert-XDOs) to provide sufficient cool air to high-density cabinets. Knurr developed CoolFlex®, which essentially involves constructing a room around the cold aisle to segregate the cold supply air from the hot return air. AFCO developed Kool-IT® as an active-air solution to effectively manage the airflow in the room.
C.4.1
In-Row Cooling
The benefits of in-row cooling are that you can service high-density areas in the data center locally and not overstress AC units located in these areas. The problem with this technology, in terms of control, is that the control is not robust. Thus, if your server deployment strategy changes, your in-row cooling deployment strategy must also change. In order to mitigate this problem, APC has combined their approach with either hot- or cold-aisle containment. This significantly reduces the problem associated with large per-cabinet variations in loading. It does, however, create a new problem. In any containment strategy, you are closely coupling your servers to the AC units. Ordinarily, from a system-level control perspective, this would be advantageous, but if the connection is too tight, it can cause problems. The main difficulty is that now the volume of air moved by the servers must precisely equal the volume of air being moved by the AC units. Otherwise, an imbalance of airflow will occur and either force the fans to work harder, draw less air, or redistribute warm air to unwanted places.
C.4.2
Overhead Cooling
Similar to in-row cooling, overhead cooling provides cold air to local areas. It tries to achieve control by moving the supply source closer to the equipment. This does help to reduce the occurrence of local hot spots, but if the server usage changes, even just a little, the AC units cannot effectively adapt to the changes unless they are deployed throughout the data center and controlled as a system, and this could be cost-prohibitive. Also, local units blowing high volumes of air can actually cause recirculation problems that may go unnoticed. Both types of localized cooling strategies do not enable an effective system-level control strategy. They simply provide a short-term fix to cooling high-density areas, without effectively managing the airflow in the data center.
C.4.3
Containment Strategies
Containment strategies, both hot and cold, are very effective control strategies. These strategies try to link the control of the AC units directly to the control in the equipment by eliminating the control gap between the two. They can be very effective at increasing the efficiency of the AC units, since the supply and return air are segregated, as well as providing the necessary cool air to the server inlets. The problem is that the control is not robust and is highly dependent upon the flow rates of the servers and the AC units. Directly coupling the servers to the AC units can cause problems if the control mechanisms of both are not synchronized. For example, as some variable speed
C.4 AIR MANAGEMENT TECHNOLOGIES
453
server fans begin to ramp up due to increased demand, the AC units would have to be able to adjust their airflow by the same amount to accommodate these changes. If this does not occur, the server fans may slow down due to pressure imbalances, or the airflow imbalance created will decide for you how to either obtain the needed air or reject the extra air. This can cause significant and unexpected problems. Furthermore, if blanking panels are removed or containment doors are not properly closed, unwanted air can make its way into the server inlets. Thus, routine maintenance has to be performed with care, and superior housecleaning practices need to be established and enforced. If the control of the AC fans could be linked to the server and equipment fans in a synchronized way, this would provide an excellent systemlevel control strategy. However, today this is not possible with currently available products.
C.4.4
Active-Air Management
Active-air technology acts on the philosophy that to achieve control you must coordinate several levels of control already designed into the cooling components of the data center. The control already designed into the servers should be complemented by cabinet control. The cabinets should provide a cold inlet temperature and moderate the hot exhaust air to reduce the potential for any hot spots to form in the room. The cabinet controls then need to work well with the AC units and their control features. AFCO Systems introduced active-airflow management technology in 1997 as a fully engineered control solution. Active-airflow management technology bridges the control gap between the servers and the AC units and provides a well-integrated control solution. It utilizes the control features built in by the server manufacturers, as well as the control features built in by the air conditioning manufacturers, and creates a balanced, controlled room. Active-airflow management accomplishes this by providing some cooling directly to the exhaust air that leaves the servers, to reduce the thermal gradient at the source. In addition, it also directs the cooler air directly to the server inlet ports to provide effective cooling at the server inlet. To account for localized imbalances of temperatures due to server load imbalances, the cabinets have adjustable fan speeds to automatically balance the cooling locally. This also provides more exhaustair cooling at the higher loads, which again keeps the thermal gradients low and, consequently, allows the server inlet temperatures to be controlled. The partly cooled server exhaust air is emitted from the top of the cabinet and then moves to the AC units, which now see a more uniform AT. This allows them to function more effectively and efficiently, since they can all equally share the load and their inlet temperatures are controlled. This also reduces short-cycling of their fans, which adds to efficiency and improved life. The benefits of active-air solutions are that you have created a robust system-level control that allows you to effectively manage your cooling in the data center. The overriding belief of the proponents of this technology is that to effectively and consistently meet the manufacturers' server-inlet temperature requirements, you must be able to control the temperature in the entire room, and how you deploy the cooling in a data center is critical in establishing this control. In addition, a well-balanced and controlled
454
APPENDIX C. AIRFLOW MANAGEMENT: A SYSTEMS APPROACH
room, with cabinets acting as an extension of the AC units, will enable the latent, cooler thermal mass in the plenum floor to continue cooling the equipment even after power to the ACs has failed, since the cabinet fans are connected to UPS power. This buys the data center operator valuable minutes in getting the cooling system back online before any permanent damage to the servers can occur. The main problem with this technology is the introduction of additional fans in the cabinets. The fans increase energy usage (up to approximately 200 W per cabinet when operating at full capacity), increase the heat being generated in the room, and they become another possible point of failure. The extra energy used for the cabinet fans is offset by the energy saved by providing a high level of control. Thus, the system can be more efficient, even though the new fans add a level of inefficiency. To mitigate the point-of-failure risk, the fans are high-performance fans and if individual fans fail they will not shut down the system. Also, should the entire fan tray fail, the cabinet sends audible alarms and still utilizes the plenum and room air to cool the servers; it just does not provide controlled mixing of the exhaust air.
C.4.5
A Benchmark Study for Comparison
A study performed at Stony Brook University looked into the various technologies that are being used in data centers, with the following results [10]. Table C.l shows the average maximum server inlet temperature for all the servers in the room at various load densities. This table suggests that active-airflow management, in-row cooling with hot-aisle containment, cold-aisle containment, and overhead cooling all meet the ASHRAE requirements across all loading densities. What you do not see from this data is the spread of the temperatures about this average value. This is what really tells you how much control you have. Thus, the study also looked at the spread of temperatures about the average or mean value and obtained the results shown in Table C.2. The standard deviation is a statistical measure of the spread about the mean. The larger the standard deviation, the greater the spread. The greater the spread, the lower the control. If the temperatures are distributed normally about the mean, then plus or minus three standard deviations would capture 99.7% of the servers, that is, 99.7% of
Table C.1. Average maximum inlet temperature per load Average per cabinet load 4kW Technology Hot aisle/cold aisle Active-airflow management In-row cooling with hot aisle containment Cold-aisle containment Overhead cooling Cabinets ducted to hot-air plenum
8kW
16 kW
Average maximum inlet temperature 70.45°F 69.50°F 66.27°F 63.89°F 68.01°F 77.87°F
85.60°F 69.52°F 64.28°F 63.41°F 66.16°F 77.28°F
99.22°F 58.26°F 58.33°F 54.65°F 59.88°F 85.72°F
C.4 AIR MANAGEMENT TECHNOLOGIES
455
Table C.2. Temperatures about the mean value per load Averagiî per cabinet load 8kW
4kW Technology
16 kW
Temperature standard deviation
Hot aisle/cold aisle Active-airflow management In-row cooling with hot-aisle containment Cold-aisle containment Overhead cooling Cabinets ducted to hot-air plenum
6.35°F 0.66°F 6.39°F 7.01°F 5.13°F 4.60°F
4.37°F 2.04°F 4.49°F 4.93°F 4.17°F 5.81°F
5.93°F 1.21°F 5.05°F 1.83°F 4.71°F 11.91°F
the servers would be between the mean-minus-three standard deviations and the meanplus-three standard deviations. If we look at the standard deviation table for the technologies considered, we see that the active-air technology has the smallest standard deviation throughout all the load densities. This means that this equipment has the most control of the technologies tested. At an average of 16 kW per cabinet, 99.7% (actually 100% in the study) of the servers in the active-airflow management cabinet had their inlet temperatures between 58.26°F - 3(1.21) and 58.26°F + 3(1.21), or 54.63°F and 61.89°F. This was the most ideal out of the technologies being studied, but they should all be considered for specific data center needs. It should be noted that the software used to carry out the study could not easily duplicate active-airflow management's variable speed fans; thus, the level of control shown here would actually improve if variable speed fans could be modeled. This may account for the sudden jump in the standard deviateion at the 8 kW level in the study. A summary of these ranges (spread in temperatures based on the standard deviation) for all the technologies is presented in Table C.3. Again, the smaller the spread, the better the control. The tight control offered by active-airflow management will enable a more effective air-management strategy to be deployed compared to the other technologies. This control was achieved by managed mixing of the cold supply air with the hot server exhaust air to control the temperature
Table C.3. Temperature spread based on standard deviation Technology Hot aisle/cold aisle Active-airflow management In-row cooling with hot-aisle containment Cold-aisle containment Overhead cooling Cabinets ducted to hot-air plenum
Inlet temperature spread (± 3 standard deviations) 35.58°F 7.26°F 30.30°F 10.98°F 28.26°F 71.46°F
456
APPENDIX C. AIRFLOW MANAGEMENT: A SYSTEMS APPROACH
Ar. With a controlled A7 you could raise the water temperature of the chilled-water plant or the supply temperature of the AC unit and not impact the integrity of the equipment. Thus, you could realize significant energy savings. From Table C.3, we see that the cold-aisle containment (CAC) approach also yields good control, with a temperature spread of only 10.98°F. The study did find a difference, though, between the active-airflow management approach and the CAC approach. The active-airflow management technology maintained its tight control even with changes to the data center environment, whereas the CAC was more susceptible to changes in the data center. In particular, if blanking panels were removed, the active-airflow management technology experienced less than a one-degree change in inlet temperatures, whereas the CAC was closer to four degrees.
C.5
CONCLUSION
The main point to take away is that a data center is a complex system with many interacting variables. You first need to know your data center and then strive for system efficiency, not component efficiency. You should do this by choosing the technology that works the best with your system. This means understanding your cooling system/AC units: their supply temperatures, their return temperatures, their temperature set points, their relative humidity, their fan CFM, their capacity in tonnage or BTUs (both latent and sensible), whether or not you have a plenum, and the under-floor pressure distribution. On the IT side, you need to know the equipment: their rated and actual wattages, internal temperature/fan CFM settings, and load distributions throughout the room and throughout the day. Once all of this information is known and monitored, you can then work toward controlling your room and using this control to increase the efficiency without jeopardizing your equipment. The airflow management approach you choose will tell you how to modify your data center to meet the control criteria specified in Section C.4: • The inlet temperature of your equipment is tightly grouped around a fixed value that is within the manufacturers' specifications. • Your AC units return air temperatures are also tightly grouped around a fixed value. • Slight modifications in the configuration of the room do not have a major impact on the sever inlet temperatures and the AC return air temperatures. If you choose a technology that does not mix supply and exhaust air, you have to make sure that the thermal imbalance does not adversely impact your room. This can be tricky if you are not capable of continuously monitoring all the equipment inlet temperatures. If you choose an air-containment technology, you have to be sure that the supply air, the return air, and the equipment fans are all synchronized to provide the same CFM, otherwise airflow imbalances will occur and degrade your control capabilities. If you choose the active-air approach, and the active-air system was designed for
REFERENCES
457
control, such as the active-airflow management technology, the control features are already built into the system. Furthermore, the system is designed for robust control, so that day-to-day variations due to housekeeping, servicing, or equipment load variations will not impact your control and will allow for efficiency improvements with little or no risk to your equipment.
REFERENCES Juran, J. M., Quality Control Handbook. New York: McGraw-Hill, 1951. Ducted Exhaust Cabinet—Managing Exhaust Airflow Beyond Hot Aisle/Cold Aisle. White Paper MKT-60020319-NB. Chatsworth Products Inc., 2006. http://www.thermalnews.com/images/knowledgecenter. Schmidt, R. R., E. E. Cruz, and M. K. Iyengar. Challenges of Data Center Thermal Management. POWER5 and Packaging, 49, 2005. Chen, Keke, David M. Auslander, Cullen E. Bash, and Chandrakant D. Patel. Local Temperature Control in Data Center Cooling: Part 1, Correlation Matrix. HP Technical Report HPL-2006-42. http://www.hpl.hp.com/techreports/2006/HPL-2006-42.html. http://www.hpl.hp.com/research/dca/smart_cooling. Garday, Doug, and Daniel Costello. Air Cooled High Performance Data Centers: Case Studies and Best Methods. Intel Corporation White Paper, 2006. Essential Cooling System Requirements for Next Generation Data Centers. APC White Paper #5. 2003. http://www.dell.com/downloads/global/services/datacenter_cooling.pdf. Greenemeier, Larry, and Jennifer Zaino. Concerns Heat Up Over Keeping Blades Cool— Blade Servers. InformationWeek/Business Technology News, Reviews and Blogs. March 10, 2001. http://www.informationweek.com/news/hardware/windows_servers/showArticle.jhtml?articleID=8700285. Noland, Chris, and Vipha Kanakakorn. Efficiencies through Increased Temperatures. In Proceedings of 2009 Data Center Energy Efficiency Conference, Sunnyvale, CA. Wu, Kwok. A Comparative Study of Various High Density Data Center Cooling Technologies. Thesis. Stony Brook University, 2008.
This page intentionally left blank
GLOSSARY
Absolute Humidity. The mass of water in a specified volume of air (example: grams/cubic meter). Active-Air Movement. The practice of using active fans to direct air that is either cooled or heated to its proper place in a data center. Addressable Control System. Programmable computer-based fire alarm systems that communicates with programmable devices in the field through a digital bus or SLC. Air-Cooled Condenser. Dry-bulb-centric device that uses the movement of ambient air passing through a coil to reject heat to the atmosphere by condensing the refrigerant vapor in the coil to a liquid state. Airflow Paths. The direction of airflow from cool air to warm created by a CRAC unit. Air-Sampling Smoke Detector. Operates as a very early warning detection system. Air is pulled into the detector periodically and is checked for traces of particles that are created in the very early stages of a fire. Alternating Current (AC). Flow of electric current in the form of a sinusoidal wave that periodically reverses direction, usually several times per second. Arc Flash. An electrical explosion associated with the release of energy caused by an electric arc from an energized piece of equipment. Such an energy release can result in permanent damage or death to anyone exposed to the flash without proper protection. Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 459 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
460
GLOSSARY
Automatic Transfer Switch (ATS). Provides an electrical connection between two sources of power, one serving as the general source and the other as a backup component. The normal source of electricity is usually the electric utility, and an emergency source of electricity is usually the standby generator. If the normal source is not available, the ATS will automatically transfer to the emergency source of power to supply the critical load. Autotransformer. An autotransformer is a transformer in which the primary (input) and the secondary (output) are electrically connected to each other. Because the two windings share turns, there is no isolation. Availability. The probability that a piece of equipment will be operational at a randomly selected instance of time in the future. [MTBF/(MTBF + MTTR)] Axial Load. A vertical load applied to the center of the raised floor pedestal due to concentrated, rolling, uniform, and other loads applied to the surface of the access floor panel. Battery. Supplies power to an inverter when input power has been lost. A battery is a group of single cells connected in series and/or parallel that convert chemical energy into electrical energy. Batteries are used as either primary or backup sources of direct current power. Benchmarking. In order to maintain a written log of the trends of how a piece of equipment is operating, equipment measurements (benchmarks) must be recorded upon initial use and thereafter. This gives facility engineers references to know if anything is wrong with a system. Blackout. Typically described as a zero-volt condition lasting longer than a half-cycle. Blackouts can be caused by utility equipment failure, accidents, Mother Nature, fuses and breakers, and many more events. The result for unprotected systems is crashes and hardware damage. Blanking Panels. Solid panels made to close off empty rack (U) spaces. Their purpose is to prevent recirculation of heated air within a rack. Bonding. The permanent joining of metallic parts to form an electrically conductive path that will ensure electrical continuity and the capacity to safely conduct any current likely to be imposed. Not to be confused with grounding. Bonding Jumper. A reliable conductor used to ensure electrical conductivity between metal parts. British Thermal Unit (BTU). A unit of energy approximately equal to the amount of energy needed to heat one pound of water by a single degree Fahrenheit. Brownout. A reduction in voltage and/or power when demand for electricity exceeds generating capacity. Manifested by dimming or flickering lights. Bypass Air. Cooling air, delivered usually through cable cutouts under and behind racks, that is not taken in by electronic equipment. This cooling air generally mixes with heated air and promotes more mixing. Carbon Dioxide System. Displaces air away from fire by flooding a room with carbon dioxide, thus depriving it of oxygen and starving the flames. Catcher System. See Isolated Redundant. CFD Modeling. Computational fluid dynamics is a software package that allows you
GLOSSARY
461
to virtually model the heat and airflow in a data center environment. This approach is useful because designs can be optimized prior to the construction phase of a facility. If a data center is already constructed, CFD modeling can assist in finding out where hotspots are occurring and identify inefficient airflow paths. Chemical Clean Agents. Chemical agents extinguish fire primarily by cooling. They are fluorinated compounds that are very similar to refrigerants used in air conditioning. Chillers. Mechanical refrigeration equipment that produces chilled water as a cooling output or end product. Heat from the chiller may be rejected to a second liquid loop (water cooled) or to the atmosphere by using fans (air cooled). Circuit Breaker. Circuit breakers are resettable devices that automatically interrupt power and protect a circuit whenever the electrical load exceeds the manufacturer's specifications. A circuit breaker can also be used as an on-off switch for manual control of a circuit. Circuit breakers fall under the following classifications: magnetic, thermal, and a combination of both. Clean Agent System. Stand-alone suppression system that requires no external power supply to operate, and extinguishes fires by either starving the flame or by cooling it with a chemical reaction. Close-Coupled Cooling. Cooling equipment with cooling coils close to IT equipment. A typical characteristic for liquid cooling technology. Closed-Transition Transfer. Closed-transition transfer momentarily parallels the two power sources during transfer from either direction. It closes the path to one source before opening another. In this application, other devices, such as an in-phase monitor or an active synchronizer, must protect the transfer. Computer Room Air-Conditioning Units (CRAC Units). CRAC units are typically used to cool large-scale (20 kW or more) installations of electronic equipment. Concentrated Load. A single load applied on a small area of the raised floor panel surface. These loads are typically imposed by stationary furniture and equipment that rest on legs. Conduction. The transfer of heat through a substance by the vibration of molecules and energy carried by subatomic particles. Convection. The transfer of heat due to the motion of fluids. Conventional Control System. Microprocessor-based fire alarm system that processes inputs on an on-or-off basis. Cooling Tower. A heat rejection device that dispels waste heat to the atmosphere through the cooling of a water stream to a lower temperature. Customer Average Interruption Duration Index (CAIDI). The average length (in minutes) of outages experiences by each electric utility customer who suffers a sustained outage. Cutout Closure. An intentional physical blocking of cable cutouts, usually under or behind racks, to prevent loss of cooling air needed to cool electronic equipment. Datacom Room/Data Center. A datacom (data processing and telecom) room houses IT and communications equipment. A data center is an entire facility, including infrastructure. The terms are often used interchangeably. DCiE. Measures of data center efficiency developed by members of the Green Grid
462
GLOSSARY
(www.thegreengrid.org). DCiE (Data Center Infrastructure Efficiency) is calculated as (1/PUE). Higher DCiE numbers are better. Delta. Three-phase power that does not include a neutral wire. Delta T (AT). For any heating or cooling process, the difference in temperature from the inlet to the outlet. In general, higher A r will result in higher efficiency. Dewpoint Temperature. The temperature at which the air is saturated with moisture; the temperature at which moisture will condense. Direct Current (DC). A constant electrical current that flows consistently in one direction and does not reverse polarity. Disconnect Switches. A switch used to isolate equipment. Distributed Generation. On-site sources of electricity operated by nonutilities; may be any size and often incorporate renewable technology. Dry Bulb Temperature. The ambient temperature of the air without taking moisture into account. This may be measured using a standard thermometer placed in the area being observed. Dry Cooler. Dry-bulb-centric heat rejection device that utilizes water as its cooling medium (typically with a mixture of glycol in the water solution to prevent freezing). Dust Ignition Proof. Enclosed in a manner that will exclude dust and will not permit arc, sparks, or heat from interior to cause ignition of specified dust on or in the vicinity of the enclosure. Economizer. A piece of equipment that takes advantage of favorable outdoor conditions to reduce or eliminate the need for mechanical refrigeration to accomplish cooling. Electrical Line Noise. Electrical line noise is defined as radio-frequency interference (RFI) and electromagnetic interference (EMI). It causes undesirable effects in the circuits of computer systems. Sources of line noise include electric motors, relays, motor control devices, broadcast transmission, microwave radiation, and distant electrical storms. RFI, EMI, and other frequency problems can cause data error, data loss, storage loss, and keyboard and system lockup. Electricity. Electricity is a fundamental agent of physics caused by the presence of charge and the movement of charge, which manifests as force, light, sound, and heat. A source of power derived from the flow of electric charge. Often the "prime mover" for modern industry. Electromagnetic Pulse (EMP). A burst of electromagnetic energy/radiation that has the potential to damage electronic systems. Possible causes can include a nuclear explosion or an irregular magnetic field. Electrostatic Discharge (ESD). The dissipation of electricity (In layman's terms, a shock.). ESD can easily destroy semiconductor products, even when the discharge is too small to be felt. Emergency Systems. Emergency systems are generally installed in places of assembly where artificial illumination is required for safe exiting and for panic control in buildings subject to occupancy by large numbers of persons, such as hotels, theaters, sports arenas, health care facilities, and similar institutions. Emergency systems may also provide power for such functions as ventilation that are essential to maintain life, fire detection and alarm systems, elevators, fire pumps, public safety communi-
GLOSSARY
463
cations systems, industrial processes in which current interruption would produce serious life safety or health hazards, and similar functions. Emergency Action Procedures (EAPs). Prepared step-by-step actions to be performed in case of an emergency. Energy Portfolio. A combination of multiple sources of electricity used to promote security and stabilize pricing. Energy Security. Tied to the continuing access to affordable electricity. Relates to protection from Web-based attacks. Areas affected include hacking into power plant internal and expernal systems or other deliberate and nondeliberate (weather) attacks. Affected by geography, politics, technology, and warfare, among others. Enthalpy. The sum of the internal energy of the system plus the product of the pressure of the gas in the system and its volume. Exercise Clock. An exercise clock is a programmable time switch. It automatically initiates starting and stopping of the generator set for scheduled exercise periods. The transfer of load from the normal to alternate source does not need to take place but is recommended in most cases. Failure Rate. The percent chance of failure for a piece of equipment. Fault Tolerance. The ability of a system to cope with internal hardware failures (e.g., a CRAC unit or UPS failure) and still continue to operate with minimal impact, such as by bringing a backup system online. Federal Energy Regulatory Commission (FERC). A U.S. agency that regulates and supervises the transmission of electricity, oil, and natural gas. FERC also provides oversight for natural gas and hydroelectric projects. Flywheel. A device for storing energy in the momentum of a rotating mass, where high power is sustained only intermittently and briefly. Free Workspace. The area immediately in the vicinity of electrical equipment that requires examination, adjustment, servicing, or maintenance. Frequency Variations. Frequency variation involves a change in frequency from the normally stable utility frequency of 50 Hz or 60 Hz, depending on the geographic location. This may be caused by erratic operation of emergency generators or unstable frequency power sources. For sensitive electronic equipment, the result can be data corruption, hard-drive crash, keyboard lockup, or program failure. Fuel Cell. Fuel cells produce electricity through an electrochemical process combining hydrogen ions (drawn from a hydrogen-rich fuel such as natural gas) and oxygen. Though the process is similar to a battery, the fuel is continuously supplied from an outside source. Fuel Maintenance Systems. A system for diesel fuel storage that pumps fuel through a series of paniculate filters and water separators, and then returns the fuel to the tank at the opposite end from which it was drawn. Ground Fault. An unintentional conductive path to ground from the hot or live parts to the ground or grounded portions of an electrical system. Ground faults may result from inadvertently energized equipment casings. Ground Fault Circuit Interrupter (GFCI). A device intended for the protection of personnel. It deenergizes a circuit, or portion thereof, within an established period of time when a current to ground exceeds some predetermined value.
464
GLOSSARY
Grounded (Effectively). A device is effectively grounded when it is intentionally connected to earth through a ground connection or connections of sufficiently low impedance (resistance) and having sufficient current-carrying capacity to prevent the buildup of voltages that may result in undue hazards to connected equipment or to persons. (NEC) Grounding Electrode Conductor. A grounding electrode conductor is a conductor used to connect the grounding electrode to the equipment bonding conductor, grounded conductor, or both, at the service equipment. Harmonics. Harmonics are integral multiples of a periodic wave. Harmonic frequency is always a multiple of the fundamental frequency (example: first harmonic is 60, second is 120, third is 180). Harmonics display a characteristic sine wave distortion that is visible on a recorded electrical sine wave. Harmonics are generated in many electromagnetic devices and may lead to overheating and a variety of power disruptions. Heat Exchanger. A device that relies on the thermal transfer from one input fluid to another input fluid. One of the fluids that enters the heat exchanger is cooler than the other entering fluid. The cooler input leaves the heat exchanger warmer than it entered, and the warmer input fluid leaves cooler than it entered. Hermetic Seal. An airtight seal, impermeable to air and its contents. Hot Aisle/Cold Aisle. In a data center, this is one configuration of equipment racks with intakes in one aisle and exhausts in an alternate aisle. CRAC units should be placed at the ends of the aisles, preferably the hot aisle. This configuration allows an efficient cooling supply and heat return path that promotes separation of heated and cooling air. HVAC. A system that provides heating, ventilating, and air conditioning within a building. Impact Load. Occurs when objects are accidentally dropped on the raised access floor. Impedance. A combination of resistive (DC) and reactive (AC) opposition to the flow of electric current. Impedance may vary with frequency, except in the case of purely resistive systems. Infrared Scanning or Thermographie Survey. Thermographie testing is a procedure performed by qualified technicians in which an infrared scanning instrument is used to identify elevated temperatures as compared to ambient surroundings. This is a low-cost method for identifying potential crippling electrical problems in electrical loads. In-phase monitor. This device monitors the relative voltage and phase angle between the normal and alternate power sources. It permits a transfer from normal to alternate and from alternate to normal only when acceptable values of voltage and phase angle difference are achieved. The advantage of the in-phase transfer is that connected motors continue to operate with little disturbance to the electrical system or to the process. Input and Output Filters. An input filter processes an AC current into its positive and negative components. An output filter is used to remove remaining odd harmonics, resulting in a sine wave, which is then used to power the critical load.
GLOSSARY
465
Insulated Gate Bipolar Transistors (IGBT). An IGBT functions similarly to the SCR but more efficiently, by being able to rapidly turn off with the removal of its gate signal, thus making it better suited for DC applications. Insulation. Material separating a conductor from conducting bodies by means of nonconductors so as to prevent transfer of electricity and heat. Intrinsically Safe. A condition in which any spark or thermal effect is incapable of igniting flammable or combustible material in air. Inverter. Converts direct current into alternating current utilizing IGBT or equivalent technology. Islanding. Providing electric power for a location, independent of the electric grid, through the use of distributed generation sources. Isolated Ground Receptacle (IGR). A receptacle in which the grounding circuit terminal (third prong) is isolated from the metal mounting yoke and metal cover or wall plate, thus establishing a separate, dedicated equipment-grounding path to reduce electrical noise in the equipment by providing two grounding paths for the installation. Isolated Redundant. An arrangement whereby the primary UPS unit serves the load being fed from the utility and a second UPS unit has its output fed into the bypass circuit of the first in case the primary fails. Also known as a catcher system. Isolation Transformer. This transformer is made of primary (input) and secondary (output) windings that have a complete separation. There is no direct path between the input and output. Latent Heat. Energy absorbed or released by a substance undergoing a phase change (e.g., water vaporizing). Legally Required Standby Systems. Legally required standby systems are typically installed to serve loads, such as heating and refrigeration systems, communication systems, ventilation and smoke removal systems, sewerage disposal, lighting systems, and industrial processes, that, when stopped during any interruption of the normal electrical supply, could create hazards or hamper rescue or firefighting operations. Line Interactive UPS. This type of UPS is also known as active standby because it is controlled by a microprocessor that continuously monitors the input line power quality and reacts to variations input. Line filtering attenuates most disturbances. Upon detection of a power failure, the UPS transfers the load to batteries in order to stabilize the power. Lockout/Tagout. During service or maintenance activities, this is the specific practices and procedures implemented to protect employees from unexpected energization or startup of machinery and equipment, or the release of hazardous energy. Manual Transfer Switch. A manual transfer switch provides the same function as an ATS except that an operator is required to transfer the load to an alternate power source. Mean Time between Failures (MTBF). The average time equipment can be used before it can be expected to fail. Mean Time to Repair (MTTR). The average time required for equipment to be repaired.
466
GLOSSARY
Method of Procedure (MOP). Step-by-step procedures and conditions for performing a specific task. This includes identifying the personnel who should be present and responsible, tracking documentation, and phasing of the work. Mission Critical. Any facility that must be operational 7 days a week, 24 hours per day or be available 99.999-99.9999 percent of the time. Mixing. Undesirable cross contamination of cool air and warm air. The most common cause of inefficiency in data centers. Momentary Average Interruption Frequency Index (MAIFI). Measures the average number of momentary interruptions experienced by a utility's customers. Momentary is defined as an outage lasting less than 2 or 5 minutes (varies by region). Motor Control Center (MCC). A motor control center is a single enclosure designed with a main disconnect switch. This enclosure houses various disconnect switches, protective devices, variable speed drives (VFD), starters, and other protective electrical control components. National Electric Code (NEC). The National Electric Code is the standard of the National Board of Fire Underwriters for electric wiring and apparatus, as recommended by the National Fire Prevention Association and approved by the American Standards Association. National Electric Manufacturers Association (NEMA). A nonprofit organization of electrical equipment manufacturers that establishes uniform standards and nomenclature throughout the industry. National Fire Protection Agency (NFPA). An organization charged with creating and maintaining minimum standards and requirements for fire prevention and suppression activities, training, and equipment, as well as other life-safety codes and standards. National Fire Prevention Association publishes the National Electrical Code. Noise. Often generated by normal computer operation, noise triggers these exasperating types of problems: incorrect data transfer, printing errors, keyboard/mouse/monitor lock-ups, program crashes, data corruption, and even damage to computer power supplies. Nonincendive. Any arc or thermal effect produced, due to shorting, opening, or grounding of field wiring, that is not capable of igniting flammable gas, vapor, or a dust-air mixture. Nonlinear Electric Devices. Devices that contain switching electrical loads, such as computers and VFDs. North American Electric Reliability Corporation (NERC). A nongovernment organization responsible for regulating the production and use of electricity. Offline UPS or Standby UPS. Offline or stand-by power supply UPSs (sometimes referred to as SPSs) use an inverter that is parallel-mounted to the load supply line and backs up the critical load. This configuration is usually applicable to small units sized less than 3 kVA. Transfer times can take up to 10 ms and protection is not included for inrush currents. These systems have the same components as online UPS systems but only supply power when the normal source is lost. Ohm's Law. Ohm's Law states that the current through a conductor between two
GLOSSARY
467
points is directly proportional to the potential difference or voltage across the two points, and inversely proportional to the resistance between them (E = IR). Online UPS. A UPS system that continually powers the load to provide both conditioned electric power and line disturbance/outage protection. Open Transition Switch. Momentarily interrupts load during bypass operation. Loadbreak-type switches bypass either the normal or emergency sources to the load through a break-before-make operation. The load will see a momentary power interruption, typically 3 to 5 cycles, upon operation of the bypass. Overturning Moment. A lateral load applied to the pedestal of the raised access floor, due to rolling load traffic, underfloor work on wire pulls, or seismic activity. Passive Smoke Detector. Passive detectors are spot or beam types that can be mounted on ceilings or in cabinets and under raised floors. They are excellent for battery rooms and power rooms where lesser airflow reduces the requirements of smoke detection. Perforated Panels. All-steel panels with a series of holes that create a 25% open area in the panel surface for air circulation. Personal Protective Equipment (PPE). Equipment designed for the safety of workers against heat and electric shock (such as an arc flash). Phase Angle. A measurement of the difference in the displacement between multiple sine waves. Only applies to alternating current. Power Conditioning. A power conditioner is an electrical device that protects a load from more than one power quality problem. Power conditioners generally do not protect against power failure. Power Distribution Unit (PDU). PDUs are freestanding electrical distribution centers located within a data center. PDUs contain panel boards for branch circuits and are equipped with a transformer if required to step down the voltage from 480 V to 120/208 V. A power monitoring system can also be included as an option if required. Power Factor. The ratio of real power to apparent power flowing in a circuit. A load with a high power factor (near 1) will draw less current than a load with a low power factor. Power Surge. A power surge takes place when the voltage is 110% above normal for fewer than three microseconds. The most common cause is heavy electrical equipment cycled on/off. Under this condition, computer systems may experience memory loss, data errors, flickering lights, and equipment shutoff. When a power surge occurs for longer than one minute, it is considered an overvoltage condition. Predictive Maintenance. Uses instrumentation to detect the condition of equipment and identify pending failures. Premium Systems (N + 2). The premium system meets the criteria of a N + 2 design by providing an essential component plus two components for backup. Preventive Maintenance (PM). For continued reliability and integrity, all systems are dependent upon an established program of routine maintenance and operational testing. The routine maintenance and operational testing program should be based on the manufacturer's recommendations, operations manuals, and the advice of expert consultants experienced with that equipment.
468
GLOSSARY
Probabilistic Risk Assessment (PRA). Looks at the probability of failure of each type of electrical power equipment. Program Delayed Transition. Program delayed transition extends the operation time of the transfer switch mechanism. A field-adjustable setting can add an appropriate time delay that will momentarily stop the switch in a neutral position before transferring to the energized source. This will allow the motors to coast down and the transformer fields to collapse before connection to the oncoming source. PUE. PUE (power usage effectiveness) is the ratio of total facility power to IT equipment power. Lower PUE numbers are better. PUE is a measure of data center efficiency developed by members of the Green Grid (www.thegreengrid.org). Raised Floor. A flooring system typically 12" to 36" above the main slab consisting of pedestals and removable panels, usually 2' x 2' square. Its function is to provide a supply air plenum for downflow CRACs and space for piping and cable ways. Some of the removable panels are perforated, typically with from 25% to 56% open area to allow preferential supply of cooling air to IT equipment racks. Reactive Load. A load having inductive or capacitive reactance. Rectifier/Battery Charger. Converts alternating current supplied from a utility or generator into direct current at a voltage that is compatible for paralleling with battery strings. The rectifier provides direct current to power the inverter, batteries are trickle charged, and control logic power is provided. Redundancy. The placing of additional assemblies, circuits, and/or equipment for the purpose of continued operation. Regulation. An act or law, enacted by the federal, state, or local government. Regulations may mandate compliance with codes. Building regulations are adopted to specify the criteria necessary for constructing, modifying, and maintaining facilities. Relative Humidity (RH). A percentage of how much air is saturated with water vapor at a specific temperature. The moisture content in air helps to dissipate static electricity. As a result, humidity in data centers needs to be kept high enough to dissipate static electricity and low enough to prevent corrosion. Reliability. Reliability is classically defined as the probability that some item will perform satisfactorily for a specified period of time under a stated set of conditions. Reliability-Centered Maintenance (RCM). The analytical approach to optimize reliability and maintenance tasks with respect to the operational requirements of the business. Resistive Load. A load whose total reactance is zero, meaning that the current is in phase with the terminal voltage. Also known as nonreactive load. Resource Conservation and Recovery Act (RCRA). Legislation passed in 1976 aimed at protecting the environment, including waterways, from solid waste contamination either directly, through spills, or indirectly, through groundwater contamination. Return Plenum. The space between a drop ceiling and the floor slab above, serving as a duct to carry heated air back to the CRAC returns. Heated air from equipment is drawn through return grills placed in the drop ceiling. Its function is to keep heated air separate from cooling air. Rolling Load. Load applied by wheeled vehicles carrying loads across the raised access floor. Pallet jacks, electronic mail carts, and lifts are some examples.
GLOSSARY
469
Rotary UPS. Uses an electromechanical motor-generator combination in lieu of an inverter used in static systems. Used primarily as a power conditioner. Relies on the inertia of the rotating components (e.g., flywheel) to bridge momentary power outages or improve power quality. Seismic Load. A combination of vertical (both up, down, and lateral) loads on the surface of the access floor. Sensible Heat. Heat added to or removed from a system that does not affect the moisture content. It is represented as a rise or fall in temperature. Sensible Heat Ratio. The sensible heat load divided by the total heat load. Short Circuit. The condition that occurs when a circuit path is created between the positive and negative poles of a battery, power supply, or circuit. A short circuit will bypass any resistance in a circuit and cause it not to operate. Short Cycling. Rapid cycling of CRAC compressors or chilled-water valves. An undesirable condition caused by air mixing, poor airflow management, and short circuiting. Occurs when a compressor or furnace is turned on immediately after it has been turned off. Signal-Loop Circuit (SLC). The bus that carries information and commands between the main processor and programmable devices in an addressable control system. Silicon-Controlled Rectifier (SCR). A solid-state electronic switching device that allows a relatively small current signal to gate or turn on and off a device that is capable of handling hundreds of amperes of current. Slab Floor/Flat Floor. A data center built without a raised floor. Smart Grid. A confluence of the current electrical grid and digital technology; envisioned to lead to increased efficiency, security, and load controls. Specific Gravity (SG). In a lead-acid cell, the electrolyte is a dilute solution of water and sulfuric acid. Specific gravity is a measure of the weight of acid in the electrolyte as compared to an equivalent volume of water. For example, an electrolyte with a specific gravity of 1.215 means it is 1.215 times heavier than an equivalent volume of water that has a specific gravity of 1.000. Standard. A document prepared by a nationally recognized organization, defining the industry practice. Standards are generally accepted practices that, while not being required by law, are used as guidelines for installing electrical systems. Standards are typically in a suitable form to be referenced by another standard or a code. Standard Operating Procedures (SOPs). A best practice approach or a set of stepby-step instructions for the day-to-day operation of equipment. Standard System (7V+ 1). A standard system establishes N + 1 reliability by supplying an essential component, such as a UPS, plus one additional component for backup. Standby Generators. A standby generator is an independent reserve source of electrical energy. When there is a loss of the normal electric supply, this independent reserve source of power will supply electricity to the user's facility. Star Grounding. A system in which the grounds join together at a single, isolated ground point in earth, independent from the building's electrical system. The signal ground may only consist of radial connections, such as the form of a star pattern, and does not form any loops. Star grounding serves to reduce radio-frequency noise.
470
GLOSSARY
Static Bypass. In the case of a UPS system internal rectifier or inverter fault, or should the inverter overload capacity be exceeded, the static transfer switch will instantaneously transfer the load from the inverter to the bypass source with no interruption in power to the critical AC load. Static Transfer Switch (STS). Selects between two or more sources of power and provides the best available power to the critical load. The STS is a solid-state device based on silicon-controlled rectifier technology (SCR) technology. The transfer from the preferred to the alternate source occurs in less than one-quarter a cycle to avoid interrupting the load. Surge Suppressor. An electrical device that provides protection from high-level transients. These should be installed at the service entrance panels, distribution panels, and individual critical loads. Sustained Average Interruption Frequency Index (SAIFI). A measurement of the months between interruptions for the utility's electric customers. System of Nines. Availability is sometimes described in units of nines, as in five nines, or 99.999%. Having a computer system's availability of 99.999% means the system is highly available, delivering its service to the user 99.999% of the time it is needed. Test Switch. Regular testing is necessary to both maintain the backup power system and to verify its reliability. Regular testing of a standby generator can uncover problems that could cause a malfunction during an actual power failure. A test switch is designed to simulate failure of the normal power source. Activation of the test switch will start the transfer of the load to the alternate power source and then retransfer it either (1) when the test switch is returned to normal or (2) after a selected time delay. Thermal Architecture Path. The transportation of heat away from the source and its eventual release into the atmosphere. Three-Phase Power. Electric power thatcontains three phases 120 degrees apart to provide near-constant power. Tie Systems. A tie is made up of two parallel redundant systems with a tie between them. Each system has its own input power feeder and a tiebreaker. Total Heat. See Enthalpy. Traditional Linear Electric Loads. Traditional loads such as lighting, motors, and so on. Transformer. A device that can step up voltage and/or step down voltage, depending on the ratio setting. Operates on the principle of magnetic induction. Transient Response. The response of a circuit to a sudden change in an input or output quantity. In power supplies, this is the excursion of the output voltage and the time it takes to recover from a step change in the output load or the input voltage. Transients. High-voltage spikes superimposed on the mains power. They can cause damage to sensitive electronic equipment. Total Harmonic Distortion. This term refers to the alteration of a waveshape by the presence of multipliers of the fundamental frequency of a signal. Ultimate Load. The maximum load a floor can support without structural damage. Uniform Load. Loads that are applied uniformly over the entire surface of the panel
GLOSSARY
471
and are typically imposed by items such as file cabinets, pallets, and furniture or equipment without legs. Uninterruptible Power Supply (UPS). Supplies sinusoidal AC power free of electrical disturbances and within strict amplitude and frequency specifications. United States Environmental Protection Agency (USEPA). The federal agency charged with setting policy and guidelines, and carrying out legal mandates for the protection of national interests in environmental resources. Very Early Warning Fire Detection (VEWFD). Standards of systems and technology that provide very early warning of a fire hazard. Voltage Sags. Momentary (typically a few milliseconds to a few seconds duration) undervoltage conditions that can be caused by a large load starting up (such as an air conditioning compressor or large motor load) or operation of utility protection equipment. Sags often manifest as flickering lights and can cause equipment shutdown. A sag of just a few milliseconds can mean a complete blackout to some sensitive equipment. Voltage- and Frequency-Sensing Controls. Voltage and frequency sensors, controls, and meters add additional useful features to an ATS. The voltage- and frequencysensing controls usually have adjustable settings for under and over conditions. The meters offer a precise indication of the voltage and frequency, and the sensing controls provide flexibility in various applications. Voltage Regulation. Voltage regulation is a specification that states the difference between maximum and minimum steady-state voltage as a percentage of nominal voltage. Voltage Regulator. A voltage regulator is a device that maintains the voltage output of a generator near its nominal value in response to changing load conditions. Voltage Spikes. A sharp rise and decline of voltage within a very small time span. Voltage Swells. Momentary (typically a few milliseconds to a few seconds in duration) overvoltage conditions that can be caused by such things as a sudden decrease in electrical load or a short circuit occurring on electrical conductors. Voltage swells can affect the performance of sensitive electronic equipment, cause data errors, produce equipment shutdowns, and may cause equipment damage. Watermist System. Fire suppression system that atomizes water to form a mist that cools, absorbs radiant heat, deprives the fire of oxygen, and suppresses the fire with water vapor. Waveform Distortion. The most common waveform distortion is a harmonic, which is a natural multiple of the standard power wave. Although harmonics can be triggered by equipment inside the network, utility power may contain harmonics generated hundreds of miles away. Caused by motor speed controllers or even computers themselves, these distortions can lead to communications errors and hardware damage. Wet Bulb Temperature. The temperature read from a moistened thermometer exposed to an air flow. This will always be lower than the dry bulb temperature due to the evaporative cooling effect, except at 100% relative humidity. Wye System. Three-phase power system that includes a fourth, neutral wire.
This page intentionally left blank
BIBLIOGRAPHY
ACP International—Home—Dedicated to the Evolution of Business Continuity, 2007; http:// www.acp-international.com, 22 July 2010. Active Power, Inc. Active Power Clean Source UPS: High Efficiency UPS Systems for a Power Hungry World, White Paper 114, 2009. American Power Conversion. How and Why Mission Critical Cooling Systems Differ From Air Conditioners, White Paper 56, 2003; http://www.ptsdcs.com/whitepapers/45.pdf, 30 July 2010. ANSI Homeland Security Standards Panel, Standardization for Enterprise Power Security and Continuity Final Workshop Report, April 2006. APC. Essential Cooling System Requirements for Next Generation Data Centers, White Paper #5, http://www.dell.com/downloads/global/services/datacenter_cooling.pdf, 2003. APC. Essential Cooling System Requirements for Next Data Center, White Paper #5, 2003. Auletta, Ken. Googled: The End of the World as We Know It. New York: Penguin, 2009. Axelrod, C. Warren, Jennifer L. Bayuk, and Daniel Schutzer. Enterprise Information Security and Privacy. Boston: Artech House, 2009. Barczak, Sara, and Ronald Carroll. Climate Change Implications for Georgia's Water Resources and Energy Future, http://www.cleanenergy.org/documents/BarczakCarrollWater Paper8March07.pdf, Apr. 2008. Bean, Thomas L., Timothy W. Butcher, and Timothy Lawrence. OSHA's Lockout/Tagout Standard. The Ohio State University 2008; http://www.osha.gov/pls/oshaweb/owadisp.show_ document?p_table=STANDARDS&p_id=9787, 1 Jan. 2010. Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 473 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
474
BIBLIOGRAPHY
BRE Global, Ltd. BREEAM Data Centres 2010 Launched, 7 May 2010; http://www.breeam. org/newsdetails.jsp?id=672, 8 July 2010. BRE Global, Ltd. BREEAM Data Centres, 2009; http://www.breeam.org/page.jsp?id=157, 7 July 2010. Bridging the Schism Between IT And Facilities. Data Center Dynamics, Vol. 3, No. 5, Aug/ Sept., 2009. Brock, J. L., Critical Infrastructure Protection: Challenges to Building a Comprehensive Strategy for Information Sharing and Coordination, 26 July 2006, pp. 9-10. Brown, Lester Russell. Plan B 3.0: Mobilizing to Save Civilization. New York: W. W. Norton, 2008. CALMAC Manufactures, http://calmac.com, Apr. 2008. Carr, Nicholas. The Big Switch. New York: W.W. Norton & Company, 2008. Casey, S. M. The Atomic Chef and Other True Tales of Design, Technology, and Human Error. Santa Barbara, CA: Aegean, 2006. Chen, Keke, David M. Auslander, Cullen E. Bash, and Chandrakant D. Patel. Local Temperature Control in Data Center Cooling: Part 1, Correlation Matrix, HP Technical Report HPL2006-42, http://www.hpl.hp.com/techreports/2006/HPL-2006-42.html. Chiles, James R. Inviting Disaster: Lessons from the Edge of Technology: An inside Look at Catastrophes and Why They Happen. New York: HarperBusiness, 2001. Coal Plants Cancelled in 2007, http://www.sourcewatch.org/index.php?title=Coal_plants_cancelled_in_2007#List_of_Cancelled_or_Shelved_Plants%20, Apr. 2008. Continuity Compliance, Business Continuity, Business Compliance—Your Business Continuity Lifeline, http://www.continuitycompliance.org, 20 July 2010. Crash And Burn: Data Center Outages, Data Center Dynamics, Vol. 3, No. 5, Aug/Sept. 2009. Cronin, Dennis. The Containerization of the Data Center, Mission Critical Magazine, Jan.-Feb. 2009; http://www.missioncriticalmagazine.com/Articles/Features/BNP_GUID_9-5-2006_A .10000000000000535271, 5 July 2010. Croy, Michael. How Prepared Are We? Business Continuity Planning in the Face of Pandemic (and Other Threats), http://www.forsythe.com/na/aboutus/news/articles/2009%20Articles/ bcplanninginfaceofpandemic, 7 July 2010. Curriculum Development for Data Center Optimization Education, Report, Polytechnic Institute ofNYU, 10 April 2009. Curtis, P. M. The Next Edge in Business Resiliency, Mission Critical Magazine, Winter pp. 12-14, 2008. Curtis, Peter M. A Phased Approach to Smart Grid Technology, Mission Critical Magazine, pp. 8-10, Sept.-Oct. 2009. Curtis, Peter M., and T. Lindsey. Bits Guide to Business-Critical Power, Report, Washington D.C.: Bits Financial Services Roundtable, 2006. Curtis, Peter M. Documentation and Its Relation to Information and Energy Security. Mission Critical Magazine, Summer, pp. 24-26, 2008. Curtis, Peter M. LEEDing the Way, Mission Critical Magazine, May-June, p. 10, 2009. Curtis, Peter M. Managing in Uncertain and Volatile Times, Mission Critical Magazine, Jan.-Feb., p. 17,2009. Curtis, Peter M. Preparing the Nintendo Generation for the Mission Critical Industry, Mission Critical Magazine, July-Aug., p. 8, 2009.
BIBLIOGRAPHY
475
Curtis, Peter M. Rolling the Dice: When the Human Element Is Considered, Mission Critical Magazine, Mar.-Apr., pp. 18-19, 2009. Curtis, Peter M. The Demand for Power, Mission Critical Magazine, Fall, pp. 15-16, 2007. Curtis, Peter M. The Direct Current Data Center, Mission Critical Magazine, Mar.-Apr., pp. 24-26, 2010. Curtis, Peter M. What Emergencies Teach Us, Mission Critical Magazine, Spring, pp. 12-14, 2008. Davidson, Paul. New Battery Packs Powerful Punch, http://www.usatoday.com/tech/products/ environment/2007-07-04-sodium-battery_N.htm, Apr. 2008. Diamond, John. Modularized Facility Infrastructure Can Control Scaleup Costs, Mission Critical Magazine, May-June, pp. 28-30, 2010. Chatsworth Products Inc., Ducted Exhaust Cabinet—Managing Ehaust Airflow Beyond Hot Aisle/Cold Aisle, White Paper, MKT-60020-319-NB, http://www.thermalnews.com/images/knowledgecenter, 2006. EER: A Work In Progress, Data Center Dynamics, Vol. 3, No. 3, p. 25, April 2009. Esty, Daniel C , and Andrew S. Winston. Green to Gold, Hamilton: New Zealand Genetics, 2007. Fama, J. Electric Reliability: Now Comes The Hard Part, IEEE Power & Energy Magazine, Nov.-Dec. 2004. FedCenter—Home, http://www.fedcenter.gov, 29 July 2010. Fink, Donald G., and H. Wayne Beaty. Standard Handbook for Electrical Engineers, 13th ed., New York: McGraw-Hill, 1993. Friedman, Thomas L. Hot, Flat, and Crowded, New York: Picador/Farrar, Straus and Giroux, 2008. Frost & Sullivan, North American Automatic Transfer Switch Market, 4 Aug. 2003. Garday, Doug, and Daniel Costello. Air Cooled High Performance Data Centers: Case Studies and Best Methods, Intel Corporation, White Paper, 2006. Google's Chiller-less Data Center, Data Center Knowledge, 15 July 2009; http://www.datacenterknowledge.com/archives/2009/07/15/googles-chiller-less-data-center, 12 July 2010. Gore, Albert. Earth in the Balance: Ecology and the Human Spirit, New York: Rodale, 2006. Gotfried, S., and J. McDonald. APS Announces New Solar Power Plant, Among World's Largest, Business Wire, Apr. 2008, http://www.businesswire.com/portal/site/google/ index. j sp?ndm Viewld=news_view&newsld=2008022100523 7&newsLang=en. Green Data Center Cooling Options: Begin With Air And Water, Data Center Dynamics, Vol. 3, No. 5, pp. 32-34, Aug/Sept. 2009. Greenemeier, Larry, and Jennifer Zaino. Concerns Heat Up Over Keeping Blades Cool—Blade Servers, Information Week, http://www.informationweek.com/news/hardware/windows_ servers/showArticle.jhtml?articleID=8700285, 10 Mar. 2001. Group C Communications. Unlocking the Value of Your Facilities: Do You Have a Firm Grasp on How Your Capital and Assets Are Distributed throughout Your Facilities?, http://www. facilitycity.com/busfac/bf_04_09_services.asp, Sept. 2004. Harris, Shane. China's Cyber Militia, National journal Magazine, 31 May 2008; http://www.nationaljournal.com/njmagazine/cs_20080531_6948.php. Hoffman, Jane S., and Michael B. Hoffman. Green: Your Place in the New Energy Revolution, New York: Palgrave Macmillan, 2008.
476
BIBLIOGRAPHY
Holt, Mike. NEC Chapter 7 Review, http://www.mikeholt.com/mojonewsarchive/NECHTML/HTML/NEC-Chapter-7-Review~20040414.php. HP Data Center Uses Free—Cooling All But 20 Hours Per Year, Environmental Leader, http://www.environmentalleader.com/2010/02/12/hp-data-center-uses-free-cooling-all-but20-hours-per-year, 11 July 2010. Is Sustainability Worth the Premium? Data Center Dynamics, Vol. 3, No. 8, p. 19, Feb/Mar 2010. It's Time For A Refresh, Data Center Dynamics, Vol. 3, No. 3, p. 54, April 2009. Juran, J. M. Quality Control Handbook, New York: McGraw-Hill, 1951. Keteylan, Armen. Digital Photocopiers Loaded With Secrets, CBS News, http://www.cbsnews.com/stories/2010/04/19/eveningnews/main6412439.shtml?tag=currentVideoInfo;vide oMetalnfo, 15 Apr. 2010. Krupp, Fred, and Miriam Horn. Earth, the Sequel: The Race to Reinvent Energy and Stop Global Warming, New York: W. W. Norton, 2008. Lewis, Janet. Electrical Code Issues and Answers, Electrical Circuits, Vol. 2, pp. 1-2, May 1999. Lovins, A.B., and L.H. Lovins. Brittle Power, Andover, Massachusetts: Brick House, 1982. Microsoft's Chiller-Less Data Center, Data Center Knowledge, Vol. 24, Sept. 2009; http:// www.datacenterknowledge.com/archives/2009/09/24/microsofts-chiller-less-data-center, 12 July 2010. Miller, Richard. Special Report: Data Centers and Renewable Energy, Data Center Knowledge, Vol. 13, July 2010; http://www.datacenterknowledge.com/special-report-data-centers-renewable-energy, 14 July 2010. Mills, M.P., and P. Huber. Critical Power, Report, Washington D.C.: Digital Power Group, 2003. Mufson, S. Changing the Current; State Environmental Laws Drive Power Producers to Renewable Resources, http://www.washingtonpost.eom/wp-dyn/content/article/2008/04/21/ AR2008042103004.html, Apr. 2008. National Commission on Energy Policy. Ending the Energy Stalemate: A Bipartisan Strategy to Meet America's Energy Challenges, Report, Washington D.C.: National Commission on Energy Policy, 2004. National Energy Policy Development Group. Reliable, Affordable, and Environmentally Sound Energy for America's Future (National Energy Policy, May 2001), Report, Washington D.C.: U.S. Government Printing Office, 2001. National Fire Protection Association. NFPA 1: Fire Code, http://www.nfpa.org/aboutthecodes/AboutTheCodes.asp?DocNum=l, 26 July 2010. National Fire Protection Association. NFPA 101: Life Safety Code9, http://www.nfpa.org/aboutthecodes/AboutTheCodes.asp?DocNum=101, 26 July 2010. National Fire Protection Association. NFPA 101: Life Safety Codem, http://www.nfpa.org/aboutthecodes/AboutTheCodes.asp?DocNum=101, 26 July 2010. National Fire Protection Association. NFPA 12: Standard on Carbon Dioxide Extinguishing Systems, http://www.nfba.org/aboutthecodes/AboutTheCodes.asp?DocNum=12, 26 July 2010. National Fire Protection Association. NFPA 13: Standard for the Installation of Sprinkler Systems, http://www.nfba.org/aboutthecodes/AboutTheCodes.asp?DocNum=13, 26 July 2010.
BIBLIOGRAPHY
477
National Fire Protection Association. NFPA 75: Standard for the Protection of Information Technology Equipment, http://www.nfpa.org/aboutthecodes/AboutTheCodes.asp7DocNum=75, 26 July 2010. National Fire Protection Association. NFPA 76: Standardfor the Fire Protection of Telecommunications Facilities, http://www.nfpa.org/aboutthecodes/AboutTheCodes.asp?DocNum=76, 26 July 2010. National Fire Protection Association. NFPA 99: Standard for Health Care Facilities, http://www.nfpa.org/aboutthecodes/AboutTheCodes.asp?DocNum=99, 26 July 2010. National Institute of Standards and Technology. Federal Information Security Management Act, http://csrc.nist.gov/groups/SMA/fisma/overview.html, 9 Feb. 2010. NERC. Critical Infrastructure Protection, http://www.nerc.com, 12 Feb. 2010. Noland, Chris, and Vipha Kanakakorn. Efficiencies through Increased Temperatures, in Proceedings of 2009 Data Center Energy Efficiency Conference, CA. OSHA. Occupational Safety and Health Standards. 29 CFR Pt.1910.137, http://www.osha.gov/ pls/oshaweb/owadisp.show_document?p_table=STANDARDS&p_id=9787. Pahl, Greg. The Citizens Power and Energy Handbook: Community Solutions to a Global Crisis, White River Junction, VT: Chelsea Green, 2007. Pataki, George E., Thomas J. Vilsack, Michael A. Levi, and David G. Victor. Confronting Climate Change—A Strategy for U.S. Foreign Policy: Report of an Independent Task Force, New York: Council on Foreign Relations, 2008. Perrow, Charles. Normal Accidents: Living with High-Risk Technologies, Princeton, NJ: Princeton UP, 1999. Reducing the Data Centre Running Costs with "Free Air Cooling," Data Center Dynamics, Vol. 3, No. 8, p. 46, Feb/Mar 2010. RIMS—Home, http://www.rims.org/Pages/Default.aspx, 20 July 2010. Russell, Peter. Waking Up in Time: Finding Inner Peace in Times of Accelerating Change, Novato, CA: Origin, 1998. Saville-King, Paul. How Maintenance Human Factors Impact Operational Risks, Data Center Dynamics, Vol. 1, p. 68, Dec. 2008. Schmidt, R. R., E. E. Cruz, and M. K. Iyengar. Challenges of Data Center Themal Management, POWER5 and Packaging, Vol. 49, 2005. Schnurr, Avi. Global Single Point Failure: The EMP Threat, EMPACT Publication. Sheffi, Yossi. The Resilient Enterprise: Overcoming Vulnerability for Competitive Advantage, Cambridge, MA: MIT Press, 2005. Smick, David M. The World Is Curved: Hidden Dangers to the Global Economy, New York: Portfolio, 2008. Souder, Elizabeth. Plans for 8 Texas Coal Plants Formally Canceled, Dallas News, http://www. dallasnews.com/sharedcontent/dws/bus/stories/101607dnbustxucoalplants. 172c72818.html, Apr. 2008. Stauffer, H. Brooke. Article 708: Critical Operations Power Systems, in User's Guide to the National Electrical Code, Sudbury, MA: Jones and Bartlett, 2009. Stobaugh, R. and Yergin, D. Conclusion: Toward a Balanced Energy Program, in Energy Future, pp. 291-308, R. Stobaugh and D. Yergin (Eds.), New York: Vintage Books, 1979. Tapscott, Don. Growing up Digital: The Rise of the Net Generation, New York: McGraw-Hill, 1998.
478
BIBLIOGRAPHY
Toffler, Alvin. The Third Wave. New York: Bantam, 1990. Ul-1008.5 Transfer Switch Equiptment, UL StandardsInfoNet, http://ulstandardsinfonet.ul.com/ scopes/scopes.asp?fn=1008.html. U.S. Department of Energy. Executive Order 13423, 12 Dec. 2009. http://wwwl.eere.energy. gov/femp/regulations/eol3423.html, 10 Feb. 2010. U.S. Department of Homeland Security. The 2009 National Infrastructure Protection Plan, 28 June, 2010; http://www.dhs.gov/files/programs/editorial_0827.shtm, 9 Feb. 2010. U.S. General Services Administration. Executive Order 13423, http://www.gsa.gov/Portal/ gsa/ep/contentView.do?contentType=GSA_BASIC&contentld=22395, 24 Mar. 2010. Vanderbilt, Tom. Data Center Overload, New York Times, 14 June 2009, http://www.nytimes. com/2009/06/ 14/magazine/ 14search-t.html, 9 Sept. 2009. Vaughan, Diane. The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA, Chicago: Univ. of Chicago Press, 2007. Waun, Thomas. Finding an Opening, Data Center Dynamics, Vol. 1, pp. 54-55, Dec. 2008; 54-55, http://www.hpl.hp.com/research/dca/smart_cooling. The White House. The National Strategy for The Physical Protection of Critical Infrastructures and Key Assets, pp. 6, 9-10, 21, 28, 58, February, 2003. Wu, Kwok. A Comparative Study of Various High Density Data Center Cooling Technologies, Thesis. Stony Brook University, 2008. Zakaria, Fareed. The Post-American World, New York: W.W. Norton, 2008.
INDEX
Absorption chiller, 69 AC power, 65 Acceptance phase, 12 Acceptance testing, 12 Addressing risks, 34 Airflow, 303 CRAC/CRAH types, 304 leakage, 303 mixing and its relationship to efficiency, 303 potential CRAC operation issues, 305 recirculation, 303 sensible versus latent cooling, 305 venturi effect, 304 vortex effect, 304 Airflow management, 441 active-air management, 453 benchmark study, 454 benefits of, 444 containment strategies, 452 in-row cooling, 452 lower return air DT versus higher DT, 445 obtaining control, 445
overhead cooling, 452 technologies, 451 Arc flash, 90 boundaries, 94 ASHRAE Guidelines, 11 Automatic static transfer switches (ASTS), 104 Automatic transfer switch (ATS), 103, 149 Basel II Accord, 2, 408 Basis of design (BOD), 11 Batteries, 251 failure modes, 254 flooded, 252 valve-regulated lead-acid (VRLA), 253 wet lead-acid, 252 Blackouts, 206 Brownouts, 206 Boardroom, responsibilities of, 53 Building deficiencies and analysis, 85 Building Research Establishment Environmental Assessment Method (BREEAM), 57
Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition. Peter M. Curtis 479 © 2011 the Institute of Electrical and Electronics Engineers, Inc. Published 2011 by John Wiley & Sons, Inc.
480
Business continuity, 7 Business continuity management agencies and regulating organizations, 410 Business continuity plan (BCP), 417 C4I—Command, Control, Communications, Computers, and Intelligence, 407 Capital costs versus operation costs, 6 Change management, 89 Commissioning (Cx), 10 Commissioning authority (CxA), 12 Commissioning process, 59 Comprehensive Environmental Response, Compensation, and Liability Act of 1980, 399 Computer hackers, 26 Computer-room air-conditioning units (CRAC units), 266, 290 Consortium for Electric Infrastructure to Support a Digital Society (CEIDS), 37 Power Delivery Reliability Initiative, 37 Construction phase, 12 Containerized systems, 72 Cooling air, 302 Cooling efficiency, 297 measurement, 299 Cooling systems, 265 air side, 267 chillers, 269 air-cooled, 271 classification and typical sizes, 272 water-cooled, 271 cooling-Medium Side, 267 CRAC units, 290 airflow paths, 292 classification, 294 configurations, 294 energy-recovery equipment, 282 economizers, 282 heat exchangers, 287 for datacom facilities, 290 plate-and-frame, 288 shell-and-rube, 288 heat-rejection equipment, 273 condensers, 277 cooling towers, 273 dry coolers, 277 geothermal, 281 outside the datacom room, 269
INDEX
refrigeration equipment, 269 within datacom rooms, 266 Coordination study, 63 Cost of downtime, 49 CRAC fighting, 308 Critical area program (CAP), 5 Critical areas, 8 Critical cyber assets (CCA), 31 Critical environment (CE), 8 Critical environment workflow, 8 Critical industries, 7 Critical Infrastructure Protection Cyber Security Standards, 30 Critical personnel, 16 Critical systems, 9 Customer Average Interruption Duration Index (CAIDI), 35 Cyber attacks, 26 Data center cooling, 308 active air movement, 310 adaptive capacity, 311 advanced active airflow management for server cabinets, 310 blade servers, 313 chimney or ducted returns, 310 cold storage, 312 cold-aisle containment, 309 dynamic server loads, 313 efficiency, 314 energy-efficient servers, 313 hot aisle/cold aisle, 308 in-row cooling with hot-aisle containment, 309 liquid cooling, 311 multicore processors, 313 overhead supplemental cooling, 309 power managed servers, 313 server efficiency, 312 server virtualization, 313 Data protection, 414 Deregulation, 217 Design phase, 12 Diesel engines, 126 Direct current (DC), 65 advantages, 65 combined cooling, heat, and power, 68 education and training, 71 future, 71
481
INDEX
lighting, 67 maintenance, 71 monitoring equipment, 71 safety issues, 70 state of the art, 68 storage options, 67 Disaster recovery, 7 Distributed generation (DG), 37 Distributed resources (DR), 38 Documentation, 14 Education and training, 18 Electric Power Research Institute (EPRI), 37 Electric reliability, 34 Electric system supply redundancy, 34 Electrical grid, 26 Electrical maintenance, 89 standards and regulations, 89 Electricity basics, 195 basic circuit, 196 life cycle of electricity, 198 power factor, 196 single-phase power, 199 three-phase power, 200 transmission of power, 197 unreliable power versus reliable power, 201 Electromagnetic pulse attack (EMP), 31 Emergency action procedures (EAP), 18 Employee certification, 20 Encryption, 416 Energy security, 25, 29 Energy sources, 26 Energy storage devices, 251 Escalation procedures, 10 Executive Order 13423—Strengthening Federal Environmental, Energy, and Transportation Management, 399 Facilities manager (FM), 16 Factory acceptance testing (FAT), 10 Failure rate, 55 Federal Energy Regulatory Commission (FERC), 30 Federal Financial Institutions Examination Council (FFIEC), 412 Federal Real Property Council (FRPC), 400 Fire protection, 357
alarm and notification, 359 carbon dioxide systems, 382 chemical clean agents, 386 FK-5-1-12, 388 Halon 1301, 389 HFC-125, 387 HFC-227ea, 386 HFC-23, 389 clean agent systems, 384 clean agents and the environment, 390 early detection, 361 fire and building codes, 365 fire detection, 366 fire suppression, 362 fire suppression systems, 374 inert gas agents, 384 IG-541,385 IG-55, 385 portable fire extinguishers, 390 stages of a fire, 364 systems design, 364 water mist systems, 379 Flywheels, 255 Focused review, 12 Fuel cell, 69 Fuel sources for electricity generation, 28 Fuel systems, 125 aboveground tanks, 127 bulk storage tank, 126 codes and standards, 128 day tank control system, 135 diesel fuel, 139 distribution system configuration, 133 fuel quality assurance program, 139 analysis of new fuel prior to transfer to on-site storage, 144 fuel needs, 141 monthly fuel system maintenance, 145 new fuel shipment prereceipt inspection, 141 procurement guidelines, 141 quarterly or semiannual monitoring of on-site bulk fuel, 146 remediation, 146 piping systems, 128 recommended practices, 129 underground tanks, 128 Green technologies, 47, 223
482
Heat generation, 301 Heat return, 302 Heat transfer inside data centers, 300 Hidden costs of operations, 15 Humidity control, 307 Industry policies and regulations, 395 Information security, 40 enhancements, 41 network and access, 41 recommendations for executive action, 42 security questions, 41 techniques for addressing, 41 Integrated testing, 13 InterNational Electric Testing Association (NETA), 82 70B Recommended Practice for Electrical Equipment Maintenance, 82 maintenance testing specifications, 82 ISO27000, Information Security Management System (ISMS), 400 Law of nines, 48 Leadership in Energy and Environmental Design (LEED) Certification, 57 LEED Bookshelf, 57 Leadership in energy-efficient design (LEED), 10 Levels of risk, 6 Load classifications, 51 critical, 51 discretionary, 51 essential, 51 Load interruption, 31 Lockout/tagout, 98 Maintainability, 52 Maintenance and safety, 81 Maintenance of typical electrical distribution equipment, 99 Maintenance programs, 14 Maintenance supervisor, 83 Managing during critical events, 16 Mean time between failures (MTBF), 55 Mean time to repair (MTTR), 55 Method of procedure (MOP), 9 Microturbine CCHP system, 70 Military networks, 26 Mission critical access, 17
INDEX
Mission critical data center, 56 certification, 57 evaluating systems, 86 maintenance approach, 87 Mission critical engineering, 47 Mission critical facilities, 50 Mission critical facilities engineer, 83 Mission critical facilities manager, 53 Mission critical facility design, 58 Momentary average interruption frequency index (MAIFI), 35 Motor control centers, 103 n + 1 design, 34 n + 2 design, 34 National Electric Code, 90 National Fire Prevention Association 1600 Standard on Disaster/Emergency Management and Business Continuity Programs, 412 National Fire Protection Association (NFPA), 82 National Infrastructure Protection Plan, 404 National Strategy for the Physical Protection of Critical Infrastructures and Key Assets, 403 NFPA 70E, 90 North American Electric Reliability Corporation (NERC) Critical Infrastructure Protection Program, 405 Northeast blackout of 2003, 36 One-line diagrams, 89 Online training program, 19 Operation and maintenance, 19 Operations documents, 13 OSHA Regulation 29 CFR Subpart S.1910.333, 90 Owner's project requirements (OPRs), 11 Panel boards, 103 Personal protective equipment (PPE), 92 classification of hazard/risk category, 92 description, 92 frequency of PPE testing, 95 glove and boot classification, 95 inspection, 94 minimum recommended, 95
INDEX
suggested codes and standards, 98 types, 93 working near eenerators, 94 working with batteries, 96 working with fuel oil, 96 Policies and regulations, 393 Power distribution units, 105 Power monitoring, 215 Power outages, 32 Power problems, 202 blackouts, 206 brownouts, 206 causes and effects of, 203 causes of power line disturbances, 207 electrostatic discharge, 208 EMI and RFI, 212 harmonics, 208 lightning, 207 utility outages, 212 overvoltage, 206 power line disturbance levels, 212 power quality transients, 202 RMS variations, 204 Tolerances of Computer Equipment, 212 CBEMA curve, 214 ITIC curve, 215 purpose of curves, 215 undervoltage, 205 voltage fluctuations, 206 voltage imbalance, 206 voltage notches, 207 voltage sags, 204 voltage swells, 205 Power quality, 193 Power transfer switch, 149 automatic, 150, 153 break-before-make, 153 breaker pair, 163 closed transition, 156 closed with bypass isolation, 157 delayed, 159 open transition, 153 open transition with bypass isolation, 155 soft-load, 160 carry full rated current continuously, 167 close against high in-rush currents, 166 control devices, 163 design features, 166
483
drawings and manuals, 173 exercise clock, 165 general recommendations, 176 in-phase monitor, 164 installation and commissioning, 168 component verification, 169 individual system operation verification, 170 integrated system operation verification (site acceptance test), 170 shipment options, 169 system construction verification, 170 interrupt current, 167 maintenance and safety, 170 maintenance tasks, 173 manual, 152 NEMA classification, 167 seismic requirement, 168 sizing, 168 system voltage ratings, 168 technology, 151 test switches, 165 testing and training, 173 time delays, 163 engine cool-down, 164 engine warm-up, 164 return to utility, 164 start, 163 transfer, 163 voltage and frequency sensing controls, 166 withstand and closing rating (WCR), 167 Predictive maintenance, 87 President's Critical Infrastructure Protection Board, 26 Preventive maintenance (PM), 52, 87 annual, 88 Private Sector Preparedness Act, 414 Probabilistic risk assessment (PRA), 54 Protecting critical data through security and vaulting, 417 Quantifying reliability and availability, 54 Raised access floors, 317 access floor maintenance, 340 airflow distribution and CFD analysis, 347 airflow method, 326 airflow requirements, 326
484
Raised access floors (continued) cleaning, 342 damp-mopping, 342 design considerations, 319 energy efficiency, 346 fire protection, 337 floor finish, 325 air leakage, 326 conductive and static-dissipative floor tile, 326 grate panels, 326 low-static-generation floor tile, 325 perforated panels, 326 grounding, 336 hot and cold air containment, 346 LEED certification, 346 out-of-square stringer grid, 344 panel cutting, 338 cutout locations, 338 cutting and installing the trim, 340 guidelines for, 338 installing protective trim around cut edges, 340 interior cutout procedure, 339 round cutout procedure, 339 safety requirements, 338 saws and blades, 339 panel lipping condition, 343 pedestal height adjustments, 343 protection of the floor from heavy loads, 331 removal and reinstallation of panels, 328 required finished floor height, 322 rocking panel condition, 343 safety concerns, 328 stringer systems, 330 structural performance required, 320 impact load, 322 rolling load, 320 ultimate load, 322 uniform load, 321 tight floor or loose floor, 345 tipping at perimeter panels, 345 troubleshooting, 343 twisted grid, 344 typical applications, 318 understructure support design, 323 zinc whiskers, 337 Redundancy, 50
INDEX
Reliability, 48, 52 analysis, 54 design team, 54 IEEE Gold Book, 54 IEEE Standard 493-1997, Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems, 54 modeling, 54 predictions, 54 terminology, 55 Reliability centered maintenance (RCM), 82, 87 Renewable energy, 68 Risk analysis and improvement, 22, 45 Risk assessment, 4 Risk management assessment (RMA), 5 Risk tolerance, 48 Risks related to information security, 29 Sarbanes-Oxley Act (SOX), 2 Short-circuit study (SCS), 60 cables, 62 electric utility, 61 in-service motors, 61 on-site generation, 61 reactors, 62 transformers, 61 UPS systems, 61 Smart grid, 30, 42 Solar power, 68 Sound practices to strengthen the resilience of the U.S. financial system, 405 Standard operating procedures (SOP), 13, 18 Standards and benchmarking, 21 Standby generators, 109 automatic and manual switchgear, 117 cold start and load acceptance, 120 control system, 117 coolant system, 117 documentation and forms, 118 emergency procedures, 119 emergency, legally required, and optional systems, 111 engine, 116 generator control cabinet, 120 generator mechanics, 117 load bank testing, 118 lockout/tagout, 114
485
INDEX
maintenance procedures, 115 management commitment, 113 necessity for, 110 nonlinear load problems, 121 automatic transfer switch, 123 frequency fluctuation, 122 line notches and harmonic current, 121 step loading, 121 synchronizing to bypass, 122 voltage rise, 122 optional, 113 record keeping, 119 record keeping and data trending, 116 training, 115 understanding your power requirements, 113 Static transfer switch, 179 downstream device monitoring, 185 eliminating human errors, 187 fault tolerance and abnormal operation, 189 human engineering, 187 location and type, 184 major components, 180 circuit breakers, 180 silicon-controlled rectifiers, 181 monitoring, data logging, and data management, 184 one-line diagram, 181 bypass operation, 182 normal operation, 181 primary system, 184 reliability and availability, 187 remote communication, 185 repairability and maintainability, 189 secondary system, 184 security, 186 technology and application, 183 testing, 190 transformer configurations, 183 Sustained average interruption frequency index (SAIFI), 35 Testing and commissioning, 10 acceptance phase, 11 construction phase, 11 design phase, 11 occupancy and operations phase, 11 predesign phase, 11
Thermal monitoring, 100 Thermal scanning, 100 Thévenin equivalent circuit, 61 Time current curve (TCC), 63 Training, 13 Transformers, 105 24/7 facilities, 48 Uninterruptible Power Systems (UPS), 105 electrical tests, 106 test values, 106 visual and mechanical inspection, 106 UPS systems, 223 availability calculations, 250 battery rundown test, 261 components of, 232 delta conversion, 227 description of, 228 designs, 226 diesel, 243 distributed redundant/catcher, 249 double conversion, 230 double-conversion UPS power path, 231 dual output concept, 244 eco mode, 249 emergence of the IGBT, 240 evolution of, 240 filter integrity and testing, 260 functional load testing, 258 harmonic analysis and testing, 259 hybrid, 244 isolated redundant configuration, 246 maintenance and testing, 256, 262 management, 262 module fault test, 261 N + 1 configuration, 246 N+2 configuration, 246 N configuration, 245 offline (standby), 226, 239 online, 230 online line interactive, 238 physical preventive maintenance (PM), 257 power control devices, 232 efficiency, 238 input filter, 236 insulated gate bipolar transistors (IGBT), 233 inverter, 237
486
UPS systems (continued) output filter, 238 rectifier/battery charger, 234 silicon-controlled rectifier (SCR), 232 static bypass, 238 protection settings, calibration, and guidelines, 258 purpose of, 225 redundancy, 245 rotary, 242 single conversion, 227 static, 229 steady-state load test, 259 steady-state load test at 0%, 50%, and 100% of load, 259
INDEX
topology, 245 transient response load test, 261 2(N+ 1) configuration, 248 2N configuration, 248 two- and three-level rectifier/inverter topology, 241 Uptime tiers, 51 U.S. PATRIOT Act, 2, 396 U.S. Securities and Exchange Commission (SEC), 5, 405 Web-based information management systems, 36 Zinc whiskers, 337
IEEE Press Series on Power Engineering 1. Principles of Electric Machines with Power Electronic Applications, Second Edition M. E. El-Hawary 2. Pulse Width Modulation for Power Converters: Principles and Practice D. Grahame Holmes and Thomas Lipo 3. Analysis of Electric Machinery and Drive Systems, Second Edition Paul C. Krause, Oleg Wasynczuk, and Scott D. SudhofF 4. Risk Assessment of Power Systems: Models, Methods, and Applications Wenyuan Li 5. Optimization Principles: Practical Applications to the Operations of Markets of the Electric Power Industry Narayan S. Rau 6. Electric Economics: Regulation and Deregulation Geoffrey Rothwell and Tomas Gomez 7. Electric Power Systems: Analysis and Control Fabio Saccomanno 8. Electrical Insulation for Rotating Machines: Design, Evaluation, Aging, Testing, and Repair Greg Stone, Edward A. Boulter, Ian Culbert, and Hussein Dhirani 9. Signal Processing of Power Quality Disturbances Math H. J. Bollen and Irene Y. H. Gu 10. Instantaneous Power Theory and Applications to Power Conditioning Hirofumi Akagi, Edson H. Watanabe, and Mauricio Aredes 11. Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition Peter M. Curtis 12. Elements of Tidal-Electric Engineering Robert H. Clark 13. Handbook of Large Turbo-Generator Operation Maintenance, Second Edition Geoff Klempner and Isidor Kerszenbaum 14. Introduction to Electrical Power Systems Mohamed E. El-Hawary 15. Modeling and Control of Fuel Cells: Disturbed Generation Applications M. Hashem Nehrir and Caisheng Wang 16. Power Distribution System Reliability: Practical Methods and Applications Ali A. Chowdhury and Don O. Koval
17. Introduction to FACTS Controllers: Theory, Modeling, and Applications Kalyan K. Sen and Mey Ling Sen 18. Economic Market Design and Planning for Electric Power Systems James Momoh and Lamine Mili 19. Operation and Control of Electric Energy Processing Systems James Momoh and Lamine Mili 20. Restructured Electric Power Systems: Analysis of Electricity Markets with Equilibrium Models Xiao-Ping Zhang 2\.An Introduction to Wavelet Modulated Inverters S.A. Saleh and M. Azizur Rahman 22. Probabilistic Transmission System Planning Wenyuan Li 23. Control of Electric Machine Drive Systems Seung-Ki Sul 24. High Voltage and Electrical Insulation Engineering Ravindra Arora and Wolfgang Mosch 25. Practical Lighting Design with LEDs Ron Lenk and Carol Lenk 26. Electricity Power Generation: The Changing Dimensions Digambar M. Tagare 27. Electric Distribution Systems Abdelhay A. Sallam and Om P. Malik 28. Maintaining Mission Critical Systems in a 24/7 Environment, Second Edition Peter M. Curtis 29. Power Conversion and Control of Wind Energy Systems Bin Wu, Yongqiang Lang, Navid Zargan, and Samir Kouro Forthcoming Titles Doubly Fed Induction Machine: Modeling and Control for Wind Energy Generation Applications Gonzalo Abad, Jesus Lopez, Miguel Rodriguez, Luis Marroyo, and Grzegorz Iwanski