This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
IFIP Advances in Information and Communication Technology
367
Editor-in-Chief A. Joe Turner, Seneca, SC, USA
Editorial Board Foundations of Computer Science Mike Hinchey, Lero, Limerick, Ireland Software: Theory and Practice Bertrand Meyer, ETH Zurich, Switzerland Education Arthur Tatnall, Victoria University, Melbourne, Australia Information Technology Applications Ronald Waxman, EDA Standards Consulting, Beachwood, OH, USA Communication Systems Guy Leduc, Université de Liège, Belgium System Modeling and Optimization Jacques Henry, Université de Bordeaux, France Information Systems Jan Pries-Heje, Roskilde University, Denmark Relationship between Computers and Society Jackie Phahlamohlaka, CSIR, Pretoria, South Africa Computer Systems Technology Paolo Prinetto, Politecnico di Torino, Italy Security and Privacy Protection in Information Processing Systems Kai Rannenberg, Goethe University Frankfurt, Germany Artificial Intelligence Tharam Dillon, Curtin University, Bentley, Australia Human-Computer Interaction Annelise Mark Pejtersen, Center of Cognitive Systems Engineering, Denmark Entertainment Computing Ryohei Nakatsu, National University of Singapore
IFIP – The International Federation for Information Processing IFIP was founded in 1960 under the auspices of UNESCO, following the First World Computer Congress held in Paris the previous year. An umbrella organization for societies working in information processing, IFIP’s aim is two-fold: to support information processing within ist member countries and to encourage technology transfer to developing nations. As ist mission statement clearly states, IFIP’s mission is to be the leading, truly international, apolitical organization which encourages and assists in the development, exploitation and application of information technology for the benefit of all people. IFIP is a non-profitmaking organization, run almost solely by 2500 volunteers. It operates through a number of technical committees, which organize events and publications. IFIP’s events range from an international congress to local seminars, but the most important are: • The IFIP World Computer Congress, held every second year; • Open conferences; • Working conferences. The flagship event is the IFIP World Computer Congress, at which both invited and contributed papers are presented. Contributed papers are rigorously refereed and the rejection rate is high. As with the Congress, participation in the open conferences is open to all and papers may be invited or submitted. Again, submitted papers are stringently refereed. The working conferences are structured differently. They are usually run by a working group and attendance is small and by invitation only. Their purpose is to create an atmosphere conducive to innovation and development. Refereeing is less rigorous and papers are subjected to extensive group discussion. Publications arising from IFIP events vary. The papers presented at the IFIP World Computer Congress and at open conferences are published as conference proceedings, while the results of the working conferences are often published as collections of selected and edited papers. Any national society whose primary activity is in information may apply to become a full member of IFIP, although full membership is restricted to one society per country. Full members are entitled to vote at the annual General Assembly, National societies preferring a less committed involvement may apply for associate or corresponding membership. Associate members enjoy the same benefits as full members, but without voting rights. Corresponding members are not represented in IFIP bodies. Affiliated membership is open to non-national societies, and individual and honorary membership schemes are also offered.
Jonathan Butts Sujeet Shenoi (Eds.)
Critical Infrastructure Protection V 5th IFIP WG 11.10 International Conference on Critical Infrastructure Protection, ICCIP 2011 Hanover, NH, USA, March 23-25, 2011 Revised Selected Papers
13
Volume Editors Jonathan Butts Air Force Institute of Technology Wright-Patterson Air Force Base Dayton, OH 45433-7765, USA E-mail: [email protected] Sujeet Shenoi University of Tulsa Department of Computer Science Tulsa, OK 74104-3189, USA E-mail: [email protected]
ISSN 1868-4238 e-ISSN 1868-422X ISBN 978-3-642-24863-4 e-ISBN 978-3-642-24864-1 DOI 10.1007/978-3-642-24864-1 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011938839 CR Subject Classification (1998): D.4.6, K.6.5, E.3, C.2, H.4, H.3, I.6
PART I THEMES AND ISSUES 1 Using Deception to Shield Cyberspace Sensors Mason Rice, Daniel Guernsey and Sujeet Shenoi 2 Botnets as an Instrument of Warfare Eric Koziel and David Robinson
3
19
PART II CONTROL SYSTEMS SECURITY 3 Lightweight Intrusion Detection for Resource-Constrained Embedded Control Systems Jason Reeves, Ashwin Ramaswamy, Michael Locasto, Sergey Bratus and Sean Smith
31
4 A Plant-Wide Industrial Process Control Security Problem Thomas McEvoy and Stephen Wolthusen
47
5 Identifying Vulnerabilities in SCADA Systems via Fuzz-Testing Rebecca Shapiro, Sergey Bratus, Edmond Rogers and Sean Smith
57
6 Security Analysis of VPN Configurations in Industrial Control Environments Sanaz Rahimi and Mehdi Zargham
73
vi
CRITICAL INFRASTRUCTURE PROTECTION V
PART III INFRASTRUCTURE SECURITY 7 Implementing Novel Defense Functionality in MPLS Networks Using Hyperspeed Signaling Daniel Guernsey, Mason Rice and Sujeet Shenoi 8 Creating a Cyber Moving Target for Critical Infrastructure Applications Hamed Okhravi, Adam Comella, Eric Robinson, Stephen Yannalfo, Peter Michaleas and Joshua Haines 9 An Evidence-Based Trust Assessment Framework for Critical Infrastructure Decision Making Yujue Wang and Carl Hauser 10 Enhancing the Usability of the Commercial Mobile Alert System Paul Ngo and Duminda Wijesekera 11 Real-Time Detection of Covert Channels in Highly Virtualized Environments Anyi Liu, Jim Chen and Li Yang PART IV
91
107
125
137
151
INFRASTRUCTURE MODELING AND SIMULATION
12 Analyzing Cyber-Physical Attacks on Networked Industrial Control Systems Bela Genge, Igor Nai Fovino, Christos Siaterlis and Marcelo Masera 13 Using an Emulation Testbed for Operational Cyber Security Exercises Christos Siaterlis, Andres Perez-Garcia and Marcelo Masera 14 Analyzing Intelligence on WMD Attacks Using Threaded EventBased Simulation Qi Fang, Peng Liu, John Yen, Jonathan Morgan, Donald Shemanski and Frank Ritter
167
185
201
Contributing Authors
Sergey Bratus is a Research Assistant Professor of Computer Science at Dartmouth College, Hanover, New Hampshire. His research interests include Linux kernel security, wireless network security and security-related visualization tools. Jim Chen is a Professor of Computer Science at George Mason University, Fairfax, Virginia. His research interests include computer graphics, networking and visualization. Adam Comella is an undergraduate student in Computer Science at Rensselaer Polytechnic Institute, Troy, New York. His research interests include secure systems, open source software applications and operating systems. Qi Fang is an M.S. student in Information Sciences and Technology at Pennsylvania State University, University Park, Pennsylvania. Her research interests are in the area of network science. Bela Genge is a Post-Doctoral Researcher at the Institute for the Protection and Security of the Citizen, Joint Research Centre of the European Commission, Ispra, Italy. His research interests include critical infrastructure protection, design methods and composition of security protocols. Daniel Guernsey received his Ph.D. degree in Computer Science from the University of Tulsa, Tulsa, Oklahoma. His research interests include information assurance, and network and telecommunications systems security. Joshua Haines is an Assistant Leader of the Cyber Systems and Technology Group at MIT Lincoln Laboratory, Lexington, Massachusetts. His research interests include system analysis, secure and robust architectures, networkcentric cyber systems and automated network vulnerability analysis.
viii
CRITICAL INFRASTRUCTURE PROTECTION V
Carl Hauser is an Associate Professor of Computer Science at Washington State University, Pullman, Washington. His research interests include concurrent and distributed systems, especially as applied to secure wide-area control systems. Eric Koziel is an M.S. student in Computer Science at the Air Force Institute of Technology, Wright-Patterson Air Force Base, Ohio. His research interests include offensive and defensive cyber security analysis. Anyi Liu is a Ph.D. student in Information Technology at George Mason University, Fairfax, Virginia. His research interests include information assurance, and intrusion detection and correlation. Peng Liu is a Professor of Information Sciences and Technology and Director of the Center for Cyber Security, Information Privacy and Trust at Pennsylvania State University, University Park, Pennsylvania. His research interests include computer security and network security. Michael Locasto is an Assistant Professor of Computer Science at the University of Calgary, Alberta, Canada. His research interests include machine intelligence and trustworthy systems. Marcelo Masera is the Head of the Energy Security Unit at the Institute for Energy, Joint Research Centre, Petten, The Netherlands. His research interests include securing networked systems and systems of systems, risk governance and control systems security. Thomas McEvoy is a Ph.D. student in Mathematics at Royal Holloway, University of London, London, United Kingdom; and a Technical Manager at HP Information Security, Bracknell, United Kingdom. His research interests include the modeling and simulation of critical infrastructures and hybrid systems in relation to security properties. Peter Michaleas is a Systems Engineer in the Embedded and High Performance Computing Group at MIT Lincoln Laboratory, Lexington, Massachusetts. His research interests include kernel development and high performance computing. Jonathan Morgan is a Research Assistant and Manager of the Applied Cognitive Science Laboratory at Pennsylvania State University, University Park, Pennsylvania. His research interests include modeling small-team dynamics, the effects of social moderators and organizational learning.
Contributing Authors
ix
Igor Nai Fovino is the Head of the Research Division of the Global Cyber Security Center, Rome, Italy. His research interests include critical infrastructure protection, intrusion detection, secure communication protocols and industrial informatics. Paul Ngo is a Ph.D. student in Computer Science at George Mason University, Fairfax, Virginia; and the Next Generation Network (NGN) Security Lead at the National Communications System in Arlington, Virginia. His research interests are in the area of emergency communications systems. Hamed Okhravi is a Technical Staff Member in the Cyber Systems and Technology Group at MIT Lincoln Laboratory, Lexington, Massachusetts. His research interests include cyber security, cyber trust, high assurance systems, virtualization and operating systems. Andres Perez-Garcia is a Network Security Specialist at the Institute for the Protection and Security of the Citizen, Joint Research Centre of the European Commission, Ispra, Italy. His research interests include inter-domain routing protocols and critical information infrastructure protection. Sanaz Rahimi is a Ph.D. candidate in Computer Science at Southern Illinois University, Carbondale, Illinois. Her research interests include cyber security, software reliability and cyber trust. Ashwin Ramaswamy is an Analyst at Bloomberg, New York. His research interests include operating system security and patch deployment systems. Jason Reeves is an M.S. student in Computer Science at Dartmouth College, Hanover, New Hampshire. His research interests include system security and human-computer interaction. Mason Rice received his Ph.D. degree in Computer Science from the University of Tulsa, Tulsa, Oklahoma. His research interests include network and telecommunications systems security, and cyberspace deterrence strategies. Frank Ritter is a Professor of Information Sciences and Technology, Computer Science and Engineering, and Psychology at Pennsylvania State University, University Park, Pennsylvania. His research interests include models of cognition and cognitive architectures.
x
CRITICAL INFRASTRUCTURE PROTECTION V
David Robinson is an Assistant Professor of Computer Engineering at the Air Force Institute of Technology, Wright-Patterson Air Force Base, Ohio. His research interests include cyber and cyber physical systems security. Eric Robinson is a Technical Staff Member in the Embedded and High Performance Computing Group at MIT Lincoln Laboratory, Lexington, Massachusetts. His research interests include high performance computing, distributed systems and compilers. Edmond Rogers is a Smart Grid Cyber Security Engineer at the University of Illinois Information Trust Institute, Urbana, Illinois. His research interests include critical infrastructure vulnerability assessment, penetration testing of SCADA systems and persistent attack detection. Rebecca Shapiro is a Ph.D. student in Computer Science at Dartmouth College, Hanover, New Hampshire. Her research interests are in the area of systems security. Donald Shemanski is a Professor of Practice of Information Sciences and Technology at Pennsylvania State University, University Park, Pennsylvania. His research interests include information law and policy, privacy law, system science and global prescience. Sujeet Shenoi is the F.P. Walter Professor of Computer Science at the University of Tulsa, Tulsa, Oklahoma. His research interests include information assurance, digital forensics, critical infrastructure protection, reverse engineering and intelligent control. Christos Siaterlis is a Scientific Officer at the Institute for the Protection and Security of the Citizen, Joint Research Centre of the European Commission, Ispra, Italy. His research interests include the security, stability and resilience of computer networks. Sean Smith is a Professor of Computer Science at Dartmouth College, Hanover, New Hampshire. His research interests include trusted computing and usable security. Yujue Wang is a Ph.D. student in Computer Science at Washington State University, Pullman, Washington. His research interests include trust assessment, network security and distributed systems.
Contributing Authors
xi
Duminda Wijesekera is an Associate Professor of Information and Software Engineering at George Mason University, Fairfax, Virginia. His research interests include information, network, telecommunications and control systems security. Stephen Wolthusen is a Professor of Information Security at the Norwegian Information Security Laboratory, Gjovik University College, Gjovik, Norway; and a Reader in Mathematics at Royal Holloway, University of London, London, United Kingdom. His research interests include critical infrastructure modeling and simulation, and network and distributed systems security. Li Yang is an Associate Professor of Computer Science at the University of Tennessee at Chattanooga, Chattanooga, Tennessee. Her research interests include computer security, software design and engineering, and database management. Stephen Yannalfo is a Subcontractor in the Cyber Systems and Technology Group at MIT Lincoln Laboratory, Lexington, Massachusetts. His research interests include software engineering and virtualization. John Yen is a University Professor and Director of Strategic Research Initiatives for the College of Information Sciences and Technology at Pennsylvania State University, University Park, Pennsylvania. His research interests include cognitive agents, social network analysis and artificial intelligence. Mehdi Zargham is a Professor and Chair of Computer Science at Southern Illinois University, Carbondale, Illinois. His research interests include mobile learning, pattern recognition and data mining.
Preface
The information infrastructure – comprising computers, embedded devices, networks and software systems – is vital to operations in every sector: information technology, telecommunications, energy, banking and finance, transportation systems, chemicals, agriculture and food, defense industrial base, public health and health care, national monuments and icons, drinking water and water treatment systems, commercial facilities, dams, emergency services, commercial nuclear reactors, materials and waste, postal and shipping, and government facilities. Global business and industry, governments, indeed society itself, cannot function if major components of the critical information infrastructure are degraded, disabled or destroyed. This book, Critical Infrastructure Protection V, is the fifth volume in the annual series produced by IFIP Working Group 11.10 on Critical Infrastructure Protection, an active international community of scientists, engineers, practitioners and policy makers dedicated to advancing research, development and implementation efforts related to critical infrastructure protection. The book presents original research results and innovative applications in the area of infrastructure protection. Also, it highlights the importance of weaving science, technology and policy in crafting sophisticated, yet practical, solutions that will help secure information, computer and network assets in the various critical infrastructure sectors. This volume contains fourteen edited papers from the Fifth Annual IFIP Working Group 11.10 International Conference on Critical Infrastructure Protection, held at Dartmouth College, Hanover, New Hampshire, March 23–25, 2011. The papers were refereed by members of IFIP Working Group 11.10 and other internationally-recognized experts in critical infrastructure protection. The chapters are organized into four sections: themes and issues, control systems security, infrastructure security, and infrastructure modeling and simulation. The coverage of topics showcases the richness and vitality of the discipline, and offers promising avenues for future research in critical infrastructure protection. This book is the result of the combined efforts of several individuals and organizations. In particular, we thank Daniel Guernsey, Heather Drinan and Nicole Hall Hewett for their tireless work on behalf of IFIP Working Group 11.10. We gratefully acknowledge the Institute for Information Infrastructure
xiv
CRITICAL INFRASTRUCTURE PROTECTION V
Protection (I3P), managed by Dartmouth College, for supporting IFIP Working Group 11.10. We also thank the Department of Homeland Security and the National Security Agency for their support of IFIP Working Group 11.10 and its activities. Finally, we wish to note that all opinions, findings, conclusions and recommendations in the chapters of this book are those of the authors and do not necessarily reflect the views of their employers or funding agencies. JONATHAN BUTTS
AND
SUJEET SHENOI
Chapter 1 USING DECEPTION TO SHIELD CYBERSPACE SENSORS Mason Rice, Daniel Guernsey and Sujeet Shenoi Abstract
The U.S. President’s Comprehensive National Cybersecurity Initiative calls for the deployment of sensors to help protect federal enterprise networks. Because of the reported cyber intrusions into America’s electric power grid and other utilities, there is the possibility that sensors could also be positioned in key privately-owned infrastructure assets and the associated cyberspace. Sensors provide situational awareness of adversary operations, but acting directly on the collected information can reveal key sensor attributes such as modality, location, range, sensitivity and credibility. The challenge is to preserve the secrecy of sensors and their attributes while providing defenders with the freedom to respond to the adversary’s operations. This paper presents a framework for using deception to shield cyberspace sensors. The purpose of deception is to degrade the accuracy of the adversary’s beliefs regarding the sensors, give the adversary a false sense of completeness, and/or cause the adversary to question the available information. The paper describes several sensor shielding tactics, plays and enabling methods, along with the potential pitfalls. Wellexecuted and nuanced deception with regard to the deployment and use of sensors can help a defender gain tactical and strategic superiority in cyberspace.
At 6:00 a.m., just before power consumption reaches its peak, a computer security expert at an electric power utility receives the text message, “Fireball Express,” indicating that a cyber operation is being executed on the utility’s assets. The expert is a covert government agent, who is embedded in the power utility to monitor cybersecurity breaches. Only the CEO of the company is aware of her status as a government agent. J. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 3–18, 2011. c IFIP International Federation for Information Processing 2011
4
CRITICAL INFRASTRUCTURE PROTECTION V
Months earlier, the embedded agent created a honeynet at the utility to draw cyber operations conducted by adversaries. The honeynet presents an intruder with a carbon copy of the utility’s SCADA systems. Meanwhile, to enhance situational awareness, U.S. intelligence has secretly implanted sensors in core Internet routers across America. The “Fireball Express” alert was triggered by correlating information gathered from the honeynet and the Internet sensors. The analysis indicates that the operations are being conducted by a nation state adversary. U.S. officials must act quickly. Directly confronting the nation state adversary about the intrusion at the utility could reveal the existence of the honeynet and, possibly, the presence of the embedded agent. How can the U.S. maintain the secrecy of its sensors while responding strongly to the intrusion? This paper presents a framework for using deception to shield cyberspace sensors from an adversary. In particular, it categorizes cyberspace sensors and their attributes, outlines sensor shielding tactics, plays and enabling methods, and discusses the potential pitfalls. Well-executed deception can shape the beliefs of the adversary to the advantage of the defender, enabling some or all of the sensor attributes to be shielded while providing an opportunity for the defender to confront the adversary about its cyber operations. The paper discusses several examples of deception and presents options for shielding sensors, including the sensors in the fictional Fireball Express scenario at the electric power utility.
2.
Sensors and Deception
Sensors provide information about the state of an environment of interest and the activities of entities in the environment. Sensors are characterized by their modality, location, range, sensitivity and credibility. The modality of a sensor refers to its detection mechanism (e.g., electronic, thermal, magnetic, radiant and chemical) [9]. The location and range of a sensor specify the location and space in which the sensor can operate effectively. Sensitivity refers to the ability of a sensor to detect stimuli and signals; cyberspace sensors may be tuned to detect specific viruses and worms, rootkits and network probes. The credibility of a sensor is a function of its reliability and durability. Reliability refers to the ability of a sensor to correctly measure the parameter of interest while durability refers to the ruggedness of the sensor and its tamper resistance. The attributes of a sensor determine its secrecy. In general, if one attribute of a sensor is classified, the existence and/or use of the sensor may be classified [4]. However, the existence of a sensor may be public knowledge, but its attributes could be classified. For example, the location, basic sensitivity, credibility and one of the modalities (magnetic) of the U.S. underwater sound surveillance system (SOSUS) may be known, but its true sensitivity and other modalities are closely guarded secrets [11]. The importance of maintaining the secrecy of sensors cannot be overstated. Scholars believe that the shroud of secrecy surrounding U.S. and Soviet satellite reconnaissance capabilities may have led to the Strategic Arms Limitation Talks
Rice, Guernsey & Shenoi
5
(SALT) I and II in the 1960s and 1970s. Shortly after one of the talks, the U.S. publicly acknowledged its use of satellite reconnaissance without providing details about the specific modalities (e.g., optical and electrical) and sensitivity. Although the Soviets released little information about their capabilities, it was widely believed that they had the ability to monitor U.S. compliance of the arms limitation agreements. As a result, the SALT documents used the ambiguous phrase “national technical means of verification” [8]. Sensor secrecy and the resulting uncertainty in the monitoring capabilities of the two countries likely facilitated the SALT agreements during the height of the Cold War. When using any instrument of national power – diplomacy, information, military and economics – it is often necessary to manipulate the response to sensor signals in order to mask one or more sensor attributes. Reacting in an obvious, unnuanced manner to sensor data about an adversary can compromise the sensor. For example, Al Qaeda was quick to recognize after attacks by U.S. forces in Afghanistan that the U.S. could track targets based on cell phone signals and other electronic transmissions. As a result, Osama bin Laden and other terrorists resorted to sending messages via courier [12]. Historically, deception has been used very effectively when exerting instruments of national power [5, 13]. Deception increases the freedom of action to carry out tasks by diverting the adversary’s attention. Deception can persuade an adversary to adopt a course of action that potentially undermines its position. Also, deception can help gain surprise and conserve resources. This paper discusses how deception can be used to obscure one or more attributes of cyberspace sensors.
3.
Deception Framework
A deception strategy should deliberately present misleading information that degrades the accuracy of the adversary’s beliefs, give the adversary a false sense of completeness, and/or cause the adversary to misjudge the available information and misallocate operational or intelligence resources. With regard to preserving sensor secrecy, the goal of deception is, very simply, to cause the adversary to have incorrect or inaccurate impressions about the modality, location, range, sensitivity and/or credibility of the sensor. Figure 1 illustrates the goal of a deception strategy that seeks to preserve sensor secrecy. The white squares at the bottom of the figure represent the true sensor attributes that are known to the defender. The black squares at the top of the figure denote a combination of true, assumed or false sensor attributes that the defender wants the adversary to believe. To accomplish this goal, the defender creates a “deception play,” represented by the black circles in the middle of the figure. The deception play provides false information about the modality and location of the sensor, no information about the sensor range, and true information about the sensitivity and credibility of the sensor. Note that the adversary may already hold certain beliefs about the sensor attributes prior to the execution of the deception play by the defender.
6
CRITICAL INFRASTRUCTURE PROTECTION V
False Modality
False Location
Guessed Range
True Sensitivity
True Credibility
False
False
Omitted
True
True
Modality
Location
Range
Sensitivity
Credibility
Figure 1.
Deceiving the adversary.
A deception play typically targets the adversary’s intelligence, surveillance and reconnaissance capabilities to shape the adversary’s beliefs [15]. The U.S. Department of Defense has adopted the “See-Think-Do” deception methodology [15]. The methodology focuses on the adversary’s cognitive processes: (i) See – what portions of the defender’s environment or activities does the adversary observe? (ii) Think – what conclusions does the adversary draw from the observations? and (iii) Do – what actions may the adversary take upon analyzing the observations?
7
Rice, Guernsey & Shenoi Table 1. Concealment
Camouflage
Passive deception techniques.
Concealment uses natural cover, obstacles or distance to hide something from the adversary. Concealment is the earliest form of military deception. An example in the cyberspace domain is the embedding of sensors in networking gear. Camouflage uses artificial means to hide something from the adversary. Note that covering military equipment with vegetation is an example of camouflage rather than concealment. An example in the cyberspace domain is the generation of artificial network traffic by a honeynet to camouflage cyber operations such as intelligence gathering.
An example of the See-Think-Do methodology is Operation Bodyguard, the deception plan instituted in advance of the D-Day invasion [15]. The Allies conducted air raids, sent fake messages and even created a phantom army to convince the German High Command that the landing point would be Pas de Calais. The German High Command saw the deceptive operations (see), believed that Calais would be the target of the assault (think), and redirected forces that could have been placed in Normandy to defend Calais instead (do). The scope of a deception play is limited by the time and resources available for its planning and execution, the adversary’s susceptibility to the deception, and the defender’s ability to measure the effectiveness of the deception. Additionally, the lack of accurate intelligence and cultural awareness can hinder a deception play. The best outcome for a deception play is for the adversary to fall for the deception. Note, however, that the defender may have a satisfactory outcome even if the play drives the adversary to believe something other than the truth.
4.
Deception Constructs
This section discusses the principal deception constructs. These include the classes of deception, deception plays, deception principles and the types of information collected by the adversary.
4.1
Classes of Deception
Deception involves two basic premises, hiding something real and showing something false [5]. This gives rise to two classes of deception: passive and active. Passive Deception: Passive deception focuses on hiding. It tends to be “safer” than active deception because it does not seek to instigate action on the part of the adversary [5]. Techniques for hiding include concealment and camouflage (Table 1).
8
CRITICAL INFRASTRUCTURE PROTECTION V Table 2. Planting False Information Ruse
Display
Demonstration
Lying
Active deception techniques.
The adversary obtains information that results in an incorrect or inaccurate belief. An adversary can be fed false information, for example, via a newspaper article or an Internet posting. The defender impersonates the actions or capabilities of another entity to cause the adversary to have an incorrect or inaccurate belief. An example is the delivery of fake orders and status reports in the enemy’s language. A cyberspace example involves spoofing the return IP addresses of packets. The defender makes the adversary see or believe something that is not there. An example is the positioning of fake artillery pieces and dummy aircraft. A cyberspace example is the generation of fake Internet traffic to create the illusion that a system has more or less capabilities than it actually has. The defender conducts an operation that conveys an incorrect or inaccurate belief to the adversary. A cleverly orchestrated demonstration can lead the adversary to make a tactical or strategic error. During the year prior to the 1973 Arab-Israeli war, Egypt repeatedly moved its troops to the Israeli border, only to recall them. The Israelis were conditioned by the demonstrations, and were thoroughly surprised when the Egyptians invaded. A cyberspace example involves the defender performing repeated probes of the adversary’s network before escalating its activities and corrupting a key asset. The defender tells a lie, which causes the adversary to have an incorrect or inaccurate belief.
Active Deception: Active deception focuses on showing something (e.g., knowledge and capabilities) that is not real [5]. It tends to be more “risky” than passive deception because it seeks to instigate action on the part of the adversary. Active deception techniques include planting information, ruses, displays, demonstrations and lying (Table 2).
4.2
Deception Plays
Insight into the thought process of the adversary enables the defender to outthink the adversary [5, 6]. An example of engaging insight is the use of absolute truth in a deception play. Absolute truth involves telling the truth in a situation where the adversary is unlikely to believe the truth – perhaps because of a strong prior belief. Another example is omission, which involves the exclusion of some information. Omission is common in politics, especially during an election campaign when a partial revelation of an opponent’s voting record can gain votes. Omission also can be used to hide contrary evidence
9
Rice, Guernsey & Shenoi
False
False
Assumed
True
True
Assumed
Assumed
True
True
True
False
False
Omitted
True
True
Omitted
Omitted
True
True
True
Modality
Location
Range
Sensitivity
Credibility
Modality
Location
Range
Sensitivity
Credibility
(a) Metox.
(b) Melody.
False
Assumed
True
True
True
True
Assumed
True
True
True
False
Omitted
True
True
True
True
Omitted
True
True
True
Modality
Location
Range
Sensitivity
Credibility
Modality
Location
Range
Sensitivity
Credibility
(c) Weapons seizure.
Figure 2.
(d) Osama bin Laden.
Example deception plays.
and create ambiguity, especially when the adversary is predisposed to certain beliefs. Active and passive techniques can be used individually or in combination to create plays that are intended to deceive an adversary. Masking, misleading, mimicking and confusing are four of many possible plays that can hide the real and show the false [2, 7]. Masking may involve camouflage and concealment, while misleading may involve planting information, ruses, displays, demonstrations and lying. Misleading could be as simple as transmitting a clear, unambiguous false signal or as complex as planting information for an adversary to find, lying to a third party who will pass the information on to the adversary and conducting ruses under the guise of the adversary. Mimicking involves copying some object or behavior (i.e., ruse). Techniques for confusing an adversary include raising the noise level associated with a specific act or acts to create uncertainly and/or paralyze decision making, or to purposely depart from an established pattern of activity by inserting random actions in a well-known activity. Figure 2 presents four historical deception plays. Each deception play is expressed – as in Figure 1 – in terms of the actual sensor attributes, the deception play and the desired adversary beliefs.
10
CRITICAL INFRASTRUCTURE PROTECTION V
Metox During World War II, the British could approximate the location of German U-boats using highly sensitive communications intelligence (COMINT) and then pinpoint their exact locations using radar [6]. The Germans installed Metox radar detectors on their U-boats to enable them to evade British attack vessels. In response, the British changed the frequency of their tracking radar, and used deception to protect their COMINT sources. The British also arranged for German agents to acquire two pieces of spurious intelligence. One was that the Royal Navy had abandoned radar in favor of infrared detectors; the other was that Metox produced a signal that RAF planes could target. The Germans acted on the spurious intelligence. They developed a paint that reduced the infrared signatures of U-boats, and worked to suppress the Metox emissions. Eventually, the Germans realized that the British had merely changed their radar frequency, and they attributed the U-boat sinkings exclusively to the British radar systems. The deception play enabled the British to preserve the secrecy of their COMINT sources. Figure 2 shows that the deception story provided inaccurate information about the modality and location of the sensor, omitted range information, and revealed information about sensor sensitivity and credibility. Melody The 1972 Anti-Ballistic Missile (ABM) treaty between the United States and the Soviet Union prohibited the development and testing of ABM systems. Soon after the treaty was ratified, the U.S. detected Soviet cheating via a highly classified feature of Project Melody that intercepted Soviet missile tracking radar signals [10]. During subsequent negotiations in Geneva, then Secretary of State Henry Kissinger confronted his Soviet counterpart with the dates and times that the Soviets had cheated on the treaty. The cheating stopped and the Soviets began a “mole hunt” for the spy who gave the information to the United States. America got its way without compromising its Melody sensors. Figure 2 shows the components of Kissinger’s deception play. Note that the play omitted the modality and location of the sensors, but it was effective because the Soviets were paranoid about spies in their midst.
Weapons Seizure Deception was likely used in 2005 when the Bush administration disclosed that it worked with other nations to intercept weapons systems bound for Iran, North Korea and Syria [14]. In particular, senior Bush administration officials stated that Pakistan had “helped” track parts of the global nuclear network. By naming Pakistan as the source of the information, the U.S. hid the true modality of its sensor, omitted the sensor location and revealed its range, sensitivity and credibility (Figure 2).
Osama bin Laden The final example, involving the decision of Osama bin Laden and other terrorists to send messages by courier instead of via electronic means, cannot be characterized as deception because the U.S. had no intention of hiding its COMINT capabilities [1]. However, the example shows
Rice, Guernsey & Shenoi
11
how a defender can use a deception play (Figure 2) that exaggerates its sensor capabilities, bluffing an adversary into using another mode of communications that it may have already compromised.
4.3
Deception Principles
Fowler and Nesbitt [6] identify six general principles for effective tactical deception in warfare: (i) deception should reinforce the adversary’s expectations; (ii) deception should have realistic timing and duration; (iii) deception should be integrated with operations; (iv) deception should be coordinated with the concealment of true intentions; (v) deception realism should be tailored to the setting; and (vi) deception should be imaginative and creative. These principles were developed for tactical deception in warfare [13], but they are clearly applicable to shielding cyberspace sensors. Several other deception principles have been developed over time. Three of the more pertinent principles that are part of the U.S. military doctrine are: Magruder’s Principle: It is generally easier to reinforce an adversary’s pre-existing belief than to deceive the adversary into changing a belief. The German Army applied this principle in the Wacht am Rhein (Watch on the Rhine) Operation during the winter of 1944. The code name led U.S. forces to believe it was a defensive operation, when in fact it was offensive in nature. Exploiting Human Information Processing: Two limitations of human information processing can be exploited in deception plays. The first is that humans tend to draw conclusions based on small data sets, although there is no statistical justification for doing so. The second is that humans are often unable to detect small changes in a measured parameter (e.g., size of opposing forces), even though the cumulative change over time can be large. Jones’ Dilemma: Deception generally becomes more difficult as the number of sources that an adversary can use to confirm the real situation increases. However, the greater the number of sources that are manipulated, the greater the chance that the adversary will fall for the deception. Interested readers are referred to [15] for additional details about these and other deception principles.
4.4
Adversary Information Gathering
A clever adversary is always collecting information about the defender. The information collected by the adversary can be categorized as: (i) known facts, (ii) secrets, (iii) disinformation, and (iv) mysteries [3]. Known Facts: A known fact is information that is publicly available or easily confirmed. In the past, the U.S. intelligence community would
12
CRITICAL INFRASTRUCTURE PROTECTION V rarely release known facts. Typically, the State Department would serve as a conduit for the release of intelligence, such as Khrushchev’s “secret speech” of 1956 that denounced Stalin. In contrast, the intelligence community now routinely releases information for public consumption, such as the World Factbook on the CIA’s website. The defender could use known facts to bolster its deception play with elements of truth. Secrets: A secret is information that is not intended to be known to the adversary. Examples include economic data and sensor attributes. Secret information collected by the adversary invariably contains gaps and ambiguities. It may be beneficial for the defender to design a deception play that leads the adversary to believe that a secret collected by the adversary is disinformation. Disinformation: Disinformation can be expected to be discarded by the adversary once it is identified as disinformation. Therefore, it is imperative that the deception play be as consistent as possible to convince the adversary of the authenticity of the information. Disinformation can distort the adversary’s confidence in its intelligence channels [3]. This, in turn, may affect the credibility of other adversary assessments. Paradoxically, the damage usually occurs when disinformation is successfully exposed. For example, in the late 1950s, the Soviets deliberately exaggerated their ballistic missile numbers. The deception was revealed when the first U.S. reconnaissance satellites showed that the Soviets had only deployed a few SS-6 missiles. The discovery of the deception caused U.S. analysts to doubt the credibility of other (most likely true) information they had gathered about Soviet military strength. Mysteries: A mystery cannot be resolved by any amount of secret information collection or analysis [3]. This can occur, for example, when multiple outcomes are probable, and the number of outcomes cannot be reduced by any means available to the adversary.
5.
Cyberspace Sensors
Cyberspace sensors may be used for a variety of purposes, including system monitoring, fault detection and data collection. Our focus is on sensors that detect cyber operations – the attack, defense and exploitation of electronic data, knowledge and communications. In the context of cyber operations, sensors may be placed in assets belonging to the defender, adversary and/or third parties. The sensors may be located in communications channels and networking devices such as routers, switches and access points. Sensors may also be placed in computing platforms: servers (platforms that provide services); hosts and edge devices (clients and mobile devices); and SCADA devices (e.g., programmable logic controllers and remote terminal units).
Rice, Guernsey & Shenoi
13
It is important to recognize that sensors may be positioned external to computing and communications assets. Examples include human beings located at control centers, and mechanical devices and physical systems that are connected to computing and communications assets. Sensors may also integrate and correlate data received from other embedded sensors. Several types of sensors can be defined based on the adversary’s knowledge and beliefs about the values of the sensor attributes: Open Sensor: All the attributes of an open sensor are known to the adversary. Covert Sensor: All the attributes of a covert sensor are not known to the adversary. The very existence of the sensor is hidden from the adversary. Phantom Sensor: A phantom sensor does not exist. However, the adversary believes that the sensor exists and knows some or all of its attributes. In other words, the adversary believes it to be a non-covert sensor. Obfuscated Sensor: An obfuscated sensor is a non-covert sensor for which the adversary has incorrect or incomplete information about at least one attribute.
6.
Shielding Cyberspace Sensors
This section discusses several tactics, plays and enabling methods for shielding cyberspace sensors.
6.1
Shielding Tactics
A shielding tactic involves a single action on the part of the defender. The tactics are categorized according to the actions and their relation to reality. Active and passive deception techniques are employed to hide and/or reveal certain sensor attributes. Revealing Tactic: A revealing tactic exposes one or more sensor attributes to the adversary. Masking Tactic: A masking tactic uses a passive deception technique (e.g., camouflage or concealment) to hide one or more sensor attributes. Misleading Tactic: A misleading tactic uses an active deception technique (e.g., planting false information, implementing a ruse, display or demonstration, or lying) to falsify one or more sensor attributes. Distraction Tactic: A distraction tactic distracts or redirects the adversary’s activities. This play should not reveal any of the sensor attributes.
14
6.2
CRITICAL INFRASTRUCTURE PROTECTION V
Shielding Plays
Shielding plays implement one or more shielding tactics. A shielding play is categorized according to the sensor attribute values that are believed by the adversary to be true after the play is executed by the defender. The four plays described below are in conformance with the See-Think-Do methodology. Open Sensor Play: An open sensor play reveals the correct values of all the sensor attributes to the adversary. Complete knowledge about a sensor serves as a deterrent because the adversary knows that the defender can detect an unfriendly act and may retaliate. Of course, complete knowledge about a sensor enables the adversary to take countermeasures. Covert Sensor Play: A covert sensor play hides the existence of a sensor, including all its attribute values. Such a sensor is similar to the “gatekeeper” submarine that was secretly positioned near a Soviet port to collect data about Soviet nuclear submarines. A covert sensor has limited use (on its own) because it is often the case that the adversary needs to know that some type of sensor exists to detect an unfriendly act on the part of the adversary. Phantom Sensor Play: A phantom sensor play is designed to convince the adversary that the defender has a sensor that, in reality, does not exist. A phantom sensor play could implement a misleading tactic that involves the defender being told about the adversary’s activities by a third party, but revealing to the adversary that the activities were detected by a sophisticated sensor. Sensor Obfuscation Play: A sensor obfuscation play releases some (correct or incorrect) information about the sensor to the adversary but hides enough information so that the adversary cannot subvert detection by the sensor. An example involves the defender’s sensors detecting Trojans placed by the adversary on several computing assets, some owned by the defender and some owned by third parties. However, the defender confronts the adversary with the Trojans discovered on its assets, but does not mention the Trojans placed on the third party assets. This play shields the sensors on the third party assets by not revealing information about their location and range.
6.3
Enabling Methods
Sensors are shielded by executing plays based on the deception framework and constructs described above. Numerous variations of the plays exist, giving the defender considerable leeway to demonstrate to the adversary that the defender knows about some asset or activity by revealing incorrect or no information about one or more of the sensor attributes.
Rice, Guernsey & Shenoi
15
Two enabling methods, shepherding and distraction, are especially useful in situations involving multiple sensors. Shepherding: Shepherding involves convincing the adversary and/or other parties to adjust their activities to the advantage of the defender. Shepherding has at least three variants. One is to convince the adversary to shift its activities so that they can be detected by an open sensor. Another is to move a non-covert sensor to where the adversary is conducting activities. A third is to shepherd a third party sensor to where the adversary is conducting activities. A honeynet can be used as a shepherding tool. Note that the defender can use the honeynet to implement an open sensor play on one sensor and other plays on the other sensors. Distraction: Distraction is designed to progressively divert the adversary’s attention from secret sensor attributes. This method can be used to create confusion (possibly panic) inside the adversary’s network. Consider a situation where the adversary releases a worm that tunnels into the defender’s network. In response, the defender conducts a display (or ruse) that releases the same worm in the adversary’s network – intending for the adversary to believe that the worm was erroneously released in its own network. To reinforce this belief, the defender plants information in the media that the adversary’s experiments with cyber capabilities infected its own network with a worm.
7.
Shielding Play Pitfalls
The efficacy of a shielding play is limited by the amount of time and resources available for its planning and execution, and the adversary’s susceptibility to deception [15]. Despite the best efforts of the defender, a shielding play can fail for many reasons. The adversary may not see all the components of the play, may not believe one or more components, be unable to act, or may decide not to act or act in an unforeseen way even if all of the components of the play are believed; also, the adversary may simply discover the deception [15]. The failure or exposure of a shielding play can significantly affect the adversary’s operations. For this reason, the defender should understand the risk associated with an action that is based on the assumed success of a shielding play. In general, there are two broad categories of deception failures: the defender does not design or implement the shielding play correctly or the adversary detects the deception. Even if a shielding play is successful, it is possible for the adversary to compromise the defender’s feedback channels [15]. Another problem is that unintended third parties may receive and act on the deceptive information intended for the adversary. The risks associated with these eventualities must be weighed carefully against the perceived benefits of the shielding play. A shielding play can be discovered by the adversary via direct observation, investigation or indirect observation [16, 17].
16
CRITICAL INFRASTRUCTURE PROTECTION V Direct Observation: Direct observation involves sensing and recognition. The adversary relies on one or more sensors (e.g., a network port scanner or packet sniffer) to discover the shielding play. Any attempt to defeat the adversary’s discovery process must consider how, when and where the adversary receives information. The defender must then target the adversary’s detection capabilities and/or information gathering processes. For example, the installation of a firewall can prevent the adversary from conducting a port scan. Alternatively, the deployment of a honeypot can compromise the port scanning process by providing incorrect information. Investigation: Investigation involves the application of analytic processes to the collected evidence rather than direct observation. An investigation helps discover something that existed in the past, or something that exists but cannot be observed directly. Note that an investigation relies on the analysis of evidence; it cannot be used for predictive purposes because evidence of future events does not exist. An investigation can be thwarted by compromising the adversary’s evidence collection and/or analysis processes. Actions can be taken to alter the available evidence, or to diminish or misdirect the adversary’s analytic capabilities. These actions are simplified if the adversary has a bias or predisposition that aligns with the shielding play. Indirect Observation: Indirect observation involves a third party (human or machine) that has discovered the deception either by direct observation or by investigation. Indirect observation is defeated by compromising the third party’s ability to make a direct observation and to conduct an investigation. Alternatively, the defender could target the communication channel between the third party and the adversary.
8.
Fireball Express Reprise
The Fireball Express dilemma involves three (initially) covert sensors: the embedded agent, honeynet and Internet sensors. If the U.S. decides that it must respond to the adversary’s cyber operation, it must acknowledge that something was detected by one or more of its sensors. Three possibilities (of many) are: (i) open the honeynet sensor; (ii) obscure the honeynet and embedded agent sensors; and (iii) obscure the embedded agent and honeynet sensors, and create a phantom sensor. The first option involves conducting an open sensor play on the honeynet sensor. The play could involve one or more revealing tactics. One revealing tactic could be the public announcement by the U.S. that it caught the adversary “red-handed” accessing the honeynet, which was installed as a defensive measure to secure the critical infrastructure. This play would reveal the existence of the honeynet and its corresponding sensor attributes to the adversary.
Rice, Guernsey & Shenoi
17
The second option, obscuring the honeynet and embedded agent sensors, involves using a sensor obfuscation play coupled with shepherding. The sensor obfuscation play may be accomplished by employing a revealing tactic and a misleading tactic. The revealing tactic discloses the sensitivity, range and location of the honeynet and the embedded employee. One approach is for U.S. authorities to publicly announce that “anomalous activity” was discovered at the utility and request blue team assistance. The blue team is a shepherded open sensor that assumes the credit for detecting the adversary’s activities via the misleading tactic. The third option, obscuring the embedded agent and honeynet, and creating a phantom sensor, involves an obfuscation play, a phantom sensor play and a distraction method. The obfuscation play uses a revealing tactic that blocks the adversary’s entry into the honeynet by implementing strong access controls. The play reveals the sensitivity, location, range and credibility of the embedded agent and honeynet sensors, but it does not reveal their modalities. The adversary is deceived via a distraction tactic and a misleading tactic. The distraction tactic is a brief denial-of-service (DoS) implemented by ARP poisoning the adversary’s network. The misleading tactic plants information that indicates the U.S. has placed sensors in the adversary’s network. The planted information is designed to make the adversary believe that the DoS attack was a side-effect of the sensor placement. The distraction and misleading tactics are designed to make the adversary believe that a phantom sensor exists in its core network. This phantom sensor could have the effect of deterring the adversary from conducting cyber operations until the sensor is detected. The Internet sensors are intended to remain covert in the three U.S. response options. Thus, each option corresponds to a covert play conducted on behalf of the Internet sensors. Note that many other combinations of tactics, plays and enabling methods can be used to achieve the same outcome.
9.
Conclusions
The global reach of the Internet and the difficulty of detecting and attributing attacks make sensors invaluable in defensive operations. Maintaining the secrecy of key sensors and their attributes is vital for several reasons. Adversaries can bypass or develop countermeasures for known sensors. Secret sensors with exaggerated capabilities can create confusion, even fear, on the part of the adversary. Deception can be used very effectively to shield cyberspace sensors. The deception-based shielding tactics and plays presented in this paper provide the defender with broad situational awareness and the flexibility to respond to adversary operations. Moreover, the tactics and plays enable the defender to shape the adversary’s beliefs about the sensors, helping the defender gain tactical and strategic superiority in cyberspace. Note that the views expressed in this paper are those of the authors and do not reflect the official policy or position of the U.S. Department of Defense or the U.S. Government.
18
CRITICAL INFRASTRUCTURE PROTECTION V
References [1] J. Bamford, The Shadow Factory, Doubleday, New York, 2008. [2] J. Bell and B. Whaley, Cheating and Deception, Transaction Publishers, New Brunswick, New Jersey, 1991. [3] B. Berkowitz and A. Goodman, Strategic Intelligence for American National Security, Princeton University Press, Princeton, New Jersey, 1989. [4] G. Bush, Executive Order 13292 – Further Amendment to Executive Order 12958, as Amended, Classified National Security Information, The White House, Washington, DC (www.archives.gov/isoo/policy-documents/eo12958-amendment.pdf), 2003. [5] J. Dunnigan and A. Nofi, Victory and Deceit, Writers Club Press, San Jose, California, 2001. [6] C. Fowler and R. Nesbit, Tactical deception in air-land warfare, Journal of Electronic Defense, vol. 18(6), pp. 37–79, 1995. [7] S. Gerwehr and R. Glenn, Unweaving the Web – Deception and Adaptation in Future Urban Operations, RAND, Santa Monica, California, 2002. [8] W. Laqueur, The Uses and Limits of Intelligence, Transaction Publishers, New Brunswick, New Jersey, 1993. [9] D. Patranabis, Sensors and Transducers, Prentice-Hall of India, New Delhi, India, 2004. [10] E. Poteat, The use and abuse of intelligence: An intelligence provider’s perspective, Diplomacy and Statecraft, vol. 11(2), pp. 1–16, 2000. [11] J. Richelson, The US Intelligence Community, Westview Press, Boulder, Colorado, 1999. [12] J. Risen and D. Rohde, A hostile land foils the quest for bin Laden, New York Times, December 13, 2004. [13] N. Rowe and H. Rothstein, Two taxonomies of deception for attacks on information systems, Journal of Information Warfare, vol. 3(2), pp.27–39, 2004. [14] D. Sanger, Rice to discuss antiproliferation program, New York Times, May 31, 2005. [15] United States Department of Defense, Military Deception, Joint Publication 3-13.4, Washington, DC, 2006. [16] J. Yuill, D. Denning and F. Feer, Using deception to hide things from hackers: Processes, principles, and techniques, Journal of Information Warfare, vol. 5(3), pp. 26–40, 2006. [17] J. Yuill, F. Feer and D. Denning, Designing deception operations for computer network defense, Proceedings of the DoD Cyber Crime Conference (www.jimyuill.com/research-papers/DoD-Cyber-Crime-deceptionprocess.pdf), 2005.
Chapter 2 BOTNETS AS AN INSTRUMENT OF WARFARE Eric Koziel and David Robinson Abstract
The use of botnets for malicious activities has grown significantly in recent years. Criminals leverage the flexibility and anonymity associated with botnets to harvest personal data, generate spam, distribute malware and launch distributed denial-of-service attacks. These same attributes readily translate to applications that can support operations in warfare. In 2007, distributed denial-of-service attacks launched by botnets targeted IT assets belonging to Estonian banks, newspapers and parliament. This paper explores the use of botnets as instruments of warfare. Seven scenarios are used to demonstrate how traditional applications of botnets such as spam, theft of resources and distributed denial-of-service attacks can have implications across the spectrum of warfare. Additionally, the paper discusses the ethical and political concerns associated with the use of botnets by nation states.
Keywords: National security, cyber warfare, botnets
1.
Introduction
Cyber space, through its inextricable pervasiveness of all aspects of society, has significantly changed the nature of international and domestic conflict. Nation states find themselves at risk of attack and disruption through electronic means. Corporations are constantly defending against adversaries who seek to steal personal and proprietary information. Individual citizens are bombarded with unsolicited advertisements and malware on a daily basis. Although these threats manifest themselves in myriad ways, botnets have become the de facto tool of choice for hackers, organized crime groups and nation states [1, 3]. Interest in botnets has grown significantly. Although criminal activities receive the majority of attention, nation states have recognized the potential military applications. A real-world example is the distributed denial-of-service (DDoS) attacks launched against the nation of Estonia in 2007. Although the attacks were not directly attributed to a nation state, they underscore the J. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 19–28, 2011. c IFIP International Federation for Information Processing 2011
20
CRITICAL INFRASTRUCTURE PROTECTION V
impact that botnets can have on a nation’s security. Indeed, botnets afford appealing attributes for the warfighting environment, including ease of setup, inherent command and control functionality, disruptive potential, high degree of anonymity and the ability to remain undetected [1, 4, 14]. Nuclear weapons are certainly not comparable to botnets in their scale and destructive potential, but they offer an interesting parallel. As instruments of warfare, nuclear weapons have a wide range of operational and strategic implications. We explore a similar notion by considering botnets as instruments of warfare. Specifically, we examine how traditional applications of botnets (e.g., spam, resource theft and DDoS attacks) can be leveraged to achieve operational and strategic objectives with respect to nation state conflicts. Example scenarios are provided that demonstrate how botnets can be used in conflicts between nation states. Also, ethical and political concerns associated with the use of botnets in conflict are discussed.
2.
Background
A botnet consists of an “army” of host computers (bots) that have been infected by malware, typically unbeknownst to their owners. The malware installs a backdoor or communication channel that enables the bot to receive commands from an authoritative controller (botmaster). Bots typically attempt to compromise other machines until the botmaster issues a command to stop [1]. As a result, botnets can propagate rapidly and grow to include thousands, even millions of machines [3]. Botmasters use a command and control channel to issue commands to their bots. While various mechanisms exist, the two main types of command and control channels are peer-to-peer (P2P) protocols and Internet Relay Chat (IRC). In a P2P botnet, a bot behaves as a client and as a server. The architecture enables bots to directly relay commands among one another. To issue directives, an attacker selects a bot to serve as the botmaster and issues commands that propagate throughout the botnet. This structure is particularly difficult to stop and track because there is no fixed botmaster source [14]. IRC botnets leverage the scalability and flexibility of the IRC protocol to issue commands. In an IRC botnet, bots are directed to connect to specified botmaster servers periodically to receive commands, updates and other directives [1]. IRC botnets are easier to set up and maintain, but their command channels are centralized to specific servers, which make them easier to disable once detected. While IRC-controlled botnets are more common, P2P botnets and new variants are on the rise. Regardless of the command and control structure, botnets offer a high degree of anonymity and the ability to mask the underlying architecture. Indeed, without inspecting bot traffic, it is difficult to discern if individual bots are associated with a given botnet [1, 4]. The primary goal when establishing a botnet is to amass a large number of infected hosts without much consideration of the types of hosts. As a result, bots cannot be blocked or disabled without affecting the unknowing users of the compromised hosts. Additionally, preventing the compromise of host computers
Koziel & Robinson
21
is extremely difficult. Even if users follow sound security practices, a large number of hosts are invariably exposed to infection. Historically, botnets have been used to send spam and unsolicited advertisements [13]. Botmasters distribute large volumes of tailored advertisements on a fee-for-service basis using their subordinate bots to send email. Bots also have the ability to serve as data collection devices to obtain personal information for identity theft and other malicious activities [7]. From a warfare perspective, the best known tactic is to use a botnet to launch DDoS attacks. DDoS attacks seek to prevent legitimate users from accessing services by saturating the resources of the targeted machines. The large number of hosts and the anonymity associated with a botnet render it ideal for this type of attack. For example, a botmaster may direct subordinate bots to repeatedly connect to and synchronize with a networked computing resource. The attacks then generate massive amounts of traffic that limit the bandwidth available for legitimate users and overwhelm the target [1, 2, 8]. This tactic was demonstrated successfully in Estonia in April–May 2007 against targets that included government portals, banking sites and ATMs, emergency response services, root domain name servers and media portals. The botnet that launched the attacks apparently incorporated more than one million nodes across several countries, including the United States. Estonia was forced to block inbound international traffic in order to mitigate the attacks [2].
3.
Botnet Warfare Scenarios
This section presents seven scenarios that leverage botnets as instruments of warfare. The scenarios are generic in nature and are not based on current or past events. The goal is to illustrate how various botnet capabilities might be used in support of strategic and operational objectives. All the scenarios involve two fictional nation states, Atlantis and Lemuria. We assume that both nation states have a cyber infrastructure that is similar to that existing in industrialized countries. We also assume that the two countries are not bound by international restrictions such as those imposed by the Geneva Conventions, nor are they concerned with the political impact of their decisions. These assumptions permit the analysis to focus on botnet capabilities in the context of worse-case scenarios. The ethical and political issues related to botnet warfare are discussed in the next section. Each botnet warfare scenario provides the overall objective, details of the tactical deployment of a botnet and the consequences of the attack. For reasons of clarity, the attacker is always Atlantis and the victim is always Lemuria. We also assume that Atlantis controls a botnet of significant scale that comprises a large number of bots within Lemuria.
3.1
Propaganda
The purpose of a propaganda attack is to influence the attitude of a group of individuals towards some cause or position. Propaganda typically has an
22
CRITICAL INFRASTRUCTURE PROTECTION V
essence of truth, but often uses selective facts to incite the desired emotional response. While typical delivery mechanisms include radio, television and print media, the widespread use of the Internet makes it attractive to disseminate propaganda using botnets. Attack: Atlantis directs the bots in Lemuria to download various forms of Atlantean propaganda from advertisement servers and display them to users. Computer configurations are altered so that website fetch requests are redirected to sites hosting Atlantean propaganda. Also, bots are used to send spam containing propaganda to large numbers of Lemurian users. Effect: The psychological effect of this type of attack is difficult to assess. However, the impact of a message delivered directly to individual users should not be underestimated. Consider the recent events in Egypt. Wael Ghonim, a Google marketing manager in Cairo, utilized Facebook to organize massive protests [12]. His ability to motivate and disperse a coherent message to a large populace is credited with helping force President Mubarak to step down. Indeed, with the Internet overtaking newspapers and approaching television as the main source of national and international news [10], the ability to leverage this outlet affords an opportunity to influence and control the views and perceptions of a large populace. Also, the fact that the Lemurian government has been unable to stop this attack may undermine its credibility. Lemuria could, of course, analyze Internet traffic to identify the primary servers that distribute the propaganda. However, stopping the attack completely would likely result in self-imposed denial of service.
3.2
Disinformation
In the intelligence domain, disinformation is the deliberate spreading of false information to mislead an adversary. Unlike propaganda that is designed to incite an emotional response, disinformation attempts to manipulate the audience by discrediting information or supporting false conclusions. Similar to propaganda, the widespread use of the Internet offers the ability to push disinformation to a massive population. Indeed, the ability to modify web pages or redirect users to sites without their knowledge offers the adversary a powerful means to manipulate individuals. Attack: Atlantis bots redirect their infected machines to connect to spoofed Lemurian media pages that provide false information on economic, political and health issues. Additionally, mass emails from spoofed Lemurian addresses provide information that supports the false web sites and discredits legitimate Lemurian media sources. Effect: As with the propaganda attack, the psychological toll of this scenario is difficult to gauge. However, the attack introduces a level of mistrust in the general population. While there is no guarantee that all
Koziel & Robinson
23
Lemurians will be affected, enough of the populace could be provided with false information to cause confusion and unrest. The legitimacy of Lemurian government policies, guidance and direction is likely to be questioned. Lemuria might respond to this attack by directing its citizens to rely on “trusted” media sources (e.g., television, newspaper and radio). However, it is likely that the attack would have political and psychological ramifications.
3.3
Conflict Instigation
This scenario creates a conflict between nation states for political, economic or military purposes. Instead of one nation directly attacking another nation state, the first nation state can use deception to provoke a third nation state to enter into a conflict with the second nation state. In this manner, the first nation state can achieve its ends without the perception of direct involvement. Attack: Atlantis directs its bots in Lemuria to begin DDoS attacks on systems that are critical to the government of Mu, a third nation state. Mu perceives the cyber attack as being instigated by Lemuria and threatens a response. Without diplomatic or international intervention, the situation may escalate. Effect: It is difficult to attribute the attack because of the anonymity associated with botnet channels. Indeed, Lemuria would most likely have to prove to Mu that it did not instigate the attack. If Lemuria cannot prove conclusively that the DDoS attacks were initiated by another actor, a tense stand-off or escalation is likely to occur.
3.4
Revenue Generation
The sale and lease of botnets for sending spam or harvesting data has become standard practice [11]. A small nation state can garner significant revenue from the use of its botnets. Indeed, terrorist organizations (although they are not classified as nation states) have already demonstrated the ability to use botnets to gather information and generate revenue to sustain their causes [15]. Attack: Atlantis uses bots to disseminate “sponsored” adware and deploy information-gathering malware (e.g., keylogging software). Atlantis receives payment from commercial entities to distribute advertisements and sells the data obtained via keylogging software on the black market. The generated revenue is discreetly added to Atlantis’ treasury. Effect: Even if Lemuria becomes aware of the operation, the options for mitigation are quite limited. This is even more problematic if the operation is launched from multiple countries. Lemuria can appeal to the international community for assistance, but even then, the options are limited because of the difficulty of attributing botnet attacks.
24
3.5
CRITICAL INFRASTRUCTURE PROTECTION V
Service Disruption
The effects of a service disruption attack range from intermittent degradation of service to complete denial of service. A subtle attack may degrade a service by slowing it down periodically so that it cannot be trusted. The targets can include control systems, telecommunications systems and banking systems. Although botnets primarily impact networks and services, botmasters can instruct their subordinate bots to disrupt or disable (e.g., reformat) their host machines. Attack: Atlantis launches DDoS attacks against government utilities, banking websites and other high-traffic Internet portals in Lemuria. The initial wave of attacks constitutes a “show of force” to demonstrate Atlantis’ capabilities to the Lemurian people. The intensity and scope of attacks are gradually increased until Lemuria is forced to concede to Atlantis’ demands. Effect: The effect of this type of attack may range from annoyance to widespread fear and confusion. Initial attacks against specific resources (e.g., popular web pages or media outlets) may serve as a mechanism to anger and frustrate the populace. As the conflict wears on, the attacks may escalate to disrupt critical infrastructure assets. Service disruption attacks may also be used as a diversionary tactic while offensive actions are performed in other areas. Few options are available for dealing with widespread DDoS attacks. Blocking inbound international traffic (as Estonia did in 2007) may not help if a large number of bots with the ability to autonomously launch DDoS attacks are deployed within Lemuria.
3.6
Intelligence Exfiltration
Gaining intelligence on the enemy is paramount in any conflict; relevant and timely information can be the difference between the success and failure of a military operation. Military operations have become highly dependent on technology and the Internet. This reliance makes them susceptible to the same types of attacks that criminal organizations currently use against individuals. For example, bots often function as data collection devices that harvest personal information. Similarly, bots injected into a military network can serve as a large, distributed data collection tool to gain intelligence and situational awareness about current and future military operations. Attack: Atlantis deploys bots in Lemurian military and military-related commercial networks. The bots remain dormant until commanded to support contingency operations, at which time they monitor and search for files containing sensitive information (e.g., about public officials, state activities and military plans). These files are transmitted to Atlantean servers for analysis.
Koziel & Robinson
25
Effect: If Lemuria detects the exfiltration, it can leverage the attack by feeding false information to Atlantis. This is effective only to the extent that Lemuria can detect and control the exfiltration. However, Lemuria may not be able to trust the integrity of its networks and may have to take them down so that they can be reconfigured. Not detecting the exfiltration could result in serious consequences for Lemuria.
3.7
Chaos Instigation
A coordinated campaign involving different types of botnet attacks can cause widespread chaos. Indeed, considerable damage could be wrought without deploying a single military unit. Attack: Atlantis initiates a misinformation campaign focused on political and economic targets in Lemuria. Simultaneously, a propaganda initiative is launched that highlights the lack of control that the Lemurian leadership has over its assets. Atlantis warns the Lemurian populace of dire consequences if its demands are not met. Atlantis then launches massive DDoS attacks against Lemurian critical infrastructure assets by instructing its Lemurian-based bots to disable their host machines. Effect: Lemuria must deal with the fear that the attacks generate among its populace and mitigate the effects of the attacks on its critical infrastructure assets. Because the attacks are launched from within and outside its borders, there is little that Lemuria can do aside from disconnecting its key assets from the Internet. This may actually exacerbate the problem and amplify the effects of the attacks. The attacks may become so debilitating that Lemuria may consider kinetic retaliatory strikes. Absent overwhelming proof – which is difficult to obtain because of the attribution problem – Lemuria may be hard-pressed to retaliate, especially if Atlantis emphatically denies a hand in the attacks.
4.
Ethical and Political Issues
The scenarios presented in the previous section ignore ethical and political concerns that may impose significant barriers to launching botnet attacks. This section examines the major ethical and political consequences associated with the use of botnets as an instrument of warfare. The first major issue concerns the Geneva Convention and its implications. Compromising a computer and installing botnet malware is equivalent to unauthorized seizure. If the compromised computer belongs to a civilian entity, the action could potentially be deemed an attack on non-combatants. An attack on civilian-owned property is strictly prohibited under Protocol I, Article 52 of the Geneva Convention [5]. Although the term “attack” may not withstand international scrutiny, a computer compromise could be deemed as an attack if it impacts critical infrastructure assets and, therefore, endangers civilian lives. Attacks on resources that are not identified as key military objectives and dis-
26
CRITICAL INFRASTRUCTURE PROTECTION V
rupt civilian life are proscribed by Protocol I, Article 54 [5]. A nation state that uses its own citizens’ computers to launch botnet attacks on another country could be deemed to be using “human shields” – an action that is prohibited under Protocol I, Article 51 of the Geneva Convention [5]. Furthermore, any computers that are used in an offensive manner can be considered to be weapons of war and, as such, the operators of these computers can be labeled as combatants. However, because of the attribution problem, the controlling computers and their operators could be in doubt; this could potentially draw unwitting civilians into the conflict. Attribution is a paramount issue. Botnets are complex with shadowy command and control structures, making the identification of a botmaster extremely difficult. Identifying the real perpetrator of an attack is even more complicated when botnet resources are “outsourced” to third parties. Few legal cases address the use of botnets. Microsoft recently won a legal battle against the Waledac spam botnet via an ex parte judicial procedure [9]. The botmaster was never determined or located; however, the primary web domains used in Waledac’s command infrastructure were identified. The ex parte procedure enabled the court to forcefully transfer control of these domains to Microsoft, effectively shutting down the ability of the botmaster to relay commands to the bots. While this exact situation may not be applicable to all botnets, it presents a means to defend against botnets within the scope of law instead of using purely technical approaches. Another recent incident involved the U.S. Department of Justice and the FBI dismantling the Coreflood botnet [6]. In this incident, commands were sent to the infected hosts to force them to stop communicating with the botmaster. This case is unprecedented in that government officials sent commands that altered the behavior of computer systems without their owners’ knowledge or consent. It would be interesting to see if this approach would withstand legal scrutiny. At the heart of many of these issues are the lexicon and classification relating to the use of botnets in warfare. International provisions and agreements that specifically cover network attacks would be a significant help. It is necessary to clarify the status of machines and the owners of the machines that are used to perpetrate attacks. Also, classifying attacks according to capabilities would help define the appropriate responses. For example, is a botnet attack that disrupts the power grid an “armed” attack? If so, how does the victim respond? A vital issue that must be addressed pertains to attacks by non nation state actors. While political constructs and international law may prevent many nation states from launching botnet attacks, history has shown that terrorist organizations and other radical groups have no such restrictions. It is critical that nations reach agreement on the protocols for dealing with attacks by non nation state actors before such scenarios actually play out.
Koziel & Robinson
5.
27
Conclusions
Botnets can be used as instruments of warfare to achieve strategic and operational objectives. With few direct defensive measures available, botnets can disrupt operations in government and industry, and impact the populace by targeting critical infrastructure assets. The ethical and political implications of botnet use are significant. Currently, the attacks are too indiscriminate for botnets to be considered as legitimate weapons under international law and conventions. Nevertheless, the role that botnets play in conflict can be expected to increase. Nation states must assess the retaliatory options and be prepared to respond if and when botnets are used against them. The attack scenarios demonstrate the depth and breadth of the offensive capabilities that botnets afford in a wartime environment. Additional research is required to develop viable legal, policy and technical solutions for detecting, preventing and responding to botnet attacks. Until holistic defensive strategies are in place, nations will be ill-prepared to deal with the full impact of botnet attacks. Note that the views expressed in this paper are those of the authors and do not reflect the official policy or position of the U.S. Air Force, U.S. Department of Defense or the U.S. Government.
References [1] M. Bailey, E. Cooke, F. Jahanian, Y. Xu and M. Karir, A survey of botnet technology and defenses, Proceedings of the Cybersecurity Applications and Technology Conference for Homeland Security, pp. 299–304, 2009. [2] L. Brooks, Botnets: A Threat to National Security, M.S. Thesis, Department of Computer Science, Florida State University, Tallahassee, Florida, 2008. [3] A. Cole, M. Mellor and D. Noyes, Botnets: The rise of the machines, Proceedings of the Sixth Annual Security Conference, 2007. [4] M. Feily, A. Shahrestani and S. Ramadass, A survey of botnet and botnet detection, Proceedings of the Third International Conference on Emerging Security Information, Systems and Technologies, pp. 268–273, 2009. [5] International Committee of the Red Cross, Protocol Additional to the Geneva Conventions of 12 August 1949, and relating to the Protection of Victims of International Armed Conflicts (Protocol I), International Humanitarian Law – Treaties and Documents, Geneva, Switzerland (www.icrc.org/ihl.nsf/full/470?opendocument), June 8, 1977. [6] D. Kaplan, Coreflood-style takedowns may lead to trouble, SC Magazine, April 15, 2011. [7] J. Leonard, S. Xu and R. Sandhu, A framework for understanding botnets, Proceedings of the Fourth International Conference on Availability, Reliability and Security, pp. 917–922, 2009.
28
CRITICAL INFRASTRUCTURE PROTECTION V
[8] S. Liu, Surviving distributed denial-of-service attacks, IT Professional, vol. 11(5), pp. 51–53, 2009. [9] E. Mills, Microsoft legal punch may change botnet battles forever, CNET News (news.cnet.com/8301-27080 3-20015912-245.html), September 9, 2010. [10] Pew Research Center for the People and the Press, More young people cite Internet than TV – Internet gains on television as public’s main news source, Washington, DC (people-press.org/reports/pdf/689.pdf), January 4, 2011. [11] B. Prince, Botnet for sale business going strong, security researchers say, eWeek.com (www.eweek.com/c/a/Security/BotnetBotnet-for-Sale-Bu siness-Going-Strong-Security-Researchers-Say848696), October 25, 2010. [12] C. Smith, Egypt’s Facebook revolution: Wael Ghonim thanks the social network, Huffington Post, February 11, 2011. [13] B. Stone-Gross, M. Cova, L. Cavallaro, B. Gilbert, M. Szydlowski, R. Kemmerer, C. Kruegel and G. Vigna, Your botnet is my botnet: Analysis of a botnet takeover, Proceedings of the Sixteenth ACM Conference on Computer and Communications Security, pp. 635–647, 2009. [14] P. Wang, L. Wu, B. Aslam and C. Zou, A systematic study of peer-topeer botnets, Proceedings of the Eighteenth International Conference on Computer Communications and Networks, 2009. [15] C. Wilson, Botnets, Cybercrime and Cyberterrorism: Vulnerabilities and Policy Issues for Congress, CRS Report for Congress, RL32114, Congressional Research Service, Washington, DC, 2008.
Chapter 3 LIGHTWEIGHT INTRUSION DETECTION FOR RESOURCE-CONSTRAINED EMBEDDED CONTROL SYSTEMS Jason Reeves, Ashwin Ramaswamy, Michael Locasto, Sergey Bratus and Sean Smith Abstract
Securing embedded control systems presents a unique challenge. In addition to the resource restrictions inherent to embedded devices, embedded control systems must accommodate strict, non-negotiable timing requirements, and their massive scale greatly increases other costs such as power consumption. These constraints render conventional host-based intrusion detection – using a hypervisor to create a safe environment under which a monitoring entity can operate – costly and impractical. This paper describes the design and implementation of Autoscopy, an experimental host-based intrusion detection system that operates from within the kernel and leverages its built-in tracing framework to identify control flow anomalies that are often caused by rootkits hijacking kernel hooks. Experimental tests demonstrate that Autoscopy can detect representative control flow hijacking techniques while maintaining a low performance overhead.
Keywords: Embedded control systems, intrusion detection
1.
Introduction
The critical infrastructure has become strongly reliant on embedded control systems. The electric power grid is not immune to this trend: one study predicts that the number of smart meters deployed worldwide, and by extension the embedded control systems inside these meters, will increase from 76 million in 2009 to roughly 212 million by 2014 [38]. The need to secure software that expresses complex process logic is well understood, and this need is particularly important for SCADA devices, where the logic applies to the control of potentially hazardous physical processes. Therefore, as embedded control devices continue to permeate the critical inJ. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 31–46, 2011. c IFIP International Federation for Information Processing 2011
32
CRITICAL INFRASTRUCTURE PROTECTION V
frastructure, it is essential that steps are taken to ensure the integrity of these devices. Failing to do so could have dangerous consequences. Stuxnet [4], which targeted workstations used to configure programmable logic controllers and successfully modified the controller code, is an example of malware that caused widespread damage to a physical installation by infecting a SCADA system. SCADA systems impose stringent requirements on protection mechanisms in order to be viable and effective. For one, the additional costs associated with security computations do not scale in SCADA environments. LeMay and Gunter [11] note that, in a planned rollout of 5.3 million electric meters, incorporating a trusted platform module in each device would incur an additional power cost of more than 490,000 kWh per year, even if the trusted platform modules sat idle at all times. Embedded control systems in the power grid must also deal with strict application timing requirements, some of which require a message delivery time of no more than 2 ms for proper operation [7]. Several researchers [8, 13, 21, 23, 29, 39] address the issue of malware by using virtualization – creating a trusted zone in which a monitoring program can operate and relying on a hypervisor to moderate between the host system and the monitor. These proposals, however, fail to consider the inherent resource constraints of embedded control systems. For example, the space and storage constraints of embedded devices may render the use of a separate hypervisor impractical. Petroni and Hicks [23] observe that simply running the Xen hypervisor on their test platform (a laptop with a 2 GHz dual-core processor and 1.5 GB RAM) imposed an overhead of nearly 40%. This finding indicates that virtualization may not be a feasible option for embedded SCADA devices, and that other approaches to intrusion detection should be considered. In contrast, kernel hardening approaches, exemplified by grsecurity/PaX [20] and OpenWall [19], are very effective at reducing a kernel’s attack surface without resorting to a separate implementation of a formal reference monitor. This is accomplished by implementing security mechanisms in the code of the Linux kernel by leveraging the MMU hardware and ELF binary format features of x86 and other architectures. Indeed, the PaX approach empirically demonstrates the possibility of providing practical security guarantees by embedding protection mechanisms in the kernel instead of relying on a separate operating layer below the kernel. It also shows that increased assurance and better performance can coexist in practice. We note that, whereas many hypervisor-based approaches may appear attractive, the collective price in terms of maintenance, patching, energy, etc. [2] obviates their use in embedded process control environments. In contrast, PaX demonstrates the suitability of implementing protection using already-deployed mechanisms in the hardware and operating system kernel stack. While dispensing with a separate reference monitor might appear to be a losing proposition from a security perspective, in practice, it requires extensive and creative machinations on the part of an attacker to overcome the protection provided by a hardened kernel.
Reeves, et al.
33
Notably, Linux kernel attacks assume that one or more of the PaX-like protective features are disabled or absent. Little published work exists on the exploitation of grsecurity/PaX kernels; even leveraging high-impact “arbitrary write” kernel code vulnerabilities to exploit PaX kernels is very difficult [16]. Proof-of-concept attacks on PaX underscore the complexity of the task, with the PaX team’s rapid elimination of the generic attack vectors serving as further evidence of the viability of the defensive approach. This technical pattern forecasts the practicality of a “same-layer” protection mechanism. This paper describes Autoscopy, an in-kernel, flow-control intrusion detection solution for embedded control systems, which is intended to complement kernel hardening measures. Autoscopy does not rely on a hypervisor; instead, it operates within the operating system, leveraging mechanisms built into the kernel (specifically, Kprobes [14]) to minimize the overhead imposed on the host. Autoscopy looks for control flow anomalies caused by the hijacking of function pointers in the kernel, a hallmark of rootkits seeking to inject their functionality into the operating system. In tests run on a standard laptop system, Autoscopy was able to detect control flow hooking techniques while imposing an overhead of no more than 5% with respect to several performance benchmarks. These results indicate that, unlike virtualized intrusion detection solutions, Autoscopy is well-suited to the task of protecting embedded control devices used in the critical infrastructure.
2.
Background
This section describes the standard methods for intrusion detection and explains why they are difficult to use in embedded control system environments. The section also discusses the virtualization and self-protection approaches to intrusion detection, and highlights the tracing framework used in our intrusion detection solution.
2.1
Embedded Control Systems
The electrical power grid contains a variety of intelligent electronic devices, including transformers, relays and remote terminal units. The capabilities of these devices can vary widely. For example, the ACE3600 RTU [18] sports a 200 MHz PowerPC-based processor and runs a VX-based real-time operating system. On the other hand, the SEL-3354 computing platform [31] has an option for a 1.6 GHz processor based on the x86 architecture and can support the Windows XP and Linux operating systems. In addition to the resource restrictions, embedded control systems used in the power grid are often subject to strict timing requirements. For example, intelligent electronic devices in a substation require a message delivery time of less than 2 ms to stream transformer analog sampled data, and must exchange event notification information for protection within 10 ms [7]. Given these timing windows, introducing even a small amount of overhead could prevent a device from meeting its message latency requirements, prohibiting it from doing
34
CRITICAL INFRASTRUCTURE PROTECTION V
its job – an outcome that may well be worse than a malware infection. Great care must be taken to limit the amount of overhead because device availability usually takes precedence over security.
2.2
Intrusion Detection Methods
Intrusion detection systems can be classified according to the device or medium they protect and the method they use to detect intrusions. An intrusion detection system can be host-based or network-based. A host-based system resides on a single platform and monitors running processes and user actions; a network-based system analyzes packets flowing through a network to detect malicious traffic. The two most common types of intrusion detection methods are misuse-based and anomaly-based. A misuse-based method looks for predefined bad behavior; an anomaly-based method looks for deviations from predefined good behavior. Note that other groupings, such as specification-based methods and behavioral detection methods [27], are also used in the literature. The key to the success of an intrusion detection system is its ability to mediate the host it protects. Specifically, it must capture any actions that could change the state of the host system and determine whether or not the actions could move the system into an untrustworthy state. Conversely, an attack is successful when it evades such mediation. In the ideal case, an intrusion detection system possesses two important characteristics. The first is that the intrusion detection system is separated in some manner from the rest of the system, enabling it to monitor the system while shielding it from host exploits (i.e., isolation). The second characteristic is that the intrusion detection system can monitor every action in the system (i.e., complete mediation). While these characteristics are attractive, they are expensive or impractical to implement in practice, especially in the light of the resource constraints imposed on an embedded control system. In contrast, Autoscopy engages less expensive methods of system mediation – its in-kernel approach permits the adjustment of the mediation scope.
2.3
Virtualization vs. Self Defense
Virtualization most often means the simulation of a specific hardware environment so that it functions as if it were an actual system. Typically, one or more of these simulations or virtual machines (VMs) are run, where each VM is isolated from the actual system and other VMs. A virtual machine monitor (VMM) is used to moderate the access of each VM to the underlying hardware. Virtualization has become a common security measure, since in theory a compromised program remains trapped inside the VM that contains it, and thus cannot affect the underlying system on which it executes. Several recent intrusion detection proposals (see, e.g., [8, 13, 23]) leverage this feature to separate the detection program from the system being monitored, which achieves the isolation goal. However, such a configuration is computationally expensive
Reeves, et al.
35
– a hypervisor can introduce a 40% overhead [23], and an embedded control system may not have adequate resources to support the configuration. To avoid the overhead of a virtualized or other external solution, we propose an internal approach to intrusion detection, one that allows the kernel to monitor itself for malicious behavior. The idea of giving the kernel a view of its own intrusion status dates back to at least 1996, when Forrest and colleagues [5] proposed the creation of a system-specific view of “normal” behavior that could be used for comparisons with future process behavior. The approach employed in Autoscopy can be viewed through the same lens: it endows the kernel with a module that allows it to perform intrusion detection using its own structures and to determine whether or not an action is trustworthy.
2.4
Kprobes
Several operating systems have introduced tracing frameworks to give authorized users standard and easy access to system internals at the granularity level of kernel symbols. Examples include DTrace [3] for Solaris and Kprobes [14] for Linux. Kprobes can be inserted at any arbitrary address in the kernel text, unless the address is explicitly blocked from probing. Once inserted, a breakpoint is placed at the address specified by the Kprobe, causing the kernel to trap upon reaching the address and to pass control to the Kprobe notifier mechanism [14]. The instruction at the specified address is single stepped and the user-defined handler functions execute just before and just after the instruction, permitting the state of the system to be monitored and/or modified at that point.
3.
Related Work
Much of the research related to kernel rootkit techniques is described in hacker publications such as Phrack and public forums such as the Bugtraq mailing list. The discussion of system call hijacking and countermeasures can be traced back to at least 1997 (see, e.g., [25]). A full survey of this research is beyond the scope of this paper; however, interested readers are referred to Phrack issue no. 50 [24] and subsequent issues. Considerable research related to intrusion detection is based on the availability of a hypervisor or some other virtualization primitive. Petroni and Hicks’s SBCFI system [23] uses VMs to create a separate, secure space for their control flow monitoring program, from which they validate the kernel text and control flow transfers in the monitored operating system. Patagonix [13] and VMWatcher [8] use hypervisors to protect their monitoring programs, but they take different approaches to bridging the semantic gap between the hypervisor and the operating system. Patagonix relies on the behavior of the hardware to verify the code being executed, while VMWatcher simply reconstructs the internal semantics of the monitored system for use by an intrusion detection system within the secured VM. NICKLE [29] and HookSafe [39] use trusted shadow copies of data to protect against rootkits. NICKLE creates a copy
36
CRITICAL INFRASTRUCTURE PROTECTION V
of VM memory space containing authenticated kernel instructions to ensure that unauthenticated code cannot run in kernel space, while HookSafe copies kernel hooks into a page-aligned memory area, where it can take advantage of page-level protection in the hardware to moderate access. Several malware detection approaches that do not involve the use of a hypervisor have been proposed, but they suffer from other drawbacks that affect their utility in an embedded control system environment. For example, Kolbitsch and colleagues [9] create behavior graphs of individual malware samples using system calls invoked by the malware, and then attempt to match unknown programs to the graphs. However, much like traditional antivirus systems, this approach requires prior analysis of malware samples. Moreover, deploying updates to embedded devices, which may be remotely deployed in areas with questionable network coverage, remains a challenge. Other researchers attempt to integrate security policies into programs, but considerable effort is required to adapt this to new systems. For example, the approach of Hicks, et al. [6], which brings together a security-typed language with the operating system services that handle mandatory access control, would most likely require the rewriting of many legacy applications. Kprobes have been used for a number of tasks, most often related to debugging kernels and analyzing kernel performance (see, e.g., [26]). Other more novel applications of Kprobes include packet capturing [10] and monitoring the energy use of systems [32]. However, to the best of our knowledge, Autoscopy is the first tool to leverage Kprobes for system protection.
4.
Autoscopy
This section describes the Autoscopy system and explains how it is uniquely suited to secure embedded control devices. Interested readers are referred to [28] for additional details about Autoscopy.
4.1
Overview
Autoscopy does not search for specific instances of malware on its host. Instead, the program looks for a specific type of control flow alteration that is commonly associated with malicious programs. The control flow of a program is defined as the sequence of code instructions that are executed by the host system when the program is executed. Diverting the control flow in a system has been a favored tactic of malware authors for some time, and using control flow constraints as a security mechanism is a well-explored area of research (see, e.g., [1]). Autoscopy is designed to look for a certain type of pointer hijacking, where a malicious function interposes itself between a function pointer and the original function pointed to by the pointer. The malicious function invokes the original target function somewhere within its body, preserving the illusion of normalcy by giving the user the expected output while allowing the malicious function to perform its actions (e.g., scrubbing the output to hide itself and its activities).
Reeves, et al.
37
Autoscopy has two phases of operation: Learning Phase: In this phase, Autoscopy scans the kernel for function pointers to protect, and collects information about normal system behavior. First, Autoscopy scans kernel memory for function pointers by dereferencing every address it finds, looking for an address that could point to another location in the kernel. This list can be verified against the System.map file in the kernel, if desired. Next, the system places a Kprobe on every potential function pointer that is found. It then silently monitors the probes as the system operates, collecting the control flow information required for detection. Multiple rounds of probing may be necessary in some cases, and probes that are not activated are removed from consideration. The result is a list of all of the functions that are called by a function pointer along with the necessary detection information. To obtain a more complete picture of trusted behavior, the Linux Test Project [35] is used to exercise as much of the kernel as possible, attempting to bring rarely-used functions under the protection scope and reduce false positives due to frequently-used functions. Note, however, that this method may leave out some task-specific behavior. Therefore, real use cases should be employed in the learning phase over and above any test suites. Detection Phase: In this phase, Autoscopy inserts Kprobes in the functions tagged during the learning phase. However, instead of collecting information about system behavior, it verifies the information against the normal behavior that was compiled earlier. Anomalous control flows are reported immediately or are logged at the administrator’s discretion.
4.2
Detection Methods
Autoscopy initially incorporated the argument similarity detection method, but currently implements trusted location lists. Argument Similarity: The argument similarity between two functions is defined as the number of equivalent arguments (in terms of position and value) that the functions share. The register values or “contexts” of pointer addresses are collected during the learning phase, and the current and future directions of the control flow of each probed address are examined during the detection phase. The current control flow state is examined by looking at the call stack, and then checking the future direction by placing probes in functions called by the currently-probed function. Suspicious behavior is flagged when more than half of the arguments of the currently-probed function and a function discovered above or below it in the current control flow are similar. This threshold was chosen based on a manual analysis of rootkit control hijacking techniques.
38
CRITICAL INFRASTRUCTURE PROTECTION V Trusted Location Lists: This method uses the return address specified upon entering a probed function to verify whether or not the control flow has been modified. Location-based verification is not a new concept [12, 33], but it helps make simple decisions about the trustworthiness of the current control flow. The return addresses encountered at each probe during the learning phase are collected and used to build trusted location lists that are verified against during the detection phase. Return addresses that were not encountered during the learning phase are logged for analysis.
Moving from using argument similarity to building trusted location lists increases the flexibility of Autoscopy. However, it places more restrictions on the malware detection capabilities.
4.3
Advantages and Disadvantages
Autoscopy offers several advantages, especially with respect to embedded control systems. The most important advantage is lower space and processing requirements. Unlike most intrusion detection solutions, Autoscopy eliminates the overhead of a hypervisor or some other virtualization mechanism. Additionally, it leverages the built-in Kprobes framework of the Linux kernel, which reduces the amount of non-native code required. Another key advantage is flexibility across multiple architectures. Indeed, this benefit was the main motivation for using trusted location lists. The argument similarity implementation [28] disassembles entire functions to locate the hooks in question. With trusted location lists, however, only one instruction (i.e., function call) is disassembled per probe. This change limits the amount of knowledge required about the architecture and instruction set, which, in turn, limits the amount of code to be changed when porting the program to a host with a different underlying architecture. Autoscopy also permits legitimate pointer hijacking. If desired, Autoscopy can be used in conjunction with other programs that alter the control flow for security or other reasons (see, e.g., [21]). Autoscopy simply tags this program behavior as trusted during the learning phase. However, as discussed below, indiscriminate tagging can be a drawback. Finally, the design provides a simple way to adjust the scope of mediation. While the question of what to monitor and what not to monitor may require deeper analysis, changing the number of locations to probe is as simple as adding or removing them from the list of kernel hooks generated during the learning phase. For all the advantages that Autoscopy offers, several shortcomings exist. First and foremost, the program itself is a target for malware. By operating within the kernel, Autoscopy is open to compromise just like the host system. While additional measures can be taken to protect the integrity of the program and kernel, e.g., by using W⊕X/NX [17] or Copilot [22], these programs may run up against the resource constraints imposed on embedded control systems.
Reeves, et al.
39
Another drawback is that Autoscopy requires a trusted base state. Because argument similarity is checked above and below a probed function, it is possible to detect malware that has been installed both before and after the deployment of Autoscopy. However, since the trusted lists are constructed by simply whitelisting every return address seen in a probed function, any malware installed before the learning phase would be classified as trusted behavior. Therefore, the system that hosts Autoscopy must be placed in a trusted base state before the learning phase to ensure that malicious behavior is classified properly. Autoscopy also has to be tuned to the host on which it resides, which can be tricky given the different types of embedded control systems that exist. The following issues must be addressed: Kernel Differences: The kernel must be configured properly to support Autoscopy. This ranges from simple compilation configuration choices (e.g., enabling Kprobes) to differences in the kernel text across operating system versions (e.g., kernel functions used by Autoscopy must be exported for module use). Architecture Differences: Autoscopy must be properly adapted to the host architecture. For example, it is necessary to know which register or memory location holds the return address of a function, and how it is accessed. Tool Availability; External tools and libraries used by Autoscopy must be available across multiple platforms. For example, Autoscopy originally used udis86 [37], an x86-specific disassembler library, which means that a similar tool must be used with other architectures. This issue is made less important by the use of trusted lists because less disassembly is required. Fortunately, although the task of configuring Autoscopy to run on different platforms is non-trivial, it is a one-time cost that is only incurred before installation.
4.4
Threats
At this point, it is important to consider the potential threats to Autoscopy. The principal threat is data modification. An attacker with the ability to read and write to arbitrary system locations could defeat Autoscopy’s defenses by modifying the underlying data structures. For example, an attacker could modify a Kprobe or change a trusted location list to include the addresses of malicious functions. Another threat is program circumvention. Autoscopy detects malware by checking for the invocation of kernel functions from illegitimate locations. However, an attacker who writes code that duplicates the functionality of a kernel function could avoid any probed functions and bypass Autoscopy entirely. While these threats are a concern, the design raises the bar for a malicious program to subvert the system by forcing it to increase its footprint on the
superkit kbdv3, Rial, Synapsys v0.4 enyelkm v1.0 DR v0.1 DR v0.1, Adore–ng 2.6 Adore–ng 2.6 Phantasmagoria
Detected Yes Yes Yes Yes Yes Yes No
host in terms of processor cycles (more computations are required to locate the appropriate data structures) and/or code size (to accommodate the extra functions needed to duplicate kernel behavior). These requirements, in turn, increase the chances of malware being detected on the host system. Other approaches can be used to protect Autoscopy’s data. One approach is to store the trusted lists in read-only memory. However, the constraints imposed by embedded systems could render this approach infeasible.
5.
Experimental Results
This section describes the results of testing Autoscopy on a standard laptop system running Ubuntu 7.04 with Linux kernel version 2.6.19.7. The experiments evaluated the ability of Autoscopy to detect common control flow altering techniques, and the amount of overhead imposed on the host in terms of time and bandwidth.
5.1
Detection of Hook Hijacking
We tested Autoscopy against several control flow altering rootkits that employ kernel hook hijacking techniques [28]. Most of the rootkits tested are prototypes that demonstrate hooking techniques rather than malware from the wild. Nevertheless, they were written to showcase a broad range of control flow altering techniques and the corresponding control flow behaviors. Table 1 lists several techniques used by malware to subvert an operating system, examples of text and/or code that demonstrate these techniques, and whether or not Autoscopy was able to detect these techniques. Note that Autoscopy was able to detect every one of the hooking behaviors listed. Interested readers are referred to [28] for the complete list of rootkits that were tested.
5.2
Performance Overhead
We measured the performance overhead imposed by Autoscopy using five benchmarks: two standard benchmark suites (SPEC CPU2000 [36] and lmbench [15]), two large compilation projects (compiling versions of the Apache web
Apache httpd 2.2.10 compilation Random 256 MB file creation Linux kernel 2.6.19.7 compilation
+0.144% +21.139% +0.138% –0.149% +0.262%
+1.904%
server and Linux kernel), and one test involving the creation of a large file. In the vast majority of these tests, Autoscopy imposed an additional time cost of no more than 5%. In fact, some of the tests indicated that the system ran faster with Autoscopy installed, which we interpreted to mean that Autoscopy had no noticeable impact on the system. Only one test (bandwidth measurement during the reading of a file) showed a large discrepancy between the results obtained with and without Autoscopy. We believe that this is due to the kernel preempting the I/O path or interfering with disk caching when it is probed. Table 2 lists the results obtained in the five benchmarks tests. Note that in the case of the lmbench bandwidth measurements, lower values indicate more
42
CRITICAL INFRASTRUCTURE PROTECTION V
overhead. The experimental results demonstrate that the overhead imposed by Autoscopy did not heavily inconvenience the system.
5.3
False Positives and False Negatives
Autoscopy combats false positives – where non-existent rootkits are “detected” – using a type-checking mechanism that classifies hooks based on the structures in which they are enclosed and the offsets of the hooks within their enclosing structures. This classification prevents the flagging of a control flow containing two similar, but not equivalent, indirect calls. False negatives – where existing rootkits are not detected – present an interesting challenge for Autoscopy. This is because locating potential hook hijacking locations depends on the definition of normal system behavior. For example, if a function is called indirectly from a pointer in the kernel, but is never called in this manner during the learning phase, then Autoscopy will not probe this location, leaving an opening for the hook to be hijacked silently. Therefore, it is important to use a comprehensive test suite during the learning phase to avoid these kinds of events.
5.4
Shortcomings
Some issues that could impact Autoscopy’s performance were discovered during the transitioning to the new trusted location list approach. For example, each probe in the learning phase only reserves enough space for a single function call (which is overwritten every time the probe is hit), and indirect function calls are checked only after probing is completed. Thus, if a function is called both indirectly and directly, then it could be overlooked during the learning phase if it was last called directly before being checked. Furthermore, if a function is called indirectly from multiple locations, then all but one of these locations could be tagged as false positives. This issue and others like it will be identified and corrected in future versions of Autoscopy.
6.
Future Work
Our ultimate goal is to demonstrate the feasibility of using Autoscopy to protect production systems in the power grid without impacting the ability of embedded devices to perform their required tasks. To accomplish this, we plan to port Autoscopy to embedded control devices that are currently used in the power grid and evaluate Autoscopy’s performance on real equipment. Currently, we are collaborating with Schweitzer Engineering Laboratories [30] to analyze how an Autoscopy-enabled power device would perform in simulated use cases compared with using a virtual machine and hypervisor. We are considering two systems in our analysis: an x86-based general computing platform and a weaker PowerPC-based device. The differences between the two systems, in terms of architecture and resource availability, will provide a good test of Autoscopy’s flexibility and lightweight design.
Reeves, et al.
43
We also plan to test a basic virtualized configuration on both power devices, placing the kernel inside a VM monitored by a hypervisor and running the same tests as performed on Autoscopy-enabled devices. This will provide a benchmark to show how Autoscopy performs in relation to a hypervisor-based solution. Our plan is to evaluate Autoscopy and the hypervisor alternative in terms of the overhead they impose on power systems, and to determine whether or not an in-kernel approach can offer better performance with less interference.
7.
Conclusions
Autoscopy takes a practical approach to intrusion detection that operates within the operating system kernel and leverages its built-in tracing framework to minimize the performance overhead on the host system. Our tests demonstrate the effectiveness of Autoscopy in a non-embedded environment. However, Autoscopy also holds promise as a means for protecting embedded control systems in the electrical power grid. Given the critical, time-sensitive nature of the tasks performed by embedded devices in the power grid, Autoscopy offers the flexibility to balance detection functionality with the overhead imposed on the system. Since it is situated in the kernel, Autoscopy requires some hardware (e.g., memory immutability) or software (e.g., kernel hardening) protection measures. However, these protective measures would cost less than full-blown reference monitor isolation via hardware virtualization that underlies hypervisor-based solutions. Note that the views and opinions in this paper are those of the authors and do not necessarily reflect those of the United States Government or any agency thereof.
Acknowledgements This research was supported by the Department of Energy under Award No. DE-OE0000097. The authors also wish to thank David Whitehead and Dennis Gammel (Schweitzer Laboratories) and Tim Yardley (University of Illinois at Urbana-Champaign) for their advice and assistance with the Autoscopy test plan.
References [1] M. Abadi, M. Budiu, U. Erlingsson and J. Ligatti, Control flow integrity: Principles, implementations and applications, ACM Transactions on Information and System Security, vol. 13(1), pp. 4:1–40, 2009. [2] S. Bratus, M. Locasto, A. Ramaswamy and S. Smith, VM-based security overkill: A lament for applied systems security research, Proceedings of the New Security Paradigms Workshop, pp. 51–60, 2010. [3] B. Cantrill, M. Shapiro and A. Leventhal, Dynamic instrumentation of production systems, Proceedings of the USENIX Annual Technical Conference, pp. 15–28, 2004.
44
CRITICAL INFRASTRUCTURE PROTECTION V
[4] N. Falliere, L. O’Murchu and E. Chien, W32.Stuxnet Dossier, Symantec, Mountain View, California (www.symantec.com/content/en/us/enterprise /media/security response/whitepapers/w32 stuxnet dossier.pdf), 2011. [5] S. Forrest, S. Hofmeyr, A. Somayaji and T. Longstaff, A sense of self for Unix processes, Proceedings of the IEEE Symposium on Security and Privacy, pp. 120–128, 1996. [6] B. Hicks, S. Rueda, T. Jaeger and P. McDaniel, From trusted to secure: Building and executing applications that enforce system security, Proceedings of the USENIX Annual Technical Conference, 2007. [7] Institute of Electrical and Electronics Engineers, IEEE 1646-2004 Standard: Communication Delivery Time Performance Requirements for Electric Power Substation Automation, Piscataway, New Jersey, 2004. [8] X. Jiang, X. Wang and D. Xu, Stealthy malware detection through VMMbased “out-of-the-box” semantic view reconstruction, Proceedings of the Fourteenth ACM Conference on Computer and Communications Security, pp. 128–138, 2007. [9] C. Kolbitsch, P. Comparetti, C. Kruegel, E. Kirda, X. Zhou and X. Wang, Effective and efficient malware detection at the end host, Proceedings of the Eighteenth USENIX Security Symposium, pp. 351–366, 2009. [10] B. Lee, S. Moon and Y. Lee, Application-specific packet capturing using kernel probes, Proceedings of the Eleventh IFIP/IEEE International Conference on Symposium on Integrated Network Management, pp. 303–306, 2009. [11] M. LeMay and C. Gunter, Cumulative attestation kernels for embedded systems, Proceedings of the Fourteenth European Symposium on Research in Computer Security, pp. 655–670, 2009. [12] J. Levine, J. Grizzard and H. Owen, A methodology to detect and characterize kernel level rootkit exploits involving redirection of the system call table, Proceedings of the Second IEEE International Information Assurance Workshop, pp. 107–125, 2004. [13] L. Litty, H. Lagar-Cavilla and D. Lie, Hypervisor support for identifying covertly executing binaries, Proceedings of the Seventeenth USENIX Security Symposium, pp. 243–258, 2008. [14] A. Mavinakayanahalli, P. Panchamukhi, J. Keniston, A. Keshavamurthy and M. Hiramatsu, Probing the guts of Kprobes, Proceedings of the Linux Symposium, vol. 2, pp. 109–124, 2006. [15] L. McVoy and C. Staelin, lmbench: Portable tools for performance analysis, Proceedings of the USENIX Annual Technical Conference, 1996. [16] T. Mittner, Exploiting gresecurity/PaX with Dan Rosenberg and Jon Oberheide (resources.infosecinstitute.com/exploiting-gresecuritypax), May 18, 2011. [17] I. Molnar, NX (No eXecute) support for x86, 2.6.7-rc2-bk2, Linux Kernel Mailing List (lkml.org/lkml/2004/6/2/228), June 2, 2004.
Reeves, et al.
45
[18] Motorola Solutions, ACE3600 Specifications Sheet, Schaumburg, Illinois (www.motorola.com/web/Business/Products/SCADA%20Products/ACE 3600/%5FDocuments/Static%20Files/ACE3600%20Specifications%20She et.pdf?pLibItem=1), 2009. [19] Openwall, Linux kernel patch from the Openwall Project (www.openwall .com/linux). [20] PaX Team, Homepage (pax.grsecurity.net). [21] B. Payne, M. Carbone, M. Sharif and W. Lee, Lares: An architecture for secure active monitoring using virtualization, Proceedings of the IEEE Symposium on Security and Privacy, pp. 233–247, 2008. [22] N. Petroni, T. Fraser, J. Molina and W. Arbaugh, Copilot – A coprocessorbased kernel runtime integrity monitor, Proceedings of the Thirteenth USENIX Security Symposium, pp. 179–194, 2004. [23] N. Petroni and M. Hicks, Automated detection of persistent kernel control flow attacks, Proceedings of the Fourteenth ACM Conference on Computer and Communications Security, pp. 103–115, 2007. [24] phrack.org, Phrack, no. 50 (www.phrack.org/issues.html?issue=50), April 9, 2007. [25] pragmatic/THC, (Nearly) complete Linux loadable kernel modules (dl.pac ketstormsecurity.net/docs/hack/LKM HACKING.html), 1999. [26] V. Prasad, W. Cohen, F. Eigler, M. Hunt, J. Keniston and B. Chen, Locating system problems using dynamic instrumentation, Proceedings of the Linux Symposium, pp. 49–64, 2005. [27] P. Proctor, The Practical Intrusion Detection Handbook, Prentice-Hall, Upper Saddle River, New Jersey, 2001. [28] A. Ramaswamy, Autoscopy: Detecting Pattern-Searching Rootkits via Control Flow Tracing, Master’s Thesis, Department of Computer Science, Dartmouth College, Hanover, New Hampshire, 2009. [29] R. Riley, X. Jiang and D. Xu, Guest-transparent prevention of kernel rootkits with VMM-based memory shadowing, Proceedings of the Eleventh International Symposium on Recent Advances in Intrusion Detection, pp. 1–20, 2008. [30] Schweitzer Engineering Laboratories, Home, Pullman, Washington (www .selinc.com). [31] Schweitzer Engineering Laboratories, SEL-3354 Embedded Automation Computing Platform Data Sheet, Pullman, Washington (www.selinc.com /WorkArea/DownloadAsset.aspx?id=6196), 2011. [32] D. Singh and W. Kaiser, The Atom LEAP Platform for Energy-Efficient Embedded Computing, Technical Report, Center for Embedded Network Sensing, University of California at Los Angeles, Los Angeles, California, 2010. [33] s0ftpr0ject Team, Tools and Projects (www.s0ftpj.org/en/tools.html).
46
CRITICAL INFRASTRUCTURE PROTECTION V
[34] R. Sommer and V. Paxson, Outside the closed world: On using machine learning for network intrusion detection, Proceedings of the IEEE Symposium on Security and Privacy, pp. 305–316, 2010. [35] SourceForge.net, Linux Test Project (ltp.sourceforge.net). [36] Standard Performance Evaluation Corporation, SPEC CPU2000 Benchmark Suite, Gainesville, Florida (www.spec.org/cpu2000), 2007. [37] V. Thampi, udis86 Disassembler Library for x86 and x86-64 (udis86.sf .net), 2009. [38] Transmission and Distribution World, About 212 million “smart” electric meters in 2014, says ABI Research (tdworld.com/smart grid automa tion/abi-research-smart-meters-0210), February 3, 2010. [39] Z. Wang, X. Jiang, W. Cui and P. Ning, Countering kernel rootkits with lightweight hook protection, Proceedings of the Sixteenth ACM Conference on Computer and Communications Security, pp. 545–554, 2009.
Chapter 4 A PLANT-WIDE INDUSTRIAL PROCESS CONTROL SECURITY PROBLEM Thomas McEvoy and Stephen Wolthusen Abstract
Industrial control systems are a vital part of the critical infrastructure. The potentially large impact of a failure makes them attractive targets for adversaries. Unfortunately, simplistic approaches to intrusion detection using protocol analysis or na¨ıve statistical estimation techniques are inadequate in the face of skilled adversaries who can hide their presence with the appearance of legitimate actions. This paper describes an approach for identifying malicious activity that involves the use of a path authentication mechanism in combination with state estimation for anomaly detection. The approach provides the ability to reason conjointly over computational structures, and operations and physical states. The well-known Tennessee Eastman reference problem is used to illustrate the efficacy of the approach.
Keywords: Industrial control systems, subversion detection
1.
Introduction
In industrial control systems, detection and prevention extend beyond the computational model into the physical realm. While protocol analysis may signal anomalies as proposed by Coutinho, et al. [2], a skilled adversary can issue apparently authentic commands [18] using legitimate protocols. Analysis may be extended using state estimation techniques, but should not be applied na¨ıvely [10, 16], especially in non-linear environments such as those encountered in the biochemical industry [6]. This paper describes an approach that utilizes state estimation in intrusion detection in combination with path authentication techniques. The approach assumes the existence of an adversary who can subvert channels and system functions [9]. Hence, it is necessary to verify the reliability and independence of channels and functions for message transmission. This is achieved by combining state estimation techniques using proxy measurements [10] with algebraic proofs over structures and operations. The Tennessee Eastman reference probJ. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 47–56, 2011. c IFIP International Federation for Information Processing 2011
48
CRITICAL INFRASTRUCTURE PROTECTION V
lem is employed as a case study to demonstrate the application of the approach to non-linear systems.
2.
Related Work
Industrial control systems are a vital part of the critical infrastructure and are attractive targets for adversaries. Security in such systems is generally weak [3]. Recent research has focused on anomaly detection at the protocol level, since traffic in control networks is well-characterized and, hence, particularly amenable to such techniques [2]. Approaches using physical state estimation techniques have also been researched [15], but these are largely limited to linear systems. However, many industrial systems, including biological and chemical processes, exhibit non-linear behavior or require non-linear control laws, resulting in less well-defined models and limited accuracy [6]. Real-time detection is also an important requirement for these industrial systems [17]. It has been argued that, in the presence of channel compromise, adversaries may use protocols correctly and present syntactically and semantically correct messages, resulting in the failure of conventional detection techniques to signal anomalies [9, 18]. These attacks may also be concealed in noisy processes that are not amenable to elementary statistical analysis [16]. In particular, this is true for non-linear systems [10]. The Tennessee Eastman reference problem [5] is commonly considered in control systems research and pedagogy (see, e.g., [1, 7, 8, 12]). It provides a well-defined problem space for using different control laws. Furthermore, a number of simulation models are available for this problem. The process calculus used to construct the control overlay model in this paper was defined in [9], where an adversary capability model for industrial control systems was also proposed. This paper uses the process calculus model to analyze computational structures and operations using techniques related to probabilistic packet marking and path authentication [4].
3.
Control Problem
An attack on an industrial control system is usually accompanied by the use of concealment techniques. Protocol analysis by itself may not detect an attack that uses legitimate protocols. State estimation techniques rely on the integrity of the signals. They can deal with missing data and noisy signals, but not with deceptive or misleading signals from subverted channels. Hence, conjoint reasoning is required over both channels and signals to uncover malicious activity, helping separate false and true signals.
4.
Solution Approach
We define a computational overlay for an industrial control system using an applied π-calculus [13]. In the context of the Tennessee Eastman challenge problem [5], we demonstrate the existence of proxy measurements of plant activity that can be used to detect anomalies. However, this requires the ability
McEvoy & Wolthusen
49
to reason about channel integrity. This is accomplished using path authentication methods that can be proven within the algebraic framework. An explicit model of human intervention is not presented, rather we consider operational capability in terms of detection.
5.
Process Calculus The capabilities of our π-calculus variant are specified by: → π ::= x ¯yp,r | x(zp,r ) | τ | λ | f (− z)→x ¯w, w | [x = y]π
A simplified version of the process calculus was presented in [9], where it was used to represent adversary capabilities. Here, we expand on its functionality to permit proofs over structures and operations. The capabilities of the process calculus are: (i) sending a name with priority and routing; (ii) receiving a name with priority and routing; (iii) performing an unobserved action (with the special meaning of decision-making); (iv) performing an observable inaction (process failure); (v) name generating function; (vi) replication capability; and (vii) conditional capability. z˜ is used to denote a vector of names. Names are typed as channels, variables or constants. The operations of the π-calculus are retained and augmented as follows: P ::= M | P |P | νz P | !P M ::= 0 | π.P | M + M | M ⊕ M where P is a process that may be a summation, concurrent, a new process with (restricted) names, or replication. M is a summation that may be null or termination, a capability guarding a process and – adding a variant summation – a soft choice between retained alternatives and a hard choice between mutually exclusive alternatives (see Sangiorgi and Walker [13] for additional details). Hence, a process may partially order its messaging and the exercising of its capabilities in a manner that is not mutually exclusive. For example, the process may send a set of messages in some order. However, it cannot be subverted as an agent of the adversary and also resist such subversion because these outcomes are mutually exclusive. The name generating function takes a set of parameters and returns a name. In general, it provides a parametric interface to the physical processes or control functions that may be defined by a state space equation or its transform. The function can also be used for other purposes, for example, to simulate automated decision-making or as a cryptographic primitive. Routing captures the ability of the system to send a message to a process by means of another process, provided the name of the process exists in the intervening process. Routing information may be explicitly coded in the summation or understood implicitly from the process structure. For example, rur .0 + x(u).¯ su[y] .0|s(u).¯ y u.0|r(u).0|y(u).0 sends m to x and forx ¯my .0|x(u).¯
50
CRITICAL INFRASTRUCTURE PROTECTION V
Figure 1.
Tennessee Eastman problem under base control [11].
wards it to y, but not to r. Prioritization can be captured by a simple ranking system [9]. Special types of functions are defined using a finite set of labels λ (e.g., delay and message loss). The actions of these properties can be described as necessary. However, they are essentially means for naming invisible process actions that would otherwise be regarded as degenerate terminations. The following equation illustrates one use of labels: ((¯ xu + x(u)).0 + Loss + Delay) ≡ ((¯ xu + x(u).0) + 0 + 0)
6.
Model Creation
This section describes how a suitable state estimation algorithm may be used along with proxy measurements or estimators in combination with path authentication techniques to uncover reliable channels and to maintain system operations in the presence of malicious activity. The Tennessee Eastman challenge problem is used to illustrate the application of the approach to non-linear estimation problems for industrial control systems.
6.1
Tennessee Eastman Problem
The Tennessee Eastman plant is a non-contrived, albeit modified, model of a real chemical process (Figure 1). It consists of a reactor-separator-recycler
51
McEvoy & Wolthusen
arrangement involving two simultaneous irreversible gas-liquid exothermic reactions and two byproduct reactions given by: A (g) + C (g) + D (g) → G (l)
P roduct 1
A (g) + C (g) + E (g) → H (l) A (g) + E (g) → F (l) 3D (g) → 2F (l)
P roduct 2 Byproduct Byproduct
The plant is open-loop, unstable and highly non-linear. Various approaches to its control have been described [8], which can result in distinct channel architectures, rendering it a suitable candidate for testing a variety of models and techniques. The gaseous reactants (g) form liquid products (l). Note that the products are not specifically identified and that the process was modified from the real industrial process by the original authors [5]. The gas phase reactions are catalyzed by a non-volatile substance dissolved in the liquid phase in the reactor. The reactor is pressurized and agitated, and uses an internal cooling bundle to remove the heat produced by the reactions. The products leave the reactor in the vapor phase along with the unreacted feeds, while the catalyst remains in the reactor. The reactor product stream passes through a cooler that condenses the products, and from there to a vaporliquid separator. Non-condensed components cycle back to the reactor feed via a centrifugal compressor. Condensed components are sent to a product stripping column that removes the remaining reactants. Products G and H exit the stripper base and are separated in a downstream refining section, which is not included in the problem statement. The byproducts and inerts are purged from the system in the vapor phase using a vapor-liquid separator. The system may be operated in six distinct modes to produce different product mass outputs. The plant has twelve valves for manipulation, and a total of 41 measurements are involved in monitoring and control. Note that one of the valves is not shown in Figure 1, which only provides closed control loops; the valve is used for higher order control purposes. Following the base control strategy outlined by McAvoy and Ye [11], most of the variables may be removed from consideration to leave the twelve control variables and twelve manipulated variables shown in Table 1. Hence, for state estimation purposes, depending on the control law used, not all the variables need to be considered. This implies a set of alternative measurements may be available as proxies for the main control variables. This also means that, for state estimation purposes, there are a number of possible measurements in addition to the main ones in the model that can be used for estimation by proxy [10].
52
CRITICAL INFRASTRUCTURE PROTECTION V Table 1.
6.2
Manipulated and controlled variables.
Manipulated Variable
Controlled Variable
A-feed set point D-feed set point E-feed set point C-feed set point Purge set point Product set point Stripper stream flow set point Separator bottom flow set point Reactor cooling water set point Condenser cooling water set point Compressor recycle valve Stirrer speed
Reactor level Separator level Stripper bottom level Reactor pressure Reactor feed flow Reactor temperature Compressor power Compressor exit flow Separator pressure Separator temperature Stripper pressure Stripper temperature
Tennessee Eastman Overlay
Using our process calculus, we can define a system architecture that satisfies the control purposes. To do so, we define the entities, messengers and agents of the system. By τ , entities make decisions. Messengers pass decisions as names. By f () →, agents are processes which act on decisions. For example, an operator that is an entity is defined by the equation: Operator := x ¯u.0 ⊕ x(u).0 ⊕ τ.0|!Operator where the set Operator = {Operator, Adversary} and τ is the decision-making capability. A (simple) controller may be defined by: z e.f (p, k, e, i) → z¯i1 Controller := νi((z(e)1 .¯ + y(k )2 .Controllerp, k , e.¯ y k2 + y(p )2 .Controllerp , k, e.¯ y p2 ).0 + (y(m).Controller p, k, e.0 ⊕ Resist | !Controller) where the controller may be changed to an agent of the adversary by a malicious message m that represents a successful attack, and R is the ability to resist such an attack with the set Agent := {Agent, P lantP rocess} representing the agent state. Other examples of control system structures are provided in [9]. They can be used to create the complete system infrastructure.
6.3
State Estimation
State estimation is the problem of accounting for the state of a system in the presence of process disturbances and measurement noise. A general non-linear system can be described as:
53
McEvoy & Wolthusen
xk+1 = f (xk , uk ) + Wk yk+1 = h(xk ) + vk
System Equation Output Equation
where x is the state variable vector, u represents the inputs under control, w represents process noise, y is the measured output and v is the measurement noise. Note that x is not known directly, but is estimated based on y; this accounts statistically for both process and measurement noise. We assume that process and measurement noise can be represented as Gaussian white noise with a mean of zero (μ = 0) and a suitable variance (σ 2 ). Several state estimation algorithms are available for this purpose. An example is the extended Kalman filter [14]. Note, however, that state estimation techniques in general are defined recursively and hence have a “memory” of the previous states of a system. This distinguishes them from pure correlation techniques where the memory of previous system behavior is lost. In the case of most industrial systems, it is possible to derive multiple sets of measurements that are functionally independent of each other in control terms. Thus, alternative means exist for testing the reliability of measurements and the ability to substitute one set of measurements for another for control and channel authentication purposes. For example, in the Tennessee Eastman system, influx A in Figure 1 can be measured directly by its flow meter and estimated by the initial flow analyzer, pressure controller and also inversely estimated based on D and E, C, G and H. Both the estimation techniques can be used and their results compared to identify inconsistencies and determine the integrity of channels and functions.
7.
Model Application
We assume the existence of an adversary who can subvert channels and functions to act on his behalf. This means that encryption techniques cannot be used to guarantee the freshness or authenticity of messages since the message originators may be compromised by the adversary. In particular, the adversary (or rather his agents) can perfectly forge messages with respect to the protocol formulation and/or directly manipulate physical measurements. We assume that a set of robust estimators E exist for a system such as the Tennessee Eastman problem, which we can use to detect inconsistent measurements. (The estimators are derived by simulation.) The goal is to clearly mark channels and sensors (controls) as reliable or unreliable to avoid an unnecessary system shutdown. To do this, it is necessary to prove that a set n = |E(·)| of independent channels exists for each estimator. In the case of untainted channels, the associated estimators can be used. However, if all the channels for an estimator are tainted, then a contingent estimator can be used provided that its channels are untainted. Clearly, a complete set of fully separated channels provides a trivial (but not minimal) solution. Non-trivial solutions are required
54
CRITICAL INFRASTRUCTURE PROTECTION V
because channels are generally shared by messages due to the convergence of channels onto an operator and resilience characteristics. Channel independence may be demonstrated by variations on packet marking for path authentication [4]. Several such techniques may be investigated for applicability, considering parameters such as topological constraints. We illustrate one technique by constructing a set of channels that use a “salt” to mark the route selected by a message. The salt is a shared secret between the channel and the operator. We assume that a set of known routes exist over which we define “normal routes” and “deviations.” For each deviation, a salt is added to provide a trace of the path followed by a signal package. Let the {P 1, P 2, P 3, P 4} be the controllers and Op be the operator as previously defined. We assume that each controller hashes the message identifiably. We define a set of channels such that each channel may re-route a message to an adjacent channel on message failure (Loss). Before doing so, it rehashes the message hash with its salt and attaches its name in order. The channels are defined by the equations: ¯C(n+1) z[Op] Cn :=νs(¯ xDn u[Op] + xCn (u)[Op] + Loss.Hash(u, s) → w + wC(n−1) (z)[Op] + x ¯Dn z[Op] ).0|!Cn ¯D(n+1) z[Op] Dn :=νs(¯ xD u[Op] + xCn (u)[Op] + Loss.Hash(u, s) → w + wD(n−1) (z)[Op] + x ¯Dn z[Op] ).0|!Dn En :=νs(¯ xOp u[Op] + xEn (u)[Op] + Loss.Hash(u, s) → w ¯E(n+1) z[Op] + wE(n−1) (z)[Op] + x ¯Op z[Op] ).0|!En The overall structure is given by the equations: x ¯P 1 zx[Op] , 1.P 1|C1|D1|E1| x ¯P 2 zx[Op] , 1.P 2|C2|D2|E2|Op x ¯P 3 zx[Op] , 1.P 3|C3|D3|E3| x ¯P 4 zx[Op] , 1.P 4|C4|D4|E4| Note that the topology is deliberately constrained, a characteristic of industrial control systems. We claim that each message follows a route that is uniquely identified by its origin and its membership in the set of deviations. Let Km,n be a message with n salts and m = n + 1 names. The name order must be consistent with the deviations permitted by the topology and must match the salt order. We subtract a name and a salt from K. Let Km,n = Km−1,n−1 . We treat this as a α move in a game. If the move Km,n → Km−1,n−1 is not permitted, where α is the trace that is the set of channels between the two marked channels, say KP and KQ , then the routing is invalid. If the routing is valid then, the operation can be repeated until K1,0 is reached, which should be the expected origin of the message. Thus, the route taken by each message can be identified. Since each
McEvoy & Wolthusen
55
message follows a uniquely identifiable route, an inconsistent message marks a potentially subverted route. Using set elimination over routes between an orgin and destination, σi,j αi − αj , the subverted channels can be identified in a probabilistic manner. Hence, if a message is sent independent of an unreliable channel, it may be regarded as reliable; otherwise, it is not reliable. Observing the independence of channels permits the detection of the adversary’s action and operation of the plant, even where manipulated signals share routes with reliable signals. To complete the approach, the set of estimators should also be independent sources of information about the process. A cyclic dependency between estimators must be avoided. For example, if the estimator A1 is used to estimate B2, and B2 to estimate C4, and C4 to estimate A1, then the results become meaningless. Undermining this approach requires the adversary to capture all the salts that are regard as infeasible. In essence, we assume the adversary can only gain partial control of the system.
8.
Conclusions
Research in the area of control systems security has shown that attackers can forge protocols or directly manipulate physical signals to mask their activities. In earlier work [10], we have demonstrated previously that proxy measurements can detect such inconsistencies. However, to minimize the reengineering efforts, it is desirable to use measurements that are already present. Combining path authentication with state estimation techniques is an effective means for identifying subverted channels and processes, and, as such, promises to be a rich area of research in the area of control systems security. Our future research will focus on refining the path authentication technique and selecting robust estimators for state estimation by proxy.
References [1] L. Bie and X. Wang, Fault detection and diagnosis of a continuous process based on multiblock principal component analysis, Proceedings of the International Conference on Computer Engineering and Technology, pp. 200–204, 2009. [2] M. Coutinho, G. Lambert-Torres, L. da Silva, J. da Silva, J. Neto, E. da Costa Bortoni and H. Lazarek, Attack and fault identification in electric power control systems: An approach to improve security, Proceedings of the Power Tech Conference, pp. 103–107, 2007. [3] A. Creery and E. Byres, Industrial cybersecurity for power systems and SCADA networks, Proceedings of the Fifty-Second Annual Petroleum and Chemical Industry Conference, pp. 303–309, 2005. [4] X. Dang, E. Albright and A. Abonamah, Performance analysis of probabilistic packet marking in IPv6, Computer Communications, vol. 30(16), pp. 3193–3202, 2007.
56
CRITICAL INFRASTRUCTURE PROTECTION V
[5] J. Downs and E. Vogel, A plant-wide industrial process control problem, Computers and Chemical Engineering, vol. 17(3), pp. 245–255, 1993. [6] D. Gamez, S. Nadjm-Tehrani, J. Bigham, C. Balducelli, K. Burbeck and T. Chyssler, Safeguarding critical infrastructures, in Dependable Computing Systems: Paradigms, Performance Issues and Applications, H. Diab and A. Zomaya (Eds.), John Wiley, Hoboken, New Jersey, pp. 479–499, 2005. [7] T. Kraus, P. Kuhl, L. Wirsching, H. Bock and M. Diehl, A moving horizon state estimation algorithm applied to the Tennessee Eastman benchmark process, Proceedings of the IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pp. 377–382, 2006. [8] T. Larsson and S. Skogestad, Plant-wide control – A review and a new design procedure, Modeling, Identification and Control, vol. 21(4), pp. 209– 240, 2000. [9] T. McEvoy and S. Wolthusen, A formal adversary capability model for SCADA environments, presented at the Fifth International Workshop on Critical Information Infrastructure Security, 2010. [10] T. McEvoy and S. Wolthusen, Detecting sensor signal manipulations in non-linear chemical processes, in Critical Infrastructure Protection IV, T. Moore and S. Shenoi (Eds.), Springer, Heidelberg, Germany, pp. 81–94, 2010. [11] T. McAvoy and N. Ye, Base control for the Tennessee Eastman problem, Computers and Chemical Engineering, vol. 18(5), pp. 383–413, 1994. [12] N. Ricker, Decentralized control of the Tennessee Eastman challenge process, Journal of Process Control, vol. 6(4), pp. 205–221, 1996. [13] D. Sangiorgi and D. Walker, π-Calculus: A Theory of Mobile Processes, Cambridge University Press, Cambridge, United Kingdom, 2001. [14] D. Simon, Optimal State Estimation: Kalman, H∞ and Nonlinear Approaches, John Wiley, Hoboken, New Jersey, 2006. [15] S. Su, X. Duan, X. Zeng, W. Chan and K. Li, Context information-based cyber security defense of protection system, IEEE Transactions on Power Delivery, vol. 22(3), pp. 1477–1481, 2007. [16] N. Svendsen and S. Wolthusen, Using physical models for anomaly detection in control systems, in Critical Infrastructure Protection III, C. Palmer and S. Shenoi (Eds.), Springer, Heidelberg, Germany, pp. 139–149, 2009. [17] C. Ten, G. Manimaran and C. Liu, Cybersecurity for critical infrastructures: Attack and defense modeling, IEEE Transactions on Systems, Man and Cybernetics (Part A: Systems and Humans), vol. 40(4), pp. 853–865, 2010. [18] J. Verba and M. Milvich, Idaho National Laboratory Supervisory Control and Data Acquisition Intrusion Detection System (SCADA IDS), Proceedings of the IEEE Conference on Technologies for Homeland Security, pp. 469–473, 2008.
Chapter 5 IDENTIFYING VULNERABILITIES IN SCADA SYSTEMS VIA FUZZ-TESTING Rebecca Shapiro, Sergey Bratus, Edmond Rogers and Sean Smith Abstract
Security vulnerabilities typically arise from bugs in input validation and in the application logic. Fuzz-testing is a popular security evaluation technique in which hostile inputs are crafted and passed to the target software in order to reveal bugs. However, in the case of SCADA systems, the use of proprietary protocols makes it difficult to apply existing fuzz-testing techniques as they work best when the protocol semantics are known, targets can be instrumented and large network traces are available. This paper describes a fuzz-testing solution involving LZFuzz, an inline tool that provides a domain expert with the ability to effectively fuzz SCADA devices.
Critical infrastructure assets such as the power grid are monitored and controlled by supervisory control and data acquisition (SCADA) systems. The proper functioning of these systems is necessary to ensure the safe and reliable operation of the critical infrastructure – something as simple as an input validation bug in SCADA software can leave an infrastructure asset vulnerable to attack. While large software development companies may have the resources to thoroughly test their software, our experience has shown that the same cannot be said for SCADA equipment manufacturers. Proell from Siemens [19] notes that random streams of bytes are often enough to crash SCADA devices. Securing SCADA devices requires extensive testing for vulnerabilities. However, software vulnerabilities are often not well understood by SCADA developers and infrastructure experts, who may themselves not have the complete protocol documentation. Meanwhile, external security experts lack the SCADA knowledge, resources and access to run thorough tests. This is a Catch-22 situation. J. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 57–72, 2011. c IFIP International Federation for Information Processing 2011
58
CRITICAL INFRASTRUCTURE PROTECTION V
Fuzz-testing is a form of security testing in which bad inputs are chosen in attempt to crash the software. As such, it is widely used to test for security bugs in input validation as well as in application logic. However, applying fuzztesting methodologies to secure SCADA devices is difficult. SCADA systems often rely on poorly understood proprietary protocols, which complicates test development. The time-sensitive, session-oriented nature of many SCADA environments makes it impossible to prime a fuzzer with a large capture. (Session data is only valid for a short time and is often rejected out of hand by the target thereafter.) Furthermore, many modern fuzzers require users to attach a debugger to the target, which is not always possible in a SCADA environment. What is needed is a fuzzer than works inline. This paper describes LZFuzz, an inline fuzzing tool that enables infrastructure asset owners and operators to effectively fuzz their own equipment without needing to modify the target system being tested, and without having to expose their assets or pass proprietary information to external security evaluators.
2.
Fuzzing Overview
Barton Miller, the father of fuzz-testing, observed during a thunderstorm that the lightning-induced noise on his network connection caused programs to crash [15]. The addition of randomness to inputs triggered bugs that were not identified during software testing. Upon further investigation, Miller discovered that the types of bugs triggered by fuzzing included race conditions, buffer overflows, failures to check return code and printf/format string problems. These bugs are often sources of software security vulnerabilities [14]. Most modern software undergoes aggressive input checking and should handle random streams of bytes without crashing. Consequently, modern fuzz-testing tools have become more selective in how they fuzz inputs. Whether or not data has been fuzzed, there usually are multiple layers of processing that the data has to undergo before it reaches the target software’s application logic. Application logic is the soft underbelly of software – penetrating it greatly increases the likelihood of compromising the software. Fuzzed inputs trigger bugs only if they are not rejected by one of the processing layers before they get to the application logic. Therefore, a fuzzer must generate inputs that are clean enough to pass all the processing layer checks, but that are sufficiently malformed to trigger bugs in the application logic. The most successful fuzzers create fuzzed inputs based on complete knowledge of the layout and contents of the inputs. If a fuzzer is given information on how a specific byte will be interpreted, it can manipulate the byte in ways that are more likely to compromise the target. For example, if a particular sequence of bytes has information about the length of a string that is contained in the next sequence of bytes, a fuzzer can try to increase, decrease or set the length value to a negative number. The target software may not check one of these cases and pass the malformed input to the application logic, resulting in a potentially exploitable memory corruption [14].
Shapiro, Bratus, Rogers & Smith
2.1
59
Fuzzing Techniques
There are two methods for creating fuzzed inputs: generation-based fuzzing and mutation fuzzing. To simplify the presentation, we focus on fuzzing packets sent to networked software. The techniques, however, apply generally to fuzztesting (e.g., of files and file systems). Generation-Based Fuzzing: This method constructs fuzzed inputs based on generation rules related to valid input structures and protocol states. The simplest generation-based fuzzers generate fuzzed inputs corresponding to random-length strings containing random bytes [15]. State-of-the-art generation-based fuzzers such as Sulley [3] and Peach [11] are typically block-based fuzzers. Block-based fuzzers require a complete description of the input structure in order to generate inputs, and often accept a protocol description as well. SPIKE [1] was the first block-based fuzzer to be distributed. Newer generation-based fuzzers such as EXE [7] instrument code to automatically generate test cases that have a high probability of success. Mutation Fuzzing: This method modifies good inputs by inserting bad bytes and/or swapping bytes to create fuzzed inputs. Some modern mutation fuzzers base their fuzzing decisions on a description of the input layout (e.g., the mutation aspect of Peach [11]). Other mutation fuzzers such as the General Purpose Fuzzer (GPF) [22] do not require any knowledge of the input layout or protocol; they use simple heuristics to guess field boundaries and accordingly mutate the input. Kaminsky’s experimental CFG9000 fuzzer [13] occupies the middle ground by using an adaptation of the Sequitur algorithm [18] to derive an approximation (context-free grammar) of the generative model of a protocol from a sufficiently large traffic capture, and then uses the model to generate mutated inputs. Most mutation fuzzers use previously-recorded network traffic as the basis for mutation, although there are some inline fuzzers that read live traffic. One of the most influential academic works on fuzzing is PROTOS [21], which analyzes a protocol, creates a model and generates fuzzing tests based on the model. A fuzzing test is typically deemed to be successful when it reveals a bug that harbors a vulnerability. However, in the case of critical infrastructure assets, a broader definition of success is appropriate – discovering a bug that creates any sort of disruption. This is important because any disruption – whether or not it is a security vulnerability – can severely impact the critical infrastructure.
2.2
Inline Fuzzing
In general, most block-based and mutation packet fuzzers work on servers, not clients. This is because these fuzzers are designed to generate packets and send them to a particular IP address and port. Since clients do not accept
60
CRITICAL INFRASTRUCTURE PROTECTION V
traffic that they are not expecting, only fuzzers that operate on live traffic are capable of fuzzing clients. Similarly, protocols that operate in short or timesensitive sessions are relatively immune to fuzzing that requires a large sample packet dump. For these reasons, inline fuzzing is typically required to fuzz clients. Fuzzers that are capable of inline fuzzing (e.g., QueMod [12]) either transmit random data or make random mutations. To our knowledge, LZFuzz, which is described in this paper, is the first inline fuzzer that goes beyond random strings and mutations.
2.3
Network-Based Fuzzing
Most modern fuzzers integrate with debuggers to instrument and monitor their targets for crashes. However, using a debugger requires intimate access to the target. Such access is unlikely to be available in the case of most SCADA systems used in the critical infrastructure. Inline fuzzers like LZFuzz must recognize when the target crashes or becomes unresponsive without direct instrumentation. With some targets, this recognition must trigger a way to (externally) reset the target; other targets may be restarted by hardware or software watchdogs. Note that generation-based fuzzers, which for various reasons cannot leverage target instrumentation, encounter similar challenges. For example, 802.11 Link Layer fuzzers that target kernel drivers [6] have had to work around their successes that caused kernel panics on targets. In either case, stopping and then restarting the fuzzing iteration over the input space is necessary so that fuzzing payloads are not wasted on an unresponsive target. It is also important for a fuzzer to adapt to its target, especially when dealing with proprietary protocols.
2.4
Fuzzing Proprietary Protocols
It is generally believed that if a fuzzer can understand and adapt to its target, it will be more successful than a fuzzer that does not. Therefore, it is important for a fuzzer to leverage all the available knowledge about the target. When no protocol specifications are available, an attempt can be made to reverse engineer the protocol manually or with the help of debuggers. In practice, this can be extremely time-consuming. Furthermore, it is not always possible to install debuggers on some equipment, which makes reverse engineering even more difficult. Consequently, it is important to build a fuzzing tool that can work efficiently on proprietary devices and software without any knowledge of the protocol it is fuzzing. Although a mutation fuzzer does not require knowledge of the protocol, it is useful to build a more efficient mutation fuzzer by incorporating field parsing (and other) heuristics that would enable it to respond to protocol state changes on the fly without protocol knowledge. Because instrumenting a target is difficult or impossible in a SCADA environment, the only option is
Shapiro, Bratus, Rogers & Smith
61
to employ inline adaptive fuzzing. We refer to this approach as live adaptive mutation fuzzing.
2.5
Fuzzing in Industrial Settings
Proprietary protocols used by SCADA equipment, such as Harris-5000 and Conitel-2020, are often not well-understood. Understandably, domain experts neither have the time nor the skills to reverse engineer the protocols. Fuzzing experts can perform this task, but infrastructure asset owners and operators may be reluctant to grant access to outsiders. In our own experience with power industry partners, it was extremely difficult to gain approval to work with their equipment. Moreover, asset owners and operators are understandably disinclined to share information about proprietary protocols and equipment, making it difficult for outside security experts to perform tests. Clearly, critical infrastructure asset owners and operators would benefit from an effective fuzzing tool that they could use on their own equipment. Our work with LZFuzz seeks to make this possible.
2.6
Modern Fuzzers
This section briefly describes examples of advanced fuzzers that are popular in the fuzz-testing community. Also, it highlights some tools that are available for fuzzing SCADA protocols. General Network-Based Fuzzing Tools: Sulley [3] is a block-based generation fuzzing tool for network protocols. It provides mechanisms for tracking the fuzzing process and performing post mortem analysis. It does so by running code that monitors network traffic and the status of the target (via a debugger). Sulley requires a description of the block layout of a packet in order to generate fuzzed inputs. It also requires a protocol description, which it uses to iterate through different protocol states during the fuzzing process. The General Purpose Fuzzer (GPF) [22] is a popular network protocol mutation fuzzer that requires little to no knowledge of a protocol. Although GPF is no longer being maintained, it is one of the few open source modern mutation fuzzers that is commonly available. GPF reads network captures and heuristically parses packets into tokens. Its heuristics can be extended to improve the accuracy with which it handles a protocol, but, by default, GPF attempts to tokenize packets using common string delimiters such as “ ” and “\n.” GPF also provides an interface to load user defined functions that perform operations on packets post-fuzzing. Peach is a general fuzzing platform that performs mutation and blockbased generation fuzzing [11]. Like Sulley, it requires a description of the fields and protocol. When performing mutation fuzzing, Peach reads in network captures and uses the field descriptions to parse and analyze packets for fuzzing as well as to adjust packet checksums before trans-
62
CRITICAL INFRASTRUCTURE PROTECTION V mitting the fuzzed packets. Like Sulley, Peach also uses debuggers and monitors to determine success and facilitate post mortem analysis. SCADA Fuzzing Tools: Some tools are available for fuzzing nonproprietary SCADA protocols. In 2007, ICCP, Modbus and DNP3 fuzzing modules were released for Sulley by Devarajan [9]. SecuriTeam includes DNP3 support with its beSTORM fuzzer [4]. Digital Bond created ICCPSic [10], a commercial suite of ICCP testing tools (unfortunately, this suite is no longer publicly available). Also Mu Dynamics offers Mu Test Suite [17], which supports modules for fuzzing SCADA protocols such as IEC61850, Modbus and DNP3.
3.
LZFuzz Tool
LZFuzz employs a simple tokenizing technique adapted from the LempelZiv compression algorithm [23] to estimate the recurring structural units of packets; interested readers are referred to [5] for an analysis of the accuracy of the tokenizing method. Effective inputs for fuzzing can be generated by combining this simple tokenizing technique with a mutation fuzzer. The need to understand and model protocol behavior can be avoided by adapting to and mutating live traffic. In our experience, SCADA protocols used in power control systems perform elaborate initial handshakes and send continuous keep-alive messages. If a target process is crashed, the process will often automatically restart itself and initiate a new handshake. This behavior is unusual for other non-SCADA classes of targets that need to be specifically instrumented to ensure liveliness and be restarted remotely. Such restarting/session renegotiation behavior assists the construction of successful fuzz sessions. Based on this observation, we propose the novel approach of “adaptive live mutation fuzzing.” The resulting fuzzer can adapt its fuzzing method based on the traffic it receives, automatically backing off when it thinks it is successful.
3.1
Design
The LZFuzz tool is inserted into a live stream of traffic, capturing packets sent to and from a source and target. A packet read into LZFuzz gets processed in several steps before it is sent to the target (Figure 1). When LZFuzz receives traffic destined for the target, it first tags the traffic with its type. Then, it applies a set of rules to see if it can declare success. Next, it looks up the LZ string table for the traffic type it is processing, updates the table and parses the packet accordingly. Next, it sends one or more tokens to a mutation fuzzer. Finally, it reassembles the packet, fixing any fields as needed in the packet finishing module. As LZFuzz receives traffic destined for the source, it checks for success and fixes any fields as required before sending the packet to the source.
Shapiro, Bratus, Rogers & Smith
Figure 1.
63
LZFuzz packet processing.
Intercepting Packets.
Although it may be possible to configure the source and target to communicate directly with the machine running LZFuzz, it may not always be practical to do so. Consequently, LZFuzz uses a technique known as ARP spoofing or ARP poisoning to transparently insert itself between two communicating parties. This method works when the systems are communicating over Ethernet and IP and at least one of them is on the same LAN switch as the machine running LZFuzz. (In the case of only one target host being local and the remote host located beyond the local LAN, the LAN’s gateway must be “poisoned.”) The ability to perform ARP spoofing means that fuzzing can be performed without the need to make any direct changes to the source or target configurations. LZFuzz uses the arp-sk tool [20] to perform ARP spoofing. Note that, although various Link Layer security measures exist against ARP poisoning and similar LAN-based attacks can be deployed either at the LAN switches or on the respective hosts or gateways (see, e.g., [8]), such measures are not typically used in control networks, because of the configuration overhead. This overhead can be especially costly in emergency scenarios where faulty or compromised equipment must be quickly replaced, because it is desirable in such situations that the replacement work “out of the box.”
Estimating Packet Structure.
As LZFuzz reads in valid packets, it builds a string table as if it were performing LZ compression [23]. The LZ table keeps track of the longest unique subsequences of bytes found in the stream of packets. LZFuzz updates its LZ tables for each packet it reads. A packet is then tokenized based on strings found in the LZ table, and each token is treated as if it were a field in the packet. One or more tokens are then passed to GPF, which guesses the token types and mutates the tokens. The number of tokens
64
CRITICAL INFRASTRUCTURE PROTECTION V
!
"#$
Figure 2.
Tokenizing and mutating packets (adapted from [5]).
passed to GPF is dependent on whether or not the windowing mode is enabled. When the mode is enabled, LZFuzz fuzzes one token at a time, periodically changing the token it fuzzes (LZFuzz may fuzz multiple tokens at a time in the windowing mode to ensure that there are enough bytes available to mutate effectively). When the windowing mode is disabled, all the tokens are passed to GPF. Figure 2 provides a high-level view of the tokenizing process.
"
Figure 3.
!
Comparison of live inline mutation with traditional mutation.
Responding to Protocol State Changes. Unlike traditional mutation fuzzers, LZFuzz’s mutation fuzzer performs live mutation fuzzing. This means that, instead of mutating previously recorded packets, packets are mutated while they are in transit from the source to the target. Figure 3 shows how live mutation differs from traditional mutation. In particular, live inline mutation enables the fuzzing of short or time-sensitive sessions on real systems in both directions. Traditional mutation fuzzers mutate uncorrupted input
Shapiro, Bratus, Rogers & Smith
65
from a network dump whereas LZFuzz mutates packets freshly from a source as it communicates with the target.
Recognizing Target Crashes.
Modern network protocol fuzzers tend to require the attachment of a debugger to the target to determine when crashes occur. However, such access is typically not available in SCADA environments. Since live communications are captured as they enter and leave the fuzzing target, our novel approach can make fuzzing decisions based on the types of messages (or lack thereof) sent by the target or source. SCADA protocols tend to have continuous liveliness checks. If a piece of equipment revives itself after being perceived as dead, an elaborate handshake is typically performed as it reintroduces itself. LZFuzz possesses the ability to recognize such behavior throughout a fuzzing session. Even if a protocol does not have these keep-alive/handshake properties, other methods can be used to deduce success from network traffic. If a protocol is running over TCP, the occurrence of an RST flag may signify that the target process has crashed. This flag is set when a host receives traffic when it has no socket listening for the traffic. Our experience with LZFuzz has shown that TCP RST flags are a reasonable success metric although they produce some false positives.
Mutation. LZFuzz can work with a variety of fuzzers to mangle the input it fetches. Also, LZFuzz can be easily modified to wrap itself around new mutation fuzzers. Currently, LZFuzz passes packet tokens to the GPF mutation fuzzer for fuzzing before it reassembles the packet and fixes any fields such as checksums.
3.2
Extending LZFuzz
LZFuzz provides an API to allow users to encode knowledge of the protocol being fuzzed. The API can be used to tag different types of packets using regular expressions. New LZ string tables are automatically generated for each type of packet that it is passed. The API also allows users to provide information on how to fix packets before they are sent so that the length and checksum fields can be set appropriately. Finally, the API allows users to custom-define success conditions. For example, if a user knows that the source will attempt a handshake with the target when the target dies, then the user can use the API to tag the handshake packets separately from the data and control packets and to instruct LZFuzz to presume success upon receiving the handshake packets.
4.
Experimental Results
An earlier version of LZFuzz was tested on several non-SCADA network protocols, including the iTunes music sharing protocol (DAAP). LZFuzz was able to consistently hang the iTunes version 2.6 client by fuzzing DAAP. It
66
CRITICAL INFRASTRUCTURE PROTECTION V
was also able to crash an older version of the Gaim client by intercepting and fuzzing AOL Instant Messenger traffic. We selected these protocols because we wanted to test the fuzzer on examples of relatively complex (and popular) client-server protocols that are used for frequent, recurring transactions with an authentication phase separate from the normal data communication phase. Also, we sought protocols that supported some notion of timed and timed-out sessions. Of course, it was desirable that the target software be widely used so that most of the easy-to-find bugs would have presumably been fixed. More importantly, however, LZFuzz was able to consistently crash SCADA equipment used by an electric power company. Beyond listing successes, it is not obvious how the effectiveness of a fuzzer can be quantitatively evaluated or compared. In practice, a fuzzer is useful if it can crash targets in a reasonable amount of time. But how does one encode such a goal in a metric that can be evaluated? The best method would be to test the ability of a fuzzer to trigger all the bugs in a target. However, such a metric is flawed because it requires a priori knowledge of all the bugs that exist in the target. A more reasonable metric is code coverage – the portion of code in a target that is executed in response to fuzzed inputs. This metric also has its flaws, but it is something that can be measured (given access to the source code of the target), and still provides insight on ability of the fuzzer to reach hidden vulnerabilities. Indeed, in 2007, Miller and Peterson [16] used code coverage to compare generational fuzzing against mutation fuzzing. Also, the usefulness of coverage instrumentation has long been recognized by the reverse engineering and exploit development communities. For example, Amini’s PaiMei fuzzing and reverse engineering framework [2] provides the means to evaluate the code coverage of a process up to the basic block granularity; the framework also includes tools for visualizing coverage. Unfortunately, the code coverage metric glosses over differences between a fuzzer constrained to having canned packet traces and one that can operate in a live inline mode. Nevertheless, to provide a means for comparing LZFuzz with other methods of fuzzing proprietary protocols, we set up experiments to compare the code coverage of LZFuzz, GPF and random mutation fuzzing (with random strings of random lengths).
4.1
Experimental Setup
We tested GPF, LZFuzz, random mutation fuzzing and no fuzzing on two targets: mt-daapd and the Net-SNMP snmpd server. We chose these two targets because mt-daapd is an open source server that uses a (reverse engineered) proprietary protocol and Net-SNMP uses the open SNMP protocol used in SCADA systems. The experiments were conducted on a machine with a 3.2 GHz i5 dual-core processor and 8 GB RAM running Linux kernel 2.6.35-23. Each fuzzing session was run separately and sequentially. The code coverage of the target was measured using gcov. The targets were executed within a monitoring environ-
67
Shapiro, Bratus, Rogers & Smith mt-daapd, Code Coverage (%)
Open-SNMP, Code Coverage (%)
Run Length (min)
Figure 4.
Run Length (min)
Code coverage for mt-daapd (left) and Open-SNMP (right).
ment that would immediately restart the target when a crash was detected (to simulate the automatic reset of common power SCADA applications). Eight separate tests were conducted on each target; specifically, the fuzzer was run for 1, 2, 4, 8, 16, 32, 64 and 128 minutes. After each test run, the code coverage was computed before resetting the code coverage count for the subsequent run. No fuzzer was provided any protocol information beyond the IP address of the target, the transport layer protocol and the port used by the target. Because GPF uses a network capture as a mutation source, it was supplied with a packet capture of about 1,600 packets as produced by the source/target setup when no fuzzer was active.
4.2
Fuzzing mt-daapd
mt-daapd is an open source music server that uses the proprietary iTunes DAA protocol to stream music. This protocol was reverse engineered by several developers who intended to build open source daapd servers and clients. We chose mt-daapd because we wanted to test a proprietary protocol but required source code in order to calculate code coverage. The tests used mt-daapd version 0.2.4.2. The mt-daapd daemon was run on the same machine as the client and fuzzer. The server was configured to prevent stray users from connecting to it. The Banshee media player was used as a client and traffic source. To maintain consistency between tests, a set of xmacro scripts were used to control Banshee and cause it to send requests to the daapd server. In general, we discovered that, with respect to code coverage, LZFuzz does as well or better than running the test environment without any fuzzer (left-hand side of Figure 4). Furthermore, we found that LZFuzz triggered the largest amount of code in the target compared with the other fuzzers we tested. This means that LZFuzz was able to reach into branches of code that none of the
68
CRITICAL INFRASTRUCTURE PROTECTION V
other fuzzers reached. Interestingly, the random fuzzer consistently produced the same low code coverage on every test run regardless of the length of the run. Other than LZFuzz, no fuzzer achieved higher code coverage than that of a non-fuzzed run of Banshee and mt-daapd.
4.3
Fuzzing snmpd
Net-SNMP is one of the few open source projects that use SCADA protocols. Our experiments used snmpd, the daemon that responds to SNMP requests in Net-SNMP version 5.6.1, as a fuzzing target. Like mt-daapd, the daemon was run on the same system as the client. We scripted snmpwalk, provided by NetSNMP, to continuously send queries to the server. For the purpose of code coverage testing, snmpwalk was used to query the status of several parameters, including the system uptime and location, and information about open TCP connections on the system. Because we were unable to make consistent code coverage measurements between runs of the same fuzzer and run length, we ran each fuzzer and run length combination five times. The averages are displayed in Figure 4 (right-hand side) along with error bars for runs with noticeable variation (standard deviation greater than 0.025%). GPF outperformed LZFuzz when GPF was running at full strength. However, we were also interested in seeing the relative performance of LZFuzz and GPF when GPF had a rate-adjusted flow so that GPF would send about the same number of packets as LZFuzz for a given run length. This adjustment provided insight into how much influence a GPF-mutated packet would have on the target compared with a LZFuzz-mutated packet. We also observed that LZFuzz induced a larger amount of code coverage in snmpd when the mutation rate that controlled the mutation engine aggressiveness was set to medium (instead of high or low). The mutation rate governs how much the GPF mutation engine mutates a packet. Although this feature is not documented, the mutation rate is required to be explicitly set during a fuzzing session. Line 143 of the GPF source file misc.c offers the options “high,” “med” or “low” without any documentation; we chose the “med” option for snmpd and “high’ for mt-daapd. Because GPF uses the same mutation engine, we set GPF to run with the medium mutation level as well. Note that, in the case of snmpd, a 1% difference in code coverage corresponds to about 65 lines of code. Figure 4 (right-hand side) shows the code coverage of GPF (with a rateadjusted flow and a medium mutation rate) compared with the code coverage of LZFuzz (with medium mutation), random fuzzing and no fuzzing. With rate-adjusted flow, LZFuzz induces a higher code coverage than GPF. LZFuzz also clearly outperforms random fuzzing. Although LZFuzz and GPF share a common heuristic mutation engine, they belong to different classes of fuzzers and each has its own strengths and weaknesses. LZFuzz can fuzz both servers and clients; GPF can only fuzz targets that are listening for incoming traffic on a port known to GPF before the fuzzing session. LZFuzz is restricted to only fuzzing packets sent by the source; GPF can send many packets in rapid succession; GPF requires the user to prepare a
69
Open-SNMP, Code Coverage (%)
Shapiro, Bratus, Rogers & Smith
Figure 5.
Run Length (min)
Code coverage for the Net-SNMP server with and without tokenizing.
representative packet capture and, thus, implicitly assumes that representative captures exist for the target scenario. Note that the time taken to prepare the network capture was not considered in our results. The packet capture given to GPF potentially provided it with a wealth of information about the protocol from the very beginning. On the other hand, LZFuzz had to develop most of its knowledge about the protocol on the fly. Also, the mutation engine of GPF was built and tuned specifically for what GPF does – fuzzing streams of packets. LZFuzz uses the same mutation engine, but only had one packet in each stream. While the GPF mutation engine was not designed to be used in this manner, we believe that the effectiveness of LZFuzz can be improved if the mutation engine could be tuned. When GPF and LZFuzz were used at full strength against mt-daap, LZFuzz outperformed GPF in terms of code coverage. However, this was not the case when both fuzzers were tested against snmpd – GPF achieved 1–2% more code coverage than LZFuzz in comparable runs. It could be argued that GPF is the more effective fuzzer for snmpd. However, the clear advantage of LZFuzz over GPF and other similar fuzzers is that it can also fuzz SNMP clients (e.g., snmpwalk) whereas GPF cannot do this without session-tracking modifications.
4.4
LZFuzz Tokenizing
The final issue to examine is whether or not the LZFuzz tokenizing method improves the overall effectiveness of the tool. If tokenizing is disabled in LZFuzz during a run and the entire payload is passed directly to GPF, then GPF attempts to apply its own heuristics to parse the packet. Figure 5 shows how LZFuzz with tokenizing compares with LZFuzz without tokenizing when run against snmpd in the same experimental environment as described above. These results suggest that the LZ tokenizing does indeed improve the effectiveness of inline fuzzing with the GPF mutation engine.
70
5.
CRITICAL INFRASTRUCTURE PROTECTION V
Conclusions
The LZFuzz tool enables control systems personnel with limited fuzzing expertise to effectively fuzz proprietary protocol implementations, including the SCADA protocols used in the electric power grid. LZFuzz’s adaptive live mutation fuzzing approach can fuzz the proprietary DAA protocol more efficiently than other methods. LZFuzz is also more effective than applying a random fuzzer to an SNMP server. The GPF mutation fuzzer appears to be more effective at fuzzing an SNMP server than LZFuzz; however, unlike LZFuzz, GPF is unable to fuzz SNMP clients. Additional work remains to be done on LZFuzz to ensure its wider application in the critical infrastructure. The user interface must be refined to change the aggressiveness of fuzzing or temporarily disable fuzzing without having to restart LZFuzz. Another refinement is to identify checksums by intercepting traffic to the target and passively search for bytes that appear to have high entropy. Also, the tool could be augmented to test for authentication and connection setup traffic by inspecting traffic at the beginning of a run and traffic from the target after blocking replies from the client, and vice versa. This information can be used to specify traffic rules that would make LZFuzz more effective. Note that the views and opinions in this paper are those of the authors and do not necessarily reflect those of the United States Government or any agency thereof.
Acknowledgements This research was supported by the Department of Energy under Award No. DE-OE0000097. The authors wish to thank Axel Hansen and Anna Shubina for their assistance in developing the initial prototype of LZFuzz. The authors also wish to thank the power industry personnel who supported the testing of LZFuzz in an isolated environment at their facility.
References [1] D. Aitel, An introduction to SPIKE, The fuzzer creation kit, presented at the BlackHat USA Conference (www.blackhat.com/presentations/bh-usa02/bh-us-02-aitel-spike.ppt), 2002. [2] P. Amini, PaiMei and the five finger exploding palm RE techniques, presented at REcon (www.recon.cx/en/s/pamini.html), 2006. [3] P. Amini, Sulley: Pure Python fully automated and unattended fuzzing framework (code.google.com/p/sulley), 2010. [4] Beyond Security, Black box software testing, McLean, Virginia (www.bey ondsecurity.com/black-box-testing.html).
Shapiro, Bratus, Rogers & Smith
71
[5] S. Bratus, A. Hansen and A. Shubina, LZFuzz: A Fast CompressionBased Fuzzer for Poorly Documented Protocols, Technical Report TR2008634, Department of Computer Science, Dartmouth College, Hanover, New Hampshire (www.cs.dartmouth.edu/reports/TR2008-634.pdf), 2008. [6] J. Cache, H. Moore and M. Miller, Exploiting 802.11 wireless driver vulnerabilities on Windows, Uninformed, vol. 6 (uninformed.org/index.cgi?v=6), January 2007. [7] C. Cadar, V. Ganesh, P. Pawlowski, D. Dill and D. Engler, EXE: Automatically generating inputs of death, ACM Transactions on Information and System Security, vol. 12(2), pp. 10:1–38, 2008. [8] S. Convery, Hacking Layer 2: Fun with Ethernet switches, presented at the BlackHat USA Conference (www.blackhat.com/presentations/bh-usa02/bh-us-02-convery-switches.pdf), 2002. [9] G. Devarajan, Unraveling SCADA protocols: Using Sulley fuzzer, presented at the DefCon 15 Hacking Conference, 2007. [10] Digital Bond, ICCPSic assessment tool set released, Sunrise, Florida (www .digitalbond.com/2007/08/28/iccpsic-assessment-tool-set-released), 2007. [11] M. Eddington, Peach Fuzzing Platform (peachfuzzer.com), 2010. [12] GitHub, QueMod, San Francisco (github.com/struct/QueMod), 2010. [13] D. Kaminsky, Black ops: Pattern recognition, presented at the BlackHat USA Conference (www.slideshare.net/dakami/dmk-blackops2006), 2006. [14] H. Meer, Memory corruption attacks: The (almost) complete history, presented at the BlackHat USA Conference (media.blackhat.com/bh-us10/white papers/Meer/BlackHat-USA-2010-Meer-History-of-Memory-Cor ruption-Attacks-wp.pdf), 2010. [15] B. Miller, L. Fredriksen and B. So, An empirical study of the reliability of UNIX utilities, Communications of the ACM, vol. 33(12), pp. 32–44, 1990. [16] C. Miller and Z. Peterson, Analysis of Mutation and Generation-Based Fuzzing, White Paper, Independent Security Evaluators, Baltimore, Maryland (securityevaluators.com/files/papers/analysisfuzzing.pdf), 2007. [17] Mu Dynamics, Mu Test Suite, Sunnyvale, California (www.mudynamics .com/products/mu-test-suite.html). [18] C. Nevill-Manning and I. Witten, Identifying hierarchical structure in sequences: A linear-time algorithm, Journal of Artificial Intelligence Research, vol. 7, pp. 67–82, 1997. [19] T. Proell, Fuzzing proprietary protocols: A practical approach, presented at the Security Education Conference Toronto (www.sector.ca/presentat ions10/ThomasProell.pdf), 2010. [20] F. Raynal, E. Detoisien and C. Blancher, arp-sk: A Swiss knife tool for ARP (sid.rstack.org/arp-sk), 2004.
72
CRITICAL INFRASTRUCTURE PROTECTION V
[21] J. Roning, M. Laakso, A. Takanen and R. Kaksonen, PROTOS: Systematic approach to eliminate software vulnerabilities, presented at Microsoft Research, Seattle, Washington (www.ee.oulu.fi/research/ouspg/PROTOS MSR2002-protos), 2002. [22] VDA Labs, General Purpose Fuzzer, Rockford, Michigan (www.vdalabs .com/tools/efs gpf.html), 2007. [23] J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, vol. 23(3), pp. 337–343, 1977.
Chapter 6 SECURITY ANALYSIS OF VPN CONFIGURATIONS IN INDUSTRIAL CONTROL ENVIRONMENTS Sanaz Rahimi and Mehdi Zargham Abstract
Virtual private networks (VPNs) are widely recommended to protect otherwise insecure industrial control protocols. VPNs provide confidentiality, integrity and availability, and are often considered to be secure. However, implementation vulnerabilities and protocol flaws expose VPN weaknesses in many deployments. This paper uses a probabilistic model to evaluate and quantify the security of VPN configurations. Simulations of the VPN model are conducted to investigate the trade-offs and parameter dependence in various VPN configurations. The experimental results provide recommendations for securing VPN deployments in industrial control environments.
Keywords: Control systems, virtual private networks, security analysis
1.
Introduction
Virtual private networks (VPNs) are widely used to provide secure communications over insecure public networks. VPNs provide security services such as confidentiality, integrity and availability by creating encrypted tunnels between the communicating parties. VPNs are recommended in the literature and by many critical infrastructure protection standards to secure process control, SCADA and automation protocol communications [14–16, 21, 22]. Although these protocols are generally very reliable, they were not designed to resist malicious attacks. As a result, it is recommended to wrap industrial protocols such as DNP3 [18], 61850 [13] and Modbus [19] within VPN tunnels to protect them from unauthorized access. These configurations supposedly offer confidentiality, integrity and availability [22], but little work has focused on the secure configuration of VPN tunnels and the maintenance required for their secure operation [22].
J. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 73–88, 2011. c IFIP International Federation for Information Processing 2011
74
CRITICAL INFRASTRUCTURE PROTECTION V
VPNs are attractive targets for attackers. Because VPNs carry sensitive information over public networks, successfully breaking into a VPN tunnel enables an attacker to alter sensitive data and commands without physical access to the industrial facility. If other protection mechanisms such as strong access control are not deployed properly, the attacker can gain access to the internal SCADA systems through a VPN tunnel. Also, as industrial systems implement more security mechanisms, VPNs can become the weakest link in the security chain. VPNs have several vulnerabilities. According to Hills [12], most VPN implementations suffer from serious security flaws that can be easily exploited by attackers to fabricate, intercept, modify and interrupt traffic. Some of these vulnerabilities are implementation specific; they are the result of flaws in a specific protocol implementation due to bad coding, incomplete implementation or poor implementation choices for conditions that are not specified in the standard. However, vulnerabilities in the underlying protocols cannot be addressed by good implementation [12]. Finally, as recent incidents have shown, sophisticated malware [6] can stealthily modify the configurations of a control system (including a VPN) and seriously impact its operation. The solutions to these security problems are proper configuration, continual configuration validation and regular maintenance, all of which are effective only if system administrators fully understand the internal details of the protocols. This paper models VPNs using stochastic activity networks [25], which are an extended form of Petri nets [3], and analyzes the probability of a successful breach against various VPN configurations. VPN security is quantified for different choices of parameters such as key length, mode of operation, number of users and maintenance frequency. The results provide recommendations for securely deploying VPNs in industrial control environments.
2.
VPNs and VPN Vulnerabilities
VPNs are categorized with respect to their layer, i.e., transport layer (SSL), network layer (IPSec) and link layer (L2TP). This paper focuses on IPSec VPNs. As an underlying VPN protocol, IPSec [23] provides confidentiality and integrity services in the network layer (i.e., on a per packet basis) using two main sub-protocols (AH and ESP) in two different modes (transport and tunnel). The detailed description of IPSec is beyond the scope of this paper. We only describe the features that are relevant to our discussion. Interested readers are referred to [23] for additional information about IPSec.
2.1
IPSec
IPSec provides security services via the Authentication Header (AH) and Encapsulation Security Payload (ESP) protocols. AH provides integrity by adding the HMAC [5] of the entire packet (payload and full IP header); however, it does not provide confidentiality because it leaves the packet in plaintext. ESP encrypts the packet payload and some fields of the IP header, and adds the
Rahimi & Zargham
75
ESP header and trailer to the IP packet, providing confidentiality and limited integrity. IPSec can operate in the transport or tunnel modes. The transport mode is used when end-to-end security is desired, and both end nodes support IPSec. In the transport mode, the original IP header is preserved for routing purposes and the AH/ESP headers are added under the IP header. The tunnel mode is used when the end machines do not support IPSec or when the identities of the communicating entities have to stay hidden. In the tunnel mode, the entire IP packet is encrypted and a new IP header is added to the packet. The gateway on the border of each organization provides the security services by adding and removing IPSec headers. Security Association (SA) is the concept used in IPSec for connection management between two communicating entities. An SA comprises a secure communication channel and its parameters, including the encryption algorithms, keys and lifetimes. Each SA is unidirectional and can provide one security service (AH or ESP). Two SAs are required for bidirectional communications. IPSec uses the Internet Key Exchange (IKE) protocol to manage and exchange encryption keys and algorithms. IKE is a hybrid of three sub-protocols: Internet Security Association and Key Management Protocol (ISAKMP), Versatile Secure Key Exchange Mechanism for Internet (SKEME) and Oakley. ISAKMP provides the framework for authentication and SA management, but does not define the specific algorithms and keys. IKE uses the SKEME and Oakley protocols for key exchange and agreement with the acceptable cryptographic algorithms. Because IKE is commonly used to establish VPN channels, many VPN vulnerabilities are in one way or another related to it. For a better understanding of these vulnerabilities, we provide an overview of IKE and its modes of operation. IKE has three important modes: main mode, aggressive mode and quick mode. The main mode is used for authentication and key exchange in the initial phase of IKE. This phase assumes that no SA is present and that the two parties wish to establish SAs for the first time. It involves three pairs of messages. The first pair negotiates the security policy and encryption algorithms to be used. The second pair establishes the keys using the Diffie-Hellman key exchange protocol. The third pair authenticates peers using signatures or certificates. Note that the identities of the peers in the main mode are often their IP addresses. The aggressive mode is also used for the initial key exchange, but it is faster and more compact than the main mode. This mode involves a total of three messages that contain the main mode parameters, but in a more compact form. Key and policy exchange are performed by the first two messages while the third message authenticates the initiator to the responder. Note that the identity of the responder (sent in the second message) is not protected, which is a vulnerability in the aggressive mode of operation.
76
CRITICAL INFRASTRUCTURE PROTECTION V
The quick mode is used for the negotiations in the secondary phase of IKE. This mode assumes that the peers have already established SAs and that the exchange can update the parameters or renew the keys. The quick mode messages are analogous to those in the aggressive mode, but the payloads are encrypted. If the quick mode operates with the perfect-forward-secrecy option, the shared secrets are renewed with a fresh Diffie-Hellman exchange. IKE authenticates peers using a pre-shared key (PSK), public key encryption or digital signature. In the PSK method, which corresponds to the traditional username/password authentication, the peers share a secret through a back channel and exchange the hash value of the secret for authentication. Unfortunately, although this method has known vulnerabilities, it is the only mandatory authentication method according to the RFC [11]. Public key encryption is another method of authentication in which the peers digitally sign their identities; however, the keys are required to be provided beforehand by some other means. Digital certificates may also be used for authentication in IKE; in this mode, the peers exchange certificates to mutually authenticate each other.
2.2
VPN Vulnerabilities
VPNs have several vulnerabilities. The common username enumeration vulnerability refers to an implementation flaw in which the username/password authentication mechanism responds differently to invalid usernames and passwords. By exploiting this vulnerability, an attacker can identify valid usernames that can be used later in password discovery. When IKE is used in the aggressive mode with a pre-shared key (PSK), the client sends an IKE packet to the server containing, among other things, the identity (username) of the client. The server then responds with another packet that contains the hash of the identity and the PSK (password). When an incorrect username is received, many VPN implementations send an error message, send a packet with NULL PSK or do not respond at all. This enables an attacker to infer whether or not a username exists by sending a single packet (enumerate username). Upon discovering the first username, attacker can generate likely usernames with the same pattern (e.g., first letter of the first name concatenated with the last name). When a VPN is used in the main mode, the identity is an IP address, not a username. Hills [12] proposes that a secure VPN implementation return the hash of a randomly-generated password each time it receives an invalid username. However, this does not solve the problem because an attacker can still send two different packets with the same username; if two different hashes are received, then the attacker knows that the username does not exist and vice versa. Furthermore, attacker can delay these two packets using a number of packets for other usernames to flush a buffer that the server may employ to track such an attack. The solution is for the server to encrypt the username with a secret key (generated and kept on the server only for this purpose) and to return the hash
Rahimi & Zargham
77
of this value. Thus, the server always responds to a username with a unique hash value, which foils the attack. When the attacker discovers a valid username, he/she can receive the hash of the username’s password from the server (using PSK in the aggressive mode). The attacker can then apply an offline algorithm to crack the hash value and obtain the password. Offline cracking can be very fast because the probabilistic model of VPN password hashing is not hidden. This poses a serious threat to short passwords. The vulnerability exists even if IKE operates in the main mode with PSK; it can be exploited by a man-in-the-middle attack (e.g., using DNS spoofing [20]) to gain the Diffie-Hellman shared secrets. The only difference is that, in the main mode, the identity of each peer is its IP address. When a username/password pair is successfully found, the first phase of IKE is breached. If the VPN configuration does not require extra authentication, the breach is enough to setup a VPN channel with the server. In some other cases, the configuration requires an extra XAUTH step to complete the second phase of IKE, but this phase is vulnerable to a man-in-the-middle attack as mentioned in the standard [24]. The reason for the vulnerability is that XAUTH must be used in conjunction with the first phase of IKE; if this phase is not performed properly, XAUTH cannot provide any security guarantees. Therefore, an attacker who performs a man-in-the-middle attack would be able to authenticate successfully. Other VPN implementation vulnerabilities include VPN fingerprinting (inferring information about a device from its behavior); insecure password storage (e.g., in the registry, as plaintext in memory, etc.); lack of account lockout; poor default configurations; and unauthorized modification of configurations. We do not consider these vulnerabilities in this paper because they are implementation specific and may require other flaws to be exploited in a successful attack (e.g., insecure password storage would be exploited by the installation of malware).
3.
VPN Security Model
This section describes the probabilistic model used to analyze VPN security. The model helps quantify the security of a protocol and provides suggestions for its secure implementation, configuration and operation. This is important because, as Hills [12] notes, VPN tunnels are misconfigured approximately 90% of the time. VPN security is modeled using a stochastic activity network (SAN) [25], which is an extension of a Petri net [3]. The Mob¨ıus tool [7] is used to specify the model and to perform numerical simulations. This section explains the details of the SAN model and its parameters; all times in the model are expressed in minutes. The model comprises two sub-models (atomic models), one models the implementation and configuration of a VPN tunnel, and the other models its environment and operational details. The two sub-models are joined into a composed VPN model using the Rep/Join representation [7].
78
CRITICAL INFRASTRUCTURE PROTECTION V
Figure 1.
Probabilistic model of a VPN.
The first atomic model (ike) models the weaknesses of the protocol (Figure 1). A global variable identifies if the VPN is operating in the main mode or in the aggressive mode. If the VPN is configured in the aggressive mode, an activity models the username enumeration attack. We consider usernames that consist of alphabetic characters and have a length of at most six characters (approximately 309 million possibilities). If the roundtrip time between the scanner and server is in the order of tens of milliseconds [17] and a window of ten packets is used, then, on the average, it takes 1 ms to check for each username; so we assume that 1,000 usernames can be checked per second. If the roundtrip time is larger, an appropriate window can be chosen to achieve this rate. With this rate, it is possible to exhaustively scan the username space in approximately 3.5 days. A sophisticated attacker can do better using a fast connection to the server and/or an intelligent guessing algorithm after a username is found. However, in this paper, we consider an unsophisticated attacker in order to obtain an upper bound on VPN security. Note that, because username scanning does not typically cause account lockout, this process is not stopped by the server. The rate of username enumeration is proportional to the number of system users (more users result in a faster enumeration using exhaustive search). This is modeled by multiplying the base rate (1 per 3.5 days = 1.94E-04 per minute) by the marking of (the number of tokens in) the usernames place. Whenever a username is found, it is moved from the pool of unknown user-
79
Rahimi & Zargham Table 1.
Password complexity and password space size.
Password Complexity 6 6 8 8
characters characters characters characters
a-z a-z, A-Z, 0-9 a-z a-z, A-Z, 0-9
Space Size 3.1E+08 5.7E+10 2.1E+11 2.1E+14
names to usernames found using the output gate username found. Note that usernames found is a place that holds the number of usernames enumerated at a given point in time, whereas username found is the output gate that transfers the discovered username from the unknown usernames place (usernames) to the usernames found place. When a username is found, the attacker can start an offline attack to obtain the password. To crack the password, the attacker has to hash different values exhaustively. The cracking speed for MD5 hashes using an AMD Athlon XP 2800+ is around 315,000 attempts per second (∼1.9E+07 attempts per min) [12]. Since the cracking speed depends heavily on password complexity, the model is run using different password space sizes. Table 1 shows the size of the password space for different types of passwords. The rate of successful attempts is proportional to the number of usernames enumerated, so the rate of the brute force activity is multiplied by the marking of the place usernames found. If a username/password pair is found, the VPN is breached. Other transactions (e.g., to setup a VPN tunnel after the breach) require negligible time compared with brute force or username enumeration. As result, after a username/password pair is found, a token is placed in vpn breached. The other possible mode of operation is the main mode. As mentioned before, in the main mode, the identities are the IP addresses of the peers. The space of 32-bit IP addresses is approximately fourteen times larger than the space of six-character usernames; thus, the find IP activity that randomly searches the IP address space has a base rate that is fourteen times slower than find username. However, upon finding a valid IP address, the attacker can perform subnet-based search, which makes it much faster to find other IP addresses (assuming, of course, that most of the clients are in the same subnet). Note that for the main mode to be enabled, at least one IP address must be found by random search. Upon finding a valid IP address, the attacker exhaustively searches the space of PSKs similar to the aggressive mode, placing a token in vpn breached whenever an IP/PSK pair is found. The second atomic SAN model captures the behavior of the VPN environment and its operational maintenance (Figure 2). VPNs are vulnerable to malware attacks [8]. In particular, malware can modify the configuration of a VPN tunnel in a stealthy manner. Since the VPN tunnel remains operational
80
CRITICAL INFRASTRUCTURE PROTECTION V
Figure 2.
VPN malware infection and maintenance models.
after the modification, it is difficult for the system administrator to detect such an attack. We model two environments, one in which malware attacks exist and the other where they do not. Malware can maliciously modify a VPN to send packets in plaintext. Therefore, the installation of malware is synonymous to a VPN breach. The malware infection rate is hard to quantify in industrial systems. For the sake of argument, however, we choose an infection rate of once a month when malware is present and later show that this rate does not have a significant impact on VPN security. The activity malware arrival models malware infections. Although malware can also retrieve unsecured passwords, we do not consider it to be a part of model because it is implementation specific. VPN maintenance by the system administrator is a preventive and/or corrective action that can improve security. Maintenance involves changing passwords and checking for bad configurations. If a VPN configuration is modified by malware, a maintenance operation can secure the VPN by correcting the configuration and installing a patch that deals with the malware. Also, regularly changing passwords can mitigate exhaustive search attacks, helping secure the VPN. On the other hand, password changes do not affect username/IP enumeration, so this activity does not flush the usernames found and IPs found places in the model.
4.
Experimental Results
This section presents the results obtained from simulation experiments using the SAN model. The primary goal was to investigate the probability that the VPN is not in the breached state. In SAN terminology, the reward variable (security probability) is defined as the probability that the marking of the place VPN breached is zero. The value of this probability for each configuration was studied for different time periods: one hour, three hours, twelve hours, one day, three days, one week, one month, three months and one year. VPN security was also studied for different IKE modes (aggressive vs. main mode); password complexity (Table 1); numbers of users/machines (1, 10, 100 and 250); environments (with or without malware); and maintenance rates (once every week, month, three months, one year and no maintenance). Given
81
Rahimi & Zargham Secur
Security Probability
Most secure configuration 1
0.8
0.2 0
0.6 0.4
Main
0.2
Aggressive
0
Time
Figure 3.
VPN security in the aggressive mode versus the main mode.
the number of factors of interest, a large number of experiments are possible, but not all of them are meaningful. Therefore, we only performed experiments in which a few parameters of interest were changed at a time. The first experiment compared VPN security in the aggressive mode versus the main mode. The main mode is generally more secure because, in order to perform offline password cracking, the attacker has to conduct a man-inthe-middle attack. Even if this attack is successful, the space of 32-bit IP addresses is larger than the space of usernames. The experiment assumed that the attacker can perform a man-in-the-middle attack; otherwise, the main mode is not vulnerable to the attack (i.e., the security probability is one at all times.) The passwords (or PSKs) for both modes were selected from the space of six alphabetic characters (3.1E+08); the system was assumed to have ten users (usernames or IP addresses); security maintenance was not performed; and malware was not present. The results are shown in Figure 3. The security of a VPN tunnel diminishes over time. However, the security declines faster for the aggressive mode than for the main mode. The aggressive mode is less than 50% secure after six hours whereas the main mode reaches this level after about four days. Note also the short lifetime of a six-character alphabetic password for a VPN tunnel. The second experiment studied the effect of password complexity on the overall security of a VPN (aggressive mode). To observe the effect of password complexity alone, maintenance and malware were switched off in this experiment. The system had ten different users. The results are shown in Figure 4. As expected, the overall security of a VPN increases with password complexity. Note that eight-character alphanumeric passwords are secure for a much longer period of time, but even this type of password is less than 65% secure after one year. On the other hand, sixcharacter alphanumeric passwords are less than 60% secure after just one day. The third experiment studied the effect of maintenance frequency on VPN security. This experiment assumed that IKE was used in the aggressive mode, that the system had ten users and that six-character alphanumeric passwords
82
CRITICAL INFRASTRUCTURE PROTECTION V
Security Probability
1 0.8
0.6
6 Alphabetic
0.4
6 Alphanumeric
0.2
8 Alphabetic 8 Alphanumeric
0
Time Figure 4.
Impact of password complexity on VPN security.
Security Probability
1 0.8 No maintenance
0.6
Once a week
0.4
Once a month
0.2 Once every 3 months 0
Once a year
Time
Figure 5.
Impact of maintenance frequency on VPN security.
were used. The results in Figure 5 show that frequent maintenance can mitigate the effect of weak configurations. Note that the security probability reduces until it reaches a minimum before any maintenance starts; after this, the probability increases with frequent maintenance. Since the rate of maintenance is higher than the rate of password cracking in each case, the security probability reaches one at steady state. This does not mean that it is impossible to break the VPN tunnel as time passes; rather, it implies that the portion of time that the VPN tunnel is breached diminishes over longer time periods. Moreover, the declining security trend during the first few days can be repeated after each successful maintenance. The effect is not shown in Figure 5 because the security probability represents the steady state measure of security at any time. The fourth experiment studied the effect of malware attacks versus weak VPN passwords. Two values of password complexity (six- and eight-character alphanumeric passwords) are plotted in Figure 6 with and without frequent malware attacks (once a month.) In the experiment, IKE used the aggressive mode and no maintenance was performed. A counterintuitive result from this
83
Rahimi & Zargham
Security Probability
1 0.8 0.6
6 alphanumeric; Malware
0.4
6 alphanumeric; No malware 8 alphanumeric; Malware
0.2
8 alphanumeric; No malware
0
Time
Figure 6.
Impact of malware on VPN security.
experiment is that malware infections have little impact on the security of a weakly-configured VPN because the dominant effect in this mode is the ability of an attacker to crack a six-character alphanumeric password. On the other hand, in the case of a strong password, frequent malware infections considerably weaken VPN security. Therefore, we conclude that the impact of a malware infection depends on the configuration of the VPN. If the rate of password cracking is higher than the rate of infection, malware has little impact on the system. As a result, priority must be given to secure a VPN configuration. Note that this study only considered the effect of malware on the security of the VPN tunnel. Malware infections have other negative security impacts that were not modeled in the experiment.
Security Probability
1
0.8 0.6
1 User
0.4
10 Users
0.2
100 Users 250 Users
0
Time
Figure 7.
Impact of the number of users on VPN security.
Next, we studied the effect of the number of users on overall VPN security. As shown in Figure 7, systems with large user populations are much less secure than systems with few users because an attacker has a higher chance of finding valid usernames/passwords (or IPs/PSKs.) This experiment assumed that IKE was used in the aggressive mode, that six-character alphanumeric passwords were used, security maintenance was not performed and malware was not present.
84
CRITICAL INFRASTRUCTURE PROTECTION V
Security Probability
1 0.8 0.6 0.4
Simple password & frequent change
0.2
Complex password & infrequent change
0
Time
Figure 8.
Password complexity vs. frequent maintenance trade-off.
The sixth experiment was designed to answer an important question: is it better to choose more secure passwords or to perform maintenance more frequently? The experiment considered two systems, one with six-character alphabetic passwords and once-a-week maintenance and the other one with eight-character alphanumeric passwords and maintenance every three months. The results are shown in Figure 8. Weak passwords with frequent maintenance are less secure in short term, but after a while (one year) complex passwords start to expire and the overall security (of the second system) decreases. Note also that changing passwords every week can be a huge administrative burden.
Security Probability
1 0.8 0.6 0.4
Most secure configuration; no maintenance
0.2 0
Time
Figure 9.
Impact of secure parameters with no maintenance.
The seventh experiment focused on a single configuration: the most secure configuration with no maintenance, complex (eight-character alphanumeric) passwords, ten system users and no malware. Figure 9 presents the results of the experiment. The importance of regular maintenance is clear – even a relatively secure configuration becomes less than 65% secure after one year without proper maintenance.
Rahimi & Zargham
5.
85
Security Recommendations
The simulation results provide valuable insight into securing VPNs, especially in industrial control environments where a tunnel is the only means to establish communications security and the tunnel may, therefore, last for a long period of time. Based on the experimental results, the following recommendations can be made regarding VPN security: The aggressive mode for IPSec VPNs provides fast tunnel establishment and less overhead that render the mode an attractive option for industrial environments where timing is critical. However, the mode suffers from serious protocol flaws that can result in security breaches in a relatively short time. This mode should not be used in critical applications. Secure configurations using the main mode and certificate-based authentication provide stronger VPN tunnels at the expense of higher overhead and slower connection establishment. Long alphanumeric passwords or PSKs should be used to achieve acceptable security. Even with complex passwords, frequent maintenance must be performed to lower the risk of a successful attack, especially when the adversary has significant computational resources. Note that personal “supercomputers” and botnets can significantly reduce the password cracking time. A weak configuration can have a dominant effect even when malware infections are frequent. Securely configuring a VPN is the first step to countering attacks. Less populated VPNs are more secure. When a VPN has a large number of users, other parameters must be stronger (e.g., longer passwords and frequent maintenance). In the case of VPN tunnels for industrial control applications, it is advisable to keep the number of users as low as possible. Usernames (and IP addresses) used in a VPN must be changed or rotated periodically to reduce the risk of username enumeration attacks.
6.
Related Work
Although probabilistic analysis has been widely used to investigate system reliability, its application to security has not attracted much attention until recently. Wang, et al. [28] have proposed the use of probabilistic models for analyzing system security. They have shown that modeling security using Markov chains can be quite informative and can facilitate the design of secure systems. Singh, et al. [27] have used probabilistic models to study the security of intrusion-tolerant replication systems.
86
CRITICAL INFRASTRUCTURE PROTECTION V
Many previous efforts and standards recommend that VPNs be used to secure industrial control protocols. IEC 62351 [14–16] recommends the deployment of VPNs for protocols such as DNP3 and 61850. Okabe, et al. [21] propose the use of IPSec and KINK to secure non-IP-based control networks. Gungor and Lambert [9] discuss the use of MPLS and IPSec VPNs to provide security for electrical power system applications. Sempere, et al. [26] demonstrate the performance and benefits of using VPNs over IP (among other technologies) for supervisory control systems. Alsiherov and Kim [2] propose the use of IPSec VPNs to ensure integrity, authenticity and confidentiality in SCADA networks; however, they suggest that IPSec be configured in the PSK mode for efficient management. Patel, et al. [22] discuss the use of TLS and IPSec VPNs to wrap SCADA protocols. Alsiherov and Kim [1] suggest using IPSec between SCADA sites to provide security when IEC 62351 is not implemented. Hills [12] has identified several VPN security flaws and has analyzed the presence of secure configurations in VPN deployments. Hamed, et al. [10] have developed a scheme for modeling and verifying IPSec and VPN security policies. Finally, Baukari and Aljane [4] have specified an auditing architecture for monitoring the security of VPNs.
7.
Conclusions
A stochastic model of a VPN and its environment provides a powerful framework for investigating the impact of various configurations and operational modes on VPN security in industrial control environments. Simulations of the model assist in quantifying the security of control protocols and evaluating security trade-offs, thereby providing a basis for the secure deployment of VPNs. The results also provide valuable recommendations for securely configuring VPNs in industrial control environments. Our future research will study other VPN protocols (e.g., TLS and L2TP) and quantify their security properties. Also, we plan to incorporate detailed models of malware infections and man-in-the-middle attacks to study their impact more meticulously. Our research will also model other industrial control protocols using SANs with the goal of evaluating their benefits and limitations.
References [1] F. Alsiherov and T. Kim, Research trend on secure SCADA network technology and methods, WSEAS Transactions on Systems and Control, vol. 8(5), pp. 635–645, 2010. [2] F. Alsiherov and T. Kim, Secure SCADA network technology and methods, Proceedings of the Twelfth WSEAS International Conference on Automatic Control, Modeling and Simulation, pp. 434–438, 2010. [3] G. Balbo, Introduction to stochastic Petri nets, in Lectures on Formal Methods and Performance Analysis (LNCS 2090), E. Brinksma, H. Hermanns and J.-P. Katoen (Eds.), Springer Verlag, Berlin-Heidelberg, Germany, pp. 84–155, 2001.
Rahimi & Zargham
87
[4] N. Baukari and A. Aljane, Security and auditing of VPN, Proceedings of the Third International Workshop on Services in Distributed and Networked Environments, pp. 132–138, 1996. [5] M. Bellare, R. Canetti and H. Krawczyk, Keying hash functions for message authentication, Proceedings of the Sixteenth International Cryptology Conference, pp. 1–15, 1996. [6] R. Brown, Stuxnet worm causes industry concern for security firms, Mass High Tech, Boston, Massachusetts (www.masshightech.com/stories /2010/10/18/daily19-Stuxnet-worm-causes-industry-concern-for-securityfirms.html), October 19, 2010. [7] D. Deavours, G. Clark, T. Courtney, D. Daly, S. Derisavi, J. Doyle, W. Sanders and P. Webster, The Mobius framework and its implementation, IEEE Transactions on Software Engineering, vol. 28(10), pp. 956–969, 2002. [8] S. Dispensa, How to reduce malware-induced security breaches, eWeek .com, March 31, 2010. [9] V. Gungor and F. Lambert, A survey on communication networks for electric system automation, Computer Networks, vol. 50(7), pp. 877–897, 2006. [10] H. Hamed, E. Al-Shaer and W. Marrero, Modeling and verification of IPSec and VPN security policies, Proceedings of the Thirteenth IEEE International Conference on Network Protocols, pp. 259–278, 2005. [11] D. Harkins and D. Carrel, The Internet Key Exchange (IKE), RFC 2409, 1998. [12] R. Hills, Common VPN Security Flaws, White Paper, NTA Monitor, Rochester, United Kingdom (www.nta-monitor.com/posts/2005/01/VPNFlaws-Whitepaper.pdf), 2005. [13] International Electrotechnical Commission, IEC 61850 Standard, Technical Specification IEC TS 61850, Geneva, Switzerland, 2003. [14] International Electrotechnical Commission, Communication Network and System Security – Profiles including TCP/IP, Technical Specification IEC TS 62351-3, Geneva, Switzerland, 2007. [15] International Electrotechnical Commission, Security for IEC 61850, Technical Specification IEC TS 62351-6, Geneva, Switzerland, 2007. [16] International Electrotechnical Commission, Security for IEC 60870-5 and Derivatives, Technical Specification IEC TS 62351-5, Geneva, Switzerland, 2009. [17] P. Li, W. Zhou and Y. Wang, Getting the real-time precise roundtrip time for stepping stone detection, Proceedings of the Fourth International Conference on Network and System Security, pp. 377–382, 2010. [18] M. Majdalawieh, Security Framework for DNP3 and SCADA, VDM Verlag, Saarbruken, Germany, 2008.
88
CRITICAL INFRASTRUCTURE PROTECTION V
[19] Modbus-IDA, Modbus Application Protocol Specification V.1.1b, Hopkinton, Massachusetts (www.modbus.org/docs/Modbus Application Proto col V1 1b.pdf), 2006. [20] N. Nayak and S. Ghosh, Different flavors of man-in-the-middle attack: Consequences and feasible solutions, Proceedings of the Third IEEE International Conference on Computer Science and Information Technology, pp. 491–495, 2010. [21] N. Okabe, S. Sakane, K. Miyazawa, K. Kamada, A. Inoue and M. Ishiyama, Security architecture for control networks using IPSec and KINK, Proceedings of the Symposium on Applications and the Internet, pp. 414–420, 2005. [22] S. Patel, G. Bhatt and J. Graham, Improving the cyber security of SCADA communication networks, Communications of the ACM, vol. 52(7), pp. 139–142, 2009. [23] K. Paterson, A cryptographic tour of the IPSec standards, Information Security Technical Report, vol. 11(2), pp. 72–81, 2006. [24] R. Pereira and S. Beaulieu, Extended Authentication within ISAKMP /Oakley (XAUTH), Internet Draft, 1999. [25] W. Sanders and J. Meyer, Stochastic activity networks: Formal definitions and concepts, in Lectures on Formal Methods and Performance Analysis (LNCS 2090), E. Brinksma, H. Hermanns and J.-P. Katoen (Eds.), Springer Verlag, Berlin-Heidelberg, Germany, pp. 315–343, 2001. [26] V. Sempere, T. Albero and J. Silvestre, Analysis of communication alternatives in a heterogeneous network for a supervision and control system, Computer Communications, vol. 29(8), pp. 1133–1145, 2006. [27] S. Singh, M. Cukier and W. Sanders, Probabilistic validation of an intrusion-tolerant replication system, Proceedings of the International Conference on Dependable Systems and Networks, pp. 615–624, 2003. [28] D. Wang, B. Madan and K. Trivedi, Security analysis of SITAR intrusion tolerance system, Proceedings of the ACM Workshop on Survivable and Self-Regenerative Systems, pp. 23–32, 2003.
Chapter 7 IMPLEMENTING NOVEL DEFENSE FUNCTIONALITY IN MPLS NETWORKS USING HYPERSPEED SIGNALING Daniel Guernsey, Mason Rice and Sujeet Shenoi Abstract
Imagine if a network administrator had powers like the superhero Flash – perceived invisibility, omnipresence and superior surveillance and reconnaissance abilities – that would enable the administrator to send early warnings of threats and trigger mitigation efforts before malicious traffic reaches its target. This paper describes the hyperspeed signaling paradigm, which can endow a network administrator with Flash-like superpowers. Hyperspeed signaling uses optimal (hyperspeed) paths to transmit high priority traffic while other traffic is sent along suboptimal (slower) paths. Slowing the traffic ever so slightly enables the faster command and control messages to implement sophisticated network defense mechanisms. The defense techniques enabled by hyperspeed signaling include distributed filtering, teleporting packets, quarantining network devices, tagging and tracking suspicious packets, projecting holographic network topologies and transfiguring networks. The paper also discusses the principal challenges involved in implementing hyperspeed signaling in MPLS networks.
The midnight ride of Paul Revere on April 18, 1775 alerted the Revolutionary Forces about the movements of the British military before the Battles of Lexington and Concord. The ability to deploy Paul-Revere-like sentinel messages within a computer network could help improve defensive postures. These sentinel messages could outrun malicious traffic, provide early warnings of threats and trigger mitigation efforts. Electrons cannot be made to move faster than the laws of physics permit, but “suspicious” traffic can be slowed down ever so J. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 91–106, 2011. c IFIP International Federation for Information Processing 2011
92
CRITICAL INFRASTRUCTURE PROTECTION V
slightly to enable sentinel messages to accomplish their task. To use an optical analogy, it is not possible to travel faster than light, but “hyperspeed signaling paths” can be created by slowing light along all other paths by increasing the refractive index of the transmission media. The concept of offering different priorities – or speeds – for communications is not new. The U.S. Postal Service has numerous classes of mail services ranging from ground delivery to Express Mail that guarantees overnight delivery. The U.S. military’s Defense Switched Network (DSN) [9] designed during the Cold War had four levels of urgency for telephone calls, where a call at a higher level could preempt a call at a lower level; the highest level was FLASH, which also incorporated a special FLASH OVERRIDE feature for the President, Secretary of Defense and other key leaders during defensive emergencies. Modern MPLS networks used by major service providers offer a variety of high-speed and low-speed paths for customer traffic based on service level agreements. This paper proposes the use of optimal (hyperspeed) paths for command and control (and other high priority) traffic and suboptimal (slower) paths for all other traffic in order to implement sophisticated network defense techniques. The basic idea is to offer a guaranteed reaction time window so that packets sent along hyperspeed paths can arrive sufficiently in advance of malicious traffic in order to alert network devices and initiate defensive actions. Separate channels have been used for command and control signals. Signaling System 7 (SS7) telephone networks provide a back-end private network for call control and traffic management, which physically separates the control and data planes [13]. MPLS networks logically separate the internal IP control network from external IP networks that connect with the data plane [7]. This paper describes the hyperspeed signaling paradigm, including its core capabilities and implementation requirements for MPLS networks. Novel defense techniques enabled by hyperspeed signaling, ranging from distributed filtering and teleportation to quarantining and network holography, are highlighted. The paper also discusses the principal challenges involved in implementing hyperspeed signaling, which include network deployment, traffic burden and net neutrality.
2.
Hyperspeed Signaling
Hyperspeed signaling uses optimal (hyperspeed) paths to transmit high priority traffic; other traffic is sent along suboptimal (slower) paths. The difference in the time taken by a packet to traverse a hyperspeed path compared with a slower path creates a reaction time window that can be leveraged for network defense and other applications. Indeed, a hyperspeed signaling path between two network nodes essentially induces a “quantum entanglement” of the two nodes, allowing them to interact with each other seemingly instantaneously. In general, there would be one or more hyperspeed (optimal) paths and multiple slower (suboptimal) paths between two nodes. Thus, different reaction time windows would be available for a hyperspeed path compared with (different) slower paths. These different windows would provide varying amounts
Guernsey, Rice & Shenoi
93
of time to implement defensive actions. Depending on its nature and priority, traffic could be sent along different suboptimal paths. For example, traffic deemed to be “suspicious” could be sent along the slowest paths. Note that hyperspeed paths need not be reserved only for command and control traffic. Certain time-critical traffic, such as interactive voice and video communications, could also be sent along faster, and possibly, hyperspeed paths. Of course, using faster paths for all traffic would reduce the reaction time windows, and, consequently, decrease the time available to implement defensive actions. Clearly, a service provider or network administrator would prefer not to reduce traffic speed drastically. Consequently, a suboptimal path should incorporate the smallest delay necessary to obtain the desired reaction time window.
3.
Core Capabilities
Hyperspeed signaling provides the network administrator with “powers” like the superhero Flash. The reaction time window corresponds to the speed advantage that Flash has over a slower villain. The ability to send signals between two network nodes faster than other traffic provides superpowers such as perceived invisibility, omnipresence and superior intelligence, surveillance and reconnaissance abilities. This section describes the core capabilities offered by hyperspeed signaling. These capabilities are described in terms of the “See-Think-Do” metaphor [15].
3.1
Omnipresence
Omnipresence in the hyperspeed signaling paradigm does not imply that the network administrator is everywhere in the network at every moment in time. Instead, omnipresence is defined with respect to a target packet – the network administrator can send a hyperspeed signal to any node in the network before the target packet arrives at the node. Omnipresence with respect to multiple packets has two versions, one stronger and the other weaker. The stronger version corresponds to a situation where there is one Flash, and this Flash can arrive before all the packets under consideration arrive at their destinations. The weaker version corresponds to a situation where there are multiple Flashes, one for each packet under consideration. Note that the stronger version of omnipresence endows the network administrator with the ability to track multiple packets and to correlate information about all the packets regardless of their locations in the network.
3.2
Intelligence, Surveillance, Reconnaissance
Intelligence, surveillance and reconnaissance (ISR) are essential elements of U.S. defensive operations [4]. ISR capabilities are implemented in a wide variety of systems for acquiring and processing information needed by national security decision makers and military commanders.
94
CRITICAL INFRASTRUCTURE PROTECTION V
Intelligence is strategic in nature; it involves the integration of time-sensitive information from all sources into concise, accurate and objective reports related to the threat situation. Reconnaissance, which is tactical in nature, refers to an effort or a mission to acquire information about a target, possibly a onetime endeavor. Surveillance lies between intelligence and reconnaissance. It refers to the systematic observation of a targeted area or group, usually over an extended time. Obviously, hyperspeed signaling would significantly advance ISR capabilities in cyberspace. The scope and speed of ISR activities would depend on the degree of connectedness of network nodes via hyperspeed paths and the reaction time windows offered by the paths.
3.3
Defensive Actions
Hyperspeed signaling can help implement several sophisticated network defense techniques. The techniques resemble the “tricks” used in stage magic. In particular, the advance warning feature provided by hyperspeed signaling enables a network to seemingly employ “precognition” and react to an attack before it reaches the target. As described in Section 6, hyperspeed signaling enables distributed filtering, teleporting packets, quarantining network devices, tagging and tracking suspicious packets, projecting holographic network topologies, and transfiguring networks. Distributed filtering permits detection mechanisms to be “outsourced” to various locations and/or organizations. Teleportation enables packets to be transported by “secret passageways” across a network without being detected. Quarantining enables a network device, segment or path to “vanish” before it can be affected by an attack. Tagging facilitates the tracking of suspicious traffic and routing other traffic accordingly. Network holography employs “smoke and mirrors” to conceal a real network and project an illusory topology. Transfiguration enables network topologies to be dynamically manipulated (i.e., “shape shifted”) to adapt to the environment and context.
4.
Multiprotocol Label Switching Networks
Circuit switching and packet switching are the two main paradigms for transporting traffic across large networks [10]. ATM and Frame Relay (OSI Layer 2) are examples of circuit-switched (i.e., connection-oriented) technologies that provide low latency and high quality of service (QoS). IP (OSI Layer 3) is a packet-switched (i.e., connectionless) protocol that unifies heterogeneous network technologies to support numerous Internet applications. An important goal of service providers is to design networks with the flexibility of IP and the speed of circuit switching without sacrificing efficiency [8]. In traditional implementations, an overlay model is used, for example, to create an ATM virtual circuit between each pair of IP routers. Operating independently, the two technologies create a relatively inefficient solution. Since IP routers are unaware of the ATM infrastructure and ATM switches are unaware of IP
Guernsey, Rice & Shenoi
95
routing, an ATM network must present a virtual topology such as a complete mesh (which is expensive) or a hub with spokes (which lacks redundancy) that connects each IP router. IP may then be used to route traffic based on the virtual, rather than physical, topology. On the other hand, the routing control paradigm used in IP networks is closely tied to the forwarding mechanism. Since a classless inter-domain routing (CIDR) IP address consists of a network prefix followed by a host identifier, IP networks have a hierarchical model. IP nodes forward packets according to the most specific (“longest match”) route entry identified by the destination address. Consequently, IP networks are only compatible with control paradigms that create hierarchical routes. The need to enhance QoS and integrate IP with connection-oriented technologies like ATM has prompted the development of a more general forwarding scheme for MPLS – one that does not limit the control paradigm [5, 7]. This forwarding mechanism, called “label switching,” is similar to the technique used by circuit-switched technologies. Thus, MPLS enables connection-oriented nodes to peer directly with connectionless technologies by transforming ATM switches into IP routers. ATM switches participate directly in IP routing protocols (e.g., RIP and OSPF) to construct label switched paths (LSPs). LSPs are implemented in ATM switches as virtual circuits, enabling existing ATM technology to support the MPLS forwarding mechanism. Conversely, MPLS enables connectionless technologies, e.g., Ethernet, to behave in a connection-oriented manner by augmenting IP addresses and routing protocols with relatively short, fixed-length labels. MPLS employs a single adaptable forwarding algorithm that supports multiple control components. MPLS labels are conceptually similar to the bar codes on U.S. mail that encode ZIP+4 information; these bar codes are used by the U.S. Postal Service to automatically sort, prioritize, route and track nearly 750 million pieces of mail a day. Within the MPLS core, label switching relies only on the packet label to select an LSP. Thus, any algorithm that can construct LSPs and specify labels can be used to control an MPLS network core. Some additional components are required at the edge where the MPLS core connects to other types of networks (e.g., an inter-office VPN running traditional IP). The MPLS edge routers interpret external routing information, place labels on ingress packets and remove labels from egress packets. The following sections describe label switching and label distribution, which underlie packet transport in MPLS networks.
4.1
Label Switching
MPLS packet forwarding resembles the mechanism used in circuit-switched technologies; in fact, it is compatible with ATM and Frame Relay networks [5, 7]. Each MPLS label is a 32-bit fixed-length tag that is inserted in the Layer 2 header (e.g., for ATM VCI and Frame Relay DLCI) or in a separate “shim” between Layers 2 and 3 [12]. A label works much like an IP address in that it dictates the path used by a router to forward the packet. Unlike an IP
96
CRITICAL INFRASTRUCTURE PROTECTION V L2 10.0.2.1 L1 10.0.2.1
L3 10.0.2.1 10.0.2.1
10.0.2.1 VPN Site 1 10.0.1/24
B
CE1
VPN Site 2 10.0.2/24
C
A
F
D
CE2
E
MPLS Provider Network
Figure 1.
MPLS packet forwarding.
address, however, an MPLS label only has local significance. When a router receives a labeled packet, the label informs the router (and only that router) about the operations to be performed on the packet. Typically, a router pops the label on an incoming packet and pushes a new label for the router at the next hop in the MPLS network; the network address in Layer 3 is unchanged. MPLS networks carry traffic between other connected networks. As such, most user traffic travels from ingress to egress (i.e., the traffic is neither destined for nor originating from internal MPLS hosts). At the ingress, a label is placed in the packet between the OSI Layer 2 and 3 headers [12]. The label informs the next hop about the path, destination and relative “importance” of the packet. At each hop, the label is examined to determine the next hop and outgoing label for the packet. The packet is then relabeled and forwarded. This process continues until the packet reaches the egress where the label is removed. If the MPLS network is composed mainly of ATM switches, the ATM hardware can naturally implement the MPLS forwarding algorithm using the ATM header with little or no hardware modification. Figure 1 shows a typical MPLS architecture that interconnects two customer VPN sites. Routers A through F in the MPLS network are called label switching routers (LSRs). Customer edge routers, CE1 and CE2, reside at the edge of the customer network and provide MPLS core connectivity. Consider the label switched path (LSP) from VPN Site 1 to VPN Site 2 (Routers A, B, C and F). Router A is designated as the ingress and Router F is designated as the egress for the path. The ingress and egress nodes are called label edge routers (LERs) [12]. When an IP packet reaches the ingress of the MPLS network, LER A consults its forwarding information base (FIB) and assigns the packet to a forwarding equivalence class (FEC). The FEC maps to a designated label that specifies QoS and class of service (CoS) requirements based on Layer 3 parameters in the packet (e.g., source IP, destination IP and application ports). In this example, LER A pushes Label L1 onto the packet and forwards it to LSR B. LSR B reads the label and consults its incoming label map (ILM) to identify the FEC of the packet. It then pops the previous label, pushes a new label (L2) and forwards the packet to its next hop LSR C. LSR C behaves similarly, forwarding the packet to LER F. LER F then pops
Guernsey, Rice & Shenoi
97
L3, examines the destination IP address and forwards the packet to VPN Site 2, where traditional IP forwarding resumes.
4.2
Label Distribution
A forwarding algorithm alone is not enough to implement an MPLS network. The individual nodes need to know the network topology in order to make informed forwarding decisions. The MPLS Working Group [1] defines a forwarding mechanism and control components to emulate IP routes using MPLS labels and paths. In IP, routing protocols such as RIP and OSPF populate IP forwarding tables [10]. Similarly, MPLS requires label distribution protocols to build end-to-end LSPs by populating the FIB and ILM of each node. Because MPLS is not tied to a particular paradigm, any routing protocol capable of carrying MPLS labels can be used to build MPLS LSPs. Such protocols include: Label Distribution Protocol (LDP): This protocol is designed to build aggregate LSPs based on IP routing information gathered by a traditional IP routing protocol such as RIP [1]. Resource Reservation Protocol – Traffic Engineering (RSVPTE): This protocol incorporates extensions to RSVP in order to construct LSP tunnels along requested paths with varying QoS. RSVP-TE is commonly used for traffic engineering (TE) in MPLS networks [2]. Multiprotocol Extension to Border Gateway Protocol 4 (MPBGP): This protocol extends BGP, and generalizes distributed gateway addresses and carries labels. It is commonly used to build VPNs [3, 11]. The three protocols listed above are commonly employed in IP-based networks. This demonstrates that MPLS seamlessly supports the IP routing paradigm and enables IP to be efficiently deployed in ATM and Frame Relay networks without the need for convoluted virtual topologies.
5.
MPLS Implementation Requirements
Two requirements must be met to implement hyperspeed signaling. First, the network must be able to recognize and distinguish hyperspeed signals from non-hyperspeed signals. Second, the network must be able to provide appreciable differences in delivery delays, so that the reaction time windows are satisfied by hyperspeed signals. The network environment and the delay techniques that are applied govern the degree of flexibility with respect to the maximum possible reaction time window. MPLS is an ideal technology for implementing hyperspeed signaling because it has built-in identification and service differentiation technologies. Labels in MPLS act like circuit identifiers in ATM to designate the paths taken by packets in the network core.
98
CRITICAL INFRASTRUCTURE PROTECTION V
Malicious Packet
Sentinel Message
Ingress
Egress Filter
Detector
Figure 2.
Egress filtering.
Hyperspeed signaling in MPLS would use the packet label to distinguish hyperspeed packets from non-hyperspeed packets. MPLS-capable routers are typically equipped with many QoS and traffic shaping features. LSRs can be configured to give hyperspeed packets the highest priority based on the packet label. Likewise, LSRs can be configured to delay non-hyperspeed packets in forwarding queues. Because the label dictates the QoS and the path, nonhyperspeed packets could be forced to take circuitous routes by constructing the corresponding LSPs using non-hyperspeed labels. The labels corresponding to optimal routes are reserved for hyperspeed packets.
6.
Novel Defense Techniques
Hyperspeed signaling can help implement several sophisticated network defense techniques. These include distributed filtering, teleporting packets, quarantining network devices, tagging and tracking suspicious packets, projecting holographic network topologies and transfiguring networks.
6.1
Distributed Filtering
Hyperspeed signaling supports a variety of distributed filtering configurations. The simplest configuration is “egress filtering” that can be used by service provider networks and other entities that transport traffic between networks. As shown in Figure 2, when a malicious packet is identified, a hyperspeed sentinel message is sent to the egress filter to intercept the packet. If the reaction time window is sufficiently large, the sentinel message arrives at the egress filter in advance of the malicious packet to permit the threat to be neutralized. The sentinel message must contain enough information to identify the malicious packet. Note that the malicious traffic is dropped at the egress filter, and the downstream network is unaware of the attempted attack. Hyperspeed signaling enhances flexibility and efficiency by distributing detection and filtration functionality. Also, it enables service provider networks and other networks (e.g., enterprise networks) that employ multiple detection modalities to maintain low latency. The traditional ingress filtering approach is shown in Figure 3(a). This approach deploys detector-filters in series, where
99
Guernsey, Rice & Shenoi Detectors
Ingress
Detector-Filters
Hub
Ingress
Next Hop
Next Hop
(a) Traditional (serial). Figure 3.
Egress Filter
(b) Distributed (parallel).
Traditional and distributed filtering configurations.
each detector-filter (e.g., firewall) contributes to the overall delay. On the other hand, the distributed filtering approach shown in Figure 3(b) deploys detectors in parallel at ingress and a filter at egress. Thus, the overall delay is the delay introduced by the single slowest detector plus the delay required for egress filtering.
Malicious Packet
Ingress Filter Sentinel Message
Ingress
Egress
Analysis
Detector
Figure 4.
Advance warning.
Figure 4 shows an advance warning configuration where a hyperspeed signal (sentinel message) is sent to the customer (or peer) ingress instead of the provider egress. In this configuration, the service provider (or peer) network detects malicious packets, but only alerts the customer (or peer) network about the incoming packets. Since the other party has advance warning, it can observe, analyze and/or block the packets or take any other actions it sees fit. The advance warning configuration enables networks to outsource detection. Copies of suspicious packets could be forwarded to a third party that has sophisticated detection capabilities (e.g., security service providers or government agencies). If the third party detects malicious activity, it can send a hyperspeed signal to trigger filtering. The third party could correlate packets observed from multiple client networks and provide sophisticated detection services to its clients without compromising its intellectual property or national security.
100
CRITICAL INFRASTRUCTURE PROTECTION V
F 2. Invisible Hops
1. First Visible Hop
G 3. Second Visible Hop
B
A
Figure 5.
6.2
Simple teleportation.
Teleportation
Hyperspeed routes can be used to teleport packets. Simple teleportation is shown in Figure 5. An operator located at A sends a packet along path ABFG where the hop from B to F involves teleportation. To teleport the packet from B to F, the packet could be encrypted and encapsulated in a labeled ICMP ping packet and sent to B along a hyperspeed path, where it would be converted to its original form and forwarded to G along a normal path. If the teleportation mechanism is to be further concealed, then the packet could be fragmented and the fragments sent along different hyperspeed paths to F (assuming that multiple hyperspeed paths exist from B to F). Another teleportation approach takes after stage magic. Magicians often use identical twins to create the illusion of teleportation. To set up the act, the magician positions one twin at the source while the other is hidden at the destination. During the act, the magician directs the first twin to enter a box and then secretly signals the other twin to reveal himself at the destination. The same approach can be used to create the illusion of packet teleportation. The staged teleportation approach is shown in Figure 6. An operator at A uses simple teleportation to secretly send a packet from A to F (Step 1). Next, the operator sends an identical packet from A to B along a normal path; this packet is dropped upon reaching B (Step 2). The operator then sends a hyperspeed signal from A to F (Step 3), which causes the staged packet to move from F to G along a normal path (Step 4). A casual observer would see the packet travel from A to B and the same packet subsequently travel from F to G, but would not see the packet travel from B to F (because no such transmission took place). Depending on the time-sensitivity of the operation, the stage can be set (Step 1) long before the act (Steps 2, 3 and 4) takes place. A variation of the teleportation act involves a modification of Step 1. An operator located at F sends a copy of a packet to A along a covert hyperspeed
101
Guernsey, Rice & Shenoi
1. Packet Teleported
G F
4. Staged Packet Sent
3. Hyperspeed Signal Sent
2. Identical Packet Sent
B
A
Figure 6.
Staged teleportation.
path. As in the previous scenario, a casual observer would see the packet travel from A to B and then from F to G, but not from B to F. This staged teleportation approach can help conceal the real origins of network messages.
Malicious Packet
Target
Ingress Detector
Quarantine Messages
Figure 7.
6.3
Quarantining Network Devices.
Quarantining
Quarantining enables a network device, segment or path to disappear before it can be compromised by an attack. As shown in Figure 7, a detector located upstream of a targeted device identifies a threat. The detector then sends hyperspeed signals to the appropriate network nodes to prevent malicious traffic from reaching the device. This essentially quarantines the targeted device. Note that if the attack reaches the targeted device before it is quarantined, the device is isolated before it can affect other parts of the network; the device is reconnected only after it is verified to be secure. Of course, the fact that
102
CRITICAL INFRASTRUCTURE PROTECTION V
Malicious Packet
Intermediate Nodes Ingress
Hyperspeed Diagnostics
Figure 8.
Target
Tagging.
quarantine messages travel along hyperspeed paths increases the likelihood that the attack will be thwarted before it impacts the targeted device. The same technique can be used to quarantine network segments or deny the use of certain network paths.
6.4
Tagging
One application of tagging is similar to the use of pheromone trails by animals. In this application, a network essentially tracks the paths taken by suspicious traffic. A network administrator sends diagnostic packets via hyperspeed paths to nodes along the path taken by a suspicious packet in order to observe its behavior. If, as shown in Figure 8, the suspicious packet causes anomalous behavior at one of the nodes, the diagnostic packet reports the anomaly via a hyperspeed signal and the compromised device can be quarantined as described above. In extreme cases, all the nodes on the path taken by the suspicious packet could be quarantined until the situation is resolved. Tagging can also be used to mitigate the effects of attacks that originate from multiple sources, including distributed denial-of-service attacks (DDoS) and other novel attacks. Consider a sophisticated attack that works like the “five finger death punch” in the movie Kill Bill Vol. 2. The attack, which is fragmented into five benign packets, is executed only when all five packets are assembled in sequence. Since a single stateful firewall with knowledge about the fragmented attack could detect and block one or more packets, implementing a successful attack would require the packets to be sent from different locations. The tagging mechanism can counter the fragmented attack by quarantining the target as soon as anomalous behavior is detected. The packets constituting the attack could then be traced back to their origins at the network perimeter, and security devices configured appropriately to detect the attack.
103
Guernsey, Rice & Shenoi
Illusory Topology
Real Topology
Figure 9.
6.5
Network Mapper
Network holography.
Network Holography
Networks can hide their internal structure, for example, by using private IP addresses. Hyperspeed signaling enables networks to project illusory internal structures or “holograms.” Conventional holograms are created using lasers and special optics to record scenes. In some cases, especially when a cylindrical piece of glass is used, a scene is recorded from many angles. Once recorded, the original scene may be removed, but the hologram still projects the recorded scene according to the viewing angle. If enough angles are recorded, the hologram creates the illusion that the original scene is still in place. Similarly, an illusory network topology can be created in memory and distributed to edge nodes in a real network (Figure 9). The presence of multiple hyperspeed paths between pairs of edge routers can help simulate the illusory topology. Other nodes may be included, but the edge nodes at the very minimum must be included. When probes (e.g., ping and traceroute) hit the real network, the edge nodes respond to the probes as if the network has the illusory topology. It is important that the same topology is simulated from all angles (i.e., no matter where the probe enters the network) because any inconsistency would shatter the illusion.
6.6
Transfiguration
Transfiguration enables networks to cooperate, much like utilities in the electric power grid [16], to provide services during times of crisis. Network administrators can manipulate their internal network topologies or modify the topologies along the perimeters of cooperating networks to lend or lease additional resources as required. Additionally, administrators may need to modify the topologies at the perimeter boundaries near an attack. This method is analogous to moving the frontline forward or backward during a battle.
104
CRITICAL INFRASTRUCTURE PROTECTION V
Links and nodes may need to be strategically quarantined, disabled or reenabled based on the circumstances. As resources are lost and gained, the roles of devices, especially at the perimeter, may change. Hyperspeed signaling permits topology changes to occur seemingly instantaneously and enables devices with special roles to operate in proxy where necessary at the perimeter. As resources become available (either regained after being compromised or leased from other sources), the window for hyperspeed signaling can be adjusted to provide additional reaction time.
7.
Implementation Challenges
This section discusses the principal challenges involved in implementing hyperspeed signaling in MPLS networks. The challenges include network deployment, traffic burden and net neutrality.
7.1
Network Deployment
Deploying a hyperspeed signaling protocol in a network involves two principal tasks. The first is programming the hardware devices to satisfy the hyperspeed signaling requirements for the specific network. Second, the hardware devices must be installed, configured and calibrated for efficient hyperspeed signaling. Ideally, vendors would program the algorithms/protocols in the hardware devices. The installation, configuration and calibration of the devices would be performed by network engineers and administrators. This task would be simplified and rendered less expensive if vendors were to offer software/firmware updates for deploying hyperspeed signaling without the need to replace existing network devices.
7.2
Traffic Burden
Sending traffic along suboptimal paths essentially increases network “capacitance” – keeping more packets in the network at any given time. Because the additional time that a non-hyperspeed packet spends in the network is specified by the reaction time window, the amount of additional traffic flowing in the network is approximately equal to the product of the reaction time window and the average link bandwidth. Another metric for the burden imposed by hyperspeed signaling is the nonhyperspeed delay divided by the hyperspeed delay (the non-hyperspeed delay is equal to the hyperspeed delay plus the reaction time window). This metric only applies to pairs of communicating end points. MPLS networks may need additional bandwidth depending on their capacity and the presence of alternate links. The traffic burden due to hyperspeed signaling can be reduced by strategically partitioning a network into multiple signaling domains to prevent critical links from being flooded. A traffic burden is also induced in a distributed filtering scenario where malicious traffic is allowed to flow through the network and screened later
Guernsey, Rice & Shenoi
105
(e.g., at an interior or egress node). However, this is not an issue because most service provider networks simply transport traffic, leaving the task of filtering to customers.
7.3
Net Neutrality
Issues regarding net neutrality must be considered because the implementation of hyperspeed signaling requires command and control traffic to be treated differently from other traffic. In particular, non-hyperspeed traffic is intentionally slowed to accommodate the desired reaction time windows. At this time, there is no consensus on the definition of net neutrality [14]. Does net neutrality mean that all traffic should be treated the same? Or does it mean that only traffic associated with a particular application type should be treated the same? Regardless of its definition, net neutrality is not a major concern for VPN service providers, who can give preferential treatment to traffic based on the applicable service level agreements. In the case of Internet service providers, net neutrality would not be violated as long as all customer traffic is slowed by the same amount. Currently, no laws have been enacted to enforce net neutrality, although there has been considerable discussion regarding proposed legislation. Many of the proposals permit exceptions in the case of traffic management, public safety and national security. Since hyperspeed signaling, as discussed in this paper, focuses on network defense, it is reasonable to conclude that it would fall under one or more of the three exemptions. Interestingly, the distributed filtering technique provided by hyperspeed signaling actually enables service providers to treat different types of traffic in a “more neutral” manner than otherwise. Consider a situation where a service provider employs a firewall that performs deep packet inspection. Certain types of traffic (e.g., suspicious packets) would require more inspection time by the firewall, contributing to a larger delay than for other traffic. But this is not the case when all traffic (including suspicious traffic) is allowed to enter the network while copies are simultaneously sent to a distributed detector. Malicious packets are filtered at egress or elsewhere in the network using hyperspeed signaling. Non-malicious packets in the same suspicious traffic pass through the network just like normal traffic.
8.
Conclusions
As attacks on computer and telecommunications networks become more prolific and more insidious, it will be increasingly important to deploy novel strategies that give the advantage to network defenders. Hyperspeed signaling is a promising defensive technology that could combat current and future threats. The paradigm is motivated by Arthur C. Clarke’s third law of prediction: “Any sufficiently advanced technology is indistinguishable from magic” [6]. Hyperspeed signaling does not require electrons to move faster than the laws of physics
106
CRITICAL INFRASTRUCTURE PROTECTION V
permit; instead, malicious traffic is slowed down ever so slightly to endow defensive capabilities that are seemingly magical. The hallmark of good engineering is making the right trade-off. Intentionally slowing down network traffic may appear to be counterintuitive, but the defensive advantages gained by hyperspeed signaling may well outweigh the costs. Note that the views expressed in this paper are those of the authors and do not reflect the official policy or position of the U.S. Department of Defense or the U.S. Government.
References [1] L. Anderson, P. Doolan, N. Feldman, A. Fredette and B. Thomas, LDP Specification, RFC 3036, 2001. [2] D. Awduche, L. Berger, D. Gan, T. Li, V. Srinivasan and G. Swallow, RSVP-TE: Extensions to RSVP for LSP Tunnels, RFC 3209, 2001. [3] T. Bates, Y. Rekhter, R. Chandra and D. Katz, Multiprotocol Extensions for BGP-4, RFC 2858, 2000. [4] R. Best, Intelligence, Surveillance and Reconnaissance (ISR) Programs: Issues for Congress, CRS Report for Congress, RL32508, Congressional Research Service, Washington, DC, 2005. [5] U. Black, MPLS and Label Switching Networks, Prentice Hall, Upper Saddle River, New Jersey, 2002. [6] A. Clarke, Profiles of the Future: An Inquiry into the Limits of the Possible, Harper and Row, New York, 1999. [7] B. Davie and Y. Rekhter, MPLS: Technology and Applications, Morgan Kaufmann, San Francisco, California, 2000. [8] E. Gray, MPLS: Implementing the Technology, Addison-Wesley, Reading, Massachusetts, 2001. [9] B. Nicolls, Airman’s Guide, Stackpole Books, Mechanicsburg, Pennsylvania, 2007. [10] L. Peterson and B. Davie, Computer Networks: A Systems Approach, Morgan Kaufmann, San Francisco, California, 2003. [11] E. Rosen and Y. Rekhter, BGP/MPLS IP Virtual Private Networks (VPNs), RFC 4364, 2006. [12] E. Rosen, A. Viswanathan and R. Callon, Multiprotocol Label Switching Architecture, RFC 3031, 2001. [13] T. Russell, Signaling System #7, McGraw-Hill, New York, 1998. [14] H. Travis, The FCC’s new theory of the First Amendment, Santa Clara Law Review, vol. 51(2), pp. 417–513, 2011. [15] United States Department of Defense, Military Deception, Joint Publication 3-13.4, Washington, DC, 2006. [16] M. Wald, Hurdles (not financial ones) await electric grid update, New York Times, February 6, 2009.
Chapter 8 CREATING A CYBER MOVING TARGET FOR CRITICAL INFRASTRUCTURE APPLICATIONS Hamed Okhravi, Adam Comella, Eric Robinson, Stephen Yannalfo, Peter Michaleas and Joshua Haines Abstract
Despite the significant amount of effort that often goes into securing critical infrastructure assets, many systems remain vulnerable to advanced, targeted cyber attacks. This paper describes the design and implementation of the Trusted Dynamic Logical Heterogeneity System (TALENT), a framework for live-migrating critical infrastructure applications across heterogeneous platforms. TALENT permits a running critical application to change its hardware platform and operating system, thus providing cyber survivability through platform diversity. TALENT uses containers (operating-system-level virtualization) and a portable checkpoint compiler to create a virtual execution environment and to migrate a running application across different platforms while preserving the state of the application (execution state, open files and network connections). TALENT is designed to support general applications written in the C programming language. By changing the platform on-the-fly, TALENT creates a cyber moving target and significantly raises the bar for a successful attack against a critical application. Experiments demonstrate that a complete migration can be completed within about one second.
Critical infrastructure systems are an integral part of the national cyber infrastructure. The power grid, oil and gas pipelines, utilities, communications systems, transportation systems, and banking and financial systems are examples of critical infrastructure systems. Despite the significant amount of effort and resources used to secure these systems, many remain vulnerable to adJ. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 107–123, 2011. c IFIP International Federation for Information Processing 2011
108
CRITICAL INFRASTRUCTURE PROTECTION V
vanced, targeted cyber attacks. The complexity of these systems and their use of commercial-of-the-shelf components often exacerbate the problem. Although protecting critical infrastructure systems is a priority, recent cyber incidents [4, 14] have shown that it is imprudent to rely completely on the hardening of individual components. As a result, attention is now focusing on game-changing technologies that achieve mission continuity during cyber attacks. In fact, the U.S. Air Force Chief Scientist’s report on technology horizons [27] mentions the need for “a fundamental shift in emphasis from ‘cyber protection’ to ‘maintaining mission effectiveness’ in the presence of cyber threats” as a way to build cyber systems that are inherently intrusion resilient. Moreover, the White House National Security Council’s progress report [19] mentions a “moving target” – a system that moves in multiple dimensions to foil the attacker and increase resilience – as one of the Administration’s three key themes for its cyber security research and development strategy. This paper describes the design and implementation of the Trusted Dynamic Logical Heterogeneity System (TALENT), a framework for live-migrating critical applications across heterogeneous platforms. In mission-critical systems, the mission itself is the top priority, not individual instances of the component. TALENT can help thwart cyber attacks by live-migrating the mission-critical application from one platform to another. Also, by dynamically changing the platform at randomly-chosen time intervals, TALENT creates a cyber moving target that places the attacker at a disadvantage and increases resilience. This means that the information collected by the attacker about the platform during the reconnaissance phase becomes ineffective at the time of attack. TALENT has several design goals: Heterogeneity at the instruction set architecture level, meaning that applications should run on processors with different instruction sets. Heterogeneity at the operating system level. Preservation of the state of the application, including the execution state, open files and sockets. This is an important property in mission-critical systems because simply restarting the application from scratch on a different platform may have undesirable consequences. Working with a general-purpose system language such as C. Much of TALENT’s functionality is straightforward to implement using a platformindependent language like Java because the Java Virtual Machine provides a sandbox for applications. However, many commodity and commercial-of-the-shelf software systems are developed in C. Restricting TALENT to a Java-like language would limit its use. TALENT must provide operating system and hardware heterogeneity while preserving the state and environment despite the incompatibility of binaries between different architectures. TALENT addresses these challenges using: (i) operating-system-level virtualization (container-based operating system) to
Okhravi, et al.
109
sandbox the application and migrate the environment including the filesystem, open files and network connections; and (ii) portable checkpoint compilation to compile the application for different architectures and migrate the execution state across different platforms. TALENT is novel in several respects. TALENT is a heterogeneous platform system that dynamically changes the instruction set and operating system. It supports the seamless migration of critical applications across platforms while preserving their states. Neither application developers nor operators require prior knowledge about TALENT; TALENT is also application agnostic. Other closely-related solutions either lose the internal state of applications or are limited to specific types of applications (e.g., web servers). The TALENT implementation is optimized to reduce the migration time – the current prototype completes the migration of state and environment in about one second. To the best of our knowledge, TALENT is the first realization of a cyber moving target through platform heterogeneity.
2.
Threat Model
The TALENT threat model assumes there is an external adversary who is attempting to exploit a vulnerability in the operating system or in the application binary in order to disrupt the normal operation of a mission-critical application. For simplicity and on-the-fly platform generation, a hypervisor (hardware-level virtualization) is used. The threat model assumes that the hypervisor and the system hardware are trusted. We assume that the authenticity of the hypervisor is checked using hardware-based cryptographic verification (e.g., TPM) and that the hypervisor implementation is free of bugs. We also assume that the operating-system-level virtualization logic is trusted. However, the rest of the system (including the operating system and applications) is not trusted and may contain vulnerabilities and malicious logic. We also assume that, although an attack may be feasible against a number of different platforms (operating system/architecture combinations), there exists a platform that is not susceptible to the attack. This means that not all the platforms are vulnerable. Our primary goal is to migrate a mission-critical application to a different platform at random time intervals when a new vulnerability is discovered or when an attack is detected. Attacks can be detected using various techniques that are described in the literature (e.g., [5]). Heterogeneity at different levels can mitigate attacks. Application-level heterogeneity protects against binary- and architecture-specific exploits, and untrusted compilers. Operating-system-level heterogeneity mitigates kernelspecific attacks, operating-system-specific malware and persistent operating system attacks (rootkits). Finally, hardware heterogeneity can thwart supply chain attacks, malicious and faulty hardware, and architecture-specific attacks. It is important to note that TALENT is by no means a complete defense against all these attacks. Instead, it is designed to enhance survivability in the presence of platform-specific attacks using dynamic heterogeneity.
110
CRITICAL INFRASTRUCTURE PROTECTION V
Figure 1.
3.
OS-level and hardware-level virtualization approaches.
Design
TALENT uses two key concepts, operating-system-level virtualization and portable checkpoint compilation, to address the challenges involved in using heterogeneous platforms, including binary incompatibility and the loss of state and environment.
3.1
Virtualization and Environment Migration
Preserving the environment of a critical infrastructure application is an important goal. The environment includes the filesystem, configuration files, open files, network connections and open devices. Note that many of the environment parameters can be preserved using virtual machine migration. However, virtual machine migration can only be accomplished using a homogeneous operating system and hardware. Indeed, virtual machine migration is not applicable because it is necessary to change the operating system and hardware while migrating a live application. TALENT uses operating-system-level virtualization to sandbox an application and migrate the environment.
OS-Level Virtualization In operating-system-level virtualization, the kernel allows for multiple isolated user-level instances. Each instance is called a container (jail or virtual environment). The method was originally designed to support fair resource sharing, load balancing and cluster computing applications. This type of virtualization can be thought of as an extended chroot in which all resources (devices, filesystem, memory, sockets, etc.) are virtualized. Note that the major difference between operating-system-level virtualization and hardware-level virtualization (e.g., Xen and KVM) is the semantic level at which the entities are virtualized (Figure 1). Hardware-level hypervisors vir-
111
Okhravi, et al.
Figure 2.
Network virtualization approaches.
tualize disk blocks, memory pages, hardware devices and CPU cycles, whereas operating-system-level virtualization works at the level of filesystems, memory regions, sockets and kernel objects (e.g., IPC memory segments and network buffers). Hence, the semantic information that is often lost in hardware virtualization is readily available in operating-system-level virtualization. This makes operating-system-level virtualization a good choice for applications where semantic information is needed, for example, when monitoring or sandboxing at the application level.
Environment Migration As discussed above, TALENT uses operatingsystem-level virtualization to migrate the environment of a critical application. When migration is requested (as a result of a malicious activity or a periodic migration), TALENT migrates the container of the application from the source machine to the destination machine. This is done by synchronizing the filesystem of the destination container with the source container. Since the operating system keeps track of open files, the same files are opened in the destination. Because this information is not available at the hardware virtualization level (due to the semantic gap between files and disk blocks), additional techniques must be implemented to recover the information (e.g., virtual machine introspection). On the other hand, this information is readily available in TALENT. Network connections can be virtualized in three ways: second layer, third layer and socket virtualization. These terms come from the OpenVZ documentation [16]. Virtualizing a network at the second layer means that each container has its own IP address, routing table and loopback interface. Third layer virtualization implies that each container can access any IP address/port and that sockets are isolated using the namespace. Socket virtualization means that each container can access any IP address/port and that sockets are isolated using filtration. Figure 2 shows the different network virtualization approaches for two containers. In socket virtualization, the port numbers are divided between the containers, whereas in third layer virtualization, the entire port range is available to every container. TALENT uses second layer virtualization in order to be able to migrate the IP address of a container. To preserve network connections during migration, the IP address of the container’s virtual network interface is migrated to the new container. Then, the state of each TCP socket (sk buff of the kernel) is transferred to the destination. The network migration is seamless to the application, and the
112
CRITICAL INFRASTRUCTURE PROTECTION V
Figure 3.
Portable checkpoint compilation.
application can continue sending and receiving packets on its sockets. In fact, our evaluation example shows that the state of an SSH connection is preserved during the migration. Many operating-system-level virtualization frameworks also support IPC and signal migration. In each case, the states of IPC and signals are extracted from the kernel data structures and migrated to the destination. These features are supported in TALENT through the underlying virtualization layer.
3.2
Checkpointing and Process Migration
Migrating the environment is only one step in backing up the system because the state of running programs must also be migrated. To do this, a method to checkpoint running applications must be implemented. After all the checkpointed program states are saved in checkpoint files, the state is migrated by simply mirroring the filesystem.
Requirements Checkpointing in TALENT must meet certain requirements. Portability: Checkpointed programs should be able to move back and forth between different architectures and operating systems in a heterogeneous computing environment. Transparency: Heavy code modification should not be required to existing programs in order to introduce proper checkpointing. Scalability: Checkpointed programs may be complex and may handle large amounts of data. Checkpointing should be able to handle such programs without affecting system performance. A portable checkpoint compiler (PCC) can help meet the portability requirement. Figure 3 illustrates the portable checkpoint compilation process, which allows compilation to occur independently on various operating system/architecture pairs. The resulting executable program, including the inserted checkpointing code, functions properly on each platform on which it was compiled. Transparency is obtained by performing automatic code analysis and checkpoint insertion. This prevents the end user from having to modify their code
Okhravi, et al.
113
to indicate where checkpointing should be performed and what data should be checkpointed. Scalability is obtained in two ways. First, the frequency of checkpointing bottlenecks in the checkpointing process is controlled. Second, through the use of compressed checkpoint file formats, the checkpoints themselves remain as small as possible even as the amount of data processed by the program increases.
Variable Level Checkpointing There are two possible approaches for checkpointing a live process: data segment level checkpointing (DSLC) and variable level checkpointing (VLC). Note that DSLC and VLC are different types of portable checkpoint compilers. In DSLC [26], the entire state of the process including the stack and heap are dumped into a checkpoint file. DSLC preserves the entire state of a process, but since the checkpoint file contains platform specific data such as the stack and heap, this approach suffers from a lack of portability. VLC [3], on the other hand, stores the values of restart-relevant variables in the checkpoint file. Since the checkpoint file only contains portable data, VLC is a good candidate for migration across heterogeneous platforms. In order to construct the entire state of the process, VLC must re-execute the non-portable portions of the code. The non-portable portions refer to the platform-dependent values stored in the stack or heap, not the variables. To perform VLC, the code is precompiled to find restart-relevant variables. These variables and their memory locations are then registered in the checkpointing tool. When checkpointing, the process is paused and the values of the memory locations are dumped into a file. The checkpointing operation must occur at safe points in the code to generate a consistent view. At restart, the memory of the destination process is populated with the desired variable values from the checkpoint file. Some portions of the code are re-executed in order to construct the entire state. A simple example involving a factorial computation is presented to illustrate VLC operation. Of course, TALENT is capable of handling much more complicated code bases. The factorial code is shown in Figure 4. For simplicity, the code incorporates a main function with no inputs. Figure 5 illustrates the VLC markup of the factorial program. All calls to the checkpointing tool are shown with pseudo-function calls with VLC prefixes. First, the checkpointer is initialized. Then, the variables to be tracked and checkpointed are registered with the tool. In the example, the variables fact, curr and i have to be registered. The actual checkpointing must occur inside the loop after each iteration. When the loop is done, there is no need to track the variables any longer, so they are unregistered. Finally, the environment is torn down before the return. Note that for transparency and scalability, the code markup has been done automatically and prior to compilation.
114
CRITICAL INFRASTRUCTURE PROTECTION V int main(int argc, char **argv) { int fact; double curr; int i; fact = 20; curr = 1; for(i=1; i<=fact; i++) { curr = curr * i; } printf("%d factorial is %f", fact, curr); return 0; } Figure 4.
Variable level checkpointing of the factorial program.
Checkpoint Portability The checkpoint file itself must have a portable format to achieve portability across heterogeneous platforms. Storing the checkpoint in a simple binary file can result in incompatibility if the destination platform has different “bitness” (32 vs. 64 bits) or endianness (little vs. big). Thus, the checkpoint file format has to be portable.
115
Okhravi, et al.
Figure 6.
TALENT migration process.
TALENT uses the HDF5 format [9] through the precompiler checkpointing tool. HDF5 is an open, versatile data model that can represent complex data objects. It is also portable in that it can represent various types of bitness and endianness. Like XML, HDF5 is self-describing. Unlike XML, HDF5 uses a binary format that allows for the efficient parsing of data. Figure 6 illustrates the complete migration process. First, the environment of the application is migrated using container migration. Then, the application itself is checkpointed and resumed on the destination platform. Heterogeneous platforms are illustrated using different colors in the figure. The application box on the destination platform shows a different binary of the same application that is compiled for the platform.
4.
Implementation
TALENT is implemented using the OpenVZ container-based operating system and the CPPC portable checkpoint compiler.
4.1
Environment Migration
Several operating-system-level virtualization implementations are available, including OpenVZ [16] and LXC [18] for Linux, Virtuozzo [20] for Windows, and Jail [23] for FreeBSD. We have chosen UNIX-like operating systems as our platform. We have also chosen OpenVZ as the container because of its ease of use, stable code and support for second layer network virtualization. In particular, we have used OpenVZ version 2.6.27 and have patched it into different kernels; KVM [8] has been used as the underlying hypervisor. We have implemented and tested TALENT on Intel Xeon 32-bit, Intel Core 2 Quad 64-bit and AMD Opteron 64-bit processors. Also, we have tested TALENT on the Gentoo, Fedora (9, 10, 11 and 12), CentOS (4 and 5), Debian
116
CRITICAL INFRASTRUCTURE PROTECTION V
(4 and 5), Ubuntu (8 and 9) and SUSE (10 and 11) operating systems. In total, we have tested 37 combinations. For environment migration, we do not use the OpenVZ live migration feature because it migrates the processes within the container and causes binary incompatibility. Instead, we migrate the environment by freezing the container, synchronizing the filesystem, migrating the virtual network interface and transferring buffers, IPC and signals. We then substitute the binary of the application built for the destination processor. Note that TALENT can also be implemented across more diverse operating systems such as Windows using the Virtuozzo [20] container. In this case, migrating complex environment features such as signals and IPC requires more effort because they have to be mapped correctly to the destination platform.
4.2
Process Migration
Given the desired requirements enumerated in Section 3.2, TALENT employs the Controller/Precompiler for Portable Checkpointing (CPPC) [22] to save the state of a running program. CPPC is a VLC precompiler implementation. It is capable of storing the program state of a running program in a format that is operating system and hardware independent (HDF5), and then correctly restarting the program on a different platform using the previously-stored state. CPPC is a compiler-assisted checkpointing program that involves four execution phases: Compiling the Code: The code is compiled on each platform independently. Configuring the Run: The preferences for checkpointing during a run are configured. Checkpointing: The run is started and checkpointing of the state occurs automatically. Restarting the Run: The checkpoint (after being migrated) is resumed on a new platform.
Compiling the Code CPPC can compile traditional C and Fortran 77 code. It compiles unmodified source code and the programmer does not need to have any knowledge of checkpointing. CPPC automatically determines how and where to checkpoint a program. Section 4.3 shows an example of how this is done. CPPC interfaces with the user code as a precompiler. It uses the Cetus compiler infrastructure [17] to determine the semantic behavior of a program in order to decide where to place checkpointing directives. Once this is determined, the code is re-factored with checkpointing function calls. The re-factored code can then be compiled using a traditional compiler such as cc or gcc.
Okhravi, et al.
117
Configuring the Run CPPC requires a configuration file in order to run with checkpointing. This file specifies the parameters used for checkpointing, including the frequency of checkpointing, the number of checkpoints to be stored and their storage locations. Although a default file is provided, a user may wish to configure the file based on the expected behavior of the program. For example, the frequency of checkpointing can be increased for critical applications that change frequently so as to capture the most recent state; or the frequency can be decreased for slowly-changing programs to avoid bottlenecks when writing files. Run parameters can be changed directly in the configuration file by modifying the appropriate values or by using command line options when starting a run. Typically, a program will have a suitable configuration specified in the configuration file. However, a user may override the configuration to obtain different behavior by entering a new value for a parameter via the command line. The configuration file can be stored in text or XML formats. Checkpoint Checkpoints are stored in a file using the HDF5 format [9]. Since this format is deployed on many platforms, checkpoint files can be stored in a manner that is compatible across a range of architectures and operating systems. Additionally, a CRC-32-based algorithm is supported to verify the integrity of checkpoint files. As stated above, checkpointing is done automatically. The user may change the rate at which checkpointing is performed via the configuration file or the command line. Compiler options also allow programmers to manually specify where checkpointing should occur by adding #pragma directives to the source code. Directives also exist for other CPPC functionality such as indicating code that should be run upon restart for re-initializing data not stored in memory, or for other initialization tasks such as restarting the message passing interface.
Restarting from a Checkpoint After a run has been started and a checkpoint has been recorded, it is possible to restart the run from the last recorded checkpoint. This is done on the same platform or on a different platform. “Jump” statements are added in the original code to the locations of the checkpoints. Based on the checkpoint file, the jump locations are known upon restart. These jump states are ignored during the initial run so that the program is executed as if no changes were introduced. In addition to jumping to the appropriate starting location, the checkpoint file contains information about variable values within the program. These are loaded upon restart to ensure that the program resumes in the same state it was upon checkpointing.
4.3
Code Example
We revisit the factorial code in Figure 4 to illustrate the operation of the checkpointer. First, the code is automatically converted to a markup code using #pragma directives to specify where special CPPC content should be inserted. Figure
118
CRITICAL INFRASTRUCTURE PROTECTION V int main(int argc, char **argv) { int fact; double curr; int i; #pragma cppc init fact = 20; curr = 1; #pragma cppc register ( fact, curr, i ) for(i=1; i<=fact; i++) { #pragma cppc checkpoint curr = curr * i; } #pragma cppc unregister ( fact, curr, i ) printf("%d factorial is %f", fact, curr); #pragma cppc shutdown return 0; } Figure 7.
CPCC markup of the factorial program.
7 shows the markup for the factorial code. Note that the variables to be checkpointed are registered with CPPC. Next, CPPC uses the markup in Figure 7 to create a final version of the code that the C compiler can understand. The final code is not shown here due to its length. However, the concepts involved in generating the code are straightforward. For each checkpoint, line labels are inserted to mark the locations of the checkpoints. The labels are tracked using an array that is populated when CPPC is initialized. Each label is assigned a unique ID based on its location in the array. When a call to checkpoint is made, the appropriate ID is also stored in the checkpoint file. When the program is restarted, the call to initialize re-populates the registered values that were in memory from the previous run. The code then jumps to the appropriate checkpoint label. This is achieved by using a “goto” command to jump to the line in the line label array referenced by the ID stored in the checkpoint file. From here, the program proceeds as normal, continuing to checkpoint at the indicated locations in the program.
5.
Evaluation
We have developed a test application to evaluate the performance of TALENT. The application contains 2,000 lines of C code and a GUI developed using
119
Okhravi, et al.
Figure 8.
Optimized filesystem synchronization model.
wxWidgets [24]. The graphical output of the application is sent to a remote machine via an SSH connection. Upon receiving a migration request, the application and its GUI are migrated from a Gentoo/Intel Xeon 32-bit machine to an Ubuntu 10.04.1/AMD Opteron 64-bit machine using environment migration and checkpointing. The original migrations took a long time (about a minute), so we decided to time the individual elements of migration. After breaking down the delays, we discovered that synchronizing the filesystem took 98.7% of the migration time. This is not surprising because, during a migration, the entire filesystem available to the container must be copied to the destination. As a result, we decided to focus on optimizing the filesystem synchronization. In the optimized version, the filesystem is synchronized with the destination once before the migration occurs. The synchronization is subsequently performed at periodic intervals by sending the differences to the destination. Figure 8 presents the optimized synchronization model. We chose 30 seconds as the synchronization interval. When a migration is requested, only the differences are sent to the destination. This simple optimization reduces the environment migration time to about one second. Figure 9 shows the performance of TALENT with and without optimization. Note that quota and configuration refer to checking the resource quotas (CPU, disk, memory, etc.) assigned to each container and verifying the platform configurations, respectively. If the optimization is enabled, then network traffic to the destination platform has to be strictly limited to filesystem updates to prevent the attacker from performing a reconnaissance of the destination. During the migration, the graphical output at the remote machine disappears for about two seconds. When the migration is completed, the graphical output reappears on the remote terminal (now running on the second platform) without any user intervention because the state of the SSH connection is preserved.
6.
Related Work
Several data segment level [6, 26] and variable level process migration techniques [3] have been proposed in the literature. These methods are often used
120
CRITICAL INFRASTRUCTURE PROTECTION V
Figure 9.
TALENT’s performance with and without optimization.
in high performance and cluster computing systems for load balancing and fault tolerance. Virtual machine migration [7] has also been proposed as a cluster administration technique for load balancing, online maintenance and power management. However, it requires a homogeneous architecture and operating system in order to preserve state. The Self-Cleansing Intrusion Tolerance (SCIT) Project [25] is closely related to TALENT. It migrates an application across different virtual machines to reduce the exposure time. Our work differs from SCIT in a number of ways. First, TALENT preserves the state of the application. SCIT-web server [1] and SCIT-DNS [11] preserve the session information and DNS master file and keys, respectively, but not the internal state of the application. Second, TALENT uses heterogeneous platforms for migration. The designers of SCIT mention the use of diverse operating systems and memory images to further confuse an attacker, but the current implementation uses the same operating system with the same configuration [1]. Finally, TALENT is designed to support general critical infrastructure applications and is not limited to a specific server. The Resilient Web Service (RWS) Project [12] uses a virtualization-based web server system that detects intrusions and periodically restores them to a pristine state. Its successor, RWS-Moving Target (RWS-MT) [13] plans to use diversity to create a moving target, but it only focuses on web servers as the critical application. In addition, both systems lose the state of the web server.
Okhravi, et al.
121
Certain forms of server rotation have been proposed by Blackmon and Nguyen [2] and by Rabbat, et al. [21] in an attempt to achieve high availability servers.
7.
Conclusions
The TALENT system provides dynamic, heterogeneous platforms for critical infrastructure applications. It creates a cyber moving target that offers resilience in the face of platform-specific cyber attacks. To the best of our knowledge, TALENT is the first heterogeneous platform solution that preserves the internal state of a general application. The current TALENT prototype is focused on providing high availability. There is no guarantee that the migrated state (persistent or ephemeral) is not already corrupted. In future work, we plan to extend TALENT by adding sanitization and recovery capabilities. This would provide integrity guarantees for an application under attack. We also plan to augment TALENT with an attack detection engine that can trigger migration. Finally, we plan to integrate TALENT with an assessment framework based on attack graphs [15] so that the destination platform can be selected based on formal vulnerability and reachability analysis. Note that the opinions, interpretations, conclusions and recommendations in this paper are those of the authors and are not necessarily endorsed by the U.S. Government.
Acknowledgements This work was sponsored by the U.S. Department of Defense under Air Force Contract FA8721-05-C-0002.
References [1] A. Bangalore and A. Sood, Securing web servers using self cleansing intrusion tolerance (SCIT), Proceedings of the Second International Conference on Dependability, pp. 60–65, 2009. [2] S. Blackmon and J. Nguyen, Storage: High-availability file server with heartbeat, System Administration, vol. 10(9), pp. 24–32, 2001. [3] G. Bronevetsky, D. Marques, K. Pingali and P. Stodghill, Automated application-level checkpointing of MPI programs, ACM SIGPLAN Notices, vol. 38(10), pp. 84–94, 2003. [4] R. Brown, Stuxnet worm causes industry concern for security firms, Mass High Tech, Boston, Massachusetts (www.masshightech.com/stories /2010/10/18/daily19-Stuxnet-worm-causes-industry-concern-for-securityfirms.html), October 19, 2010. [5] G. Carl, G. Kesidis, R. Brooks and S. Rai, Denial-of-service attack detection techniques, IEEE Internet Computing, vol. 10(1), pp. 82–89, 2006.
122
CRITICAL INFRASTRUCTURE PROTECTION V
[6] Y. Chen, K. Li and J. Plank, CLIP: A checkpointing tool for message passing parallel programs, Proceedings of the ACM/IEEE Conference on Supercomputing, p. 33, 1997. [7] C. Clark, K. Fraser, S. Hand, J. Hansen, E. Jul, C. Limpach, I. Pratt and A. Warfield, Live migration of virtual machines, Proceedings of the Second Conference on Symposium on Networked Systems Design and Implementation , vol. 2, pp. 273–286, 2005. [8] I. Habib, Virtualization with KVM, Linux Journal (www.linuxjournal.com /article/9764), February 1, 2008. [9] HDF Group, HDF4 Reference Manual, Champaign, Illinois (ftp.hdfgroup .org/HDF/Documentation/HDF4.2.5/HDF425 RefMan.pdf), 2010. [10] Y. Huang, D. Arsenault and A. Sood, Closing cluster attack windows through server redundancy and rotations, Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid, p. 21, 2006. [11] Y. Huang, D. Arsenault and A. Sood, Incorruptible self cleansing intrusion tolerance and its application to DNS security, Journal of Networks, vol. 1(5), pp. 21–30, 2006. [12] Y. Huang and A. Ghosh, Automating intrusion response via virtualization for realizing uninterruptible web services, Proceedings of the Eighth IEEE International Symposium on Network Computing and Applications, pp. 114–117, 2009. [13] Y. Huang, A. Ghosh, T. Bracewell and B. Mastropietro, A security evaluation of a novel resilient web serving architecture: Lessons learned through industry/academia collaboration, Proceedings of the International Conference on Dependable Systems and Networks Workshops, pp. 188–193, 2010. [14] Industrial Control Systems Cyber Emergency Response Team (ICSCERT), ICS-ALERT-10-301-01 – Control System Internet Accessibility, Department of Homeland Security, Washington, DC (www.us-cert.gov /control systems/pdf/ICS-Alert-10-301-01.pdf), October 28, 2010. [15] K. Ingols, M. Chu, R. Lippmann, S. Webster and S. Boyer, Modeling modern network attacks and countermeasures using attack graphs, Proceedings of the Annual Computer Security Applications Conference, pp. 117–126, 2009. [16] K. Kolyshkin, Virtualization in Linux, OpenVZ (ftp.openvz.org/doc/open vz-intro.pdf), 2006. [17] S. Lee, T. Johnson and R. Eigenmann, Cetus – An extensible compiler infrastructure for source-to-source transformation, Proceedings of the Sixteenth International Workshop on Languages and Compilers for Parallel Computing, pp. 539–553, 2003. [18] lxc Linux Containers, lxc man pages (lxc.sourceforge.net/index.php/about /man). [19] National Security Council, Cybersecurity Progress after President Obama’s Address, The White House, Washington, DC, July 14, 2010.
Okhravi, et al.
123
[20] Parallels, Clustering in Parallels Virtuozzo-Based Systems, White Paper, Renton, Washington, 2009. [21] R. Rabbat, T. McNeal and T. Burke, A high-availability clustering architecture with data integrity guarantees, Proceedings of the IEEE International Conference on Cluster Computing, pp. 178–182, 2001. [22] G. Rodriguez, M. Martin, P. Gonzalez, J. Tourino and R. Doallo, CPPC: A compiler-assisted tool for portable checkpointing of message-passing applications, Concurrency and Computation: Practice and Experience, vol. 22(6), pp. 749–766, 2010. [23] E. Sarmiento, Securing FreeBSD using Jail, System Administration, vol. 10(5), pp. 31–37, 2001. [24] J. Smart, K. Hock and S. Csomor, Cross-Platform GUI Programming with wxWidgets, Prentice Hall, Upper Saddle River, New Jersey, 2005. [25] A. Sood, Intrusion tolerance to mitigate attacks that persist, presented at the Secure and Resilient Cyber Architectures Conference, 2010. [26] G. Stellner, CoCheck: Checkpointing and process migration for MPI, Proceedings of the Tenth International Parallel Processing Symposium, pp. 526–531, 1996. [27] U.S. Air Force Chief Scientist, Report on Technology Horizons: A Vision for Air Force Science and Technology During 2010–2030, Volume 1, Technical Report AF/ST-TR-10-01-PR, Department of the Air Force, Washington, DC, 2010.
Chapter 9 AN EVIDENCE-BASED TRUST ASSESSMENT FRAMEWORK FOR CRITICAL INFRASTRUCTURE DECISION MAKING Yujue Wang and Carl Hauser Abstract
The availability and reliability of large critical infrastructures depend on decisions made by hundreds or thousands of interdependent entities and, by extension, on the information that the entities exchange with each other. In the electric power grid, the widespread deployment of devices that measure and report readings many times per second offers new opportunities for automated control, which is accompanied by the need to automatically assess the trustworthiness of the received information. This paper proposes a Bayesian estimation model for calculating the trustworthiness of entities in the electric power grid and making trustrelevant decisions. The model quantifies uncertainties and also helps minimize the risk in decision making.
Keywords: Trust assessment, Bayesian estimation, electric power grid
1.
Introduction
The U.S. electric power grid is on the cusp of a tremendous expansion in the amount of sensor data available to support its operations. For decades, the power grid has been operated using supervisory control and data acquisition (SCADA) systems that poll each sensor once every two or four seconds – a situation that some in the industry have characterized as “flying blind.” The widespread deployment of sensing systems called phasor measurement units (PMUs) that provide accurately time-stamped data 30, 60 or more times per second is near at hand. By the end of 2013, utilities, with the assistance of the American Recovery and Reinvestment Act of 2009 (ARRA), will have increased the number of these devices on the grid to nearly 1,000, roughly an order of magnitude increase over what exists today. Data from PMUs and other J. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 125–135, 2011. c IFIP International Federation for Information Processing 2011
126
CRITICAL INFRASTRUCTURE PROTECTION V
high-rate sensing devices will support new control schemes for the reliable and efficient operation of the power grid as larger fractions of electric power demand are met by intermittent sources such as wind and solar, and as controllable loads such as electric vehicle rechargers increase. As power grid operations come to increasingly rely on these new control schemes, the availability and integrity of the data, but to some degree confidentiality as well, are of great concern. Good security practices and technologies such as those required by the NERC CIP standards [15] will be more essential to reliable grid operations than they are today. However, uncertainty is inherent to large-scale systems such as the power grid due to measurement errors (e.g., sensor reading errors) and the stochastic nature of physical processes (e.g., weather conditions). Uncertainties associated with PMUs arise from several factors: (i) PMUs are deployed under the control of various entities, throughout the transmission and distribution systems, that have different management policies and configurations; (ii) a number of cyber attacks on PMUs can impact electric power systems; and (iii) with vastly more data available, it becomes possible to use a subset of data from the most trustworthy sources. For example, when PMU data authentication is performed using a public-key infrastructure, the reliability of the authentication is ultimately limited by the uncertainty of the binding between a particular public key and the authenticated entity. While one might wish that there were no uncertainty, it is, in fact, quite likely that some of the bindings in a large-scale system are incorrectly known at least some of the time by some entities, whether due to error or malicious manipulation. If uncertainty is indeed unavoidable, the reliability of the system comes down to blind faith – we know that security is uncertain but we have to trust in it because it is all we have – or to decision processes that explicitly and appropriately take into account the uncertainties associated with security. This paper focuses on the latter viewpoint. Since the power grid must be controlled in real time in an ever-changing security threat environment, we are interested in decision models that can be fully automated rather than models that rely on human insight. Because Bayesian decision theory fits well with our desire for a computational solution, our approach uses a Bayesian perspective [19]. The word “trust” is introduced here for its connotations of one party’s (trustor’s) reliance on and belief in the performance of another party (trustee). An example is the trust in a public-key infrastructure (PKI) certifier to correctly bind a public key to some other entity. The reliance or belief often must occur without certainty or they may be in the form of a prediction about the future (itself a source of uncertainty). Trust, though uncertain, need not be blind: trustors can use evidence, in the form of past experience with a trustee, reputation information, or contracts and laws that impose penalties for nonperformance, to form their trust judgments. We believe that if critical infrastructures are to be resilient against attacks, then operational decision making processes must appropriately take into account evidence about the trustworthi-
127
Wang & Hauser
Figure 1.
Credit reporting system.
ness of their input data. As we will show, using evidence appropriately means that it is considered in the light of the particular decision being made: there is no single approach to judging trust that is universally appropriate. This paper establishes the need for a systematic method of dealing with uncertainty related to trust in the context of control systems. It presents a theoretical framework based on Bayesian decision theory that addresses this need by incorporating trust-related evidence.
2.
Motivating Example
The consumer credit reporting and scoring system provides an interesting analogy for evidence-based decision making in the presence of risk. As shown in Figure 1, credit bureaus collect information from various sources and provide credit reports that detail individual consumers’ past borrowing and bill paying behaviors. Some companies further analyze the information in credit reports from multiple sources to produce a single, numerical credit score based on the statistical analysis of an individual’s credit reports. The credit score is claimed to statistically represent the creditworthiness of an individual. Now consider the decisions that lenders make in analyzing a loan application. They have to decide whether or not to make the loan and the terms under which the loan should be made. If the loan is made, then the lender stands to make a profit if the borrower pays it back, or the lender makes a loss if the borrower defaults on the payment. A loss function describes the lender’s payback for various future behaviors of the borrower. While the loss function is known, the future behavior of the borrower is, of course, uncertain at the time the loan is made. The lender thus seeks to make a decision that minimizes the expected
128
CRITICAL INFRASTRUCTURE PROTECTION V
loss (maximizes the expected return) by assessing the probabilities of different future borrower behaviors. This is done by considering the credit report or credit score as well as information about employment, income and stability of residence that is contained in the loan application. There are several important points to note in this analogy. First, different lenders have different loss functions, and a single lender may have different loss functions for different kinds of loans: trust decisions are situational. In the power grid domain, a decision to turn off electric car charging at a time when the power supply is stressed carries different loss implications than a decision to shed load by turning off power to an entire region. Second, different lenders may assess the probability of various borrower behaviors differently based on the same credit report facts: trust decisions are subjective. Third, the analogy is imperfect. In the case of lending, risk pooling allows businesses to balance losses from some loans with profits from others, so decisions take into account not only an individual loan but a portfolio of loans. The consequences of decisions related to power grid operations cannot be easily aggregated, so the decision processes generally emphasize the analysis of individual decisions. Thus, there are similarities and differences between the two domains. However, the structure is basically the same: the trustor collects evidence about a trustee and uses it to probabilistically predict the behavior of the trustee according to a model. The trustor may make decisions that, in hindsight, seem wrong, but the decisions are, nevertheless, the best that could be made at the time based on the available information.
3.
Preliminaries
The distributed control system for a large-scale critical infrastructure (such as the electric power grid) can be described abstractly as consisting of a collection of controllers, a collection of data sources and a collection of actuators. Controller and actuators share the essential characteristics related to the uncertainty of security, so we focus primarily on controllers and information sources. In the power grid, for example, controllers include protective relays, automatic generator controls, remedial action schemes, etc. Data sources include sensors, human operators and controller outputs. Communication channels link data sources to controllers. The essential property of controllers is that they receive inputs from data sources and repeatedly make decisions based on the data, with the decisions ultimately being reflected in an action that changes the physical state of the grid. Because of the noise in sensor outputs, system inputs are assumed to be probabilistically related to the actual state of the sensed world by considering that each measurement corresponds to the actual state plus a normally-distributed
Wang & Hauser
129
noise term. System failures can lead to bad inputs (highly improbable in the model with normally-distributed noise), which can often be detected and excluded by bad-data detection algorithms that exploit redundancy present in the inputs. Several researchers have studied how input data streams might be intentionally attacked in a manner that is invisible to the bad-data detectors that are in use today (see, e.g., [14]). The approach described in this paper is, at a high level, aimed at providing controllers with the ability to evaluate evidence from a variety of sources regarding the correctness of data received from sensors and the ability of actuators to carry out the commanded actions. The uncertainties associated with these aspects and the outcomes are modeled probabilistically, albeit with much greater flexibility than afforded by current approaches that assume normallydistributed noise, and with the explicit incorporation of uncertain results in the form of loss functions.
4.
Bayesian Decision Model
Decision theory studies the values and uncertainties related to making rational and optimal decisions [11]. Statistical theory has been widely applied to decision making problems [17]. Our method is based on the Bayesian statistical paradigm, which quantifies the uncertainty of decisions using personal probability [16]. Interested readers are referred to [19] for a systematic introduction to Bayesian decision theory. As previously noted, uncertainty is inherent in complex systems. Thus, risk, which is a state of uncertainty where some of the possibilities involve a loss, catastrophe or other undesirable outcome, is unavoidable. In order to reduce risk, every entity in a system should have the ability to incorporate evidence about the trustworthiness of other entities and be inclined to rely on entities that are (more) trustworthy. To formalize this view, we assume that there are a number of trust-related attributes E = (E1 , E2 , . . . , Ep ) concerning each entity in the system, which together form the trust evidence. By focusing on a single entity A at a certain point in time, it is possible to collect the current evidence about an entity B, which is denoted as xi = (ε1 , ε2 , . . . , εp ) ∈ Rp . Over a period of time, a number of xi s, denoted by x = (x1 , x2 , · · · , xn ), would be collected. Based on x, A makes a decision d ∈ D (where D is the decision space) on B in the light of A’s estimate of the value of θ (0 ≤ θ ≤ 1) from the parameter space Θ, which is the trustworthiness that is placed on B. Essentially, θ is the probability that B is trustworthy. In the proposed model, the decision making process is considered as a choice of action made by the decision maker from among a set of alternatives according to their possible consequences. In the electric power grid, these decisions are made under uncertainty, i.e., the decision maker can neither know the exact consequence of a chosen decision before it occurs nor obtain accurate values of the evidence due to system complexity and uncertainty. Probabilistic modeling is a natural choice for interpreting the evidence E and evaluating the conse-
130
CRITICAL INFRASTRUCTURE PROTECTION V
quences. The model should not only incorporate the available information in E, but also the uncertainty of this information. In the probabilistic model, xi (1 ≤ i ≤ n) follows a probability distribution fi , xi ∼ fi (xi |θ, x1 , · · · , xi−1 ) on Rp , where fi is known but θ is unknown. If x is collected over a short enough period of time, it is reasonable to assume that x1 , x2 , · · · , xn are independent repeated trials from identical distributions and that the distribution can be denoted as: x ∼ f (x|θ) . The likelihood function l is defined as: l (θ|x) = f (x|θ) . The likelihood function l is equal to f , but it emphasizes that θ is conditional on x and manifests that θ can be inferred from x. According to our assumptions and the likelihood principle [3], all available information to make an inference about θ is contained in the likelihood function l (θ|x). Decisions can then be made based on the inferred value of θ. To combine these processes, when the likelihood function l (θ|x) is fixed, a function from X to D can be obtained as δ (x), which is called the decision rule as it relates to trust. Note that trustworthiness assessment is only one aspect of the overall decision process. Decisions are made according to the inferred trustworthiness value, but trustworthiness evaluation is not the end goal. Next, we describe the elements involved in a Bayesian determination of decision rule δ (x), namely prior distributions and loss functions, and proceed to specify the derived rule.
4.1
Modeling Prior Information
As previously noted, trust decisions are subjective: based on the very same evidence, different trustors may make different decisions. In the Bayesian model, the uncertainty of the trustworthiness value θ of a trustor regarding a trustee before receiving evidence is modeled using a prior probability distribution π(θ) on Θ. Subjectivity of trust is naturally modeled by different prior distributions.
4.2
Loss Function
While it is easy to talk about making “good” decisions, the model requires a precise formalization of the notion of goodness. All the possible choices in a decision should be ordered or quantified. Decision theory uses a loss function for this purpose. The loss function is a function L ≥ 0 from Θ × D to Rp , which represents the penalty L (θ, d) associated with the decision d when the parameter takes the value θ. In our case, the penalty L (θ, d) is the quantified consequence at the time the decision is made when the trustee’s trustworthiness value is θ and the trustor chooses decision d. However, it is very hard to measure
131
Wang & Hauser
the trustworthiness value of a trustee in a complex system due to the dynamic and fuzzy nature of trust [6]. Therefore, it is important that the model can reflect this uncertainty. A simple way to obtain the loss is to integrate over all possible values of θ. Moreover, instead of focusing on one decision, our goal is to assess a decision rule δ (x), which is the allocation of a decision to each outcome x ∼ f (x|θ). Thus, the loss function L (θ, δ (x)) should also be integrated over X , the entire space of x. Given the prior distribution π (θ) and the distribution f (x|θ) of x, θ should be integrated in proportion to π (θ) and x in proportion to f (x|θ). Thus, the loss function can be written as: L (θ, δ (x)) f (xθ) dxπ (θ) dθ r (π, δ) = Eπ [R (θ, δ)] = Θ
X
where r (π, δ) is called the risk function of δ.
4.3
Bayesian Estimation
The goal of the decision making model is to derive an “optimal” decision rule that provides trustors with rational decisions about trustees based on observations (evidence) x. Optimality is implemented by minimizing the risk function r (π, δ). The decision maker follows the decision rule that gives the smallest risk. However, the trustworthiness value θ is often unknown, so a problem arises regarding the situation under which the risk function is minimized. A common choice is the minimax rule, which chooses the δ that satisfies the equation: sup r θ, δ˜ = inf sup r (θ, δ) sup r θ, δ˜ = inf sup r (θ, δ) . θ
δ
θ
θ
δ
θ
The minimax rule also fits our original intention to make decisions that reduce the risk of trustors under uncertainty. As an implementation of the likelihood principle, the Bayesian paradigm satisfies the decision-related requirements for trust assessment. It not only quantifies the uncertainties and minimizes the decision making risk, which is crucial when making rational decisions, but it also smoothly incorporates the trustors’ prior information about the trustees’ trustworthiness. This is essential when the decision process is viewed in the context of long-term system operation: trustors continuously acquire new evidence that must be combined with their prior information when making new decisions.
5.
Illustrative Example
This section illustrates the application of the decision making model. It examines the simplified decision making case involving the inference of the trustworthiness value of a trustee based on the observation x, for which D = Θ. The evidence aggregator of the trustor collects values of the related attributes E = (E1 , E2 , . . . , Ep ) and stores the values in the corresponding vector xi =
132
CRITICAL INFRASTRUCTURE PROTECTION V
Figure 2.
Cumulative distribution functions of three prior distributions.
(ε1 , ε2 , . . . , εp ). Within a short time T , the evidence aggregator collects the n vectors forming x = (x1 , x2 , · · · xn ). Since the time T is short, we assume that x1 , x2 , · · · xn are independent repeated trials from identical distributions f . According to the probabilistic model, the values of the attributes are conditional on the trustworthiness value θ, so the distribution can be denoted as f (x|θ). As mentioned before, trust is subjective. For example, risk-averse trustors may tend to make negative decisions and risk-preferred trustors may tend to make positive decisions. The differences among trustors could be attributed to many factors. For instance, the difference might be due to the previous experience of trustors. A positive experience on the part of a trustor means that the trustor has made many correct decisions regarding trustworthy entities and makes the trustor more risk-preferred. Conversely, negative experience, which means that a trustor has made wrong decisions and trusted the wrong entities, makes the trustor more cautious. For one-dimensional evidence, this type of subjectivity can be modeled using a Beta-distribution with parameters α and β as the prior distribution of trustors. Let α be the number of past negative experiences and β be the number of past positive experiences. The prior information of trustors can be modeled as: β−1
θα−1 (1 − θ) π (θ) = Beta (α, β) = 1 β−1 tα−1 (1 − t) dt 0 where π (θ) is the probability that the trustor decides to trust the trustee. Increasing the value of α makes the trustor more risk averse and increasing β makes the trustor more risk preferred. As shown in Figure 2, a decision maker with α = 8 and β = 2 (top line) tends to make negative trust decisions since the probability is high that trust-
133
Wang & Hauser
worthiness values under 0.5 are allocated. On the other hand, a decision maker with α = 2 and β = 8 (bottom line) is more likely to make positive decisions. In this simplified example, since it is only necessary to estimate the value of θ, we select a commonly-used simple loss function, called the “quadratic loss function.” This function is given by: 2
L (θ, δ) = (θ − δ) . The risk function is: 2
r (θ, δ) = Θ
X
(θ − δ) f (x|θ) dxπ (θ) dθ.
The corresponding the computed estimator is: θf (x|θ) π (θ) dθ δ (x) = Θ . f (x|θ) π (θ) dθ Θ
6.
Related Work
The issue of trust is drawing increasing attention in the information security community. In 1996, Rasmussen and Jansson [18] examined the relationship between security and social control, and classified security mechanisms as: “soft security” such as trust and reputation systems, and “hard security” such as authentication and access control. Most security mechanisms include some aspects of trust, but they make implicit “trust assumptions” [7]. In order to overcome the drawbacks of current security mechanisms such as the inadequacy of authentication [5], a more general concept of trustworthiness should be engaged [1]. Trust management is largely associated with inference and decision making. Related evidence should be collected first and delivered to the trust management system as an input to the decision making model. Several trust management systems, such as PolicyMaker [5], KeyNote [4] and REFEREE [8], have been designed to collect security credentials and test the compliance of the credentials with security policies. Also, some trustworthiness computing models (e.g., [10]) collect trustors’ prior experience as evidence and make predictions based on the experience. Some models collect evidence from other entities – these are essentially reputation systems [12, 13]. Generally, however, trust management systems and trustworthiness computing models [2, 9] attempt to determine a numerical trustworthiness value for a trustee or make a binary decision about whether or not a trustee is trustworthy. Our approach goes beyond this view by focusing on trust decision making as coupled with succeeding decision processes.
7.
Conclusions
The Bayesian paradigm provides an elegant framework for incorporating trust into decision making processes associated with the control of large-scale
134
CRITICAL INFRASTRUCTURE PROTECTION V
critical infrastructure systems. The risk function, prior distribution and the distribution of evidence are three key components of the Bayesian paradigm. A prior distribution, which models the subjectivity of trustors, is combined with newly-acquired evidence and the derived Bayes risk function to obtain a decision rule by minimizing the risk function. Although the mathematical structure of the framework is straightforward, its applicability depends on gaining experience with the kinds of data that are available in critical infrastructure systems and what the data say about trustworthiness. While it may not be clear, for example, what a particular ratio of good/bad past experience means for a particular decision, the framework shows what should be done with such data when it is collected. Note that the views and opinions in this paper are those of the authors and do not necessarily reflect those of the United States Government or any agency thereof.
Acknowledgements This research was partially funded by Department of Energy Award No. DEOE0000097 (TCIPG) and by matching funds provided by Washington State University.
References [1] A. Abdul-Rahman and S. Hailes, A distributed trust model, Proceedings of the New Security Paradigms Workshop, pp. 48–60, 1997. [2] A. Abdul-Rahman and S. Hailes, Supporting trust in virtual communities, Proceedings of the Thirty-Third Annual Hawaii International Conference on System Sciences, vol. 6, 2000. [3] J. Berger, Statistical Decision Theory and Bayesian Analysis, Springer, New York, 1985. [4] M. Blaze, J. Feigenbaum and A. Keromytis, KeyNote: Trust management for public-key infrastructures, Proceedings of the Sixth International Workshop on Security Protocols, pp. 59–63, 1998. [5] M. Blaze, J. Feigenbaum, and A. Keromytis, The role of trust management in distributed systems security, in Secure Internet Programming (LNCS 1603), J. Vitek and C. Jensen (Eds.), Springer, Berlin, Germany, pp. 185– 210, 1999. [6] E. Chang, P. Thomson, T. Dillon and F. Hussain, The fuzzy and dynamic nature of trust, Proceedings of the Second International Conference on Trust, Privacy and Security in Digital Business, pp. 161–174, 2005. [7] B. Christianson and W. Harbison, Why isn’t trust transitive? Proceedings of the International Workshop on Security Protocols, pp. 171–176, 1997. [8] Y. Chu, J. Feigenbaum, B. LaMacchia, P. Resnick and M. Strauss, REFEREE: Trust management for web applications, Computer Networks and ISDN Systems, vol. 29(8–13), pp. 953–964, 1997.
Wang & Hauser
135
[9] W. Conner, A. Iyengar, T. Mikalsen, I. Rouvellou and K. Nahrstedt, A trust management framework for service-oriented environments, Proceedings of the Eighteenth International Conference on the World Wide Web, pp. 891–900, 2009. [10] M. Denko, T. Sun and I. Woungang, Trust management in ubiquitous computing: A Bayesian approach, Computer Communications, vol. 34(3), pp. 398–406, 2011. [11] S. French, Decision Theory: An Introduction to the Mathematics of Rationality, Ellis Horwood, New York, 1986. [12] A. Josang, R. Ismail and C. Boyd, A survey of trust and reputation systems for online service provision, Decision Support Systems, vol. 43(2), pp. 618– 644, 2007. [13] S. Kamvar, M. Schlosser and H. Garcia-Molina, The Eigentrust algorithm for reputation management in P2P networks, Proceedings of the Twelfth International Conference on the World Wide Web, pp. 640–651, 2003. [14] Y. Liu, P. Ning and M. Reiter, False data injection attacks against state estimation in electric power grids, Proceedings of the Sixteenth ACM Conference on Computer and Communications Security, pp. 21–32, 2009. [15] North American Electric Reliability Corporation, Cyber Security Standards CIP-002-4 through CIP-009-4, Washington, DC (www.nerc.com/pa ge.php?cid=2—20), 2011. [16] A. O’Hagan, Bayesian statistics: Principles and benefits, in Bayesian Statistics and Quality Modeling in the Agro-Food Production Chain, Volume 3, M. van Boekel, A. Stein and A. van Bruggen (Eds.), Springer, Berlin, Germany, pp. 31–45, 2004. [17] J. Pratt, H. Raiffa and R. Schlaifer, Introduction to Statistical Decision Theory, MIT Press, Cambridge, Massachusetts, 1995. [18] L. Rasmusson and S. Jansson, Simulated social control for secure Internet commerce, Proceedings of the New Security Paradigms Workshop, pp. 18– 25, 1996. [19] C. Robert, The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation, Springer, New York, 2007.
Chapter 10 ENHANCING THE USABILITY OF THE COMMERCIAL MOBILE ALERT SYSTEM Paul Ngo and Duminda Wijesekera Abstract
The U.S. Department of Homeland Security initiated the Commercial Mobile Alert System (CMAS) to inform the general public of emergencies. CMAS utilizes the commercial telecommunications infrastructure to broadcast emergency alert text messages to mobile users in an area affected by an emergency. Because CMAS uses cell broadcast service, the smallest area that CMAS can broadcast messages is a cell site, which is usually quite large for local emergencies. This paper proposes an enhancement that uses CMAS as a transport protocol to distribute local emergency alerts to areas smaller than a cell site. The paper also conducts an investigation of the Common Alerting Protocol (CAP), the current emergency protocol standard, and suggests an enhancement to the CAP message structure for CMAS emergency alerts. The viability of the approach is demonstrated using a prototype implementation, which simulates broadcasts of emergency alerts to confined areas such as a city block or an apartment complex.
Emergencies that require the services of specially-trained emergency personnel range from trivial injuries to major disasters such as flash floods, hurricanes, earthquakes, forest fires, tsunamis and terrorist attacks. Due to the unique characteristics of emergency situations, preparing for, responding to and recovering from them is always a challenge. In the aftermath of the terrorist attacks of September 11, 2001, communications was identified as a major bottleneck for emergency activities. The telecommunications infrastructure experienced an extremely high overload of calls into and out of the targeted areas, which caused congestion at access points and in the core networks. While many calls were blocked and rejected, mobile
J. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 137–149, 2011. c IFIP International Federation for Information Processing 2011
138
CRITICAL INFRASTRUCTURE PROTECTION V
users were still able to text message each other [21]. However, text messaging did not see much use at the time because it was still rather expensive. In 2006, the U.S. Federal Government established the Worker Adjustment and Retraining Notification (WARN) Act that supported research and development related to the Common Mobile Alert System (CMAS). CMAS is designed to use the existing commercial telecommunications infrastructure to broadcast emergency alerts and warnings to specified geographic areas. However, the smallest granularity of a CMAS broadcast is a cell, which is too large for smallscale emergencies whose effects are confined, such as a major car accident or a burning building. While CMAS is still in the design and development phases, we have identified two shortfalls in the CMAS cell broadcast service specification [18]. One is that CMAS cannot send alerts to confined areas. The other is that the protocol standard does not provide enough information to facilitate effective emergency communications. We address the first limitation by constructing an Android mobile application that filters CMAS alerts using global positioning system (GPS) data and only displays the alerts to users who are inside the affected area. The second limitation is addressed by enhancing the message structure of the Common Alerting Protocol (CAP) version 1.2 [19] by adding XML tags to facilitate emergency communications.
2.
SMS for Emergency Handling
Informing the general public about emergencies has been a problem for many years. Traditionally, television and radio network broadcasts have been used to alert the public to emergencies. However, television and radio only reach a small percentage of the population at any given time. Americans aged fifteen years and older spend only 6.12% of their time watching television [5]. Americans aged twelve years and older spend 11.2 % and 7% of their time listening to commercial radio [3] and National Public Radio (NPR) [20], respectively. On the other hand, more than 91% of the U.S. population subscribes to a wireless service [8], which is an extremely effective means of communicating information to the general public. Users carry their wireless devices almost everywhere they go. With the proliferation of Internet services, emergency information can be published on news websites and delivered right to a user’s wireless device after a simple registration process. Moreover, various alert settings such as the vibration mode and special ringtones can attract attention almost immediately. A survey conducted after the 9/11 terrorist attack in Washington, DC revealed that, although telephone service into and out of the area was highly congested, people were still able to use text messaging to communicate with each other [21]. Ten years later, Short Message Service (SMS) is an extremely popular mode of communication, especially among the younger generation [4]. SMS used to be expensive in the past – as much as 20 cents per text message sent or received. However, at this time, many service providers offer SMS as a low-cost or free service to attract new customers.
Ngo & Wijesekera
3.
139
CMAS Overview
Recognizing the importance of SMS [6], the U.S. Department of Homeland Security initiated the Commercial Mobile Alert Service (CMAS) [15] to broadcast emergency alert text messages to the general public. Service providers and equipment vendors are actively involved in defining CMAS standards and implementations. Unlike sender-to-receiver SMS, CMAS uses a dedicated primary broadcast control channel (BCCH) to broadcast text messages, which can reach millions of wireless subscribers in minutes. CMAS is still in the design and development phase. However, it inherits some weaknesses due to its reliance on cellular broadcast service and the existing emergency communications protocol: CMAS alert messages cannot be broadcasted to an area smaller than a cell site, which is defined in the Federal Information Processing Standard (FIPS) code [14]. The area covered by a cell site varies according to population density, but it is too large for broadcasting alerts about smallscale emergencies. According to the CMAS specification [2], the Common Alerting Protocol (CAP 1.2) is to be used to communicate CMAS emergency alerts. However, CAP 1.2 was designed for emergency communications between government entities of various levels (federal, state and local departments and agencies). Consequently, much of the CAP 1.2 message structure is not relevant to emergency mobile broadcasting. Also, the information carried in the CAP 1.2 message structure does not fully address the essence of local emergencies. CMAS is designed to disseminate three types of alerts: Presidential alerts, imminent threat alerts and America’s Missing: Broadcast Emergency Response (AMBER) alerts [23]. CMAS is not designed to broadcast alerts about local (i.e., small-scale) emergencies.
4.
Cell Broadcast Service for Local Emergencies
This section describes the CMAS enhancement that enables the delivery of alert messages to areas smaller area than a cell site or FIPS code equivalent.
4.1
Enhancement Considerations
In 2003, OASIS sponsored the Common Alerting Protocol (CAP) initiative with the objective of providing fundamental messaging protocols to facilitate interagency emergency communications. The OASIS Technical Committee on Emergency Management has developed a set of standards [17, 18] that involve a set of XML tags for exchanging information needed to handle emergencies. However, CAP is intended for communications involving emergency responders, operators and other officials; it was not designed to broadcast emergency alerts
140
CRITICAL INFRASTRUCTURE PROTECTION V
to a large population. For example, the Sender ID value in CAP is not relevant to members of the public who would receive broadcast alert messages. Our investigation of the technical specifications of the GSM and UMTS cell broadcast service [1] revealed that there are no features or options that could be configured to support the broadcast of alert messages to an area smaller than a cell site. However, given the computing power and functionality provided by modern mobile devices, we opted to build an emergency response application (ERApp) to intercept CMAS alert messages and filter them based on the proximity of users to the location of an emergency. To accomplish this, it was necessary to enhance the CAP message structure to include additional tags. Interested readers are referred to [16] for our preliminary work on enhancing the CAP message structure.
4.2
CMAS Alert Message Structure
A CMAS alert message can be sent in one of two ways. The first is to send the message as raw text, where the values are separated by delimiters. However, this method does not provide enough space to squeeze the necessary information into the message. The second method uses the CAP 1.2 emergency messaging standard [19]. However, a CAP message is in XML format, which requires additional space to store an XML alert message. Figure 1 shows the CAP 1.2 alert message structure. As mentioned above, CAP 1.2 was designed for emergency communications between officials. For example, when a tornado strikes Louisville, Kentucky, the City of Louisville sends an alert message to the State of Kentucky requesting resources for rescue operations. Officials at the state’s emergency operations center then verify that the message is not a hoax. The verification is performed using data such as the message id, sender id, date/time sent, scope, etc. On the other hand, a mobile user receiving a CMAS alert message would not need any data for message verification. Instead, the user would require information about the emergency such as its nature, location and the precautions to be taken. Much of the information in the CAP 1.2 schema is not relevant to CMAS alert messages. To address this limitation, we have enhanced the CAP 1.2 message structure for CMAS alert messages. Figure 2 shows the enhanced message structure. We focused on extracting relevant information from the existing emergency messaging standard CAP 1.2 that would be beneficial to mobile users. Also, we created three tags: affected area, spreadable and location. The affected area tag holds the radius of the affected area measured in meters from the emergency location. The value would depend of the specific emergency and would be entered by the emergency coordinator. All mobile users within the affected area would receive CMAS alert messages. The spreadable tag is important in environmental emergencies such as pollution alerts, toxic gas releases, and biological or radiological attacks. The alert might urge users to seek shelter and use gas masks. The location tag holds the latitude and longitude of the
141
Ngo & Wijesekera
$ & + % '" ' '"
! " # ! $ % % & " ' ' ()
' *
Figure 1.
,', - .$ . . (' " (' (' " + +
! * + + / + + +
CAP 1.2 alert message structure.
emergency, which is easily shown on a map. Figure 3 shows a sample CMAS mobile message corresponding to a tornado alert.
4.3
Enhancement for Small-Scale Emergencies
Enhancing the cell broadcast service itself is a complex and time-consuming task. However, given the computing power and functionality provided by modern mobile devices, it is simpler, far less expensive and just as effective to create an emergency response application (ERApp) that would assist mobile users in times of emergency. ERApp is a mobile application (app) written in Java for the Google Android API 2.2 platform. The current version of ERApp intercepts CMAS alert messages and filters them based on the proximity of users to the location of an emergency. It is designed to work in conjunction with
142
CRITICAL INFRASTRUCTURE PROTECTION V
! #
Figure 2.
#
" "
Enhanced CMAS CAP 1.2 alert message structure.
CMAS-01 <status>Actual MetExpected <severity>Severe Observed <expires>2010-10-02T17:00:00-0500 <description>Multiple tornados are expected around 2PM in the Washington, DC area. 1000 <spreadable>No <event>Tornado 38.882334-77.171091 Figure 3.
CMAS mobile alert message (tornado example).
ERAlert, a CMAS prototype implementation that simulates the broadcasting of CMAS alert messages to mobile phones using the email texting feature. When a mobile user launches ERApp, it: (i) updates the local repository; and (ii) retrieves and displays the user’s current GPS location on a map on the mobile device. The GPS location of the user is updated periodically based on
143
Ngo & Wijesekera
Figure 4.
Typical emergency scenario.
time and distance, which are set to default values of 60 seconds and 50 meters, respectively. The user can adjust these parameters to reduce the frequency of GPS location queries in order to conserve power. Under the default settings, the location update is triggered when the 60second time period has expired or the user has moved 50 meters from the last recorded location. GPS data is used to estimate the movement of the user assuming that the user carries the device all the time. Thus, a user headed towards the emergency location could be advised to leave the area immediately. ERApp receives an alert message from the ERAlert system in a byte-encoded XML format. ERApp decodes the message into its original XML form. It then locates and displays the emergency on the map using the location tag information. In addition, ERApp displays the user’s current position with respect to the location of the emergency. If the mobile device is within the affected area, ERApp alerts the user according to the device alert settings (e.g., vibration mode or ringtone) and displays the alert message. The user must acknowledge the alert within a configurable time period or ERApp will continue to alert the user about the emergency. If the user is outside the affected area and ERApp detects that the user is moving towards the affected area, it alerts the user and displays the alert message. Upon acknowledging the alert, the user can view his/her current position on the map along with the location of the emergency and the affected area (Figure 4).
5.
Prototype Implementation
Our prototype system is implemented in Java. The client GUI was developed as a Java Applet. The service engine was developed in Java Web Services running on a JBOSS Community Server 5.1.0 GA with MySQL Server 5.1 as the backend database. The entire ERAlert project was developed using NetBeans IDE 6.9.1. All the Java archive (JAR) files are self-signed with our own certificate to ensure that the files come from a trusted source. Cell broadcast service has been a standard for many years, but it has not yet been implemented by U.S. carrier networks. Therefore, in order to test our
144
CRITICAL INFRASTRUCTURE PROTECTION V
Figure 5.
ERAlert login screen.
Figure 6.
ERAlert map screen.
prototype, the only option was to simulate the cell broadcast service functionality. To simplify the testing process, we used the email texting feature offered by service providers. In the following sections, we describe the various phases involved in the operation of our system. The emergency scenario considered is a tornado touchdown in Arlington, Virginia.
5.1
User Login
The user is registered with the ERAlert System by an emergency coordinator. The registration process involves recording the user’s information and saving the PKI keys. The PKI keys are used to encrypt and decrypt messages between the web interface and the backend server to mitigate security threats such as man-in-the-middle attacks and password hijacking attacks. The PKI certificate is saved on the local system in our prototype merely to demonstrate that two-factor authentication is implemented. However, in a deployed system, the PKI certificate would be stored on a common access card (CAC) or on a secured universal serial bus (USB) thumb drive. By default, passwords are required to have two lowercase alpha characters, two uppercase alpha characters, two numeric characters and two special characters. The password restrictions can be strengthened or weakened depending on the requirements of the deployed system. Figure 5 shows the login screen for a registered user. The user enters his user id and password and clicks the Login button. The system then prompts the user to locate his PKI-key file. Once the user locates the file, he clicks the Open button. Upon successful login, ERAlert map screen is displayed as shown in Figure 6.
145
Ngo & Wijesekera
Figure 7.
ERAlert CMAS alert message.
5.2
Local Alert Generation
Figure 8.
Login error message.
During an emergency, the emergency coordinator enters the event location and clicks the Go button, which pinpoints the emergency location on the map. The coordinator then enters all the necessary information about the emergency such as alert type, event type, message, spreadable nature, affected area, expired time, category, status, urgency level, severity and certainty. As soon as the radius of the affected area is entered, a red transparent circle appears around the emergency location on the map (Figure 7). Finally, the emergency coordinator clicks the Send button to broadcast the alert message to all the mobile users in the affected area.
5.3
ERApp Alert Handling
After the emergency alert is broadcasted to a particular cell site, all the mobile phones in the cell site receive the alert [2]. At this point, ERApp, which listens to the broadcast control channel, decodes the alert to extract the emergency location, affected area, event type, etc., and proceeds to display the location of the emergency, the user’s current location and the affected area. ERApp calculates the distance between the user’s current location and the location of the emergency. If the user is outside the affected area, ERApp stores the alert on the mobile device until the alert expires (left-hand side of Figure 9). If ERApp detects that the user is moving into the affected area and the alert is still in effect, it sounds the alert until the user acknowledges it. If the user is inside the affected area, ERApp sounds the alert until the user acknowledges it (right-hand side of Figure 9).
5.4
AMBER Alert Generation
The prototype also handles other CMAS alert types such as Presidential alerts, imminent threat alerts and AMBER alerts. Due to the amount of in-
146
CRITICAL INFRASTRUCTURE PROTECTION V
Figure 9.
Alerts outside and inside the emergency affected area.
formation included in an AMBER alert, sending one broadcast message is not enough. Consequently, the prototype breaks up the AMBER alert information and inserts it into smaller CMAS AMBER alerts, which are broadcasted. Upon receiving the smaller CMAS AMBER alerts, ERApp assembles to create the original CMAS AMBER alert and displays it as shown in Figure 10.
6.
Related Work
In April 2007, the Integrated Public Alert and Warning System (IPAWS) [7] was established by the Federal Emergency Management Agency under Executive Order 13407 signed by President George W. Bush on June 26, 2006. IPAWS seeks to ameliorate public safety at all levels of government by providing integrated and interoperable services to communicate timely alerts and warnings to help conserve life and property. IPAWS uses two communication modes, radio broadcast and mobile broadcast. IPAWS also promotes a standard protocol to communicate alerts and warnings across all government emergency systems. In addition, IPAWS focuses on modernizing and expanding the legacy Emergency Alert System (EAS) to take advantage of cutting-edge technologies. CMAS, which focuses on mobile devices, is one of the major IPAWS projects. Several universities such as George Mason University [9] and Louisiana State University [13] have implemented alert mechanisms that require cell phone and/or email registration by users. These systems leverage the existing telecommunications and Internet service provider infrastructures to deliver emergency
147
Ngo & Wijesekera
Figure 10.
CMAS AMBER alert.
alerts to users. During an emergency, however, these systems potentially place extreme loads on the associated networks that can lead to service disruptions [6]. Consequently, these alert messages are not guaranteed to be delivered to user devices in a timely manner. In a recent publication [16], we proposed an enhancement to the Emergency Data Exchange Language (EDXL) [17] and to CAP by adding tags to ensure the availability of responders during an emergency. The tags capture the roles and tasks of emergency personnel. Each emergency responder is associated with his/her role and list of relevant tasks, all of which can be searched in a flexible manner. The search results include a default contact, a list of alternate contacts, even a contact for registering complaints. The search results may also be accessed by a private branch exchange (PBX) to automate the process of contacting emergency responders.
7.
Conclusions
CMAS suffers from two major limitations. The first is that it cannot send alerts to confined areas. The second is that the protocol standard does not provide enough information to facilitate effective emergency communications. We address the first limitation by constructing an Android mobile application that filters CMAS alerts based on GPS data and only displays the alerts to users who are inside the affected area or are moving towards the affected area. We address the second limitation by adding XML tags to the CAP 1.2 message structure to hold the additional information that must be passed to the general public during small-scale emergencies. The viability of our solution is demon-
148
CRITICAL INFRASTRUCTURE PROTECTION V
strated by a prototype implementation that simulates broadcasts of emergency alerts to confined areas such as a city block or an apartment complex. Our future research will attempt to leverage advanced mobile device capabilities to advise users during emergencies. In particular, we will extend ERApp to provide intelligent assistance to users.
References [1] Alliance for Telecommunications Industry Solutions, Implementation Guidelines and Best Practices for GSM/UMTS Cell Broadcast Service, ATIS-0700007, Washington, DC, 2009. [2] Alliance for Telecommunications Industry Solutions, Commercial Mobile Alert Service (CMAS) via GSM/UMTS Cell Broadcast Service Specification, ATIS-0700006, Washington, DC, 2010. [3] Arbitron, Public Radio Today: How America Listens to Public Radio, New York (internet.arbitron.com/downloads/PublicRadioToday07.pdf), 2007. [4] BBC News, Young prefer texting to calls, London, United Kingdom (news.bbc.co.uk/2/hi/business/2985072.stm), June 13, 2003. [5] Bureau of Labor Statistics, American Time Use Survey – 2008 Results, USDL 09-0704, Washington, DC (www.bls.gov/news.release/archives/atus 06242009.pdf), June 24, 2009. [6] W. Enck, P. Traynor, P. McDaniel and T. La Porta, Exploiting open functionality in SMS-capable cellular networks, Proceedings of the Twelfth ACM Conference on Computer and Communications Security, pp. 393– 404, 2005. [7] Federal Emergency Management Agency, Integrated Public Alert and Warning System (IPAWS), Washington, DC (www.fema.gov/emergency /ipaws). [8] C. Foresman, Wireless survey: 91% of Americans use cell phones, Ars Technica (arstechnica.com/telecom/news/2010/03/wireless-survey-91-ofamericans-have-cell-phones.ars), March 24, 2010. [9] George Mason University, Mason Alert: An emergency messaging system, Fairfax, Virginia (alert.gmu.edu/index.php?CCheck=1). [10] Google, Industry leaders announce open platform for mobile devices, Mountain View, California (www.google.com/intl/en/press/pressrel/2007 1105 mobile open.html), November 5, 2007. [11] Google, Android timeline from November 5th, 2007 to October 21st, 2008, Mountain View, California (www.android.com/about/timeline.html). [12] T. Hansen, J. Eklund, J. Sprinkle, R. Bajcsy and S. Sastry, Using smart sensors and a camera phone to detect and verify the fall of elderly persons, Proceedings of the European Medicine, Biology and Engineering Conference, 2005.
Ngo & Wijesekera
149
[13] Louisiana State University, PAWS: Emergency text message system (subscribe), Baton Rouge, Louisiana (grok.lsu.edu/mobile/article.aspx?arti cleid=4884). [14] National Institute of Standards and Technology, Federal Information Processing Standards Publications, Gaithersburg, Maryland (www.itl.nist .gov/fipspubs/index.htm). [15] National Public Safety Telecommunications Council, Commercial Mobile Alert Service Architecture and Requirements, Version 0.6, Littleton, Colorado (www.npstc.org/download.jsp?tableId=37&column=217&id=703& file=PMG-0035 Final Recommendations v0 6.pdf), 2007. [16] P. Ngo and D. Wijesekera, Using Ontological Information to Enhance Responder Availability in Emergency Response, Technical Report GMU-CSTR-2010-13, Department of Computer Science, George Mason University, Fairfax, Virginia, 2010. [17] Organization for the Advancement of Structured Information Standards, Emergency Data Exchange Language (EDXL) Distribution Element, v1.0, OASIS Standard EDXL-DE v1.0, Burlington, Massachusetts (docs.oasisopen.org/emergency/edxl-de/v1.0/EDXL-DE Spec v1.0.pdf), 2006. [18] Organization for the Advancement of Structured Information Standards, Common Alerting Protocol Version 1.1 (Approved Errata), Burlington, Massachusetts (docs.oasis-open.org/emergency/cap/v1.1/errata/CAP-v1 .1-errata.html), 2007. [19] Organization for the Advancement of Structured Information Standards, Common Alerting Protocol Version 1.2, OASIS Standard, Burlington, Massachusetts (docs.oasis-open.org/emergency/cap/v1.2/CAP-v1.2os.html), 2010. [20] Project for Excellence in Journalism, The state of the news media: An annual report on American journalism, Washington, DC (www.stateofthe media.org/2009), 2009. [21] C. Stout, C. Heppner and K. Brick, Emergency Preparedness and Emergency Communication Access: Lessons Learned Since 9/11 and Recommendations, Northern Virginia Resource Center for Deaf and Hard of Hearing Persons, Fairfax, Virginia, 2004. [22] P. Traynor, W. Enck, P. McDaniel and T. La Porta, Mitigating attacks on open functionality in SMS-capable cellular networks, Proceedings of the Twelfth International Conference on Mobile Computing and Networking, pp. 182–193, 2006. [23] U.S. Department of Justice, AMBER Alert, Washington, DC (www.amber alert.gov).
Chapter 11 REAL-TIME DETECTION OF COVERT CHANNELS IN HIGHLY VIRTUALIZED ENVIRONMENTS Anyi Liu, Jim Chen and Li Yang Abstract
Despite extensive research, covert channels are a principal threat to information security. Covert channels employ specially-crafted content or timing characteristics to transmit internal information to external attackers. Most techniques for detecting covert channels model legitimate network traffic. However, such an approach may not be applicable in dynamic virtualized environments because traffic for modeling normal activities may not be available. This paper describes Observer, a real-time covert channel detection system. The system runs a secure virtual machine that mimics the vulnerable virtual machine so that any differences between two virtual machines can be identified in real time. Unlike other detection systems, Observer does not require historic data to construct a model. Experimental tests demonstrate that Observer can detect covert channels with a high success rate and low latency and overhead.
The widespread deployment of firewalls and other perimeter defenses to protect enterprise information systems has raised the bar for malicious external attackers. However, these defensive mechanisms are ineffective against insiders, who can access sensitive data and send it to external entities using secret communication channels. According to the Computer Security Institute [13], the percentage of the insider attacks rose to 59% in 2007. Insider attacks have overtaken viruses and worms as the most common type of attack [3]. A covert channel is a communication channel that can be exploited by a process to transfer information in a manner that violates system security policies. A successful covert channel leaks information to external entities in a manner J. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 151–164, 2011. c IFIP International Federation for Information Processing 2011
152
CRITICAL INFRASTRUCTURE PROTECTION V
that is often difficult to detect. Researchers have proposed a variety of approaches to detect and prevent covert channels. Most covert channel detection approaches [4, 6, 7, 11, 26] construct models based on clean traffic and detect covert channels by searching for deviations in actual traffic. Upon detecting a covert channel, a variety of countermeasures [10, 16, 17, 30] can be applied to manipulate traffic and prevent information leaks. While the approaches are effective against most covert channels, they require a sufficient amount of clean traffic. However, in networked virtual environments, such as those encountered in cloud computing, creating suitable models from clean historic traffic is problematic, mainly because the traffic associated with most virtual machine services is highly dynamic in nature. For example, virtual machines may migrate arbitrarily across virtual networks, revert to the snapshot of a saved state, or may run multi-booting systems. In such cases, clean historic traffic is either unavailable or the available traffic does not reflect the characteristics of clean traffic. To address these challenges, we have designed and implemented the Outbound Service Validator (Observer), a real-time covert channel detection system. Observer leverages a secure virtual machine to mimic the behavior of a vulnerable virtual machine. It redirects all inbound traffic destined to the vulnerable virtual machine to the secure virtual machine, and differentiates between the outbound traffic of the two virtual machines to detect covert channels. Unlike existing approaches, Observer operates in real time and does not require historic traffic for modeling normal behavior. It can be dynamically incorporated in a cloud infrastructure when a vulnerable virtual machine has been identified. Moreover, the implementation is transparent to external attackers, which minimizes the risk that the detection system itself is the target of subversion. Experimental tests demonstrate that Observer can detect covert channels with a high success rate and low overhead. In particular, it induces an average 0.05 ms latency in the inter-packet delay and an average CPU usage increase of about 35% in a virtual network with 100 Mbps throughput.
2.
Related Work
A number of differential analysis approaches have been developed for intrusion detection. Our approach is closely related to that of Netspy [31], which compares outgoing packets from a clean system with those from an infected system. However, our approach represents an advancement over Netspy’s approach. First, Netspy detects spyware that leaks private information as plaintext in HTTP responses; in contrast, our approach detects stealthy attacks, such as covert channels, that involve information leaks using encrypted traffic. Second, in order to generate signatures, Netspy assumes that spyware generates additional network traffic from the infected system; this assumption fails when information is transmitted through a passive covert channel that does not generate extra traffic. Third, Netspy correlates inbound packets with the corresponding outbound packets that are triggered; this approach fails to de-
Liu, Chen & Yang
153
tect sophisticated covert channels that postpone outbound packets to ensure that inbound and outbound packets cannot be correlated. Privacy Oracle [15] employs an approach similar to ours to discover information leaks. It uses perturbed user input to identify the fields of network traces by aligning pairs of network traces. Siren [5] uses crafted user input along with a description of legitimate user activities to thwart mimicry attacks. However, both Privacy Oracle and Siren are unable to differentiate anomalous output in order to detect covert channels. The approach of Mesnier, et al. [20] predicts the performance and resource usage of a device using a mirror device. However, this approach is designed to predict the workload characteristics of I/O devices. Our approach, on the other hand, deals with the more difficult problem of detecting covert channels. A number of methodologies have been proposed for creating covert channels [6, 12, 18, 27] and for detecting covert channels [1, 11, 19, 23]. The accuracy of these techniques depends on the availability of a good model and a substantial quantity of clean historic traffic. Our approach is superior in that it works online and neither requires modeling nor clean historic traffic. Moreover, it can be deployed dynamically or migrated across a networked virtual infrastructure, which renders it an attractive solution for highly virtualized environments. Several researchers have applied covert channel design schemes to trace suspicious traffic. For example, Wang and Reeves [34] employ well-designed interpacket delays to trace suspicious VoIP traffic [32, 33]. In contrast, our work only focuses on detecting covert channels, although covert channels design schemes nicely complement our detection methodology. Research efforts related to cross virtual machine covert/side channels [22, 25, 35] and their countermeasures [14] are relevant to our work. However, they deal with covert channels that leak information between virtual machines that share the same virtual machine monitor or hardware. Our work goes beyond inter-virtual-machine covert/side channels – it focuses on detecting aggressive covert channels between insiders and external entities.
3.
Threat Model
Covert channels can be roughly categorized into two types: covert storage channels that manipulate the contents of storage locations (e.g., disk, memory, packet headers, etc.) and covert timing channels that manipulate the timing or ordering of events (e.g., disk accesses, memory accesses, inter-packet delays, etc.). This paper focuses on the detection of covert storage channels. Despite the complex nature of networked virtual environments, we desire to handle covert channel threats in as general a manner as possible. First, we assume that a vulnerable virtual machine can be compromised by many exploits. These include exploits that target vulnerable services, zero-day attacks and internal subversion exploits; however, we exclude attacks that change virtual machine behavior via a virtual machine monitor or hypervisor. Second, we assume that, after a virtual machine has been compromised, its user-space applications and kernel-space device drivers can be fully controlled by the at-
154
CRITICAL INFRASTRUCTURE PROTECTION V
Internet
Covert Channel Security Mediator
Normal Traffic
SSL
Covert Channel SSL
Figure 1.
Covert channel threat.
tacker. Since covert channels leak internal information to external attackers, they can be detected by examining outbound network traffic. Figure 1 illustrates the covert channel threat. The security foundation is based on two assumptions. First, the virtual machine monitor, which is under the control of the current virtual computing environment, is trusted and cannot be breached. Second, there exist a number of secure virtual machines that are also under the control of the current virtual computing environment. The secure virtual machines may be created along with the vulnerable virtual machine by cloning them from a clean state, or they may be created from a virtual machine prototype such as Amazon’s Elastic Compute Cloud (EC2) [2] or Microsoft Azure Services [21]. The secure virtual machines are protected by the virtual computing environment and external attackers cannot compromise them. Note that, although this approach requires at least one secure virtual machine to synchronize with a vulnerable virtual machine, the number of secure virtual machines is bounded by the number of vulnerable servers being monitored. Therefore, the Observer system can be applied to monitor as many vulnerable servers as desired when computational and storage resources become available. We do not require the virtual machine monitor of the virtual environment to know the software (i.e., operating system and applications) installed on the virtual machines, although this information is useful to determine the integrity of the virtual machines.
4.
System Architecture
Figure 2 presents the architecture of the Observer system. The system has two main components: the security mediator and the virtual machine repos-
155
Liu, Chen & Yang
Vulnerable VM (VVM)
Incoming Traffic
Internet Outgoing Traffic
VM Manager
Traffic Filter
App Covert Channel Elimination Commands
Traffic Distributor
Traffic Manipulator
Output Analyzer
Virtual Machines VM Management Commands
OS
App
App
App
OS
OS
OS
VM Management Commands VM Repository
App Virtual Router
OS
Outgoing Traffic Observer Secure VM (SVM)
Figure 2.
Virtual Network
Observer system architecture.
itory. The security mediator comprises the: (i) traffic filter, which monitors inbound packets from the Internet and filters traffic of interest; (ii) traffic distributor, which examines the networking protocol information in the intercepted packets, replicates the packets and sends them to the vulnerable virtual machine and secure virtual machine; and (iii) output analyzer, which singles out outbound traffic from the two virtual machines and detects anomalous patterns. The virtual machine repository maintains a set of virtual machines, and activates, deactivates, clones and updates the virtual machine snapshots.
4.1
Traffic Filter
To protect services, the traffic filter maintains several rules that determine whether or not packets are to be intercepted. If an incoming packet satisfies a rule, it is subjected to further processing. After a new susceptible service is launched, new rules corresponding to the service are added to the rule list. INTERCEPTOUTTCP123.123.123.123ANY129.174.2.12380NA ...
Figure 3.
... DROPINTCPANYANY129.174.2.12380TCPFLAG=rst
Configuration file of traffic filter.
Figure 3 lists some rules, which are similar to general purpose firewall rules. The rules specify the packet header fields, such as Direction, Protocol, Src IP, Src Port, Dst IP, Dst Port and Packet Header Flags. The first rule specifies
156 Start
INIT
CRITICAL INFRASTRUCTURE PROTECTION V Receive ACK. Receive ACK. Produce ACK’. Send ACK, ACK’ Produce ACK’. Send FIN/ACK, FIN/ACK’ to M1, M2, to M1, M2, Respectively Respectively Sent Ready to Disconnected FIN/ACK Disconnect
Receive SYN. Produce SYN’. Send SYN, SYN’ to M1, M2 Sent Sent SYN SYN/ACK Receive SYN/ACK and SYN/ACK’ from M1, M2. Send SYN/ACK.
Receive ACK. Produce ACK’. Send ACK, ACK’ to M1, M2, Respectively Request to Disconnect
Receive FIN/ACK and FIN/ACK’ from M1, M2. Send FIN/ACK. Established Receive ACK. Produce ACK’. Send ACK, ACK’ to M1, M2, Respectively
Figure 4.
Sent Data
Receive PUSH/ACK from M1, M2. Send PUSH/ACK from M1.
Received Data
Receive ACK. Produce ACK’. Send ACK, ACK’ to M1, M2, Respectively
State machine for the traffic distributor.
that TCP packets from an external host (with address 123.123.123.123 and any port number) to an internal HTTP service (with address 129.174.2.123 and port number 80) are to be intercepted. The second rule drops all TCP packets in a reply to an external host, whose packet header field contains a rst flag. Rules can be added and deleted at runtime to ensure that the system cannot be penetrated during rule updates. Note also that the rules are based on a priori knowledge and reported vulnerabilities. The time taken by the traffic filter is O(N ), where N is the total number of packets (in terms of the number of bytes) that satisfies the rule set provided that the focus is restricted to services that are easily compromised. Since the packet filter does not buffer processed packets, the storage requirement is bounded by the maximum size of a network packet.
4.2
Traffic Distributor
The primary task of the traffic distributor is to forward packets destined to the vulnerable virtual machine to the secure virtual machine. Two steps are involved. First, when the traffic distributor receives a packet e from the traffic filter, it constructs a new packet e . The new packet e keeps some of the fields from e (e.g., Src IP and Src P ort), while other fields (e.g., Dst IP , Dst P ort, sequence numbers and checksum) are modified (Dst IP and Dst P ort correspond to the IP address and port number of the secure virtual machine, respectively). The two packets e and e are then dispatched simultaneously. Second, when the traffic distributor receives reply packets from both the virtual machines, it only sends the packet reply corresponding to e; thus, an external attacker has no knowledge of the secure virtual machine. Figure 4 presents a simplified state machine corresponding to the traffic distributor. M 1 and M 2 represent the vulnerable virtual machine and secure virtual machine, respectively. For example, when the traffic distributor receives a SY N packet, it constructs a new SY N packet, namely SY N , and sends
Liu, Chen & Yang
157
SY N and SY N to M 1 and M 2, respectively. Similarly, when it receives the ACK and ACK packets, it only sends the ACK packet. To synchronize the outputs of M 1 and M 2, the traffic distributor must maintain all the previous communication states in a queue in the event of packet loss or fragmentation. The overhead involved in constructing a new packet e is essentially constant. The storage requirement is dictated by the total number of packets forwarded from the traffic filter. Although Observer collects live traffic, which increases without bound, the storage requirement is still bounded by 2n, where n is the size of the queue.
5.
Implementation
Observer is implemented on a VMware ESX Server 4.1 [29]. The traffic filter and traffic distributor are implemented as part of a transparent bridge, which uses a customized ipfw application to intercept packets and divert socket to manipulate packets. The security mediator comprises around 1,000 lines of C code. To minimize the latency after packets leave the security mediator, the vulnerable virtual machine and secure virtual machine are cloned from the same virtual machine image with the same state, and the two virtual machines are configured one hop away from Observer. The vulnerable virtual machine and secure virtual machine must be closely synchronized; the sequence number field of TCP packets is used to synchronize outbound traffic from the two virtual machines. To ensure that Observer maintains accurate time information, each virtual machine is configured to have affinity to one CPU at a time. The output analyzer uses Ethereal to collect traffic and separate the timing information. The ntop application is used to generate network traffic statistics at runtime. The output analyzer module is written in C, Perl, Dataplot and MATLAB.
6.
Evaluation
This section analyzes the ability of Observer to detect covert channels. It also examines the performance overhead involved in covert channel detection.
6.1
Covert Channel Construction and Detection
A scheme similar to that used by Ramsbrock, et al. [24] was used to construct covert channels. Specifically, to encode an i-bit sequence S = s0 , . . . , si−1 , we used 2i randomly chosen packet pairs Pri , Pei (i = 0, . . . , L) where ri ≤ ei , and Pri and Pei correspond to reference packets and encoding packets, respectively. A covert bit sk (0 ≤ k ≤ i − 1) is encoded into the packet pair Pri , Pei using the equation: e(Lr , Le , L, sk ) = le + [(0.5 + sk )L − (le − lr )]mod2L where Lr and Le are the values of the encoded field in Pri and Pei , respectively.
158
CRITICAL INFRASTRUCTURE PROTECTION V Table 1.
Detection window size for various covert channels.
a a a a
= = = =
1 5 10 20
BTWC
PFC
160 181 7,150 24,060
60 460 390 730
The original covert channel design scheme was extended to test the effectiveness of Observer to detect slow covert channels. Specifically, instead of using 2i packets, we used 2ai (a ≥ 1) packets to encode S, where a is a constant or pseudorandom number. Pri and Pei were chosen from the 2a packets. The term a serves as an “amplifier,” where a larger value indicates a slower covert channel. Covert channels were detected using shape tests corresponding to first-order statistics such as means and variances [23]. The shapes of the traffic patterns were tested using a Chi-Square test [8] and a two-sample Kolmogorov-Smirnov test (K-S test) [9]. The Chi-Square test was used to verify whether or not two discrete sample data come from the same discrete distribution; the K-S test was used to verify whether or not two continuous sample data come from the same continuous distribution. The two tests were chosen because they are distribution free, i.e., they do not depend on the actual cumulative distribution function (CDF) being tested.
6.2
Effectiveness
The evaluation focused on the detection of two types of covert channels: IP/TCP packet field channels (PFCs) and botnet traceback watermark channels (BTWCs). A PFC operates by modifying the urgent field of TCP packets to transmit information: a 1 bit is transmitted by increasing the urgent field value by an integer modulo w, while a 0 bit is transmitted by increasing the value by an integer modulo w2 . A BTWC operates by modifying the length of the encoding packet Pei by padding characters to achieve a specific length that is different from its corresponding reference packet Pri . The padded characters could be visible or invisible (e.g., whitespace), and can be inserted in random locations in the payload. The latency of Observer was measured using the detection window size, which is the minimum number of packets needed to detect a covert channel since it commenced transmission. A larger window size indicates greater latency and less sensitivity to covert channels in real time. Table 1 shows the detection window size required by Observer to obtain a 100% true positive rate (note that a larger value of a indicates a slower covert channel). The results demonstrate that a slower covert channel requires a larger detection window. Table 1 shows that that a window size of 160 is required to detect the most aggressive BTWC, which sends one bit per packet. However, a slower BTWC,
159
Liu, Chen & Yang !"# $%
!"
#
#
#
#
(a) Packet length distributions.
Figure 5.
&
&
&
&
(b) Urgent field distributions.
Cumulative distributions for BTWCs and PFCs for various values of a.
which transmits one bit every 20 packets, requires a window size of 24,060! The detection window sizes, which are much more sensitive for PFCs, vary from 60 to 730. The results can be explained by comparing the cumulative distribution functions (CDFs) corresponding to BTWCs and PFCs for different values of a. Figure 5 shows that the distributions of BTWCs are quite similar for different values of a, but the same is not true for PFCs. Therefore, BTWCs are more difficult to detect than PFCs.
6.3
Detection Rate
Live traffic was collected from the vulnerable virtual machine and secure virtual machine in order to determine the false positive rates for Observer. A BTWC and PFC were created with a = 1; the false positive rates for both channels were zero. Tables 2 and 3 present the results of the statistical tests for the two channels. As expected, the slower covert channels (with larger values of a) have statistics that are closer to those of legitimate traffic. A theoretic analysis of false positives was conducted by setting the targeted false positive rate to 1%. To achieve this false positive rate, we used a “cutoff point,” which was set at the 99th percentile of the legitimate samples to determine if samples are benign or malicious. Figure 6 shows the true positive rates for PFC and BTWC detection for various values of a. The results show that the effectiveness of detection depends on the number of observed packets. For example, in the case of a PFC with a = 10, the true positive rate fluctuates when Observer collects 5,000 packets. A similar situation also occurs for a BTWC. We are currently investigating various approaches to improve the effectiveness of detection.
160
CRITICAL INFRASTRUCTURE PROTECTION V Table 2.
Legitimate HTTP PFC (a = 1) PFC (a = 5) PFC (a = 10) PFC (a = 20) PFC (a = 50)
Mean
SD
Chi-Square Test
Chi-Square (CDF value)
Chi-Square (1% cutoff )
20.143
3.925
0
0
≥ 15.087
36.926
16.485
6789.002
1
≥ 11.345
23.480
10.576
82.136
1
≥ 15.086
23.343
10.316
75.657
1
≥ 13.277
20.878
6.137
3.902
0.581
≥ 13.277
20.478
5.088
0.796
0.061
≥ 13.277
Table 3.
Legitimate HTTP BTWC (a = 1) BTWC (a = 5) BTWC (a = 10) BTWC (a = 20) BTWC (a = 50)
6.4
PFC test scores.
BTWC test scores.
Mean
SD
K-S Test
K-S (p value)
K-S (1% cutoff )
283.601
453.532
0
1
≥ 0.1923
284.785
452.005
0.248
0
≥ 0.1923
283.824
453.215
0.049
2.94e-15
≥ 0.1923
283.721
453.357
0.025
2.53e-04
≥ 0.1923
283.660
453.441
0.012
0.212
≥ 0.1929
283.622
453.503
0.004
0.951
≥ 0.1929
Performance
Two experiments were conducted to evaluate the performance of Observer. The first experiment evaluated the throughput of Observer under the best and worst case scenarios. In the best case scenario, no packets were intercepted by the traffic filter. In the worst case scenario, the majority of the inbound packets sent to the vulnerable virtual machine were intercepted by the traffic filter because they were assumed to be sent via a covert channel. Packets in both scenarios were collected when an attacker visited the web server of the vulnerable virtual machine for 160 minutes. Table 4 presents the detailed statistics for the two scenarios. Note that the average time to process a packet is almost 56.2 ms even in the worst case.
161
# ! $
# ! $
Liu, Chen & Yang
% % % %
!"
(a) PFC true positive rates.
Figure 6.
% % % %
!"
(b) BTWC true positive rates.
True positive rates of channel detection for various values of a.
Table 4.
Packets Bytes Packets/s Bytes/s Duration
Comparison of throughput.
Best Scenario
Worst Scenario
Ratio
1,802,176,469 1,214,518,973,379 187,727 126,512,393 160 min
170,712,403 128,003,547,066 17,783 13,333,703 160 min
0.094 0.105 0.098 0.110 1.000
The second experiment measured the average latency introduced by Observer to the inter-packet delay. In the experiment, 1,000,000 packets were collected with and without Observer installed. The average latency added to the inter-packet delay is 5 ms, which is almost unnoticeable compared with the reported average inter-packet delay of 42.67 ms for Northern America [28]. This latency is the result of the queuing delay that Observer imposes on inbound packets when directing them to both virtual machines at the same time.
7.
Conclusions
Observer detects covert channels in a networked virtual environment by running a secure virtual machine that mimics a vulnerable virtual machine so that differences between the two virtual machines can be identified. Unlike most covert channel detection systems, Observer does not rely on historic data to create models of normal behavior. Experimental tests demonstrate that Observer can detect covert channels with a high success rate and low latency and overhead. The accuracy of covert channel detection relies on the fact that the virtual computing environment provides at least one secure virtual machine for every vulnerable virtual machine. This limits the scalability of our approach. To ad-
162
CRITICAL INFRASTRUCTURE PROTECTION V
dress this limitation, our future research will investigate the dynamic allocation and management of secure virtual machines. Another limitation is that it is necessary to maintain a secure version of a vulnerable virtual machine over its entire lifecycle. Even during the detection phase, it is difficult to ensure that inbound traffic does not contain an exploit that could compromise the secure virtual machine at runtime. Potential solutions that address this limitation will be examined in our future research.
References [1] D. Agrawal, S. Baktir, D. Karakoyunlu, P. Rohatgi and B. Sunar, Trojan detection using IC fingerprinting, Proceedings of the IEEE Symposium on Security and Privacy, pp. 296–310, 2007. [2] Amazon, Amazon Elastic Compute Cloud (Amazon EC2), Seattle, Washington (aws.amazon.com/ec2). [3] M. Ben Salem, S. Hershkop and S. Stolfo, A survey of insider attack detection research, in Insider Attack and Cyber Security: Beyond the Hacker, S. Stolfo, S. Bellovin, S. Hershkop, A. Keromytis, S. Sinclair and S. Smith (Eds.), Springer, New York, pp. 69–90, 2008. [4] V. Berk, A. Giani and G. Cybenko, Covert channel detection using process query systems, Proceedings of the Second Annual Workshop on Flow Analysis, 2005. [5] K. Borders, X. Zhao and A. Prakash, Siren: Catching evasive malware, Proceedings of the IEEE Symposium on Security and Privacy, pp. 78–85, 2006. [6] S. Cabuk, Network Covert Channels: Design, Analysis, Detection and Elimination, Ph.D. Thesis, Department of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana, 2006. [7] S. Cabuk, C. Brodley and C. Shields, IP covert timing channels: Design and detection, Proceedings of the Eleventh ACM Conference on Computer and Communications Security, pp. 178–187, 2004. [8] G. Corder and D. Foreman, Nonparametric Statistics for NonStatisticians: A Step-by-Step Approach, John Wiley, Hoboken, New Jersey, 2009. [9] R. Duda, P. Hart and D. Stork, Pattern Classification, John Wiley, New York, 2001. [10] G. Fisk, M. Fisk, C. Papadopoulos and J. Neil, Eliminating steganography in Internet traffic with active wardens, Proceedings of the Fifth International Workshop on Information Hiding, pp. 18–35, 2002. [11] S. Gianvecchio and H. Wang, Detecting covert timing channels: An entropy-based approach, Proceedings of the Fourteenth ACM Conference on Computer and Communications Security, pp. 307–316, 2007. [12] S. Gianvecchio, H. Wang, D. Wijesekera and S. Jajodia, Model-based covert timing channels: Automated modeling and evasion, Proceedings of
Liu, Chen & Yang
163
the Eleventh International Symposium on Recent Advances in Intrusion Detection, pp. 211–230, 2008. [13] L. Gordon, M. Loeb, W. Lucyshyn and R. Richardson, 2006 CSI/FBI Computer Crime and Security Survey, Computer Security Institute, San Francisco, California, 2006. [14] T. Jaeger, R. Sailer and Y. Sreenivasan, Managing the risk of covert information flows in virtual machine systems, Proceedings of the Twelfth ACM Symposium on Access Control Models and Technologies, pp. 81–90, 2007. [15] J. Jung, A. Sheth, B. Greenstein, D. Wetherall, G. Maganis and T. Kohno, Privacy Oracle: A system for finding application leaks with black box differential testing, Proceedings of the Fifteenth ACM Conference on Computer and Communications Security, pp. 279–288, 2008. [16] M. Kang, I. Moskowitz and S. Chincheck, The pump: A decade of covert fun, Proceedings of the Twenty-First Annual Computer Security Applications Conference, pp. 352–360, 2005. [17] M. Kang, I. Moskowitz and D. Lee, A network version of the pump, Proceedings of the IEEE Symposium on Security and Privacy, pp. 144–154, 1995. [18] B. Kopf and D. Basin, An information-theoretic model for adaptive sidechannel attacks, Proceedings of the Fourteenth ACM Conference on Computer and Communications Security, pp. 286–296, 2007. [19] Y. Liu, C. Corbett, K. Chiang, R. Archibald, B. Mukherjee and D. Ghosal, SIDD: A framework for detecting sensitive data exfiltration by an insider attack, Proceedings of the Forty-Second Hawaii International Conference on System Sciences, 2009. [20] M. Mesnier, M. Wachs, B. Salmon and G. Ganger, Relative fitness models for storage, ACM SIGMETRICS Performance Evaluation Review, vol. 33(4), pp. 23–28, 2006. [21] Microsoft, Microsoft Azure Services Platform, Redmond, Washington (www.microsoft.com/azure/default.mspx). [22] K. Okamura and Y. Oyama, Load-based covert channels between Xen virtual machines, Proceedings of the Twenty-Fifth Symposium on Applied Computing, pp. 173–180, 2010. [23] P. Peng, P. Ning and D. Reeves, On the secrecy of timing-based active watermarking trace-back techniques, Proceedings of the IEEE Symposium on Security and Privacy, pp. 334–349, 2006. [24] D. Ramsbrock, X. Wang and X. Jiang, A first step towards live botmaster traceback, Proceedings of the Eleventh International Symposium on Recent Advances in Intrusion Detection, pp. 59–77, 2008.
164
CRITICAL INFRASTRUCTURE PROTECTION V
[25] T. Ristenpart, E. Tromer, H. Shacham and S. Savage, Hey, you, get off of my cloud: Exploring information leakage in third-party compute clouds, Proceedings of the Sixteenth ACM Conference on Computer and Communications Security, pp. 199–212, 2009. [26] G. Shah, A. Molina and M. Blaze, Keyboards and covert channels, Proceedings of the Fifteenth USENIX Security Symposium, pp. 59–75, 2006. [27] R. Smith and G. Scott Knight, Predictable design of network-based covert communication systems, Proceedings of the IEEE Symposium on Security and Privacy, pp. 311–321, 2008. [28] Verizon, IP latency statistics, New York (www.verizonbusiness.com/about /network/latency), 2010 [29] VMware, VMware ESXi and ESX Info Center, Palo Alto, California (www .vmware.com/products/vsphere/esxi-and-esx/index.html). [30] M. Vutukuru, H. Balakrishnan and V. Paxson, Efficient and robust TCP stream normalization, Proceedings of the IEEE Symposium on Security and Privacy, pp. 96–110, 2008. [31] H. Wang, S. Jha and V. Ganapathy, NetSpy: Automatic generation of spyware signatures for NIDS, Proceedings of the Twenty-Second Annual Computer Security Applications Conference, pp. 99–108, 2006. [32] X. Wang, S. Chen and S. Jajodia, Tracking anonymous peer-to-peer VoIP calls on the Internet, Proceedings of the Twelfth ACM Conference on Computer and Communications Security, pp. 81–91, 2005. [33] X. Wang, S. Chen and S. Jajodia, Network flow watermarking attack on low-latency anonymous communication systems, Proceedings of the IEEE Symposium on Security and Privacy, pp. 116–130, 2007. [34] X. Wang and D. Reeves, Robust correlation of encrypted attack traffic through stepping stones by manipulation of interpacket delays, Proceedings of the Tenth ACM Conference on Computer and Communications Security, pp. 20–29, 2003. [35] Z. Wang and R. Lee, Covert and side channels due to processor architecture, Proceedings of the Twenty-Second Annual Computer Security Applications Conference, pp. 473–482, 2006.
Chapter 12 ANALYZING CYBER-PHYSICAL ATTACKS ON NETWORKED INDUSTRIAL CONTROL SYSTEMS Bela Genge, Igor Nai Fovino, Christos Siaterlis and Marcelo Masera Abstract
Considerable research has focused on securing SCADA systems and protocols, but an efficient approach for conducting experiments that measure the impact of attacks on the cyber and physical components of the critical infrastructure is not yet available. This paper attempts to address the issue by presenting an innovative experimental framework that incorporates cyber and physical systems. An emulation testbed based on Emulab is used to model cyber components while a soft realtime simulator based on Simulink is used to model physical processes. The feasibility and performance of the prototype is evaluated through a series of experiments. The prototype supports experimentation with networked industrial control systems and helps understand and measure the consequences of cyber attacks on physical processes.
Keywords: Industrial control systems, cyber attacks, simulation, testbed
1.
Introduction
Modern critical infrastructures such as power plants and water supply systems rely on information and communications technologies, which contribute to reduced costs as well as greater efficiency, flexibility and interoperability. However, these technologies, which underlie networked industrial control systems, are exposed to significant cyber threats [7, 14]. The recent Stuxnet worm [8] is the first malware that was specifically designed to attack networked industrial control systems. Stuxnet’s ability to reprogram the logic of control hardware and alter physical processes demonstrates the danger of modern cyber threats. Stuxnet has served as a wakeup call for the international security community, and has raised many questions. Above all, Stuxnet reminds us that an efficient approach for conducting experiments that measure the impact of attacks
J. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 167–183, 2011. c IFIP International Federation for Information Processing 2011
168
CRITICAL INFRASTRUCTURE PROTECTION V
on the cyber and physical components of the critical infrastructure is not yet available. The study of complex systems, whether cyber or physical, can be carried out by experimenting with real systems, software simulators or emulators. Experimentation with real production systems is hindered by the inability to control the environments adequately to obtain reproducible results. Furthermore, any study that attempts to analyze security or resilience raises serious concerns about the potential faults and disruptions to mission-critical systems. The alternative, a dedicated experimental infrastructure with real components, is costly and the experiments can pose safety risks. Software-based simulation is generally considered to be an efficient approach to study physical systems, mainly because of its lower cost coupled with fast and accurate analysis. However, it has limited applicability to cyber security because of the complexity and diversity of information and communications technologies. Moreover, while software simulators may effectively model normal operations, they fail to capture the manner in which computer systems fail. For these reasons, we have chosen to adopt a hybrid approach between the two extremes of experimentation with real components and pure software simulation. Our proposed framework uses simulation for the physical components and an emulation testbed based on Emulab [9, 22] to recreate the cyber components of networked industrial control systems such as SCADA servers and corporate networks. The models of the physical systems are developed using Matlab Simulink, from which the corresponding C code is generated using Matlab Real Time Workshop. The generated code is executed in real time and can interact with the real components in the emulation testbed. The primary advantage of the framework is that it provides an experimentation environment for understanding and measuring the consequences of cyber attacks to physical processes while using real cyber components and malware in a safe manner. Furthermore, experimental evaluations of the framework demonstrate that it can scale and accurately recreate large networked industrial control systems with up to 100 programmable logic controllers (PLCs).
2.
Related Work
Analyzing the behavior of networked industrial control systems is challenging because they incorporate components that interact in the physical and cyber domains. This section briefly describes the most relevant work on the subject. Wang, et al. [21] employ an OPC (OLE for Process Control) server, the ns-2 network simulator, and real PLCs and field devices to analyze networked industrial control systems. ns-2 is used to simulate the enterprise network of a SCADA system. Calls from ns-2 are dispatched via software agents to the OPC server, which sends Modbus messages to the physical PLCs. In this approach, the only simulated component is the enterprise network; all the other components (servers, PLCs, etc.) are real. Because almost every component is real, such a testbed can provide reliable experimental data, but it cannot support tests on large infrastructures such as chemical plants and gas pipelines.
Genge, Nai Fovino, Siaterlis & Masera
169
Nai Fovino, et al. [15] have also pursued a similar approach by developing a protected environment for studying cyber vulnerabilities in power plant control systems. The core of their environment is a real industrial system that reproduces the physical dynamics of a power plant. However, the high fidelity of this testing environment is counterbalanced by its poor flexibility with respect to handling new systems and its high maintenance costs. Hiyama and Ueno [10] have used Simulink to model physical systems and the Matlab Real Time Workshop to run the model in real time. A similar approach has been used by Queiroz, et al. [18] to analyze the security of SCADA systems. In their case, only the sensors and actuators are real physical devices; the remaining components (e.g., PLCs) and the communication protocols are implemented as OMNeT++ modules. Other researchers focus on simulating both SCADA and field devices. For example, Chabukswar, et al. [4] use the Command and Control WindTunnel (C2WindTunnel) [16] multi-model simulation environment, based on the HighLevel Architecture (HLA) IEEE Standard 1.3 [3], to enable interactions between multiple simulation engines. They use OMNeT++ to simulate the network and the Matlab Simulink to build and run the physical plant model. C2WindTunnel provides the global clock for OMNeT++ and Matlab Simulink. Nevertheless, analyzing the cyber-physical effects of malware is a challenging task, because it requires a detailed description of all the cyber components and detailed knowledge of the dynamics of malware, which is rarely available. Davis, et al. [6] use PowerWorld [17], a high-voltage power system simulation and analysis package [17], to model an entire power grid and run it in real time. The PowerWorld server is connected to a proxy that implements the Modbus protocol and transmits Modbus messages to client applications. Client applications interact with the PowerWorld server via a visual interface that allows them to introduce disturbances into the network and observe the effects. Our approach also uses simulation for the physical layer. However, unlike Davis, et al. [6], we also emulate typical components such as PLCs and master units.
3.
Proposed Framework
This section presents the proposed experimentation framework that supports the analysis of the physical impact of cyber threats against networked industrial control systems. Following a brief description of a typical networked industrial control system architecture, we present the proposed framework and its prototype implementation.
3.1
Control System Architecture
Modern industrial process control network architectures have two control layers: (i) the physical layer, which comprises actuators, sensors and hardware devices that physically perform the actions on the system (e.g., open valves, measure voltages, etc.); and (ii) the cyber layer, which comprises all the information and communications devices and software that acquire data, elaborate
170
CRITICAL INFRASTRUCTURE PROTECTION V Cyber Layer
VLAN1 Process Network
Physical Layer
VLAN2 Control Network Pressure Sensor PLC HMI PC
HMI PC
VOUT
Vessel PLC
Generator Turbine 3-Layer Switch
Temperature Sensor
Fuel
SCADA Svr.
Temperature Sensor Pump 1 802.11 WLAN
Control Rods Pump 2
Condenser
SCADA Svr.
PLC HMI PC SCADA Svr.
Control Network
VLAN3
PLC
Figure 1.
Industrial plant network.
low-level process strategies and deliver commands to the physical layer. The cyber layer typically uses SCADA protocols to control and manage an industrial installation. The entire architecture can be viewed as a “distributed control system” spread among two networks: the control network and the process network. The process network usually hosts the SCADA servers (also known as SCADA masters) and human-machine interfaces (HMIs). The control network hosts all the devices that on one side control the actuators and sensors of the physical layer and on the other side provide the control interface to the process network. A typical control network is composed of a mesh of PLCs as shown in Figure 1. From the operational point of view, PLCs receive data from the physical layer, elaborate a local actuation strategy based on the data, and send commands to the actuators. When requested, the PLCs also provide the data received from the physical layer to the SCADA servers (masters) in the process network and eventually execute the commands that they receive. In modern SCADA architectures, communications between a master and PLCs are usually implemented in two ways: (i) through an OPC layer that helps map the PLC devices; and/or (ii) through a direct memory mapping that uses SCADA communication protocols such as Modbus, DNP3 and Profibus.
3.2
Overview of the Approach
The proposed framework engages a hybrid approach, where the Emulabbased testbed recreates the control and process network (including the SCADA
171
Genge, Nai Fovino, Siaterlis & Masera Cyber-Physical Layer
Cyber Layer
Tight Coupling PLC Code
PLC MEM
Physical Layer
Tight Coupling MEM Read/Write
MEM Read/Write PLC MEM
Physical Models
PLC Code
R-PLC Unit
SC Unit
Master Unit
192.168.1.2
192.168.1.4
R-PLC Unit
Loose coupling MEM Read/Write
192.168.1.3
Figure 2.
192.168.1.1
Emulab Testbed
Overview of the framework.
servers and PLCs), and a software simulation reproduces the physical processes. Figure 2 shows a high-level view of the proposed framework. The principal argument for emulating the cyber components is that any study of the security and resilience of a computer network requires the simulation of all the failurerelated functions, behaviors and states, most of which are unknown. On the other hand, software simulation is a very reasonable approach for the physical layer because of its low cost, the existence of accurate models and the ability to conduct experiments safely. The architecture presented in Figure 2 has three layers: the cyber layer, link layer and physical layer. The cyber layer incorporates the information and communications devices used in SCADA systems, while the physical layer provides the simulation of physical devices. The link layer serves as the glue for the two layers through the use of a shared memory region. The cyber layer is recreated by an emulation testbed that uses the Emulab architecture and software [22] to automatically and dynamically map physical components (e.g., servers and switches) to a virtual topology. In other words, the Emulab software configures the physical topology so that it emulates the virtual topology as accurately as possible. Interested readers are referred to [9] for additional details. Aside from the process network, the cyber layer also includes the control logic code, which is implemented by PLCs in the real world. In our approach, the control code can be made to run sequentially or in parallel with the physical
172
CRITICAL INFRASTRUCTURE PROTECTION V
model. In the sequential case, we use tightly-coupled code, i.e., code that runs in the same memory space as the model. In the parallel case, we use looselycoupled code, i.e., code that runs in another address space, possibly on another host. The main advantage of tightly-coupled code is that it does not miss values generated by the model between executions. On the other hand, loosely-coupled code supports the execution of PLC code remotely, the injection of malicious code without stopping the execution of the model and the operation of more complex PLC emulators. The cyber-physical layer incorporates the PLC memory (usually a set of registers) and the communications interfaces that glue the other two layers. Memory registers provide the links to the inputs (e.g., valve positions) and outputs (e.g. sensor values) of the physical model. The physical layer provides the real-time execution of the physical model. The execution time of the model is strongly coupled to the timing service provided by the operating system on which the model executes. Because a multi-tasking operating system is used, achieving hard real-time execution is a difficult task without using kernel drivers. However, soft real-time execution can be achieved by allowing some deviations from the operating system clock. We further elaborate on this topic in the following sections. The selection of a time step, i.e., the time between two executions of the model, for a given setting is not a trivial task. Choosing a small value (in the order of microseconds), increases the deviation of the execution from the system clock and the number of missed values by loosely-coupled code. On the other hand, choosing a larger value (in the order of seconds), may cause certain effects and attacks to be missed. At the same time, the selected time step cannot be less than the execution time of the physical model. We use the term “resolution” to denote the minimal value of the time step. When a system includes tightly-coupled code, because this code executes sequentially with the model, the resolution cannot be less than the cumulative execution time of the PLC code and physical model.
3.3
Detailed Architectural Description
Figure 3 presents the modular structure of the framework. The framework has three main units: the simulation core (SC), remote PLC (R-PLC) and SCADA master. Simulation Core Unit: The main role of the simulation core unit is to provide soft real-time execution of tightly-coupled code and the physical model, synchronized with the operating system clock, providing at the same time the glue between the cyber and physical layers. The most important modules of the simulation core unit are: the local PLC (LPLC), remoting handler and core. Communications between simulation core and remote units are handled by .NET’s binary implementation of RPC over TCP (called “remoting”). The local PLC module incorporates the PLC memory (e.g., coils, digital input registers, input registers and
173
Genge, Nai Fovino, Siaterlis & Masera XML Settings File SC
Master
Master
...
Core
Master
Logger Log Files
Modbus/TCP
RǦPLC
RǦPLC
Modbus Handler Remoting Handler
Code Runner
Modbus Handler
Code Runner
RǦPLC ...
Remoting Handler
Modbus Handler
Settings Loader
Code Runner
LǦPLC LǦPLC SyncǦAcc
Remoting System
Figure 3.
LǦPLC LǦPLC SyncǦAcc
MEM
MEM
Code Runner
Code Runner
Remoting Handler
Model Handler
Timer
Generated Model Code
LǦPLC LǦPLC SyncǦAcc ...
MEM Code Runner
C# Source File or DLL
Remoting Handler RPC/TCP
Modular architecture.
holding registers), which is used as the glue between the cyber and physical layers and the code runner module. The remoting handler module handles the communications between the local PLC modules and the local RPC system. The core module ensures the exchange of data between modules and the execution of the core timer. With the help of this timer, the simulation core unit provides soft real-time execution of the physical model. Remote PLC Unit: The main role of the remote PLC unit is to run loosely-coupled code and to provide an interface for master units to access the model. The main modules include: the remoting handler module, which implements communications with the simulation core; the code runner module, which runs the loosely-coupled code; and the Modbus handler module, which implements the Modbus protocol. Master Unit: The main role of the master unit is to implement a global decision based on the sensor values received from the remote PLC units. It includes a Modbus handler module for communicating with the remote PLC units and the decision algorithm module.
3.4
Implementation Details
The framework code was written in C#. The Mono platform was used to port the framework to a Unix system. Tightly-coupled code can be provided as C# source files or as binary DLLs, both of which are dynamically loaded at run time. C# source files are dynamically loaded, compiled and executed at run time using .NET support for dynamic code execution. Although C# source files have longer execution times, they provide the ability to implement PLC code without a development environment. At this time, loosely-coupled code is written in C# and must be compiled with the rest of the unit.
174
CRITICAL INFRASTRUCTURE PROTECTION V
Matlab Simulink was used to simulate the physical layer because a wide variety of plants (e.g., power plants, water purification plants and gas plants) have to be covered. Matlab Simulink is a general design and simulation environment for dynamic and embedded systems. It provides several toolboxes that contain pre-defined components for domains such as power systems, mechanics, hydraulics, electronics, etc. These toolboxes are enriched with every new release, providing powerful support for designers and effectively reducing the design time. C code corresponding to Simulink models are generated using the Matlab Real Time Workshop. The generated code is then integrated into the framework and its execution time is synchronized with the operating system clock. As mentioned above, communications between the simulation core and remote units are handled by .NET’s remoting feature. .NET remoting ensures minimal overhead and the use of a well-established implementation. Currently, we use Modbus over TCP for communications between remote PLCs and master units. However, new protocols are easily added by substituting the Modbus handler module. The synchronization of the model execution time with the system clock is implemented within the simulation core unit using a synchronization algorithm. At first glance, such an algorithm seems to be trivial; however, an experimental study has indicated the existence of several pitfalls. The main concern is that the PLC memory is a shared resource between the simulation core unit and remote PLC units. This means that the PLC memory has to be protected from simultaneous access, which introduces the problem of critical sections from the field of concurrent programming. Intuitively, a synchronization algorithm would have to run the process model only once for each time step. However, in a multi-tasking environment such an approach introduces accumulated deviations because the operating system can stop and resume threads without any intervention from the user space. Based on this observation, the implemented synchronization algorithm includes a loop to run the process model multiple times in each time step in order to reduce the deviation.
4.
Performance Evaluation
This section focuses on the evaluation of the performance of the framework with respect to scalability, resolution and deviation from the operating system clock. The results show that the framework can support as many as 100 PLCs with code sizes ranging from a few if-instructions to 1,000 if-instructions.
4.1
Experimental Setup
The experiments were conducted on an Emulab testbed running the FreeBSD operating system. Note, however, that the framework was also tested with the Windows 7, Fedora Core 8 and Ubuntu 10.10 operating systems. In all, eight hosts were used, one for running the simulation core unit, one for running the
master unit, and six for running up to 100 remote PLC units. Figure 4 presents the experimental setup and the plant models that were used. Three plant models were constructed in Simulink, from which the corresponding C code was generated using the Matlab Real Time Workshop. The first model (Model 1 in Figure 4) corresponds to a simplified version of a water purification plant with two water tanks. Model 2 corresponds to a 160 MW oilfired electric power plant based on the Sydsvenska Kraft AB plant in Malmo, Sweden [2]; the model includes a boiler and turbine. Several power plant models are available in the literature [5, 12, 13, 19], however, this particular model was selected because it includes estimated parameters from a real power plant and has been used by other researchers [1, 20] to validate their proposals. Model 3 extends the second model by incorporating a condenser; the equations for the condenser are based on [11].
4.2
Plant Model Execution Time
We measured the execution times of the Simulink models for the three plants mentioned above. The execution time of the first model was 19.2 μs. The second model has four additional equations, including the equations for sq and er , which yielded an added execution time of 4 μs and a total execution time of 23.2 μs. The third model also has four additional equations, which yielded an added execution time of 4.7 μs and a total execution time of 27.9 μs. Based on these measurements, the time step chosen for a system cannot be less than the execution time of the model.
4.3
PLC Code Execution Time
In the current implementation, tightly-coupled PLC code can be provided as a C# source file, a C# DLL or a native dynamic library. Since the loosely-
176
CRITICAL INFRASTRUCTURE PROTECTION V 0.018 C# Source
C# DLL
Native DLL
0.016
Execcution Time (ms)
0.014 0.012 0.010 0.008 0.006 0.004 0.002 0.000 1
Figure 5.
10
100
200
300 400 500 600 IfͲInstruction Count
700
800
900 1000
Execution time of tightly-coupled PLC code.
coupled PLC code was included as a module in the remote PLC unit, it had to be written in C# and compiled with the rest of the unit. As shown in Figure 5, the execution time of PLC code varies with the type of implementation (e.g., C# source file, C# DLL or native dynamic library) and the number of if-instructions (for each if-instruction, we also considered a PLC memory write instruction). The C# source file had the longest execution time because the code was compiled at runtime, while the native dynamic library had the shortest execution time because it was a binary that was compiled for the target platform. However, the key advantage of using C# source files is that they do not require the presence of development libraries, changes can be made rapidly and experiments can be resumed quickly.
4.4
Measuring System Resolution
For tightly-coupled code, the resolution is a function of the execution time of the model, the execution time of the tightly-coupled code and the added overhead of handling PLC code: (tiplc + toh ). RES1 = tmodel + i
For this setting, the missed rate is zero because the PLC code was run sequentially with the model, i.e., M iss1 = 0. For loosely-coupled code, the resolution is only a function of the model execution time: RES2 = tmodel .
However, the number of missed values is no longer zero because it depends on N , where st the chosen time step and the number of PLCs, i.e., M iss2 = αst is the chosen time step and N is the number of loosely-coupled PLCs. The N is determined empirically and can only be approximated because value of αst a multi-tasking operating system is used. For a mixed system, which includes loosely-coupled and tightly-coupled code, the resolution equals that for the tightly-coupled setting, i.e., RES3 = RES1 , and the missed count equals that for the loosely-coupled setting, i.e., M iss3 = M iss2 . Thus, the minimal value of the time step that can be chosen is equal to the system resolution. For tightly-coupled code, we measured the resolution for up to 100 PLCs and code sizes ranging from 1 to 1,000 if-instruction sets. The results shown in Figure 6 correspond to Model 2 with an execution time of 23.3 μs and native DLL-based PLC code. Of course, the resolution would automatically increase when using other models with longer execution times or other PLC code implementations. The resolution can also be increased by increasing the number of PLCs because more PLC code must be executed sequentially with the model. For instance, in the case of a single PLC and PLC code with 100 if-instructions, the resolution is 0.029 ms and increases up to 0.204 ms for 100 PLCs. However, the values generated by the model were not missed and the PLCs were able to react to all model changes. For loosely-coupled code, the resolution equals the model execution time. Naturally, for mixed systems the resolution equals the model execution time plus the tightly-coupled code execution time (Figure 6). Unlike the tightlycoupled code scenario, it is necessary to consider that PLCs miss values gen-
178
CRITICAL INFRASTRUCTURE PROTECTION V
erated by the model because the execution time is not synchronized between units, a multi-tasking operating system is used, and network communications introduce additional delays. Thus, providing a low miss rate at resolutions of 0.1 ms is difficult because of the multi-tasking operating system and the added overhead of network communications. We chose nine time steps ranging from 0.1 ms to 1,000 ms and measured the average number of missed reads for each time step. In the experiment, each PLC read the remote memory, ran 100 if-instructions and then wrote the results back to the remote memory. The number of missed reads is influenced by the number of PLCs and the read frequency. We considered up to 100 PLCs and two read frequencies (1 read/time step and 3 reads/time step). Figures 7(a) and 7(b) show the average (for 1 to 100 PLCs) measured percentages of missed reads for 1 read/time step and 3 reads/time step, respectively. For both settings, the average miss rate exceeds 50% for time steps smaller than 1 ms, mainly due to network delays, for which we measured a minimum value of 0.25 ms. For time steps larger than 10 ms, the percentage of missed reads became zero (even for 100 PLCs). Comparing the average values for the two settings (Figure 7(c)) shows a slight decrease in the value of the miss rate for a frequency of 1 read/time step. The reason is that reducing the number of reads decreases the number of simultaneous accesses to the synchronized PLC memory; thus, more PLCs can access the remote memory during each time step. In summary, for the extreme case of 100 PLCs executing tightly-coupled code with each PLC running 1,000 if-instructions, the proposed framework provides a resolution of 0.638 ms with a miss rate of zero. On the other hand, achieving a miss rate of zero for the loosely-coupled case is only possible with time steps greater than 100 ms.
4.5
Measuring System Deviation
The proposed framework maintains the execution time synchronized with the operating system clock. Because of the multi-tasking environment used to run the framework, synchronization is affected by the number of PLCs and the chosen code coupling. The measured deviation for tightly-coupled code is shown in Figure 8. For a time step of 0.1 ms, the evolution of the deviation is shown as a dashed line. Note that when more than 20 PLCs are used, the deviation increases gradually and exceeds 1 ms for 30 PLCs. This is because the cumulative PLC code execution time exceeds the 0.1 ms time step. On the other hand, the deviation is decreased by increasing the time step – for time steps of 10 ms, 100 ms, 500 ms and 1 s, the deviation is maintained around 2.2 μs even for 100 PLCs. Figures 9(a) and 9(b) show the average measured deviations for looselycoupled code with 1 read/time step and 3 reads/time step, respectively. For both time steps, the deviation increases starting with a 0.1 ms time step as more PLCs access the remote PLC memory. Starting with a 100 ms time step, the average deviation is maintained around 0.002 ms for 1 read/time step and
179
Genge, Nai Fovino, Siaterlis & Masera 100 90
Misse ed Read / Write (% %)
80 70
1 0.8 0.6 0.4 0.2 0
60 50 40
100
500
500
1000
30
1000
20 10 0 0.1
0.3
0.5
0.7 1 10 Time Step (ms)
100
(a) 1 read/time step. 100 90
Misse ed Read / Write (% %)
80 70
1 0.8 0.6 0.4 0.2 0
60 50 40 30
100
500
500
1000
1000
20 10 0 0.1
0.3
0.5
0.7 1 10 Time Step (ms)
100
(b) 3 reads/time step. 120 1 read / time step 3 read / time step
Missed d Read / Write (%)
100
0.05 0.04 0.03 0.02 0.01 0
80
60
40
100
500
1000
20
0 0.1
0.3
0.5
0.7 1 10 Time Step (ms)
100
500
1000
(c) 1 read/time step vs. 3 reads/time step. Figure 7.
Average missed PLC read percentages.
180
CRITICAL INFRASTRUCTURE PROTECTION V 0.12 100us 300us
0.10
500us 0.08 Deviation (ms) D
100us
700 700us 1ms 10ms
0.06
300us
100ms 500 500ms
0.04
1s
0.02
500us
0.00 1
5
Figure 8.
10
20
30
40 50 TCC Count
60
70
80
90
100
Deviation for tightly-coupled PLC code.
0.1 ms for 3 reads/time step. Comparing the average values for the two settings (Figure 9(c)) shows that a larger read frequency introduces larger deviations, an expected result. More specifically, the value of the deviation increases up to 50 times for a time step of 100 ms.
5.
Conclusions
The framework for the security analysis of networked industrial control systems represents an advancement over existing approaches that either use real physical components combined with simulated/emulated components or a completely simulated system. The hybrid architecture of the framework uses emulation for protocols and components such as SCADA servers and PLCs, and simulation for the physical layer. This approach captures the complexity of information and communications devices and efficiently handles the complexity of the physical layer. Another novel feature of the framework is that it supports both tightly-coupled and loosely-coupled code. Tightly-coupled code is used when PLCs cannot afford to miss any events; loosely-coupled code permits the injection of new (malicious) code without stopping execution. Tightlycoupled code ensures lower resolution with a zero miss rate; on the other hand, loosely-coupled code permits the integration of complex PLC emulators without affecting the architectures of the other units. The models considered range from simple water purification plants to complex boiling water power plants. Experimental results demonstrate that the framework is capable of running complex models within tens of microseconds. With regard to parameters such as resolution, miss rate and deviation, tightly-
181
Genge, Nai Fovino, Siaterlis & Masera 35
28
Deviation (ms) D
0.0040 0.0030
21
0.0020 0.0010 14
0.0000 100
500 1000
7
0 0.1
0.3
0.5
0.7 1 10 Time Step (ms)
100
500
1000
100
500
500
1000
(a) 1 read/time step. 42
35
0.5 0.4 0.3 0.2 0.1 0
Deviation (ms) D
28
21
14
1000
7
0 0.1
0.3
0.5
0.7 1 10 Time Step (ms)
100
(b) 3 reads/time step. 12 1 read / time step 10
3 read / time step 0.15
Deviation (ms) D
8
0.10 6
0.05 0.00
4
100
500
1000
2
0 0.1
0.3
0.5
0.7 1 10 Time Step (ms)
100
500
1000
(c) 1 read/time step vs. 3 reads/time step. Figure 9.
Average deviation for loosely-coupled PLC code.
182
CRITICAL INFRASTRUCTURE PROTECTION V
coupled code provides higher resolution values, but with lower miss rates and deviations. On the other hand, loosely-coupled code yields lower resolution values with higher miss rates and deviations. Our future research will use the framework to study the propagation of perturbations in cyber-physical environments, analyze the behavior of physical plants and develop countermeasures. It will also analyze the physical impact of attacks using more complex models that include descriptions of the physical components.
References [1] A. Abdennour and K. Lee, An autonomous control system for boilerturbine units, IEEE Transactions on Energy Conversion, vol. 11(2), pp. 401–406, 1996. [2] R. Bell and K. Astrom, Dynamic Models for Boiler-Turbine Alternator Units: Data Logs and Parameter Estimation for a 160 MW Unit, Technical Report TFRT-3192, Department of Automatic Control, Lund Institute of Technology, Lund, Sweden, 1987. [3] J. Calvin and R. Weatherly, An introduction to the high level architecture (HLA) runtime infrastructure (RTI), Proceedings of the Fourteenth Workshop on Standards for the Interoperability of Defense Simulations, pp. 705–715, 1996. [4] R. Chabukswar, B. Sinopoli, G. Karsai, A. Giani, H. Neema and A. Davis, Simulation of network attacks on SCADA systems, presented at the First Workshop on Secure Control Systems, 2010. [5] P. Chawdhry and B. Hogg, Identification of boiler models, IEE Proceedings on Control Theory and Applications, vol. 136(5), pp. 261–271, 1989. [6] C. Davis, J. Tate, H. Okhravi, C. Grier, T. Overbye and D. Nicol, SCADA cyber security testbed development, Proceedings of the ThirtyEighth North American Power Symposium, pp. 483–488, 2006. [7] S. East, J. Butts, M. Papa and S. Shenoi, A taxonomy of attacks on the DNP3 protocol, in Critical Infrastructure Protection III, C. Palmer and S. Shenoi (Eds.), Springer, Heidelberg, Germany, pp. 67–81, 2009. [8] N. Falliere, L. O’Murchu and E. Chien, W32.Stuxnet Dossier, Symantec, Mountain View, California (www.symantec.com/content/en/us/enterprise /media/security response/whitepapers/w32 stuxnet dossier.pdf), 2011. [9] M. Guglielmi, I. Nai Fovino, A. Perez-Garcia and C. Siaterlis, A preliminary study of a wireless process control network using emulation testbeds, Proceedings of the Second International Conference on Mobile Lightweight Wireless Systems, pp. 268–279, 2010. [10] T. Hiyama and A. Ueno, Development of a real time power system simulator in Matlab/Simulink environment, Proceedings of the IEEE Power Engineering Society Summer Meeting, vol. 4, pp. 2096–2100, 2000.
Genge, Nai Fovino, Siaterlis & Masera
183
[11] Y. Kim, M. Chung, J. Park and M. Chun, An experimental investigation of direct condensation of steam jet in subcooled water, Journal of the Korean Nuclear Society, vol. 29(1), pp. 45–57, 1997. [12] A. Kumar, K. Sandhu, S. Jain and P. Kumar, Modeling and control of a micro-turbine-based distributed generation system, International Journal of Circuits, Systems and Signal Processing, vol. 3(2), pp. 65–72, 2009. [13] J. McDonald and H. Kwatny, Design and analysis of boiler-turbinegenerator controls using optimal linear regulator theory, IEEE Transactions on Automatic Control, vol. 18(3), pp. 202–209, 1973. [14] I. Nai Fovino, A. Carcano, M. Masera and A. Trombetta, An experimental investigation of malware attacks on SCADA systems, International Journal of Critical Infrastructure Protection, vol. 2(4), pp. 139–145, 2009. [15] I. Nai Fovino, M. Masera, L. Guidi and G. Carpi, An experimental platform for assessing SCADA vulnerabilities and countermeasures in power plants, Proceedings of the Third Conference on Human System Interaction, pp. 679–686, 2010. [16] S. Neema, T. Bapty, X. Koutsoukos, H. Neema, J. Sztipanovits and G. Karsai, Model-based integration and experimentation of information fusion and C2 systems, Proceedings of the Twelfth International Conference on Information Fusion, pp. 1958–1965, 2009. [17] PowerWorld Corporation, Champaign, Illinois (www.powerworld.com). [18] C. Queiroz, A. Mahmood, J. Hu, Z. Tari and X. Yu, Building a SCADA security testbed, Proceedings of the Third International Conference on Network and System Security, pp. 357–364, 2009. [19] H. Seifi and A. Seifi, An intelligent tutoring system for a power plant simulator, Electric Power Systems Research, vol. 62(3), pp. 161–171, 2002. [20] W. Tan, H. Marquez, T. Chen and J. Liu, Analysis and control of a nonlinear boiler-turbine unit, Journal of Process Control, vol. 15(8), pp. 883–891, 2005. [21] C. Wang, L. Fang and Y. Dai, A simulation environment for SCADA security analysis and assessment, Proceedings of the International Conference on Measuring Technology and Mechatronics Automation, vol. 1, pp. 342– 347, 2010. [22] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb and A. Joglekar, An integrated experimental environment for distributed systems and networks, Proceedings of the Fifth Symposium on Operating Systems Design and Implementation, pp. 255–270, 2002.
Chapter 13 USING AN EMULATION TESTBED FOR OPERATIONAL CYBER SECURITY EXERCISES Christos Siaterlis, Andres Perez-Garcia and Marcelo Masera Abstract
The detection, coordination and response capabilities of critical infrastructure operators ultimately determine the economic and societal impact of infrastructure disruptions. Operational cyber security exercises are an important element of preparedness activities. Emulation testbeds are a promising approach for conducting multi-party operational cyber exercises. This paper demonstrates how an Emulab-based testbed can be adapted to meet the requirements of operational exercises and human-in-the-loop testing. Three key aspects are considered: (i) enabling secure and remote access by multiple participants; (ii) supporting voice communications during exercises by simulating a public switched telephone network; and (iii) providing exercise moderators with a feature-rich monitoring interface. An exercise scenario involving a man-in-the-middle attack on the Border Gateway Protocol (BGP) is presented to demonstrate the utility of the emulation testbed.
The increasing dependence of critical infrastructures on information and communications technologies is a growing area of concern. Contingencies that involve abnormal events and disruptions – deliberate (e.g., cyber attacks) or unintentional (e.g., fiber cable cuts) – can result in dire consequences if critical infrastructure operators fail to react promptly, appropriately and effectively. Therefore, in the context of incident preparedness, it is important that the procedures performed during contingencies are carefully planned and tested in advance at the conceptual and technical levels. These activities can reveal vital details that could negatively affect incident detection, coordination and response capabilities. J. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 185–199, 2011. c IFIP International Federation for Information Processing 2011
186
CRITICAL INFRASTRUCTURE PROTECTION V
The execution of cyber security exercises has been identified as a priority at the national [18] and international levels [2]. The U.S. Homeland Security Exercise and Evaluation Program (HSEEP) [3] identifies two main types of exercises: discussion-based exercises and operations-based exercises. Operationsbased exercises provide valuable information about the behavior of operators during security incidents, including response times and levels of coordination. These exercises often engage the red team/blue team paradigm (i.e., the use of an attacking team and a defending team) to ascertain the security of a system or network [1, 12, 15]. Our approach diverges from this paradigm in that it focuses on cyber security exercises involving multiple stakeholders from different administrative domains with the primary objective of examining their coordination capabilities. Such exercises are of particular interest due to the distributed, global and privately-owned characteristics of the Internet infrastructure. In the case of multinational exercises, private infrastructure owners (e.g., network service providers (NSPs)) are the principal actors, but are typically competitors; moreover, the notion of a governing entity (e.g., public sector) is hard to define. Conducting exercises using production systems raises concerns about the potential side-effects to mission-critical systems and services. Software-based simulation has been proposed as a solution [9, 10], but it is limited by the fact that operator behavior can be altered significantly if the exercise platform lacks realism. The third option, hardware-based emulation, is a good candidate because it combines realism and flexibility. This paper discusses how an emulation testbed, specifically one based on the Emulab software [5], can serve as a platform for executing multi-party, operational cyber security exercises. An Emulab testbed can recreate a wide range of experimentation environments that support the development, debugging and evaluation of complex systems [4, 11]. In the context of cyber security exercises, the DETER testbed [13] has been used in the well-known Cyber Storm exercise to provide visual inputs to participants and help them understand the effects of attacks. We adopt a similar approach and investigate how an Emulab testbed can be adapted for multi-party cyber security exercises. In particular, we identify the missing elements and functionality needed to conduct robust cyber security exercises. The effectiveness of the approach is demonstrated using an exercise with multiple network operators involving a man-in-the-middle attack on the Border Gateway Protocol (BGP).
2.
Operational Exercise Components
Cyber security exercises seek to raise the level of preparedness by confronting participants with artificial events and studying their reactions. Six main elements must be considered when designing an exercise: Participants (Who): Participants come from the government, private sector, media, etc. They have diverse roles such as players, observers, etc. Location (Where): Participants may be in the same physical location or may participate remotely.
187
Siaterlis, Perez-Garcia & Masera Discussion
Operational
How?
Duration
Duration Short
Long
Where? Remote With Communications: Phone / Mail / Dedicated System
Short
Local
Local
Single Meeting Place (Constraint on Maximum Actors)
Combination is not meaningful or desirable
Figure 1.
Where? Remote
Adhoc Testbed for Each Exercise
Remote
Systems
Systems Dedicated Production
X
Long
Where?
X
Dedicated X
Dedicated Resources and Maintainer
Local
X Production
Experiments on Top of Production Systems
Design options for cyber security exercises.
Time and Duration (When): Exercises may last from a few hours to several days or weeks. Objectives (Why): Exercise objectives can vary widely, e.g., testing recovery procedures and coordination capabilities. Type (How): Exercises can be grouped into two main categories: discussion-based exercises and operations-based exercises. Scenario (What): A scenario typically consists of a storyline of events, master event list and contextual information. A scenario usually incorporates the assets, vulnerabilities and asset topologies, hazards and potential adversaries, threats leading to attacks, attacks and their consequences (physical, psychological, etc.), and countermeasures and response actions. Figure 1 presents the various exercise design options. Note that our focus is on operational exercises with remote participants. The exercise elements are not independent of each other. In Figure 1, local operational exercises, both long and short on production systems, are marked as not meaningful. The first option would require participants to be away from their offices for long periods of time. The second option is not feasible in most cases because production systems are located in specific facilities and are not easily accessible. As mentioned above, this paper focuses on operational exercises that involve multiple remote participants. The objective is to involve key critical infrastructure stakeholders (mainly owners and operators of private infrastructures) at the operational and practical levels in order to assess the communication and coordination of operators during contingencies. Four roughly sequential phases are involved in an exercise: design, setup, execution and analysis. Each phase presents its own challenges, which will be discussed individually after introducing the main concepts underlying an emulation testbed.
188
CRITICAL INFRASTRUCTURE PROTECTION V
1. User provides experiment description
2. Emulab reserves needed resources and configures physical topology
3. The desired virtual topology is recreated including monitoring nodes
ns Virtual Topology Servers Running Emulab Software Pool of Available Resources Routers Switches Generic PCs
Figure 2.
3.
Recreating a virtual network configuration.
Emulab Overview
Using an emulation testbed is one of the most promising approaches for experimenting with large and complex systems. Pure software simulation is often too simplistic to recreate complex environments. Using an ad hoc testbed is not recommended because it can be time-consuming and error-prone to setup, maintain and modify. Consequently, emulation testbeds like Emulab [20] are becoming more popular. They are an attractive option for conducting cyber security exercises because they can support human-in-the-loop experimentation. An Emulab testbed typically consists of two servers that run the Emulab software (named boss and ops) and a pool of physical resources (e.g., generic personal computers and network devices) used as experimental nodes. The Emulab software permits the automatic and dynamic mapping of physical components (e.g., servers and switches) to a virtual topology. In other words, it configures the physical components so that they emulate the virtual topology as transparently as possible. Thus, significant advantages are provided in terms of the repeatability, scalability and controllability of experiments. Figure 2 shows the main steps involved in recreating a virtual network configuration in an Emulab testbed. The following steps are involved: A detailed description of the virtual network configuration is created using Emulab’s experiment script, which is based on the TCL language with ns-2 and testbed-specific extensions. In the description, similar components are designated as different instances of the same component type. Consequently, templates of common components (e.g., Linux DNS server) can be easily reused and automatically deployed and configured.
Siaterlis, Perez-Garcia & Masera
189
An experiment is instantiated using the Emulab software. The Emulab server automatically reserves and allocates the required physical resources from the pool of available components. This procedure is called “swapin,” in contrast with the termination of the experiment, which is called “swap-out.” The software configures the network switches to recreate the virtual topology by connecting experimental nodes using multiple VLANs. In the final step before experimentation, the software configures packet captures at predefined links for monitoring purposes.
4.
Configuring Emulab
An Emulab testbed is easily configured to support operational exercises. This section describes the steps involved in the design, setup, execution and analysis phases.
4.1
Design Phase
After determining the participants and developing a detailed scenario, the following tasks are performed: The network topology is described using an experiment script. The components (e.g., servers and routers) to be controlled by the participants are differentiated from the components that simulate the rest of the world (i.e., the context). For example, participants would not have direct access to nodes that generate background traffic (e.g., by replaying real traffic dumps [19]). All the components should be based on reusable templates, which reduces the costs involved in organizing exercises. The exercise scenario is described using the experiment script (e.g., scenario injects are represented as scheduled or dynamic events). For example, a fiber cable cut could be scheduled by introducing a “link-down” event. Emulab’s event scheduling mechanism supports the execution of the scenario in real time instead of simulated time. However, it is important to consider the need to pause exercise execution because swappingout the experiment can cause scheduled events to be replayed. The exercise monitoring infrastructure is described and the data collection mechanisms are configured.
4.2
Setup Phase
In the setup phase, the predefined systems are instantiated and configured as in any Emulab experiment. Exercise participants are given access to individual experimental nodes. Access control mechanisms are used to ensure that participants may only access the nodes “owned” by them in the scenario. However, exercise moderators are permitted to access all resources.
190
CRITICAL INFRASTRUCTURE PROTECTION V
As an example, consider the case of a participant representing a network service provider. This participant would be given access to two logical nodes. The first is a router that implements the service provider’s routing policy. The second node is a companion management host, on which the participant can install custom tools and scripts used in daily operations. Obviously, preparing the exercise platform in such a manner is important to conducting realistic operational exercises.
4.3
Execution Phase
During the exercise execution phase, the participants interact with the systems and among themselves. Their actions are monitored for further analysis after the end of the exercise (analysis phase). It is important that exercise moderators know how the exercise is evolving so that they can intervene (e.g., by injecting dynamic events) if necessary.
4.4
Analysis Phase
In the analysis phase, the emulation testbed is used to gather recorded data. The data is used to evaluate the response times, durations of actions, levels of coordination, etc. The data collected depends on the scope of the exercise.
5.
Challenges
Based on the analysis above, three principal challenges must be addressed in order to use an Emulab testbed for operational exercises: Multiple Remote Users: The Emulab architecture supports multiple remote users, but does not provide secure access. To address this challenge, we propose the use of VPN connections for secure remote access. Realistic Environment: Using an emulation testbed addresses the need to recreate IP networks. However, it is necessary to integrate a simulated and monitored public switched telephone network (PSTN) that could be used by participants to communicate and coordinate their activities. This means that telephone calls must be supported as in the real world. Flexible and Automated Monitoring: Emulab offers limited functionality beyond link-tracing (i.e., packet capture of network traffic). Also, it lacks support for measuring individual node metrics (e.g., CPU utilization) and does not provide a user-friendly monitoring GUI. Therefore, we have integrated Zabbix, a powerful open-source network monitoring application, with the Emulab architecture, enabling the automated monitoring of experimental nodes. These challenges are discussed in more detail in the following sections.
Zabbix Server Measurement Plane Experimental Network
Emulab testbed with secure remote access.
Secure Remote Access Architecture
The organization of multi-party exercises is simplified by allowing geographically distributed participants to remotely access the exercise platform. Our secure remote access architecture essentially isolates the testbed by allowing remote access only through VPN connections (Figure 3). An OpenVPN server enables remote users to connect securely through a public network such as the Internet. The confidentiality and integrity of transmitted information is ensured by tunneling protocols and encryption algorithms. Non-interference between participants is implemented via the “no client-to-client” configuration of the OpenVPN server. All remote users authenticate themselves with the VPN server, which is protected by a firewall. Once a user is connected to the “users network,” an internal firewall guarantees that access is available only to the required resources (typically www to boss and ssh to nodes). Also, a user cannot reuse a new connection (IP within the remote users network) to reach the Internet. This architecture provides remote users with access to the platform, but not to the Internet. The only access to the Internet from the testbed is via an authenticated proxy (GW in Figure 3), which is restricted to platform administrators. Having two layers of firewalls provides a high level of security and facilitates the specification of access policies from the remote user network to internal resources. Of course, this architecture is by no means optimal or unique, but it is presented as a reference implementation for secure remote access to the testbed. Of course, the architecture complements any security enhancements provided by the Emulab architecture [8].
192
5.2
CRITICAL INFRASTRUCTURE PROTECTION V
Support for Voice and Data Networks
An operational exercise platform should provide a realistic environment without any artificial features that could alter participant behavior. Unlike a network simulator [10], an emulation testbed addresses this issue because participants can interact with a realistic IP network that could be based on real routers or on software routers like Quagga. Although software routers are not a replacement for real routers, they permit the testbed to scale to much larger topologies while maintaining a certain degree of fidelity. An important detail is that by having the exercise platform isolated from the Internet, realworld IP addresses, autonomous systems (ASs), etc. can be safely reused in the experimental network, which increases the realism perceived by the participants. The exercise platform is extended to simulate a public switched telephone network as in the real world [17]. The extension uses a separate logical network consisting of a central VoIP server and several clients (e.g., soft phones), which are handed to the participants. Our testbed uses an Asterisk server with FreeBSD 8, which is automatically configured by launching scripts from Emulab. However, for the purposes of an exercise, this functionality is augmented with call recording so that communications between participants can be logged for subsequent analysis. A VoIP server like Asterisk also facilitates monitoring, because it supports the capture of two-party calls in separate files (after multiplexing both voice streams).
5.3
Improving Network Monitoring Support
Permitting exercise moderators to monitor the execution of a complex cyber security exercise in real time enables them to intervene when needed and to properly simulate non-participating entities. Although some work on extending Emulab’s monitoring capabilities has been performed (e.g., SEER [16]), the integration of Emulab with general purpose network monitoring software can guarantee more frequent updates, more functionality and support by a wider community. For these reasons, we have integrated Zabbix, which offers advanced monitoring, alerting and visualization features in a scalable and automated manner. First, we created a template operating system image of a Zabbix server. The experimental nodes can either have a pre-installed Zabbix agent or have the agent installed at runtime, e.g., using the tb-set-node-tarfiles command. The Zabbix server runs on a separate node and communicates with the agents using the control network to avoid interference with the experiment and to ensure communications with agents despite potential disruptions in the experimental plane (part of the exercise scenario). The challenge is to automatically configure the server to monitor all the agents. This is because the IP addresses of the nodes allocated to an experiment are not known a priori. We address this issue by including custom emulab mon code in Emulab’s experimental script, which specifies the nodes
193
Siaterlis, Perez-Garcia & Masera
Shell
config. file
Program ns
Program
Shell
config. file
emulabmon.tcl Program
Figure 4.
Python
XML RPC
Zabbix Agent
Zabbix Agent
Zabbix Server
Zabbix auto-configuration processes.
that are to be monitored and the Zabbix templates to which they should be attached (Figure 4). At swapping-in time, the code calls shell scripts on all the monitored nodes to configure the agents, and invokes a Python script that configures the Zabbix server using its built-in XML-RPC API. This API is in its infancy and, therefore, Zabbix server version 1.8 is the minimum version that can be used.
Figure 5.
Zabbix web interface for experiment monitoring.
This process automatically configures the powerful, user-friendly web interface presented in Figure 5. The interface can be used for general purpose
194
CRITICAL INFRASTRUCTURE PROTECTION V
VoIP Server
Voice Communications / PSTN
Figure 6.
Exercise topology.
experiment monitoring such as presenting graphs of traffic loads, CPU and memory usage of individual nodes, as well as a network weathermap [7]. Exercise moderators can also use the interface as a central exercise monitoring screen.
6.
BGP Attack Response Scenario
This section describes an operational exercise intended to assess the communications and coordination of network service providers during a BGP man-inthe middle attack as used in the infamous YouTube hijacking incident [14]. The use case demonstrates how participants could use an emulation testbed, how a scenario could be played and how information for studying response strategies and communication patterns can be captured. The exercise involved 21 participants: eighteen national network service providers and three global network service providers (R1, R10 and R11) that simulate the Internet core (Figure 6). In this simplified Internet model, each network service provider communicated directly with its neighbors (if they shared a link); otherwise they communicated through the Internet core. An
Siaterlis, Perez-Garcia & Masera
195
eBGP session between network service providers that shared a link was used to exchange their prefixes. The exercise scenario assumed that a network service provider was compromised by an internal attack that hijacked the IP address spaces of two network service providers, NSP12 and NSP16. In addition to hijacking the IP address space, the attack also performed a man-in-the-middle exploit by forwarding traffic to the destination. This was accomplished by announcing more specific prefixes of the victims and applying “AS-path prepend” of the intended network service providers in the path [6]. Thus, the attack was able to copy and manipulate traffic between NSP12 and NSP16, thereby compromising the integrity and confidentiality of communications. The scenario assumed that the operators of the compromised network service providers were not reachable and were unable to mitigate the internal attacks promptly. This scenario was recreated in our Emulab-based testbed using 21 routernodes running Quagga software with the appropriate BGP configuration. Links between network service providers had 10 Mbps of bandwidth and 10 ms of delay, while links in the core had 100 Mbps of bandwidth and 0 ms of delay. Each network service provider announced a /16 prefix that was configured on a loopback interface. An isolated node with Zabbix software connected to the routers via the control network was used to collect traffic statistics. Another experiment with Asterisk software was used to simulate the public switched telephone network and support voice communications between the exercise participants. Every process was automated during experiment start-up using a different script. The scripts enabled Quagga to load the right configuration file on each router. Also, they enabled the Zabbix server to configure the agents and itself with the required hosts and graphs, and the network weathermap to visualize traffic load. Such automation is very important from the point of view of scalability. After the experiments were instantiated, the events corresponding to the attack were launched. Initially, sensitive traffic between NSP12 and NSP16 followed the path NSP12-R10-R1-R11-NSP16 and vice versa (left-hand side of Figure 7). However, after the attack changed the BGP configuration to hijack the IP address spaces of NSP12 and NSP16 by announcing more specific prefixes, sensitive traffic followed the path NSP12-R10-NSP1-R11-NSP16 and vice versa (right-hand side of Figure 7). Note that Figure 7 presents the visualization of sensitive traffic as seen from the monitoring server. Since the attack forwarded traffic to the intended destination, the source and destination network service providers were unaware of the route modification and the adversary was able to capture and eventually modify the packets. Although the attack was transparent from the point of view of communications, the users experienced higher delays and the operators could see that traffic was diverted to other interfaces in the routers. Additional details can be obtained by examining the BGP tables before and after the attack. Before the attack, the path between NSP12 and NSP16 (in
196
CRITICAL INFRASTRUCTURE PROTECTION V R10–AS110
R10–AS110
NSP12
NSP12
R1–AS101
NSP1
R1–AS101
NSP1
NSP16
NSP16
R11–AS111
R11–AS111
Figure 7.
Sensitive traffic between NSP12 and NSP16.
terms of ASs) was: 12 – 110 – 101 – 111 – 16, corresponding to the ASs of NSP12, R10, R1, R11 and NSP16, respectively: NSP12# show ip bgp Network *> 10.16.0.0/16
Next Hop 10.1.20.2
Metric LocPrf Weight Path 0 110 101 111 16 i
NSP16# show ip bgp Network *> 10.12.0.0/16
Next Hop 10.1.25.2
Metric LocPrf Weight Path 0 111 101 110 12 i
After the attack, the BGP tables were manipulated and included more specific prefixes that followed a different path through NSP1 (AS1). The use of “AS-path prepend” by the attack made the new entries seem legitimate, just as if they were announced by the destination network service provider: NSP12# show ip bgp Network *> 10.16.0.0/16 *> 10.16.0.0/24
Next Hop 10.1.20.2 10.1.20.11
Metric LocPrf Weight Path 0 110 101 111 16 i 0 110 1 111 16 i
NSP16# show ip bgp Network *> 10.12.0.0/16 *> 10.12.0.0/24
Next Hop 10.1.25.2 10.1.25.11
Metric LocPrf Weight Path 0 111 101 110 12 i 0 111 1 110 12 i
During the exercise, the participants could react to and choose one or a combination of two response strategies:
Siaterlis, Perez-Garcia & Masera
197
Filtering Strategy: The victims of the attack contact the peering network service providers and ask them to take action. In the exercise, the core routers R10 and R11 receive harmful announcements from NSP1 and filtering must be applied to block the announcements. More Specific Prefix Strategy: The victims combat the attack by announcing even more specific prefixes. The victims act and coordinate their activities as in the filtering strategy, but they do not need to contact other network service providers. Although the strategy of announcing more specific prefixes seems less complex and requires less coordination with other network service providers, it may not be the best technical and long-term strategy for several reasons. For example, providers that are upstream of the victims might deploy prefix filters that do not allow the use of more specific prefixes, or the victims could compete with the attacker in announcing prefixes of increasing specificity. A detailed discussion of these issues is outside the scope of this paper. In the case of the hijacking attack and assuming that the filtering strategy is applied by the network service providers that are directly connected to the attacker, then the total time Tt that the victims would spend on the telephone to ask all the network service providers to filter the attack is given by: Tt ∝ (Np × Nv ) × Tc where Nv is the number of victims, Np is the number of network service providers that peer with the attacker and Tc is the time required for two participants to coordinate their actions. This represents the “cost” of mesh communications between uncoordinated victims and network service providers that are directly connected to the single man-in-the-middle attacker. This time is different from the actual time required to mitigate the attack because the latter depends on factors such as the availability of concurrent communications, the time needed to apply filtering by network service providers due to internal procedures, and even BGP convergence times. Furthermore, the formula assumes a constant time Tc for each communication but, in reality, Tc depends on operator experience, contact networks, operators language skills, etc. The value of a real operational exercise based on this use case goes beyond theoretic constructs to a deeper understanding of the operational reality where decisions and reaction measures follow administrative procedures. This often translates to a series of communications, possibly involving third parties (e.g., a regional Internet registry (RIR)) to confirm information related to the announced routes. Therefore, in the context of preparedness, the execution of an operational exercise on top of an emulation testbed (enhanced with voice communications) with multiple participants from different network service providers would not only support training, but also provide input to researchers about network service provider coordination in terms of procedures followed, typical values of Tc and the need to automate administrative procedures. Given
198
CRITICAL INFRASTRUCTURE PROTECTION V
the complexity of the Internet, coordination between organizations, institutions and stakeholders is a key factor in any response to a contingency.
7.
Conclusions
Organizing multi-party operational cyber security exercises using an emulation testbed offers several advantages. Exercises can be conducted without interfering with production networks while offering a realistic environment that supports voice and data. Remote access to the testbed supports real-time exercises of long duration that actively involve large numbers of participants. Exercises can include architectures, technologies and policies that are not yet deployed; and monitoring and data collection can be very detailed with limited privacy concerns. Finally, investing in reusable components simplifies the task of organizing future cyber exercises while reducing costs. Our future work will analyze the effectiveness of the paradigm in real exercises. Other areas of focus include enhancing the fidelity of the platform, and developing and conducting exercises that cover multiple critical infrastructure sectors.
References [1] W. Adams, E. Gavas, T. Lacey and S. Leblanc, Collective views of the NSA/CSS cyber defense exercise on curricula and learning objectives, Proceedings of the Second Conference on Cyber Security Experimentation and Test, p. 2, 2009. [2] European Commission, Protecting Europe from Large Scale Cyber-Attacks and Disruptions: Enhancing Preparedness, Security and Resilience, COM(2009) 149, Brussels, Belgium (ec.europa.eu/information society/poli cy/nis/docs/comm ciip/comm en.pdf), 2009. [3] Federal Emergency Management Agency, Homeland Security Exercise and Evaluation Program (HSEEP), Washington, DC (hseep.dhs.gov). [4] Flux Research Group, Emulab bibliography, School of Computing, University of Utah, Salt Lake City, Utah (www.emulab.net/expubs.php). [5] Flux Research Group, Emulab – Network Emulation Testbed, School of Computing, University of Utah, Salt Lake City, Utah (www.emulab.net). [6] C. Hepner and E. Zmijewski, Defending against BGP man-in-the-middle attacks, presented at the Black Hat DC Conference, 2009. [7] H. Jones, Network Weathermap (www.network-weathermap.com). [8] K. Lahey, R. Braden and K. Sklower, Experiment isolation in a secure cluster testbed, Proceedings of the Conference on Cyber Security Experimentation and Test, 2008. [9] Y. Li, M. Liljenstam and J. Liu, Real-time security exercises on a realistic interdomain routing experiment platform, Proceedings of the Twenty-Third Workshop on Principles of Advanced and Distributed Simulation, pp. 54– 63, 2009.
Siaterlis, Perez-Garcia & Masera
199
[10] M. Liljenstam, J. Liu, D. Nicol, Y. Yuan, G. Yan and C. Grier, RINSE: The real-time immersive network simulation environment for network security exercises (extended version), Simulation, vol. 82(1), pp. 43–59, 2006. [11] J. Mirkovic, A. Hussain, S. Fahmy, P. Reiher and R. Thomas, Accurately measuring denial of service in simulation and testbed experiments, IEEE Transactions on Dependable and Secure Computing, vol. 6(2), pp. 81–95, 2009. [12] J. Mirkovic, P. Reiher, C. Papadopoulos, A. Hussain, M. Shepard, M. Berg and R. Jung, Testing a collaborative DDoS defense in a red team/blue team exercise, IEEE Transactions on Computers, vol. 57(8), pp. 1098– 1112, 2008. [13] R. Ostrenga and P. Walczak, Application of DETER in large-scale cyber security exercises, Proceedings of the DETER Community Workshop, 2006. [14] RIPE Network Coordination Center, YouTube hijacking: A RIPE NCC RIS case study, Amsterdam, The Netherlands (www.ripe.net/news/studyyoutube-hijacking.html), 2008. [15] B. Sangster, T. O’Connor, T. Cook, R. Fanelli, E. Dean, W. Adams, C. Morrell and G. Conti, Toward instrumenting network warfare competitions to generate labeled datasets, Proceedings of the Second Conference on Cyber Security Experimentation and Test, p. 9, 2009. [16] S. Schwab, B. Wilson, C. Ko and A. Hussain, SEER: A security experimentation environment for DETER, Proceedings of the DETER Community Workshop, p. 2, 2007. [17] R. Stapleton-Gray, Inter-network operations center dial-by-ASN (INOCDBA), A resource for the network operator community, Proceedings of the Cybersecurity Applications and Technology Conference for Homeland Security, pp. 181–185, 2009. [18] The White House, The National Strategy to Secure Cyberspace, Washington, DC (www.dhs.gov/xlibrary/assets/National Cyberspace Strategy .pdf) 2003. [19] A. Turner, Tcpreplay (tcpreplay.synfin.net). [20] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb and A. Joglekar, An integrated experimental environment for distributed systems and networks, Proceedings of the Fifth Symposium on Operating Systems Design and Implementation, pp. 255–270, 2002.
Chapter 14 ANALYZING INTELLIGENCE ON WMD ATTACKS USING THREADED EVENT-BASED SIMULATION Qi Fang, Peng Liu, John Yen, Jonathan Morgan, Donald Shemanski and Frank Ritter Abstract
Data available for intelligence analysis is often incomplete, ambiguous and voluminous. Also, the data may be unorganized, the details overwhelming, and considerable analysis may be required to uncover adversarial activities. This paper describes a simulation-based approach that helps analysts understand data and use it to predict future events and possible scenarios. In particular, the approach enables intelligence analysts to find, display and understand data relationships by connecting the dots of data to create network of information. The approach also generates alternative storylines, allowing analysts to view other possible outcomes. It facilitates the automation of reasoning and the detection of inconsistent data, which provides more reliable information for analysis. A case study using data from the TV series, 24, demonstrates the feasibility of approach and its application to intelligence analysis of WMD attacks against the critical infrastructure.
Terrorists often target critical infrastructures. The U.S. State Department defines terrorism as “premeditated, politically motivated violence perpetrated against noncombatant targets by sub-national groups or clandestine agents, usually intended to influence the audience” [17]. This paper focuses on the analysis of intelligence related to weapons of mass destruction (WMD) attacks against critical infrastructures. WMD attacks could be chemical, biological, radiological, nuclear or combinations thereof [5]. The general characteristics of terrorists and other clandestine groups who seek to acquire WMD include cause, commitment, camaJ. Butts and S. Shenoi (Eds.): Critical Infrastructure Protection V, IFIP AICT 367, pp. 201–216, 2011. c IFIP International Federation for Information Processing 2011
202
CRITICAL INFRASTRUCTURE PROTECTION V !" # "
&#' ( )
%)
*+
&
"$
"+
&
%
!
& * + & % ,-
Figure 1.
Intelligence analysis workflow.
raderie, charismatic leadership, cash and resources, and cells [19]: Organizational characteristics include command, control and communications, recruitment, weapons procurement, logistics, surveillance, operations and finance. Organizational complexity, characterized by the division of responsibility within the group with respect to the various tasks outlined above, generally contributes to the likelihood of a successful, high-yield WMD event, while also generating a greater amount of traceable data. In order to combat terrorism in a timely and effective manner, intelligence analysts need to continuously analyze incoming information related to key actors, organizations and events, identify patterns, anomalies, relationships and causal influences, and provide alternative explanations and possible outcomes for decision making. When given an assignment, intelligence analysts search for information, assemble and organize the information in a manner designed to facilitate retrieval and analysis, analyze the information to make an estimative judgment, and write a report [8]. Figure 1 shows the workflow and key decision points in the intelligence gathering and analysis cycle [8].
Fang, et al.
203
There are four broad challenges in intelligence analysis: data collection, synthesis, validation and interpretation. Also, simulation tools that facilitate intelligence analysis must operate within the applicable time constraints even when information is abundant or incomplete. To address these challenges, we describe a threaded event-based simulation approach for intelligence analysis. The simulation approach offers intelligence analysts a means for identifying causal relationships and patterns in large data sets, detecting missing data, performing counter-validation and mapping multiple alternative storylines to support emerging analysis priorities.
2.
Related Work
Several techniques have been developed to address the challenges of organizing information in order to identify recurring patterns and causal relationships, distinguish relevant information from noise and infer activities of interest from incomplete data. However, these techniques generally place limited, if any, emphasis on counter-validation. Computer-aided analysis enhances the ability of intelligence analysts to reason about complex problems when the available information is incomplete or ambiguous, as is typically the case in intelligence analysis [8]. Modeling and simulation can provide valuable knowledge, understanding and preparation to combat future attacks [20]. Simulation approaches can be broadly grouped into two categories: agentbased and event-based simulations. Several researchers have used agent-based approaches to model and infer the effects of decisions and actions in social systems [6, 12, 15, 17]. In this paradigm, agent interactions are characterized in different ways: as forms of information diffusion [4]; as mechanisms that leverage social influence [11]; as trades, contracts or negotiations [2]; and as the consequences of some activity or strategy [3]. Agent-based approaches also offer a powerful means for representing agent-level capabilities using constraints such as geospatial effects [13], psychological limitations [18] and socio-cognitive effects [14]. Agent-based approaches vary in their portability (i.e., ability to integrate with simulation environments) and modularity (i.e., ability of the user/analyst to change aspects of the architecture). Agent architectures also differ in their capabilities and in their theoretical entanglements. Powerful complex agents are generally computationally expensive, difficult to integrate into simulation environments and entail significant theoretical commitments, which makes them difficult to modify. Lighter agents, on the other hand, are often modular and easier to integrate into existing simulations, but are relatively brittle. They generally model a small set of behaviors very well, but require significant modification to model other behaviors of interest. Consequently, agent-based approaches, when used exclusively, are unlikely to support – without external assistance – a wide range of evolving analysis priorities or fundamental changes in understanding.
204
CRITICAL INFRASTRUCTURE PROTECTION V
Figure 2.
Sparse event network.
Event-based simulations represent behavior by modeling the causal and temporal preconditions, participants, effects, times and locations that characterize an event. Unlike the internal state of an agent, these characteristics are immediately observable and verifiable. Analysts can use hypergraphs or meta-network representations to express causal, temporal, spatial and social relationships as ties between events. Event-based simulations also support deductive and inferential reasoning. By defining the causal and temporal preconditions associated with an event, a set of potential consequences (or storylines) can be deduced. These storylines can be compared with reports gathered from human intelligence sources, providing a form of counter-validation. Alternatively, analysts can identify potential group associations or behaviors by highlighting points of co-occurrence or performing other analyses of the network topology. Note that the notion of events has also been used to model the growth of social networks [16]. Figure 2 shows a sparse event network based on data from [4]. The network has different types of nodes and ties. Red nodes represent agents, orange nodes represent locations, light blue nodes represent resources and dark blue nodes represent tasks/events. The ties between nodes represent several types of relations, such as social relations between agents, spatial relationships between agents and locations, ownership relationships between agents and resources, actor relations between agents and tasks, and distance relations between locations. A sparse network can help visualize nodes with high centrality and cliques, providing early indicators of the nodes of interest. Event-based simulations are a pragmatic approach to dealing with analysis problems. Many of the basic causal and temporal preconditions associated
205
Fang, et al.
!
Figure 3.
Intelligence analysis workflow with simulator.
with events can be handled using established techniques [1, 10]. The relative theoretical flexibility of event-based simulations also makes them responsive to user demands. Basic deductions require only running the simulation again using a new set of inputs; implementing new causal or temporal rules is facilitated through a GUI. On the other hand, event-based simulations generally possess no mechanisms for bringing to bear external knowledge. Also, the scope of inferences and deductions depends on the completeness of the available data and the expert knowledge encoded into the simulations.
3.
Simulation Approach
Figure 3 presents an analysis-driven workflow using our simulator, which supports the second, third and fourth decision points in Figure 1. The simulator provides assisted analysis via information organization, simulation of event possibilities represented as storylines, support of evolving analysis priorities by enabling intelligence analysts to examine the effects of different decision points, discovery of causal relationships, detection of missing data, and, thus, an important aspect of counter-validation. The simulator decomposes the original data into a set of networks or storylines by defining three types of nodes, events, actors and objects, along with their associated attributes. The simulator can identify and generate multiple divergent storylines as well as alternative storylines from a large dataset. Multiple storylines occur when a decision point generates multiple coexisting possibilities; alternative storylines occur when the decision point generates two mutually exclusive possibilities. Figure 4 shows an example of multiple storylines generated by the simulator along with the decision nodes used in a simulation. Each storyline is indicated by a different color in the figure. In previous approaches, temporal and causal dependencies between events are specified in advance by the knowledge engineer; if the preconditions are not satisfied, the associated event cannot proceed. In our approach, temporal and causal dependencies between events are discovered; when the preconditions are not satisfied, the gathering of missing information is triggered. The associated event proceeds after the missing information becomes available.
206
CRITICAL INFRASTRUCTURE PROTECTION V
Figure 4.
Threaded event-based simulation.
To demonstrate the performance of the simulator, we present a case study using data from the second season of the popular TV show 24. The entire season was parsed into a set of discrete events along with their attributes, from which the simulator generated multiple storylines with decision nodes and different options. The simulator also discovered various causal relationships and detected the presence of missing information. The test involving missing information was conducted by deliberately deleting some events from the original data and running the simulation.
4.
Simulator Design
As mentioned above, the simulator is designed to reduce the workload of intelligence analysts. Specifically, the simulator incorporates mechanisms for connecting unorganized data into an information network, generating multiple storylines to allow analysts to view different outcomes, and automatically detecting causal relationships and missing information. This section clarifies the principal concepts used in the simulation model. Also, it describes the software architecture and the algorithms used by the simulator.
4.1
Key Concepts
An event is something of interest that has occurred (e.g., CTU agents have found a bomb). Actors are participants in events. Objects are target infrastructures or tools. Events and their relationships generate state changes. The “world” is defined by events, actors, objects and their relationships. Events, actors and objects have various attributes that capture their properties (Tables 1, 2 and 3, respectively). Preconditions and effects are two important attributes of events. Preconditions describe the state of the world before
207
Fang, et al. Table 1.
Event attributes.
Name
Implication
Example
Event ID Content Start Time Location Actors Relationships Effects Preconditions
Identifier of event Description of event Time of event Place where event occurs Participants involved in event Relationships shown in event Effects caused by event Requirements for event to occur
1 “Make death allusion” –20:00 WestBank@USA Mamud Faheen
Table 2.
Mamud Faheen.status = 2 Mamud Faheen.type = 1
Actor attributes.
Name
Implication
Example
Actor ID Name Sex Age Type Status Level Affiliation Location
Identifier of actor Name of actor 1: Male; 2: Female; 0: Unknown Positive integer 1: Terrorist; 2: Anti-terrorist agent; 3: Neutral 0: Dead; 1: Alive; 2: Arrested; 3: Under surveillance Status level of actor in a task Organization to which actor belongs Location of actor
1 Mamud 1 49 1 2 10 Second Wave Los Angeles
Table 3. Name Object Object Object Object
ID Name Status Location
Object attributes.
Implication
Example
Identifier of object Name of object Status of object Location of object
1 Nuclear bomb Ready Los Angeles
a change that is caused by an event. Effects describe the state of the world (actors and objects) after a change. Events can trigger new events. The simulator treats preconditions as qualified state(s) of the actors, objects and relationships for which the owner event occurs. A state that does not satisfy the preconditions of an event precludes the event from occurring in the state. Information gathering – currently in the form of a query to the user – is triggered when preconditions are not satisfied. Causal relationships are present when the effects of an event cause the preconditions of another event to be satisfied. Hidden relationships are discovered by searching the dataset for causal relationships.
208
CRITICAL INFRASTRUCTURE PROTECTION V
"
#
$
Figure 5.
4.2
!
Software architecture of the simulator.
Software Architecture
Figure 5 shows the software architecture of the simulator, including the components that support the detection of causal relationships and the identification of missing data. The figure also illustrates the selection of events by the simulator. The selection is guided by rules, constraints and guidelines that can be obtained from and/or altered by experienced analysts. The parser extracts information about events and their attributes. The format detector verifies that the parsed data can be executed by the simulator. The simulator has six main components: (i) data storage, which saves all the data during a simulation process and includes the available event storage, which holds the events whose preconditions are satisfied; (ii) rules/constraints/guidelines component, which saves the logic rules used by the simulator; (iii) unexecuted event detector, which selects unexecuted events; (iv) event availability detector, which selects events whose preconditions are satisfied from the output of the unexecuted event detector; (v) event selector, which selects the event with the earliest start time from the output of the event availability detector; and (vi) event executor, which executes the selected event, changes the attributes accordingly and marks the event as “executed.” After completing a simulation, the simulator gives a chronological sequence of discrete events for each storyline according to the execution order of events. The reasoning engine is designed to discover causal relationships between events and to detect missing information based on the output of the simulator. For each event, the reasoning engine matches its preconditions with the effects of all preceding events to check if any relationship exists between the events. The reasoning engine detects missing information when one or more preconditions do not match.
4.3
Algorithms
Figure 6 presents the simulator workflow. The workflow involves the following steps: Step 1: Parse input data into the appropriate XML format.
209
Fang, et al.
(
(
! (
! " # $ %
(
! " $ %
& ! ' "
Figure 6.
Simulator workflow.
Step 2: Search for unexecuted options; terminate the process if none exist. Step 3: Select the option with the earliest start time and conditions that are satisfied. Step 4: Search for subsequent events; query for additional information if necessary. Step 5: Generate a network of events based on the causal and temporal dependencies. Algorithms are implemented for information organization, simulation, causal relationship discovery, missing data detection and multiple storyline generation. Information Organization: The input data for the simulator is a set of events with their associated attributes in XML format. Each storyline has three input files corresponding to events, actors, objects and their associated attributes. Figure 7 presents the input data format. The parser processes all the input files, extracts the attributes in each record and saves the objects in event, actor and object storage. Simulation: After parsing the data, the simulator goes through the event storage and checks if the preconditions of each event are satisfied. If the event is unexecuted and its preconditions are satisfied, the simulator marks the event as “ready.” From the ready events, the simulator picks the event with the earliest start time to execute. After the event
210
CRITICAL INFRASTRUCTURE PROTECTION V <Event> <EventID>1 Mamud makes death elusion <StartTime>−2000 WestBank@USAMamud <Effects> <Effect>Mamud.status=“dead” <Effects> <Event>