Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6723
Pascal Felber Romain Rouvoy (Eds.)
Distributed Applications and Interoperable Systems 11th IFIP WG 6.1 International Conference DAIS 2011 Reykjavik, Iceland, June 6-9, 2011 Proceedings
13
Volume Editors Pascal Felber Université de Neuchâtel, Institut d’Informatique Rue Emile-Argand 11, B-114, 2000 Neuchâtel, Switzerland E-mail:
[email protected] Romain Rouvoy University of Lille 1, LIFL Campus Scientifique, Bâtiment Extension M3 59655 Villeneuve d’Ascq Cedex, France E-mail:
[email protected]
ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-21386-1 e-ISBN 978-3-642-21387-8 DOI 10.1007/978-3-642-21387-8 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011928247 CR Subject Classification (1998): C.2, D.2, H.4, H.5, H.3, C.4 LNCS Sublibrary: SL 5 – Computer Communication Networks and Telecommunications © IFIP International Federation for Information Processing 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Foreword
In 2011 the 6th International Federated Conferences on Distributed Computing Techniques (DisCoTec) took place in Reykjavik, Iceland, during June 6-9. It was hosted and organized by Reykjavik University. The DisCoTec series of federated conferences, one of the major events sponsored by the International Federation for Information processing (IFIP), included three conferences: Coordination, DAIS, and FMOODS/FORTE. DisCoTec conferences jointly cover the complete spectrum of distributed computing subjects ranging from theoretical foundations to formal specification techniques to practical considerations. The 13th International Conference on Coordination Models and Languages (Coordination) focused on the design and implementation of models that allow compositional construction of large-scale concurrent and distributed systems, including both practical and foundational models, run-time systems, and related verification and analysis techniques. The 11th IFIP International Conference on Distributed Applications and Interoperable Systems (DAIS) elicited contributions on architectures, models, technologies and platforms for large-scale and complex distributed applications and services that are related to the latest trends in bridging the physical/virtual worlds based on flexible and versatile service architectures and platforms. The 13th Formal Methods for Open Object-Based Distributed Systems and 31st Formal Techniques for Networked and Distributed Systems (FMOODS/FORTE) together emphasized distributed computing models and formal specification, testing and verification methods. Each of the three days of the federated event began with a plenary speaker nominated by one of the conferences. On the first day, Giuseppe Castagna (CNRS, Paris 7 University, France) gave a keynote titled “On Global Types and Multi-Party Sessions.” On the second day, Paulo Verissimo (University of Lisbon FCUL, Portugal) gave a keynote talk on “Resisting Intrusions Means More than Byzantine Fault Tolerance.” On the final and third day, Pascal Costanza (ExaScience Lab, Intel, Belgium) presented a talk that discussed “Extreme Coordination—Challenges and Opportunities from Exascale Computing.” In addition, there was a poster session, and a session of invited talks from representatives of Icelandic industries including Ossur, CCP Games, Marorka, and GreenQloud. There were five satellite events: 1. The 4th DisCoTec workshop on Context-Aware Adaptation Mechanisms for Pervasive and Ubiquitous Services (CAMPUS) 2. The Second International Workshop on Interactions Between Computer Science and Biology (CS2BIO) with keynote lectures by Jasmin Fisher (Microsoft Research - Cambridge, UK) and Gordon Plotkin (Laboratory for Foundations of Computer Science - University of Edinburgh, UK)
VI
Foreword
3. The 4th Workshop on Interaction and Concurrency Experience (ICE) with keynote lectures by Prakash Panangaden (McGill University, Canada), Rocco de Nicola (University of Florence, Italy), and Simon Gay (University of Glasgow, UK) 4. The First Workshop on Process Algebra and Coordination (PACO) with keynote lectures by Jos Baeten (Eindhoven University of Technology, The Netherlands), Dave Clarke (Katholieke Universiteit Leuven, Belgium), Rocco De Nicola (University of Florence, Italy), and Gianluigi Zavattaro (University of Bologna, Italy) 5. The 7th International Workshop on Automated Specification and Verification of Web Systems (WWV) with a keynote lecture by Elie Najm (Telecom Paris, France) I believe that this rich program offered each participant an interesting and stimulating event. I would like to thank the Program Committee Chairs of each conference and workshop for their effort. Moreover, organizing DisCoTec 2011 was only possible thanks to the dedicated work of the Publicity Chair Gwen Salaun (Grenoble INP - INRIA, France), the Workshop Chairs Marcello Bonsangue (University of Leiden, The Netherlands) and Immo Grabe (CWI, The Netherlands), the Poster Chair Martin Steffen (University of Oslo, Norway), the Industry Track Chairs Bj¨orn J´ onsson (Reykjavik University, Iceland), and Oddur Kjartansson (Reykjavik University, Iceland), and the members of the Organizing ´ Committee from Reykjavik University: Arni Hermann Reynisson, Steinar Hugi Sigurðarson, Georgiana Caltais Goriac, Eugen-Ioan Goriac and Ute Schiffel. To conclude I want to thank the International Federation for Information Processing (IFIP), Reykjavik University, and CCP Games Iceland for their sponsorship. June 2011
Marjan Sirjani
Preface
This volume contains the proceedings of DAIS 2011, the IFIP International Working Conference on Distributed Applications and Interoperable Systems. The conference was held in Reykjavik, Iceland, during June 6–9, 2011 as part of DisCoTec (Distributed Computing Techniques) federated conference, together with the International Conference on Formal Techniques for Distributed Systems (FMOODS and FORTE) and the International Conference on Coordination Models and Languages (COORDINATION). The DAIS 2011 conference was sponsored by IFIP (International Federation for Information Processing) in cooperation with ACM SIGSOFT and ACM SIGAPP, and it was the tenth conference in the DAIS series of events organized by IFIP Working Group 6.1. The conference program presented the state of the art in research on distributed and interoperable systems. Distributed application technology has become a foundation of the information society. New computing and communication technologies have brought up a multitude of challenging application areas, including mobile computing, interenterprise collaborations, ubiquitous services, service-oriented architectures, autonomous and self-adapting systems, peer-to-peer systems, just to name a few. New challenges include the need for novel abstractions supporting the development, deployment, management, and interoperability of evolutionary and complex applications and services, such as those bridging the physical/virtual worlds. Therefore, the linkage between applications, platforms and users through multidisciplinary user requirements (e.g., security, privacy, usability, efficiency, safety, semantic and pragmatic interoperability of data and services, dependability, trust and self-adaptivity) becomes of special interest. It is envisaged that future complex applications will far exceed those of today in terms of these requirements. The main part of the conference program comprised presentations of the accepted papers. This year, the technical program of DAIS drew from 55 submitted papers. All papers were reviewed by at least three reviewers. After initial reviews were posted, a set of candidate papers were selected and subjected to discussion among the reviewers and Program Committee Chairs to resolve differing viewpoints. As a result of this process, 18 full papers were selected for inclusion in the proceedings and 6 short papers were included additionally. The papers presented at DAIS 2011 address key challenges of modern distributed services and applications, including pervasiveness and peer-to-peer environments, and tackle issues related to adaptation, interoperability, availability and performance, as well as dependability and security. Finally, we would like to take this opportunity to thank the numerous people whose work made this conference possible. We wish to express our deepest gratitude to the authors of submitted papers, to all Program Committee members for their active participation in the paper review process, to all external reviewers
VIII
Preface
for their help in evaluating submissions, to Reykjavik University for hosting the event, to the Publicity Chairs, to the DAIS Steering Committee for their advice, and to Marjan Sirjani for acting as a General Chair of the joint event. June 2011
Pascal Felber Romain Rouvoy
Organization
Program Committee Umesh Bellur Yolande Berbers Antoine Beugnard Gordon Blair Ant´ onio Casimiro Emmanuel Cecchet Anwitaman Datta Ada Diaconescu Jim Dowling Frank Eliassen Pascal Felber Kurt Geihs Karl G¨ oschka Svein Hallsteinsen Peter Herrmann Jadwiga Indulska Hans-Arno Jacobsen R¨ udiger Kapitza Reinhold Kroeger Lea Kutvonen Winfried Lamersdorf Peter Linington Raimundo Macˆedo Ren´e Meier Elie Najm Nitya Narasimhan Jos´e Pereira Guillaume Pierre Peter Pietzuch Frantisek Plasil Etienne Riviere Romain Rouvoy Douglas Schmidt Francois Taiani Sotirios Terzis Ga¨el Thomas
Indian Institute of Technology Bombay, India Katholieke Universiteit Leuven, Belgium TELECOM Bretagne, France Lancaster University, UK University of Lisbon, Portugal University of Massachusetts, USA NTU, Singapore TELECOM ParisTech, France Swedish Institute of Computer Science, Sweden University of Oslo, Norway Universit´e de Neuchˆatel, Switzerland University of Kassel, Germany Vienna University of Technology, Austria SINTEF, Norway NTNU Trondheim, Norway University of Queensland, Australia University of Toronto, Canada University of Erlangen-N¨ urnberg, Germany University of Applied Sciences, Wiesbaden, Germany University of Helsinki, Finland University of Hamburg, Germany University of Kent, UK Federal University of Bahia, Brazil Trinity College Dublin, Ireland TELECOM ParisTech, France Motorola Labs, USA University of Minho, Portugal Vrije Universiteit Amsterdam, The Netherlands Imperial College London, UK Charles University, Czech Republic Universit´e de Neuchˆatel, Switzerland University Lille 1, France Carnegie Mellon University, USA Lancaster University, UK University of Strathclyde, UK LIP6, France
X
Organization
Additional Reviewers Babka, Vlastimil Campos, Filipe Comes, Diana Decky, Martin Dinh Tien Tuan, Anh Distler, Tobias Dixit, Monica Evers, Christoph Hamann, Kristof Hnetynka, Petr Jander, Kai Jiang, Shanshan Kirchner, Dominik Kraemer, Frank Alexander Lie, Arne Liu, Xin Maniymaran, Balasubramaneyam Marques, Luis Mathisen, Bjrn Magnus Matos, Miguel
Michaux, Jonathan Mokhtarian, Kianoosh Payberah, Amir H. Phung-Khac, An Poch, Tomas Schaefer, Jan Schiavoni, Valerio Segarra, Maria-Teresa Skubch, Hendrik Sl˚ atten, Vidar Stengel, Klaus Thoss, Marcus Ventresque, Anthony Vilenica, Ante Von Der Weth, Christian Wagner, Michael Ye, Chunyang Zapf, Michael Zaplata, Sonja
Table of Contents
Gozar: NAT-friendly Peer Sampling with One-Hop Distributed NAT Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amir H. Payberah, Jim Dowling, and Seif Haridi
1
Modeling the Performance of Ring Based DHTs in the Presence of Network Address Translators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John Ardelius and Boris Mej´ıas
15
Usurp: Distributed NAT Traversal for Overlay Networks . . . . . . . . . . . . . . Salman Niazi and Jim Dowling
29
Kalimucho: Contextual Deployment for QoS Management . . . . . . . . . . . . . Christine Louberry, Philippe Roose, and Marc Dalmau
43
Providing Context-Aware Adaptations Based on a Semantic Model . . . . . Guido S¨ oldner, R¨ udiger Kapitza, and Ren´e Meier
57
Towards QoC-Aware Location-Based Services . . . . . . . . . . . . . . . . . . . . . . . . Sophie Chabridon, Cao-Cuong Ngo, Zied Abid, Denis Conan, Chantal Taconet, and Alain Ozanne
71
Session-Based Role Programming for the Design of Advanced Telephony Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gilles Vanwormhoudt and Areski Flissi Architecturing Conflict Handling of Pervasive Computing Resources . . . . Henner Jakob, Charles Consel, and Nicolas Loriant Passive Network-Awareness for Dynamic Resource-Constrained Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Agoston Petz, Taesoo Jun, Nirmalya Roy, Chien-Liang Fok, and Christine Julien
77 92
106
Utility Driven Elastic Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pablo Chacin and Leando Navarro
122
Improving the Scalability of Cloud-Based Resilient Database Servers . . . Lu´ıs Soares and Jos´e Pereira
136
An Extensible Framework for Dynamic Market-Based Service Selection and Business Process Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ante Vilenica, Kristof Hamann, Winfried Lamersdorf, Jan Sudeikat, and Wolfgang Renz
150
XII
Table of Contents
Beddernet: Application-Level Platform-Agnostic MANETs . . . . . . . . . . . . Rasmus Sidorovs Gohs, Sigur ur Rafn Gunnarsson, and Arne John Glenstrup
165
The Role of Ontologies in Enabling Dynamic Interoperability . . . . . . . . . . Vatsala Nundloll, Paul Grace, and Gordon S. Blair
179
A Step towards Making Local and Remote Desktop Applications Interoperable with High-Resolution Tiled Display Walls . . . . . . . . . . . . . . . Tor-Magne Stien Hagen, Daniel Stødle, John Markus Bjørndalen, and Otto Anshus
194
Replica Placement in Peer-Assisted Clouds: An Economic Approach . . . . Ahmed Ali-Eldin and Sameh El-Ansary
208
A Correlation-Aware Data Placement Strategy for Key-Value Stores . . . . Ricardo Vila¸ca, Rui Oliveira, and Jos´e Pereira
214
Experience Report: Trading Dependability, Performance, and Security through Temporal Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lorenz Froihofer, Guenther Starnberger, and Karl M. Goeschka Cooperative Repair of Wireless Broadcasts . . . . . . . . . . . . . . . . . . . . . . . . . . Aaron Harwood, Spyros Voulgaris, and Maarten van Steen ScoreTree: A Decentralised Framework for Credibility Management of User-Generated Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Liao, Aaron Harwood, and Kotagiri Ramamohanarao Worldwide Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francisco Maia, Miguel Matos, Jos´e Pereira, and Rui Oliveira
228 243
249 257
Transparent Scalability with Clustering for Java e-Science Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro Sampaio, Paulo Ferreira, and Lu´ıs Veiga
270
CassMail: A Scalable, Highly-Available, and Rapidly-Prototyped E-Mail Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lazaros Koromilas and Kostas Magoutis
278
Transparent Adaptation of e-Science Applications for Parallel and Cycle-Sharing Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jo˜ ao Morais, Jo˜ ao Nuno Silva, Paulo Ferreira, and Lu´ıs Veiga
292
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
301
Gozar: NAT-Friendly Peer Sampling with One-Hop Distributed NAT Traversal Amir H. Payberah1,2, Jim Dowling1 , and Seif Haridi1,2 1
Swedish Institute of Computer Science (SICS) 2 KTH - Royal Institute of Technology
Abstract. Gossip-based peer sampling protocols have been widely used as a building block for many large-scale distributed applications. However, Network Address Translation gateways (NATs) cause most existing gossiping protocols to break down, as nodes cannot establish direct connections to nodes behind NATs (private nodes). In addition, most of the existing NAT traversal algorithms for establishing connectivity to private nodes rely on third party servers running at a well-known, public IP addresses. In this paper, we present Gozar, a gossip-based peer sampling service that: (i) provides uniform random samples in the presence of NATs, and (ii) enables direct connectivity to sampled nodes using a fully distributed NAT traversal service, where connection messages require only a single hop to connect to private nodes. We show in simulation that Gozar preserves the randomness properties of a gossip-based peer sampling service. We show the robustness of Gozar when a large fraction of nodes reside behind NATs and also in catastrophic failure scenarios. For example, if 80% of nodes are behind NATs, and 80% of the nodes fail, more than 92% of the remaining nodes stay connected. In addition, we compare Gozar with existing NAT-friendly gossip-based peer sampling services, Nylon and ARRG. We show that Gozar is the only system that supports one-hop NAT traversal, and its overhead is roughly half of Nylon’s.
1 Introduction Peer sampling services have been widely used in large scale distributed applications, such as information dissemination [7], aggregation [17], and overlay topology management [14,28]. A peer sampling service (PSS) periodically provides a node with a uniform random sample of live nodes from all nodes in the system, where the sample size is typically much smaller than the system size [15]. The sampled nodes are stored in a partial view that consists of a set of node descriptors, which are updated periodically by the PSS. Gossiping algorithms are the most common approach to implementing a PSS [29,9,16]. Gossip-based PSS’ can ensure that node descriptors are distributed uniformly at random over all partial views [18]. However, in the Internet, where a high percentage of nodes are behind NATs, these traditional gossip-based PSS’ become biased. Nodes cannot establish direct connections to nodes behind NATs (private nodes), and private nodes become under-represented in partial views, while nodes that do support direct connectivity, public nodes, become over-represented in partial views [19]. The ability to establish direct connectivity with private nodes, using NAT traversal algorithms, has traditionally not been considered by gossip-based PSS’. However, as P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 1–14, 2011. c IFIP International Federation for Information Processing 2011
2
A.H. Payberah, J. Dowling, and S. Haridi
nodes are typically sampled from a PSS in order to connect to them, there are natural benefits to including NAT traversal as part of a PSS. Nylon [19] was the first system to present a distributed solution to NAT traversal that uses existing nodes in the PSS to help in NAT traversal. Nylon uses nodes that have successfully established a connection to a private node as partners who will both route messages to the private node (through its NAT) and coordinate NAT hole punching algorithms [8,19]. As node descriptors spread in the system through gossiping, this creates routing table entries for paths that forward packets to private nodes. However, long routing paths increase both network traffic at intermediary nodes and the routing latency to private nodes. Also, routing paths become fragile when nodes frequently join and leave the system (churn). Finally, hole punching is slow and can take up to a few seconds over the Internet [27]. This paper introduces Gozar, a gossip-based peer sampling service that (i) provides uniform random samples in the presence of NATs, and (ii) enables direct connectivity to sampled nodes by providing a distributed NAT traversal service that requires only a single intermediary hop to connect to a private node. Gozar uses public nodes as both relay servers [13] (to forward messages to private nodes) and rendezvous servers [8] (to establish direct connections with private nodes using hole punching algorithms). Relaying and hole punching is enabled by private nodes finding public nodes who will act as both relay and rendezvous partners for them. For load balancing and fairness, public nodes accept only a small bounded number of private nodes as partners. When references to private nodes are gossiped in the PSS or sampled using the PSS, they include the addresses of their partner nodes. A node, then, can use these partners to either (i) gossip with a private node by relaying or (ii) establish a direct connection with the private node by using the partner for hole punching. We favour relaying over hole punching when gossiping with private nodes due to the low connection setup time compared to hole punching and also because the messages involved are small and introduce negligible overhead to public nodes. However, the hole punching service can be used by clients of the PSS to establish a direct connection with a sampled private node. NAT hole punching is typically required by applications such as video-on-demand [2] and live streaming [22,23], where relaying would introduce too much overhead on public nodes. A private node may have several redundant partners. Although redundancy introduces some extra overhead on public nodes, it also reduces latency when performing NAT traversal, as parallel connection requests can be sent to several partners, with the end-to-end connection latency being the fastest of the partners to complete NAT traversal. In this way, a more reliable NAT traversal service can be built over more unreliable connection latencies, such as those widely seen on the Internet. We evaluate Gozar in simulation and show how its PSS maintains its randomness property even in networks containing large fractions of NATs. We validate its behaviour through comparison with the widely used Cyclon protocol [29] (which does not support networks containing NATs). We also compare the performance of Gozar with the only other NAT-friendly PSS’ we found in the literature, Nylon [19] and ARRG [4], and show how Gozar has less protocol overhead compared to Nylon and ARRG, and is the only NAT-friendly peer sampling system that supports one hop NAT traversal.
Gozar: NAT-Friendly Peer Sampling with One-Hop Distributed NAT Traversal
3
2 Related Work Dan Kegel explored STUN [26] as a UDP hole punching solution for NAT traversal, and Guha et al. extended it to TCP by introducing STUNT [10]. However, studies [8,10] show that NAT hole punching fails 10-15% of the time for UDP and 30-40% of the time for TCP traffic. TURN [13] was an alternative solution for NAT traversal using relay nodes that works for all nodes that can establish an outbound connection. Interactive connectivity establishment (ICE) [25] has been introduced as a more general technique for NAT traversal for media streams that makes use of both STUN [26] and TURN [13]. All these techniques rely on third party servers running at well-known addresses. Kermarrec et al. introduce in Nylon [19] a distributed NAT traversal technique that uses all existing nodes in the system (both private and public nodes) as rendezvous servers (RVPs). In Nylon, two nodes become the RVP of each other whenever they exchange their views. Later, if a node selects a private node for gossip exchange, it opens a direct connection to the private node using a chain of RVPs for hole punching. The chains of RVPs in Nylon are unbounded in length, making Nylon fragile in dynamic networks, and increasing traffic at intermediary nodes. ARRG [4] supports gossip-based peer sampling in the presence of NATs without an explicit solution for traversing NATs. In ARRG, each node maintains an open list of nodes with whom it has had a successful gossip exchange in the past. When a node view exchange fails, it selects a different node from this open list. The open list, however, biases the PSS, since the nodes in the open list are selected more frequently for gossiping. Renesse et. al [20] present an approach to fairly distribute relay traffic over public nodes in a NAT-friendly gossiping system. In their system, which is not a PSS, each node accepts exchange requests as much as it initiates view exchanges. Similar to Nylon, they use chains of nodes as relay servers. In [5], D’Acunto et. al introduce an analytical model to show the impact of NATs on P2P swarming systems, and in [21] Liu and Pan analyse the performance of bittorrentlike systems in private networks. They show how the fraction of private nodes affects the download speed and download time of a P2P file-sharing system. Moreover, authors in [6] and [27] study the characteristics of existing NAT devices on the Internet, and show the success rate, on the Internet, of NAT traversal algorithms for different NAT types. In addition, the distribution of NAT rule timeouts for NAT devices on the Internet is described in [6], and in [24] an algorithm is presented, based on binary search, to adapt the time required to refresh NAT rules to prevent timeouts.
3 Background In gossip-based PSS’, protocol execution at each node is divided into periodic cycles [18]. In each cycle, every node selects a node from its partial view to exchange a subset of its partial view with the selected node. Both nodes subsequently update their partial views using the received node descriptors. Implementations vary based on a number of different policies in node selection (rand, tail), view exchange (push, push-pull) and view selection (blind, heale, swapper) [18].
4
A.H. Payberah, J. Dowling, and S. Haridi
In a PSS, the sampled nodes should follow a uniform random distribution. To ensure randomness of a partial view in an overlay network, the overlay constructed by a peer sampling protocol should ensure that indegree distribution, average shortest path and clustering coefficient, are close to a random network [18,29]. Kermarrec et al. evaluated the impact of NATs on traditional gossip-based PSS’ in [19]. They showed that the network becomes partitioned when the number of private nodes exceeds a certain threshold. The larger the view size is, the higher the threshold for partitioning is. However, increasing the nodes’ view size increases the number of stale node descriptors in views, which, in turn, biases the peer sampling. There are two general techniques that are used to communicate with private nodes: (i) hole punching [8,12] can be used to establish direct connections that traverse the private node’s NAT, and (ii) relaying [13] can be used to send a message to a private node via a third party relay node that already has an established connection with the private node. In general, hole punching is preferable when large amounts of traffic will be sent between the two nodes and when slow connection setup times are not a problem. Relaying is preferable when the connection setup time should be short (typically less than one second) and small amounts of data will be sent over the connection. In principle, existing PSS’ could be adapted to work over NATs. This can be done by having all nodes run a protocol to identify their NAT type, such as STUN [26]. Then, nodes identified as private keep open a connection to a third party rendezvous server. When a node wishes to gossip with a private node, it can request a connection to the private node via the rendezvous server. The rendezvous server then executes a hole punching technique to establish a direct connection between the two nodes. Aside from the inherently centralized nature of this approach, other problems include the success rate of NAT hole punching for UDP is only 85-90% [8,10], and the time taken to establish a direct connection using hole punching protocols is high and has high variance (averaging between 700ms and 1100ms on the open Internet for the company Peerialism within Sweden [27]). This high and unpredictable NAT traversal time of hole punching is the main reason why Gozar uses relaying when gossiping.
4 Problem Description The problem Gozar addresses is how to design a gossip-based NAT-friendly PSS that also supports distributed NAT traversal using a system composed of both public and private nodes. The challenge with gossiping is that it assumes a node can communicate with any node selected from its partial view. To communicate with a private node, there are three existing options: 1. Relay communications to the private node using a public relay node, 2. Use a NAT hole-punching algorithm to establish a direct connection to the private node using a public rendezvous node, 3. Route the request to the private node using chains of existing open connections. For the first two options, we assume that private nodes are assigned to different public nodes that act as relay or rendezvous servers. This leads to the problem of discovering which public nodes act as partners for the private nodes. A similar problem arises for
Gozar: NAT-Friendly Peer Sampling with One-Hop Distributed NAT Traversal
5
the third option - if we are to route a request to a private node along a chain of open connections, how do we maintain routing tables with entries for all reachable private nodes. When designing a gossiping system, we have to decide on which option(s) to support for communicating with private nodes. There are several factors to consider. How much data will be sent over the connection? How long lived will the connection be? How sensitive is the system to high and variable latencies in establishing connections? How fairly should the gossiping load be distributed over public versus private nodes? For large amounts of data traffic, the second option of hole-punching is the only really viable option, if one is to preserve fairness. However, if a system is sensitive to long connection establishment times, then hole-punching may not be suitable. If the amount of data being sent is small, and fast connection setup times are important, then relaying is considered an acceptable solution. If it is important to distribute load as fairly as possible between public and private nodes, then option 3 is attractive. In existing systems, it appears that Skype supports both options 1 and 2, and can considered to have a solution to the fairness problem that, by virtue of its widespread adoption, can be considered acceptable to their user community [3].
5 The Gozar Protocol Gozar is a NAT-friendly gossip-based peer sampling protocol with support for distributed NAT traversal. Our implementation of Gozar is based on the tail, push-pull and swapper policies for node selection, view exchange and view selection, respectively [18] (although we also run experiments, ommitted here for brevity, showing that Gozar also works with different policies introduced in [18]). In Gozar, node descriptors are augmented with the node’s NAT type (private or public) and the mapping, assignment and filtering policies determined for the NAT [27]. A STUN-like protocol is run on a bootstrap server when a node joins the system to determine its NAT type and policies. We consider running STUN once at bootstrap time acceptable, as, although some corporate NAT devices can change their NAT policies dynamically, the vast majority of consumer NAT devices have a fixed NAT type and fixed policies. In Gozar, each private node connects to one or more public nodes, called partners. Private nodes discover potential partners using the PSS, that is, private nodes select public nodes from their partial view and send partnering requests to them. When a private node successfully partners with a public node, it adds its partner address to its own node descriptor. As node descriptors spread in the system through gossiping, a node that subsequently selects the private node from its partial view communicates with the private node using one of its partners as a relay server. Relaying enables faster connection establishment than hole punching, allowing for shorter periodic cycles for gossiping. Short gossiping cycles are necessary in dynamic networks, as they improve convergence time, helping keep partial views updated in a timely manner. However, for distributed applications that use a PSS, such as online gaming, video streaming, and P2P file sharing, relaying is not acceptable due to the extra load on public nodes. To support these applications, the private nodes’ partners also provide a
6
A.H. Payberah, J. Dowling, and S. Haridi
rendezvous service to enable applications that sample nodes using the PSS to connect to them using a hole punching algorithm (if hole punching is possible). 5.1 Partnering Whenever a new node joins the system, it contacts the bootstrap server and asks for a list of nodes from the system and also runs the modified STUN protocol to determine its NAT type and policies. If the node is public, it can immediately add the returned nodes to its partial view and start gossiping with the returned nodes. If the node is private, it needs to find a partner before it can start gossiping. It selects m public nodes from the returned nodes and sends each of them a partnering request. Public nodes only partner a bounded number of private nodes to ensure the partnering load is balanced over the public nodes. Therefore, if a public node cannot act as a partner, it returns a NACK. The private node continues sending partnering requests to public nodes until it finds a partner, upon which the private node can now start gossiping. Private nodes proactively keep their connections to their partners open by sending ping messages to them periodically. Authors in [6] showed that unused NAT mapping rules remain valid for more than 120 seconds for 70% of connections. In our implementation, the private nodes send the ping messages every 50 seconds to refresh a higher percentage of mapping rules. Moreover, private nodes use the ping replies to detect the failure of their partners. If a private node detects a failed partner, it restarts the partner discovery process. 5.2 Peer Sampling Service Each node in Gozar maintains a partial view of the nodes in the system. A node descriptor, stored in a partial view, contains the address of the node, NAT type, and the addresses of the node’s partners, which are initially empty. When a node descriptor is gossiped or sampled, other nodes learn about the node’s NAT type and any partners. Later on, a node can gossip with a private node by relaying messages through the private node’s partners. Each node p periodically executes algorithm 1 to exchange and update its view. The algorithm shows that in each iteration, p first updates the age of all nodes in its view, and then chooses a node to exchange its view with. After selecting a node q, p removes that node from its view. Node p, then, selects a subset of random nodes from its view, and appends to the subset its own node descriptor (the node, its NAT type, and its partners). If the selected node q is a public node, then p sends the shuffle request message directly to q, otherwise it sends the shuffle request as a relay message to one of q’s partners, selected uniformly at random. Algorithm 2 shows how a node p selects another node to exchange its view with. Node p selects the oldest node in its view (the tail policy), which is either a public node, or a private node that has at least one partner. Algorithm 3 is triggered whenever a node receives a shuffle request message. Once node q receives the shuffle request, it selects a random subset of node descriptors from its view and sends the subset back to the requester node p. If p is a public node, q sends the shuffle response back directly to it, otherwise it uses one of p’s partners to relay
Gozar: NAT-Friendly Peer Sampling with One-Hop Distributed NAT Traversal
7
Algorithm 1. Shuffle view 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
procedure ShuffleView this this.view.updateAge() q ← SelectAN odeT oShuf f leW ith(this.view) this.view.remove(q) pV iew ← this.view.subset() pV iew.add(p, p.natT ype, p.partners) if q.natT ype is public then Send Shuf f leRequest(pV iew, p) to q else qP artner ← random partner from q.partners Send Relay(shuf f leRequest, pV iew, q) to qP artner end if
See algorithm 2 a random subset from p’s view
end procedure
Algorithm 2. Select a node to shuffle with 1: 2: 3: 4: 5: 6: 7: 8: 9:
procedure SelectANodeToShuffleWith this.view for all nodei in this.view do if nodei .natT ype = public OR (nodei .natT ype = private AND nodei .partners = Ø) then candidates ← nodei end if end for q ← oldest node from candidates Return q end procedure
Algorithm 3. Handling the shuffle request 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:
upon event S HUFFLE R EQUEST | pV iew, p from m qV iew ← this.view.subset() if p.natT ype is public then Send Shuf f leResponse(qV iew, q) to p else pP artner ← random partner from p.partners Send Relay(shuf f leResponse, qV iew, p) to pP artner end if UpdateV iew(qV iew, pV iew)
m can be p or q.partner a random subset from q’s view
end event
Algorithm 4. Handling the shuffle response 1: 2: 3:
upon event S HUFFLE R ESPONSE | qV iew, q from n UpdateV iew(pV iew, qV iew) end event
Algorithm 5. Updating the view 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
procedure UpdateView sentV iew, receivedV iew for all nodei in receivedV iew do if this.view.contains(nodei ) then this.view.updateAge(nodei ) else if this.view has free entries then this.view.add(nodei ) else nodej ← sentV iew.poll() this.view.remove(nodej ) this.view.add(nodei ) end if end for end procedure
n can be q or p.partner
8
A.H. Payberah, J. Dowling, and S. Haridi
Algorithm 6. Handling the relay message 1: 2: 3: 4: 5: 6: 7:
upon event R ELAY | natT ype, view, y from x if natT ype is shuf f leRequest then Send Shuf f leRequest(view, x) to y else Send Shuf f leResponse(view, x) to y end if end event
Algorithm 7. NAT Traversal to private nodes 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
procedure SendData q, data if q.natT ype is public then Send data to q else RV P ← random partner from q.partners Determine hole punching algorithm for the combination of NAT types hp ← hpAlgorithm(p.natT ype, q.natT ype) Start hole punching at RV P using the hole punching algorithm hp. holeP unch(hp, p, q, RV P ) Send data to q end if end procedure
the response. Again, node q selects p’s relaying node uniformly at random from the list of p’s partners. Finally, node q updates its view. A node updates its view whenever it receives a shuffle response (algorithm 4). Algorithm 5 shows how a node updates its view using the received list of node descriptors. Node p merges the node descriptors received from q with its current view by iterating through the received list, and adding the descriptors to its own view. If its view is not full, it adds the node, and if a node descriptor to be merged already exists in p’s view, p updates its age (if more recent). If the view is full, p replaces one of the nodes it had sent to q with the node in received list (the swapper policy). Algorithm 6 is triggered whenever a partner node receives a relay message from another node. The node extracts the embedded message that can be a shuffle request or shuffle response, and forwards it to the destination private node. If a client of the PSS, node p, wants to establish a direct connection to a node q, it uses algorithm 7 that implements the hole punching service. Algorithm 7 shows that if q is a public node, then p sends data directly to q. Otherwise, p selects uniformly at random one of q’s partners as a rendezvous node (RV P ), and determines the hole punching algorithm (hp) using the combination of its own NAT type and q’s NAT type RV P [27]. Then, p starts the hole punching process through the RV P [27]. After successfully establishing a direct connection, node p sends data directly to q.
6 Evaluation In this section, we compare in simulation the behavior of Gozar with Nylon [19] and ARRG [4], the only two other NAT-friendly gossip-based PSS’ we found in the literature. In our experiments, we use Cyclon as a baseline for comparison, where Cyclon experiments are executed using only public nodes. Cyclon has shown in simulation that it passes classical tests for randomness [29].
Gozar: NAT-Friendly Peer Sampling with One-Hop Distributed NAT Traversal
9
6.1 Experiment Setup We implemented Gozar, Cyclon, Nylon and ARRG on the Kompics platform [1]. Kompics provides a framework for building P2P protocols and a discrete event simulator for simulating them using different bandwidth, latency and churn models. Our implementations of Cyclon, Nylon and ARRG are based on the system descriptions in [29], [19] and [4], respectively. Nylon differs from Gozar in its node selection and view merging policies: Gozar uses tail and swapper policies, while Nylon uses rand and healer policies [19]. For a cleaner comparison with the NAT-friendly features of Nylon, we use the tail and swapper policies in our implementation of Nylon. In our experimental setup, for all four systems, the size of a node’s partial view is 10, and the size of subset of the partial view sent in each view exchange is 5. The iteration period for view exchange is set to one second. Latencies between nodes are modelled on Internet latencies, using a latency map based on the King data-set [11]. In all simulations, 1000 nodes join the system following a Poisson distribution with an inter-arrival time of 10 milliseconds, and unless stated otherwise, 80% of nodes are behind NATs. In Gozar, each private node has 3 public nodes as partners, and they keep a connection to their partners open by sending ping messages every 50 seconds. The experiment scenarios presented here are a comparison of the randomness of Gozar with Cyclon, Nylon and ARRG; a comparison of the protocol overhead of Gozar and Nylon for different percentages of private nodes, and finally, we evaluate the behaviour of Gozar in dynamic networks. 6.2 Randomness Here, we compare the randomness of the PSS’ of Gozar with Nylon and ARRG. Cyclon is used as a baseline for true randomness. In the first experiment, we measure the local randomness property [18] of these systems. Local randomness shows the number of times that each node in the system is returned by the PSS for each node in the system. For a truly random PSS, we expect that the returned nodes follow a uniform random distribution. In figure 1(a), we measure the local randomness of all nodes in the system, after 250 cycles. For a uniform random distribution, the expected number of selections for each node is 25. As we can see, Cyclon has an almost uniform random distribution, while Nylon’s distribution is slightly closer to uniform random than Gozar’s distribution. ARRG, on the other hand, has a long-tailed distribution, where there are a few nodes that are sampled many times (the public nodes stored in private nodes’ caches [4]). For Gozar, we can see two spikes: one representing the private nodes, that is roughly four times higher than the other consisting of the public nodes. This slight skew in the distribution results from the fact that public nodes are more likely to be selected during the first few cycles when private nodes have no partners. In addition to the local randomness property, we use the global randomness metrics, defined in [18], to capture important global correlations of the system as a whole. The global randomness metrics are based on graph theoretical properties of the system, including the indegree distribution, average path length and clustering coefficient. Figure 1(b) shows the indegree distribution of nodes after 250 cycles (the out-degree of all nodes is 10). In a uniformly random system, we expect that the indegree is distributed uniformly among all nodes. Cyclon shows this behaviour as the node indegree
10
A.H. Payberah, J. Dowling, and S. Haridi 300
cyclon gozar nylon arrg
150
200
# of nodes
# of nodes
200
cyclon gozar nylon arrg
250
150 100
100
50 50 0
0 0
50
(a)
100
150
0
10
20
# of selection
40
(b)
50
60
Indegree distribution
0.2
clustering coefficeint
cyclon gozar nylon arrg
40
30 Indegree
Local randomness
50
avg. path length
200
30 20 10 0
cyclon gozar nylon arrg
0.15
0.1
0.05
0 0
50
(c)
100 150 # of cycles
Average path length
200
250
0
50
(d)
100 150 # of cycles
200
250
Clustering coefficient
Fig. 1. Randomness properties
is almost distributed uniformly among nodes. We can see the same distribution in Gozar and Nylon - their indegree distributions are very close to Cyclon. Again, due to high number of unsuccessful view exchanges in ARRG, we see that the node indegree is highly skewed. In figure 1(c), we compare the average path length of the three systems, with Cyclon as a baseline. The path length for two nodes is measured as the minimum number of hops between two nodes, and the average path length is the average of all path lengths between all nodes in the system. Figure 1(c) also shows the average path length for the system in different cycles. Here, we can see the average path length of Gozar and Nylon track Cyclon very closely, but ARRG has higher average path length. As we can see, in the first few cycles, the path length of Gozar is high but after passing 50 cycles (50 seconds), the path length decreases. That is because of the time that private nodes need to find their partners and add them to their node descriptors. Finally, we compare the clustering coefficient of the systems. The clustering coefficient of a node is the number of links between the neighbors of the node divided by all possible links. Figure 1(d) shows the evolution of the clustering coefficient of the constructed overlay by each system. We can see that Gozar and Nylon almost have the same clustering coefficient as Cyclon, while the value for ARRG is higher. 6.3 Protocol Overhead In this section, we compare the protocol overhead of Gozar and Nylon in different settings, where the protocol overhead traffic is the extra messages required to route
Gozar: NAT-Friendly Peer Sampling with One-Hop Distributed NAT Traversal
Y1 - Protocol overhead (KB)
30000
400
160
350
140
300
25000
250
20000
200
15000
150
10000
100
5000
50
0 0
(a)
50
100 150 Time
200
0 250
Protocol overhead of Gozar vs. Nylon
Protocl overhead (KB/s)
gozar (Y1) nylon (Y1) gozar-private (Y2) nylon-private (Y2) gozar-public (Y2) nylon-public (Y2)
35000
Y2 - Nodes overhead (KB)
40000
11
gozar (1 partner) gozar (3 partners) nylon
120 100 80 60 40 20 0 40
50
60
70
80
Percentage of NATs
(b)
Overhead traffic of Gozar vs. Nylon for varying
percentages of private nodes
Fig. 2. Protocols overhead
messages through NATs. Protocol overhead traffic in Gozar consists of relay traffic and partner management, while in Nylon it consists of routing traffic. Figure 2(a) shows the protocol overhead when 80% of nodes are behind NAT. The Y1-axis shows the total overhead, and the Y2-axis shows the average overhead of each public and private node. In this experiment, each private node in Gozar has three public nodes as partners, but only one partner is used to relay a message to a private node. Nylon, however, routes messages through more than two intermediate nodes on average (see [19] for comparable results). Figure 2(a) shows that after 250 cycles the relay traffic and partner management overhead in Gozar is 20000KB, while the routing traffic overhead in Nylon is roughly 37000KB. Now, we compare the protocol overhead for Gozar and Nylon for different percentages of private nodes. To show the overhead in adding more partners, we consider two settings for Gozar: private nodes have one partner, and private nodes have three partners. In figure 2(b), we can see that when 80% of nodes are behind NAT, the protocol overhead for all nodes in Nylon is around 150KBs after 250 cycles. The corresponding overhead in Gozar, when the private nodes have three and one partners, are around 70KBs and 40KBs, respectively. The main contributory difference between the protocol overhead in the two different partner settings is that shuffle request and shuffle response messages become larger for more partners, as all partners addresses are included in private nodes’ descriptors. The increase in traffic is a function of the percentage of private nodes (as only their descriptors include partner addresses), but is independent of the size of the partial view. 6.4 Fairness and Connectivity after Catastrophic Failure We evaluate the behaviour of Gozar if high numbers of nodes leave the system or crash. Our experiment models a catastrophic failure scenario: 20 cycles after 1000 nodes have joined, 50% of nodes fail following a Poisson distribution with inter-arrival time of 10 milliseconds. Our first failure experiment shows the level of fairness between public and private nodes after the catastrophic failure. In figure 3(a), the Y1-axis shows the average traffic
A.H. Payberah, J. Dowling, and S. Haridi
avg. public nodes traffic (Y1) avg. private nodes traffic (Y1) avg. unsuccessful gossip (Y2)
34
800 32 600 30 400 28 200 26 0 1
(a)
2
3 4 Num. of partners
98 96 94 92 90 40
5
Fairness after catastrophic failure: overhead for
50% NATs 60% NATs 70% NATs 80% NATs
100 biggest cluster size (%)
Y1 - Relay traffic (Bs)
1000
Y2 - Avg. num. of unsuccessful gossip
12
(b)
50 60 70 percentage of node departure
80
Biggest cluster size after catastrophic failures
public and private nodes for varying numbers of parents
Fig. 3. Behaviour of the system after catastrophic failure
on each public node and private node for different number of partners, and the Y2axis shows the average number of unsuccessful view exchanges for each node. Here, 80% of nodes are private nodes and we capture the results 80 cycles after 50% of the nodes fail. As we can see in figure 3(a), the higher the number of partners the private nodes have, the more overhead traffic generated, again, due to the increasing the size of messages exchanged among nodes. The Y2-axis shows that when the private nodes have only one partner, the average number of unsuccessful view exchanges is higher than when the private nodes have more than one partner. If a private node has more than one partner, then in case of failure of any of them, there are still other partners that can be used to communicate with the private node. An interesting observation here is that we cannot see a big decrease in the number of unsuccessful view exchanges when the private nodes has more than two partners. This observation, however, is dependent on our catastrophic failure model, and high churn rates might benefit more from more than two partners. Finally, we measure the size of biggest cluster after a catastrophic failure. Here, we assume that each private node has three partners. Figure 3(b) shows the size of biggest cluster for varying percentages of private nodes, when varying numbers of nodes fail. We can see that Gozar is resilient to node failure. For example, in the case of 80% private nodes, when 80% of the nodes fail, the biggest cluster still covers more than 92% of the nodes.
7 Conclusion In this paper, we presented Gozar, a NAT-friendly gossip-based peer sampling service that also provides a distributed NAT traversal service to clients of the PSS. Public nodes are leveraged to provide both the relaying and hole punching services. Relaying is only used for gossiping to private nodes, and is preferred to hole punching or routing through existing open connections (as done in Nylon), as relaying has lower connection latency, enabling a faster gossiping cycle, and the messages relayed are small, thus, adding only low overhead to public nodes. Relaying and hole punching services provided by public
Gozar: NAT-Friendly Peer Sampling with One-Hop Distributed NAT Traversal
13
nodes are enabled by every private node partnering with a small number of (redundant) public nodes and keeping a connection open to them. We extended node descriptors for private nodes to include the addresses of their partners, so when a node wishes to send a message to a private node (through relaying) or establish a direct connection with the private node through hole punching, it sends a relay or connection message to one (or more) of the private node’s partners. We showed in simulation that Gozar preserves the randomness properties of a gossipbased peer sampling service. We also show that the protocol overhead in our system is less than that of Nylon in different network settings and different percentages of private nodes. We also showed that the extra overhead incurred by public nodes is acceptable. Finally, we show that if 80% of the nodes are private, and when 50% of the nodes suddenly fail, more than 92% of nodes stay connected. In future work, we will integrate our existing P2P applications with Gozar, such as our work on video streaming [22,23], and evaluate their behaviour on the open Internet.
References 1. Arad, C., Dowling, J., Haridi, S.: Developing, simulating, and deploying peer-to-peer systems using the kompics component model. In: COMSWARE 2009: Proceedings of the Fourth International ICST Conference on COMmunication System softWAre and middlewaRE, pp. 1–9. ACM, New York (2009) 2. Berthou, G., Dowling, J.: P2p vod using the self-organizing gradient overlay network. In: SOAR 2010: Proceeding of the Second International Workshop on Self-Organizing Architectures, pp. 29–34. ACM, New York (2010) 3. Bonfiglio, D., Mellia, M., Meo, M., Rossi, D., Tofanelli, P.: Revealing skype traffic: when randomness plays with you. SIGCOMM Comput. Commun. Rev. 37(4), 37–48 (2007) 4. Drost, N., Ogston, E., van Nieuwpoort, R.V., Bal, H.E.: Arrg: real-world gossiping. In: HPDC 2007: Proceedings of the 16th International Symposium on High Performance Distributed Computing, pp. 147–158. ACM, New York (2007) 5. DAcunto, L., Meulpolder, M., Rahman, R., Pouwelse, J.A., Sips, H.J.: Modeling and analyzing the effects of firewalls and nats in p2p swarming systems. In: Proceedings IPDPS 2010 (HotP2P 2010). IEEE, Los Alamitos (April 2010) 6. DAcunto, L., Pouwelse, J.A., Sips, H.J.: A measurement of nat and firewall characteristics in peer-to-peer systems. In: Wolters, L., Gevers, T., Bos, H. (eds.) Proc. 15-th ASCI Conference, pp. 1–5. Advanced School for Computing and Imaging (ASCI), P.O. Box 5031, 2600 GA Delft, The Netherlands (2009) 7. Eugster, P.T., Guerraoui, R., Handurukande, S.B., Kouznetsov, P., Kermarrec, A.-M.: Lightweight probabilistic broadcast. In: DSN 2001: Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS), pp. 443–452. IEEE Computer Society, Washington, DC, USA (2001) 8. Ford, B., Srisuresh, P., Kegel, D.: Peer-to-peer communication across network address translators. CoRR, abs/cs/0603074 (2006) 9. Ganesh, A.J., Kermarrec, A.-M., Massoulie, L.: Peer-to-peer membership management for gossip-based protocols. IEEE Transactions on Computers 52, 2003 (2003) 10. Guha, S., Francis, P.: Characterization and measurement of tcp traversal through nats and firewalls. In: IMC 2005: Proceedings of the 5th ACM SIGCOMM Conference on Internet Measurement, p. 18. USENIX Association, Berkeley (2005)
14
A.H. Payberah, J. Dowling, and S. Haridi
11. Gummadi, K.P., Saroiu, S., Gribble, S.D.: King: Estimating latency between arbitrary internet end hosts. In: SIGCOMM Internet Measurement Workshop (2002) 12. Hunt, R., Phuoc, H.C., McKenzie, A.: Nat traversal techniques in peer-to-peer networks (2008) 13. Mahy, R., Rosenberg, J., Huitema, C.: Turn - traversal using relay nat (September 2005), http://tools.ietf.org/id/draft-rosenberg-midcom-turn-08.txt 14. Jelasity, M., Montresor, A., Babaoglu, O.: T-Man: Gossip-based fast overlay topology construction. Computer Networks 53(13), 2321–2339 (2009) 15. Jelasity, M., Liu, H., Kermarrec, A.-M., van Steen, M.: The peer sampling service: Experimental evaluation of unstructured gossip-based implementations. In: Jacobsen, H.-A. (ed.) Middleware 2004. LNCS, vol. 3231, pp. 79–98. Springer, Heidelberg (2004) 16. Jelasity, M., Montresor, A.: Epidemic-style proactive aggregation in large overlay networks. In: ICDCS 2004: Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS 2004), pp. 102–109. IEEE Computer Society, Washington, DC, USA (2004) 17. Jelasity, M., Montresor, A., Babaoglu, O.: Gossip-based aggregation in large dynamic networks. ACM Trans. Comput. Syst. 23(3), 219–252 (2005) 18. Jelasity, M., Voulgaris, S., Guerraoui, R., Kermarrec, A.-M., van Steen, M.: Gossip-based peer sampling. ACM Trans. Comput. Syst. 25(3), 8 (2007) 19. Kermarrec, A.-M., Pace, A., Quema, V., Schiavoni, V.: Nat-resilient gossip peer sampling. In: ICDCS 2009: Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems, pp. 360–367. IEEE Computer Society, Washington, DC, USA (2009) 20. Leit˜ao, J., van Renesse, R., Rodrigues, L.: Balancing gossip exchanges in networks with firewalls. In: Proceedings of the 9th International Workshop on Peer-to-Peer Systems (IPTPS 2010), San Jose, CA, U.S.A (2010) (to appear) 21. Liu, Y., Pan, J.: The impact of NAT on BitTorrent-like P2P systems. In: IEEE Ninth International Conference on Peer-to-Peer Computing, P2P 2009, pp. 242–251 (2009) 22. Payberah, A.H., Dowling, J., Rahimian, F., Haridi, S.: gradienTv: Market-based P2P Live Media Streaming on the Gradient Overlay. In: Eliassen, F., Kapitza, R. (eds.) DAIS 2010. LNCS, vol. 6115, pp. 212–225. Springer, Heidelberg (2010) 23. Payberah, A.H., Dowling, J., Rahimian, F., Haridi, S.: Sepidar: Incentivized market-based p2p live-streaming on the gradient overlay network. International Symposium on Multimedia, vol. 0, pp. 1–8 (2010) 24. Price, R., Tino, P.: Adapting to NAT timeout values in P2P overlay networks. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–6. IEEE, Los Alamitos (2010) 25. Rosenberg, J.: Interactive connectivity establishment (ice): A methodology for network address translator (nat) traversal for offer/answer protocols (January 2007), http://tools.ietf.org/html/draft-ietf-mmusic-ice-13 26. Rosenberg, J., Mahy, R., Mathews, P., Wing, D.: Rfc 5389: Session traversal utilities for nat (stun) (2008) 27. Roverso, R., El-Ansary, S., Haridi, S.: Natcracker: Nat combinations matter. In: ICCCN 2009: Proceedings of the 2009 Proceedings of 18th International Conference on Computer Communications and Networks, pp. 1–7. IEEE Computer Society, Washington, DC, USA (2009) 28. Sacha, J., Dowling, J., Cunningham, R., Meier, R.: Discovery of stable peers in a selforganising peer-to-peer gradient topology. In: Eliassen, F., Montresor, A. (eds.) DAIS 2006. LNCS, vol. 4025, pp. 70–83. Springer, Heidelberg (2006) 29. Voulgaris, S., Gavidia, D., Van Steen, M.: Cyclon: Inexpensive membership management for unstructured p2p overlays. Journal of Network and Systems Management 13, 2005 (2005)
Modeling the Performance of Ring Based DHTs in the Presence of Network Address Translators John Ardelius1 and Boris Mej´ıas2 1
Swedish Institute of Computer Science
[email protected] 2 Universit´e catholique de Louvain
[email protected]
Abstract. Dealing with Network Address Translators (NATs) is a central problem in many peer-to-peer applications on the Internet today. However, most analytical models of overlay networks assume the underlying network to be a complete graph, an assumption that might hold in evaluation environments such as PlanetLab but turns out to be simplistic in practice. In this work we introduce an analytical network model where a fraction of the communication links are unavailable due to NATs. We investigate how the topology induced by the model affects the performance of ring based DHTs. We quantify two main performance issues induced by NATs namely large lookup inconsistencies and increased break-up probability, and suggest how theses issues can be addressed. The model is evaluated using discrete based simulation for a wide range of parameters.
1 Introduction Peer-to-peer systems are widely regarded as being more scalable and robust than systems with classical centralised client-server architecture. They provide no single point of failure or obvious bottlenecks and since peers are given the responsibility to maintain and recover the system in case of departure or failure they are also in best case self-stabilising. However, many of these properties can only be guaranteed within certain strong assumptions, such as moderate node churn, transitive communication links, accurate failure detection and NAT transparency, among others. When these assumptions are not met, system performance and behaviour might become unstable. In this work we are investigating the behaviour of a peer-to-peer system when we relax the assumption of a transitive underlying network. In general, a set of connections are said to be non-transitive if the fact that a node A can talk to node B, and B can talk to C does not imply that node A can talk to C. Non-transitivity directly influences a system by introducing false suspicions in failure detectors since a node cannot a-priori determine if a node has departed or is unable to communicate due to link failure. In practice this study is motivated by the increasing presences of Network Address Translators (NATs) on today’s Internet [1]. As many peer-to-peer protocols are designed P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 15–28, 2011. c IFIP International Federation for Information Processing 2011
16
J. Ardelius and B. Mej´ıas
with open networks in mind, NAT-traversal techniques are becoming a common tool in system design [13]. One of the most well studies peer-to-peer overlays, at least from a theoretic point of view, is the distributed hash table (DHT) Chord [15]. Chord and other ring based DHTs are especially sensitive to non-transitive networks since they rely on the fact that each participating node needs to communicate with the node succeeding it on the ring in order to perform maintenance. Even without considering NATs the authors of Chord [15], Kademlia [10], and OpenDHT [12], experienced the problems of non-transitive connectivity when running their networks on PlanetLab 1 , where all participating peers have public IP address. Several patches to these problems have been proposed [3] but they only work with if the system has very small amount of non-transitive links, as in PlanetLab, where every node has a public IP address. In this work, we construct an analytic model of the ring-based DHT, Chord, running on top of non-transitive network under node churn. Our aim is to quantify the impact a non-transitive underlay network to the performance of the overlay application. We evaluate the systems working range and examine the underlying mechanisms that causes its failure in terms of churn rate and presence of NATs. Our results indicate that it is possible to patch the Chord protocol to be robust and provide consistent lookups even in the absence of NAT-traversal protocols. Our main contributions are: – Introduction of a new inconsistency measure, Q, for ring based DHT’s. Using this metric we can quantify the amount of potential lookup inconsistency for a given parameter setting. – Quantification of the load imbalance. We show that in the presence of NATs, nodes with open IP addresses receive unproportional amounts of traffic and maintenance work load. – A novel investigation of an inherent limitation for the number of nodes that can join the DHT ring. Since each node needs to be able to communicate with its successor the available key range in which a node behind a NAT can join is limited. – Evaluation of two modifications to the original Chord protocol, namely predecessor list routing and the introduction of a recovery list containing only peers with open IP addresses. Throughout the paper we will use the words node and peer interchangeably. We will also use the term NATed or expressions such as being behind a NAT for a node behind a non-traversable NAT. It simply means that two nodes with this property are unable to communicate directly. Section 2 discusses related work, which justifies our design decisions for the evaluation model, which is discussed in Section 3. We present our analysis on lookup-consistency and resilience in Sections 4 and 5 respectively. The paper concludes by discussing some limitations on the behaviour of Chord and ring based DHTs in general. 1
PlanetLab: An open platform for developing, deploying, and accessing planetary-scale services. http://www.planet-lab.org
Modeling the Performance of Ring Based DHTs in the Presence of NATs
17
2 Related Work Understanding how peer-to-peer systems behave on the Internet has received a lot of attention in the recent years. The increase of NAT devices has posed a big challenge to system developers as well as those designing and simulating overlay networks. Existing studies are mostly related to systems providing file-sharing, voice over IP, videostreaming and video-on-demand. Such systems use overlay topologies different from Chord-like ring, or at most they integrate the ring as one of the components to provide a DHT. Therefore, they do not provide any insight regarding the influence of NAT on ring-based DHTs. A deep study of Coolstreaming [8], a large peer-to-peer system for video-streaming, shows that at least 45% of their peers sit behind NAT devices. They are able to run the system despite NATs by relying on permanent servers logging successful communication to NATed peers, to be reused in new communication. In general, their architecture relies on servers out of the peer-to-peer network to keep the service running. A similar system, PPLive, that in addition to streaming provides video-on-demand. Their measurements on May 2008 [4] indicates 80% of peers behind NATs. The system also uses servers as loggers for NAT-traversal techniques, and the use of DHT is only as a component to help trackers with file distribution. With respect to file-sharing, a study on the impact of NAT devices on BitTorrent [9] shows that NATed peers get an unfair participation. They have to contribute more to the system than what they get from it, mainly because they cannot connect to other peers behind NATs. It is the opposite for peers with public IP addresses because they can connect to many more nodes. It is shown that the more peers behind NATs, the more unfair the system is. According to [5], another result related to BitTorrent is that NAT devices are responsible for the poor performance of DHTs as “DNS” for torrents. Such conclusion is shared by apt-p2p [2] where peers behind NATs are not allowed to join the DHT. Apt-p2p is a real application for software distribution used by a small community within Debian/Ubuntu users. It uses a Kademlia-based DHT to locate peers hosting software packages [10]. Peers behind NATs, around 50% according to their measurements, can download and upload software, but they are not part of the DHT, because they break it. To appreciate the impact NAT devices are having on the Internet, apart from the more system specific measurements referenced above, we refer to the more complete quantitative measurements done in [1]. Taking geography into account, it is shown that one of the worst scenarios is France, where 93% of nodes are behind NATs. One of the best cases is Italy, with 77%. Another important measurement indicates that 62% of nodes have a time-out in communication of 2 minutes, which is too much for ring-based DHT protocols. Being aware of several NAT-traversal techniques, the problem is still far from being solved. We identify Nylon [6] as promising recent attempt to incorporate NATs in the system design. Nylon uses a reactive hole punching protocol to create paths of relay peers to set-up communication. In their work a combination of four kinds of NATs is considered and they are being able to traverse all of them in simulations and run the system with 90% of peers behind NATs. However, their approach does not consider a complete set of NAT types. The NATCracker [13] makes a classification of 27 types of NATs, where there is a certain amount of combinations which cannot be traversed, even with the techniques of Nylon.
18
J. Ardelius and B. Mej´ıas
3 Evaluation Model 3.1 Chord Model Chord [15] is a called distributed hash table (DHT) which provides a key-value mappings and a distributed way to receive the value for a specific key, a lookup. The keys belongs to the range [0 : 2K [ (K = 20 in our case). Each participating node is responsible for a subset of this range and stores the values that those keys map to. It is important that this responsibility is strictly divided among the participating peers to avoid inconsistent lookups. In order to achieve this each node contains a pointer to the node responsible for the range succeeding its own, its successor. Since the successor might leave the system at any point a set of succeeding nodes are stored in a successor list of some predefined length. Lookups are performed by relying the message clockwise along the key range direction. When the lookup reaches the node preceding the key it will return the identifier of its successor as the owner of the key. In order to speed the process up some shortcuts are created known as fingers. The influence of NATs to the fingers is limited and are only discussed briefly in this work. Each nodes performs maintenance at asynchronous regular intervals. During maintenance the node pings the nodes on its successor list and removes those who have left the system. In order to update the list the node queries its successor for its list and appends the successors id. 3.2 NAT Model In order to study the effect of NATs we construct a model that reflects the connectivity quality of the network. We consider two types of peer nodes: open peers and NATed peers. An open peer is a node with public IP address, or sitting behind a traversable-NAT, meaning that it can establish a direct link to any other node. A NATed peer is a node behind a NAT that cannot be traversed from another NATed peer, or that it is so costly to traverse, that it is not suitable for peer-to-peer protocols. Open peers can talk to NATed peers, but NATed peers cannot talk with each other. In the model, when joining the system, each node has a fixed probability p of being a NATed peer. The connectivity quality q of a network, defined as the fraction of available links will then be: q = 1 − p2 = 1 − c
(1)
Where c is the fraction of unavailable links in the system. Proof of equation 1 is straightforward and can be found in appendix 2 . We assume, without loss of generality, that all non-available communication links in the system are due to NATs. In practice we are well aware of the existence of several more or less reliable NAT-traversal protocols. [13] provides a very good overview and concludes that some NAT types, however, are still considered non-traversable (for instance random port-dependent ones) and the time delays they introduce might in many cases be unreasonable for structured overlay networks. 2
http://www.info.ucl.ac.be/˜bmc/apx-proof.pdf
Modeling the Performance of Ring Based DHTs in the Presence of NATs
19
3.3 Churn Model We use an analytic model of Chord similar to the one analysed in [7] to study the influence of churn on the ring. Namely we model the Chord system as a M/M/∞ queue containing N (212 in our case) nodes, with independent Poisson distributed join, leave and stabilisation events with rates λj , λf and λs respectively. The fail event considers both graceful leaves and abrupt failures and the stabilisation is done continuously by all nodes in the system. The amount of churn in the system is quantified by the average number of stabilisation rounds that a node issues in its lifetime r = λλfs . Low r means less stabilisation and therefore higher churn. The effect of churn to Chord (studied in [7]) is mainly that the entire successor lists becomes outdated quickly for large churn rates resulting in degraded performance and inconsistent lookups. Under churn or in the presence of NATs the chain of successor pointers may not form a perfect ring. In this paper we will use the term core ring to denote the periodic chain of successor pointers inherent in the Chord protocol. Due to NATs, some nodes are not part of the core ring and are said to sit on a branch. Figure 1(a) depicts such configuration.
(a)
(b)
Fig. 1. The creation of branches. Dotted lines indicates that the identifier of the peer is known but unable to communicate due to an intercepting NATs. 1(a) Peers p and q cannot communication which leaves peers p and r in a branch rooted at peer s. 1(b) In order to route messages to peers q and r, s has to maintain pointers to both of them in its predecessor list since path s → r → q is not available.
4 Lookup Consistency One of the most important properties of any lookup mechanism is consistency. It is however a well known fact [3,14] that lookup consistency in Chord is compromised when the ring is not perfectly connected. If two nodes queries the system for the same key they should receive identical values but due to node churn a node’s pointers might become outdated or wrong before the node had time to correct them in a stabilisation round. In any case where a peer wrongly assigns its successor pointer to a node not succeeding it, a potential inconsistency is created. The configuration in Figure 1(a) can lead to a potential lookup inconsistency in the range ]q, r]. Upon lookup request for a key in that range, peer q will think r is responsible for the key whereas peer p think peer s is. These lookup inconsistencies due to incorrect successor pointers can be both due to NATs (a node cannot communicate with its successor) or due to churn (a node is not aware of its successor since it did not maintain the pointer).
20
J. Ardelius and B. Mej´ıas
A solution proposed by Chord’s authors in [3] is to forward the lookup request to a candidate responsible node. The candidate will verify its local responsibility by checking its predecessor pointer. If its predecessor its a better candidate, the lookup is sent backwards until reaching the node truly responsible for the key. In order to quantify how reliable the result of a lookup really is it is important to estimate the amount of inconsistency in the system as function of churn and other system parameters. Measuring lookup inconsistency has been considered in a related but not equivalent way in [14]. We define the responsibility range, Q as the range of keys any node can be hold responsible for. By the nature of the Chord lookup protocol a node is responsible for a key if its preceding peer says so. From a global perspective it is then possible to ask each node who it thinks succeeds Fig. 2. A lookup for a key owned by it on the ring and sum up the distance between s will not give inconsistent results due them. However, as shown in Figure 2, the mere to the fact that r is not a successor of fact that two nodes think they have the same suc- another node cessor does not lead inconsistent lookups. In order to have an inconsistent lookup two nodes need to think they are the immediate predecessor of the key and have different successor pointers. In order to quantify the amount of inconsistency in the system we then need to find, for each node the largest separating distance between it and any node that has it as its successor. This value is the maximum range the node can be held responsible for. The sum of each such range divided by the number of keys in the system will indicate the deviation from the ideal case where Q = 1. The procedure is outlined in listing 1.1. Listing 1.1. Inconsistency calculation for n in peers do // First alive and reachable succeeding node m = n.getSucc() d = dist(n,m) // Store the largest range m.range = max(d,m.range) end for globalRange = sum(m.range)
The responsibility range is measured for various fractions of NATs as function of churn and the results are plotted in Figure 3. We see that even for low churn rates the responsibility range after introducing the NATs are greater than 1. This indicates that on average more than one node think they are responsible for each key which causes a lot of inconsistent lookups. Important to note is also that without NATs (c=0) the churn induced inconsistency for low r is much higher than 1 .
Modeling the Performance of Ring Based DHTs in the Presence of NATs
responsibility range
100
21
c=0.0 c=0.5 c=0.75 c=0.8
10
1 0
20
40
60 80 100 stabilisation rate, r
120
140
Fig. 3. Responsibility range as function of stabilisation rate for various values of c. Even for low churn, NATs introduce a large amount of lookup inconsistencies as indicated by the range being greater than 1.
4.1 Predecessor List (PL) Routing The result indicates that merely assigning the responsibility to your successor does not provide robust lookups in the presence of NATs or for high churn. Using local responsibility instead of the successor pointer to answer lookup requests, and routing through the predecessor will ensure that only the correct owner answers the query. Because more peers become reachable, there are less lookup inconsistencies, and more peers can use their local responsibility to share the load of the address space. Again, taking the configuration in figure 1(a) as example, we can see that peer q is unreachable in traditional Chord, and therefore, the range ]p, q] is unavailable. By using the predecessor pointer for routing, a lookup for key k can be successfully handled following path p → s → r → q. However, by only keeping one predecessor pointer, lookups will still be inconstant on configurations such as the one depicted in Figure 1(b). To be able to route to any node in a branch, a peer needs to maintain a predecessor list. This is not the equivalent backward of the successor list, which is used for resilience. The predecessor list is used for routing, and it contains all nodes pointing to a given peer as its successor. If peers would use the predecessor list in the depicted examples, the predecessor list of s in Figure 1(a) would be {p, r}. The list in Figure 1(b) would be {p, q, r}. In a perfect ring, the predecessor list of each node contains only the preceding node. This means that the size of the routing table is not affected when the quality of the connectivity is very good. In the presence of NATs or churn, however, it is necessary to keep a predecessor list in order to enable routing on node branches . From a local perspective, a peer knows that it is the root of a branch if the size of its predecessor list is greater that one. Such predecessor list (PL) routing is used for instance in the relaxed-ring system [11] to handle non-transitive underlay networks. In order to evaluate the performance of PL routing we define another responsibility measure, the branch range, Qb . Since the PL routing protocol lets the root of a branch decide who is responsible (or whom to relay the lookup to) it is important that the
22
J. Ardelius and B. Mej´ıas
root correctly knows the accumulated responsibility range of all nodes in its branch. The only way a PL lookup can be inconsistent is if two distinct root nodes on the same distance from the core ring think they should relay the lookup to one of its predecessors. The problem is depicted in Figure 4. The branch range is calculated in a similar way as the previous predecessor range. Each node, on the core ring, queries all its predecessors except for the last one (the one furthest away) for their range. The predecessors iteratively queries their predecessors and so forth Fig. 4. Peer q misses its successor s. Peer until the end of the branch is reached. Each s routes lookup(k) to r because it does not node then returns, among all the answers from know q. its predecessors, the end point furthest away from the root. The root, in turn, compares the returned value with the key of its last predecessor. If the returned key lies behind the last predecessor the range can give rise to an inconsistent lookup. The procedure is outlined in listing 1.2.
Listing 1.2. Branch inconsistency calculation
// Peers in the core ring are first // marked by cycle detection. for n in corePeers do n.range = 0; for m in n.preds // Don’t add pred from core ring if(m!=n.preds.last) n.range+=m.getBranchEnd(m.key) end for n.range+=n.preds.last end for function getBranchEnd(p) for m in n.preds p=min(p,m.getBranchEnd(m.key)) end for return p end function globalBranchRange = sum(m.range)
The branch responsibility range, Qb , includes the former range, Q, found without PL routing and adds any extra inconsistency caused by overlapping branches. Figure 5 shows the additional responsibility range induced overlapping branches. We see that the responsibility ranges resulting from overlapping trees are orders of magnitude smaller the ranges due to problems with successor based routing. Even in the worst case the range does not exceed 1.5 which means that at least 50% of the lookups will be reliable
Modeling the Performance of Ring Based DHTs in the Presence of NATs
1.4
c=0.0 c=0.5 c=0.75 c=0.8
1.2 branch responsibility range
23
1 0.8 0.6 0.4 0.2 0 0
20
40
60 80 100 stabilisation rate, r
120
140
Fig. 5. Branch responsibility range as function of stabilisation rate. As the system breaks down due to high churn the trees starts to overlap resulting in additional lookup inconsistencies.
in worst case. Interesting also to note that the churn based inconsistency (c = 0) does not exceed 0.2 in any case which means the lookups are reliable to at least 80%. 4.2 Skewed Key Range In Chord, it is mandatory for the joining peer to be able to communicate with its successor in order to perform maintenance. When joining the system a node is provided a successor on the ring by a boot-strap service. If joining node is unable to communicate with the assigned successor due to a NAT all that remains to do is for the node to re-join the system. As more and more NATed peers join the network the fraction of potential successors decrease creating a skewed key distribution range available for NAT nodes trying to join the ring. This leaves joining peers behind a NAT to join only closer and closer to the root. As the fraction of NATed peers increase in the system additional join attempts are need in order to find a key succeeded by an open peer. The number of re-join attempts as function of c is shown in Figure 6(a) and the variance in Figure 6(b). Note that both the average and the variance for the number of join attempts start to grow super exponential at some critical c value ≈ 0.8 which indicates a behavioural transition between functional and congested system state.
5 Resilience 5.1 Load Balancing Even for the c range where nodes can join in reasonable amount of time the open nodes on the network will be successors for an increasing number of NATed peers. Being the successor of a node implies relaying its lookups and sending it successor- and recovery
24
J. Ardelius and B. Mej´ıas
100
1 0.1 0.01
10 1 0.1 0.01
0.001
0.001
0.0001
0.0001 0
0.2
N=100 N=400 N=1600
100 re-join attempts/join
10 re-join attempts/join
1000
N=100 N=400 N=1600
0.4
0.6
0.8
1
0
0.2
0.4
c
0.6
0.8
1
c
(a)
(b)
Fig. 6. Distribution of number of re-join attempts as function of c. 6(a) Average number of re-join attempts before a node is able to join the ring for different values of c. Vertical lines indicate the point where a some node in the system needs to re-try more than 10.000*N times. 6(b) Variance of the number of re-join attempts.
lists while stabilizing. Increasing the number of predecessors therefore increase the workload of open peers. In the case of predecessor list (PL) relays, open nodes as branch roots will be responsible to relay lookups not only to itself and its predecessors but to all nodes in its branch. Figure 7(a) shows the distribution of the amount of predecessors per node and Figure 7(b) shows the size distribution of existing branches. For large c values we note that some nodes have almost 1% of the participating nodes a predecessors and are the root in branches of size 0.2 ∗ N . When such a overloaded node fails a large re-arrangement is necessary at high cost. 1
0.01
c=0.0 c=0.25 c=0.5 c=0.75 c=0.85
0.1 fraction of branches
0.1 fraction of nodes
1
c=0 c=0.25 c=0.5 c=0.75 c=0.85
0.001 0.0001 1e-05 1e-06
0.01 0.001 0.0001 1e-05 1e-06
1e-07
1e-07 0
10
20 30 40 50 number of predecessors
(a)
60
70
0
50 100 150 200 250 300 350 400 450 500 length or branch
(b)
Fig. 7. The load on the open peers increases with c as they receive more predecessors and acts as relay nodes for the branches. The system contains 212 nodes. 7(a) Fraction of nodes with a given number of predecessors. For large values of c the number for the open peers become orders of magnitude higher than on the NATed peers. 7(b) Size of branches in the system for various values of c. For low values trees are rare and they have small size. As the fraction of NATs grow the length of branches grow too.
Modeling the Performance of Ring Based DHTs in the Presence of NATs
25
5.2 Sparse Successor Lists The size of the successor list, typically log(N ), is the resilience factor of the network. The ring is broken if for one node, all peers in its successor list are dead. In the presence of NATed peers the resilient factor is reduced to log(N ) − n for nodes behind a NAT, where n is the average number of NATed peers in the successor list. This is because NATed peers cannot use other NATed peers for failure recovery. The decrease in resilience by the fraction of NATed nodes is possible to cope with for low fractions c. Since there still is a high probability to find another alive peer which does not sit behind a NAT the ring holds together. In fact the NATs can be considered as additional churn and the effective churn rate becomes ref f = r(1 − c). For larger values of c, however, the situation becomes intractable. The effective churn rate quickly becomes very high and breaks the ring. 5.3 Recovery List As we previously mentioned, the resilient factor of the network decreases to log(N ) − n. To remedy this a first idea to improve resilience is to filter out NATed peers from the successor list. However, the successor list is propagated backwards, and therefore, the predecessor might need some of the peers filtered out by its successor in order to maintain consistent key ranges. We propose to use a second list looking ahead in the ring, denoted the recovery list. The idea is that the successor list is used for propagation of accurate information about peer order and the recovery list is Fig. 8. Two lists looking ahead: the successor and the used for failure recovery, and it only recovery list contains peers that all nodes can communicate with, that is, open peers. The recovery list is initially constructed by filtering out NATed peers from the successor list. If the size after filtering is less than log(N ), the peer requests the successor list of the last peer on the list in order to keep on constructing the recovery list. Ideally, both lists would be of size log(N ). Both lists are propagated backwards as the preceding nodes perform maintenance. Figure 8 shows the construction of the recovery list at a NATed peer. Because both lists are propagated backwards, we have observed that even for open peers is best to filter out NATed peers from their recovery lists even when they can establish connection to them. The reason is that if an open peer keeps references on its recovery list, those values will be propagated backward to a NATed peer who will not be able to use them for recovery, reducing its resilient factor, incrementing its cost of rebuilding a valid recovery list, and therefore, decreasing performance of the whole ring. If no NATed peers are used in any recovery list, the ring is able to survive a much higher degree of churn in comparison to rings only using the successor list for failure recovery. The recovery lists are populated in a reactive manner until it finds a complete set of open peers or can only communicate with peers with overlapping set of recovery
26
J. Ardelius and B. Mej´ıas
100
r=25 r=30 r=50 r=100
30
90 80 70
25
list lenght
backup messages send / node
35
20
60 50 40 30
15
c=0 c=0.5 c=0.75 c=0.85
20 10
10 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
c
(a)
0.8
0.9
0
10
20
30 40 50 stabilisation rate, r
60
70
(b)
Fig. 9. 9(a) Average number of backup messages sent on each stabilisation round by a node for different values of c as function of stabilisation rate. 9(b) Size of the recovery list for various c values as function of stabilisation rate. Before break-up there is a rapid decrease in number of available pointers.
nodes. The number of messages sent during stabilisation of the recovery list is shown in Figure 9(a). For large stabilisation rates and high c values more messages are generated as the nodes need to make more queries in order to fill its recovery list. Figure 9 shows how the average size of the recovery list varies with both churn r and fraction of NATs c. In the absence of NATs (c = 0) the recovery list maintains about the same size over the whole r-range. In the presence of NATs on the other hand, the size of the list abruptly decreases to only a fraction of its maximum. The reason for this effect is that for high enough churn large branches tends to grow. Large part of the ring are transferred to branches and while new nodes join they grow while the core ring shrinks. The small size of the recovery list reflects the fact that there are only a few open peers left on the core ring for high c and churn. Since the updates are done backwards and most open peers are situated in a branches nodes outside of the branch will not receive any information about their existence. The system can still function in this large branch state but becomes very sensitive to failures of open peers on the core ring.
6 Working Range Limits Summarizing, we can categories the system behavior in the following ranges of the parameter c: 0 ≤ c < 0.05. In this range the influence of NATs is minor and can be seen as an effective increased churn rate. If a NATed node find itself behind another NATed node it can simply re-join the system without to much relative overhead. 0.05 ≤ c < 0.5. In the intermediate range the open peers are still in majority. Letting NATed nodes re-join when they obtain a NATed successor will however cause a lot of overhead and high churn. This implies that successor stabilisation between NATed peers need to use an open relay node. Multiple predecessor pointers are needed in order to avoid lookup inconsistencies and peers will experience a slightly higher workload.
Modeling the Performance of Ring Based DHTs in the Presence of NATs
27
0.5 ≤ c < 0.8. When the majority of the nodes are behind NATs the main problem becomes the inability to re-join for NATed peers. In effect, the open peers in the network will have relatively small responsibility ranges but will in turn relay the majority of all requests to NATed peers. The number of reachable nodes in the NATed nodes successor lists decrease rapidly with churn. A separate recovery list with only open peers is needed in addition to avoid system breakdown. 0.8 ≤ c < 1. In the high range the only viable solution is to let only open peers participate and have the NATed nodes directly connected to the open peers as clients. The open peers can then split workload and manage resources among its clients but are solely responsible for the key range between it and the first preceding open peer. To make the system function for even higher churn rates and NAT ratio, our conclusion is that one should only let open peers join the network and then attach the NATed peers evenly to them as clients. Since NATed peers do more harm than good, if there are to many of them we see no other option than to leave them out. Since the open peers get most (or all) of the work load, in any case it is better to spread it evenly.
7 Conclusions In this work we have studied a model of Network Address Translators (NATs) and how they impact the performance of ring based DHT’s, namely Chord. We examine the performance gains of using predecessor based routing and introducing a recovery list with open peers. We show that adding theses elements to Chord makes the system run and function in highly dynamic and NAT constrained networks. We quantify how the necessary adjustments needed to perform reliable lookups vary with the fraction of nodes behind non-traversable NATs. We also note that having NATed nodes per se does not dramatically increase the probability of system failure due to break-up of the ring, as long as nodes behind nontraversable NATs can use some communication relay. The main reason why the ring eventually fails due to large churn is that the branches becomes larger than the length of the successor- and recovery list. Information about new peers in the ring cannot then reach nodes at the end of the branch who’s pointers will quickly become deprecated. At the same time, as branches grow large, new nodes will have a high probability of joining a branch instead of the actual ring which will worsen the situation further in a feedback that eventually breaks the system. Our conclusion is that it is indeed possible to adjust Chord, and related ring based DHT protocols, to function in the presence of NATs without the need for traversal techniques. It further shows that it is possible to construct robust overlay applications without the assumption of an open underlying network.
Acknowledgements The authors would like to thank Jim Dowling for valuable comments and discussion. The project is supported by SICS Center for Networked Systems (CNS) and the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant agreement n 214898. The Mancoosi Project.
28
J. Ardelius and B. Mej´ıas
References 1. D’Acunto, L., Pouwelse, J., Sips, H.: A measurement of NAT and firewall characteristics in peer-to-peer systems. In: Gevers, T., Bos, H., Wolters, L. (eds.) Proc. 15-th ASCI Conference, pp. 1–5. Advanced School for Computing and Imaging (ASCI), P.O. Box 5031, 2600 GA Delft, The Netherlands (June 2009) 2. Dale, C., Liu, J.: apt-p2p: A peer-to-peer distribution system for software package releases and updates. In: IEEE INFOCOM, Rio de Janeiro, Brazil (April 2009) 3. Freedman, M.J., Lakshminarayanan, K., Rhea, S., Stoica, I.: Non-transitive connectivity and DHTs. In: WORLDS 2005: Proceedings of the 2nd Conference on Real, Large Distributed Systems, pp. 55–60. USENIX Association, Berkeley (2005) 4. Huang, Y., Fu, T.Z., Chiu, D.M., Lui, J.C., Huang, C.: Challenges, design and analysis of a large-scale p2p-vod system. SIGCOMM Comput. Commun. Rev. 38(4), 375–388 (2008) 5. Jimenez, R., Osmani, F., Knutsson, B.: Connectivity properties of mainline BitTorrent DHT nodes. In: IEEE Ninth International Conference on Peer-to-Peer Computing, P2P 2009, pp. 262–270 (2009), http://dx.doi.org/10.1109/P2P.2009.5284530 6. Kermarrec, A.M., Pace, A., Quema, V., Schiavoni, V.: NAT-resilient gossip peer sampling. In: International Conference on Distributed Computing Systems, vol. 0, pp. 360–367 (2009) 7. Krishnamurthy, S., El-Ansary, S., Aurell, E.A., Haridi, S.: An analytical study of a structured overlay in the presence of dynamic membership. IEEE/ACM Transactions on Networking 16, 814–825 (2008) 8. Li, B., Qu, Y., Keung, Y., Xie, S., Lin, C., Liu, J., Zhang, X.: Inside the New Coolstreaming: Principles, Measurements and Performance Implications. In: The 27th Conference on Computer Communications, INFOCOM 2008. IEEE, Los Alamitos (2008) 9. Liu, Y., Pan, J.: The impact of NAT on BitTorrent-like p2p systems. In: IEEE Ninth International Conference on Peer-to-Peer Computing, P2P 2009, pp. 242–251 (2009), http://dx.doi.org/10.1109/P2P.2009.5284521 10. Maymounkov, P., Mazieres, D.: Kademlia: A peer-to-peer information system based on the xor metric (2002) 11. Mej´ıas, B., Van Roy, P.: The relaxed-ring: a fault-tolerant topology for structured overlay networks. Parallel Processing Letters 18(3), 411–432 (2008) 12. Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., Stoica, I., Yu, H.: OpenDHT: A public DHT service and its uses (2005), citeseer.ist.psu.edu/rhea05opendht.html 13. Roverso, R., El-Ansary, S., Haridi, S.: NATCracker: NAT combinations matter. In: International Conference on Computer Communications and Networks, pp. 1–7 (2009) 14. Shafaat, T.M., Moser, M., Sch¨utt, T., Reinefeld, A., Ghodsi, A., Haridi, S.: Key-based consistency and availability in structured overlay networks. In: Proceedings of the 3rd International ICST Conference on Scalable Information Systems (Infoscale 2008). ACM, New York (June 2008) 15. Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A scalable peer-topeer lookup service for internet applications. In: Proceedings of the 2001 ACM SIGCOMM Conference, pp. 149–160 (2001)
Usurp: Distributed NAT Traversal for Overlay Networks Salman Niazi and Jim Dowling Swedish Institute of Computer Science (SICS)
Abstract. Many existing overlay networks are not practical on the open Internet because of the presence of Network Address Translation (NAT) devices and firewalls. In this paper, we introduce Usurp, a message routing infrastructure that enables communication between private nodes (behind NATs or firewalls) either by direct connectivity or relaying messages via public nodes (nodes that support direct connectivity). Usurp provides fully distributed NAT-type identification and NAT traversal services using a structured overlay network (SON) built using the public nodes in the system. Private nodes do not join the SON, instead, each private node is assigned a key in the SON’s address space and the public node(s) responsible for its key acts as both a rendezvous and relay server to the private node. Usurp is designed as a middleware that existing overlay networks can be built over, enabling them to function correctly in the presence of NATs. We evaluate Usurp using a gossip-based peer sampling service (PSS). Our results show that the PSS running over Usurp preserves its randomness properties and remains connected even in scenarios with high churn rates and where 80% of the nodes are behind NATs. We also show that Usurp only adds a low and manageable overhead to public nodes.
1
Introduction
Many elegant distributed algorithms for constructing overlay networks are not practical over the open Internet because of the presence of ugly Network Address Translation (NAT) devices and firewalls. For example, gossiping, a widely used technique for building overlay networks, assumes that any pair of nodes can communicate directly with each other, whereas, in reality, private nodes behind NATs do not support direct connectivity with nodes outside their private network. This results in an uneven participation of nodes in gossiping, where public nodes (with open IP addresses) have a significantly higher network traffic burden [17,18]. Systems studies have shown that in existing peer-to-peer (P2P) systems only between 20-40% of nodes are public nodes [13,22]. NAT traversal protocols are required to communicate with private nodes, except in the case where the source node resides behind the same NAT. Centralized NAT traversal services are commonly used in existing P2P systems [23]. These include STUN (Session Traversal Utilities for NAT) [19,20] that identifies a node’s NAT type, and relay and rendezvous services that, respectively, forward P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 29–42, 2011. c IFIP International Federation for Information Processing 2011
30
S. Niazi and J. Dowling
packets to the private node and enable direct connectivity to the private node in a process commonly known as hole punching. Protocols for hole punching do not work for all combinations of NAT types. Depending on the distribution of NAT types in the system, hole-punching for UDP works for 80%-95% of NATs [22,6], and around 52% for TCP [10]. In this paper, we present the first fully distributed NAT identification and traversal protocols. We use these protocols to build Usurp, a NAT-friendly overlay network, that enables any two nodes on the open Internet to communicate, regardless of whether they are public or private. In Usurp, all public nodes join a structured overlay network (SON). Each private node is assigned a unique address in the SON’s address space and the public node responsible for that SON address acts as a relay and rendezvous server for the private node. Relay and rendezvous services enable indirect and direct connectivity with private nodes, respectively. All public nodes also provide a NAT-type identification service that enables newly joined nodes to determine whether they reside behind a NAT or not, and what the type of that NAT is. To reduce connection latency using the SON, we introduce a caching mechanism that preserves useful information for future session establishment and reduces the need for lookups on the SON. Usurp is implemented as a middleware that can be layered below existing overlay network protocols. We introduce an address structure for connecting to both public and private nodes that includes a key in the SON address space, the node’s NAT type, and a set of IP addresses (the node’s own IP address for public nodes and the address of its parent(s) for private nodes). A parent address is an IP address of a public node on the SON responsible for a private node. When a node attempts to connect to a private node, it can first attempt to connect via its parents (in parallel), if it fails then it falls back to the SON to find an active parent. This significantly reduces the need to perform lookups on the SON, and is particularly effective where either public nodes are long-lived or where addresses are quickly expired from the system. Usurp also enables the construction of NAT-aware applications, enabling nodes to send private nodes either small messages with lower latency using relaying (e.g., control packets) or larger messages via a direct connection, but incurring higher latency due to the overhead of hole punching. We have validated and evaluated Usurp by constructing a gossip-based peer sampling service (PSS) on top of Usurp. Our results show that Usurp enables the PSS to preserve its randomness properties and connectivity even in scenarios with churn rates of 80% and where up to 80/90% of the nodes are behind NATs. For the PSS, we show that Usurp adds only a low and manageable overhead to public nodes.
2
NAT Classification and Traversal
The type of NAT a private node resides behind is important in determining what NAT traversal mechanism to use when communicating with that private node. The original Session Traversal Utilities for NAT (STUN) protocol [20] provides a
Usurp: Distributed NAT Traversal for Overlay Networks
31
limited categorization of NATs into one of four types: full-cone, address-restricted cone, port-restricted cone, and symmetric. We adopt a richer classification of NAT types introduced by Roverso in [22], based on the BEHAVE RFC [2] and [14], that classifies a NAT by its port mapping, port allocation and port filtering policies. The port mapping policy defines when a NAT receives an outgoing packet from a private node whether it allocates a new port or uses an existing port on its external public interface. The port allocation policy defines which port should be allocated on the NAT for an outgoing packet when a new mapping is created on the NAT. Finally, the port filtering policy determines whether the NAT forwards an incoming packet to a private node or not, depending on the existing mappings in the NAT and the source IP address and port of the incoming packet. Classical STUN can only accurately determine the filtering policy. We use a modified version of STUN protocol, based on [30] and [22], to determine all three policies. Another difference with STUN is that classical STUN servers require two different public IP addresses. However, most nodes in P2P systems do not have two different public IPs. As such, we use pairs of public nodes to implement a distributed STUN service (DSTUN). Each public node maintains a list of partner STUN nodes, sampled from the SON and ordered by round-trip time (RTT), so whenever a DSTUN server has to send a reply from an different IP address, it simply requests its lowest RTT partner to send the reply. Note that DSTUN does not consider dynamic and multi-layer NATs, more commonly found in corporate networks [6]. We do, however, support UPnP port mapping for NATs [27]. Usurp supports NAT traversal by establishing direct connections using holepunching for UDP, and where not possible, relaying messages to private nodes using public nodes. We do not support hole-punching using TCP [8] due to its significantly lower success ratio. We support a suite of hole-punching algorithms and the NAT type of both the source and destination nodes is used to determine the traversal technique required to establish a connection between two nodes. When hole-punching is not supported for the combination of the two NAT types we revert to relaying. The hole-punching techniques we support include simple hole punching, port prediction using preservation, and port prediction using contiguity. All of these techniques use a public node acting as a rendezvous server to coordinate the protocol, and vary in how they generate a NAT mapping that will allow traffic to pass through the NAT, and, thus, establish a direct connection. More details on these algorithms can be found in [22].
3
Usurp SON
On joining Usurp, a node discovers a number of random public nodes using a bootstrap service. The node then pings these public nodes and runs our NATtype identification protocol against the node with the lowest RTT. On discovering its NAT-type, the node will either join a SON if it is public, or put a value in the SON if it is a private node. For public nodes in the SON, we generate an initial node-Id by hashing the node’s public IP address, and then we replace the least significant 16 bits of
32
S. Niazi and J. Dowling
public node
relay hole punch
keep-alive
NAT private node
Fig. 1. Usurp’s structured overlay network. Filled circles are public nodes, members of the SON. Empty circles are keys representing private nodes. Every private node keeps a NAT mapping open to the public node responsible for its key, so that the public node can handle relay and hole-punching requests for the private node.
SON Key NAT Type
20 bytes
1 byte
(6 bytes) * n
[TS]
[8 bytes]
Payload
Fig. 2. Usurp node descriptor
the node-Ids with the port number. This limits a single public node’s ability to mount a Sybil attack as nodes it produces from behind one IP address will most likely be contiguous on the overlay. We use iterative routing, as it has a lower hop count compared to recursive routing, and low latency is crucial for connection establishment. For private nodes, we generate a key by hashing its NAT’s public IP address, and then we replace the least significant 16 bits with the last 16 bits of the private IP address. The private node then puts the key with its node descriptor into the SON and then performs k lookups on the SON using the k replication hash keys. The lookup responses return the k public nodes responsible for the keys. The node then registers as a child of these parents and keeps the NAT mappings to the parents alive using heartbeats. When a public node leaves the SON, its children become children of the new public node responsible for the key-space. The heartbeat period is determined by the NAT mapping timeout, as measured by the NAT-type identification service. As it can take minutes to determine the NAT mapping timeout, the default heartbeat period is initially set to 30 seconds, the shortest NAT mapping timeout for UDP observed by [12], and later updated when the NAT mapping timeout is determined. Our SON is based on Chord and Usurp’s architecture is illustrated in figure 1. Although a lot of extensions have been proposed for Chord, such as biasing Id assignment to load balance data over nodes [25] and network-awareness to reduce latencies [31], we consider these issues to be outside the scope of this paper. However, one extension we provide that is an address caching mechanism to preserve connection information for future session establishment. Node descriptors
Usurp: Distributed NAT Traversal for Overlay Networks
Overlay Network
33
Usurp Hole-Punching Client
DSTUN Client
Rendezvous Server
DSTUN Server
SON
Relay Server
!"
#! $ !"% #! $ !"% % %
UDP/IP
(a) Modular view of Usurp
&$'
(b) Hole-punching using the SON
Fig. 3. Usurp middleware and hole-punching using the SON
for private nodes include references to their parent addresses, see figure 2. When a node wishes to relay a message or directly connect to a private node, it sends a message to the parents listed in the node descriptor, with fallback to the SON to lookup the active parent only when all parents listed in the node descriptor are not reachable (because the node’s parents have changed since the node descriptor was published).
4
Connection Establishment in Usurp
Usurp is implemented as a middleware and appears as a black box to higherlevel overlay network protocols. Usurp takes messages from the upper overlay network layer, see figure 3a. Usurp does not require any change to overlay network protocols, apart from using the addressing scheme from figure 2. The only case where overlay protocols may have to be modified is if they are sensitive to connection setup times of up to a few seconds, as hole-punching may take that long to complete [22]. Figure 3a shows the modular view of our Usurp layer. It consists of DSTUN, hole-punching, relay and SON modules. Public nodes provide DSTUN, relay, hole punching and SON services, while both public and private nodes provide the DSTUN and hole-punching clients of these services. When a node attempts to connect to a private node, both mechanisms for establishing a connection, hole-punching and message relaying, require establishing a connection to one of the private node’s responsible public nodes, a rendezvous server (RVP). The private node must also have a valid NAT mapping for the same RS. In figure 3b, we can see how private node A first looks up the public node RSB , responsible for private node B. A sends a connect message to RSB , and RSB selects the appropriate NAT traversal policy, which is then sent to both private nodes A and B. If hole-punching is supported, A and B execute the hole punching algorithm in parallel, sending possibly many packets to ports on Bnat
34
S. Niazi and J. Dowling
and Anat , respectively, with the goal of inserting a mapping rule in either Bnat or Anat that will allow a direct connection to be established between A and B. The complete Usurp protocol is defined in Algorithm 1. The first step nodes take when joining the system is to request a set of random public nodes from the bootstrap server. The client then pings these public nodes and runs the DSTUN protocol against the available node with the lowest RTT to identify its NAT-type, lines 7–14. UPnP enabled nodes can also act as public nodes. Instead of publishing their private address, they publish a mapped port and the public address of their NAT. If the node is public or supports UPNP port mapping, then it also initializes the DSTUN server and hole punching server modules; and sends a Join request to SON module, lines 16–22. For UPnP nodes, we need to map ports on the NAT, lines 17–18. Public nodes register with their own hole punching server module, line 22. If the client is behind a NAT then it must register with a public node as its RVP. It performs a lookup for its id on the SON and registers with the public node returned, lines 24–25. Nodes may join or leave the system, causing the RVP responsible for a private node-id to change. Private nodes start a periodic timer to continuously look for any change in their RVP, line 26. If the periodic timer detects any change in a child’s RVP node, the client unregisters with the old RVP and registers with the new RVP, lines 28–34. The event handler from line 35 is triggered every time the upper overlay network layer sends a message over the network. Here, dst is the descriptor of the destination node. When the Usurp layer receives a message from the upper layer, it checks the NAT type of the destination node. If the destination node is a public node then the message is send directly to it, line 36–37. Hole punching is tried if the destination node is a private node. In order to start hole punching, first, we need to find out a RVP with whom the destination node is registered. Each destination node descriptor also contains a list of parent nodes responsible for the private node. A RVP selected from the list of the parents in the node descriptor, if all parents addresses are invalid then a lookup is sent to the SON for the destination node’s id (key). The SON returns the RVP responsible for the destination node and hole punching is tried using this RVP. If hole punching succeeds then the message is sent to the port defined in the newly created mapping on the destination’s NAT. Both nodes participating in the hole punching process know about the newly created mappings in the NATs if the hole punching process succeeds. The message is relayed, using the RVP node, if hole punching between the two nodes is not possible or hole punching fails, line 48-49. When Usurp layer receives a message from the lower network layer it simply delivers it to the upper overlay network layer, line 51–53. We also use a cache that contains open holes, line 39.
5
Experimental Evaluation
Our validation of Usurp involved layering a well-known overlay network, Cyclon [28], on top of Usurp and evaluating the performance of Cyclon/Usurp in the
Usurp: Distributed NAT Traversal for Overlay Networks
35
Algorithm 1. Usurp protocol 1: 2: 3: 4: 5:
internal data id ← nd nat_type ← nd rs ← nd end
node’s unique identifier NAT policies hole punching server a.k.a rendezvous server
6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27:
upon event init | node_id do stun_client.init() hp_client.init() SON.init()
28: 29: 30: 31: 32: 33: 34:
every T do rs ← SON.lookup(id) if rs’ != rs then rs.unregister() rs ← rs’ rs.register(id, nat_type) end
35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48: 49: 50:
upon event Message | dst do message from the upper overlay layer if dst.nat_type = PUBLIC or dst.nat_type = UPNP_ENABLED_NAT then send Message to dst.address direct communication else destination is a private node if hp_client.holeExists(id, dst.id) then pre-existing hole dstHole ← hp_client.getDestinationHole(dst.id) send Message to dstHole else do hole punching rs ← valid parent from destination node descriptor dst OR SON.lookup(dst.id) hp_resp ← hp_client.doHolePunching(dst.id, rs’) if hp_resp = SUCCESS then dstHole ← hp_client.getDestinationHole(dst.id) send Message to dstHole else if hp_resp = HP_NOT_POSSIBLE || hp_resp = HP_FAILED then rs’.relay(Message, dst.id) end event
51: 52: 53:
upon Receive Message do trigger Deliver | Message end
id ← node_id sServers ← bootstrap.getRandomPublicNodes() sServer ← lowest_rtt(sServers) nat_type ← run NAT-type Identification with sServer if nat_type = PUBLIC or nat_type = UPNP_ENABLED_NAT then if nat_type = UPNP_ENABLED_NAT then map_UPnP_ports() stun_server.init() hp_server.init() SON.join(id) hp_server.register(id, nat_type) RVP for the public node is the node itself else rs ← SON.lookup(id) rs.register(id, nat_type) establish out-of-band connection run RVP periodic check timer end event private nodes check for RVP change
received from the lower network layer
36
S. Niazi and J. Dowling
presence of NATs compared to classical Cyclon in a NAT-free network. Cyclon is gossip-based peer sampling protocol that is widely used to build and maintain more complex overlay networks. Cyclon creates a random graph overlay network that has small diameter, low clustering coefficient and is highly resilient to churn. Each Cyclon node maintains a view that contains a bounded number of addresses of other nodes in the system. After a number of gossiping rounds, the view converges to a random subset of nodes in the system. Cyclon, and gossiping in general, assumes that any node can directly communicate with any other node in the system. In summary, our results show that (i) Cyclon/Usurp preserves the randomness properties of Cyclon, i.e., low clustering coefficient, short paths between nodes, small diameter, uniform random sampling and high resilience to churn and; (ii) public nodes incur an acceptable level of overhead and nodes participate evenly in gossiping, a node’s amount of gossiping is not affected by the presence of NATs. 5.1
Experimental Setup
We implemented USurp as a message-level simulator using the Kompics platform [1]. Kompics provides a framework for building P2P protocols, and simulation support using a discrete event simulator. We developed a NAT emulator that emulates all the mapping, port allocation and filtering policies. All the messages sent by network layer pass through NAT emulator. In all experiments rule binding expiration time for every NAT was randomly chosen from the set {30, 60, 90, 120, 150, 180 sec}. When any message leaves or enters the NAT, it updates the corresponding rule expiration timestamp. In our experiments, there is only one node behind each NAT, but in real life there may be multiple nodes behind a single NAT. Multiple nodes affect the success ratio for hole punching protocols by continuously allocating ports on the NAT that would be used by port prediction algorithms that are part of NAT Traversal protocols. We emulate the behaviour of multiple nodes behind each NAT, by attaching a component to the NAT emulator. Every second, it opens a new port on the NAT emulator. This is done by sending a dummy message outside the network. Destination IP and port information in the dummy message is set in such a way that the message always opens a new port on the NAT and never reuses an existing mapping. The network size is set to 1024 and the latencies between pairs of nodes is modeled using the King data set [11]. Each experiment was run 30 times using different seeds and the results reported here are the averages of results obtained. Instead of initializing all nodes at once, we consider a growing network scenario where nodes gradually join the overlay. The arrival rate between two joins is set to a constant 500ms. We use centralized bootstrap sever that returns 20 random public nodes in the system. We are using Chord SON and in all experiments the successor stabilization timeout for Chord is set to 2 seconds and the finger stabilization timeout is set to 3 seconds. Due to space limitations, no replication is used in our experiments; every private node has only one RVP associated with
Usurp: Distributed NAT Traversal for Overlay Networks
37
it. The main parameters to set for Cyclon are the cycle period, which we set to 10 sec, the view size, set to 15, and the shuffle length, set to 5. 5.2
Correctness of the Overlay Network Layer
To check the correctness of the overlay network layer we have tried to make the scenarios as realistic as possible. The ratio of open to private nodes is set to 1:4, similar to [7], and percentages of different types of NAT are taken from [22]. The statistics in [22] correspond to data collected by Peerialism, Sweden for a video streaming application. We set 5% of the NATs to support port-mapping using the UPnP Internet Gateway Device protocol. In all the graphs vertical lines represents the end of the growth of the overlay. The join process for all nodes completes around the 70th cycle. As can be seen in figures 4a, 4b and 4c, Usurp produces results that are very close to classical Cyclon run using only public nodes - the clustering coefficient, average path length and average in-degree matrices converge very rapidly after all nodes have joined the overlay. We can also see that if no NAT Traversal strategies are used, Cyclon performs badly in the presence of private nodes. There are few available links between the nodes, i.e., only the links among public nodes and the links from private to public nodes. This results in very low average in-degree and high clustering coefficient. The average path length is smaller because the presence of the NATs caused nodes to fail to join the overlay network. On average, only 75% nodes successfully joined the overlay. 5.3
Usurp Overhead
We have used the same experiment setup for calculating bandwidth consumption of Usurp as a function of time. On average, the public nodes use five times more bandwidth than the private nodes. This is because the public nodes have dual responsibilities, i.e., they provide SON and RVP services to the remaining 80% of the nodes in the system. Bandwidth consumed by private nodes remains steady at 0.52 KB/s; and for a network of fixed size the bandwidth consumption for public nodes does not grow over time, as can be seen in figure 5a. For calculating bandwidth consumption as a function of the percentage of private nodes, we use only one type of NAT. The NATs’ mapping policy is set to Endpoint Independent, filtering policy is set to Endpoint Independent and port allocation policy is set to Port Preservation. Bandwidth consumed by public nodes grows as the percentage of private nodes in the system increases. Up to 80% of private nodes, for every 10% increase in the number of private nodes, there is on average a 7.72% increase in the bandwidth used by public nodes. However, this linear increase breaks down above 80% private nodes, and we observe a 30% increase in bandwidth consumption for public nodes from 80% to 90% private nodes.
38
S. Niazi and J. Dowling
4
1
Using Usurp Middleware Without Using Usurp Middleware Cyclon Baseline, All Public Nodes
0.9
3.5 0.8
3 Avg. Path Length
Clustering Coefficient
0.7 0.6 Using Usurp Middleware Without Using Usurp Middleware Cyclon Baseline, All Public Nodes
0.5 0.4
2.5
2
0.3 0.2
1.5 0.1 0 0
50
100
150
200
1
250
0
50
100
Cycles
150
200
250
Cycles
(a) Cyclon clustering coefficient
(b) Cyclon average path length
16 14
Avg. In-Degree
12 10 Using Usurp Middleware Without Using Usurp Middleware Cyclon Baseline, All Public Nodes
8 6 4 2 0 0
50
100
150
200
250
Cycles
(c) Cyclon average in-degree Fig. 4. Randomness properties of the Cyclon/Usurp overlay network
5.4
Churn Resilience
We have tested our solution under high churn and failure rates. We define churn as certain fraction of the nodes joining and leaving the overlay in one gossip cycle; and failure is defined as the fraction of the nodes leaving the overlay in one gossip cycle. For massive failure analysis, we again use only one type of the NAT as described above. We remove a fraction of nodes after every node has completed at-least 50 cycles. Public and private nodes are randomly removed from the system. Figure 6a, shows the size of the biggest cluster 50 cycles after the failure process has completed. We observe that our solution is highly resilient to massive failures and it can easily tolerate failure of 80% of the nodes. The overlay only starts to partition when the failure rate reaches 90%. For churn analysis, we use the same scenario described in the first experiment. A fraction of nodes join and leave the system after every node in the system has
Usurp: Distributed NAT Traversal for Overlay Networks
39
4.5
3
Public Nodes Private Nodes 4
2.5
3.5 3 Data Rate (KB/s)
Data Rate (KB/s)
2
Cyclon Baseline, All Public Nodes Public Nodes Using Usurp Middleware Private Nodes Using Usurp Middleware
1.5
2.5 2 1.5
1
1 0.5
0.5 0
0 0
50
100
150
200
10
250
20
30
40
50
60
70
80
90
Private Nodes (%)
Cycles
(a) Usurp overhead against time
(b) Usurp overhead for an increasing percentage of private nodes
Fig. 5. Usurp protocol overhead
100
70% Private Peers 80% Private Peers
50% Private Peers 60% Private Peers
90
100
90
80
90
Biggest Cluster Size (%)
Biggest Cluster Size (%)
100
80 70 60
80
Join Failures (1000 ms) Biggest Cluster (1000 ms) Join Failures (2500 ms) Biggest Custer (2500 ms) Join Failures (5000 ms) Biggest Cluster (5000 ms) Join Failures (7500 ms) Biggest Cluster (7500 ms) Join Failures (10000 ms) Biggest Cluster (10000 ms)
70 60 50
70 60 50
40
40
30
30
20
20
10
10
Join Failures (%)
110
50 40 30
60
70
80
Failure Percentage
(a) Massive Failures
90
0
0 10
20
30
40
50
60
70
80
Churn (%)
(b) Churn
Fig. 6. Behaviour of Usurp/Cyclon under churn and massive failures
completed 50 cycles; and data is collected 50 cycles after the churn process has completed. For churn analysis, it is crucial to observe the effect of different finger and successor stabilization rates. In this experiment, finger and successor stabilization rates are set to same values. We observe that under high churn many nodes fail to join the overlay; this is because during the initialization process the bootstrap server returns dead public nodes or the SON ring has not stabilized. The bootstrap server evicts a public node if it does not receive a ping from the node. In all our experiments, the node eviction period was set to 20 seconds. We observe few join failures and high clustering with short finger and successor
40
S. Niazi and J. Dowling
stabilization rates. Increasing the finger and successor stabilization rates directly effects the performance of the system, as can be seen in figure 6b.
6
Related Work
There are proprietary systems, such as Skype [9] and Hamachi, that support distributed NAT connectivity using public nodes, although details on their architecture are not public knowledge. Most existing P2P systems either use centralized servers to provide NAT connectivity [23] or do not support NAT connectivity at all [18]. The idea of connecting public nodes using a SON and having private nodes as clients originated with the Internet Indirection Infrastructure [26], although it did not address NAT traversal. The most similar system to Usurp is Maidsafe SON, a commercial implementation of Kademlia [15], where public nodes act as rendezvous servers. However, private nodes pick a rendezvous parent using bootstrap nodes from their own routing table dump on start-up, so there are no guarantees on whether a node can discover the rendezvous server responsible for a private node - false negatives are possible. Also, they do not separate NAT type identification from NAT traversal, so, similar to Interactive Connectivity Establishment (ICE) [21], as nodes do not know each others NAT type, a connection request results in a node trying to connect using several mechanisms in parallel: direct connection, connection reversal, and hole-punching. Usurp’s node descriptor is similar to that used in Teredo [27], where an address contains the private address and a public address (although, for Teredo the public address is an address on the NAT). Usurp’s architecture has similarities to P2PSIP, whose goal is to implement SIP using Chord [4], although Usurp provides a more general connectivity layer. In [29], Wolinsky et al. showed how to bootstrap a P2P system using BruNet [3] and XMPP [24]. Similar to Usurp, they used a SON to implement relaying from public nodes in a SON to private nodes connected to public nodes in the SON. There has also been work on peer sampling protocols that work in the presence of NATs, similar to Cyclon/Usurp from our evaluation[17,5,16]. Leitão et al. address the problem of balancing load among public and private nodes [17], while Actualized Robust Random Gossiping (ARRG) [5] uses a Fallback Cache containing public nodes to handle partitioning problems. Nylon is a peer sampling protocol that allows any node, whether open or natted, to act as a rendezvous server. However, they only consider the four classical types of NATs and do not take into account success rates of different hole punching protocols for different NAT combinations.
7
Conclusions
We have presented Usurp, a distributed NAT Traversal solution that supports node connectivity for overlay network protocols. The layered architecture of our solution allows the reuse of the Usurp layer with other protocols. We demonstrated that our solution does not require any changes to an existing overlay
Usurp: Distributed NAT Traversal for Overlay Networks
41
network protocol, Cyclon, and produces results comparable with Cyclon run in a network with only public nodes. We showed that Cyclon/Usurp is resilient to high failure and churn rates with up to 80% of nodes behind NATs, and it has reasonable overhead while preserving the randomness properties of the peer sampling service.
References 1. Arad, C., Dowling, J., Haridi, S.: Developing, simulating, and deploying peer-topeer systems using the kompics component model. In: COMSWARE 2009: Proceedings of the Fourth International ICST Conference on COMmunication System softWAre and middlewaRE, pp. 1–9. ACM, New York (2009) 2. Audet, F., Jennings, C.: Network address translation (nat) behavioral requirements for unicast udp (2007) 3. Boykin, P.O., Bridgewater, J.S.A., Kong, J.S., Lozev, K.M., Rezaei, B.A., Roychowdhury, V.P.: A symphony conducted by brunet. CoRR abs/0709.4048 (2007) 4. Broadbent, T., Bryan, D.A.: P2psip, http://www.p2psip.org/index.php 5. Drost, N., Ogston, E., van Nieuwpoort, R.V., Bal, H.E.: Arrg: real-world gossiping. In: HPDC 2007: Proceedings of the 16th International Symposium on High Performance Distributed Computing, pp. 147–158. ACM, New York (2007) 6. Ford, B., Srisuresh, P., Kegel, D.: Peer-to-peer communication across network address translators. In: ATEC 2005: Proceedings of the Annual Conference on USENIX Annual Technical Conference, p. 13. USENIX Association, Berkeley (2005) 7. Ganjam, A., Zhang, H.: Connectivity restrictions in overlay multicast. In: NOSSDAV 2004: Proceedings of the 14th International Workshop on Network and Operating Systems Support for Digital Audio and Video, pp. 54–59. ACM, New York (2004) 8. Guha, S., Biswas, K., Ford, B., Sivakumar, S., Srisuresh, P.: RFC 5382: NAT Behavioral Requirements for TCP (October 2008) 9. Guha, S., Daswani, N., Jain, R.: An Experimental Study of the Skype Peer-to-Peer VoIP System. In: IPTPS 2006: The 5th International Workshop on Peer-to-Peer Systems. Microsoft Research (2006), http://saikat.guha.cc/pub/iptps06-skype.pdf 10. Guha, S., Francis, P.: Characterization and measurement of tcp traversal through nats and firewalls. In: Proceedings of the 5th ACM SIGCOMM Conference on Internet Measurement, IMC 2005, p. 18. USENIX Association, Berkeley (2005), http://portal.acm.org/citation.cfm?id=1251086.1251104 11. Gummadi, K.P., Saroiu, S., Gribble, S.D.: King: Estimating latency between arbitrary internet end hosts. In: SIGCOMM Internet Measurement Workshop (2002) 12. Hatonen, S., Nyrhinen, A., Eggert, L., Strowes, S., Sarolahti, P., Kojo, M.: An experimental study of home gateway characteristics. In: ACM SIGCOMM Internet Measurement Conference (IMC) (2010) 13. Huang, Y., Fu, T.Z.J., Chiu, D.M., Lui, J.C.S., Huang, C.: Challenges, design and analysis of a large-scale p2p-vod system. SIGCOMM Comput. Commun. Rev. 38(4), 375–388 (2008), http://dx.doi.org/10.1145/1402946.1403001 14. Huitema, C.: Teredo: Tunneling ipv6 over udp through network address translations (nats) (2006)
42
S. Niazi and J. Dowling
15. Hutchison, F.: Nat traversal in maidsafe dht (2010), http://code.google.com/p/maidsafe-dht/wiki/NATTraversal (accessed November 2010) 16. Kermarrec, A.M., Pace, A., Quema, V., Schiavoni, V.: Nat-resilient gossip peer sampling. In: ICDCS 2009: Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems, pp. 360–367. IEEE Computer Society, Washington, DC, USA (2009), http://dx.doi.org/10.1109/ICDCS.2009.44 17. Leitão, J., van Renesse, R., Rodrigues, L.: Balancing gossip exchanges in networks with firewalls. In: International Workshop (IPTPS 2010), San Jose, CA (April 2010) 18. Lu, Y., Fallica, B., Kuipers, F.A., Kooij, R.E., Mieghem, P.V.: Assessing the quality of experience of sopcast. Int. J. Internet Protoc. Technol. 4(1), 11–23 (2009) 19. MacDonald, D., Lowekamp, B.: Skype: Nat behavior discovery using session traversal utilities for nat (stun). IETF RFC 5780 (May 2010) 20. Rosenberg, J., Weinberger, J., Huitema, C., Mahy, R.: Stun - simple traversal of user datagram protocol (udp) through network address translators (nats) (2003) 21. Rosenburg, J.: Interactive connectivity establishment (ice). In: IETF Internet Draft (October 2007), http://tools.ietf.org/html/draft-ietf-mmusic-ice-19.txt 22. Roverso, R., Ansary, S.E., Haridi, S.: Natcracker: Nat combinations matter. In: International Conference on Computer Communications and Networks, vol. 0, pp. 1–7 (2009), http://dx.doi.org/10.1109/ICCCN.2009.5235278 23. Roverso, R., Naiem, A., Reda, M., El-Beltagy, M., El-Ansary, S., Franzen, N., Haridi, S.: On the feasibility of centrally-coordinated peer-to-peer live streaming. In: Consumer Communications and Networking Conference (2011) 24. Saint-Andre, P., Smith, K., Tronçon, R.: XMPP: The Definitive Guide: Building Real-Time Applications with Jabber Technologies. O’Reilly Media, Inc., Sebastopol (May 2009) 25. Schutt, T., Schintke, F., Reinefeld, A.: Structured overlay without consistent hashing: Empirical results. In: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid, CCGRID 2006, p. 8. IEEE Computer Society, Washington, DC, USA (2006), http://portal.acm.org/citation.cfm?id=1134822.1134923 26. Stoica, I., Adkins, D., Zhuang, S., Shenker, S., Surana, S.: Internet indirection infrastructure. In: SIGCOMM, pp. 73–86 (2002) 27. Thaler, D.: Teredo extensions (2011) 28. Voulgaris, S., Gavidia, D., Steen, M.V.: Cyclon: Inexpensive membership management for unstructured p2p overlays. Journal of Network and Systems Management 13, 2005 (2005) 29. Wolinsky, D.I., St. Juste, P., Boykin, P.O., Figueiredo, R.J.O.: Addressing the p2p bootstrap problem for small overlay networks. In: Peer-to-Peer Computing, pp. 1–10. IEEE, Los Alamitos (2010), http://dx.doi.org/10.1109/P2P.2010.5569960 30. Takeda, Y.: Symmetric nat traversal using stun (June 2010), http://tools.ietf.org/id/draft-takeda-symmetric-nat-traversal-00.txt 31. Zhu, Y., Hu, Y.: Efficient, proximity-aware load balancing for dht-based p2p systems. IEEE Trans. Parallel Distrib. Syst. 16, 349–361 (2005), http://dx.doi.org/10.1109/TPDS.2005.46
Kalimucho: Contextual Deployment for QoS Management Christine Louberry1, Philippe Roose2, and Marc Dalmau2 1
ASCOLA-LINA, Ecole des Mines de Nantes, 4 rue A. Kastler, 44300 Nantes, France [email protected] 2 T2I-LIUPPA, IUT de Bayonne, 2 allée Parc Montaury, 64600 Anglet, France {philippe.roose,marc.dalmau}@iutbayonne.univ-pau.fr
Abstract. Recently, the increasing use of mobile technologies leads to face with new challenges in order to satisfy users. New systems deal with three main characteristics: context changes, mobility and limited resources of such devices. In this article we try to address such requirements using QoS-driven dynamic adaptations of application deployment. We are particularly interested in distributed applications QoS management facing with hardware limitations and mobility of devices, user requirements and usage constraints. We propose a service-based reconfiguration platform named Kalimucho. It implements a contextual-deployment heuristic to find a configuration matching context and QoS requirements. Kalimucho was tested with the Osagaia/Korrontea component model and several devices; the results confirm that Kalimucho provides a satisfying execution time to adapt applications. Keywords: QoS management, context-awareness, contextual deployment, dynamic adaptation, software component.
1 Introduction The increasing use of mobile technologies leads to face with new challenges in order to satisfy people using mobile devices. As they can now take their devices anywhere, people wish to use their favorite applications as well as at home. Moreover, they wish that applications could be automatically customized according to their location, light, weather or other environmental oriented context. However, we have to deal with three main characteristics of such systems: context changes, mobility and limited resources. The aim of this article is to answer to user needs and changes of the environment using QoS-driven dynamic adaptations of application deployment. We are particularly interested in QoS management of distributed applications facing with hardware limitations and mobility of devices, user requirements and usage constraints. In this article, we illustrate our work with the sample use case of a tourismcentered application: the visit of a museum. Three visitors use the application running on a mobile device (a smart phone for example). The museum provides a server hosting a video information service (fig. 1). We consider that the best QoS is to P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 43–56, 2011. © IFIP International Federation for Information Processing 2011
44
C. Louberry, P. Roose, and M. Dalmau
broadcast a color video. The visitor #3 moves in the museum. The platform is advised of this change and consequently estimates the quality of the video service provided to the user #3. First, the bandwidth is low but the device still reach the server. To ensure the continuity of the service, one solution is to reduce the number of transmitted data. The server transforms the color video in a black and white one. Secondly, the device cannot reach the server. The platform has to look for a new route to reach the device #3 and ensure the continuity of the service. For example, the visitor #2 device, which also uses the video service, can be a relay for the device #3.
Fig. 1. Visit of a museum application
Discovering a new route to broadcast information is a classic problem, many protocols as AODV [5] can be implemented. Nevertheless, such protocol just manages physical routing constraints. To propose more interesting solutions not only based on technical criteria of feasibility, we use a supervision platform distributed on each device enabling to have a complete and global knowledge of the application. Hence, because the color video is already broadcasted to the users #1 and #2, it is possible to use one of these two users as a relay for the user #3, which cannot reach the video service anymore. However if the relay device does not have enough energy available, the broadcast of a color video is not possible. We can thus propose to install a black and white conversion component on the relay device in order to transmit the adapted video more suitable to the energy available on the device. The energy consumption depending on the quantity of data, the conversion into black and white allows to reduce the quantity of data to transmit to the device #3 and thus to preserve its energy. Furthermore distributing services allows network load balancing. Obviously such a solution requires a good knowledge of the whole application in terms of services in order to choose the best deployment. Our approach using dynamic reconfiguration and redeployment allows proposing reliable solutions from both points of view of the infrastructure and the QoS. In the remainder of this paper, we first discuss related work (Section 2). Section 3 presents the definition of context and QoS we use in this paper. In Section 4, we present our QoS model, which addresses the utility of an application and its durability.
Kalimucho: Contextual Deployment for QoS Management
45
Then, in Section 5, we describe Kalimucho, our reconfiguration platform that provide contextual-deployment of applications in order to meet QoS requirements. Finally, Section 6 presents an experimentation of Kalimucho with the Sunspot platform and we conclude and discuss directions for future work (Section 7).
2 Related Work Routing protocol is the most common solution when dealing with QoS in distributed or mobile environment [6][11]. Routing protocols allow discovering reliable routes, ensuring the continuity of service and the best bandwidth. However, such protocols are not sufficient to face with heterogeneity and context changing as well as highlevel routing decision. Following this report, several works proposed software architecture to manage QoS in mobile and limited systems. Such approaches address the global problem of context-awareness to manage QoS and adapt applications. Music [13] project propose a context-aware middleware to adapt application in mobile systems. “Planning-based adaptation refers to the capability of adapting an application to changing operating conditions by exploiting knowledge about its composition and Quality of Service (QoS) meta-data associated to the application components”[8]. Music considers that applications are developed with a QoS model such as utility functions. Applications are described in several variations where components are associated to QoS meta-data. The planning process chooses variations in order to maximize utility. Music is a device-oriented approach. For each adaptation, an adaptation domain is defined around the device requesting adaptation where the adaptation planner can act. Moreover, distribution plans are associated to each variation that limits the range of solutions. QuAMobile [15] is a generic QoSaware software architecture for multimedia distributed applications. It is based on two concepts: QoS-oriented components monitoring QoS and a service planner. As Music, this service planner allows composing dynamically a configuration according to QoS requirements. This approach shows how it is important to model QoS and describe components and devices in order to provide a suitable configuration. However none of these works tackle the problem of distributed deployment. Some works such as Carisma and AxSeL [3] provide contextual deployment solutions. As well as optimize resources consumption, they allow driving deployment according to physical and/or logical dependencies. Nevertheless no one includes network communication cost in adapting applications.
3 Context and QoS A unique definition of what the context is does not exist. The context is application domain dependant. This article addresses QoS management through context adaptation in mobile applications. Hence, context has to include particularities of such applications: mobility, limited devices, customization according location, environment. We refer to the definition of Schilit and Teimer [14] and one of David and Ledoux [7]. Such definitions point out that the context and the QoS are linked: context changes can be seen as a QoS evolution. However, all context changes do not
46
C. Louberry, P. Roose, and M. Dalmau
have the same consequence. Thus, we define three categories of context (fig. 2): user, usage and execution. The user context refers to user preferences, what service user wants to use. The usage context refers to application constraints. There are functional specifications, which define what the application must do and must not do. The execution context refers to hardware (CPU, memory, energy) and networks capabilities. These entities are traditionally monitored in context-aware systems. The context is a measure of the QoS of applications. Usually used in networks to measure the performance of transmissions according to quantitative criteria such as delay, gigue or rate of error, QoS cannot only be based on a network and hardware criterion [9]. The consideration of the users’ point of view is necessary but not enough for the QoS evaluation when dealing with constraint devices. Indeed, using limited mobile devices implies to optimize the energy consumption and the way to provide applications so that the offered service fits within environment constraints and can be used for a long time. To achieve this goal, we wish to act at the three levels (fig. 2). At the infrastructure level, we have to guarantee the continuity of service, whatever the evolutions of the infrastructure are and despite hardware or network failure. At the application level, we have to guarantee the durability of the application. Indeed, the use of limited devices raises the problem of their lifetime. Finally, at the user level, we have to guarantee the respect of user needs to provide a useful application (utility).
Fig. 2. QoS and context interactions
Continuity of service. Considering application QoS, the main objective is to guarantee the continuity of service despite the hardware, software and network failures. Furthermore we have to face with the heterogeneity due to the use of several types of devices, as well as hardware fails like the battery level of mobile devices. Application durability. We wish to guarantee the continuity of service of applications running on limited devices. One solution consists in maximizing the lifetime of the application. A device without energy causes the disconnection of all the services running on it and consequently may compromise the continuity of service. [1] points out that network exchanges consume more energy than computation. Therefore, solutions based on service mobility can minimize transmissions and maximize lifetime.
Kalimucho: Contextual Deployment for QoS Management
47
Application utility. We defined the usage constraints as the functional specifications of the application that the system has to respect. For example, in the application of visiting a museum, the designer can express constraints as follows: − When the visitor comes into a room where the conference is, avoid providing the information service with sound to not disturb the visitor. − When it is close time, activate the guide service to drive visitors to the exit door.
4 A Two Dimensional QoS Model In order to make the decision to reconfigure, we have to measure the application QoS. We aim at providing useful application running as long as possible. In mobile systems we have to meet the user needs and to maximize the lifetime of devices. To minimize energy consumption, most of approaches try to choose the most suitable components according to their CPU and energy consumption. Actually, network communications on mobile devices consume 90% more energy than data processing [1]. Energy consumption is closely linked with network distribution. Our approach does not limit lifetime management with optimizing resources consumption. We also try to optimize network balancing. We propose a two dimensional QoS model allowing optimizing utility and durability of an application. 4.1 Utility Utility is represented as a classification of configurations. Each configuration has a mark first determined by the designer. This classification changes according to the context. Usage constraints correspond to rules, which change the utility of a set of configurations. For example, when sound is higher than 70dB, we avoid providing the sound features so that the user does not support noise disturbance. This kind of constraint is translated as an Event-Condition-Action rule: If (sound > 70) throws beginConference Event: [ beginConference, sound, « - », 0.2 ] We associate an event to a feature (sound, video, etc.), an operator to increase or decrease the utility mark and a coefficient. In our example, we decrease the utility of a configuration providing sound when beginConference event occurs. Such a rule can change the classification placing concerned configuration in a bottom position. 4.2 Durability As utility functions minimize impact of several factors in systems [2], we use two functions to evaluate durability of an application. The first one aims at minimizing the impact of the resources consumption when deploying components on devices. Then, the second one aims at minimizing the network balancing of a deployment. Durability depending on the resources consumption. Each component and device is represented with a 3-uple: consumption/availability of CPU (C), memory (M) and
48
C. Louberry, P. Roosse, and M. Dalmau
energy. Each value is exp pressed in percentage between 0 and 1. A componnent consumes no energy (CPU U or memory), 0, or 100% energy of a device, 1. Wheen a configuration is deployed, we w calculate the influence of each component supportedd by a device (component C on device d H (1)). (C on H) = (CH – CC,MH-MC,EH-EC) .
(1)
However, we have to diistinguish components that seem equivalent. For examp mple, A(0.5, 0.2, 0.2), B(0.2, 0.5, 0.2) and C(0.2, 0.1, 0.6). If we calculate the average off the three values, we obtain 0.3 for all these components. If we deploy A on a device P P, A will consume much energy y and P will be rapidly out of order. B consumes m much memory and P will not bee able to support other component anymore. C consum mes much CPU, which can slow w computation. To resume, the discriminatory factor relaates to the deployment impact of o a component on a device and we use the min of the thhree values as the durability of deploying d C. We name this equation QoS_RC: (2)
.
If we deploy several com mponents on a device, we add all the values of componeents and we finally calculate thee durability of a configuration: .
(3)
This method to calculatte QoS is not optimal because we choose a configurattion that consumes little energy y on two devices instead of choosing a configuration tthat consumes much energy on n one device. This method exhausts devices little by liittle instead of completely exh hausts devices one by one. Nevertheless, our goal iss to maximize the lifetime of deevices, so we want to avoid breaking up a device. Durability depending on the network consumption. We calculate the durabiility depending on network conssumption (QoS_NC) with a similar method than QoS_R RC. For a particular deploymen nt, we know each device supporting each component and we know all network links between b components. In a previous work, we propose a design method describing each component and device in an ID card [12]. In these ID cards, a device knows the theoretical bandwiidth (BWTh) of every network ks it can reach (Wi-Fi, Bluetooth, Zigbee, etc.). In our applications, software com mponents and data flows are encapsulated into containners (called Osagaia for softwaare components and Korrontea for data flow containeers). They are composed of threee entities, Input Unit, Output Unit and Control Unit, whhich are the main information so ource for the platform. When a unit detects an evolutionn of its context, it raises an ev vent informing the platform and the platform can quuery information from containers (section 5). According to the Korrontea container, we can monitor the bandwidth off connectors during execution and calculate the averrage bandwidth of a connection between two devices. Then we can calculate the availaable bandwidth between two dev vices, H1 ad H2: .
(4)
Kaalimucho: Contextual Deployment for QoS Management
49
Finally, ID cards of com mponents indicate the output bandwidth that a componnent produces. So, we can apply a simple method to know the output bandwidth oof a device is to sum the output bandwidth of every component it supports: (5) . 4.3 QoS Evaluation We can represent applicatioons as configuration graphs [10] where one application can be realized by one or several configurations with different QoS. For each applicatiion, configurations are classifiedd according their decreasing utility. The Utility QoS is ffirst determined by the designeer depending on the application domain. Then the ussage context defines some rules m modifying the configurations QoS.
Fig. 3. QoS evaluation
For example, when soound level is higher than 70dB, we propose to avvoid configuration providing souund. Hence, we apply a coefficient, which reduces QoS S of such configuration, consequuently, classification is modified. This classification is the base for QoS evaluation. We represent the QoS of a configuration (and its deployments) in a two dim mensional diagram where X-axis represents the Utility and Y-axis represents the Duraability. We place Utility and Durability thresholds, whhich define a set of reliable conffigurations (fig. 3). Such thresholds can be modified durring execution in order to enlaarge this set and achieve a deployment. We study eeach configuration of the classifiication from top to bottom and calculate durability of thheir deployments. While the Q QoS of a deployment is outside the boundaries, we test another deployment or connfiguration until a deployment meets QoS requirementss. If there are two reliable deployments as in figure 2, we take the one with the best QoS S. The following paragrapph presents Kalimucho, a reconfiguration platform tthat implements this QoS modell through a heuristic for contextual deployment.
50
C. Louberry, P. Roose, and M. Dalmau
5 Kalimucho In this article, we choose to address QoS management in mobile and constraint systems by dynamic reconfiguration of applications. Hence, we use component-based applications, which provide flexibility and modularity. They are based on a component model, Osagaia and a connector model, Korrontea [4]. These models encapsulate each component and connector in a suitable container, which allows to monitor context (activity, QoS, delay, etc.) and to control component/connector life cycle (start, stop, connection, disconnection, migration). These containers can be controlled through a control unit that is an interface with the platform. So, we can act directly on components and connectors to reconfigure applications. To provide such quality in mobile applications, we propose a distributed QoSaware platform: Kalimucho (https://kalimucho.dev.java.net/). Kalimucho consists in five collaborating services, distributed to all devices supporting the application in order to have a global knowledge of the system. This platform is able to trigger the context changes and to modify the structure and the deployment of the application using five basic actions: add, remove, connect, disconnect and migrate component/connector. This set of five services allows Kalimucho ensuring the four following objectives: (1) it must first be able to capture the context. Events can come from the application or the platform itself. It must have mechanisms to capture these events, interpret them and take the appropriate decision to adapt the application. These mechanisms correspond to the application monitoring and are provided by the Supervisor service. (2) When the reconfiguration decision is taken, the platform must be able to propose a reliable deployment. So it must know all the software components and devices available and test whether a deployment meets the QoS requirements. It is carried by the Reconfiguration Builder service. (3) The reconfiguration involves moving, adding and removing components. To maintain an effective application, we must ensure the reliability of the network connections between devices supporting the application. Kalimucho therefore needs a service to maintain the network topology of the application. This is the Routing service. (4) Finally, our applications can be used with any device. All devices do not have the same hardware and software resources. We must manage this heterogeneity between devices. Components must be able to run on any device. We use a component model where each component/connector is encapsulated in a container suited to the device, which free it of all non-functional properties. Lifetime of containers is limited to the use of the component. They are created when the component is installed and destroyed when the component is removed. The container must then be adapted to the device. The platform needs a service able to create specific containers depending on the component and the device. This is the Container Factory and the Connector Factory. Depending on events, Kalimucho provides two reconfiguration processes. First, it can migrate a service: it tries to differently deploy the components of the current service. Secondly, it can deploy a new configuration of a service: it evaluates a set of configurations respecting utility and tries to find a deployment respecting durability. To find such a deployment, it implements a heuristic for contextual deployment.
Kaalimucho: Contextual Deployment for QoS Management
51
5.1 Heuristic for Contexttual Deployment This heuristic aims at findin ng a configuration and its deployment meeting the conttext constraints and the QoS criiteria: Utility and Durability. This heuristic is composedd of two parts. The first one evaaluates the utility criterion whereas the second one searcches for a deployment optimiziing durability. It plays the following scenario: For eeach configuration where the uttility is between 1 and the utility threshold, our heuriistic searches for a deployment.. This deployment has to ensure durability between 1 and the durability threshold. nsumption, we propose to minimize the number of netw work To limit the energy con links. Since there are locattion dependencies for some components, we try to grooup them on imposed devices. When they cannot support component anymore, we tryy to place component on a devicce located on paths between imposed devices (Fig.5).
Fig. 4. (a) Graph h of the “Text” configuration. (b) Network Graph.
This solution is based on o a device ranking (table 1), which is itself based on ttwo concepts: component weigh ht and device weight. We can represent configuration ass an oriented graph (fig. 4(a)). Then T we define paths between components that are fiixed with particular devices (loccation dependence). We define the weight of a componnent as its minimal rank in such h path. It is the minimal distance from the component tto a fixed-component. Fixed-co omponents are always set to rank 0. We apply the saame definition to define the weight w of a device but network is represented as a nnonoriented graph (bidirectionaal paths). From this ranking, we try to deploy each configuration until one meets Q QoS requirements. In order to fin nd rapidly a deployment, we improve the initial rankingg of devices with other criteria: type of a component, energy, CPU and memory: − Type of component: This T ranking aims at using a maximum of non-limiited devices. So, we distinguish 3 types of devices: Fixed, CDC1 and CLDC, accordding J standard classification. CLDC devices are the m most to Sun Microsystems’s J2ME limited such as mobile ph hones and wireless sensors. − Energy: Energy available on devices is an important criterion because it direcctly addresses the durabilitty of devices and consequently the durability of the application. It may im mply more reconfigurations than CPU or memory--use changes.
1
According to Sun Microsystem devices d type description.
52
C. Louberry, P. Roosse, and M. Dalmau
− CPU: CPU workload iss a less important criterion than energy. However, a hhigh workload can slow the co omputing of components. − Memory: Memory availaability has no impact on the durability of the applicationn at this moment. It will impaact when we would add component on a device. Table 1. Ranking of devices Device Impact Type Energy CPU Memory
H1 H2 0 0 Fixed CDC C
H3 0 CDC 0.8
A C B 1 1 1 Fixed Fixed CDC 0.9
D 1 CLDC 0.2
F E 1 2 CLDC CLDC 0.3
0.75
Fig. 5. Result R of deploying the “Text” configuration
According to this rankin ng, the heuristic uses a recursive approach to calculatte a deployment for each compo onent: 1. We rank components, in a list CL, according their impact. 2. We select the first compo onent where impact == 1 a. If it does not exisst, it means that all components have been placed. 3. We rank devices, in a list DL, according the criteria of table 1. 4. We calculate the impact of placing this component on the first device of DL. a. If (QoS_RC >= QoS_RC Q threshold) then we place the component on this device and we in nvoke the recursive heuristic since step 1. If the placem ment refers to the last component of CL, we can calculate the QoS_NC of the deployment. b. If (QoS_RC < Qo oS_RC threshold) then we go back to step 4 with the nnext device in DL if th here is one. Else, this recursive call fails and implies a nnew calculus to place again the previous component. In case of failure, we can update the QoS threshold in order to select nnew configurations, which can provide a deployment meeting QoS requirements. If the
Kalimucho: Contextual Deployment for QoS Management
53
QoS threshold update cannot provide such a deployment and the heuristic has tested all the configurations, it means that the application cannot be adapted. As an example, we try to deploy the “Text” configuration represented in figure 4(a) on the network represented in figure 4(b). The “Text” configuration is composed of 8 components, C1 to C8. Location dependences have been defined: C1 is placed on H1, C5 and C6 are placed on H2 and C8 is placed on H3. The result of the heuristic in figure 5 confirms that the components are grouped on the devices already used in order to limit the number of networks links and the energy consumption.
6 Experimentations We developed a prototype of Kalimucho running on a netbook, a PDA and two SunSPOT platforms. This implementation of Kalimucho can deploy and reconfigure components-based applications using batch files. It can also capture contextual elements necessary for decision-making. The Supervisor service relates, as an alarm, the context changes and the user requests to adapt the application. To do so, the Supervisor service is able to monitor the state of a device (memory, CPU, battery), to monitor the state of a component or a connector (QoS, activity, connections, etc.) and to relay statements to other distributed platforms. The Reconfiguration Builder service is able to create or remove components/connectors, migrate components, connect or disconnect a component input and delete or duplicate component output and send commands to other platforms. Kalimucho is distributed on all devices supporting the application. We implement specific versions according to the type of device. Thus we find in our prototype: − Kalimucho for fixed devices such as PCs and laptops (about 260Ko); − CDC Kalimucho for devices such as smart phone (about 356Ko); − Kalimucho for CLDC devices such as sensors Sun Spot (about 169Ko). We also implemented an Android version. Finally, the devices do not use the same network, so we have to provide tools to relay information through the several networks. In our prototype, the PC and the Android phone use Wi-Fi although SunSpots use Zigbee. To enable the Sunspots to communicate with others devices, a base-station plugged into the PC runs as a gateway. We propose a scenario that where we can use the five actions providing by the platform and measure execution time of these actions. This is an application displaying the angle captured by a SunSpot. This application is initially composed of 3 components: capture, processing and display (Figure 6). The Supervisor service of Kalimucho captures the activity of components and connectors, the QoS and the resources of devices (CPU, energy, memory). First, we replace the processing component to reverse display (1). Then, we add a new component on the PC to choose color display (2). We migrate the color component on the smart phone (3). We have to care about the network links. Hence, the smart phone has to share information with a SunSpot through the PC (gateway). Finally, we reconfigure the application to display on the smart phone too. We duplicate the output from color component to
54
C. Louberry, P. Roose, and M. Dalmau
Fig. 6. Display application
send information to the display components on the SunSpot and the smart phone (4). Table 2 resumes the execution time of all the command we used in this scenario. We can notice that the most execution times are in the order of millisecond, which is acceptable for such limited devices. Table 2. Execution of Kalimucho commands on a SunSpot sensor and a Nexus One mobile phone Command Create a component Delete a component
Create a connector
Delete a connector
Disconnection or reconnection of an input Duplication of an output Read QoS of a container Read state of a container Read state of a device Migrate a component
Execution time in ms SunSpot 70 to 170 ms 20 ms minimum Depends on time to end the component Internal: 70 to 110 ms Distributed: 100 to 190 ms on device receiving the command, 30 to 120 ms on the other Internal: 60 to 80 ms Distributed: 100 to 260 ms on device receiving the command, 30 to 120 ms on the other 20 to 60 ms 20 to 80 ms 80 ms 70 to 80 ms 70 to 90 ms 90 to 230 ms
Android 450 to 750 ms (upload of a 2kB byte code) 15 ms. Depends on component activity Internal: 3 to 15 ms Distributed: 10 to 100 ms
Internal: 3 to 15 ms Distributed: 3 to 25 ms
2 to 7 ms 2 to 7 ms
650 to 750 ms (upload of a 2kB byte code)
Kalimucho: Contextual Deployment for QoS Management
55
7 Conclusion and Future Work Pervasive computing is becoming a reality. Nowadays, people want to use application anywhere with its mobile device. Due to the mobility, people want applications to be adapted according to context changes. This brings new challenges to traditional applications. As said in [16], applications should be context aware because of limited resources of the devices and the variability of the execution context. Most approaches deal with energy consumption providing planning-based adaptation or contextual deployment. However, these approaches only consider CPU and energy consumption, no one considers network communication cost. So we propose a QoS model allowing to guaranty the utility of an application and to maximize its lifetime. The durability is a fundamental notion with mobile devices because a high quality application is useless if it just runs for a few time. Utility measures the adequacy of the provided application to user needs and application specifications. Durability measures the lifetime of the application according to resources and network consumption. Then we propose Kalimucho, a contextual deployment platform. It implements the QoS model through a recursive heuristic. We test Kalimucho on several platforms such as SunSpot and Android. Execution times of Kalimucho commands, in the order of millisecond show that response time is acceptable for limited devices. However, it still exist an obvious limit to our approach. Although QoS can be adapted dynamically, it is based on static measure of resources consumption from components. Then, the heuristic to select a deployment only computes the QoS of the service where the reconfiguration event occurred, and not the whole application. When reconfiguring, we offer the possibility to get the best QoS for one service. The reconfiguration of one service does not imply the reconfiguration of another one. But, the modification of the application (its deployment) has consequences on the execution context because it modifies the charge of devices and the network traffic. So, a reconfiguration of one service may induce the raise of events that will trigger new reconfigurations. Future works focus on the design and test of this configuration choice heuristic. We must specifically work on: − Does the platform have to manage priorities on events in order to manage quicker reaction for some of them? − When an event is managed, do we have to manage those waiting or ignore them? Doing a reconfiguration modify the context and consequently some events produced before this configuration may be obsolete. − The QoS model manages utility QoS and durability. The importance between these to criteria depends on the application. For example, a video surveillance application needs to give priority to durability whereas the one presented for museums visits gives priority to utility QoS in order to produce good quality information corresponding to users’ demands.
Acknowledgment This work was funded by the National Research Agency under MOANO project.
56
C. Louberry, P. Roose, and M. Dalmau
References 1. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: a survey. Computer Networks 38(4), 393–422 (2002) 2. Alia, M., Eide, V.S.W., Paspallis, N., Eliassen, F., Hallsteinsen, S.O., Papadopoulos, G.A.: A utility-based adaptivity model for mobile applications. In: AINA Workshops (2), pp. 556–563. IEEE Computer Society, Los Alamitos (2007) 3. Hamida, B., Le Mouel, F., Ahmed, B.: A graph-based approach for contextual service loading in pervasive environments. In: Meersman, R., Tari, Z. (eds.) DOA 2008, Part I. LNCS, vol. 5331, pp. 589–606. Springer, Heidelberg (2008) 4. Bouix, E., Roose, P., Dalmau, M.: The korrontea data modeling. In: Ambi-Sys 2008: Proceedings of the 1st International Conference on Ambient Media and Systems. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), pp. 1–10. ICST, Brussels (2008) 5. Chakeres, I.D., Belding-Royer, E.M.: AODV Routing Protocol Implementation Design. In: Proceedings of the International Workshop on Wireless Ad Hoc Networking, WWAN 2004, Tokyo, Japan. Springer, Heidelberg (March 2004) 6. Chen, T.W., Tsai, J.T., Gerla, Y.: Qos routing performance in multihop, multimedia, wireless networks. In: Proceedings of IEEE International Conference on Universal Personal Communications (ICUPC 1997), pp. 557–561 (1997) 7. David, P.C., Ledoux, T.: Wildcat: a generic framework for context-aware applications. In: Terzis, S., Donsez, D. (eds.) MPAC. ACM International Conference Proceeding Series, vol. 115, pp. 1–7. ACM, New York (2005) 8. Floch, J., Hallsteinsen, S., Stav, E., Eliassen, F., Lund, K., Gjørven, E.: Using architecture models for runtime adaptability. IEEE Software 23(2), 62–70 (2006) 9. Franken, L.J.N., Haverkort, B.R.: Quality of service management using generic modelling and monitoring techniques. Distributed Systems Engineering 4(1), 28–37 (1997) 10. Laplace, S., Dalmau, M., Roose, P.: Kalinahia: Considering quality of service to design and execute distributed multimedia applications. In: NOMS, pp. 951–954. IEEE, Los Alamitos (2008) 11. Lin, C.R., Liu, J.S.: Qos routing in ad hoc wireless networks. IEEE Journal On Selected Areas in Communications 17(8), 1426–1438 (1999) 12. Louberry, C., Roose, P., Dalmau, M.: QoS Based Design Process for Pervasive Computing Applications. In: ACM Mobility 2009, Nice, France (2009) 13. Rouvoy, R., Barone, P., Ding, Y., Eliassen, F., Hallsteinsen, S., Lorenzo, J., Mamelli, A., Scholz, U.: MUSIC: Middleware support for self-adaptation in ubiquitous and serviceoriented environments. In: Cheng, B.H.C., de Lemos, R., Giese, H., Inverardi, P., Magee, J. (eds.) Software Engineering for Self-Adaptive Systems. LNCS, vol. 5525, pp. 164–182. Springer, Heidelberg (2009) 14. Schilit, B.N., Theimer, M.M.: Disseminating active map information to mobile hosts. IEEE Network 8(5), 22–32 (1994) 15. Amundsen, S.L., Lund, K., Griwodz, C., Halvorsen, P.: Qos-aware mobile middleware for video streaming. In: EUROMICRO-SEAA, pp. 54–61. IEEE Computer Society, Los Alamitos (2005) 16. Zheng, D., Wang, J., Jia, Y., Han, W., Zou, P.: Deployment of context-aware componentbased applications based on middleware. In: Indulska, J., Ma, J., Yang, L.T., Ungerer, T., Cao, J. (eds.) UIC 2007. LNCS, vol. 4611, pp. 908–918. Springer, Heidelberg (2007)
Providing Context-Aware Adaptations Based on a Semantic Model Guido S¨oldner1, R¨udiger Kapitza1 , and Ren´e Meier2 1
Friedrich–Alexander University Erlangen–Nuremberg {soeldner,rrkapitz}@cs.fau.de 2 Trinity College Dublin [email protected]
Abstract. Smartphones and tablet PCs are on the verge of revolutionizing the information society by offering high quality applications and almost permanent connectivity to the Internet in a mobile world. They naturally support new applications that take advantage of context information like location, time and other environmental conditions. However, developing these novel contextaware applications is challenging as it is difficult to a priori anticipate their execution context and the adaptations that might be necessary to use new context information. This issue is reinforced by the semantic gap between the lowlevel technical realization of adaptation mechanisms and the demand to describe adaptations in abstract and comprehensible business terms. This paper presents programming support for context-aware adaptations based on a semantic model that builds on the AOCI framework. Using such a model, applications and adaptations can be described by means of easy to comprehend business terms. Thereby the model enables the AOCI framework to store and publish both context and domain-specific run-time information and provides a basis for high-level and tailored programming support. This enables to transparently select adaptations based on various criteria and integrate them into applications at run-time. At the level of adaptation mechanisms our approach supports integration for permanent changes using Aspect-Oriented Programming and more importantly for spontaneous and short-time integration of web services by means of interceptors.
1 Introduction and Goals Smartphones, netbooks and tablet PCs are revolutionizing the information society as we know it today. Due to their increasing computational power and almost permanent connectivity to the Internet they provide an ideal platform for supporting high quality applications in a mobile world. These applications need to act in a context-aware manner and therefore have to deal with temporal, spatial, personal or social information provided by the environment. This demands a change of paradigm in application development towards adapting behavior to adjust to the current situation [9]. This is hard to achieve as capturing all possible use cases during design-time is difficult.
Part of this work is funded by a research fellowship granted by the German Academic Exchange Service (DAAD).
P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 57–70, 2011. c IFIP International Federation for Information Processing 2011
58
G. S¨oldner, R. Kapitza, and R. Meier
As a result, developing adaptive applications remains an arduous task. Most existing approaches provide predefined adaptation mechanisms while adding further adaptation features at run-time is not possible. The reasons for these restrictions [6] are the tight coupling of the application business logic and the adaptation features. In consequence, most approaches omit or only have a limited separate management layer for reasoning and adapting which also impacts expressiveness and flexibility of application-specific adaptation features. In this paper, we propose programming support for context-aware adaptations that is built upon a semantic model. Our solution is integrated within our AspectOriented Component Infrastructure (AOCI) framework [20] that so far was limited to handle basic annotations using a semantic layer to make AOCI enhanced applications adaptable. We extend this basic support by explicit modelling of the context as well as application-specific domains inside this layer using ontologies of different granularity of abstraction. This allows us to store and to publish context information and provides access to the domain-specific run-time information of the applications. Adaptation developers can now use this information via a simple but yet powerful programming interface, which eases the definition of adaptations. To enable fast dynamic adaptation using external resources, we provide an interceptor-based dynamic adapter for the integration of web services. When executing interceptors, our middleware framework dynamically checks the current context and is able to select and invoke suitable services. This mechanism allows to replace methods at run-time. Hence, using this interceptorbased mechanism, current applications can easily be leveraged and adapted with other services. Our approach has several advantages compared to other adaptation frameworks. First, due to this work, using the extended semantic layer of AOCI significantly eases the development of adaptive applications. This is due to the fact, that the semantic layer provides access to all the context and domain-specific information using a comprehensive programming interface. Second, based on the semantic layer, the AOCI framework offers different adaptation mechanisms that allow a context-dependent transparent adaptation. These mechanisms can be controlled both by the programmer as well as through configuration options. We support the use of aspect-oriented programming as provided by AOCI for long-term adaptation and newly dynamic and short-term integration of web services on an interceptor-based mechanism. We outline two practical use case scenarios of our approach by extending a Personal Information Manager (PIM) application and show how our framework supports dynamic adaptation in a mobile environment. The paper is structured as follows: Section 2 discusses related work. Section 3 provides background information and introduces our semantic model. Section 4 describes our approach including a PIM proof-of-concept scenario. In Section 5 we outline the semantic layer and, Section 6 describes the adaptation layer. Section 7 provides an evaluation and finally we conclude this paper.
2 Related Work In mobile computing scenarios, applications often face the need for adaptation as a result of spontaneous changes in their context. However, providing appropriate
Providing Context-Aware Adaptations Based on a Semantic Model
59
adaptation support is still challenging. In the following discussion, we focus on different approaches for dynamic adaptation in mobile environments. To enable or enhance transparent adaptation, software is often extended by semantic information which can be queried by a framework to reason about adaptation. Kiczales et al. [14] propose the use of annotations to define extension points, hence introducing a level of abstraction. However, these extension points are self-defined and do not comply with a semantic model. Consequently the level of abstraction remains low. The use of so called Crosscut Programming Interfaces (XPI) [22] is similar. XPIs represent explicit and abstract interfaces. Source code and adaptations can be separated by means of design rules. To fulfil this, there are pre-conditions for the source code and post-conditions for the execution of adaptations. XPIs allow to extract semantic information from the implementation, however an explicit meta model is not provided. Kellens et al. [13] introduce model-based pointcuts, which allow to transfer the matching of adapatations to a more conceptual level. As a semantic language for reasoning they extended the CARMA aspect language combined with the formalism of so called ”intensional views”. CARMA uses SOUL, a language akin to Prolog, to reason about the application of advice. However, compared to our framework, CARMA does neither support dynamic adaptation nor context-awareness. Research has also been done to provide a common representation format for semantic modelling. For this purpose, ontologies can be used. Both SOUPA [8] and SOCAM [12] provide an ontology-based two-layer hierarchical approach. One layer is used to describe general context information and another one for domain-specific ontologies to provide additional vocabularies for supporting specific types of services. To address domain-specific issues during design-time, feature models are widely used. Such feature models are compatible with ontologies as they can be used as subset of the later. Based on such semantic layers, frameworks can cope with context-awareness. The Context Broker Architecture [7] is based on SOUPA and provides a programming interface to access the context. However, in contrast to AOCI, the focus is mainly on a common semantic model and consequently automatic-adaptation is not supported. Gaia [17], Aura [21] and DoAmI [2] also support context-aware services and can cope with adaptation, but they neither support fine-grained nor transparent adaptation of services. Unanticipated dynamic adaptation of mobile applications is also addressed in the MUSIC [18] project, which recently addresses aspect-oriented adaptations. Using this framework, planning-based self-adaptation is possible. To accomplish this, the authors use a meta-model and a sophisticated reasoning engine, however this model is not ontology-based. Instead they concentrate on system-level issues like quality-of-service. In [23], the authors propose an aspect-based variability model for representing crossorganizational features. Based on a feature ontology, they can cope with the adaptation requirements at run-time and customize existing features to the client needs.
3 Background: The AOCI Framework This section introduces the AOCI framework that provides the basis for our semantic model, its programming support and the dynamic integration of web services. The framework enables dynamic adaptation of component-based applications in a
60
G. S¨oldner, R. Kapitza, and R. Meier
distributed environment, e.g., to account for context changes. In order to allow transparent adaptation, component developers have to introduce particular statements, so called ontology-based annotations, which adhere to an ontology-based model. This information indicates to the middleware, where adaptations can be applied and which kind of adaptation is suitable. Thus, our system supports a greybox component [5] approach, while preserving encapsulation and the concepts of a black box to a certain degree. Conceptually, the AOCI framework can be split into two layers: A semantic layer, which is responsible for context-aware selection of adaptations and a lowlevel adaptation layer, which performs the necessary steps to adapt the application accordingly to the selection. The ontology-based semantic layer enables to reason about adaptation. In general, the use of ontologies provides a means to gain a shared understanding of a set of entities, relations and functions. Ontologies are widely used in knowledge management and due to their expressiveness, the possibility for context representation and extensibility they are well suited to model application domains and context information. In AOCI, ontologies are used, first, to attach semantic information to both applications and adaptations, and second, to provide basic reasoning capabilities. The adaptation itself is performed by the adaptation layer. It takes all the necessary steps to adapt the application. The adaptation process uses the semantic information attached to the implementation in order to identify the code which needs to be adapted. So far AOCI uses aspect-oriented programming techniques and allows an invasive change of applications typically needed for adaptations with a long-term perspective. Our implementation specifically targets the adaptation of OSGi-based applications and therefore is implemented using Equinox [10]. As an adaptation mechanism it utilizes Equinox Aspects which supports basic aspect-oriented programming for OSGi.
4 Adapting Service Ecosystems Next, we motivate how to enable the adaptation of current applications with contextdependent web services in a dynamic fashion. We start by introducing our developer role model that eases the provision of context-aware adaptations. Then, we present use case scenarios for our approach. 4.1 Developer Role Model To ease writing adaptive applications, our approach provides a developer role model and distinguishes three different roles, which are depicted in Figure 1. Service developers attach additional semantic information to their source code. This information has to adhere to a semantic model. The semantic model itself is modelled by the model authors, which are experts in different domain of services. They build a set of entities which represent the structure of a particular domain and its activities. Adaptation developers use the model and its semantic API to write adaptations and determine in terms of the model where and when the adaptation takes place. 4.2 Use Case Scenarios In our scenario, a student is using a smartphone (e.g. MeeGo-based [19]) with a PIM application running on it. She can use this application to write emails, schedule
Providing Context-Aware Adaptations Based on a Semantic Model
61
Fig. 1. Developer roles
appointments and manage contacts. Moreover, her university is providing a variety of web services with attached semantic information, among others, a web service to book rooms for learning groups. However, this web service can only be used if a) the student’s location is within the university campus and b) the student is member of a tutor group because there is only a small pool of rooms that can be booked on short notice. If these conditions are met, appointments can result in a booking of a room. The corresponding call to the appointment-method will be intercepted, and in case of a room booking, the method call will be transparently replaced by the invocation of the web service. The university also offers a translation web service along with a semantic description, which can be registered in the semantic layer. Using this translation-service, the application can leverage the PIM application with automatic translation capabilities when needed. These use cases can also be expressed more formally. For each use case, we first define the conditions in the context that have to be fulfilled, followed by the adaptation that should be used. – RoomBooking: Appointment.To = ’RoomX’ and UserGroup = ’Tutor’ → Invoke RoomBooking – Translate: Appointment.To.Language != Me.Language → Invoke Translation Service To realize these use cases with our framework, the different developers have to provide semantic information. The service developer attaches the PIM application with semantic annotations created by the model author, and the adaptation developer defines the web service to be invoked and sets the conditions, which have to be fulfilled to invoke the web service. Before adaptations can be applied they have to be deployed to a target system (e.g., the student’s smartphone) in form of conditions and associated web services references. This can for example be performed by accessing a well-known university website. However, other discovery approaches can be imagined. From a conceptual point of view, the two use cases are not anticipated during the design of an off-the-shelf PIM application. However, by means of our middleware, these use cases can be realized at run-time by using our platform. The use cases depict the main goals of our enhancements:
62
G. S¨oldner, R. Kapitza, and R. Meier
– To provide a high-level semantic model in order to describe applications, adaptation conditions and the adaptation itself. – A transparent context-dependent adaptation mechanism.
5 Semantic Layer In this section, we describe the architecture of our extended semantic layer in detail. First, we show how the semantic model is realized, followed by a description of how developers can use this model to attach semantic information. 5.1 Semantic Model As stated, adaptation usually happens due to changes in the context, hence, it is a key requirement to model the different aspects within the environment like location (e.g. position, orientation), time, identity (preferences, profile) or activities (e.g. sleeping, walking, etc.) or computing resources (device, network, bandwidth, etc.)
Upper Ontology
AOCI Entity
CompEntity
Location
Application
Collegues
Work
...
Domain-specific Ontology
Home Work
Home
Network
...
Person
Campus
Device
Legend:
Time
...
owl:Class
...
rdfs:subClassOf
...
owl:Property
Fig. 2. Upper and domain-specific ontology
For context modelling, we use the Standard Ontology for Ubiquitous and Pervasive Applications (SOUPA) [8]. SOUPA provides a core set of entities for the environment and is highly extensible. Our ontology is based on the Resource Description Framework (RDF) [15] and the Web Ontology Language (OWL) [3]. RDF was originally specified as a data model to describe meta data and is now used as a general method for conceptual description. OWL is based on RDF and represents a knowledge representation language for authoring and reasoning about ontologies. Furthermore, context ontologies are used to describe and define the conditions when adaptation can
Providing Context-Aware Adaptations Based on a Semantic Model
63
occur. To describe the actual adaptations and applications, our approach uses domain ontologies. Domain models are integrated together with context models in the semantic model: The SOUPA ontology serves as upper ontology for the context and is extended with ontologies for domain-specific services as lower ontology (see Figure 2). Hence, we model the structure and its provided activities of different application domains by means of these lower-ontologies. Consequently, we introduce several OWL entities; we distinguish so called StructureNodes and ActivityNodes. To put these elements into relationships, our model provides generalization / specialization-, association- and aggregation constructs. Both the StructureNodes and the ActivityNodes span up a separate tree, which can be connected by means of an association if a structure provides a certain functionality. A subset of an ontology for the PIM is depicted in Figure 3. As stated, there are two different subtrees. The upper subtree describes the structure of the PIM domain including modules like email or calendar, the lower subtree models the different functionalities being applicable in that domain. For example, the message node supports sending functionality; consequently these two nodes are connected to each other. The sendMessage() text within the rectangle indicates that there is an instance of a reference to an adaptation stored within the ontology which adheres to a sending semantics.
Fig. 3. Graphical representation of a PIM ontology
5.2 Semantic Programming To implement context-aware applications, we provide an API for managing semantic information. Application developers use annotations to attach semantics to their implementation. Figure 4 shows how to do this according to our second use case. As the annotations adhere to the semantic model, the information can be extracted at run-time by means of our API, which extracts values based on the ontology and the Java reflection API. This API is used by the dynamic adapter of the adaptation layer, but can also be used in aspect-oriented adaptation code. Figure 5 depicts how to use this API. In our API, the AOCIContext is used as an entry point to the ontology and to access the actual execution context, in this case, the sendAppointment()-operation. The
64
G. S¨oldner, R. Kapitza, and R. Meier
@AOCI . PIM . A p p o i n t m e n t public c l a s s Appointment { @AOCI . PIM . C a l e n d a r . A p p o i n t m e n t . To User t o U s e r ; @AOCI . S e n d A c t i v i t y . SendMessage public void sendAppointment ( ) { . . . } }
Fig. 4. Appointment class annotated with PIM annotations
Appointment-class further contains a property referring to the recipient. The first statement extracts the current location from the upper context ontology. The second statement uses the domain ontology to navigate to the recipient user by selecting the corresponding class property and extracts his language.
S t r i n g l o c = AOCIContext . L o c a t i o n . t o S t r i n g ( ) ; S t r i n g c u r L a n g = AOCIContext . c u r r e n t N o d e . g e t S t r u c t u r e N o d e ( ) . g e t P r o p e r t y ( "" ) . g e t L a n g u a g e ( ) . t o S t r i n g ( ) ;
Fig. 5. Querying semantic information at run-time
Finally, adaptations can be enabled and disabled at run-time by means of policy files. They can reference concepts within the ontology and define values in the context or the domain, which have to be met to enable or disable the adaptation. This provides flexibility for controlling adaptations as the adaptation process can be defined by users or administrators.
<j p >A c t i v i t y . P a r e n t s =’PIM.Calender.Appointment’ j p > PIM . C a l e n d a r . A p p o i n t m e n t . To . Language ! = U s e r . Language c o n d i t i o n > L o c a t i o n =’University’ c o n d i t i o n > a d v i c e >
Fig. 6. Context condition for the appointment adaptation
A sample scenario is depicted in Figure 6. First, it defines to use adaptations in classes that are attached with appointments semantics. The next lines specify that the adaptation is only enabled if sender and recipient have different languages and the application is used within the university. If all conditions are fulfilled, the framework will query the ontology for suitable adaptations.
Providing Context-Aware Adaptations Based on a Semantic Model
65
6 Adaptation Layer Based on the outcome of the semantic reasoning, the adaptation layer transparently chooses the adaptation method and takes the appropriate steps to perform the adaptation. In this section, we initially discuss the differences between the two supported adaptation techniques from a conceptual point of view, next we detail the adaptation process using our interceptor-based approach enabling dynamic integration and invocation of web services. 6.1 Comparison of Adaptation Techniques We support adaptation via aspect-oriented programming and enhanced our framework with interceptors. While the former can be understood as a generalization of the latter, we used different implementations for AOP and interceptors and due to its properties we combined the interceptors-based solution with the integration of web services. The aspect-oriented programming support is based on the Equinox Aspect framework which features load-time adaptation of components. Adaptation code can be loaded from remote peers. While this approach provides great flexibility, it might require integration of code from an untrusted provider. Furthermore, many adaptations are locationdependent and therefore often of short-term nature, hence from an implementation point of view using load-time weaving seems unreasonable as modified application parts need not only to be reloaded for the integration of an adaptation but also for the removal of an adaptation. Of course, one could deactivate adaptations that are no longer needed but this would lead to an ever-growing number of unused adaptations thereby wasting resources. The code size of adaptations is also an important factor as the code has to be loaded via the network which can take considerable time if an adaptation provides a complex functionality that is attached with a large state (e.g., a vocabulary file in case of the translation service). Therefore, we enhance our framework with method-level interceptors, representing an approach to easily integrate adaptations into current applications. Such interceptors can be executed before or after or even instead of a selected method. We chose to integrate generic interceptors at application startup as described below which avoids reloading or restarting actions at runtime and supports the dynamic integration and invocation of web services. Thus, a more spontaneous adaptation using services is permitted that has lower security risks compared to the aforementioned approach as loading of remote code is avoided. Depending on the context and the required adaptation, our framework dynamically selects the appropriate mechanism at run-time if it is available. Table 1 summarizes the main differences of the two approaches. 6.2 Adaptation Process Our interceptor-based adaptation is based on a dynamic adapter. Our middleware uses the ASM [4] framework to automatically create hooks at the beginning (respectively at the end) of annotated method calls. ASM is an all-purpose Java bytecode manipulation and analysis framework used to modify existing classes or dynamically generate classes, directly in binary form. Within these hooks, the semantic layer is queried to reason
66
G. S¨oldner, R. Kapitza, and R. Meier Table 1. Comparison of adaptation approaches
Applicationn Spontanous interaction Perfomance Granularity Distributed Adaptation Availability Security risks Coupling
Aspect-weaving Load-time Restart required Very good Class, method, field Yes Permanent Local, network Tight
Interceptors Run-time Ad-hoc Medium Method No Only in context Network Loose
about possible adaptations. In case of an adaptation, the method can be intercepted or redirected to a web service. Internally, the process consists of two subsequent phases, the transformation phase and the invocation phase. Transformation Phase. In the transformation phase, hooks are integrated within the application to enable the use of interceptors. This phase happens before loading classes into the Java virtual machine. The framework uses the hook-mechanism of the Equinox framework to intercept the classloading. The hook-mechanism allows to register a callback function which is responsible for adapting the classes. The full process is described in Figure 7. First, the processClass-method of the AOCIExtensionHook is called. Within the body, the configuration for the bundle where the class resides is requested. Upon receipt of the information, a TransformerService is requested. This service is used to transform the actual class. We use ASM to inject a call to a dynamic adapter function. Within this function, the current context is evaluated, conditions are checked and web services are invoked dynamically. The transformed class is then stored along with the original class in the AOCIBundleRegistry. The dynamic adapter supports three strategies: pre-invoke, post-invoke and replacement. Pre-invoke is usually used before the method execution with the goal to change the values of the input parameters, hence changing the semantics of the method or in order to validate the ingoing parameters. Post-invoke is executed after the annotated method and can change the result of the function. The replacement directive provides the possibility to skip the execution of the method. This allows for a transparent way to interchange the behavior of services. Invocation Phase. The invocation phase is performed at run-time and represents the execution of the interceptors. In the first phase of the invocation phase, context conditions have to be evaluated, and based on the result, a decision whether to perform adaptations or not is taken. If the conditions are met, the dynamic adapter reasons by means of the OntologyService. This service encapsulates the ontology and provides reasoning capabilities to determine what kind of services should be integrated. The ontology provides a model for describing the structure and behavior of services and applications. This information is used at run-time to dynamically invoke web services. The ontology also contains references to these web services. For the dynamic invocation
Providing Context-Aware Adaptations Based on a Semantic Model : AOCIExtensionHook
1: processClass
:ITransformerServiceProvider
67
: AOCIBundleRegisty
(className,classBytes,classpathEntry,bundleEntry,manager)
1.1:
getBundleConfig
(bundleID)
1.2:
createTransformerService()
: TransformerService 1.2.1:
1.3:
transform
«creation»
*
(classBytes,config)
2: addTransformedClass
(bundleID,className,loader)
Fig. 7. Transformation phase
of the web services we use the DAIOS [16] framework. Compared to other dynamic invocation frameworks, DAIOS is easy to use, performs well and supports a wide variety of web service techniques. DAIOS uses a generic technique based on similarities of input parameters for the mapping between WSDL and DAIOS types and supports both simple data types as well as complex data types. The next step is to determine the calling semantics of the web service communication. We support two different styles: blocking and non-blocking communication. The standard way to invoke web services is synchronous blocking. However, as web services are often deployed within a Service Oriented Architecture (SOA), some services may take some time to process a request. Consequently, the standard request/response style is not suitable for such long-running method invocations. Therefore we provide the support for asynchronous (non-blocking) communication. In the last step, the actual dynamic invocation is being performed. DAIOS supports the core SOA principles: It uses dynamic services invocation, hence no static component like client-side stubs is needed, and instead arbitrary web services can be invoked using a single interface with generic data structures. Furthermore, our invocation engine is protocol-independent and supports a message-driven approach, hence leading to looser coupling of services; both SOAP-based as well REST-based [11] services are supported. The process is shown in Figure 8: First, within the application’s dynamic adapter, a reference to the InvocationService is fetched. Second, the invoke method is called, which in turn checks the conditions, fetches a suitable web service reference and makes a dynamic call to the web service.
68
G. S¨oldner, R. Kapitza, and R. Meier «Component» aoci.extension.hook
3: get conditions
«Component» Application
2: invoke
«component» aoci.extension.ws.invocation
1: getWSInvocationService()
: ServiceRefAccess
4: get WebService Referenc
«component» aoci.extension.semantic.mapping
Fig. 8. Invocation phase
7 Evaluation For the evaluation, we conducted experiments to adapt our application in a short-term style by means of an interceptor-based invocation of a web service and compared the results to our previous aspect-oriented adaptation support. We attached the appointment module of the Nomad PIM [1] (an OSGi-based open-source PIM) with semantic information as described in Section 5 and provided an adaptation to translate the text body of emails by means of a translation web service. For the interceptor-based variant the evaluation includes the transformation during the start of the PIM application, the semantic meta data extraction at run-time and the dynamic invocation. Additionally, we measured the aspect-oriented weaving that is performed during class loading requiring a bundle restart of the effected OSGi components. All experiments were performed 50 times for each type of measure and the web service was co-located with the application on the same physical machine. Table 2. Benchmark results for interceptors Type of measure Interceptor integration during application startup Semantic meta data extraction at run-time Ontological reasoning at run-time Dynamic invocation at run-time Aspect-oriented adaptation
Min. [ms] 987 0,49 3,7 7,21 1803
Max. [ms] 1731 1,28 4,6 7,87 2006
Avg. [ms] 1403 0,67 4,1 7,39 2357
Table 2 shows that our interceptor-based approach is executing with low overhead, however, due to the transformation phase, the startup of the application takes 1400 ms longer. Subsequent reasoning and invocations only need a few milliseconds. Aspectoriented adaptation, in contrast, needs a restart of the affected bundles if adaptations are
Providing Context-Aware Adaptations Based on a Semantic Model
69
integrated or removed. In both cases this results in a application downtime of 2000 ms at run-time and bears significant startup costs. Compared to the interceptor approach, subsequent invocations do not need additional reasoning, hence the approach provides low invocation costs. In a second step, we measured the different variants of how dynamic invocation of web services can be applied. We evaluated the pre-invoke, the post-invoke and the replace style. Furthermore we conducted measures for asynchronous and synchronous calls (see Table 3). The replace strategy performs slightly better than the other styles as less reflection-based method calls are needed. Furthermore, due to the described rulebased adaption selection process of the scenario, the ontological reasoning time is quite low. Table 3. Benchmark results for dynamic invocation
Pre-invoke Post-invoke Replace
Average asynchronous [ms] 7,21 7,65 6,31
Average synchronous [ms] 7,39 8,12 6,88
In summary, an interceptor-based approach can support various scenarios where an external and code-wise unknown service needs to be integrated on demand. Furthermore, the approach is faster in respect of the integration of adaptations than our previous solution using AOP, because the components affected by the adaptations do not need to be restarted.
8 Conclusion To cope with the requirements of dynamic applications, the ability to adapt due to changes in the context is essential. Existing approaches for adaptive software lack a semantic model to define and perform adaptations in a business-oriented way. In this paper we filled this gap by integrating context and domain modelling in one combined ontology. Based on this model, our application can ease the questions of “when”, “what” and “how” to adapt. In combination with a flexible API to access the semantic model and the support for dynamic and possibly short-time integration of web services we provide a powerful tool set for dynamic adaptation of applications in a mobile world.
References 1. Nomad PIM, http://nomadpim.sourceforge.net/ 2. Anastasopoulos, M., Klus, H., Koch, J., Niebuhr, D., Werkman, E.: DoAmI-a middleware platform facilitating (re-) configuration in ubiquitous systems. In: System Support for Ubiquitous Computing Workshop, At the 8th Annual Conference on Ubiquitous Computing (Ubicomp 2006) (2006) 3. Bechhofer, S., Van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D., Patel-Schneider, P., Stein, L., et al.: OWL web ontology language reference. W3C recommendation 10, 2006–01 (2004)
70
G. S¨oldner, R. Kapitza, and R. Meier
4. Bruneton, E., Lenglet, R., Coupaye, T.: ASM: a code manipulation tool to implement adaptable systems. Adaptable and extensible component systems (2002) 5. B¨uchi, M., Weck, W.: A plea for Grey-Box components. Tech. Rep. 122, Turku Centre for Computer Science (1997) 6. Charfi, A., Dinkelaker, T., Mezini, M.: A plug-in architecture for self-adaptive web service compositions. In: IEEE International Conference on Web Services, ICWS 2009, pp. 35–42. IEEE, Los Alamitos (2009) 7. Chen, H., Finin, T., Joshi, A.: Semantic Web in the Context Broker Architecture. In: Proceedings of the Second IEEE International Conference on Pervasive Computing and Communications (PerCom 2004), p. 277. IEEE, Los Alamitos (2004) 8. Chen, H., Perich, F., Finin, T., Joshi, A.: SOUPA: Standard ontology for ubiquitous and pervasive applications. In: Mobile and Ubiquitous Systems: Networking and Services, pp. 258–267. IEEE, Los Alamitos (2004) 9. Coutaz, J., Crowley, J., Dobson, S., Garlan, D.: Context is key. Communications of the ACM 48(3), 49–53 (2005) 10. Eclipse Foundation: Equinox OSGi framework (2008), http://www.eclipse.org/equinox 11. Fielding, R.: Architectural styles and the design of network-based software architectures. Ph.D. thesis (2000) 12. Gu, T., Pung, H., Zhang, D.: A service-oriented middleware for building context-aware services. Journal of Network and Computer Applications 28(1), 1–18 (2005) 13. Kellens, A., Mens, K., Brichau, J., Gybels, K.: Managing the evolution of aspect-oriented software with model-based pointcuts. In: Hu, Q. (ed.) ECOOP 2006. LNCS, vol. 4067, pp. 501–525. Springer, Heidelberg (2006) 14. Kiczales, G., Mezini, M.: Separation of concerns with procedures, annotations, advice and pointcuts. In: Gao, X.-X. (ed.) ECOOP 2005. LNCS, vol. 3586, pp. 195–213. Springer, Heidelberg (2005) 15. Klyne, G., Carroll, J., McBride, B.: Resource description framework (RDF): Concepts and abstract syntax. Changes (2004) 16. Leitner, P., Rosenberg, F., Dustdar, S.: Daios: Efficient Dynamic Web Service Invocation. IEEE Internet Computing 13(3), 72–80 (2009) 17. Rom´an, M., Hess, C., Cerqueira, R., Ranganathan, A., Campbell, R., Nahrstedt, K.: Gaia: a middleware platform for active spaces. ACM SIGMOBILE Mobile Computing and Communications Review 6(4), 65–67 (2002) 18. Rouvoy, R., Eliassen, F., Beauvois, M.: Dynamic planning and weaving of dependability concerns for self-adaptive ubiquitous services. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 1021–1028. ACM, New York (2009) 19. Schroeder, S.: Introduction to MeeGo. IEEE Pervasive Computing 9(4), 4–7 (2010) 20. S¨oldner, G., Schober, S., Schr¨oder-Preikschat, W., Kapitza, R.: AOCI: Weaving Components in a Distributed Environment. In: Chung, S. (ed.) OTM 2008, Part I. LNCS, vol. 5331, pp. 535–552. Springer, Heidelberg (2008) 21. Sousa, J., Garlan, D., et al.: Aura: an architectural framework for user mobility in ubiquitous computing environments. In: Proceedings of the 3rd Working IEEE/IFIP Conference on Software Architecture, pp. 29–43 (2002) 22. Sullivan, K., Griswold, W., Song, Y., Cai, Y., Shonle, M., Tewari, N., Rajan, H.: Information hiding interfaces for aspect-oriented design. ACM SIGSOFT Software Engineering Notes 30(5), 166–175 (2005) 23. Walraven, S., Lagaisse, B., Truyen, E., Joosen, W.: Aspect-based variability model for crossorganizational features in service networks (status: published)
Towards QoC-Aware Location-Based Services Sophie Chabridon, Cao-Cuong Ngo, Zied Abid, Denis Conan, Chantal Taconet, and Alain Ozanne Institut TELECOM, TELECOM SudParis, CNRS UMR Samovar ´ 9 rue Charles Fourier, 91011 Evry cedex, France [email protected]
Abstract. As location-based services on mobile devices are entering more and more everyday life, we are concerned in this paper with finding ways to master the level of quality of location information in order to take relevant decisions. Location being a typical example of context information, we manipulate it using the COSMOS framework that we develop for the management of context data and their associated quality meta-data or quality of context (QoC). We consider several QoC parameters that are important for location and determine how the QoC can help a location aggregator component to identify the current region where a user is located. The mechanisms we propose support a pragmatic approach in which application designers or deployers survey an area to demarcate regions surrounding locations, and application users are localized into these regions and are presented with the quality of the estimate. We report on the experimentation we performed on the campus of our institute collecting information from Wi-Fi, 3G networks and GPS signals, and show the accuracy we obtain at no additional infrastructure cost. Keywords: Context, quality of context, location, uncertainty.
1
Introduction
Even though context information has long been identified as a corner stone for mobile, ubiquitous or pervasive applications [6,5], only a few systems do pay attention to the Quality of the Context information (QoC). Location is an example of context information that we propose to manipulate using the COSMOS1 context management framework that we develop. COSMOS allows to take into account the QoC associated with context data and to integrate it in the decision process. We show in this paper how introducing the QoC in the inference process can help a location aggregator component to derive the most accurate symbolic location with respect to the real user position from a set of input location information originated from different sources. Nowadays, off-the-shelf devices commonly offer GPS reception in addition to Wi-Fi and 3G cellular network communication. This naturally leads to the idea 1
http://picolibre.int-evry.fr/projects/cosmos
P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 71–76, 2011. c IFIP International Federation for Information Processing 2011
72
S. Chabridon et al.
of an abstract location interface [10] to support deriving locations from different positioning technologies. The ability to take into account multiple position sources also provides the means to remove any frontier from outdoor to indoor positioning, building the location information from the currently available sensed data. The approach we follow is depicted in Figure 1. Let us take the scenario of the preparation of a geo-localized game and of a game session. Game designers survey the area where the game is going to take place. They use a location survey tool installed on their mobile phone that senses the position with a GPS sensor, a Wi-Fi sensor and a GSM sensor. At some positions, they take fingerprints of the sensors and tag the positions with meaningful names so that the positions are marked to be locations, that is distinguished positions. Periodically, the tool collects the fingerprints of position sensors of the same types as the ones used during the location survey. These context data are intersected with the location data of the game to obtain estimated locations, one per position sensor type. In our work, location data are complemented with QoC data and the estimated locations can be aggregated to choose the best QoC-based location, which is itself complemented with region data in order to be graphically displayed on the map of the game application. Region
Location Aggregator Location Region data
GPS Intersecter
Location data
GPS Sensor
Position
Location
Wi−Fi Intersecter Position Wi−Fi Sensor
Location GSM Intersecter Position GSM Sensor
Fig. 1. Location model
The organization of this paper is the following: Section 2 describes the role of QoC in the inference process. We show in Section 3 some evaluation results obtained with a prototype we developed. In Section 4, we discuss related work and Section 5 concludes this paper.
2
Qoc-Aware Location
The multiplicity of positioning sources calls for the need to determine the quality of the derived location information. The location aggragator of Figure 1 relies on three QoC criteria which are accuracy, freshness and trustworthiness; our COSMOS context management framework allows to easily extend this set of criteria if necessary. Positional accuracy, or accuracy, represents the degree to which the reported location matches the true location in the real world. It can be
Towards QoC-Aware Location-Based Services
73
determined statistically from a set of experimentations comparing the estimated location and the real one. As location is a very dynamic notion, an evaluation of its freshness (or up-to-dateness) appears essential. We compute the freshness as a function of the age of a position measure, represented by the time elapsed since the measure was taken, and of its lifetime [7]: f reshness 1 tc tm lt where tc is the current time, tm is the measurement time and lt corresponds to the lifetime. Trustworthiness has been considered in several works [2,12] as a QoC criterion allowing to rate the context sources indicating how much trust can be put in the input data. With regard to positioning, we define the trustworthiness as the probability that the derived location information matches the real location of the user. Its computation depends on the relevancy of the available information, linked to the source of the information, and also on the technology used. We compute the trustworthiness of a Wi-Fi-based location as TW iF i Or Ss where Or is the overlapping ratio of the received signals w.r. to the survey phase and Ss is the total difference of the signal strengths. For GSM signals, our experiments show a high instability in the strengths of the received signals. We therefore derive the trustworthiness of a GSM-based location as: TGSM Or . The trustworthiness of a GPS information is defined as follows: TGP S Dr drp Dmax where Dr is the diameter of a predefined region, as registered during the survey phase, drp is the distance between a given predefined region and the current position, and Dmax is the maximal diameter that we consider for a region. Based on the location information and its quality meta-data provided by the various intersecters, a Location Aggregator component performs a fusion process driven by the knowledge of the quality of the location information. Locations are sorted according to their trustworthiness, freshness and accuracy, in this order. The location that is ranked first is chosen. When several intersecters provide the same location as a result, the QoC criteria of the aggregated region are computed as the maximum values of the input QoC.
3
Evaluation Results
We have conducted series of experiments on the campus of our institute using a prototype we developed based on the COSMOS process-oriented context manager [4] which supports QoC processing [1]. This experimentation implies two phases. The first phase is the location survey during which we register the WiFi, GSM and GPS signatures of several locations that we tag with a symbolic name. The second phase consists in testing the behavior of the location detection application by going to a registered location and obtaining the location derived by the system. For each kind of positioning technology, we have performed some specific measures to determine the most appropriate calibration of the location detection application. For the Wi-Fi Intersecter, as shown on Figure 2, we have determined that an overlapping ratio of 50% reached the best accuracy. We see on Figure 3 that a SignalStrength of 90dBm gave the best accuracy. It represents the threshold for the strength of the received signal below which a signal is ignored. With this parameter setting, we obtained an accuracy of 0.72 meaning
74
S. Chabridon et al.
0.8
0.8
0.6
0.6 Accuracy
1
Accuracy
1
0.4
0.4
0.2
0.2
0 0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 -100
0.9
-95
-90
overlapping ratio
-85
-80
-75
-70
Minimum signal strength (dB)
Fig.2. Wi-Fi Overlapping Ratio
Fig.3. Wi-Fi Signal Strength
0.8
0.8
0.6
0.6 Accuracy
1
Accuracy
1
0.4
0.4
0.2
0.2
0
0 5
10
15 20 25 30 Average distance between wifi regions(m)
35
40
Fig.4. Wi-Fi Avg distance btw Regions
5
10
15
20 25 GPS Radius (m)
30
35
40
Fig.5. GPS diameter
that the location detection application indicates the correct location in 72 % of the experiments. In Figure 4, we analyze the impact of the actual distance between the tested locations. As can be expected, the sensed information must be different enough for the system to be able to distinguish between regions. When regions are separated by more 20m, 30m or 40m we obtained the correct location in 78% of the experiments, 87% or 89% respectively. For the GPS Intersecter, a default value of 10m has been chosen for the maximum radius of a GPS region as shown on Figure 5 reaching an accuracy of 30%. The best accuracy of 70% was obtained with a radius of 40m, but this does not allow to sufficiently differentiate the different places at the scale of the campus.
4
Related Work
We review in this section some related work on positioning middleware that consider the quality dimension of location and also consider more general approaches for dealing with the uncertainty of context information.
Towards QoC-Aware Location-Based Services
75
Middlewhere [11] relies on three metrics for determining the quality of location information: resolution, confidence and freshness. It also proposes an uncertainty model based on a predicate representation of contexts. However, the resulting quality of location information is not exposed to the applications and the models cannot easily be extended by application developers. [8] makes use of the accuracy as given by the context sources and expressed by a distance and the freshness of the measure. However, this work does not consider additional quality aspects such as trustworthiness as we propose. Nexus [9] considers three quality aspects through degradation, consistency and trust. This model is very powerful but requires applications to specify probabilities in order to perform position queries. We propose a more user-friendly solution where the framework informs the user of the obtained context quality rather than requiring the user to restrict the research domain. The LOC8 framework [13] is a recent effort to provide application developers with easy access to location information. It defines a quality matrix consisting of granularity, frequency, coverage and a list of accuracy and precision pairs. LOC8 also relies on a sensor fusion method, with a default implementation based on fuzzy logic integrating the confidence on location data. While our work results from a similar effort to manipulate different sensor data and to expose the knowledge of its quality, we promote a fusion process that considers a larger set of quality criteria, and not only confidence. Exploiting meta-data expressing the quality of context information can help to deal with its inherent uncertainty and to resolve the potential inconsistencies that can result from it. For instance, the fusion process we propose for location aggregation can benefit to the context-correlation model of [3]. [14] considers uncertainty regions around the position of mobile objects. Introducing the freshness quality criterion we propose in the proximity evaluation algorithms of [14] would help to filter out old context information.
5
Conclusion
This paper presents our approach for building QoC-aware location-based services. As it is central to the development of a large number of mobile distributed applications, we consider that location information requires specific care to deal with its inherent uncertainty and that applications need to have the knowledge of this uncertainty level. We identify accuracy, freshness and trustworthiness, as being the quality criteria that are particularly relevant for location information, but this list can be extended if required as additional QoC parameters are provided by our COSMOS context management framework. During an experimentation on our campus, our location detection application provided the correct location in 72% of the experiments. This should be compared to a probability of 33.3% for choosing one location out of the three locations provided by the intersecters according to a uniform law, as would be the case without considering QoC in the location aggregation process. This result can even be improved when regions are separated by more than 30m reaching an accuracy of 87%. More experiments are planned in making the prototype applications available in a social network
76
S. Chabridon et al.
on the campus so that students can register their own measures and comment on the obtained results. Acknowledgements. This work was partially funded by the French FUI (Unique Interministerial Fund), under the CAPPUCINO project, by the Inter CarnotFraunhofer Program TOTEM project, and by a T´el´ecom SudParis FellowShip.
References 1. Abid, Z., Chabridon, S., Conan, D.: A Framework for Quality of Context Management. In: Rothermel, K., Fritsch, D., Blochinger, W., D¨ urr, F. (eds.) QuaCon 2009. LNCS, vol. 5786, pp. 120–131. Springer, Heidelberg (2009) 2. Buchholz, T., Kupper, A., Schiers, M.: Quality of Context Information: What it is and Why we Need it. In: 10th Int. Workshop of the HP OpenView University Association (HPOVUA), Geneva, Switzerland (July 2003) 3. Chen, C., Ye, C., Jacobsen, H.-A.: Hybrid Context Inconsistency Resolution for Context-Aware Services. In: Proc. 9th IEEE Conf. on Pervasive Computing and Communications (PerCom 2011), pp. 10–19 (March 2011) 4. Conan, D., Rouvoy, R., Seinturier, L.: Scalable Processing of Context Information with COSMOS. In: Indulska, J., Raymond, K. (eds.) DAIS 2007. LNCS, vol. 4531, pp. 210–224. Springer, Heidelberg (2007) 5. Coutaz, J., Crowley, J.L., Dobson, S., Garlan, D.: Context is key. Communications of the ACM 48(3), 53 (2005) 6. Dey, A.K., Abowd, G.D.: Towards a Better Understanding of Context and Contextawareness. In: CHI 2000, Workshop on the What, Who, Where, When, and How of Context-awareness, pp. 304–307 (2000) 7. Manzoor, A., Truong, H.L.: On the Evaluation of Quality of Context. In: Roggen, D., Lombriser, C., Tr¨ oster, G., Kortuem, G., Havinga, P. (eds.) EuroSSC 2008. LNCS, vol. 5279, pp. 140–153. Springer, Heidelberg (2008) 8. Myllymaki, J., Edlund, S.: Location Aggregation from Multiple Sources. In: Proc. 3rd Int. Conf. on Mobile Data Management (MDM 2002), Singapore, January 8-11, pp. 131–138 (2002) 9. Nexus Team. Reference Model for the Quality of Context Information. Stuttgart Universit¨ at (February 2010) 10. Oppermann, L.: Facilitating the Development of Location-Based Experiences. PhD thesis, The University of Nottingham, UK (April 2009) 11. Ranganathan, A., Al-Muhtadi, J., Chetan, S.K., Campbell, R., Mickunas, M.D.: MiddleWhere: A Middleware for Location Awareness in Ubiquitous Computing Applications. In: Jacobsen, H.-A. (ed.) Middleware 2004. LNCS, vol. 3231, pp. 397–416. Springer, Heidelberg (2004) 12. Sheikh, K., Wegdam, M., van Sinderen, M.: Middleware Support for Quality of Context in Pervasive Context-Aware Systems. In: Proc. 5th IEEE Int. Conf. on Pervasive Computing and Communications, Workshops, PerCom Workshops 2007, March 19-23, pp. 461–466 (2007) 13. Stevenson, G., Ye, J., Dobson, S., Nixon, P.: LOC8: A Location Model and Extensible Framework for Programming with Location. IEEE Pervasive Computing 9, 28–37 (2010) 14. Xu, Z., Jacobsen, H.-A.: Evaluating Proximity Relations Under Uncertainty. In: Proc. IEEE 23rd Int. Conf. on Data Engineering (ICDE), pp. 876 –885 (2007)
Session-Based Role Programming for the Design of Advanced Telephony Applications Gilles Vanwormhoudt1,2 and Areski Flissi2 1
Institut TELECOM LIFL/CNRS - University of Lille 1 (UMR 8022) 59655 Villeneuve d’Ascq cedex - France {Gilles.Vanwormhoudt,Areski.Flissi}@lifl.fr
2
Abstract. Stimulated by new protocols like SIP, telephony applications are rapidly evolving to offer and combine a variety of communications forms including presence status, instant messaging and videoconferencing. This situation changes and complicates significantly the programming of telephony applications that consist now of distributed entities involved into multiple heterogeneous, stateful and long-running interactions. This paper proposes an approach to support the development of SIP-based telephony applications based on general programming language. Our approach combines the concepts of Actor, Session and Role. Role is the part an actor takes in a session and we consider a session as a collaboration between roles. By using these concepts, we are able to break the complexity of SIP entities programming and provide flexibility for defining new ones. Our approach is implemented as a coding framework above JAIN-SIP.
1
Introduction
In recent years, telephony services have endorsed significant changes by integrating a variety of communication forms including video, text and presence while managing aspects like mobility, security, etc. This evolution has created a need for telephony applications to involve an increasing widely range of distributed entities with capacities for participating into multiple, heterogeneous, stateful and long-running interactions. This requirement, compounded with the intricacies of underlying communications, make the programming of new telephony applications a daunting task. By supporting a rich range of communication forms, the ‘Session Initiation Protocol’ (SIP) has contributed a lot to this evolution and many advanced telephony applications are now SIP-based. For programming the entities involved in these applications, two main categories of approaches have been proposed over the years. In the first category, we find domain-specific languages (DSL) like LESS[12], SPL[8], ECharts[11], StratoSIP[9] to program specific kind of SIP entities such as routing server, end user-agent or back-to-back user agents. All these DSLs provide high-level concepts to hide the intricacies of the underlying SIP technologies but they are usually limited to coarse-grained and dedicated operations and therefore prevent from implementing arbitrary telephony services. The second category of approaches is based on general purpose programming P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 77–91, 2011. c IFIP International Federation for Information Processing 2011
78
G. Vanwormhoudt and A. Flissi
language and the providing of large, powerful and generic APIs or frameworks such as JAIN-SIP, SIP-Servlet and JAIN-SLEE for the Java language1 . However, although they enable the programming of unrestricted SIP applications, these approaches provide little support to layer the design of entities that are involved into multiples sessions or participate into sessions with complex message flow, two requirements often meet in advanced telephony applications. In this paper, our goal is to facilitate the development of SIP-based applications programmed with general-purpose languages. To do so, we provide the developer with a programming model that raises the abstraction level with Actor, Session and Role as key concepts. Our notion of role encapsulates one fragment of the behavior played by an entity similarly to other existing role-based programming approaches[10,4] but in our model this notion is specifically related with the notion of session of interactions to provide session-based role programming. The roles played by a SIP entity, which is represented by an Actor, depend automatically on the sessions it participates in at runtime. Advantages provided by this model is to simplify the reasoning on entities description, to achieve a better modularization between session-dependent parts of entities and to improve the capacities for constructing SIP entities from reusable components. In addition to this model, we propose a lightweight implementation over JAIN-SIP that supports the definition of the proposed notions through a set of Java annotations. The rest of this paper is organized as follows. After pointing out some issues underlying the design of advanced SIP applications in the next section, section 3 presents our programming model. Section 4 illustrates the model and discusses its benefits. Section 5 describes a lightweight implementation of the model in Java. Section 6 presents related works prior to conclude with section 7.
2
Issues in SIP-Based Applications Design
For supporting applications ranging from simple VoIP routing or instant messaging to sophisticated multimedia sessions involving multiple parties with presence management, SIP provides a rich range of communications forms. Communications can be stateless for simple messages exchange, session-based to exchange messages over a period of time or event-based to propagate information like state-change of an entity. One main benefit of SIP is that it enables mixing its communication forms to design advanced telephony applications. As an example of such application, which will serve in the following, we propose to consider an application that manages presence-based redirection2 of invitation to a dialog combining voice and text. In this application, a server manages the incoming invitation for registered users depending on their status: when available, the server replies with the current user’s address to directly invite him, otherwise the invitation is rejected. Figure 1 shows the architecture of this application and a use case of presence-based redirection leading to a successful dialog. In this figure, Alice updates its status because she leaves its office for a meeting (1,2). 1 2
http://java.sun.com/products/jain/ In SIP, redirection consists in directing the client to contact an alternate address.
Session-Based Role Programming
79
Presence-based Redirect Server 3)Invite alice
Bob User Agent
1) Publish [meeting]
Alice User Agent
2) 200 OK
4).486 Busy Here
5) Publish [available]
7) Invite alice 8) 302 Moved @alice
6) 200 OK 9) Invite alice
Session Setup
10) 180 Ringing 11) 200 OK
Accept or Decline ? Accept
12) Ack 13) Message
User Message exchange
14) 200 OK 15) Message 16) 200 OK
Session Ending
17) Bye 18) 200 OK
Successful Session establishment between UserAgent
Fig. 1. Example of Presence-based Redirection
Then, Bob calls Alice during her meeting and he receives a BUSY response (3,4). When Alice becomes available (5,6), a new invitation from Bob results in a redirect response including Alice’s address to initiate a dialog (7,8). Thanks to the returned address, Bob can directly contact Alice to establish a successful dialog (9 to 18). Despite SIP features for supporting multiple communications forms, developing advanced applications like the previous one remains a complex task. In the following, we give three of the typical issues that complicate the programming of SIP entities. 1) Complex messages flow within a session : Within a session, SIP entities exchange and handle messages at each end. The handling of these messages consists generally in checking that the received message is valid with the entity’s current state, performing actions and then changing to next state. There are several difficulties related to the handling of messages in a SIP application. – A first difficulty is that messages can be received at any time. This entails that SIP entities may be prepared to react to any received messages, including while others are processed. In general, this is achieved by adopting an eventdriven approach for message handling. – Another difficulty related to message handling is that interpretation of messages in the flow is generally state-dependent. In the example, OK responses have a meaning that depends upon the current state or the previous request. Distinguishing between these interpretations generally require that the application maintains a session state according to the exchanged messages. – Sometimes, the messages flow of a session may contain messages that are related to distinct concerns. For each concern, the related messages are not necessarily in separate sequences but may be interleaved with those from others concerns. In the example, this is illustrated in the lower part of Figure 1 where
80
G. Vanwormhoudt and A. Flissi
we can identify three groups of messages corresponding to session setup, user message exchange and session ending concerns. – A last difficulty is that message flow can be inverted during the session. Illustrations in the example are given by MESSAGE and BYE requests. This implies that entities must be able to provide behaviours for handling session flow in both directions and to act both as requestor and responder. Because of the difficulties described previously, the behaviour of entities handling some part of the session flow like the previous one is usually not easy to design and there is a need to provide the developer with appropriate abstractions and decomposition mechanisms to facilitate the handling of messages related to a particular concern or a particular state inside the code. 2) Multi-branches session : A SIP session is not always restricted to a peer-topeer conversation. There are some classical calling or presence scenarios which involve more than two peers in the same session. In these scenarios, one or several SIP entities generally act as facilitator between several participants to the session. For such an entity, this entails to handle more that one conversation within the same session and to coordinate all the conversations in order to serve the communication partners properly. In our example, we initially assume a redirect behaviour for the server but we can easily imagine changing this behaviour to a proxy-like one in order to extend the status of a user according to its participation in an existing dialog3 . Compared to the previous behaviour, adopting this change requires that the server manages and coordinates two communication paths within a session : one to the callee and one to the caller, switching back and forth being a client and a server at the same time. When a SIP entity is involved in more than one conversation within a session, the complexity for describing its behaviour is inherently increased. To help the developer designing behaviour for multiple interwoven conversations, mechanisms should be provided. Such mechanism should enable expressing the state and behaviour related to each conversation separately but also simplify their coordination. 3) Multi-sessions management : The need to handle multiple sessions is another situation than makes the development of SIP applications and their entities complicated. Two cases may be distinguished for multi-sessions management. The first case is the one where the sessions managed by a SIP entity have the same type and typically occur concurrently. We generally encounter this case when designing SIP server. In our example, the redirect server is an illustration of this case as it must be able to manage calling session coming from multiple requestors. Here, the difficulty is that multiple session states must be maintained separately and concurrently. The second case is the one where each session has a different type. This case can be found particularly in the development of rich user agents and rich servers. In our example, Alice’s user agent illustrates this case as it must be able to manage several sessions related to registration, incoming invitation and status modification. For this case, an additional difficulty besides maintaining multiple states is that each session may require the handling 3
Proxy behaviour in SIP consists in relaying each request and response between user agents of a session.
Session-Based Role Programming
81
of distinct set of messages resulting in a significant amount of messages to handle. Note that a combination of the two cases may sometimes be required for the design of a SIP entity as it is illustrated by the redirect server. For SIP entities, supporting multiple sessions makes their design more complicated because they must include characteristics for many states and behaviours. Therefore, some facilities are needed to limit the complexity. Such facilities should make possible to describe and manage part of the entity related to each session separately from others while simplifying the selection of the appropriate parts during interaction. Regarding the previous issues, it appears that existing SIP APIs or frameworks do not provide abstractions to solve them. Because of this lack of abstractions, it is common for application developers using SIP frameworks to express the multiple states and behaviours related to parts like concerns, conversations or sessions in a monolithic way and manage them manually with intricate and scattered if-statements. This greatly complicates the application development task and results in cluttered code that is harder to understand, reuse and maintain. To cope with these issues and fill the gap, we present our role-based programming model in the next section.
3
Programming Model with Actor, Session and Role
As explained previously, development of advanced SIP applications is complex, leading to intricate structure of behaviour for SIP entities. In this section, we propose a programming model based on Actor, Session and Role concepts to help this development. By relying on these concepts, we are able to break the complexity of actor behaviour and to define this behaviour in a flexible way. In our model, an actor is a top level component which represents a distributed SIP entity. During its lifetime, an actor is involved into one or several SIP sessions that may have the same nature or being different and exist concurrently. To handle sessions programmatically, we define two related concepts : session and session part. The session concept is orthogonal to actors : it aims to group the definition of roles that characterize a specific session. The session part concept establishes the link between an actor and a specific session : it aims to define one or several roles played by an actor when it participates to a particular session. Role concept is the basic unit of our programming model. It is used to describe one of the behaviour involved in a session. Within a particular session, an actor communicates with other actors that play dual or consistent roles. Figure 2 summarizes the idea of actor, session and role concepts and shows capabilities of the model. Related actors A1 and A2 participate to a SIP session S1. Within this session, actor A1 plays a single role R1 and communicates with A2 that plays dual roles R2 and R2’. Figure 2 also shows a second SIP session S2 independent from S1 that involves actors A3 and A4 but also A1, already engaged in S1. For this session, we may observe that actor A1 plays a role R1’ distinct from R1 played in S1. Actor A3 plays two distinct roles to communicate with A1 and A4 actors. The combined use of actor, session and role proposed by our model allows to deal with the issues identified above. The first issue can be managed by
82
G. Vanwormhoudt and A. Flissi actor A1
actor A2 Session S1
role R1
role R2 role R2'
role R1'
role R2 role R3
Session S1
role R4
role R3' Session S2
role R4'
actor A3 actor A4
Fig. 2. Illustration of Actors, Sessions and Roles
defining multiple roles for a particular session (e.g. A2), each role dealing with one session concern. The capacity for an actor to play multiple roles for distinct communication path (e.g. A3) of a session may be exploited to solve the second issue. Concerning the last issue, the solution is provided by the ability of an actor to participate in multiple sessions with separate roles (e.g. A2 and A4). In the following subsections, we present the structure of each concept and how they relate. We also explain how session parts and their related roles are created and activated during the actor life-cycle. 3.1
Session and Role
In our programming model, sessions are defined independently of actors. A session definition is a kind of module construct containing declarations of interacting roles types that are intended to be played by participating actors. There are two ways to achieve the declaration of roles types inside the session definition : by nesting a role type definition or by importing a role type. This second way enables reusing role types existing in libraries. Role types of a particular session are declared independently of other sessions. Indeed, it is possible for two sessions to rely on a same imported role type if they need the same slice of behaviour. Code 1 gives, with a Java-like syntax, the definition of CallingSession, a session from our example that defines role types for dialog establishment and user message exchange. We can see that this session definition declares three roles types, the first two (Callee and Caller ) by nesting and the third (MessageHandler ) by importing thanks to the ‘includerole’ keyword. A role type describes one behaviour involved in a session. Such a behaviour consists in handling incoming and outgoing SIP messages from and to the other roles while realizing the relevant logic. We provide three fixed kinds of role type to specify their capacities in terms of received and sent messages and help the analysis and the correlation of role types : – Client role represents an asymmetric role type that sends requests and receives related responses. – Server role represents an asymmetric role that receives request and sends related responses. – Client Server role represents a symmetric role that sends and receives both requests and responses.
Session-Based Role Programming
// CallingSession CallingSession { clientserver role onInvite(Request onBye(Request r) sendBye() { ...} ... }
session Callee { r) { ...} {...}
clientserver role Caller { sendBye() { ...} sendInvite() { ...} onOk(Response r) {...} onRinging(Response r) {...} onBye(Request r) {...} ... }
83
// RegisterSession session RegisterSession { server role RegAcceptor { Timer timer; String contactName; InetAddress currentAddress; onRegister(Request r) { ...} ... } client role RegRequestor { Timer timer; InetAddress currentAddress; sendRegister() {...} onOk(Response r) {...} ... } }
includerole library.MessageHandler }
Code 1. Two examples of session definition
The definition of a role type is quite similar to a class in the sense that it may include attributes, operations and may inherit from another role type. Beside operations, a role type may also contain message handlers that are special operations designed to handle an incoming SIP request or response. These operations have a specific signature that matches the type of the SIP message. In the example above, the RegRequestor role type of RegisterSession session is a ‘client’ role that registers current user’s address sending the REGISTER request and handles its related OK responses. It owns a sendRegister operation and a onOK response handler for that purpose. The RegAcceptor role type is a dual ‘server’ role that has only a message handler to respond to REGISTER request. From sessions, it is possible to define actors that interact by playing the related roles. 3.2
Actor and Session Part
Actors are described using actor type. An actor type usually includes one or several session parts. Like a class, an actor type may also contain declaration of attributes and methods that can serve to share some common data or operations between session parts. Furthermore, to specify which session parts are created on the basis of a particular request, an actor type may also have a special block containing rules which specify a condition coupled with a reference to a session part. The meaning of a rule is that an instance of the corresponding session part will be created if the condition is verified for the current state of the actor and the received request. Code 2 illustrates the actor type for presence-enabled user agents (Alice’s one) discussed in Section 2. This actor type includes attributes for managing the history of callers. It also contains declaration of session parts for the three sessions this actor type takes part: RegisterSession, PublishSession, and CallingSession. Preceding the session parts, we find a ‘sessionControl’ block containing a rule
84
G. Vanwormhoudt and A. Flissi
/* PresenceEnabledUserAgent actor type */ actor PresenceEnabledUserAgent{ String contactName; List<SipAddress> callerHistory; void clearHistory() ... sessionControl { when (isINVITE(req) && !hasSession(CallingSession)) activate SPM default activate Error } sessionPart SPR:RegisterSession { play (RegRequestor) } sessionPart SPP:PublishSession { play (PresencePublisher) }
/* Continued here */ sessionPart SPI:CallingSession { String callerName; String getCallerName() { ... } play (Callee, MessageHandler) extension Callee { onInvite (Request req) { callHistory.add(req.getFromAddr()); callerName = req.getCallerName(); getRole(PublishSession,PresencePublisher). setNewState(State.available); super(req); } onBye (Request req) { ... } } roleControl { when (isINVITE(req)) activate Callee when (isMESSAGE(req)) activate MessageHandler } } ... } /* End of actor type */
Code 2. Example of an actor type
to specify that the SPI:CallingSession session part should be created when an INVITE request is received. Other session parts are intended to be created explicitly as they only contain client roles. A session part defines the participation of an actor type to a specific session in terms of played roles. In a session part, this participation is specified by referencing the targeted session and declaring which roles defined in the session is played by actors of this type. If we take a look at the actor type given above, we can see that the session part named SPI is connected to the CallingSession defined previously. For this session part, two roles are specified using the ‘play’ keyword : Callee and MessageHandler. As illustrated by this example, introduction of attributes (caller ) and methods to share common data and operations between played roles is also possible. Similarly to actor type, each session part may include a special block containing rules for determining the roles to create on a particular request and the current actor state. Rule conditions can use predefined boolean functions to query the roles of the session. In Code 2, we have an example of this block introduced by the ‘roleControl’ keyword. This block contains two rules to state that the Callee (resp. MessageHandler ) role should be played on reception of an INVITE (resp. MESSAGE) request. In addition to the previous elements, a session part may also introduce extension of roles defined for the referenced session and played by the actors. A role extension enables refining its behaviour. Such extension may be needed to add interactions between sessions inside a role or between roles of a session. This capacity is used in the SPI session part to extend the behaviour defined in CallingSession for the Callee role. Here, this extension introduced by the ‘extension’ keyword is required to update the status of the user when he enters or leaves a
Session-Based Role Programming
85
dialog. It is achieved by redefining the handlers attached to INVITE and BYE messages with operations to interact with PresencePublisher role. 3.3
From Concept to Runtime Entities
In our programming model, actor type, session parts and role types are instantiated to form the state and behaviour of an actor. At runtime, a session part must be considered as the reification of a session from an actor’s point of view. Actors are created from actor type and represents SIP entity at runtime. During the lifetime of an actor, session parts are instantiated to reflect its participation in real SIP sessions. This instantiation can be done explicitly on demand by means of a new-like construct or occurs implicitly on the basis of received request. For a receiving request, the decision to create a session part instance depends if there already exists a session part matching the session-id included in the request. If none exists, rules attached to the actor for session parts are evaluated to create a matching session. A session part may be instantiated more than once per actor. This corresponds to the situation when an actor participates to several sessions of the same kind during its lifetime. At runtime, each session part instance representing a real session aggregates roles that are played by the actor. The creation of roles attached to a session can be done in two ways, like session part instantiation: either explicitly by means of a new-like construct or implicitly on the basis of a received request. In the latter case, rules attached to the session part are evaluated to determine if a corresponding role must be created. Message handlers provided by roles are automatically executed on the basis of incoming messages through a forwarding process from actor. When an actor is requested to handle an incoming SIP message, the forwarding process is based first on a selection of the session part instance matching the session-id of the request4 . After that, the processing continues by choosing a role attached to the selected session or by creating a new one if necessary. At last, the message is forwarded to the role by triggering its corresponding message handler.
4
Revisiting the Example
To illustrate our approach, we propose to revisit our example in terms of Actors, Sessions, Session parts and Roles. Figure 3 shows the resulting architecture which is composed of three actor types: PresenceEnabledUA, RedirectServer and UserAgent. These actors are involved in four sessions which are represented by horizontal boxes crossing actors. The first session, named RegisterSession, takes place when a PresenceEnableUA actor registers its current address to the server. PublishSession is the session to update the user status. The LocateSession is initiated when an external UserAgent actor communicates with the RedirectServer to locate a registered user agent (PresenceEnabledUserAgent). Finally, 4
As explained before, the incoming message may entail a creation of session instance.
86
G. Vanwormhoudt and A. Flissi PresenceEnabledUA<>
Redirect Server <>
contactName: String callerHistory:List
registredUsers: List availableUser:List
RegRequestor <> sendREGISTERReq() onOK(Response)
Register Session
PresencePublisher <> sendPUBLISHReq() onOK(Response)
Publish Session
RegAcceptor <<srvrole>> onREGISTER(Request)
PresenceHandler <<svrrole>> onPUBLISH(Request)
Locator <<svrrole>> onINVITE(Request)
callerName: String Callee <> onINVITE(Request) onBYE(Request) sendBYEReq() MessageHandler <> sendMESSAGEReq() onMESSAGE(Req)
UserAgent <>
Calling Session
Locate Session
Lookup <> sendINVITEReq() onOK(Response) onBusy(Response) Caller <> sendINVITEReq() sendBYEReq() onOK(Response) onRINGING(Response) MessageHandler <> sendMESSAGEReq() onMESSAGE(Req)
Fig. 3. Architecture of our example with Actors, Sessions and Roles Types
CallingSession is the session for handling the dialog between PresenceEnabledUA and UserAgent actors types. For the above sessions, we can see that the involved actors play separate roles which are figured by class-like box (session parts are not shown but just indicated by dotted lines). Figure 4 shows the relationships between the flow of SIP messages and instances of session parts and roles for a dialog setup between Alice and Bob. After receiving the response containing Alice’s address from the server, the Lookup role instance played by Bob’s UserAgent actor explicitly creates a new CallingSession-related session part and an associated Caller role. Next, this role is invoked to send the INVITE message establishing the real session. When the Alice’s PresenceEnabledUA actor receives the message, it detects that there is no matching session, so it evaluates its rules for determining the session part to create and activate. In the current case, it is a CallingSession-related session part which is instantiated. Then, this session part instance evaluates its own rules for determining which role must be played by the actor from the current message and this kind of session. The result is the creation of a Callee role which is finally invoked for processing the message. When playing the Callee role, the PresenceEnabledUA actor returns an OK response to the UserAgent actor for the same session. Because the UserAgent actor is already active in this session and already has a Caller role to handle OK response, no new role is created and this role is selected to process the responses and confirm the dialog establishment with an ACK request. After dialog establishment is complete, both actors can play the MessageHandler role that deals with sending and reception of user messages inside the session. Through this example, we have illustrated how our model supports the decomposition of behaviour into multiple sessions parts with their respective roles and accomplishes some properties to cope with the issues and requirements identified in section 2. This model also offers the following advantages : – It raises the level of abstraction as the developer can think about the behaviour of its actor at a higher level thanks to session part and role. This
Session-Based Role Programming BobUA: UserAgent <> Lookup <>
87
Server : RedirectServer <>
LocateSession <<sessionpart>>
LocateSession <<sessionpart>>
sendInvite()
INVITE onMoved(resp)
Locator <>
onInvite(req)
UserDB
getAdress(user)
302 Moved
LocateSession createRole(Caller) sendInvite()
createSession(CallingSession) AliceUA : PresenceEnabledUA <>
Caller <>
CallingSession <<sessionpart>>
CallingSession <<sessionpart>> INVITE
MessageHandler <>
onOk(resp)
200 OK ACK
sendMessage()
MESSAGE
createRole
Callee <>
onInvite(req) onAck(req) createRole onMessage(req)
onOk(resp)
MessageHandler <>
200 OK
CallingSession
Fig. 4. Relationships between SIP messages and actor components
contrasts with conventional SIP frameworks which require to examine implementation code in detail to get a similar view. – The encapsulation of behaviour and state into multiple session parts and role components with delimited scope contributes to increase the modularity of actors. As a result of this enhanced modularity, maintenance and evolution of SIP entities and services are made easier. This modularity allows, for instance, to change the server with a proxy behaviour by just replacing its Locator role with two coordinated forwarder roles supporting multi-branches communications. – Automatic selection of sessions and roles as well as automatic messages forwarding to roles allow to reduce the coding effort since number of controls and extra-state to ensure execution of the appropriate behaviour is minimized compared to SIP frameworks. – Finally, our approach gives the ability to capture some recurrent behaviour into reusable role types and reuse them by inclusion into sessions. This is an main advantage over existing DSLs for SIP where the question of reusability is generally eluded and over SIP frameworks where reusability is limited by the hardwiring of session-related parts into methods.
5
A Coding Framework above JAIN-SIP
We have implemented our approach in Java through a coding framework which is presented at Figure 5. Actor-annotated Java code is transformed to produce executable actors, that is to say Java classes that specialize an actor framework. The main classes of the actor framework are SipActor, SipSessionPart, SipClientRole, SipServerRole and SipClientServerRole. These classes implement the proposed concepts and a runtime engine above JAIN-SIP. To leverage building of SIP applications using our framework, a set of annotations presented in Table 1 has been defined. A class with the @actor annotation
88
G. Vanwormhoudt and A. Flissi
Fig. 5. Coding framework architecture
Table 1. Annotations for programming SIP Actors in Java Annotation
Attributes
Description
@actor @session @sessionPart @clientserverRole @clientRole @serverRole @useSessionPart
type::Class type::Class, method::String, condition::String type::Class, method::String, condition::String type::Class
Declare an Actor class Declare an abstract Session class Declare a Session part class Declare a symetric role class Declare an asymetric role class Declare an asymetric role class Specify an activation clause for session part Specify an activation clause for role
@useRole @includeRole
Import an existing role in a session class
will inherit from the SipActor class of the framework, whereas a class annotated with @sessionPart will inherit from SipSessionPart class. The @session annotation is used with an abstract class to define a session. Such a class can declare inner role classes to define the related role types or use the @includeRole(type) annotation to import existing role classes from a library. The three kinds of role types provided by the model are declined in corresponding annotations for classes: @clientRole, @serverRole, @clientserverRole. An actor declares its sessions part with @useSessionPart(type, method, condition) annotation on its class. The mandatory type attribute determines the session part class that has to be imported and activated. The mandatory method attribute provides the type of SIP request method (e.g. INVITE message) that triggers the activation of the session part. The optional condition attribute is a string that refers to the name of a boolean method that describes some particular conditions for the activation of the session part. The principle is similar for declaring roles of a session part, thanks to @useRole annotations, except that it is used with a session part class. Code 3 gives the annotated version of CallingSession session and PresenceEnableUserAgent actor type. A preliminary study on performance was conducted using the SIPp traffic generator and a JAIN-SIP and framework version of the same user agent server. Results were produced for 1000 calling sessions with a rate of 5/sec. These results show that the framework overhead is about 10 percent compared to JAIN-SIP which is relatively low given that the framework is not yet optimized.
Session-Based Role Programming
89
@session // Declaration of session public abstract class CallingSession { @clientserverRole // Declaration of role class public class Callee { public void onInvite (Request req, ServerTransaction tx) { ... } public void sendBye () {...} } ... } @sessionPart // Declaration of session part @useRole(type=ExtCallee.class,method="INVITE") @useRole(type=MessageHandler.class,method="MESSAGE") public class CallingSessionPartPUA { @extension(Callee.class) // Extension of role class public class ExtCallee { public void onInvite (Request req, ServerTransaction tx) { ... } ... } } @actor // Declaration of actor class @useSessionPart(type=RegisteringSessionPartPUA.class) @useSessionPart(type=PublishingSessionPartPUA.class) @useSessionPart(type=CallingSessionPartPUA.class,method="INVITE",condition="hasNoCallingSsn")) public class PresenceEnableUA { public boolean hasNoCallingSsn() { ... } }
Code 3. Example of annotations use to code the presence server example
6
Related Works
In earlier role-based programming approaches, roles are meant to capture the dynamic and temporal aspects of real worlds objects and the view adopted for roles is object-centric: roles are defined as being independent from the interaction. From an object-centric view, our role model have similarities with some earlier approaches and their variants [10,4] : it enables multiple role instances for a player and it links the roles to its player using aggregation relationships. However, some features like the distinction between client and server roles, the event-based triggering of roles and the capacity to have roles running in parallel threads are also unique to our role model. A few works combine roles with a notion of context to express the fact that the interaction possibilities change according to the properties of the interacting objects. In powerJava [1], roles represent the possibilities offered by an object to interact with it. For a client, the interaction with such an object is made by acquiring one of its offered role. Similar ideas have been proposed in ActorFrame [2], a Java-based framework for the design of distributed services with actors and roles. The work described in [6] presents a programming model with contextdependent roles for actors. Roles represents adaptations of an actor that are automatically selected for each message based on the context of the message sender and receiver. In our approach, contexts for roles are provided by sessions. This make our approach different by two main points: selection of a role is not controlled by an interacting entity but only by the owner (actor) and our notion of context is long-lived, i.e persistently activated between messages.
90
G. Vanwormhoudt and A. Flissi
The notion of role is also related to the concept of collaboration that aims to describe the interactions among different objects in a given context. ObjectTeam/J [5] and EpsilonJ [7] are extensions of Java that group roles which interact into collaboration modules. Roles inside a collaboration module are played by objects of base classes either explicitly or implicitly, enabling them to interact. In our approach, sessions and session parts provide similar capacities with the main difference that they are supported in a distributed context. Finally, authors of [3] have proposed the notion of session type as a language construct to support description and type-checking of protocols between parallels threads. A session type is implemented in dual operations of interacting components using a correlated sequence of receive and send instructions. Compared to the use of one or more role for describing the behaviour of a component related to a session, an operation implementing a session-type is more fine-grained and offers a concise and clearer view of the interaction structure but they also provide less capacities for reuse and flexible composition of behaviour.
7
Conclusion
To tackle some issues arising from the design of advanced telephony applications based on SIP, we have proposed an approach that enables involved entities to be constructed as actors playing roles in multiples sessions. Proposed approach and its implementation have been experimented through the development of various SIP entities such as user-agents, third-party agents as well as presence and proxy servers. Currently, we are working on elaborating a library of reusable roles from these experiments. In future works, we plan to enhance the programming model with inheritance for sessions and actor types and event-based mechanisms to support coordination between roles and between session parts. We also plan to explore the extensibility of SIP for typing exchanged messages with sessions and roles to enable dynamic roles alignment and synchronization between actors. A last perspective is to better integrate our concepts with the host programming language by using capabilities of some languages for embedding DSL.
References [1] Baldoni, M., Boella, G., van der Torre, L.: Interaction among objects via roles: sessions and affordances in java. In: 4th Int. Symp. on Principles of Programming in Java (2006) [2] Bræk, R., Melby, G.: Model-Driven Service Engineering. In: Model-Driven Software Development. Springer, Heidelberg (2005) [3] Dezani-Ciancaglini, M., Mostrous, D., Yoshida, N., Gairing, M.: Session Types for Object-Oriented Languages. In: Hu, Q. (ed.) ECOOP 2006. LNCS, vol. 4067, pp. 328–352. Springer, Heidelberg (2006) [4] Graversen, K.B.: The nature of roles. A taxonomic analysis of roles as a language constructs. In Phd Thesis, IT University of Copenhagen (2006) [5] Herrmann, S.: A Precise Model for Contextual Roles: The Programming Language ObjectTeams/Java. In: Applied Ontology, vol. 2. IOS Press, Amsterdam (2007)
Session-Based Role Programming
91
[6] Vallejos, J., Ebraert, P., Desmet, B., Van Cutsem, T., Mostinckx, S., Costanza, P.: The Context-Dependent Role Model. In: Indulska, J., Raymond, K. (eds.) DAIS 2007. LNCS, vol. 4531, pp. 1–16. Springer, Heidelberg (2007) [7] Monpratarnchai, S., Tetsuo, T.: The Implementation and Execution Framework of a Role Model Based Language, EpsilonJ. In: Proceedings of SNPD 2008 (2008) [8] Palix, N., Consel, C., Reveillere, L., Lawall, J.: A stepwise approach to developing languages for SIP telephony service creation. In: Proceedings of IPTComm 2007 (2007) [9] Zave, P., Cheung, E., Bond, G., Smith, T.: Abstractions for Programming SIP Back-to-Back User Agents. In: Proceedings of IPTComm 2009 (2009) [10] Steimann, F.: On the representation of roles in object-oriented and conceptual modelling. Data Knowledge Engineering 35 (2000) [11] Smith, T., Gregory, G., Bond, W.: ECharts for SIP Servlets: a state-machine programming environment for VoIP applications. In: Proceedings of IPTComm 2007 (2007) [12] Wu, X., Schulzrinne, H.: Handling feature interactions in the Language for End System Services. In: Feature Interactions in Telecommunications and Software Systems VIII (2005)
Architecturing Conflict Handling of Pervasive Computing Resources Henner Jakob1 , Charles Consel1 , and Nicolas Loriant2 1
INRIA Sud-Ouest, Bordeaux, France {henner.jakob,charles.consel}@inria.fr 2 Imperial College, London, UK [email protected]
Abstract. Pervasive computing environments are created to support human activities in different domains (e.g., home automation and healthcare). To do so, applications orchestrate deployed services and devices. In a realistic setting, applications are bound to conflict in their usage of shared resources, e.g., controlling doors for security and fire evacuation purposes. These conflicts can have critical effects on the physical world, putting people and assets at risk. This paper presents a domain-specific approach to architecturing conflict handling of pervasive computing resources. This approach covers the software development lifecycle and consists of enriching the description of a pervasive computing system with declarations for resource handling. These declarations are used to automate conflict detection, manage the states of a pervasive computing system, and orchestrate resource accesses accordingly at runtime. In effect, our approach separates the application logic from resource conflict handling. Our approach has been implemented and validated on various building automation applications.
1 Introduction The advances in telecommunication technologies and the proliferation of embedded networked devices are allowing the seamless integration of computing systems in our everyday lives. Nowadays, pervasive computing systems, as envisioned by Weiser [16], are being deployed in an increasing number of areas, including building automation and assisted living. Typically, a pervasive computing environment consists of multiple applications that gather data from sensing devices, compute decisions from sensed data, and carry out these decisions by orchestrating actuating devices. For example, in building automation, motion and temperature sensors are used to automate lighting and regulate heating. The rapid development of new devices (i.e., resources), and development tools opened to third-parties, have paved the way to an increasing number of applications being deployed in pervasive computing environments. These applications access resources without any coordination between them because a pervasive computing platform needs to evolve as requirements change. In this situation, it is very common for a resource to be accessed by multiple applications, potentially leading to conflicts. For example, in a building management system, a security application that grants access inside the building can conflict with another application dealing with emergency situations like fires, P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 92–105, 2011. c IFIP International Federation for Information Processing 2011
Architecturing Conflict Handling of Pervasive Computing Resources
93
preventing the building to be evacuated. In fact, conflicts do not only occur across applications but also within an application. For example, different modules of an application may be developed independently of each other, creating a risk that conflicting orders to be issued to devices. Detecting, resolving and preventing intra- and inter-application conflicts is critical to make a pervasive computing system reliable. To do so, a systematic and rigorous approach to handling conflicts throughout the development lifecycle is required. Detecting conflicts is a daunting task. Pervasive computing systems are complex and involve numerous applications that may conflict on one or multiple resources. Scaling up conflict handling for real-size pervasive computing systems requires to distinguish potential conflicts from safe resource sharing. This may depend on the type of a resource, for example, a conflict may occur on a resource providing mutually exclusive operations (e.g., locking and unlocking a door). This may also depend on the applications being deployed in a pervasive computing environment (e.g., two applications may access a device inconsistently), precluding application developers from anticipating potential conflicts. Without any support, detecting potential conflicts requires to examine the code of all the applications to identify each resource usage, and determine whether it may conflict. After potential conflicts are pinpointed, it is necessary to resolve each of them. It requires intimate knowledge about the code of the corresponding applications to resolve the conflicts by making code changes. Because of the lack of high-level programming support, writing system-wide conflict-handling strategies is often overlooked. This situation results in polluting the logic of applications with ad hoc code, compromising the system maintainability. The situation is exacerbated by the fact that pervasive computing environments are prone to changes: applications as well as resources emerge, evolve, and may disappear over time. These changes directly impact conflict management. This problem is well known in the telecommunications domain where it was observed that the number of potential conflicts grows exponentially as new applications are added to an existing system [10]. Manually handling conflict thus becomes impractical. Our Approach Managing conflicts is often decomposed into three stages: detection, resolution and prevention [10]. In practice, these stages crosscut the development lifecycle of applications and pervasive computing systems. We introduce an approach to conflict management that covers the lifecycle of a pervasive computing system. It consists of a design method for applications, supported by declarations and tools, separating conflict management tasks. This approach facilitates the work of architects, developers and administrators: requirements for conflict management are propagated throughout the development stages. We propose to declare a pervasive computing system and its applications using a domain-specific architecture description language (ADL), named DiaSpec [5], developed in our research group. This ADL serves two purposes: (1) it allows domain experts to describe the available resources in the pervasive computing environment, and (2) it is used by software architects to design applications with respect to the declared
94
H. Jakob, C. Consel, and N. Loriant
resources. We extended DiaSpec with conflict-handling declarations that allow domain experts to characterize resources from a conflict-management viewpoint. This information, in combination with the architecture descriptions, allows to automatically pinpoint places where conflicts can occur. To resolve the detected conflicts, we propose to raise the level of abstraction beyond the code level, by providing declarative support for conflict resolution. Within an application, the developer uses declarations to specify states for a pervasive computing system and order them with respect to their critical nature (e.g., fire is more critical than intrusion). These states are enabled and disabled depending on runtime conditions over the pervasive computing system (e.g., fire detection). State changes are used to update access rights to conflict-sensitive resources (e.g., in case of fire, the fire module takes precedence over the intrusion module). Our approach is incremental in that states and priorities can be added as a pervasive computing system is enriched with new applications. Its declarative nature allows to prevent conflict-handling logic from polluting the application logic. Conflict-extended architecture descriptions are used to generate customized programming frameworks. These frameworks guide and support the implementation of the conflict-handling logic. Generating the underlying framework from the architecture description guarantees that the architecture implementation can only access the required resources. Additionally, runtime support ensures that access to resources are granted in conformance with conflict-handling declarations. Our contributions can be summarized as follows. – Extended development cycle – We have identified the requirements at different development stages to detect, resolve, and prevent conflicts. We have seamlessly integrated conflict-management activities into a software development lifecycle. – Conflict-handling declarations – We have extended a domain-specific ADL to declare conflict resolution at an architectural level. A declarative approach is introduced to define the states of a pervasive computing system and their critical nature. Such declarations form the basis to define the conflict-handling logic of a pervasive computing system. – Programming support – Conflict-handling declarations are used to augment the generated programming framework with code dedicate to conflict handling. This code (1) guides the implementation of the conflict handling logic within and across applications, and (2) generates code that manages resource accesses to prevent runtime conflicts. The rest of this paper is organized as follows. Section 2 identifies the key requirements to manage resource conflicts. Section 3 presents how to integrate conflict management into the development cycle. Section 4 outlines our implementation. Section 5 evaluates our approach. Related works are discussed in Section 6, and concluding remarks are given in Section 7.
2 Background and Requirements In this section, we first present a domain-specific architecture description language, named DiaSpec [5]. The underlying development process is illustrated with a working
Architecturing Conflict Handling of Pervasive Computing Resources
95
example of building management. Second, we examine the requirements for managing resource conflicts when developing pervasive computing applications. 2.1 Background The DiaSpec language enforces an architectural pattern, named sense-compute-control, commonly used in the pervasive computing domain [6]. This pattern distinguishes three types of components, as depicted in Figure 1: (1) resources, which provide sensing and actuating capabilities on a pervasive computing environment1, (2) contexts, which aggregate and process sensed data, and (3) controllers, which receive information from contexts and invoke actuators. This architectural pattern goes beyond the pervasive computing domain and enables high-level programming support and a range of verifications [3,4,7].
Controllers
act on
Actions
context data Contexts
Applications
Resources Sources
sensed by
Pervasive Computing Environment
raw data
orders
Resources
Fig. 1. DiaSpec architectural pattern
Fig. 2. DiaSpec development cycle
Figure 2 shows how a DiaSpec description drives a five-stage development process. (1) A domain expert declares a taxonomy of resources that can be found in the pervasive computing environment. (2) An architect describes the interactions between resources, contexts and controllers. Given a taxonomy and an architecture description, a compiler, named DiaGen, generates a customized programming framework in Java. (3) The generated framework is used by the developer to implement the application. (4) The application code can be tested as is, prior to deployment, using a simulator for a pervasive computing environment, named DiaSim [1]. (5) A system administrator can deploy the application in a real pervasive computing environment. The suite of tools supporting our development process is called DiaSuite2 . We now focus on the first three steps of our development process with an application that treats different types of emergencies in a building. Describing the Environment. First, the domain expert declares the available resources of a pervasive computing environment, as is done using an interface description language (e.g., WSDL) to declare external resources. In DiaSpec, this process is supported by a 1 2
Resources are devices (e.g., a motion detector) or software components (e.g., an address book). DiaSuite is freely available http://diasuite.inria.fr and open source.
96
H. Jakob, C. Consel, and N. Loriant
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
device LocDevice { 1 attribute location as Location; 2 } 3 device SmokeSensor extends LocDevice { 4 source smoke as Float; 5 } 6 device TempSensor extends LocDevice { 7 source temperature as Float; 8 } 9 device Door extends LocDevice { 10 source status as LockedStatus; 11 action LockUnlock; 12 } 13 device Alarm extends LocDevice { 14 action OnOff; 15 } 16 device Sprinkler extends LocDevice { 17 action OnOff; 18 } 19 device Logger { action Log; } 20 action LockUnlock { lock (); unlock(); } action Log { logEvent( event as String); } action OnOff { on(); off (); }
Fig. 3. Extract of the emergency management taxonomy
21 22 23 24 25 26 27 28
context AvgTemp as Float indexed by location as Location { source temperature from TempSensor; } context SmokeDetected as Boolean indexed by location as Location { source smoke from SmokeSensor; } context Fire as Boolean indexed by location as Location { context AvgTemp; context SmokeDetected; } context DoorStatus as Boolean indexed by location as Location { source status from Door; } controller FireCtrl { context Fire; context DoorStatus; action LockUnlock on Door; action OnOff on Alarm , Sprinkler; action Log on Logger; }
Fig. 4. Extract of the architectural description of the fire module
language layer dedicated to describing classes of entities that are relevant to a given application area. An entity declaration models sensing capabilities that produce data, and actuating capabilities that provide actions. Specifically, a declaration includes a data source for each one of its sensing capabilities. An actuating capability corresponds to a set of method declarations. Additionally, attributes are included in an entity declaration to characterize properties about instances (e.g., their location). Entity declarations are organized hierarchically, allowing entity classes to inherit attributes, sources, and actions. Figure 3 shows an excerpt of the taxonomy for the emergency application. Specifically, to detect a fire, the application uses temperature and smoke sensors deployed in the building. Upon fire detection, the doors are unlocked to ensure the safe evacuation of all the building occupants. Additionally the alarms of the building are turned on, as well as the sprinklers nearby the fire. All actions are logged for later analyses. The domain expert introduces the resource classes with the device keyword. In Figure 3, a root device with a location attribute is declared (see lines 1 to 3). Attributes mainly serve as filters for resource discovery in the pervasive computing environment. The source and action keywords define the capabilities of a resource. For example, line 5 declares that the smoke sensor produces a float value, indicating the current smoke intensity. The Door device provides the LockUnlock action (line 12), which is further detailed in lines 22 to 25. Describing the Architecture. To support application design, the DiaSpec language offers an ADL layer, based on the architectural pattern depicted in Figure 1, and comprises resource, context and controller components.
Sprinkler
Alarm
Door
Logger
OnOff
OnOff
LockUnlock
Log
Intrusion Ctrl
FireCtrl
Intrusion
Fire
AvgTemp
Smoke Detected
DoorStatus
Float
Boolean
LockedSatus
TempSensor
SmokeSensor
Door
Event
Calendar
Occupancy
97
Emergency Application
Architecturing Conflict Handling of Pervasive Computing Resources
Boolean
Boolean
Motion Sensor
Break Detector
Fig. 5. Architecture of the emergency application
To illustrate the ADL layer, let us examine the emergency application. Figure 4 presents an excerpt of the corresponding DiaSpec declarations, describing the fire module. Figure 5 shows a graphical view of the emergency application, including the intrusion module. The arrows indicate the flow of information. The resources at the bottom of the diagram provide information to context components; the resources at the top provide the controller components with actions on the environment. The temperature sensors of a room send their values to the AvgTemp component. Figure 4, lines 1 through 4, introduces this component using the context keyword. It includes a source declaration, defining the input of this component. The as keyword, line 1, is followed by the type of the output value (Float). The value is indexed by a location: the room where the average temperature is measured. Another context component, SmokeDetected, gathers information from smoke detectors. Both contexts, the average temperature and the smoke information, are used by the Fire component to determine whether there is a fire in the building and its location. Eventually, if there is fire, the FireCtrl component is invoked. It is declared by the controller keyword (line 22). This component declares two input sources using the context keyword and referring to Fire and DoorStatus (lines 23 to 24). The action keyword defines the actuator operations that can be invoked by a controller component. In our example, the FireCtrl component can lock/unlock doors, turn on/off alarms and sprinklers, and log events (lines 25 to 27). Implementing an Application. The customized programming framework produced by DiaGen consists of an abstract class for each DiaSpec component (resource, context, and controller). The abstract class includes methods that implement the programming support (e.g., resource discovery and component communication mechanisms). The application logic to be provided by the developer is declared as abstract methods. Implementing the application logic is done by subclassing a generated abstract class.
98
H. Jakob, C. Consel, and N. Loriant
2.2 Requirements Let us now define our notion of resource conflict and examine the issues to be resolved within the DiaSpec development approach. Intra-application Resource Conflicts. Sensors and actuators need to be distinguished when it comes to resource conflicts. Indeed, sensors can sustain many consumers, requesting values either directly (e.g., remote procedure call) or via some runtime support (e.g., notification server). The situation would be comparable for actuators, if only they did not have side effects on the environment. This is illustrated in Figure 5, where the FireCtrl and IntrusionCtrl controllers share resources. These controllers can, for example, have conflicting effects on the door resource, depending on whether the current state of the pervasive computing environment requires anti-intrusion or firefighting measures. What this example illustrates is that resolving resource conflicts relies on some notion of state that determines which consumer should acquire the resource. A pervasive computing environment can be in different states depending on a variety of conditions. Expressing these conditions is a key to providing a practical approach to conflict resolution. To separate this concern from the application logic, the approach should target the architecture level. In the door example, we would need to introduce states, enabled by conditions over relevant sensed data (e.g., smoke intensity, motion detection). Based on the enabled states, the attempts of the controllers to acquire the doors would be prioritized. Note that some actuators can be insensitive to conflicts. An example is the log action (lines 26 to 28): it can record data in any order, assuming each invocation has the necessary contextual information (e.g., a time stamp). Inter-application Resource Conflicts. The emergency application is only a part of the building management system. The system administrator also deploys a security application to manage access in the building. Figure 6 shows a graphical representation of two applications: emergency and security. Both applications operate the same type of resources, in this case door and logger. As can be noted, resource conflicts occur at different levels and require to be managed globally. Even though, conflicting usage of resources can be resolved with respect to a given state, there needs to be a global, system-wide approach to combining unitary strategies in a transparent and predictable way.
3 Conflict Management This section presents our approach to conflict management. It addresses the requirements discussed previously, and illustrates the approach with the building management system. 3.1 Detecting Potential Conflicts Our approach to conflict management revolves around the DiaSpec description of an application. Such a description exposes the interactions with actuators, allowing
Alarm
Door
Logger
OnOff
LockUnlock
Log
FireCtrl
Intrusion Ctrl
Fire
Intrusion
AccessCtrl
Access Lookup
Door Status
99
Security
Emergency
Architecturing Conflict Handling of Pervasive Computing Resources
Sources Fig. 6. Potential resource conflicts between multiple controller components
resource conflicts to be detected within an application, for the application developer, and between applications, for the system administrator. Let us examine how the intra-application conflicts between the fire and the intrusion modules are solved (Figure 6). The process is the same for inter-application conflicts. In DiaSpec, conflicts may occur when a resource is used by more than one controller component. Information about the resource usage can be extracted from the DiaSpec description of an application. This information needs to be refined to account for actions that are insensitive to resource conflicts (e.g., the log action). Categorizing Actions in the Taxonomy. We extended the taxonomy language of DiaSpec with effect declarations for resource actions. An effect declaration applies to an action (i.e., an interface and its associated operations), which is part of a device declaration. In practice, we have identified three main effects that need to be expressed. First, a device includes an action with operations that are mutually exclusive in their effects. For example, a door is either locked or unlocked. Such an action is declared with the exclusive keyword. Second, a device combines operations that interfere with each other. For example, a multimedia device could include two actions: an audio player and a video player; if both players run simultaneously, they interfere with each other. The list of interfering actions of a device is declared with the interfering keyword3 . Lastly, when an action is conflict insensitive, it is declared without effect keywords. In our example, the domain expert has to enrich the declaration of the Door, Alarm and Sprinkler devices with the exclusive keyword, as is shown in Figure 7. The Log action is left unchanged because it is conflict insensitive. Analyzing the Architecture Description. Given the taxonomy declarations enriched with conflict-handling information, the application developer and, later in the process the system administrator, investigate potential resource conflicts. A resource usage raises a potential conflict when two or more controllers may access it. These controllers may 3
Interfering actions do not occur in our building management example.
100
H. Jakob, C. Consel, and N. Loriant
1 2 3 4 5 6 7 8 9 10
device Door extends LocDevice { source status as LockedStatus; exclusive action LockUnlock; } device Alarm extends LocDevice { exclusive action OnOff; } device Sprinkler extends LocDevice { exclusive action OnOff; }
Fig. 7. The taxonomy contains three devices with exclusive actions
be defined within an application or across applications. In our approach, potential resource conflicts are automatically detected from a DiaSpec description. Conflict resolution is expressed with declarations, leaving the application logic unchanged. 3.2 Declaring Conflict Resolution To resolve conflicts, we partition resource users with respect to a set of states in which a pervasive computing environment can be. These states are totally ordered with respect to their assigned priority level; they are associated with resource users (i.e., controller components). For example, our building can be in either of the following states, listed in order of increasing priority: normal, security, or emergency. In doing so, applications and controllers, within an application, can be assigned different states, resolving their access to conflicting resources. To complete our approach, we need to enable and disable states depending on evolving conditions of the pervasive computing environment. This is done by introducing state component, leveraging the DiaSpec notion of context component. Recall that such a component receives information about the pervasive computing environment (e.g., smoke, fire, . . . ). A state component uses this information to determine whether the conditions for a given state hold, producing a boolean value. Let us illustrate our approach with inter- and intra-application conflict resolution. Consider Figure 8 where two state components are defined (lines 1 to 9): SecuritySt and EmergencySt. These components are declared with the system keyword to indicate that they apply system-wide, allowing the system administrator to resolve interapplication conflicts. With the priority keyword, they are assigned priority values of
1 2 3 4 5 6 7 8 9
system state SecuritySt priority 5 to Security { source date from Calendar; } system state EmergencySt priority 10 to Emergency { application state FireASt; application state IntrusionASt; }
1 2 3 4 5 6 7 8 9
application state FireASt priority 15 to FireCtrl { source temperature from TempSensor; source smoke from SmokeSensor; } application state IntrusionASt priority 10 to IntrusionCtrl { context Intrusion; }
Fig. 8. System state-component declarations Fig. 9. Application state-component declara(inter-application conflicts) tions (intra-application conflicts)
Architecturing Conflict Handling of Pervasive Computing Resources
101
5 and 10, respectively, indicating that SecuritySt is less critical than EmergencySt. Following the to keyword is the applications to which the declared state applies. The conditions under which a state holds are parameterized by information sources, as is declared for the SecuritySt state with the Calendar source. As well, the conditions may be parameterized by other states, as is defined by the EmergencySt state with FireASt and IntrusionASt. In fact, these two states are used to resolve intra-application conflicts, promoting state-component reuse – the states are defined in Figure 9 (lines 1 to 9). Application state components are declared with the application keyword by the application developer and apply to controller components declared within an application. For example, the FireASt state applies to the FireCtrl and IntrusionASt to IntrusionCtrl. Both controllers, and associated states, are local to the Emergency application. This local nature also applies to the priority defined by application states. That is, these priorities resolve conflicts within an application. In our example, these declarations prioritize FireCtrl over IntrusionCtrl. In doing so, intra-application conflicts for resources, such as doors, can get resolved. 3.3 Implementing Conflict Resolution Declarations of conflict handling are enforced by additional code produced by DiaGen, shielding the application developer and system administrator from low-level implementation details. Let us illustrate the implementation of the conflict-handling logic by considering the declaration of the FireASt state in Figure 9. This state component relies on two information sources, temperature and smoke, to determine whether the building is on fire. Figure 10 shows an implementation of this state component. In lines 7 and 8, the component subscribes to all the required sensors. To keep track of the building situation, the component stores temperature and smoke values from all the locations within the building. Specifically, the component implementation updates the value (temperature or smoke) for each location (lines 12 to 21). After refreshing the value, it checks whether the condition for a fire holds by calling the checkFire method (line 17). This method determines whether or not a fire is occurring by publishing a boolean value (line 35), which in turn will enable or disable the corresponding state of the pervasive computing system (i.e., FireASt).
4 Implementation To achieve our conflict management approach, we have extended DiaSpec, DiaGen, and the DiaSpec runtime. The extended DiaSpec runtime is illustrated by our building example in Figure 11. We introduce a ConflictCtrl component that subscribes to state components to gather information about states that get enabled/disabled at runtime. It combines this information with state priorities to compute access rights, and update the enforcer components, associated with each resource class (e.g., Door). The enforcer components intercept resource accesses and decide whether or not to
102
H. Jakob, C. Consel, and N. Loriant 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
public class FireASt extends AbstractFireASt { private Map> status; public void initialize() { status = new HashMap >(); allTempSensors(). subscribeTemperature(); allSmokeDetectors(). subscribeSmoke(); } @Override public void onNewTemperature ( Location loc , Temperature temperature) { Map<String , Value > values = getValues(loc); values.put(" temperature", temperature.value()); status.put(loc, values); checkFire(); } @Override public void onNewSmoke( Location loc , Smoke smoke) {...} private Map<String , Value > getValues( Location loc) {...} private void checkFire(){ boolean fireDetected = false; for( Location loc : status.keySet()){ Map<String , Value > values = status.get(loc); if( values.get("temperature").equals( Temperature. HIGH) && values.get("smoke"). equals(Smoke. HIGH )){ fireDetected = true ; break; } } setFireASt(fireDetected); } }
Fig. 10. An implementation of the FireASt state component
grant them. Specifically, an enforcer component intercepts a method call and creates a request of the form (controller, action, resource). Such a request is matched against an access control list (ACL) attached to each resource class; this ACL comprises rules of the following form. (controller , action , resource , [ true |false ])
Door
Completely generated
Logic must be implemented
LockUnlock
Enforcer
FireCtrl
updateACL
Intrusion Ctrl
Conflict Ctrl
IntrusionAst
FireASt
Contexts Sources Fig. 11. Extended DiaSpec runtime system
Architecturing Conflict Handling of Pervasive Computing Resources
103
When the request matches an ACL entry, the access to the resource is granted depending on the boolean value of the corresponding rule. Globally, the conflict management process is performed in two stages. Statically, potential conflicts are detected based on the taxonomy and architecture declarations. For each detected conflict on a resource, the DiaSpec description is searched to identify state components dedicated to resolving it. This information is used to parameterize the ConflictCtrl component. Dynamically, this component will update the ACL of the enforcer component of all the resources impacted by a state change. The updated ACL is calculated based on the enabled/disabled states and their priorities.
5 Evaluation To assess the usability of our approach, we applied it to the building management system. This case study was particularly interesting because it had been specified in DiaSpec and implemented, prior to the development of our approach. As a result, it could serve as a reference implementation, and a basis to be extended with our conflicthandling approach. We focus on the comprehensibility and reusability of conflict managing code, and the ability to detect conflicts. To test the correct behaviour of both implementations, original and extended, we used our pervasive computing simulator, DiaSim [1]. The original building management system was developed by members of our group who have expert knowledge in DiaSpec. They acted as architect, developer, administrator, and used their expertise to solve the foreseeable conflicts. The lack of proper support made them resort to ad hoc strategies to resolve resource conflicts. For example, to prevent three different controllers to conflict in accessing doors, they had to introduce a dedicated action to the door resource for each kind of controller in the taxonomy. This action would essentially mimic our conflict resolution strategy, taking a state as a parameter and determining whether to grant access to the door. In contrast to our approach, this ad hoc technique requires to structure the taxonomy with respect to conflict handling concerns and to pollute the application code with conflict handling logic. With our approach, adding a new application to an existing system requires to declare and implement an additional system state component, if a new state is needed. In this case, the new system state component is independent from other components, besides the new priority level to be introduced.
6 Related Work Conflicts are a major problem in a variety of domains. For example in telecommunications, Keck and Kühn show that feature interaction is an exponential problem that appears when new services are added to an existing system [10]. This problem can be directly mapped to pervasive computing, their services and features are our applications and their actions on resources. Calder and Miller [2] use the Spin model checker to analyze telecommunication systems. To do so, a system (services and features) is modeled in Promela using temporal properties. Our approach circumvents the feature interaction
104
H. Jakob, C. Consel, and N. Loriant
problem by relying on existing system specifications and conflict-handling declarations provided by the domain expert and the application developer. There exist different strategies to resolve conflicts in pervasive computing environments. The idea of proactively changing access control on resources is also used by Gupta et al. [8]. They present a criticality-aware access control approach that is only studied as a conceptual model. In contrast, we cover conflict management throughout the development lifecycle: from design to programming, to runtime. Haya et al. assign a priority to every operation [9]. The priority is calculated by a central component using information about the current state, the caller and the type of operation. In comparison, our approach incurs little overhead for resource invocations because the enforcer component is coupled with the resource, preventing any central component from becoming a bottleneck. The work closest to ours is that of Retkowitz and Kulle [11]. They use the notion of dependency management for handling resource conflicts. It is exemplified in the context of smart homes where it allows fine-grained configuration of a conflict-aware middleware. It is designed so a user can interact with the system and set priorities for different applications. In comparison, our approach is not limited to the home automation domain and addresses conflict handling throughout the development lifecycle. Tuttlies et al. have a different approach to resolving conflicts. They propose to describe the side effects of an application on the physical environment [15]. Additionally, each application states, what it considers a conflict. As a result, they can detect conflicts between interfering applications. Devising and applying a suitable strategy is left to the application developer. In contrast, we aim for a system-wide conflict management to allow system-wide reasoning.
7 Conclusion In this paper we have presented a domain-specific approach to architecturing conflict handling of resources. This approach covers the development lifecycle of a pervasive computing system. Our approach includes a detection of potential conflicts, their resolution, and their prevention at runtime. We extended an ADL to add information that is required for these three stages of conflict management. This information is used to generate code that guides and supports the implementation of conflict management. We have introduced new tasks dedicated to conflict management in the development process of a pervasive computing system. In the resulting process, application and conflict handling code are cleanly separated. Furthermore, our approach to conflict management is incremental and modular, preserving the independence between applications. This facilitates reuse of applications, and makes the conflict management easier to understand and verify. To prevent conflicts, our implementation enforces an ACL for each pervasive computing resource. These ACLs are proactively updated based on the current system state. Currently, our approach treats conflicts for classes of resources. This strategy applies to situations where applications act on all instances of a class (e.g., the emergency application unlocks all doors). We plan to extend this approach by introducing a perimeter (e.g., a room, a floor) in the conflict management declaration of an application.
Architecturing Conflict Handling of Pervasive Computing Resources
105
We also plan to expand our model to include the access rights for users. Access control is a major problem in pervasive computing environments, since it must handle physical and virtual objects at the same time [12,14]. A related research direction is to integrate user preferences into our model to resolve certain types of conflicts, as proposed by Shin et al. [13].
References 1. Bruneau, J., Jouve, W., Consel, C.: DiaSim, a parameterized simulator for pervasive computing applications. In: International Conference on Mobile and Ubiquitous Systems (2009) 2. Calder, M., Miller, A.: Feature interaction detection by pairwise analysis of LTL properties A case study. Formal Methods in System Design 28(3), 213–261 (2006) 3. Cassou, D., Balland, E., Consel, C., Lawall, J.: Architecture-driven programming for sense/compute/control applications. In: International Conference on Systems, Programming, Languages, and Applications: Software for Humanity (2010) 4. Cassou, D., Balland, E., Consel, C., Lawall, J.: Leveraging software architectures to guide and verify the development of sense/compute/control applications. In: International Conference on Software Engineering (2011) 5. Cassou, D., Bertran, B., Loriant, N., Consel, C.: A generative programming approach to developing pervasive computing systems. In: International Conference on Generative Programming and Component Engineering (2009) 6. Dey, A.K., Abowd, G.D., Salber, D.: A conceptual framework and a toolkit for supporting the rapid prototyping of context-aware applications. H.-C. I. 16(2), 97–166 (2001) 7. Gatti, S., Balland, E., Consel, C.: A step-wise approach for integrating qoS throughout software development. In: Giannakopoulou, D., Orejas, F. (eds.) FASE 2011. LNCS, vol. 6603, pp. 217–231. Springer, Heidelberg (2011) 8. Gupta, S.K.S., Mukherjee, T., Venkatasubramanian, K.: Criticality aware access control model for pervasive applications. In: International Conference on Pervasive Computing and Communications (2006) 9. Haya, P.A., Montoro, G., Esquivel, A., Garíca-Herranz, M., Alamán, X.: A mechanism for solving conflicts in ambient intelligent environments. J. UCS 12(3), 284–296 (2006) 10. Keck, D.O., Kuehn, P.J.: The feature and service interaction problem in telecommunications systems: A survey. IEEE Transactions on Software Engineering 24, 779–796 (1998) 11. Retkowitz, D., Kulle, S.: Dependency management in smart homes. In: Senivongse, T., Oliveira, R. (eds.) DAIS 2009. LNCS, vol. 5523, pp. 143–156. Springer, Heidelberg (2009) 12. Sampemane, G.: Access Control For Active Spaces. PhD thesis, University of Illinois (2005) 13. Shin, C., Dey, A.K., Woo, W.: Mixed-initiative conflict resolution for context-aware applications. In: International Conference on Ubiquitous Computing, pp. 262–271 (2008) 14. Tolone, W., Ahn, G.-J., Pai, T., Hong, S.-P.: Access control in collaborative systems. ACM Computing Surveys 37, 29–41 (2005) 15. Tuttlies, V., Schiele, G., Becker, C.: Comity - conflict avoidance in pervasive computing environments. In: International Workshop on Pervasive Systems (2007) 16. Weiser, M.: The computer for the twenty-first century. Scientific American 265, 94–104 (1991)
Passive Network-Awareness for Dynamic Resource-Constrained Networks Agoston Petz1 , Taesoo Jun2 , Nirmalya Roy3 , Chien-Liang Fok1 , and Christine Julien1 1
The University of Texas at Austin, Austin, Texas, USA {agoston,liangfok,c.julien}@mail.utexas.edu 2 Samsung, Korea [email protected] 3 Institute for Infocomm Research, Singapore [email protected]
Abstract. As computing becomes increasingly mobile, the demand for information about the environment, or context, becomes of significant importance. Applications must adapt to the changing environment, however acquiring the necessary context information can be very expensive because collecting it usually requires communication among devices. We explore collecting reasonably accurate context information passively by defining passively sensed context through network overhearing, defining context metrics without added communication cost. We use this framework to build a small suite of commonly used context metrics and evaluate the quality with which they can reflect ground truth using simulation. We also provide an implementation of this passive sensing framework for Linux, which we evaluate on a real mobile ad-hoc network using mobile autonomous robots with commodity 802.11 b/g wireless cards.
1
Introduction
The increasing ubiquity of small, mobile computing devices has introduced applications that find themselves in constantly changing environments, requiring adaptation. Research in context-aware computing has created applications that adapt to location (e.g., in tour guides [1]), time (e.g., in reminder applications [6]) and even weather conditions (e.g., in automated field-note taking [17]). Several toolkits provide abstractions for accessing such context information [7, 11]. While adaptation to physical characteristics is the most obvious use of contextawareness, the ability to respond to the condition of the network is just as crucial. Network-awareness is especially important for protocol adaptation as it allows communication protocols to change their behavior in response to the immediate network conditions or the available network resources. Network context can also be used directly by applications, for example to change the fidelity of the data transmitted when the available bandwidth changes. Traditional means of measuring context are active in that they generate extra control messages or require nodes to exchange meta-information. Metrics P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 106–121, 2011. c IFIP International Federation for Information Processing 2011
Passive Network-Awareness for Dynamic Resource-Constrained Networks
107
that report message latency require nodes to exchange ping messages, measuring the amount of latency these messages experience. Traditional measures for determining the degree of mobility in a mobile network require nodes to periodically exchange location and velocity information. The extra network traffic these mechanisms generate places an increased burden on the already taxed network, making it difficult to justify the use of context-awareness in the common case. If the overhead of sensing context information can be reduced, the benefit of the availability of the information is increased. We define a framework for defining passively sensed context metrics based on network eavesdropping (Section 3). Our approach focuses on sensing context with zero additional communication overhead. Our context metrics do not provide the exact measure of context that their active counterparts may provide, but we demonstrate the measures’ fidelities match traditional measures of context. We use this framework to create instantiations of three common network context measures (Section 4). For each of these metrics, we evaluate the specificity of the passively sensed context metric with respect to the ground truth (Section 5). Our work shows that passive sensing of network context can inexpensively provide information about the state of the world and that, especially when these metrics are correlated with each other, enable adaptive applications in environments where traditional active context sensing is cumbersome.
2
Related Work
The demand for adaptive mobile applications indicates the need for efficient context-awareness. Much work has focused on supporting software engineering needs through frameworks and middleware that provide programming abstractions for acquiring and responding to context. For example, Hydrogen [10] defines a completely decentralized architecture for managing and serving context information. Hydrogen’s abstractions are unconcerned with how context is sensed; clearly, performing context acquisition efficiently is important to the success of such a framework. Many other projects have also looked at reducing the cost of context sensing. Several of these take an application-oriented perspective, identifying what high-level information the application desires and only acquiring information necessary to support an application’s desired fidelity [26]. SeeMon [13] reduces the cost of context by only reporting changes in context; other time- and event-based approaches also limit overhead this way [8]. Many existing projects provide network context-awareness through dedicated software that sends and receives control messages [4], for example by separating characteristics sensed about the wireless portion of a mobile network from those sensed about the wired portions. The approach does not apply to infrastructureless networks and incurs communication overhead to sense context. There is also a need for network-awareness in mobile agent systems [3]. When supporting mobile agents, however, network-awareness concerns are different due to the fact that an agent’s notion of “connectivity” does not necessarily match the network’s provision of physical connections. Our work focuses on applications that require awareness of local network conditions.
108
A. Petz et al.
Active network monitoring has been explicitly separated from passive network monitoring. Komodo [22] defines passive context sensing as any mechanism that does not add network overhead. Komodo requires knowledge of the entire network (even, and especially, network links not currently in use), so the project implements an active sensing approach. Given that we focus on mobile networks based on wireless communication, we promote an approach that takes advantage of the inherent broadcast nature of communication, passively gathering information about links that may not be present at the application level. Passive measurement of network properties has been explored in a scheme that uses perceived signal strength to adapt a routing protocol [2]. This approach requires that nodes are able to easily discern the signal strength of incoming packets and requires nodes to send periodic “hello” messages to monitor their neighbor set, which adds network overhead. A different approach monitors packet traffic to provide routing protocols information about packets dropped at the TCP layer [27]. This information allows protocols to more quickly respond to route failures. We undertake a similar approach in this work but focus on gathering a local measure of network properties instead of boosting performance on a particular end-to-end flow. These related projects lay the foundation for our work in developing a comprehensive framework for passively sensing network context information. These previous projects have demonstrated 1) a need for context information to enable adaptive communication protocols and applications; 2) a requirement for the acquisition of context to be extensible and easy to incorporate into applications; and 3) a desire to accomplish both of the above with low network communication overhead.
3
Defining Passive Context through Eavesdropping
Passive Network Awareness
In this section, we introduce a framework for adding passive context sensing into mobile computing architectures. A schematic of the architecture is shown in Fig. 1. The physical and MAC implementations handle Application packet reception and transmission. Our framework Interface inserts itself in two places: first between the MAC layer and the routing layer, and second above the Routing routing layer before the application. The former Interceptor point serves as an interceptor that allows eavesdropPhy/MAC ping on existing communication. The information overheard through this interceptor will be used to Fig. 1. Architecture for infer various context metrics as described below. Passive Context Sensing The portion of the framework inserted between the routing and application layers exposes the passivelysensed context information to the application, enabling it to adapt to the current context.
Passive Network-Awareness for Dynamic Resource-Constrained Networks
109
Existing Network Communication in MANETs. Passive sensing can benefit from information exchanged by these protocols, so it is useful here to provide a brief explanation of their functionality. Unicast routing in MANETs requires every node to serve as a router and can be either table-driven or on-demand. In Destination Sequenced Distance Vector Routing (DSDV) [19], a table-driven protocol, each node maintains a table containing the best known distance to each destination and the next hop to use. These tables are updated by periodically exchanging information among neighbors, generating a fairly constant overhead that is independent of the amount of useful communication. On-demand algorithms like Ad hoc On-demand Distance Vector Routing (AODV) [20] and Dynamic Source Routing (DSR) [12] determine routes only when a packet needs to be sent. These protocols broadcast a route request that propagates to the destination. A reply is returned along the same path. AODV stores routing information in tables on each node, and the tables are updated via periodic exchanges among neighbors. In DSR, the packet carries the routing information, and no beaconing is required. Each approach has advantages and disadvantages [23]; details are omitted as they are not the focus of this paper. Protocols also exist that add hierarchy [18] or use location to assist routing [14]. In general, the routing protocols share several characteristics: they all generate control messages to discover and maintain routes and they all transport data packets across established routes. A common control message is generated when a node detects that a path has been broken due to a failed link; the detecting node commonly sends a route error packet to its active predecessors for that destination,. In the passive metrics we devise, we will use the discovery, data, and route error messages generated by MANET routing protocols to infer various types of network context information. Passive Metrics: Some Examples. The following three metrics each measure a dynamic condition of the physical or network environment. In all three cases, the sensed information can be useful to communication protocols that adapt their transmission rates or patterns, and to applications that adapt high-level behaviors. Network load. The simplest metric in our passive metric suite provides a direct measure of the local traffic on the network. Adapting to this information, applications can prioritize network operations, throttling communication of low importance when the network traffic is high. Communication protocols can also change routing or discovery heuristics in response to changing amounts of network traffic to avoid collisions. Network density. In dynamic networks, a node’s one-hop neighbors can constantly change, and applications can adapt their behavior in response. When the number of neighbors is high, common behaviors can increase collisions and therefore communication delay, while when the number of one-hop neighbors is low, conservative communication can lead to dropped packets and loss of perceived connectivity. To most easily measure the local network density, nodes exchange periodic hello messages with one-hop neighbors. While some protocols already incur this expense, adding proactive behavior to completely reactive
110
A. Petz et al.
protocols can be expensive. We devise a metric for passively sensing network density regardless of the behavior of the underlying protocol(s). Network Dynamics. Our final example passive metric measures the mobility of a node relative to its neighbors. Traditional measures of relative mobility require nodes to periodically exchange velocity information. We approximate this notion of relative mobility by eavesdropping on communication packets to discern information about links that break. We show how this simple and efficient metric can correlate well with the physical mobility degree in dynamic mobile ad hoc networks. The Specificity of Passive Metrics. A major hurdle in passively sensing context information is ensuring that the quality of the measurement sensed passively (or the context specificity) closely approximates the value that could have been sensed actively for increased cost. This may differ from the actual value for the context metric since even active metrics may not exactly reflect the state of the environment. For each of the passive metrics we define, we generate its context specificity by comparing its performance to a reasonable corresponding active metric (if one exists). This not only allows us to determine whether the particular passive metric is or will be successful, but it also helps us tune our approaches to achieve better specificity. Adaptation Based on Passive Metrics. One of the most important components of our framework is its ability to make passively sensed context information available to applications and network protocols. As shown in Fig. 1, we provide an interface that delivers passively sensed context directly from the sensing framework. We provide a simple event-driven approach and allow applications to request that a fidelity level be associated with context reports to indicate how confident the sensing framework is in the passively sensed context measure in question.
4
Building Common Context Metrics
To acquire context information at no network cost and little computation and storage cost, we created a passive network suite in C++. Our implementation takes network packets received at a node, “intercepts” them and examines their details, all without altering the packets or their processing by the nodes. Our implementation also provides an event-based interface through which applications can receive information about passively sensed context. We describe the concrete architecture and implementation of our passive metric suite and look in detail at the specifics of our three sample metrics. 4.1
Implementing Passive Metrics
Fig. 2 depicts our implemented passive context sensing framework. Solid arrows represent the movement of packets. Specifically, packets no longer pass directly from the radio to the MAC layer or from the MAC layer to the network layer;
Passive Network-Awareness for Dynamic Resource-Constrained Networks
111
instead they first pass through the passive context sensing framework. Dashed arrows indicate potential uses of the passively sensed context in the form of event registrations and callbacks. In our passive sensing suite, the interceptor (passive sensing in the
figure) eavesdrops on every received packet. For each of the passively sensed metrics, the framework gen erates an estimate of the metric’s $!#"" $! !#"" value based on the information from the data packets at a specified time interval, ν. This time interval can be different for each passively sensed Fig. 2. Concrete Architecture for Passive metric depending on its sensitivity Context Sensing in a particular environment. To define a passive metric, a new handler for the metric must be provided that can parse a received packet. The handler defines its own data structures to manage the necessary storage between estimation events. When any packet is intercepted, a copy is passed to the handler for each instantiated metric, and the handler updates its data structures. Each new metric must also define an estimator that operates on the context information stored in the metric’s data structure and generates a new estimate. When the passive framework is instantiated, each metric is provided a time interval for estimation (ν). The framework then calls the metric’s estimator every ν time steps to generate a new metric estimate. Larger intervals result in lowered sensing overhead (in terms of computation) but may result in lower quality results (as discussed in Section 5). 4.2
The Passive Metrics
For each metric, our interceptor takes as input the sensed context value at time t and the estimated value at time t − ν and creates an estimate of the next value of the time series. For each metric, this results in a moving average, in which previous values are discounted based on a weight factor γ provided for each metric. When γ is 0, a new estimate for time t is based solely on information sensed in the interval [t − ν, t]. Network Load. Network load can be sensed directly by measuring the amount of traffic the node generates and forwards. The network load metric’s handler eavesdrops on every received packet, logging the packet’s size in a buffer. To generate an estimate, the metric’s estimator function simply totals the number of bytes seen in the interval ν and adjusts the moving average accordingly. Specifically, the network load metric nl i of a node i is defined as the total of the sizes of the packets that the node has seen within a given time window [t − ν, t]: nl i (t) = γnl i (t − ν) + (1 − γ)nl m i (t − ν)
112
A. Petz et al.
where nl m i (t − ν) denotes the total size of packets seen by the node in the time interval [t − ν, t] (i.e., the measured value). Network Density. Our second metric measures a node’s network density, or its number of neighbors. This metric’s handler examines each packet and logs the MAC address of the sender. When the estimator is invoked at time t, it tallies the number of unique MAC addresses logged during [t − ν, t]. The network density of a node i is estimated by calculating the number of distinct neighbors of the node: nd i (t) = γnd i (t − ν) + (1 − γ)nd m i (t − ν) where nd m i (t − ν) calculates the number of distinct neighbors observed in the previous time window. Node i was isolated during [t − ν, t] when nd m i (t − ν) = 0. Network Dynamics. Our third metric captures the relative dynamics surrounding a particular node. This metric is, to some degree, a measure of how reliable the surrounding network is. In our previous work, we have shown that we can approximate this notion by eavesdropping on communication packets to discern link quality [24]. A node can do this by observing the quality of the received packets directly or by looking at the semantics of packets that indicate link failures. In the former case, a node observes packets transmitted by neighboring nodes to determine the link quality lq ji , which is a normalized representation ∈ [0, 1] of the quality of the link from node j to node i: (t − ν) lq ji (t) = γlq ji (t − ν) + (1 − γ)lq j,avg i where lq j,avg (t−ν) calculates the average of the link quality values of the packets i received from node j in the current window. In our implementation in the next section, instead of directly measuring link quality, we rely on the presence of route error packets in the communication protocol to indicate faulty links. The metric’s handler eavesdrops on every packet, counting those indicating route errors. When the context estimator is invoked, it returns the number of route error packets seen per second in the time interval [t − ν, t]: (t − ν) lq ji (t) = γlq ji (t − ν) + (1 − γ)nre j,m i where nre j,m (t − ν) is the number of route error packets from j in [t − ν, t]. i
5
Evaluating Passive Context Sensing
We evaluated our passive context sensing by integrating it with the OMNeT++ network simulator [25] with the INET framework. We describe the nature of this evaluation and present some results. In the next section, we translate this prototype in simulation to a concrete implementation on real devices. Evaluation Settings. We simulated two different environments, both consisting of 50 nodes distributed in 1000m × 1000m. In the first situation, every node
Passive Network-Awareness for Dynamic Resource-Constrained Networks
113
attempted to ping a sink (node 0) every 10 seconds (with a ping packet size of 56 bytes). This creates relatively symmetric traffic over the field. In the second situation, one randomly selected node opened a UDP stream to one other randomly selected node, sending five 512 byte packets every second. In this situation, the traffic is asymmetric; nodes in the path of traffic tend to have more packets to overhear and therefore better information about the passively sensed metrics. We used both AODV and DSR to provide routing support; we did not find statistically significant differences between the two approaches; for consistency’s sake, we report results using AODV. We ran five different types of node mobility. In the first case, all nodes were stationary, then they moved at speeds evenly distributed around 2m/s, 4m/s, 8m/s and 16m/s. We used the random waypoint mobility model for node movement with a pause time of 0 seconds. In these initial evaluations, we aimed to see, in the simplest cases, whether passively sensing context was a viable alternative to active sensing. For this reason, we did not perform any smoothing of the results over time (i.e., all of the weighting factors (γ) were set to 0). As a result, only measurements made in time interval [t − ν, t] were used to estimate the passive metric at time t. We experimented with five different values of ν: 10, 25, 50, 75, and 100 seconds. For all graphs, we calculated 95% confidence intervals; in most cases, they are too small to see.
Sensing Network Load. In our first metric, we intercepted every received packet and added its size to calculate the average load over the time window [t−ν]. The specificity of this metric is exact; we are measuring the metric instead of estimating it. Fig. 3 shows the load measurements our passive framework generated for the PING application for the five different values of ν, plotted as a function of increasing node speed. This figure simply serves to demonstrate the load information we were able to observe. The load increases as speed increases due to the overhead involved in repairing broken routes. It does level out at higher speeds; this reason is disFig. 3. Network Load Passive Metric cussed below when we examine the network dynamics more carefully.
Sensing Network Density. Our node density metric estimates the number of neighbors of a node at time t, based on the observed packet senders over [t−ν, t]. This metric’s specificity relates to its ability to correctly identify the number of unique neighbors a node has at time t. Therefore, we calculated the neighbor error rate as: |nd i (t) − nn i (t)| ner i (t) = nn i (t)
114
A. Petz et al.
Fig. 4. Network Density Metric
Fig. 5. Network Dynamics Metric
where nn i (t) is the actual number of neighbors i had at time t, retrieved from an oracle. Fig. 4 plots the neighbor error rate for the PING scenario for the five different values of ν. This metric was the most sensitive to the size of ν. Especially at higher speeds, a wider sensing interval led to very poor estimates of network density. Even with our smallest tested value of ν, the estimation error at node speeds of 16m/s was almost 17%. However, as discussed below in correlating our passive metrics, this was not always the case; correlating this metric with the network load can lead to a better understanding of the reliability of the estimate. The results for estimating the neighbor density in the UDP application scenario were, on average, significantly worse. Many nodes in the network saw very little network traffic, so they had very little information to base their neighbor density estimates on. Again, correlating these estimates with the amount of traffic a node observes can provide better reliability, as discussed below. Sensing Network Dynamics. Our final metric relied on overhearing route error packets from either DSR or AODV to estimate the rate of link breaks, which we used to estimate the relative mobility of a node and its neighbors. The ground truth we compare with is the actual average speed of the nodes as set in the simulation settings. Fig. 5 plots the rate of error packets against the average speed of the nodes for the five different values of ν. The expected relationship holds for the lower sets of node speeds (up to 4 m/s). However, at the higher node speeds, the relationship degrades. We conjecture that this may be a result of border effects in our simulation environment in conjunction with the long lag between sending PING packets; these two together may cause the fast moving nodes to disconnect but reconnect without a link break being detected by our passive metric. We look at accounting for these errors in the next section, when we correlate passively sensed metrics with each other. While this appears to be a disappointing result, Fig. 5 compares this metric with an oracle. A corresponding active metric in which neighboring nodes periodically exchange speed information would add to the network traffic. As shown in Fig. 3, the network load in these highly dynamic situations is already high, and this would lead to a similar degradation in the estimated context value as well.
115
Passive Network-Awareness for Dynamic Resource-Constrained Networks
(a)
(b)
Fig. 6. Correlating Passive Metrics: (a) Neighbor Error Rate and Load; (b) Neighbor Error Rate and Route Error Rate
However, from the perspective of applying a passive metric to these dynamic environments for sensing network dynamics, we argue that the measure we provide in our passive metric (i.e., the rate at which nodes experience errors in delivering their data packets) is itself a useful measure of the network quality. Therefore, while this passive metric for network dynamics does not show complete specificity with its corresponding oracle, the metric does provide useful information about the quality with which the network can support communication. Correlating Passive Metrics. Fig. 6(a) shows how one metric can be used to assess the quality of another. This figure shows results for the UDP experiment with ν = 50 for both load and network density. The chart plots the network load observed by nodes based on the node’s neighbor error rate. The nodes that more correctly estimated their neighbor density (to the left of the figure) were more likely to have seen more network traffic than nodes that were more incorrect in their neighbor estimates. Similarly, Fig. 6(b) shows the correlation between the neighbor error rate and the route error rate for the same experiment. The results are fairly intuitive; the nodes that experienced fewer route errors were more likely to be correct about their estimate of the number of neighbors. These results motivate applications to use multiple passively sensed metrics in conjunction since information from one metric can provide an indication as to how reliable estimates from another metric are.
6
Implementation of Passive Context Sensing
We implemented the passive sensing metrics using the Click Modular Router [15], a flexible, modular, and stable framework for developing routers and network protocols, and we evaluated our implementation on autonomous robots from the Pharos testbed [9]. The following sections describe our implementation, the Pharos testbed, and our experimental setup and results.
116
A. Petz et al.
Implementation in Click. The Click framework is written in C/C++, runs on Linux and BSD, and includes components to manipulate the network stack from hardware drivers to the transport layer. A Click implementation of a network stack consists of a number of separate elements—modular components that each operate on Fig. 7. Click Passive Sensing Implementation network packets—connected in way that provides the desired functionality. We implemented three such elements, PCS Load, PCS Density, and PCS Dynamics, which implement the three passive sensing metrics described in Section 4.2. Each element also has an external handler to allow other elements or processes to retrieve the computed context value. We have made our implementation available for download1 . Fig. 7 shows the configuration we used in our experiments. The three passive sensing elements are connected such that all inbound packets are copied and processed by all three elements; the copy of the packets is then discarded. Although it is possible to configure Click to run as a kernel module so it can process the original packets instead of copying them to userspace, this was an unnecessary optimization for our experiments. The Pharos Testbed. To fully evaluate our passive sensing implementation, we used the Pharos testbed [9], a highly capable mobile ad hoc network testbed at the University of Texas at Austin that consists of autonomous mobile robots called Proteus nodes [21]. We used eight of the Proteus nodes shown in Fig. 8. Each robot runs Linux v.2.6 and is equipped with an x86 motherboard, and Atheros 802.11b/g wireless radio. The robots navigate autonomously using their onboard GPS and a digital compass. Experimental setup. In addition to the passive context sensing suite, each node was running the AODV routing protocol [16] implementation from Uppsala University and sent UDP beacons to every other node at 1s or 10s intervals (depending on the run). This beaconing was independent of the passive sensing suite, and simply provided network load. We used two mobility patterns, a short pattern (shown in black in Fig. 9) which took about 5 minutes to complete, and a long pattern (shown in yellow) which took about 10 minutes 2 . Each pattern had a series of longer jumps punctuated by 2 series of tight winding curves. The robots were started 30 seconds apart and drove at 2m/s (though this varied based on course corrections and imperfect odometer calcuations), and the winding curves were designed to trap several robots in the same area to ensure the formation of a dynamic ad hoc network. To ensure occasional 1 2
Our implementation is at http://mpc.ece.utexas.edu/passivesensing Waypoints generated using http://www.gpsvisualizer.com
Passive Network-Awareness for Dynamic Resource-Constrained Networks
Fig. 8. The Proteus Node
117
Fig. 9. Waypoints for Experiments
link-layer disconnections in our 150m x 200m space, we turned the transmit power on the radios down to 1dBM (using the MadWiFi stack3 ). Results. We ran several experiments, and a comprehensive analysis of all of the experiments is not possible in this paper4 . Fig. 10 shows values of the passively sensed metrics for one robot navigating the longer mobility model with 1s beacon intervals, the weight factor (γ) set to 0, and the time interval ([t − ν, t]) set to 10 seconds to better show the instantaneous context values. Fig. 11 shows a different run with seven robots, the beacon interval set to 10s (instead of 1s), and with each robot instantiating a 1MB file transfer to one randomly chosen destination every 10 seconds. Seventy-eight total file transfers were attempted, of which 43 succeeded and 35 eventually timed out or were interrupted. Although the raw data is not extremely meaningful in isolation, it does show the degree of variation of context observed by a single node even in a small experiment. There are obvious correlations between node density and load and node density and network dynamics that were evident in our real world tests—some of this can be seen in the figures as well. Comparing real-world results to simulation. To compare the real-world experiments to the simulated results, we took the recorded GPS trace of the Proteus nodes’ exact location/time and created trace files that were compatible with OMNeT++. In this way, we could simulate the exact mobility pattern executed by the robots, including variation from the intended waypoints due to GPS and compass error, steering misalignment, and speed corrections. Figs. 12 and 13 show the simulated results for the same node as Figure 10. We used the same simulation setup as in the previous section, but we set the simulated transmit power to 0.001mW in order to simulate the same number of neighbors on average for each node—this value of 0.001mW was empirically determined by comparing simulations with the observed number of neighbors from the real-world experiments. We were able to correlate the node density between simulation and the 3 4
http://madwifi.org/ The raw data as well as videos of the individual experiments can be found at http://pharos.ece.utexas.edu/wiki/index.php/Missions
118
A. Petz et al.
Fig. 10. 8 nodes, 1s beacons, no file-tx
Fig. 11. 7 nodes, 10s beacons, file-tx
Fig. 12. Sim. vs. real world density
Fig. 13. Sim. vs. real world route errors
real-world well on average, but the number of route error packets seen by the nodes differ significantly. We assume this is due to inaccuracies in the wireless model used in the OMNeT++ simulator. We have demonstrated that passive context sensing can be implemented in the real world, and we have designed and built a Click-based passive context suite and made our implementation available to the research community. We have also compared the real world tests with simulations using GPS data from the Proteus nodes to recreate the same mobility traces executed during the experiments.
7
Discussion and Future Work
In this section, we briefly discuss some lessons learned and future directions for our passive context sensing suite. Passive Sensing Sensitivity. We found that the different metrics could be sensitive to different parameters to different degrees. For example, the network density metric was highly sensitive to the value of ν. For our existing metrics and any newly introduced metrics, determining the best values for these parameters will be required and may be the most difficult challenge of passive sensing. However, our work with correlating passively sensed metrics with each other offers
Passive Network-Awareness for Dynamic Resource-Constrained Networks
119
promise. Specifically, we showed that correlating the network density with the load can indicate to an application when to “trust” the network density metric; i.e., when the node experiences a higher load, its neighbor density estimate is more likely to be correct. Similarly, we expect that using passively sensed metrics to adjust other metrics’ sensitivities may be useful. For example, when a passively sensed network dynamics metric indicates a high degree of dynamics, the node’s passively sensed network density metric should likely use a smaller value of ν to achieve better results. Extending the Passive Context Suite. Our passive context suite is straightforward to extend. Defining a metric requires determining the signature of the packets to be overheard, the information required from those packets, and how the metric should be defined over those overheard values. As an example, consider adding a link stability metric based on MAC layer information. This new metric utilizes properties of MAC packets to estimate the reliability of a communication link. Upper layer protocols and applications can use this information to adapt to, for example, decrease traffic on low reliability links to reduce the overall probability of collisions. This new passively sensed metric assumes the IEEE 802.11 MAC protocol and its Distributed Coordination Function (DCF), in which, before sending a data packet to another node, the sender performs Request to Send (RTS)/Clear to Send (CTS) handshaking, thereby reserving the shared medium. After receiving a valid CTS packet, the sender and receiver exchange data and acknowledgment packets. In a mobile environment, a node can encounter failures in either of these exchanges for several reasons. To recover from these failures, a node simply retransmits the data following the same procedures until it reaches a retry limit. The retry statistics for a node’s outgoing links are reported in the MAC Management Information Base (MIB). A new interceptor in our passive sensing framework can access the MIB at the end of every ν interval to acquire information about the stability of the node’s links. The MIB contains two pertinent values in this case: the dot11FailedCount, which tallies the number of discarded packets, and dot11TransmittedFrameCount, which tallies the number of successfully delivered packets. We can easily define a passive metric of link stability that defines the probability of a successful packet delivery over one hop from node i (ps i ): ps i (t) = γps i (t − ν) + (1 − γ)
NsMIB NsMIB + NfMIB
where NfMIB and NsMIB are dot11FailedCount and dot11TransmittedFrameCount, respectively. This new passive metric is easily inserted into our existing OMNeT++ framework simply by examining the MIB storage located in the 802.11 MAC implementation. The implementation of the passive context suite in Click is similarly easy to extend. Adapting to Passively Sensed Context. We have made our passively sensed context metrics available through an event based interface. Upper-layer protocols and applications can register to receive notifications of changes in passively
120
A. Petz et al.
sensed context metrics and adapt in response. We have already begun integrating passively sensed context into a pervasive computing routing protocol, Cross-Layer Discover and Routing (CDR) [5], in which we use this passively sensed context to adjust the proactiveness of a communication protocol in response to sensed network dynamics. In highly dynamic situations, the protocol avoids proactiveness due to the overhead incurred in communicating information that rapidly becomes outdated. However, in lower dynamic environments, some degree of proactiveness makes sense to bootstrap on-demand communication.
8
Conclusions
Mobile and pervasive computing applications must integrate with and respond to the environment and the network. Previous work has demonstrated 1) a need for context information to enable this expressive adaptation; 2) the ability to acquire context information with little cost; 3) the ability to easily integrate new context metrics as they emerge; and 4) software frameworks that ease the integration of context information into applications and protocols. In this paper, we have described a framework that achieves all of these goals by enabling the passive sensing of network context. Our approach allows context metrics to eavesdrop on communication in the network to estimate network context with no additional overhead. We have shown that our framework can be easily extended to incorporate new metrics and that the metrics we have already included show good specificity for their target active metrics in both simulation and a real network deployment. We have provided implementation of the passive sensing metrics for both the OMNeT++ simulator, and for Linux nodes using the Click Modular Router. Additionally, we have shown that applications can even adapt the context sensing framework by correlating the results of multiple passively sensed context metrics. This information enables adaptive applications and protocols in environments where active approaches are infeasible or undesirable due to the extra network traffic they generate.
References 1. Abowd, G., Atkeson, C., Hong, J., Long, S., Cooper, R., Pinkerton, M.: Cyberguide: A mobile context-aware tour guide. ACM Wireless Netw. 3, 412–433 (1997) 2. Basu, P., Khan, N., Little, T.: A mobility based metric for clustering in mobile ad hoc networks. In: Proc. of ICDCS Workshops, pp. 413–418 (2001) 3. Caripe, W., Cybenko, G., Moizumi, K., Gray, R.: Network awareness and mobile agent systems. IEEE Comm. Magazine 36(7), 44–49 (1998) 4. Cheng, L., Marsic, I.: Piecewise network awareness service for wireless/mobile pervasive computing. Mobile Netw. and App. 7(4), 269–278 (2004) 5. Dalton, A., Julien, C.: Towards adaptive resource-driven routing. In: Proc. of Percom (2009) 6. Dey, A., Abowd, G.: CybreMinder: A context-aware system for supporting reminders. In: Thomas, P., Gellersen, H.-W. (eds.) HUC 2000. LNCS, vol. 1927, pp. 172–186. Springer, Heidelberg (2000)
Passive Network-Awareness for Dynamic Resource-Constrained Networks
121
7. Dey, A., Salber, D., Abowd, G.: A conceptual framework and a toolkit for supporting the rapid prototyping of context-aware applications. Human Computer Interaction 16(2-4), 97–166 (2001) 8. Ferscha, A., Vogl, S., Beer, W.: Ubiquitous context sensing in wireless environments. In: Distributed and Parallel Systems: Cluster and Grid Computing. The Springer Int’l Series in Eng. and Comp. Science, vol. 706, pp. 98–106 (2002) 9. Fok, C.-L., Petz, A., Stovall, D., Paine, N., Julien, C., Vishwanath, S.: Pharos: A testbed for mobile cyber-physical systems. Technical Report TR-ARiSE-2011-001, AriSE, University of Texas at Austin (2011) 10. Hofer, T., Schwinger, W., Pichler, M., Leonhartsberger, G., Altmann, J., Regschitzegger, W.: Context-awareness on mobile devices: the hydrogen approach. In: Proc. of HICSS (2003) 11. Hong, J., Landay, J.: An infrastructure approach to context-aware computing. Human Computer Interaction 16(2-4) (2001) 12. Johnson, D., Maltz, D., Broch, J.: DSR: The dynamic source routing protocol for multihop wireless ad hoc networks. In: Ad Hoc Netw., pp. 139–172 (2001) 13. Kang, S., Lee, J., Jang, H., Lee, H., Lee, Y., Park, S., Park, T., Song, J.: SeeMon: Scalable and energy-efficient context monitoring framework for sensor-rich mobile environments. In: Proc. of Mobisys, pp. 267–280 (2008) 14. Karp, B., Kung, H.: GPSR: Greedy perimeter stateless routing for wireless networks. In: Proc. of Mobicom, pp. 243–254 (2000) 15. Morris, R., Kohler, E., Jannotti, J., Kaashoek, M.F.: The click modular router. SIGOPS Oper. Syst. Rev. 33(5), 217–231 (1999) 16. Nordstrom, E., Lundgren, H.: Aodv-uu implementation from uppsala university 17. Pascoe, J.: Adding generic contextual capabilities to wearable computers. In: Proc. of ISWC, pp. 92–99 (1998) 18. Pei, G., Gerla, M., Hong, X., Chiang, C.-C.: A wireless hierarchical routing protocol with group mobility. In: Proc. of WCNC, pp. 1538–1542 (1999) 19. Perkins, C., Bhagwat, P.: Highly dynamic destination-sequenced distance-vector routing (DSDV) for mobile computingers. In: Proc. of SIGCOMM, pp. 234–244 (1994) 20. Perkins, C., Royer, E.: Ad-hoc on-demand distance vector routing. In: Proc. of WMCSA, pp. 90–110 (1999) 21. http://proteus.ece.utexas.edu 22. Ranganathan, M., Acharya, A., Sharma, S., Saltz, J.: Network-aware mobile programs. In: Milojicic, D., Douglis, F., Wheeler, R. (eds.) Mobility: Processes, Computers, and Agents, pp. 567–581 (1999) 23. Royer, E., Toh, C.: A review of current routing protocols for ad hoc mobile wireless networks. IEEE Personal Comm. Magazine 6(2), 46–55 (1999) 24. Srinivasan, R., Julien, C.: Passive network awareness for adaptive mobile applications. In: Proc. of MUCS, pp. 22–31 (May 2006) 25. Vargas, A.: The OMNeT++ discrete event simulation system. In: Proc. of ESM, pp. 319–324 (2001) 26. Wang, Y., Lin, J., Annavaram, M., Jacobson, Q.A., Hong, J., Krishnamachari, B., Sadeh, N.: A framework of energy efficient mobile sensing for automatic user state recognition. In: Proc. of Mobisys, pp. 179–192 (2009) 27. Yu, X.: Improving TCP performance over mobile ad hoc networks. In: Proc. of Mobicom, pp. 231–344 (2004)
Utility Driven Elastic Services Pablo Chacin and Leando Navarro Departament d’Arquitectura dels Computadors, Universitat Polit´enica de Catalunya, Barcelona, Spain [email protected]
Abstract. To address the requirements of scalability it has become a common practice to deploy large scale services over infrastructures of non-dedicated servers, multiplexing instances of multiple services at a fine grained level. This tendency has recently been popularized thanks to the utilization of virtualization technologies. As these infrastructures become more complex, large, heterogeneous ad distributed, a manual allocation of resources becomes unfeasible and some form of self-management is required. However, traditional closed loop control mechanisms seems unsuitable for this platforms. The main contribution of this paper is the proposal of an Elastic Utility Driven Overlay Network (eUDON) for dynamically scaling the number of instances of a service to ensure a target QoS objective in highly dynamic large-scale infrastructures of non-dedicated servers. This overlay combines an application provided utility function to express the service’s QoS, with an epidemic protocol for state information dissemination, and simple local decisions on each instance to adapt to changes in the execution conditions. These elements give the overlay robustness, flexibility, scalability and a low overhead. We show, by means of simulation experiments, that the proposed mechanisms can adapt to a diverse range of situations like flash crowds and massive failures, while maintaining the QoS objectives of the service. Keywords: Web Self-adaptive.
1
Services,
QoS,
Epidemic
Algorithm,
Overlay,
Introduction
Modern large scale service-oriented applications frequently address unexpected situations that demand a rapid adaptation of the allocated resources, like flash crowds –that require a quick allocation of additional resources– or massive hardware failures –that require the re-allocation of failed resources. At the same time, applications are expected to maintain certain QoS objectives in terms of attributes like response time and execution cost [14]. To address these requirements, it has become a common practice to deploy services over large scale non-dedicated infrastructures – e.g. shared clusters –
This work has been partly supported by Spanish MEC grant TIN2010-20140-C03-01.
P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 122–135, 2011. c IFIP International Federation for Information Processing 2011
Utility Driven Elastic Services
123
on which servers are dynamically provisioned/decommissioned to services in response to workload variations. In comparison, in a traditional enterprise infrastructure the scale up process takes a long time and requires manual intervention, and therefore over-provisioning services to handle such situations is the common practice. Chandra et al. [7] demonstrated that fine-grained multiplexing at short timescales – in the order of seconds to a few minutes – combined with fractional server allocation leads to substantial multiplexing gains over coarse-grained reallocations. To accomplish this fine grained multiplexing, it is necessary to count with mechanisms to allocate/deallocate servers efficiently and then be able to manage those servers in a very dynamic environment with a high turn-over of servers. As these infrastructures become more complex, large, distributed and heterogeneous, some sort of self-adaptive [18] (also known as autonomic [12]) capabilities are needed to stabilize its performance within acceptable limits despite variations in the load and resources, recover from failures, and optimize them according to business objectives. However, as noted in [15] traditional closed loop self-adaptation approaches are of limited applicability in the scenarios described above, as they made a set of restrictive assumptions: a) The entire state of the application and the resources are known/visible to the management component b) the adaptation ordered by the management component is carried out in full and in a synchronized way, and c) the management component gets full feedback of the results of changes made on the entire system. In contrast, in a large scale, wide-area system getting a global system knowledge is infeasible and coordinating adaptation actions is costly. Additionally, each server may belong to different management domains – different sites in an organization, external providers – with different optimization objectives. Two additional problems arise for the self-management of non-dedicated servers: the complexity of eliciting a model to predict the effect of the adaptation decisions and drive the adaptation process, and the need for an isolation mechanism to prevent servers to interfere with each other’s performance. The main contribution of this paper is the proposal of an Elastic Utility Driven Overlay Network (eUDON) for dynamically scaling the number of service instances used to support a service, ensuring a target QoS objective in highly dynamic large scale shared infrastructures. Based on UDON [5], eUDON combines a) an application provided utility function to express the service’s QoS in a compact way; b) an epidemic protocol for scalable, resilient and low overhead state information dissemination; c) a model-less adaptive admission policy on each service instance to ensure a QoS objective; and d) mechanisms for the elastic assignment of instances to adapt to fluctuations in the load and recover from failures. All these mechanisms act autonomously on each instance based on local information, making the systems highly scalable and resilient.
124
P. Chacin and L. Navarro
We show by means of simulations experiments how eUDON adapts to diverse conditions like peaks in the workload and massive failures, maintaining its QoS and using efficiently the available resources. The rest of the paper is organized as follows. Section 2 presents the general model for eUDON and describes in detail the two main mechanism used to achieve elasticity. Section 3 describes the simulation based experimental evaluation that explores its behavior under diverse scenarios. Section 4 presents relevant work in the field to put the proposed work into context. Finally, section 5 presents the conclusions and outlines the future work.
2
eUDON: An Elastic Utility Driven Overlay Network
The model for eUDON which is shown in Fig 1.
Fig. 1. Elastic service overlay model
Requests coming from users are processed through a set of entry-points, which correspond to segments of users with similar QoS requirements, and must be routed to service instances that offer an adequate QoS. Requests are evenly distributed over the multiple entry points for the same user segment using traditional DNS or L4 level load balancing techniques [4]. It is important to notice that in our work we concentrate in the web application layer and assume that the data access, including consistency requirements, are handle by a separated data layer as proposed in modern highly scalable web architectures [13].
Utility Driven Elastic Services
125
Each service has a utility function that maps the attributes and execution state of a service instance (e.g. response time, available resources, trustworthiness, reliability, execution cost) to a scalar value that represents the QoS it provides. Utility functions allows a compact representation of the QoS requirements for services and facilitate the comparison of the QoS that different instances provide, even if they run on very heterogeneous nodes [11]. The QoS required by a request is defined as the minimum acceptable utility that a service instance must provide to process it. The QoS offered by an instance may vary over time due to, for example, fluctuations on the load or the available resources 1 of the non-dedicated server it runs on. There is large pool of servers available for diverse services. At any given time, on a subset of those servers there are instances activated to process requests for a service. The active instances are organized in a Service Search Overlay, whose objective is to facilitate finding an instance offering adequate QoS. Among the active instances, a subset capable of processing the current workload is promoted – as described later in section 2.2 – to join the Service Routing Overlay. This overlay is used by the entry points to route requests preserving QoS and achieving load balancing. Instances which are underutilized, are demoted and leave the routing overlay but remain in the search overlay. Eventually, instances can be deactivated from the search overlay. The number of active instances in the search overlay depends on the expected service demand and the level of replication required to ensure resilience to failures as well as to handle short time increases in the demand. To estimate this number approaches like those proposed in [17] [9] can be applied. In eUDON, this problem is part of our ongoing research and diverse options are being explored. In the rest of this paper we assume the number of active instances is fixed even when the proposed approach can accommodate variations in this number. The main problem we address is how to maintain to a minimum the number of instances in the routing overlay to keep the number of hops needed to process requests low while still offering an adequate QoS. It is important to notice that in shared clusters, the active instances which are not promoted for processing requests add little overhead to the cluster. In a cloud scenario, it makes sense to have instances activated for some time even if idle, because of the activation overhead and because computing resources are paid by hours – and therefore a 15 minutes activation cost the same than a full hour activation. The routing and search overlays use a push style epidemic algorithm to maintain their topologies, find new (activated, promoted) instances, remove unavailable (failed, demoted, deactivated) instances, and spread the current state of instances. Periodically each instance selects a subset of its neighbors (the exchange set) and sends a message with its local view (the neighbor set) and its own current state. When a instance receives this message, merges it with its current neighbor set and selects the subset with the more recently updated entries 1
For example, modern virtualization technologies make feasible to change resource assignments to execution environments on the fly.
126
P. Chacin and L. Navarro
as the new neighbor set. In this way, each instance keeps a local view with the most up date information among all the information received from neighbors. Next sections describe in detail the two main elements of the model outlined before that provides the attributes of self-adaptiveness and elasticity to eUDON. 2.1
Routing
The objective of the routing mechanism is to deliver each request to a service instance that satisfies its QoS requirements, with high probability and the minimal amount of routing hops. In eUDON Requests are routed using the unicast algorithm shown in Fig. 2a. On the reception of a request, the routing algorithm uses an admission function to determine if the instance can accept the request for processing. If the request is not accepted, then a ranking function is used to rank the neighbors and select the most promising one as the next hop in the routing process. eUDON uses an adaptive admission function – inspired by the one proposed in the Quorum system [2] – summarized in Fig. 2b. The utility of the service instance is periodically monitored and compared with a target QoS objective and the size of the admission window is increased or decreased as needed to correct deviations. The only assumption made by this process is that the utility is non-increasing with respect of the load. That is, that increasing/decreasing the load lowers/rises the utility, given that the rest of the utility related attributes remains equal. One significant advantage of this adaptive admission function is that it does not require any model to estimate the future performance of the service. Moreover, it works even when the resources allocated to a service cannot be reserved and therefore the available capacity fluctuates.
(a) Routing process
(b) Adaptive admission.
Fig. 2. Adaptive routing
Every overlay uses a different ranking function. The routing overlay uses a round-robin ranking which has been shown to offer an acceptable performance in this context – see [6] for details. When a request is routed beyond a predefined number of hops, it is routed using the search overlay, looking for an active (but not promoted) instance capable of serving it. This search uses a greedy ranking
Utility Driven Elastic Services
127
based on the (last known) utility of the nodes which was shown in [5] to be highly effective in a wide range of conditions and scenarios. 2.2
Promotion and Demotion
The decision to join/leave the routing overlay is taken by each instance autonomously based both on its local information, like the rate of request being processed or the server’s utilization, and aggregate non-local (potentially global) information like the total workload of the system or the average service rate of other instances. eDUON uses a probabilistic adaptation mechanism implemented by two rules based on the service rate (the number of service requests processed by time interval). An instance promotes itself if its service rate is close to the average of the system and demotes itself if it is offering a service rate below the 25 percentile of all instances. These rules where chosen for their simplicity and because they could be easily traceable to the system’s status, more than for their optimality. However, they exhibit a very acceptable behavior as shown in the experimental results. The probabilistic nature of the rules leads to a progressive adaptation preventing that many instances take simultaneously the same decision, overreacting to a situation and leading to oscillations in the system. An estimate of the global service rate can be obtained by an epidemic aggregation process embedded into the overlay maintenance algorithms [10] [8]. In the simulation described in section 3.1, each instance gets an estimated of these values perturbed by an error factor to simulate the estimation error of the distributed aggregation algorithms. The probability for promotion/demoting an instance is given by: P (S ) =
1 , 1 + ekS
(1)
Where S = (S¯ − S)/S¯ is the deviation of the node’s service rate S from ¯ and k is a parameter that adjust the sensitivity of the target service rate S, probability to the service rate. When calculating the probability for demoting, k > 0 and for promoting k < 0. Fig. 3 shows this probability for various values of k, and and S¯ = 50 for promotions and S¯ = 20 for demotions. As the promotion and demotion rules behaves independently of each other, we have added an additional condition to prevent an instance to be continuously promoted/demoted without having the change to stabilize: instances will not run these rules again for the 3 cycles following a promotion/demotion. This number was empirically obtained by testing multiple options and found to work well in different situations.
3
Experimental Evaluation
In this section we describe the simulation model and the metrics used for the evaluation, and summarize the results of different experiments developed to analyze how the system adapts to different scenarios.
Probability
128
P. Chacin and L. Navarro
1
1
0.8
0.8
0.6
0.6 k=3 k=5
k=-3 k=-5
0.4
0.4
0.2
0.2
0
0 0
20
40
60
80
Arrival rate
(a) Promotion probability
100
0
20
40
60
80
100
Arrival rate
(b) Demotion probability
Fig. 3. Promotion/Demotion probability function for diverse values of k
The results presented correspond, unless the contrary is explicitly indicated, to the average over 10 simulation runs. Each run simulates 200 seconds (300 for the peak load scenario). 3.1
Simulation Model
We implemented a discrete event simulator for the detailed simulation of the processing of requests allowing us to capture very detailed measurements of the system’s behavior. Table 1 summarizes the more relevant simulation parameters. Overlay. We have simulated an idealized network that mimics a large cluster, on which nodes can communicate with a uniform delay. The base experimental setup was 128 overlay nodes, with a 8 entry points and the 120 service instances (a 1:15 ratio). There is an ongoing work – with promising preliminary results for up to 2048 nodes– to experiment with several thousand instances. However, as the instances work exclusively with local information, we expect that the results will hold for larger scales, as was previously shown in [5]. All Service instances are initially members of the search overlay, but only a fraction initially join the routing overlay according to a join probability parameter. The adaptation process dynamically adjust this fraction accordingly to the conditions (e.g. load). Each instance maintains a neighbor set of 32 nodes and contacts 2 of them periodically to exchange information. These values correspond, to the optimal trade off between information freshness and communication costs as discussed in [5]. Service Instances. Each service instance dispatches requests using a processor sharing discipline. This model fits well for web servers like Apache, a well-known and widely used multi-threaded web server, and is amenable to analytical evaluation using a M/G/1/k ∗ P S queuing system [3]. This facilitates the comparison of simulation results with analytical estimates. For instance, the maximum
Utility Driven Elastic Services
129
Table 1. Simulation parameters Parameter Servers Entries ratio
Values 128, ... 2048 1:15
Neighbor set
32
Update cycle
1
Exchange set
2
Adaptation cycle 3 Join probability .60 Load Maximum .5 Load variability 0.10 QoS K S¯
0.7 -3 (promotion) 3 (demotion) 50 (promotion)
Description Number of instances Ratio of entry points, with respect of the number of instances Number of neighbors maintained per node in the overlay Frequency of information dissemination (in seconds) Number of neighbors contacted per node on each update cycle Frequency of the adaptation process (in seconds) Probability of an service instances to join the routing overlay at initialization Maximum fraction of a server capacity used by background load Maximum variation of background load per second Target utility for requests Adaptation probability adjust constant Target service tion/demotion
rate
for
promo-
20 (demotion)
arrival rate that can be processed by the system maintaining a target response time – as explained below – was estimated using this model. Arrivals. The service requests arrive following a Poisson distribution and are evenly distributed among the entry points. The arrival rate is calculated using the analytical model for service instances considering the average background load of servers to ensure that the allocation of the workload is feasible, but demands all the available capacity. Therefore, the maximum theoretical allocated demand is of 1.0 and the expected utilization is around 0.95. All requests generated in the tests have the same expected QoS. Utility Function. In our experiments we use a utility function that relates the utility to the deviation of the response time RT from a target maximum response time RT0 : α RT0 − RT U (RT ) = (2) RT0 As shown in Fig. 4 the coefficient α controls how quickly the utility decreases as the response time approaches the maximum. This function was selected because it can easily be related to metrics obtained from both the analytical and
130
P. Chacin and L. Navarro
simulation models, making it straightforward to predict and measure the impact of the adaptation decisions in the resulting utility. The evaluation of more complex utility functions is subject of future work. However, it is important to stress that the adaptation process considers only the utility function and makes no explicit reference to the underlying response time. This allows us to generalize the results to other utility functions given that they satisfy the basic assumption of being non-increasing with the load of the system, as discussed in section 2.1.
1 0.8 Utility
0.6 0.4 0.2
α = 0.3 α = 0.5
0
RT0 Response Time Fig. 4. Utility Function
Background Workload. One important aspect in our experiments is the evaluation of the impact of background load in non-dedicated servers, which impacts the utility that an instance can provide. This load (defined as a faction of the node’s computing power) is modeled as a random variable whose value varies along the time following a random walk with certain Variability. This model is consistent with the observed behavior of the load on shared servers [16,20]. 3.2
Metrics
In the evaluation of the proposed mechanisms we have considered the following metrics, which are related to its main objectives of efficiently delivering requests to an appropriate node to maintain an adequate QoS: Allocated Demand: measures the fraction of the demand that is actually allocated to a server, before being dropped due to the expiration of its TTL (set to 8 hops in our experiments). This metric measures how effective is the systems in allocating requests. QoS Ratio: is the ratio between the target QoS expected by a service request and the actual QoS received. A ratio below 1.0 means target was not met, while a ratio over 1.0 means the target was exceeded (which is not necessarily desirable, as it may indicate the server is underutilized). Utilization: Measures the percentage of the node capacity being used, considering both the background load and the load produced by the service requests. This metric is relevant as measures how efficiently resources are used.
Utility Driven Elastic Services
131
Routing hops: measures the number of hops (or retries) needed to allocate a request to a server with an adequate utility. It measures how efficient is the mechanism in allocating requests. In the graphics of experimental results in section 3.3 the percentiles 25, 50 and 75 of the QoS Ratio, Utilization and Routing Hops are presented to show the variability of these metrics. Percentiles 25 and 75 are drawn as a filled curve and percentile 50 as a continuous line. 3.3
Results
In this section we describe the different experiments we made to test the behavior of the system under diverse conditions and usage scenarios.
1.0 0.8 0.6 0.4 0.2 0.0
Total utilization Background load
Utilization
Utilization
Base scenario. In the base scenario, a steady workload is submitted to the system that demands all the available capacity to achieve the QoS objective.
Capacity
20 15 10
0
50
100
150
200
0
50
100 Time (seconds)
150
200
5 Utility Ratio
0 Utility Ratio
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
1.0
(a) Evolution of an Instance
1.0
(b) Utilization and QoS Ratio
Fig. 5. Behavior for base scenario
Fig. 5a shows how the utility driven adaptation occurs for an instance. As the background load fluctuates so does the capacity of the instance – following the adaptive admission described in section 2.1 – to compensate the change in the available CPU and maintain the QoS ratio close to 1.0. With respect of the overall behavior of the system, as shown in Fig. 5b it quickly converges to a utility ratio of 1.0, with a small variability. The adaptation process also achieves a high level of system efficiency, with a total utilization of system capacity around 0.9. Additionally, 90% of the maximum theoretical workload is allocated – this figure is maintained in all scenarios – and a 75% of request needs at most 3 hops to be allocated (graphics for these results not shown for brevity). These results show that the system is effective in achieving the QoS goals, efficient in the utilization of resources and imposes a low overhead in terms of the number of hops needed for allocation.
132
P. Chacin and L. Navarro
120.0 100.0 80.0 60.0 40.0 20.0 0.0
Utilization
Requests
7000 6000 5000 4000 3000 2000 1000 0
N´” Nodes
Peak load scenario. In this scenario, the system is initially submitted to a steady workload that demands 70% of the the available capacity, but at time 100s, an additional load is injected. As can be seen in Fig. 6, the systems quickly reacts to the load by promoting more instances. The overall utilization of the system also is increased - the percentile 25 of the Utilization rises significantly – but the QoS Ratio is maintained during this adaptation process. At time 200s, the additional load is removed and the systems returns to the previous state, demoting the instances not longer needed.
50
100
150
200
250
300
Utility Ratio
0
0
50
100
150 200 Time (seconds)
250
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
300
(a) Injected load and number of instances
0
50
100
0
50
100
150
200
250
300
150 200 Time (seconds)
250
300
1.0
(b) Utilization and Utility Ratio
Fig. 6. Behavior for peak load scenario
Failure scenario. In this scenario, the system is submitted to a steady workload that demands a fraction of its capacity. At time 100s, 20% of the promoted instances fail – a correlated failure as expected in clusters. Fig. 7 shows how the system reacts, incorporating more instances until the system stabilizes. As can be seen, the utility ratio is maintained along this process – except for a short period just after the failure – as requests are routed to nodes in the search overlay; as a consequence, routing hops increase until all the required nodes are promoted to the routing overlay.
4
Related Work
The elasticity in the allocation of resources to web applications has attracted significant attention from different perspectives. In [9] the dynamic placing of the instances of multiple applications on a set of server machines is formulated as a two dimensional packaging problem and several heuristics are proposed to optimize the solution by minimizing the number of placement changes while maximizing the balancing of the load across nodes. However solving this problem has a high computational complexity, severely limiting its scalability. VioCluster [17] uses both machine and network virtualization techniques which allow a domain in a shared cluster to dynamically grow/shrink based on resource demand by negotiating the borrowing/lending of virtual machines with
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
133
80.0 70.0 Nodes
Utilization
Utility Driven Elastic Services
60.0 50.0 40.0 30.0
0
50
100
150
200
0
50
100
150
200
150
200
4.0
1.0
Hops
Utility Ratio
5.0 3.0 2.0 1.0 0.0 0
50
100
150
200
0
50
Time (seconds)
(a) Utilization and Utility Ratio
100 Time (seconds)
(b) Number of instances and Number of Hops
Fig. 7. Behavior for failure scenario
other domains. This approach has, however, a significant overhead for both the negotiation process and the need to create and start machines. In [19] a utility related performance metric is used by a request scheduler to order the processing of requests from multiple services classes, so that the resulting aggregate utility is maximized. The main drawback of the proposed schema is its dependency of a cluster level centralized load balancing, making it unpractical to the scales of our systems of interest. Also, it requires the on-line elicitation of the resource consumption profile for each service –using application supplied metrics– to adjust the resource allocation, while our approach uses a model-less adaptation. Closer to out work, in [1] nodes are self-organized and sliced according to an application defined metric and the group that represents the ”top” slice are selected to form the application’s overlay. Its main drawback is that as the nodes’ attributes may change continuously the slicing must also be continuously updated, and operation that require the execution of protocols that run over ”epochs” of several update cycles in the order of several seconds. This requirements make this approach unsuitable to the scenarios of interest. In our approach, we integrate this process of updating the set of active nodes into the routing process, making it more responsive to changes. Besides, there’s no empirical evidence of the actual performance of the proposed model.
5
Conclusions
We have presented eUDON an overlay for dynamically scaling services on large scale infrastructures of non-dedicated servers. The evaluation of the different scenarios shows the ability of eUDON to adapt to changing conditions using only local information and local decisions, while maintaining the QoS objectives and a high utilization level. More importantly, the system is highly scalable and resilient to failures, two characteristics that are critical for systems based on commodity hardware clusters.
134
P. Chacin and L. Navarro
A salient feature of eUDON is its model-less adaptation approach, which can be used in scenarios where there is not a model to predict the QoS of a service, or applying such a model is not feasible due to the dynamism of the environment. Moreover, the adaptation does not require any isolation between competing services. One additional advantage of the proposed model is that it unifies different events, like service failure, saturation or demotion under a single set of simple adaptation mechanism, simplifying the system design. The results presented are part of a work in progress and there are still diverse aspects to develop. As already mentioned, the continuous adaptation of the number of active instances (the activation/deactivation mechanism) is still an open issue which is being actively researched. We envision using a mechanism similar to the one used for promotion/demotion but triggered at the servers to decide which services to activate/deactivate. Additionally, we use only instantaneous measurements of an instance’s utility for the various adaptation decisions and in particular, for the admission acceptance window. As a result, even when the average utility ration is around 1.0 and shows little variability, no guarantees of the type ”95% of request will have a certain QoS” are currently possible. We are exploring the utilization of a form of summarization of the recent history to offers such guarantees.
References 1. Babaoglu, O., Jelasity, M., Kermarrec, A.M., Montresor, A., van Steen, M.: Managing clouds: a case for a fresh look at large unreliable dynamic networks. ACM SIGOPS Operating Systems Review 40, 3 (2006) 2. Blanquer, J.M., Batchelli, A., Schauser, K., Wolsk, R.: Quorum: Flexible quality of service for internet services. In: 2nd Symposium on Networked Systems Design and Implementation (NSDI 2005) (May 2-4, 2005) 3. Cao, J., Andersson, M., Nyberg, C., Kihl, M.: Web server performance modeling using an m/g/1/k*ps queue. In: 10th International Conference on Telecommunications (2003) 4. Cardellini, V., Casalicchio, E., Colajanni, M., Yu, P.S.: The state of the art in locally distributed web-server systems. ACM Computing Surveys 34(2), 263–311 (2002) 5. Chacin, P., Navarro, L., Garcia Lopez, P.: Utility driven service routing over large scale infrastructures. In: Di Nitto, E., Yahyapour, R. (eds.) ServiceWave 2010. LNCS, vol. 6481, pp. 88–99. Springer, Heidelberg (2010) 6. Chacin, P., Navarro, L., Lopez, P.G.: Load balancing on large-scale service infrastructures. Technical Report UPC-DAC-RR-XCSD-2011-1, Polytechnic University of Catalonia, Computer Architecture Deparment. Computer Networks and Distributed Systen Group (2011) 7. Chandra, A., Goyal, P., Shenoy, P.: Quantifying the benefits of resource multiplexing in on-demand data centers. In: First ACM Workshop on Algorithms and Architectures for Self-Managing Systems (2003) 8. Jelasity, M., Montresor, A., Babaoglu, O.: Gossip-based aggregation in large dynamic networks. ACM Transactions on Computer Systems 23(3), 219–259 (2005)
Utility Driven Elastic Services
135
9. Karve, A., Kimbrel, T., Pacifici, G., Spreitzer, M., Steinder, M., Sviridenko, M., Tantawi, A.: Dynamic placement for clustered web applications. In: Proceedings of the 15th International Conference on World Wide Web, pp. 595–604. ACM, New York (2006) 10. Kempe, D., Dobra, A., Gehrke, J.: Gossip-based computation of aggregate information. In: 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2003) (Octuber 11-14, 2003) 11. Kephart, J.O., Das, R.: Achieving self-management via utility functions. IEEE Internet Computing 11(1), 40–48 (2007) 12. Kephart, J., Chess, M.: The vision of autonomic computing. Computer 31(1), 41–50 (2003) 13. Kossmann, D., Kraska, T., Loesing, S.: An evaluation of alternative architectures for transaction processing in the cloud. In: Proceedings of the 2010 International Conference on Management of Data SIGMOD 2010, pp. 579–590 (2010) 14. Menasce, D.A.: Qos issues in web services. IEEE Internet Computing 6(6), 72–75 (2002) 15. Nallur, V., Bahsoon, R., Yao, X.: Self-optimizing architecture for ensuring quality attributes in the cloud. In: Joint Working IEEE/IFIP Conference on Software Architecture 2009 & European Conference on Software Architecture, WICSA/ECSA 2009 (2009) 16. Oppenheimer, D., Chun, B., Patterson, D., Snoeren, A.C., Vahdat, A.: Service placement in a shared widearea platform. In: USENIX Annual Technical Conference, pp. 273–288 (2006) 17. Ruth, P., McGachey, P., Xu, D.: Viocluster: Virtualization for dynamic computational domains. In: IEEE International Conference on Cluster Computing (2005) 18. Salehie, M., Tahvildari, L.: Self-adaptive software: Landscape and research challenges. ACM Transactions on Autonomous and Adaptive Systems 4(2), 42 (2009) 19. Shen, K., Tang, H., Yang, T., Chu, L.: Integrated resource management for clusterbased internet. In: 5th Symposium on Operating Systems Design and Implementation (2002) 20. Yang, L., Foster, I., Schopf, J.: Homeostatic and tendency-based cpu load predictions. In: Guo, M. (ed.) ISPA 2003. LNCS, vol. 2745, p. 9. Springer, Heidelberg (2003)
Improving the Scalability of Cloud-Based Resilient Database Servers Lu´ıs Soares and Jos´e Pereira University of Minho {los,jop}@di.uminho.pt
Abstract. Many rely now on public cloud infrastructure-as-a-service for database servers, mainly, by pushing the limits of existing pooling and replication software to operate large shared-nothing virtual server clusters. Yet, it is unclear whether this is still the best architectural choice, namely, when cloud infrastructure provides seamless virtual shared storage and bills clients on actual disk usage. This paper addresses this challenge with Resilient Asynchronous Commit (RAsC), an improvement to a well-known shared-nothing design based on the assumption that a much larger number of servers is required for scale than for resilience. Then we compare this proposal to other database server architectures using an analytical model focused on peak throughput and conclude that it provides the best performance/cost trade-off while at the same time addressing a wide range of fault scenarios. Keywords: Database servers, cloud computing, scalability, resilience.
1
Introduction
There is a growing number of organizations taking advantage of infrastructureas-a-service offered by public cloud vendors. In fact, multi-tiered applications make it easy to scale out upper layers across multiple virtual servers as they are mostly “embarrassingly parallel” and stateless. The scalability, availability, and integrity bottleneck is still the database management system (DBMS) that holds all non-volatile state. Although there has recently been a call to rethink databases from scratch, leading to NoSQL databases such as Amazon SimpleDB, this challenge is still being addressed by pushing existing SQL database server clustering to the limit. The simplest approach, since it doesn’t require explicit support from the DBMS and addresses only availability, is failover using virtual volume provided by the cloud infrastructure such as Amazon Elastic Block Storage (EBS). A more sophisticated approach is provided by Oracle Real Application Cluster (RAC) [1], also backed by a shared volume, but leveraging multiple hosts for parallel processing. An alternative is a shared-nothing cluster with a middleware controller
Partially funded by project ReD (PDTC/EIA-EIA/109044/2008) and FCT PhD scholarship (SFRH/BD/31114/2006).
P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 136–149, 2011. c IFIP International Federation for Information Processing 2011
Improving the Scalability of Cloud-Based Resilient Database Servers
137
built also with any off-the-shelf DBMS and allowing parallel processing with many common workloads. This is the approach of C-JDBC [2]. In addition, a certification-based protocol, such as Postgres-R [3], can further improve performance by allowing execution of update transactions by a single server. The trade-off between scalability and resilience implicit in each of the of these architectures is however much less clear. Namely, different options on how update transactions are executed lead to potentially different peak throughput for a specific resource configuration, in terms of available CPUs and storage bandwidth, with a write intensive load. Moreover, state corruption, namely, upon a software bug can have different impacts in different architectures. Avoiding the severe impact on availability when the only resort is to recover from backups is very relevant. Answering these questions requires considering how each architecture handles updates and to what extent different components of the database management system are replicated independently, thus leading to logically and/or physical replicas of data. The contribution of this paper is therefore twofold. First, we propose Resilient Asynchronous Commit (RAsC), an improvement to certification-based database replication protocol that decouples effective storage bandwidth from the number of servers, allowing the cluster to scale in terms of peak throughput. We then reevaluate different architectural aspects for clustering with an analytical model that relates each of the described architectures with resilience and scalability metrics. The rest of this paper is structured as follows: In Section 2 we provide the background on clustering architectures. Section 3 proposes an optimization on one of the clustering architectures. Section 4 introduces the model. Section 5 compares different architectures within this model, justifying the relevance of the proposed contribution. Finally, Section 6 concludes the paper.
2
Background
Transaction processing in a relational database management system is usually regarded a layered process [4]. At the top, SQL is parsed. The resulting syntax tree is then fed to the optimizer, which uses a number of heuristics and statistical information to select the best strategy for each relational operator. The resulting plan is then executed by calling into the logical storage layer. In this paper, we use a simplified view of transaction processing as a two layer process as depicted in Figure 1(a): We consider parsing, optimization, and planning as the Processing Engine (PE) and logical and physical storage management as the Storage Engine (SE). This maps, for instance, with MySQL’s two-layer architecture, with pluggable storage engines. Note that assertive faults at PE and SE levels, that lead to erroneous results, have very different impacts. At the SE level, they may invalidate basic assumptions of physical layout and of transactional recovery mechanisms and lead to invalid data. This can only be recovered by taking the server off-line and, possibly, only by restoring from backup copies. Faults a the PE level will
138
L. Soares and J. Pereira
C0 Clients
C0
C1
Ci
C1 LAN
SQL
SQL
SQL
R1
R0 R0
PE
PE
PE
SE
SE
PE
PE
PE
Rn
R1
TUPLES R1
TUPLES
TUPLES
SE SE
SE
SE
Coordination BLOCKS
BLOCKS
BLOCKS Disk
D0
D1
D0
(a) Standalone (b) Shared Disk Failover (SDF)
C0
SAN
Coordination
C1
Dm
D1
(c) Shared Disk Parallel (SDP)
Ci
SQL C0
C1
Ci
Coordination LAN
LAN
SQL R0 PE
R1 PE
Rn
R0
PE
PE
SE
SE
SE
PE
Rn PE
R1 Coordination
TUPLES
TUPLES
R1
SE
SE
SE
D0
D1
Dm
BLOCKS
BLOCKS D0
(d) Shared (SNA)
D1
Nothing
Dm
Active (e) Shared Nothing CertificationBased (SNCB)
Fig. 1. Standalone and clustered servers
still be contained within transaction boundaries and can be recovered by undoing affected transactions, manually or from undo logs. The key defining architectural decision of database server clustering architectures is the amount of sharing that occurs, which defines at which level coordination happens and what layers (PE and/or SE) are replicated or shared. This determines not only the resulting scalability and resilience trade-off, but the applicability of each architecture to an off-the-shelf database server. We now examine four representative architectures. Shared Disk Failover (SDF). Cluster management software ensures that the DBMS server is running in only one of the nodes attached to a shared disk, often using a Storage Area Network (SAN). If the currently active node crashes, it is forcibly unplugged and the server is started on a different node. The standard log recovery procedure ensures the consistency of on-disk data, thus it is applicable to any DBMS. A variation of this approach can be built without a physically shared disk by using a volume replicator such as DRBD [5]. Otherwise, disk redundancy is ensured by a RAID configuration. This architecture is thus targeted exclusively at tolerating server crashes and is often deployed in a simple two server configuration. As depicted in Figure 1(b), coordination exists only outside the DBMS ensuring that the shared volume is
Improving the Scalability of Cloud-Based Resilient Database Servers
139
mounted exactly by a single server. It is impossible to use the standby nodes even for distributing read-only load as cache coherence issues would arise if the volume was mounted by multiple nodes. Since replication is performed at the raw disk level, neither the PE or SE are replicated in updates and no tolerance to corruption is provided. Shared Disk Parallel (SDP). Allowing multiple nodes to concurrently access the same shared storage requires that caches are kept consistent. In detail, the ownership of each block changes through time, in particular, whenever a write operation is issued. A distributed concurrency control mechanism is thus responsible to hand over the page to the issuing instance and no I/O is required in this process, even if the page is dirty. Reads are shared by having the owner to clone the page whenever a read request is issued. Flushing blocks back to disk is performed by only one replica at the time. As shown in Figure 1(c), coordination is thus performed within the storage engine layer. An example of this architecture is Oracle Real Application Cluster (RAC), which is based on the Oracle Parallel Server (OPS) and Cache Fusion technology [6]. This architecture is thus targeted mainly at scaling the server both in terms of available CPU and memory bandwidth, although it provides the same degree of fault tolerance as SDF, since most of the server stack is still not replicated in update transactions. Shared Nothing Active (SNA). By completely isolating back-end servers, a middleware layer intercepts all client requests and forwards them to the independent replicas. Scalability is achieved as read-only requests are balanced across available nodes. Only update transactions need to be actively replicated on all replicas. The controller thus acts as a wrapper. It exposes the same client interface as the original server, for which it acts as a client. There is no direct communication between cluster nodes, as coordination is performed outside servers, as shown in Figure 1(d). A popular implementation is provided by Sequoia, formerly C-JDBC [2], which intercepts JDBC and is portable to multiple back-end servers. The major scalability drawback is that update statements must be fully deterministic and have to be carefully scheduled to avoid conflict that translate into non-deterministic outcome of the execution and thus inconsistency. In practice, this usually means not allowing concurrent update transactions at all. This architecture is thus targeted at scaling the server in face of mostly read-only workload. By completely isolation back-end servers, it replicates all layers in update transactions and thus tolerates all outlined assertive fault scenarios in both PE and SE. In fact, a portable implementations such as Sequoia even supports DBMS diversity. In principle, it could even support voting to mask erroneous replies based on corrupt state [7]. Shared Nothing Certification-Based (SNCB). Active replication of update transactions in shared nothing clusters can be avoided by using a certification-based protocol. Each transaction is thus executed in the replica that is directly contacted by the client, without any a priori coordination. Thence, transactions
140
L. Soares and J. Pereira
(a) Regular
(b) Safe
(c) RAsC
Fig. 2. Variations of the SNCB architecture
get locally synchronized, according to the local concurrency control mechanism and only just before commit a coordination procedure is initiated. At this time, the initiating replica multicasts updates using a totally ordered group communication primitive [8]. This causes all nodes to deliver the exact same sequence of updates, which are then certified by testing for possible conflicts. This leads to the exact same sequence of transaction outcomes that is then committed independently by each node. Although no commercial products based on this approach exist, there have been a number of related research proposals and prototypes [3,9,10,11]. Since coordination happens between Processing and Storage Engines (Figure 1(e)), it is capable of performing fine grained synchronization of scaling with an update intensive workload. As a consequence of shared execution, this approach does not tolerate logical corruption, however, is tolerates physical corruption at the storage engine and disk layers. This is a very interesting trade-off, since such logical corruption can be corrected by undoing changes even while the system is on-line.
3
Resilient Asynchronous Commit
The SNCB architecture thus offers a very interesting trade-off: Since assertive faults at the PE can be corrected by undoing changes even while the system is on-line and it naturally copes with assertive faults at the SE level, it provides much of the advantages of the SNA architecture with a potentially better peak throughput. Traditionally, certification-based protocols use asynchronous group communication primitives for handling message passing between the replicas in the cluster, as shown in Figure 2(a). Thus there is a chance that updates are lost in the situation that the originating server’s disk is lost. To improve resilience one can resort to an uniform reliable or safe multicast primitive [8], that gathers a number of acknowledgments from at least fc + 1 nodes prior to delivery (Figure 2(b)), where fc is the upper bound on process faults. This ensures that a number of other nodes have stored the update in memory and unless the cluster fails catastrophically, it will eventually be committed to all disks [12]. Nonetheless, even if acknowledging transaction commit to the client awaits only for local commit to disk, existing proposals do not distinguish commits
Improving the Scalability of Cloud-Based Resilient Database Servers
Global site variables 1 2 3 4
local = nsynchs = originator = [] certified = toCommit = () gts = 0 committing = None
Events at the initator 5 6 7
upon onExecuting(tid) local[tid]=gts continueExecuting(tid)
8 upon onComitting(tid, rs, ws, wv) 9 nsynchs[tid] = () 10 tocast(tid, local[tid], rs, ws, wv, myReplicaId) 11 12
upon onAborting(tid) continueAborting(tid)
Delivery of updates 13 14 15 16 17 18 19 20 21 22 23
upon tocastDeliver(tid, ts, ws, wv, originatorId) foreach (ctid, cts, cws, cwv) in certified do if cts ≥ ts and !certification(cws, rs, ws) then if local[tid] then dbAbort(tid) return originator[tid] = originatorId add (ctid, cts, cws, cwv) to certified isSynch = isSynch(tid) enqueue (tid, ws, wv, isSynch) to toCommit commitNext();
141
Transaction commit 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
upon onCommitted(tid, isSynch) gts = gts + 1; if !local[tid] then if isSynch then rsend(tid, myReplicaId, originator[tid]) continueCommitted(tid) committing = None commitNext(); else deliverSynchAck(tid, myReplicaId) upon deliverSynchAck(tid, replicaId) nsynchs[tid] += (replicaId) if local[tid] and size(nsynchs[tid]) = fd +1 then delete(local[tid]) delete(nsynchs[tid]) delete(originator[tid]) continueCommitted(tid) committing = None commitNext(); procedure commitNext() if committing != None then return else (tid, ws, wv, isSynch) = dequeue(toCommit) committing = tid if local[tid] then continueCommitting(tid, isSynch) else commitRemote(tid, ws, wv, isSynch)
Fig. 3. Resilient Asynchronous Commit Protocol (RAsC)
that are out of such critical path and will still force updates to disk. This poses an upper bound on database server scale-out, as storage bandwidth consumed by each replica grows linearly with the size of the workload being handled. Our proposal thus stems from the observation that the substantial storage bandwidth economy resulting from asynchronous commit [13] can also be obtained by sharing the burden of synchronous commit across a large number of replicas. Moreover, the same mechanism should allow waiting for multiple disk commits such that scenarios with catastrophic failures can be handled. In detail, this means performing an asynchronous commit on n−(fd +1) nodes (where fd is the number of tolerated disk faults and n the number of replicas), and a synchronous commit elsewhere. Then we defer acknowledgment until synchronous commit concludes. The resulting protocol (RAsC) is shown in Figure 2(c), in which commit waits for a rotating subset of replicas to commit to disk. Figure 3 details in pseudo-code the proposed Resilient Asynchronous Commit protocol in combination with SNCB. The initiator is the site in which a transaction has been submitted. Handlers, or hooks, are assumed to exist and are called by the DBMS Transaction Manager. A set of interfaces targeting this behavior has been proposed and several prototypes exist [14]. Nevertheless, these hooks are further explained in the next few lines. Before a transaction tid executes its first operation, the onExecuting handler is invoked. The version of the database seen by tid is required for the certification procedure. Since we are considering snapshot isolation, this is equal to the number of committed transactions when
142
L. Soares and J. Pereira
tid begins execution. If the transaction at any time aborts locally, onAborting() is invoked and the transaction is simply forgotten. After a successful local execution, the onCommitting hook is called, causing the updates to be atomically multicast to the group of replicas. This ensures atomic and ordered delivery of transaction updates to all replicas, which happens on the tocastDeliver hook. After delivery, the certification procedure is performed and the fate of the transaction is deterministically and independently decided at every replica. Within this hook, the isSynch() function determines if a synchronous commit is meant to happen at the replica. The isSynch() function determines whether this replica is in the rotating quorum for this transaction. The last operation in the hook is a call to the scheduler that issues execution/commit on the next certified transaction (commitNext). Whenever a transaction commit finishes, which happens every time the onCommitted hook is called, the version counter is incremented. For remote transactions it checks if a synchronous commit has been performed and if so, an acknowledge is sent back to the initiator replica, using a reliable send communication primitive (rsend). Execution resumes by letting the Transaction Manager know that it may proceed (continueCommitted hook), and by scheduling the next certified transaction to commit (commitNext). For local transactions, a call to the deliverSynchAck is performed. The deliverSynchAck hook is called every time the initiator receives a synchronous commit acknowledge from a replica, or once the initiator commit finishes (in this case the initiator acknowledges its own synchronous commit). Once all the required synchronous commits have been performed the continueCommitted hook is called and local execution may resume, which ultimately results in notifying the client that the commit succeeded. A final note about the myReplicaId, replicaId and originatorId. These identifiers are used to perform message passing, which may even be IP addresses, should the replicas reside on different machines, or any other identifier that uniquely addresses replica processes.
4
Analytical Model
To select the best architecture for different fault and workload scenarios, and to what extent the Resilient Asynchronous Commit protocol improves the SNCB architecture, we model the amount of computing and storage resources (i.e. CPUs and disks) required to handle a given load while tolerating a number of failures. Depending on the architecture chosen, there are additional parameters. For instance, in a shared-nothing architecture, we have n independent nodes. In general, the system cost directly depends on the following parameters: 1. aggregate computing bandwidth (C); 2. aggregate disk bandwidth (D). An architecture is preferable if it allows us to tightly dimension the system such that there is neither excess C or D. Also, that it allows the system to be reconfigured in order to separately accommodate changing requirements.
Improving the Scalability of Cloud-Based Resilient Database Servers
143
Assumptions. The following assumptions hold in our model. They are backed by assumptions already made in previous work (Gray et al [15]). – Each transaction t is made of a number of read (nr ) and write (nw ) operations (no = nr + nw ), and we consider read-only (no = nr ) and update (no = nw ) transactions; – Read operations never block because they operate in their own snapshot version of the database [16], hence only updates conflict; – Read and write operations are equal in disk and cpu bandwidth consumption (dw = dr = do and cw = cr = co ), take the same time to complete (to ), and each transaction completes execution in a given time tt (tt = to · no ); – The system load (tps) is composed of a mix of update transactions (wtps) and read only transactions (rtps). These are correlated by a wf factor (wf = wtps tps ). The number of concurrent transactions in the system (nt ) is derived from the workload (nt = ntw + ntr = (wtps · nw · to ) + (rtps · nr · to ))). – The size of the database (s) is the number of objects stored and item accesses are uniformly distributed; – Failures exist (f ), but they never result in the failure of the entire system; – n itself is the number of replicas in the system. In contrast to previous proposals that model distributed database systems [17,18], we focus on the availability of a shared storage resources (space and bandwidth) offered by cloud infrastructure instead of assuming that storage is proportional to number of allocated servers. Resource Bandwidth. We start by modeling the baseline (NONE) which is a centralized database monitor with no disk redundancy. Bounds on system parameters are established by the workload. In a centralized and contention-free system, the disk and CPU used, by a transaction t, are generically expressed using Equation 1 and Equation 2. dt = drt + dwt = (nr + nw ) · do
(1)
ct = crt + cwt = (nr + nw ) · co
(2)
An improvement over the baseline system (NONE-R), in terms of disk faults resilience and read performance, is achieved using a RAID storage system (m disks providing redundancy and parallelism). The tradeoff lies in the extra disk bandwidth (m − 1) required to replicate blocks same data. dt−none−r = (nr + m · nw ) · do
(3)
A different approach altogether, would be to use DRBD. This solution replicates data at block level and provides multi-master replication by delegating to a top layer software (a clustered file system like OFCSv2 or GFS) conflict detection and handling. Nevertheless, concurrent writes are handled but they are not meant to happen regularly at the DRBD level. Furthermore, when a database is deployed
144
L. Soares and J. Pereira
on top of the DRBD, the replication is performed in a master-slave (hot standby) fashion. This imposes a limit to resilience, as the number of replicas cannot be higher than two (n = 2). Due to its current resilience limitations we find this architecture rather uninteresting, and will not be considering it from now on. dt−sdf = drt + n · dwt = (nr + 2 · nw ) · do
(4)
ct−sdf = crt + n · cwt = (nr + 2 · nw ) · co
(5)
SDF limitations may be easily mitigated by architecting a system based on a distributed middleware approach, mostly like SN A. In this architecture, a middleware controller acts as the load balancer for reads and coordinator for writes. Database back-ends are registered at the controller. Reads are only performed at one replica, while writes happen everywhere. This approach is very similar to a RAID based disk mirroring strategy, but instead of handling raw blocks, logical data representations (e.g., SQL statements) are synchronized and executed at each registered database instance. Equations 6 and 7 model the resource consumption in this setup. Unfortunately, this approach has limited scalability when dealing with write intensive (or write peaks) workloads and non-deterministic operations. dt−sna = (nr + n · nw ) · do
(6)
ct−sna = (nr + n · nw ) · co
(7)
SNCB mitigates the issues exhibited by SNA. We assume independent servers, acting as a replicated state machine on write requests and with perfectly balanced read requests. This is the case for certification based approach to replicated databases (e.g., the Database State Machine - DBSM). Given that in a DBSM setting each replica writes the same data on its local storage, the disk usage is described by Equation 8 (we assume that the database working set fits in main memory, so we disregard disk usage for read operations). On the other hand, the CPU consumption does not increase by a degree of n. In fact, the optimistic execution guarantees given a transaction t, it executes completely at any given replica and the others only apply t’s changes. Consequently, remote updates only take a fraction of the original CPU execution regarding the write operations. This is depicted by the correlation factor kapply , which captures the cost of applying the updates versus executing the original update operations. dt−sncb = n · nw · do
(8)
ct−sncb = (nr + (1 + kapply · (n − 1)) · nw ) · co
(9)
An alternative cluster architecture uses a shared storage. We assume that such storage is a RAID unit of m disks. Since we are not accounting for messages delays nor network bandwidth consumption, the disk bandwidth and cpu bandwidth are the same as in the NONE-R and NONE, respectively. Finally, for all of the above mentioned architectures, the aggregate C and D bandwidth consumption is calculated as a function of the incoming transaction rate (tps). This is depicted by Equation 10 and 11, respectively.
Improving the Scalability of Cloud-Based Resilient Database Servers
145
C = ct · tps
(10)
D = dt · wf · tps
(11)
Resource Contention. The aggregate CPU and Storage bandwidth consumption is driven by the workload (tps). Note that dependent on the workload is also the contention rate of the system. Therefore, the aggregate consumption must be calculated by taking into account contention. In [15], and under the same assumptions presented here, we may find that the generic expression for system contention rate (number of transactions waiting per second) is given by Equation 12. ntw · nw nw tpswait = (1 − (1 − ( ) ) · tps · wf (12) 2·s Except for the SN CB and SN A architecture, this equation models perfectly transaction blocking. In SN CB, transactions tend to be in the system a bit longer than normal execution, due to the process of applying remote updates. As Equation 13 shows, ntw increases, and the system becomes more susceptible to conflicts. ntwsncb = wf · tps · nw · to · (1 + kapply ) (13) On the other hand, in SN A, transactions are set to execute sequentially by the controller which becomes a major bottleneck. Since only update transactions conflict and transaction execution is sequential, the number of transactions waiting in the system is given by Equation 14. tpswaitsna =
(wf · tps)2 (nw · to ) · ((nw · to ) − (wf · tps))
(14)
Finally, contention has a negative impact on system performance, which means that the number of committed transactions per second is unarguably lower than the number of input transaction rate. As such, subtracting the waiting rate from the incoming rate, we get the overall system throughput (Equation 15). tpso = wf · tps − tpswait + (1 − wf ) · tps
5
(15)
Evaluation
Strictly on the basis of resilience, one would probably choose the SNA architecture, after making the necessary changes to remove the single-point-of-failure introduced by the controller. In this section, we evaluate the cost of this option in terms of conflict scalability, i.e. how does it tackle peak write-intensive loads, and resource scalability, i.e. how does it take advantage of existing resources for performance. Conflict Scalability. We start by applying the contention model to determine how each architecture scales with different workloads with different amount of update transactions and updated items in each transaction. We do this by fixing an arbitrary offered load and then varying the ratio of update transactions
146
L. Soares and J. Pereira
0.8
0.8
0.8
0.6
0.6
0.6
0.4 0.2 0
X
1
X
1
X
1
0.4 NONE SDP SNA SNCB 0
0.2
0.2
0.4 0.6 Updates (%)
0.8
0
1
0.4 NONE SDP SNA SNCB 0
0.2
0.2
(a) p = 0.025
0.4 0.6 Updates (%)
0.8
0
1
NONE SDP SNA SNCB 0
0.2
(b) p = 0.0125
0.4 0.6 Updates (%)
0.8
1
(c) p = 0.00625
Fig. 4. Impact of item conflict probability in throughput 10
10
10
6
6
6
None SDP SNA 8 SNCB
None SDP SNA 8 SNCB
X
X
X
None SDP SNA 8 SNCB
4
4
4
2
2
2
0
0
2
3
4
5
6
7
N
(a) wf = 0.01
8
9
10
0 2
3
4
5
6 N
7
(b) wf = 0.2
8
9
10
2
3
4
5
6
7
8
9
10
N
(c) wf = 0.8
Fig. 5. Scalability of throughput with number of nodes, for different workload mixes
from 0 to 1. Figure 4 show these with three different probabilities of single item conflicts (the p parameter). Previous experiments [19] indicate that TPC-C produces results comparable to Figure 4(c) and agree with the proposed model in terms of the resulting useful throughput in both shared-nothing scenarios. The most interesting conclusion from Figure 4 is that the SNA approach exhibits a sudden saturation point with increasing number of update transactions, regardless of likelihood of actual conflicts. This precludes this architecture as a choice when there are concerns about possible write-intensive workload peaks leading to safety issues. On the other hand, one observes that SNCB can approximate the performance of SDP. Neither exhibits the sudden tip-over point and thus should be able to withstand write intensive peak loads. Final notice, the NONE and SDP lines are a perfect match. Node Scalability. The next step is to apply the bandwidth model to determine how each architecture allows required resources to scale linearly with an increasing throughput. Therefore, we assume that computing bandwidth is provided in discrete units. To add an additional unit of CPU bandwidth one has therefore to add one more node to the cluster. This has an impact in shared nothing architectures, since each additional node requires an independent copy of data. Figure 5 shows the speedup that can be expected when adding additional nodes to the cluster. As expected, the SDP architecture should exhibit a perfect speedup, should the conflict probability be fixed, as happens for instance with
Improving the Scalability of Cloud-Based Resilient Database Servers 100
None SDP SNA SNCB 70 SNCB-RAsC
35
60
30
40
20
None SDP SNA SNCB 25 SNCB-RAsC
60 X
50
X
80
X
70
None SDP SNA SNCB 50 SNCB-RAsC
90
40
30
15
30
20
10
20 0
5
10
10 1
2
3
4
5
6
7
8
9
N
(a) wf = 0.01
10
147
0
0 1
2
3
4
5
6
7
N
(b) wf = 0.2
8
9
10
1
2
3
4
5
6
7
8
9
10
N
(c) wf = 0.8
Fig. 6. Required storage bandwidth with number of nodes, different workload mixes
TPC-C scaling rules. On the other hand, SNA shows that with anything other than an almost read-only load of Figure 5(a), this architecture scales very poorly. This is due to the fact that update transactions have to be actively executed by all nodes, and regardless of the contention effect described in the previous section. Finally, Figure 5 shows that the SNCB architecture scales for low values of wf × kapply . This means that efficiency when applying updates, for instance by using a dedicated low level interface, can offset the scalability obstacle presented by wrote intensive loads. Simple testing with the TPC-C workload and PostgreSQL 8.1, without using any dedicated interface, shows that k = 0.3, which is the value used henceforth. Dedicated interfaces for applying updates have been implemented a number of times for DBMS replication. For instance, Oracle Streams and Postgres-R provide such interfaces for Oracle and PostgreSQL. 5.1
Disk Scalability
Figure 6 shows the aggregate storage bandwidth required to achieve the maximum theoretical scale up of Figure 5 and if possible, tolerating f = 1 faults. Namely, SDP tolerates only disk faults, regardless of nodes in the cluster. SNA with n > f + 1 tolerates f logical or physical corruption faults. We now consider the following dilemma. Assume that one has 10 nodes in the cluster, and 20 disks. Each of the disks provides sufficient bandwidth for 1× throughput. If one chooses the SDP architecture, it is possible to configure the storage subsystem with RAID 1+0 such that the exact bandwidth is achieved (i.e. 10 stripes, 2 copies). This allows 10× the throughput. If one chooses SNCB, one has to opt for at most 2 stripes in each of the 10 copies. This is sufficient however for at most 5 nodes (from Figure 6(b)), which result in as little as 4× the throughput (from Figure 5(b)). This is a 60% performance penalty. Furthermore, one would be tempted to say that SNCB tolerates also f physical corruption faults with n > f + 1. However, certification-based protocols use asynchronous group communication primitives which jeopardizes that goal. Nevertheless, by using Resilient Asynchronous Commit (RAsC) storage bandwidth is enough for the 10 nodes, thus for as much as 6× the throughput. This is 50% more than the standard SNCB configuration. By executing synchronously
148
L. Soares and J. Pereira
only f +1/n updates, this allows each of the nodes to use only f +1/n of the previously required bandwidth, up to as much as 25× with typical update intensive loads. This is shown in Figure 6(c).
6
Conclusion
In this paper we reconsider database server clustering architectures when used with a larger number of servers, in which cost-effectiveness depends on decoupling CPU and disk resources when scaling out. In contrast to previous approaches [3,9,10,11], Resilient Asynchronous Commit protocol (RAsC) improves write scalability by making better use of resources with large number of servers on a shared storage cloud infrastructure without changes to the DBMS server, while at the same time allowing configurable resilience in terms of the number of durable copies that precede acknowledgment to clients. Then we use a simple analytical model to seek scalability boundaries of different architectures and how shared resources in a cloud infrastructure can better be allocated. The first conclusion is that the currently very popular SNA architecture, although promising in terms of resilience should be considered very risky for scenarios exhibiting peak loads and write-intensive peaks.The second conclusion is that the SNCB approximates SDP in terms of linear scalability with moderate write-intensive loads and does not exhibit the risky sudden drop of performance with heavily write-intensive loads of SNA. The critical issue is the parameter kapply in our model: The ratio of CPU bandwidth consumed when applying already executed updates. Finally, together with the proposed RAsC protocol, SNCB scales also in terms of storage bandwidth, especially with a relatively low number of assertive faults considered.
References 1. Ault, M., Tumma, M.: Oracle Real Application Clusters Configuration and Internals. Rampant Techpress (2003) 2. Cecchet, E., Marguerite, J., Zwaenepoel, W.: C-JDBC: Flexible database clustering middleware. In: USENIX Annual Technical Conference (2004) 3. Kemme, B., Alonso, G.: Don’t be lazy, be consistent: Postgres-R, a new way to implement database replication. In: VLDB 2000: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 134–143. Morgan Kaufmann Publishers Inc., San Francisco (2000) 4. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book. Prentice Hall, Englewood Cliffs (2002) 5. Ellenberg, L.: DRBD 8.0.x and beyond: Shared-disk semantics on a shared-nothing cluster. In: LinuxConf Europe (2007) 6. Lahiri, T., Srihari, V., Chan, W., MacNaughton, N., Chandrasekaran, S.: Cache fusion: Extending shared-disk clusters with shared caches. In: Apers, P.M.G., Atzeni, P., Ceri, S., Paraboschi, S., Ramamohanarao, K., Snodgrass, R.T. (eds.) Very Large Data Bases (VLDB) Conference, pp. 683–686. Morgan Kaufmann, San Francisco (2001)
Improving the Scalability of Cloud-Based Resilient Database Servers
149
7. Gashi, I., Popov, P., Strigini, L.: Fault diversity among off-the-shelf SQL database servers. In: International Conference on Dependable Systems and Networks, 2004, June 28-July 1, pp. 389–398 (2004) 8. Chockler, G.V., Keidar, I., Vitenberg, R.: Group Communication Specifications: a Comprehensive Study. ACMCS 33(4), 427–469 (2001) 9. Pedone, F., Guerraoui, R., Schiper, A.: The database state machine approach. Distributed and Parallel Databases 14(1), 71–98 (2003) 10. Wu, S., Kemme, B.: Postgres-R(SI): Combining replica control with concurrency control based on snapshot isolation. In: ICDE 2005: Proceedings of the 21st International Conference on Data Engineering, pp. 422–433. IEEE Computer Society, Washington, DC, USA (2005) 11. Elnikety, S., Dropsho, S., Pedone, F.: Tashkent: uniting durability with transaction ordering for high-performance scalable database replication. In: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems, EuroSys 2006, pp. 117–130. ACM, New York (2006) 12. Grov, J., Soares, L., Correia Jr., A., Pereira, J., Oliveira, R., Pedone, F.: A pragmatic protocol for database replication in interconnected clusters. In: 12th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2006), Riverside, USA (2006) 13. Kathuria, V., Dhamankar, R., Kodavalla, H.: Transaction isolation and lazy commit. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 1204–1211 (2007) 14. Correia, A., Pereira, J., Rodrigues, L., Carvalho, N., Vilaca, R., Oliveira, R., Guedes, S.: GORDA: An open architecture for database replication. In: Sixth IEEE International Symposium on Network Computing and Applications, NCA 2007, July 12-14, pp. 287–290 (2007) 15. Gray, J., Helland, P., O’Neil, P., Shasha, D.: The dangers of replication and a solution. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM SIGMOD Record, vol. 25, 2, pp. 173–182. ACM Press, New York (June 1996) 16. Berenson, H., Bernstein, P., Gray, J., Melton, J., O’Neil, E., O’Neil, P.: A critique of ANSI SQL isolation levels. In: ACM SIGMOD International Conference on Management of Data, pp. 1–10. ACM, New York (1995) 17. Bernab´e-Gisbert, J.M., Zuikeviciute, V., Mu˜ noz Esco´ı, F.D., Pedone, F.: A probabilistic analysis of snapshot isolation with partial replication. In: Proceedings of the 2008 Symposium on Reliable Distributed Systems, pp. 249–258. IEEE Computer Society, Washington, DC, USA (2008) 18. Elnikety, S., Dropsho, S., Cecchet, E., Zwaenepoel, W.: Predicting replicated database scalability from standalone database profiling. In: Proceedings of the 4th ACM European Conference on Computer Systems, EuroSys 2009, pp. 303–316. ACM, New York (2009) 19. Correia Jr., A., Sousa, A., Soares, L., Pereira, J., Moura, F., Oliveira, R.: Groupbased replication of on-line transaction processing servers. In: Maziero, C.A., Gabriel Silva, J., Andrade, A.M.S., Assis Silva, F.M.d. (eds.) LADC 2005. LNCS, vol. 3747, pp. 245–260. Springer, Heidelberg (2005)
An Extensible Framework for Dynamic Market-Based Service Selection and Business Process Execution Ante Vilenica1 , Kristof Hamann1 , Winfried Lamersdorf1, Jan Sudeikat2 , and Wolfgang Renz2 1 Distributed Systems and Information Systems Department of Informatics, University of Hamburg http://vsis-www.informatik.uni-hamburg.de/ 2 Multimedia Systems Laboratory Hamburg University of Applied Sciences http://cms-server.ti-mmlab.haw-hamburg.de/
Abstract. Business applications in open and dynamic service markets offer great opportunities for both consumers as well as for providers of services. However, the management of related business processes in such environments requires considerable (often still manual) effort. Specific additional challenges arise in highly dynamic environments which may lead, e.g., to service failures or even to complete disappearance of partners and, consequently, a need to reconfigure related processes. This work aims at generic software support for addressing such challenges mainly by providing an extensible negotiation framework which is capable of performing the tasks of service selection and service execution automatically. Its technical basis are augmented, reusable and highly autonomous service components that can be tailored towards the specific needs of each business process. In addition, the implementation of the negotiation framework includes a simulation component which offers convenient means to study the outcome of different settings of the business environment a priori. Keywords: Service-Oriented Computing, Autonomous Components, Negotiation Framework.
1
Introduction
State-of-the-art business information systems rely on a high number of different services from various sources as well as on open and flexible business procedures making use of them. Software support for the development of such (distributed) applications profits substantially from a service-oriented software architecture which provides appropriate architectural patterns for the sophisticated and flexible development of information systems that can easily interoperate, change components and therefore cope with the high complexity, dynamicity and demands of modern business environments. In such scenarios, services may typically appear and disappear at any time and may, furthermore, dynamically P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 150–164, 2011. c IFIP International Federation for Information Processing 2011
Framework for Dynamic Market-Based Service Selection and Execution
151
change their configuration, e.g. their commitment towards a certain application or their respective costs. Such kind of behaviour can be found in particular in domains that use market-based pricing for negotiating contracts between consumers and providers of respective services. These business domains require open market systems with standardized interfaces and various vendors in order to work properly and to have a competitive market. At the same time, the inherent dynamics of such market-based systems complicate the runtime management of applications composed of different services and, thus, require solutions capable to cope with these challenges as autonomously as possible. For example, think of a computer manufacturer who regularly needs to acquire different well-defined and standardized sub-components like hard-disks, keyboards, RAM etc. for his production line. This scenario reveals a clear demand for an (autonomous) management of this supply process that takes into account different functional and non-functional aspects and that (dynamically) selects the most appropriate service provider at any time in a constantly changing environment. Furthermore, not only the selection but also the execution of the selected service should be automatically monitored and managed – such that, for instance in the case of a service-breakdown, appropriate actions, e.g. instantiation of a new selection process, should be initiated automatically. Aiming at such scenarios, the work reported in this paper addresses these challenges by proposing a supportive software framework capable of facilitating the autonomous and dynamic selection and execution of service-based applications. It leverages different market-based negotiation protocols in order to determine contract partners as well as various utility-functions for specifying preferences among (potentially conflicting) objectives. Therefore, the approach proposed here enables existing as well as new composed services with the capability to participate in such automatic and autonomous service selection and process execution. Technical basis for the implementation of such dynamic and adaptive service composition is an environment that integrates workflow execution with autonomous software agents [12]. In this approach, software agents represent services which participate in dynamically adaptive business processes. The necessary adaptivity is realised by an underlying management middleware that controls the coordination of all participating software agents. This, in turn, leads to adaptive and autonomous properties of the workflow management itself, e.g., by autonomously deciding which individual agents resp. services are responsible for realising specific complex activities, without effect on the overall workflow goals themselves. This decentralized middleware for the integration of decentralized, self-organizing processes among software agents was already presented in [14]. The remainder of this paper is structured as follows: The next section gives an overview of related work; Section 3 describes the proposed framework for realising automatic service selection and execution in market-based business applications. Section 4 presents parts of a proof of concept implementation of this framework and reports on a case study to show the applicability of the approach before Section 5 concludes the paper with a brief discussion on future work.
152
2
A. Vilenica et al.
Related Work
Service-Oriented Architectures (SOA) provide a paradigm to leverage business integration and to implement loosely-coupled distributed applications. In such an approach, the dynamic binding of services is perceived as an integral part of an SOA since it facilitates a loose coupling between services and hence fosters reusability and dynamic adaptation to changing environments. Service binding assembles two different aspects, service discovery and service selection [11]. Service discovery describes the procedure of locating available services matching given functional demands. In contrast to that, service selection deals with the problem of choosing one service from a set of suitable services. Often, service selection is done by incorporating non-functional requirements, defined by the service requestor [20,13]. However, since most approaches require the service provider to declare the offered non-functional properties, the consumer has to rely on these propositions. Solutions to this problem incorporate trust models in order to rate the reputation of a provider. Vu et al. [17], e.g., propose a probalistic framework for decentralized management of trust and quality while incorporating the credibility of observation data, the multi-dimensionality and subjectivity of quality and trust. Advanced approaches for service selection use a utility function in order to enable service consumers to flexibly differentiate between important and less significant non-functional properties. Approaches such as the work of Hang and Singh [5] provide frameworks which are able to optimize these utility functions. Regarding dynamic pricing in markets, there has been also work on service selection on behalf of market-based negotiation protocols such as auctions. Wellman et al. give an overview [18] of the several bidding strategies used in the international Trading Agent Competition, a scenario where agents contend for flights, hotel rooms and entertainment. Similar to the assumptions in our work, prices are set dynamically by the market participants. For every product type, there is a different auction type with different properties, such as auction setting (e.g. combinatorial or multi-auction), simultaneity, price predictability, auction length. Hence, the approaches of the bidding agents differ enormously. Lamparter and Schnizler [7] propose a market-based architecture for trading Web Services. However, they focus on the semantic description of services and bids with ontologies. Service offers and service requests are converted by a preprocessor in order to facilitate the syntactic matching of bids. This allows for the use of existing implementations of the demanded multi-attribute combinatorial double auction. He and Liu [6] suggest to use software agents in order to realize a market-based service selection framework. However, the resulting framework is nevertheless rather inflexible, since it is limited to the Contract Net protocol and only few non-functional properties are used. Borissov et al. [2] propose an automated bidding system for the allocation of grid-based services. Market platform and bid framework are strictly separated and make use of agent technologies, e.g. for negotiation resp. communication. The BidGenerator automatically performs bids at the market, which allows the
Framework for Dynamic Market-Based Service Selection and Execution
Agents
Event-based middleware
eive
SOA/BPM
pu
p er c
Service Market Agent (SMA)
pub
Process Engine
publish
ive
perceive
rce
lish
pe
publish
Business Process Description
Service Provider
perceive
Negotiation middleware
negotiation medium
Service Consumer
153
bli
sh
Service Agent (SA)
Service
Agents
SOA
Fig. 1. Architecture of the Negotiation Framework and the utilized paradigms
usage of two negotiation types. Therefore, BidGenerator implements several bidding strategies, which can be used by the customer in order to obtain a resource in the grid. In summary, the agent paradigm is well suited in order to develop autonomous software components as they are useful for participation in market-based scenarios. Well-established technologies used in agent frameworks, such as negotiation and self-organisation [14], can facilitate the development of according applications. However, there are comparatively few frameworks for marketbased service selection, which make use of these technologies. The agent-based approaches introduced in this section, however, do not provide support for advanced service-selection mechanisms, such as trust models and business process integration. Therefore, the next section proposes a flexible agent-based serviceselection framework that facilitates the implementation of market places using different auction types. It supports the integration of service providers and service consumers into the market with respect to their own bidding strategies, utility functions and trust models, and the overall integration in a business process management system.
3
The Service-Selection Framework
This section presents the proposed framework for an automatic handling of service selection and service execution for market-based business domains. The basic idea of this framework consists of providing a blueprint that contains the necessary components and defines their structure and kind of interaction to achieve the autonomous management functions mentioned in Section 1 rather than proposing and developing new negotiation protocols, utility functions etc. Figure 1 depicts the components of the framework and their structure on an abstract level. It shows that this approach basically uses the SOA approach to build (distributed) applications and takes the Agent paradigm [19] to enrich the management of these applications with proactive and autonomous capabilities.
154
A. Vilenica et al.
At the beginning of the service selection phase, the framework expects a business process description which contains the required services and describes the logical-temporal dependencies among them. This description can be provided by applying, e.g., the XML Process Definition Language (XPDL). It is then processed by a process engine which identifies all needed services and sends for each of them a request to the Service Market Agent (SMA). In addition, this request does not only contain the service type to be found but also some other optional properties like a utility function that contains certain concerns regarding time deadlines, fees, quality of service parameters and so forth. Also, the request may specify a certain negotiation (bidding) protocol to be used. For each request, the SMA sends out a service negotiation request message using a negotiation middleware that contains implementations of various negotiation protocols. Then, it depends on the type of negotiation protocol how an appropriate service provider is selected. In order to perform the negotiation task the negotiation middleware additionally has to process service bid messages sent by Service Agents (SA). These agents act on behalf of service providers and try to find contractors that fit best. In summary this approach consists of two agent types, i.e. SMA and SA, that try to find appropriate service partners using a middleware with different negotiation protocols. Details of these components are presented in the following subsections. 3.1
Negotiation Middleware
The aim of the negotiation middleware is to propose an infrastructure component that facilitates the reusability and modularity of negotiation protocols and that additionally enables the parallel execution of different negotiation protocols at runtime. Thereby, the negotiation middleware utilizes the approach of coordination spaces [16], that facilitates a clear separation between computation and coordination. Whereas computation denotes the core functions of a component, coordination can be seen as “managing dependencies between activities” [9, p. 90]. In consequence, coordination spaces promote an approach of easy changing the way dependencies are managed among components. This is achieved by a layered approach that contains interchangeable coordination media. In conclusion, different coordination media provide different ways of interdependency management. From this perspective on, negotiation can be seen as a special type of coordination. Thereby, negotiation can be realized using coordination media that provide implementations of negotiation protocols like the pure ContractNet or extensions of this protocol [10] as well as auctions like Dutch, Vickrey etc. In order to achieve a loose coupling between the SMA, SA and negotiation media the negotiation middleware uses an event-based architecture. Each of the afore-mentioned components implements a generic publish/subscribe interface which enables the components to publish and perceive events of interest. Therefore, the negotiation middleware uses asynchronous communication which is especially suited for distributed systems to ensure a reasonable performance
Framework for Dynamic Market-Based Service Selection and Execution
Service Market Agent (SMA)
Negotiation Middleware
Service Agent (SA)
155
Negotiation Middleware
Service Registration
Service Negotiation Request
Service Bid Service Contract Prop osal
Accepted / Rejected Service Contract Prop osal Final / Rejected Service Contract
optional
Service Offer Request Service Contract Prop osal
Accepted / Rejected Service Contract Prop osal t
(a) SMA and negotiation middleware
Final / Rejected Service Contract
t
(b) SA and negotiation middleware
Fig. 2. Exemplary communication pattern
level. Another advantage of this approach lies in its flexibility regarding the communication patterns between the single components. Figure 2 depicts proposals of patterns that have been implemented (cf. Section 4) for evaluation purpose. The first one (cp. Figure 2(a)) shows the messages sent between an SMA and a negotiation medium. It starts with a message to initiate the negotiation process containing all information necessary for the negotiation medium to process. As soon as the medium has found an appropriate service provider it sends a contract proposal to the SMA. Now, the SMA can decide to accept or reject this contract. The SMA may decide to reject the proposal if it has already received an appropriate offer from another negotiation medium. Depending on the answer of the SMA the negotiation medium terminates the selection process for this particular service type with a final message that states whether the contract was closed or cancelled. The proposed communication pattern between SA and negotiation medium is quite similar to the afore described one but has one important difference which relates to the question where to place the negotiation strategy of the SA. Thereby, the negotiation strategy expresses the behaviour of an SA and especially determines the way service bids are computed. One possibility is to place the strategy into the initial service registration message that is sent at the beginning. Then, the negotiation medium is the only place where service providers and service consumers interact according to a negotiation protocol like the Contract Net Protocol. This approach has the benefit of a high efficiency since the negotiation process is handled inside the medium and does not require further communication with SAs. On the other hand this approach has the drawback that SAs have to disclose their bidding strategy to a third party (negotiation medium). For certain service provider this might not be appropriate. Therefore, Figure 2(b) depicts an optional part of the communication pattern that targets
156
A. Vilenica et al.
this issue. When the negotiation medium receives a service negotiation request sent by an SMA requesting a certain service type, it searches for all SAs that have registered at the negotiation medium for this particular service type. Then, the negotiation medium sends out a service offer request to these SAs. Now, the SAs can submit an offer without the need to disclose their strategy. Therefore, proposals are sent back to the negotiation medium using a service bid message. Depending on the type of negotiation protocol this optional protocol phase may be iterated several times. Most often iteration is needed if the proposals do not meet the requirements of the SMA. The last part of the communication protocol is equal to the protocol between an SMA and the negotiation medium. It gives the chance to the selected SA, i.e. the SA with the most appropriate bid, to close or reject the service contract (with the service consumer). As mentioned at the beginning, the framework can implement arbitrary communication patterns. The previous paragraph has presented one possible pattern to illustrate the potential of the framework. The next subsection will present more details regarding the SMA and the SA. It will explain their functionality and show how they can be tailored with respect to certain requirements. 3.2
Two Negotiation Proxies: SMA and SA
In order to provide reusable components for the management of automated negotiations in service-based systems, the two agent types SMA and SA are proposed. This approach promotes a clear separation between the core concepts of SOA and the ability to dynamically negotiate contract partners in the domain of market-based pricing. It enables services as well as processes to participate in negotiations by specifying their requirements and preferences in a declarative manner but without the need to implement these functions by themselves. In order to encapsulate the negotiation functions that are needed, two different components have been designed: SMA & SA. Whereas the first one is in charge of service consumers, the latter one deals with service providers. As they act on behalf on another component they are also named negotiation proxies. The two negotiation proxy types themselves consist of reusable subcomponents, i.e. capabilities [3], which can be divided into mandatory and optional ones. An SMA needs a Service Offer Capability and an Execution Caller Capability. The first one is in charge of handling the selection phase whereas the latter one takes care of the execution phase. Complementary, an SA has a Service Supplier Capability and an Execution Service Capability which correspond to the afore mentioned functions of an SMA. Therefore, the proposed framework can not only be used for the automated management of the service selection phase but also for the dynamic management of the service execution phase. This aspect is achieved by a collaboration between the SMA and the process engine. Once the process engine has started the execution of the process the SMA is contacted each time a service needs to be executed. Now, the SMA is responsible for requesting the selected service for execution as well as dealing with failures. If the selected service is not available the SMA initiates a new service selection process and tries to find an appropriate
Framework for Dynamic Market-Based Service Selection and Execution
157
substitution for the service. In order to detect service breakdowns happening at runtime the SMA uses a heart beat protocol that determines if a service is alive. Again, if a failure is detected, a new service has to be selected. Additionally, it might happen that a committed service provider, i.e. a service that has signed a contract with a process, decides to break the contract and to participate in another process, e.g. since it gets a better reward. Then, the SMA acts proactively and tries to replace this service in order to avoid a failure when this service has to be executed. Beside these basic functionalities, the two proxy types can be extended to deal with optional aspects of negotiations like ensuring confidentiality and anonymity among contract partners or using implementations of trust models [4]. Thereby, this paper does not understand trust from a computer security perspective on. Rather, it uses trust to define and measure the reliability / reputation of (possible) contract partners. Reliability tries therefore to quantify the confidence that a stakeholder A has with respect to another one B that it will deliver a service according to a signed contract. Therefore, trust can be seen as an additional important criterion which has to be evaluated in a market-based environment. The interplay of service selections, based on past experiences, and the subsequent adjustment of reputation values leads to an independent dynamic process (cf. [15] for a systemic evaluation). Whereas direct service selection criteria, such as fees or time deadlines, can be easily evaluated, this does not hold for trust which is an indirect criterion and therefore more difficult to measure. The actual configuration of the proxies for a certain application is performed at design time using an XML configuration file. For both of the proxy types a separate XML Schema has been developed. For the SMA this schema requires following aspects to be specified: an ordered preference list of negotiation media to use, a utility function that is used to evaluate proposals (with respect to potentially conflicting goals) and a deadline for the negotiation process. Optional aspects can be specified as name value pairs. The schema for the SA specifies following aspects: a list of negotiation media to use, a service offer containing the fees and execution period and a negotiation strategy or a reference to the SA in order to avoid disclosing the strategy. These configuration options allow a flexible reuse of the negotiation proxies. In the following section, the practical realization is introduced and the overall approach is evaluated.
4
Implementation and Evaluation
This section describes the prototype implementation of the proposed negotiation framework. Furthermore, a case study on an imaginary “computer manufacturer” is briefly presented in order to demonstrate the applicability of the approach. The framework implementation utilizes the Jadex Agent Framework [12]. Besides core functionalities required for the execution of Multi Agent Systems (MAS) it has the capability to perform (automated) simulation experiments as well as to model and execute processes. Therefore, it is capable of handling process descriptions using the Business Process Modeling Notation (BPMN). Hence, Jadex is well suited as a foundation to realize the negotiation framework.
158
4.1
A. Vilenica et al.
Implemented Components
The framework prototype provides reference implementations of the two agent types SMA & SA. Each agent type utilizes two newly provided capabilities for the handling of the service selection phase as well as for the execution phase (cp. Section 3.2). Furthermore, the framework prototype includes a negotiation medium that implements a Contract Net Protocol. The design of this protocol is inspired by “the way that companies organize the process of putting contracts out to tender” [19, p. 156] but it has also been used in MAS for distributed problem solving. Finally, the prototype utilizes the Jadex BPMN Engine and therefore contains reference implementations for all mandatory components. Beside this mandatory components the prototype offers some optional extensions which enrich its functionality. One targets the chronological order of the phases of a business process. Usually, the selection phase is followed by the execution phase. This might not be appropriate for all types of business processes. Especially processes containing many tasks with a long duration require a different handling since they may face two problems. On the one hand, the selection phase may take too long since there are many tasks which require a negotiation in order to find an appropriate service provider. On the other hand, the dynamic environment of the business process leads to a situation where contract partners may (dis-)appear at any time. Therefore, it is questionable whether it is always reasonable to negotiate contract partners a long time before their service is executed. Rather, the selection of service providers should happen close to their execution in the business process. One possibility offered by the framework in order to cope with these challenges is to relax the strict separation between the selection and execution phases. Tasks of the business process can be annotated with a statement which denotes when a selection process should be initiated for the respective task. Then, the process engine knows which tasks are required in order to start the execution of the business process and which tasks can be selected later. This leads to an approach that is more appropriate for dynamic environments. Another provided extension targets the issue of trust between contract partners. The extension consists of a trust capability for service providers, service consumers and a trust medium. The first component observes the execution of service providers and logs in order to detect whether a service request was executed successfully or not. The result is published via the event-based trust medium. In this prototype, the trust medium has the function to transmit messages between service providers and service consumers participating in a business domain that incorporates the usage of trust models. Service consumers use the transmitted logs to compute the trust of possible contract partners. Thereby, they may apply different ways to compute the trust since the received messages represent events without dependence on a particular trust model. For example, service consumers may store the logs in a history and use a simple aggregation function to compute the trust level. The modeling of the aggregation function may be inspired by the process of forgetting of the human mind and lead to an exponential function [1]. Nevertheless which evaluation function is taken, the
Framework for Dynamic Market-Based Service Selection and Execution
159
typeofserviceprovider
characteristicoffees
characteristicofduration
characteristicoftrust
cheapandunrealiable normalandstable expensiveandreliable
0.4 0.5 0.7
0.5 0.5 0.5
0.999 0.4 0.1
Fig. 3. Characteristics for different agent types
computed level of trust may be part of the service negotiation request message that a SMA sends to the negotiation medium. Then, trust can be used as a criterion for the ranking of bids. Therefore, this approach offers the possibility to realize business markets using trust in a convenient way and it may lead to a situation where service consumers only take bids into account from providers that incorporate trust. Finally, the framework has an additional component which allows the automatic execution of simulation experiments of the implemented application. Besides logging and visualization functions this component allows an easy evaluation of the effects of different bidding strategies. Therefore, the integrated simulation and evaluation component offers valuable support for the development of new components and strategies and it has been used to conduct the simulation experiments presented in Section 4.3. 4.2
Configuration of the Case Study
In order to prove the proposed concept and implemented prototype, a case study has been conducted that exemplarily shows the utilization of the dynamic market-based framework. It allows developers to focus on core aspects of the application domain by relieving them from dealing with low-level aspects related to the service selection and the business process execution in a highly dynamic environment. The conducted case study consists of an acquisition process which can be seen as a sub-process of a supply-chain process of a fictive computer manufacturer. This manufacturer defines the logical-temporal order of tasks of the buying process and expects the negotiation framework to manage the execution of the process automatically – including dealing with service failures. For the ease of evaluation the buying process consists of three sequentially ordered tasks that target the acquisition of hard-disks, keyboards and RAM. In order to enable an automatic management of the business process, the service consumer (computer manufacturer) needs to express his preferences w.r.t. conflicting objectives like fees, service duration and trust. One common approach for that is to apply a cost-effectiveness analysis [8] and as a part of it to define a utility function that values each objective. Mathematically, a utility function can be defined as U (x) = ni=1 xi · wi where wi expresses the weight of an objective xi . Such a utility function is then used to configure the SMA which acts on behalf of the computer manufacturer. Therefore, the utility function enables the SMA to rank different bids and to choose the most appropriate one. In the same way as service consumers, service providers need to configure their proxy (SA). The case study assumes that the service providers use a static
160
A. Vilenica et al.
bidding strategy and will therefore disclose their behaviour to the negotiation medium. Furthermore, the existence of three different types of service providers is assumed. These three classes of agent types have specific characteristics w.r.t. following attributes: fees, duration and trust (cp. Figure 3). In turn, these characteristics influence the bids of service providers. For example, the fees characteristic is used to calculate the fees part of an SA’s bid using following function: f (SA) = average fees · (0.5 + SAfees characteristic ). Assuming an average fee of e 1000 this leads to following bids regarding the fees attribute: e 900 (cheap service type), e 1000 (normal service type) and e 1200 (expensive service type). The same type of strategy is also used to compute the duration part of a bid. Whereas the semantics of the attributes fee and duration are self-explanatory the semantics of the attribute trust requires an explanation. For a service provider the trust attribute denotes the probability of a service blackout. A service blackout has two consequences. First, the service provider will not be able to participate in the bidding process (selection phase). Second, the execution of the services that the service provider offers will fail at the moment when a blackout appears (execution phase). The blackout itself is modeled using a function with an exponential distribution. Thereby, the function computes the time between two blackouts of a service provider and takes into account the trust characteristics of the service provider. The higher the trust attribute is, the lower the time is between two blackouts. Then, all the afore described attributes (fee, duration, trust) and settings (bidding strategy, blackout function) are used to configure different SAs to act on behalf of service providers. In addition to the modelling of the SMA and the SA the case study includes some configuration aspects that are needed to evaluate the outcome of different business strategy scenarios. In spite of all possible conflicting objectives of the computer manufacturer, profit is still the most important objective that determines success (of a business process). Hence, the profit of a business process depends on the costs of its execution. These costs consist of the charges of the service providers as well as of two additional fees: First, the service consumer has to pay a fix amount of money to the negotiation medium for each requested negotiation process. Second, the service consumer may need to pay surcharge if the business process is not finished on time. This charge handles the situation where a process B depends on a process A and A is not accomplished on time. For example, this may happen if A contracts to many services providers with a low trust attribute for the execution of its tasks. These providers have a higher blackout probability and tasks will not be executed on time. 4.3
Simulation Results and Evaluation
Based on the described setting of the case study, the negotiation framework is used to evaluate the relation between the configuration of the computer manufacturer’s utility function, with a specific focus on the configuration of the trust attribute, and its profit. Here, the sum of all weights of the utility function with the attributes fee, duration and trust is always 1.0 and the value of the weight for the trust attribute is altered from 0.0 to 1.0 with a step size of 0.2.
Framework for Dynamic Market-Based Service Selection and Execution
23000
450,00
22000
400,00
21000
350,00
161
300,00
20000
Profit 19000
Profit
18000
250,00 200,00 150,00
17000
100,00
16000
50,00 0,00
15000 0
0,2
0,4
0,6
0,8
1
Weightoftrustattribute
(a) Collected profit of iterated business process execution after 60 time units
0
0,2
0,4
0,6
0,8
1
Weightoftrustattribute
(b) Profit per business process execution
Fig. 4. Profit in dependency of the weight of the trust characteristic
Figure 4(a) depicts the collected results (average values) of the simulations. Since the focus is on the influence of trust, the execution of the business process is iterated many times using a trust model that incorporates an exponential function to compute the trust level of potential service partners. In this way the level of trust, i.e. with respect to service providers, can change accordingly to successful or failed service executions. More specifically, Figure 4(a) shows the average results computed from 80 simulation experiments per setting. Each experiment iterated the execution of the business process for a period of 60 time units. The simulation assumes that the service consumer will gain on the one hand a revenue of e 2200 for a single execution of the business process. On the other hand it has to pay between e 900 and e 1200 for the execution of the tasks to the service providers, i.e. depending on the fee characteristic, as well as e 700 for each negotiation process. The simulation results reveal a maximum of profit with a weight of 0.6 for the trust whereas weights between 0.0 and 0.2 as well as of 1.0 show the least profit. Figure 4(b) uses the same collected results of the simulations and depicts the average profit per business process execution. The results of this case study show that trust has a significant impact on the profit. Taking trust into account, e.g. with a weight of 0.6, pays off with respect to settings without trust, e.g. a weight of 0.0 for the utility function, and increases the profit up to 22%. Furthermore, it can also be seen that configurations with a high trust have almost the same effect on the profit as configurations with no respectively limited trust. The results can be explained as follows. Configurations with a trust value between 0.4 and 0.8 offer the best compromise between reliability, i.e. successful service executions, and costs and have therefore the highest profit. Other configurations have an imbalance and either result in high costs for re-negotiations due to many service failures or choose most often expensive services which limits the profit. Additionally, Figure 4(b) reveals that the chosen strategies and configurations of the simulation settings reach about 66% of the theoretical maximum of e 600. Of course, this maximum is only achievable without service failures but it can still be seen as a benchmark to compare the outcome of different management strategies within the negotiation framework.
162
A. Vilenica et al.
In conclusion, the case study has demonstrated both the applicability of the proposed negotiation framework and the corresponding prototype. The implementation offers the possibility to select and execute business processes in a market-based environment using dynamic and autonomous management components. Also, the evaluation reveals the demand for simulating different management configurations in order to find optimal settings for service consumers as well as for service providers. Therefore, knowledge about the outcome of different settings is the prerequisite for performing management actions autonomously as proposed in this negotiation framework.
5
Conclusion and Future Work
This work targets dynamic market-based business environments that consist of various stakeholders mostly acting in a selfish way. More specifically, this paper addresses software support for dynamic markets with distributed participants that may (dis-)appear at any time. In order to enable a reliable and adaptable selection as well as execution of services in such an environment, this work proposes a framework which incorporates a service-based system at the core, equipped with autonomous and proactive management components (agents). These components act on behalf of their clients, i.e. service consumers/providers, and are therefore also referenced as negotiation proxies. This enables a clear separation of core application tasks carried out by services and of negotiation tasks carried out by agents. Furthermore, an event-based negotiation middleware is introduced which mediates between the demands of the negotiation proxies. The applicability of the framework is proven by a prototype implementation. Furthermore, this implementation shows the capability of the framework to deal with different types of service failures while ensuring the execution of business processes. The negotiation framework is therefore able to automatically process the tasks of service selection and execution for business processes in a highly dynamic environment. In consequence, this allows for – in major parts – highly autonomous management of complex service compositions as typical for advanced flexible and dynamic business processes. Future work shall, on the one hand, strive towards providing an autonomic and customizable strategy adaptation component for service providers. The aim of this strategy adaptation component is to enhance the success of service providers, i.e. to increase the profit. Therefore, the component observes the outcome of negotiation processes and tries to improve the biding strategy if it is not satisfying. For example: if an SA has a high rate of getting contracts with a certain price the adaptation component may increase the price. Contrary to that, the component may decrease the price if the SA does not close many contracts with service consumers. On the other hand, it is envisioned to provide additional negotiation media implementations in order to take advantage of the potential of the negotiation framework to execute media in parallel. Then, the strategy adaptation component may also manage the aspect which negotiation media to participate in for the SA autonomously.
Framework for Dynamic Market-Based Service Selection and Execution
163
Acknowledgments. The research leading to these results has received funding from Deutsche Forschungsgemeinschaft and from the European Community’s Seventh Framework Programme under grant agreement 215483 (S-Cube).
References 1. Anderson, J.R.: Cognitive Psychology and Its Implications, 5th edn. W.H. Freeman, New York (2000) 2. Borissov, N., Neumann, D., Weinhardt, C.: Automated bidding in computational markets: an application in market-based allocation of computing services. Autonomous Agents and Multi-Agent Systems 21, 115–142 (2010) 3. Braubach, L., Pokahr, A., Lamersdorf, W.: Extending the capability concept for flexible BDI agent modularization. In: Bordini, R.H., Dastani, M.M., Dix, J., El Fallah Seghrouchni, A. (eds.) PROMAS 2005. LNCS (LNAI), vol. 3862, pp. 139– 155. Springer, Heidelberg (2006) 4. Divsalar, M., Azgomi, M.A.: A computational trust model for e-commerce systems. In: Akan, et al. (eds.) Digital Business. Lecture Notes of the Inst. for Comp. Scien., Soc. Inform. and Telecommunications Engin., vol. 21, pp. 9–18. Springer, Heidelberg (2010) 5. Hang, C.W., Singh, M.P.: From quality to utility: Adaptive service selection framework. In: Maglio, P.P., Weske, M., Yang, J., Fantinato, M. (eds.) ICSOC 2010. LNCS, vol. 6470, pp. 456–470. Springer, Heidelberg (2010) 6. He, Y., Liu, H.: Market-based service selection framework in grid computing. In: Chen, B., Paterson, M., Zhang, G. (eds.) ESCAPE 2007. LNCS, vol. 4614, pp. 421–434. Springer, Heidelberg (2007) 7. Lamparter, S., Schnizler, B.: Trading services in ontology-driven markets. In: Proc. of the 2006 ACM Symp. on Applied Computing, pp. 1679–1683. ACM, New York (2006) 8. Levin, H.M., McEwan, P.C.: Cost-Effectiveness Analysis: Methods and Applications, 2nd edn. Sage Publications, Thousand Oaks (2000) 9. Malone, T.W., Crowston, K.: The interdisciplinary study of coordination. ACM Comput. Surv. 26(1), 87–119 (1994) 10. Napoli, C.D.: Software Agents to Enable Service Composition through Negotiation, ch. 12. Studies in Computational Intelligence, pp. 275–296. Springer, Heidelberg (2009) 11. Papazoglou, M.P.: Web Services: Principles and Technology. Pearson, London (2008) 12. Pokahr, A., Braubach, L., Jander, K.: Unifying agent and component concepts – Jadex active components. In: Dix, J., Witteveen, C. (eds.) MATES 2010. LNCS, vol. 6251, pp. 100–112. Springer, Heidelberg (2010) 13. Schr¨ opfer, C., Binshtok, M., Shimony, S.E., Dayan, A., Brafman, R.I., Offermann, P., Holschke, O.: Introducing preferences over nFPs into service selection in SOA. In: Di Nitto, E., Ripeanu, M. (eds.) ICSOC 2007. LNCS, vol. 4907, pp. 68–79. Springer, Heidelberg (2009) 14. Sudeikat, J., Braubach, L., Pokahr, A., Renz, W., Lamersdorf, W.: Systematically engineering self-organizing systems: The SodekoVS approach. Electronic Communications of the EASST 17 (2009) 15. Sudeikat, J., Renz, W., Vilenica, A., Lamersdorf, W.: A reputation-based approach to self-adaptive service selection. In: 17th Conf. on Com. in Distr. Sys (KiVS 2011), Schloss Dagstuhl, Germany. OpenAccess Ser. in Inform., vol. 17, pp. 14–25 (2011)
164
A. Vilenica et al.
16. Vilenica, A., Sudeikat, J., Lamersdorf, W., Renz, W., Braubach, L., Pokahr, A.: Coordination in multi-agent systems: A declarative approach using coordination spaces. In: Proc. of EMCSR. Austrian Soc. for Cyber. Studies, pp. 441–446 (2010) 17. Vu, L.H., Aberer, K.: A probabilistic framework for decentralized management of trust and quality. In: Klusch, M., Hindriks, K.V., Papazoglou, M.P., Sterling, L. (eds.) CIA 2007. LNCS (LNAI), vol. 4676, pp. 328–342. Springer, Heidelberg (2007) 18. Wellman, M.P., Greenwald, A., Stone, P.: Autonomous Bidding Agents: Strategies and Lessons from the Trading Agent Competition. The MIT Press, Cambridge (2007) 19. Wooldridge, M.: An Introduction to Multi Agent Systems, 2nd edn. Wiley, Chichester (2009) 20. Yu, T., Zhang, Y., Lin, K.J.: Efficient algorithms for web services selection with end-to-end QoS constraints. ACM Transactions on the Web 1(1) (2007)
Beddernet: Application-Level Platform-Agnostic MANETs Rasmus Sidorovs Gohs, Sigurður Rafn Gunnarsson, and Arne John Glenstrup {rasmus.gohs,sigurdur.rafn,arne.glenstrup}@gmail.com
Abstract. This paper introduces Beddernet, a platform-agnostic mobile ad-hoc network framework. The Beddernet architecture is designed to work with different networking protocols - the version detailed here supports Bluetooth ad-hoc networks or scatternets. Although considerable work has gone into researching and designing scatternets, no standard has been agreed upon and no scatternet protocol can be found in Bluetooth specifications. Beddernet fills this gap and can become a useful tool both for research and real-world applications. The standard is open and free to use, and is detailed in a separate Beddernet Specification Document. Beddernet middleware has been tested on Java and Android devices with good results. The reference design of Beddernet is based on the Android Operating System and is available under an open source license. Keywords: MANET, peer-to-peer, mesh, networking, DSDV, Android, multicast, Bluetooth, mobile.
1 Introduction Mobile devices like handheld gaming devices and mobile phones are becoming quite accommodating; the latest mobile phones have several connectivity features and powerful application processors. These devices rely mostly on some infrastructure such as a WLAN or a mobile phone network to communicate with each other and the world. This isn’t always feasible or desired. One solution to this is to have the devices themselves interconnect and create mobile ad-hoc networks, MANETs. Such networks can enable devices to share data and resource sharing for e.g. collaborative work, file sharing and gaming without any infrastructure or central control. MANETs do require some processing power and ideally an advanced operating system to run on. As powerful mobile devices with sophisticated operating systems not hampered by these problems are now commonplace, MANETs can have an important place in the world by augmenting infrastructure in places where it is weak, expensive or non-existent. For this to be possible, devices need a standard to connect and communicate. This paper proposes a solution to this problem inBeddernet, an advanced application level MANET protocol with self-organising and self-healing capabilities. The typical usage scenario would be MANETs consisting of 2 to 20 devices. The next chapter discusses some work related to this project. Chapter 3 and 4 briefly introduces the technologies and concepts Beddernet relies on. Chapter 5 P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 165–178, 2011. © IFIP International Federation for Information Processing 2011
166
R.S. Gohs, S.R. Gunnarsson, and A.J. Glenstrup
details the design of Beddernet, its protocols and structure. Several experiments were performed to test Beddernet’s performance and functionality. Those experiments are discussed in Chapter 6, Evaluation. Conclusions and perspectives for the project’s future are then discussed in the final chapter.
2 Related Work In an ad-hoc network, individual nodes cooperate to create and maintain the network and to route data. A scatternet is such a network where the nodes use low power Bluetooth communication for connections. BEDnet, the predecessor of Beddernet, is a real-world scatternet application based on the Java Platform, Micro Edition (JME). Due to some limits of Bluetooth on JME, BEDnet eschews complicated Scatternet Formation Algorithms and has devices connect to each other in a simple mesh creating algorithm. BEDnet showed good results, formed scatternets reliably, routed data accurately, and proved useful in applications such as turn-based gaming and text messaging, and moderately successful in media sharing applications [1]. Performance was below the theoretical maximum transfer speeds of Bluetooth, but it was believed this could be managed using better hardware and possibly some optimisations in code. The Beddernet project builds on the success of BEDnet and addresses its shortcomings. Although Bluetooth is the only widely spread protocol that supports device-todevice connections, little work seems to have been done designing Bluetooth scatternet standards or software for mobile devices. Scattercom [2] is written for the Symbian OS and is based on a proactive routing protocol, but does not offer APIs for third party applications. A project by Ibraheem [3] is implemented in JME and uses a reactive routing protocol. The project seems to target transferring a 4kb file in a maximum of 10 seconds on a two hop scatternet, so it does not seem to be intended to support applications such as interactive real time gaming and media streaming. Finally, Donegan et al. [4] present another JME project, originally designed to facilitate parallel computations over Bluetooth scatternets. Although claimed to be general enough for further deployments, it has not been codified as such.
3 Technologies Bluetooth communication, scatternet formation/routing, and Android OS, the basic technologies the Beddernet prototype builds upon, are described in the following. 3.1 Bluetooth Bluetooth is a wireless standard for low powered, short range data exchange. It is implemented in e.g. computers, mobile phones, and video game consoles [5]. Bluetooth devices are uniquely identified by their address and are arranged in star networks called piconets [6], each consisting of up to 8 active devices, one of which is designated asmaster. Devices in a piconet communicate using a shared medium.The master assigns specific time intervals, time slots, to each connected device to transmit data to or from the master, cf. Fig. 1. More than 7 devices can be registered with the
Bed ddernet: Application-Level Platform-Agnostic MANETs
167
master but are then put intto park mode [6], where they are considered a part of the piconet but are not assign ned time slots. Parking and un-parking of devices haas a negative impact on perform mance [7]. For a device to be able to join a piconet it neeeds to identify the address and clock of the piconet master [7]. This is done in two phaases by the master; inquiry forr discovering new devices and paging for establishinng a connection. Each phase consists of two t modes; listening (scanning) and transmitting. For ttwo devices to exchange address and clock information they must be in opposite modess. In the first phase, inquirry, the master discovers the slave address and clock. Next, in the paging phase, the master m sends its address and clock to the slave and the devices are connected. To avoid interference, devices hop to a different raadio frequency at each time slott. Each piconet uses a specific hopping pattern identiffied by the master's address and a clock. When both identities and clocks have bbeen interchanged, the frequen ncy hopping sequence can be synchronized and ddata exchange can begin. The device d initially in inquiry transmitting mode becomes the master of the connection.
Fig. 1. Bluetooth switching
RFCOMM Being the only connectio on protocol available in both Java and Android, the RFCOMM Bluetooth proto ocol is used by Beddernet (Fig. 2). This stream-oriennted protocol relies on the autom matic retransmission and in-order sequencing providedd by the lower base-band layer fo or reliability in transmissions between connected devicees. 3.2 Scatternet Formation n Thanks to frequency hopp ping, several piconets can overlap geographically withhout interference. A node in onee piconet can join another, thereby connecting them. IIt is possible to connect several piconets using this method, the resulting network is knoown
168
R.S. Gohs, S.R. Gunn narsson, and A.J. Glenstrup
Fig. 2. Bluetooth radio stack
as a Bluetooth scatternet, cf. Fig. 3. The fundamental problem of forming a sselforganizing scatternet of Bluetooth B devices is non-trivial and is an active areaa of research [7]. Ensuring conn nectivity requires nodes to agree on a scatternet formattion algorithm (SFA), specifying g how they interconnect. Several different algorithms hhave been proposed with differen nt characteristics [7]. A general problem with h implementing many of the algorithms is that they m make assumptions that impede usage in platform agnostic standards. Some e.g. assumee all devices are in range of each h other, or that devices have access to such informationn as the link management in the Bluetooth stack, location of devices or battery levels [1].
Fig. 3. Scatternet - three t piconets form a scatternet via bridge nodes (blue)
3.3 Routing MANETs such as Bluetoo oth scatternets are more volatile than normal compuuter networks; devices can appeaar spontaneously, move around, and then disappear againn. In the face of such network ch hurn, special routing protocols have been designed, broaadly speaking in two classes: pro oactive and reactive. Proactive protocols attempt to mainntain a recent list of all nodes an nd/or routes on a network by regularly exchanging routting information updates. Reacttive protocols like Ad hoc On-Demand Distance Vecctor (AODV) find routes on demand, usually by flooding request packets. Simulatiions suggest that AODV is bettter suited than e.g. the proactive Destination-Sequennced
Beddernet: Application-Level Platform-Agnostic MANETs
169
Distance Vector (DSDV) algorithm for highly volatile ad-hoc networks [8], but actual experiments have shown that in some cases the route lookup takes an inexpediently long time, outlasting even the actual Bluetooth transmission time [9]. AODV also requires more processing per packet than DSDV [1]. These properties were factored in when designing Beddernet.
4 Mobile Programming Frameworks Beddernet is designed to be simple and platform agnostic. A reference implementation has been created on the Android mobile platform [12]. Applications on Android can run as background services and can communicate with other applications on a device making it very suitable for a Beddernet reference implementation.
5 Framework Design Beddernet adheres to a 3 layer architecture having a data-link, a routing, and an application layer. All communication between Beddernet devices, both for maintaining the scatternet and for transmitting data, is via discrete Beddernet messages. The first byte (or bytes in special cases) of each message is a control byte. It denotes what type of message follows. Different message types are used for maintaining routing information, carry data etc. The following sections describe the function of each layer. 5.1 Datalink Layer The Datalink layer contains the functionality that concerns the actual connection medium, Bluetooth in the case of Beddernet. This layer holds all connections to neighbour devices and sends and receives Beddernet messages from the routing layer. Scatternet Formation A reliable scatternet framework must make sure connected scatternets are created, but also maintain the scatternet as nodes appear, move around and disappear. Beddernetattempts to accomplish this with a two-phased algorithm that first creates a mesh based scatternet and then enters an active maintenance phase. The Beddernet framework is designed to be a general framework and does not assume information like battery status or location is available. Therefore the simple but functional mesh algorithm described below is used. Phase 1 - Mesh creation As a node starts Beddernet, it tries to establish a connection with other devices in range. It randomly alternates between listening and transmitting modes until a connection can be established. This random factor prevents devices from being constantly in the same mode and ensures that a device eventually connects with other devices if they are in range. When connection has been established, knowledge of other devices is exchanged, thereby quickly establishing a fully connected scatternet.
170
R.S. Gohs, S.R. Gunn narsson, and A.J. Glenstrup
Phase 2 – Maintenance A Beddernet device that co onnects to another device stops scanning as frequently and enters a maintenance phaase. In this phase it spends most of the time beeing discoverable, allowing fo or incoming connections, but only performing devvice discoveries intermittently. As device discovery is generally a power intenssive procedure that interrupts co ommunication [13], it should be done as rarely as possibble. To achieve this, Beddern net uses a dynamic maintenance algorithm that sloows scanning frequency linearly y with the number of connected Bluetooth neighbours. The time T between deviices discoveries is thus regulated by the following formuula: 1
, ∞,
7
7
.
(1)
where N is the number of connected neighbours, T0 is some constant time interval and X is a random number betw ween 0 and 1. The maintenance protocol runs continuouusly, regularly scanning for new devices. This enables two or more established scatternnets to merge automatically. (cf.. Fig. 3). 5.2 Routing Layer As discussed earlier, reactiv ve protocols tend to scale better than proactive ones. A As a Beddernet usage scenario was presumed to be typically 2-20 devices, this was not seen to justify the added co omplexity of such reactive protocols. Therefore, Bedderrnet uses DSDV. This also makes m the implementation of advanced features suchh as multicasting [10] and servicce discovery simpler than if using AODV [11]. Multicast Multicasting can save band dwidth and increase throughput in some scenarios (Figg. 4) and is included in the Bed ddernet protocol [15], using a stateless explicit multiccast algorithm because of its simplicity and efficiency [10]. The special Bedderrnet multicast message header can c contain multiple Bluetooth addresses. The numberr of addresses is indicated by a control byte that precedes the address list, supporting upp to 255 destination addresses within w a single multicast message. The protocol couldd be extended to support reversee multicast, by having each intermediary device aggreggate replies before returning theem towards the multicast source but this is not a part off the specification.
Fig g. 4. Unicast vs. multicast file transfers
Beddernet: Application-Level Platform-Agnostic MANETs
171
5.3 Application Layer Beddernet is designed to work with several concurrent applications running on different platforms and devices without interference. The following discusses briefly how this is done in Beddernet. Unique Application Identifier Applications in Beddernet are given a 64-bit Unique Application Identifier (UAI). It is obtained by hashing the application’s human readable name into a 64-bit sequence. This identifier is then used to route messages to the correct application on the destination device making it possible to run several applications concurrently. Although this method does not guarantee collision-free application routing, it makes the risk of collisions very improbable [16]. It two applications do get the same UAI, application designers can modify the name they provide to Beddernet.Information about active applications on a device is propagated proactively in Beddernet, embedded in the DSDV routing messages, cf. Table 1 and Table 2. This proactive approach entails an overhead of 8 bytes per control message. Table 1. Route Broadcast Message
Type
Senders address
Recipients address
Is route down?
Number of RTE
Routing Table Entries
1 byte
6 bytes
6 bytes
1 boolean
1 int
1-* RTE
Table 2. Routing Table Entry (RTE)
Type
Destination Address
Number of Hops
Sequence Number
Number of UAIH
UIAH
1 byte
6 bytes
1 int
1 int
1 byte
1-255 longs
6 Evaluation To test the practical performance of Beddernet a series of tests were run on the Android reference implementation and on a JavaSE implementation, created for this purpose. Tests on the JavaSE version were carried out on several homogeneous and stationary Windows XP SP3 workstations with identical unbranded and generic class 2, version 2.0 + EDR Bluetooth hardware.
172
R.S. Gohs, S.R. Gunn narsson, and A.J. Glenstrup
Fig. 5. Default test setup
6.1 Performance To measure performance and explore the cost of routing a message throuugh intermediary nodes, bandwidth and latency was measured in a linear scatternet whhere up to six devices were connected c in a chain; making up a scatternet of ffive piconets.(cf. Fig. 6) RTT an nd average throughput was measured between the first and last devices. The last dev vice in the chain was then disconnected, performaance measured again, etc. until only o two devices were left.
Fig. 6. Multi-hop bandwidth and latency test
6.2 Latency As expected, latency increaases linearly with the number of hops in a route (cf. Figg. 7) although some tests showed d that congestion can be a factor in overloaded scatterneets.
Fig. 7. Multihop RTT
Bed ddernet: Application-Level Platform-Agnostic MANETs
173
This effect shows clearly y that latency dependant applications are strongly affeccted by the number of hops betw ween devices. 6.3 Bandwidth Bandwidth between two connected devices is around 600ms under the default lab conditions (cf. Fig. 8) whilee a two hop file transfer is half as fast. This is expectedd, as the total bandwidth availab ble has to be split in two; the intermediary node reads frrom one device and then writes to t the next.
Fig. 8. Multihop bandwidth
As another hop is addeed into the route, sending data through two intermediarries, bandwidth suffers another drop in speed, 44% from the last bandwiidth measurement.This drop seeems high as the bandwidth available between device 3 and 4 is logically similar as betw ween device 1 and 3. Additional penalties are then incurrred as more hops are added, alth hough much smaller. Table 3. Multi-hop test results
Hops
1 (b base)
2
Bandwidth
697 kbit/s k
RTT
35 ms
101 ms
Percentage of base bandwidth
10 00%
Percentage of RTT base
10 00%
3
4
5
127 kbit/s
113 kbit/ss
153 ms
225 ms
297 ms
40%
22%
18%
16%
288%
437%
643%
848%
279 kbit/s 156 kbit/s
174
R.S. Gohs, S.R. Gunn narsson, and A.J. Glenstrup
One possible reason forr this performance drop may be that the increase in the number of piconets leads to some inefficiencies in exchanging data between the piconets. To test this, an experim ment was carried out. Two different 3-hop scatternets w were created (cf. Fig. 9), one con ntaining 3 piconets, the other with only 2. Bandwidth w was 13% higher in the 2 piconett setup, suggesting that some bandwidth is lost when noodes hop between piconets.
Fig. 9. Different D three hop scatternet configurations
6.4 Message Size The design of Beddernet allows for arbitrarily sized messages, some experimeents were carried out to assess if some set maximum/minimum size in the specificatiions would be advisable. Larger message sizes were w shown to increase transmission speed in a sim mple bandwidth test with differeent message sizes. Profiling shows [14] that Beddernet has negligible overhead in CPU U usage and overhead is a small percentage of total ddata sent, so most of the gains of o using larger message sizes were presumed to be duee to the costs of initiating RFCO OMM transfers [14]. Although large message sizes improve bandwidth, very large messages sent acrross a scatternet were speculateed to cause problems for latency dependent applicatiions because of possible congesstive effects. A test designed to explore this showed tthat large messages can compleetely occupy a connection for several seconds leading tto a negative impact on latency for competing transmissions [14]. A message size of 50000 bytes gave a good balancee between responsiveness and bandwidth in tests and has been designated as the maximum and default message size in Beddernet. 6.5 Topology Previous performance testss focused on the number of hops in a linear scatternet. To explore what effect topolo ogy may have, another test was conducted. Bidirectioonal bandwidth was measured beetween two devices. Then, another device was added to the piconet and the test repeateed between the original two devices (Fig. 10). This resullted in a 32% drop in throughp put. Adding more devices lead to additional performaance drops. The results from this exp periment seem to indicate that the master device divides the available total bandwidth eq qually between all connected devices rather than assignning active devices more slots. Conversely, changing th he setup so that a single slave was connected to multiiple masters lead to only a sligh ht decrease in performance compared to the previous, sinngle piconet test. This almost co onstant throughput is speculated to be because of a nodde's
Bed ddernet: Application-Level Platform-Agnostic MANETs
175
Fig. 10. Piconet bandwidth test, multiple slaves vs. multiple masters
ability to go into sniff mode.. In sniff mode a device can be absent from one piconet fo for a longer period of time while being b engaged in another without losing connectivity [6]. It would seem that devicces only use the sniff mode to negotiate between differrent piconets and not to increasse bandwidth within a piconet. If this effect is commonn in Bluetooth hardware impleementations, it may have a considerable effect on the performance of SFAs in real-world r settings. Designing a SFA that leverages this factor and takes other ex xperimental results into account could show some rreal improvements over older designs. The algorithm would minimise hops whhile preventing masters from having h many slaves. The topology produced could e.g. resemble an inverted Bluesttar [17]. Scatternet proposals witth a very high number of masters could raise questionss of interference issues though, due to the larger number of piconets. Calculations frrom [11] indicate that this is nott a critical concern as e.g. R = 4 concurrent piconets woould only experience an interferrence related drop of I = 4% in a simplified worst ccase scenario, ignoring error corrrection etc., using the formula: 1
1
.
(2)
At this point, tests havee shown that both routing through intermediary nodes and having extra nodes in a picconet causes considerable performance drops, cf. Tablee 3. To give a better picture on scatternet performance in real-world usage, a new test w was designed, combining these two t factors.
Fig. 11. Multiple hops with extra nodes, the two inactive nodes added are white.
176
R.S. Gohs, S.R. Gunnarsson, and A.J. Glenstrup
A new 3 hop scatternet was set up and bandwidth measured. Then, 2 inactive nodes were added to the scatternet, connecting to the two intermediary nodes as shown on Fig. 11. Bandwidth was measured again, revealing a 39% drop in throughput. These results are somewhat surprising. The devices are already performing far under their available bandwidth capacity but still incur bandwidth penalties as devices are added, even if these new devices are inactive. Table 4. Multiple hops with extra nodes
Hops
2
3
4
5
Simple chain
303 kbit/s
198 kbit/s
165 kbit/s
151 kbit/s
With two inactive nodes
198 kbit/s
142 kbit/s
119 kbit/s
107 kbit/s
6.6 Multicast Performance The multicast feature of Beddernet was tested by setting up a scatternet as on Fig. 4. Transfer using multicast were 53% faster than using unicast. This isn’t surprising as each message only needs to be sent through threeindividual connections and not five as with unicast. This experiment shows the promise of using multicast in scatternets for applications such as streaming media to multiple nodes.
7 Conclusion and Future Work Despite the possible utility of mobile ad-hoc networking, such networks are not yet a standard feature of mobile devices. The Beddernet project was started to provide a free and open standard to enable multi-platform scatternets, both for research and real-world projects. Implementations of the simple Beddernet protocol have been shown to work on different platforms with good results. Performance has been tested and although highly dependent on scatternet topology, shown to be sufficient to enable different useful applications. Performance is only expected to improve as mobile processors and Bluetooth adapters become faster. The Beddernet project is considered to have reached its technological goal. The real success of Beddernet, however, depends on its usefulness to research and in real world deployments. To encourage adoption and development the source code is open source and can be downloaded from the project home page [18]. As experiments indicate that setting up RFCOMM connections is costly, implementing L2CAP protocol might reveal some performance gains, but as of this writing, the Android SDK has no supports for L2CAP.
Beddernet: Application-Level Platform-Agnostic MANETs
177
Beddernet currently supports the DSDV routing algorithm, but the loosely coupled design allows for easy implementation of different routing algorithms. B.A.T.M.A.N has been identified as a promising routing protocol [20] and it would be interesting if a larger real world comparison would be made, not only measuring overhead and bandwidth, but also the practical use of such an algorithm for features such as service discovery and multicasting. Lastly, Beddernet’s usefulness could be increased by adding more transmission protocols e.g. Wi-Fi to the datalink layer. The standard, (802.11a/b/g/n) is very widely deployed and is getting more common in mobile devices, providing long communication range and high transfer speeds [19].
References 1. Nielsen, M., Glenstrup, A.J., Skytte, F., Guðnason, A.: Real-world Bluetooth MANET Java Middleware. Technical report TR-2009-120, IT-University of Copenhagen (2008) 2. Scattercom, http://sourceforge.net/projects/scattercom/ 3. Ibraheem: Development of Routing Algorithm Based on Bluetooth Technology. Thesis, University of Technology, Iraq (December 2006) 4. Donegan, B., Doolan, D., Tabirca, S.: Mobile Message Passing using a Scatternet Framework. International Journal of Communications & Control 3(1) (2008) 5. Bluetooth. Wikipedia, http://www.en.wikipedia.org/wiki/Bluetooth 6. Bluetooth specifications: Core Specification v2.0 + EDR, Bluetooth SIG (1994) 7. Whitaker, R.M., Hodge, L., Chlamtac, I.: Bluetooth scatternet formation: A survey. Ad Hoc Networks 3 (2005) 8. Boukerche, A.: Performance Evaluation of Routing Protocols for Ad Hoc Wireless Networks. Mobile Networks and Applications 9(4) (2004) 9. Nielsen, M., Glenstrup, A.J., Skytte, F., Guðnason, A.: Bluetooth Enabled Device ad-hoc NETwork (2009) 10. Ji, L., Scott Corson, M.: Explicit Multicasting for Mobile Ad Hoc Networks. Mobile Networks and Applications 8(5) (2003) 11. Haartsen, J.C., Mattisson, S.: Bluetooth–a new lowpower radio interference providing short-range connectivity. Proceedings of the IEEE 88 (2000) 12. Android guide, http://www.developer.android.com/guide/basics/what -is-android.html 13. Android documentation: Bluetooth, http://www.developer.android.com/guide/topics/wireless/Bluet ooth.html 14. Gohs, R., Gunnarsson, S.R.: Bluetooth Scatternet Framework For Mobile Devices (Beddernet), IT - University of Copenhagen (2010) 15. Gohs, R., Gunnarsson, S.R.: Beddernet Protocol Specifications 0.1, IT - University of Copenhagen (2010) 16. Birthday problem, http://www.en.wikipedia.org/wiki/Birthday_problem 17. Dubhashi, et al.: Blue pleiades, a new solution for device discovery and scatternetformation in multi-hop Bluetooth networks, May 8. Kluwer Academic Publishers, Dordrecht (2006)
178
R.S. Gohs, S.R. Gunnarsson, and A.J. Glenstrup
18. The Beddernet project homepage, http://www.code.google.com/p/beddernet/ 19. Wi-Fi Alliance: Wi-FI Direct, http://www.wi-fi.org/Wi-Fi_Direct.php 20. Annese, S., Casetti, C., Chiasserini, C., Cipollone, P., Ghittino, A., Reineri, M.: Assessing Mobility Support in Mesh Networks. In: WiNTECH 2009, September 21 (2009)
The Role of Ontologies in Enabling Dynamic Interoperability Vatsala Nundloll, Paul Grace, and Gordon S. Blair School of Computing and Communications, Lancaster University, UK {nundloll,gracep,gordon}@comp.lancs.ac.uk
Abstract. Advances in the middleware paradigm has enabled applications to be integrated together thus enabling more reliable distributed systems. Although every middleware tries to solve interoperability issues among a given set of applications, nonetheless there still remains interoperability challenges across various middlewares. Interoperability enables diverse systems to work in accordance and extend the scope of services that are provided by individual systems. During an interoperability process, it is imperative to interpret the information exchanged in a correct and accurate manner in order to maintain coherence of data. Hence, the aim of this paper is to tackle this issue of semantic interoperability through an experimental approach using the domain of vehicular ad-hoc networked systems.
Keywords: Interoperability, Ontology, Vehicular Ad-Hoc Networks.
1 Introduction Middleware technologies have proven a successful solution in ensuring interoperability within distributed systems. However, due to the heterogeneity of applications and environments, a number of different middleware systems have emerged. Due to conflicting standards, communication styles and heterogeneous protocols are unable to interoperate with one another. Solutions to this problem of middleware heterogeneity include bridging [14, 15, 16] and interoperability frameworks [17, 19, 20]. However, these are insufficient where interoperability is required ‘on-the-fly’, i.e., where two heterogeneous systems spontaneously encounter one another at runtime, there is a need to automatically learn the middleware solutions employed and then generate a dynamic bridge between the two systems. CONNECT1 is a software framework that dynamically generates these middleware mediators and hence provides emergent middleware solutions, which can encounter different networked systems dynamically, enable them to connect together, understand one another and be able to exchange data. Here, a fundamental requirement is the ability to discover and understand the meaning and behaviour of middleware technologies and standards in order to learn and generate the required software. This entails matching or comparison between different middleware in order 1
http://www.connect-forever.eu
P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 179–193, 2011. © IFIP International Federation for Information Processing 2011
180
V. Nundloll, P. Grace, and G.S. Blair
to identify how they are different, and based upon this synthesize an appropriate adaptive mapping mechanism to underpin the interoperability solution. To achieve such understanding, we advocate the novel use of ontologies that crosscut middleware solutions; this allows us to obtain semantic knowledge of the middleware in order to allow two different systems to effectively be able to interoperate with each other. In this paper, we present a dynamic interoperability framework that leverages ontologies in order to provide the following two important capabilities: ⎯ Classifying Protocols. This involves discovering and defining the type of messaging protocols from each system. ⎯ Matching Protocols. We can observe where the fields of two messages are the same and different to provide the information required to build a bridge. We use a case-study based evaluation that shows we can understand and match communication protocols from Vehicular Ad-hoc Networks (VANETS) domain. The paper is structured as follows: Section 2 briefly presents a background on ontologies. Section 3 explains the dynamic framework, and Section 4 details the case study. Section 5 then presents an experiment that validates and evaluates the case study results. Finally, Section 6 discusses related work and then Section 7 concludes the paper and briefly outlines our future work.
2 Background on Ontologies An ontology is a formal descriptive notation given to concepts that constitute a particular domain. It is a simple but powerful notion used to classify concepts of a domain in the form of a superclass-subclass model and also to define the relationships that exist among these different objects. In addition to making domain assumptions explicit, ontologies also permit reuse and analysis of this domain knowledge. Enabling the classified information regarding a domain to be shared as a common vocabulary across people and applications, ontologies consequently pave the way towards building a supportive infrastructure for information exchange and discovery. Any kind of domain can be modelled using ontologies, ranging from concrete such as defining a type of bread, to abstract such as defining an organization. Fig. 1 elicits the components that make up an ontology. The Primitive Concepts, for instance, denote the different constituents of a domain. On the other hand, the Defined Concepts set the criteria to determine whether an object is a member of a certain class. The Axioms define the restrictions that are laid on the domain, for example, a particular object O cannot have more than x number of subordinates. These assets can define a domain by structuring its different constituents and defining the relationships that exist among them. Furthermore, the ontologies can also classify this information in order to infer additional meaning to the domain. This is possible through the use of reasoners such as Hermit [1], RacerPro [2] and Fact++ [3]. As an example, referring to the food ontology presented in [23], let us suppose we need to classify the foods into categories such as Healthy Foods and Non-Healthy Foods. Through the means of defined concepts these two classes can thus be created and stored in the ontology. Moreover, through the usage of relations, as explained in Fig. 1, the different foods can be defined in terms of their amount of fat, food energy, cholesterol, weight and saturated fat. Then, through the use of a reasoner, these
The Role of Ontologies in Enabling Dynamic Interoperability
181
various classes of food can be classified as either Healthy Foods or Non-Healthy Foods. The power of the reasoner is that it infers all meanings and facts based upon the semantic meaning provided by the ontology about the concepts of a domain.
Fig. 1. Ontology Concepts
The OWL language (Ontology Web Language) is the language designed by the W3C in order to build ontologies and is mainly devised for Semantic Web applications. The language formulates the domain information in terms of instances of OWL classes and enables the use of axioms to interpret and manipulate this information. The software mostly employed to develop ontologies is Protégé, a free, opensource ontology editor [4]. Active development is being carried out to improve the software and its two latest versions are Protégé Version 3.x and Version 4.x. P3.x is shown to be in more stable state to use inference rules within the ontology and hence supports the SWRL rule language and SQWRL query language. SWRL stands for Semantic Web Rule Language and can add more expressivity to the OWL language through the creation of rules. These rules are expressed in terms of OWL concepts (classes, properties, instances). A rule-engine bridge mechanism is provided to embed a rule engine into the Protégé-OWL in order to execute the SWRL rules. One such bridge is the Jess rule engine which can be embedded in P3.x to execute rules and add more expressivity to the OWL language. In addition, the SWRL language has been extended to a query language called SQWRL (Semantic Query Web Rule Language)
182
V. Nundloll, P. Grace, and G.S. Blair
in order to enable extraction of information from the ontology. The SQWRL library is packaged with SQL-like built-ins that are used within the SWRL rules in order to execute SQL-like queries to pull out required information. The library further provides new sets of operators classified as Core and Collection operators, which enable basic as well as advanced operations such as select, counting, difference and data aggregation to execute in the rules. An example eliciting the use of a SQWRL query is given below, where a query retrieves all Breads having a price less than £2 from a given ontology: Bread(?b) ^ hasPrice(?b, ?p) ^ swrlb:lessThan(?p, 2) Æ sqwrl:select(?b, ?p) On the other hand, P4.x supports the latest version of OWL language (OWL 2.0) and is tailored to handle large and complex ontologies. It can produce very expressive ontologies, but it yet cannot provide full support for the creation and execution of inference rules through SWRL and SQWRL. Since the features of P3.x suit more the requirements of our experiment in view of performing queries against our vehicular ontology and enabling matching of different concepts, we have resorted to the usage of Protégé Version 3.4.4 for the purpose of our experiment.
3 Framework for Dynamic Interoperability Our framework for dynamic interoperability provides mechanisms to achieve emergent middleware (Fig. 2); in this the crosscutting role of ontologies is depicted as central to achieving the objectives. The framework offers three phases of behaviour: ⎯ In the first phase, which regards the discovery and learning phase, the ontology is used to give a semantic meaning to the different concepts that are involved in a system. Based on Fig. 1, a system can be defined using the primary and defined concepts together with the axioms available from the ontology. Learning this system involves classifying these defined concepts within the ontology through the use of a reasoner, hence identifying related concepts. ⎯ The second phase, which involves enabling matching between any two systems, is achieved through use of semantic rules defined within an ontology. These rules compare the definition of any two concepts classified by the ontology and generate the difference that emanates from the given definitions. This step is crucial as it also shows the degree of similarity/difference that exists between two systems, thus determining the possibility of mapping from one system to another. ⎯ The third step involves the dynamic synthesis of a mapping mechanism in order to enable a system A to operate as another system B. The ontology is helpful here to list the missing requirements in A for it to perform as B and vice versa. Once this information is available, the mapping determines how to provide A with the adequate and absent requirements so that it can adapt itself to perform as B. The role of ontologies spans all the phases required to enable interoperability. We hence advocate and emphasize the importance of using ontologies to define the role or behaviour of a system; these define the types of protocols being deployed by the system and can help bridge the gap between any two different systems trying to interoperate with each other.
The Role of Ontologies in Enabling Dynamic Interoperability
183
Fig. 2. Dynamic Interoperability Framework
4 Case Study on Vehicular Ontology To enable interoperability between any two systems means dealing with the low-level message exchange between them. Since different systems deal with different message formats, it is imperative to interpret these message formats in a way so that a solution can be devised regarding message exchange between them. The case study that we present in this section is based on the framework explained in section 3 and aims at tackling the interoperability problem at the level of message formats. The main hindrance in exchanging data packets stems from the difference in the packet formats themselves. In this respect, our case study is based on the role played by ontologies in facilitating some level of dynamic semantic interoperability among different packet formats. It shows how we can use ontologies to interpret and enable some level of comparison between message formats from different protocols. Motivation of the application of Ontologies to VANETs: We chose VANETs as a case-study for our framework, because it is a domain of protocols with heterogeneity of message formats and routing strategies as shown in Fig. 3. Each particular VANET can only interpret the packet formats it has defined for itself. Hence, if we intend to make two different VANETs interoperate with each other, we need a way to be able to interpret the format of incoming packets to a VANET system. To enable this, we define a vehicular ontology to create a vocabulary of the various routing strategies defining their set of requirements. The main idea is to use this ontology to classify unknown incoming packets under the appropriate routing scheme and deduce how to enable the packet to interoperate with the current VANET.
184
V. Nundloll, P. Grace, and G.S. Blair
Fig. 3. Routing Strategies in VANETs
In Fig. 4 we show the application of our interopeability framework to VANETs. An important element is the Domain-Component based Model for VANETs; this is a dynamic middleware for sending, receiving and routing VANET packets—each distinct component serves one specific function within a VANET protocol. This is leveraged to create the emergent middleware between two VANET protocols. We now in turn describe the phases of the interoperability framework. Phase 1, Discover & Learn: Defining the VANET domain in the ontology. The first step is to define the VANET domain within the ontology, which is part of the first phase regarding discovery and learning. This ontology contains all of the meanings of the different routing strategies applicable to VANETs, together with the definition of known packet formats. As can be seen in Fig. 5, which shows part of the vehicular ontology, existing packet formats are defined and stored within the ontology. In this case, they are stored as subclasses of a class called NamedPackets. For instance, one of these is BBRPacket which is a protocol performing Broadcast and is derived from the protocol BBR [5]. Referring to Fig. 4, let us assume that our VANET system sends packets P1, suppose a broadcast-based packet, the format of which is already defined by the ontology. Upon receiving packet P2 with a new unknown format, suppose a trajectory-based packet, the system enables this new format to be defined and stored within the ontology repository. Phase 1, Discover & Learn: Identifying a VANET protocol. The presence of a reasoner engine embedded within the ontology tool enables to infer the meaning of a packet. As a result, the packet is classified under the most appropriate routing strategy. This classification is an important step as it helps to establish a ground for comparison between packets belonging to different routing categories. Part of the inferred ontology is shown in Fig. 6, where the class BBRPacket has been properly
The Role of Ontologies in Enabling Dynamic Interoperability
185
Fig. 4. Applying the dynamic interoperability framework to VANETs
Fig. 5. The VANET Ontology
classified as an IdentifiedPacket and also as an MFRBroadcastPacket. The requirements for MFRBroadcastPacket are the fields CommonNeighbourNo and NeighbourList. These fields form part of the format of a BBRPacket and hence the reasoner is able to classify the latter as being of type MFRBroadcast packet (Most Forwarding Broadcast). On the other hand, the class IdentifiedPacket denotes that the incoming packets contain fields that are known. It is possible that an incoming packet
186
V. Nundloll, P. Grace, and G.S. Blair
does not correspond to any of the routing strategies defined within the ontology, yet contains fields that have been already defined by the ontology. In this case, the packet is an IdentifiedPacket and this classification is enough to show that information can be extracted from the packet using SQWRL mechanisms (as shown later in the section). Phase 2, Match: Dynamic Bridging between P1 and P2. Once the classification process is done, the packet P2 (Fig. 4) can be compared to the existing packet P1 through an intuitive mechanism which makes use of SWRL rules and SQWRL query rules within the ontology itself. These rules enable to deduce the difference that lies between the packet formats P1 and P2. For instance, let us assume P1 to be a BBR packet [5] (designed for performing Broadcast-based routing) and P2 to be a Broadcomm [7] packet (designed for performing cluster-based routing). At this stage, both packets P1 and P2 have already been classified under the appropriate routing scheme by the ontology. As can be seen in Fig. 7, which details out the packet format of both BBR (P1) and Broadcomm (P2), there is no direct mapping possible.
Fig. 6. Inferred Vehicular Ontology
Fig. 7. Mapping BBR and Broadcomm packet formats
Phase 2, Match: Role of SQWRL. In order to enable some kind of comparison between them, a rule-based mechanism needs to be deployed within the ontology in order to provide the adequate reasoning to enable matching. We need to make use of SQWRL like queries to retrieve the required information from the ontology. The following SQWRL rule formulates a comparison between BBR and Broadcomm in order to find out which fields are different between them:
The Role of Ontologies in Enabling Dynamic Interoperability
187
The SQWRL query states that if b is a BBR packet and has fields represented by f, create a set of all these fields called bag. On the other hand, if p is a Broadcomm packet and has fields denoted by pf, create a set of such fields called bagt. Then find the difference between these two bags and in such a case, select those fields that have been found to be in Broadcomm packet p but not in BBR packet b. The result of this query is the set of fields missing from BBR for it to function as a Broadcomm packet. As shown in Fig. 7, the fields lacking in BBR are: LocationCoordinates, TargetRoute and ClusterHead and this information is retrieved via the SQWRL query. The OWL language enhanced with the use of SWRL and SQWRL results in creating an expressive vehicular ontology, which determines the nature of a packet given the field descriptions. Furthermore, it enables a comparison of any two particular packets and thus provides the difference between them in terms of the fields that are missing. Once the matching of the packets has been achieved, this leads to the next step, which is to perform the mapping between these two packets. Phase 3, Synthesize Mapping: Mapping P2 onto P1. Once the difference in the two packet formats P1 and P2 have been provided via the ontology, this final step entails engineering an adaptive mapping mechanism to enable P1 to function as P2. For example, if we need to enable BBR to function as Broadcomm, the step will determine how to provide the missing fields in the BBR packet, which are Location Coordinates, Cluster Head and Target Route. These are among the set of fields required for Broadcomm to perform cluster-based routing and are lacking from BBR. To detail how to enable this mapping mechanism is not within the scope of this paper as it is included in part of our future work regarding the interoperability process.
5 Dynamic Interoperability Experiments 5.1 Methodology The case study above explains how interoperability can be tackled between two specific VANETs, which are cluster-based (Broadcomm) and broadcast-based (BBR). In order to validate this case study, we have conducted an experiment to enable the same interoperability procedures (i.e. discovery/learning and matching) to execute at run-time. The experiment consists of tackling interoperability between other systems and our VANET system at run time through the use of our vehicular ontology. In order to enable the experiment at run time, we have made use of java-based programs and Protege-owl API [10] in order to manipulate the ontology at run time. The version of the Protege ontology tool that we have used is Protege3.4.4 [4] which provides full support to apply SWRL rules and SQWRL-based queries. In order to interpret incoming packets, we read those incoming packets and extract their field labels from their format at run time. These field labels are then stored in a text file. Another java program loads and manipulates the vehicular ontology at run time. The field names, stored in the text file, are then fed as input to the ontology which creates a new packet based on these values. We have used the reasoner Pellet [11] in order to classify the packets defined within the ontology. If the new packet contains fields which are identified by the ontology, then it is classified under a class
188
V. Nundloll, P. Grace, and G.S. Blair
called IdentifiedPacket, as shown in Fig. 8. Otherwise, the packet is ranked under UnIdentifiedPacket. Moreover, if the packet corresponds to the requirements of a given routing scheme, the reasoner classifies the packet under the appropriate routing class. However, if the packet partly corresponds to these requirements, it is classified as partially fulfilling the role of that routing scheme by the reasoner. Part of the resulting inferred version of the ontology is displayed in Fig. 6. We carried out test runs with different packet formats and these are displayed in Table 1. For each new incoming packet, the java program creates a new packet class in the ontology at runtime, displayed in the ObjectName column in Table 1. All the new packet objects are initially created as subclasses of the class UnNamedPackets, pictured in Fig. 8. The reasoner then classifies them as identified or non-identified packets and also ranks them under the appropriate routing scheme. Table 1. Test Cases
Fig. 8. Inferred Vehicular Ontology with the generated Test cases
The Role of Ontologies in Enabling Dynamic Interoperability
189
5.2 Experiment Results Fig. 8 portrays the resulting ontology after the generation and classification of these test cases by the reasoner at run time. The first test run, UnIdentifiedPacketRecv0, consists of only the fields CommonNeighbourNo and DestinationIP, whereby CommonNeighbourNo is among the set of fields required for performing the MFRBroadcast routing. Therefore, the reasoner classifies the packet UnIdentifiedPacketRecv0 under the class PartialMFRBroadcast, which implies that this packet can partially provide requirements for performing MFRBroadcast routing. It is also classified as an IdentifiedPacket since it contains known fields. In the second test case, UnIdentifiedPacketRecv1, the fields Longitude, Latitude and TargetRoute are required for performing a Position-based routing. On the other hand, TargetRoute is also required among other set of fields to perform a Cluster-based routing. Therefore, the packet UnIdentifiedPacketRecv1 is classified under ClusterBasedPacket which, in turn, is a subclass of PositionBasedPacket and is also ranked as an IdentifiedPacket. In test case 3, although the packet, UnIdentifiedPacketRecv2, contains known fields such as BroadcastMeter, Distance and DestinationIP, the packet is not classified under any routing strategy since these fields are not defined as critical in the running of any routing strategy. However, because the packet has been identified as containing existing fields, it has been categorized under the IdentifiedPacket class. In addition to containing the same fields as in test case 1, the fourth test case, UnIdentifiedPacketRecv3, also contains 2 unknown fields. Consequently, it is classified both under PartialMFRBroadcast and UnIdentifiedPacket. Finally, the fifth test case, UnIdentifiedPacketRecv4, contains all unknown fields and hence, is classified as UnIdentifiedPacket. 5.3 Evaluation When using BBR, which requires fields such as CommonNeighbourNo, NeighbourList, and DestinationIP address to perform broadcast-based routing, then test case 0 will require NeighbourList to operate as BBR and the system will then be able to route the packet. The matching mechanism through the use of SQWRL queries as explained earlier indicates that the latter field is required to enable the existing VANET to route the packet. The mapping mechanism will eventually determine how to enable test case 0 to function as BBR. On the other hand, if test case 0 is compared against Broadcomm, there are more missing fields since Broadcomm requires more fields to operate. Furthermore, if we take test case 3, although a few of the fields have been identified to enable this packet to be partially classified under MFRBroadcast routing scheme, the lack of information about the unidentified fields acts as a hindrance to properly identify the format. We may need additional mechanisms to interpret the fields that are unknown, which we consider as part of our future work. In this experiment, we have tried to deal with the problem of interoperability in the domain of VANETs and have been able to show that this problem can be tackled to a certain degree. The results of this experiment demonstrate that the use of ontology combined with SWRL and SQWRL can help perform a matching between any two packets. This forms the basis of comparing any two concepts, which is the starting point for handling interoperability between them. If we try to expand this idea in a much broader context where different networked systems are trying to interoperate
190
V. Nundloll, P. Grace, and G.S. Blair
with one another, we would need to create an ontology for every such system in order to capture the meaning of the concepts present within the domain. Thus, the deployment of ontologies creates yet another challenge which is the differences arising among the different ontologies of a same general domain, making their manipulation even more complex and difficult. To deal with such different application ontologies, a new type of ontology is surfacing called the Reference ontology [24], which aims at providing links between heterogeneous ontologies. However, the authors in paper [24] argue that if ontologies expand a particular reference ontology in a coherent way, then matching their different concepts can be made easier. Providing an initial matching between distinct ontologies of a general domain through a reference ontology is indispensable. If we are moving towards the inception of an emergent middleware to tackle dynamic interoperability, then the reference ontology can provide a benchmark to compare related ontologies and hence facilitate matching the different concepts through the application of SWRL and SQWRL-like rules.
6 Related Work Universal interoperability is a long-standing objective of distributed systems research. The traditional approach to resolve interoperability problems is to agree on a standard, i.e., everyone uses the same protocols and interface description languages; CORBA[12], DCOM[13], and Web Services are good examples of this approach. For situations where systems can agree to a common standard, the approaches are highly effective. However, for both long-lived and universal interoperability these solutions have demonstrably failed, indeed, future attempts at such global standards are destined to fail too. Such one size fits all standards and middleware platforms cannot cope with the extreme heterogeneity of distributed systems, e.g., from sensor applications through to large-scale Internet applications; and a single communication paradigm, e.g. RPC, cannot meet all application requirements. Moreover, new distributed systems and applications emerge fast, while standards development is a slow, incremental process; it is likely that new technologies will appear that will make a pre-existing standard obsolete. Finally, new standards do not typically embrace an existing legacy standard, which leads to immediate interoperability problems. Software bridges have been proposed to enable communication between different middleware environments. The bridge acts as a one-to-one mapping between domains; it will take messages from a client in one format and then marshal this to the format of the server middleware; the response is then mapped to the original message format. Many bridging solutions have been produced between established commercial platforms. The OMG created the DCOM/CORBA Inter-working specification [14]. OrbixCOMet [15] is an implementation of the DCOM-CORBA bridge, while SOAP2CORBA [16] bridges SOAP and CORBA middleware. Further, Model Driven Architecture advocates the generation of such bridges to underpin deployed interoperable solutions. However, developing bridges is a resource intensive, timeconsuming task, which for universal interoperability would be required for every protocol pair; further a future protocol requires a mapping to every existing protocol. Finally, software bridges must normally be deployed and available in the network; for many environments (particularly resource-constrained) this is not possible.
The Role of Ontologies in Enabling Dynamic Interoperability
191
Intermediary-based solutions take the ideas of software bridges further; rather than a one-to-one mapping, the protocol or data is translated to an intermediary representation at the source and then translated to the legacy format at the destination (and vice versa for a response). Enterprise Service Buses (ESB), INDISS [17], uMiddle [18] and SeDIM [19] are examples that follow this philosophy. However, this approach suffers from the greatest common divisor problem, i.e., between two protocols the intermediary is where their behaviour matches, they cannot interoperate beyond this defined subset. As the number of protocols grows this common divisor then becomes smaller such that only limited interoperability is possible. Substitution solutions (e.g., ReMMoC [20] and WSIF [21]) embrace the philosophy of speaking the peer’s language. That is, they substitute the communication middleware to be the same as the peer or server they wish to use. A local abstraction maps the behaviour onto the substituted middleware. Like for software bridges this is particularly resource consuming; every potential (and future) middleware must be developed such that it can be substituted. Further, it is generally limited to client-side interoperability with heterogeneous server. Semantic Middleware solutions e.g. S-ARIADNE [22] employs efficient, semantic, service-oriented solutions to achieve interoperability; this is utilized at the discovery and matching stage to ensure that only services that semantically match attempt to interoperate with one another; hence, concerning itself with only the application data and function differences, not the heterogeneity of middleware (indeed a common middleware platform is required). There is a distinct disconnection between the mainstream middleware work and the work on semantic interoperability. Our solution embraces and integrates the ideas of both (as far as we are aware it is the first to employ ontologies to resolve communication protocol interoperability); because of this, we argue that it is better placed to achieve long-lived, universal interoperability. By employing ontologies to classify, match and map communication protocols, we have the ability to automatically generate a bridge between two legacy communication protocols. The nature of the solution means that the problems of standards, exhaustive resource requirements, and minimal matches can be overcome.
7 Conclusions This paper puts forward the approach of using ontologies to handle the semantic differences that arise between different systems, in order to enable them to interoperate. We have elicited three major steps required for performing interoperability, which are: discovery/learning, matching and synthesis of mapping. We have also elicited the role ontology plays in handling these steps and we advocate that the ontology has a crosscutting role in the steps involved during the interoperability process. All three processes are intertwined and each step provides the necessary input to perform the next step. For instance, the result from the matching process provides a sound notion on the type of mapping that needs to be performed. We have been able to validate the discovery/learning and matching phases through a case study on VANETs, however, the synthesis of mapping between two different systems remains part of our future work.
192
V. Nundloll, P. Grace, and G.S. Blair
Furthermore, we also intend to extend in the future our experiment on VANETs through the investigation of user-defined SWRL built-ins in order to compare the data types of the fields in a message to achieve richer interoperability between protocols through the handling of data heterogeneity. We also plan to explore a wider range of middleware protocols including traditional technologies where bridging has been attempted e.g., RPC protocols, message-based platforms and service discovery.
References 1. 2. 3. 4. 5. 6.
7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
18.
19. 20. 21.
http://hermit-reasoner.com/ http://www.racer-systems.com/products/racerpro/ http://owl.man.ac.uk/factplusplus/ http://protege.stanford.edu/ Zhang, M., Wolf, R.: Border Node Based Routing Protocol for VANETs in Sparse and Rural Areas. In: IEEE Globecom Autonet Workshop, Washington DC, pp. 1–7 (November 2007) Liu, G., Lee, B., Seet, B., et al.: A Routing Strategy for Metropolis Vehicular Communications. In: Kahng, H.-K., Goto, S. (eds.) ICOIN 2004. LNCS, vol. 3090, pp. 134–143. Springer, Heidelberg (2004) Durresi, M., Durresi, A., Barolli, L.: Emergency Broadcast Protocol for Inter-Vehicle Communications. In: Proc. 11th International ICPADS Conference Workshops (2005) Santos, R., Edwards, A., Edwards, R., Seed, N.: Performance evaluation of routing protocols in vehicular ad-hoc networks. Int. J. Ad Hoc Ubiquitous Comput. 1(1/2), 80–91 (2005) Zhao, J., Cao, G.: VADD: Vehicle-assisted data delivery in vehicular ad hoc networks. In: Proc. 25th IEEE International Conference on Computer Communications (2006) http://protege.stanford.edu/plugins/owl/api/ http://clarkparsia.com/pellet/ Object Management Group, The common object request broker: Architecture and specification Version 2.0, Technical Report (1995) Booth, D., et al.: W3C Working Group Note (February 2004), http://www.w3.org/TR/ws-arch/ Object Management Group, COM/CORBA Interworking Specification Part A & B (1997) Iona Tech., OrbixCOMet (1999), http://www.iona.com/support/whitepapers/ocomet-wp.pdf Brueckne, L.: SOAP2CORBA (January 2010), http://soap2corba.sourceforge.net Bromberg, Y., Issarny, V.: INDISS: Interoperable Discovery System for Networked Services. In: Alonso, G. (ed.) Middleware 2005. LNCS, vol. 3790, pp. 164–183. Springer, Heidelberg (2005) Nakazawa, J., Tokuda, H., Edwards, W., Ramachandran, U.: A Bridging Framework for Universal Interoperability in Pervasive Systems. In: Proc of 26th IEEE International Conference on Distributed Computing Systems (ICDCS 2006), Lisbon, Portuga (2006) Flores, C., Blair, G., Grace, P.: An Adaptive Middleware to Overcome Service Discovery Heterogeneity in Mobile Ad Hoc Environments. IEEE Dist. Sys. Online (July 2007) Grace, P., Blair, G., Samuel, S.: A Reflective Framework for Discovery and Interaction in Heterogeneous Mobile Environments. ACM SIGMOBILE Review (January 2005) Duftler, M., Mukhi, N., Slominski, S., Weerawarana, S.: Web Services Invocation Framework (WSIF). In: Proc. OOPSLA 2001 Workshop on OO Web Services, Florida (2001)
The Role of Ontologies in Enabling Dynamic Interoperability
193
22. Ben Mokhtar, S., Kaul, A., Georgantas, N., Issarny, V.: Efficient Semantic Service Discovery in Pervasive Computing Environments. In: van Steen, M., Henning, M. (eds.) Middleware 2006. LNCS, vol. 4290, pp. 240–259. Springer, Heidelberg (2006) 23. Cantais, J., Dominguez, D., Gigante, V., Laera, L., Tamma, V.: An example of food ontology for diabetes control. In: Proc. of the ISWC 2005 Workshop on Ontology Patterns for the Semantic Web, Galway, Ireland (November 2005) 24. Wang, C., He, K., He, Y.: MFI4Onto: Towards Ontology Registration on the Semantic Web. In: 6th IEEE Int. Conference on Computer and Information Technology (CIT 2006) (2006)
A Step towards Making Local and Remote Desktop Applications Interoperable with High-Resolution Tiled Display Walls Tor-Magne Stien Hagen, Daniel Stødle, John Markus Bjørndalen, and Otto Anshus Department of Computer Science, Faculty of Science and Technology, University of Tromsø {tormsh,daniels,jmb,otto}@cs.uit.no
Abstract. The visual output from a personal desktop application is limited to the resolution of the local desktop and display. This prevents the desktop application from utilizing the resolution provided by high-resolution tiled display walls. Additionally, most desktop applications are not designed for the distributed and parallel architecture of display walls, limiting the availability of such applications in these kinds of environments. This paper proposes the Network Accessible Compute (NAC) model, transforming personal computers into compute services for a set of display-side visualization clients. The clients request output from the compute services, which in turn start the relevant personal desktop applications and use them to produce output that can be transferred into display-side compatible formats by the NAC service. NAC services are available to the visualization clients through a live data set, which receives requests from visualization nodes, translates these to compute messages and forwards them to available compute services. Compute services return output to visualization nodes for rendering. Experiments conducted on a 28-node, 22-megapixel, display wall show that the time used to rasterize a 350-page PDF document into 550 megapixels of image tiles and display these image tiles on the display wall is 74.7 seconds (PNG) and 20.7 seconds (JPG) using a single computer with a quad-core CPU as a NAC service. When increasing this into 28 quad-core CPU computers, this time is reduced to 4.2 seconds (PNG) and 2.4 seconds (JPG). This shows that the application output from personal desktop computers can be made interoperable with high-resolution tiled display walls, with good performance and independent of the resolution of the local desktop and display.
1 Introduction A display wall is a wall-sized high-resolution tiled display. It provides orders of magnitude higher resolution than regular desktop displays and can provide insight into problems not possible to visualize on such displays. The large size of display walls enable several users to work on the same display surface, either to compare visualizations or to collaborate on the same visualization. The combination of resolution and size enable users to get overviews of the visualizations, at the same time being able to walk up close to look at details. Visualization domains that benefit from the resolution offered by display walls include gigapixel images and planetary-scale data sets. These types of domains provide P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 194–207, 2011. c IFIP International Federation for Information Processing 2011
A Step towards Making Local and Remote Desktop Applications
195
content in the order of tens and thousands of megapixels. In addition, more ”standard” visualization domains such as spreadsheet, word-processing, and presentation-style applications can benefit from higher resolution displays, enabling them to display much more content than a normal sized display would allow for. However, applications are tied to both the resolution of the local desktop and display, and the operating system environment installed on the local computer. In addition, display walls often comprise a parallel and distributed architecture, which often requires parallelizing applications to run them with good performance. This makes porting opensource software time-consuming and proprietary software solutions close to impossible. For example, showing a Microsoft Word document on a display wall is difficult, since Word is designed for a single computer system, and therefore cannot be simply “run” on the display wall. A modified remote desktop system can be used to bring the content of a computer display to a display wall. However, the resolution of the remotely displayed content usually matches the resolution of the local computer’s display. Although, some remote desktop systems support higher virtual resolution (such as the Windows Remote Desktop Protocol (RDP) [17] supporting a maximum resolution of 4096x2048), they still do not utilize the full resolution of typical display walls, and such systems have performance problems with increasing number of pixels [15]. In addition, some desktop applications have a predefined layout for the graphical user-interface. For example, it is to the authors’ knowledge not any PDF viewer that can show more than a couple of PDF pages in width. For regular resolution displays this might be enough, but for high-resolution tiled display walls, which often have orders of magnitude higher resolution, it is not. To address these problems, this paper presents the Network Accessible Compute (NAC) model, transforming compute resources into compute services for a set of visualization clients (figure 1). The NAC model defines two classes of compute resources; static such as clusters, grids and supercomputers and dynamic such as laptops and desktop computers. Static compute resources are accessed according to their security policies and access protocols. Dynamic compute resources are customized, on-the-fly, to become compute services in the system. The dynamic compute resources are the main focus of this paper. A live data set [8] separates the compute-side from the display-side, thus enabling both compute services and visualization clients to be added or removed from the system without affecting their underlying implementation or communication protocols. This situation is different from a traditional client-server model. Firstly, compute services are communicating with the visualization nodes through a data space architecture allowing both visualization- and compute-nodes to be added transparently to each other. Secondly, for dynamic compute resources, users have their own software environment installed on the compute service, which enables the compute-side to produce customized data for the display-side based on users custom software installation. Visualization systems can therefore visualize this data without understanding the original data format, as long as a transformation function exists that can represent the data in a format familiar and customized to the visualization system. Using personal desktop computers as compute services for a display wall tracks the current trend in computer hardware architectures. Today, modern computers have become both multi- and many-core. The increase in transistor density combined with the
196
T. -M. Stien Hagen et al.
Fig. 1. Illustration of multiple desktop computers used as a NAC resource to provide processed data for a visualization system running on a high-resolution tiled display wall
memory-, ILP- and frequency-walls [16] has forced processor vendors into devoting transistors to CPU cores, on-chip caches and memory- and communication-systems, rather than extracting instruction level parallelism or increasing frequency on single cores [24]. Current contemporary processor chips contain multiple cores up to 100 per chip [27]. Current state of the art GPUs contain up to 480 cores per chip [19]. Following Moores law, there is no indication that this trend will not continue for the foreseeable future. Users own more and more computers, and some might have available processing, memory, storage and network available. The NAC model improves on the following: (i) It enables desktop computers to produce data for a display wall without modifying or porting the applications on the desktop computer; (ii) it enables remote compute resources to produce data for a display wall without requiring custom software running on the remote site; (iii) it allows for visualization of data from desktop computers without being limited to the resolution of the local display; (iv) it enables cross-platform visualization of data located on a desktop computer without the need for the application to be executed on the visualization node; and (v) desktop computers are customized by a live data set and do therefore not need to install or keep any software updated to be able to communicate with the display-side. The novelty of the system is the usage of locally installed desktop applications in a display wall context, by decoupling the resolution of the local computer from the display, thus enabling existing desktop applications to utilize the resolution of highresolution tiled display walls. This paper makes three contributions: (i) The Network Accessible Compute (NAC) model; (ii) WallScope, a system realizing the NAC model in a personal computing environment; and (iii) a performance evaluation of the WallScope system.
A Step towards Making Local and Remote Desktop Applications
197
2 Related Work The NAC model has common characteristics with public (global) computing, where the idea is to use the world’s computational power and disk space to create virtual supercomputers capable of solving problems and conduct research previously infeasible. There are a number of projects focusing on public computing, among others SETI@home [3], Predictor@home [21], Folding@home [20] and Climateprediction .net [25]. These projects use the BOINC (Berkeley Open Infrastructure for Network Computing) [2] platform. The overall goal of BOINC is to make it easy for scientists to create and operate public-resource computing projects. A user wanting to participate in the BOINC project downloads and installs a BOINC client which is used to communicate with the server-side. While there are similarities between the NAC model and BOINC there are some important differences. In BOINC the focus is to make it easy to utilize available computational resources. For the NAC model, the focus is to utilize desktop applications for domain specific computation of data for a set of visualization clients. NAC gives users complete control over what data is shared, and enables users to choose this data from their personal computer on a per data-element basis, for example only page 1 and 3 of a 10-page document. This also includes complete control over the output format such as pixels, PDF, original source, etc. In addition, the live data set used as part of the realization of the NAC model supports local and remote compute resources like clusters and supercomputers, which are not supported by BOINC focusing exclusively on public computing. There are other system sharing characteristics with NAC such as Condor [14], Minimum Intrusion Grid (MIG) [28], XtremWeb [6] and XtremWeb-CH [1] (comprising the two versions XWCH-sMs, and XWCH-p2p). However, their main difference to the NAC model is the same as for BOINC. In contrast to these systems focusing on utilizing available computational resources, the NAC model focuses on using desktop applications for domain specific computation of data for distributed visualization systems. In addition to the research projects focusing on global computing, there are other projects sharing characteristics with NAC. These include the Scalable Adaptive Graphics Environment (SAGE) [9] based on TeraVision [22] and TeraScope [31], OptiStore [30], Active Data Repository [11], Active Semantic Caching [4], DataCutter [5], ParVox [12], The Remote Interactive Visualization and Analysis System (RIVA) [13], OptiPuter [23], Digital Light Table [10] and Scalable Parallel Visual Networking [7]. However, these systems do not support remote compute resources nor the ability to customize personal computers on-the-fly to become compute nodes in the system.
3 Architecture The network accessible compute model is realized using a data space architecture, where visualization nodes communicate with compute nodes through a live data set (figure 2). For distributed visualization systems, a separate state server gets user input and provides all visualization clients with the global view state of the visualization through a separate event server. Compute nodes in the system produce data customized to the particular visualization domain of the visualization clients. The network accessible compute resources can be categorized into two classes; static and dynamic. A static
198
T. -M. Stien Hagen et al.
Fig. 2. Architecture
compute resource is a compute resource that the live data set has been pre-configured to communicate with. This category ranges from clusters of computers to supercomputers which have strict underlying security and access policies (software running on a supercomputer is often prohibited to make outgoing connections). A dynamic compute resource is a compute resource that is customized by the live data set to produce data for the system. A computer can become a dynamic compute node in the system by registering with the live data set to become customized, and then provide information about the type of requests that it can process and what data it will share with the system.
A Step towards Making Local and Remote Desktop Applications
199
The display-side of the system can query the live data set to get information about all the data that it comprises. This data is a combination of all the data the compute resources can process on behalf of the display-side. From the visualization nodes point of view, the live data set contains all the data pre-processed. However, the live data set will only actually contain data that has been processed, and all requests for data that has not been processed will be sent to a compute node that can produce this data. This is done transparently to the display-side, and thus hides all computation to the visualization clients.
4 Design The display-side of the system comprises a set of visualization clients that request data from the live data set, which they then use as part of the rendering. Each visualization client combines the view state received from the view state server (via the event server) with the location in the display grid to determine what data to request from the live data set. The live data set contains data customized for the particular visualization domain of the visualization clients, for example maps rendered into image tiles or files converted to a format that the visualization clients understand. The customization of the dynamic compute nodes is done by the live data set. A computer that wishes to become a compute node in the system initiates contact with the live data set. The live data set responds with a piece of code that is downloaded to the compute node. This code is responsible for performing the initial setup with the live data set. When the compute node has downloaded and started the code, a plugin validation phase is started. The downloaded code contains a set of plugins that can be used to compute processed data. These plugins might be available for different operating systems and installed software in general. Plugins are run in a separate address spaces to utilize multiple CPU cores and support non-thread-safe APIs. Based on the type of plugins the compute node supports, a list of supported data types is generated and sent to the live data set, which then stores information about the compute node and its associated data types for future requests from visualization clients. Visualization clients can browse the live data set to determine which processed data it contains. For example, if a visualization client contains functionality for processing images but lack the functionality for reading PDF documents, it can request tiled images of the document from the live data set. The live data set sends compute messages to compute nodes that hold the document with an associated plugin. The compute nodes produce image tiles from the document which is returned to the live data set and in turn back to the visualization clients. All data is cached in the live data set to avoid re-computation of data.
5 Implementation Currently, one visualization system has been implemented as part of WallScope. The system is implemented in C++ using OpenGL for rendering. C++ was chosen for allowing optimization of performance critical parts of the rendering engine. OpenGL was chosen for cross-platform utilization of available graphics hardware. The visualization
200
T. -M. Stien Hagen et al.
system supports gigapixel images, virtual globes, 3D models of various formats and regular images. The visualization system queries the live data set at regular intervals to get an updated list of the processed data that it contains. Most of the data in the live data set can be rendered using different levels of detail. Therefore the visualization system is built to show a visualization of all the data that the live data set contains, enabling users to get an overview, at the same time being able to zoom in at the finest level of detail for each of the data set elements. A user can navigate in the visualization using a touch-free interface constructed from a set of floor mounted cameras. The touch-free interface supports common gestures found in regular touch displays such as panning and zooming. Additionally, the touch-free interface supports 3D touch input allowing users to easily navigate in 3D visualizations. The live data set is implemented in C++ using Squid [29] as the front-end for caching. The visualization clients request data from the live data set using HTTP. Although HTTP is a relatively heavyweight protocol, especially for usage in high performance distributed and parallel systems, it was chosen for the large number of existing compatible systems (such as the aformentioned Squid cache system) and other applications and utilities (such as web browsers etc.) that can be used to debug and solve problems with the system. Previously performed requests are handled by the Squid cache. If the data for a request is not cached or has expired, the live data set inspects the request, performs a lookup in the compute node list to find a compute node that can process the request, and then sends a message to this compute resource to get processed data. The live data set has a Java JAR file that contains the code needed for a dynamic compute resource to communicate with the live data set, as well as all the plugins developed for the system. A compute node downloads and executes this JAR file using the Java Network Launch Protocol (JNLP) [26]. The user initiates the customization of the compute node by clicking on a link in a browser. Once downloaded and started, the Java code will validate the plugins to find the compatible plugins for the installed software environment. The plugins are executable files created for a specific software platform. Some of the plugins implemented are plugins to compute processed data from DOC, DOCX, XLS, XLSX, PPT, PPTX, PDF and various 3D formats. These plugins utilize the desktop applications already installed on the computer, for example using Microsoft Component Object Model (COM) [18] to orchestrate a document conversion to a format that can be processed by the NAC service. The processed data ranges from image tiles and PDFs to 3D models that the visualization clients can load. Since the dynamic resources initiates contact with the live data set, the compute resources are available to the system even though they might be behind Network Address Translation (NAT) or a firewall.
6 Experiments To evaluate the system, four experiments were conducted with the purpose of documenting the performance of the system, and to find potential bottlenecks. In all experiments a 350-page PDF document was used, and the time to rasterize (compute-side) and display (display-side) the document was measured. In experiment one and two, the number of compute nodes was varied between 1 and 28, and the image tiles produced
A Step towards Making Local and Remote Desktop Applications
201
were encoded using PNG and JPG, respectively. In experiment three and four the produced image tiles (PNG and JPG) were loaded from the live data set’s cache and from a local cache on each node. 6.1 Methodology The hardware used in the experiments was: (i) a 28-node display cluster (Intel P4 EM64T 3.2 GHz, 2GB RAM, Hyper-Threading, NVidia Quadro FX 3400 w/256 MB VRAM) interconnected using switched gigabit Ethernet and running the 32-bit version of the Rocks Linux cluster distribution 4.0; (ii) A computer running the live data set, the event server and the state server (same specifications as the display cluster nodes); and (iii) 28 compute nodes (Intel Xeon Processor E5520 8M Cache 2.26 GHz, 2.5 GB RAM, 4 cores, Hyper-Threading and running the 32-bit version of CentOS release 5.5). Compute nodes were group-wise connected to gigabit switches (6 compute nodes per group). These switches were connected to a gigabit switch connected to a router providing the link to the display nodes. The shared theoretical bandwidth between compute nodes and display nodes was 1 gigabit per second. For all experiments, the time used to compute and render a 350-page document was measured, with the purpose of identifying the speedup when adding compute nodes to the system, as well as documenting potential bottlenecks. The PDF document was rasterized into image tiles on the compute-side. Each tile had a size of 512x512 pixels and every page of the document comprised six such tiles. This yields a total resolution of 550 megapixels for the 350 pages. In experiment one, PNG was used as the image tile format. In experiment two, JPG was used. For both experiments, 1, 2, 4, 8, 16, and 28 compute nodes were used to compute the result. Each of these nodes had 4 compute processes running to utilize all the cores (not including Hyper-Threading). Every display node was configured to perform 4 simultaneous requests to the live data set. These requests were load-balanced on the available compute nodes by the live data set. In experiment three and four, image tiles were loaded from the live data set’s cache and from the local cache on each display node, with the purpose of documenting potential bottlenecks in the cache system and the network bandwidth between the live data set and the display nodes. For these experiments, the same image tiles requested in the previous experiments were used. The number of requests generated for each experiment was 2432. This number is larger than the number of tiles that comprised the document, and is caused by some of the image tiles overlapping between displays and thus are requested at least 2 times. (The image tiles overlapping between display corners are requested by 4 display nodes). 6.2 Results Figures 3 and 4 show the time and speedup factor for experiment one and two. This includes the rasterization of the document into image tiles on the compute-side including the time to encode the images to PNG or JPG, the transfer of these image tiles from the compute-side, through the live data set, to the display-side, and the loading and rendering of the image tiles on the display nodes.
202
T. -M. Stien Hagen et al. 80
74.66
PNG
70
JPG
60
Seconds
50
40
37.65
30 20.66
20.23
20 11.07
10
6.65
11.26
5.99
3.74
0 0
5
4.20 2.40
2.53 10
15
20
25
30
Nodes (4 cores / connections per node)
Fig. 3. Time to request and simultaneously display 2432 JPG or PNG encoded image tiles computed from a 350-page PDF document residing at the compute-side. (Compute nodes are increased from 1 to 28). 20
17.77
PNG JPG
Speedup
15
11.23 10
5
5.52
3.69 1.98 1.00 0
8.59
8.16
6.75
3.45
1.84
1.00 0
5
10
15
20
25
30
Nodes (4 cores / connections per node)
Fig. 4. Speedup factor when requesting and simultaneously displaying 2432 JPG or PNG encoded image tiles computed from a 350-page PDF document when going from 1 to 28 compute nodes
A Step towards Making Local and Remote Desktop Applications
203
4 3.95 3.94
3.74
3.61 3.27 3.21
Core utilization
3
2
1 Per-core utilization Average core utilization 0 0
5
10
15
20
25
30
Nodes (4 cores / connections per node)
Fig. 5. Compute core utilization when rasterizing the 350-page PDF document to PNG images 4
3.89 3.72
3.64 3.26
Core utilization
3
2.88
2.03
2
1
Per-core utilization Average core utilization
0 0
5
10
15
20
25
30
Nodes (4 cores / connections per node)
Fig. 6. Compute core utilization when rasterizing the 350-page PDF document to JPG images
Figures 5 and 6 show the per-core and average core utilization for experiment one and two.
204
T. -M. Stien Hagen et al.
Table 1 shows the average latency for one request in the system when using all 28 compute nodes. Table 2 shows the result of experiment three and four. The load time for the LDS cache includes the time used to request data from the cache, the transfer of the images over the network and the local time used to decode and render the images. The load time for the local cache includes the time used to request the tiles from the local cache, including the time to decode the images and render them. Table 1. Average latency for a request to complete when using 28 compute nodes Image Type Display-Side LDS Compute-Side PNG 0.1521 sec 0.1456 sec 0.1445 sec JPG 0.0865 sec 0.0574 sec 0.0533 sec
Table 2. Time to request and simultaneously display 2432 PNG or JPG encoded image tiles requested from the live data set’s cache or from the local cache on each visualization node Image Type Load Time LDS Cache Load Time Local Cache PNG 1.694 sec 0.908 sec JPG 1.305 sec 0.923 sec
6.3 Discussion As can be seen from figures 3 and 4, the system benefits from increased number of compute nodes. When using PNG as the image format, the total load time is 74.66 seconds using 1 compute node. When using all nodes this time is reduced to 4.20 seconds, which translates to a speedup of 17.77. When using JPG as the image format the time to load the entire document using one compute node is 20.66 seconds. This time is reduced to 2.4 seconds using all compute nodes, translating to a speedup of 8.59. However, as both figures show, the load time and speedup factor does not translate with a linear one to one factor with the use of additional compute nodes. In addition, for JPG the speedup is approximately half of the speedup of PNG when using all nodes and only increases with 0.43 when going from 16 to 28 nodes. This indicates a bottleneck in the system. When the produced image tiles are located in the live data set’s cache, the time used to load and display the document on the display-side is 1.694 for PNG and 1.305 for JPG (table 2). The reason that the compute system cannot produce data with this rate is a combination of the latency introduced by computing the image tiles on the compute-side and the number of connections that is established from every node on the display-side. During the experiments every display node had 4 request connections. For PNG the average latency per request is 0.1521 seconds. When using 4 connections this translates to a total average of 3.3 seconds per display node (((2432 / 28) / 4) x 0.1521). However, the compute nodes are idle some of this time. For JPG this is even worse. The latency per request is 0.0865 seconds, giving a total latency of 1.88 seconds. However, the compute nodes are using 0.0533 seconds per compute request, giving a larger idle time. The result of this can be seen from figures 5 and 6. The core utilization decreases as the number of nodes increases. When using PNG as the image format, the
A Step towards Making Local and Remote Desktop Applications
205
CPU core utilization is 3.95 using 4 cores on one node. This value is reduced to 3.21 using all nodes. For JPG, the CPU core utilization using 1 node is 3.89, which is reduced to 2.03 using all nodes. To solve this situation the display-side could be configured to use more than 4 connections to the live data set. However, there is a tradeoff between the number of connections established from the display-side and the performance of the rendering engine. Request threads are responsible for decoding the data to the rendering engine. Decoding of images is CPU bound and request threads will therefore compete with the rendering engine for CPU cycles on a single-core computer. This will in turn affect the framerate of the visualization. However, this problem can be solved in several ways (separate request functionality from decoding functionally, pipeline requests or use multi-core computers on the display-side with a dedicated core for the rendering engine). In addition to the display-side modifications, the connections between the live data set and the compute nodes should allow for pipelining of requests to increase the core utilization and mask latency. The presented suggestions are all part of future work and currently being investigated.
7 Conclusion This paper has presented the Network Accessible Compute (NAC) model and a system, WallScope, adhering the model. The NAC model is realized using a live data set architecture, which separates compute nodes from visualization nodes using a data set containing data customized for the particular visualization domain. Visualization clients request data from the live data set, which forwards these requests to available network accessible compute resources. Network accessible compute resources start the relevant personal desktop applications and use them to produce output that can be transferred into display-side compatible formats by the NAC service. The results are returned to the visualization clients for rendering. Experiments conducted show that the compute resources in the system can be utilized in parallel to increase the overall performance of the system, improving the load time of a PDF document from 74.7 to 4.2 seconds (PNG) and 20.7 to 2.4 seconds (JPG) when going from 1 to 28 compute nodes. This shows that the application output from personal desktop computers can be made interoperable with high-resolution tiled display walls, with good performance and without being limited to the resolution of the local desktop and display. The main bottleneck in the system is the compute-side of the system combined with the synchronous communication mechanism used throughout the system. Currently, work is being done to improve on this. The experiments conducted have shown promising possibilities for displaying of static content such as document and images. Future research will focus on more dynamic content such as collaborative edited documents and videos, in addition to integrating more applications with the compute-side software.
Acknowledgements The authors would like to thank Lars Ailo Bongo, B˚ard Fjukstad, Phuong Hoai Ha and Tore Larsen for discussions. In addition, the authors would like to thank the technical staff at the Computer Science Department, University of Tromsø, especially Jon
206
T. -M. Stien Hagen et al.
Ivar Kristiansen for providing great support on the compute nodes used in the experiments. This work has been supported by the Norwegian Research Council, projects No. 159936/V30, SHARE - A Distributed Shared Virtual Desktop for Simple, Scalable and Robust Resource Sharing across Computers, Storage and Display Devices, and No. 155550/420 - Display Wall with Compute Cluster.
References 1. Abdennadher, N., Boesch, R.: Towards a peer-to-peer platform for high performance computing. In: HPCASIA 2005: Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region, p. 354. IEEE Computer Society, Washington, DC, USA (2005) 2. Anderson, D.P.: Boinc: A system for public-resource computing and storage. In: 5th IEEE/ACM International Workshop on Grid Computing, pp. 4–10 (2004) 3. Anderson, D.P., Cobb, J., Korpela, E., Lebofsky, M., Werthimer, D.: Seti@home: an experiment in public-resource computing. Commun. ACM 45(11), 56–61 (2002) 4. Andrade, H., Kurc, T., Sussman, A., Saltz, J.: Active semantic caching to optimize multidimensional data analysis in parallel and distributed environments. Parallel Comput. 33(7-8), 497–520 (2007) 5. Beynon, M.D., Kurc, T., C ¸ ataly¨urek, U., Chang, C., Sussman, A., Saltz, J.: Distributed processing of very large datasets with datacutter. Clusters and Computational Grids for Scientific Computing 27(11), 1457–1478 (2001) 6. Cecile, G.F., Fedak, G., Germain, C., Neri, V.: Xtremweb: A generic global computing system. In: Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2001), pp. 582–587 (2001) 7. Correa, W.T., Klosowski, J.T., Morris, C.J., Jackmann, T.M.: SPVN: a new application framework for interactive visualization of large datasets. In: SIGGRAPH 2007: ACM SIGGRAPH 2007 courses, page 6 (2007) 8. Hagen, T.-M.S., Stødle, D., Anshus, O.: On-demand high-performance visualization of spatial data on high-resolution tiled display walls. In: Proceedings of the International Conference on Information Visualization Theory and Applications, pp. 112–119 (2010) 9. Jeong, B., Renambot, L., Jagodic, R., Singh, R., Aguilera, J., Johnson, A., Leigh, J.: Highperformance dynamic graphics streaming for scalable adaptive graphics environment. In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, page 108 (2006) 10. Katz, D., Bergou, A., Berriman, G., Block, G., Collier, J., Curkendall, D., Good, J., Husman, L., Jacob, J., Laity, A., Li, P., Miller, C., Prince, T., Siegel, H., Williams, R.: Accessing and visualizing scientific spatiotemporal data. In: Proceedings of 16th International Conference on Scientific and Statistical Database Management, 2004, pp. 107–110 (June 2004) 11. Kurc, T., C ¸ ataly¨urek, U., Chang, C., Sussman, A., Saltz, J.: Visualization of large data sets with the active data repository. IEEE Comput. Graph. Appl. 21(4), 24–33 (2001) 12. Li, P.: Supercomputing visualization for earth science datasets. In: Proceedings of 2002 NASA Earth Science Technology Conference (2002) 13. Li, P., Duquette, W.H., Curkendall, D.W.: Riva: A versatile parallel rendering system for interactive scientific visualization. IEEE Transactions on Visualization and Computer Graphics 2(3), 186–201 (1996) 14. Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems (June 1988) 15. Liu, Y., Anshus, O.J.: Improving the performance of vnc for high-resolution display walls. In: Proceedings of the 2009 International Symposium on Collaborative Technologies and Systems, pp. 376–383. IEEE Computer Society, Washington, DC, USA (2009)
A Step towards Making Local and Remote Desktop Applications
207
16. Manferdelli, J.L., Govindaraju, N.K., Crall, C.: Challenges and opportunities in many-core computing. Proceedings of the IEEE 96(5), 808–815 (2008) 17. Microsoft, http://www.msdn.microsoft.com/en-us/library/aa383015(VS.85). aspx 18. Microsoft, http://www.microsoft.com/com/default.mspx 19. NVIDIA, http://www.nvidia.com/object/product_geforce_gtx_480_us.html 20. Pande, V.S., Baker, I., Chapman, J., Elmer, S.P., Khaliq, S., Larson, S.M., Rhee, Y.M., Shirts, M.R., Snow, C.D., Sorin, E.J., Zagrovic, B.: Atomistic protein folding simulations on the submillisecond time scale using worldwide distributed computing. Biopolymers 68(1), 91–109 (2003) 21. Predictor@home, http://predictor.scripps.edu 22. Singh, R., Jeong, B., Renambot, L., Johnson, A., Leigh, J.: Teravision: a distributed, scalable, high resolution graphics streaming system. In: CLUSTER 2004: Proceedings of the 2004 IEEE International Conference on Cluster Computing, pp. 391–400 (2004) 23. Smarr, L.L., Chien, A.A., DeFanti, T., Leigh, J., Papadopoulos, P.M.: The optiputer. Commun. ACM 46(11), 58–67 (2003) 24. Sodan, A.C., Machina, J., Deshmeh, A., Macnaughton, K., Esbaugh, B.: Parallelism via multithreaded and multicore cpus. Computer 43, 24–32 (2010) 25. Stainforth, D., Kettleborough, J., Martin, A., Simpson, A., Gillis, R., Akkas, A., Gault, R., Collins, M., Gavaghan, D., Allen, M.: Climateprediction.net: Design principles for publicresource modeling research. In: 14th IASTED International Conference Parallel and Distributed Computing and Systems, pp. 32–38 (2002) 26. Sun Microsystems, http://www.jcp.org/aboutJava/communityprocess/ first/jsr056/jnlp-1.0-proposed-final-draft.pdf 27. Tilera, http://www.tilera.com/pdf/PB025_TILE-Gx_Processor_A_v3.pdf 28. Vinter, B.: The architecture of the minimum intrusion grid (mig). In: Broenink, J.F., Roebbers, H.W., Sunter, J.P.E., Welch, P.H., Wood, D.C. (eds.) CPA. Concurrent Systems Engineering Series, vol. 63, pp. 189–201. IOS Press, Amsterdam (2005) 29. Wessels, D., Claffy, K., Braun, H.-W.: NLANR prototype Web caching system (1995), http://ircache.nlaur.net/ 30. Zhang, C.: OptiStore: An On-Demand Data processing Middleware for Very Large Scale Interactive Visualization. PhD thesis, Computer Science, Graduate College of the University of Illinois, Chicago (2008) 31. Zhang, C., Leigh, J., DeFanti, T.A., Mazzucco, M., Grossman, R.: Terascope: distributed visual data mining of terascale data sets over photonic networks. Future Gener. Comput. Syst. 19(6), 935–943 (2003)
Replica Placement in Peer-Assisted Clouds: An Economic Approach Ahmed Ali-Eldin and Sameh El-Ansary Center of Informatics science, Nile University [email protected],[email protected]
Abstract. We introduce NileStore, a replica placement algorithm based on an economical model for use in Peer-assisted cloud storage. The algorithm uses storage and bandwidth resources of peers to offload the cloud provider’s resources. We formulate the placement problem as a linear task assignment problem where the aim is to minimize time needed for file replicas to reach a certain desired threshold. Using simulation, We reduce the probability of a file being served from the provider’s servers by more than 97.5% under realistic network conditions. Keywords: Peer to peer computing, cloud storage.
1
Introduction
Cloud storage systems are online storage systems where data is stored on groups of virtual servers rather than dedicated servers. The storage provider assigns resources for a user according to the current requirements of the customer. Cost of bandwidth is the strongest challenge facing cloud storage [1]. One of the answers to this challenge is to build a peer-assisted cloud where Peers’ resources are used to offload the storage servers. The provider distributes replicas of the data through the network to reduce the cost of operation, provide fault tolerance and provide a reliable service at reduced costs. Providing guarantees on the availability and durability in a system depending on volatile peers is hard. Availability is the ability of a peer in the network to retrieve a data object at any time. Durability of a data object represents the time the object is not permanently lost. A P2P storage system gives probabilistic guarantees on the durability of 97% of the data [2]. In a peer-assisted approach, data will always be stored on the servers of the storage service provider so there will also be guarantees on availability. In this paper we focus on the problem of replica placement in peer-assisted cloud storage. Replica placement addresses the problem of where to place the replicas created by the system to maintain highest levels of durability and availability of the data stored. The main contributions in this work are: 1- We introduce an economical formulation for the replica placement problem.2- We introduce Nilestore, a Peer-assisted cloud storage network protocol that offloads the service provider’s servers. We show that using Nilestore can results in at least 75% improvement over using a random placement algorithm. P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 208–213, 2011. c IFIP International Federation for Information Processing 2011
Replica Placement in Peer-Assisted Clouds: An Economic Approach
2
209
Related Work
In [3], a peer-assisted cloud storage system deployed in china is introduced. The authors describe the system design and show some measurements. Our work can be considered as an extension to their system as FS2you uses random placement for the replicas. Toka et al.[1], prove that using this peer-assisted clouds can provide a performance comparable to that of a centralized cloud at a fraction of the cost. For placement, they cluster peers depending on their online behavior. Our system takes into account the contributed storage, bandwidth and the scarcity of the data when doing data placement and aims at offloading the servers of the service provider. Oceanstore [2] and Farsite [4] are examples of P2P storage systems. In [5], the authors prove that when redundancy, data scale, and dynamics are all high, the needed cross-system bandwidth is unreasonable. A similar conclusion was presented in [4]. These results makes peer-assisted cloud storage systems a more attractive approach as the storage nodes in the cloud are less dynamic compared to the P2P nodes.
3
Replica Placement and Economics
The problem of replica placement in the context of peer-assisted cloud storage can be defined as follows: Given a group of peers in a cloud storage network where each peer have some data for replication, free storage and unused bandwidth. Make r replicas of the data of each peer using the contributed space of the other peers in a way that increase the amount of data retrieved from the peers compared to that retrieved from the servers. The economic problem is the problem involving the allocation of scarce resources among different alternatives or competing ends [6]. Replica placement is similar to the economic problem as there are scarce resources (bandwidth and storage) that can be allocated between the different peers-different alternatives. Replica management economies are systems where the peer acquires real or virtual money for hosting replicas of others. The machines will use this money to buy storage space for its own replicas [7]. We consider the problem of replica placement in peer-assisted storage clouds as an economical problem. We try to solve it using a mixture of two types of auctioning; first-price sealed-bid auctioning and double auctioning [8]. In first-price sealed-bid auction, each buyer submits one bid for a resource with no knowledge of the bids of the other buyers. The highest bidder wins and gets the resource for the price he specified in his bid. In a double auction, bidders submits their bids while sellers submit the items (resources) each will sell at anytime during the trading period. If at any time any of the bids is matching with any of the sold items (quantity of sold resources), the trade is executed immediately. In our Nilestore, we consider three main players; the cloud provider as the auctioneer and peers who play a dual role; sellers who contribute resources to get money and use it for buying backup as bidders. All peers send a sealed bid
210
A. Ali-Eldin and S. El-Ansary
to the cloud provider containing the amount of contributed resources and the amount of needed resources. If fairness is to be imposed, a peer is not allowed to send a bid in which the amount of resources he contributes is less than the amount of resources he buys. A peer can place a bid where his contributed resources are more than the amount of resources he plans to buy in order to be sure that he has a higher bid than the others. The provider receiving this bid during a trading period will match the different bidders with the different sellers in a way that maximizes the utility.
4
Players Design
In Nilestore, the allocation server holds an auction every τ time units. The peers send sealed bids specifying data blocks to be uploaded including any data blocks hosted, amount of storage to be contributed to the system, peer’s upload bandwidth, peer’s download bandwidth, a list of hashes of the data objects and the amount of contributed resources. A data block is replicated r times to provide higher availability. The list of hashes identify the replica count available from each data object. Deduplication is achieved using the replica count of each data object. After τ time units, the server does not accept more bids for the current trading period. Any late bids are stored for the evaluation during the next round of bids. The server converts the replica placement problem to a task assignment problem [9]. We calculate the profit of allocating the resources of a seller peer to every available buyer. 4.1
The Profit Function
The profit function between a buyer peer pi and a seller peer pj consists of three multiplied terms. The three terms are: 1. The Feasibility of storage Sij : is responsible for capturing the feasibility of storing the data of pi of size |Bˆi | on the free space Fj on pj where Bˆi is the list of blocks that the buyer wants to replicate and |Bˆi | is its size. When coupling two peers based on the storage, we want to reduce the fragmentation thus we try to keep the blocks owned by a peer spatially on the same machines. This allows a peer to contact a minimal number of peers to retrieve all the data he owned. The system should look for best fit allocation between the peers. In a system where fairness is important, A peer contributes at least r times the size of data he initially wants to replicate. If the peer chooses to increase his storage contribution, he will host more blocks and eventually get a higher utility. The feasibility of storage is thus calculated as follows: min(|Bˆi |, Fj ) Sij = (1) max(|Bˆi |, Fj )
Replica Placement in Peer-Assisted Clouds: An Economic Approach
211
2. The Feasibility of transfer Tij : This term adds the bandwidth consideration to the utility calculations. We to couple peers such that their bandwidths is maximally utilized to reduce the transfer time . Fairness is imposed by enforcing a ratio between the upload bandwidth of pi and the download bandwidth of a peer. The term is: Tij =
min(ui , dj ) max(ui , dj )
(2)
3. The average scarcity of the blocks of a buyer Hun (pi ) : This term represents the scarcity of the data block of a peer pi . we define the scarcity of a peer to be the average number of replications that its blocks need. That is, ˆ r − R(bik ) Hun (pi ) = ∀bik ∈Bi (3) |Bˆi | where Bˆi is the set of blocks owned by pi that needs replication, bik is a single block on pi and R(bik ) is the number of replicas available for block bik in the network. These three terms are multiplied to form the local profit and are fed into an assignment engine which tries to find a suboptimal allocation policy that maximizes the profit for the system while satisfying the needs of every buyer. The goal is to reduce consumption of the bandwidth of all different peers while reducing the load on the storage servers. We started experimenting using the Hungarian algorithm [9] which proved not suitable because of its complexity O(n4 ). We designed a greedy suboptimal task assignment engine that has lower complexity of O(n log n) where n is the number of peers. 4.2
Solving the Task Assignment Problem
The algorithm shuffles the list of workers randomly. The first worker on the top of the list is picked and all of the jobs that he can perform are sorted according to decreasing profit. The job with the highest profit is assigned to the selected worker and the job’s name is added to a list containing the names of all the assigned jobs. Second worker on the list is then picked, the jobs he can do are sorted and is assigned to the non-assigned job with highest profit. This is repeated for every worker in the workers list such that no job is assigned to two workers.
5
Simulation and Results
We built a discrete event simulator that simulates peer-assisted cloud storage networks. We used a group of P2P workload studies to come to a workload model. Peer bandwidths were obtained from [10]. We generate for each peer a number of unique objects. The contributed free storage for a peer is set randomly
212
A. Ali-Eldin and S. El-Ansary 0.2
ratio of files fetched from the storage servers
ratio of files fetched from the storage servers
0.2
0.15
0.1
0.05
0.15
0.1
0.05
0
0 1
2
3
4
5
6
7
3
8
4
5
τ Nilestore
6
7
8
r Nilestore
Random Placement
(a) Effect of changing the trading period on the performance on Nilestore versus random placement
Random Placement
(b) The effect of increasing the number of replicas on performance
Fig. 1. Nilestore performance
between r and 2r times the size of his data . We used [11]and [12] to quantify the peer Join/Leave rates. We conducted experiments to evaluate Nilestore versus random placement. To the best of our knowledge, random placement is the only approach for replica placement used in the peer-assisted storage literature. In random placement, a peer replicates his data on a peer chosen randomly from the peers available in the network. Figure 1(a) shows the ratio of data blocks unavailable after 20000 bidding rounds with trading periods varying between one minute and eight minutes. The figure shows that using Nilestore improves the system performance by 70 to 90%. If τ is small, many of the peers will not be able to send their bids and the assignment algorithm will have less options. If τ is chosen to be very large, Nilestore will react to failures slowly risking the loss of the objects that need replication. It can be seen from the figure that choosing τ = 5 reduces contact with the servers by almost 95%. Figure 1(b) shows the effect of increasing the threshold r for the number of replicas made for each data block in the system. Our simulation results conform with the previous results [12] on the number of replicas needed. Figure 2 shows that 8 seconds are needed for making the allocation when there are 1000 peers in the system.
14
1.6
1.4
Time required for assignment in seconds
Time required for assignment
12
10
8
6
4
2
1.2
1
0.8
0.6
0.4
0.2
0
0 0
5000
10000
15000
20000
25000
30000
Time in minutes Time for computation round
(a) Time required for the replica placement computation using Nilestore for 1000 peers
0
5000
10000
15000
20000
25000
30000
Time in minutes Time for computation round
(b) Time required for the replica placement computation using random placement for 1000 peers
Fig. 2. The time of allocation in Nilestore and using a random placement approach
Replica Placement in Peer-Assisted Clouds: An Economic Approach
6
213
Conclusion and Future Work
In this work, we introduced Nilestore, a Peer-assisted cloud storage protocol that offloads the resources of the cloud storage servers using the unused resources of the peers subscribed to the storage service. A single copy of each data object is stored on the storage servers to ensure durability and availability and duplicates are distributed across the network. Peers always try to retrieve data from the network before contacting the servers. In the future we plan to distribute the allocation process and to deploy the system in real-life and consider the trust levels of the peers.
References 1. Toka, L., Dell’Amico, M., Michiardi, P.: Online Data Backup: A Peer-Assisted Approach. In: 2010 IEEE Tenth International Conference on Peer-to-Peer Computing (P2P), pp. 1–10. IEEE, Los Alamitos (2010) 2. Kubiatowicz, J.: Extracting guarantees from chaos. Communications of the ACM 46(2), 33–38 (2003) 3. Sun, Y., Liu, F., Li, B., Li, B., Zhang, X.: Fs2you: Peer-assisted semi-persistent online storage at a large scale. In: IEEE INFOCOM 2009, pp. 873–881. IEEE, Los Alamitos (2009) 4. Bolosky, W., Douceur, J., Howell, J.: The farsite project: a retrospective. ACM SIGOPS Operating Systems Review 41(2), 17–26 (2007) 5. Blake, C., Rodrigues, R.: High availability, scalable storage, dynamic peer networks: Pick two. In: Proceedings of the 9th Conference on Hot Topics in Operating Systems, vol. 9, p. 1. USENIX Association (2003) 6. Buchanan, J.: What should economists do? Southern Economic Journal 30(3), 213–222 (1964) 7. Geels, D., Kubiatowicz, J.: Replica management should be a game. In: Proc. of the 10th European SIGOPS Workshop. ACM, New York (2002) 8. Buyya, R., Abramson, D., Giddy, J., Stockinger, H.: Economic models for resource management and scheduling in grid computing. Concurrency and Computation: Practice and Experience 14(13-15), 1507–1542 (2002) 9. Burkard, R., Dell’Amico, M., Martello, S.: Assignment Problems. Society for Industrial and Applied Mathematics, Philadelphia (2009) 10. Correa, D.K.: Assessing Broadband in America: OECD and ITIF Broadband Rankings, SSRN eLibrary (2007) 11. Steiner, M., En-Najjary, T., Biersack, E.W.: Analyzing peer behavior in kad. Institut Eurecom, Tech. Rep. (October 2007) 12. Chun, B.-G., Dabek, F., Haeberlen, A., Sit, E., Weatherspoon, H., Kaashoek, M.F., Kubiatowicz, J., Morris, R.: Efficient replica maintenance for distributed storage systems. In: NSDI 2006: Proceedings of the 3rd Conference on Networked Systems Design & Implementation, p. 4. USENIX Association, Berkeley (2006)
A Correlation-Aware Data Placement Strategy for Key-Value Stores Ricardo Vila¸ca, Rui Oliveira, and Jos´e Pereira High-Assurance Software Laboratory University of Minho Braga, Portugal {rmvilaca,rco,jop}@di.uminho.pt
Abstract. Key-value stores hold the unprecedented bulk of the data produced by applications such as social networks. Their scalability and availability requirements often outweigh sacrificing richer data and processing models, and even elementary data consistency. Moreover, existing key-value stores have only random or order based placement strategies. In this paper we exploit arbitrary data relations easily expressed by the application to foster data locality and improve the performance of complex queries common in social network read-intensive workloads. We present a novel data placement strategy, supporting dynamic tags, based on multidimensional locality-preserving mappings. We compare our data placement strategy with the ones used in existing key-value stores under the workload of a typical social network application and show that the proposed correlation-aware data placement strategy offers a major improvement on the system’s overall response time and network requirements. Keywords: Peer-to-Peer, DHT, Cloud Computing, Dependability.
1
Introduction
Highly distributed and elastic key-value stores are at the core of the management of sheer volumes of data handled by very large scale Internet services. Major examples such as Google, Facebook and Twitter rely on key-value stores to handle the bulk of their data where traditional relational database management systems fail to scale or become economically unacceptable. To this end, distributed key-value stores invariably offer very weak consistency guarantees and eschew transactional guarantees. These first generation distributed key-value stores are built by major Internet players, like Google [4], Amazon [8], FaceBook [17] and Yahoo [6], by embracing the Cloud Computing model. While common applications leverage, or even depend on, general multi-item operations that read or write whole sets of items, current key-value stores only
Partially funded by the Portuguese Science Foundation (FCT) under project Stratus – A Layered Approach to Data Management in the Cloud (PTDC/EIACCO/115570/2009) and grant SFRH/BD/38529/2007.
P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 214–227, 2011. c IFIP International Federation for Information Processing 2011
A Correlation-Aware Data Placement Strategy for Key-Value Stores
215
offer simple single item operation or at most range queries based on the primary key of the items [23]. These systems require that more general and complex multi-item queries are done outside of the system using some implementation of the Map Reduce[7] programming model: Yahoo’s PigLatin, Google’s Sawzall, Microsoft’s LINQ. However, if the API does not provide enough operations to efficiently retrieve multiple items for the general multi-item queries they will have a high cost in performance. These queries will mostly access a set of correlated items. Zhonk et al. have shown that the probability of a pair of items being requested together in a query is not uniform but often highly skewed [27]. They have also shown that correlation is mostly stable over time for real applications. Furthermore, when involving multiple items in a request to a distributed key-value store, it is desirable to restrict the number of nodes who actually participate in the request. It is therefore more beneficial to couple related items tightly, and unrelated items loosely, so that the most common items to be queried by a request would be those that are closed to each other. Leveraging the items correlation and the biased access patterns requires the ability to reflect that correlation into the items placement strategies[26]. However, data placement strategies in existing key-value stores [4,6,8,17] only support single item or range queries. If the data placement strategy places correlated items on the same node the communication overhead for multi-items operations is reduced. The challenge here is to achieve such placement in a decentralized fashion, without resorting to a global directory, while at the same time ensuring that the storage and query load on each node remains balanced. We address this challenge with a novel correlation-aware data placement strategy that allows the use of dynamic and arbitrary tags on data items and combines the usage of a Space Filling Curve (SFC) with random partitioning to store and retrieve correlated items. This strategy was built into DataDroplets, an ongoing effort to build an elastic data store supporting conflict-free strongly consistent data storage. Multi-item operations leverage disclosed data relations to manipulate dynamic sets of comparable or arbitrarily related elements. DataDroplets extends the data model of existing key-value stores with tags allowing applications to establish arbitrary relations among items. It is suitable to handle multi-tenant data and is meant to be run in a Cloud Computing environment. DataDroplets supports also the usual random and order based strategies. This allows it to adapt and to be optimized to the different workloads and, in the specific context of Cloud Computing, to suit the multi-tenant architecture. Moreover, as some data placement strategies may be non-uniform, with impact on the overall system performance and fault tolerance, we implemented a load balancing mechanism to enforce uniformity of data distribution among nodes. We have evaluated our proposal with a realistic environment and workload that mimics the Twitter social network. The remainder of the paper is organized as follows. Section 2 presents DataDroplets and Section 3 describes the correlation-aware placement strategy.
216
R. Vila¸ca, R. Oliveira, and J. Pereira
Section 4 presents a thorough evaluation of the three placement strategies. Section 5 discusses related work and Section 6 concludes the paper.
2
DataDroplets Key-Value Store
DataDroplets is a key-value store targeted at supporting very large volumes of data leveraging the individual processing and storage capabilities of a large number of well connected computers. It offers a low level storage service with a simple application interface providing the atomic manipulation of key-value items and the flexible establishment of arbitrary relations among items. In [23] we introduced DataDroplets and presented a detailed comparison against existing systems regarding their data model, architecture and trade-offs. 2.1
Data Modeling
DataDroplets assumes a very simple data model. Data is organized into disjoint collections of items identified by a string. Each item is a triple consisting of a unique key drawn from a partially ordered set, a value that is opaque to DataDroplets and a set of free form string tags. DataDroplets use tags to allow applications to dynamically establish arbitrary relations among items. The major advantage of tags is that they are free form strings and thus applications may use them in different manners. Applications migrated from relational databases with relationships between rows of different tables and frequent queries over this relationships may have tags as foreign keys. Therefore, they will efficiently retrieve correlated rows. Social applications may use as tags the user’s ID and the IDs of the user’s social connections allowing that most operations will be restricted to a small set of nodes. Also, tags can be used to correlate messages of the same topic. 2.2
Overlay Management
DataDroplets builds on the Chord [22] structured overlay network. Physical nodes are kept organized on a logical ring overlay where each node maintains complete information about the overlay membership as in [12,18]. This fits our informal assumptions about the size and dynamics of the target environments (tens to hundreds of nodes with a reasonably stable membership) and allows efficient one-hoping routing of requests [12]. On membership changes (due to nodes that join or leave the overlay) the system adapts to its new composition updating the routing information at each node and readjusting the data stored at each node according to the redistribution of the mapping interval. In DataDroplets this procedure follows closely the one described in [12].1 1
To the reviewer: since in this paper we do not assess the impact of dynamic membership changes and because the algorithm has been described in [12], we omit most of the details of the procedure.
A Correlation-Aware Data Placement Strategy for Key-Value Stores
217
Besides the automatic load redistribution on membership changes, because some workloads may impair the uniform data distribution even with a random data placement strategy the system implements dynamic load-balancing as proposed in [15]. Roughly, the algorithm is as follows. Periodically, a randomly chosen node contacts its successor in the ring to carry a pairwise adjustment of load. DataDroplets uses synchronous replication to provide fault-tolerance and automatic fail-over on node crashes [23]. 2.3
Data Placement Strategies
Nodes in the DataDroplets overlay have unique identifiers uniformly picked from the [0, 1] interval and ordered along the ring. Each node is responsible for the storage of buckets of a distributed hash table (DHT) also mapped into the same [0, 1] interval. The data placement strategy is defined on a collection basis. In the following we describe the commonly used data placement strategies for DataDroplets. The first is the random placement, the basic load-balancing strategy present in most DHTs [22,12] and also in most key-value stores [8,6,17]. This strategy is based on a consistent hash function [14]. When using consistent hashing each item has a numerical ID (between 0 and MAXID) obtained, for example, by hashing the item’s key. The output of the hash function is treated as a circular space in which the largest value wraps around the smallest value. This is particularly interesting when made to overlap the overlay ring. Furthermore, it guarantees that the addition or removal of a bucket (the corresponding node) incurs in a small change in the mapping of keys to buckets. The other is the ordered placement that takes into account order relationships among items’ primary key favoring the response to range oriented reads and is present in some key-value stores [6,4,17]. This order needs to be disclosed by the application and can be per application, per workload or even per request. We use an order-preserving hash function [11] to generate the identifiers. Compared to a standard hash function, for a given ordering relation among the items, an orderpreserving hash function hashorder() has the extra guarantee that if o1 < o2, then hashorder(o1) < hashorder(o2). The major drawback of the random placement is that items that are commonly accessed by the same operation may be distributed across multiple nodes. A single operation may need to retrieve items from many different nodes leading to a performance penalty. Regarding the ordered placement, in order to make the order-preserving hash function uniform as well we need some knowledge on the distribution of the item’s keys [11]. For a uniform and efficient distribution we need to know the domain of the item’s key, the minimum and maximum values. This yields a tradeoff between uniformity and reconfiguration. While a pessimistic prediction of the domain will avoid further reconfiguration it may break the uniformity. In the current implementation of DataDroplets the hash function is not made uniform but, as described later, we use a more general approach to balance load.
218
R. Vila¸ca, R. Oliveira, and J. Pereira
N 2
p p
a
a
0
n 0
b
(a) Hilbert Mapping
(b) Hybrid-n placement strategy
n
(c) Query example
Fig. 1. Tagged placement strategy
3
Correlation-Aware Strategy
A key aspect of DataDroplets is the multi-item access that enables the efficient storage and retrieval of large sets of related data at once. Multi-item operations leverage disclosed data relations to manipulate sets of comparable or arbitrarily related elements. The performance of multi-item operations depends heavily on the way correlated data is physically distributed. The balanced placement of data is particularly challenging in the presence of dynamic and multi-dimensional relations. This aspect is the main contribution of the current work describing a novel data placement strategy based on multidimensional locality-preserving mappings. Correlation is derived from disclosed tags dynamically attached to items. 3.1
Tagged Placement
The placement strategy, called hereafter tagged, realizes the data distribution according to the set of tags defined per item. A relevant aspect of our approach is that these sets can be dynamic. This allows us to efficiently retrieve correlated items, that were previously attached by the application. The strategy uses a dimension reducing and locality-preserving indexing scheme that effectively maps the multidimensional information space to the identifier space, [0, 1]. Tags are free-form strings and form a multidimensional space where tags are the coordinates and the data items are points in the space. Two data items are collocated if they have equal sized set of tags and tags lexicographically close, or if one set is a sub-set of the other. This mapping is derived from a locality-preserving mapping called Space Filling Curves (SFCs) [19]. A SFC is a continuous mapping from a d-dimensional space to a unidimensional space (f : N d → N ). The d-dimensional space is viewed as a d-dimensional cube partitioned into sub-cubes, which is mapped onto a line such that the line passes once through each point (sub-cube) in the volume of the cube, entering and exiting the cube only once. Using this mapping,
A Correlation-Aware Data Placement Strategy for Key-Value Stores
219
a point in the cube can be described by its spatial coordinates, or by the length along the line, measured from one of its ends. SFCs are used to generate the one-dimensional index space from the multidimensional tag space. Applying the Hilbert mapping [3] to this multidimensional space, each data element can be mapped to a point on the SFC. Figure 1(a) shows the mapping for the set of tags {a, b}. Any range query or query composed of tags can be mapped into a set of regions in the tag space and corresponding to line segments in the resulting one-dimensional Hilbert curve. These line segments are then mapped to the proper nodes. An example for querying tag {a} is shown in Figure 1(c), which is mapped into two line segments. An update to a previous item without knowing its previous tags must find which node has the requested item and then update it. If the update also updates its tags the item will be moved from the old node, defined by old tags, to the new node, defined by new tags. As this strategy only takes into account tags, all items with the same set of tags will have the same position in the identifier space and therefore will be allocated to the same node. To prevent this we adopt a hybrid-n strategy. Basically, we divide the set of nodes into n partitions and the item’s tags instead of defining the complete identifier into the identifier space define only the partition, as shown in Figure 1(b). The position inside the partition is defined by a random strategy. Therefore, the locality is only preserved inter partition. 3.2
Request Handling
The system supports common single item operations such as put, get and delete, multi item put (multiP ut) and get (multiGet) operations, and set operations to retrieve ranges (getByRange) and equally tagged items (getByT ags). The details of DataDroplets operations are presented in [23]. Any node in the overlay can handle client requests. When handling a request the node may need to split the request, contact a set of nodes, and compose the clients reply from the replies it gets from the contacted nodes. This is particularly so with multi item and set operations. When the collection’s placement is done by tags, this also happens for single item operations. Indeed, most request processing is tightly dependent of the collection’s placement strategy. For the put and multiPut this is obvious as the target nodes result from the chosen placement strategy. For operations that explicitly identify the item by key, get, multiGet and delete the node responsible for the data can be directly identified when the collection is distributed at random or ordered. Having the data distributed by tags all nodes need to be searched for the requested key. For getByRange and getByTags requests the right set of nodes can be directly identified if the collection is distributed with the ordered and tagged strategies, respectively. Otherwise, all nodes need to be contacted and need to process the request.
220
4
R. Vila¸ca, R. Oliveira, and J. Pereira
Experimental Evaluation
We ran a series of experiments to evaluate the performance of the system, in particular the suitability of the different data placement strategies, under a workload representative of applications currently exploiting the scalability of emerging key-value stores. In the following we present performance results for the three data placement strategies previously described both in simulated and real settings. 4.1
Test Workload
For the evaluation of DataDroplets we have defined a workload that mimics the usage of the Twitter social network. Twitter is an online social network application offering a simple micro-blogging service consisting of small user posts, the tweets. A user gets access to other user tweets by explicitly stating a follow relationship, building a social graph. Our workload definition has been shaped by the results of recent studies on Twitter [13,16,2] and biased towards a read intensive workload based on discussions that took place during Twitter’s Chirp conference (the Twitter official developers conference). In particular, we consider just the subset of the seven most used operations from the Twitter API2 (Search and REST API as of March 2010): statuses user timeline, statuses friends timeline, statuses mentions, search contains hashtag, statuses update, friendships create and friendships destroy. Twitter’s network belongs to a class of scale-free networks and exhibit a small world phenomenon [13]. The generation of tweets, both for the initialization phase and for the workload, follows observations over Twitter traces [16,2]. First, the number of tweets per user is proportional to the user’s followers [16]. From all tweets, 36% mention some user and 5% refer to a topic [2]. Mentions in tweets are created by randomly choosing a user from the set of friends. Topics are chosen using a power-law distribution [13]. Each run of the workload consists of a specified number of operations. The next operation is randomly chosen and, after it had finished, the system waits some pre configured time, think-time, and only afterwards sends the next operation. The probabilities of occurrence of each operation and a more detailed description of the workload can be found in [24]. The defined workload may be used with both key-value stores and relational databases3 . 4.2
Experimental Setting
We evaluate our implementation of DataDroplets using the ProtoPeer toolkit [9]. ProtoPeer is a toolkit for rapid distributed systems prototyping that allows 2 3
http://apiwiki.twitter.com/Twitter-API-Documentation The workload is available at https://github.com/rmpvilaca/UBlog-Benchmark
A Correlation-Aware Data Placement Strategy for Key-Value Stores
221
switching between event-driven simulation and live network deployment without changing the application code. For all experiments presented next the performance metric has been the average request latency as perceived by the clients. The system was populated with 64 topics for tweets and a initial tweet factor of 1000. A initial tweet factor of n means that a user with f followers will have n × f initial tweets. For each run 500000 operations were executed. Different request loads have been achieved by varying the clients think-time between operations. Throughout the experiments no failures were injected. Simulated setting. From ProtoPeer we have used the network simulation model and extended it with simulation models for CPU as per [25]. The network model was configured to simulate a LAN with latency uniformly distributed between 1 ms and 2 ms. For the CPU simulation we have used a hybrid simulation approach as described in [21]. All data has been stored in memory, persistent storage was not considered. Briefly, the execution of an event is timed with a profiling timer and the result is used to mark the simulated CPU busy during the corresponding period, thus preventing other event to be attributed simultaneously to the same CPU. A simulation event is then scheduled with the execution delay to free the CPU. Further pending events are then considered. Therefore, only the network latency is simulated and the other execution times are profiled from real execution. Each node was configured and calibrated to simulate one dual-core AMD Opteron processor running at 2.53GHz. The system was populated with 10000 concurrent users and the same number of concurrent users were simulated (uniformly distributed by the number of configured nodes). Real setting. We used a machine with 24 AMD Opteron Processor cores running at 2.1GHz, 128GB of RAM and a dedicated SATA hard disk. We ran 20 instances of Java Virtual Machine (1.6.0) running ProtoPeer. ProtoPeer uses Apache MINA4 for communication in real settings. We have used Apache Mina 1.1.3. All data has been stored persistently using Berkeley DB Java edition 4.0 5 . The system was populated with 2500 concurrent users and the same number of concurrent users were run (uniformly distributed by the number of configured instances). During all the experiment IO was not the bottleneck. 4.3
Evaluation of Data Placement Strategies
Simulated setting. The graphs in Figure 2 depict the performance of the system when using the different placement strategies available in the simulated setting. The workload has been firstly configured to only use the random strategy (the most common in existing key-value stores), then configured to use the 4 5
http://mina.apache.org/ http://www.oracle.com/technetwork/database/berkeleydb/overview/ index-093405.html
R. Vila¸ca, R. Oliveira, and J. Pereira 80
80
60
60 Latency (ms)
Latency (ms)
222
40 random ordered tagged
20
40 random ordered tagged
20
0
0 25 · 103
0
5 · 103
75 · 103
1 · 104
25 · 103
0
Throughput (ops/sec)
250
250
200
200
150 100 random ordered tagged
100 random ordered tagged
0 0
25 · 103
5 · 103
75 · 103
1 · 104
0
5 · 103
75 · 103
1 · 104
(d) statuses user timeline op
(c) friendships create op 250
200
200 Latency (ms)
250
150 100 random ordered tagged
50
25 · 103
Throughput (ops/sec)
Throughput (ops/sec)
Latency (ms)
1 · 104
150
50
0
150 100 random ordered tagged
50
0
0 0
25 · 103
5 · 103
75 · 103
1 · 104
0
Throughput (ops/sec)
5 · 103
75 · 103
1 · 104
(f) search contains hashtag op 200
30
150 Latency (ms)
40
20 random ordered tagged
10
25 · 103
Throughput (ops/sec)
(e) statuses mentions op
Latency (ms)
75 · 103
(b) friendships destroy op
Latency (ms)
Latency (ms)
(a) statuses update op
50
5 · 103
Throughput (ops/sec)
100 random ordered tagged
50
0
0
0
25 · 103
5 · 103
75 · 103
Throughput (ops/sec)
(g) statuses friends timeline op
1 · 104
0
25 · 103
5 · 103
75 · 103
Throughput (ops/sec)
(h) Overall workload
Fig. 2. System’s response time with 100 simulated nodes
1 · 104
A Correlation-Aware Data Placement Strategy for Key-Value Stores
223
ordered placement for both the tweets and timelines collections (for users placement has been kept at random), and finally configured to exploit the tagged placement for tweets (timelines were kept ordered and users at random). The lines random, ordered and tagged in Figure 2 match these configurations. We present the measurements for each of the seven workload operations (Figure 2(a) through 2(g)) and for the overall workload (Figure 2(h)). All runs were carried with 100 nodes. We can start by seeing that for write operations (statuses_update and friendships_destroy) the system’s response time is very similar for all scenarios (Figures 2(a)and 2(b)). Both operations read one user record and subsequently add or update one of the tables. The costs of these operations is basically the same in all the placement strategies. The third writing operation, friendships_create, has a different impact, though (Figure 2(c)). This operation also has a strong read component. When creating a follow relationship the operation performs a statuses_user_timeline which, as can be seen in Figure 2(d), is clearly favored when tweets are stored in order. Regarding read-only operations, the adopted data placement strategy may have an high impact on latency, see Figures 2(d) through 2(g). The statuses_user_timeline operation (Figures 2(d)) is mainly composed by a range query (which retrieves a set of the more recent tweets of the user) and is therefore best served when tweets are (chronologically) ordered minimizing this way the number of nodes contacted. Taking advantage of SFC’s locality preserving property grouping by tags still considerably outperforms the random strategy before saturation. Operations status_mentions and search_contains_hashtag are essentially correlated searches over tweets, by user and by topic, respectively. Therefore, as expected, they perform particularly well when the placement of tweets uses the tagged strategy. For status_mentions the tagged strategy is twice as fast as the others, and for search_contains_hashtag keeps a steady response time up to ten thousand ops/sec while with the other strategies the systems struggle right from the beginning. Operation statuses_friends_timeline accesses the tweets collection directly by key and sparsely. To construct the user’s timeline the operation gets the user’s tweets list entry from timelines and for each tweetid reads it from tweets. These end up being direct and ungrouped (i.e.. single item) requests and, as depicted in Figure 2(g) best served by the random and ordered placements. Figure 2(h) depicts the response time for the combined workload. Overall, the new SFC based data placement strategy consistently outperforms the others with responses 40% faster. Finally, it is worth noticing the substantial reduction of the number of exchanged messages attained by using the tagged strategy. Figure 3(a) compares the total number of messages exchanged when using the random and tagged strategies. This reduction is due to the restricted number of contacted nodes by the tagged strategy in multi-item operations.
224
R. Vila¸ca, R. Oliveira, and J. Pereira 300
75 · 106
Latency (ms)
Total Messages
1 · 107
5 · 10
6
25 · 106
200
100
random ordered tagged
random tagged 0
0 0
50
100
150
200
Nodes
(a) Total number of messages exchanged with system size
0
100
200
300
400
500
Throughput (ops/sec)
(b) System’s response time
Fig. 3. Additional evaluation results
Real setting. Figure 3(b) depicts the response time for the combined workload in the real setting. The results in the real setting confirm the previous results from the simulated setting. Overall, the new SFC based data placement strategy consistently outperforms the others. The additional response time in the real setting, compared with the simulated setting, is due to the use of a persistent storage.
5
Related Work
There are several emerging decentralized key-value stores developed by major companies like Google, Yahoo, Facebook and Amazon to tackle internal data management problems and support their current and future Cloud services. Google’s BigTable[4], Yahoo’s PNUTS[6], Amazon’s Dynamo[8] and Facebook’s Cassandra[17] provide a similar service: a simple key-value store interface that allows applications to insert, retrieve, and remove individual items. BigTable, Cassandra and PNUTS additionally support range access in which clients can iterate over a subset of data. DataDroplets extends these systems’ data models with tags allowing applications to run more general operations by tagging and querying correlated items. These systems define one or two data placement strategies. While Cassandra and Dynamo use a DHT for data placement and lookup, PNUTS and BigTable have special nodes to define data placement and lookup. Dynamo just implements a random placement strategy based on consistent hashing. Cassandra supports both random and ordered data placement strategies per application but only allows range queries when using ordered data placement. In PNUTS, special nodes called routers, maintain an interval mapping that divides the overall space into intervals and define the nodes responsible for each interval. It also supports random and ordered strategies, and the interval mapping is done by partitioning the hash space and the primary key’s domain, respectively. BigTable only supports an ordered data placement. The items’ key range is dynamically
A Correlation-Aware Data Placement Strategy for Key-Value Stores
225
partitioned into tablets that are the unit for distribution and load balancing. With only random and ordered data placement strategies, existing decentralized data stores can only efficiently support single item operations or range operations. However, some applications, like social networks, need frequently to retrieve general multi correlated items. Our novel data placement strategy that allows to dynamically correlate items is based on Space Filing Curves(SFCs). SFCs had been used to process multidimensional queries in P2P systems. PHT [5] applies SFC indexing over generic multi hop DHTs, while SCRAP [10] and Squid [20] apply SFC indexing to Skip graphs [1] and Chord [22] overlay, respectively. While in all other systems the multi-dimensional queries are based on predefined keywords and keywords set DataDroplets is more flexible, allowing free form tags and tags set. Therefore, tags are used by applications to dynamically define items correlation. Additionally, in DataDroplets the SFC based strategy is combined with a random placement strategy and with a generic load balancing mechanism to improve uniformity even when the distribution of tags is highly skewed. Moreover, while all other systems using SFCs in P2P systems run over multi hop DHTs, our tagged data placement strategy runs over a one hop DHT. Therefore, while in these other systems the query processing is done recursively over several nodes in the routing path, with increasing latency, our strategy allows the node that receives the query to locally know the nodes that need to be contacted to answer the query.
6
Conclusion
Cloud Computing and unprecedented large scale applications, most strikingly, social networks such as Twitter, challenge tried and tested data management solutions and call for a novel approach. In this paper, we introduce DataDroplets, a key-value store whose main contribution is a novel data placement strategy based on multidimensional locality preserving mappings. This fits access patterns found in many current applications, which arbitrarily relate and search data by means of free-form tags, and provides a substantial improvement in overall query performance. Additionally, we show the usefulness of having multiple simultaneous placement strategies in a multi-tenant system, by supporting also ordered placement, for range queries, and the usual random placement. Finally, our results are grounded on the proposal of a simple but realistic benchmark for elastic key-value stores based on Twitter and currently known statistical data about its usage. We advocate that consensus on benchmarking standards for emerging key-value stores is a strong requirement for repeatable and comparable experiments and thus for the maturity of this area. This proposal is therefore a first step in that direction.
226
R. Vila¸ca, R. Oliveira, and J. Pereira
References 1. Aspnes, J., Shah, G.: Skip graphs. In: Proceedings of the Fourteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2003, pp. 384–393. Society for Industrial and Applied Mathematics, Philadelphia (2003), http://portal.acm.org/citation.cfm?id=644108.644170 2. Boyd, D., Golder, S., Lotan, G.: Tweet tweet retweet: Conversational aspects of retweeting on twitter. In: Society, I.C. (ed.) Proceedings of HICSS-43 (January 2010) 3. Butz, A.R.: Alternative algorithm for hilbert’s space-filling curve. IEEE Trans. Comput. 20(4), 424–426 (1971) 4. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: OSDI 2006: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, pp. 205–218. USENIX Association, Berkeley (2006) 5. Chawathe, Y., Ramabhadran, S., Ratnasamy, S., LaMarca, A., Shenker, S., Hellerstein, J.: A case study in building layered DHT applications. In: SIGCOMM 2005: Proceedings of the 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 97–108. ACM, New York (2005) 6. Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.A., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. 1(2), 1277–1288 (2008) 7. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI 2004: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA (December 2004) 8. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. In: SOSP 2007: Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles, pp. 205–220. ACM, New York (2007) 9. Galuba, W., Aberer, K., Despotovic, Z., Kellerer, W.: Protopeer: From simulation to live deployment in one step. In: Eighth International Conference on Peer-to-Peer Computing, P2P 2008, pp. 191–192 (September 2008) 10. Ganesan, P., Yang, B., Garcia-Molina, H.: One torus to rule them all: multidimensional queries in p2p systems. In: WebDB 2004: Proceedings of the 7th International Workshop on the Web and Databases, pp. 19–24. ACM, New York (2004) 11. Garg, A.K., Gotlieb, C.C.: Order-preserving key transformations. ACM Trans. Database Syst. 11(2), 213–234 (1986) 12. Gupta, A., Liskov, B., Rodrigues, R.: Efficient routing for peer-to-peer overlays. In: First Symposium on Networked Systems Design and Implementation (NSDI), San Francisco, CA (March 2004) 13. Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogging usage and communities. In: WebKDD/SNA-KDD 2007: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 56–65. ACM, New York (2007) 14. Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: STOC 1997: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pp. 654–663. ACM, New York (1997)
A Correlation-Aware Data Placement Strategy for Key-Value Stores
227
15. Karger, D.R., Ruhl, M.: Simple efficient load balancing algorithms for peer-to-peer systems. In: SPAA 2004: Proceedings of the Sixteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 36–43. ACM, New York (2004) 16. Krishnamurthy, B., Gill, P., Arlitt, M.: A few chirps about twitter. In: WOSP 2008: Proceedings of the First Workshop on Online Social Networks, pp. 19–24. ACM, New York (2008) 17. Lakshman, A., Malik, P.: Cassandra - A Decentralized Structured Storage System. In: SOSP Workshop on Large Scale Distributed Systems and Middleware (LADIS), Big Sky, MT (Ocotber 2009) 18. Risson, J., Harwood, A., Moors, T.: Stable high-capacity one-hop distributed hash tables. In: ISCC 2006: Proceedings of the 11th IEEE Symposium on Computers and Communications, pp. 687–694. IEEE Computer Society, Washington, DC, USA (2006) 19. Sagan, H.: Space-Filling Curves. Springer, New York (1994) 20. Schmidt, C., Parashar, M.: Flexible information discovery in decentralized distributed systems. In: HPDC 2003: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing, p. 226. IEEE Computer Society, Washington, DC, USA (2003) 21. Sousa, A., Pereira, J., Soares, L., Correia Jr., A., Rocha, L., Oliveira, R., Moura, F.: Testing the Dependability and Performance of Group Communication Based Database Replication Protocols. In: International Conference on Dependable Systems and Networks (DSN 2005) (June 2005) 22. Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A scalable Peer-To-Peer lookup service for internet applications. In: Proceedings of the 2001 ACM SIGCOMM Conference, pp. 149–160 (2001) 23. Vila¸ca, R., Cruz, F., Oliveira, R.: On the expressiveness and trade-offs of large scale tuple stores. In: Meersman, R., Dillon, T., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6427, pp. 727–744. Springer, Heidelberg (2010) 24. Vila¸ca, R., Oliveira, R., Pereira, J.: A correlation-aware data placement strategy for key-value stores. Tech. Rep. DI-CCTC-10-08, CCTC Research Centre, Universidade do Minho (2010), http://gsd.di.uminho.pt/members/rmvilaca/papers/ddtr.pdf 25. Xiongpai, Q., Wei, C., Shan, W.: Simulation of main memory database parallel recovery. In: SpringSim 2009: Proceedings of the 2009 Spring Simulation Multiconference, pp. 1–8. Society for Computer Simulation International, San Diego (2009) 26. Yu, H., Gibbons, P.B., Nath, S.: Availability of multi-object operations. In: NSDI 2006: Proceedings of the 3rd conference on 3rd Symposium on Networked Systems Design & Implementation, p. 16. USENIX Association, Berkeley (2006) 27. Zhong, M., Shen, K., Seiferas, J.: Correlation-aware object placement for multiobject operations. In: ICDCS 2008: Proceedings of the 28th International Conference on Distributed Computing Systems, pp. 512–521. IEEE Computer Society, Washington, DC, USA (2008)
Experience Report: Trading Dependability, Performance, and Security through Temporal Decoupling Lorenz Froihofer, Guenther Starnberger, and Karl M. Goeschka Vienna University of Technology Institute of Information Systems, Distributed Systems Group Argentinierstrasse 8/184-1 1040 Vienna, Austria {lorenz.froihofer,guenther.starnberger,karl.goeschka}@tuwien.ac.at
Abstract. While it is widely recognized that security can be traded for performance and dependability, this trade-off lacks concrete and quantitative evidence. In this experience report we discuss (i) a concrete approach (temporal decoupling) to control the trade-off between those properties, and (ii) a quantitative and qualitative evaluation of the benefits based on an online auction system. Our results show that trading only a small amount of security does not pay off in terms of performance or dependability. Trading security even more first improves performance and later improves dependability. Keywords: Temporal decoupling, Dependability, Security, Performance.
1
Introduction
While it is widely recognized that security can be traded for performance and dependability [8, 18, 33], this trade-off lacks concrete and quantitative evidence. In this experience report we examine the implementation of a prototype that allows trading these properties in first-price sealed-bid auctions as used for governmental bonds and CO2 certificates. Main motivation for examining this auction type are the high dependability and performance requirements as the high monetary amount of traded goods can lead to significant financial losses when auctions need to be canceled or rescheduled. In addition, such auctions typically exhibit a high peak load shortly before the auction’s deadline, as a late submission of bids allows bidders to better optimize their bids due to continuously changing financial market conditions. Moreover, cloud computing is not an option due to data ownership issues. The core idea of our approach is to mitigate performance issues (high peak loads) and dependability problems (fault tolerance of the auctioneer’s infrastructure as well as the network infrastructure between client and auctioneer) by shifting them into the security domain and by subsequently solving the new security challenges [26]. In first-price sealed-bid auctions this is possible by decoupling bid submission from bid transmission: Unlike eBay-style auctions, bidders P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 228–242, 2011. c IFIP International Federation for Information Processing 2011
Temporal Decoupling to Trade Dependability, Performance, and Security
229
do not learn about other bids before the auction’s deadline. Therefore, we can locally timestamp bids when they are placed on the client and transfer them to the server later. This allows us to increase performance, as we can spread the peak load in the temporal domain, and dependability, as we can reschedule transmission of bids in case of errors. In addition to server-side failures we can also mitigate client-side failures, which cannot be solved with traditional techniques such as redundant server-side infrastructure. This increase of performance and dependability introduces new security challenges as it allows adversaries to attack the locally applied timestamps. Consequently, we introduced a smartcardbased secure timestamping protocol that solves the new security challenges [28]. In addition to our temporal decoupling approach this paper contributes with a quantitative and qualitative evaluation of temporal decoupling as means to trade security for performance and dependability. In this section we first examine the architecture of our prototype, followed by a discussion of the implementation. 1.1
Architecture
This section discusses the architecture of our prototype implementation to make the results presented in the following sections better comprehensible. The server side architecture of the prototype (Figure 1) leverages EJB (Enterprise JavaBean) components and the JBoss Seam framework and is deployed to the JBoss application server. In order to facilitate temporal decoupling we not only leverage a Web browser at the client side, but also additional components: The first prerequisite to temporal decoupling is a secure smart card running the security-critical parts of the application such as time synchronization and time stamping of bid submissions [25, 28]. In order to enable the Web application to talk to the smart card, we introduced the smart card proxy [27], a generic secure approach to enable secure HTTP-based (Hypertext Transfer Protocol) access to smart cards that do not offer an HTTP communication interface. The second prerequisite is a client side storage facility, provided by Google Gears, to store bids once they are timestamped. This prevents loss of valid bids in case of client crashes so that bids can be sent to the server after client recovery. 1.2
Implementation
The prototype has been developed in two iterations: The first iteration used a Java applet to fulfil the bid submission related tasks. Based on the drawbacks we observed for the Java applet solution, we performed a feasibility study of a Webonly solution (no Java applet required) during the second iteration of prototype development and hence replaced the bid submission tasks with a GWT (Google Web Toolkit) implementation, which has two core responsibilities with respect to our decoupling approach: – The client side GWT application communicates via the proxy with the smart card in order to perform the time synchronization and time stamping tasks.
230
L. Froihofer, G. Starnberger, and K.M. Goeschka Network boundary HTTP
Web browser (Firefox) HTTP Smart card proxy APDU Smart card (.NET Card)
Plugin (Google Gears)
Network boundary
Load balancer (Apache HTTPD)
HTTP
Web server (Tomcat)
SQL
EJB container (JBoss)
Storage mechanism (SQLite)
Client side infrastructure
Network boundary
Application server Applicationserver server Application (JBoss AS) (JBossAS) AS) (JBoss JDBC Database (MySQL)
Application server cluster
Communication providers (Internet)
Server side infrastructure
Fig. 1. Prototype architecture
– After a bid has been placed, the client side GWT application persistently stores the timestamped bid for later submission to the server. For persistent storage, the Google Gears plug-in is required. Finally, the client side GWT application submits bids to the server according to the implemented bid submission strategy [26]. The smart card proxy is implemented as an HttpServlet according to the Java Servlet specification and is executed in an Apache Tomcat Web application container. It gets requests from the client side GWT application running in the Web browser and performs the necessary translation between the HTTP communication with the GWT application and the APDU (Application Protocol Data Unit) communication with the smart card. The Servlet-based approach has been taken for the proof-of-concept prototype implementation, but could be replaced with any implementation converting from HTTP to the APDU protocol.
2
Evaluation of Temporal Decoupling
The previous section introduced the architecture and implementation of our decoupling approach from a technological perspective. This section goes beyond the technological aspects and evaluates the benefits and drawbacks of the temporal decoupling approach. Our original idea was to delay all bid submissions for a random period of time, which turned out to be only a sub-optimal solution, because it also delays bid submission in times where the server(s) are not fully loaded. Therefore, we invented and evaluated more sophisticated bid submission strategies [26] in order to better utilize the server infrastructure. The interval-based submission strategy limits clients to the submission of only a single bid within a certain time interval. The group-based submission strategy partitions the set of clients into disjoint groups and only the clients within a specific group might send bids to the server at a specific point in time. In addition, each strategy can be used in two different
Temporal Decoupling to Trade Dependability, Performance, and Security
231
modes: With bid-queuing all bids submitted at the client are eventually delivered to the server while with bid-overwriting a later bid overwrites any earlier bids still queued at the client. We discuss the performance improvements through temporal decoupling in this section and compare the two strategies based on the following parameters: – A bidder issues a bid every 30–40 seconds. This corresponds to real-world data of governmental bond auctions just before the auction deadline. – 1 500 bidders can be supported by a single server without temporal decoupling. This is the result of the performance measurements of our prototype and specific to our hardware environment. For analysis and comparison of the temporal decoupling improvement potential we introduce the decoupling factor as decoupling period divided by the original peak load duration (parameters illustrated in Figure 2). The original peak load duration is the time between the start of the peak load as it would be observed without our temporal decoupling approach, e.g., determined by exceeding a pre-defined threshold such as average submitted bids per second, and the end of that peak load period, typically close to the deadline for bids. Generally, we assumed five minutes of peak load duration according to our application scenario. The deadline for bids is the latest point in time where a bid has to be submitted by a bidder while the deadline for messages is the latest point in time where all bids have to be transmitted to the server. The time span between these two deadlines is called decoupling period. If we have 5 minutes of original peak load duration and 10 minutes decoupling period, for example, the decoupling factor would be 2. Figure 3 shows the performance improvement possible through temporal decoupling in terms of supportable bidders in relation to the decoupling period (shown before the colon on the X-axis) and the decoupling factor (shown after the colon on the X-axis). In this figure, we assume an original peak load duration of five minutes.
Fig. 2. Temporal decoupling approach
232
L. Froihofer, G. Starnberger, and K.M. Goeschka
Supportable bidders (single server)
Performance Improvement through Temporal Decoupling 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 0: 0
1: 0,2 2: 0,4 3: 0,6 4: 0,8
5: 1
6: 1,2 7: 1,4 8: 1,6 9: 1,8 10: 2
15: 3
Decoupling period [min]: decoupling factor Interval with bid queuing Rand delay with bid overwriting Collection after deadline (measured)
Interval with bid overwriting Collection after deadline
Fig. 3. Performance improvement through temporal decoupling
The values of Figure 3 are calculated for the different submission strategies based on the single-server bid submission performance without temporal decoupling. Moreover, we measured the best case scenario in order to verify the upper bound of performance improvement based on temporal decoupling. The results of these investigations are discussed in the following sections. 2.1
Worst Case Scenario
The worst case with respect to performance improvements through temporal decoupling is to use a uniform random delay between 0 and the decoupling period together with bid queuing. For a decoupling factor ≤ 1, this only delays the original peak load curve by half of the decoupling period, but does not
80
100 bids per second [% of peak load]
bids per second [% of peak load]
100 delayed, queuing original load
60 40 20 0 −1000
−500
0 time [s]
500
1000
80
delayed, queuing original load
60 40 20 0 −1000
−500
0 time [s]
500
1000
(a) 5 min peak load with 5 min random (b) 5 min peak load with 10 min random delay, decoupling factor = 1 delay, decoupling factor = 2 Fig. 4. Impact of decoupling parameters
Temporal Decoupling to Trade Dependability, Performance, and Security
233
lead to an efficient load reduction. This behavior is illustrated in Figure 4(a). Figure 4(b) shows that the random delay reduces load, but about 80% of the original peak load for a decoupling factor of two is obviously sub-optimal. 2.2
Best Case Scenario
The best case scenario with respect to performance improvements through temporal decoupling is a completely different submission strategy—the collection of only the latest bid per bidder after the deadline for bids. In this case, no server side resources are wasted on bids that are later overwritten by subsequent bids and the limit with respect to supportable bidders only depends on the duration of the decoupling period as well the hardware requirements imposed by a specific implementation of this strategy. Figure 5 illustrates this approach and shows how the load curve for collecting the bids (“peak load with bid collection after deadline”) is fully decoupled from the original “peak load without decoupling”. Up until the deadline for bids, clients do not submit any bid to the auction servers. After the deadline for bids, the auction servers start to collect the latest bid per client by “asking” the clients to submit their bid. This is performed in a controlled way, so that server overload is avoided.
Fig. 5. Collection of all bids after the deadline
A minor optimization to this method would be to start bid collection already shortly before the deadline of bids in a time period where only a single bid can be expected by a single bidder. In our scenario, this would be about 30 seconds before the deadline for bids. However, this would also require to allow for bids to be overwritten in the event that a bidder sends more than one bid within the last 30 seconds before the deadline for bids. Based on the results of our analysis this approach can already support about 2 600 bidders for a decoupling period of 1 minute and the number of supportable bidders increases linearly with the increase of the decoupling period: within five minutes, about 13 000 bidders can be supported, 26 000 bidders within ten minutes and 39 000 bidders within a decoupling period of 15 minutes on a single
234
L. Froihofer, G. Starnberger, and K.M. Goeschka
server. In practice, hardware requirements and maximum allowable decoupling period limit the total number of supportable bidders. To verify the calculated performance values, we implemented a version where the clients first issue a request to the server, the server blocks the request until it wants to receive the bid of the specific client, and then notifies the client with the response to the blocked request to submit the bid. Mapped to Web client technologies, this corresponds to an AJAX (Asynchronous JavaScript and XML) push long-polling approach. Due to the potentially high number of simultaneous open connections to the server, this requires support for asynchronous processing of client requests at the server side as introduced with Java Servlets 3.0 in the Java Enterprise Edition (JEE) 6, for example. In our case, we used the JBoss-specific asynchronous Servlet API (Application Programming Interface). The results gained from the prototype measurements correspond quite well to the values expected from calculations and show even better performance. Based on our measurements the numbers of supportable bidders are about 3 200 bidders for a decoupling period of 1 minute, 6 300 bidders for 2 minutes, and 8 700 bidders for 3 minutes. About 8 000 bidders is the reasonable limit for a Pentium 4, 2.8 GHz, and 2 GB of RAM. Measurements with about 10 000 bidders were possible, but for this number nearly all of the RAM was used by the JBoss instance and server-side bid processing was with about 38 processed bid submissions per second already much slower than in the other measurements with a lower number of bidders, which allowed for about 50 bid submissions per second. Based on additional measurements, increasing the RAM increases the number of bidders by about 4 000 bidders per GB of RAM, consumed by the high number of concurrent open client connections. To increase the number of supportable bidders without increasing the hardware resources would require a different implementation to control bid submission of the clients. Two possible options would be as follows: – The clients could be partitioned into ordered groups of a specific size and each group gets a specific timeslot for bid submission during the decoupling period. We illustrate this with an example. Let’s assume we have a group size of 8 000. From our measurements we know that we need less than 3 minutes to collect the bids of these 8 000 bidders. When initiating the session for a specific client, the client is assigned its group. We start from group number 1 and after we assigned 8 000 clients to group number 1, we start assigning clients to group number 2 and so on. After the deadline for bids, we collect the bids of clients in group 1. Three minutes after the deadline for bids, we collect the bids of clients in group 2. Generally, we collect the bids of clients in group i, (i − 1) · 3 minutes after the deadline. – An alternate solution would be to use the distributed feedback channel as detailed in [26]. Unfortunately, firewalls and network address translation (NAT) are a major hindrance for the practical applicability of this solution. While bid collection only after the deadline for bids delivers the best solution with respect to performance, it also gives an attacker the most time between bid
Temporal Decoupling to Trade Dependability, Performance, and Security
235
creation and bid transmission. This trade-off between performance and security is further elaborated in Section 2.4. 2.3
Intermediate Approaches
Between the worst case and the best case scenario, we illustrated three other strategies in Figure 3: Interval-based bid queuing, interval-based bid overwriting, and random delay with bid overwriting. Using the interval-based bid queuing strategy, the number of supportable bidders increases linearly with the decoupling factor and can generally be approximated with the following formula: bidders with decoupling = bidders without decoupling · (1 + decoupling factor ). With this approach, a decoupling factor of two (10 minutes decoupling period for 5 minutes original peak load duration) already allows to support 4 500 bidders and a decoupling factor of 3 allows for 6 000 bidders instead of 1 500 bidders without temporal decoupling for a single server scenario. Obviously, these numbers have to be multiplied by the number of servers used in a server cluster. Therefore, we would be able to support 18 000 bidders with a decoupling factor of 3 on a three nodes server cluster compared to 4 500 bidders without temporal decoupling. For interval-based bid overwriting, Figure 3 shows that this strategy only increases performance for a decoupling factor ≥ 1, compared to the intervalbased bid queuing approach. The reason for this is that only in this case bids will be delayed long enough so that bid overwriting will actually take place at the client side. For the same reason, the interval-based bid overwriting approach does not necessarily have an advantage over the random delay bid overwriting approach. In the interval-based approach, the delay of a bid and hence the probability of being overwritten depends on the decoupling factor (relative factor) while in the random delay approach the delay of a bid depends on the decoupling period (absolute factor). This can also be observed from Figure 3, where the random delay bid overwriting approach generally performs better than interval-based bid overwriting. However, this disadvantage could be reduced by requiring the interval-based approach to use time intervals larger than the average time span between two bid submissions of a bidder so that bid overwriting will become generally effective. 2.4
Interpretation
In this section we provide an interpretation of the results presented in the previous sections as well as the insights gained from analysis of the prototype implementation. In particular, we discuss the following trade-offs: Performance vs. security, dependability vs. security, and dependability vs. performance. Performance vs. Security. As illustrated in Section 2 the “collection of bids after the deadline” variant shows the best performance, but also gives a malicious adversary the most time for attacking the system, when considering the attacks
236
L. Froihofer, G. Starnberger, and K.M. Goeschka
discussed in our trust model [28]. Other approaches allow to specify a maximum delay after which a bid must have been received at the server. For the random delay approach, the maximum time after which a bid must be received at the server is the decoupling period duration. With an interval-based approach, a bid has to be transmitted immediately, if no other bids are queued at the client, or within the first interval for which no bid is already scheduled (which is the immediate next interval in case of bid overwriting). As we have seen, these approaches are only able to achieve sub-optimal performance as they also have to process bids that will be overwritten by subsequent bid submissions. Concluding, we have a trade-off between optimal performance and optimal security. For optimal performance we would use the bid collection after deadline strategy, giving a potential attacker the most time to attack the system between bid creation and bid submission. For optimal security, we would send bids immediately, facing the performance problems associated with the original peak load curve. In order to target a solution in the middle, we can control this trade-off by using different submission strategies to delay bid submission together with client side bid overwriting. Which solution approach to take will depend on the security and performance requirements of a specific customer, balanced with the costs of the corresponding hardware requirements. Dependability vs. Security. If only dependability and security are of concern, but not performance, then bids should be transferred to the server immediately in a healthy system. This prevents attacks on the timestamps applied at the client side. However, if node or link failures occur, then bids should be queued at the client using the bid-overwriting strategy and submitted to the server after the failures are repaired. If temporal decoupling with bid queuing is applied, the dependability vs. security trade-off is influenced by two further aspects: – The estimated or proven security of the used smart card as well as the software running on it limits the maximum time span between bid creation and bid transmission due to the requirement to guarantee fair auction conditions to all bidders. If it is possible to (i) crack the smart card or to find out the contained cryptographic keys and (ii) to reverse engineer the software running on the smart card or to find out the used signature mechanism even when no auctions are running, then the temporal decoupling approach must not be applied. – The accuracy of the smart card’s clock determines the maximum offline period where a bidder can still submit bids at the client without time synchronizing the smart card’s clock. This is a major issue as it has to be ensured that the smart card’s clock cannot be influenced from the outside, as this would considerably decrease the security of the system. Dependability vs. Performance. The balancing of dependability and performance in our case is a concern for server side clustering, especially in regard to the performance implications of session replication for transparent fail-over in
Temporal Decoupling to Trade Dependability, Performance, and Security Decoupling period long enough for efficient bid collection after deadline. Decoupling period long enough for intermediate repair. increased performance dependability decoupling period duration security 1
2
3
1
Security Ļ
2
Security Ļ Performance Ĺ Security ĻĹ Performance Ĺ Dependability Ĺ
3
237
decreased
Fig. 6. Influence of decoupling period on different system properties (unquantified)
case of server faults. During our evaluation we measured different configurations and compared our prototype’s performance with and without session replication as well to an alternate implementation that was implemented with Java ServerFaces (JSF) and Terracotta for session replication. While the performance drawback of session replication for the prototype described in this paper was about 15%, session replication in case of the alternate implementation already reduced performance as soon as a second node was introduced. With respect to the temporal decoupling approach, however, both dependability and performance benefit from a larger allowable time span between bid creation and bid transmission (more time to collect bids or intermediate repair) and have a disadvantage from a lower time span or the requirement for immediate transmission. Balancing Conclusion. Summarizing the previous sections, Figure 6 qualitatively visualizes the different trade-offs, showing that with an increase of the decoupling period, security decreases while dependability and performance increase. The curves in Figure 6 are not quantified on a specific measurement unit as the different properties depend on the specific requirements and threat scenarios of a specific application while we aim at a generic visualization. Based on our investigations and the measurements performed on the prototype, we conclude with the following observations: 1. Security decreases as soon as the temporal decoupling approach is facilitated (Zone 1). How much security decreases depends on the security of the smart card, including the security and accuracy of its clock, as well as an attacker’s potential gain from a successful attack along with the reputation loss for the system provider, i.e., the auctioneer in case of our application scenario. 2. A larger decoupling period (Zone 2) improves performance as collection of bids after the deadline becomes efficiently possible. However, there is only a low increase of dependability, as the decoupling period does not yet allow for intermediate repair, e.g., of the auctioneer’s Internet connectivity.
238
L. Froihofer, G. Starnberger, and K.M. Goeschka
3. As soon as the decoupling period is long enough to allow for intermediate repair (Zone 3), dependability is increased significantly. Additionally, security may increase through it’s availability attribute if the decoupling period is long enough to submit bids after an intermediate denial of service attack. Based on our measurements and these observations we recommend to generally facilitate the temporal decoupling approach for scenarios with a reasonably long decoupling period duration, longer than about 4 minutes in our case, as first benefits are only observable after this minimum time span.
3
Related Work
This section presents related work discussing the following interrelations: (i) dependability/security, (ii) dependability/performance, and (iii) security/ performance. 3.1
Dependability/Security Interrelation
Research on viewing dependability and security as an integrated concept seems to have started with a focus on intrusion tolerance [10], where research can be traced back to 1985 [11]. One of the first works to establish a common view and terminology on the dependability and security interrelation was published by the IFIP (International Federation for Information Processing) Working Group on Dependable Computing and Fault Tolerance [17]. This publication provides the four dependability attributes availability, reliability, safety, and security, while a later publication in 2004 [3] regards security and dependability to be at the same level and integrates security through the traditional security attributes of confidentiality, integrity, and availability. Summarizing related work in this area, we observe two main directions for an integrative view on dependability and security: – Adding dependability means to traditional security mechanisms such as firewalls or cryptographic algorithms, e.g., through redundant/diverse implementations [6, 15, 16, 18]. – Investigation of security issues (intrusions) from a dependability perspective, viewing malicious attacks as faults within a system—with the prominent research area of intrusion tolerance [10, 22, 30, 31]. In contrast to these works, we solve a dependability concern (high availability) through temporal decoupling, thereby shifting it into the security domain (attacks against the client side timestamps) and subsequently mitigate the resulting security challenges. 3.2
Dependability/Performance Interrelation
With respect to the dependability/performance interrelation, degradable systems introduce a grey zone for differentiating between an available and unavailable system state (dependability concern), with different notions of a “slow”
Temporal Decoupling to Trade Dependability, Performance, and Security
239
system in between (performance concern). Investigation of the interrelation between dependability and performance reveals the following core directions: – Performability as an integrative concept for degradable systems with different performability levels based on which system worth is calculated with a worth function [7, 20, 21]. – User-perceived availability takes user sessions and user think time into account when calculating system/service availability. Consequently, only failures during user requests affect availability [13, 14, 19, 23, 32, 34]. While our work allows for a degradable system, i.e., in cases of node or link failures bids might be transferred to the server later, we do not aim at explicit performability levels or user-perceived availability models. The goal in our case is to keep the system parts necessary for secure bid submission available, while tolerating faults in the global infrastructure. Therefore, the system might degrade as bids have to be queued at the client for later delivery, for example, leading to lower user-perceived availability and hence lower performability. 3.3
Security/Performance Interrelation
The interrelation of security with performance [8, 33] is an obvious relationship that has already been researched within different areas: – Competition for computational resources introduces a trade-off between security, e.g., because of encryption, and performance, e.g., measured in terms of throughput [1, 4, 9, 12]. – Research in the area of packet routing where more complex routing protocols or lower bandwidth result in increased security while showing less performance [2, 29]. – Security/performance trade-off with respect to traffic analysis where higher bandwidth requirements increase traffic flow confidentiality [5, 24]. In contrast to related work, we don’t address the security/performance interrelation on a protocol or network level, but trade security for performance with respect to throughput based on architectural trade-offs. For example, we give an attacker more time for an attack in order to increase the number of bidders processable on a given hardware. Concluding, the interrelations of the different properties have been addressed in several works. However, most of related work explicitly trades only two of the three properties dependability, security, and performance while we contribute with an integrated view on all three for a certain class of application scenarios.
4
Conclusions
In this experience report we showed how the system attributes dependability, security, and performance can be traded in practice by means of temporal decoupling. Based on online auctions as our specific application scenario,
240
L. Froihofer, G. Starnberger, and K.M. Goeschka
we showed how different system requirements demand different trade-offs, addressed through different bid submission strategies in our case. For example, we can achieve higher performance at the price of additional attack possibilities if bids are collected only after the auction deadline. A considerable amount of research publications addresses trade-offs and interdependencies between dependability, security, and performance already. However, most of these works either focus on the conceptual level or are targeted at a specific trade-off for a certain system component such as security vs. performance in cryptographic algorithms. Although we consider all three properties in a system approach, we focused our investigations of security concerns to the new challenges introduced through temporal decoupling. As our results show a significant potential in explicitly balancing these trade-offs already, we see even more potential in future research on these trade-offs and their – design-time and run-time – balancing from a holistic system perspective. Acknowledgments. This work has been partially funded by the Austrian Federal Ministry of Transport, Innovation and Technology under the FIT-IT project TRADE (Trustworthy Adaptive Quality Balancing through Temporal Decoupling, contract 816143, http://www.dedisys.org/trade/).
References 1. Agi, I., Gong, L.: An empirical study of secure mpeg video transmissions. In: Proceedings of the 1996 Symposium on Network and Distributed System Security (SNDSS 1996), pp. 137–144. IEEE Computer Society, Washington, DC, USA (1996) 2. Andersen, D.G.: Mayday: Distributed filtering for internet services. In: 4th USENIX Symposium on Internet Technologies and Systems (2003) 3. Aviˇzienis, A., Laprie, J.C., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput. 1(1), 11–33 (2004) 4. Barka, E., Boulmalf, M.: On the impact of security on the performance of wlans. JCM 2(4), 10–17 (2007) 5. Bauer, K., McCoy, D., Grunwald, D., Kohno, T., Sicker, D.: Low-resource routing attacks against tor. In: WPES 2007: Proceedings of the 2007 ACM Workshop on Privacy in Electronic Society, pp. 11–20. ACM, New York (2007) 6. Chen, Y., He, Z.: Simulating highly dependable applications in a distributed computing environment. In: ANSS 2003: Proceedings of the 36th Annual Symposium on Simulation, p. 101. IEEE Computer Society, Washington, DC, USA (2003) 7. Cho, B., Youn, H., Lee, E.: Performability analysis method from reliability and availability. In: Lee, G., Howard, D., Kang, J.J., Slezak, D., Ahn, T.N., Yang, C.H. (eds.) ICHIT. ACM International Conference Proceeding Series, vol. 321, pp. 401–407. ACM, New York (2009) 8. Cortellessa, V., Trubiani, C., Mostarda, L., Dulay, N.: An architectural framework for analyzing tradeoffs between software security and performance. In: Giese, H. (ed.) ISARCS 2010. LNCS, vol. 6150, pp. 1–18. Springer, Heidelberg (2010)
Temporal Decoupling to Trade Dependability, Performance, and Security
241
9. Cowan, C., Pu, C., Maier, D., Hintony, H., Walpole, J., Bakke, P., Beattie, S., Grier, A., Wagle, P., Zhang, Q.: Stackguard: automatic adaptive detection and prevention of buffer-overow attacks. In: SSYM 1998: Proceedings of the 7th USENIX Security Symposium, p. 5. USENIX Association, Berkeley (1998) 10. Deswarte, Y., Blain, L., Fabre, J.C.: Intrusion tolerance in distributed computing systems. In: IEEE Symposium on Security and Privacy, pp. 110–121 (1991) 11. Fraga, J., Powell, D.: A fault- and intrusion-tolerant file system. In: Proceedings of the 3rd Intl. Conf. on Computer Security, pp. 203–218 (1985) 12. Haleem, M.A., Mathur, C.N., Chandramouli, R., Subbalakshmi, K.P.: Opportunistic encryption: A trade-off between security and throughput in wireless networks. IEEE Trans. Dependable Secur. Comput. 4(4), 313–324 (2007) 13. Hariri, S., Mutlu, H.: Hierarchical modeling of availability in distributed systems. IEEE Trans. Softw. Eng. 21(1), 50–58 (1995) 14. Kaaniche, M., Kanoun, K., Rabah, M.: A framework for modeling availability of e- business systems. In: Proceedings of Tenth Intl. Conf. on Computer Communications and Networks, 2001, pp. 40–45 (2001) 15. Komari, I.E., Kharchenko, V., Lysenko, I., Babeshko, E., Romanovsky, A.: Diversity and security of computing systems: Points of interconnection. part 2: Methodology and case study. MASAUM Journal of Open Problems in Science and Engineering 1(2), 33–41 (2009) 16. Komari, I.E., Kharchenko, V., Romanovsky, A., Babeshko, E.: Diversity and security of computing systems: Points of interconnection. part 1: Introduction to methodology. MASAUM Journal of Open Problems in Science and Engineering 1(2), 28–32 (2009) 17. Laprie, J. (ed.): Dependability: Basic Concepts and Terminology. Springer, Heidelberg (1992) 18. Littlewood, B., Strigini, L.: Redundancy and diversity in security. In: Samarati, P., Ryan, P.Y.A., Gollmann, D., Molva, R. (eds.) ESORICS 2004. LNCS, vol. 3193, pp. 423–438. Springer, Heidelberg (2004) 19. Mainkar, V.: Availability analysis of transaction processing systems based on userperceived performance. In: SRDS 1997: Proceedings of the 16th Symposium on Reliable Distributed Systems, p. 10. IEEE Computer Society, Los Alamitos (1997) 20. Meyer, J.F.: On evaluating the performability of degradable computing systems. IEEE Transactions on Computers 29(8), 720–731 (1980) 21. Meyer, J.F.: Performability: a retrospective and some pointers to the future. Performance Evaluation 14(3-4), 139–156 (1992); performability Modelling of Computer and Communication Systems 22. Powell, D., Stroud, R. (eds.): Conceptual model and architecture of MAFTIA. Tech. Rep. D21, MAFTIA EU Project (2003) 23. Shao, L., Zhao, J., Xie, T., Zhang, L., Xie, B., Mei, H.: User-perceived service availability: A metric and an estimation approach. In: ICWS, pp. 647–654. IEEE, Los Alamitos (2009) 24. Snader, R., Borisov, N.: A tune-up for tor: Improving security and performance in the tor network. In: NDSS. The Internet Society, San Diego (2008) 25. Starnberger, G., Froihofer, L., Goeschka, K.M.: Distributed timestamping with smart cards using effcient overlay routing. In: Fifth Intl. Conf. for Internet Technology and Secured Transactions (ICITST 2010) (November 2010) 26. Starnberger, G., Froihofer, L., Goeschka, K.M.: Adaptive run-time performance optimization through scalable client request rate control. In: Proc. 2nd Joint WOSP/SIPEW Intl. Conf. on Performance Engineering (WOSP/SIPEW 2011). ACM, New York (March 2011) (to appear)
242
L. Froihofer, G. Starnberger, and K.M. Goeschka
27. Starnberger, G., Froihofer, L., Goeschka, K.M.: A generic proxy for secure smart card-enabled web applications. In: Benatallah, B., Casati, F., Kappel, G., Rossi, G. (eds.) ICWE 2010. LNCS, vol. 6189, pp. 370–384. Springer, Heidelberg (2010) 28. Starnberger, G., Froihofer, L., Goeschka, K.M.: Using smart cards for tamper-proof timestamps on untrusted clients. In: ARES 2010, Fifth Intl. Conf. on Availability,Reliability and Security, Krak´ ow, Poland, February 15-18, pp. 96–103. IEEE Computer Society, Los Alamitos (2010) 29. Timmerman, B.: A security model for dynamic adaptive traffic masking. In: NSPW 1997: Proceedings of the 1997 Workshop on New Security Paradigms, pp. 107–116. ACM, New York (1997) 30. Ver´ıssimo, P., Neves, N.F., Cachin, C., Poritz, J.A., Powell, D., Deswarte, Y., Stroud, R.J., Welch, I.: Intrusion-tolerant middleware: the road to automatic security. IEEE Security & Privacy 4(4), 54–62 (2006) 31. Ver´ıssimo, P., Neves, N.F., Correia, M.: Intrusion-tolerant architectures: Concepts and design. In: de Lemos, R., Gacek, C., Romanovsky, A.B. (eds.) Architecting Dependable Systems. LNCS, vol. 2677, pp. 3–36. Springer, Heidelberg (2003) 32. Wang, D., Trivedi, K.S.: Modeling user-perceived service availability. In: Malek, M., Nett, E., Suri, N. (eds.) ISAS 2005. LNCS, vol. 3694, pp. 107–122. Springer, Heidelberg (2005) 33. Wolter, K., Reinecke, P.: Performance and security tradeoff. In: Aldini, A., Bernardo, M., Pierro, A.D., Wiklicky, H. (eds.) SFM 2010. LNCS, vol. 6154, pp. 135–167. Springer, Heidelberg (2010) 34. Xie, W., Sun, H., Cao, Y., Trivedi, K.: Modeling of user perceived webserver availability. In: IEEE Intl. Conf. on Communications, ICC 2003, vol. 3, pp. 1796–1800 (May 2003)
Cooperative Repair of Wireless Broadcasts Aaron Harwood1, , Spyros Voulgaris2, and Maarten van Steen2 1 National ICT Australia, Victoria Laboratory Department of Computer Science and Software Engineering The University of Melbourne, Australia [email protected] 2 Department of Computer Science Vrije Universiteit Amsterdam, The Netherlands {spyros,steen}@cs.vu.nl
Abstract. Wireless broadcasting systems, such as Digital Video Broadcasting (DVB), are subject to signal degradation, having an effect on end users’ reception quality. Clearly, reception quality can be improved by increasing signal strength. This, however, comes at a significantly increased energy use, with adverse environmental and financial consequences, notably in either sparsely populated rural regions, or overly built and difficult to penetrate dense urban areas. This paper discusses our ongoing work on an alternative approach to improving reception quality, based on the collaborative repair of lossy packet streams among the community of DVB viewers. We present our main idea, the crucial design decisions, the algorithm, as well as preliminary results demonstrating the feasibility and efficiency of this approach. Keywords: Peer-to-Peer, Cooperative Repair, Video Broadcasting.
1
Background
Wireless broadcasting systems, such as Digital Video Broadcasting [3] (DVB), are subject to interference from the environment, which can result in loss of information and an associated loss of quality for the user. Forward error correction (FEC) and interleaving based schemes involve high overhead on the essentially limited DVB bandwidth. In contrast, a cooperative peer-to-peer repair (CPR) mechanism relies on a primary channel being repaired using a CPR protocol on a secondary channel. We consider the primary channel to be a video stream and we have chosen the ISO/IEC 13818-1 Int. Std. as a case study, which is the standard used for DVB. We propose to use UDP over the Internet as the secondary channel. The ISO/IEC 13818-1 standard describes how to packetize streaming data, such as video and audio streams, for transmission and/or storage. In this case, where the transmission system is unreliable, the standard prescribes the use of Transport Stream (TS) packets of 188 bytes in length. The video and audio
Research undertaken while on sabbatical at Vrije Universiteit Amsterdam, The Netherlands, Aug. 2010 – Jan. 2011.
P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 243–248, 2011. c IFIP International Federation for Information Processing 2011
244
A. Harwood, S. Voulgaris, and M. van Steen
MPEG streams (frames) are first encapsulated into a stream of Packetized Elementary Stream (PES) packets, and each PES packet is, in turn, encapsulated into multiple TS packets. A typical DVB transmitter may be rated at 50kW and supply “adequate” service at up to an 80km radius. In Australia there are over 1300 DVB transmitters totalling over 55.6MW per hour of transmission; or 285k metric tons of CO2 per year. The use of CPR may help to reduce the total number of transmitters and/or the required power, thereby helping the environment.
2 2.1
Design Decisions and Challenges Scope of Repairing
The cornerstone decision in our work has been the selection of the level at which repair should be applied. Specifically, there are three possibilities. Repairing frames, repairing PES packets, or repairing TS packets. Due to diverse pros and cons, no option constitutes a win-win tradeoff. We decided to repair TS packets. While it is tempting to consider working at the PES or frame level because these levels provide more semantic information, there are some drawbacks that we identified: less total reliable information at the higher levels, semantic information tends to be optional, TS packets conveniently fit into a UDP packet, and more processing is required at the higher levels. Furthermore, Fig. 1 (a) and (b) shows the distribution of frame and PES packet sizes from a DVB video sample (video using MPEG2 704x576 which prescales to 1024x576). Even at this relatively low resolution (HD television usually allows up to 1920x1080), frame sizes exceed a single UDP packet payload. The variability of frame/PES sizes and inability to fit within a single UDP packet in general would add further complexity to our overall system that we mitigate by working at the TS level. On the down side, TS packets do not contain unique identifiers, which would be very beneficial. The lack of unique identifiers at the TS level can be overcome by the use of some convenient, mandatory, semantic information contained at the TS level (explained next). 2.2
Transport Stream Packets
A DVB receiver, when tuned and locked to a carrier frequency, produces a stream of TS packets – called the raw TS stream. Among other things, every TS packet in the raw stream contains the following key fields: a 13 bit Program Identifier (P ID) that maps the TS packet to an elementary stream, such as a video or audio stream for a given program; a 4 bit Continuity Counter (CC) that is incremented by 1 modulo 16 for each subsequent TS packet in the elementary stream (certain flagged conditions may arise where the CC is not incremented); and a 1 bit Transport Error Indicator (T EI) that is set true by the receiver if the TS packet is erroneous. The fields listed above are insufficient to synchronize two streams for the sake of cooperative repair. There is no uniquely identifying information in each TS
100
80
80
60 40 20
10 observed gaps
100 size (KB)
size (KB)
Cooperative Repair of Wireless Broadcasts
60 40 20
0
0 0
20
40
60
80 100
frames, sorted by size (%)
0
20
40
60
80 100
245
4
worst worse good
103 102 101 100
0
2
4
6
8 10 12 14 16
PES packets, sorted by size (%) observed length of Continuity Counter gap
(a) frame size distribution (b) PES size distribution
(c) CC gap distribution
(d) good : infrequent errors (e) worse: frequent errors
(f) worst: unviewable
Fig. 1. a-b: Size distribution of frames and PES packets. c-f: CC gap distribution for three sample streams.
packet, and indeed duplicate TS packets may arise in the stream. However the standard allows a TS packet to contain additional information in an optional Adaption Field, and this optional information contains a 33 bit Program Clock Reference (PCR). The standard requires this optional information to appear at least every 100 milliseconds. PCR values are not sequential. They are, however, monotonously increasing, therefore unique within a stream. The presence of the PCR provides semi-regular, unique stamps on selected TS packets. Thus, we make use of the P CR to synchronize two peers so that cooperative repair can take place.
3
The Cooperative Repair Algorithm
A repair algorithm relies on two main components: a mechanism to detect missing information, and a naming scheme to uniquely identify this information in requests to external sources. 3.1
Detecting Missing Packets
We use the CC field, described in Section 2.2, to detect missing TS packets. This has a clear limitation, as any sequence of k consecutive missing packets maps to a CC gap of k modulo 16. For instance, it is impossible to distinguish between missing a single packet, or 17, 33, and generally 16i+1 consecutive packets. Even worse, missing sequences of exactly a multiple of 16 packets will go undetected. To assess the frequencies at which different gaps appear in a realistic setting, we analyzed a number of DVB streams recorded at locations with diverse reception qualities. Fig. 1(c) shows the count of CC gaps for three recordings of the
246
A. Harwood, S. Voulgaris, and M. van Steen
same duration (circa 1 minute each), yet of very different viewing qualities: infrequent errors, frequent errors and unviewable. The respective snapshots illustrate characteristically the quality of each stream. As expected, barely tuned streams experience the highest packet loss, and the most CC gaps. Still, the observations of gaps with length 1 (1, 17, 33, etc., consecutive missed packets) are two orders of magnitude more than observations of gaps of length 15 (15, 31, 47 missed packets, etc.). This strongly suggests an inadequacy in the limited length of 4 bits used for the CC to detect missing packets. That, however, occurs only for a negligible fraction of the total losses. Our experiments confirm this observation. 3.2
Naming Scheme
In order to be able to name specific units for repair, we introduce the notion of blocks. A block is a sequence of TS packets, consisting of all packets between two consecutive PCRs. The starting and ending PCR values of a block are called its boundaries, and uniquely identify this block across all nodes. Packets within a block are identified relatively to the starting PCR boundary, based on the CC. When a node receives a PCR packet, it marks the beginning of a new block. By observing the CC in subsequent TS packets, it keeps track of which packets it received, and which it missed. In doing so, it follows an optimistic approach: a CC gap of k ∈ [1, 15] is interpreted as a loss of exactly k packets, rather than 16i + k. A zero gap is interpreted as no lost packet, rather than as 16i lost packets. With high probability, the assessment of what has been received is accurate, unless 16 or more consecutive packets were lost at some point. The next PCR marks the end of this block and the beginning of a new one. The record of which TS packets were received and which are missing, constitutes the block’s map. In our system it is represented as a bitstring; 1 stands for received and 0 for missing packet. Note that missing PCRs leads to concatenated blocks that may later be repaired and split to smaller blocks. 3.3
Regular Operation
Upon completion of a block the node stores that block in memory, indexed by its boundaries. Then the node checks whether all packets – are believed to – have been received. If not, it invokes a Pull operation on a random other node that is currently tuned to the same TV channel, requesting the missing TS packets. It sends the block boundaries and the block map. A node receiving a Pull request, looks up its memory for a block with the specified boundaries. It may have it complete, or it may be missing some TS packets too. In either case, it performs a bitwise operation on the two block maps to figure out which of the requested packets it has. It then sends a Push response to the requester, piggybacking zero or more of the requested packets. Upon receiving a Push response, a node adds the received TS packets to the block in question, updates the block’s map, and checks if the block is now complete. If it is still missing packets, it issues a new Pull request on another
Cooperative Repair of Wireless Broadcasts
247
random node. This is repeated until either the block is complete, or a time threshold called ViewerTimeout has been reached. At that point, the TS packets of the block are handed to the higher layers for decoding and viewing. A second timer, the PullTimeout, is associated with each Pull request. If no response is received for that time, a new Pull request is sent to another random node. 3.4
Special Cases and Supporting Mechanisms
A number of special cases may arise, which makes the algorithm nontrivial. A node receiving a request may realize that its own copy of the block in question has more or fewer TS packets than the map received. A node receiving a Pull request may be unable to spot the requested block in its local memory. A node receiving a Pull request may figure out that it has both starting and ending PCR values, but they do not belong to a single block. We have solutions to these special cases that cure most (although not all) error scenarios, which we omit for space considerations. Like many epidemic protocols, the cooperative repair algorithm relies on communication between peers selected uniformly at random. To that end, we rely on the family of Peer Sampling Service protocols, and specifically Cyclon [5], which provides each node with a regularly refreshed list of links to random other peers, in a fully decentralized manner and at negligible bandwidth cost. We omit discussion of this aspect due to space consideration.
4
Evaluation
Our basic results are obtained through simulations using PeerSim, based on a uniform error model. All nodes experience identical constant error probability for each TS packet. Fig. 2 shows, as a function of the TS packet error rate, the percentage of blocks that were received correctly straight away through broadcasting (darkest), blocks that were initially incomplete but were fully repaired (dark), blocks that were still incomplete when passed on to the higher layer – probably partially repaired (light), and blocks which the node never became aware of, due to some lost PCR (lightest). The four cases represent a ViewerTimeout of 500ms, 1000ms, 2000ms, and 5000ms, while the PullTimeout was fixed to 600ms for all of them. Clearly, as the ViewerTimeout increases, the ability to know and repair more blocks increases. In fact, ViewerTimeout has a dominant effect in this respect, more so than PullTimeout. E.g., for a mere delay of 5sec, users that would otherwise hardly be able to watch a program, can now enjoy a crystal clear stream. The value of PullTimeout was chosen to be 600ms in this example because this value keeps the upload bandwidth of the nodes below about 1Mbps, in line with residential ADSL2+ limits.
5
Conclusion and Related Work
The work in [6] is closely related to ours – the authors repair satellite broadcasts; however few details are provided in the paper. In [4,1,2], some other P2P
248
A. Harwood, S. Voulgaris, and M. van Steen ViewerTimeout = 500ms
80 60 40 20
80 60 40 20
0
0 0
0.1 0.2 prob[TS packet error]
0.3
0
ViewerTimeout = 2000ms
100 80 60 40 20
0.1 0.2 prob[TS packet error]
0.3
ViewerTimeout = 5000ms
100 blocks (%)
blocks (%)
ViewerTimeout = 1000ms
100 blocks (%)
blocks (%)
100
80 60 40 20
0
0 0
0.1 0.2 prob[TS packet error]
0.3
0
0.1 0.2 prob[TS packet error]
0.3
Fig. 2. Repair effectiveness vs. node error rates, for different ViewerTimeout values. From bottom up — Darkest: fraction of blocks received correctly; Dark: completed through repairing; Light: known (but incomplete); Lightest: not known blocks.
approaches use a secondary channel such as 802.11 or Bluetooth, with focus on mobile devices. We omit a detailed description of related work for space considerations. Our results thus far, provide significant motivation for us to continue research in this direction.
References 1. Li, S., Chan, S.H.G.: Bopper: Wireless video broadcasting with peer-to-peer error recovery. In: Proc. IEEE Int. Conf. on Multimedia and Expo, pp. 392–395. IEEE, Los Alamitos (2007) 2. Raza, S., Li, D., Chuah, C.N., Cheung, G.: Cooperative peer-to-peer repair for wireless multimedia broadcast. In: Proc. IEEE Int. Conf. on Multimedia and Expo, pp. 1075–1078. IEEE, Los Alamitos (2007) 3. Reimers, U.: DVB-the family of international standards for digital video broadcasting. Proc. IEEE 94(1), 173–182 (2006) 4. Sanigepalli, P., Kalva, H., Furht, B.: Using p2p networks for error recovery in mbms applications. In: Proc. IEEE Int. Conf. on Multimedia and Expo, pp. 1685–1688. IEEE Computer Society, Los Alamitos (2006) 5. Voulgaris, S., Gavidia, D., van Steen, M.: Cyclon: Inexpensive membership management for unstructured p2p overlays. Journal of Network and Systems Management 13(2), 197–217 (2005), http://dx.doi.org/10.1007/s10922-005-4441-x 6. Weigle, E., Hiltunen, M., Schlichting, R., Vaishampayan, V.A., Chien, A.A.: Peer-topeer error recovery for hybrid satellite-terrestrial networks. In: Proc. IEEE Int. Con. on Peer-to-Peer Computing, pp. 153–160. IEEE Computer Society, Los Alamitos (2006)
ScoreTree: A Decentralised Framework for Credibility Management of User-Generated Content Yang Liao, Aaron Harwood, and Kotagiri Ramamohanarao The Department of Computer Science and Software Engineering The University of Melbourne, Australia {liaoy,aaron,rao}@csse.unimelb.edu.au
Abstract. Peer-to-peer applications are used in sharing User-Generated Content (UGC) on the Internet and there is a significant need for UGC to be analysed for credibility/quality. A number of schemes have been proposed for deriving credibility of content items by analysing users’ feedback, mostly using centralised computations and/or semi-decentralised approaches. In this paper, we propose our P2P schema, ScoreTree, that decentralises a relatively complex credibility management algorithm by aggregating distributed evaluations and delivering an estimate of credibility for each interested content item. Our experiments show that our schema compares favourably with existing decentralised approaches including a gossip message based implementation of ScoreFinder, and a widely adopted P2P application called Vuze.
1 Introduction User-Generated Content (UGC) is an increasingly important information source on the Internet. UGC applications process individual data streams from a large number of Internet users and make this information available globally, e.g., Social Networking, Collaborative Content Publishing, File Sharing, Virtual Worlds and other collaborative activities. The value or utility of the information from these applications is dependent to the information credibility – users need to be able to ascertain the credibility/quality of the information in the UGC. Because it is impossible to manually rank the credibility of large collections of shared content items by any single party, a number of UGC applications allow the users themselves to provide feedback, or to score the content items that other users have provided. Recent advances such as in [8] have been made towards more sophisticated methods for aggregating the users’ feedback, e.g. by eliminating bias and other anomalous (undesirable) user behavior that can be identified in a set of scores. Decentralised or Peer-to-peer (P2P) approaches have been widely used in recent years for sharing contents contributed and/or generated by users. The long term continuation of P2P content sharing applications raises the need for a decentralised credibility management schema. Addressing this need, we propose our decentralised schema, ScoreTree, in this paper. Experimental results show that our method convergence fast, and is more robust against churn and network conditions, in comparison to other current proposals [8,9,1]. P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 249–256, 2011. c IFIP International Federation for Information Processing 2011
250
Y. Liao, A. Harwood, and K. Ramamohanarao
2 Background and Related Work 2.1 Trust Management and Collaborative Filtering The quality and authenticity of shared items may be inferred by Trust and Reputation Management systems, like P2PRep [2], which is designed for Gnutella-styled unstructured peer-to-peer networks and collect reputation votes by flooding requests, and EigenTrust [6], where a peer has number of trusted peers by manually rating or accumulated interaction experiences, so that the goal of EigenTrust is to infer the global rank of trustworthiness for each peer. The Trust/Reputation Management Model is for managing the trustworthiness of individuals rather than the quality of shared items, hence we need to improve this model, such that items from the same user may be discriminated on their quality. A very close field to our research is Collaborative Filtering [13,11], for predicting the score that a user may give to a new item by aggregating opinions from other users. This objective is very similar to Web Link Analysis, nonetheless no globally agreed rank for each node is maintained, instead different predictions are given to different users according to their profiles. By combining the methods of Collaborative Filtering and Trust Management, we have proposed our Annotator-Article Model. 2.2 The Annotator-Article Model The Annotator-Article model is proposed in our earlier work [8] for credibility management applications, where two types of entities are considered: Articles that are available for annotation and Annotators who annotate the articles, i.e., score them. The evaluation towards an article could be nominal ranks or numeric scores. We have proposed an iterative algorithm, called ScoreFinder, which is applied for offsetting the bias from each annotator and adaptively selecting scores from credible users. The pseudo-code of this algorithm is shown in Algorithm 1. The algorithm iteratively updates the expertise level and the bias level of each user, and then calculates the weighted average of scores to each item using the recent expertise levels and the bias levels. The input parameters for this algorithm is score matrix S = (sij ), a two-mode proximity matrix that donates the scores that each user i gives to each article j. The scores in S are linearly normalised into the range between 0 and 1; a higher sij denotes a better rating. A discriminate function, δij , is also defined to determine if a score exists between user i and article j. The value of δij equals to 1 if the score exist, and 0 otherwise. The output of this algorithm is vector r = (rj ), by each rj , between 0 and 1, denotes the consensus evaluation to the j-th article. A higher rj indicates a better evaluation of article j. The expertise levels and the bias-removed scores of users in every iteration are the intermediate results of the algorithm, denoted by vector eτ and matrix Sτ respectively, where τ denotes the current iteration number. The value of γ that is used to control the influence of expertise levels is a trained constant, and the convergence criterion, , is the difference of the Sum of Squared Errors (SSE) between the results in the two iterations. More details about selection of τ are available in [8].
ScoreTree: A Decentralised Framework for Credibility Management of UGC
251
Algorithm 1 The Algorithm of Centralised ScoreFinder The formulas corresponding to procedures BiasRemoval, ExpertnessEstimation and WeightedAverage are schema τ←0 rτ ← AverageScores(S) repeat τ ←τ +1 Sτ ← BiasRemoval(S, rτ −1 ) eτ ← ExpertnessEstimation(Sτ , rτ −1 ) rτ ← WeightedAverage(Sτ , eτ ) until SSE(rτ , rτ −1 ) < r ← rτ
P j
` ´ δij sij − rjτ −1 P , j δij
sτij
=
sij −
eτi
=
1−
=
P τ τ i δij ei sij P τ i δij ei
rjτ
Pn
j=1
δij |rτ −1 − sτij | Pn j j=1 δij
!γ ,
3 Decentralised Credibility Management 3.1 Calculating Weighted Average in a Peer-to-Peer Network A straightforward approach to calculate a weighted average in a P2P network is to nominate a peer, which is in charge of collecting source values from all other peers and propagating the result. The peer is usually nominated by searching the unique identity of the file using the Distributed Hash Tables (DHT), as the method that is used in the credibility management component in Vuze 1 . In an Internet scale application, a peer that is in charge of managing credibility of very popular items needs to face a very large number of peers sending and receiving the scores, and becomes a bottleneck of the system. A comparison of message numbers between our schema and Vuze’s schema is shown later in the experiment section. Gossip messages are an efficient way to calculate weighted average without introducing bottlenecks. As discussed in [7,10], each peer initialises the local, temporary result by its local source value, and continuously exchanges a portion of its local temporary results with each known neighbour. After a sufficient time period, temporary results in all peers will converge to a consistent value, which is the accurate average value of all source values. There are two disadvantages of concern when using gossiping. First, the loss of messages, e.g. due to packet loss, may lead to numerical inaccuracies and requires additional messaging to overcome. Second, the convergence is highly dependent on the network topology; a low connectivity network can take a long time to converge because of the limited propagation speed between partitions of the network. The Prefix Hash Tree (PHT), proposed in [12], is an approach to build a tree structure on the peers, which are organised in a DHT, for hierarchically aggregating and searching data that is distributed over the peers. A peer in structured network usually has a unique identity, like its IP address, to be addressed. Assuming that the hash value is represented by a string (a1 , a2 , a3 , ..., al ) where l is the length of the string, a sequence can be built by selecting the first k-th characters: Keyk = (a1 , a2 , ..., ak ) for all 0 ≤ k ≤ l. Since Keyk−1 is a function of Keyk , and all peers have Key0 = “”, a tree structure can be built on these keys involving all real peers as leaf nodes, as the example shown in Figure 1(a). We use term logical node to call a tree node that corresponds to such a key. Because there is a predictable and unique path between every pair of nodes in a tree, each peer can expect the edge through which data from another given peer comes, and 1
Vuze is a widely used peer-to-peer file sharing application based on an open-source project. The application can be downloaded from http://www.vuze.org/. An extension component [1] is provided for managing the quality of shared items by analysing scores given by users.
252
Y. Liao, A. Harwood, and K. Ramamohanarao
the maximum length of the paths is strictly limited to 2l. The two disadvantages of the gossip-message approach can be overcome by exchanging data along these paths, as we demonstrate in the next section. Therefore, we build our ScoreTree schema by mapping logical nodes to real peers, as shown in Figure 1(b). A peer with Keyk sends a message by the DHT to Keyk−1 for searching its parent peer. There can be circles in such a tree because a logical node and its ancestors may be mapped to the same peer, as the example shown in Figure 1(b). We use three rules to remove circles in such a tree: (1) we use the hosted PHT node that is on the highest level to represent a peer, (2) if a peer hosts multiple PHT nodes on the same level of the PHT, the PHT node having the minimum key value is selected representative and (3) only the edges between the PHT nodes as the representatives are kept, and other PHT edges are disregarded. Figure 1(c) shows an example of selecting tree edges by applying these three rules.
(a) An example of PHT among logical nodes
(b) Using DHT to map logical nodes into real peers
(c) From a node tree to a peer tree Fig. 1. Building Prefix-Hash Trees
3.2 Tree-Based Average Calculation The key challenges for implementing ScoreFinder algorithm are to calculate the weighted average of scores and to update the estimates of the bias levels and the expertise levels in a P2P network. We depict the principle of our weighted average algorithm in Figure 2, where each peer in the P2P network is deemed to be a node in a tree. Despite the size of the peer-to-peer network, a tree node can only see a handful of the neighbour nodes – its parent node and its child nodes. Note that a peer who does not contribute sk and ek can still join the tree structure by letting sk = 0. Our approach considers each edge from a peer to another peer as the delegate for all nodes that are reached via that edge. Because of the uniqueness of the path between each pair of nodes, there is no overlap between node sets that are delegated by each edge of peer k and each node in the tree (except peer k itself) must be delegated by an edge of peer k. Furthermore, if all its edges have been the delegates of the nodes
ScoreTree: A Decentralised Framework for Credibility Management of UGC
253
Fig. 2. An example showing the process of calculating the weighted average value on peer k by exchanging data with its parent node and children nodes. By σx,y and μx,y denote the weighted sum values and the sum of weights sent from node x to node y. Article number j is omitted from sij and rj because only one shared item is considered in this example.
behind them to peer k, these edges are also the delegates of nodes behind them to all k’s neighbours. All peers like peer k periodically send to neighbours the sum of σ and μ from all the other edges as well as its local weight and weighted score: σk,x =
y=x
σy,k + ek sk , μk,x =
μy,k + ek ,
(1)
y=x
where σk,x and μk,x denote the sum of weighted scores and the sum of weights that are sent from peer k to peer x. Since peer k receives the weighted sum of all source values and the sum of weights (except ek and sk ) from its neighbours, it can calculate the above expressions and calculate the weighted average of all source values in the tree: ek sk + x=k σx,k e x sx σ r= = = x . μ ek + x=k μx,k x ex
Note that σ and μ are equally calculated on all peers, such that all peers reach the same r in 2 × l steps; where l denotes the depth of the tree. For reducing the load on the only root node, a dedicated PHT is built for every article being scored, and a peer accordingly joins a number of trees, each for an article it has scored; article identity is used to differentiate the map between keys and peers in trees for different articles, i.e., P eerx = DHT (Articlei + Keyx). Our PHT-guided approach overcomes the two disadvantages of the gossip-message based approach: the tree structure provides convergence in a number of rounds equal to a constant times the tree depth and lost messages can be inferred by each peer explicitly knowing which other peers it expects messages from. For example, if a message carrying σk,x and μk,x is lost, node x explicitly knows that data from the subtree that is delegated by node k is unavailable in this round of iteration, so it may use σk,x and μk,x received in the last round or ask node k to retransmit the message. DHT schemes, like Pastry [15] and Bamboo [14], usually provide replication of data items to peers, such that data is not lost from churn. Our ScoreTree schema packages all the scores from a user into a single data item, and uses the DHT to store this data package using the key of its owner; the data package is replicated by the DHT to a number of peers. When the original peer goes off-line, one replica is activated to join the computation on behalf of the peer left until it returns.
254
Y. Liao, A. Harwood, and K. Ramamohanarao
4 Experiments 4.1 Baselines, Datasets and The Experimental Environment We compared out ScoreTree schema with three schemes in experiments: – the centralised ScoreFinder algorithm that is introduced in [8] – the schema introduced in [9] that decentralises the ScoreFinder algorithm by exchanging gossip messages between randomly assigned neighbours – the decentralised schema used in Vuze that store the raw scores at a peer that is selected by DHT The average scores are the baseline, and all new schemes should have a better accuracy than the average scores. Because our ScoreTree schema, the first schema and the second schema implement the same algorithm, they are expected to converge to the same results. We use the MovieLens data set [3] to test our schema. The movie data set contains 10 million ratings for 10681 movies from 71567 volunteers; we randomly select a small part of annotators and movies from the data set. In terms of “oracle” scores, we calculate the arithmetic average score from all scores given for each movie as its true score. We use the Mean Squared Error (MSE) to evaluate the accuracy of all results from the tested schemes, and show the improvement from the baseline in the charts below. We built a simulator to emulate a P2P network with at most two thousand peers, and run this simulator on a cluster computer consisting of 20 nodes. This simulator was configured to examine a number of situations, including different packet loss rates and different peer availability schemes. Our schema was implemented in this simulator, as well all other decentralised schemes we tested. A widely adopted DHT schema, Pastry, was used to organise the computation that is rely on DHT. 4.2 Experimental Results Figure 3(a) and 3(b) show the change of accuracy according to the different scales of the selected data set. ScoreTree achieved similar accuracy to ScoreFinder in all scales of data sets, as well as the gossip-based approach. Because ScoreTree is a decentralised version of ScoreFinder without any change to its hypotheses and operations, ScoreTree is expected to achieve a similar accuracy to ScoreFinder. A message may be lost in the direct route between any two peers and we simulate random message loss using a constant error probability over all packets sent. Figure 3(c) shows that the accuracy of ScoreTree is better than the gossip-message method when less than 25% messages are lost because no information is lost along with the lost messages; however when the proportion of lost messages exceeded 25%, the accuracy of ScoreTree rapidly decreased because of the broken tree-structure. Churn is simulated by turning off a random selection of peers, and the results are shown in Figure 3(d). Our replication management significantly improved the robustness of the ScoreTree schema. Without replication management, ScoreTree achieved a better performance than the baseline when there are 80% peers on-line, but approx. 50% of peers can achieve the same accuracy by hosting the replicas of the off-line peers.
ScoreTree: A Decentralised Framework for Credibility Management of UGC
Mean Squared Error
Mean Squared Error
ScoreTree Gossiping ScoreFinder Mean Value
Number of annotators
255
ScoreTree Gossiping ScoreFinder Mean Value Number of articles
(a)
(b)
(c)
(d)
Number of peers
e
ScoreTree Vuze
Number of peers
Average number of msgs per peer
ScoreTree Vuze
Total number of msgs
Max number of msgs from a peer
Fig. 3. Comparisons for: (a) number of annotators (number of articles held constant at 400), (b) number of articles (number of annotators held constant at 400), (c) lost messages and (d) proportion of on-line peers. Note that all Y-axes are logarithmically scaled. ScoreTree Vuze
Number of peers
Fig. 4. Comparison of ScoreTree to the rating management module Vuze in terms of message overhead
We also compare the message overheads between the rating management module of Vuze [1] and our ScoreTree schema in Figure 4, where numbers of all messages for maintaining the DHT/PHT and for computation are recorded. It shows that Vuze always has fewer total messages, but by introducing more peers into the network, the maximum number of messages sent from a Vuze peer increases more rapidly than in our schema. Both Vuze and ScoreTree generate less average messages per peer when the scale of the network increases, giving these two schemes good scalability.
5 Conclusions and Future Work A new decentralised schema, ScoreTree, is introduced in this paper addressing the problem of Credibility Management in p2p networks. The results of the experiments show a better scalability of ScoreTree than the other schemes, as well as a better robustness against network conditions and churn of peers in most situations. Our schema creates overheads for building the DHT between peers and for building the PHT on the DHT, but maximum workload of a peer is better controlled by ScoreTree, and the convergence speed is guaranteed. We note that there are schemes like [5] for building trees between peers without underlying structures, and the depth and width of the tree are controllable.
256
Y. Liao, A. Harwood, and K. Ramamohanarao
This gives us a chance to reduce the cost for maintaining a tree structure. Integrity is also an issue of concern for our schema when adopted in practical applications. Influence from every node is propagated to all nodes in a tree by our schema; a malicious peer can arbitrarily change the final result by manipulating σ and μ that it sends to the neighbours. We note that PeerReview [4] may be useful in accounting for the behaviour of every peer and identifying those peers that break the protocol.
References 1. Chalouhi, O., and Tux Paper: Azureus - rating v1.3.1 (July 2006), http://azureus.sourceforge.net/ plugin details.php?plugin=azrating 2. Damiani, E., De Capitani di Vimercati, S., Paraboschi, S., Samarati, P.: Managing and sharing servants’ reputations in P2P systems. IEEE Transactions on Knowledge and Data Engineering 15(4), 840–854 (2003) 3. GroupLens. Movielens data sets (October 2006), http://www.grouplens.org/node/73 4. Haeberlen, A., Kouznetsov, P., Druschel, P.: Peerreview: practical accountability for distributed systems. SIGOPS Oper. Syst. Rev. 41(6), 175–188 (2007) 5. Jagadish, H.V., Ooi, B.C., Vu, Q.H.: Baton: a balanced tree structure for peer-to-peer networks. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB 2005, pp. 661–672. VLDB Endowment (2005) 6. Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H.: The eigentrust algorithm for reputation management in p2p networks. In: WWW 2003: Proceedings of the 12th International Conference on World Wide Web, pp. 640–651. ACM, New York (2003) 7. Kempe, D., Dobra, A., Gehrke, J.: Gossip-based computation of aggregate information. In: Annual Symposium on Foundations of Computer Science, vol. 44, pp. 482–491. Citeseer (2003) 8. Liao, Y., Harwood, A., Kotagiri, R.: Scorefinder: A method for collaborative quality inference on user-generated content. Accepted by ICDE 2010 (2010) 9. Liao, Y., Harwood, A., Ramamohanarao, K.: Decentralisation of scoreFinder: A framework for credibility management on user-generated contents. In: Zaki, M., Yu, J., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 272–282. Springer, Heidelberg (2010) 10. Mehyar, M., Spanos, D., Pongsajapan, J., Low, S.H., Murray, R.M.: Distributed averaging on asynchronous communication networks. In: 44th IEEE Conference on Decision and Control, 2005 European Control Conference, CDC-ECC 2005, pp. 7446–7451 (December 2005) 11. O’Donovan, J., Smyth, B.: Trust in recommender systems. In: IUI 2005: Proceedings of the 10th International Conference on Intelligent User Interfaces, pp. 167–174. ACM, New York (2005) 12. Ramabhadran, S., Ratnasamy, S., Hellerstein, J.M., Shenker, S.: Prefix hash tree: An indexing data structure over distributed hash tables. In: Proceedings of the 23rd ACM Symposium on Principles of Distributed Computing. Citeseer (2004) 13. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: an open architecture for collaborative filtering of netnews. In: CSCW 1994: Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, pp. 175–186. ACM, New York (1994) 14. Rhea, S., Geels, D., Roscoe, T., Kubiatowicz, J.: Handling churn in a DHT. In: Proceedings of the USENIX Annual Technical Conference, pp. 127–140 (2004) 15. Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Liu, H. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001)
Worldwide Consensus Francisco Maia, Miguel Matos, Jos´e Pereira, and Rui Oliveira High-Assurance Software Laboratory University of Minho Braga, Portugal {fmaia,miguelmatos,jop,rco}@di.uminho.pt
Abstract. Consensus is an abstraction of a variety of important challenges in dependable distributed systems. Thus a large body of theoretical knowledge is focused on modeling and solving consensus within different system assumptions. However, moving from theory to practice imposes compromises and design decisions that may impact the elegance, trade-offs and correctness of theoretical appealing consensus protocols. In this paper we present the implementation and detailed analysis, in a real environment with a large number of nodes, of mutable consensus, a theoretical appealing protocol able to offer a wide range of trade-offs (called mutations) between decision latency and message complexity. The analysis sheds light on the fundamental behavior of the mutations, and leads to the identification of problems related to the real environment. Such problems are addressed without ever affecting the correctness of the theoretical proposal.
1
Introduction
The problem of fault-tolerant consensus in distributed systems has received much attention throughout the years as a powerful abstraction at the core of several practical problems, namely atomic commitment, atomic broadcast and view synchrony. Furthermore, the variety of models in which consensus can be solved led to the appearance of several protocols targeted at system models with different assumptions on the synchrony of processes and communications channels, on the admissible failure patterns, and on failure detection. For the numerous consensus protocols present in the literature, a generic differentiator, with major relevance in practical terms, is the network-level communication pattern that emerges from each particular design. For instance, in Chandra and Toueg’s centralized protocol [4], a rotating coordinator process is in the center of all communication: the coordinator sends its proposal to all other participants, then collects votes from everyone and finally broadcasts the decision. A different approach is taken in Schiper’s Early Consensus protocol [11] where all participants always broadcast their messages. Different protocols thus
Partially funded by the Portuguese Science Foundation (FCT) under project Stratus - A Layered Approach to Data Management in the Cloud (PTDC/EIACCO/115570/2009) and grants SFRH/BD/62380/2009 and SFRH/BD/71476/2010.
P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 257–269, 2011. c IFIP International Federation for Information Processing 2011
258
F. Maia et al.
present different communications patterns with distinct message complexity and communication steps that establish several trade-offs on the decision latency, network usage, resilience to message loss and processor load.
2
Mutable Consensus
The Mutable Consensus protocol [10] solves the consensus problem [6] tolerating the crash of a majority of processes. It assumes an asynchronous distributed system model augmented with an eventual strong failure detector, ♦S [4]. Processes are considered to be fully connected through fair-lossy communication channels. A fair-lossy channel closely models existing network links requiring the weakest reliability properties to still be useful: any message that is sent has a non-zero probability to be delivered. Over these channels, the mutable protocol leverages a simple yet powerful abstraction given by Stubborn communication [7]. In the following we recall the definitions of consensus and Stubborn Channels and provide an overview of the Mutable Consensus protocol. 2.1
The Consensus Problem
The consensus problem abstracts agreement in fault-tolerant distributed systems, in which a set of processes agree on a common value despite starting with different opinions. All processes are expected to start the protocol with an undecided value for the decision and proposing some value through function Consensus. Each correct process, that is a process that does not crash, is expected to finish the protocol as soon as it decides on a value such that the following properties hold [4]: Validity If a process decides v, then v was proposed by some process; Agreement No two processes decide differently and Termination Every correct process eventually decides some value. 2.2
Stubborn Communication Channels
A Stubborn Channel [7] connecting two processes pi and pj is an unreliable communication channel defined by a pair of primitives sSendi,j (m) and sReceivej,i (m), that satisfy the following two properties: No-Creation If pi receives a message m from pj , then pj has previously sent m to pi . Stubborn Let pi and pj be correct. If pi sends a message m to pj and pi indefinitely delays sending any further message to pj , then pj eventually receives m. Intuitively, a stubborn channel adds to the reliability of a fair-lossy channel by strengthening the delivery guarantees of the last message that is sent. As soon as a message is sent, it makes the previous one obsolete. Stubborn channels were initially proposed as a way to reduce the buffer footprint required by reliable communication [7] but in the Mutable Consensus protocol they are the key to the algorithm mutations regarding the network-level usage patterns. The stubborn property allows messages to be lost and the Mutable Consensus takes advantage of it to have only a subset of the messages effectively sent over the network. To implement a stubborn channel over a fair-lossy channel it suffices to buffer the last message sent and retransmit it periodically.
Worldwide Consensus
2.3
259
Protocol Description
Like most agreement protocols based on the asynchronous distributed system model, the Mutable Consensus protocol uses the rotating coordinator paradigm and proceeds in asynchronous rounds. Each round has two phases. In the first phase of some round, processes try to agree on the coordinator’s proposal for the decision. If the coordinator is suspected to have failed, then the second phase starts and processes try to agree to detract the current coordinator and proceed to the next round. In both phases agreement is reached as soon as a majority of processes share the same opinion. For each round, each process keeps a list of processes that currently have a similar opinion, which can be either supporting the coordinator on the current value proposed (phase 1), or retracting the coordinator when suspecting it failed (phase 2). In the protocol, communication is used to broadcast these lists among the participating processes. Stubborn communication channels handle the communication in ways such that, at the network-level, distinct message patterns emerge, as if the consensus protocol itself actually mutates. A first look at the Mutable Consensus protocol 1 hints at a similar message exchange pattern to that of the Early Consensus protocol [11] which is not attractive due to the probable redundancy of the messages’ contents and the quadratic complexity of the exchange pattern (all processes send their lists to all). However, from the protocol specification and the stubborn property of the communication channels, it is possible that only a subset of the messages are actually transmitted over the network. Firstly, of the messages sSent by the protocol only messages with new information are actually broadcast. Then, at the stubborn channel level, not all of those messages are readily sent, they are judiciously delayed in such a way that, in good runs they become obsolete and end up not being transmitted at all. As introduced in [10] and detailed in the next Section, a sensible implementation of the Stubborn Channels can match the subset of transmitted messages with the minimum set of messages needed to reach consensus. This is achieved by configuring different send delays, which allow to radically alter the message exchange pattern without ever impacting the protocol’s correctness. These configurations are called protocol mutations. Four mutations have been proposed: early, centralized, ring and gossip. The early mutation assigns to each message zero delay. This enforces an actual broadcast of each message and the protocol behaves as expected at higher level. For the other mutations, some messages will be sent immediately while others only after a period of time e, which is an estimate on the time consensus will take and therefore sending those messages is expected to be avoided. Following this idea, in the centralized mutation, only messages to and from the coordinator process are immediately sent while the others are delayed. In the ring mutation only messages addressed to the next process (in a logical ring) are immediately sent. Finally, for the gossip mutation, each process has a permutation of the list of all processes and sends the message immediately to f processes (gossip fanout) and delays it to the others. Parameter f is configurable and this set of processes changes for each broadcast.
260
F. Maia et al.
It is important to notice that delayed messages are never discarded. In fact, if, after e elapses, the subset of messages transmitted was not enough to reach consensus the Stubborn Channel will send those messages allowing the protocol to make progress. Moreover, in the original proposal this actually means that all the mutations may degenerate into the early mutation. This observation is discussed in Section 3.2.
3
Mutable Consensus Made Live
A complete implementation of the Mutable Consensus protocol, capable of running in a real large scale environment, uncovered some challenges previously not considered at the theoretical level. These challenges not only raised practical issues but also led us to propose some changes in the algorithm itself, namely in the various mutations definition. The implementation was done using the Splay [8] platform and the Lua programming language. Splay enables the specification of distributed algorithms in a very concise way using Lua, and enables the deployment in a number of different testbeds including PlanetLab [2]. The ability to deploy in PlanetLab allows the use of a number of nodes not available to us at the laboratory. Moreover, the real environment helps to test the application against different unpredictable network and node failures. After a careful study of the original algorithm three main challenges arose, namely in the implementation of the core of protocol, in the implementation of the stubborn channels, and in the achievement of quiescence. 3.1
Mutable Core
Splay is an event-driven framework where processes communicate through remote procedure calls (RPC). To avoid blocking RPC calls and Mutable Consensus is message based, threads are used to parallelize such invocations. This improves the performance of the algorithm and matches the original definition of the protocol. The event loop is started by events.run(f ) which invokes function f and waits from incoming events received by means of RPCs. Processes terminate by calling events.exit(). Each process has a list plist, containing the identifier of all n participants and is identified by its position on that list, given by pId. The implementation of the Mutable Consensus is presented in Listing 1. It closely resembles that of [10]. Initially, the consensus() function is called, which begins the event loop (lines 3∼11) and calls start. In function start (lines 13∼23), the coordinator, given by ((ri mod n)+ 1), initiates the protocol by calling sSend with the following parameters: its identifier pId, the round ri and phase 1, the list of supporters voters (currently itself) and the estimate est for all nodes (line 20). sSend is presented later in Listing 2. The protocol then proceeds by exchanging messages, which correspond to sReceive calls. Upon the reception of a message a process proceeds as follows:
Worldwide Consensus
261
– If its list of supporters, voters, does not contain a majority of votes yet then it evaluates the received message as follows: if the message comes from a larger round rj then the process jumps to that round and resets the list of supporters (lines 27∼32); if the message belongs to the current round but to a larger phase, then the coordinator has been suspected and thus the process changes phase and starts collecting a majority of detractors (lines 33∼36); otherwise if the message belongs to the current round and contains new votes, or it is from phase 1 and already contains a majority of votes endorsing the coordinator’s estimate, then the process updates its set of supporters voters and broadcasts it (lines 37∼45). – If its list of supporters, voters, already contains a majority of votes (lines 47∼56) then, if phase is 1 (endorsing the coordinator) the process decides and exits, otherwise (phase 2, detracting the coordinator) the process moves to the next round. Finally, should the coordinator become suspected (lines 59∼67), the process immediately changes to phase 2 and broadcasts its suspicion to force a change of round. Function suspected is invoked by the failure detector module not detailed in the paper. 3.2
Stubborn Channels
From the definition, a Stubborn Channel implementation should be fairly straightforward but, nonetheless, some subtleties arose. This section identifies those issues and describes the proposed solutions. Implementation. A Stubborn Channel requires primitives sSend(k, m) and sReceive(m) where k is the destination process and m is the message. sReceive (Listing 1) has no special semantics and is given by Splay’s RPC mechanism. sSend is presented in Listing 2 (lines 1∼6). The actual send of the message is done in line 6 by remotely invoking, through Splay, the destination’s sReceive function. The sSend requires two auxiliary functions: delta0/delta which are responsible for the protocol’s mutations, and retransmission which handles the periodic retransmission of messages to overcome message loss. When a message is sSent, it is buffered in bstate (line 3) and, if delta0 determines so, it is sent immediately to the network (lines 4∼6). Otherwise, it will wait to be handled by the retransmission function. Unlike the original algorithm that handles the retransmission of messages per channel separately, our implementation, for the sake of scalability, deals with all open channels in batch through a single thread that animates the function retransmission. Periodically, function retransmission() (lines 10∼ 24) determines for each destination k the need to send bstate[k] by means of delta(k). Variables tries and maxtries will be discussed in Section 3.2. The delta0(k, m) and delta(k) functions determine if a message is to be sent immediately or delayed. delta0(k, m) is used for the first time a message is sent
262 1
F. Maia et al.
consensus decision = nil
2 3 4 5 6 7 8 9 10 11
function c o n s e n s u s ( v a l u e ) −−G l o b a l S t a t e v o t e r s = Set . new{ } e s t = { v a l=v a l u e , p r o c=p Id } ri = 1 phi = 1 −− s t a r t s e v e n t l o o p e v e n t s . run ( function ( ) s t a r t ( ) end ) end
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
function s t a r t ( ) i f pId == ( ( r i mod n ) + 1 ) then −− t h e c o o r d i n a t o r i n i t i a t e s t h e p r o t o c o l by −− b r o a d c a s t i n g i t s e s t i m a t e v o t e r s = Set . new{ p Id } e s t . p r o c = pId f o r k i n p l i s t do sSend ( k , { pId , r i , p h i , v o t e r s , e s t } ) end end end function s R e c e i v e ( s , { r j , p h j , v o t e r s j , e s t j } ) i f Set . l e n ( p ) <= n/2 then i f r i n / 2 ) i f ( newVotes or m a j o r i t y ) then v o t e r s = Set . u n i o n ( v o t e r s , Set . u n i o n ( v o t e r s j , Set . new{ p Id } ) ) i f e s t j . p r o c == ( ( r i mod n ) + 1 ) then e s t = e s t j end f o r k i n p l i s t do sSend ( k , { pId , r i , p h i , v o t e r s , e s t } ) end end end i f Set . l e n ( v o t e r s ) > n /2 then i f p h i == 1 then c on se n su s d e c i sion = est . val events . e xit () else ri = ri + 1 phi = 1 v o t e r s = Set . new{ } end end end
58 59 60 61 62 63 64 65 66 67
function s u s p e c t e d ( j ) i f j == ( ( r i mod n ) + 1 ) and p h i == 1 then phi = 2 v o t e r s = Set . new{ p Id } for k in p l i s t sSend ( k , { pId , r i , p h i , v o t e r s , e s t } ) end end end
Listing 1. Mutable consensus implementation in Lua
Worldwide Consensus
263
and delta(k) in message retransmission. It is important to notice that delaying a message does not compromise in any way the guarantees given by the Stubborn Channel. By delaying certain messages and immediately sending others the message exchange pattern is altered. Different implementations of delta0(k, m) and delta(k) yield different mutations. The implementation of the four original mutations are presented in Listing 3. In the early mutation, Listing 3(a), delta0(k, m) always return true for useful messages. Useful messages either contain a majority of votes (maj(m)) or are new (fresh(m1, m2)). In this mutation, delta(k) always returns true. This will enforce an actual broadcast by having all messages transmitted over the network. The centralized mutation, Listing 3(b), requires a slight change. Only messages to and from the coordinator are immediately sent. This will enforce a centralized message exchange pattern where processes send all the votes to the coordinator. When the coordinator gathers a majority it broadcasts the decision. The ring mutation, Listing 3(c), is similar to the previous one except that only messages to the next process, in a logical ring, are immediately sent. The protocol will act as if nodes were physically connected in a ring topology. The gossip mutation presented in Listing 3(d) intends to offer the high scalability properties of gossip-based protocols, which should allow Mutable Consensus to scale to a large number of nodes. Each process keeps a permutation u of the list of processes. Each time a message is sent it is immediately transmitted to the next f (fanout) processes in the list. Variable c (Listing 2) is used to vary the list of f destination processes. This variable is incremented each time a broadcast is invoked. Real runs of the algorithm yielded by configuring the mutable consensus protocol with the various mutations are depicted in Figure 1. Each horizontal line represents a process, arrows the messages, and black dots the decision. Mutation Degeneration. After running the protocol several times we observed that the ring and centralized mutations may degenerate into the early mutation. Degeneration means that retransmissions end up being made immediately without regard to the mutation. In each sSend messages are divided into two groups: those immediately sent and those that will be delayed. The first ones will follow the pattern defined by the mutation while the others are expected never to be sent as consensus is reached before def aultDelta (the estimate on the time consensus will take to finish, e in Section 2.3) expires. However, if something goes wrong in the first round, such as node failures or message loss, all delayed messages will be sent at once as a result of the delta(k) implementation that always returns true for these mutations. This implies degenerating into the early mutation as depicted in Figure 2(a) for the ring mutation. Similar behavior is observed for the centralized mutation. Essentially, the degeneration happens because after the def aultDelta period all messages are treated equally. Therefore, we modified the delta function to take this issue into account, and selectively send some messages and further delay the others. The delta function now becomes similar do delta0 without the
264
1 2 3 4 5 6 7 8
F. Maia et al.
function sSend ( k ,m) t = d e l t a 0 ( k ,m) b s t a t e [ k ] = m −− h o l d s t h e m e s s a g e s to a l l d e s t i n a t i o n s i f t then −− i n v o k e s R e c e i v e i n a s e p a r a t e t h r e a d e v e n t s . t h r e a d ( function ( ) r p c . a c a l l ( d e s t i n a t i o n [ k ] , {” s R e c e i v e ” ,m} ) end ) end end
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
function r e t r a n s m i s s i o n ( ) return events . thread ( function ( ) tries = 0 while true do events . sle e p ( defaultDelta ) f o r k i n p do i f d e l t a ( k ) or t r i e s > m a x t r i e s then r p c . a c a l l ( d e s t i n a t i o n [ k ] , {” s R e c e i v e ” , b s t a t e [ k ] } end tries = tries + 1 end end end ) end
Listing 2. Stubborn Channels implementation in Lua
1 1 2 3
function d e l t a 0 ( k ,m) r e t u r n ( f r e s h ( b s t a t e [ k ] ,m) or maj (m) ) end
4 5 6 7
2 3 4 5
function d e l t a ( k ,m) r e t u r n true end
(a) Early
6 7 8 9
2 3 4
2 3 4
function d e l t a 0 ( k ,m) r e t u r n ( k == ( ( pId % n ) + 1 ) ) and ( f r e s h ( b s t a t e [ k ] ,m) or maj (m) ) end
5 6 7 8
function d e l t a ( k ,m) r e t u r n true end
(c) Ring
function d e l t a ( k ,m) r e t u r n true end
(b) Centralized 1
1
function d e l t a 0 ( k ,m) c o o r d = ( r i mod n ) + 1 r e t u r n ( k==c o o r d or pId==c o o r d ) and ( f r e s h ( b s t a t e [ k ] ,m) or maj (m) ) end
u = perm ( n ) ; c = 1 ; t u r n = 0 ; f = 4 function d e l t a 0 ( k ,m) return delta (k) end
5 6 7 8 9 10 11 12 13 14 15 16 17
function d e l t a ( k ) turn = turn + 1 i f t u r n == n then c = c+f ; t u r n = 0 ; end local l = 0 while ( l
(d) Gossip
Listing 3. Mutations implementation in Lua
Worldwide Consensus
(a) Early
(b) Centralized
(c) Ring
(d) Gossip (F=5)
265
Fig. 1. Prefixes of typical executions
(a) Degenerating mutation
(b) Mutation preserved
Fig. 2. Mutation degeneration correction
f resh check as messages are being retransmitted and thus are not new, and the majority check as the message needs to be retransmitted even if it does not hold a majority. The new delta function for the ring mutation is presented in Listing 4. Changes for the centralized mutation are similar and thus omitted. The process is repeated a finite number of times after which the mutation must degenerate into the early mutation. In fact, if that was not the case, correctness could be compromised as some messages would never be sent. To overcome this, we additionally check if the number of allowed retransmissions (stored in variable maxtries) has been reached (Listing 2, line 17). On top of these observations, the ring mutation revealed another interesting problem when deployed in real settings. In fact, maintaining the same order of the ring across rounds is not resilient to message loss. For instance, if the link between two nodes is prone to large message loss, consensus would only be reached when the ring mutation degenerates into the early mutation. This can be overcome by changing the ring on every round by simply computing the next process in the ring as follows: ((myposition + ri )%n) + 1 where n is the number of processes and ri the round number. Quiescence. When a process decides and terminates (Listing 1, lines 48 and 49) the last message it has broadcast corresponds to phase 1, contains a majority of votes in p and the decision value in est. To execute line 48, phi needs to be
1 2 3
function d e l t a ( k ) r e t u r n ( k == ( ( m y p o s i t i o n % n ) + 1 ) ) end
Listing 4. Improved function delta for the ring mutation
266
F. Maia et al.
1 and |p| > n/2. This can only happen if the process executed line 39 and got a majority of votes in p, since any other previous conditions in lines 26 or 32 result in p = {}. By the stubborn property of the communication channels all processes that do not crash will eventually receive this message and will, in turn, decide and terminate if they have not done so yet. Therefore the termination property of Consensus is satisfied. However, any process that decides needs to keep the retransmission of its last message to ensure it is delivered. This means that, in this case, albeit the consensus instance terminates the process does not become quiescent. Given the recurrence of the algorithm this can become a problem as buffers from stubborn channels cannot be discarded and retransmission would go endlessly. Achieving quiescent reliable communication with common failure detectors would require us to assume an eventual perfect failure detector (♦P)[4] which is stronger than needed to solve consensus and whose properties are much more difficult to attain in practice. To work around this problem, Aguilera et al. [1] have proposed the heartbeat failure detector HB. Roughly, a HB failure detector provides each process with non-decreasing heartbeats of all the other processes and ensures that the heartbeat of a correct process is unbounded while that of a crashed process is bounded. Its implementation, in our model, is pretty simple: Each process periodically sends a heartbeat message to all its neighbors; upon the receipt of such a message from process q, p increases the heartbeat value of q. By combining the output of the HB failure detector with a simple positive acknowledgement protocol between the sSend and sReceive primitives we made the stubborn communication quiescent.
4
Evaluation
We evaluate the implementation of Mutable Consensus in a PlanetLab [2] environment by means of the Splay platform [8]. Splay was chosen because it allows the specification of algorithms in a very concise way using Lua, and enables the deployment in several testbeds including PlanetLab. The user simply specifies the number of nodes and Splay deploys the protocol in those nodes. A deployment for a run with 300 nodes is presented in Figure 5. The geographically dispersion and heterogenous nature of PlanetLab helps to test the application against different unpredictable network and node failures. Presented results are the average of 5 runs where each run represents a new Splay deployment. A centralized logger gathers information about events. Due to asymmetries in nodes and links, events reaching the logger may deviate from the actual run. However, as results focus on comparison among runs, the conclusions stay valid. Evaluation focus on two perspectives: consensus latency and message complexity. Consensus Latency is the time taken for processes to decide. We define two different metrics: Coordinator Latency, which is the time it takes for the coordinator to decide; and Majority Latency, which is the time it takes for a majority of nodes to decide. With respect to latency the results obtained are depicted in Figure 3. The first remark has to do with the fact that the ring mutation did not produce
Worldwide Consensus
(a) Coordinator Latency
267
(b) Majority Latency
Fig. 3. Consensus Latency Coordinator #Messages
Others #Messages
1000
300000
early
early
gossip
gossip
ring 800
250000
cent
ring cent
#Messages
#Messages
200000 600
150000
400 100000
200 50000
0 0
50
100
150
200
250
300
0 0
50
100
#Nodes
(a) Number of messages coordinator
150
200
250
300
#Nodes
(b) Total number of messages
Fig. 4. Number of messages exchanged
results on runs with more that 50 nodes. In fact, in such runs the probability of message loss is very high and the ring easily breaks. To have a fair comparison, results are obtained from good rounds (before retransmissions) where the difference between mutations is clear. From the perspective of the coordinator, the centralized mutation exhibits higher latency than the earlier one. At first this seems unsettling. In fact, for the coordinator, in both mutations there are only two communication steps before decision: the coordinator’s broadcast and then the collection of processes’ votes. In spite of this similar behavior, in the centralized mutation the coordinator has to sequentially handle n/2 messages in all situations while in the early mutation each node can aggregate votes and propagate them to the coordinator. Considering PlanetLab’s link asymmetry, it is likely that faster intermediate nodes propagate messages with a group of votes already gathered, which lowers coordinator’s decision latency. From the perspective of a majority of nodes, the most significant change is the gap between the centralized and early mutations. This is expected as the centralized mutation needs an extra communication step for the majority of nodes to decide.
268
F. Maia et al.
Fig. 5. Node geographic distribution
It is important to notice that, both from the perspective of the coordinator and a majority of nodes, the centralized and early mutation have an abrupt increase in latency after runs with 200 nodes. This indicates a system saturation and an impediment to scalability of the protocol when configured with such mutations. On the other hand, the gossip mutation is virtually unaffected by the increase in the number of nodes. More interestingly, the gossip mutation is able to offer small latencies when compared to the other mutations. This stems from the inherently small number of hops each messages need to take to reach all nodes in a epidemic setting [5], and due to message frugality analyzed next. Message complexity measures the network load each mutation implies. We defined two metrics: Coordinator Messages, which is the number of messages sent and received by the coordinator and Others Messages which is the total number of messages sent and received by other nodes, on average. The results are depicted in Figure 4. The ring mutation clearly exchanges fewer messages in both metrics at the cost of higher latency. From the perspective of the coordinator, the early mutation and centralized mutations have a similar behavior. This is expected has the coordinator has to receive and send messages from and to all the participants. However, globally, the centralized mutation exchanges a considerably smaller number of messages at the cost of the extra communication step needed for the coordinator to broadcast its decision. The interesting result is the gossip mutation. This mutation exchanges a more stable number of messages independently of the number of nodes and the metric. This is actually the key characteristic that enables this mutation to scale to a large number of nodes. As message exchanged is balanced across all the nodes the overall load is smaller. These results support the claim that the Mutable Consensus protocol is able to adapt to different environments and able to scale to a large number of nodes when configured with the gossip mutation.
5
Discussion
This paper described the implementation of the Mutable Consensus protocol in a real environment. The gap between theory and practice became evident as several non-trivial problems emerged. As in [3], those stem mainly from several
Worldwide Consensus
269
simplifications that remained hidden both in the theoretical models and in the simulation tools used. Those issues have been addressed from a practical point of view but without ever compromising correctness. As relevant additions to the algorithm we point out the avoidance of mutation degenerations and the quiescence of the stubborn communication channels. To the best of our knowledge, this is the first work to analyze the behavior of a consensus protocol in an large scale hostile environment such as PlanetLab. With the gossip mutation, we have shown that the Mutable Consensus protocol can scale up to 300 nodes without compromising decision latency. This contrasts with the common belief that uniform consensus does not scale. Mutable Consensus is adaptable by design simply by using different protocol mutations which makes it an attractive tool to solve consensus in a wide range of environments. Despite being adaptable, Mutable is not adaptive. However as it possesses the required properties to build a self-tuning system [9], offering Mutable Consensus as a generic self-contained software package is an important pursuit as future work. This could allow developers to use a modular and generic consensus service, averting the mix of different code components [3].
References 1. Aguilera, M., Chen, W., Toueg, S.: On quiescent reliable communication. SIAM Journal on Computing 29 (2000) 2. Bavier, A., Bowman, M., Chun, B., Culler, D., Karlin, S., Muir, S., Peterson, L., Roscoe, T., Spalink, T., Wawrzoniak, M.: Operating system support for planetary-scale network services. In: Symposium on Networked Systems Design and Implementation, p. 19 (2004) 3. Chandra, T., Griesemer, R., Redstone, J.: Paxos made live: an engineering perspective. In: Symposium on Principles of Distributed Computing, pp. 398–407 (2007) 4. Chandra, T., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43, 225–267 (1996) 5. Eugster, P., Guerraoui, R., Kermarrec, A.-M., Massouli´e, L.: From Epidemics to Distributed Computing. IEEE Computer 37(5), 60–67 (2004) 6. Fischer, M., Lynch, N., Paterson, M.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985) 7. Guerraoui, R., Oliveira, R., Schiper, A.: Stubborn Communication Channels. Technical report, EPFL (1998) 8. Leonini, L., Riviere, E., Felber, P.: SPLAY: Distributed Systems Evaluation Made Simple. In: Symposium on Networked Systems Design and Implementation, pp. 185–198 (2009) 9. Matos, M., Pereira, J., Oliveira, R.: Self Tuning with Self Confidence. In: ”Fast Abstract”, International Conference on Dependable Systems and Networks (2008) 10. Pereira, J., Oliveira, R.: The mutable consensus protocol. In: Symposium on Reliable Distributed Systems, pp. 218–227 (2004) 11. Schiper, A.: Early consensus in an asynchronous system with a weak failure detector. Distributed Computing 10, 149–157 (1997)
Transparent Scalability with Clustering for Java e-Science Applications Pedro Sampaio, Paulo Ferreira, and Lu´ıs Veiga INESC ID/IST, Technical University of Lisbon, Portugal [email protected], {paulo.ferreira,luis.veiga}@inesc-id.pt
Abstract. The two-decade long history of events relating object-oriented programming, the development of persistence and transactional support, and the aggregation of multiple nodes in a single-system image cluster, appears to convey the following conclusion: programmers ideally would develop and deploy applications against a single shared global memory space (heap of objects) of mostly unbounded capacity, with implicit support for persistence and concurrency, transparently backed by a possibly large number of clustered physical machines. In this paper, we propose a new approach to the design of OODB systems for Java applications: (O3 )2 (pronounced ozone squared ). It aims at providing to developers a single-system image of virtually unbounded object space/heap with support for object persistence, object querying, transactions and concurrency enforcement, backed by a cluster of multi-core machines with Java VMs that is kept transparent to the user/developer. It is based on an existing persistence framework (ozonedb) and the feasibility and performance of our approach has been validated resorting to the OO7 benchmark.
1
Introduction
A trend been taking place with the rediscovery of the notion of a single-system image provided by the transparent clustering of distributed OO storage systems (e.g., from Thor [6] with caching and transactions ca. 1992, to present distributed VM systems such as Terracotta). They allow to scale-out systems and overcome the limitations and bottlenecks w.r.t. CPU, memory, bandwidth, availability, scalability, and affordability of employing a single, even if powerful, machine, while attempting to maintain the same abstractions and transparency to the programmers. The two-decade long history of events relating object-oriented programming, the development of persistence and transactional support, and the aggregation of multiple nodes in a single-system image cluster [8], appears to convey the following conclusion: programmers ideally would develop and deploy applications
This work was supported by FCT (INESC-ID multiannual funding) through the PIDDAC Program funds.
P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 270–277, 2011. c IFIP International Federation for Information Processing 2011
Transparent Scalability with Clustering for Java e-Science Applications
271
against a single shared global memory space (heap of objects) of mostly unbounded capacity, with implicit support for persistence and concurrency, transparently backed by a possibly large number of clustered physical machines. In fact, today more and more applications are developed resorting to OO languages and execution environments, encompassing common desktop and web applications, commercial business applications on application servers, applications for science and engineering (e.g., architecture, engineering, electronic system design, network analysis, molecular modeling), and even games, virtual simulation environments. This is due to the universality of the programming model and performance offered by present JIT1 technology. Such applications essentially maintain, navigate and update object graphs with increasingly larger (main) memory requirements, more than a single machine has available or can manage efficiently. For storage, reliability and sharing purposes, these objects graphs also need be made persistent to a repository. In this paper, we propose a new approach to the design of OODB systems for Java applications: (O3 )2 (pronounced ozone squared). It aims at providing to developers a single-system image of virtually unbounded object space/heap with support for object persistence, object querying, transactions and concurrency enforcement, backed by a cluster of multi-core machines with Java VMs that is kept transparent to the user/developer. While embodying some of the principal goals of the original OODB systems (orthogonal persistence, transparency to developers, transactional support), it reprises them in the context of contemporary computing infrastructures (such as cluster, grid and cloud computing), execution environments (namely Java VM), and application development models, described next. It is based on an existing persistence framework (ozone-db [5]). The rest of the paper is organized as follows. In the next Section, we address the relevant related work in some areas intersecting with our work goals. In Section 3, we describe the architecture of (O3 )2 . Section 4 describes the main implementation details and the performance results obtained with a benchmark from the literature. Section 5 closes the paper with some conclusions and future work.
2
Related Work
OODB systems traditionally designate those systems simultaneously databases and object-based systems. They provide support for orthogonal (transparent) persistence of object graphs, querying to the object store (usually a single server machine), and frequently object caching. This is achieved without requiring an extra mapping step to a relational database. They also enable navigation through object graphs, type inheritance, polymorphism, etc. Earlier examples include Gemstone [3]. Examples of recent work include ozone-db [5] and db4o [7]. They provide transparency and object querying. The main limitation of past and current OODB systems is that they do not offer true single-system image semantics. A repository must fit in its entirety on a single machine; other machines may 1
Just-in time compilation.
272
P. Sampaio, P. Ferreira, and L. Veiga
only be used as backup replicas for fault-tolerance purposes, but the object heap cannot be increased by aggregating the memory of several machines. Akin to DSM systems, distributed object systems were able to aggregate memory (heaps) of several machines across the network in order to offer applications a shared object space with uniform referencing across process boundaries, together with some runtime services (e.g., micro transactions, long-running transactions, possibly while disconnected using cached and replicated objects, distributed garbage collection). Examples include work in Thor [6], OBIWAN [9], and Sinfonia [1]. These systems provide object persistence and transparency to developers w.r.t. programming model. However, support for single-system image semantics is not fully provided since distribution is made known to application developers, who must know where special (root) objects are located in the network. No object querying is supported, only root object look-up. The same approach can be applied to the notion of a VM for an object oriented language. A distributed VM aggregates the resources of machines in a cluster, able to provide, e.g., a Java VM with a larger heap encompassing part (or all) of the individual machines’ object heaps. This provides a single-system image with shared global object space [4], with virtually unbounded memory available to applications. Examples include Jessica [10] and Terracotta (which, albeit its success, holds the entire object graph in a coordinator machine and employs others solely for caching). Persistence is not offered at all or is limited to support for object swapping. Furthermore, no support for object querying is provided.
3
Architecture
In this section, we describe the architecture of (O3 )2 . It is an extension of an existing middleware, ozone-db [5], simply because it is open-source and we can leverage some of its properties: persistence in object storage, transparency to developers who just have to code Java applications, support for transversal on object graphs using both a programmatic, as well as a declarative and querybased approach (using XML, W3C-DOM, and allowing XPapth/XQuery usage). However, ozone-db lacks support for single-system image semantics, i.e., currently an object store must reside fully in a single server machine, and objects cannot be cached outside this central server. (O3 )2 provides single-system image semantics by employing a cluster of machines executing middleware that: i) aggregates the memory of all machines into a global uniformly addressed object heap, ii) modifies how object references are handled in order to maintain transparency to developers, regardless of where objects are located across the cluster, iii) manages object allocation and placement in the cluster globally, with support for inclusion of more specific policies (e.g., caching objects in client machines for disconnection support). We first describe the fundamental aspects regarding original ozone-db architecture and then describe the architecture of (O3 )2 , and the referred mechanisms. The ozone-db is an open source object oriented database project, totally written in Java and aimed to allow the execution of Java applications that manipulate graphs of persistent objects in a transactional environment (including
Transparent Scalability with Clustering for Java e-Science Applications
273
optimistic long-running transactions). Ozone-db has a sizable user base of application developers, and numerous e-Science applications ported to make use of persistent objects (e.g., [2]). The middleware is completely implemented in Java, portable, and executes on virtually all implementations of the Java VM. A ozone-db database (or object repository) is in essence a server machine that manages and maintains the object repository. With ozone-db architecture, it is possible to instantiate the server and client applications in the same machine or in different ones, depending on the computing resources available to the user, the size of the object repository, number of applications and application instances. The access to objects stored in the server is mediated by proxy objects, a common approach in most related systems. The current architecture of ozone-db offers a number of interesting properties but still suffers from important limitations. Mainly, its deployment is limited to a single server machine which may become a bottleneck in terms of memory, CPU, and I/O bandwidth. A medium range server machine may have 4 or 8 GB of main memory (with some operating system configurations and architectures, only half of that is available to applications and for that matter, to the Java VM object heap), one or two quad-core CPUs (with technology such as hyper threading, the number of hardware concurrent threads can double the number of cores), and several large capacity hard disks. While for small and medium size applications, such resources may be enough, they quickly become scarce when applications manipulate larger object graphs and/or several applications are executing concurrently. Therefore, it would be advantageous to be able to aggregate the available memory of several server machines for increased scalability, and their extended CPU capability for increased performance. This requires that all interventions be made within the scope of (O3 )2 middleware, without imposing customized Java VMs nor modifications to Java application code. This last option might even be unfeasible, as applications may be distributed in bytecode format only. Figure 1 describes a typical scenario of application execution in (O3 )2 . We highlight the following differences: i) the object graph is distributed in main memory and in storage, partitioned among a group of servers (for simplicity, only three are shown), this being completely transparent to applications that need not know the server group membership, and ii) a set of heavily accessed objects can reside in local caches at clients, for improved performance and bandwidth savings (and, additionally some support for disconnection). In Figure 1, the application while connected to Server 1 has accessed objects A, B, C and D of the graph with relevant frequency. Therefore, these objects are cached at the client in order to improve performance. The extensions to ozone-db required by the (O3 )2 architecture are performed at the following levels described in the following paragraphs: i) transport, ii) server, and iii) storage, leaving the application interface unchanged for transparency w.r.t. developers. The (O3 )2 middleware running at servers is designed in the following manner. Each server now holds in its main memory only a fraction of the objects
274
P. Sampaio, P. Ferreira, and L. Veiga Storage
Obj. F
Obj. C
Storage
Server 3
Obj. A
Transport Storage
Obj. B
Obj. C
Obj. E Obj. D
LAN Obj. D
Storage
Obj. B
Server 1 Client
Obj. A
Server 2
Fig. 1. Typical application in the (O3 )2 architecture with a larger graph of objects at the servers, and a subset of objects cached locally
currently in use. The graph of objects is thus scattered across all servers to improve scalability w.r.t. available memory capacity and performance by employing extra CPUs to perform object invocation. The servers are launched in sequence and join a group before the cluster becomes available for client access. Regardless of object placement strategy, once a client gets a reference to an object, its proxy targets directly the server where the object is loaded. Two strategies may be adopted for object management and placement: Coordinated: One of the servers acts as a coordinator, holds a primary copy of metadata in memory, registering object location (indexed by objectID) and locking information (clients can be connected to any server, though, e.g., with some server side redirecting scheme). This information is lazily replicated to the other servers in the cluster. Modifications to this information (namely for locking) are only performed by the primary. The coordinator may trigger migration of subsets of objects among servers, may decide to keep the memory occupation of all servers leveled or, in alternative, only start to allocate objects in a server when the heap of the servers currently in use reaches certain thresholds. Decentralized: No server needs to act as coordinator for the metadata. When an object is about to be loaded from persistent store, its objectID is fed to a hash function that determines the server where it must be placed, and where its metadata will reside. This is a deterministic operation that all servers in the cluster can perform independently. A simple round-robin approach would be correct but utterly inefficient as it would not leverage any locality of reference. Instead, a tunable parameter in the hashing function decides broadly how many objects created in sequence (i.e., a subset of objects with very high probability of having references among them) are placed at a server before allocation is performed at another server. When objects are invoked later, this locality will be preserved.
Transparent Scalability with Clustering for Java e-Science Applications
4
275
Implementation Issues and Results
The application interface of ozone-db is unchanged, therefore applications need not be modified, nor even recompiled. The major aspects addressed are: i) server group management, and ii) object referencing. The (O3 )2 middleware running at each server in the cluster includes new classes OzoneServer, and OzoneCluster that allow each server to reference and communicate with other servers, and maintain information about the identity and number of servers cooperating in the (O3 )2 cluster. Presently, cluster management and fault-tolerance operate with the following approach. A designated cluster manager (just for these purposes but that may double as coordinator as described in Section 3) keeps OzoneCluster data updated and forwards notifications to the other servers. Object Referencing: Object referencing allows servers to redirect accesses to objects loaded in other servers. To avoid performing this repeatedly, after the appropriate server for an object is determined (via coordinated or decentralized strategies), an object proxy is set up in order to reference that server directly, without further indirection. In (O3 )2 implementation, an extra step is inserted that triggers the determination of the server where the object proxy is, according to the specified strategy (others may be developed by extending this behavior). Evaluation: The evaluation of (O3 )2 was performed by executing a known benchmark for OODBs (OO7) with dimension of objects, number of references and connections per objects increased in order to make execution times longer (topping at 200 roughly seconds). Both the original ozone-db and (O3 )2 architecture were used to execute the benchmark tests in two scenarios: i) single server, and ii) three-node cluster (when testing ozone-db, only one of the machines is actually used as server, the other as a client). The machines used are Intel Core2 Quad with 8 GB RAM and 1 TB HD each, running Linux ubuntu server edition for extended address space for applications. The tests purpose is to show that (O3 )2 clustered architecture, while improving scalability and memory capacity, does not introduce significant overhead in application execution, and that it reduces memory usage in the servers. The tests evaluate memory usage at each server and execution time for three OO7 benchmark tests: i) consecutive object creation, ii) complete transversal of an object graph, and iii) transversal of the object graph searching for an object (matching). The test database of OO7 consists of several linked objects in a tree structure. The tree structure has three levels, 2000 or 4000 child objects for the two first levels, and either 40000 or 200000 references among those objects to simulate different object graph densities. The results in Figure 2 show that total memory usage is similar across the configurations for create and transversal tests. These tests occupy the most memory and (O3 )2 does not introduce relevant overhead. Note that memory occupation is reduced as servers are added because with 3-node (O3 )2 cluster, the memory effectively used by each server is roughly a third of the total shown. With ozonedb, all objects are loaded at one of the machines, the other only used to offload client application (hence slightly reduced memory usage). This shows that for
276
P. Sampaio, P. Ferreira, and L. Veiga ozone-db single server 2000 objs., 40000 refs. ozone-db single server 2000 objs., 200000 refs. 200
ozone-db single server 4000 objs., 200000 refs. ozone-squared 3-node cluster 2000 objs., 40000 refs.
MBytes
ozone-squared 3-node cluster 2000 objs., 200000 refs. ozone-squared 3-node cluster 4000 objs., 200000 refs. 150
100
50
X
0
create
X
traversal
X
match
Fig. 2. Memory usage tests (total memory used by single node and whole cluster) ozone-db single server 2000 objs., 40000 refs.
1000000 ms log scale
ozone-db single server 2000 objs., 200000 refs. ozone-db single server 4000 objs., 200000 refs. ozone-squared 3-node cluster 2000 objs., 40000 refs.
100000
ozone-squared 3-node cluster 2000 objs., 200000 refs. ozone-squared 3-node cluster 4000 objs., 200000 refs.
10000
1000
100
10
1
X
create
X
traversal
X
match
Fig. 3. Execution time tests
the most memory intensive tests with (O3 )2 , the global memory available for applications can indeed by multiplied without any significant overhead at each server instance. In the case of original ozone-db, when the objects are 4000 and the references are 200000 it is not possible to execute the application, because the server has not enough memory. (O3 )2 scales making it possible to execute this test with increased memory load without applications crash. The results in Figure 3 show that total execution times for the benchmark tests remain similar across configurations. This demonstrates that (O3 )2 management of several servers and distribution/partitioning of object graphs does not introduce any noticeable overhead to application execution times. However, we must bear in mind that OO7 benchmark is a single threaded application, so
Transparent Scalability with Clustering for Java e-Science Applications
277
no speed-up was to be expected. If there are multiple threads in execution and/or multiple applications accessing the database, the extra CPU capability leveraged by (O3 )2 will keep processors’ load low and increase system throughput, if not reduce individual application execution times.
5
Conclusion
In this paper, we propose a new approach to the design of OODB systems for Java applications: (O3 )2 (pronounced ozone squared ) that addresses the limitations of previous work in the literature. It provides developers with a singlesystem image of virtually unbounded object space/heap with support for object persistence, object querying, transactions and concurrency enforcement, backed by a cluster of multi-core machines with Java VMs. Transparency regarding developers and their interface with the OODB system is untouched. Applications need not be modified nor recompiled. Our approach has been validating by employing a benchmark (OO7) relevant in the literature. Future work includes more refined strategies for object placement (namely based on traces of previous runs of the same application) and address the incompleteness and unsoundness of the memory management of persistence stores in ozone-db (based on explicit delete operations).
References 1. Aguilera, M., Merchant, A., Shah, M., Veitch, A., Karamanolis, C.: Sinfonia: a new paradigm for building scalable distributed systems. In: 21st ACM SOSP (2007) 2. Baldwin, R.T.: Views, objects, and persistence for accessing a high volume global data set. In: MSS 2003: Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS 2003), p. 77. IEEE Computer Society, Washington, DC, USA (2003) 3. Butterwoth, P., Otis, A., Stein, J.: The GemStone object database management system. Communications of the ACM 34(10), 64–77 (1991) 4. Buyya, R., Cortes, T., Jin, H.: Single system image. Int. J. High Perform. Comput. Appl. 15(2) (2001) 5. Braeutigam, F., Mueller, G., Nyfelt, P., Mekenkamp, L.: The ozone-db Object Database System (2002), http://www.ozone-db.org 6. Liskov, B., Day, M., Shrira, L.: Distributed object management in thor. In: International Workshop on Distributed Object Management, pp. 79–91 (1992) 7. Paterson, J., Edlich, S., H¨ orning, H., H¨ orning, R.: The Definitive Guide to db4o (2006) 8. Pfister, G.F., IBM Workstations & Syst. Div., Austin, TX: The varieties of single system image. In: Proceedings of the IEEE Workshop on Advances in Parallel and Distributed Systems, 1993, pp. 59–63 (1993) 9. Veiga, L., Ferreira, P.: Incremental replication for mobility support in OBIWAN. In: Proceedings of 22nd International Conference on Distributed Computing Systems, 2002, pp. 249–256 (2002) 10. Zhu, W., Wang, C.-L., Lau, F.C.M.: Jessica2: A distributed java virtual machine with transparent thread migration support. In: IEEE Fourth International Conference on Cluster Computing, Chicago, USA (September 2002)
CassMail: A Scalable, Highly-Available, and Rapidly-Prototyped E-Mail Service Lazaros Koromilas and Kostas Magoutis Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH) Heraklion, GR-70013, Greece {koromil,magoutis}@ics.forth.gr
Abstract. In this paper we present the design and implementation of a scalable e-mail service over the Cassandra eventually-consistent storage system. Our system provides a working implementation of the SMTP and POP3 protocols and our evaluation shows that the system exhibits scalable performance, high availability, and is easily manageable under write-intensive e-mail workloads. The design and implementation of our system is centered around a synthesis of interoperable components for rapid prototyping and deployment. Besides offering a proof of concept of such an approach to prototyping distributed applications, we further make two key contributions in this paper: First, we provide a detailed evaluation of the configuration and tuning of the underlying storage engine necessary to achieve scalable application performance. Second, we show that the availability of scalable storage systems such as Cassandra simplifies the design and implementation of higher-level scalable services, especially when compared to the effort expended in projects with similar goals in the past (e.g., Porcupine). We believe that the existence of infrastructural services such as Cassandra brings us closer to the vision of a universal toolbox for rapidly prototyping arbitrary scalable services.
1
Introduction
The large-scale adoption of the Internet as a means of personal communication, societal interaction, and a successful venue for conducting business has raised the need for applications and services that can support Internet-scale communities of users. Cloud computing [1] has recently offered the infrastructural support necessary for small, medium and large enterprises alike to deploy such services. However both application developers and Cloud computing providers are in need of the infrastructural services and platforms that can support the scalability requirements of distributed applications. Compounded with the need for scalability is the need for rapid prototyping. Today’s planet-scale social networks such as Facebook, and application marketplaces such as AppleStore have brought applications (and the “garage innovator” behind them [5]) closer to large communities of users. This trend has fueled a race among developers to bring new ideas to market as soon as possible without sacrificing scalability and availability along P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 278–291, 2011. c IFIP International Federation for Information Processing 2011
CassMail
279
the way. The combination of the above needs (scalability and rapid prototyping of application-specific designs) was certainly a challenging undertaking a decade ago. As the core theme of this paper indicates however, recent advances in scalable infrastructure services have improved on the state of the art. In this paper we substantiate the above observations by focusing on a specific case study: scalable e-mail services. E-mail is an important application offered as an Internet-accessible service by companies such as Google (Gmail), Yahoo (Yahoo! Mail), and Microsoft (Hotmail) among others. Constructing a scalable, highly-available e-mail service has in the past been performed in a variety of ways, by either statically partitioning users and their data in specific machines [3] using a general-purpose distributed file system as an underlying scalable store [14,6] or by specifically designing and constructing an entire system from first principles for the targeted application [12]. Whereas the first approach results in simpler systems, experience shows that it suffers from scalability issues specifically in the areas of load balancing (due to static partitioning) and availability (due to strong consistency built into general purpose file systems). The latter approach has in the past been shown to address the above issues, however at the cost of significant system engineering to support (i) fine-grain, balanced data partitioning and (ii) a weaker consistency model that matches the semantics of e-mail protocols. The system presented in this paper combines for the first time the best of both approaches: A synthesis of interoperable components resulting in a simpler system, capitalizing on standardized support for (i) and (ii) above. A key motivation for this paper is the observation that in recent years there has been significant interest in developing, and in many cases open-sourcing, scalable infrastructural services. This activity has culminated into systems that offer storage/file and data APIs with strong [6, 13, 2] as well as weak [8, 9, 15] consistency semantics. A prime example of the latter class of storage systems that is used in this paper is Apache Cassandra [9], a scalable, highly-available, eventually-consistent key-value store originally developed by Facebook [9]. The existence of Cassandra prompted us to revisit the question of how would one design and build a scalable e-mail service over an eventually-consistent replicated storage system today. More specifically, we considered the following questions: 1. Does Cassandra significantly reduce the development effort compared to the effort expended in a project with similar goals in the past [12]? 2. Does the resulting system exhibit the scalability and availability expected of a robust scalable e-mail service? More generally, our work puts novel infrastructural services such as Cassandra into context with past efforts to explore the feasibility and utility of providing high-level abstractions or data structures as the fundamental storage infrastructure. The Boxwood project [10] and the scalable distributed data structure (SDDS) approach of Gribble et al. [7] are two examples of such efforts. We believe that a key missing piece in past proposals are primitives that explore the space of data consistency semantics. This paper shows that Cassandra is such a missing
280
L. Koromilas and K. Magoutis
piece that brings us closer to the vision of a toolbox of universal abstractions to support arbitrary scalable application development. Our contributions in this paper are: – The design and implementation of a fully functioning scalable, highly-available e-mail service based on synthesis of interoperable components (extensible highlevel development interfaces, Cassandra storage system). – Demonstration of the speed of prototyping that such a software engineering approach allows. Specifically, it took us a few tens of lines of Python code to implement a working prototype compared to the 41,000 lines of C++ code it took for a system with similar design principles a decade ago [12]. – Evaluation of the configuration and tuning of the underlying storage engine for the targeted application, exhibiting the scalabilibity, availability, and manageability properties of our rapidly-prototyped system. The rest of the paper is organised as follows. We refer to related work in Section 2. In Section 3 we describe the system architecture and in Section 4 we provide the details of our implementation. In Section 5 we describe the evaluation of our system. In Section 6 we discuss possible optimizations and finally, in Section 7 we conclude.
2
Related Work
The penetration of e-mail in our way of daily life is such that it is nowadays a mandatory offering by all Internet-scale service providers to their subscribers [4]. Large-scale e-mail services have in the past been implemented in a number of ways. Early distributed e-mail services partitioned e-mail state across a number of storage nodes but did so in a static partitioning scheme using distributed file systems [3]. Such a scheme is hard to manage (in particular, it is hard to rebalance e-mail to storage node in case of failure, or to correct an initially unbalanced partitioning). Cluster-based e-mail services [16] attempted to achieve scalability and availability via database failover schemes with limited success. Finally, application-specific designs [12] achieved better scalability via the use of hash-based partitioning of user e-mail to storage nodes, optimistic replication [15], and dynamic membership algorithms. Such approaches however were complicated and thus were met with little practical success in term of real-world deployment and use. CassMail shares the basic premise behind systems such as Porcupine [12], namely that the semantics of e-mail protocols are naturally relaxed and users are used to e-mail being occasionally delayed, reordered, reappearing after being deleted, or even lost. Thus it is based on a storage system (Cassandra [9]) utilizing optimistic replication to achieve scalability and high availability. Cassandra is an eventually-consistent storage system, meaning that replicas can temporarily diverge (and clients allowed to see inconsistent intermediate state) but guaranteed to eventually converge. Based on a general-purpose eventually-consistent key-value store rather than its own implementation, CassMail leverages a robust
CassMail
281
and tested scalable software component and at the same time radically simplifies the overall architecture focusing on the core logic of the application. The idea of using foundational storage abstractions to support applicationspecific services is not new. Boxwood [10] explored the idea in which high-level abstractions can facilitate the creation of scalable and fault-tolerant applications. Boxwood proposed a set of components (replicated logical devices, a chunk data store, and a scalable B-tree, along with a set of infrastructure services such as global state management, lock service, and a transactional service) as a comprehensive toolbox for distributed system design. At a smaller scale, scalable distributed data structures [7] proposed a key-value hash table as another foundational abstraction for the support of scalable applications. A common theme in the above proposals was the assumption of strong consistency semantics (singlecopy serializability [11]), which in some cases limit system availability and may be constraining applications, especially when their semantics do not strictly require it. Components such as Cassandra extend and enrich the above proposals.
3
Design
We will first give a brief overview of the Cassandra data model and the schema that we designed for CassMail. Cassandra’s basic data unit is the column, or block as we will refer to it next. A block consists of a key and a value. Sequences of blocks (an arbitrary number) collectively form a row. Blocks in a row can be ordered in a user-specified manner depending on key type (for example in timestamp order). Each row is identified by a separate key. A row is individually placed on a Cassandra storage node based on a consistent hashing scheme [9] described later in this section. Rows are grouped into column families, which are entities akin to relational database tables. Column families are grouped into keyspaces. Figure 1 displays the schema in which Cassandra stores user and mailbox information. There are two tables, Mailboxes and Users, within a keyspace called Mail. Users is used to validate a user (the origin or destination of an e-mail message) and to find the names of his mailboxes. Each row in Users is keyed by user name. The blocks stored in a row have as their keys the names of the user’s mailboxes. Concatenating a username with a mailbox name forms a natural key to a row in the Mailboxes table. The row contains blocks that hold the actual e-mail messages in that user’s mailbox. The key for each block is a time-based universally unique identifier (UUID) stamped by the SMTP daemon when the message arrives and is stored. The value of a block is the e-mail message itself. We chose not to fragment a user’s mailbox across multiple rows (as for example Porcupine does [12]) for two reasons: First, we believe that spreading the load of retrieving a mailbox to several nodes can be achieved by reading different block ranges from different replicas of the row rather than different fragments of the mailbox. Second, we want to avoid hard-to-adjust magic constants (such as the soft limit used by Porcupine [12]) to restrict the mailbox spreading too far across storage nodes.
282
L. Koromilas and K. Magoutis Mail :Mailboxes "pete:inbox"
timeuuid emailmsg0 timeuuid emailmsg1 ...
"pete:outbox"
timeuuid emailmsg0 timeuuid emailmsg1 ...
"anne:inbox"
timeuuid emailmsg0 timeuuid emailmsg1 ...
"anne:outbox" timeuuid emailmsg0 timeuuid emailmsg1 ... "anne:project" timeuuid emailmsg0 timeuuid emailmsg1 ... ... :Users "pete"
"inbox"
"outbox"
...
"anne"
"inbox"
"outbox"
"project"
...
...
Fig. 1. The Cassandra schema designed for CassMail
Cassandra runs on a cluster of n storage nodes as shown in Figure 2. Each node maps to a specific position on the ring through a hash function. Similarly, each row maps to a position on the ring by hashing its key using the same hash function. Each node is in charge of storing all rows whose keys hash between this node’s position and the position of the previous node on the ring. Cassandra members communicate with each other to exchange membership and liveness information through RPC calls. They also communicate when looking for the node in charge of the client’s requested data. In our design each Cassandra node is also running an SMTP and a POP3 server. In this fashion, the cluster consists of functionally identical nodes that can perform any and all functions. This symmetrical configuration underlies the system’s scalability and availability properties. An example of e-mail delivery and retrieval is depicted in Figure 2. In this example anne connects1 through her mail-submission agent (MSA) to the SMTP server on node 1 to send an e-mail message to pete. The SMTP server inspects the message and saves it to pete’s inbox and anne’s outbox on Cassandra. Now suppose that pete wants to check his messages by accessing node 2. He connects to the POP3 server on node 2 with his mail-retrieval agent (MRA) which first receives the number of e-mails (equal to the number of columns in the pete:mailbox row) then asks for message headers, and finally retrieves a number of e-mail messages. POP3 fetches pete’s mailbox from Cassandra in fixed-size batches. Each batch would normally go to a different replica for the row to ensure that the 1
Normally an e-mail goes through the submission, relaying, and delivery steps. Without loss of generality in this discussion we omit relaying and think of the users connecting directly to a system node to submit and then deliver the message.
CassMail pete
anne
email
email smtpd
283
smtpd
pop3d
cassandra
1
smtpd
pop3d
cassandra
2
pop3d
cassandra
n
hash function ring
Fig. 2. System design
read load is balanced across the Cassandra cluster. Eventually pete receives the new message from anne.
4
Implementation
In this section we describe our implementations of the SMTP and POP3 servers and the configuration and tuning of the Cassandra system to support our writeintensive e-mail workloads. The set of commands implemented by the SMTP and POP3 servers is shown in Table 1. This is the minimum set that enables Mail User Agents (MUAs) to properly receive and send e-mail. We initially implemented the SMTP and POP3 servers in Ruby using the generic GServer class2 . We found that Gserver handles many low-level management tasks, allowing the developer to focus on the specifics of the SMTP and POP3 protocols. Additionally, the Ruby client for Cassandra3 provides a clean, high-level interface that is easy to work with. However, our preliminary experiments showed that the underlying implementation of Gserver was not robust enough to successfully pass our stress tests. Given this early experience we decided to switch to using Python’s smtpd module and the pycassa4 client library for Cassandra which proved to be a more robust and performant implementation. Our working implementation of the SMTP and POP3 servers consists of a few tens of lines of code that is easy to reason about and extend. In addition to implementing the POP3 protocol, we extended the implementation to support multiple mailboxes per user (a feature not directly supported by 2 3 4
http://www.ruby-doc.org/stdlib/libdoc/gserver/rdoc/ https://github.com/fauna/cassandra https://github.com/pycassa/pycassa
284
L. Koromilas and K. Magoutis Table 1. Protocol commands supported by SMTP and POP3 servers Server Supported commands SMTP HELO MAIL RCPT DATA RSET NOOP QUIT POP3 USER PASS STAT LIST UIDL TOP RETR DELE RSET NOOP QUIT
POP3). We achieved this by appending a delimiter to the username, followed by the specific mailbox name, as for example in POPUSER="anne|outbox". Upon receiving such a name the POP3 server extracts the username and mailbox (using inbox as the default) and uses it to interact with Cassandra. SMTP and POP3 servers access Cassandra through the Thrift5 interface. Thrift transparently handles Cassandra node availability issues such as failing over to another Cassandra node when the current one appears to have failed. In our implementation we collocate SMTP, POP3, and Cassandra servers on each system node (thus exposing Thrift on the localhost interface). This was a design choice we took to arrive at a homogeneous system in which any node can perform any task. However it would be a trivial modification to enable SMTP/POP3 servers to access Cassandra over the network thus decoupling them into two separate tiers. Properly configuring Cassandra is key for tuning the system towards specific workloads and environments. Our schema described in Section 3 can be embodied within a very concise description that comprises the name of the keyspace, the column families included, and the information of how to sort the blocks inside them, as shown in the following configuration excerpt: keyspaces: - name: Mail replication_factor: 3 column_families: - name: Mailboxes compare_with: TimeUUIDType - name: Users compare_with: BytesType
Another configuration decision is how to partition the logical ring between Cassandra nodes. Our experience with automatic/random node placement on the ring is that it can lead to hot-spots. We thus opted for precomputing initial tokens for our nodes ensuring that each node gets roughly equal key ranges. We used Cassandra’s RandomPartitioner 6 to achieve balanced distibution of row keys on the ring by hashing them using the MD-5 hash function. In case of a node failure, surviving nodes detect the failed node and exclude them from their membership list. Cassandra does not automatically repartition the ring to the surviving nodes. This is something that requires manual intervention with the execution of shell commands by an operator. However the system 5 6
http://incubator.apache.org/thrift/ http://wiki.apache.org/cassandra/StorageConfiguration
CassMail
285
(depending on settings described below) can be available while operating under failure as our evaluation in Section 5 shows. Two other key Cassandra parameters are the degree of replication (or replication factor) for row data and the level of consistency chosen for reads and writes. Describing the full set of options offered by Cassandra is outside the scope of this paper (see [8,9] for a complete discussion). We will however describe the options we exercised during our evaluation to highlight the key tradeoffs involved. In terms of consistency levels, we used the following conditions for acknowledging a write: ONE. Write must be stored at the memory table and commit log of at least one replica. ALL. Write must be stored at the memory tables and commit logs of all replicas. Our implementation uses Cassandra7 version 0.7.0 running on Java 1.6.0 22. We used Ruby version 1.9.2 and Python version 2.6.6.
5
Evaluation
In this section we report on our experimental results. Our experimental setup consists of a 10-node cluster of dual-CPU AMD 244 Opteron servers with 2GB DRAM running Linux 2.6.18 and connected through a 1Gbps Ethernet switch using Jumbo (9000-byte) frames. Each node was provisioned with a dedicated logical volume comprising four 80GB SATA disks in a RAID-0 configuration. We used the xfs filesystem on this volume on all nodes. The benchmark used in this study was Postal8, a widely-used benchmark for SMTP servers. Postal operates by repeatedly and randomly selecting and e-mailing a user ([email protected]) from a population of users. We created a realistic population of users by using the usernames from our departmental mail server (about 700 users). The Postal client is a multithreaded process that connects to a specific SMTP server. In our experiments we configured Cassandra with different replication factors (1, 2, 3) and consistency-levels (ONE and ALL described in Section 4). We used message sizes drawn uniformly at random from a range of sizes. We experimented with two ranges: – 200KB–2MB (typical of large attachments) – 50KB–500KB (typical of small attachments) These ranges are chosen to reflect the increase in average e-mail size compared to a related study performed a decade ago [12] due to widespread use of attachments in everyday communications. Each experiment consists of an e-mail-sending session blasting the CassMail cluster to saturation for about ten minutes. Our measurements are per-minute Postal reports of the sum of e-mail data sent during the previous minute (that is 7 8
http://cassandra.apache.org/ http://doc.coker.com.au/projects/postal/
286
L. Koromilas and K. Magoutis
e-mail payload, excluding other control/header information). In order to avoid taking into account any bootstrapping overhead we only consider the last five minutes in our measurements. In all of our graphs we report aggregate average throughput and standard deviation (as error bars) of our measurements. Each node ran an instance of the SMTP server (Python code) and an instance of the Cassandra server (Java code), with each server consuming one of the two CPUs. In all cases performance is limited by the servers’ CPUs. There was also swapping and garbage collection activity taking place during runs. We consider such activities unavoidable (especially when running software in high-level, scripted, and garbage collected languages such as Python and Java) and legitimate part of a node’s load. We used two dedicated client machines with similar specifications to our servers to drive all experiments. The client machines hosted Postal processes in a setup that balanced load-generation work across the two machines. 60000
repl1
repl2
repl3
Throughtput (KB/s)
50000 40000 30000 20000 10000 0
1
2 4 Number of nodes
8
Fig. 3. Throughput for different replication factors
Our first experiment measures the aggregate write throughput over increasing cluster sizes for messages in the range 200KB–2MB. Our results are depicted in Figure 3. Lighter bars correspond to higher replication factors (1–3). The consistency level is set to ONE in all cases. Increasing cluster size results into higher aggregate throughput across all replication factors. The performance increase is smaller going from 1 to 2 nodes due to the introduction of Cassandra server-toserver traffic to forward keys to the proper coordinator (since the e-mail clients are unaware of the mapping between user mailboxes and Cassandra nodes). Increasing the replication factor results into decreased throughput by about 5–10% for each extra replica at all cluster sizes due to the additional traffic necessary to update replicas. We expect this drop to be steeper for stricter consistency levels such as ALL. Note that a replication factor of 2 and 3 does not make sense for cluster sizes of 1 and 2 respectively, explaining the missing bars in Figure 3.
CassMail
60000
Throughtput (KB/s)
50000
repl2 one repl2 all
287
repl3 one repl3 all
40000 30000 20000 10000 0
2
4 Number of nodes
8
Fig. 4. Throughput for different consistency levels
60000
200KB-2MB
50KB-500KB
Throughtput (KB/s)
50000 40000 30000 20000 10000 0
1
2 4 Number of nodes
8
Fig. 5. Throughput for different message size ranges
Figure 4 depicts the impact of the consistency level in aggregate write throughput with increasing cluster sizes (2–8) and replication factors (2–3). The key observation is that stronger consistency requirements (ALL instead of ONE) degrade performance in all cases. The degradation is more pronounced at larger cluster sizes and is about 35% in the case of 2 replicas (dropping from 50MB/s to 30MB/s) and about 45–50% in the case of 3 replicas. A key factor responsible for this degradation is the large imbalances in nodes’ performance due to background tasks such as Java garbage collection or swapping activity. These
288
L. Koromilas and K. Magoutis
imbalances are largely masked at consistency level ONE but exposed to the clients at consistency level ALL. This observation highlights a key advantage of eventually-consistent storage systems compared to strongly-consistent ones under write-intensive workloads. Previous work [7] has pointed out the adverse impact of garbage collection activity in strongly-consistent storage systems written in Java, namely stalling write operations when one out of a group of replicas freezes while undergoing some background activity. Eventually-consistent systems can hide that stall time by allowing operations to progress at the speed of the fastest replica. We next focus on the impact of message size on aggregate throughput. Figure 5 depicts system performance with increasing cluster size at consistency level ONE and replication factor 1. We observe a performance drop when moving from larger (200KB–2MB) to smaller (50KB–500KB) e-mail messages, which is about constant in absolute value (≈5MB/s) but decreases in relative terms with increasing cluster size. This drop is caused by the higher impact of peroperation overheads (connection setup/teardown, header information generated and processed for each e-mail message, etc.). We next explore the availability of CassMail service when experiencing a node failure under replication factor 2 at different consistency levels (ONE, ALL). Figure 6 depicts aggregate per-minute throughput before (0 –3 ) and after (4 – 10 ) failure for consistency level ONE (squares) and ALL (dots). For this experiment we used an 8-node CassMail cluster in which all nodes run Cassandra servers and only four out of them also run SMTP servers. At minute 3 we inject a crash failure on one of the Cassandra-only nodes. In the case of relaxed consistency (ONE, squares) the node failure has no apparent effect on performance since the surviving replica takes the update (and thus all writes completing successfully without delay) while failover mechanisms (such as hinted handoff [9,8]) are initiated in the background. In the case of consistency-level ALL (dots) we observe a measurable degradation of about 25% in the following minute and immediate recovery of service after that. This happens because some writes cannot get acknowledgments from the failed node and thus temporarily block until the failover mechanism has been activated. In all cases, CassMail can rely on Cassandra to gracefully handle the node failure with minimal or no availability loss, and without operator intervention. Comparing CassMail experimentally to systems with equivalent functionality is hard since to the best of our knowledge no such systems are available in open source. The closest alternative —lacking several of CassMail’s properties— would be a system relying on static partitioning of users over conventional SMTP servers. For the purpose of comparing to such a system we configured its building block (an SMTP server based on Postfix 2.7.1) and used it as a reference point for comparing CassMail’s single-node performance with a mature tighly-configured software system. We did not focus on larger-scale experiments since with static partitioning, client awareness of data location, and no replication one can trivially achieve linear scalability up to the limits of the network. We used the Postal benchmark configured as described earlier and created a mailbox file for each
CassMail
40000
repl2 one repl2 all
35000 Throughtput (KB/s)
289
30000 25000 20000 15000 10000 5000 0
0
1
2
3
4
5
6
7
8
9
Minutes
Fig. 6. Throughput over time before and after the occurrence of a failure event
user in the server’s file system. Our results show that the Postfix-based server achieves average write throughput of 40MB/s and 25MB/s with large (200KB– 2MB) and small (50KB–500KB) messages respectively, limited by CPU in both cases. This contrasts to CassMail’s single-node write performance of 15MB/s and 10MB/s for large and small messages respectively. The performance difference can be attributed to implementation characteristics: CassMail is written in high-level programming languages and libraries (Python, Java) and combines e-mail protocol processing with significant storage system processing at each server node. Postfix on the other hand, is a mature performance-optimized software system written in C using a lightweight storage stack. We believe that CassMail’s scalability properties can make up for the impact in single-node performance.
6
Discussion and Future Work
We are exploring deployment of CassMail in a Cloud infrastructure [1, 5] offering virtual machines (VMs) and local or remotely-mounted storage volumes. In a straightforward deployment scheme each Cassandra server maps to a VM and each disk to (possibly RAID setups of) local or remotely-mounted storage volumes. Assumptions about failure independence require VMs and storage volumes to not share any single point of failure (such as a physical server). Current Cloud providers hide this level of information from the user raising a challenge to effective deployment. Our experimental results suggest the use of VMs with considerable CPU (number of cores) and physical memory allocations. In addition, the higher performance, reliability, and predictability of local storage makes it a better alternative to remotely-mounted storage for storing Cassandra data.
290
L. Koromilas and K. Magoutis
Our system has proven to be quite robust under intensive experiments but is currently lacking some features that are needed for real-world deployment. First, it does not deal with user authentication or data encryption of the messages being transferred. Also, the SMTP server currently receives e-mail but does not relay messages to other mailservers. We believe that these features are straightforward additions to our prototype and we plan to implement them in the near future.
7
Conclusions
Eventually-consistent storage systems have been shown to be key enablers of scalable, highly available, and manageable application services that do not require strict consistency semantics. In particular, Porcupine [12] demonstrated a scalable e-mail service that was based on an eventually-consistent storage system built from scratch. A drawback of such an approach is the complexity and long development effort it requires (Porcupine consists of 14 major components written in C++ with a total of about 41,000 lines of code [12]). The emergence of general-purpose scalable storage services that offer APIs with eventualconsistency semantics such as Cassandra raise the opportunity of realizing application services similar to Porcupine at a lower development cost. In this paper we describe the results of a project to build an SMTP/POP service over Cassandra and show that such a service can be simple (consisting of a few tens of lines of Python code focused on the application logic) and thus rapidly implemented. We also show that the implementation exhibits good scalability properties: its throughput increases from 15MB/s to 55MB/s when the cluster size grows from 1 to 8 nodes and a crash failure of a single node results in minimal to no availability lapse, depending on the level of consistency. These properties also indicate an easy-to-manage system (no need for human intervention in mapping users to storage nodes or for restoring availability at the time of failure), which is a critical characteristic of a system meant to operate at a large scale. A cost for the simplicity of our design is the additional overhead (evidenced by high CPU usage of Python SMTP/POP server and Java Cassandra client/ servers) as well as the background activity inherent in scripted and garbagecollected programming environments. However, the combination of technology trends pointing to more cycles in future multi-core CPUs (which can to some extent absorb higher overheads of high-level language runtimes) and the strength of the eventually-consistent storage model in hiding the effect of slow replicas in write performance, paint a positive conclusion: We believe that the synthesis of interoperable (application and storage) components is a viable path to rapidly prototyping robust scalable systems. In this context, storage systems with general-purpose APIs that explore alternative consistency semantics are important foundational abstractions for building scalable applications.
CassMail
291
Acknowledgments We thankfully acknowledge the support of the European ICT-FP7 program through the SCALEWORKS (MC IEF 237677) and CUMULONIMBO (STREP 257993) projects.
References 1. Armbrust, M., et al.: Above the Clouds: A Berkeley View of Cloud Computing. Technical Report UCB/EECS-2009-28, UC, Berkeley (February 2009) 2. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems (TOCS) 26(2), 1–26 (2008) 3. Christenson, N., Bosserman, T., Beckemeyer, D.: A Highly Scalable Electronic Mail Service using Open Systems. In: Proc. of the USENIX Symposium on Internet Technologies and Systems, Monterey, CA (1997) 4. Ducheneaut, N., Bellotti, V.: E-mail as habitat: an exploration of embedded personal information management. ACM Interactions 8(5), 30–38 (2001) 5. Elson, J., Howell, J.: Handling Flash Crowds from your Garage. In: USENIX 2008 Annual Technical Conference, Boston, MA (2008) 6. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google File System. ACM SIGOPS Operating Systems Review 37(5), 29–43 (2003) 7. Gribble, S., Brewer, E., Hellerstein, J., Culler, D.: Scalable, Distributed Data Structures for Internet Service Construction. In: Proc. of 4th Conference on Operating System Design & Implementation, San Diego, CA (2000) 8. Hastorun, D., et al.: Dynamo: Amazon’s Highly Available Key-Value Store. In: Proc. of Symposium on Operating Systems Principles, Stevenson, WA (2007) 9. Lakshman, A., Malik, P.: Cassandra: a Decentralized Structured Storage System. ACM SIGOPS Operating Systems Review 44(2), 35–40 (2010) 10. McCormick, J., Murphy, N., Najork, M., Thekkath, C.A., Zhou, L.: Boxwood: Abstractions as the Foundation for Storage Infrastructure. In: Proc. of Conference on Operating Systems Design & Implementation, San Francisco, CA (2004) 11. Papadimitriou, C.H.: The serializability of concurrent database updates. J. ACM 26, 631–653 (1979) 12. Saito, Y., Bershad, B.N., Levy, H.M.: Manageability, Availability, and Performance in Porcupine: A Highly Scalable, Cluster-based Mail Service. ACM Transactions on Computer Systems (TOCS) 18(3), 298 (2000) 13. Shvachko, K., et al.: The Hadoop Distributed File System. In: Proc. of IEEE Conf. on Mass Storage Systems and Technologies, Lake Tahoe, NV (2010) 14. Thekkath, C., Mann, T., Lee, E.: Frangipani: a Scalable Distributed File System. In: Proc. of the 16th ACM Symposium on Operating Systems Principles, Saint Malo, France (1997) 15. Vogels, W.: Eventually Consistent. ACM Queue Magazine (December 2008) 16. Vogels, W., Dumitriu, D., Agrawal, A., Chia, T., Guo, K.: Scalability of the Microsoft Cluster Service. In: Proc. of 2nd USENIX Windows NT Symposium, Seattle, WA (1998)
Transparent Adaptation of e-Science Applications for Parallel and Cycle-Sharing Infrastructures Jo˜ao Morais, Jo˜ao Nuno Silva, Paulo Ferreira, and Lu´ıs Veiga INESC-ID / Technical University of Lisbon, Portugal [email protected], {joao.n.silva,paulo.ferreira,luis.veiga}@inesc-id.pt
Abstract. Grid computing is a concept usually associated with institution-driven networks assembled with a clear purpose, namely to address complex calculation problems or when heterogeneity and users’ geographical dispersion is a key factor. However, regular home users willing to take advantage of distributed processing cannot regard this a viable option. Even if Grid access was open to the general public, a home user would not be able to express task decomposition without clearly understanding the program internals. In this work, distributed computation, and cycle-sharing in particular, are addressed in a different manner. Users share idle resources with other users provided that such resources (namely, CPU cycles) are mostly employed to execute already installed applications (e.g., popular commodity applications targeting video compression/transcoding, image processing, ray tracing). Users need not to modify an application they already use and trust. Instead, they require only access to an available format description of the application input/output, in order to allow transparent and automatic decomposition of a job in smaller tasks that may be distributed and executed in cycle-sharing machines.
1 Introduction In this day and age, a quick observation that can be made about personal computers is that they are either inactive or active with a very small load; this means that most CPU cycles (and the energy spent in them) are wasted without any relevant operation being done. Along with the Internet growing, in recent years several applications have appeared that try to leverage these wasted cycles in useful processing of CPU-intensive applications (the most known being the SETI@Home). However most of these projects have a limited scope in how the Grid resources can be used. In most cycle sharing projects users can only share their resources and are not allowed to run their applications. In such cases, users are motivated to give their CPU time when there is monetary compensation involved (Plura) or, still more frequently, the project’s purpose is the Humanity’s greater good (e.g., Folding@Home, Seti@Home). An increasing number of home users makes use of commodity (or off-the-shelf ) applications that are CPU-intensive and whose performance could be improved if executed in a parallelized manner (namely, for distributed execution employing other users’
This work has been supported by FCT (INESC-ID multiannual funding) through the PIDDAC Program funds.
P. Felber and R. Rouvoy (Eds.): DAIS 2011, LNCS 6723, pp. 292–300, 2011. c IFIP International Federation for Information Processing 2011
Transparent Adaptation of e-Science Applications
293
idle CPU cycles). More so, they tend to form communities around portals exchanging applications and techniques, and sometimes data (e.g., 3D models). Examples include applications such as those to render photo-realistic images and animations using ray tracing, video transcoding for format conversion, batch picture processing for photo enhancement, face detection and identification, among others (many of them free or shareware, and widely available and deployed). While most home users will not have access to Grid or cluster infrastructures, they may be able to access some P2P cyclesharing or utility computing infrastructure (e.g., Amazon EC2). Independently of the access to parallel execution infrastructures, most of the users are unable to parallelize applications due to one or more of the following reasons: i) applications are distributed in binary form designed for local execution only, ii) unwillingness to execute applications parallelized (i.e., modified) by others than the publishers, and iii) even when source code is available and free, most users lack the necessary coding skills. Thus, to most users, the only available option for parallel execution in distributed scenarios, while employing the unmodified applications users are acquainted to, is through transparent application adaptation. This work acts upon the aforementioned restrictions: the users either i) do not have access to the application source code, or ii) do not have the knowledge or time to study it in order parallelize it, or iii) do not trust third-parties other than the publishers to modify the application. Since the application will remain unmodified, its adaptation can be performed in only two ways: i) binary code rewriting which actually modifies application code during run-time, and is complex to do for every application (in essence, injecting parallelized code designed previously), or ii) transparent partition of the input data, adaptation of application parameters and input, and finally regrouping and adaptation of the results. The fraction(s) of input to provide each task with, as well as how the partial outputs are aggregated, is determined by instructing the middleware with XML-based format and application descriptors that indicate how to process, partition and aggregate the files. These descriptors can be written by a user, application provider, derived from format syntax; once made available, they can be reused by anyone else. A great advantage of working with smaller inputs occurs when the original file is large and the tasks are deployed to other nodes connected through lower bandwidth links (as is the case with some home users in P2P scenarios). The rest of this paper is organized as follows. In the next section, we address work related to ours along with a brief comparative analysis. In Section 3 we describe the middleware architecture to perform application adaptation for execution in cycle-sharing infrastructures, and its main implementation details and results in Section 4. In Section 5, we close the paper with some conclusions and future work.
2 Related Work In the cluster and grid arena, schedulers (such as Condor [1]) are employed to handle the deployment of jobs on the available hosts with different quality of service by minimizing total execution time or taking advantage of processors’ idle time. Ourgrid [2] aims to allow any user to run Bag-of-Tasks applications over a Grid of federated clusters. Applications are executed over a sandbox and fair resource usage is handled with
294
J. Morais et al.
a network of favors that rewards users according to their contribution to the Grid. InteGrade [3] has similar goals but works on a client-server model where each job must be routed through the cluster server to other peers. It also allows for MPI application execution over a checkpoint system for handling incomplete jobs. Both these projects have failed to reach major acceptance since there is yet to be ported any useful application to work on top of these grid infrastructures. To ease the interaction with grid schedulers there have been specialized developments targeting user interfaces for tasks creation (e.g., Ganga [4]). These tools allow the easy creation of parameter sweep jobs, where each task processes part of a input domain. The partition of parameters intervals, as well as the assignment of different files to different tasks are possible, but the splitting of large data files into smaller data units is impossible (let alone ensuring the semantically coherency of the partitions) . Another approach, APST-DV [5] describes a case study regarding MPEG4 encoding on clusters which has also explored input division techniques, resorting to external programs to do the file splitting/merging. Developing such applications for other data formats and applications would require more programming effort, albeit somewhat redundant, that most users will not be able to develop. BOINC is a middleware and infrastructure for developing distributed computing applications allowing the execution of lengthy Bag-of-Tasks problems over the internet on donated spare computing cycles. However, users are not allowed to execute their own applications, being restricted solely as cycle donors. In distributed computing environments, the data partition is usually a separate and custom built step. In BOINC, when tasks are being created, the input data must be already partitioned. The project owner is responsible for the development of the code to create the tasks, and for the development of the data splitting code. In nuBOINC [6], users are allowed to submit new tasks that are executed with commodity applications. Users are also able to interactively define the parameters passed to each task. There have been a number of proposed P2P cyclesharing infrastructures. Most of them address the execution of complete jobs at a single node and do not attempt to perform otherwise automatic input data splitting. In CCOF [7], the application load is distributed by the Wave Scheduler which organizes the resources by geographic regions in order to take advantage of night periods which usually have less activity and, although a generic P2P infrastructure like ours is suggested, no programming model is available yet. The earlier work described in [8] introduces the notion of gridlet as a semantics-aware unit of workload division and computational offload. In this paper, the authors describe a complete architecture able to parallelize lengthy jobs that execute commodity applications, along with its implementation and performance evaluation.
3 Middleware Architecture In this section we describe the main aspects of the proposed middleware architecture, named Ginger. We address: i) the general network and system architecture, ii) application adaptation with file partition and aggregation, and iii) task creation. Network and System Architecture. Figure 1 depicts a global view of the network and system architecture of Ginger, using a layered approach, featuring applications, gridlets (i.e., tasks, inputs and results), and the networked cycle-sharing infrastructure
Transparent Adaptation of e-Science Applications
295
Fig. 1. Ginger network and system architecture
providing computational resources. At the top, there are unmodified commodity CPUintensive applications executed (usually locally) by users. Pictured examples portray image and video rendering and processing, as well as applications usually employed in Grid scenarios to perform calculus, simulation and modeling related to chemistry, biology, economics and statistics. The two LEGO cars represent the input and output of an hypothetical CPU-intensive application that processes data in a complex format (e.g., MPEG4 transcoding). In the middle layer, the LEGO blocks colored yellow represent parallel tasks, each with its associated partition of the input data. These tasks are automatically created by the middleware after determining and analyzing the application being invoked (e.g., by its command line), its parameters, and the input file(s) provided. Each task is carried out by executing the intended application at one of the nodes of the cycle-sharing infrastructure. This way, parallelism is extracted not by analyzing application code (assumed to be binary and opaque) but rather by identifying sections or blocks of the input file(s) that could be processed independently, and therefore, in parallel manner. Nonetheless, since the applications are not modified, the partition of the input data fed to each task must also be adapted by the middleware in order to appear as a properly formatted input file (this involves header analysis and structure manipulation). The lower layer illustrates an example cycle-sharing infrastructure where nodes communicate among themselves to discover available resources and installed applications. Each node receiving a gridlet, executes the intended unmodified application locally (possibly over a virtual machine or embedded in a virtual appliance) with the transformed input partition assigned to its task, and also with possibly adapted parameters. The middle layer also depicts LEGO blocks colored red that represent the results of the execution of parallel tasks. These results need to be aggregated by the middleware into a complete output file according to format description and application semantics (this involves header reconstruction and structure manipulation) and provided to the user. The result file should have no relevant differences (w.r.t. application semantics) from one that would have been produced by a local, serial execution. LEGO blocks representing gridlets have different sizes in order to depict tasks with different costs (CPU, bandwidth, memory) associated to their execution (both estimated and determined after execution). Such costs estimates and measurements can be used to drive resource discovery and a possible reputation/accountancy mechanism.
296
J. Morais et al.
Application Adaptation. At the moment, Ginger must be explicitly invoked through the command line before the application (e.g., ffmpeg) that is being adapted (in practice, this can easily be circumvented by using a customized command shell). Next the application (its name) is analyzed in order to discover which are its inputs and outputs, and what are its arguments and if other requirements are met. These properties are read from a XML application descriptor (described in [9] due to space limitations). This avoids rewriting custom tools for each application, improving middleware extensibility. The next step consists in parsing the file and constructing an auxiliary tree for subsequent transformation. The parsing is done according to a XML format descriptor which includes a grammar that accepts any file of this format, as well as the operations required to properly split (partition) and merge (aggregate) files of this format. This includes all the necessary header analysis and reconstruction, patching and structural modifications (e.g., moving blocks across the file). Tree Manipulation. During input partitioning, the tree being manipulated is generated from the input file, while at result aggregation, the tree is generated starting from one of the, possibly many, output files. The transformation operations can be found along with the format descriptor and consist in sequences of CRUD operations (create, insert, update and delete tree nodes before, after, or between specified elements or tokens). After being transformed, the tree is serialized to file and encapsulated inside the respective gridlet, along with its ID, the application descriptor and other required files, and finally sent to the peer that has been allocated to handle this task. After receiving all the replies the process is reverted. The output files are each one converted to a tree representation and sent to a transformer, to be aggregated in the final output file. Again, the merging (aggregation) operations are taken from the format descriptor which also tells the transformer how the final document should look like before the merging starts. Available options include: i) start from an empty output file, ii) start with the result created from the response to the first request, or iii) start with the result created from the response that arrived first. For example, in the case of an AVI with an MPEG movie, since the important headers should be the same on every response, we choose to start the final result tree from the first response that arrives. With the final result tree completely built, it is now simply a matter of serializing/reassembling the tree back into a file form and the file manipulation process is complete. This is done incrementally on disk by the middleware. There is no need to hold all the output files on memory simultaneously in order to aggregate them. Only the result tree is preferably maintained in memory for better performance. Task Creation. Before executing the splitting operations, the transformer asks the Gridlet Manager (GM) for a list of empty gridlets that will contain the smaller tasks (computational wise) including in this request, if required, the maximum number of tasks that can be created from this particular job. The GM first consults with the application descriptor in order to rate the total cost of each task. The cost is a vector: < CP U, I/O, U pBW, DownBW > with CP U + I/O = 1; UpBW, DownBW in KB Knowing this cost, the GM makes a best case scenario choosing the peers which minimize the processing time according to this task cost vector and tries to allocate a time slot for this task on each of them. Whether the peer allows or denies the allocation, it returns the maximum time that it is willing to concede to this client. This way, if
Transparent Adaptation of e-Science Applications
297
any peer denies the allocation, the GM can either partition the whole task again or partition only the job which failed the allocation. The lower-level resource discovery mechanism informing the GM of suitable peers with available resources and the desired applications/virtual appliances set up is out of the scope of this paper. After having the allocation done, the GM creates N empty gridlets where N is the number of tasks created to handle this job, specifying in each one its offset and size relative to the overall job. Then, for each gridlet, the input tree is copied and the transformer runs the splitting operations over this tree, referring the current and next block variables so that the splitter knows what nodes to add/remove/change. In binary formats, there are usually variable-sized blocks, with previous blocks holding the actual size of the next block. To avoid having to write updates every time an add/remove is done on the variable block, we allow for a data node to listen for changes on other data nodes or elements in the tree and, whenever there is an update on the node, the value of the listening node is automatically updated according with an XPath expression.
4 Implementation Issues and Evaluation Figure 2 illustrates the implementation of the main components of Ginger (implemented in Java). The Saxon XPath engine was used to read and process XML configuration files. Communication between remote components uses both Java RMI (for gridlet exchange over LAN or to virtual appliances running in utility computing infrastructures) and FreePastry (for gridlet exchange over peer-to-peer cycle-sharing overlay). The Ginger Client interacts with the user, allowing him to specify the file to be processed, the application to be used, and its parameters. Ginger Workers execute on the remote computers and, after receiving all the information about the tasks to be executed (file to be processed and task parameters), execute the intended application. The Receiver and Sender modules are responsible for communication with the Gridlet Manager. These modules interact with their corresponding Worker Handler. The Job Handler Thread Pool guarantees that, depending on the number of available CPU/cores available, the optimal number of concurrent applications is executed and that all pending gridlets are executed.
Fig. 2. Gridlet Manager
298
J. Morais et al.
Fig. 3. Execution time of video compression (44.4MB file on the left, 7MB file on the right)
The Gridlet Manager receives the information about the jobs to be executed, and creates the necessary gridlets, storing the information about each job on the corresponding GingerHandler. The partitioning is performed by the Partition module taking into account the input files and the descriptors in a XML file. This module creates a tree representing the file, and taking into account the corresponding XML file partition descriptor, splits the original tree in several coherent branches. For each produced tree, a gridlet input file is created. After receiving the result, the Worker Handler feeds the resulting file to the Aggregation module to create the complete result file. Evaluation. As a preliminary evaluation (further results in [9]), we address video transcoding job (using the ffmpeg application) which is not only a CPU-bound process but also relatively I/O intensive, accepting and producing usually large inputs/outputs. We use the AVI format (a complex format with an intricate internal structure) to contain our video and the partitioning is done on keyframes which are independent frames (i.e., not represented as differences from previous or successor frames), so we can safely assume that they are good splitting points. We performed the following 2 tests: i) Transcoding of a 44.4MB xvid video to the h263+ codec, and ii) Transcoding of a 7MB h264 video to the h263+ codec. To exemplify the computers that would serve as peers providing resources to execute jobs, both in a cycle-sharing infrastructure as well as on a Grid scenario, we executed the applications on machines with the following characteristics: I) Laptop: Intel Core 2 Duo 2.0GHz, 2GB RAM, 7200RPM disk (main user machine), II) Desktop: AMD 2.2 GHz single-processor, 1GB RAM (Local Network), III) VM allocated in Department cluster (Sigma): Dual Opteron 2.4GHz, AFS (Internet). Regarding the qualitative evaluation of the Ginger middleware architecture, the main result is that the validity and generality of the approach used have been confirmed. In fact, users are able to execute unmodified applications as they are used to, and they are transparently adapted by the middleware that is able to decompose the submitted job in several gridlets to be executed in remote nodes. The applications executed are oblivious of the fact that they are being adapted in order to process only a fraction of the work to carry out the submitted job (see [9]). To get representative results these tests were done over 5 different configurations with gridlet load distribution (Sigma, Desktop, Laptop) varying in each of them, as follows: I) (1/3, 1/3, 1/3); II) (2/3, 1/3, 0); III) (1/2, 1/2, 0); IV (1/3, 2/3, 0), and V) (0, 0, 1). The results are shown in Figure 3. Note that in the results for configurations I-IV, the overall processing and response times are taken from the gridlet that takes
Transparent Adaptation of e-Science Applications
299
the longest time to produce its result and that may depend on network latency (many of the results would be already available before that moment and the file could have been previewed in a media player). Our transformer is not yet optimized; partitioning and tree creation have delay. Additional latency is caused by gridlet transfer to remote peers, and AFS versus a local disk. Nonetheless, users are able to transparently leverage available remote cycles to execute several jobs in parallel.
5 Conclusions Open grids are not a new concept, but adapting existing applications without requiring modifying them and, instead, explore the opportunity to perform adaptation on the inputs and outputs, is indeed a novel approach. In this paper we propose a new middleware architecture (Ginger), able to parallelize the execution of unmodified commodity CPU-intensive applications (e.g., video compression/transcoding, image processing, ray tracing) on distributed cycle-sharing scenarios. Thus, users need not to modify an application they already use and trust. Applications are transparently adapted via middleware driven by format descriptions of the application input/output. Adaptation deals solely with input and output data formats, as well as application parameters. Thus, popular commodity applications, such as ffmpeg can be transparently adapted to execute several tasks in parallel over a distributed environment, allowing users to execute jobs remotely, in parallel fashion, without the need to modify applications’ source or binary code.
References 1. Thain, D., Tannenbaum, T., Livny, M.: Condor and the grid. In: Berman, F., Fox, G., Hey, T. (eds.) Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons Inc., Chichester (December 2002) 2. Andrade, N., Cirne, W., Brasileiro, F., Roisenberg, P.: OurGrid: An approach to easily assemble grids with equitable resource sharing. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 61–86. Springer, Heidelberg (2003) 3. De Camargo, R.Y., Kon, F.: Design and implementation of a middleware for data storage in opportunistic grids. In: CCGRID 2007: Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid, pp. 23–30. IEEE Computer Society, Washington, DC, USA (2007) 4. Egede, U., Harrison, K., Jones, R., Maier, A., Moscicki, J., Patrick, G., Soroko, A., Tan, C.: Ganga user interface for job definition and management. In: Proc. Fourth International Workshop on Frontier Science: New Frontiers in Subnuclear Physics, Italy, Laboratori Nazionali di Frascati (September 2005) 5. van der Raadt, K., Yang, Y., Casanova, H.: Practical Divisible Load Scheduling on Grid Platforms with APST-DV. In: Proceedings of 19th IEEE International Parallel and Distributed Processing Symposium, 2005, p. 29b (2005) 6. Silva, J., Veiga, L., Ferreira, P.: nuboinc: Boinc extensions for community cycle sharing. In: Second IEEE International Conference on Self-Adaptive and Self-Organizing Systems Workshops, SASOW 2008, pp. 248–253 (October 2008) 7. Zhou, D., Lo, V.: Cluster computing on the fly: Resource discovery in a cycle sharing peer-topeer system. In: IEEE International Symposium on Cluster Computing and the Grid (2004)
300
J. Morais et al.
8. Veiga, L., Rodrigues, R., Ferreira, P.: Gigi: An ocean of gridlets on a ”grid-for-the-masses”. In: CCGRID 2007: Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid, pp. 783–788. IEEE Computer Society, Washington, DC, USA (2007) 9. Morais, J., Silva, J., Ferreira, P., Veiga, L.: Transparent adaptation of e-science applications for parallel and cycle-sharing infrastructures, inesc-id tech. report 15/2011 (February 2011)
Author Index
Abid, Zied 71 Ali-Eldin, Ahmed 208 Anshus, Otto 194 Ardelius, John 15
Loriant, Nicolas 92 Louberry, Christine 43
Bjørndalen, John Markus Blair, Gordon S. 179
194
Chabridon, Sophie 71 Chacin, Pablo 122 Conan, Denis 71 Consel, Charles 92
Navarro, Leando 122 Ngo, Cao-Cuong 71 Niazi, Salman 29 Nundloll, Vatsala 179
Dalmau, Marc 43 Dowling, Jim 1, 29 El-Ansary, Sameh
Oliveira, Rui 214, 257 Ozanne, Alain 71
208
Ferreira, Paulo 270, 292 Flissi, Areski 77 Fok, Chien-Liang 106 Froihofer, Lorenz 228 Glenstrup, Arne John 165 Goeschka, Karl M. 228 Gohs, Rasmus Sidorovs 165 Grace, Paul 179 Gunnarsson, Sigur ur Rafn 165 Hagen, Tor-Magne Stien 194 Hamann, Kristof 150 Haridi, Seif 1 Harwood, Aaron 243, 249 Jakob, Henner 92 Julien, Christine 106 Jun, Taesoo 106
Payberah, Amir H. 1 Pereira, Jos´e 136, 214, 257 Petz, Agoston 106 Ramamohanarao, Kotagiri Renz, Wolfgang 150 Roose, Philippe 43 Roy, Nirmalya 106 Sampaio, Pedro 270 Silva, Jo˜ ao Nuno 292 Soares, Lu´ıs 136 S¨ oldner, Guido 57 Starnberger, Guenther 228 Stødle, Daniel 194 Sudeikat, Jan 150 Taconet, Chantal
Kapitza, R¨ udiger 57 Koromilas, Lazaros 278 Lamersdorf, Winfried Liao, Yang 249
Magoutis, Kostas 278 Maia, Francisco 257 Matos, Miguel 257 Meier, Ren´e 57 Mej´ıas, Boris 15 Morais, Jo˜ ao 292
150
71
van Steen, Maarten 243 Vanwormhoudt, Gilles 77 Veiga, Lu´ıs 270, 292 Vila¸ca, Ricardo 214 Vilenica, Ante 150 Voulgaris, Spyros 243
249