This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering Editorial Board Ozgur Akan Middle East Technical University, Ankara, Turkey Paolo Bellavista University of Bologna, Italy Jiannong Cao Hong Kong Polytechnic University, Hong Kong Falko Dressler University of Erlangen, Germany Domenico Ferrari Università Cattolica Piacenza, Italy Mario Gerla UCLA, USA Hisashi Kobayashi Princeton University, USA Sergio Palazzo University of Catania, Italy Sartaj Sahni University of Florida, USA Xuemin (Sherman) Shen University of Waterloo, Canada Mircea Stan University of Virginia, USA Jia Xiaohua City University of Hong Kong, Hong Kong Albert Zomaya University of Sydney, Australia Geoffrey Coulson Lancaster University, UK
2
Pascale Vicat-Blanc Primet Tomohiro Kudoh Joe Mambretti (Eds.)
Networks for Grid Applications Second International Conference, GridNets 2008 Beijing, China, October 8-10, 2008 Revised Selected Papers
13
Volume Editors Pascale Vicat-Blanc Primet Ecole Normale Supérieure LIP, UMR CNRS - INRIA - ENS, UCBL No. 5668 69364 LYON Cedex 07, France E-mail: [email protected] Tomohiro Kudoh Information Technology Research Institute National Institute of Advanced Science and Technology Tsukuba-Central 2 1-1-1 Umezono Tsukuba Ibaraki 305-8568 Japan E-mail: [email protected] Joe Mambretti Northwestern University International Center for Advanced Internet Research 750 North Lake Shore Drive, Suite 600 Chicago, IL 60611, USA E-mail: [email protected]
Library of Congress Control Number: Applied for CR Subject Classification (1998): C.2, D.1.3, H.2.4, H.3.4, G.2.2 ISSN ISBN-10 ISBN-13
1867-8211 3-642-02079-8 Springer Berlin Heidelberg New York 978-3-642-02079-7 Springer Berlin Heidelberg New York
Beijing, China October 8–10, 2008 The Second International Conference on Networks for Grid Applications In cooperation with ACM SIGARCH Sponsored by ICST Technically co-sponsored by Create-Net and EU-IST
The GridNets conference series is an annual international meeting which provides a focused and highly interactive forum where researchers and technologists have the opportunity to present and discuss leading research, developments, and future directions in the grid networking area. The objective of this event is to serve as both the premier conference presenting the best grid networking research and a forum where new concepts can be introduced and explored. After the great success of last year's GridNets in Lyon, France, which was the first "conference" event, we decided to move GridNets to Beijing, China in 2008. We received 37 papers, and accepted 19. For this single-track conference, there were 2 invited keynote speakers, 19 reviewed paper presentations, and 4 invited presentations. This program was supplemented by a workshop on Service-Aware Optical Grid Networks and a workshop on Wireless Grids; both of these workshops took place on the first day of the conference day. Next year's event is already being planned, and it will take place in Athens, Greece. We hope to see you there!
Tomohiro Kudoh Pascale Vicat-Blanc Primet Joe Mambretti
Organization
Steering Committee Imrich Chlamtac, Chair Pascale Vicat-Blanc Primet Michael Welzl
CREATE-NET INRIA, ENS-Lyon (France) University of Innsbruck (Austria)
General Co-chairs Chris Edwards Michael Welzl Junsheng Yu
Lancaster University (UK) University of Innsbruck (Austria) Beijing University of Posts and Telecommunications (China)
Local Co-chairs Yuan'an Liu Tongxu Zhang Shaohua Liu
Beijing University of Posts and Telecommunications (China) China Mobile Group Design Institute Co. Ltd. (China) Beijing University of Posts and Telecommunications (China)
Program Committee Co-chairs Pascale Vicat-Blanc Primet Joe Mambretti Tomohiro Kudoh
INRIA, ENS-Lyon (France) Northwestern University (USA) AIST (Japan)
Publicity Co-chairs Serafim Kotrotsos, for Europe Xingyao Wu, for Asia Sumit Naiksatam, for USA
EXIS IT (Greece) China Mobile Group Design Institute Co. Ltd. (China) Cisco Systems
Exhibits and Sponsorship Co-chairs Peng Gao Junsheng Yu
China Mobile Group Design Institute Co. Ltd. (China) Beijing University of Posts and Telecommunications (China)
Workshops Chair Ioannis Tomkos
Athens Information Technology (Greece)
Industry Track Chair Tibor Kovács
ICST
Panels Chair Chris Edwards
Lancaster University (UK)
Conference Organization Chair Gergely Nagy
ICST
Finance Chair Karen Decker
ICST
Webmaster Yehia Elkhatib
Lancaster University (UK)
Technical Program Co-chairs Pascale Vicat-Blanc Prime Joe Mambretti Tomohiro Kudoh
INRIA, ENS-Lyon (France) Northwestern University (USA) AIST (Japan)
Technical Program Members Micah Beck Augusto Casaca Piero Castoldi Cees De Laat
University of Tennessee (USA) INESC-ID (Portugal) SSSUP (Italy) University of Amsterdam (Holland)
Organization
Chris Develder Silvia Figueira Gabriele Garzoglio Olivier Glück Paola Grosso Yunhong Gu Wei Guo David Hausheer Guili He Doan Hoang Tao Huang Yusheng Ji Raj Kettimuthu Oh-kyoung Kwon Dieter Kranzlmueller Francis Lee Bu Sung Shaohua Liu Tao Liu Tiejun Ma Olivier Martin Katsuichi Nakamura Marcelo Pasin Nicholas Race Depei Qian Dimitra Simeonidou Burkhard Stiller Zhili Sun Martin Swany Osamu Tatebe Dominique Verchcre Jun Wei Chan-Hyu Youn Wolfgang Ziegler
Ghent University - IBBT (Belgium) Santa Clara University (USA) Fermi National Accelerator Laboratory (USA) INRIA, ENS-Lyon (France) University of Amsterdam (Holland) University of Illinois (USA) Shanghai Jiao Tong University (China) University of Zurich (Switzerland) China Telecommunication Technology Lab (China) University of Technology, Sydney (Australia) Institute of Software, Chinese Academy of Sciences (China) NII (Japan) Argonne National Laboratory (USA) KISTI (Korea) GUP-Linz (Austria) Nanyang Technological University (Singapore) Beijing University of Posts and Telecommunications (China) China Mobile Group Design Institute Co. Ltd. (China) University of Oxford (UK) ICT Consulting (Switzerland) Kyushu Institute of Technology (Japan) INRIA, ENS-Lyon (France) Lancaster University (UK) Beijing University of Aeronautics and Astronautics (China) University of Essex (UK) University of Zurich (Switzerland) University of Surrey (UK) University of Delaware (USA) Tsukuba University (Japan) Alcatel-Lucent (France) Institute of Software, Chinese Academy of Sciences (China) ICU (Korea) Fraunhofer-Gesellschaft (Germany)
Wireless Grids Workshop Organizing Committee Frank H.P. Fitzek Marcos D. Katz Geng-Sheng Kuo Jianwei Niu Qi Zhang
A High Performance SOAP Engine for Grid Computing . . . . . . . . . . . . . . . Ning Wang, Michael Welzl, and Liang Zhang
1
UDTv4: Improvements in Performance and Usability . . . . . . . . . . . . . . . . . Yunhong Gu and Robert Grossman
9
The Measurement and Modeling of a P2P Streaming Video Service . . . . . Peng Gao, Tao Liu, Yanming Chen, Xingyao Wu, Yehia El-khatib, and Christopher Edwards
24
SCE: Grid Environment for Scientific Computing . . . . . . . . . . . . . . . . . . . . Haili Xiao, Hong Wu, and Xuebin Chi
Automatic Network Services Aligned with Grid Application Requirements in CARRIOCAS Project (Invited Paper) . . . . . . . . . . . . . . . D. Verchere, O. Audouin, B. Berde, A. Chiosi, R. Douville, H. Pouyllau, P. Primet, M. Pasin, S. Soudan, T. Marcot, V. Piperaud, R. Theillaud, D. Hong, D. Barth, C. Cad´er´e, V. Reinhart, and J. Tomasik
196
Communication Contention Reduction in Joint Scheduling for Optical Grid Computing (Invited Paper) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yaohui Jin, Yan Wang, Wei Guo, Weiqiang Sun, and Weisheng Hu
206
Experimental Demonstration of a Self-organized Architecture for Emerging Grid Computing Applications on OBS Testbed . . . . . . . . . . . . . Lei Liu, Xiaobin Hong, Jian Wu, and Jintong Lin
215
Joint Scheduling of Tasks and Communication in WDM Optical Networks for Supporting Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . Xubin Luo and Bin Wang
A High Performance SOAP Engine for Grid Computing Ning Wang1 , Michael Welzl2 , and Liang Zhang1 1
Institute of Software, Chinese Academy of Sciences, Beijing, China [email protected] 2 Institute of Computer Science, University of Innsbruck, Austria [email protected]
Abstract. Web Service technology still has many defects that make its usage for Grid computing problematic, most notably the low performance of the SOAP engine. In this paper, we develop a novel SOAP engine called SOAPExpress, which adopts two key techniques for improving processing performance: SCTP data transport and dynamic early binding based data mapping. Experimental results show a significant and consistent performance improvement of SOAPExpress over Apache Axis. Keywords: SOAP, SCTP, Web Service.
1
Introduction
The rapid development of Web Service technology in recent years has attracted much attention in the Grid computing community. The recently proposed Open Grid Service Architecture (OGSA) represents an evolution towards a Grid system architecture based on Web Service concepts and technologies. The new WS-Resource Framework (WSRF) proposed by Globus, IBM and HP provides a set of core Web Service specifications for OGSA. Taken together, and combined with WS-Notification (WSN), these specifications describe how to implement OGSA capabilities using Web Services. The low performance of the Web Service engine (SOAP engine) is problematic in the Grid computing context. In this paper, we propose two techniques for improving Web Service processing performance and develop a novel SOAP engine called SOAPExpress. The two techniques are using SCTP as a transport protocol, and dynamic early binding based data mapping. We conduct experiments comparing the new SOAP engine with Apache Axis by using the standard WS Test suite.1 The experimental results show that, no matter what the Web Service call is, SOAPExpress is always more efficient than Apache Axis. In case of handling an echoList Web Service call, SOAPExpress can achieve a 56% reduction of the processing time. After a review of related work, we provide an overview of SOAPExpress in section 3, with more details about the underlying key techniques following in section 4. We present a performance evaluation in section 5 and conclude. 1
P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 1–8, 2009. c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009
2
2
N. Wang, M. Welzl, and L. Zhang
Related Work
There have been several studies [3, 4, 5] on the performance of the SOAP processing. These studies all agree that the XML based SOAP protocol incurs a substantial performance penalty in comparison with binary protocols. Davis conducts an experimental evaluation on the latency of various SOAP implementations, compared with other protocols such as Java RMI and CORBA [4]. A conclusion is drawn that two reasons may cause the inefficiency of SOAP: one is about the multiple system calls to realize one logical message sending, and another is about XML parsing and formatting. A similar conclusion is drawn in [5] by comparing SOAP with CORBA. Chiu et al. point out that the most critical bottleneck in using SOAP for scientific computing is the conversion between floating point numbers and their ASCII representations in [3]. Recently, various mechanisms have been utilized to optimize the deserialization and serialization between XML data and Java data. In [1], rather than reserializing each message from scratch, a serialized XML message copy is cached in the senders stub, which is reused as a template for the next message with the same type. The approach in [8] reuses the matching regions from the previously deserialized application objects, and only performs deserialization for a new region that has not been processed before; however, for large SOAP messages, especially for SOAP messages whose data always changes, the performance improvement of [8] will be decreased. Also Java reflection is adopted by [8] as a means to set and get new values. For large Java objects, especially deeply nested objects, this will negatively affect the performance. The transport protocol is also a factor that can degrade the SOAP performance. The traditional HTTP and TCP communication protocols exhibit many defects when used for Web Services, including “Head-Of-Line blocking (HOL)” delay (to be explained in section 4.1), three-way handshake, ordered data delivery and half open connections [2]. Some of these problems can be alleviated by using the SCTP protocol [7] instead of TCP. While this benefit was previously shown for similar applications, most notably MPI (see chapter 5 of [6] for a literature overview), to the best of our knowledge, using SCTP within a SOAP engine as presented in this paper is novel.
3
SOAPExpress Overview
As a lightweight Web Service container, SOAPExpress provides an integrated platform for developing, deploying, operating and managing Web Services, and fully reflects the important characteristics of the next generation SOAP engine including QoS and diverse message exchange patterns. SOAPExpress not only supports the core Web Service standards such as SOAP and WSDL, but also inherits the open and flexible design style of Web Service technology because of its architecture: SOAPExpress can easily support different Web Service standards such as WS-Addressing, WS-Security and WS-ReliableMessage. It can also be integrated with the major technology for enterprise applications such as EJB and
A High Performance SOAP Engine for Grid Computing
3
JMS to establish a more loosely coupled and flexible computing environment. To enable agile development of Web Services, we also provide development tools as plug-ins for the Eclipse platform.
Fig. 1. SOAPExpress architecture
The architecture of SOAPExpress consists of four parts as shown in Fig. 1: – Transport protocol adaptor: supports client access to the system through a variety of underlying protocols such as HTTP, TCP and SCTP, and offers the system an abstract SOAP message receiving and sending primitive. – SOAP message processing module: provides the effective and flexible SOAP message processing mechanism and is able to access the data in the SOAP message in three layers, namely byte stream, XML object and Jave object. – Execution controller: with a dynamic pipeline structure, controls the flow of SOAP message processing such as service identification, message addressing and message exchanging pattern management, and supports various QoS modules such as security and reliable messaging. – Integrated service provider: provides an integrated framework to support different kinds of information sources such as plain Java objects and EJBs, and wraps them into Web Services in a convenient way.
4
Key Techniques
In this section, we will present the design details of the key techniques applied in SOAPExpress to improve its performance. 4.1
SCTP Transport
At the transport layer, we use the SCTP protocol [7] to speed up the execution of Web Service calls. This is done by exploiting two of its features: out-of-order delivery and multi-streaming. Out-of-order delivery eliminates the HOL delay of TCP: if, for example, packets 1, 2, 3, 4 are sent from A to B, and packet 1 is lost, packets 2, 3 and 4 arrive at the receiver before the retransmitted (and therefore delayed) packet 1. Then, even if the receiving application could already
4
N. Wang, M. Welzl, and L. Zhang
use the data in packets 2, 3 and 4, it has no means to access it because the TCP semantics (in-order delivery of a consecutive data stream) prevent the protocol from handing over the content of these packets before the arrival of packet 1. In a Grid, these packets could correspond with function calls, which, depending on the code, might need to be executed in sequence. If they do, the possibility of experiencing HOL blocking delay is inevitable — but if they don’t, the outof-order delivery feature of SCTP can speed up the transfer in the presence of packet loss. Directly using the out-of-order delivery mechanism may not always be useful, as this would require each function call to be at most as large as one packet, thereby significantly limiting the number and types of parameters that could be embedded. We therefore used the multi-streaming feature, which bundles independent data streams together and allows out-of-order delivery only for packets from different streams. In our example, packets 1 and 3 could be associated with stream A, and packets 2 and 4 could be associated with stream B. The data of stream B could then be delivered in sequence before the arrival of packet 1, thereby speeding up the transfer. For our implementation, we used a Java SCTP library which is based on the Linux kernel space implementation called “LKSCTP”.2 Since our goal was to enable the use of SCTP instead of TCP without requiring the programmer to carry out a major code change, we used Java’s inherent support for the factory pattern as a simple and efficient way to replace the TCP socket with an SCTP socket in an existing source code. All that is needed to automatically make all socket calls use SCTP instead of TCP is to call to the methods Socket.setSocketImplFactory and ServerSocket.setSocketFactory for the client and server side, respectively. In order to avoid bothering the programmer with the need to determine which SCTP stream a function call should be associated with, we automatically assign socket calls to streams in a round-robin fashion. Clearly, it must be up to the programmer to decide whether function calls could be executed in parallel (in which case they would be associated with multiple streams) or not. To this end, we also provide two methods called StartChunk() and EndChunk(), respectively, which can be used to mark parts of data which must be consecutively delivered. All write() calls that are executed between StartChunk() and EndChunk() will cause data to be sent via the same stream. 4.2
Dynamic Early Binding Based Data Mapping
The purpose of the data mapping is to build a bridge between the platform independent SOAP messages and the platform dependent data such as Java objects. The indispensable elements of the data mapping include XML data definitions in an XML schema, data definitions in a specific platform, and the mapping rule between them. Before discussing our data mapping solution, let us first explain two pairs of concepts. – Early binding and late binding: The differences between early binding and late binding focus on when to get the binding information and when to 2
Java SCTP library by I. Skytte Joergensen: http://i1.dk/JavaSCTP/
A High Performance SOAP Engine for Grid Computing
5
use them, as illustrated in the Fig. 2. Here the binding information refers to mapping information between XML data and Java data. In early binding, all the binding information is retrieved before performing the binding, while in late binding, the binding is performed as soon as enough binding information is available. – Dynamic binding and static binding: Here, dynamic binding refers to the binding mechanism which can add new XML-Java mapping pairs at run time. In contrast, static binding refers to a mechanism which can only add new mapping pairs at compilation time.
Fig. 2. Early binding and late binding
According to the above explanation, the existing data binding implementations can be classified into two schemes: dynamic late binding and static early binding. Dynamic late binding gets the binding information by Java reflection at run time, and then uses the binding information to carry on data binding between XML data and Java data. Dynamic late binding can dynamically add new XML-Java mapping pairs, and avoid generating assistant codes by using dynamic features of Java; however, this flexibility is achieved by sacrificing efficiency. Representatives of this scheme are Apache Axis and Castor. For example, Castor uses java reflection to instantiate the new class added to the XML-java pairs at the run time, and initialize it using the values in XML through method reflection. Static early binding generates Java template files which record the binding information before running, and then carries on the binding between XML data and Java data at runtime. Static early binding (as, e.g., in XMLBeans) improves the performance by avoiding the frequent use of Java reflection. However, new XML-Java mapping pairs cannot be added at runtime, which reduces the flexibility. As illustrated in Fig. 3, we use a dynamic early binding scheme. This scheme can establish the mapping rules between the XML schema for some data type and the Java class for the same data type at compilation time. At run time, a Java template is generated based on the XML schema, the Java class and their mapping rules, which we call Data Mapping Template (DMT), by dynamic code generation techniques. The DMT is used to drive the data mapping procedure. Dynamic early binding avoids Java reflection so that the performance can be distinctly improved. Simultaneously, the DMT can be generated and managed at run time, which gives dynamic early binding the same flexibility as dynamic
6
N. Wang, M. Welzl, and L. Zhang
Fig. 3. Dynamic early binding
late binding. Dynamic early binding combines the advantages of static early binding and dynamic late binding.
5
Performance Evaluation
We begin our performance evaluation with a study of the performance improvement from using SCTP with multi-streaming. We executed asynchronous Web Service calls with JAX-WS 2.0 to carry out a simple scalar product calculation, where sequential execution of the individual calls (multiplications) is not necessary. Three PCs were used: a server and client (identical AMD Athlon 64 X2 Dual-Core 4200 with 2.2 GHz), and a Linux router which interconnected them (a HP Evo W6000 2.4 GHz workstation). At the Linux router, we generated random packet loss with NistNet.3 Figure 4 shows the transfer time of these tests with various packet loss ratios. The results were taken as an average of running 100 tests with the same packet loss setting each. As could be expected from tests carried out in a controlled environment, there was no significant divergence between the results of these test runs. For each measurement, we sent 5000 Integer vaues to the Web Service, which sent 2500 results back to the client. Eventually, at the client, the sum was calculated to finally yield the scalar product. Clearly, if SCTP is used as we intended (with unordered delivery between streams and 1000 streams), it outperforms TCP, in particular when the packet loss ratio gets high. SCTP with one stream and ordered behavior is only included in fig. 4 as a reference value — its performance is not as good as TCP’s because the TCP implementation is probably more efficient (TCP has evolved over many years and, other than our library, operates at the kernel level). Multiple streams and ordered transmission of packets between streams would theoretically be pointless; surprisingly, the result for this case is better than with only one stream. We believe that this is a peculiarity of the SCTP library that we used. 3
http://snad.ncsl.nist.gov/nistnet/
A High Performance SOAP Engine for Grid Computing
7
Fig. 4. Transfer time of TCP and SCTP using the Web Service
We then evaluated the performance of SOAPExpress as a whole, including SCTP and dynamic early binding based data mapping. We chose the WS Test 1.0 suite to test the time spent on each stage in the SOAP message processing. Several kinds of test cases were carried out, each designed to measure the performance of a different type of Web Service calls: – echoVoid: send/receive a message with empty body. – echoStruct: send/receive an array of size 20, with each entry being a complex data type composed of an integer, a floating point number and a string. – echoList: send/receive a linked list of size 20, with each entry being the same complex data type defined in echoStruct. The experimental settings were: CPU: Pentium-4 2.40 GHz; Memory: 1 GB; OS: Ubuntu Linux 8.04; JVM: Sun JRE 6; Web Container: Apache Tomcat 6.0. The Web Service client performed each Web Service call 10,000 times, and the work load was 5 calls per second. The benchmark we have chosen is Apache Axis 1.2. Fig. 4 shows the experimental results. For echoStruct and echoList, the XML payload was about 4KB. The measure started from receiving a request and ended with returning the
Fig. 5. Performance comparison among different types of Web Service calls
8
N. Wang, M. Welzl, and L. Zhang
response. We observed that for echoVoid, the processing time is very close between the two SOAP engines, since echoVoid has no business logic and just returns the SOAP message with an empty body. For echoStruct, the processing time of SOAPExpress is about 46% of Apache Axis, and for echoList, the proportion reduces to about 44%. This is a very sound overall performance improvement of SOAPExpress over Apache Axis.
6
Conclusion
In this paper, we presented the SOAPExpress engine for Grid computing. It uses two key techniques for reducing the Web Service processing time: the SCTP transport protocol and dynamic early binding based data mapping. Experiments were conducted to compare its performance with Apache Axis by using the standard WS Test suite. Our experimental results have shown that, no matter what the Web Service call is, SOAPExpress is more efficient than Apache Axis.
Acknowledgments We thank Christoph Sereinig for his contributions. This work is partially supported by the FP6-IST-045256 project EC-GIN (http://www.ec-gin.eu).
References 1. Abu-Ghazaleh, N., Lewis, M.J., Govindaraju, M.: Differential serialization for optimized SOAP performance. In: HPDC 2004: Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, Washington, DC, USA, pp. 55–64. IEEE Computer Society Press, Los Alamitos (2004) 2. Bickhart, R.W.: Transparent TCP-to-SCTP translation shim layer. In: Proceedings of the European BSD Conference (2007) 3. Chiu, K., Govindaraju, M., Bramley, R.: Investigating the limits of SOAP performance for scientific computing. In: HPDC 2002: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, Washington, DC, USA, p. 246. IEEE Computer Society Press, Los Alamitos (2002) 4. Davis, D., Parashar, M.P.: Latency performance of SOAP implementations. In: CCGRID 2002: Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, Washington, DC, USA, p. 407 (2002) 5. Elfwing, R., Paulsson, U., Lundberg, L.: Performance of SOAP in Web Service environment compared to CORBA. In: APSEC 2002: Proceedings of the Ninth Asia-Pacific Software Engineering Conference, Washington, DC, USA, p. 84. IEEE Computer Society Press, Los Alamitos (2002) 6. Sereinig, C.: Speeding up Java web applications and Web Service calls with SCTP. Master’s thesis, University of Innsbruck (April 2008), http://www.ec-gin.eu 7. Stewart, R.: Stream Control Transmission Protocol. RFC 4960 (September 2007) 8. Suzumura, T., Takase, T., Tatsubori, M.: Optimizing web services performance by differential deserialization. In: ICWS 2005: Proceedings of the IEEE International Conference on Web Services, Washington, DC, USA, pp. 185–192. IEEE Computer Society Press, Los Alamitos (2005)
UDTv4: Improvements in Performance and Usability Yunhong Gu and Robert Grossman National Center for Data Mining, University of Illinois at Chicago
Abstract. This paper presents UDT version 4 (UDTv4), the fourth generation of the UDT high performance data transfer protocol. The focus of the paper is on the new features introduced in version 4 during the past two years to improve the performance and usability of the protocol. UDTv4 introduces a new three-layer protocol architecture (connection-flowmultiplexer) for enhanced congestion control and resource management. The new design allows protocol parameters to be shared by parallel connections and to be reused by future connections. This improves the congestion control and reduces the connection setup time. Meanwhile, UDTv4 also provide better usability by supporting a broader variety of network environments and use scenarios.
In fact, UDP has been used in many applications (e.g., Skype) but it is usually customized independently for each application. RTP [15] is a good example and it is a great success in supporting multimedia applications. However, there are few generalpurpose UDP-based protocols that application developers can use directly or customize easily. UDT, or UDP-based data transfer protocol, is an application level general-purpose transport protocol on top of UDP [8]. UDT address a large portion of the requirements from the new applications by seamlessly integrating many modern protocol design and implementation techniques at the application level. The protocol was originally designed for transferring large scientific data over high-speed wide area networks and it has been successful in many research projects. For example, UDT has been used to distribute the 13TB SDSS astronomy data release to global astronomers [9]. UDT has been an open source project since 2001 and the first production release was made in 2004. While it was originally designed for big scientific data sets, the UDT library has been used in many other situations, either with its stock form or in a modified form. A great deal of user feedback has been received. The new version (UDTv4) released in 2007 introduces significant changes and supports better performance and usability. •
•
UDTv4 uses a three-layer architecture to enhance congestion control and reduce connection setup time by sharing control parameters among parallel connections and by using historical data. UDTv4 introduces new techniques in both protocol design and implementation to support better scalability, hence it can be used in a larger variety of use scenarios.
This paper describes these new features of UDTv4. Section 2 explains the protocol design. Section 3 describes several key implementation techniques. Section 5 presents the evaluation. Section 6 discusses the related work. Section 7 concludes the paper. Throughout the rest of the paper, we use UDT to refer the most recent version, UDTv4, unless otherwise explicitly stated.
2 Protocol Design 2.1 Protocol Overview UDT is a connection-oriented, duplex, and unicast protocol. There are 3 logical layers in design: UDT connection, UDT flow, and UDP multiplexer (Figure 1). A UDT connection is set up between a pair of UDT sockets as a distinct data transfer entity to applications. It can provide either reliable data streaming services or partial reliable messaging services, but not both for the same socket. A UDT flow is a logical data transfer channel between two UDP addresses (IP and port) with a unique congestion control algorithm. That is, a UDT flow is composed of five elements (source IP, source UDP port, destination IP, destination UDP port, and congestion control algorithm). The UDT flow is transparent to applications.
UDTv4: Improvements in Performance and Usability
UDP Multiplexer
11
UDP Multiplexer Sockets
Sockets Sockets
UDT Connection UDT Flow
To other addresses Fig. 1. UDT Connection, Flow, and UDP Multiplexer
One or more UDT connections are associated with one UDT flow, if the UDT connections share the same five elements described above. Every connection must be associated with one and only one flow. In other words, UDT connections sharing the same five elements are multiplexed over a single UDT flow. A UDT flow provides reliability control as it multiplexes individual packets from UDT connections, while UDT connections provide data semantics (streaming or messaging) management. Different types of UDT connections (streaming or messaging) can be associated with the same UDT flow. Congestion control is also applied to the UDT flow, rather than the connections. Therefore, all connections in one flow share the same congestion control process. Flow control, however, is applied to each connection. Multiple UDT flows can share a single UDP socket/port and a UDP multiplexer is used to send and dispatch packets for different UDT flows. The UDP multiplexer is also transparent to applications. 2.2 UDP Multiplexing Multiple UDT flows can bind to a single UDP port and each packet is differentiated by the destination (UDT) socket ID carried in the packet header. The UDP multiplexing method helps to traverse firewalls and alleviates the system limitation on the port number space. The number of TCP ports is limited to 65536. In contrast, UDT can support up to 232 connections at the same time. UDP multiplexing also helps firewall traversing. By opening one UDP port, a host can open virtually an unlimited number of UDT connections to the outside. 2.3 Flow Management UDT multiplexes multiple connections into one single UDT flow, if the connections share the same attributes of source IP, source UDP port, destination IP, destination UDP port, and congestion control algorithm. This single flow for multiple connections helps to reduce control traffic, but more importantly, it uses a single congestion control for all connections sharing the same end points. This removes the unfairness by using parallel flows and in most situations
12
Y. Gu and R. Grossman Flow Control Buf
Connection 1
Connection 1
Buf
Buf
Connection 2
Connection 2
Buf
data
Buf
control Congestion Control
Reliability Control
Fig. 2. UDT Flow and Connection
it improves throughput because connections in a single flow coordinate with each other rather than compete with each other. As shown in Figure 2, the flow maintains all activities required for a regular data transfer connection, whereas the UDT connection is only responsible for the application interface (connection maintenance and data semantics). At the sender side, the UDT flow reads packets from each associated connection in a round robin manner, assigns each packet the flow sequence numbers and sends them out. 2.4 Connection Record Index/Cache When a new connection is requested, UDT needs to look up whether there is already a flow existing between the same peers. A connection record index (Figure 3) is used for this purpose. The index is sorted by the peer IP addresses. Each entry records the information between the local host and the peer address, including but not limited to RTT, path MTU, and estimated bandwidth. Each entry may contain multiple sub-entries by different ports, followed by multiple flows differentiated by congestion control (CC). The connection record index caches the IP information (RTT, MTU, estimated bandwidth, etc.) even if the connection and flow is closed, in which case there is no port associated with the IP entry. This information can be used when a new connection is set up. Its RTT value can be initialized with a previously recorded value; otherwise it would take several ACKs to get an accurate value for the RTT. If IP
Info.
Port
Flow/CC Flow/CC
Port IP
Info. Fig. 3. Connection Record Index
Flow/CC
UDTv4: Improvements in Performance and Usability
13
path MTU discovery is used, the MTU information can also be initialized with a historical value. The index entry without an active flow will be removed when the maximum length of the index has been reached, and the oldest entry will be removed first. Although the cache may be removed very quickly on a busy server (e.g., a web server), the client side may contain the same cache and pass the values to the server. For example, a client that frequently visits a web server may keep the link information between the client and the server, while the server may have already removed it. 2.5 Garbage Collection When a UDT socket is closed (either by the application or because of a broken connection), it is not removed immediately. Instead, it is tagged as having closed status. A garbage collection thread will periodically scan the closed sockets and remove the sockets when no API is accessing the socket. Without garbage collection, UDT would have needed stronger synchronization protection on its APIs, which increases implementation complexity and adds some slight overhead for the additional synchronization mechanism. In addition, because of the delayed removal, a new socket can reuse a closed socket and the related UDP multiplexer when possible, thus it improves connection setup efficiency. Garbage collection also checks the buffer usage and decreases the size of the system allocated buffer if necessary. If during the last 60 seconds, less that 50% of the buffer is used, the buffer will be reduced to half (a minimum size limit, 32 packets, is used so that the buffer size will not be decreased to a meaningless 1-byte).
3 Implementation UDT is implemented as an open source project and is available for download from SourceForge.net. The UDT library has been used in both research projects and commercial products. So far 18,000 copies have been downloaded, excluding direct checkout from the CVS and redistribution from other websites. The UDT implementation is available on both POSIX and Windows systems and it is thoroughly tested on Linux 2.4, 2.6, and Windows XP. The code is written in C++ with API wrappers for other languages available. The latest stable version of the UDT library (version 4.2) consists of approximately 11,500 lines of C++ code, including about 4000 semicolons and about 20% of the code is comments. 3.1 Software Architecture Figure 4 shows the software architecture of the UDT implementation. A global UDT API module dispatches requests from applications to a specific UDT socket. Data transfer for the UDT socket is managed by a UDT flow, while the UDT flow communicates via a UDP multiplexer. One UDP multiplexer can support multiple UDT flows, and one UDT flow can support multiple UDT sockets. Finally, both the
14
Y. Gu and R. Grossman UDT API
Global Socket Management Garbage Collection
Flow Control UDT Connection Management UDT Flow Management
Buffer Management
Congestion Control
Reliability Control
Snd Queue
Rcv Queue
UDP Multiplexer
System UDP Socket API
Fig. 4. UDT Software Architecture
Flow Ctrl. UDT Socket
Snd Buffer Rcv Buffer
UDT Socket
Cong. Ctrl. UDT Flow
Snd Buffer
Snd Queue Rcv Queue
UDP Multiplexer
Rcv Queue
UDP Multiplexer
Snd Loss List
Rcv Loss List UDT Flow
Fig. 5. Data Flow over a Single UDT Connection
buffer management module and the garbage collection module work at global space to support the resource management. Figure 5 shows the data flow in a single UDT connection. The UDT flow moves data packets from the socket buffer to its own sending buffer and sends the data out via the UDP multiplexer. The control information is exchanged on both directions of the data flow. At the sender side, the UDP multiplexer receives the control information (ACK, NAK, etc.) from the receiver and dispatches the control information to the
UDTv4: Improvements in Performance and Usability
15
corresponding UDT flow or connection. Lost lists are used at both sides to record the lost packets. Lost lists work at flow level and only record flow sequence numbers. Flow control is applied to a UDT socket, while congestion control and reliability control are applied to the UDT flow. 3.2 UDP Multiplexer and Queue Management The UDP multiplexer maintains a sending queue and a receiving queue. The queue manages a set of UDT flows to send or receive packets via the associated UDP port. The sending queue contains a set of UDT flows that has data to send out. If rate based control is used, the flows are scheduled according to the next packet sending time; if pure window-based control is used, the flows are scheduled according to a round robin scheme. The sending queue checks the system time and when it is time to send out the first packet, it removes the first flow on the queue and sends out its packet. If there are more packets to be sent for the particular flow, the flow will be inserted into the queue again according to the next packet sending time by rate/congestion/flow control. The sending queue uses a heap structure to maintain the flows. With the heap structure, each send or insert action takes at most log2(n) steps, where n is the total number of flows in the queue. The heap structure guarantees that the sender can find the flow instance with the smallest next scheduled packet sending time; however, it is not necessary to have all the flows sorted by the next scheduled time. The job of the receiving queue is much simpler. It checks the timing events (retransmission timer, keep-alive, timer-based ACK, etc.) for each flow associated with the UDP multiplexer. Every fixed time interval (0.1 second), flows are checked in a round robin manner. However, if a packet arrived for a particular flow, the timers will be checked for the flow and the flow is moved to the end of the queue for the next round of check. The receiving queue uses a double linked list to store the flows and each operation takes O(1) time. The receiving side of the UDP multiplexer also maintains a hash table for the associated UDT connections, so that when a packet arrives, the multiplexer can quickly look up the corresponding connection to process the packet. Note that the flow processing handler can be looked up via the socket instance. 3.3 Connection and Flow Management In the UDT implementation, a flow is a special connection that contains pointers to all connections within the same flow, including itself. The first connection of the flow is set up by the normal 3-way handshake process. More connections are set up by a simplified 2-way handshake as it joins an existing flow. The first connection automatically becomes the flow and manages all the connections. If the current "flow" connection is closed or leaves (because of IP address change), another connection will become the flow and related flow information will be moved to the new flow from the old one. The flow maintains a separate sending buffer in addition to the connections' sending buffers. In an ideal world, the flow should read packets from each connection
16
Y. Gu and R. Grossman
in a round robin fashion. However, in this way the flow would either need to keep track of the source of each packet or copy the packet into its own buffer, because each ACK or NAK processing needs to locate the original packet. In the current implementation, the socket sending buffer is organized as a link of multiple 32-packet blocks. The UDT flow reads one 32-packet block from each connection in round robin fashion, removes the block from the socket's sending buffer, and links the block to its own (flow) sending buffer. Note that there may be less than 32 packets in the block if there is not enough data to be sent for a particular connection. Flow control is enforced at the socket level. The UDT send call will be blocked if either the sender buffer limit or the receiver buffer limit is full. This guarantees that data in the flow sending buffer is not limited by flow control. By using this strategy, the flow simply applies ACKs and NAKs to its own buffer and avoids memory copies between flow and connections or a data structure to map flow sequence number to connection sequence number. In the latter case, UDT would also need to check every single packet being acknowledged, because they may belong to different connections and may not be continuous. At the receiver side, all connections have their own receiver buffer for application data reading. However, only the flow maintains a loss list to recover packet losses. Rendezvous connection setup. In addition to the regular client/server mode, UDT provides a method for rendezvous connection method. Both peers can connect to each other at (approximately) the same time, provided that they know the peer's address beforehand (e.g., via a 3rd known server). 3.4 Performance Considerations Multi-core processing. The UDT implementation uses multiple threads to explore the multi-core ability of modern processors. Network bandwidth increases faster than CPU speed, and a single core of today's processors is barely enough to saturate 10Gb/s. One single UDT connection can use 2 cores (sending and receiving) per data traffic direction on each side. Meanwhile, each UDP multiplexer has its own sending thread and receiving thread. Therefore, users can start more UDT threads by binding UDT sockets to different UDP ports, thus more UDP multiplexers will be started and each multiplexer will start their own packet processing threads. New select API. UDT provides a new version of the select API, in which the result socket descriptor set is an independent output, rather than overwriting the input directly. The BSD style select API is inefficient for large numbers of sockets, because the input is modified and applications have to reinitialize the input each time. In addition, UDT provides a way to iterate the result set; in contrast, for the BSD socket API, applications have to test each socket against the result set. New sendfile/recvfile API. UDT provides both sendfile and recvfile APIs to reduce one memory copy by exchanging data between the UDT buffer and application file directly. These two APIs also simplify application development in certain cases. It is important to mention that file transfer can operate under both streaming mode and messaging mode. However, messaging mode is more efficient in this case,
UDTv4: Improvements in Performance and Usability
17
because recvfile does not require continuous data block receiving and therefore in messaging mode data blocks can be read into files out of order without the "head of line" blocking problem. This is especially useful when the packet loss rate is high. Buffer auto-sizing. All UDT connections/flows share the same buffer space, which increases when necessary. The UDT socket buffer size is only an upper limit and it does not allocate the buffer until it has to. UDT automatically increases the socket buffer size limit to 2*BDP, if the default or user-specified buffer size is less than this value. However, if the default or userspecified value is greater than this value, UDT will not decrease the buffer size. The bandwidth value (B in BDP) is estimated by the maximum packet arriving rate at the receiver side. The garbage collection thread may decrease the system buffers when it detects that only less than half of the buffers are used.
4 Evaluation This section evaluates UDT's scalability, performance, and usability. UDT provides superior usability over TCP and although it is at the application level, its implementation efficiency is comparable to the highly optimized Linux TCP implementation in kernel space. More importantly, UDT effectively addresses many application requirements and fills a blank left by transport layer protocols. 4.1 Performance Characteristics This section summarizes the performance characteristics of UDT, in particular, its scalability. Packet header size. UDT consumes 24 bytes (16-byte UDT + 8-byte UDP) for data packet headers. In contrast, TCP uses a 20-byte packet header, SCTP uses a 28-byte packet header, and DCCP uses 12 bytes without reliability. Control traffic per flow. UDT sends one ACK per 0.01 second when there is data traffic. This can be overridden by a user-defined congestion control algorithm, if more ACKs are necessary. However, the user-defined ACKs will be lightweight ACKs and consumes less bandwidth and CPU [8]. ACK2 packet is generated occasionally, at a decreased frequency (up to 1 ACK2 per second). In contrast, TCP implementations usually send one ACK every one or two segments. In addition, UDT may also send NAKs, message drop request, or keep-alive packets when necessary, but these packets are much less frequent than ACK and ACK2. Limit on number of connections. The maximum number of flows and connections supported by UDT is virtually only limited by system resources (232). Multi-threading. UDT starts 2 threads per UDP port, in addition to the application thread. Users can control the number of data processing threads by using a different number of UDP ports. Summary of data structures. At the UDP multiplexer level, UDT maintains the sending queue and receiving queue. The sending queue costs O(log2n) time to insert
18
Y. Gu and R. Grossman
or remove a flow, where n is the total number of flows. The receiving queue checks timers of each UDT flow every 0.1 second, but it is self clocked by the arrival of packets. Each check costs O(1) time. Finally, the hash table used for the UDP multiplexer to locate a socket costs O(1) look up time. The UDT loss list is based on congestion events, and each scan time is proportional to the number of congestion events, rather than the number of lost packets [8]. 4.2 Implementation Efficiency UDT's implementation performance has been extensively tuned. This sub-section lists the CPU usage for one or more data flows between two local directly connected identical Linux servers. The server runs Debian Linux (kernel 2.6.18) on dual AMD Opteron Dual Core 3.0GHz processors, 4 GB memory, and 10GE MyriNet NIC. All system parameters are left as default except that the MTU is set to 9000 bytes. No TCP or UDP offload is enabled. Figure 6 shows the CPU usage of a single TCP, UDP and UDT flow (with or without memory copy avoidance). The total CPU capacity is 400%, because there are 4 cores. Because each flow has a different throughput (varies between 5.4Gb/s TCP and 7.5Gb/s UDT with memory copy avoidance), the values listed in Figure 6 are CPU usage per Gb/s throughput. According to Figure 6, UDT with memory copy avoidance costs similar CPU as UDP and less CPU time than TCP. UDT without memory copy avoidance costs approximately double CPU time of that in the other three situations. In the case of a single UDT flow without memory copy avoidance, at 7.4Gb/s, the CPU usage of the UDT thread and the application thread at the sender side cost 99% and 40%, respectively (per thread CPU time not shown in Figure 6); the UDT thread and the application thread at the receiver thread cost 90% and 36%, respectively. 20
send recv
CPU usage (percentage per Gb/s)
18 16 14 12 10 8 6 4 2 0
TCP
UDP
UDT
UDT w/o mem
Fig. 6. CPU Usage of Single Data Flow
UDTv4: Improvements in Performance and Usability
19
Although memory copy avoidance happens in the application thread, when it is used, it also reduces CPU usage on both the UDT sending and receiving threads because more memory bandwidth is available for the UDT threads and cache hit ratio is also higher. Figure 7 shows the CPU usage (unit value is per Gb/s throughput, the same as in Figure 6) of multiple parallel connections of TCP and UDT. UDT memory copy avoidance is not enabled in these experiments because in the situation of multiple connections, the receiver side memory copy avoidance does not work well (see Section 3.6). The connection concurrency is 10, 100, and 500 respectively for each group (TCP, UDT with all connections sharing a single flow, and UDT with each connection having its own flow).
CPU usage (percentage per Gb/s)
30
25
send recv
20
15
10
5
0
TCP
UDT single flow
UDT multi flow
Fig. 7. CPU Usage of Concurrent Data Flows
According to Figure 7, CPU usage of UDT increases slowly as the number of parallel connections increases. The design and implementation is scalable to connection concurrency, and it is comparable to the kernel space TCP implementation. Furthermore, the second group (all connections share one flow) costs slightly less CPU than the third group. In the case of multiple flows, the overhead of control packets for UDT increases proportionally to the number of flows, because each flow sends its own control packets. 4.3 Usability UDT is designed to be a general purpose and versatile transport protocol. The stock form of UDT can be used in regular data transfer. Additionally, the messaging UDT socket can also be used in multimedia applications, RPC, file transfer, web services, etc. Currently UDT has been used in many real world applications, including data distribution and file transfer (especially scientific data), P2P applications (both data transfer and system messaging), remote visualization, and so on.
20
Y. Gu and R. Grossman
While TCP is mostly used for regular file and data transfer, UDT has the advantage of a richer set of data transfer semantics and congestion control algorithms. Proper congestion control algorithms can be used in special environments such as wireless networks. UDT does not increase Internet congestion by allowing users to easily modify the congestion control algorithm. It has always been trivial to obtain unfair bandwidth share by using parallel TCP or constant bit rate UDP. In fact, UDT's connection/flow design improves the Internet congestion control by removing the unfairness and traffic oscillation caused by applications that start parallel data connections between the same pair of hosts. The configurable congestion control feature of UDT can actually help network researchers to rapidly implement and experiment with control algorithms. To demonstrate this ability, six new TCP control algorithms (Scalable, HighSpeed, BiC, Westwood, Vegas, and FAST) are implemented in addition to the three predefined algorithms in the UDT release. Lines of code for the implementation of these control algorithms vary between 11 and 73 [8]. UDT can also be modified to implement other protocols at the application level. An example is to implement forward error correction (FEC) on UDT for low bandwidth high link error environments. While UDT is not a completely modularized framework like CTP [4] due to performance considerations, it still provides high configurability (congestion control, user defined packets, user controllable ACK intervals, etc.). It is also much easier to modify UDT than to modify a kernel space TCP implementation. Moreover, there are fewer limitations on deployment and protocol standardization. Finally, UDT is also more supportive for firewall traversing (e.g., NAT punching) with UDP multiplexing and rendezvous connection setup.
5 Related Work While transport protocols have been an active research topic in computer networks for decades, there are actually few general purpose transport protocols running at the application level today. In this sense UDT fills a void left by the transport layer protocols where they cannot perfectly support all applications. However, without considering its application level advantage, UDT can be broadly compared to several other transport protocols. (In fact, the UDT protocol could actually be implemented on top of IP, but the application level implementation was one of the major objectives in developing this protocol.) UDT borrows the messaging and partial reliability semantics from SCTP. However, SCTP are specially designed for VoIP and telephony, but UDT targets general purpose data transfer. UDT unifies both messaging and streaming semantics in one protocol. UDT's connection/flow design can also be compared to the multi-streaming feature in SCTP. SCTP creates an association (analogous to UDT flow) between two addresses and multiple independent streams (analogous to UDT connection) can be set up over the association. However, in SCTP, applications need to explicitly create
UDTv4: Improvements in Performance and Usability
21
the association and the number of streams is fixed at the beginning, while UDT implicitly joins the connections into the same UDT flow (applications only create independent connections). Furthermore, SCTP applies flow control at the association level. In contrast, UDT applies flow control at the connection level. This layered design (connection/flow in UDT and stream/association in SCTP) can also be found in Structured Stream Transport (SST) [7]. SST creates channels (analogous to UDT flow) between a pair of hosts while starting multiple lightweight streams (analogous to UDT connection) atop the same channels. However, the rationales behind SST and UDT are fundamentally different. In SST, the channel provides a secured (optional) virtual connection to support multiple independent application streams and to reduce stream setup time. This design particularly targets applications that require multiple data channels. In contrast, UDT flow automatically aggregates multiple independent connections to reduce control traffic and to provide better congestion control. Both protocols apply congestion control on the lower layer (channel and flow), but SST channel provides unreliable packet delivery only and the streams have to conduct reliability control independently. In contrast, UDT flow provides reliable packet delivery (unless a UDT connection requests a message drop). Beyond this 2-layer design, SST and UDT differ significantly on details of reliability control (ACK, etc.), congestion control, data transfer semantics, and API semantics. UDT enforces congestion control on the flow level, which carries traffic from multiple UDT connections. This leads to a similar objective as that of congestion manager (CM) [2, 3]. UDT's flow/connection design makes it a natural way to share congestion control among connections between the same address pairs. This design is transparent to existing congestion control algorithms, because any congestion control algorithm originally designed for a single connection can still work on the UDT flow without any modification. In contrast, CM introduces its own congestion control that is specially designed for a group of connections. Furthermore, CM enforces congestion control at the system level and does not provide the flexibility for individual connections to have different control algorithms. UDT allows applications to choose a predefined congestion control algorithm for each connection. A similar approach is taken in DCCP. However, UDT goes further by allowing users to redefine the control event handlers and write their own congestion control algorithm. Some of the implementation techniques used in UDT are exchangeable with kernel space TCP implementations. Here are several examples. UDT's buffer management is similar to the Slab cache in Linux TCP [10]. UDT automatically changes socket buffer size to maximize throughput. Windows Vista provides socket buffer autosizing, while SOBAS [5] provides application level TCP buffer auto-tuning. UDT uses a congestion event based loss list that significantly reduces the scan time on the packet loss list. This problem occurred in the Linux SACK implementation (when a SACK packet arrives, Linux used to scan the complete list of in-flight packets, which could be very large for high BDP links) and was fixed later. While there are so many similarities on the implementation issues, UDT's application level implementation is largely different from TCP's kernel space implementation. UDT cannot directly use kernel space thread, kernel timer, hardware
22
Y. Gu and R. Grossman
interrupt, processor binding, and so on. It is more challenging to realize a high performance implementation at the applications level.
6 Conclusions This paper has described the design and implementation of Version 4 of the UDT protocol and demonstrated its scalability, performance, and usability in layered protocol design (UDP multiplexer, UDT flow, and UDT connection), data transfer semantics, configurability, and efficient application level implementation. As the end-to-end principle [14] indicates, the kernel space should provide the simplest possible protocol and the applications should handle application specific operations. From this point of view, the transport layer does provide UDP, while UDT can bridge the gap between transport layer and applications. This design rationale of UDT does not conflict with the existence of other transport layer protocols, as they provide direct support for large groups of applications with common requirements. The UDT software is currently in production quality. At the application level, it is much easier to deploy than new TCP variants or new kernel space protocols (e.g., XCP [11]). This also provides a platform for rapidly prototyping and evaluating new ideas in transport protocols. Some of the UDT approaches can be implemented in kernel space if they are proven to be effective in real world settings. Furthermore, many UDT modifications are expected to be application or domain specific, thus they do not need to be compatible with any existing protocols. Without the limitations of deployment and compatibility, even more innovative technologies on transport protocols will be encouraged and implemented than before, which is another ambitious objective of the UDT project.
References [1] Allman, M., Paxson, V., Stevens, W.: TCP congestion control. RFC 2581 (April 1999) [2] Andersen, D.G., Bansal, D., Curtis, D., Seshan, S., Balakrishnan, H.: System Support for Bandwidth Management and Content Adaptation in Internet Applications. In: 4th USENIX OSDI Conf., San Diego, California (October 2000) [3] Balakrishnan, H., Rahul, H., Seshan, S.: An Integrated Congestion Management Architecture for Internet Hosts. In: Proc. ACM SIGCOMM, Cambridge, MA (September 1999) [4] Bridges, P.G., Hiltunen, M.A., Schlichting, R.D., Wong, G.T.: A configurable and extensible transport protocol. ACM/IEEE Transactions on Networking 15(6) (December 2007) [5] Dovrolis, C., Prasad, R., Jain, M.: Socket Buffer Auto-Sizing for High-Performance Data Transfers. Journal of Grid Computing 1(4) (2004) [6] Duke, M., Braden, R., Eddy, W., Blanton, E.: A Roadmap for Transmission Control Protocol (TCP). RFC 4614, IETF (September 2006) [7] Ford, B.: Structured Streams: a New Transport Abstraction. In: ACM SIGCOMM 2007, August 27-31, Kyoto, Japan (2007) [8] Gu, Y., Grossman, R.L.: UDT: UDP-based Data Transfer for High-Speed Wide Area Networks. Computer Networks 51(7) (May 2007)
UDTv4: Improvements in Performance and Usability
23
[9] Gu, Y., Grossman, R.L., Szalay, A., Thakar, A.: Distributing the Sloan Digital Sky Survey Using UDT and Sector. In: Proceedings of e-Science (2006) [10] Herbert, T.: Linux TCP/IP Networking for Embedded Systems (Networking), 2nd edn., November 17, 2006. Charles River Media (2006) [11] Katabi, D., Handley, M., Rohrs, C.: Internet Congestion Control for High BandwidthDelay Product Networks. In: The ACM Special Interest Group on Data Communications (SIGCOMM 2002), Pittsburgh, PA, pp. 89–102 (2002) [12] Kohler, E., Handley, M., Floyd, S.: Designing DCCP: Congestion Control Without Reliability. In: Proceedings of SIGCOMM (September 2006) [13] Rhee, I., Xu, L.: CUBIC: A New TCP-Friendly High-Speed TCP Variant, PFLDnet, Lyon, France (2005) [14] Saltzer, J.H., Reed, D.P., Clark, D.D.: End-to-end arguments in system design. ACM Transactions on Computer Systems 2(4), 277–288 (1984) [15] Schulzrinne, H., Casner, S., Frederick, R., Jacobson, V.: RTP: A Transport Protocol for Real-Time Applications. RFC 3550 (July 2003) [16] Stewart, R. (ed.): Stream Control Transmission Protocol. RFC 4960 (September 2007) [17] Tan, K., Song, J., Zhang, Q., Sridharan, M.A.: Compound TCP: Approach for HighSpeed and Long Distance Networks. In: Proceedings of INFOCOM 2006. 25th IEEE International Conference on Computer Communications, pp. 1–12. IEEE Computer Society Press, Los Alamitos (2006)
The Measurement and Modeling of a P2P Streaming Video Service* Peng Gao1,2, Tao Liu2, Yanming Chen2, Xingyao Wu2, Yehia El-khatib3, and Christopher Edwards3 1
Graduate University of Chinese Academy of Sciences, Beijing, China 2 China Mobile Group Design Institute Co., Ltd, Beijing, China {gaopeng,liutao1,chenyanming,wuxingyao}@cmdi.chinamobile.com 3 Computing Department, InfoLab21, Lancaster University, Lancaster LA1 4WA, United Kingdom {yehia,ce}@comp.lancs.ac.uk
Abstract. Most of the work on grid technology in video area has been generally restricted to aspects of resource scheduling and replica management. The traffic of such service has a lot of characteristics in common with that of the traditional video service. However the architecture and user behavior in Grid networks are quite different from those of traditional Internet. Considering the potential of grid networks and video sharing services, measuring and analyzing P2P IPTV traffic are important and fundamental works in the field grid networks. This paper investigates the features of PPLive, which is the most popular streaming service in China and based on P2P technology. Through monitoring and analyzing PPLive traffic streams, the characteristics of P2P streaming service have been studied. The analyses are carried out in respect of bearing protocols, geographical distribution and the self-similarity properties of the traffic. A streaming service traffic model has been created and verified with the simulation. The simulation results indicate that the proposed streaming service traffic model complies well with the real IPTV streaming service. It can also function as a step towards studying video-sharing services on grids. Keywords: gird, peer to peer, p2p, video streaming, traffic characteristics, traffic modeling, ns2 simulation.
1 Introduction Along with the rapid development of P2P file sharing and IPTV video services, P2P streaming services have become the main multi-user video sharing application on the Internet. The focus of the grid technology in video area is generally on the resource schedule and replica management aspect [1] [2], wherein the service traffic characteristic is still similar to the traditional video service. In depth work has been carried out in the areas of monitoring and modeling video traffic [3]. Therefore, *
This work is partially supported by the European Sixth Framework Program (FP6) under the contract STREP FP6-2006 IST-045256.
The Measurement and Modeling of a P2P Streaming Video Service
25
considering the developing trends of grid systems and video sharing, monitoring and analysis of P2P IPTV traffic are interesting and promising topics of research. They are the main focus of this paper. There have been many research efforts in the area of P2P live streaming. Silverston et al. [4] have analyzed and compared the different mechanisms and traffic patterns of four mainstreams P2P video applications. Their work has focused on IPTV system aspects such as peer survival time. Hei et al. [5] have made a comprehensive investigation of aspects such as the P2P live streaming system architecture, user behavior, user distribution and software realization. The work presented in these two papers is similar in part to the work presented in this paper. However, our objectives and outcome are dissimilar from theirs in two ways. First, the aim of traffic monitoring is to obtain the characteristics of service traffic. The traffic is analyzed in order to come up with models that can be used in simulation. Second, some of our traffic analysis results differ from those found in the two aforementioned papers for the PPLive software update [7]. The rest of this paper is organized as follows: section 2 describes the monitoring setup, and section 3 details the results of analyzing the measurement data. In order to verify the accuracy of the analysis, a traffic model is developed and evaluated through simulation in Section 4.
2 Measurement Setup The measurement was carried out in the intranet of China Mobile Group Design Institute (CMDI) in Beijing. Fig. 1 shows the measurement environment, where the intranet access bandwidth is 100Mbps and Internet access bandwidth is 10Mbps.
Fig. 1. PPLive Measuring Environment
Monitoring terminals choose a representative popular channel at a peak time to collect the service data. The duration of the data measurement is about 2 hours from the start till the end of the service. The size of the produced data file is 896MB stored in pcap format. In order to simplify the data analysis process and to make the data analysis detailed, the measurement data is divided into several parts, from which two portions of the data (the service beginning and service steady stage) are chosen as the emphases of the analysis. The monitoring terminal runs the PPLive software (version 1.8.18 build 1336). The service uses the default configuration: network type is
26
P. Gao et al.
community broadband, maximum number of connections per channel is 30, and maximum simultaneous network connections is 20. Monitoring terminals use Ethereal version 0.99.0 [8] to collect the traffic data generated by PPLive service. Measurement data processing is completed using Gawk [9] and the plots are drawn using gnulpot [10]. The mathematical analysis tool Easyfit [11] is used.
3 Measurement Data Analysis This section presents the results of analyzing different aspects of the measurement data. Since the amount of captured data is relatively large, five identical time sections are selected, where each section is ten minutes long and includes service establishment, middle and end periods. The five time sections are 0-600s, 18002400s, 3600-4200s, 5400-6000s and 6000-7200s. Section 3.1 uses the data from the 3600-4200s time section for analysis, describing the bearing protocols and the sizes of the packets used to transfer both data and control signals. Section 3.2 presents the geographical distribution characteristics of the peers during all the five time sections. Section 3.3 introduces the characteristics of the upload and download traffic based on the first two measurement time sections. The data from the 5400-6000s time section is used in Section 3.4 to investigate self-similarity properties of the traffic. 3.1 General Measurement In order to study network transport protocols that deliver the video streams, a period with relatively steady traffic has to first be considered. The 3600-4200s time section was selected. Fig. 2 shows the size distribution of TCP packets (green dots) and UDP datagrams (red dots) during the mentioned time section. The total number of TCP packets observed during this time section is 462 which is less than 0.46% of the total number of packets (101620). Further, the size of TCP packets is generally small. On the other hand, UDP datagrams clearly seem to be one of two types. One type is datagrams that are bigger than 600 bytes, most being 1145 bytes long. The other type is datagrams of 100 to 200 bytes. The former type composes 96% to all the traffic while the latter type composes only 4%. We conclude from this figure that PPLive uses UDP as the main bearing protocol for video content, with packet majorly being 1145 bytes long. TCP is used to transmit some control frames and a few video frames. 3.2 Peer IP Address Distribution Since data collection is on the Chinese financial information channel, the viewers mainly come from China's inland areas. According to the IP address distribution information from China Internet Network Information Center [12], the information of different viewers' location of upload and download can be determined by their IP address. Fig. 3 shows the information of the geographical distribution of IP address. The analysis of the distribution of peers is based on the data in the five time sections which are 1 ~ 600 seconds, 1800 ~ 2400 seconds, 3600 ~ 4200 seconds, 5400 ~ 6000 seconds and 6000 ~ 7200 seconds respectively.
The Measurement and Modeling of a P2P Streaming Video Service
27
Fig. 2. Packet Size Distribution per Bearing Protocol
Fig. 3. Peer Distribution for Download and Upload Traffic
The download and upload traffic are processed respectively. The focus of analysis of the peer distribution is on the relationship between the amount of service traffic and corresponding source-area or destination-area. In each time section, the peer IP addresses are sorted according to the number of video packets. Then, the amount of traffic based on the peer IP addresses is converged based on the geographic area of IP address. Fig.3 shows the traffic ratio from or to different areas in five time sections. The red block represents the fist contributing area or beneficial area, the yellow represents the second one, the black represents the third one and the blue represents the set of other areas. In the download traffic, the contributors are from about 20 provinces. Fig.3-a shows that more than 50% traffic are from the first contributing area. In upload traffic, the beneficiaries are from about 25 provinces. Compared with the download traffic, the upload traffic is evenly distributed to different beneficial areas, as shown
28
P. Gao et al.
in Fig.3-b. It is interesting that the download traffic is mainly from the network operated by China Necom (CNC) while the upload traffic being mainly to the network operated by China Telecom (CT). The ratio of the traffic is shown in Table1. Table 1. Traffic Distribution of the Different Time Sections
Operator CNC (Download) CT (Upload)
0-600s
1800-2400s
3600-4200s
5400-6000s
6600-7200s
86.9%
86.2%
68.78%
84.3%
80.7%
%
80.7%
56.3%
69.2%
85.1%
70
3.3 Traffic Characteristics The upload and download throughput during the first two time sections are shown in Fig. 4. Diagrams at different time sections correspond to different time and throughput. Fig. 4 shows the green download traffic presents relatively steady flow rate during the whole data monitoring period, the average is about 50KB/s and there is no dramatic fluctuation from the beginning to the end. The red upload traffic can be divided into two periods, which are service beginning period and the service steady period. During the service beginning period, there is an obvious increasing process. The time point is around the 10th minute and before the 10th minute, there are mainly the detect packets and there is no video packet transmission. During the service steady period, data packets are transmitted according to the link condition of the user. Fig. 4 is also of the top peer traffic both in download and upload, expressed in blue and purple separately, which represent the policy of download and upload. It can be notice that the output traffic to the top one beneficiary almost equals to the contribution from the top one contributor in the download traffic. The statistic granularity is also one second. The average of the rate of the upload and download top one peer is about 12KByte/s.
Fig. 4. Upload and Download Traffic Performance with Each Top One Peer Traffic
The Measurement and Modeling of a P2P Streaming Video Service
29
The relationship between the node and the throughput is analyzed with reference to the data between 1800s and 2400s and here only the upload traffic is analyzed. The beneficiaries are sequenced according to their received traffic. Compared with the download traffic, the number of the accumulated nodes of the upload traffic is larger. Therefore, the upload traffic obtained by different nodes is comparatively less. The top one node which obtains the most upload traffic only obtains 5% of the overall upload traffic. The upload traffic output decreases obviously after the 60th peer and after the 90th peer, there is basically only the output of control packets. During the overall monitoring period, only 4% peers (top 100 peers) receive 96% data traffic (the determination of the number of nodes is based on the IP addresses) and the 96% upload traffic is composed of 1145-byte video packet and the remaining part is composed of 100-200Byte detect packets. 3.4 Self-similarity Analysis The analysis of the self-similar and long range dependence is achieved by the Selfis tool [13] from Thomas Karagiannis et al. The data in the 5400-6000s time section is used as the analysis basis of download and upload. Fig. 5 shows the download and upload traffic analysis result of R/S estimation and autocorrelation function respectively. Its R/S linear fitting function is 0.1723x+0.01451, and the self-similar Hurst value is 0.1732. According to the increasing of lags, the autocorrelation function has no notable decline and its characteristic of non-summable is not clear. For the upload traffic, its R/S linear fitting function is 0.6216x+0.0349, and the corresponding Hurst value is 0.6216. The non-summable characteristic of autocorrelation function is also unclear, but since the Hurst value is more than 0.5, it can be regarded that the ACF of the upload traffic is non-summable (H>0.5), i.e. the upload traffic is long range dependent. According to the analysis of several time sections of upload and download traffic, the Hurst value of download traffic fluctuates from the 0.2 to 0.3, and the Hurst value of upload traffic fluctuates from 0.6 to 0.7. If the data analysis is in a long time scale which is more than 50min, the Hurst values of download and upload traffic will increase obviously which are at 0.35 and 0.75 respectively, and the ACF has no clear decline either. 3.5 Traffic Statistical Fitting From the analysis of the above sections, we draw two conclusions about modeling the traffic of PPLive. Firstly, traffic modeling of the PPLive service should be based on the upload traffic. Secondly, for simplicity, the upload traffic should be modeled as a whole, although actually being composed of traffic to different beneficiaries. Detecting video packets groups interval time, the fitting PDF and its corresponding parameters based on the K-S test sequencing are considered, as shown in Table 2 below. Log-normal PDF is chosen as the video frame transmitting function of traffic modeling.
30
P. Gao et al.
Fig. 5. R/S and ACF of Download and Upload Traffic Table 2. PDF Fitting
Distribution LogNormal LogNormal (3p) Weibull (3P) Gamma (3P) Weibull
Fig. 6 shows the Q-Q plot of the log-normal distribution, which indicates the best KS test distribution when analyze the video packets inter time. It is shown that before 0.4 seconds the log-normal distribution can match the actual data well but deflect gradually thereafter. However, since the inter packets time is all less than 3s and almost 92% of all is less than 0.4 second, this does not influence the traffic model much. The math analysis tool uses the Easyfit software free trail [12]. Use the same way to analyze the control packets. The fitting PDF of small packet time interval and its corresponding parameters based on the K-S test sequencing are considered. Again, log-normal PDF is chosen as the control packets function of traffic modeling.
The Measurement and Modeling of a P2P Streaming Video Service
31
Fig. 6. Statistic Fitting of Video Packet
4 Modeling Design and Simulation As we know, the actual net system is very heterogeneous and complicated. Managing and developing such a large-scale distributed system is very difficult and hardly feasible, so modeling is the best way to simulate its operation. This paper provides an effective method to model IPTV traffic. This method can serve as a reference for similar work in the future. On the other hand, the simulation results can also verify the accuracy of the analysis and modeling methods used. 4.1 Algorithm In this model, an algorithm has been put forward to control the packet generation sequence, as shown in Fig. 7. First, data initialization is performed. 1. 2.
Send a video packet when simulation begins. Compute the next video packet sending time. Put it into a variable NextT.
Next, the time needed to send the next packet is computed. To account for different packet sizes, different parameters are used to calculate inter- video packet time (variable NextT) and the inter-control packet time (array t_i). The values of t_1 to t_n are summed to variable SmallT. As long as the value of SmallT is less than NextT, t_i is used as the inter-packet time for sending small packets. Otherwise, a large packet is sent immediately with an inter-packet time of NextT - (SmallT - t_i). 4.2 Simulation and Evaluation For the simulation, version 2.29 of the widely-used NS-2[14] is used to implement the model. Script analysis language Awk, mapping tool gnuplot and statistical fitting tool Easyfit are all used here for the simulation data analysis.
32
P. Gao et al.
Fig. 7. Packet Generation Sequence Flowchart
Fig. 8. Q-Q Plot of Packets Inter-Arrival Time
β
The log-normal distribution with the parameters α equals 1.5514 and equals 3.7143 is used to simulate the inter-video packet time interval, as discussed before. equals to -9.1022 simulate the Log-normal with parameters α equals to 2.5647 and inter-packet time for small packets, as explained in Section 4.1,while the size of control packets is 130 and the size of video packets is 8015. Fig. 8 shows the Q-Q plot of the characteristics of inter-video packet time from simulation, which match very well during the first 2 seconds and then depart gradually compared with the actual data.
β
The Measurement and Modeling of a P2P Streaming Video Service
33
In order to further validate the correctness of this traffic model, ten minutes of simulation traffic data has been collected to draw a throughput plot. This is compared with the data from the actual throughput for 3600-4200s by plotting them on the same figure, as shown in Fig. 9. The red points are real traffic data and the blue points are the simulation data.
Fig. 9. Comparison between Original and Simulated Traffic
The performance of these two sets of traffic data are compared in Table 3, which indicates that this model can be used to represent traffic generated by IPTV peers on a simulation platform. Table 3. Comparing data
Data
Total Traffic (Kbyte)
Actual Simulation
30907 38197
Average Value (Kbyte/s) 51.59766 63.76795
Standard Deviation
Confidence Level (95%)
51.0418 52.0086
4.0958 4.1734
5 Conclusion This paper investigates features of PPLive, a representative of the P2P technology, such as distribution of peer IP addresses, number of peers, and various traffic characteristics including self-similarity. By further analysis, we designed a model for the streaming video traffic generated by PPLive.. We verified the correctness of our model using simulations of the video streaming service. The results indicate that the proposed streaming service traffic model complies with the actual IPTV streaming service. This study of IPTV traffic streams can be used in the future as a foundation for characterizing grid video-sharing services.
34
P. Gao et al.
Acknowledgment First of all, we thank the Europe-China Grid InterNetworking (EC-GIN) project which provided a wonderful platform for in-depth research on grid networks. We are indebted to the members of the EC-GIN WP2 group for their collaboration and support. Our team members tried their best, and now we can share the success. Finally, we thank all the people whose experience we have benefited from, and those concerned with the improvement of grids!
References 1. Cuilian, L., Yunsheng, M., Junchun, R.: A Grid-based VOD System and Its Simulation. Journal of Fudan University (Natural Science) 43(1), 103–109 (2004) 2. Anfu, W.: Program Scheduling and Implementing in a Grid-VOD System. Engineering Journal of Wuhan University 40(4), 149–152 (2007) 3. Garrett, M.W., Willinger, W.: Analysis, Modeling and Generation of Self-Similar VBR Video Traffic. In: Proceedings of SIGCOMM 1994 (September 1994) 4. Silverston, T., Fourmaux, O.: Measuring P2P IPTV Systems. In: Proceedings of the 17th ACM International Workshop on Network and Operating Systems Support for Digital Audio & Video (NOSSDAV 2007), Urbana, IL, USA, June 2007, pp. 83–88 (2007) 5. Hei, X., Liang, C., Liang, J., Liu, Y., Ross, K.W.: A Measurement Study of a Large-Scale P2P IPTV System. IEEE Transactions on Multimedia 9(8), 1672–1687 (2007) 6. El-khatib, Y., Edwards, C. (eds.) Damjanovic, D., Heiß, W., Welzl, M., Stiller, B., Gonçalves, P., Loiseau, P., Vicat-Blanc Primet, P., Fan, L., Wu, J., Yang, Y., Zhou, Y., Hu, Y., Li, L., Li, S., Liu, S., Ma, X., Yang, M., Zhang, L., Kun, W., Liu, Z., Chen, Y., Liu, T., Zhang, C., Zhang, L.: Survey of Grid Simulators and a Network-level Analysis of Grid Applications. EC-GIN Deliverable 2.0, University of Innsbruck (April 2008) 7. PPLive, http://www.pplive.com/ 8. Ethereal: A Network Protocol Analyzer, http://www.ethereal.com/ 9. Gawk, http://www.gnu.org/software/gawk/ 10. gnuplot, http://www.gnuplot.info/ 11. Easyfit: Distribution Fitting Software, http://www.mathwave.com 12. China Internet Network Information Center, http://www.cnnic.net.cn/en/index/index.htm 13. The SELFIS Tool, http://www.cs.ucr.edu/~tkarag/Selfis/Selfis.html 14. NS-2: Simulation tool, http://nsnam.isi.edu/nsnam/
SCE: Grid Environment for Scientific Computing Haili Xiao, Hong Wu, and Xuebin Chi Supercomputing Center, Chinese Academy of Sciences 100190 P.O.Box 349, Beijing, China {haili,wh,chi}@sccas.cn
Abstract. Over the last few years Grid computing has evolved into an innovating technology and gotten increased commercial adoption. However, existing Grids do not have enough users as for sustainable development in the long term. This paper proposes several suggestions to this problem on the basis of long-term experience and careful analysis. The Scientific Computing Environment (SCE) in the Chinese Academy of Sciences is introduced as a completely new model and a feasible solution to this problem. Keywords: Grid, scientific computing, PSE.
1
Introduction
This paper begins with this short introduction. In the second part, a overview of several large Grid projects and its applications (and application Grids) is given, followed by a discussion of existing problems of those Grids, including EGEE, TeraGrid, CNGrid and ScGrid etc. The third part mainly talk about the Scientific Computing Environment (SCE), what it is, why it is important and how it works. The following part introduces applications build upon SCE and how scientists benefit from this integrated environment. The paper ends with final conclusions and future work.
2
Grid Computing Overview
2.1
Grid Computing
Grid computing has been defined in a number of different ways, especially when it’s getting increased commercial adoption. People from scientific and research area do not have the same understanding with others from company like IBM,
This work is supported by the National High Technology Research and Development Program of China (863 Program) “Research of management in CNGrid” (2006AA01A117), “Environment of supercomputing services facing scientific research” (2006AA01A116), and “Application of computational chemistry” (2006AA01A119).
P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 35–42, 2009. c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009
36
H. Xiao, H. Wu, and X. Chi
Oracle, Sun, etc. However, there is a consensus that Grid computing involves the integration of large computing resources, huge storage and expensive instruments, which are generally linked together from geographically diverse sites. 2.2
Well-Known Grid Projects
Over the last few years, since it was started at Argonne National Labs in 1990, Grid computing has evolved rapidly. Many national Grids, even multinational Grids are funded to build collaboration environment for scientists and researchers. EGEE. The Enabling Grids for E-sciencE (EGEE) project brings together scientists and engineers from more than 240 institutions in 45 countries world-wide to provide a seamless Grid infrastructure for e-Science that is available to scientists 24 hours-a-day. The EGEE project aims to provide researchers in academia and industry with access to major computing resources, independent of their geographic location. The first peorid of the EGEE project was officially ended on the 31 March 2006, and EGEE II was started on 1 April 2006. The EGEE Grid consists of 41,000 CPU available to users 24 hours a day, 7 days a week, in addition to about 5 PB disk (5 million Gigabytes) + tape MSS of storage, and maintains 100,000 concurrent jobs. Having such resources available changes the way scientific research takes place. The end use depends on the users’ needs: large storage capacity, the bandwidth that the infrastructure provides, or the sheer computing power available. The project primarily concentrates on three core areas: – To build a consistent, robust and secure Grid network that will attract additional computing resources. – To continuously improve and maintain the middleware in order to deliver a reliable service to users. – To attract new users from industry as well as science and ensure they receive the high standard of training and support they need. Expanding from originally two pilot scientific application fields, high energy physics and life sciences, EGEE now integrates applications from many other scientific fields, ranging from geology to computational chemistry. Generally, the EGEE Grid infrastructure is for scientific research especially where the time and resources needed for running the applications are considered impractical when using traditional IT infrastructures. TeraGrid. TeraGrid is one of the world’s largest, most comprehensive distributed cyberinfrastructure for open scientific research. TeraGrid integrates high-performance computers, data resources and tools, and high-end experimental facilities around USA through high-performance network connections. Currently, TeraGrid resources include more than 750 teraflops of computing capability and more than 30 petabytes of online and archival data storage, with rapid access and retrieval over high-performance networks.
SCE: Grid Environment for Scientific Computing
37
A TeraGrid Science Gateway is a community-developed set of tools, applications, and data collections that are integrated via a portal or a suite of applications. Gateways provide access to a variety of capabilities including workflows, visualization, resource discovery, and job execution services. Science Gateways enable entire communities of users associated with a common scientific goal to use national resources through a common interface. Science Gateways are enabled by a community allocation whose goal is to delegate account management, accounting, certificates management, and user support to the gateway developers. TeraGrid is coordinated through the Grid Infrastructure Group (GIG) at the University of Chicago, working in partnership with the Resource Provider sites. CNGrid. China National Grid (CNGrid) is a key project launched in May 2002 and supported by the China National High-Tech Research and Development Program (the 863 program), and a testbed for the new information infrastructure which aggregates high-performance computing and transaction processing capability. CNGrid promotes the national information construction and the development of relevant industries by technical innovation. Ten resource providers (also known as grid nodes) from all around China join CNGrid, contribute more than 18 teraflops of computing capability. CNGrid has equipped with self-made grid-oriented high performance computers (Lenovo DeepComp 6800 in Beijing and Dawning 4000A in Shanghai). Two 100-teraflops hpcs is being built and will be installed before the end of 2008. Hpcs of 1000teraflops will be built and installed in the near future. Through the first five years of building, CNGrid has effectively supported many scientific research and application domain, such as resource and environment, advanced manufacturing and information service. CNGrid is coordinated through the CNGrid Operation Center at Supercomputing Center of Chinese Academy of Sciences. CNGrid Operation Center is established in September 2004 and works with all CNGrid nodes. CNGrid Operation Center is responsible for system monitoring, user management, policy management, training, technical support and international cooperation. ScGrid. In Chinese Academy of Sciences, construction of supercomputing environment is one of the key supporting area. To build the e-Science infrastructure and make the best of advantage of the supercomputing environment, the CAS has clear and long plans on Grid research and the construction of Scientific Computing Grid (ScGrid). ScGrid has provided users a problem solving environment based on Grid technology, which is a generic Grid computing platform. ScGrid has supported typical applications from Bioinformatics, Environmental Sciences, Material Sciences, Computational Chemistry, Geophysics etc. There are other well-known Grid projects like e-Science from UK, NAREGI from Japan and K*Grid from Korea.
38
2.3
H. Xiao, H. Wu, and X. Chi
Analysis of Existing Grids
The emerging Grid technology provide a Cyberinfrastructure for application science communities. In UK and China, another word e-Science is used more often when talk about Grid technology. Both characterizations regard Grid as the third basic way to scientific research, and the difference is that ‘Cyberinfrastructure’ focuses on method of scientific research while ‘e-Science’ emphasizes prospect and mode of future research. All grids are driven by applications in nature, or built for specific applications. Demands from users and applications do exist that can be concluded from above description. Scientific research can be done in a faster and more efficient way by using Grid technology, which has been proven by above Grids. However, Grids, especially production Grids still need to find more users and arouse more users’ interest. Grids and Grid applications need more active engaged domain scientists. As a example, in the TeraGrid annual report for 2007 [1], the milestone of 300 Science Gateway users was delayed by six months (from month 18 to month 24). In our experience as the China National Grid Operation Center in the past four years, most of the users’ complaint are: 1. the usability and stability are not enough for daily use; 2. the integrated applications do not cover their needs; 3. the Grid portal is not as convenient as their SSH terminal. There are a number of issues at stake when we talk about Grid and its value for users, which are also the reasons that users want to choose Grid: 1. Can Grid reduce time of computation and/or time of queuing? This is essential when a user like to choose a alternative and new method (like Grid) to finish her work. Unfortunately, Grid itself is not a technology to make a computation finished faster. But Grid often consists of more than one computation nodes, when a job is submitted, a ‘best’ destination node (faster or less busy one, due to the scheduling policy) could be selected to finish the job, without awareness of the user. So under many conditions, a computation job could finished earlier in Grid. 2. Is it free of charge to use Grid? Users will be very happy if it is true. Furthermore, if the computation (CPU cycle) is also free while using Grid, otherwise not, they will certainly love Grid. It is not good in the long term, however, because any new technology including Grid could not last long if it totally depends on free use. 3. Can Grid give users wonderful experience? There are some user like new things, beautiful and well-designed UI, convenient drag’n’drop mouse manipulation. Users need more choices to select their most familiar and comfortable way to access Grid resources, i.e. via a Grid portal or traditional terminal, from a Linux workstation or a Windows laptop. 4. Apply once, run anywhere in Grid. Many users have accounts on several HPCs, they always need to apply new account on each HPC. In Grid, they
SCE: Grid Environment for Scientific Computing
39
can apply only once, a Grid account, to have local account on each Grid node. This could save much time. From above analysis, resource sharing and scheduling is still the key value of Grid for users. Each Grid user must have a local account on every single Grid node, and each submitted job must been delivered to a specific Grid node. This is the basic concept of Grid, while it is sometimes ideal in reality. In this kind of situation, on the one hand, functions of software need to be improved, on the other hand, operation and management is also a big issue that need more concern.
3
Scientific Computing Environment
In the eleventh five-year plan of the Chinese Academy of Sciences, the construction of supercomputing environment is designed as a pyramidal structure. The top layer is a centralized high-end computing environment, building a one hundred teraflops computer and software support platform. The middle layer is distributed among China, choosing five sub-center and integrating fifty teraflops of computing resource. The bottom layer consists of institutes in the CAS, they can have their own hpcs with different dimensions. Under this pyramidal layout, based on ScGrid and the work during the tenth five-year plan, Grid computing technology is the best choice for sharing resource and integrating computing power. 3.1
Structure of SCE
Supercomputing Center continues the research and construction of Scientific Computing Grid during the advance of e-Science infrastructure in Chinese Academy of Sciences. To make the best of advantage of the Scientific Computing Environment (SCE), Supercomputing Center provide users a easy-to-use problem solving environment based on Grid technology. The basic structure of SCE is illustrated in figure 1. In the bottom, there are different high performance computers and visualization workstations that consists of hardware layer. Grid middleware is a abstract and transparent layer that shields the difference of geographical locations and heterogeneous machines in the hardware layer. It also serve as a Grid services provider for upper client tools and portals. Users choose whatever client tools or portals to access Grid resources as they like. 3.2
Features of SCE
Integrated and Transparent. In the integrated scientific computing environment, computing jobs are scheduling among all computing nodes. Jobs can be submitted to a specified node or a node selected automatically. Large scale and long-time jobs can be scheduled to the top or middle layer with more powerful computing capability. Other small or short-time jobs will be scheduled to the
40
H. Xiao, H. Wu, and X. Chi
Fig. 1. Basic structure of SCE
Fig. 2. Submitting a Gaussian Job in Grid portal
middle or bottom layer. When all comes to all, the utilization of high performance computers are maximize. On the other hand, users do not need to logon to different nodes. They can login once, and use all the nodes. All are transparent to users as a larger computers. Flexible and easy-to-use. Users can choose to use either web portal or client tool. In web portal (see figure 2), normal jobs can be submitted, job status can be checked, intermediate output can be viewed, and final result can be downloaded. In figure 2, the status of a running Gaussian job is listed and the instant output of intermediate file is dumped in the center window. In client tools, in addition, files can be transferred in multi-thread. Remote file systems on all Grid nodes
SCE: Grid Environment for Scientific Computing
41
can be mounted locally in the Windows explorer. Directories and files of three remote hpcs are listed concurrently in the left tree view of Windows explorer. Remote file operations are as easy and normal as doing locally. Lightweight and scalable. The source code of SCE middleware is mainly written in C and Shell script, the core code is around ten thousand lines. So it is very easy to deploy and setup. Also, the modular design make it easy to scale, new functions can be easily added. 3.3
Applications Based on SCE
Vitual Laboratory of Computational Chemistry. VLCC is a very successful example of the combination of Grid technology and specific application. VLCC is established in 2004, which aims at building for computational chemistry a platform of research, communication between scientists, training for newbies and application software development. Until now, there are more fifty members from universities and institutes that join the virtual laboratory. Many software are shared between those members, from commercial software, i.e. Gaussian 2003, ADF 2004, Molpro and VASP, to free software CPMD, GAMESS, and GROMACS etc. Scientists also share software developed by themselves with others, including XIAN-CI by Professor Zhenyuan Wen from North West University of China, CCSD and SCIDVD bye Professor Zhigang Shuai from Institute of Chemistry of CAS, 3D by Professor Keli Han from Dalian Institute of Chemistry Physics. The success of VLCC arouses inspiration of researchers from other application domains. Similar virtual laboratories are planned.
4
Conclusions and Future Work
Grid computing involves the integration of large computing resources, huge storage and expensive instruments. Over the last few years, Grid computing has evolved rapidly. Many national Grids, even multi-national Grids like EGEE, TeraGrid, CNGrid are built. The Scientific Computing Environment (SCE) in the Chinese Academy of Sciences is a completely new model which put emphasis on consistent operation and management, and versatile accessing method to meet users’ demand. Successful applications in the SCE integrated environment show it is a feasible solution to this problem. Also more applications are being integrated in SCE.
Acknowledgement We’d like to thank scientists and researchers from CAS for their support to our work. We also thank staffs and students of the Grid team in SCCAS. Their spirit of team work greatly encourage us and finally get things done.
42
H. Xiao, H. Wu, and X. Chi
References 1. Catlett, C., et al.: TeraGrid Report for CY 2006 Program Plan for 2007 (January 2007) 2. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1998) 3. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration (January 2002) 4. von Laszewski, G., Pieper, G.W., Wagstrom, P.: Gestalt of the Grid, May 12 (2002) 5. Yin, X.-c., Lu, Z., Chi, X., Zhang, H.: The China ACES iSERVO Grid Node. Computing in Science & Engineering (IEEE) 7(4), 38–42 (2005) 6. Zhu, J., Guo, A., Lu, Z., Wu, Y., Shen, B., Chi, X.: Analysis of the Bioinformatics Grid Technique Application in China. In: Fourth International Workshop on Biomedical Computations on the Grid, May 16-19, 2006. Nanyang Technological University, Singapore (2006) 7. Sun, Y., Shen, B., Lu, Z., Jin, Z., Chi, X.: GridMol: a grid application for molecular modeling and visualization. Journal of Computer-Aided Molecular Design(SCI) 22(2), 119–129 (2008) 8. Wu, H., Chi, X.-b., Xiao, H.-l.: An Example of Application-Specific Portal in Webbased Supercomputing Environment. In: International Conference on Parallel Algorithms and Computing Environments (ICPACE), October 8-11, 2003, pp. 135–138 (2003) 9. Haili, X., Hong, W., Xuebin, C.: An Implementation of Interactive Jobs Submission for Grid Computing Portals. Australian Computer Science Communications 27(7), 67–70 (2005) 10. Xiao, H., Wu, H., Chi, X.: A Novel Approach to Remote File Management in Grid Environments. In: DCABES 2007, vol. II, pp. 641–644 (2007) 11. Dai, Z., Wu, L., Xiao, H.: A Lightweight Grid Middleware Based on OPENSSH SCE. In: Grid and Cooperative Computing, pp. 387–394 (2007) 12. http://www.863.org.cn 13. http://www.cngrid.org 14. http://www.eu-egee.org 15. http://www.teragrid.org 16. http://www.sccas.cn 17. http://www.scgrid.cn 18. http://www.top500.org
QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks Oliver Yu and Huan Xu Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, IL 60607
Abstract. This paper presents the QoS differentiated Adaptive Scheduled Optical Burst Switching (QAS-OBS) paradigm that efficiently supports dynamic grid applications. (QAS-OBS) paradigm enables differentiated-services optical burst switching via two-way reservation and adaptive QoS control in distribute wavelength-routed networks. Based on the network loading status, the edge nodes dynamically optimize per-class lightpath reservation control to provide proportional QoS differentiation, while maximizing the overall network throughput. Unlike existing QoS control schemes that require special core network to support burst switching and QoS control, QAS-OBS pushes all the complexity of burst switching and QoS control to the edge nodes, so that the core network design is maximally simplified. Simulation results evaluate the performance of QAS-OBS and verified the proposed control schemes could provide proportional QoS control while maximizing the overall network throughput.
Emerging collaborative grid applications have increasing demand for supports of diverse bandwidth granularities and multimedia traffic. This paper presents an optical burst switching (OBS) based grid network architecture to support such collaborative grid applications. An open issue for OBS based grid network architecture is provisioning for grid applications with multimedia traffic components requiring differentiated quality-of-service (QoS) performances. Optical Burst Switching (OBS) schemes [3][4] have been proposed to enable wavelength statistical multiplexing for bursty traffic. The one-way reservation of OBS, however, incurs burst contention and data may be dropped at the core nodes due to wavelength reservation contention. The burst contention blocking recovery controls (including time, wavelength and link deflection) are employed to minimize contention blocking. These recovery controls, however, may incur extra end-to-end data delay. In addition, some burst contention recovery controls require special devices such as Fiber Delay Lines (FDLs) at the core network. Wavelength routed OBS (WR-OBS) [5] guarantees data delivery through two-way wavelength reservation. WR-OBS is based on centralized control plane, and does not require special devices (e.g. FDLs) to be implemented at core nodes. However, the centralized control may not be practical for large-scale grid systems. Adaptive Reliable Optical Burst Switching (AR-OBS) [6] minimizes data loss in distribute-controlled WRONs based on wavelength-routed optical burst switching. AR-OBS minimizes the overall data loss via adapting the lightpath reservation robustness (maximal allowed time for blocking recovery) for each burst to the network loading status. To optimize sharing of optical network resources, it is desirable to provide service differentiation in OBS based grid networks. Service differentiation via assigning different offset delay is proposed in [7][8]. In these schemes, higher priority bursts are assigned with extra offset to avoid burst contention with lower priority bursts. However, the higher priority bursts always suffer from longer end-to-end data delay and such control schemes are unfair to large size burst [9][10]. Look-ahead Window (LaW) based service differentiation schemes are proposed in [11][12]. Each core node buffers burst headers in a sliding window, and a dropping-based decision is made for the buffered bursts based on the QoS requirements. It shows that the collective view of multiple burst headers in LaW results in more efficient dropping decision than simple dropping-based schemes (e.g. the scheme proposed in [9]). The overall end-toend delay of LAW is scarified due to the extra queuing delay in the window. By implementing Weighted Fair Queue for the burst reservation requests in central controller [12], WR-OBS could provide prioritized QoS. As mentioned above, the centralized control scheme may not be practical in large-scale network systems. To facilitate easy pricing model for grid applications, it is better to provide proportional service differentiation. By intentional dropping lower-class bursts at the core network node, the QoS control scheme proposed in [9] provides proportional servicedifferentiation in terms of data loss ratio. Simple dropping-based schemes at each core node may cause unnecessary drop and deteriorate the wavelength utilization. Preemption-based QoS control scheme proposed in [14] provides proportional service differentiation by maintaining the number of wavelengths occupied by each class of bursts. To guarantee the preset differentiation ratio, bursts satisfying the corresponding QoS usage profiles will preempt the ones not satisfying its usage profiles. The preempted ones, however, will waste some wavelength resource and lower the wavelength utilization.
QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks
45
An absolute QoS model for OBS networks is proposed in [15] to ensure the data loss of each QoS class does not exceed a certain value. Loss guarantees are provided via two mechanisms: Early drop and wavelength group. The optimal wavelength sharing problem for OBS networks with QoS constraint is discussed in [16]. Optimized wavelength sharing policies are derived based on Markov decision process to maximize wavelength utilization while ensuring the data loss rate for QoS classes not exceeding the given thresholds. All of the aforementioned QoS control schemes require implementing QoS control at every core network nodes, which increases the complexity of core network design. In addition, some of the schemes demand special optical switch architectures. For example, differentiated offset delay and LaW based QoS control schemes need the support of FDLs at every core network node to buffer the incoming optical bursts. The QoS control schemes proposed in [15] and [16] require every optical node to be wavelength convertible. These requirements may not be easily satisfied for existing WRONs, and will increase the investment cost for the core networks. To simplify the core network design and push the complexity of burst switching and QoS control to the edge node in WRONs, QoS differentiated Adaptive Scheduled Optical Burst Switching (QAS-OBS) is proposed. QAS-OBS provides proportional service differentiation in terms of data loss rate while maximizing the overall network throughput. QAS-OBS employs two-way reservation to guarantee data delivery. The core network controllers are not involved into burst QoS control (e.g. dropping decision). The core network control could employs distributed reservation protocols with blocking recovery control mechanisms (e.g. S-RFORP [17] or RFORP [18] etc.). The QoS control in QAS-OBS is implemented at the edge nodes, which dynamically adjust the lightpath reservation robustness (i.e. maximal allowed signaling delay) for each burst to provide the service differentiation. To guarantee proportional differentiation under different network loading status, the edge controller dynamically adjusts lightpath reservation robustness for each burst based on network loading status. Round Trip Signaling Delay (RTSD) of the two-way reservation signaling is employed as an indicator to estimate the network loading status while not demanding explicit network loading status information distribution. A heuristic-based control algorithm is proposed in this paper. Simulation results show that QAS-OBS could achieve proportional QoS provisioning while maximizes the network throughput. The reminder of the paper is organized as follows. Section II discusses the general architecture of QAS-OBS. The core signaling control is illustrated in Section III. Section IV presents multi-class QoS control. Section V presents the simulation results.
2 Architecture of QAS-OBS QAS-OBS contains two functional planes: a distributed data plane consisting of the optical core network switches, and a distributed control plane taking care of the signaling control and switch configuration for each optical switch. The motivation of QAS-OBS is to push the complexity of QoS control from network core to edge. Edge nodes control the lightpath setup robustness (i.e. maximal allowed lightpath signaling time) and burst scheduling. Core network nodes are in charge of the lightpath reservation and do not need to participate in the burst QoS control, which simplifies the implementation of core network.
46
O. Yu and H. Xu
At the edge nodes, incoming data packets are aggregated in different buffers according to their QoS classes and destinations. QAS-OBS employs two-way reservation with lightpath setup acknowledgment to guarantee data delivery. To avoid long delay incurred by two-way reservation, lightpath signaling process is initiated during burst aggregation phase. Thus, the signaling delay is overlapped with burst aggregation delay and the data end-to-end delay is minimized. To support service-differentiated control, robustness of lightpath reservation for each burst is set according to its QoS class. For higher-class bursts, less reservation blocking is achieved by assigning larger lightpath reservation robustness to the corresponding resource reservation process. Some optical resource reservation protocols could provide adjustable lightpath reservation robustness. For example, maximal allowed number of blocking recovery retries in S-RFORP [17] or RFORP [18] could be adjusted for each request. Larger number of blocking recovery retries will result in lower reservation blocking. In the following of this paper, we assume that QAS-OBS employs a resource reservation protocol that can support adjustable reservation robustness by assigning different upper bound of signaling delay (e.g. limitation of number of blocking recovery retries in S-RFORP). To meet the QoS requirement in different network loading status, burst reservation robustness in QAS-OBS is adapted to the network loading status. The reservation robustness of each burst is selected to maximize the overall throughput while satisfying the QoS constraints. Delayed reservation is employed in QAS-OBS to maximize the wavelength utilization.
User Generated Traffic
Per-class & Per-destination Aggregator Data Incoming Rate
Traffic Class
User Traffic Data Rate Estimator
Traffic Class Classifier
Estimated Burst Aggregation Delay&Size
Burst Traffic Class
Network Loading Status
Network RTSD Monitor
QAS-OBS Network
Network Input Traffic
Trigger Burst Transmission
Edge Node
Core Node Controller Edge Node Controller
Burst Transmission Controller Burst Offset Delay
Multi-Class QoS Controller
All-optical Switch
Burst Transmission Time&Size
Burst Signaling Controller
Per-burst Reservation Signaling Protocol
Selected Route
Route Selection Controller
Edge Node Controller
Fig. 1. Overall Architecture of QAS-OBS
Burst Signaling Controller
Core Node Controller
QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks
47
Unlike the offset-delay-based QoS control in OBS systems that isolate different classes in time dimension, QAS-OBS provides service differentiation via controlling the blocking probability of lightpath reservation for each burst. The lower blocking does not trade off data end-to-end delay in QAS-OBS, since the lightpath signaling process is required to be completed by the end of burst aggregation process. Fig. 1 shows the overall architecture of QAS-OBS with edge node function blocks. The incoming data is aggregated at the per-class per-destination aggregator. User Traffic Rate Estimator monitors the data incoming rate per-destination and per-QoS class, and predicts the burst sending time based on the criteria to avoid buffer overflow or data expiration at the edge node. Based on the estimated burst sending time and network loading status, The Multi-Class QoS controller dynamically selects the lightpath reservation robustness for each burst, and triggers the lightpath signaling procedure accordingly. We define offset delay to be the time interval between triggering lightpath reservation and sending out burst, which is also the time allowed for lightpath signaling (reservation robustness). To maintain proportional data loss rate among QoS classes and maximize the overall data throughput, Offset Delay is dynamically adjusted for each QoS class according to current network loading status. In QAS-OBS, the network loading status is estimated by the round-trip signaling delay (RTSD). Longer RTSD means there are more blocking recovery retries during lightpath reservation, and it is an indicator that the network is heavy loaded. After getting lightpath setup request carrying burst sending time and burst size from Multi-Class QoS Controller, Burst Signaling Controller selects the route and sends out per-burst reservation request to the core network. The routing control in QAS-OBS is supposed to be based on shortest path selection. Wavelength selection algorithm depends on the employed signaling protocol (e.g. first fit or random fit). If the lightpath setup acknowledgement returns, Burst Transmission Controller triggers burst transmission and data is dumped from edge buffer into the optical network as an optical burst at the predicted burst sending time.
3 Burst Signaling Control Burst signaling control is to reserve the wavelengths for each burst at specified burst transmission time. The signaling control is supported by both edge and core nodes. Based on the incoming rate of an aggregating burst, edge node determines the sending time of lightpath reservation signaling and wavelength channel holding time for corresponding burst. Core nodes interpret the signaling message and configure the corresponding optical switch to support the wavelength reservation. The objective of burst signaling control includes minimizing the data end-to-end delay and maximizing the wavelength utilization. In this section, we focus on the discussion of minimizing the end-to-end data delay. The wavelength utilization is maximized in QAS-OBS via delayed reservation [19]. In QAS-OBS networks, the burst is sent out after the lightpath setup acknowledgement returns, and data delivery is guaranteed. The offset delay is required to be larger than the signaling round-trip delay. To avoid larger end-to-end delay caused by the twoway reservation, QAS-OBS signals the lightpath before the burst is fully assembled. If
the signaling is sent out early enough to ensure the lightpath acknowledgement returns before the burst is fully assembled, the total data end-to-end delay is minimized since the offset delay is not included. Taking account of the extra delay incurred by the blocking recovery, offset delay in QAS-OBS consists of error free signaling delay (EFSD) and maximal allowed recovery delay (MARD). EFSD is the signaling delay for a given route if there is no blocking occurs. MARD is the maximal allowed time for blocking recovery control in the signaling process, which depends on the maximal allowed signaling delay. In the signaling procedure, the actual recovery delay (ARD) is limited by MARD to ensure the acknowledgement returns before the signaled burst transmission time. If the assigned MARD is not big enough to cover ARD, the lightpath request will be dropped. The actual round trip signaling delay (RTSD) is EFSD plus ARD, which is bounded by the given offset delay. A signaling scenario is shown in Fig. 2 to illustrate the signaling control procedures. As shown in the figure, a new burst aggregation starts aggregation at t1. After some processing time, the signaling request is sent out to core network at t2 before the burst is fully assembled. The estimated wavelength channel holding time and burst sending time is carried by the signaling message. As shown in Fig. 2, the signaling message transverses three hops from source to destination during wavelength discovery phase. When the signaling message reaches the destination node, the wavelength reservation procedure will be triggered if there is common available wavelength along
QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks
49
the route. Then the signaling message is sent back to source node from destination node. If there is no wavelength contention in the reservation phase, the signaling message returns the source node at t3. The time between t3 and t2 is EFSD. In Fig. 2, blocking recovery processing is triggered at blocked link (Link 2) due to wavelength reservation contention. The illustrated blocking recovery control is based on alternate wavelength selection proposed in [17]. After the contention blocking is recovered at Link 2, the signaling message will be passed to next link. The total allowed blocking recovery time along the route is bounded by MARD and the actually recovery delay may take less time than MARD. In the figure, the signaling message returns to the source node at t4. The source node will send out the burst at t5, which is the signaled burst sending time. To maximize the wavelength utilization, delayed wavelength reservation is employed as marked in Fig. 2. The wavelength holding time (t5 – t6) of each burst is determined by the burst size and core bandwidth. Serialized signaling procedure shown in this scenario is for illustration only and may not be efficient to support burst switching. QAS-OBS employs a fast and robust signaling protocol, namely SRFORP, proposed in [17]. Interested reader may refer to [17] for details of the signaling protocol.
4 Multi-class QoS Controller In QAS-OBS, MARD is the time reserved for lightpath reservation blocking recovery. To provide differentiated QoS control in terms of data loss rate, MARD for each burst is adjusted according to the service class of that burst. Larger MARD allows the signaling protocol to have more chance for connection blocking recovery. For example, the signaling protocols proposed in [17] [18] will utilize the MARD to recover wavelength contention blocking via localized rerouting or alternative wavelength selection. To guarantee the proportional differentiated-services in terms of data loss rate, the MARD should be adjusted according to the network loading status. In distributed networks, however, the real-time link status information is hard to collect either due to the excessive information exchange or due to security reasons. The actual additional signaling delay (i.e. ARD) is employed in QAS-OBS as an indicator of the network loading status. The wavelength contention and congestion occurs more often as the average link loading increases. If blocking recovery control is employed, the additional signaling delay caused by blocking recovery will depend on the probability of contention or congestion blocking, which is determined by the link loading. The Multi-Class QoS Controller has the following functions: determine the Offset Delay, burst transmission starting time and burst size for a burst. Fig. 3 shows that the structure of Multi-Class QoS Controller. Based on the QoS Performance Requirement, Optimization Heuristic in the QoS Database Configuration module sets up the mapping between monitored ARD ( ARˆ D) and Per-Class MARD, and stores the information in ARD to Per-Class Offset Delay Mapping Database. The mapping table in the database is the decision pool for the Per-Class MARD Adaptation module. Each QoS class will be assigned with MARD via the Per-Class MARD Adaptation module based on the monitored ARD. According to the QoS class, selected route and estimated burst
50
O. Yu and H. Xu Multi-Class QoS Controller Monitored RTSD Selected Route
Per-Class MARD Adaptation
ARD to Per-Class MARD Mapping Database
Read
Write QoS Control Database Configuration (Optimization Heuristic)
Per-Class MARD
Optimization Policy Burst Traffic Class
Per-Burst MARD & Transmission Starting Time Decision Controller
QoS Performance Requirements • Optimize network throughput • Proportional Data Loss Rate
aggregation time of an aggregating burst, the Per-Burst Offset Delay&Transmission Starting Time Decision Controller will select the offset delay for the each aggregated burst. When it is time to sending out the signaling, Per-Burst Offset Delay & Transmission Starting Time Decision Controller triggers the lightpath reservation by passing the burst transmission time and size to the Burst Signaling Controller. Meanwhile, the burst transmission time is sent to Burst Transmission Controller. Larger MARD will result in lower lightpath connection blocking; however, it will also increase the data loss due to edge buffer overflow. In section 2 and 3, we discussed that the burst size and sending time is based on estimation. The estimation processing time (shown in Fig. 2) is inversely proportional to the MARD, and the estimation error depends on the processing time. In QAS-OBS, the estimation error will be proportional to MARD. Taking account of the tradeoff between data loss due to connection blocking and edge buffer overflow, a mapping between MARD and data loss rate could be established. Our control algorithm is to select the optimal MARD that minimizes the total data loss for the QoS class that requires minimal data loss rate. The MARD for other QoS classes will be selected based on the data loss rate ratio and the mapping between MARD and data loss rate. Such mapping could be setup by monitoring the history data. An example of the mapping between MARD and data loss rate is illustrated in the simulation part. Within the function modules in Fig 3, the most important one is the ARD to MARD mapping database, which is the decision pool for MARD selection. Since MARD determines the reservation blocking for each burst, there is a mapping function between MARD and burst data loss rate for given network loading status. The one-to-one mapping between network loading and ARD is shown in the simulation. Based on this mapping, we can setup the ARD to MARD mapping database that satisfies the QoS requirement.
QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks
51
5 Simulation Results The simulation results consist of two parts. The first part presents the simulated results needed for the proposed control algorithm. The second part shows the performance results of QAS-OBS such as the overall throughput and proportional data loss rate between two classes.
Fig. 4. 14-node NSF network topology
The network used for simulation is the NSF 14-node as shown in Fig.4. Without loss of generality, we assume that each link has identical propagation delay that is normalized to 1 time unit (TU). The wavelength discovery and reservation processing delay of each node is supposed to be 5 TU, and switching latency of each optical switch is normalized to be 3 TU. Shortest Path routing and first-fit wavelength assignment is assumed. The incoming traffic to each edge is VBR traffic, and packet size follows a Poisson distribution with average packet size to be 10kb and a maximal allowable edge delay 100 TU. Exponential weighted sliding window estimator is implemented. The simulation is running on a self-designed simulator written in C++. The parameters of the topology model are: each node is supposed to have the full wavelength convertibility; number of links E=42 (bi-direction links assumed); default number of wavelength per link W=8. The average per-link wavelength loading is defined as: Wavelength Loading = ∑ (i )
where i is the index of a source destination pair, rate on the route of the source destination pair i, time for the burst transmitted on the route of i,
λi ⋅ H i μi ⋅ | E | ⋅ W
λi
μ
i
is the average connection arrival is the average wavelength holding
i
H is the number of hops on route of i.
A. Results needed by Control Algorithm Fig. 5 shows the effects of MARD on burst data loss for given wavelength loading. As shown in Fig. 5 increasing the MARD will reduce the burst reservation blocking, but the efficiency of reducing burst connection blocking via increasing the MARD
52
O. Yu and H. Xu
Fig. 5. Effects of Increment of Offset Delay on data loss rate
Fig. 6. Effect of Network Loading on ARD
reduces when the MARD is large enough. As the MARD increases, the increasing of edge overflow blocking will outperform the decreasing of burst connection blocking and the overall data loss rate will increase slightly when the MARD is large enough. Fig. 6 shows the how ARD works as a network loading indicator. In QAS-OBS, sliding window average ARD ( ARˆ D) is employed to indicate the network loading status. The figure shows that the ARˆ D is proportional to the wavelength loading. In addition, ARˆ D depends on the loading status only and is kind of independent to MARD. This characteristic qualifies ARˆ D as a good loading status indicator. After getting the results shown in Fig 5, and 6, the ARˆ D to Per Class MARD mapping database can be derived following the steps presented in Section 4. The result of the database is shown in Fig. 7. The value of per-class MARD that satisfies the QoS requirement has one-to-one mapping with the monitored ARˆ D .
QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks
Fig. 7.
ARˆ D
53
to Per Class MARD mapping database
1RUPDOL]HG7KURXJKSXW
Fig. 8. Effects of Average Wavelength Loading on Data Loss Rate
6LQJOH&ODVV
&&
&&
&&
$YHUDJH:DYHOHQJWK/RDGLQJ
Fig. 9. Effects of Average Wavelength Loading on Overall Throughput
54
O. Yu and H. Xu
B. Performance Results Fig. 8 shows one of the QoS constraint, proportional data loss ratio, is satisfied under different wavelength loading and traffic ratio. Three combinations with different traffic ratio of C0 to C1 are simulated for given total traffic amount. For example, C0:C1 = 2:1 means for given total traffic loading, 67% traffic belongs to class 0 and 33% belongs to class 1. It shows that the data loss rate of each class increases as the average wavelength loading increases. The data loss rate ratio between the two classes is largely fixed and independent of the wavelength loading. Thus, one of the QoS requirements (fixed data loss ratio between the two classes) is satisfied. In Fig. 9, the performance of overall throughput of the multi-class case is compared with the classless (single class) case. The overall throughput of single class is assumed to be the sub-optimal because every burst is scheduled to minimize the blocking and maximize the overall throughput. In addition, the overall throughput in single class case can be considered as the upper bound of that for multi-class case. This is because the QoS constraint in multi-class case may results in lower overall throughput since some lower class burst is reserved with low reservation robustness even it could be transmitted to keep the fixed data loss rate. The simulation results show that the data loss rate and overall throughput in multi-class case are very closed to the performance in single class case. Thus, the other QoS performance requirement, maximize the overall throughput, is satisfied. In Fig. 8 and 9, different combination of class 1 and class 0 traffic will affect the overall data loss rate. This is because for given traffic loading, the blocking probability cannot be set to arbitrary low, and lower class traffic may be assigned low lightpath reservation robustness to keep the proportional data loss rate. When the lower class traffic becomes dominant, the overall data loss rate decreases. In such case, adjust the blocking ratio may be needed.
6 Conclusion In this paper, we present QAS-OBS to provide proportional QoS differentiation for optical burst switching for grid applications in wavelength routed optical networks. Adaptive burst reservation robustness control is implemented to provide service differentiation while maximizing network throughput. QAS-OBS simplifies the core network design by pushing all the QoS control to the edge node, which makes it easy to implement QAS-OBS in existing WRONs. The control heuristic is presented and the performance is evaluated through simulation. It shows that QAS-OBS satisfies the QoS constraints while maximizing the overall network throughput.
References 1. Yu, O., et al.: Multi-domain lambda grid data portal for collaborative grid applications. Elsevier Journal Future Generation Computer Sys. 22(8) (October 2006) 2. Simeonidou, D., et al.: Dynamic Optical-Network Architectures and Technologies for Existing and Emerging Grid Services. J. Lightwave Technology 23(10), 3347–3357 (2005) 3. Qiao, C., Yoo, M.: Optical burst switching (OBS)-A new paradigm for an optical internet. J. High speed Networks 8(1), 69–84 (1999)
QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks
55
4. Xiong, Y., Vandenhoute, M., Cankaya, H.: Control Architecture in Optical Burst-Switched WDM Networks. IEEE J. Select. Areas Commun. 18(10), 1838–1851 (2000) 5. Duser, M., Bayvel, P.: Analysis of a dynamically wavelength-routed optical burst switched network architecture. J. Lightwave Technology 20(4), 574–585 (2002) 6. Yu, O., Xu, H., Yin, L.: Adaptive reliable optical burst switching. In: 2nd International Conference on Broadband Networks, 2005, October 3-7, 2005, vol. 2, pp. 1419–1427 (2005) 7. Yoo, M., Qiao, C., Dixit, S.: QoS Performance of Optical Burst Switching in IP-overWDM Networks. IEEE J. Sel. Areas Commun. 18(10), 2062–2071 (2000) 8. Yoo, M., Qiao, C.: Supporting Multiple Classes of Services in IP over WDM Networks. In: Proc. IEEE Globecom 1999, pp. 1023–1027 (1999) 9. Chen, Y., Hamdi, M., Tsang, D.H.K.: Proportional QoS over OBS networks. In: Global Telecommunications Conference, 2001. GLOBECOM 2001, vol. 3, pp. 1510–1514 (2001) 10. Poppe, F., Laevens, K., Michiel, H., Molenaar, S.: Quality-of-service differentiation and fairness in optical burst-switched networks. In: Proc. SPIE OptiComm., Boston, MA, July 2002, vol. 4874, pp. 118–124 (2002) 11. Chen, Y., Turner, J.S., Zhai, Z.: Contour-Based Priority Scheduling in Optical Burst Switched Networks. Journal of Lightwave Technology 25(8), 1949–1960 (2007) 12. Farahmand, F., Jue, J.P.: Supporting QoS with look-ahead window contention resolution in optical burst switched networks. In: Global Telecommunications Conference, 2003. GLOBECOM 2003, December 1-5, 2003, vol. 5, pp. 2699–2703 (2003) 13. Kozlovski, E., Bayvel, P.: QoS performance of WR-OBS network architecture with request scheduling. In: IFIP 6th Working Conf. Optical Network Design Modeling Torino, Italy, February 4–6, 2002, pp. 101–116 (2002) 14. Loi, C.-H., Liao, W., Yang, D.-N.: Service differentiation in optical burst switched networks. In: IEEE GLOBECOM, November 2002, vol. 3, pp. 2313–2317 (2002) 15. Zhang, Q., et al.: Absolute QoS Differentiation in Optical Burst-Switched Networks. IEEE Journal of Selected Areas in Communications 22(9), 1781–1795 (2004) 16. Yang, L., Rouskas, G.N.: Optimal Wavelength Sharing Policies in OBS Networks Subject to QoS Constraints. IEEE Journal of Selected Areas in Communications 22(9), 40–50 (2007) 17. Xu, H., Yu, O., Yin, L., Liao, M.: Segment-Based Robust Fast Optical Reservation Protocol. In: High-Speed Networks Workshop, May 11, 2007, pp. 36–40 (2007) 18. Yu, O.: Intercarrier Interdomain Control Plane for Global Optical Networks. In: Proc. ICC 2004, IEEE International Communications Conference, June 20-24, 2004, vol. 3, pp. 1679–1683 (2004) 19. Yoo, M., Qiao, C.: Just-enough-time (JET): A high speed protocol for bursty traffic in optical networks. In: Proceeding of IEEE/LEOS Conf. on Technologies for a Global Information Infrastructure, August 26-27, 1997, pp. 26–27 (1997)
Principles of Service Oriented Operating Systems Lutz Schubert and Alexander Kipp HLRS - University of Stuttgart Intelligent Service Infrastructures Allmandring 30, 70569 Stuttgart, Germany [email protected]
Abstract. Grid middleware support and the Web Services domain have advanced significantly over recent years, reaching a point where resource exposition and usage over the web has not only become feasible, but an actual commercial reality. Nonetheless, commercial uptake is still slow, and certainly not progressing the same way as other web developments have been taking place - this is mostly due to the fact that usage is still complicated and restrictive. This paper will discuss a new approach towards tackling grid-like networking across organisational boundaries and across different types of resources by moving main parts of the infrastructure to a lower (OS) level. This will allow more intuitive use of Grid infrastructures for current types of users. Keywords: Service Oriented Architecture, distributed operating systems, Grid, virtualisation, encapsulation.
1
The “Grid”
Over recent years, the internet has become the most important means of communication and hence implicitly an important means for performing business and collaborative tasks. But the internet offers more potential than this: with the ever-increasing band-width, the internet is no longer restricted to pure communication, but is also capable of exchanging huge, complex data-sets within a fraction of the time once needed to fulfill these tasks. Likewise, it has become possible to send requests to remote services and have them execute complex tasks on your own behalf, thus giving the user the possibility to outsource resource intensive tasks. Recently, web services have found wide interest both in the commercial and the academic domain. They offer the capability to communicate in a standardised fashion across the web to invoke specific functionalities as exposed by the respective resource(s), so that from a pure developer’s perspective, the according capabilities can be treated just like a (local) library extension. In principle, this would allow generation of complex resource sharing networks that build a grid-like infrastructure over which distributed processes can be enacted. However, the concept fails due to multiple reasons: P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 56–69, 2009. c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009
Principles of Service Oriented Operating Systems
57
– non regarding standardised messaging, all providers will expose their own types of interfaces which is hardly ever compatible to another provider’s interface even when essentially providing the same functions. – additional capabilities are necessary in order to support the full scope of business requirements in situtations such as joint ventures, in particular with respect to legal restrictions and requirements. – exposing capabilities over the web and adapting the own processes to cater for the according requirements, is typically very time and effort consuming – a high degree of technical capabilities is required from the persons managing services and resources in a Virtual Organisation Current Grid related research projects tberefore focus strongly on overcoming this deficiencies by realising more flexibile interfaces and more intuitive means of interactions and management. One of the most sophisticated project in this area is certainly BREIN [1] which tries to enhance typical Grid support with a specific focus on simplifying usage for the human behind the system. It therefore introduces extensions from the Agent, Semantic and AI domains, as shall be discussed in more detail below. Whilst such projects hence enrichen the capabilities of the system, it remains questionable whether this is not a sub-efficient solution given that the underlying infrastructure suffers from the additional messaging overhead and that the actual low-level management issues have not been essentially simplified, but mostly just moved further away from the end-user. This paper presents an alternative approach to realising a grid-enabled networking infrastructure that reduces the administrative overhead on the individual user significantly. The approach is basing strongly on the concept of distributed operating systems, as already envisaged e.g. by Andrew. S. Tanenbaum [2]. We will demonstrate how modern advances in virtualisation and encapsulation can be taken to a lower (OS) layer, that will not only allow cross-internet resource usage in a more efficient way, but will also enable new forms of more complex applications running on top of it, which exceeds current Grid solutions. We will therefore examine the potentials of what is commonly called “the grid” (section 2), how a low level, OS-embedded Grid system would look like (section 3) and how it could be employed in collaborative environments (section 4). Basing on this we will the derive requirements towards future network support for such a system (section 5) and conclude with a summary and outlook (section 6).
2
Low level Grids?
All “Grid”-Definitions have in common that remote resources are used as, respectively instead of local ones, so as to enhance or stabilise the execution of complex tasks, see e.g. the EGEE project [3]. What is the advantage on an Operating System layer though? Currently, more and more services are offered over the internet which are, in essence, at a low hardware level: storage can be rented on the web that grants access from all over the world; computational platforms can be used to reduce
58
L. Schubert and A. Kipp
the local CPU usage; printing services expose capabilities to print high quality documents and photos; remote recording service store data, burn it to a disk and send it by mail and so on. Interoperability & Usability. Using these capabilities currently requires the user to write code in a specific manner, integrating public interfaces, making use of APIs and / or using browser interfaces over which to up- and download data. As opposed to this, local resources are provided to the user via the OS typical interfaces, such as Explorer, which are typically more intuitive to use and in particular do use OS layer drivers to communicate with the devices, rather than leaving it up to the user to make the resources usable. However, in particular low layer resources, such as the ones mentioned above, could in principal be treated in the same fashion as local resources on cost of delays for communication. We will treat the communication problem in more detail below. Resource Enhancement. The obvious benefit of integrating remote resources consists in the enhancement to local resources in order to e.g. extend storage space, but also to add additional computing power and extend local storage, as we will discuss below. In particular on the level of printers, disk burners and storage extensions, remote resources are already in use, but with above described drawbacks. Stability and Reliability. By using remote resources for backup and replication features, overall process execution and data maintenance of the system can increase significantly. Already there are backup solutions duplicating disk storage onto a resource accessible over the web and systems such as the one by Salesforce [4] allows for stable, managed execution of code by using local replication techniques (cf. [5], [6]). However, none of them are intuitive to use and allow for interactive integration. Tight integration on the OS layer would hence allow making use of remote resources in a simple fashion, without having to care about interoperability issues.
3
Building up the Future Middleware
Web Service and Grid middlewares typically make use of HTTP communication with the SOAP protocol on top of it, which renders interactions rather slow and thus basically useless on a level that wants to make full use of the resources. Classical Grid applications mostly assume decoupled process execution where interaction only takes place to receive input data and return, respectively forward operation results. Even “fast” Grid networks are building more on dynamic relationships than on actual interactivity [7], [8], [9]. There have hence been multiple approaches to improve message throughput, such as SOAP over TCP and XML integration into C++ [10], [11] or standard binary encoding of SOAP messages [12], [13]. None of these approaches however
Principles of Service Oriented Operating Systems
59
can overcome the full impact of the high SOAP overhead, thus rendering lower level resource integration nearly impossible. 3.1
A Model for Service Oriented Operating Systems
As has been noted, the main feature of Grid / Web Service oriented middlewares consists in provider transparent resource integration for local usage. Notably higher capabilities are of relevance for distributed process enactment, but from each provider’s local perspective the same statement holds true. On the operating system level, resources can not be treated as transparently as on the application level, as the resources, their location and the means of transaction have to be known in full detail - implicitly taking this burden away from the user. In a classical OS approach, the operating system would just provide the means of communication and leave it up to higher layers to take care of identification and maintenance of any remote resource.
Fig. 1. Schema of a service oriented, distributed operating system
Figure 1 shows a high level view on a Service Oriented OS and its relationships to resources in its environment: from the user’s perspective the OS spans all resources, independent of their location etc. and represents them as locally integrated. In order to realise this, the main operating instance provides common communication means on basis of an enhanced I/O management system. We distinguish three types of OS instances, depending on usage and location: The Main Operating Instance represents the user side - as opposed to all other instances, this one maintains resources in the network and controls their usage. It performs the main OS tasks, such as process maintenance and in particular virtual memory management. Any MicroKernel instance are extensions to the local resources and treated as such. The system therefore does not promote a flat grid hierarchy where all nodes can be treated as extensions to each other.
60
L. Schubert and A. Kipp
The Embedded MicroKernel is similar to the classical extension of a networked OS: it is actually a (network) interface to the device controller - depending on the type of resource, such control may range from a simple serial / parallel interface for e.g. printers and storage, up to a standalone job- and resourcemanagement e.g. if the resource is a computer itself that can host jobs. The Standalone MicroKernel is an extension to the Embedded MicroKernel and allows to run the same functions on top of a local operating system, i.e. it acts as an application rather than a kernel. As such it is principally identical to the current Grid middleware extensions which run as normal processes. Leaving performance aside1 , local and remote resources could principally be treated in the same fashion, i.e. using common interfaces and I/O modules. We will use this simplification in order to provide a sketch of the individual operating system instance and its main features as they differ from classical OS structures.
Fig. 2. Sketch of the individual kernel models
Main issue with respect to building a grid like Service Oriented OS consists in identifying the required connectivity between building blocks and resources, and thus candidates for outsourcing and remote usage. It is obvious that actual process management, execution, virtual memory access and all executable code needs to be accessible via (a) local cache, so that according capabilities can not be outsourced without, as a consequence, the operating system failing completely. The key to all remoting consists in the capability of virtualising devices and resources on a low, operating-system level. Virtualisation techniques imply an enhanced interface that exposes the functionalities of the respective resource in a common structure [14] understandable to the operating system. Accordingly, such interfaces act as an intermediary between the device controller and the respective communication means and as such hide the controller logic from the main operating instance. Using pre-defined descriptions for common devices, this technique not only increases the “resource space”, but also reduces the negative impact that device drivers typically inflict upon the operating system. Instead, maintenance of the resource is completely separated from operating system and user context. 1
In fact this would impact on performance in such a way that hardly any reasonable usage of the system would be possible anymore.
Principles of Service Oriented Operating Systems
61
In the following sections, we will describe in more detail, how in particular storage and computational resources will be integrated into and used by a Service Oriented Operating System: 3.2
Memory and Storage Resources
We may distinguish three types of storage, depending on speed and availability: – local RAM: the fastest and most available type of storage for the computer. It is typically limited in size and is reserved for the executable code and its direct execution environment – local storage, such as the computer’s hard drive is already used as an extension to the local RAM for less frequent access or as a swapping medium for whole processes. It also serves as pure file storage. – networked storage typically extends the local file storage and only grants slow access, sometimes only with limited reliability, respectively availability. All these resources together form the storage space available to the user. Classical operating systems distinguish these types on an application layer, thus leaving selection and usage to the user. With the Service Oriented Operating System, the whole storage is virtualised as one virtual memory which the OS uses according to processes’ requirements, such as speed and size. In addtion, the user will only be aware of the disk space distributed across local and remote resources. This implies that there is no direct distinction between local RAM and remote, networked storage. Process’ variables and even parts of the code itself may be stored to remote resources, if the right conditions apply (see computation resources below). By making use of a “double-blind” virtualisation technique, it is ensured that a provider can locally maintain his / her resources to his / her own discretion (i.e. most likely to the operating system’s requirements) without being impacted by the remote user. In other words, remote access to local resources is treated identically to local storage usage, with the according swapping and fragmenting techniques and without the remote user’s OS (the main instance) having to deal with changes in allocation etc. In case of virtual memory access, “double-blindness” works as follows (cf. Figure 3: the calling process maintains its own local memory address range (VMemproc ) which on the global level is mapped to a global unique identififer (VMemglobal ). Each actual providing resource must convert this global identifier to a local address space that reflects the actual physical storage in case of embedded devices, or the reserved storage in case of OS driven devices (Memlocal ). Measuring Memory Usage. The Service Oriented Operating System monitors memory usage of individual processes in two ways: depending on the actual usage in the code (the usage context) and according to the access frequency of specific memory parts. Accordingly, performance and memory usage will improve over time, as the process behaviour is continuously analysed.
62
L. Schubert and A. Kipp
Fig. 3. Double blind memory virtualisation
Whilst the drawback of this consists in sub-optimal behaviour in the beginning, the advantage is that coders and users will not have to consider aspects related to remote resources. Notably, preferences per process may be stated, even though the operating system takes the final decision on resource distribution. 3.3
Computational Resources
Computational resources are in principal computers that allow remote users to deploy and run processes. Depending on the type of resource (cf. above), these processes are either under the control of the Service Oriented Operating System or of the hosting OS. Computing devices do not directly extend the pool of available resources in the way storage resource do: as has often be shown by parallelisation efforts, and by multi-core CPUs, duplication of the number of CPUs does not imply duplication of speed. More precisely, a CPU can typically take over individual processes, but not jointly participate in executing a single process. Parallel execution in current single core CPUs is realised through processswapping, i.e. through freezing individual processes and switching to other scheduled jobs. The Service Oriented Operating System is capable of swapping across computational resources, thus allowing semi-shared process execution. By building up a shared process memory space, parts of the process can be distributed across multiple computational resources and still executed in a linear fashion as the virtual storage space builds a shared memory. Measuring Process Requirements. The restrictions of storage usage apply in even stronger form to process execution: each process requires a direct hosting environment including process status, passing which is time critical. Whilst the hosting environment itself is part of the embedded, respectively standalone kernels, the actual process status (cache) is not. Therefore the Service Oriented Operating System analyses the relationship between process parts and memory usage in order to identify distributable bits. Typical examples consist in applications with a high user interaction on the one hand, but a set of computing intensive operations in the background upon
Principles of Service Oriented Operating Systems
63
Fig. 4. Invoking remote code blocks from a local interactive interface
request, such as CAD and render software which requires high availability and interactivity during modelling, but is decoupled from the actual rendering process. 3.4
Other Resources
Other devices, such as printers, DVD writers or complex process enactors all communicate with the main operating instance over the local OS instance (embedded / standalone) using a common I/O interface. Main control over the resource is retained by the provider, thus allowing for complete provider side maintenance: not only can local policies be enforced (such as usage restrictions and prioritisation), but also the actual low-level resource control is maintained, i.e. the device drivers, thus decoupling it from the user-side maintenance tasks. This obviously comes at the cost of less control from the user side and accordingly less means to influence the details of individual resources, as all are treated identical. As an example, a printer allowing holes to be punched into the printouts will not be able to expose these capabilities to the main operating instance, as it is not a commonly known feature - notably, the protocol could be extended to cater for device specific capabilities too. However, it is the strong opinion of the authors, that such behaviour is not always desirable. As with the original Grid, resources should be made available according to need, i.e. printers upon a print-out request, CPUs upon executing a high amount of computing intensive processes, storage upon shortage etc. Typically in these cases, no special requirements are put forward to these resources so that simple context-driven retrieval and integration is sufficient. We can assume without loss of generality that all additional requirements are specified on an application level.
4
Putting Service Oriented Operating Systems to Use
The benefits of such a system may not be obvious straight away, though their applications are many-fold: with remote resources being available on a low
64
L. Schubert and A. Kipp
operating system layer, applications and data can be agnostic to its distribution across the internet (leaving issues of speed aside) - this leads to a variety of use cases, which shall be detailed in this section: 4.1
Mobile Office
With code and data being principally distributed across the internet, execution of applications and their usage context becomes mostly independent of the runtime environment of the main system and its resources. Instead, the same application could be hosted and maintained remotely, whilst the actual user accesses the full environment from a local low-level computer with the right network connection, as only direct interfaces and interactive code parts may have to be hosted locally. Accordingly, an employee could principally use the full office working environment from any location with according web-access (cf. also section 5) with any device of his or her choice. By using the main machine to maintain the user profile and additional information such as for identification, the user can also recreate the typical working environment’s look and feel non-regarding his or her current location. This will allow better integration of mobile workers in a company, thus realising the “mobile office”. 4.2
Collaborative Workspaces
The IST-project CoSpaces [15] is examining means to share applications across the internet so as to allow concurrent access to the same data via this application, i.e. to collaboratively work on a dataset, such as a CAD design. One of the specific use cases of CoSpaces consists in a mechanical engineer trying to repair a defect aircraft without knowing the full design: through CoSpaces he will not only get access to the design data, but also may invite experts who have better knowledge about the problem and the design than he is. Using a common application, the experts can together discuss the problem and its solution and directly communicate the work to be done in order to fix it. This implies that all people involved in the collaboration can see the same actions, changes etc. but also concurrently work on the data. With a Service Oriented Operating System, application sharing becomes far more simple: as the actual code may be distributed across the web, local execution will require only the minimal interfaces for interaction whilst the actual main code may be hosted remotely. With the right means to resolve conflicting actions upon both code and data and with network-based replication support (cf. section 5), it will be possible to grant concurrent access to common code and parameter environment parts, thus effectively using the same application across the network in a collaborative fashion. CoSpaces is currently taking a web service based approach, which does not provide the data throughput necessary for such a scenario but is vital to elaborate the detailed low-level protocols and the high-level management means which would have to be incorporated into the OS to meet all ends.
Principles of Service Oriented Operating Systems
4.3
65
Distributed Intelligent Virtual Organisations
A classical use case for distributed systems consists in so-calles “Virtual Organisations” (VO) where an end user brings together different service providers so that they collectively can realise a task none of them would have been able to fulfil individually. Typically, this involves execution of a cross-organisational workflow that distributes and collects data, respectively products from the individual participants, thus generating a larger product or result. Currently, VO support middlewares and frameworks are still difficult to use and typically require a lot of technical expertise from the respective user, ranging from deployment issues over policy descriptions to specifications integration and application / service adaptation. And hardly ever the overall business policies and goals of the user are respected or even taken into consideration. The IST project BREIN [1] is addressing this issue from a user centric point of view, i.e. trying primarily to make the use of VO middleware systems easier for the user by adding self-management and intelligent adaptation features as kind of an additional management layer on top of the base support system. This implies enhancements to the underlying infrastructure layer that is responsible for actual resource provisioning and maintenance. With a Service Oriented Operating System, remote resources can be linked on an OS level as if locally available - accordingly, execution of distributed workflows shifts the main issue related to discovery, binding and maintenance down to the lower layer. This allows the user to generate execution processes over resources in the same way as executing simple batch programs: by accessing and using local resources and not caring about their details. User profiles may be exploited to define discovery and integration details, which again leads to the problem of technical expertise to fully exploit the respective capabilities. Future version of the SOS may also consider more complex services as part of the low-level resource infrastructure. The BREIN project will deliver results with respect to how the according low level protocols need to be extended and in particular how the high level user interface of the OS needs to be enhanced to allow optimal representation of the user’s requirements and goals.
5
Future Network Requirements
Like most web-based applications, the Service Oriented Operating System too profits most from an increase in speed, reliability and bandwidth of the network. The actual detailed requirements are defined more by the respective use case than by the overall system as such - however, one will notice that there is a set of base requirements and assumptions that are recurring in recent network research areas, which in fact can be approached on different levels and with different technologies. Most prominent of these are obviously virtualisation technologies which hide the actual context and details of the resources and endpoint from the overlaying system (cf. [1], [16], [17]). This does not only allow for easier communication management, but in particular for higher dynamicity and
66
L. Schubert and A. Kipp
mobility, as the full messaging details do not need to be communicated to all the endpoints and protocols do not need to be renegotiated every time. Other issues along the same line involve secure authentication, encryption and in particular unique identification under highly dynamic circumstances, such as in mobile environments. Intermediary servers and routers may be used for caching and / or as resources themselves, but should not be overloaded with additional tasks of name and location resolution, in particular if the latter are subject to change. With the current instability of network servers, the overhead for (re)direction reliability just diminishes the efficiency of the system. Resources can be treated both as consumers and providers in a SOS network, meaning that hosting (serving) and consumption of messages need to be treated at the same level, i.e. consumers need to be directly accessible without intermediary services. The IPv6 protocol [18] offers most of the base capabilities for stable direct consumer access under dynamic conditions. What is more, to increase the performance of the system, context information related to connectivity and reliability of the respective location need to be taken into consideration to ensure stable execution on an Operating System level. Protocols such as SIP allow communication of location sensitive information [19] but there is no approach as yet to exploit network-traffic related quality information in the context of traffic control and automatic protocol adapatation via location specific data which is heuristically sufficient to indicate average performance issues. In the following we will elaborate the specifics in the context of the individual use cases in some more detail: Mobile Offices do not put forward requirements with respect to concurrent access (such as the other scenarios) but are particularly sensitive to performance and reliability issues as the main goal tends towards distributed application execution with little to no performance loss. We need to furthermore distinguish between access to local (intranet) and remote resources. Working towards uniformity in both usage and production, both connectivity types should make use of the same physical network ideally over a self-adaptive protocol layer. This would reduce the need of dedicated docking stations. As we will see, the security requirements in this scenario are of less priority than they appear to be: even though the corporate network may grant secure access to the local data structures, authentication and encryption do not require the same dynamicity support as with the other scenarios. Collaborative Workspaces share access to common applications and need to distinguish between two types of data even though identical from the application point of view: user specific and shared information. Though access is performed identically, the network may not grant cross-user access to private information whilst shared data must be subject to atomic transactions to reduce potential conflicts. Note that the degree of transaction freedom may be subject to individual roles’ privileges.
Principles of Service Oriented Operating Systems
67
Accordingly, the network needs to provide and maintain role and profile specific information that can be used for secure authentication, access restriction and in particular for differentiation between individual code-data-user-relationships. Further to this, replication can be supported through low-level multiplexing according to context-specific information, thus reducing the management overhead for the operating system. In other words extended control mechanisms of the higher layer may be exploited to control replication on the network layer. By making use of unique dynamic identifiers as discussed above, this would allow usage-independent access to data even in a replicated state and with according dynamic shifting of this replicated data. Distributed VO Operations do maintain access restriction according to dynamic end-user profiles, but typically do not share applications. Accordingly, most relevant for this scenario is data maintenance across the network, i.e. conflict anagement and atomic transactions to reduce the impact of potential conflicts.
6
Summary and Implications
We introduced in this paper a next generation operating system that realises grid on a low OS level thus allowing for full integration and usage of remote resources in a similar fashion to local ones. With such an operating system, management and monitoring of resources, workflow execution over remote resources etc. is simplified significantly, leading to more efficient grid systems in particular in the areas of distributed computing and database management. Obviously, such an operating system simplifies resource integration at the cost of flexibility and standardised communication on this low layer. In other words, all low level control is passed to the OS and influence by the user is restricted. For the average user, this is a big advantage, but it increases the effort of Kernel maintenance in case of required updates from the standardisation side. As it stands so far, the Service Oriented Operating System does not cater for realiability issues, i.e. it treats resource according to their current availabilty - this means that the Operating System will shift jobs and files to remote resources independently of potential reliability issues. Accordingly, code and data blocks may become invisible during execution and even fail, leading to unwanted behaviour and system crashes. Different approaches towards increasing reliability exist, ranging from simple replication to device behaviour analysis. The approach will differ on a case-bycase-basis as it will depend on the requirements put towards the respective processes and / or data. Notably the system can principally extend local execution reliability, when replicating critical processes. Along that line it is not yet fully specified, how these requirements should be defined and on what level - it is currently assumed that the user or coder specifies these directly per process or file. However, in particular for commonly used devices (printers, storage etc.) it may be sensible to defind and use user profiles that represent the most typical user requirements under specific circumstances and could thus serve as the base setup for most resources.
68
L. Schubert and A. Kipp
As the system is still in its initial stage, no performance measurements exist as yet that could quantify the discussions above - in particular since reliability issues have not been covered so far. This would also involve measurements involving different network conditions, i.e. bandwidth, reliability etc. in order to enable more precise defintion of the requirements towards future networks.
Acknowledgments This work has been supported by the BREIN (http://www. gridsforbusiness.eu) and the CoSpaces project (http://www.cospaces.org) and has been partly funded by the European Commission’s IST activity of the 6th Framework Programme under contract number 034556 and 034245. This paper expresses the opinions of the authors and not necessarily those of the European Commission. The European Commission is not liable for any use that may be made of the information contained in this paper.
References 1. BREIN: Business objective driven reliable and intelligent grid for real business, http://www.eu-brein.com 2. Tanenbaum, A.S.: Modern Operating Systems. Prentice Hall PTR, Upper Saddle River (2001) 3. EGEE: Enabling grids for e-science, http://www.eu-egee.org 4. SalesForce: force.com - platform as a service, http://www.salesforce.com 5. Mitchell, D.: Defining platform-as-a-service, or PaaS (2008), http://bungeeconnect.wordpress.com/2008/02/18/ defining-platform-as-a-service-or-paas 6. Hinchcliffe, D.: The next evolution in web apps: Platform-as-a-service, PaaS (2008), http://bungee-media.s3.amazonaws.com/whitepapers/hinchcliffe/ hinchcliffe0408.pdf 7. Wesner, S., Schubert, L., Dimitrakos, T.: Dynamic virtual organisations in engineering. In: Computational Science and High Performance Computing II. Notes on Numerical Fluid Mechanics and Multidisciplinary Design, vol. 91, pp. 289–302. Springer, Berlin (2006) 8. Wilson, M., Schubert, L., Arenas, A.: The trustcom framework v4. Technical report (2007), http://epubs.cclrc.ac.uk/work-details?w=37589 9. Jun, T., Ji-chang, S., Hui, W.: Research on the evolution model for the virtual organization of cooperative information storage. In: International Conference on Management Science and Engineering, 2007. ICMSE 2007, pp. 52–57 (2007) 10. van Engelen, R.A., Gallivan, K.: The gsoap toolkit for web services and peer-to-peer computing networks. In: Proceedings of the 2nd IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2002), pp. 128–135 (2002) 11. Head, M.R., Govindaraju, M., Slominski, A., Liu, P., Abu-Ghazaleh, N., van Engelen, R., Chiu, K., Lewis, M.J.: The gsi plug-in for gsoap: Enhanced security, performance, and reliability. In: ITCC conference 2005, pp. 304–309 (2005)
Principles of Service Oriented Operating Systems
69
12. Ng, A., Greenfield, P., Chen, S.: A study of the impact of compression and binary encoding on SOAP performance. In: Schneider, J.G. (ed.) The Sixth AustralasianWorkshop on Software and System Architectures: Proceedings, pp. 46–56. AWSA (2005) R Communication Foundation. Microsoft Press (2007) 13. Smith, J.: Inside Windows 14. Nash, A.: Service virtualization – key to managing change in soa. Service Virtualization – Key to Managing Change in SOA (2006) 15. CoSpaces: Cospaces project website, http://www.cospaces.org 16. 4WARD: The 4ward project, http://www.4ward-project.eu 17. IRMOS: Interactive realtime multimedia applications on service oriented infrastructures, http://www.irmosproject.eu/ 18. Hinden, R., Deering, S.: Ip version 6 addressing architecture (fc 2373). Technical report, Network Working Group (1998) 19. Polk, J., Brian Rosenberg, J.: Session initiation protocol location conveyance. Technical report, SIP Working Group (2007)
Preliminary Resource Management for Dynamic Parallel Applications in the Grid Hao Liu, Amril Nazir, and Søren-Aksel Sørensen Department of Computer Science, University College London United Kingdom {h.liu,a.nazir,s.sorensen}@cs.ucl.ac.uk
Abstract. Dynamic parallel applications such as CFD-OG impose a new problem for distributed processing because of their dynamic resource requirements at run-time. These applications are difficult to adapt in the current distributed processing model (such as the Grid) due to a lack of interface for them to directly communicate with the runtime system and the delay of resource allocation. In this paper, we propose a novel mechanism, the Application Agent (AA) embedded between an application and the underlying conventional Grid middleware to support the dynamic resource requests on the fly. We introduce AA's dynamic process management functionality and its resource buffer policies which efficiently store resources in advance to maintain the execution performance of the application. To this end, we introduce the implementation of AA. Keywords: resource management, dynamic parallel application, resource buffer.
Preliminary Resource Management for Dynamic Parallel Applications in the Grid
71
conventional Grid middleware for resource management. We also propose two resource buffer policies to maintain the application performance cost effectively while the resource demands change constantly.
2 Dynamic Parallel Applications Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics that uses numerical methods and algorithms to solve and analyze problems that involve fluid flows. One method to compute a continuous fluid is to discretize the spatial domain into small cells to form a volume mesh or grid, and then apply a suitable algorithm such as Eulerian methodology [18] to solve the equations of motion. The approach assumes that the values of physical attributes are the same throughout a volume. If scientists need a higher resolution of the result they have to replace a volume element with a number of smaller ones. Since the physical conditions change very rapidly, high resolution is needed dynamically to represent the flux (amount per time unit per surface unit) of all the physical attributes with sufficient accuracy. Consequently the whole volume grid is changing constantly, which may lead to dynamic resource requirements (Fig. 1).
Fig. 1. A series of slides of the flow across a hump. The grid structure is constantly changing at run-time to adjust the reasonable resolution. The resource requirements in the last slide are around 120 times those in the first slide [12].
The usual method for introducing high resolution is to replace a volume element with a number of smaller ones. This is a difficult process in traditional CFD because it used a Cartesian of indexed volumes (Vijk). To introduce smaller volumes into such a structure would require further hierarchical indexing leading to complex treatment. An alternative approach which is introduced by Sørensen is the Object Graph (CFDOG) [12]. A CFD-OG application has two base objects: cells and walls. A cell represents the fluid element and holds the physical values for a volume. It is surrounded by a number of wall objects known to it by their addresses (pointers). A cell simply holds a number of wall pointers. The walls on the other hand only know two objects: the cells on either side of the wall. This is a simple structure that is completely unaware of the physical geometry and topology of the model. The advantage of the object graph is that it uses reference by addressing only. It is therefore possible to change the whole grid topology when smaller volume elements are introduced. This approach
72
H. Liu, A. Nazir, and S.-A. Sørensen
also benefits the distributed processing since it is easy to substitute the local addresses (pointers) with global addresses (host, process, pointer) without changing the structure. That being the case, the objects can be distributed on the network computer nodes in any manner. With object graph, a distributed CFD-OG application can be very dynamic and autonomic. The application is composed of a number of processes executing synchronously and normally one process is running on one nodes. In this context, each process holds a number of computational objects (cells and walls) that can migrate from one process to another. As the application progresses, a built-in physical manager monitors the local conditions. If it detects the local resolution is too low, it will ask the built-in object manager to introduce smaller volumes. If it on the other hand the built-in physical manager detects superfluous resolutions, it asks to replace several smaller cells with fewer larger ones. This may create imbalances in the processing and the object manager may subsequently attempt to balance this by moving objects onto light load nodes. Such load balancing is limited. The application may have to demand additional resources (processors) immediately to maintain the required performance such as a certain execution time progression. Likewise, the application may want to release redundant allocated resources when the low resolution is acceptable. The dynamic resource requirements on the fly is a huge challenge for current distributed environment. This is because there is no well defined interface for applications to communicate with the Grid to add/release resources at run-time; and further more a significant delay is normally associated with the allocation of a resource in response to a request due both to resource competition between running jobs and the time conflict of concurrent requests, which will influence the smooth execution of the application. For the first problem, we define a mechanism which has several functions that applications can call to add and release resources during the execution in a way that is independent from resource managers. For the second problem, we propose the resource buffer management to hide the delay of resource allocation.
3 Dynamic Resource Support As introduced, a dynamic parallel application is composed of a number of processes that holds a number of computational objects that can migrate from one process to another by the application itself. Therefore, as the application needs an additional resource, it requests to deploy a new process that it can move other objects into. The resource requirement is therefore interpreted as adding a new process with no objects initially. We define a mechanism called Application Agent (AA) that stays between applications and Grid middleware to support such dynamic addition/release of processes. Programmers must build applications based on the programming interface provided by AA. Each process’s binary file therefore has to be compiled with the AA library. That means each running process is associated with an AA which keeps information synchronized across the whole application. Programmers can perform three basic operations inherited from AA: add a process, stop a process, and exchange messages with a process. As soon as the application is running, AA will start and act as an agent
Preliminary Resource Management for Dynamic Parallel Applications in the Grid
73
to communicate with the Grid environment on behalf of the application, i.e. to find a new node and deploy a new process, stop a process and return a node, and transport messages during the execution of the application. The process requests are served by a scheduler that is associated with AA. The scheduler can apply different scheduling policies (e.g. FCFS (First Come First Served), priority-based) according to the preference of the application. We do not address the scheduling issue in this paper. 3.1 Adding a Process AA allows the application programmer to start a named process at anytime during the execution. The process request is performed using AddProcess( Name of executable ) which returns an integer ID which subsequently used to address the process. The function AddProcess() is non-blocking. As soon as it is invoked, AA will check if its Resource Buffer (RB) holds an additional prepared process (an idle process that has been deployed on a new node previously). If so, it immediately returns the process ID to the application which can activate this process for migrating objects. This is called a ”successful request”. If not, it will contact the underlying Grid scheduler such as Condor [3], SGE [1] or GRAM [7] to request an allocation, simultaneity returning integer 0 to the application. This is called a ”failed request”. As in a non-dedicated Grid environment the amount of time it takes for a process to be allocated is not bound, it is the programmer’s responsibility to request a process again if previous request has failed. Once the new process is deployed, AA will store the ID of the process into the RB and return it to the application as soon as the next request is performed. One problem with the dynamic addition of process is that current Grid schedulers do not support dynamic resource co-allocation and deployment. They treat each later added process as an independent single job and do not deploy them communicable with other processes of the parallel application. One approach (approach A) to this problem is that once the process is allocated by a scheduler, the AA that is associated with this new process must broadcast its address to other old processes so that they can communicate. This is archived by registering this process’s address into the RB which is synchronized throughout the application. The new process is assigned an unique global id for further use by the application during the registration. This approach can be implemented by general socket programming or based on distributed computing framework (e.g. CORBA [10], Jini [11]). An alternative approach (approach B) is to use probes for resource allocation while the actual process startup is accomplished by AA itself. A probe is a small piece of deployment code that does not perform any computation for the application. The probe always runs until AA kills it, in order to hold that node. Once the probe is allocated by the scheduler, it configures the node for AA use and notifies its node address to the master AA which takes charge of process spawning. Then the master AA transfers the process binary onto that node and starts the process. This approach can make full use of existing process management systems (e.g. PVM [6], LAM/MPI [2]) that will return a process identifier for the application to address the process. Once the new processed is deployed, the object manager of the application can then migrate the target objects into this process for load balancing. The migration procedure is beyond the scope of this paper.
74
H. Liu, A. Nazir, and S.-A. Sørensen
3.2 Stopping a Process Programmers may want to stop a process and release the node when application’s resource requirement is low. The application firstly autonomically vacates the process (computational objects migration), then invokes the function StopProcess( id ). This results in the process with the given ID to be removed from the application and the related computer node to be disassociated. However AA may still reserve this prepared process for the possible future requests from the application. The decision is made according to AA’s buffer policies (we discuss them in next section). If AA does decide to release the node, it contacts the local scheduler to kill the vacated process and finally return the node to the pool. 3.3 Communication Technically, AA could either have a new communication mechanism if using approach A for adding processes or have PVM/MPI as the lower communication service if using approach B. In the latter case, programmers can still use PVM/MPI routines for communication, and use AA’s routines for process management.
4 Resource Buffer In non-dedicated Grid environments the amount of time it takes until a process is allocated is not bound. In order to satisfy the process addition request on-demand at run-time, we propose the Resource Buffer (RB) to hide the delay of process allocation. A RB stores a number of prepared processes that can be returned to the application immediately when it requests AddProcess(). AA manages the RB by requesting the allocations of processes from schedulers in advance. AA manages the RB to satisfy the application demands based on the RB policies. In order to measure the RB performance, we propose two simple metrics: Request Satisfaction S and Resource Waste W. Since the process release requests do not benefit from the policies, the Request Satisfaction is defined as the percentage of successful added processes out of the total number of process addition requests. If S is approaching 1, it indicates that the dynamic application can run smoothly since all of dynamic resource requests are satisfied on-demand. The Resource Waste is defined as the accumulative total time when the prepared processes stay in the RB (W =
∑
Texe t =0
Rt where Texe = total execution time; Rt = the number of idle processes in the
RB at time t). If the RB policy is good enough, W should be approaching 0 while S should be approaching 1. In order to investigate the RB policies, we made a few assumptions regarding the dynamic parallel application: – 1. It is a parallel iterative application. – 2. Its dynamic behavior is not random. Similar to the CFD-OG, the application only requests resources on some stage (e.g. during iteration 100 ~ 110) when physical condition changes. – 3. It only requests to add processes but not to release processes.
Preliminary Resource Management for Dynamic Parallel Applications in the Grid
75
4.1 Policies We consider two corresponding heuristic policies, the Periodic (P) policy and the Periodic Prediction (PP) policy to manage the RB. The P policy periodically reviews the RB to ensure that the RB is kept to a predetermined process amount RL at each predefined interval tp. For every time interval, the number of Nprocess requested is:
⎧+ ( RL − R) : R <= RL N =⎨ ⎩− ( RL − R) : R > RL where RL is the request (threshold) level and R is the current number of prepared processes in the RB at each periodic level. ”+” means requesting the allocation of a node and deploying a new process and ”-” means requesting the release of a process and returning the node to the pool. The value of RL is crucial for the P policy. RL can be determined by the maximum number of requests that the application could consecutively make. Generally, if the application requests processes frequently, RL will be higher.
Fig. 2. AA periodically predicts the number of requests the application will make. E.g. when the application is running at iteration 5, AA predicts the period from iteration 25 to 30.
If the time
Talloc that the Grid scheduler takes to allocate a process in the cluster is
known and relatively stable, we can use the PP policy. The PP policy periodically (at interval tp) predicts the number of requests that the application will make after θ time (Figure 2). If AA predicts that after θ the application will make n requests, AA will place n requests to the scheduler immediately. We let θ = max Talloc ≈ ∀Talloc . Then the application would be able to use the advanced placed process right after it has been allocated. PP uses the application’s historical execution behaviors for the prediction. Since the application execution behavior may vary due to its execution environment, PP predicts the probability of a request in an iteration range (e.g. iteration 25 ~ iteration 30, Figure 2) rather than a single iteration. The probability p i ~ i + range in the future
Pi ~i + range = ∑i
i + range
iteration i ~ i + range is calculated as
ri / times , where ri is the
total number of requests at iteration i during the recorded executions, times is number of recorded historical executions, and range is the number of iteration the prediction
76
H. Liu, A. Nazir, and S.-A. Sørensen
ni ~i + range the application is predicted to make in iteration i ~ i + range can be calculated by ⎣Pi ~i + range ⎦ .
covers. The number of request
In order to avoid the redundancy of process in the RB, PP also has a request (threshold) level RL. As the current number of process R reaches beyond RL, AA initiates to release processes. The full Pseudo-code for PP policy in AA is shown in Figure 3.
Fig. 3. Pseudo-code for PP policy
4.2 Simulation and Results Based on the assumptions we made, we use Clown [17], a discrete-event simulator to simulate a simple iterative application that has a behavior similar to the real dynamic parallel application. Initially, this application has three processes, each of which has 10 computational objects. To simulate the dynamic behavior, one of the processes generates 10 objects every 100 iterations. Due to the increase of objects, the application will detect when the execution speed is getting slower and subsequently responds by requesting additional resources/processes by AddProcess() to balance the computation in ensuring smooth execution. If AA cannot satisfy the requests on some stage, the application runs slower with insufficient resources and it will request to add processes again in next iteration. In order to simulate the execution of the application, we simulate 500 homogeneous computer nodes where the average speed spnode is 11.2 (it takes 11.2 simulation time to complete 1 objects, and takes 112 to complete 10 objects) . The execution speed fluctuates in each run according to a Gaussian distribution with variance equals to 1.0. In the cluster, the time Talloc it takes to allocate a node is Gaussian distributed with average 1000 and variance 100. The simulated execution speed spexe (the time it takes to run 1 iteration) with S = 1 can be monitored
Preliminary Resource Management for Dynamic Parallel Applications in the Grid
77
as ≈ 120. The main purpose of the simulation is to evaluate the effectiveness of the proposed buffer policies under the simulated environment. For each policy, we run the application for 100 times. For each execution, the application is run for 3000 iterations. Figure 4 shows the value S of both policies during the 100 executions. We can see the Request Satisfaction of the P policy is slightly higher than what the PP policy can provide. For the PP policy, S is very low in the beginning and increases to 90% around the 5th execution. This is because PP is based on historical information and there is not enough information for predicting the execution behavior in the beginning. During the 100 executions, both policies can provide reasonable S (S > 80%). The small fluctuations are caused by the unstable Talloc and spnode .
Fig. 4. The Request Satisfaction S of both policies during 100 executions. PP: range = 5, RL = 1, tp = 5iterations. P: RL = 1, tp = 600.
Fig. 5. The Resource Waste W of both policies during 100 executions. PP: range = 5, RL = 1, tp = 5iterations. P: RL = 1, tp = 600.
Figure 5 shows the value W of both policies during the 100 executions. We can see that the PP policy is far more efficient than the P policy. The W of PP in the first few executions are relatively high since historical information is insufficient. W then drops rapidly through the learning process and maintains a low value with small fluctuations in the rest of the executions. The results show that both proposed policies have their pros and cons. The P policy can provide slightly higher Request Satisfaction while leads to very high Resource Waste too. The PP policy is considered more advanced. It can perform simple prediction and make requests at the right time. However the current PP policy is not suitable for the cluster where Talloc is random and unpredictable. While the P policy is suitable for any environment.
5 Implementation The current implementation of AA is developed based on PVM. The process management follows approach B. Requests are served according to FCFS policy. The test environment is a Condor pool which has 50 Linux machines. The cluster has NFS (Network File System) installed and each machine has PVM installed.
78
H. Liu, A. Nazir, and S.-A. Sørensen
All the AA-enabled application’s binaries and related files must be put on a NFS mounted directory. The application is firstly started on the Condor submitting machine. The first process started is called master process and manages the whole application. The correlative AA is called master AA. When AA decides to add a process (receiving a request or according to the RB policies) for the application, it firstly asks the master AA to submit a probe program to Condor specifying the resource requirements (e.g. Arch == ”INTEL”) of the application in the submit file. Once the probe is allocated, it notifies the master AA by writing the node information into a XML file on the NFS. AA keeps reading this file. Once it finds that a new node is added, it immediately adds that node into its virtual machine by pvm addhosts(). Then it starts a process on that node by pvm spawn(), and stores the process id into its RB. Since all the binaries are on the NFS, AA does not need to transfer any files. AA in its current implementation does not support heterogeneous deployment. It does allow heterogeneous processing as long as a suitable executable is present on the target node. When AA releases a process, it first stops the process by pvm kill(), then excludes the node from its virtual machine by pvm delhosts(). It finally returns this node to the pool by killing the probe via condor rm(). AA’s message passing module is a C++ wrapper of PVM’s interface.
6 Related Work Condor-PVM [3] provides a dynamic resource management for PVM applications executing in a Condor cluster. Whenever a PVM program asks for nodes, the request is re-mapped to Condor, which then finds a machine in the Condor pool via the usual mechanisms, and adds it to the PVM virtual machine. This system is intended to integrate the resource management (RM) systems (Condor) with the parallel programming environment (PVM) [14]. Although it supports runtime resource requests similar to what AA supports, it does not put any effort into the performance of the application, e.g buffer management. Moreover, the request scheduling for the application is totally managed by Condor, which has no scalability to add other application-level scheduling policies. Gropp et al. [9] introduce an extension of MPI for MPI applications to communicate the job scheduler to add and subtract processes during the computation. It proposed a non-blocking call MPI IALLOCATE to reserve processors from the scheduler with returning a set of MPI requests. Then the actual process startup can be accomplished with the conventional MPI START or MPI STARTALL calls. This paper however does not provide detailed implementation information. DUROC [4] implements an interactive transaction strategy and mechanism for resource co-allocation in a multi-cluster environment. It accepts multiple requests, each written in a RSL expression and each specifying a subjob of an application. In order to tolerate resource failures and timeouts, some resources may be specified as ”interactive” and ”optional”. Successful completion of a DUROC co-allocation request results in the creation of a set of application processes that are able to communicate with one another. An important issue in the resource co-allocation is that the required resources have to be available at the same time otherwise the computation cannot
Preliminary Resource Management for Dynamic Parallel Applications in the Grid
79
proceed. While in our model, a dynamic parallel application can continue computation with insufficient resources and request additional resources via AA during the computation to maintain its ideal performance.
7 Conclusion and Future Direction The contribution of this paper is the proposal of an application agent AA that supports the dynamic resource requirements of dynamic parallel applications such as CFD-OG. An AA-enabled application is able to add new resource (deploy a new process) and release surplus (release a process) at run-time. To maintain the smooth execution of the application, the Resource Buffer service is proposed that is embedded in AA to relieve the cost for waiting resources. Two heuristic policies are introduced to examine how the RB concept can be managed more effectively and efficiently. The current version of AA is implemented with approach B (detailed in section 3) and tested in a Condor cluster. As we mentioned, this approach is restricted by existing systems. For example, PVM is bounded to join resources that are located in the same network domain and so AA cannot perform wide-area computing based on PVM. Some extensions (e.g. PMVM [13]) enable PVM to create multi-domain virtual machines. The future work will involve implementing and testing AA in a multidomain environment. We aim to investigate whether the virtual machine architecture would apply in this setting or it is more appropriate to apply approach A that distributed loosely links processes across the network. The security problem arising from multi-domain environment will be also addressed. The RB policies also need more precise investigation. Two polices will be further tested by two real world dynamic parallel applications CFD-OG and RUNOUT [16]. The policies will be further extended to intelligently react to the change of resource environment to ensure that the smooth execution of application is not affected.
References 1. N1 grid engine6 administration guide. Technical report, Sun Microsystems, Inc. 2. Burns, G., Daoud, R., Vaigl, J.: LAM: An Open Cluster Environment for MPI. In: Proceedings of Supercomputing Symposium, pp. 379–386 (1994) 3. Condor. Condor online manual version 6.5, http://www.cs.wisc.edu/condor/manual/v6.5 4. Czajkowski, K., Foster, I.T., Kesselman, C.: Resource co-allocation in computational grids. In: HPDC (1999) 5. Foster, I.: Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley Longman Publishing Co., Inc., Boston (1995) 6. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., Sunderam, V.S.: PVM: Parallel Virtual Machine: A Users’ Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge (1994) 7. Globus Alliance. Globus Toolkit, http://www.globus.org/toolkit/ 8. Goux, J.-P., Kulkarni, S., Linderoth, J., Yoder, M.: An enabling framework for masterworker applications on the computational grid. In: HPDC, pp. 43–50 (2000) 9. Gropp, W., Lusk, E.: Dynamic process management in an mpi setting. spdp, 530 (1995)
80
H. Liu, A. Nazir, and S.-A. Sørensen
10. Idl, N.S.: The common object request broker: Architecture and specification 11. S. Microsystems. Jini network technology. Technical report, http://www.sun.com/software/jini/ 12. Nazir, A., Liu, H., Sørensen, S.-A.: Powerpoint presentation: Steering dynamic behaviour. In: Open Grid Forum 20, Manchester, UK (2007) 13. Petrone, M., Zarrelli, R.: Enabling pvm to build parallel multidomain virtual machines. In: PDP 2006: Proceedings of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Washington, DC, USA, pp. 187–194. IEEE Computer Society Press, Los Alamitos (2006) 14. Pruyne, J., Livny, M.: Providing resource management services to parallel applications (1995) 15. Shao, G.: Adaptive scheduling of master/worker applications on distributed computational resources (2001) 16. Sørensen, S.-A., Bauer, B.: On the dynamics of the kofels sturzstrom. In: Geomorphology (2003) 17. Sorensen, S.-A., Jones, M.G.W.: The clown network simulator. In: 7th UK Computer and Telecommunications Performance Engineering Workshop, London, UK, pp. 123–130. Springer, Heidelberg (1992) 18. Trac, H., Pen, U.-L.: A primer on eulerian computational fluid dynamics for astrophysics. Publications of the Astronomical Society of the Pacific 115, 303 (2003) 19. Wolski, R., Spring, N.T., Hayes, J.: The network weather service: a distributed resource performance forecasting service for metacomputing. Future Generation Computer Systems 15(5–6), 757–768 (1999)
Performance Evaluation of a SLA Negotiation Control Protocol for Grid Networks Igor Cergol1, Vinod Mirchandani1, and Dominique Verchere2 1
Faculty of Information Technology The University of Technology, Sydney, Australia {icergol,vinodm}@it.uts.edu.au 2 Bell Labs, Alcatel-Lucent Route de Villejust, 91620 Nozay, France [email protected]
Abstract. A framework for an autonomous negotiation control protocol for service delivery is crucial to enable the support of heterogeneous service level agreements (SLAs) that will exist in distributed environments. We have first given a gist of our augmented service negotiation protocol to support distinct service elements. The augmentations also encompass related composition of the services and negotiation with several service providers simultaneously. All the incorporated augmentations will enable to consolidate the service negotiation operations for telecom networks, which are evolving towards Grid networks. Furthermore, our autonomous negotiation protocol is based on a distributed multi-agent framework to create an open market for Grid services. Second, we have concisely presented key simulation results of our work in progress. The results exhibit the usefulness of our negotiation protocol for realistic scenarios that involves different background traffic loading, message sizes and traffic flow asymmetry between background and negotiation traffics. Keywords: Grid networks, Simulation, Performance Evaluation, Negotiation protocols.
switching technologies ie L1 services (e.g. fiber or TDM), L2 services (e.g. Carrier Grade Ethernet) or L3 services (e.g. IP/MPLS) . Then we present and discuss OMNeT++ based models and related stochastic simulation results that reflect the impacts on the performance of our negotiation protocol from different background traffic loads, message sizes and traffic flow asymmetry. This paper is organized as follows: Section 2 gives an outline of the distributed Grid market framework in open environments. It also gives an essence of our negotiation protocol proposed in [5] first for single service element between two parties and then for a bundle of heterogeneous service elements from multiple service providers that can complement or compete for the service elements offers. Performance evaluation results are presented and discussed also in Section 2. The conclusions that can be drawn from this paper are stated in Section 3.
2 Negotiation Framework Overview The framework for managing Quality of experience Delivery In New generation telecommunication networks with E-negotiation (QDINE) [6] is shown in Figure 1.
Fig. 1. The QDINE Negotiation Framework
The QDINE approach adopts a distributed, open market approach to service management. It makes services available via a common repository with an accompanying list of negotiation models that are used to distribute service level agreements (SLAs) for the services. A QDINE market is a multiagent system with each agent adopting one or more market roles. Every QDINE market has exactly one agent acting in the market agent (MA) role and multiple service providers (SPs), consumers and billing providers. 2.1 Proposed Service Negotiation Protocol Our negotiation protocol [5] is session-oriented and client-initiated protocol that can be used for intra as well as inter-domain negotiation in a Grid environment. A key
Performance Evaluation of a SLA Negotiation Control Protocol for Grid Networks
83
distinguishing feature of our protocol is that it enables the negotiation of a single service as well as the composition of several service elements from multiple service providers (SPs) within one session. There are two possible scenarios as shown in Figs. 2 & 3 that we consider for the negotiation of services between the consumer agent (CA) and SP. 2.1.1 Negotiation of a Single Service The sequence diagram for the negotiation of a single service is explained with the help of Fig. 2 when there is a direct negotiation between the service provider (SP) and consumer agent (CA). The numbers in the parenthesis that precede the messages are used to describe the order of the interactions. Upon receiving the Request message (1), the SP has several options: • It can immediately provide the service by sending an Agree message (2a) back to the CA. This forms the beginning of the SLA binding process of the service element. • If the SP cannot deliver the service requested, it chooses to end the negotiation, it may reply with a Refuse message (2b), stating the reason for refusal. The SP may choose to continue the negotiation sequence by offering an alternative of the SLA parameters proposal, in which case it responds with a Propose message (2c), containing the modified SLA and any additional constraints for an acceptable contract agreement. The consumer may reply to the new proposal with another Request message (3), or accept the proposal (5b). Several Request and Propose messages (3) and (4) may be exchanged between the CA and the SP. If the consumer chooses to end the negotiation, it can reply with a Reject-proposal message (5a). Alternatively, if the consumer accepts the proposal, it responds with an Accept-proposal message (5b) indicating the accepted SLA. After receiving the Accept-proposal message from the CA the SP can bind the SLA by sending an Acknowledge message (6). The SLA under negotiation can be validated at any time by participants, using the appropriate service specification(s) and current constraints. Invalid proposals may result in a Reject-proposal message (5a) or Refuse message (2b) sent by the consumer or provider, respectively. 2.1.2 Negotiation of Multiple Service Elements We illustrate the protocol sequence diagram in Figure 3 for a multi-service negotiation using the following scenario: A QDINE market consumer needs to perform a highly demanding computational fluid dynamics (CFD) simulation workflow on a specific computer infrastructure and then to store the processed data results in other remote locations. The consumer therefore needs to negotiate the bundled service from several service providers (SPs): (i) a high performance computing (HPC) SP to execute the workflow of the simulation program hosted on the machine, (ii) a storage SP to store the results produced by the computational application, and (iii) a network (NW) SP to provide the required bandwidth for transmitting the data between the different resource centers within a time interval window that is able to avoid internal delays of the execution of global application workflow. In the configuration phase the consumer sends a
84
I. Cergol, V. Mirchandani, and D. Verchere
Fig. 2. Direct negotiation of a single service
Fig. 3. Negotiation of multiple services
Request message (1) to the MA, to get the necessary information to enable the consumer agent (CA) to fill the proxy message in a template defined by the MA. The MA responds to the request with an Inform message (2), carrying any updated configuration information since consumer registration. In this example scenario, after obtaining the configuration information, the CA sends a Proxy message (3) to the MA, requesting it to select the appropriate SPs and negotiate the three afore-mentioned services. All of these services are required for a successful negotiation in this example: • a call-for-proposal (cfp) for a service S1 (HPC) with three preferred providers (SP1, SP2 and SP3), returning the best offer. • a bilateral bargaining with any appropriate providers for service S2 (storage), returning the best offer, and • a bilateral bargaining with service provider SP5 for the network service S3 (constraints expressed with bandwidth), returning the final offer.
Performance Evaluation of a SLA Negotiation Control Protocol for Grid Networks
85
(The SPs for common services are shown collocated in Figure 3. due to space limitations.) Upon receiving the Proxy message, the MA assesses the proxy message and if it validates the request it replies with an Agree message (4b) to CA. According to the request for service element S1, the MA issues a cfp (5) to service providers SP1, SP2 and SP3. One of them refuses to provide the service, sending a Refuse message (6a), the other two HPC service providers submit each a proposal, using a Propose message (6b). The MA validates the proposals and uses the offer selection function (osf) to pick the most appropriate service offer. The MA requests the chosen SP to wait (7b), as it doesn't want to bind the agreement for service element S1 until it has collected acceptable proposals for all three service elements. According to the “any appropriate” provider selection function submitted by the CA for service S2, the MA then selects the only appropriate SP (SP4) for the storage service. The MA then sends a Request message (9) to SP4 by invoking a bilateral bargaining based mechanisms. For simplicity, we assume that the chosen SP immediately agrees to provide the requested service by replying with an Agree message (10). In reality, the negotiation sequence could be similar to the one depicted in Figure 2. When an agreement is formed, the MA requests SP4 to wait (11) while the other service negotiations are completed. SP4 replies with an Agree message (12). Steps 13-16 can be referred to in [5]. The Service providers bind the agreement by replying with an Acknowledgement (17) for the bundled service (s1, s2, s3). 2.2 Performance Evaluation An important part of the specification phase of the negotiation protocol is to pre-validate its sequences and evaluate its performance through simulations. Metro Ethernet Network (MEN) architecture was modelled as an underlying transport network infrastructure in the QDINE service negotiation framework. This section documents the results of the simulation study of the negotiation protocol performance for different realistic parameters such as negotiation message sizes, negotiation frequency and input load. We believe this study is valuable as it considers the performance during the negotiation phase over MEN rather than the end-to-end performance of the transport of data traffic. The study investigates the following negotiation sequence: (i) the negotiation of a single service element where the negotiation is direct between the SP and CA and (ii) the negotiation of multiple service elements simultaneously from different types of SPs via the MA. To the best of our knowledge, whilst there have been other studies conducted for negotiation protocols, they neither evaluate the performance for different negotiation message sizes nor for varying background input load from other stations connected to the network. Simulation Model. The simulations were performed by using OMNeT++ tool [7], the framework INET was used for the simulation of wired (LAN, WAN) and was implemented in our model environment. The QDINE framework with MEN infrastructure was modeled and simulated distinctly for (i) direct negotiation and (ii) indirect negotiation. The MEN service type that we used is the Extended Virtual PrivateLAN (EVPLAN) service and the background traffic generated was best-effort service, which is carried by the EVP-LAN. Diverse actors in the QDINE negotiation framework – CA, SP and MA – were modeled as network stations, i.e. Ethernet hosts. Each Ethernet
86
I. Cergol, V. Mirchandani, and D. Verchere
host was modeled as a stack that in a top to bottom order comprised of: (a) two application models – a simple traffic generator (client side) for generating request packets and a receiving entity (server side), which also generates response packets, a LLC module and a CSMA/CD MAC module.
Fig. 4. Metro Ethernet Network (MEN) model 2.2.1 Results and Discussion – Direct Negotiation for a Single Service In direct negotiation (refer Fig. 4), the negotiation messages flow directly between the consumer agent (CA) and the service provider (SP). The input data traffic load on the MEN was generated by the linked nodes (stations) 1…m. other than the negotiating control stations. The input load is expressed as a fraction of the 10Mbps capacity of the MEN. In the experiments performed, we varied the input load by keeping the packet sending rate constant and increasing the number of stations. Table 1. Attributes specific for direct negotiation Parameter Packet size – background traffic Packet arrival rate – background traffic Negotiation frequency Packet generation Negotiation message sizes
A) Impacts of negotiation frequency and back-ground traffic packet arrival rate: The attributes used for this simulation are given in Table 1. Fig. 5 shows the impact of the negotiation frequency on the traffic control as well as the generated background data traffic's packet arrival rates on the mean transfer delay of the negotiation packets received at the CA. The negotiation control protocol used was bilateral bargaining. From Fig. 5a it can be seen that the increase in the negotiation frequency has a marginal effect on the mean delay up to the input load where the mean delay increases asymptotically (0.7 in Fig. 5a). This shows that the negotiation protocol is more influenced by input data traffic load on the network relative to negotiation control frequency, which will facilitate to improve scalability of negotiation stations on the underlying network. B) Impact of different negotiation messages sizes: We studied this to determine the impact on performance at the SP, CA and overall network due to the following negotiation control messages sizes (i) 100 – 300 bytes (ii) 300 – 900 bytes (iii) 300 – 1800 bytes.
Performance Evaluation of a SLA Negotiation Control Protocol for Grid Networks
87
Fig. 5. (a) Time Performance at the consumer agent (CA) (b) Performance at the CA – Mean delay vs. Input load
The impact of negotiation message size on the mean transfer delay obtained at the CA is shown in Fig. 5b. The impact was found to be more pronounced at the SP than at the CA. The results of the impact at SP are not shown here due to space limitations. Most of the large size messages are received at SP relative to the CA, which causes a larger overall delay. Note: the total end-to-end delay of a control frame transfer is the sum of queuing delay, transmission delay, propagation delay and contention time. Except for the propagation delay (fixed for a medium of length L) all other delays are higher for larger frames. 2.2.2 Results and Discussion – Indirect Negotiation for Multiple Services This subsection provides an evaluation study of our service negotiation control protocol for multiple service elements that can be negotiated between a SP and a CA. This study uses the parameters values similar to that given in Table 1. As the negotiation messages and the background traffic are of different packet sizes and have relative traffic flow asymmetry it makes the performance study more realistic. The method for the indirect negotiation of multiple services has been proposed in section 2.1.2. The operation of the indirect negotiation that was modelled incorporated negotiations between the CA and the SP for multiple services indirectly via the MA. The MA can also be configured to possess the functionality of a resource broker, which will facilitate to identify the state and the services available at the various SPs. This will help to reduce the overall time of negotiation between the CA and the SPs via the MA. The mean delay of the messages received at the MA is presented in Fig. 6a, when the MA actively negotiates with the SPs on behalf of the CA. The SPs on receipt of the cfp respond in a distributed manner, which causes relatively lower
Fig. 6. (a) Mean delay vs. No. of neg. stations (No. of background stations) at the MA. (b) Mean negotiation time vs. No. of neg. stations (No. of background stations).
88
I. Cergol, V. Mirchandani, and D. Verchere
contention delays hence mean delay than it would be if all the SPs were to respond simultaneously. Due to this the mean delay of the messages at the MA is decreased. Also, background traffic has more impact on the performance than the number of SPs. In the indirect negotiation cases it was considered SPs to be grouped in blocks of 10 each forming a bundled service. The SPs within each SP block were candidates for providing a distinct service element. The mean negotiation time for this scenario is shown in Fig. 6b. It can be seen that in this case also the background traffic influences the delay. The result in Fig. 6b indicates that the increase in number of the negotiating stations has relatively lesser impact than the background traffic. This means that a higher capacity network such as 100 Mbps/1 Gbps/10 Gbps will be a more suitable infrastructure to limit the effect of the background traffic.
3 Conclusions We have proposed a dynamic negotiation protocol that was explained in a generic QoS framework for realistic scenarios that involved intra-domain negotiation. Salient performance evaluation results obtained from extensive simulations showed the effectiveness of the protocol for both direct and indirect negotiations. In particular, the negotiation protocol was shown to be feasible and scalable i.e. the negotiation protocol messages did not limit the scalability. Acknowledgements. This research was performed as part of an ARC Linkage Grant, between Bell Labs (Alcatel-Lucent) and UTS, Sydney. We express our thanks to Les Green and Prof. John Debenham for their contributions to the QDINE framework.
References 1. Nguyen, T.M.T., Boukhatem, N., Doudane, Y.G., Pujolle, G.: COPS-SLS: A Service Level Negotiation Protocol for the Internet. IEEE Communications Magazine (2002) 2. Chen, J.-C., McAuley, A., Sarangan, V., Baba, S., Ohba, Y.: Dynamic Service Negotiation Protocol (DSNP) and Wireless Diffserv. In: Proceedings of ICC, pp. 1033–1038 (2002) 3. Wang, X., Schulzrinne, H.: RNAP: A Resource Nego-tiation and Pricing Protocol. In: Proceedings of NOSSDAV, NJ, USA, pp. 77–93 (1999) 4. FIPA Contract Net Interaction Protocol Specification (December 2003), http://www.fipa.org 5. Green, L., Mirchandani, V., Cergol, I., Verchere, D.: Design of a Dynamic SLA Negotiation Protocol for Grids. In: Proc. ACM GridNets, Lyon, France (2007) 6. Green, L.: PhD Thesis, Automated, Ubiquitous delivery of Generalised Services in an Open Market (April 2007) 7. Discrete Event Simulation Environment OMNeT++, http://www.omnetpp.org
Performance Assessment Architecture for Grid Jin Wu and Zhili Sun Centre for Communication System Research, University of Surrey, Guildford, GU2 7XH, United Kingdom {jin.wu,z.sun}@surrey.ac.uk
Abstract. For the sake of simplicity in resource sharing, Grid services only expose a function access interface to users. However, some Grid users want to know more about the performance of services in planning phase. Problem emerges since the Grid simply has no means in obtaining how well the system will cope with user demands. Current Grid infrastructures do not integrate adequate performance assessment measures to meet the user’s requirement. In this paper, the architecture of Guided Subsystem Approach for Grid performance assessments is presented. We proposed an assessment infrastructure that allows the user to collect information to evaluate the performance of Grid applications. Based on this infrastructure, a user-centric performance assessment method is given. It is expected that this research will lead to some sort of extension in Grid middleware to facilitate the Grid platform the ability to handle applications with higher reliability requirements. Keywords: Grid, performance assessment, reliability.
1 Introduction The lack of performance indications becomes an obstacle for the continuously promotion of the Grid. User’s sceptics in service quality significantly hold back the efforts of putting more applications onto Grid: it is hard to convince users to transfer valuable applications onto the Grid infrastructure before they have been clearly notified with the service quality they will receive. Such a problem is brought forward when the Grid middleware hides system details by allowing users to access service through a portal without recognising any detail of resource providers. On one hand, a standardised interface provides an easy access to heterogeneous resources. On the other hand, the system details are hidden behind for users to recognize the performance of Grid services. This paper studies the performance assessment architecture to solve the problem. It collects the requirements from the user, and feedbacks how well the system will cope with the user’s requirement after examine related Grid components. This architecture does not help to improve the performance of individual component, but assist the users to select services more wisely by enabling the comparison among available services providers on like-for-like bases. Two approaches can be used to obtain the performance assessment of an application: the user initiated assessment, and the infrastructure based assessment. The user initiated assessment approach evaluates the performance of an application by users’ P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 89 – 97, 2009. @ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009
90
J. Wu and Z. Sun
own efforts. The services from all providers are trialled one by one, and their performances are documented for like-for-like comparisons. While this approach is easy to use, its utility is generally limited by its overhead, accuracy, and management concerns. For instance, a large number of users making independent and frequent assessment trials could have a severe impact on the performance of services themselves. Furthermore, the assessment abilities at user-ends are always limited which weaken the accuracy of assessment results. More important still, application providers might not agree to open their systems for trials based on economical and security concerns. Ideally, dedicated modules attached to service providers should be in place while its assessment results could be made available, with low overhead, to all perspective users. This is the basic motivation of infrastructure based performance assessment for Grid services. Our paper extends the research by combing the above two idea together - an assessment infrastructure is in place to assist the application selection. The novelties of this research lie on: a) the assessment infrastructure has been extended to support performance assessment other than network performance metrics assessment; b) a general architecture considering both application layer and transmission network performances are proposed; and c) Grid application users will be notified with possible performance before their executions. The reminder of the paper are organised as follows. Section 2 gives the works relate to this research. Section 3 gives an architecture level study of the performance assessment. Section 4 gives the detailed procedures for the performance assessment identified in Section 3. Finally, Section 5 summaries this work.
2 Related Works Current researches have limited emphasises on Grid performance assessments. Previous researches have discussed the use of infrastructure to obtain data transmission performance between Internet hosts. A simple protocol for such a service, SONAR, was discussed in the IETF as early as February 1996 [2], and in April 1997 as a more general service called HOPS (Host Proximity Service) [3]. Both of these efforts proposed lightweight client-server query and reply protocols similar to the DNS query/reply protocol. Ref [4] proposed a global architecture, called “IDMaps”, for Internet host distance estimation and distribution. This work propose an efficient E2E probing scheme that satisfies the requirements for supporting large number of clientserver pairs. However, the timeliness of such information is anticipated to be on the order of a day and therefore, not reflect transient properties. Furthermore, this work can only assess the transmission delay among Internet host, this could be an underneath support of performance assessment, but this work did not touch the performance issue related to users. The topic of server selection has also been touched. Ref [5] proposed passive server selection schemes that collects response times from previous transactions and use this data to direct client to servers. Ref [6, 7] proposed active server selection schemes. Under these schemes, measurement to periodically probe network paths is distributed through the network paths are distributed throughout the network. Based on the Round Trip Time that is probed, an estimated metric is assigned to the path between node pair. Ref [1] study a scenario where multimedia Content Delivery Network are distributing thousands of servers throughout the Internet. In
Performance Assessment Architecture for Grid
91
such scenario, the number of tracers and the probe frequency must be limited to the minimum required to accurately report network distance on some time scale.
3 Architectural Level Study to Performance Assessment An architectural level study of performance assessment is given in this section. Performance assessment is used to find out the degree of likeliness by which the performance of a Grid service fulfils the user’s requirements. In many cases, multiple component services from different administration domains are federated into a composite service which is finally exposed to users. Needless to say, the performances of related component services affect the performance of the composite service that is directly visible to users. The accuracy, efficiency, and scalability are main concerns for the performance assessment of Grid services. Generally, Grid could be large scale heterogeneous systems with frequent simultaneous user accesses. Multiple performance assessment requests must be employed to assess performance in a timely and overhead efficiency manner. Apparently, the larger and more complex the system, and the more varied its characteristics, the more difficult it becomes to conduct an effective performance assessment. Subsystem-level approach is a solution to overcome the scalability problem to performance assessment over large scale systems. As its name suggests, the Subsystem-level approach accomplishes a performance assessment by developing a collection of assessments subsystem by subsystem. The rationale for such an approach is that the performance meltdown that user experienced will show up as a risk within one or more of the subsystems. Therefore, doing a good job at the subsystem level will cover all the important area for the whole system. The following observations identify the limitations of the approach. Independent subsystem performance assessment tends to give static performance and assume in the way that the upcoming invoking request does not carry out a meaningful impact to the subsystem performance. But a subsystem can exhibit differently when the additional invoking request is applied. Without a dynamic performance assessment which takes the characteristics of both invoking request and subsystem, the performance assessment result has a systematic inaccuracy. Independent subsystem performance assessment is prone to variance in the way performance metrics are characterised and described. Without a common basis applied across the system, subsequent aggregation of subsystem results is meaningless or impossible to accomplish. A consequence is the inevitable misunderstanding and the reduced accuracy of performance assessment. Subsystem A’s analysts can only depict the operation status and risk of Subsystem A, but it may not have an accurate understanding of Subsystem A’s criticality to the overall service performance. As viewed from the perspective of the top level performance assessment, the result from subsystem can be wasted effort assessing unrelated performance metrics, or subsystem performance metric that crucial to the overall performance assessment is not measured. Given the weakness of the conventional subsystem performance assessment approaches, we propose the Guided Subsystem Approach to performance assessment as an enhancement to conventional approaches.
92
J. Wu and Z. Sun
Fig. 1. Architectural Level Process for Guided Subsystem Assessment
The main idea of the Guided Subsystem Approach is to introduce a user-centric component that interprets the user’s assessment requests and assessment function of subsystems. It conducts macro-level pre-analyse efforts. The top-down efforts are enforced to translate those performance assessment requests and transmit to the subsystem assessment modules. Such the dynamically generating of assessment requests to subsystem helps to prevent mismatching between user’s assessment requirements and the assessment actions applied onto the subsystems. The Guided Subsystem Approach takes the advantage of both top-down and subsystem approaches while presenting better efficiencies, consistencies, and systematic accuracies. Furthermore, the Guided Subsystem Approach has top-down characteristics together with characteristics from subsystem. The user centric interpretation components and subsystem performance assessment components can be designed and implemented separately.
4 Guided Subsystem Performance Assessment For an assessment infrastructure, it needs to give performance indications after users express their expectations to the target Grid service. Users can have diverse performance expectation for a Grid service. Their performance expectations should be expressed in a standardised way to let the assessment infrastructure learn the need of users. A performance indicator is defined to notify users the result of performance assessment. The performance indicator should have pragmatic meaning and allow
Performance Assessment Architecture for Grid
93
user to perform like-with-like comparison between similar Grid services delivered from different providers. For the purpose of measuring the ECU of a target Grid service, three types of inputs need to be identified: • Set of Performance Malfunctioning (SPM) is a set containing any possible performance malfunctioning feature by which Internet application users might possibly be experienced. It is a framework of discernment attached to applications. • Consequence of Performance Malfunctioning (CPM) measures the damages every particular performance malfunctioning feature could possibly cause. It depicts the users’ expectations of performance towards the application. • Probability of Performance Malfunctioning (PPM) measures the possibility of performance malfunctioning appears under a given framework of discernment. This input depicts the how the underneath system reacts to the invoking requests. The ECU of a target Grid service can be given upon the collection of the above three aspects of inputs. To obtain the SPM is the first step for ECU measurement. The performance malfunctioning is a feature of an application that precludes it from performing according to performance specifications, a performance malfunctioning occurs if the actual performances of the system are under the specified values. SPM is usually defined when a type of Grid service is composed. When a type of Grid service is identified, its standard SPM can be obtained. Different types of Grid services could lead to different SPMs. Basically, SPM is authored by system analysis, and could be, nevertheless, updated thanks to users’ feedbacks. However, the management and standardisation of SPM is outside the scope of this research. Without loss of generality, we simply consider standard SPMs can be given by a third party generalised service when a use-case is presented, where the SPM of service u is denoted as SPMu. For a service u with a standard SPM including n malfunctioning type, there exists SPM u = {s1 ,..., sn } , which can also be denoted as an n-dimensional vector SPM u , SPM u = ( s1 ,..., sn ) . Identifying the severity of performance malfunctioning is another important aspect of ECU measurement. CPM is a set of parameters configured by users in representing the severity each performance malfunctioning feature could possibly cause. Define a function p<userx SPM > representing the CPM configuration process for the user userx on the u
framework of discernment of use-case u, SPMu. Denote this CPM as a n-dimensional vector CPM u = (c1 ,., cx ,., cn ) , where cx is the cost of damages when the xth performance malfunction p< userx SPM
u
exists.
Then,
for
∀s x ∈ SPM ,
∃cx ∈ [0,+∞)
satisfying
: s x → cx . It is users’ responsibility to configure the CPM. In many cases, >
policy based automatic configurations are possible in order to make this process more friendly to users. The third part of performance assessment is to find out the performance of the invoked system in terms of under-performing possibility when an application scenario is applied. The performance of the invoked system relates to the ability of physical resource that the system contains, and their managements. An n-tuple, PPM, is defined to measure the performance malfunctioning probability of a given function under a scenario. Denote sx as a performance metric satisfying sx ∈ SPM u .
94
J. Wu and Z. Sun
For a use-case u, suppose the specified performance values as SPMu’, where SPMU’={s1’,…,sh’}. The scenario e is executed to apply the use-case u. px is defined as the possibility of performance malfunctioning that exhibited by the xth performance metric, where px=Pr(sx<sx’). Then, we have the probability of performance malfunctioning for scenario e, where PPM e = ( p1 ,..., pn ) . The PPM related to three factors: a) the ability and usability of components; b) importance of components to overall performance; and c) invoking frequency of components. SPM and CPM can be given by the user, while PPM needs to be obtained from sub-systems that compose the target Grid service. We consider a Grid service can be decomposed as many independent component Grid services, which actually contains software and hardware resources, and performs a set of functional activities. A set of metrics is used to measure the performance of a composite service. Let a data structure <x,y> denote the yth performance metric of the component x. Let performance metrics R<x,y>(e) and P<x,y> measure the performance requirement by scenario e and the delivered performance for <x,y>. u<x,y>, u< x , y > ∈ {0,1} , is the state value describing the healthy status of component x with regards to the performance metric <x,y>, and satisfies ⎪⎧u < x , y > (e) = 1, when P< x , y > < R< x , y > (e) ⎨ ⎪⎩u < x , y > (e) = 0, when P< x , y > ≥ R< x , y > (e)
An n-tuple, Ux(e), is used to depict the performance status of component x under scenario e. U x (e) = (u< x ,1> (e),..., u< x ,n> (e))
where n is the total number of all measurable metrics for the component x. The under-performing of components surely affect the performance of scenario. A performance malfunctioning assignment function is used to represent the degree of influence to SPM when a particular performance state exhibits in a particular component. A performance malfunctioning assignment function m : SPM → [0,1]
is defined, when it verifies the following two formulas: ∀A ∈ SPM , mU x ( e ) ( A) ∈ [0,1] ; and
∑m
A∈SPM
Ux (e)
( A) ≤ 1
where mU (e ) denote the performance assignment function during the performance state Ux(e). The higher value of mU ( e ) ( A) denotes the higher influence of the state x
x
Ux(e) to the performance malfunctioning A, and vice versa. The m function can be coauthored by system analysis and simulations. However, how to obtain the m function is not within the scope of this research. We consider the m function can be generated by a third party service. We then study how to assess the performance of a component. Let R '< x , y > denotes the predicted performance for <x,y>. For ∀x, y , there exist
Performance Assessment Architecture for Grid
95
v< x , y > (e) = Pr( R '< x , y > < P< x , y > (e)) .
Once the component x is invoked by scenario e, let v<x,y>(e) be the degree of likeliness of under-performance measured from <x,y>. The state of likeliness underperformance of component x can be represented as an n-tuple, V x (e) = (v< x ,1> (e),..., v< x ,n > (e)) .
similarity measurement function sim : ( X , Y ) → [0,1] X = ( x1,..., xn ) and Y = ( y1 ,.., yn ) are two n-tuples, and
is
A
defined,
where
n
sim( X , Y ) = ∏ (1 − xk − yk ) . k =1
Then, when Vx(e) can be obtained, the performance assignment function can be given as mVx ( e ) ( A) =
∑ sim(U
x
(e),Vx (e)) × mU x ( e ) ( A)
∑ sim(U
x
(e),V x (e)) × mU x ( e ) ( SPM )
U x ( e )∈Φ x
we also can have mVx ( e ) ( SPM ) =
U x ( e )∈Φ x
From this formula, the relationship between the performance of component x and SPM is given. Multiple components are involved when a scenario is being executed. To account for the probability of a component failure manifesting itself into user-end performance degradation, Dynamic Metrics (DM) is being used. Dynamic Metrics are used to measure the dynamic behaviour of a system based in the premise that active components are source of failure [2]. It is natural that a performance meltdown of component may not affect the remote instrumentation scenario’s performance if not invoked. So, it is reasonable to use measurements obtained from executing the system models to perform performance analysis. There is a higher probability that, if an underperformance event is likely to exist in an active component, it will easily lead the scenario into malfunctioning. A data structure denoting the Component Status Description (CSD) is defined as: CSD x (e) =< i , x, start x (e ), duration x (e), mV ( x ) ( SPM ) >
where i and x are the unified identification of invoking request and component; startx(e) and durationx(e) are, respectively, the start time and expected execution duration of component x. Startx can be given by analysing the scenario that invokes the component. We can assume the value of durationx can be given by an external service. When the CSD of all components that involve in a scenario are given, a mapping from the Time-Component Representation of a scenario to Time-Discrete Representation can be carried out. A data structure denoting the Scenario Status Description (SSD) is defined as: SSD(e) =< j ,{x x ∈ set ( j )}, start j (e), duration j (e),
∪m
V (x)
x∈set ( j )
( SPM ) >
96
J. Wu and Z. Sun
where j is the serial number for the component; set(j) is the set of active components in the jth time fragment; startj(e) and durationj(e) are the start time and the time length for the jth time fragment when the scenario e is applied. Since all components are independently operated, the accumulated performance malfunctioning assignment can be given by the following formula:
∪m
V (s)
∏ (1 − m
( SPM ) = 1 −
V ( x)
x∈set ( j )
x∈set ( j )
( SPM ))
Therefore, the PPM of a scenario e can be given as: PPM (e, SPM e ) = 1 −
∏
j∈scenario e
(1 −
duration j (e) T (e)
∪m
⋅
V (s)
( SPM ))
x∈set ( j )
where T(e) is total time required for scenario e. Finally, the performance evaluation of a Grid service can be given as Cost (e) =
∑ CPM (e, A) ⋅ PPM ( A)
A∈SPM
where ∀A ∈ SPM , PPM (e, A) = 1 −
∏
j∈scenario e
(1 −
duration j (e)
∑ duration (e)
j∈scenario e
∪ mV ( s ) ( A) = 1 − ∏ (1 − mV ( x ) ( A)) , and mVx (e ) ( A) =
x∈set ( j )
x∈set ( j )
⋅
j
∪m
V ( x)
∑ sim (U
U x ( e )∈Φ x
( A))
,
x∈set ( j )
x
(e),Vx (e)) × mU x ( e ) ( A) .
5 Summary This paper describes the on-going research of the Grid service performance assessment. It explores the approach of assessing the Grid service performance by using Guided Subsystem Approach. The assessment approach given is an effective way to measure the composite Grid services which include multiple loosely coupled component services. Top-down approach is used to capture the characteristics of user demands. Performance requirements are automatically translated and transferred to component services. Subsystem approach locally analyses the local available resources and demands, and concludes the possibility of under-performing for the local module. Dynamic metric is used to combine the performance assessment result together taking into account the importance of the component. Future will adopt the assessment approach to specified applications to show its advantages.
4. Francis, P., Jamin, S., Jin, C., Raz, D., Shavitt, Y., Zhang, L.: IDMaps: A Global Internet Host Distance Estimation Service. IEEE/ACM Trans. on Networking (October 2001) 5. Seshan, S., Stemm, M., Katz, R.: SPAND: Shared Passive Network Performance Discovery. In: USENIX Symposium on Internet Technologies and Systems (December 1997) 6. Bhattacharjee, S., Fei, Z.: A Novel Server Selection Technique for Improving the Response Time of a Replocated Service. In: Proc. of IEEE Infocom (1998) 7. Crovella, M., Carter, R.: Dynamic Server Selection in the Internet. In: Proc. of the 3nd IEEE workshop on the Architecture and Implementation of High Performance Communication Sybsystems (HPCS 1995) (1995)
Implementation and Evaluation of DSMIPv6 for MIPL* Mingli Wang1, Bo Hu1, Shanzhi Chen2, and Qinxue Sun1 1
State key Lab of Switching and Networking Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China 2 State key Lab of Wireless Mobile Communication, China Academy of Telecommunication Technology, Beijing 100083, China [email protected]
Abstract. Mobile IPv6 performs mobility management to mobile node within IPv6 networks. Since IPv6 is not widely deployed yet, it's an important feature to let Mobile IPv6 support mobile node moving to IPv4 network and maintaining the established communications. DSMIPv6 specification extends Mobile IPv6 capabilities to allow dual stack mobile nodes to move within IPv4 and IPv6 networks. This paper describes the implementation of this feature based on open source MIPL under Linux. By performing experiments on testbed using the implementation, it is confirmed that the DSMIPv6 works as expected. Keywords: Mobile IPv6, DSMIPv6, MIPL.
1 Introduction Mobile IP allows mobile nodes to move within the Internet while maintaining reachability and ongoing sessions, using a permanent home address. It can serve as the basic mobility management method in the IP-based wireless networks. Mobile IPv6 (MIPv6) [1] shares many features with Mobile IPv4 (MIPv4) and offers many other improvements. There are many different versions of MIPv6 available today, such as MIPL [2] and SunLab's Mobile IP [3]. Some works have been done on testbed for MIPv6, such as [4][5], they only support IPv6 network. However, since IPv6 has not been widely deployed, it is unlikely that mobile nodes use IPv6 addresses only for their connections. It is reasonable to assume that mobile nodes will move to networks that might not support IPv6 and would therefore need the capability to support IPv4 Care-of Addresses. Since running MIPv4 and MIPv6 simultaneously is problematic (as illustrated in [6]), The Dual Stack MIPv6 (DSMIPv6) [7] specification extends Mobile IPv6 capabilities to allow dual stack mobile nodes to request their home agent forward IPv4/IPv6 packets, addressed to their home addresses, to their IPv4/IPv6 care-of addresses. Mobile nodes can maintain *
Foundation item: This work is supported by the National High-Technology(863) Program of China under Grant No. 2006AA01Z229, 2007AA01Z222, 2008AA01A316 and Sino-Swedish Strategic Cooperative Programme on Next Generation Networks No. 2008DFA12110.
their communications with other nodes by using DSMIPv6 only, regardless the visiting network is IPv4 or IPv6. There are already some implementations of DSMIPv6, such as UMIP-DSMIP [8] and implementation for SHISA [9]. Although the two implementations realized the basic features of DSMIPv6, both of them have some features not implemented yet, such as supporting private IPv4 care-of address. In this paper, we mainly describe our implementation of DSMIPv6 to support mobile nodes move to IPv4 networks and use private IPv4 care-of addresses. We also performed experiments by using our implementation and the results are shown in this paper too. This paper is organized as follow. We give an overview of DSMIPv6 and its existing implementations in Sec.2. We then present the design of our implementation in Sec.3. We performed experiments by using our implementation and the results are shown in Sec.4. In Sec.5, we give the conclusion of this paper.
2 DSMIPv6 2.1 Principle of DSMIPv6 Dual stack node is a node that supports both IPv4 and IPv6 networks. In order to use DSMIPv6, dual stack mobile nodes need to manage an IPv4 and IPv6 home or care-of address simultaneously and update their home agents' bindings. The basic features of DSMIPv6 are: the possibility to configure an IPv4 care-of address on the mobile node, the possibility to connect the mobile node behind a NAT device, the possibility to connect the home agent behind a NAT device, the possibility to use an IPv4 Home Address on the mobile node. A mobile node, which is dual stack node, has both IPv4 and IPv6 home addresses. Home agent is connected to both IPv4 and IPv6 networks. When in IPv4 network, mobile node gets an IPv4 care-of address and registers it to home agent. Both IPv4 and IPv6 home addresses are bound to the address. IPv4 traffic goes through IPv4-inIPv4 tunnel between mobile node and home agent. IPv6 traffic goes through IPv6-inIPv4 tunnel. In a similar way, if mobile node moves to IPv6 network and registers IPv6 care-of address, traffic goes through IPv4-in-IPv6 tunnel or IPv6-in-IPv6 tunnel. NAT devices between mobile node and home agent can be detected. 2.2 Implementations of DSMIPv6 There are already some implementations of DSMIPv6, such as UMIP-DSMIP [8] and implementation for SHISA [9]. UMIP-DSMIP is an implementation of DSMIPv6 for the UMIP stack. It is developed by Nautilus6 [8] and is shipped as a set of patches for UMIP 0.4 and 2.6.24 kernels. The implementation for SHISA introduced in [9] is based on BSD operating systems. Although some basic features of DSMIPv6 are realized in the two implementations, both of them have some features not implemented yet, such as Network Address Translator (NAT) [10] Traversal feature. Since public IPv4 addresses are not enough for use, NAT is widely used to solve this problem. Suppose a mobile node moves to GPRS or CDMA network, it can only get private IPv4 care-of addresses. To
100
M. Wang et al.
use the care-of addresses to maintain the communication, NAT traversal feature must be realized in this situation. Our implementation includes this feature. The MIPv6 implementations based on are also different, UMIP-DSMIP is based on UMIP stack and the other is based on SHISA. Our implementation is based on MIPL (Mobile IPv6 for Linux), which is open source under GNU GPL. MIPL is developed by HUT to run on Linux environments that support IPv6. The latest version of MIPL is 2.0.2, which requires the kernel version of 2.6.16.
3 Implementations 3.1 Overview We implemented the basic features of DSMIPv6 including the possibility to configure an IPv4 care-of address on the mobile node and the possibility to connect the mobile node behind a NAT device. By using our implementation, mobile node can move to IPv4-only network and keep the communications. Private IPv4 care-of address is supported. Fig.1 shows the supported scenes. Our implementation is based on the version of MIPL 2.0.2.
Fig. 1. Supported scenes of our implementation
When a mobile node moves to IPv6 network and gets an IPv6 care-of address, the basic MIPL works. In addition to this, it should update the new members of binding update list and binding cache described in 3.3 and 3.4. If the mobile node moves to IPv4 network and gets an IPv4 care-of address, it needs send a binding update message to home agent's IPv4 address. The binding update message contains the mobile node's IPv6 home address in the home address option. However, since the care-of address is IPv4 address, the mobile node must include its IPv4 care-of address in the IPv6 packet. After accepting the binding update message, the home agent will update the related binding cache entry or create a new binding cache entry if such entry does not exist. The binding cache entry points to the
Implementation and Evaluation of DSMIPv6 for MIPL
101
mobile node's IPv4 care-of address. All packets addressed to the mobile node's home address will be encapsulated in a tunnel that includes the home agent's IPv4 address in the source address field and the mobile node's IPv4 care-of address in the destination address field. After creating the corresponding binding cache entry, the home agent sends a binding acknowledgment message to the mobile node. 3.2 Tunnel In the implementation of MIPL, when mobile node moves to IPv6 foreign network, an IPv6-in-IPv6 tunnel will be created, whose name is "ip6tnl". All packets sent to mobile node's home address will be encapsulated to this tunnel from home agent to mobile node's care-of address. When mobile node moves to IPv4 foreign network, we should setup an IPv6-in-IPv4 tunnel to transmit packets between mobile node and home agent. However, if mobile node's IPv4 care-of address is private, the communication link between mobile node's care-of address and home agent will consist of NAT devices. Since common NAT devices do not support IPv6-in-IPv4 packets (which type is 41), we should tunnel IPv6 packets in UDP and IPv4, i.e. IPv6-in-UDP tunnel. Fig.2 shows the differences between the two tunnels.
Fig. 2. Differences between the two type tunnels
Linux kernel realized the IPv6-in-IPv4 tunnel, which named "sit". It needs to configure the two end-points before using it to transmit packets. We modified the codes to realize the tunnel IPv6-in-UDP by adding UDP header between IPv4 header and inner IPv6 header before transmitting. After receiving tunnel packets, pull out the UDP header as well as the IPv4 header. There are two reasons for tunnel's implementation at the kernel space. Firstly, it has good efficiency; Secondly, the encapsulation and decapsulation are the common function of mobile node, home agent and access router. Separating the function with others and modeling of tunnel just provide the common interface to upper applications, so that applications do not need consider the tunnel. This method can enhance the expand property. 3.3 Mobile Node Modifications When IPv4 care-of address is detected, IPv4 care-of address option is created and included in the Mobility Header including the binding update message sent from the mobile node to home agent. Before sending binding update message, mobile node sets the sit tunnel in itself and sends signal to home agent to let it set another end-point of tunnel. After receiving the successful message to set tunnel in home agent, mobile node create or update a binding update list entry for its home address. Then, binding
102
M. Wang et al.
Fig. 3. Process flow for IPv4 care-of address
update message is encapsulated in IPv6-in-UDP tunnel and sent to home agent if everything is successfully processed. The process flow is shown in Fig.3. Based on the structure of binding update list in MIPL, we add two members, one is to store IPv4 care-of address, the other is to mark the current type of care-of address: IPv4 or IPv6. By doing this, we use only one entry for one home address. After receiving the binding acknowledgement message, mobile node updates the corresponding bind update list entry to change the binding status. If the binding acknowledgement shows the binding failure, should delete the tunnel set before. 3.4 Home Agent Modifications Home agent processes the received binding update message, create or update a binding cache entry. IPv4 address acknowledgement option is created and included in the Mobility Header including the binding acknowledgement message sent from home agent to mobile node. This option indicates whether a binding cache entry was created or updated for the mobile node's IPv4 care-of address. If home registration failed, should delete the tunnel set before. The process flow is shown in Fig.3. Like the binding update list, we add two members to the structure of binding cache too. One is to store the IPv4 care-of address; another is to mark the current type of care-of address.
4 Evaluations 4.1 Testbed Details The testbed's network topology is shown as Fig.4. The column are PC router, which is exploited on Linux platform. CN, which is corresponding node, is a PC connected
Implementation and Evaluation of DSMIPv6 for MIPL
103
Fig. 4. Network topology of testbed
with home agent (HA) via IPv6 network. MN, which is mobile node, moving between home network and different foreign networks, is a laptop with Linux OS. Protocol is IPv6 in home network and IPv4 in foreign network. Modified MIPL is running in MN and HA. MN has IPv6 home address and can get both IPv6 and IPv4 care-of address. CN has only IPv6 address and communicates with MN. HA has both IPv6 and IPv4 addresses, in which IPv4 address is public and is known by MN. MN can get private IPv4 address as its care-of address in WLAN or GPRS network. Initially MN is at home network, it communicates with CN via IPv6. After moving to foreign network, WLAN or GPRS, it gets IPv4 care-of addresses. MN will back to home network and use IPv6 address too. 4.2 RTT (Round Trip Time) Test We use "ping6" command to test the RTT and perform 200 times to get the mean values. Table.1 is the RTT comparison in different networks, separated by the scenes that MN in home network and in foreign network. Table 1. RTT comparisons in different networks MN in home network. MN in WLAN. MN in GPRS Mean RTT (ms) 2.559 223.6 615.3
The RTT in home network is smallest in the three network since MN is directly linked with HA. The RTT in GPRS network is the biggest, for the packets are tunneled and pass more hops in the road. The results confirmed that the implementation of DSMIPv6 worked as our expected, most of all, connection was maintained by private IPv4 address and tunnel in WLAN and GPRS. 4.3 Handover Delay Test The "crontab" tool of Linux is used to implement handover automatically. And we collect 100 times' handover to analyze. The results are listed in Table 2.
104
M. Wang et al. Table 2. Handover delays in different situations Home to WLAN. WLAN to GPRS. GPRS to Home Mean time (s) 1.942 2.038 2.846
As Table 2 shows, the delay of moving from home network to foreign network and moving between different foreign networks is approximate 2 seconds. The delay of moving from foreign network to home network is a little bigger, for there is more operations while MN back home. These tests also confirmed the validation of our implementation. We also run services, such as ftp and video, to test the effect of the handover between networks. The interval was not very obvious, and didn't cause a bad effect for user experience.
5 Conclusions This paper describes the DSMIPv6 implementation on MIPL for Linux systems. The DSMIPv6 implementation was designed to support mobile node moving between IPv4 and IPv6 networks while maintaining the established communications. An IPv6in-UDP tunnel is setup to transmit IPv6 packets via IPv4 link and traverse NAT devices. An IPv4 care-of address and an address flag are added to binding update list and binding cache, function of which are separately storing IPv4 care-of address and marking the address type. An IPv4 care-of address option and IPv4 address acknowledgement option are used to register IPv4 care-of address to home agent. By evaluating the implementation, it is confirmed that the DSMIPv6 implementation works as expected. It is thus said that the specification is stable to work in true environment.
References 1. Johnson, D.: Mobility Support in IPv6, RFC3775, the Internet Engineering Task Force (June) 2. Mobile IPv6 for Linux, http://www.mobile-ipv6.org/ 3. Networking and security center.: Sun Microsystems Laboratories, http://playground.sun.com 4. Busaranun1, A., Pongpaibool, P., Supanakoon, P.: Handover Performance of Mobile IPv6 on Linux Testbed. ECTI-CON (2006) 5. Lei, Y.: The Principle of Mobile IPv6 and Deployment of Test Bed. Sciencepaper Online 6. Tsirtsis, G., Soliman, H.: Mobility management for Dual stack mobile nodes, A Problem Statement, RFC 4977, The Internet Engineering Task Force (August 2007) 7. Soliman, H. (ed.): Mobile IPv6 support for dual stack Hosts and Routers (DSMIPv6), Internet Drafts draftietf-mip6-nemo-v4traversal-02, The Internet Engineering Task Force (June 2006) 8. UMIP-DSMIP, http://www.nautilus6.org/ 9. Mitsuya, K., Wakikawa, R., Murai, J.: Implementation and Evaluation of Dual Stack Mobile IPv6, AsiaBSDCon 2007 (2007) 10. Egevang, K.: The IP Network Address Translator (NAT), RFC1631, The Internet Engineering Task Force (May 1994)
Grid Management: Data Model Definition for Trouble Ticket Normalization Dimitris Zisiadis¹, Spyros Kopsidas¹, Matina Tsavli¹, Leandros Tassiulas¹, Leonidas Georgiadis¹, Chrysostomos Tziouvaras², and Fotis Karayannis² ¹ Center for research and Technology Hellas, 6th km Thermi-Thessaloniki, 57001, Hellas ² GRNET, 56 Mesogion Av. 11527, Athens, Hellas
Abstract. Handling multiple sets of trouble tickets (TTs) originating from different participants in today’s GRID interconnected network environments poses a series of challenges for the involved institutions. Each of the participants follows different procedures for handling trouble incidents in its domain, according to the local technical and linguistic profile. The TT systems of the participants collect, represent and disseminate TT information in different formats. As a result, management of the daily workload by a central Network Operations Centre (NOC) is a challenge on its own. Normalization of TTs to a common format for presentation and storing at the central NOC is mandatory. In the present work we provide a model for automating the collection and normalization of the TT received by multiple networks forming the Grid. Each of the participants is using its home TT system within its domain for handling trouble incidents, whereas the central NOC is gathering the tickets in the normalized format for repository and handling. Our approach is using XML as the common representation language. The model was adopted and used as part of the SA2 activity of the EGEE-II project. Keywords: Network management, trouble ticket, grid services, grid information systems, problem solving.
GÉANT2 [3] is an example of a Grid. It is the seventh generation of pan-European research and education network successor to the pan-European multi-gigabit research network GÉANT. The GÉANT2 network connects 34 countries through 30 national research and education networks (NRENs), using multiple 10Gbps wavelengths. GÉANT2 is putting user needs at the forefront of its plans for network services and research. Usually a central Network Operations Centre (NOC) is established at the core of the network for achieving network and service integration support. Ideally, a uniform infrastructure should be put in place, with interoperating network components and systems, in order to provide services to the users of the Grid and to manage the network. In practice though, this is not the case. Unfortunately, different trouble ticket (TT) systems are used by the participating networks. There is a wide variety of TT systems available, with differentiated functionality and pricing (respectively) among them. Examples are Keystone [4], ITSM [5], HEAT [6], SimpleTicket [7], OTRS [8]. Moreover, in-house developed systems, as is the case for GRnet [9], is another option. The advantages of this option are that it offers freedom in design and localization options and that it meets the required functionality in full. It has though the disadvantage of local deployment and maintenance. Nevertheless, it is adopted by many Internet Service Providers (IPSs), both academic and commercial, as the pros of this solution enable service delivery and monitoring with total control over the network. The current work evolved within the specific Service Activity 2 (SA2) activity of the EGEE-II European funded project [10]. A central NOC, called the ENOC [11] is responsible for collecting and handling multiple TTs received by the participating institutions TT systems. Various TT systems are used by each of them, delivering TTs in different formats, while TT load is growing proportionally with the network size and the serviced users. TT normalization, i.e. transformation to a common format that is reasonable for all parties and copes with service demands in a dynamic and effective way, is of crucial importance for successful management of the Grid. In the present work we define a data model for TT normalization for the participating institutions in EGEE-II. The model is designed in accordance with the specific needs of the participants, meeting requirements of the multiple TT systems used. It is both effective and comprehensive, as it compensates for the core activities of the NOCs. It is also dynamic as it allows other options to be included in the future, according to demand. This paper is organized as follows: section 2 outlines related work on TT normalization. In section 3 we present our data model in detail, whereas in section 4 we provide a prototype implementation of the proposed solution. Finally in section 5 we discuss conclusions of this work.
2 Related Work Whenever multiple organizations and institutions form a Grid, or some other form of cooperative platform for network service deployment, the need arises to define a common understanding over network operations and management issues. Trouble incidents are recorded in case a problem arises, affecting normal network operations or
Grid Management: Data Model Definition for Trouble Ticket Normalization
107
services. Typical problems are failures in network links or other network elements (i.e. routers, servers), security incidents (i.e. intrusion detection) or any other problem that affects normal service delivery (i.e. service overload). The incidents are represented in specific formats, namely TTs. ATT is issued in order for the network operators to record and handle the incident. RFC 1297 [12], titled NOC Internal Integrated Trouble Ticket System Functional Specification Wishlist, describes general functions of a TT system that could be designed for NOCs, exploring competing uses, architectures, and desirable features of integrated internal trouble ticket systems for Network and other Operations Centres. Network infrastructure available to EGEE is served by a set of National Research and Education Networks (NRENs) via the GÉANT2 network. Reliable network resource provision to Grid infrastructure highly depends on coherent collaboration between a large numbers of different parties both from NREN/ GÉANT2 and EGEE sides, as described in [13]. Common problems and solutions as well as strategies for investigating problem reports has been presented in [14] [15]. The concept of the Multi-Domain Monitoring (MDM) service, which describes the transfer of end-to-end monitoring services in order to serve the needs of different user groups is discussed in [16]. The OSS Trouble Ticket API (OSS/J API) [17] provides interfaces for creating, querying, updating, and deleting trouble tickets (trouble reports). The Trouble Ticket API focus is on the application of the Java 2 Platform, Enterprise Edition (J2EE), and XML technologies to facilitate the development and integration of OSS components with Trouble Ticket Systems. The Incident Object Description Exchange Format (IODEF) [18] constitutes a format for representing computer security information commonly exchanged between Computer Security Incident Response Teams (CSIRTs). It provides an XML representation for conveying incident information across administrative domains between parties that have an operational responsibility of remediation or a watch-and-warning over a defined constituency. The data model encodes information about hosts, networks, and the services running on these systems; attack methodology and associated forensic evidence; impact of the activity; and limited approaches for documenting workflow. The EGEE project is heavily using shared resources spanning across more than 45 countries and involving more than 1600 production's hosts. To link these resources together the network infrastructure used by EGEE is mainly served by GÉANT2 and NRENs. NRENs are providing link to sites within a country while GÉANT2, the seventh generation of pan-European research and education network, connects countries. To link Grid and network worlds the ENOC [17], EGEE Network Operation Centre, has been defined in EGEE as the operational interface between the EGEE Grid, GÉANT2 and the NRENs to check the end-to-end connectivity of Grid sites. Using daily relations with all providers of the network infrastructures on top of which EGEE is built it ensures the complex nexus of domains involved to link Grid sites are performing efficiently. The ENOC deals with network problems troubleshooting, notifications from NRENs, network Service Level Agreement (SLA) installation and monitoring and network usage reporting. The ENOC acts as the network support unit in the Global Grig User Support (GGUS) of EGEE to provide coordinated user support across Grid and network services.
108
D. Zisiadis et al.
In the next section we describe the Data Model that was adopted by the EGEE parties for TT normalization.
3 Definition of the Data Model There has been a long discussion on the functionality of the emerging data model. We examined thoroughly the various fields supported by the numerous ticketing systems in use. There has also been a lot of effort to incorporate all critical fields that could ease network monitoring and management of the Grid. We consolidated all experts' opinions regarding the importance of each field and its effects on the management of both the individual NRENs as well as the Grid. The goal was to define a comprehensive set of fields that would best fit to the network management needs of the EGEE grid. As a result of this procedure, we provide below the definition of the Data model that aims to achieve the required functionality for the management of the Grid. 3.1 Terminology The Trouble Ticket Data Model (TTDM) uses specific keywords to describe the various data elements. These keywords are Defined, Free, Multiple, List, Predefined String, String, Datetime, Solved, Cancelled, Inactive, Superseded, Opened/Closed, Operational, Informational, Administrative, Test and they are interpreted as described in Section 3.5 and 3.6. 3.2 Notations This section provides a Unified Modelling Language (UML) model describing the individual classes and their relationships with each other. The semantics of each class are discussed and their attributes are explained. The terms "class", and "element" will be used interchangeably to reference a given UML class in the data model. 3.3 The TTDM Attributes The Field Name class has four attributes. Each attribute provides information about a Field Name instance. The attributes that characterize one instance constitute all the information required to form the data model. − DESCRIPTION: This field contains a short description of the field name. − TYPE: The TYPE attribute contains information about the type of the field name it depends on. The values that it may contain are: Defined, Free, Multiple, List. − VALID FORMAT: This attribute contains information about the format of each field. The values that it may contain are: Predefined String, String, Datetime. − MANDATORY: This attribute indicates if the information of each field is required or is optional. In case the information is required the field MANDATORY contains the word Yes. On the contrary, when filling the information is optional, the field MANDATORY contains the word No.
Grid Management: Data Model Definition for Trouble Ticket Normalization
109
3.4 The TTDM Aggregate Classes The collected and processed TTs received from multiple telecommunications networks are adjusted in a normalized TTDM. In this section, the individual components of the TTDM are discussed in detail. The TTDM aggregate class provides a standardized representation for commonly exchanged Field Name data. We provide below the field name values that are defined in our model. For convenience, as most names are self explained, and for readability reasons, we only provide the values : Partner_ID, Original_ID, TT_ID, TT_Open_Datetime, TT_Close_Datetime, Start_Datetime, Detect_Datetime, Report_Datetime, End_Datetime, TT_Lastupdate_ Time, Time_Window_Start, Time_Window_End, Work_Plan_Start_Datetime, Work_ Plan_End_Datetime, TT_Title, TT_Short_Description, TT_Long_Description, Type, TT_Type, TT_Impact_Assessment, Related_External_Tickets, Location, Network_Node, Network_Link_Circuit, End_Line_Location_A, End_Line_Location_B, Open_Engineer, Contact_Engineers, Close_Engineer, TT_Priority, TT_Status, Additional_Data, Related_Activity_History, Hash, TT_Source, Affected_Community, Affected_Service. 3.5 Types and Definitions of the TYPE Class The TYPE Class defines four data types, as follows: − Defined :The TTDM provides a mean to compute this value from the rest of the fields − Free : The value can be freely chosen − Multiple : One value among multiple fixed values − List : Many values among multiple fixed values 3.6 Types and Definitions of the VALID FORMAT Class The VALID FORMAT Class defines three data types, as follows: − Predefined String : A predefined value in the data model − String : A value defined by the user of the model − Datetime : A date-time string that indicates a particular instant in time The predefined values are associated with the appropriate Field Name class. The values are strict and should not be altered. The values defined in our model are the following: − TT_Type with accepted predefined values: Operational, Informational, Administrative, Test. − Type with accepted predefined values: Scheduled, Unscheduled. − TT_Priority with accepted predefined values: Low, Medium, High. − TT_Short_Description with accepted predefined values: Core Line Fault, Access Line Fault, Degraded Service, Router Hardware Fault, Router Software Fault, Routing Problem, Undefined Problem, Network Congestion, Client upgrade, IPv6, QoS, Other. − TT_Impact_Assessment with accepted predefined values: No impact, Reduced redundancy, Minor performance impact, Severe performance impact, No connectivity, On backup, At risk, Unknown.
110
D. Zisiadis et al.
− TT_Status with accepted predefined values: Solved, Cancelled, Inactive, Superseded, Opened/Closed. − TT_Source with accepted predefined values: Users, Monitoring, Other NOC.
4 Implementation XML [19] was the choice for the implementation schema, due to its powerful mechanisms and its global acceptance. The implemented system operates as depicted in Fig.1, below.
Fig. 1. The Implemented System
Our system connects to GRnet ticketing system and uses POP e-mail to download the TTs. Following, it converts the TTs according to the data model presented, stores them in a database and finally sends them to ENOC via e-mail to a specified email address. More options are available: − TTs can be sent via http protocol to a web service or a form. − TTs can be stored to another database (remote). − TTs can be sent via email in XML format (not suggested since the XML format is not human readable). An SMS send option (to mobile phones) is under development, since this proves to be vital in case of extreme importance. For this option to work, an SMS server needs to be provided. Linguistic issues are also under development, in order to ease understanding of all fields in a TT, i.e. Greek to English translation needs to be performed for some predefined fields, like TT Type. Our system offering improves security: most web forms use ASP, PHP or CGI to update the database or perform some other action. This is inherently insecure because the database needs to be accessible from the web server. Our system offers a number of advantages:
Grid Management: Data Model Definition for Trouble Ticket Normalization
111
− The email can be sent to a completely separate and secure PC. − Our system can process the email without ever needing access to the web server or be accessible from the Internet. − If a PC or network connection is down the emails will sit waiting. Our system will 'catch up' when the PC or network is back up. Moreover we offer increased redundancy: when using web forms to update backend databases, the problem always arises about what to do if the database is down. Our system resolves this problem, because the email will always get sent. In case the database cannot be updated our system will wait and process the email later.
5 Conclusions In the present work, a common format for normalizing trouble tickets from the various NRENs participating in the Grid, implemented in the EGEE project framework, has been designed and implemented. XML was used to represent the common format. The adopted transformation schema is lightweight yet effective and is accepted by all participating partners. The solution has passed beta testing and it is already in use for the GRNET TTs. The other NRENs are migrating to the solution gradually.
Acknowledgments This paper is largely based on work for TT normalization in SA2 Activity of EGEE. The following groups and individuals, contributed substantially to this document and are gratefully acknowledged: − − − − −
References 1. Quality of Service (QoS) of service, http://en.wikipedia.org/wiki/Qualityofservice 2. Service Level Agreement (SLA) Level Agreement, http://en.wikipedia.org/wiki/ServiceLevelAgreement 3. GÉANT2, http://www.geant2.net/ 4. Keystone, http://www.stonekeep.com/keystone.php/ 5. ITSM, http://www.bmc.com/products/products_services_detail/0,0,190 52_19426_52560284,00.html 6. HEAT, http://www.frontrange.com/ProductsSolutions/SubCategory.aspx? id=35&ccid=6
112 7. 8. 9. 10. 11. 12. 13. 14. 15.
16. 17. 18. 19.
D. Zisiadis et al. SimpleTicket, http://www.simpleticket.net/ OTRS, http://otrs.org/ GRnet, http://www.grnet.gr/ Service Activity 2 (SA2), http://www.eu-egee.org/ ENOC, http://egee-sa2.web.cern.ch/egee-sa2/ENOC.html RFC 1297, http://www.ietf.org/rfc/rfc1297.txt EGEE-SA2-TEC-503527- NREN Interface-v2-7 PERTKB Web, http://pace.geant2.net/cgi-bin/twiki/view/PERTKB/WebHome GÉANT2 Performance Enhancement and Response Team (PERT) User Guide and Best Practice Guide, http://www.geant2.net/upload/pdf/GN2-05-176v4.1.pdf Implementing a Multi-Domain Monitoring Service for the European Research Networks, http://tnc2007.terena.org/core/getfile.php?file_id=148 The OSS Trouble Ticket API (OSS/J API), http://jcp.org/aboutJava/ communityprocess/first/jsr091/index.html The Incident Object Description Exchange Format (IODEF), http://tools.ietf.org/html/draft-ietf-inch-implement-01 XML, http://en.wikipedia.org/wiki/XML
Extension of Resource Management in SIP Franco Callegati and Aldo Campi University of Bologna, Italy {franco.callegati,aldo.campi}@unibo.it
Abstract. In this work we discuss the issue of communication QoS management in a high performance network oriented to carry GRID application. Based on a set of previous works that successfully proved the feasibility of the concept, we propose to use sessions to logically identify and mange the communication between the applications. As a consequence the quality of service of the communication is mapped on reservation of network resources to support a given session. Starting from a framework defined to support such a task for VoIP applications we show here how this framework can be extended to match the need of GRID computing. Keywords: Application oriented networking, QoS, SIP, Session layer.
with the application languages. This is also feasible and a full functional application working on top of an OBS data plane was reported in [6]. The concepts presented in this paper are an extension of this work to support QoS management. They can be easily applied to a general network environment but we will stick to the aforementioned scenario as a reference. RFC 3312 [7] addresses the problem of QoS management in session oriented communications using SIP as signaling protocol. The RFC defines a framework that is based on the concept of “precondition” at session establishment as will be explained later in the paper. Now the question is: could this framework be used to support the needs of a QoS oriented reservation of network resources in an application oriented OBS network? We argue in the following that the answer to this question is no and provide some first insight into a possible solution by analyzing what is missing and proposing a generalization of the framework. The paper is organized as follows. Section 2 reviews the use of the session layer to support application oriented networking. Then in section 3 are presented the problems to be addressed o support specific quality of service requirements by the applications. The applicability of the existing QoS management framework in SIP is analyzed and the weakness identified. In section 4 is proposed an extension for general application oriented networking. Finally in section 5 we draw some conclusions.
2 SIP Based Application Oriented Networking A trend we can envisage in new IT services such as GRID computing is that they tend to be “state-full” rather than state-less. The more complex communication paradigms require a number of state variables to manage the information flows. Moreover the Quality of Service issue is more and more important when the communication paradigms become more complex. Multimedia streams have rather stringent QoS requirements and communication without QoS guarantees is rather meaningless in this context. We believe that the state-full approach with QoS management capabilities has to be pursued for an “application aware” control plane that must manage “vertically” the communication. This is where the session concepts and the SIP protocol get into play: − sessions are used to handle the communication requests and maintain their state by mapping it into session attributes; − sessions are mapped into a set of networking resources with QoS guarantees; − the SIP protocol is used to manage the sessions since it provides all the primitives for user authentication, session set-up, suspension and retrieval, as well modification of the service by adding or taking away resources or communication facilities according to the needs. In [3] and [4] we have already described some different approaches to resource discovery and reservation exploiting the SIP protocol for GRID network. The resource reservation is part of the establishment of a session that is then used to carry the information flows. We considered that the SIP user agent (UA) already knows the location of the resources into the network, so direct reservation can be performed. In this scenario a SIP UA coupled with an application opens a dialog with a peer SIP UA. The dialog is the logical entity identifying the service relation between end-points (e.g. Video on
Extension of Resource Management in SIP
115
Demand dialog) and is made of one or many sessions. The sessions are the logical entities representing the communication services delivered between end-points, e.g. audio session, video session, chat session etc. Sessions exploit the network resources and may be associated with QoS requests. These concepts are in line with the IP Multimedia Subsystem (IMS) architecture [8]. Generally speaking the GRID application can be successfully served if both the application resources to execute the required processing and the network resources required to exchange the related data are available. Two models are possible to finalize the network resource reservation: − end-to-end model, requiring that the peers have the capability to map the media streams of a session into network resource reservations, for example, IMS supports several end-to-end QoS models and terminals may use link-layer resource reservation protocols (PDP Context Activation), Resource ReSerVation Protocol (RSVP), or Differentiated Services (DiffServ) directly; − core oriented model, where the transport network already provides QoS oriented service classes (for instance using DiffServ) and the sessions are mapped directly into this classes by the network itself. In both models the sequence of actions to be performed in the application layer are: − agree on the QoS characteristics for the communication paths of the session; − check if the related network resources are available; − reserve the resources for the desired session; − set-up the application session and consequently start the data streams. In the network layer the aforementioned actions are mapped into: − check by means of the network Control Plane if enough resources are available; − reserve the network resources providing the QoS required; The main issue is how the search and reservation of application and network resources are combined in time during the session start-up phase. The importance of this is due to the fact that the completion of the session requires alerting the end-user to ask whether the session is acceptable or not (phone ringing in a VoIP call for instance). Whether this has to be done before, while or after the network resources required for a given QoS communication are reserved is a matter to be discussed and tailored according to the specific service. It is worth mentioning that for communications spanning over multiple domains, QoS support is not straightforward. In the remainder of this work we do not consider the multi-domain issue, that will be subject of further investigation, and focus on QoS guarantee within a single network domain.
3 Session QoS Management with SIP Given that a session set-up phase is always started with an INVITE message sent by the caller to the callee, several interaction models are possible to guarantee the establishment of a session with network QoS guarantee. Network reservation during session set up: while the INVITE message goes through the network (e.g. with anycast procedures) a network path is reserved before the
116
F. Callegati and A. Campi
INVITE message reach the resource destination. This method can be used for very fast provisioning purpose. Network reservation before application reservation: as soon as the INVITE message arrives at the destination and application resources are checked, the network reservation starts. After the network resources are reserved the application resource is reserved and the session started. This is the case of VoIP calls. Network reservation before session set up: as soon as the INVITE message arrives at the destination, and the application resources are found, the application and network resource reservations are started in parallel. Once both reservation are completed independently and both the application resource and the network path are available the session is started. This can be the case of standard GRID sessions. Network reservation after application reservation: the INVITE message arrive at the destination and application resources are checked and reserved. Then, network reservation starts into the network and as soon as the path between GRID user and GRID resource is established, the application session is started. This can be used when there are few application resources available and the probability to find a free application resource is low. Network reservation after session set up: the INVITE message arrive at the destination and application resources are checked and reserved, immediately the application session can start without having any network resource reserved. Then, network reservation starts into the network and as soon as the path between GRID user and GRID resource is established, the application session already started can take advantage moving the transmission state from best-effort to QoS enable. This can be used when the QoS is not crucial for the application (or at least not in the beginning part of the session). The management of the QoS issue is not part of the standard SIP protocol but the issue is there and, not surprisingly, a Resource Management framework was defined for establishing SIP sessions with QoS [7]. The solution proposed exploits the concept of pre-condition. The pre-condition is a set of “desiderata” that are negotiated during the session set-up phase between the two SIP UAs involved in the communication (the application terminals). If the pre-conditions can be met by the underlining networking infrastructure then the session is set up, otherwise the set-up phase fails and the session is not established. This scheme is mandatory because the reservation of network resources frequently requires learning the IP address, port, and session parameters of the callee. Moreover in a bidirectional communications the QoS parameters must be agreed upon between caller and callee. Therefore the reservation of network resources can not be done before an exchange of information between caller and callee has been finalized. This exchange of information is the result of the initial offer/answer message exchange at the beginning of the session start up. The information exchange sets the preconditions to the session that is established if and only if they can be met. 3.1 Weakness of the Current Resource Management Framework The current framework says that the QoS pre-conditions are included into the SDP [9] message, in two state variables: current status and desired status. Consequently, the
Extension of Resource Management in SIP
117
SIP UA treats these variables as all other SDP media attributes. The current and desired status variables are exchanged between the UAs using offers and answers in order to have a shared view of the status of the session. The framework has been proposed for VoIP applications and is tailored to this specific case, but has the following limitations when considering a generalized application aware environment. 1. Both session and quality of service parameters are carried by the same SDP document. Thus, a session established with another session protocol (i.e. Job Submission Description Language (JSDL)) is not possible. 2. The pre-condition framework imposes a strict “modus operandi”, since preconditions must be met before alerting the user. Therefore network reservation must always be completed before service reservation. 3. Since the network reservation is performed by the UA at the edge of the network, only a mechanism based on end-to-end reservation is applicable. 4. QoS is performed on each single connection, without the ability of grouping connections resulting from sessions established by others. 5. Since SDP is a protocol based on lines description, it has a reduced enquiring capability. Moreover, only one set of pre-conditions can be espressed by SDP. Multiple negotiations of pre-conditions are not possible. 6. The SDP semantic is rather limited. It is not possible to specify detailed QoS parameters since the SDP lines are marked with “QoS” parameter without specifying any additional detail (i.e. bandwidth, delay, jitter, etc... ). Given these issues we believe that an extended resource management framework is needed in order to achieve a general framework for QoS for application oriented network supporting GRID applications. In the next section we propose an extension to the framework to generalize its applicability to the case of the Network reservation before application reservation scenario that we believe to be the more likely to be used in a GRID network. Extensions to other scenarios are possible and will be presented in future works.
4 The Extended Resource Management Framework The extension can be implemented exploiting the following ideas: 1. do not limit the framework to the use of the SDP protocol but allow more general protocols in the SIP payload at session start-up; 2. separate the information about the request and requirement related to the application layer from the information about request and requirements at the network layer, using two different protocols to carry them; 3. allow as many re-negotiation as possible of the pre-conditions both at session startup and while the session is running. Regarding the protocol to declare the pre-conditions we propose to use two protocols. − Application Description Protocol (ADP): is used to describe application requirements, for instance JSDL which is a consolidate language for describing Job submission for GRID networks.
118
F. Callegati and A. Campi
− Network Resource Description Protocol (NRDP): is used to describe the network requirements and QoS requests. The use of a complete different document descriptor for QoS allows the use of many ADPs for describing application requirements, and not only the SDP. In this work we assume Resource Description Framework (RDF) as the candidate protocol for this task. The Resource Description Framework [10] is a general method of modeling information through a variety of syntax formats. RDF has the capability to describe both resources and the resources state. For this reason RDF can be used instead of SDP as a more general description language for the network resources. Furthermore, because RDF is a structured protocol, it is possible to enrich its semantic at will with the detailed information about QoS (i.e. Bandwidth, jitter, delay, etc...) The main advantage of this approach is that the network does not need the capability to understand the ADP since the QoS requirements are only described in the NRDP. The main drawback is that a link between ADP and NRDP is needed; and therefore some sort of “interface” has to be implemented in the UA on top of a standard SIP UA. The extension to pre-condition negotiation is implemented exploiting two concepts already present is SIP: − INVITE multi-part message: that is an INVITE that carries more than one protocol in the payload (multi-body), in this case for instance RDF and JSDL; − NOTIFY message [11]: a message that allows the exchange of information related to a “relationship” between two UAs, for instance a session that is starting up or that is already running. In Fig. 1 is presented an overview of the call flow in an end-to-end scenario where the network resource reservation is a pre-condition to the session set up. User A sends an initial INVITE multi-part message including both the network QoS requests and an application protocol for the application requests. The QoS requests are expressed by means of an RDF document while the application protocol depends on the type of application (i.e. SDP for VoIP, JSDL for GRID computing, etc...) A does not want B to start providing the requested service until the network resources are reserved in both directions and end-to-end. B starts checking if the service is available, if so it agrees to reserve network resources for this session before starting the service. B will handle resource reservation in the B⇒A direction, but needs A to handle the A⇒B direction. To indicate so, B returns a 183 (Session Progress) response with an RDF document describing the quality of service from its point of view. This first phase goes on with the two peers exchanging their view of the desired QoS of the communication, thus setting the ”pre-conditions” to the session in term of communications quality. Then both A and B start the network resource reservation. When an UA finishes to reserve resources in one direction, it sends a NOTIFY message to the other UA to notify the current reservation status by a RDF document in the message body. The NOTIFY message is used to specify the notification of an event. Only when the network channel meets the pre-condition B starts the reservation of the desired resources in the application domain, and session establishment may complete as normally.
Extension of Resource Management in SIP
119
Once the service is ready to be provided B send back a 200 OK message to A. Data can be exchanged by the two end-point since both application and network resourced have been reserved. BYE message close the dialog.
Fig. 1. Example of Call Flow at session set-up
The main differences with the current approach are: − Use of a separated RDF document in an multi-part INVITE message. The INVITE message starts a SIP dialog between parties. The peer RDF is carried by a 183 response. − Use of NOTIFY messages instead of UPDATE. Since ADP and NRDP are separated (even if carried by the same INVITE message) the UPDATE message can be sent only by UA A and can be used to update both ADP and NRDP. The NOTIFY can be sent by both UAs and is related to NRDP only and guarantees that the issues of network resource reservation is addressed independently from that of application resource reservation. The NOTIFY messages have Dialog-ID, From and To tag equal to the INVITE message. − The INVITE message implies the opening of a logical relationship between the peers (a subscription) that ends with the dialog tear down. Each time that the network reservation state changed a NOTIFY message is used to notify the changes for a specific subscription. The Resource Description Framework is used to generate an offer regarding the desired QoS into the network. The network resource reservation parameters can be many with different semantics. The propose of this paper is not to give an extensive and detailed list of reservation and QoS parameters but is to present the general
120
F. Callegati and A. Campi
solution that exploiting the flexibility of the RDF can be adapted by the end users or by the network provider according to the specific needs.
5 Conclusion In this work we have presented a possible extension of the resource management framework for QoS in SIP. We have addressed the issues of application oriented networking presenting an approach based on communication service mapped into sessions and showing that different session requirements require different interaction between application and network layer. The current SIP Resource Management framework for QoS can cope only with VoIP application using a generic precondition mechanism. We have presented the first step for an extended Resource Management framework showing the call flow that can fit with GRID application. Extensions to other scenarios and details about the interaction between RDF and JSDL will be presented in future works.
Acknowledgments The work described in this paper was carried out with the support of the BONEproject (“Building the Future Optical Network in Europe'”), a Network of Excellence funded by the European Commission through the 7th ICT-Framework Programme.
References 1. Simeonidou, D., et al.: Optical Network Infrastructure for Grid. Open Grid Forum document GFD. 36 (August 2004) 2. Simeonidou, D., et al.: Dynamic Optical-Network Architectures and Technologies for Existing and Emerging Grid Services. IEEE Journal on Lightwave Technology 23(10), 3347– 3357 (2005) 3. Campi, A., et al.: SIP Based OBS networks for Grid Computing. In: Proceedings of ONDM 2007, Athens, Greece, May 29-31 (2007) 4. Zervas, G., et al.: SIP-enabled Optical Burst Switching architectures and protocols for application-aware optical networks. Computer Networks 52(10) (2008) 5. Rosenberg, J., et al.: SIP: Session Initiation Protocol. IETF RFC 3261 (June 2002) 6. Zervas, G., et al.: Demonstration of Application Layer Service Provisioning Integrated on Full-Duplex Optical Burst Switching Network Test-bed. In: OFC 2008, California, USA (June 2008) 7. Camarillo, G., et al.: Integration of Resource Management and Session Initiation Protocol (SIP). IETF RFC 3312 (October 2002) 8. Poikselka, M., et al.: The IMS: IP Multimedia Concepts and Services, 2nd edn. Wiley, Chichester (2006) 9. Rosenberg, J., Schulzrinne, H.: An offer/answer model with session description protocol (SDP), RFC 3264, Internet Engineering Task Force (June 2002) 10. Beckett, D. (ed.): RDF/XML Syntax Specification (Revised). W3C Recommendation, February 10 (2004) 11. Roach, A.B.: Session Initiation Protocol (SIP)-Specific Event Notification. IETF RFC 3265 (June 2002)
Economic Model for Consistency Management of Replicas in Data Grids with OptorSim Simulator Ghalem Belalem Dept. of Computer Science, Faculty of Sciences, University of Oran – Es Senia, Oran, Algeria [email protected]
Abstract. Data Grids are currently solutions suggested meeting the needs for the scale large systems. They provide whole of resources varied, geographically distributed of which the goal is to ensure a fast access and effective to the data, to improve the availability, and to tolerate the breakdowns. In such systems, these advantages are not possible that by the use of the replication. Nevertheless, several problems can appear with the use of replication techniques, most significant is the maintaining consistency of modified data. The strategies of replication and scheduling of the jobs were tested by simulation. Several simulators of grids were born. One of the important simulators for Data Grids is the OptorSim tool. In this work, we present an extension of OptorSim by consistency management service of replicas in data Grid. This service is based on hybrid approach, combining between pessimistic, optimistic approaches and economic model, articulated on a hierarchical model. Keywords: Data Grid, Replication, Consistency, OptorSim, Optimistic approach, Pessimistic approach, Economics models.
replicas. The various preliminary experiments discussed in section 4. Finally, we finish this article by a conclusion and some future work perspectives.
2 Approaches of Consistency Management The replica of data is made up of multiple copies, on the separate computers. It is a significant technology which improves the availability and the execution. But the various replica of the same object must be coherent, i.e. seem only one copy. There are many models of consistency. All does not offer the same performances and do not impose the same constraints to the application programmers. The consistency management of replicas can be done either in a synchronous way with use of algorithms known as pessimistic, or in an asynchronous way with use of algorithms known as optimistic [6]. The synchronous model makes it possible to make an immediate propagation of the updates towards the replicas with blocking of the access, whereas the asynchronous model differs the operation from propagation of the updates, which introduces a certain level of divergence between replicas. 1. Pessimistic approach: This forbids any access to a replica provided that it is up to data [1, 6], what gives an illusion to the user to have a single substantial copy. The main advantage of this approach is that all replicas converge at the same time, what allows guaranteeing a strong consistency, so avoiding any problem linked to the stage of the reconciliation. This type of approach is adapted well to systems of small and average scales and it becomes very complex to implement for systems with wide scale. So, we can raise three major inconveniences of this type of approach: (i) It is very badly adapted to vague or unstable environments, such as the mobile systems or the grids at strong rate of change; (ii) It cannot support the cost of update when the degree of replication is very high; (iii) It cannot support the cost of update when the degree of replication is very high. 2. Optimistic approach: This approach authorizes the access to any replica and all the time. In this way, it is then possible to access a replica which is not inevitably coherent [1, 6]. So, this approach tolerates a certain difference between replicas. On the other hand, it requires a phase of detection of difference between replicas then a phase of correction of this difference by converging the replicas on a coherent state. Although it does not guarantee coherence strong as in the pessimistic case, it possesses nevertheless certain number of advantages which we can summarize as follows [1]: (i) They improve the availability: because the access to data is never blocked; (ii) They are flexible as regards the network management which does not need to be completely connected so that the replicas are completely accessible, like the mobile environments; (iii) They can support large number replicas by the fact that they require not enough synchronization between replicas; (iv) Its algorithms are well adapted to large scale systems.
Economic Model for Consistency Management of Replicas in Data Grids
123
3 Service of Consistency Management Based on Economic Models To improve the quality of services (QoS) in consistency management of replicas in data grids, we deem valuable for extending work presented in [1] by an economic model of the market when resolving conflicts between replicas. In the real world market, there exist various economic models for setting the price of services based on supply-and-demand and their value to users. These real world economy models such as commodity market model, market bidding, auction model, bargaining model etc can also be applied to allocate resources and tasks in distributed computing [2]. In our work, we regard a grid as a distributed collection of elements of calculation (CE's) and elements of storage (SE's). These elements are dependent whole by a network to form a Site. This hierarchical model is composed of two levels (see Figure 1), level 0 is charged to control and to manage local consistency within each site, level 1 consists in ensuring the global consistency of the grid.
Fig. 1. Proposed Model
The main stages of global process of consistency management of replicas proposed is defined by the following Figure 2.
Fig. 2. Principal stages of consistency management
124
G. Belalem
- Stage 1: Collect of information (collection): example of information, we have the number of conflicts, divergences per site as well as current time. - Stage 2: Calculate metrics (analyzes): consist in analyzing and comparing the information collected with the tolerated thresholds (threshold of conflicts and threshold of divergence,…). - Stage 3: Decision: decide the adequate algorithm of consistency management of replicas. To study collaboration between the various entities of service of consistency management, we used the diagram of activity as we can see it in this Figure 3.
Fig. 3. Diagram of activity of service consistency management
3.1 Local Consistency Local consistency also called coherence intra-site, in this level we find the sites which make the grid. Each site contains CE's and SE's. The retorted data are stored on the SE's and are accessible by CE's via the operations from reading or writing. Each replica attached to additional information is called metadata (Timestamp, of the indices, the versions,…). The control and the management of coherence are ensured by representatives elected by a mechanism of election, the goal of these representatives is to treat the requests of readings and writings coming from the external environment, and to also control the degree of divergence within the site, by calculating measurements appearing in Table 1: Rate of the number of conflicts per site (τi), distance within a site (Dlocal_i), dispersion of version (σi).
Economic Model for Consistency Management of Replicas in Data Grids
125
Table 1. Measurements calculated in local consistency
Measure
Definition Number of replicas of same object inside Sitei
ni Dlocal_i
Versionmax - Versionmin of replicas inside Sitei Rate of number of conflicts inside Sitei Standard deviation of Sitei = 1 / n n ( V − V ) 2 i i ∑ it
τi σi
i
t =1
Where
V i is average version inside Sitei
We detect critical situations of one site to meet of one of the following cases: - τi > Rate of Conflicts number tolerated; - Dlocal_i > Distance tolerated; - σi > σm tolerance rate for dispersion of versions. 3.2 Global Consistency This level is also called coherence of inter-sites, it is responsible for global consistency of data grid, each site cooperates with the other sites via the representatives who communicate between them. The goal of representatives is to control the degree of divergence and of conflict within the site, by calculating measurements appearing in Table 2. Table 2. Measurements calculated in global consistency
Measure Dglobal(i,j)
Definition If ni < nj then 1 / n n i ∑ i
t =1
τij Cov(i,j)
it
−V
2
else
jt
n
1/ n
j
j
∑V t =1
it
−V
(τi +τj)/(ni+ nj) : Rate of conflicts between Sitei and Sitej 1 ni If ni < nj then ∑ (V it − V i )( V jt − V j ) n i t =1
1 nj ρij
V
n
j
∑ (V t =1
it
2 jt
else
− V i )( V jt − V j )
Cov(i,j)/ σi σj : Coefficient of correlation between Sitei and Sitej
Two situations can be treated, the first situation corresponds to the competing negotiation, started after the presence of a critical situation of a site which is translated by the existence of: - τij > Threshold of a tolerated number of conflicts; - Dglobal(i,j) > Threshold of tolerated divergence; - |ρij | < ε Where ε <<1.
126
G. Belalem
This negotiation proceeds between two groups (group of stable representatives who are not in critical situation and a group of representatives in crisis), that is to say by a call for tender, where a representative in crisis receives offers of the other stable representatives by providing described measurements above. This representative seeks the representative who has measurements close with his; algorithm below (algorithm 1) described the mechanism to this negotiation:
Algorithm 1. CALL FOR TENDER Calculate τi , Dlocal_i, σi /* measurements of ith representative */ candidates ← false, Nbr_candidates ← 0 for all elements of the group of sites in crisis do Representativea ← representative of site in crisis; j ← 1 repeat Representativeb ← representative of the stable site if (⏐τa - τb⏐< ε) ∨ (⏐Dlocal_a - Dlocal_b⏐< ε) ∨ (⏐σa >σb⏐< ε) then Nbr_candidates ← Nbr_candidates+1 end if j ← j +1 until j ≥ Number of elements of the group of stable sites if (Nbr_candidates > 1) then candidates ← true; Algorithm Negotiation /* Negotiation of candidates */ else Propagation of the updates of the stable site to the sites in crisis end if end for
If there are two or more representatives candidates (candidates = true), being able to put the site in crisis in stable state one passes to a negotiation between these candidates (see Algorithm 1), the best of them is that which is more stable, and will proceed to update the site in crisis. The second situation corresponds to the co-operative negotiation (see Algorithm 2) between the representatives of each site, started by the exhaustion of the period defined for total coherence.
1: 2: 3: 4: 5: 6: 7: 8: 9:
Algorithm 2. NEGOTIATION τi , Dlocal_i, σi /* measurements of ith representative for stable site */ winner ← first of all sites stable /* Candidate supposed winner */ τi , Dlocal_i, σi /*measurements of ith representative*/ ; no_candidate ← false for all representatives – { candidatewinner } do if (τi < τwinner) ∧ (Di < Dwinner) ∧ (σi <σwinner) then winner ← reprensentativei end if Winner propagates its updates with the sites in crisis end if
For that, two algorithms inspired of the economic model can be used to put the whole of the replicas in a coherent state. The Dutch auction (Algorithm 3) ensures the quality of service of which the aim is to obtain the most recent replica, it consists in seeking the representative having the largest version and to compare its measurements with those of the other
Economic Model for Consistency Management of Replicas in Data Grids
127
representatives for goal to have a representative having less conflicts and divergences, and to compare the version of this representative with the average of all vectors of version. If its version is lower or equal "no\_candidate = true" (see Algorithm 3), then we proceed to the Algorithm of English Auction.
Algorithm 3. DUTCH AUCTION representativemax ← representative having the most recent version version_reserves /* the average of all vectors of versions */ τi , Dlocal_i, σi /* measurements of ith representative */ no_candidate ← false for all representatives – {representativemax} do if (τmax > τi) ∨ (Dlocal_max > Dlocal_i) ∨ (σmax >σi) then if versioni ≤ version_reserves then no_candidate ← true else representativemax ← representativei ; no_candidate ←false end if end if end for if (no_candidates = false ) then representativemax propagates the updates with the whole of representatives else Algorithm English Auction end if
The English auction (Algorithm 4) has the same principle that the Dutch auction except that it is ascending starting with the representative having the oldest replica and increases according to measurements calculated within each site.
1: 2: 3: 4: 5: 6: 7: 8: 9:
Algorithm 4. ENGLISH AUCTION representativemin ← representative having the oldest version version_reserves /* the average of all vectors of versions */ τi , Dlocal_i, σi /* measurements of ith representative */ for all representatives – {representativemin} do if (τmin > τi) ∨ (Dlocal_min > Dlocal_i) ∨ (σmin > σi) then representativemin ← representativei end if end for representativemin propagates the updates with the whole of the representatives
4 Experimental Study We have implemented a simulation tool for our approach and their component, where entire architecture was discussed in the previous sections. The tool is based on the OptorSim Grid simulator [1], extended with our hybrid approach, in order to validate and to evaluate our approach of consistency management of replicas compared to pessimistic and optimistic approaches, we carried out series of experiments whose results and interpretations are covered in this section. In order to analyze the results
128
G. Belalem
relating to the experimentation of our approach, we used three metrics to know the response time, the number of divergences and number conflicts among replicas. These experiments were carried out with the following parameters of simulation: 10 sites, 100 nodes, retorted data 10 times in each site according to the strategy multi-Masters, the bandwidth being fixed at 100Mb/ms for star topology, and we varied the number of requests (requests of writings) between 100 to 500 requests per step of 100.
Fig. 4. Average response time / #Requests
Fig. 5. Average response time / #Requests
We notice that in graph of the Figure 4, some is it a number of requests carried out, response time of the two approaches pessimists: ROWA and Quorum, are obviously more than optimistic approach, hybrid [1] like our approach. Us let us notice in the Figure 5, that the optimistic approach which does not block the requests, it who causes to reduce the execution time contrary to the two other approaches which are slightly confused, this increase is explained owing to the fact that local and global consistency, were started to propagate the updates in the two approaches. From this Figure and although two curves are not very different, the curve of our approach remains in below of that of the hybrid approach. This experiment allows us to evaluate and compare the QoS, by taking account of the divergences and the conflicts a number of conflicts measured during the execution of the writing request to the counterparts, between our suggested approach and optimistic one. The results presented in Figures 6, 7, 8, and 9 show in significant way that our approach solves quickly divergences and conflicts by using economic models.
Fig. 6. Number of divergences / #Requests
Fig. 7. Number of divergences / #Requests
Economic Model for Consistency Management of Replicas in Data Grids
Fig. 8. Number of conflicts / #Requests
129
Fig. 9. Number of conflicts / #Requests
5 Conclusion and Future Works In this work, we tried to seek balance between the performance and the quality of service. This led us to propose a service of consistency management based on a hybrid approach, combining between consistency approaches (pessimistic and optimistic) and the economic model of the market, which allows at the same time to ensure local consistency for each site and global consistency of the entire grid. The results of these experiments are rather encouraging and showed that the objective of compromise between quality of service and performance was achieved. There are a number of directions which we think are interesting, We can mention: - We propose to integrate our approach in the form of Web service in the Globus environment by using technology WSDL [3]; - Currently, we place a replica randomly. It is useful to explore the possibility of making static or dynamic placement to improve QoS.
References 1. Belalem, G., Slimani, Y.: Hybrid Approach for Consistency Management in OptorSim Simulator. International Journal of Multimedia and Ubiquitous Engineering (IJMUE) 2(2), 103–118 (2007) 2. Buyya, R., Murshed, M., Gridsim, M.: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing. The Journal of Concurrency and Computation: Practice and Experience 4(13-15), 1175–1220 (2002) 3. Foster, I., Kesselman, C.: The Grid 2: Blueprint for a New Computing Infrastructure. Elsevier Series in Grid Computing. Morgan Kaufmann Publishers, San Francisco (2004) 4. Gray, J., Helland, P., O’Neil, P., Shasha, D.: The Dangers of Replication and a Solution. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 4-6, 1996, pp. 173–182 (1996) 5. Ranganathan, K., Foster, I.: Identifying Dynamic Replication Strategies for a HighPerformance Data Grid. In: Lee, C.A. (ed.) GRID 2001. LNCS, vol. 2242, pp. 75–86. Springer, Heidelberg (2001) 6. Saito, Y., Shapiro, M.: Optimistic Replication. ACM Computing Surveys 37(1), 42–81 (2005) 7. Xu, J., Li, B., Li, D.: Placement Problems for Transparent Data Replication Proxy Services. IEEE Journal on Selected areas in Communications 20(7), 1383–1398 (2002)
A Video Broadcast Architecture with Server Placement Programming Lei He1, Xiangjie Ma2, Weili Zhang1, Yunfei Guo2, and Wenbo Liu2 China National Digital Switching System Engineering and Technology Research Center (NDSC), P.O. 1001, 450002 Zhengzhou, China {hl.helei,clairezwl}@gmail.com, {mxj,gyf,lwb}@mail.ndsc.com.cn
Abstract. We propose a hybrid architecture MTreeTV to support fast channel switching. MTreeTV combines the use of P2P networks with dedicated streaming servers, and was proposed to build on the advantages of both P2P and CDN paradigms. We study the placement of the servers with constraints on the client to server paths and evaluate the effect of the server parameters. Through analysis and simulation, we show that MTreeTV supports fast channel switching (<4s).∗ Keywords: Internet video broadcast; mesh; fast channel switching; locality; pastry.
1 Introduction With the widespread penetration of broadband accesses, multimedia services are getting increasingly popular among users and have contributed to a significant amount of today’s Internet traffic. The sparse deployment of IP Multicast [1] have limited broadcast to only a subset of Internet content publishers. Both CDN-based and P2P based architectures have their advantages and disadvantages, and each architecture alone does not provide a cost-effective and scalable solution to streaming media distribution. There has been considerable work in the area of peer-to-peer live-video streaming [3]-[11]. There are two key drivers making the approach attractive. First, such technology does not require support from Internet routers and network infrastructure, and consequently is extremely cost-effective and easy to deploy. Second, in such a technology, a participant that tunes into a broadcast is not only downloading a video stream, but also uploading it to other participants watching the program. Consequently, such an approach has the potential to scale with group size. However, the P2P architecture has the performance problem and suffers from long setup and source to end delay. e.g., CoolStreaming [7] is one of the first data-driven systems, and its setup delay is longer than 60 seconds. PPLive [8], a popular commercial peer-to-peer streaming system, its source to end delay is between 20 and 60 seconds. ∗
This work is partially supported by National Basic Research Program of China (973 Program) with No.2007CB307102.
A Video Broadcast Architecture with Server Placement Programming
131
To reduce the delay, we propose a novel hybrid architecture MTreeTV that combines the use of P2P networks with dedicated streaming servers, and was proposed to build on the advantages of both paradigms. We emphasize on the effect of placing servers in overlay networks. Through analysis and simulation, we show that MTreeTV supports fast channel switching (<4s), it can provide high playback continuity (>98%) with little control overload (<2%). The remainder of this article is organized as follows. In Section 2, we present the overview of hybrid architecture MTreeTV. We solve the server placement problem in Section 3. In Section 4, we evaluate the performance of MTreeTV. Finally, Section 5 concludes the paper and offers potential future research directions.
2 Hybrid Architectures We consider a live video streaming system using overlay multicast. The video stream, originated from the source to multiple tree roots and then the other nodes, is divided into equal-length segments. The clients that are interested in the video form an overlay mesh by joining multiple trees simultaneously, and each overlay node, except for the source, acts both as a receiver and, if needed, an application-layer relay that forwards data segments. In addition, these nodes can join and leave the overlay at will, or crash without notification. Our hybrid video broadcast architecture MTreeTV is shown in Fig.1. The details of MTreeTV overlay construction and scheduling algorithm is described in ref. [12]. In this architecture, the two streaming technologies complement each other: The video stream will be first pushed to streaming servers; then clients that are interested in the video form a P2P overlay and pull the streaming data from the near peers.
Fig. 1. The hybrid video broadcast architecture MTreeTV. Dotted lines show the directions of stream data transmission.
132
L. He et al.
The main entities of MTreeTV are: • Control Server. The role of the Control server is to maintain client/channel information. When a client starts, it first registers to the control server, then it retrieves the information of channel/streaming-server from the control server; finally, it joins the overlay and gets the video playing. • Streaming Server. It holds a copy of all channels and is responsible for streaming. It is also responsible for processing the data requests of direct connected clients. We assume a zero downtime for the servers. • Client. The set of user machines participating in the streaming system. It essentially constructs a mesh out of the overlay nodes, with each node having a small set of neighbors to exchange data. We compared the performance of MTreeTV with CoolStreaming and the Tree. Tree was a single tree of MTreeTV. We used several metrics to evaluate the performance. Setup delay is the time from the user first receives a segment to the video is visible. Source to end delay is the delay between the content originator and the receiver, also known as playback delay. Playback continuity is the percentage of received data packets before the deadline. Control overhead is Control message volume/Video traffic volume at each node. Failure percent is the ratio of the failure node number to all node number in an time interval T.
3 The Server Placement Problem In this section, we study the placement of the streaming servers with constraints on the client to server paths, reflecting the quality of service within the regional access networks. We envision that this imposition of service quality constraints on server to client paths is essential for the newer network services to achieve better service quality in order to attract and retain customers. Given the network and their estimated service parameters, how many servers are needed and where should an overlay service provider locate them to ensure a desired service quality to all its clients? To answer the above question, we transform the placement problem to the set cover problem and solve it using 0-1 programming, linear programming (LP) relaxation and greedy heuristics. 3.1 Formal Definitions The server placement maps to the set cover problem as follows: an element corresponds to the network location of an client; the base element set contains all clients; a set represents a potential server placement at one of the router locations, each set includes all the network locations that are within the service range from the server location represented by the same set. Given a base set of elements and a family of sets that are subsets of this base set, find the minimum number of sets such that their union includes all elements in the base set [10]. In Fig.2, the points are denoted by numbers and the sets of points are denoted by names. The goal is to "cover" all the points by choosing the minimum number of sets, e.g. we can choose S1, S3 and S4.
A Video Broadcast Architecture with Server Placement Programming
133
Fig. 2. A set covering instance
Given a set of networks and their interconnections, as well as a specific routing policy, we can compute an end-to-end path for every pair of nodes in the networks. For each node ni, we compute a set Si which includes all the nodes reachable from ni within the server service range C. Let S1, S2,…, Sm be all the sets computed. The LP formulation of the set cover problem is: j =m
Objective:
min ∑ x j
(1)
j =1
j =m
Subject to:
∑a x j =1
ij
j
≥ 1, for i=1 … n
(2)
x j ∈ {0,1} where xj is the selection variable of Sj , aij is 1 if
n i ∈ S j and 0 otherwise. A variation
of the problem is to allow one primary and one backup server to cover each node. A backup server can cover twice the distance of the primary server. Let T1, T2, … , Tm be all the backup sets, and bij = 1 if n i ∈ Tj and 0 otherwise. The objective here is still to minimize the number of selected sets but with the additional constraints of: j =m
∑b x j =1
ij
j
≥ 2, for i=1 … n
(3)
Since all nodes in the primary set are also in the backup set centered at the same server, bij = 1 if aij = 1; but we can not have a primary server also service the same node as the backup server — the constraint in (3) ensures the selection of a different server as the backup. 3.2 LP Based Algorithms The above formulation can be solved by the 0-1 programming (also call binary integer programming). 0-1 programming is a special type of integer programming, and xj is 0 or 1. An efficient method to solve Integer programming is the branch-and-bound
134
L. He et al.
method. Matlab provides functions to solve the 0-1 integer programming problem and the linear programming problem. The solution for the 0-1 programming naturally provides a lower bound =
∑
j
x*j for the set cover problem, since it is an optimal solution and the 0-1 program-
ming is a super set of the set cover problem. We will use this lower bound to evaluate the quality of the solutions produced by other algorithms. The above formulation can also be approximated by first solving the LP relaxation of the problem optimally and then rounding the fractional values to integers. The LP relaxation of the problem is to allow the selection variables xj to take fractional values between [0, 1]. The LP relaxation can be solved in polynomial time and the rounding can be done in O(n). Reference [13] introduced a rounding algorithm which is a papproximation algorithm, where
p = max i {∑ j aij } is the maximum number of sets
covering an element. We refer to the rounding algorithm in [14] as the fixed-rounding (FR) algorithm: 1. Solve the LP relaxation of the problem and let 2. Output
{x*j } be the optimal solution;
1 sets = { j | x j > } . p
It is easy to see that the FR algorithms can still have redundant sets in the final solution. To prune these unnecessary extra sets, we use a simple pruning algorithm as the final step to complete the set selection: 1. Sort all selected sets in increasing order of set size; 2. Starting from the smallest set, check if it can be removed without leaving any of its nodes uncovered. 3. Repeat Step 2 until all sets are checked. 3.3 Greedy Heuristics The basic greedy attribute of the algorithm is to select a set at every step that contains the maximum number of uncovered elements. For the backup problem variant, we extend the algorithm by treating any node that has not satisfied the constraints of (2) and (3) as equally uncovered. At each step, we select a set that has the largest number of remaining uncovered nodes and repeat till there are no more uncovered nodes. The greedy heuristic is: 1. 2. 3.
Sort all unselected sets in decreasing order of uncovered node number; Picking up the largest set, and changing the elements’ state of the set to covered; Repeat Step 1 and 2 until all nodes are covered.
Using simulation, we show that this rounding technique and the greedy heuristic approach the lower bound very closely; in fact, it reaches the lower bound for many configurations. Meanwhile, the greedy heuristic also provides good performance in all instances with significantly less computation complexity.
A Video Broadcast Architecture with Server Placement Programming
135
4 Simulation Results We have performed a series of experiments on our segment-level, event-driven simulator. Our simulation topologies has 100 routers and 1000 hosts, it is generated by the GT-ITM topology generator using the transit-stub model. Message delays are determined using the resulting distance metric. 1000 end nodes that were randomly assigned to routers in the core of each topology with uniform probability. Each end system was directly attached by a LAN link to its assigned router. The delay of each LAN link was 1ms and the average delay of core links (computed by GT-ITM) was approximately 216ms. 70
Fig. 3. Performance comparison of the FR, Greedy and 0-1 programming algorithms with variation on server service range.
0
MTreeTV Tree CoolStreaming
30 20 10 0 200
600 k=1, Backup server
10 Source to end delay (s)
Setup delay (s)
40
400 k=2
Fig. 4. Variation on the placement strategy (service range =500ms)
60 50
200 k=1
6 4 2 0
400
600
800
1000
Bitrate of stream (Kbps)
Fig. 5. Setup delay as a function of the bitrate of stream.
C=200ms C=500ms C=*
8
2
4 6 8 Server bandwidth• Mbps•
10
Fig. 6. Souce to end delay as a function of the time of server bandwidth and service range.
We compare the greedy algorithm with the fixed rounding (FR) algorithm. The results are further compared with the lower bound obtained as the optimal solution from the 0-1 programming method. The programming and greedy algorithm solver we used, is called Matlab7.0.1 which is an advanced computation and simulation package. The simulations reveal the impact of different server and client parameters (service range, server bandwidth, client buffer time) on the system performance. All the nodes join the overlay initially, and the tree roots begin streaming from the time
136
L. He et al.
zero. We used different random join sequence for each run. All results are the average values of 10 times run. In this section, we compare the fixed rounding (FR) algorithm and the greedy algorithm with the 0-1 programming method. Given the network, we also use the previous metrics to evaluate the effect of the server parameters (e.g., service range, bandwidth). With this routing policy, we can compute a routing table for each node ni and the cost of each routing path c(ni, nj), which is the sum of the hop distances along the path. For each router ri, we compute a set Si which includes all the nodes reachable from ri within the routing cost of C, C is the service range of this server. This is to say that if a server is placed at the location of router ri, then all the nodes within this service range can be serviced by this server. Fig.3 shows the number of required servers when varying service range. In general, both the FR (with pruning) and the greedy algorithm perform closely with the lower bound. With increasing service range, the number of servers needed drops significantly. Fig.4 shows the performance against the different server placement strategies. We perform simulations on the following three scenarios: (a) k = 1, with only one primary server required to cover each node; (b) k = 2, with two primary server required to cover each node; (c) k = 1 and requiring one backup server. The backup server range is twice that of the primary server for both networks. Fig.4 shows that two primary servers will significantly increase the number of servers. And the addition of the backup servers only increases the number of total servers slightly. MTreeTV has the shortest source to end delay (From Fig. 5~6). Fig. 5 shows that the high bitrate of stream will increase the setup Delay. When the number of nodes is 200, B=10Mbps, buffertime=16s, C=200ms, the SETUP delay of MTreeTV is about 4s (Fig. 6). With the optimization of hybrid architecture optimize, the setup delay of MTreeTV is only 1/4 of CoolStreaming’s. From the simulation results, MTreeTV also gets high playback continuity (>98%) and little control overload (<2%).
5 Conclusion and Future Work In the very near future, video may become the dominant type of traffic over the Internet, dwarfing other types of traffic. Among the three video distribution modes: broadcast, on-demand streaming, and file download, broadcast is the most challenging to support due to the strong scalability and performance requirements. In conclusion, MTreeTV as a whole outperforms other systems under various environments. Through analysis and simulation, we show that MTreeTV can support fast channel switching, it provide high playback continuity with little control overload. As part of our future work, we plan to further explore this class of application. We will look at the extreme peer dynamics or flash crowd, access bandwidth scarce regimes, incentives and fairness. We are going to simulate more system behaviors to further improve the system’s performance.
Acknowledgment Lei He would like to thank Prof. Julong Lan, Prof. Dongnian Cheng of NDSC for their helpful comments.
A Video Broadcast Architecture with Server Placement Programming
137
References [1] Deering, S.: Multicast Routing in Internetworks and Extended LANs. In: Proceedings of the ACM SIGCOMM (August 1988) [2] Danzig, P., et al.: A case for caching file objects inside internetworks. In: Proc. ACM SIGCOMM Conference (SIGCOMM 1993), pp. 239–248 (September 1993) [3] Banerjee, S., Bhattacharjee, B., Kommareddy, C.: Scalable applicationlayer multicast. In: Proc. ACM SIGCOMM 2002, Pittsburgh, PA (August 2002) [4] Castro, M., Druschel, P., Kermarrec, A.-M., Nandi, A., Rowstronand, A., Singh, A.: SplitStream: High-bandwidth multicast in cooperative environments. In: Proc. ACM SOSP 2003, New York, USA (October 2003) [5] Chu, Y.h., et al.: A Case for End System Multicast. In: Proc. 2000 ACM SIGMETRICS Int’l. Conf. Measurement and Modeling of Comp. Sys. (2000) [6] Rowstron, A., et al.: Scribe: The design of a large-scale event notification infrastructure (June 2001) [7] Zhang, X.: CoolStreaming/DONet: A Data-driven Overlay Network for Peer-to-Peer Live Media Streaming. In: Proc. INFOCOM 2005, Miami, FL, USA (March 2005) [8] Hei, X., et al.: A Measurement Study of a Large-Scale P2P IPTV System. Tech. rep., Dept. of Comp. and Info. Sci., Polytechnic University (2006) [9] Venkataraman, V., Francis, P., Calandrino, J.: ChunkySpread: Multitree unstructured peer-to-peer multicast. In: Proc. The 5th International Workshop on Peer-to-Peer Systems (February 2006) [10] Shi, S., Turner, J.: Placing Servers in Overlay Networks. In: SPETS (July 2002) [11] “Bittorrent”[EB/OL], http://www.bittorrent.com [12] Lei, H., et al.: A Peer-to-Peer Internet Video Broadcast System Utilizing the Locality Properties. In: HPCC, China (September 2008) [13] Shi, S., Turner, J.: Placing Servers in Overlay Networks. In: SPETS (July 2002) [14] Hochbaum, D.: Approximation Algorihtms for NP-Hard Problems. Brooks/Cole Publishing Co. (1996)
VXDL: Virtual Resources and Interconnection Networks Description Language Guilherme Piegas Koslovski1,2 , Pascale Vicat-Blanc Primet1 , and Andrea Schwertner Char˜ao2 1 2
´ INRIA - Ecole Normale Sup´erieure (ENS) Lyon Laborat´orio de Sistemas de Computac¸a˜ o (LSC) Universidade Federal de Santa Maria (UFSM)
Abstract. Data grid applications require often an access to infrastructures with high performance data movement facilities coordinated with computational resources. Other applications need interconnections of large scale instruments with HPC platforms. In these context, dynamic provisioning of customized computing and networking infrastructure as well as resource virtualization are appealing technologies. Therefore new models and tools must be studied and developed to allow users create and handle such on-demand virtual infrastructures within grid platforms or even within the Internet. This work presents VXDL, a language for virtual resources interconnection networks specification and modeling. Besides allowing end resources description, VXDL lets users describe the desirable virtual network topology, including virtual routers and timeline. In this paper we motivate and present the key features of our modeling language. We explore typical examples to demonstrates the expressiveness and the pertinence of it. Then we detail experimental results based on the execution of NAS benchmark on virtual infrastructures, conforming different VXDL specifications. Keywords: virtual grids, network virtualization, description language.
1 Introduction Data grid applications require often an access to infrastructures with high performance data movement facilities coordinated with computational resources, while other applications need interconnections of large scale instruments with HPC platforms. Computational grids offer a distributed infrastructure of hardware and software, comprising computers and clusters of computers interconnected and geographically distributed [12]. The resources available in a grid can be shared by multiple users for different computational purposes, such as high-throughput computing, on-demand computing, data intensive and distributed supercomputing. In this context, dynamic provisioning of customized infrastructure combined with end resources and network virtualization are appealing technologies. To allow efficient and flexible resources sharing in grids, projects like VGrADS [14] Virtual Workspace [20], VioCluster [19] are exploring the use of resources virtualization. On an other hand projects like G-Lambda [22], Phosphorus [3] or UCLP [4] are P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 138–154, 2009. c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009
VXDL: Virtual Resources and Interconnection Networks Description Language
139
exploring end-user level dynamic lambda path or bandwidth provisioning for the creation of customized high speed networking environments. However, modeling and specifying these type of resource interconnection is still an open issue. In such context users should be able to specify what are the resources needed to run their applications. When environments are complex, a language, which exposes a set of parameters for describing resources to be composed, is required. Several languages for computer resources description and selection like ClassAd, vgDL or SWORD [9,17,18] have been proposed. These languages differentiate by their grammar, the proposed parameters and the implementation specificities. But the interconnection network which plays a central role in the distributed computing infrastructure, cannot be finely modeled by these languages. On an other hand, in the context of networking capacities provisioning, few languages have been proposed, such as the NDL [7] based on XML/RDF. But they do not include fine end resource description nor virtualization constraints specification. Standardisation work is also ongoing in groups like OGF NML-WG [21]. To include virtualization constraints, some extra parameters are required in the specification. For example, virtualization enable to split the same physical resource (computational or network resource) and shared it in different virtual entities. Thus, the user has to be able to specify the maximum number of virtual resources on the same physical entity he wants. We propose VXDL, a language for virtual resource interconnection network specification, which allows a detailed and extensible definition of all components of a virtual infrastructure. The first level goal of VXDL is to integrate the network interconnection, the virtualization constraints and the timeline with the classical resources description. VXDL has been defined for the HIPCAL [2] and CARRIOCAS [1] projects. This paper is organized as follows. Section 2 presents an example and the characteristics of virtual infrastructures. In the section 3 the language VXDL is described, the grammar is illustrated on uses case. To evaluate the ease of use, we performed an analysis of the available specification properties and compare them with those of other existing languages. We then investigate the improvement obtained with the specification of network topology and parameters such as latency and bandwidth, through the analysis of the total execution time of the benchmark NAS [6]. Section 4 shows the results obtained with the execution of some experiments on the Grid5000 platform. Section 5 reviews related works. The conclusions and perspectives are given in section 6.
2 Virtual Infrastructure Definition We define an infrastructure as the structured aggregation of computing resources. Thus, a virtual infrastructure represents the aggregation of virtual resources through an organized interconnection network. The virtualization layer enables an efficient separation between the applications specifications and physical resources. OS-level virtual machines paradigm is becoming a key feature of Grid as it provides a powerful abstraction. It has the potential to simplify management of resources and to offer a great flexibility in resource usage. Each VM a) provides a confined environment where non-trusted
140
G.P. Koslovski, P. Vicat-Blanc Primet, and A.S. Char˜ao
applications can be run, b) allows to establish limits in hardware resource access and usages, through isolation techniques, c) allows adapting the runtime environment to the application instead of porting the application to the runtime environment (this enhance the application portability), d) allows using dedicated or optimized OS mechanisms (scheduler, virtual memory management, network protocol) for each application, e) applications and processes running within a VM can be managed as a whole. The abstraction of the hardware enables to create multiple, isolated and protected virtual clusters on the same set of physical resources by sharing them in time and space. In other words, with representation in Virtual Machines (VM), it is possible that a physical resource (node) hosts VMs of different virtual clusters. A virtual cluster [2] is defined as a group of machines (physical or virtual) configured for a common purpose. As Grid computing is about using multiple sites spread over wide area, the virtual cluster abstraction is not sufficient to solve the network QoS and secure communication channels issues raised in Grids. This inability to enforce network QoS may prevent this approach from being useful in a range of scenarios. A virtual private network is classically provisioned over a public network infrastructure to provide dedicated connectivity to a closed group of users. Resource-guaranteed VPNs can be obtained from a carrier but are generally static and require complex service level agreements (SLAs) and overheads. Tunneled VPNs such as those in use over the Internet are more lightweight but offer low performance and no assured quality of service. Functionality necessary for automating dynamic provisioning of lightpath or virtual private network capacities are emerging. The originality of our approach is to combine OS-level machine virtualization with network virtualization and bandwidth reservation through service overlays which can be based on router virtualization. Grid applications are characterized by the fact they may transfer large subsets of data across network for processing while exchanging very small and urgent messages. In most cases, grid resource scheduler allocates precious computing and storage resources first, then generates output as bulk data transfer requests. The volume of dataset is determined from task specification, and a deadline may be specified to enforce the efficient use of expensive grid resources (CPUs, disks) as well as to shorten the task completion time. Low latency interprocess communications (MPI), sharing the same wide area networking infrastructure meet also strong QoS requirements and may benefit from end to end bandwidth reservation and isolation service. Such a service can be implemented with the overlay network and router virtualization approach. An overlay network has an ideal vantage point to monitor and control the underlying physical network and the applications running on the VMs. The overlay, being a virtual network can run on a network that provides reservations or ligh-path setup and teardown in an optical network. Router virtualisation enable the customization of packet routing, packet scheduling and traffic engineering for each virtual network crossing it. The substrate to build such dynamic, predictable and large-scale computing environments is under development within the HIPCAL project [2]. We concentrate here on the virtual infrastructure description language defined within this project and demonstrate the importance of interconnection network description and control.
VXDL: Virtual Resources and Interconnection Networks Description Language
141
3 VXDL Grammar VXDL has been defined for specifying an interconnection of virtual resources into a virtual infrastructure. In this context, the process of resource specification and selection has some particular features, when compared to conventional grids. It is expected that a language should allow the user to describe: – individual resources and groups; – the elementary function assigned to the resource, which can be attributed to a single component or a cluster, for example: request of computing nodes, storage nodes, visualization nodes, etc.; – the network topology, including virtual representations for routers and characterization of the necessary links, such as node-node, intra-cluster, cluster-node, noderouter, etc. in terms of QoS metrics; – the applications and tools needed for each component (operating systems and programming tools, for example); – the executing timeline of the application (this parameter should help in resources co-scheduling). Figure 1 shows a graph representing a typical infrastructure, composed by two virtual routers that perform the interconnection among multiple virtual clusters, with different configurations. The representation of the network topology define the shape of the infrastructure. In the figure 6 the grammar is divided in virtual grid, virtual topology and virtual timeline. With the first part, virtual grid, it is possible describe all necessary components (nodes and clusters) and create groups among them. Here, was used the idea of identification by elementary functions of a component, where the user can describe what is the use of the resource, for example: computing, storage, network sensor, visualization, aquisition, router and endpoint. An example of its use can be seen in the
Fig. 1. Graph representing an infrastructure
142
G.P. Koslovski, P. Vicat-Blanc Primet, and A.S. Char˜ao
request of a cluster for computing and storage, or the definition of a node for visualization and network endpoint. A specified virtual infrastructure can be instanciated by the allocation of geographically distributed resources. The option anchor allows the user to specify the physical location where the component (node or cluster) should be reserved. This parameter helps in modeling environments where there is a local dependency for execution, such as specific components for result visualization or a particular input database. The second part of the grammar, virtual topology, allows the user to specify the network topology desirable to connect all components of the infrastructure. The specification can detail internal parameters in clusters or between clusters, applying parameters as latency and bandwidth to all links needed. Through the identification of source and destination, the same component may have different communication channels. The goal of detailing the network topology is to allow users requesting a virtual infrastructure for communication-sensitive applications, i.e. applications that require transmitting a high data volume or that benefits from low latency. Using the topology specification, such applications can ask for an environment with skillful network links. Is important to note that VXDL allows to express the virtual network components and topology, but is not expressing explicitly any mechanism for physical provisioning or of any information on the pricing scheme. These physical and economical parameters are defined as external parameters and can be joined to the specification. Figure 2 shows two graphs, G(V, e), which represent different infrastructures for execution. In these graphs, the vertices represent the necessary components and the edges represent the communication links between them. Vertex can be nodes (endpoints), clusters or routers. The graph in Figure2.a represents a topology where components communicate directly with each other. This figure may represent the communication required between four clusters, for instance. In Figure 2.b, we represent a topology where there are routers (in yellow) between components. In both examples, the user can specify separate parameters for each communication link. The specification of a virtual timeline helps in identifying the moment when the resources are needed. This description asks resources for a certain time, or wait the execution of a stage to start another. This type of notation is usual in workflows, which are used to identify and to map the resources needed for executing an application in a grid. Definitions as start, after and until allow the creation of a timeline for execution, re-
Fig. 2. Graphs of the network topologies
VXDL: Virtual Resources and Interconnection Networks Description Language
143
questing resources for certain time intervals. The approach implemented is the nominal identification of time variables to allow reference in future rules. 3.1 Example To study and elaborate a specification using VXDL, we select an application named VISUPIPE (High Performance Collaborative Remote Visualisation), which is used to view images in real time. This use case is developed within the CARRIOCAS project. This application generate images of 20 Mpixels x 32 bits x 30 images/s x 2 for stereo. This represents a bandwidth requirement of 38,4 Gb/s when a researcher wants to explore the data and images on the visualisation wall (12). Low latency is also required for interactivity. Figures 10 and 11 shows a pipeline and an environment necessary to make a view session. The five clusters described perform the storage, initial computing (filters), mapping, image formation and final visualization. In this pipeline the network is a critical part, so there is a clear need for specifying network topology requirements. Figure 7 shows the corresponding resource query using VXDL. In addition to parameters for composition and elementary functions identification, there is a request for a time to use the resources and specific clusters interconnection. Using the start and for parameters, the user informs the time for execution and the period for make the resources reservation. In the network topology requested, we described the configuration for the links bandwidth1 and bandwidth2, informing the minimum and maximum bandwidth for each. Yet, this request informs the maximum tolerable latency on some connections, intra and inter clusters. In the execution timeline for VISUPIPE, we indicate virtual times (t0, t1, t2 and t3) that may be referenced in future rules. The values specified for computing time and data transfer volume come from the application description. In this timeline, features such as Bandwidth2 and Cluster Visual should be available only after the execution of previous tasks. This kind of description can help the schedule in the resources distribution among the existing virtual infrastructures, allowing to make reservations only when necessary.
4 Evaluation The evaluation we performed aimed a quantitative analysis of the benefit obtained with a detailed virtual infrastructure specification, more precisely the network topology parameters provisioned by VXDL. For this, we selected the NAS benchmark [6] for a execution in different virtual infrastructures allocated on the test environment Grid50001. The Numeric Aerodynamic Simulation (NAS) parallel benchmarks is a set of applications developed by NASA, which have been grouped in order to form a benchmark for parallel architectures. For this analysis, we selected only four applications from the package (cg, is, lu and mg), due to their different characteristics of communication, memory occupation and processing [11]. In the virtual environment, we used Xen version 3.1. The virtualized operating systems were based on Debian Sid Linux distribution. For benchmarks execution, we used MPICH [10]. All results were obtained from the average of 10 executions of each 1
G.P. Koslovski, P. Vicat-Blanc Primet, and A.S. Char˜ao
test. The resource selection in a Grid5000 was performed manually: starting from informed requests, the resources that meet with the request were selected and allocated for execution. 4.1 Latency Specification We prepared four queries requesting the virtual cluster creation: Query without network topology, 1 VM per node: This query, represented by Figure 8.a in the annex, asks for a virtual environment composed of at least 16 and no more than 20 nodes, classified as computing nodes. It also asks for a minimum of 1GB of memory RAM. The resource selection resulted in the allocation of 16 nodes, 8 per each physical cluster. Query with latency (0.100ms), 1 VM per node: In this request, Figure 8.a, we used the same nodes parameters defined in the description above, adding the network topology. It asks for interconnected resources in a network with maximum latency of 0.100ms, resulting in the selection of resources with geographic proximity, allocated in the same physical cluster. Query with latency (0.100ms), 2 VMs per node: In the description presented in Figure 8.c we requested for a virtual environment consisting of nodes with 256MB of memory RAM. This allows the allocation of multiple virtual machines on a single physical resource. The network description indicates that the resource allocation occurs on the same physical cluster. It was selected 8 physical computers, allocating 2 virtual systems in each node. Query with latency (10ms), 2 VMs per node: In this query (Figure 8.d) the parameters for network topology ask for links with maximum latency of 10ms. This way, a selection results in the allocation of physical nodes in different clusters. The RAM and CPU configurations allow the allocation of multiple virtual machines on the same physical resource. The selected virtual environment was composed by 8 nodes, distributed between two clusters. In each node it was allocated 2 virtual operating systems. Table 1 and Figure 8 show the results obtained with the execution of NAS applications in virtual infrastructures. It is possible to observe in the total execution time the direct influence of resources location. Descriptions that informed the network topology desirable obtained a lower execution time in all applications. Comparing the queries Without network topology and With latency specification, 0.100ms it is observed that in applications such as cg, is and mg, that perform a high number of communication, the total execution time is 6 times, 26 times and 7 times lower, respectively. Table 1. Results of NAS benchmark with different latency specifications NAS
VXDL: Virtual Resources and Interconnection Networks Description Language
145
Even in figure 8, it is observed that the allocation of two virtual machines on a unique physical resource results in a overhead (description With latency specification, 0.100ms, 2 VMs per node). Is important to discuss the fact that in allocations of multiple virtual machines on the same physical resource should respect the queries demands, that is, there must be guarantee the provision of solicited resources (nodes and network).
Fig. 3. Results of NAS benchmark with different latency specifications
4.2 Bandwidth Specification For accomplishment of this tests, a virtual infrastructure was requested resulting in the selection of 18 nodes located one the same physical cluster (as specified for the anchor parameter). The VXDL specification (Figure 9) describes the execution infrastructure where we represented two clusters, interconnected by two routers. In the graph in Figure 4), the vertices are represented by resources (green vertices are the computing resources and yellow vertices are the routers) and the edges by the network topology. The resources selection resulted in the allocation of a virtual machine in each physical computer. The routers were carried in software, through a specific virtual machine (as the specification). For the execution of the tests, we selected three virtual infrastructures with the same specification described, modifying the desirable bandwidth between the routers through the alteration of the parameter bandwidth on the link called Router to Router. The values used had been: 1Gb/s, 100Mb/s and 10Mb/s. To limit the bandwidth between the routers was used the Traffic Control (tc) tool, which informs kernel parameters about the performance of network interfaces. The cross-traffic was generated using the iperf tool, performing the total use of the available bandwidth (in the results this scenario is identified by the word ”cross-traffic” on subtitles). The results of NAS benchmark execution in the three infrastructures elaborated are shown in the figure 5 and in the table 2. Analyzing the execution time taking with
146
G.P. Koslovski, P. Vicat-Blanc Primet, and A.S. Char˜ao
Fig. 4. Network topology for bandwidth specification
Fig. 5. Results of NAS benchmark with different bandwidth specifications Table 2. Results of NAS benchmark with different bandwidth specifications NAS 1Gb/s 100Mb/s 100Mb/s 10Mb/s 10Mb/s cross 0 0 900Mb/s 0 990Mb/s cg 248s 247s 395s 1006s 1476s is 11s 39s 55s 380s 448s lu 65s 69s 124s 270s 522s mg 9s 11s 21s 81s 118s
base the result obtained in 1Gb/s network, is observed in the execution on 100Mb/s network (cross-traffic 900Mb/s) that the applications CG, IS, LU and MG increased the execution time in 59%, 400%, 90% and 33% respectively. The comparison between configurations 100Mb/s and 100Mb/s (with cross-traffic) can represents a impact of a possible channel division between different virtual infrastructures. It is observed that the execution time increases 1.60, 1.41, 1.79 and 1.90 for the applications CG, IS, LU and MG, respectively. The results of these experiments help to note that in the case of virtual channels, often shared among several virtual infrastructures is important to observe that the correct specification of the bandwidth and the ensuring by the manager system is essential
VXDL: Virtual Resources and Interconnection Networks Description Language