This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
into Q1 23: ENDIF
Fig. 4. DBR packet forwarding algorithm
buffer Q2. If the packet is found in Q2, it is dropped as it has been forwarded recently. Otherwise, the node calculates the sending time for the packet based on the current system time and the holding time and inserts the packet into the priority queue Q1. Note that if the packet is already in Q1, the sending time is updated to the earlier time. Later, the packets enqueued in Q1 will be sent out according to their scheduled sending times.
4
Performance Evaluation
In this section, we evaluate the performance of DBR and compare it with the Vector Based Forwarding (VBF) protocol[17]. 4.1
Simulation Settings
All simulations are performed using the Network Simulator (ns2) [12] with an underwater sensor network simulation package (called Aqua-Sim) extension. In our simulations, sensor nodes are randomly deployed in a 500m × 500m × 500m
DBR: Depth-Based Routing for Underwater Sensor Networks
81
3-D area. Multiple sinks are randomly deployed at the water surface. While we assume that the sinks are stationary once deployed, the sensor nodes follow the random-walk mobility pattern. Each sensor node randomly selects a direction and moves to the new position with a random speed between the minimal speed and maximal speed, which are 1 m/s and 5 m/s respectively unless specified otherwise. Although the source node can be anywhere in the network, for easy simulations, we place it at a random position at the bottom layer in our experiment. The data generating rate at the source node is one packet per second, with a packet size of 50 bytes. The communication parameters are similar to those on a commercial acoustic modem, LinkQuest UWM1000 [10]: the bit rate is 10k bps; the maximal transmission range is 100 meters (in all directions); and the power consumption in sending, receiving, and idling mode are 2w, 0.1w, and 10mw, respectively. The same broadcast Media Access Control (MAC) protocol as in [17] is used in our simulations. In this MAC protocol, when a node has a packet to send, it first senses the channel. If the channel is free, it continues to broadcast the packet. Otherwise, it backs off. The packet will be dropped if the maximal number of backoffs have been reached. We use the following metrics to evaluate the performance of routing protocols. – Packet Delivery Ratio is defined as the ratio of the number of distinct packets received successfully at the sinks to the total number of packets generated at the source node. Although a packet may reach the sinks multiple times, these redundant packets are considered as only one distinct packet. – Average End-to-end Delay represents the average time taken by a packet to travel from the source node to any of the sinks. – Total Energy Consumption represents the total energy consumed in packet delivery, including transmitting, receiving, and idling energy consumption of all nodes in the network. 4.2
Impact of Design Parameters
Now we examine how the performance of DBR is affected by algorithm parameters and network settings: parameter δ, depth threshold dth , node mobility, and number of sinks. Parameter δ. The parameter δ decides the holding time of packets at each node. With a larger δ, each node has a shorter holding time and hence the average end-to-end delay will be reduced; but more nodes will forward the same packet, which results in more energy consumption. On the other hand, with a smaller δ, a node will have longer holding time and hence the average end-to-end delay is increased; but less nodes will forward the same packet, as helps to reduce energy consumption. Fig. 5 shows how the energy consumption and average end-to-end delay change with three different values of δ: R/4, R/2, and R, where R is the maximal transmission range. In this set of simulations, the number of sinks is set to 5 and the depth threshold is set to 0. From Fig. 5(a), we observe that the average
82
H. Yan, Z.J. Shi, and J.-H. Cui 14000 Total Energy Consumption
Average E2E Delay (sec)
6 R/4 R/2 R
5 4 3 2 1
R/4 R/2 R
12000 10000 8000 6000 4000 2000 0
0 200
300
400 500 600 Number of Nodes
700
200
800
(a) Average end-to-end delay
300
400 500 600 Number of Nodes
700
800
(b) Energy consumption
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
16000 Total Energy Consumption
Packet Delivery Ratio
Fig. 5. DBR’s performance with different δ values
0 20 40
14000
0 20 40
12000 10000 8000 6000 4000 2000 0
200
300
400 500 600 Number of Nodes
700
(a) Packet delivery ratio
800
200
300
400 500 600 Number of Nodes
700
800
(b) Energy consumption
Fig. 6. DBR’s performance with different depth thresholds
end-to-end delay when δ = R4 is about three times of the delay when δ = R. Fig. 5(b) shows that the energy consumption does increase as δ is lifted, but not significantly (less than 5%). Hence, we choose δ = R in later simulations. Depth Threshold. Fig. 6 shows how the depth threshold, dth , affects the packet delivery ratio and energy consumption. In this set of simulations, the number of sinks is set to 5. From Fig. 6, we see that when the depth threshold increases, both the packet delivery ratio and the total energy consumption decrease. This is because increasing the depth threshold has a similar effect to reducing the number of available nodes in the network. Therefore, the number of forwarding nodes will decrease. Consequently, the packet delivery ratio decreases and less energy is consumed. Node Mobility. To evaluate how node mobility affects the performance of DBR, we simulate DBR with different fixed node speeds at 1 m/s, 5 m/s, and 10 m/s. The depth threshold is set to 0 and the total number of sinks deployed is 5. The simulation results are plotted in Fig. 7. From Fig. 7, we observe that the packet delivery ratio, total energy consumption, and average delay do not change much with node speed. The reason is that all routing decisions in DBR are made locally based on a node’s depth information. No topology or route information needs to be exchanged among neighboring
DBR: Depth-Based Routing for Underwater Sensor Networks
83
16000 Average E2E Delay (sec)
Packet Delivery Ratio
4
1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8
1 m/s 5 m/s 10 m/s
14000 12000 10000 8000
1 m/s 5 m/s 10 m/s
6000 4000 2000
(a) Packet delivery ratio
3 2.5 2 1.5
1 m/s 5 m/s 10 m/s
1 0.5 0
0
200 300 400 500 600 700 800 Number of Nodes
3.5
200
300
400
500
600
700
800
Number of Nodes
(b) Energy consumption
200
300
400 500 600 Number of Nodes
700
800
(c) Average end-to-end delay
Fig. 7. DBR’s performance with different node mobility
nodes. Therefore, DBR can handle dynamic network topologies well. For example, Fig. 7(a) shows that DBR successfully delivers at least 88% of the packets in all cases. When the total number of nodes is larger than 400, almost all packets can be successfully delivered. From Fig. 7 we can also see how node density affects the performance of DBR. The figure shows that the packet delivery ratio and the energy consumption increase as the number of nodes increases. The average end-to-end delay, however, decreases. This is because when the number of nodes increases, the number of forwarding nodes also increases, which leads to a higher packet delivery ratio and higher energy consumption. At the same time, the probability that a forwarding node has a shorter distance to the sinks increases as more nodes participate in packet forwarding. As a result, the average end-to-end delay decreases. Number of Sinks. Fig. 8(a) shows how the packet delivery ratio changes with different number of sinks. DBR with multiple sinks has a better packet delivery ratio than DBR with only one sink. Since DBR is a greedy algorithm trying to deliver data packets to the water surface, increasing the number of sinks at the water surface will increase the chance that a packet is received by a sink. This explains the higher delivery ratio when multiple sinks are deployed. The total energy consumption for different number of sinks is shown in Fig. 8(b). We observe that the energy consumption is almost the same for different number of sinks. The reason is that in DBR, the number of sinks does not affect the forwarding process. Therefore, all different settings have almost the same energy consumption. From Fig. 8(c), we observe that DBR with multiple sinks has a slightly better average end-to-end delay than DBR with one sink. This is because in the multiple-sink case, a packet is considered successfully delivered whenever it reaches any of the sinks. 4.3
Comparison with VBF
Now we compare DBR with VBF [17]. In this set of simulations, we consider two settings for DBR: one sink and multiple sinks. For one-sink DBR, the depth threshold is set to 0. In multiple-sink DBR, 5 sinks are randomly deployed at
H. Yan, Z.J. Shi, and J.-H. Cui
3
1 sink 3 sinks 5 sinks 10 sinks
Average E2E Delay (sec)
12000
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Total Energy Consumption
Packet delivery Ratio
84
1 sink 3 sinks 5 sinks 10 sinks
10000 8000 6000 4000
2.5 2 1.5 1 sink 3 sinks 5 sinks 10 sinks
1 0.5
2000
0 200
300
400 500 600 Number of Nodes
700
800
(a) Packet delivery ratio
0
200 200
300
400 500 600 Number of Nodes
700
800
(b) Energy consumption
300
400 500 600 Number of Nodes
700
800
(c) Average end-to-end delay
Fig. 8. DBR’s performance with different number of sinks
the water surface and the depth threshold is set to 20 meters. For VBF, we set the routing pipe radius to 100 meters, which is the maximal transmission range. Fig. 9 compares DBR and VBF with respect to the three metrics. Fig. 9(a) shows that for the one-sink setting, DBR achieves a similar packet delivery ratio to VBF. With the multi-sink setting, however, DBR can achieve a much better delivery ratio, especially for sparse networks. For example, in a network with 200 nodes, DBR with five sinks has a packet delivery ratio of around 70%, which is more than four times larger than 15%, the delivery ratio of VBF. Fig. 9(b) shows that DBR has better energy efficiency compared with VBF. In all cases, the total energy consumption of DBR is about half that of VBF. This is mainly due to the the redundant packet suppression techniques adopted by DBR. The two-queue mechanism in DBR has greatly reduced the redundant packet transmissions in the network. It should be noted, however, that this comparison is made based on the basic settings of VBF. VBF may get better energy efficiency when advanced adaptation algorithms are applied and optimal parameters are chosen. Fig. 9(c) shows that the best end-to-end delay is achieved in multiple-sink DBR while VBF has a better end-to-end delay than one-sink DBR. This is because VBF tries to find the shortest path from the source node to the sink along the virtual vector between them. Thus the delay in VBF is shorter than that in one-sink DBR. In multiple-sink DBR, however, packets can be delivered to any sink, instead of a fixed sink as in VBF. It should be noted that DBR and VBF target different network settings and have quite different network assumptions. For example, VBF is designed for networks with a single sink. Although DBR can work in one-sink networks, it has better performance in multiple-sink settings. 4.4
Summary
The simulation results show that DBR protocol works well for dense networks. However, the delivery ratio in sparse networks is relatively low. The reason is that DBR has only a greedy mode. The greedy method alone is not able to achieve high delivery ratios in sparse networks. We will investigate recovery algorithms for DBR in the future.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
5
one-sink multiple-sink VBF
20000
Average E2E Delay (sec)
one-sink multiple-sink VBF
15000 10000 5000
300
400 500 600 Number of Nodes
700
800
(a) Packet delivery ratio
4 3 2
one-sink multiple-sink VBF
1 0
0
200
85
6
25000 Total Energy Consumption
Packet Delivery Ratio
DBR: Depth-Based Routing for Underwater Sensor Networks
200
300
400 500 600 700 Number of Nodes
200
800
(b) Energy consumption
300
400 500 600 Number of Nodes
700
800
(c) Average end-to-end delay
Fig. 9. Comparison of DBR and VBF
The DBR protocol requires more memory in sensor nodes to maintain two buffers. Since the underwater sensor nodes normally are equipped with more resources than land-based sensor nodes, the memory overhead is not significant in most systems. Moreover, the applications for underwater sensor networks have relatively low data rate so only small buffers need to be maintained. For example, we observed in our simulations that the average number of packets in the sending queue of each node is less than 10. As a result, the cost of extra memory is affordable for underwater sensor nodes.
5
Conclusions and Future Work
In this paper, we presented Depth Based Routing (DBR), a routing protocol based on the depth information of nodes, for underwater sensor networks. DBR uses a greedy approach to deliver packets to the sinks at the water surface. Different from other geographical-based routing protocols that require full-dimensional location information of nodes, DBR only needs the depth information, which can be easily obtained locally at each sensor node. Further, DBR can naturally takes advantages of the multiple-sink underwater sensor network architecture without introducing extra cost. Our simulation results have shown that DBR can achieve high packet delivery ratios (at least 95%) for dense networks, with reasonable energy consumption. To achieve better performance for sparse networks, recovery algorithms need to be explored to avoid the “void” areas where the greedy strategy fails. Advanced recovery algorithms should be developed for higher energy efficiency and delivery ratio. DBR provides a distributed routing protocol based on local information of the sensor nodes. Besides the depth information, other information, such as the residual energy level and estimated distance to neighboring nodes, could also be useful in making routing decisions that can further reduce energy consumption and extend the network’s life time. For multiple-sink network settings, we only consider some simple cases in which the sinks are randomly, uniformly deployed on the water surface. Given the routing protocol and the node deployment model, we may find better deployment locations for the multiple sinks to achieve better performance.
86
H. Yan, Z.J. Shi, and J.-H. Cui
References 1. Akyildiz, I.F., Pompili, D., Melodia, T.: Challenges for efficient communication in underwater acoustic sensor networks. ACM SIGBED Review 1(1) (July 2004) 2. Basagni, S., Chlamtac, I., Syrotiuk, V.R., Woodward, B.A.: A distance routing effect algorithm for mobility (dream). In: MOBICOM 1998 (1998) 3. Braginsky, D., Estrin, D.: Rumor routing algorithm for sensor networks. In: WSNA 2002 (September 2002) 4. Cui, J.H., Kong, J., Gerla, M., Zhou, S.: Challenges: Building scalable mobile underwater wireless sensor networks for aquatic applications. Special Issue of IEEE Network on Wireless Sensor Networking (May 2006) 5. Grossglauser, M., Vetterli, M.: Locating nodes with ease: Mobility diffusion of last encounters in ad hoc networks. In: IEEE INFOCOM 2003 (March 2003) 6. Heidemann, J., Ye, W., Wills, J., Syed, A., Li, Y.: Research challenges and applications for underwater sensor networking. In: IEEE Wireless Communications and Networking Conference (April 2006) 7. Heinzelman, W.R., Kulik, J., Balakrishnan, H.: Adaptive protocols for information dissemination in wireless sensor networks. In: Mobicom 1999 (August 1999) 8. Intanagonwiwat, C., Govindan, R., Estrin, D.: Directed diffusion: A scalable and roust communication paradigm for sensor networks. In: MOBICOM 2000 (August 2000) 9. Karp, B., Kung, H.T.: Gpsr: Greedy perimeter stateless routing for wireless networks. In: Proc. Mobicom (2000) 10. Linkquest, http://www.link-quest.com/ 11. Nicolaou, N., See, A., Xie, P., Cui, J.-H., Maggiorini, D.: Improving the robustness of location-based routing for underwater sensor networks. In: Proceedings of IEEE OCEANS 2007 (June 2007) 12. The ns mannual (2002), http://www.isi.edu/nsnam/ns/doc/index.html 13. Partan, J., Kurose, J., Levine, B.N.: A survey of practical issues in underwater networks. In: Proc. of ACM International Workshop on UnderWater Networks (WUWNet), September 2006, pp. 17–24 (2006) 14. Proakis, J., Sozer, E.M., Rice, J.A., Stojanovic, M.: Shallow water acoustic networks. IEEE Communications Magazines, 114–119 (2001) 15. Seach, W.K., Tan, H.X.: Multipath virtual sink architecture for underwater sensor networks. In: OCEANS (May 2006) 16. Xie, G.G., Gibson, J.: A networking protocol for underwater acoustic networks. Technical Report TR-CS-00-02 (2000) 17. Xie, P., Cui, J.-H., Lao, L.: Vbf: Vector-based forwarding protocol for underwater sensor networks. In: Proceedings of IFIP Networking (May 2006) 18. Ye, F., Luo, H., Cheng, J., Lu, S., Zhang, L.: A two-tier data dissemination model for large-scale wireless sensor networks. In: MOBICOM 2002 (September 2002) 19. Ye, F., Zhong, G., lu, S., Zhang, L.: Gradient broadcast: A robust data delivery protocol for large scale sensor networks. ACM Wireless Networks(WINET) 11(2) (2005)
Channel Allocation for Multiple Channels Multiple Interfaces Communication in Wireless Ad Hoc Networks Trung-Tuan Luong, Bu-Sung Lee, and C.K. Yeo School of Computer Engineering Nanyang Technological University, Nanyang Avenue, Singapore 639798 {Y050089,bslee,asckyeo}@ntu.edu.sg
Abstract. A major drawback of multi-hop communication in ad hoc network is the bandwidth scarcity along the forwarding path of data packet. The main reason for the lack of bandwidth or the exponential reduction in the bandwidth is the contention for bandwidth between nodes along the path. In this paper we will demonstrate one way to overcome this problem through the use of multiple channels multiple interfaces (MCMI). We investigate different forwarding channel selection schemes for MCMI communication with Destination Sequenced Distance Vector protocol: same channel, random and round robin. Based on our analysis and simulation results we have shown that MCMI protocol significantly improves the channel capacity while maintaining low end-to-end delay. Forwarding channel selection policies play an important role in determining the performance of the MCMI. Among the channel assignment policies under study, round robin policy provides the best performance. The results provide a baseline for evaluating bandwidth of multiple channels multiple interfaces networks. Keywords: Multiple channel communication, channel interference, channel assignment, bandwidth.
1
Introduction
Wireless communication has become an indispensable part of modern day-to-day life. Research in ad hoc networks has focused on single channel networks, in which a common channel is desired for simple routing control in multi-hop networks. A well known fact that is affecting the performance of ad hoc networks is the significant throughput degradation along the multi-hop path [1]. When a single channel is used both for incoming and outgoing traffic, throughput is halved as they use the same radio channel. That means when one node is transmitting, its neighbors nodes must all be in listening mode otherwise collision will occur. This problem is amplified across the network, and after a few hops the bandwidth is reduced significantly. To overcome the throughput degradation problem, a natural approach is to use multiple channels in which nodes work on different channels in an interference area. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 87–98, 2008. c IFIP International Federation for Information Processing 2008
88
T.-T. Luong, B.-S. Lee, and and C.K. Yeo
Recent research and commercial efforts in the wireless networking industry have focused on a new class of ad hoc networks: multiple channels multiple interfaces ad hoc networks [1]. These networks are characterized by a set of fixed nodes with multiple wireless interfaces utilizing multiple orthogonal channels at the same time over a given swath of spectrum such as IEEE 802.11 a/b/g. Simultaneous operation of multiple channels provides the following advantages: – Capacity enhancement: A node with two interfaces can send packets on two channels simultaneously. More importantly, forwarding nodes can both send and receive at the same time. Using appropriate protocols for channel selection and assignment, such a system can provide significantly greater capacity than a single interface system – Load sharing: Load sharing occurs where a traffic flow is distributed among the available connections to achieve lower latency and increase robustness to network. – Channel failure recovery: Graceful degradation and robustness against channel errors is possible by employing frequency diversity. Frequency diversity can be achieved by using multiple interfaces and operating each on different channel frequencies. In this paper, we implement the Destination Sequenced Distance Vector (DSDV) for multiple channels multiple interfaces wireless ad hoc networks (DSDV-MCMI) to reduce the bandwidth degradation problem. In DSDV-MCMI, we investigated three different policies for forwarding channel selection: same channel, random channel and round robin channel to evaluate the channel interference along the multi-hop path. Our simulation results show that DSDVMCMI protocol significantly improves the network goodput while maintaining low end-to-end delay. Among the three different channels selection policies for MCMI, round robin policy, where the packets are forwarded based on the packet incoming channel, is the best at preventing channel interference; which results in shortest end-to-end delay time. Our paper outline is as follows. Section 2 briefly discusses the related work. In Section 3, we present the multiple channels multiple interfaces protocol with DSDV-MCMI. The simulation and results analysis are given in Section 4. Finally, Section 5 concludes the paper.
2
Related Work
We can categorize strategies for using multiple channels into two groups depending on the number of interfaces used: single interface and multiple interfaces. The first strategy is to enable a single wireless interface card to access multiple channels [2,3]. For example, So and Vaidya [3] propose a scheme that allows wireless devices to communicate on multiple channels using a single interface card. The scheme requires frequent channel switching, which incurs considerable overhead on the current hardware. The other strategy is to use multiple interface cards; Nasipuri et al. [4,5,6] propose a class of multi-channel carrier sense multiple access
Channel Allocation for MCMI Communication
89
MAC protocols where all nodes have an interface on each channel. These proposed protocols use different metrics to choose the channel for communication between nodes. The metric may simply use the idle channel [4], or the signal power observed at the sender [5], or the received signal power at the receiver [6]. Wu et al [7] propose a MAC layer solution that requires two interfaces. One interface is assigned to common channel for control purposes, and the second interface is switched between the remaining available channels for transmitting data. Lee [8] proposed a class of proactive routing protocol for multi channels, which uses one control channel and N data channels. The nodes exchange control packages on the control channel to negotiate the best channel for the receiver’s data channel. Most past research in the use of multiple channels require either modification of the MAC layer protocol or changes to IEEE 802.11. Compared to previous research, our proposed protocol focus on exploit the multiple channels by using concurrent multiple interfaces. The protocol does not require channel synchronization among nodes. Although there is some additional control overhead, the simulation results show that the proposed protocol significantly increase network goodput while decreasing end-to-end delay efficiently.
3 3.1
Multiple Channels Multiple Interfaces Routing with DSDV Model Description
Module based node for ns-2. Wireless nodes are becoming multimodal. They are equipped with multiple interfaces possibly using different technologies, including wired access, and may use them concurrently. Paquereau [9] designed module based wireless node (MW-Node), a more flexible and better integrated wireless and mobile support in ns-2[10]. It is not a new implementation but primarily a reorganization of already existing components. The MW-Node enables new features to be supported by the simulator. For instance, true support for multiple wireless interfaces on a single node, each may have its own routing protocol. DSDV-MCMI. Based on the MW-Node model, we developed destination sequenced distance vector for multiple channels multiple interfaces, DSDV-MCMI, to demonstrate the multiple channels multiple interfaces routing scheme. Nodes can have more than one network interfaces as shown in Fig. 1. These interfaces are assigned to orthogonal channels so that they will not interfere with each other. Every interface has one DSDV [11] routing agent for generating control messages and forwarding data packets. Destination-Sequenced Distance-Vector Routing (DSDV) is a table-driven routing protocol based on the Bellman-Ford algorithm. Each mobile node maintains a route to every other node in a routing table. Route entry contains destination address, next hop address and number of host to reach the destination. In order to prevent routing loop, each entry is marked with a sequence number assigned by the destination node. The sequence numbers enable node to
90
T.-T. Luong, B.-S. Lee, and and C.K. Yeo
Fig. 1. Multi-interface node
distinguish stale routes from new ones. The DSDV routing agent at each interface maintains its’ own routing table for all available destinations. To maintain the consistency of routing tables in a dynamically varying topology, each node’s interface periodically transmits full updates or immediately transmits incremental updates when significant new information is available, such as a topology changes. We design a channel assignment module on top of the network interfaces. This module is in charge of selecting the network interface for forwarding each data packets. 3.2
Channel Assignment Policy
The purpose of channel assignment is to determine the forwarding channel for packet to minimize interference with other nodes. We apply the static approach [12] that assigning a common set of channels for node’s interfaces. Every node has N interfaces, which are assigned to the same set of N orthogonal channels. At the source node, data packet is sent out on an interface which is randomly selected. When an intermediate node along the path receives a data packet on incoming channel Chin ; the node uses one of the following three policies to select the outgoing channel Chout for relaying data packet: – Policy 1 Same channel: node will relay the packet on the same channel as the packet was received. This approach incurs high channel interference in multi-hop route, which results in long delay and packet drops that degrade the performance significantly. Chout = Chin – Policy 2 Random channel: node relays data packet on a randomly selected channel. The probability that the forwarding channel is the same as the received channel is N1 . When a node has more interfaces, the less chance that the forwarding channel is the same as the received channel. Chout = Random(Ch0 ,Ch1 ,..,ChN −1) – Policy 3 Round robin channel: when a node receive data packet need to be forwarded on ith channel it will forward that packet on the (i+1)th channel. Chout = (Chin + 1) mod (N)
Channel Allocation for MCMI Communication
3.3
91
Channel Interference
Assuming that nodes are ideally placed along the route with hop distance ’d’ as in Fig.2, all the node has transmission range ’T’ and carrier sensing range ’C’. The ideal placement of a node is d ≤ T < 2∗d
(1)
The transmissions of a node interfere with transmissions of other nodes within its’ sensing range. Thus sensing range is also called as interference range. The interference distance in term of number of hops along a path is: M = Cd hops where Cd is floor of Cd This means that transmission of a node may interfere with transmission of the next M nodes and transmission of the previous M nodes along the path. We will prove the throughput by using multiple channels for transmission; the minimum number of channels required to eliminate the channel interference among forwarding nodes along the path is (M+2). As shown in Fig. 2, nodes A1 , A2 ,..,
Fig. 2. Multi-hop path
AM+1 are within interference range of each other, because the distance between A1 and AM+1 is M hops. Therefore the (M+1) data transmissions from Ai to Ai+1 (i=1,2,..,M+1) can interfere with each other. Due to the exposed node problem, data transmission from node AM+1 to node AM+2 can also interfere with CTS/ACK packet sent from node A1 to node S. Thus we can see that there are (M+2) data transmissions interfering with each other. Hence we need at least (M+2) channels so that every data transmission can work on different channel to ensure these communications will not interfere with each other. In Fig. 3, we apply policy 3 for forwarding channel selection among the set of (M+2) channels { Ch0 , Ch1 ,..,ChM+1 }. We can see that two transmissions use the same channel, such as link S-A1 and link AM+2 -AM+3 , are (M+1) hops apart along the path, hence they will not interfere with each other. Thus with (M+2) available channels, we can use policy 3 for channel assignment for data transmission so that we can eliminate channel interference
92
T.-T. Luong, B.-S. Lee, and and C.K. Yeo
Fig. 3. Channel assignment for communication link in multi-hop path
3.4
Capacity of a Chain of Multiple Channels Multiple Interfaces Nodes
In this subsection, we evaluate the capacity of a single chain of nodes where packets are originated at first node and forwarded along the chain to the last node. We investigate chain of nodes under saturated condition that every node along the chain always has packet to forward. When node A select channel to forward packet, we denote PI (A) the probability that the selected channel will be interfered. Let S be the set of N available channels; B is channel bandwidth; B(A) is bandwidth experienced by node A. BChain is the maximum bandwidth can be achieved for transmission from source to destination of the chain. M is the interference distance in term of number of hops and L is hops length of the chain. Along the chain, we can see that the middle node such as node AM+1 in Fig. 2 is the bottleneck node, because its’ data transmission may share the same channel with most number of other transmissions. Bandwidth of chain is reduced significantly because of those nodes. Thus we will evaluate the bandwidth can be experienced by the middle node A in the chain. Policy 1. we have Chout = Chin , therefore two consecutive data transmissions along the path always interfere with each other. Thus channel interference always happens, PI (A) =1. If the chain is long enough, which is equal or more than (2*M+2) hops long. We examine the middle node AM+1 as in Fig. 2; AM+1 will share the same channel with the next M nodes AM+2 ,.., A2M+1 and the previous M nodes A1 ,.., AM along the path. However due to the exposed node problem, transmission from node AM+1 to node AM+2 will also interfere with CTS/ACK packet sent from node A1 to node S. Therefore we can say that node AM+1 shares the same B channel with (2*M+1) other node.; this makes B(AM+1 ) = 2∗M+2 . Thus the B chain of nodes can achieve a maximum bandwidth of 2∗M+2 . If hops length L < (2*M+2); the middle node will have carrier sensing range cover the whole chain, thus its’ data transmission will share the same channel with all the other (L-1) transmissions along the path. So its’ experienced bandwidth B will be B L . Therefore the chain can achieve a maximum bandwidth of L . B if L ≥ (2 ∗ M + 2) BChain = 2∗M+2 (2) B if L < (2 ∗ M + 2) L Policy 2. we denote PN I (A) the probability that the selected channel of A will not be interfered. PN I (A) = 1 - PI (A). Let K be number of forwarding nodes
Channel Allocation for MCMI Communication
93
along the chain excluding A stay inside carrier sensing range of A. As discussed in policy 1, K is (2*M+1) if L ≥ (2*M+2); otherwise K is (L-1). We can see that PN I (A) equals to the probability that node A selects channel Chi and the other K nodes select among the rest (N-1) channels (S\ {Chi }) multiply by N. K *N = NN−1 K PI (A) = 1- PN I (A) = 1- NN−1
PN I (A) =
1 N
*
N −1 K N
we denote PI (A,J) the probability that the selected channel of A will be shared with other J nodes J+1 N −1 K−J J N −1 K−J PI (A,J) = N1 ∗ N ∗ N = N1 ∗ N J=0..K where 1 K K K (N −1)K−J B B(A) = J=0 PI (A, J) ∗ J+1 =B ∗ N J=0 J+1
BChain = B ∗
1 N
K K (N − 1)K−J (2 ∗ M + 1) if L ≥ (2 ∗ M + 2) whereK = L−1 if L < (2 ∗ M + 2) J + 1 J =0 (3)
Policy 3. We first evaluate the long chain of nodes that L ≥ (2*M+2). If N ≥ (M+2), policy 3 can help to eliminate channel interference; thus PI (A) = 0 and B(A)= B. Else if N < (M+2), PI (A) = 1 and node A will share the same M channel with next M along the chain. Node A N nodes and previous N nodes may share the same channel with the exposed node if M+1 = M+1 N N . Therefore B(A) = M + BM +1 +1 . N N Then we evaluate the short chain that L< (2*M+2). If L ≤ N then all the transmissions use different channel thus PI (A) = 0 and B(A)= B. Else if L > L N, data transmission of node A will share the same channel with other N -1 L L transmission. Therefore PI (A) = 1 and B(A)= B where N is ceiling of N L N ⎧ B if L ≥ (2 ∗ M + 2)andN ≥ (M + 2) ⎪ ⎪ B ⎪ ⎨ M if L ≥ (2 ∗ M + 2)andN < (M + 2) N + MN+1 +1 BChain = (4) B if N < L < (2 ∗ M + 2) ⎪ L ⎪ ⎪ N ⎩ B if L ≤ N
4 4.1
Simulation and Results Verification of Chain Results
First we simulate MCMI in a chain of nodes setup, where nodes are placed ideally 150m apart from each other. The transmission range is 250m and carrier sensing range is 550m. This is the worst-case placement, where node is placed at just above half of transmission range of each other. In this scenario, interference distance M is 3 hops. Packet is originated at source node with the rate of 2Mbps, which equals to channel bandwidth; so that data transmission of the chain is
94
T.-T. Luong, B.-S. Lee, and and C.K. Yeo Table 1. Bandwidth drop rate when chain’s length increases Chain length (hops) 1 2 3 4 5 5
Interface Interface Interface Interface Interface Interface
with with with with with
Policy Policy Policy Policy Policy
3 3 3 2 3
1 2
3
4
5
6
7
8
9
10
1 2
1 3 1 2
1 4 1 2 1 2
1 5 1 3 1 2 1 2
1 6 1 3 1 2 1 2
1 7 1 4 1 3 1 2
1 8 1 4 1 3 1 2
1 8 1 4 1 3 1 2
1 8 1 4 1 3 1 2
1 1 1 1 1 1
1 1 1 0.9 1
1 1 1 0.73 0.59 0.47 0.38 0.30 0.24 0.24 0.24 1 1 1 1 1 1 1 1
under saturated condition. We will measure the bandwidth can be achieved when the chain is 1-hop length and measure how the bandwidth drops when chain is longer. We increases numbers of interfaces equipped by nodes and apply different channel selection policy to verify our analysis. We calculate the drop rate of maximum bandwidth achieved of chain of Nhops length compared to 1-hop length chain according to equation (2), (3), (4). The result is summarized in Table 1 Fig. 4 shows the comparison results of our
Fig. 4. Bandwidth drop rate when number of hops increases
simulation and analysis. We can see that our simulation’s results are very close to the analysis results shown in Table 1. When the chain gets longer, packet must traverse more hops. Packets’ end-to-end delay increases while bandwidth decreases. 4.2
Simulation of MCMI in Two Dimensions Multihops Network Environment
We evaluated the performance of MCMI in static ad hoc network of map size 1200mx800m with network node density of 80 nodes. This setting helps to ensure that network is crowded enough so that we can have a path between any nodes; and network is also scarce enough so that the path length is long for evaluating effects of different channel allocation policies. The average hop counts of our
Channel Allocation for MCMI Communication
95
Table 2. Simulation Setting Parameter
Setting
Traffic model Packet size Packet sending rate Mac Transmission range Carrier sensing range Map size Node number Simulation time
20CBR connections 512 bytes 2pkt/s-12pkts/s 802.11 2Mbps 250m 550m 1200mx800m 80 300s
simulation are 3.5 hops. We simulate 20 CBR connections with 512bytes packet size in 300 seconds. The packet rate increases from 2 packets/s to 12 packets/s to vary the traffic load. A summary of the simulation setting is given as shown in Table 2. The following performance metrics are used to evaluate the simulation results: – Goodput: the number of useful bits per unit of time forwarded by the network from a certain source address to a certain destination, excluding protocol overhead, and excluding retransmitted data packets – End-to-end delay: the average time it takes for a packet to traverse the network from its source to destination. 4.3
Result Analysis
Goodput Experiment 1: Effects of number of Interfaces,N. Fig. 5 shows the goodput results at different traffic load for different number of channels. The goodput performance of N-interfaces is normalized against the performance of 1-interface. With increasing number of node’s interface, nodes can receive packet on one interface and send packet on other interface at the same time. Thus with more interfaces, node can share the traffic load better, which help to reduce the congestion. Therefore as expected when number of interfaces increases, the goodput increases. We do not expect the goodput to increase linearly as the number of interfaces increases, because communication paths can crossover and interfere each others. This causes bandwidth reduction in all the crossover communication path. Experiment 2: Effects of Forwarding Channel Assignment Policy. The normalized goodput performances of three channel assignment policies is compared against policy 1 in three-interface network are shown in Fig. 6. Policy 3 is the best as it diversifies channel usages, which results in greatest goodput achieved. Policy 1 does nothing to prevent channel interference, thus its’ goopdut performance is much less than that of policy 3. Policy 2 does a little effort to prevent
96
T.-T. Luong, B.-S. Lee, and and C.K. Yeo
Fig. 5. Normalized goodput regarding number of interfaces
Fig. 6. Normalized goodput regarding forwarding channel selection policy
channel interference by selecting forwarding channel randomly, which still has probability of channel interference. Thus policy 2 performs slightly worse than policy 3. End-to-End Delay Experiment 1: Effects of number of Interfaces,N. Fig. 7 shows the corresponding results of end-to-end delay for different number of network interface used for transmission, as the traffic load is increased. A key observation is that, in a congested network, packets will be kept in queue for very long, waiting for channel availability; which resulted in either long end to end delay or packet dropped. When node has more interfaces, it will be able to forward packets through all the interfaces. Thus packets need not wait long in queue; as the result, with more interface the end-to-end delay decreases considerably. For example at packet sending rate of 10pkt/s, the end to end delay can be reduced by 2, 4, 6 times that of a single interface when node has 2, 3 and 4 interfaces respectively.
Channel Allocation for MCMI Communication
97
Fig. 7. End-to-end delay regarding number of interfaces
Experiment 2: Effects of Forwarding Channel Assignment Policy. Fig. 8 presents the end to end delay performance of three channel assignment policies in three-interface node configuration with varying traffic load. Forwarding data packets on the same channel as the receiving channel incurs the greatest channel interference, thus policy 1 has the highest end-to-end delay. Policy 3 is the best as it diversifies channel usages, thus it can help reduce channel interference, which results in shortest end-to-end delay. The performance of policy 2 is slightly worse than policy 3.
Fig. 8. End-to-end delay regarding forwarding channel selection policy
5
Conclusion
We presented a multiple channels multiple interfaces ad hoc network routing protocol, DSDV-MCMI, which utilizes multiple channels to enable simultaneous multiple transmission to improve network capacity. This proposed scheme does not require modification to the current IEEE 802.11 MAC protocol. Simulation
98
T.-T. Luong, B.-S. Lee, and and C.K. Yeo
results shows that DSDV-MCMI exploits multiple channels multiple interfaces to improve network capacity and reduce channel interference. With more equipped network interfaces, we can achieve higher goodput and shorter end-to-end delay time. We also suggested some policies for forwarding channel selection: same channel, random channel and round robin channel; these policies play an important role in determining performance of MCMI. However, there is also an issue of crossover interference among the communication path and one needs a deeper understanding before a satisfactory solution is found.
References 1. Bahl, P., Adya, A., Padhye, J., Walman, A.: Reconsidering wireless systems with multiple radios. SIGCOMM Comput. Commun. Rev. 34(5), 39–46 (2004) 2. Chandra, R., Bahl, P., et al.: Multinet: Connecting to multiple ieee 802.11 networks using a single wireless card. In: Proceedings of IEEE Infocom 2004, Hongkong (March 2004) 3. So, J., Vaidya, N.H.: Multi-channel mac for ad hoc networks: handling multichannel hidden terminals using a single transceiver. In: MobiHoc 2004: Proceedings of the 5th ACM international symposium on Mobile ad hoc networking and computing, pp. 222–233. ACM Press, New York (2004) 4. Nasipuri, A., Zhuang, J., Das, S.R.: A multichannel csma mac protocol for multihop wireless networks. In: Wireless Communications and Networking Conference. WCNC, vol. 3, pp. 1402–1406. IEEE, Los Alamitos (1999) 5. Nasipuri, A., Das, S.: Multichannel csma with signal power-based channel selection for multihop wireless networks. In: Vehicular Technology Conference, 2000. IEEE VTS-Fall VTC 2000. 52nd, vol. 1(1), pp. 211–218 (2000) 6. Jain, N., Das, S., Nasipuri, A.: A multichannel csma mac protocol with receiverbased channel selection for multihop wireless networks. In: Proc. IEEE IC3N, Phoenix (October 2001) 7. Wu, S.-L., Lin, C.-Y., Tseng, Y.-C., Sheu, J.-P.: A new multi-channel mac protocol with on-demand channel assignment for multi-hop mobile ad hoc networks. In: ISPAN ’00: Proceedings of the 2000 International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN 2000), p. 232. IEEE Computer Society Press, Washington (2000) 8. Lee, U., Midkiff, S.F., Park, J.S.: A proactive routing protocol for multi-channel wireless ad-hoc networks (dsdv-mc). In: International Conference on Information Technology: Coding and Computing (ITCC 2005), vol. 2, pp. 710–715 (2005) 9. Paquereau, L., Helvik, B.E.: A module-based wireless node for ns-2. In: WNS2 2006: Proceeding from the 2006 workshop on ns-2: the IP network simulator, p. 4. ACM Press, New York (2006) 10. network simulator version 2 (ns 2): http://www.isi.edu/nsnam/ns/ 11. Perkins, C.E., Bhagwat, P.: Highly dynamic destination-sequenced distance-vector routing (dsdv) for mobile computers. Computer Communications Review 24(4), 234–244 (1994) 12. Katzela, I., Naghshineh, M.: Channel assignment schemes for cellular mobile telecommunication systems: a comprehensive survey. Personal Communications, IEEE [see also IEEE Wireless Communications] 3(3), 10–31 (1996)
Resilience to Dropping Nodes in Mobile Ad Hoc Networks with Link-State Routing Ignacy Gawędzki and Khaldoun Al Agha Laboratoire de Recherche en Informatique Université Paris-Sud 11, CNRS F-91405 Orsay, France [email protected], [email protected]
Abstract. Currently emerging standard routing protocols for MANETs do not perform well in presence of malicious nodes that intentionally drop data traffic but otherwise behave correctly with respect to control messages of the routing protocol. In this paper, we address the problem of coping with the presence of such nodes in networks that use a linkstate routing protocol. The solution, based on the verification of the principle of flow conservation, is purely protocolar and does not rely on any specific underlying OSI layer 1 or 2 technical feature. In addition, it is well suited for integration into existing link-state routing protocols for ad hoc networks, by using existing periodic control messages and by not requiring any clock synchronization between the nodes. The method, once implemented in an actual routing protocol, proves to increase the ratio of successfully delivered data packets significantly. Keywords: ad hoc networks, link-state routing, security, resilience.
1
Introduction
Routing protocols for mobile ad hoc networks (MANETs) have received much attention in the past decade, but currently the most mature solutions still rely on the full cooperation of the participating nodes. The presence of one or more malicious nodes may have negative consequences on the proper operation of the network. In the case of proactive protocols, nodes exchange control messages allowing them to maintain their routing table and data packet forwarding is performed on a hop-by-hop basis, using a traditional protocol stack (most commonly TCP/IP). This makes these protocols easy to implement as userspace processes while the operating system’s kernel remains unchanged. The downside is that securing the protocol itself is not enough to make the network resilient to malicious nodes, since they can always play nice with control packets, while not as much with data packets. The presented solution aims to reduce the impact of malicious nodes as much as possible, allowing others to route data packets around nodes that tend to lose traffic, provided that malicious nodes are not colluding. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 99–111, 2008. c IFIP International Federation for Information Processing 2008
100
I. Gawędzki and K. Al Agha
After related work in the field is summarized in Sect. 2, the solution is detailed in Sect. 3. Performance evaluation is presented in Sect. 4, followed finally by a conclusion and some perspectives in Sect. 5.
2
Related Work
The issue of Byzantine robustness of network operations has received attention for as long as the network administration of the Internet has become more decentralized and thus more exposed to attacks. The problem reemerged for MANETs, since the direct application of solutions for wired networks appeared to be either impossible or ineffective. In the family of reactive routing protocols — of which DSR [9] and AODV [13] are notorious examples — various solutions have been proposed [2,7,1] to provide Byzantine robustness. Since route discovery and maintenance as well as data packet forwarding is taken care of by the protocol, cryptographic methods can be used to ensure that every packet has been forwarded to the destination, without being corrupted, misrouted nor forged. As for proactive routing protocols, the OLSR [5] protocol has received attention from Raffo et al. [14] who proposed to secure the content of control messages to prevent nodes from advertising false or expired information. Fourati et al. also proposed a way to secure control messages [6] by requiring neighbors to validate and sign TC messages using threshold cryptography. While these propositions aim to guarantee the correctness of the protocol in presence of malicious nodes, they do not protect against data packet dropping. Protecting data packets in the same (intrusive) way as proposed for reactive protocols would break the principle of separate table-driven hop-by-hop routing, so it is not desirable in this case. Additional cryptographic tools such as key distribution systems to allow sender authentication should be used, but they still do not enable the prevention or detection of intentional packet dropping. In order to make packet dropping detectable, Marti et al. proposed [12] to exploit the ability of nodes to overhear the transmissions of their neighbors to check whether they have retransmitted the packets as requested. Unfortunately, such an approach does not work anymore when directional antennae or multiple network interfaces (or simply different communication channels) are used on nodes. Other approaches, based on active probing of nodes to detect droppers, have also been proposed [10,2,11]. They rely on the inability of nodes to either distinguish probes from plain data packets or identify their originator. It is worth noting that many solutions to encourage cooperation [4,15] have been proposed as well. Although they may be very effective to discourage selfishness, they do not protect against malicious nodes that drop data traffic, be it at their own expense. In the context of static wired networks, Bradley et al. proposed [3] that nodes should periodically verify that each of their neighbors satisfies the principle of flow conservation based on their advertised packet counters. Nodes that fail to pass these tests are flagged as bad and excluded from the network. Their
Resilience to Dropping Nodes in MANETs with Link-State Routing
101
assumptions about the network have later been criticized as being too strong and a few possible attacks have been exposed by Hughes et al. [8]. However, this solution is also hardly applicable to MANETs, because of mobility and the difficulty to maintain tightly synchronized clocks between nodes.
3 3.1
Proposed Solution Motivations and Assumptions
Our experience in implementation of MANET routing protocols taught us that the less a solution is intrusive in the existing framework the easier it is to implement, maintain and port. Besides, relying only on basic features of drivers allows wider hardware support. With a protocolar approach, we sought a solution easy to implement by requiring little or no change at all in the OS kernel while no cross-layer requirement eases support for a wide range of network interface cards. The solution assumes that cryptographic mechanisms allow any node, be it the destination host or an intermediate router, to check a packet’s integrity and authenticate its source. The network is assumed to be running a linkstate routing protocol that exchanges periodic control packets. The bound P on the time separating two successive control packets is assumed to be finite and known. 3.2
Solution Outline
The solution is composed of three distinct phases, that are performed continuously during the operation of the network. First, perform unicast packet accounting, exchange counters periodically and verify the principle of flow conservation. Second, maintain a local degree of distrust in each other node, based on how often it fails the flow conservation tests and diffuse that value in the network. Third, combine local degrees of distrust from other nodes into a single, uniform metric (i.e. every other node comes up with the same values) and use it for routing table calculation. Note that intentional and unintentional packet droppings are not distinguished, which is desirable since in the third phase, intentional as well as unintentional packet droppers are simply avoided. This should have a positive effect on the overall operation of the network. 3.3
Checking Flow Conservation
In this section, the method used to check whether nodes’ counter values do not satisfy the flow conservation principle is presented. Definition 1. Let the network be a graph G = (V, E) with V the set of vertices (the nodes) and E : R+ → V × V a function mapping instants in time t to sets of arcs E(t) (the one-way links) existing at these instants.
102
I. Gawędzki and K. Al Agha
j i Definition 2. Let Iij (t) and Iij (t) be respectively i’s and j’s counter of the amount of bytes that flowed on link (i, j) up to instant t in packets which desj i tination was not j. Similarly, let Oij (t) and Oij (t) be respectively i’s and j’s counter of what flowed on (i, j) up to t in packets which source was not i.
The flow conservation principle states that i i ∀t, ∀i, Iji (t) − Oij (t) = 0 .
(1)
j i i If i advertised its counters Iji (t) and Oij (t) periodically, its neighbors could check (1) for each instant t at which i sent an advertisement. But since that simple check is based on i’s counters only, i could as well have advertised false values. Additional checks are thus needed to ensure that these are correct, namely:
∀t, ∀i, j,
j j i i Iji (t) = Iji (t) and Oij (t) = Oij (t)
and Tiji (t) = Tijj (t) ,
(2)
where Tiji (t) and Tijj (t) are respectively i’s and j’s counter of the total amount of bytes that flowed on (i, j) up to t. Because of mobility, the neighborhoods are changing and so absolute values of counters are inconvenient. Differential values are better advertised instead: a differential value is simply the difference between the current absolute value of a counter and its absolute value at the time of last advertisement. Definition 3. Let Si (t) be the value of i’s sequence counter at instant t and let Ti (n) be the instant at which the sequence counter is incremented from n − 1 to n and i’s advertisement set with sequence number n is transmitted. In the following, X is used to represent either I, O or T , to avoid repeating three times each formula that holds for each type of counter. j i i Definition 4. For each counter in the form Xij (t) or Xij (t), let X˙ ij (t) and j ˙ Xij (t) be their differential values.
∀t, ∀i, j, ∀k ∈ {i, j},
k k k X˙ ij (t) = Xij (t) − Xij (Tk (Sk (t)))
(3)
Definition 5. Let T− i (n) be the instant right before i’s nth advertisement set is sent. This notation is used only with differential values to avoid ambiguity. ∀t, ∀i, Si T− (4) i (Si (t)) = Si (t) − 1 Using periodic messages of the routing protocol, nodes advertise successive differential values of their counters. Definition 6. An advertised values of counters (AVC) tuple from i for neighbor j with sequence number n has the following form: i − − i I˙ij Ti (n) , O˙ ij Ti (n) , T˙iji T− i (n) , i Vj (n) = i − . (5) − i I˙ji T (n) , O˙ ji T (n) , T˙jii T− (n) i
i
i
Resilience to Dropping Nodes in MANETs with Link-State Routing P
P Ti (n)
Ti (n + 1)
103
P
Ti (n + 2)
Ti (n + 3) from i time from j
Tj (m − 2)
Tj (m − 1) P
Tj (m) Tj (m + 1) P
P
Fig. 1. Desynchronized advertisements: differential counter values from i and j are not advertised at the same time and thus do not represent flow on the same interval. Vertical arrows indicate instants at which advertisements from i (above) and j (below) are sent. Upper bound on the advertisement interval P is shown as well.
Between two successive transmissions of an AVC tuple for some neighbor j, a node i may receive zero or more AVC tuples regarding itself from j. Definition 7. Let Sij (n) be the (possibly empty) set of sequence numbers of j’s AVC tuples received by i between its (n − 1)th and nth advertisements. ∀i, j, ∀n, Sij (n) = m : Ti (n − 1) ≤ Tj (m) < Ti (n) ∧ (j, i) ∈ E (Tj (m)) (6) To allow neighbors of i to verify (2), i has to retransmit, along with each AVC tuple about a neighbor j, a set of reverse AVC tuples. In all, values of counters are diffused to the whole 2-hop neighborhood. Definition 8. Node i’s nth advertisement set contains nth link advertisements about each of its neighbor j. ⎧⎛ ⎞ ⎫ ⎬ ⎨ ⎝n, Vji (n), Ai (n) = m, Vij (m) ⎠ (7) ⎩ ⎭ i j
m∈Sj (n)
Given that advertisements are sent asynchronously, (1) can be performed directly, whereas (2) most probably cannot. The question is then how to decide that those equations are not satisfied, based on desynchronized counters, i.e. differential counter values over different intervals of time (see Fig. 1). The idea is to consider what can be called desynchronized link balances. Definition 9. A desynchronized link balance in terms of counter X for the link (k, l) (either (i, j) or (j, i)), computed from advertisement number n from node i about node j has the following form: − j − i ∀i, j, ∀n, ∀(k, l), BX ikl (n) = X˙ kl Ti (n) − X˙ kl Tj (m) . (8) m∈Sij (n)
When a node k receives the nth advertisement set from i, it can compute BX iij (n), i i i ˙j − since X˙ ij (T− i (n)) and every Xij (Tj (m)) with m ∈ Sj (n) are contained in A (n).
104
I. Gawędzki and K. Al Agha
Obviously, BX iij (n) and BX iji (n) are seldom zero, unless the counters in Ai (n) are zero themselves. Instead of looking at desynchronized link balances for individual advertisements, let us consider the sum of several desynchronized link balances for successive advertisements. It appears that these accumulated link balances have interesting properties. Definition 10. Let Δij (n) be the difference in time between the instant Ai (n) is sent and last j’s advertisement set was received by i.
∀i, j, ∀n,
Δij (n) =
⎧ ⎨Ti (n) − max Tj (m) i
if Sij (n) =∅ ,
⎩T (n) − T (n − 1) + Δi (n − 1) i i j
otherwise.
m∈Sj (n)
(9)
Suppose that k receives M successive advertisement sets from i, calculates the desynchronized link balances and sums them. That sum is expressed quite simply in terms of absolute counters. The intuitive argument can be seen on Fig. 1: the summed desynchronized link balance (the gray rectangles) from i’s advertisements n + 2 to n + 3 (i’s (n + 2)th advertisement contains j’s (m − 1)th counters and i’s (n + 3)th advertisement contains j’s mth and (m + 1)th counters), is clearly equal to the difference between the amount of bytes that flowed on interval [Tj (m + 1), Ti (n + 3)) and on interval [Tj (m − 2), Ti (n + 1)). Definition 11. Let F X iij (n) (resp. F X iji (n)) be i’s end flow on link (i, j) (resp. (j, i)), i.e. the quantity expressed by: ∀i, j, ∀n, ∀(k, l),
i i F X ikl (n) = Xkl (Ti (n)) − Xkl Ti (n) − Δij (n) .
(10)
j i Theorem 1. If Xkl (t) = Xkl (t) (for any i, j, (k, l) = (i, j) or (k, l) = (j, i) and any t), the summed values of desynchronized link balances over an interval of sequence numbers [n, n + M ] are exactly:
∀i, j, ∀n, ∀M ≥ 0, ∀(k, l),
n+M
BX ikl (p) = F X ikl (n + M ) − F X ikl (n − 1) . (11)
p=n
Proof. By induction on M ≥ 0. The case of M = 0 is trivially shown: i i ∀i, j, ∀n, ∀(k, l), BX ikl (n) = Xkl Ti (n) − Xkl Ti (n) −Δij (n) (12) j j i − Xkl Ti (n−1) + Xkl Ti (n−1) −Δj (n−1) . j i The induction hypothesis holds since we have that Xkl (t) = Xkl (t).
From the expression of (11), bounds on the accumulated desynchronized link balances can be expressed as follows.
Resilience to Dropping Nodes in MANETs with Link-State Routing
105
j i Theorem 2. If Xkl (t) = Xkl (t) (for any i, j, (k, l) = (i, j) or (k, l) = (j, i) and any t), then the accumulated desynchronized link balances are bound in the q n+M following way: i X Xi ∀i, j, ∀n, ∀M ≥ 0, ∀(k, l), max B kl (p) − F kl (q) ≤ BX ikl (p) q∈[n,n+M ]
p=n
≤
min
q
q∈[n,n+M ]
p=n
BX ikl (p)
+ F X ikl (n + M ) .
(13)
p=n
The proof of Theorem 2 is straightforward but omitted for brevity. Definition 12. Let Γji (n), be the minimum difference in time between the instant Ai (n) is sent and first j’s advertisement set was received by i since Ai (n−1) was sent. ∀i, j, ∀n : Sij (n) = ∅, Γji (n) = Ti (n) − min Tj (m) (14) i m∈Sj (n)
i ˙i − In Ai (n), the quantities X˙ ij (T− i (n)) and Xji (Ti (n)) are i’s claim about how much has flowed on links (i, j) and (j, i) respectively for each class X, during interval [Ti (n−1), Ti (n)). If that advertisement also contains at least one reverse j i AVC tuple for the links (i.e. Sij (n) = ∅), then the quantities X˙ ji (T− i (n) − Γj (n)) j − i ˙ and Xij (Ti (n) − Γj (n)) are j’s corresponding claim on interval [Ti (n − 1) − Δij (n − 1), Ti (n) − Γji (n)).
Definition 13. Let Lij (n) and Nij (n) be the sequence numbers defined as follows. ∀i, j, ∀n,
Lij (n) = max m : m ≤ n ∧ Sij (m) =∅ Nij (n) = min m : m > n ∧ Sij (m) =∅
(15)
It appears that we have the following properties: ∀i, j, ∀n, Ti (n) − Δij (n), Ti (n) ⊆ Ti Lij (n) − 1 , Ti (n) Ti (n) − Δij (n), Ti (n) ⊆ Ti (n) − Δij (n), Ti Nij (n) − Γji Nij (n) .
(16)
It immediately follows that bounds for the end flow can be expressed in terms of differential values of counters contained in advertisements. ⎞ ∀i, j, ∀n, ∀(k, l), ⎛ n − i i i i ˙j ⎠ (17) F X ikl (n) ≤ min ⎝ X˙ kl T− i (p) , Xkl Ti Nj (n) − Γj Nj (n) p=Lij (n)
If there is no discrepancy between i and j’s counters about link (k, l) (either (i, j) or (j, i)), then the bounds on the accumulated desynchronized link balances of (13) are verified. Even more so if upper bounds of the end flows from (17)
106
I. Gawędzki and K. Al Agha
are used instead of their actual values. In the following we call this verification the link coherence check. When such a check fails, the absolute value of the excess of accumulated desynchronized link balances with respect to the (loose) bounds is itself a lower bound on the amount of actual discrepancy, which is used directly in the second phase of the solution (detailed in Sect. 3.4). Then the corresponding accumulated values have to be reset to zero and the bounds to be initialized anew upon reception of the next advertisement. 3.4
Distrust System
A failed check is a proof of counter discrepancy, caused by dropped data traffic or false AVCs. Since usually some data packet loss is to be legitimately expected, a check failure should not be a direct reason for the exclusion of nodes. Nodes that drop data traffic should thus be avoided if possible, but not excluded. In the second phase of the solution, tests from the first phase are used to maintain a distrust metric on nodes which is used for path calculation. Unlike in usual trust or reputation systems, this metric cannot be individually maintained by each node, based on first-hand observations combined with second-hand observations from neighbors. The metric has to be uniform across all the nodes, so that every node tends to know of the same value for any given node in the topology, otherwise hop-by-hop routing may not be loop-free. Therefore, the metric is a combination of all the nodes’ local observations, diffused in the network. The local observations (failures to pass checks from the first phase) are used to maintain a local degree of distrust. This value is initially zero and each time a check fails, it is incremented as follows: Di ← max(1, Di ) · 1 + Li · r+ , (18) where Di , Li and r+ are respectively the degree of distrust in i, the lower bound on the amount of actual discrepancy and the ratio used as a parameter of the system set to how fast a degree has to increase. The value recovers over time in the following way: Di ← max 0, Di − Δt · r− , (19) where Di , Δt and r− are respectively the degree of distrust in i, the interval of time elapsed since last update and the ratio used as a parameter of the system set to how quickly a degree has to decrease with time. The local degree of distrust is diffused to other nodes, to allow them to compute the distrust metric for each other node. To make the calculation resilient to slander, we propose to combine the different degrees by calculating a kind of median value, which is insensitive to pathological samples that deviate too much. Using the local degrees of distrust directly is too simplistic and we require nodes to advertise a confidence factor along. That factor is used to enable each node to say to what extent its degree has to be taken into account in the calculation of the median. It is based on the amount of traffic that the observed
Resilience to Dropping Nodes in MANETs with Link-State Routing
107
node supposedly transmitted, according to its counters. The confidence factor increases with traffic: ci ← ci + Ti · r+ (20) (with ci , Ti and r+ being respectively the confidence factor for Di , the amount of traffic of i and the ratio used as a parameter of the system set to how quickly the confidence has to grow) and decreases with time passing: ci ← max(0, ci − Δt · r− )
(21)
(with ci , Δt and r− being respectively the confidence factor of Di , the interval of time elapsed since last update and the ratio used as a parameter of the system set to how quickly the confidence has to decrease with time). This way, the more (alleged) traffic goes through a node, the higher the confidence factor of its neighbors in their degree of distrust in that node. The value of the confidence factor that is actually advertised is required to be in the interval [0, 1), Ci = 1 −
1 , 1 + ci · ra
(22)
where Ci , ci and ra are respectively the advertised confidence factor of Di , the (internal) confidence factor of Di and the ratio used as a parameter of the system set to how quickly an advertised confidence factor has to come close to 1. The actual calculation of the metric based on individual couples (Dij , Cij ) (distrust and associated confidence in i diffused by j) is a weighted median: ∀i,
Di1 , Ci1 , . . . , (Din , Cin ) , j 1 j Ci ≤ · C 2 j i j j:Di <Wi
and
Cij ≤
j:Dij >Wi
1 j · C , (23) 2 j i
where Wi is the combined metric for node i. A simple route calculation using a shortest path algorithm on the global topology graph is to set the weight wij of each link (i, j) to 1 + Wi (instead of 1 for a simple hop-count metric).
4 4.1
Performance Evaluation Simulation Model
We have implemented our solution on top of a model of the OLSR protocol for the OPNET 12.0 simulator. Mobile nodes move on an area of 1000 m × 1000 m, with constant velocity (chosen uniformly on [0, 1] m/s) and constant direction (chosen uniformly on [0, 2π)) during an interval of time chosen exponentially with rate 1/600 per second. Data packets are generated exponentially with rate 5 per second, source and destination are chosen uniformly and data length is chosen uniformly between 40 and 1500 bytes. Cheaters, also chosen uniformly among the
108
I. Gawędzki and K. Al Agha
nodes, drop data packets with a parametrized probability. A IEEE 802.11 MAC layer is used on all nodes with 11 Mbps carrier, 50 mW transmission power, flow control (RTS/CTS) and default power thresholds (-85 dBm for carrier sense and -70 dBm for reception, resulting in ranges of respectively 1200 m and 214 m). The method presented in Sect. 3.3 and 3.4 has been implemented, along with a simple retransmission protocol for lost advertisements (omitted here for brevity), with r+ = 1/50, r+ = 1, r− = r− = 1/10 and ra = 10−6 . The MPR selection heuristic has been adapted so as to prefer neighbors with lower local distrust (details omitted here as well). MPRs diffuse local degrees of distrust and confidence factors of their selectors in TC messages. Cheaters do slander by always advertising a very high degree of distrust and a confidence factor of 1. The data packet delivery rate (total received data packets over total generated data packets) has been measured in each run and compared to several reference runs: in the No Cheater run, plain OLSR is used and the drop probability is forced to zero to see what OLSR achieves at best without cheaters; in the Plain run, plain OLSR is faced with non-zero drop probability to see how bad it does alone; in the Known run, OLSR has an oracle telling which node is a cheater and routing table calculation is performed as suggested at the end of Sect. 3.4 with Wi = 9 for each cheater i. That last reference run is used to see what performance could be achieved at best. Cheaters that intentionally drop data packets have many possible ways in which they can manipulate their counters. Among possible strategies, we have chosen four different ones: increment counters as though the packets were forwarded indeed; increment counters on all outgoing links so that the total added sum is the amount of lost data bytes; do not increment counters on outgoing links; increment counters only for every second packet. 4.2
Simulation Results
In the first batch of simulations, there were 100 nodes with 10% of cheaters. We have checked that the system stabilizes pretty quickly and after a few dozen seconds, the delivery ratio of our method is stable (though we checked for up to 10 1.0 0.9 delivery ratio
delivery ratio
0.9
0.8 Known Average Plain
0.7
0.8 0.7 0.6
No Cheater Known
Average Plain
0.5 0.25
0.5
0.75
drop probability
1.0
25 (3)
50 (5) 75 (8) total nodes (cheaters)
100 (10)
Fig. 2. Delivery ratio: with fixed density of 100 nodes and variable drop probability (left); with variable density and fixed drop probability of 1.0 (right)
Resilience to Dropping Nodes in MANETs with Link-State Routing
109
1.0
delivery ratio
0.8 0.6 Known Average Plain
0.4 0.2 0
5
10
15
20
25
30
35
40
45
50
55
60
percentage of cheaters
Fig. 3. Delivery ratio vs. cheaters: 1.0 drop probability and a total of 100 nodes
hours) and much closer to the Known case than to the Plain one. Lowering the drop probability, apart from raising the Plain curve, only lengthened by a small factor the time required for the delivery ratio to be as close to the Known curve (approximately 3 minutes for a drop probability of 0.25). The average delivery ratios of the Plain, Known and the average of the four cheating strategies (which results turned out to be very similar) after one hour of simulation, with 100 nodes, 10% cheaters with 1.0 drop probability, are presented in Fig. 2 (left). It appears clearly that in dense networks, the performance of our method sticks to the Known reference. To evaluate the performance of our method with respect to varying network density, we have run a batch of simulations with a different total number of nodes but a (almost) constant ratio of 10% of cheaters, each dropping with a probability of 1.0. The results presented in Fig. 2 (right) show that while the delivery ratio degrades with decreasing density, the performance of our method is still very close to the Known reference. The reason why maximum achievable performance decreases with the density is that there are fewer and fewer alternative paths for a given destination, hence fewer ways to avoid the cheaters. A last batch of simulations has been run with a duration of 15 minutes, a total number of 100 nodes but a varying number of cheaters with 1.0 drop probability. Results show (see Fig. 3) that our method improves the delivery ratio significantly, but less and less than the Known reference as the number of cheaters increases. 4.3
Discussion
Hughes et al. strongly criticized [8] the use of the principle of flow conservation as a way to detect malicious nodes in a network. They presented a number of attack scenarios which make simplistic approaches fail anyway. In our solution, we require all packets to be checked for integrity and their source address to be authenticated by all intermediate nodes, to exclude the possibility of packet modification or forgery going unnoticed. Some attacks, like “ghost routers” or the ones based on the reactivity of link-state routing protocols,
110
I. Gawędzki and K. Al Agha
are targeted at the routing protocol itself, which we assume here to be already protected. Note that the “kamikaze” attack, in which a malicious node intentionally advertises false counter values in order to be excluded along with some of its neighbors is simply handled by our solution, in which nodes are never excluded.
5
Conclusion
We have presented a way to augment a link-state routing protocol to make it resilient to the presence of nodes intentionally dropping data traffic. The method consists in three parallel operations: detect flow non-conservation on neighboring nodes, maintain a degree of distrust in all checked nodes, combine all degrees of distrust into a metric for routing table calculation. The method is very little intrusive into the system, provided that necessary cryptographic tools are available and that the routing protocol itself is protected. It is independent of OSI Data Link and Physical layers which makes it easier to implement and workable with different hardware setups (unidirectional antennae, multiple interfaces, etc). The method showed satisfactory performance, very close to an ideal case in which all cheating nodes are known to an oracle. Among the possible future works, we intend to explore the possibility of making the method work even in the presence of colluding cheaters. The interaction with some intrusion detection system can also be studied, in order to counter the “premature aging” attack.
References 1. Avramopoulos, I., Kobayashi, H., Wang, R., Krishnamurthy, A.: Highly secure and efficient routing. In: IEEE (ed.) INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies, March 2004, vol. 1, pp. 208–220. IEEE, Los Alamitos (2004) 2. Awerbuch, B., Holmer, D., Nita-Rotaru, C., Rubens, H.: An on-demand secure routing protocol resilient to byzantine failures. In: WiSE 2002: Proceedings of the 3rd ACM workshop on Wireless security, pp. 21–30. ACM Press, New York (2002) 3. Bradley, K.A., Cheung, S., Puketza, N., Mukherjee, B., Olsson, R.A.: Detecting disruptive routers: a distributed network monitoring approach. Network, IEEE 12(5), 50–60 (1998) 4. Buttyán, L., Hubaux, J.-P.: Stimulating cooperation in self-organizing mobile ad hoc networks. MONET 8(5), 579–592 (2003) 5. Clausen, T., Jacquet, P.: Optimized link state routing (OLSR) protocol. In: RFC 3626, IETF (October 2003) 6. Fourati, A., Agha, K.A.: A shared secret-based algorithm for securing the OLSR routing protocol. Telecommunication Systems 31(2–3), 213–226 (2006) 7. Hu, Y.-C., Perrig, A., Johnson, D.B.: Ariadne: a secure on-demand routing protocol for ad hoc networks. In: MobiCom 2002: Proceedings of the 8th annual international conference on Mobile computing and networking, pp. 12–23. ACM Press, New York (2002)
Resilience to Dropping Nodes in MANETs with Link-State Routing
111
8. Hughes, J.R., Aura, T., Bishop, M.: Using conservation of flow as a security mechanism in network protocols. In: IEEE Symposium on Security and Privacy, pp. 132–141 (2000) 9. Johnson, D.B., Maltz, D.A., Hu, Y.-C.: The dynamic source routing protocol (DSR) for mobile ad hoc networks for IPv4. RFC 4728, IETF (February 2007) 10. Just, M., Kranakis, E., Wan, T.: Resisting malicious packet dropping in wireless ad hoc networks. In: Pierre, S., Barbeau, M., Kranakis, E. (eds.) ADHOC-NOW 2003. LNCS, vol. 2865, pp. 151–163. Springer, Heidelberg (2003) 11. Mahajan, R., Rodrig, M., Wetherall, D., Zahorjan, J.: Sustaining cooperation in multi-hop wireless networks. In: 2nd Symposium on Networked System Design and Implementation, Boston, MA, USA (May 2005) 12. Marti, S., Giuli, T.J., Lai, K., Baker, M.: Mitigating routing misbehavior in mobile ad hoc networks. In: MobiCom 2000: Proceedings of the 6th annual international conference on Mobile computing and networking, pp. 255–265. ACM Press, New York (2000) 13. Perkins, C., Belding-Royer, E., Das, S.: Ad hoc on-demand distance vector (AODV) routing. RFC 3561, IETF (July 2003) 14. Raffo, D., Adjih, C., Clausen, T., Mühlethaler, P.: An advanced signature system for olsr. In: SASN 2004: Proceedings of the 2nd ACM workshop on Security of ad hoc and sensor networks, pp. 10–16. ACM Press, New York (2004) 15. Zhong, S., Chen, J., Yang, Y.R.: Sprite: A simple, cheat-proof, credit-based system for mobile ad-hoc networks. In: INFOCOM (2003)
3-D Localization Schemes of RFID Tags with Static and Mobile Readers Mathieu Bouet and Guy Pujolle Université Pierre et Marie Curie (UPMC) – Laboratoire d’Informatique de Paris 6 (LIP6) 104 avenue du Président Kennedy, 75016 Paris, France {Mathieu.Bouet,Guy.Pujolle}@lip6.fr
Abstract. RFID is an automatic identification technology that enables tracking of people and objects. We propose a scalable and robust 3-D localization method of RFID tags, derived in two schemes, that uses the diversity of pervasive environments to improve its accuracy. RFID readers are placed on the floor and ceiling of a room. In the first scheme, static and mobile tags are localized using only connectivity information obtained through this structure. The second scheme involves in addition mobile RFID readers. They are used here to refine the localization process by adding more inclusive constraints to the calculation. We study the impact of the number of static and mobile readers on the accuracy. We then show that the first scheme constitutes an upper bound for the second one. Furthermore, we observe that the second scheme does not require a lot of mobile readers and that it considerably improves accuracy. Keywords: RFID, localization, mobility, 3 dimensions, tracking.
1 Introduction Radio Frenquency IDentification (RFID) has recently become widespread for automatic identification and tracking. A RFID system is composed of two types of entities: RFID tags (or transponders) that store identity, and RFID readers (or detectors) that remotely retrieve these data. Tags can be passive: communications are supplied by the reader generally by backscattering its carrier wave, semi-passive: the tags embed a power source for their internal processes and communicate with readers like passive tags, and active: they embed a battery that supplies communications with readers and functionalities such as cryptography or sensor. Active tags are more expensive than passive tags but their communication range is far larger and they can have more advanced utilizations [1]. RFID is a key technology in future pervasive systems and services. Indeed, it provides essential context-aware functionalities such as automatic identification, tracking, and real-time inventory. By building RFID in networks, new management concepts could be conceived so as to fully integrate real-time status of items and people. However, all the benefits of this technology could be hugely increased if identification information was linked to positions. It would add another dimension to the context-awareness: real-time location and consequently mobility. For example, it could be used in home-networking and libraries to retrieve specific objects, control A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 112–123, 2008. © IFIP International Federation for Information Processing 2008
3-D Localization Schemes of RFID Tags with Static and Mobile Readers
113
access or monitor events. Matching identities with their locations would also allow the development of new strategies in autonomic systems for mobility control, resource allocation, security, service discovery. Classic RFID systems only provide coarse-grained localization. In fact, they are generally composed of readers placed in strategic locations; their purpose is to identify all the tags that pass in their read range. As a matter of fact, the precision of the position corresponds to the size of the cell formed by their read range. In this paper, our major contribution is to design a 3-D localization algorithm for mobile and static RFID tags which, contrary to other related works, is based only on connectivity information and which takes into account mobile RFID readers. Our algorithm offers fine-grained position estimation and is conceived so as to be cheap, robust, fast, and thus scalable. Moreover mobile readers are used to refine and improve the precision. Our model is simple: we consider a room, a warehouse, or a container (modeled by a hexahedron) and we deploy readers on the floor and ceiling. No assumption is done on the equipments and their quantity is reduced to the minimum in order to have a low-cost system. In the first scheme, an estimated position is computed for each mobile or static tag by using only connectivity information (which reader can detect the tag and which cannot) with a semi-infinitely constrained multivariable nonlinear method. Then, we present a second method, derived from the first one, to increase the localization precision when there are mobile readers. They are used to refine the first method by adding more constraints. This calculation requires more time and can only be applied on static RFID tags. First, we evaluate the impact of the number of readers on the precision of the localization, and we show that the accuracy of our method is very high when they are densely deployed. Then, we study the impact of the number of mobile readers on the performance of the second method, we underline the existence of an upper bound for the precision, and we point out a quick and impressive reduction of the localization error. The rest of the paper is organized as follows. First we discuss related work in Section 2. Then we explain the basic localization method and the improved localization method with mobile readers in Section 3. Simulation methodology and results are discussed in Section 4. Finally, Section 5 concludes the paper.
2 Related Work Localization is often considered as a widely studied domain. However, few methods have been proposed for RFIDs. In fact, this area has specific constraints and requirements. Contrary to localization in Mobile Ad-hoc NETworks (MANETs) or in Wireless Sensor Networks (WSNs) for which a lot of algorithms have been conceived, a RFID system is centralized because the computation capabilities of RFID tags are very limited. Besides they are composed of tens of thousands of tags that need to be treated quasi simultaneously. Moreover, they are usually deployed in indoor environments and their signal is harshly impaired. For example, UHF tags’ radio waves do not bounce off metal and have difficulties to penetrate water. The most famous indoor localization sensing method is called LANDMARC (LocAtioN iDentification based on dynaMic Active Rfid Calibration) [2]. It only
114
M. Bouet and G. Pujolle
localizes active RFID. The system is calibrated with the help of reference tags (additional tags that are fixed and whose positions are known by the system): a map including the Received Signal Strength (RSS) of each reader is built. Then, the system has to retrieve the RSS information from each tag to the readers and checks the map to find out its estimated position. Another well-known RFID localization system is SpotON [3]. It is also based on RSS measurements and uses them to estimate the distance between readers and tags. A central server aggregates the data and performs a trilateration to obtain the position of the tags. In [4], the authors also estimate the distance between tags and readers. However, this estimation is done through the Time Difference Of Arrival (TDOA) technique: the system measures the time-of-flight and deduces the distance according to the frequency and the inter-delay. This scheme is designed for fixed passive SAW RFID tags. Finally, in [5] the authors propose a 3-D localization scheme. It is founded on the assumption that readers have numerous power levels and on the deployment of reference tags. The read range of each reader is increased or decreased until the tag is detected. Then, thanks to the reference tags, an upper and a lower bound are obtained. The localization is done by mixing the bounds of all the readers and by applying a time-costly computation on the obtained area. Distance estimation through RSS, TOA or AOA is often very complicated in indoor environments because of signal absorption, indirect path, and interferences. These techniques can lead to large errors and require specific and thus expensive equipments. Moreover, the use of reference tags considerably increases the whole cost of the system and necessitates manual installation. Our localization method is based on very simple principles in order to be cost-effective, robust in harsh environments, and fast enough to be scalable when thousands of items need to be treated simultaneously. Indeed, we consider readers with basic capabilities, we do not use any additional equipment, and we localize tags using only connectivity information, no distance estimation is performed. General indoor localization systems have also been studied for a long time. The three more famous are Cricket [6], Active Bat [7], and Active Badge [8]. In the last system, Active Badges transmit unique infra-red signals every 10 seconds. When they pass close to a reader, they are detected. The position of the badge corresponds to the one of the reader that detected it, and the precision is the size of the cell. In Active Bat, ultrasonic receivers (or sensors) are placed on the ceiling. The system measures the time-of-flight from the node to each receiver and applies a classic trilateration. Finally, Cricket is also based on TDOA measurements but the difference is that the sensors send their coordinates to the node and it is the node that performs the trilateration to estimate its position. Several location methods for geographic routing and security mechanisms have also been proposed for MANETs and WSNs. The vast majority of them measures inter-node distance through the RSS, TDOA, or Angle of Arrival (AOA) technique. The others are based on hop count or inclusive and exclusive constraints. The readers are referred to [9] for more details. As explained earlier, those approaches address different problems than the one treated in this article but they offer a good comprehension and a solid basis for our approach. This also goes for the numerous methods that use reference RFID tags to localize mobile robots.
3-D Localization Schemes of RFID Tags with Static and Mobile Readers
115
3 Our Algorithm 3.1 System Setup First of all, we place RFID readers on the floor and ceiling of a room, a warehouse, or a container. These are modelized by a hexahedron whose length, width, and height are L, W, and H respectively as shown in Fig. 1. The readers are spaced from each other according to a certain gap S. The exact position of each reader is known by the system. Furthermore, they are supposed to be as basic as possible, that is, they have no different power levels, no directional antenna, no multiple antennas, and no synchronization. They can also have different read ranges. For easier comprehension, we consider they all have the same capacities and thus the same read range R, and we assume the signal transmission between readers and tags forms a sphere. Both active and passive tags can be located with this system.
Fig. 1. Our model
3.2 Basic Position Estimation This first method can locate both static and mobile tags and is based on the architecture presented in Section 3.1. The localization of tags only requires connectivity information: which readers are able to detect a given tag and which cannot. No distance estimation is performed. A mean position is calculated for each combination of intersection of readers as shown in Fig. 2. These mean positions correspond to the barycenter (also known as the center of gravity or the center of mass) of the different volumes formed by the intersection of the readers’ detection areas. Tags’ positions are estimated as follows. First, we define an error objective function δ i (x, y, z ) for each combination of readers: nj
δ i ( x, y , z ) = ∑ j =1
(x − x ) + (y − y ) + (z − z ) 2
j
2
j
2
j
.
(1)
116
M. Bouet and G. Pujolle
Fig. 2. A 2 dimensional intersection of readers
Where n j is the number of readers that form the combination,
( x, y , z )
are the
unknown coordinates of the barycenter (that will be the estimated position), and x j , y j , z j the coordinates of the reader j .
(
)
The error objective function is defined as the sum of the distance between the tag whose position is unknown and every reader able to read it. Then, as the number of readers on the floor is equal to the one on the ceiling, we can easily deduce if the tag is close to the floor, in the middle zone, or close to the ceiling. Indeed, if the tag is near the floor, more readers on the floor than on the ceiling will be able to read it. In that case, we can add extra information to the error objective function in equation (2). The same is done if more readers on the ceiling than on the floor are activated in equation (3). Nothing is done if the quantities are equal.
δ i ( x, y , z ) = δ i ( x , y , z ) +
(z − z
floor
)
δ i ( x, y , z ) = δ i ( x, y , z ) +
(z − z
ceiling
)
2
2
,
(2)
.
(3)
Finally, we use a semi-infinitely constrained multivariable nonlinear method to find the minimum of the error objective function presented in equation (1). It takes as parameter non linear functions, linear inequalities (the difference between the maximum read range R and the distance reader-unknown estimated position), and lower and upper bounds (the dimensions of the hexahedron). This well-known algorithm uses cubic and quadratic interpolation techniques to estimate peak values in the semi-infinite constraints. The error objective function has been defined so that its minimum corresponds to the barycenter of the volume: the calculated point will minimize the sum of the distances with all the readers plus the distance with the floor or the ceiling. The problem is stated as follows:
3-D Localization Schemes of RFID Tags with Static and Mobile Readers
117
min δ i (x, y, z ) subject to
( x, y , z )
⎧0 ≤ x ≤ L ⎪0 ≤ y ≤ W ⎪ ⎪0 ≤ z ≤ H ⎪ ⎨ ( x − x )2 + ( y − y )2 + ( z − z )2 − R ≤ 0 1 1 1 ⎪ ⎪... ⎪ ⎪ x − xn 2 + y − y n 2 + z − z n 2 − R ≤ 0 j j j ⎩
(
(
) (
Where (x1 , y1 , z1 ) , …, x n j , y n j , z n j
) (
(4)
)
) are the coordinates of the readers that formed
the volume. It is obvious that if the result for a given combination has already been computed, it is not necessary to recompute it. Consequently, this localization method can require a few calculations at the beginning but will be very fast when all the cases are treated. Actually, a solution would be to calibrate the room at the start, i.e. compute the corresponding mean position of every combination of reader. The localization would then be very fast: check the built map and retrieve the tag’s position. This process can be easily applied both on static and mobile tags since only connectivity information is used. 3.3 Position Estimation Enhanced by Mobile Readers In realistic pervasive environments, RFID readers can be mobile. We propose to use this diversity to improve our localization scheme. We lean this method on the first one: the room is equipped in the same way and the positions are estimated through the same process if no mobile reader is present. The mobile readers do not need to be complex or expensive since, once again, we deduce connectivity information from their location and their maximum read range. Their positions are known by the system. The interaction between the static reader structure and the mobile readers is shown in Fig. 3.
Fig. 3. A 2 dimensional intersection of static and mobile readers
118
M. Bouet and G. Pujolle
Consequently, the equation (4) becomes:
min δ i ( x, y, z ) subject to
(x, y , z )
⎧0 ≤ x ≤ L ⎪0 ≤ y ≤ W ⎪ ⎪0 ≤ z ≤ H ⎪ ⎪ ⎪ 2 2 2 ⎨ (x − x1 ) + ( y − y1 ) + ( z − z1 ) − R ≤ 0 ⎪ ⎪... ⎪ 2 2 2 ⎪ x − xn j + y − y n j + z − z n j − R ≤ 0 ⎪ ⎪ x − x mobile 2 + y − y mobile 2 + z − z mobile k k k ⎩
(
) (
(
(
) (
) (
Where x mobilek , y mobilek , z mobilek
)
) (
)
(5)
)
2
−R≤0
are the coordinates of the mobile reader that
detected the tag. Each time a mobile reader performs the reading of a tag, an inclusive constraint is added. Obviously, if the tag is static, this scheme can be easily applied and the zone is reduced quickly or not depending on the number of mobile readers. If the tag is moving slowly, it can also be localized with this variant but it is necessary to put timestamp on the constraints defined by the mobile readers and to remove the oldest ones. If the tag is moving quickly, the first localization method has to be applied.
4 Simulation 4.1 Simulation Methodology and Settings We have run extensive simulations in order to study the accuracy of our methods. The semi-infinitely constrained multivariable nonlinear algorithm has been implemented in several mathematic programs. We chose the one in the Optimization Toolbox available in Matlab [10]. We have defined different error measurements so as to evaluate the accuracy: • The mean error in 2 dimensions:
ε 2D =
(xreal − xestimated )2 + ( y real − yestimated )2
(6)
,
And its standard deviation σ 2 D . • The mean error in 3 dimensions:
ε 3D =
(xreal − xestimated )2 + ( y real − yestimated )2 + (z real − z estimated )2
And its standard deviation σ 3 D .
,
(7)
3-D Localization Schemes of RFID Tags with Static and Mobile Readers
119
• The mean error regarding the estimated height:
εh =
(z real − z estimated )2
,
(8)
And its standard deviation σ h . We dimensioned the hexahedron as follows:
L = 100m W = 100m . H = 3m
(9)
Where L, W, and H are the length, the width, and the height respectively. These values have voluntarily been chosen bigger than the dimensions of a typical room so as to take into account all the possible shapes. We choose realistic values for the read range of the static and mobile RFID readers:
R = 5m .
(10)
Finally, we implemented the Random WayPoint Mobility Model (RWP) [11] for the mobile readers. This model is used in many works since it is relevant for simulation of people displacements. The mobile readers randomly pick a destination in the hexahedron, go to it with a constant speed, when they reach it they pause, and then start the same process. It would be better if the mobile readers were not all at the same height but for an easier analysis of the performances, we chose to place them at 1 meter from the floor and to make them move with the speed of someone who is walking (5 kilometers/hour). Since the related works presented in Section 2 do not provide enough accurate hypotheses or use different configurations and different equipments (such as reference tags) from our model, we are not able to propose any comparison. 4.2 Results Fig. 4 shows the mean error in two dimensions (on x and y coordinates) after 300 seconds for different quantities of mobile readers. The accuracy of our basic localization scheme is below 1 meter when the static readers are spaced until 6 meters. A local peak is observed at 4.5 meters. This is due to the shape of the intersections of readers’ areas. Indeed, this value is just below the one of the radius. The readers on the floor and ceiling thus form very irregular zones, hence a higher error. This data constitutes an upper bound for the second localization method. After 300 seconds, with only one mobile reader the 2-D error is reduced until 40 centimeters compared to the static results. In the same time, if 5 mobile readers are used, the estimated error falls between 18 and 57 centimeters. Furthermore, the use of 10 or 20 mobile readers almost cancels the impact of the inter-reader spacing since after 300 seconds the error is bounded between 17 and 23 centimeters.
120
M. Bouet and G. Pujolle
1.4
0 mobile reader 1 mobile reader 5 mobile readers 10 mobile readers 20 mobile readers
1.2
Accuracy (m)
1 0.8 0.6 0.4 0.2 0 2
2.5
3
3.5 4 4.5 5 Inter-Reader Spacing (m)
5.5
6
6.5
6
6.5
Fig. 4. 2-D accuracy after 300 seconds 1.6
0 mobile reader 1 mobile reader 5 mobile readers 10 mobile readers 20 mobile readers
Accuracy (m)
1.4 1.2 1 0.8 0.6 0.4 2
2.5
3
3.5 4 4.5 5 Inter-Reader Spacing (m)
5.5
Fig. 5. 3-D accuracy after 300 seconds
Fig. 6 shows the mean error on the estimated height. It is limited between 39 and 61 centimeters. When the readers are densely deployed, the shapes of the intersections formed by the readers’ areas are not as regular as the ones when the inter-spacing reader is around 3 or 3.5 meters. The same phenomenon occurs in the cases where they are far one another. Consequently, the accuracy is better around 3 – 3.5 meters. The results for the second localization scheme are roughly the same. However, we can notice an improvement of the accuracy when mobile readers are used with static readers spaced of at least 4 meters. The impact of the mobile readers could be hugely increased if they were not all placed at the same height and if their read range were not all equal. For easier comparison between the different scenarios we chose identical mobile readers. Fig. 5 shows the 3-D mean error. The curbs have almost the same shapes than the ones for the 2-D results. This can be explained by the fact that our second localization method has only a light impact on the improvement of the height accuracy. Therefore,
3-D Localization Schemes of RFID Tags with Static and Mobile Readers
0.65
0 mobile reader 1 mobile reader 5 mobile readers 10 mobile readers 20 mobile readers
0.6
Accuracy (m)
121
0.55 0.5 0.45 0.4 0.35 2
2.5
3
3.5 4 4.5 5 Inter-Reader Spacing (m)
5.5
6
6.5
Fig. 6. Accuracy on height after 300 seconds
the evolution of the 3-D accuracy tends to follow the one in 2-D. The error is bounded between 57 and 159 centimeters for the first localization method. When 20 mobile readers are used, after 300 seconds the error falls between 48 and 61 centimeters (until 62% of reduction). The evolution of the accuracy through time is illustrated in Fig. 7. The impact of the mobile readers is visible since the beginning. After 30 seconds, the precision has been greatly increased and it goes on over time. Nevertheless, we can notice that improvement of the accuracy tends to decrease since the difference between the results at 240 seconds and at 300 seconds is only a few centimeters. Finally, in Fig. 8 we compare the impact of the number of mobile readers on the accuracy. Since for the first localization scheme there is no mobile reader, the error remains unchanged through time. In addition, two interesting remarks can be done. First, it is not necessary to use a lot of mobile readers. Indeed, the difference between 1.4
0s 1s 30 s 60 s 120 s 180 s 240 s 300 s
1.2
Accuracy (m)
1 0.8 0.6 0.4 0.2 0 2
2.5
3
3.5 4 4.5 5 Inter-Reader Spacing (m)
5.5
6
6.5
Fig. 7. Evolution of the 2-D accuracy with 10 mobile readers
122
M. Bouet and G. Pujolle
0.6
Accuracy (m)
0.5 0.4 0.3 0.2 0 mobile reader 1 mobile reader 5 mobile readers 10 mobile readers 20 mobile readers
0.1 0 0
100
200
300
400 500 Time (s)
600
700
800
900
Fig. 8. Evolution of the 2-D accuracy for an inter-spacing of 5 meters
the results for 5 mobile readers and the ones for 10 and 20 is not as important as expected. Secondly, as noticed in Fig. 7, the mean error in each scenario tends to rapidly converge over time. A residual error remains because of the inter-spacing between the static readers. To sum up, the results for the first localization scheme constitute an upper bound for the mean error of the second scheme. The use of mobile readers improves the accuracy. It depends on the number of mobile readers, the space between the static readers, and the execution time. For example, when inter-space is 5.5 meters and when 10 mobile readers are involved, the 3-D error falls from 92 to 59 centimeters (a reduction of 36%) after 300 seconds. Moreover, the results show that the error reduction converges quickly over time and that it does not require a lot of mobile readers.
5 Conclusion We have proposed a 3 dimensional localization method for static and mobile RFID tags and we have shown how to use diversity of pervasive environments, that is to say mobile RFID readers, to improve the positioning accuracy. We have chosen very simple properties so as to obtain a robust, fast, cost-efficient, and thus scalable method. Static RFID readers are placed on the floor and ceiling of a room. Tags’ positions are estimated by using only connectivity information in a semi-infinitely constrained multivariable nonlinear method. We then derive this scheme: mobile readers are used to refine the inclusive constraints and thus to improve the accuracy of static tags’ estimated location. Our results show that the precision of the first scheme is very satisfactory compared to the requirements. Furthermore, the use of mobile readers hugely reduces the localization error. The analysis also shows that it is not necessary to involve a lot of mobile readers and that the error reduction rapidly converges making the second localization scheme not only very quick but also very precise.
3-D Localization Schemes of RFID Tags with Static and Mobile Readers
123
References 1. Want, R.: An introduction to RFID technology. IEEE Pervasive Computing 5(1), 25–33 (2006) 2. Ni, L.M., Liu, Y., Lau, Y.C., Patil, A.P.: LANDMARC: indoor location sensing using active RFID. In: First IEEE International Conference on Pervasive Computing and Communications, pp. 407–415 (2003) 3. Hightower, J., Borriello, G., Want, R.: SpotON: An Indoor 3D Location Sensing Technology Based on RF Signal Strength. UW CSE Technical Report (2000) 4. Bechteler, T.F., Yenigun, H.: 2-D localization and identification based on SAW ID-tags at 2.5 GHz. IEEE Transactions on Microwave Theory and Techniques 51(5), 1584–1590 (2003) 5. Wang, C., Wu, H., Tzeng, N.-F.: RFID-Based 3-D Positioning Schemes. In: 26th IEEE International Conference on Computer Communications (INFOCOM), pp. 1235–1243 (2007) 6. Priyantha, N.B., Chakraborty, A., Balakrishnan, H.: The cricket location-support system. In: 6th Annual ACM International Conference on Mobile Computing and Networking (MOBICOM), pp. 32–43 (2000) 7. Ward, A., Jones, A., Hopper, A.: A New Location Technique for the Active Office. IEEE Personal Communications 4(5), 42–47 (1997) 8. Want, R., Hopper, A., Falcao, V., Gibbons, J.: The active badge location system. ACM Transactions on Information Systems 10(1), 91–102 (1992) 9. Gustafsson, F., Gunnarsson, F.: Mobile positioning using wireless networks: possibilities and fundamental limitations based on available wireless network measurements. IEEE Signal Processing Magazine 22(4), 41–53 (2005) 10. Matlab, http://www.mathworks.com/ 11. Hyytia, E., Lassila, P., Virtamo, J.: Spatial node distribution of the random waypoint mobility model with applications. IEEE Transactions on Mobile Computing 5(6), 680–694 (2006)
Performance Evaluation and Enhancement of Surface Coverage Relay Protocol Antoine Gallais and Jean Carle IRCICA/LIFL, Univ. Lille 1, INRIA Futurs, France {antoine.gallais,jean.carle}@lifl.fr http://www.lifl.fr/POPS
Abstract. Area coverage protocols aim at turning off redundant sensor nodes while ensuring full coverage of the area by the remaining active nodes. Connectivity of the active nodes subset must also be provided so that monitoring reports can reach the sink stations. Existing solutions hardly address these two issues as a unified one and very few are robust to non ideal physical conditions. In this paper, we propose a deep analysis and some enhancements of a localized algorithm for area coverage, based on Surface Coverage Relays (SCR) and able to build connected active nodes sets that fully cover the area. We first enhanced the critical phase of our protocol (the relay selection) and show that the number of active nodes can be drastically reduced. We then raise the issue of the robustness of the protocol once a realistic physical layer is simulated. Our algorithm proved itself to be an interesting solution as it remained able to still ensure high coverage level under realistic physical layer conditions. We also added the possibility to finely tune the overall proportion of active nodes through a new parameter used during local relay selection phases. Keywords: Wireless sensor networks, area coverage, configurable localized algorithm, realistic physical layer.
1 Introduction Wireless sensor networks are made up of hundreds of devices in which a battery, a sensing module and a wireless communication device are embedded. They are deployed over hostile or remote environments, in which they become one-use-only since their batteries cannot easily be replaced or refilled. Energy consumption is therefore balanced by taking advantage of the redundancy induced by the random deployment of nodes; some nodes are active while others are in sleep mode, thus using less energy. Such a dynamic topology should not impact the monitoring activity. The ensuing issue consists in the active nodes to fully cover the area. Furthermore, as monitoring reports should at least reach the sink stations, the set of active nodes should be connected. This is also known as the connected area coverage problem. This work focuses on a localized algorithm, considering first that the availability of a deciding central entity is a strong assumption and also willing to show that such an organization can be obtained from networks without any infrastructure. We proposed in [1] a fully localized solution to the connected sensor area coverage issue. Both a neighbor A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 124–134, 2008. c IFIP International Federation for Information Processing 2008
Performance Evaluation and Enhancement of SCR Protocol
125
detection phase and a relay selection phase are required to complete the activity decision process. In this article, we deeply study the relay selection phase, made from a new method described here. This latest solution allows us to strongly reduce the proportions of active nodes while not inducing any communication overhead nor any new information in control messages. Most of existing area coverage solutions rely on an ideal communication model (each node has a transmission threshold, beyond which no message may be directly sent). The radio channel randomness is hardly considered. We studied the robustness of our SCR protocol with such a new assumption. After showing that it was slightly impacted, we introduce a new parameter during the relay selection phase in order to vary the size of the relay set thus finely tuning the proportions of active nodes.
2 Preliminaries We assume that randomly deployed sensors are time-synchronized, static and positionaware. Several solutions have been studied to achieve these goals [2]. The location of a sensor node is used as its unique identifier. We also consider that sensors are able to determine whether an area is covered by a set of sensors or not. Various coverage evaluation schemes have already been used for that purpose [3,4]. 2.1 Communication and Sensing Models A wireless network is modeled by a graph G = (V, E), V being the set of vertices and E ⊆ V 2 the set of edges that the available communications. An edge between two vertices u and v exists if u is physically able to send a message to v. The neighborhood set of u, noted as N(u), is defined as: N(u) = {v ∈ V | v = u ∧ (u, v) ∈ E}.
(1)
Each sensor has a communication range Rc and a sensing range Rs . We denote by S(u) the area covered by a node u and S(A) the area covered by a set of nodes A = {a1 , a2 , . . . , an } such that:
i=|A|
S(A) =
S(ai ).
(2)
i=1
2.2 Radio Channel Model Given a graph G = (V, E), a communication range Rc and the Euclidean distance between nodes u and v dist(u, v), the unit disk graph model defines the set of edges E as: E = {(u, v) ∈ V 2 | u = v ∧ dist(u, v) ≤ Rc },
(3)
Yet, in a practical context, transmission quality varies over the time and a node u may be able to communicate with node v at time t, but not at time t+1. Each communication
126
A. Gallais and J. Carle
link has a probability to exist which is influenced by a lot of factors such as the emitting power or the distance between the emitter and the receiver. Many models exist and some generators have already been proposed to allow model generation from parameters such as the kind of material used or the localization of the sensors themselves (see [5] for instance). To replace the unit disk graph, we opted for the lognormal shadowing model thus transforming G into a weighted graph, where the weight of each edge (u, v) ∈ E is equal to the probability of correct reception p(dist(u, v)) for the two nodes u and v. An approximated function P(x) is described by Kuruvila et al. [6]: ⎧ ⎪ 1− ⎪ ⎪ ⎪ ⎪ ⎨ P(x) =
⎪ ⎪ ⎪ ⎪ ⎪ ⎩
( Rxc )2α 2
if 0 < x ≤ Rc ,
( 2RRcc−x )2α 2
if Rc < x ≤ 2 × Rc ,
0
otherwise,
(4)
α being a factor that highly depends on the environment and x being the considered distance. This function assumes that the probability of correct reception for the range Rc is always equal to P(Rc ) = 0.5. Fig. 1 illustrates this function for α = 2, which is frequently encountered in the literature. Being aware of the approximation that is made here, we are not pretending to use a highly realistic model. Yet, we believe that considering radio channel randomness during protocol analysis and simulation constitutes a first step towards facilitating further developments on real material.
3 Related Work Many papers have already addressed the problem of connected area coverage in wireless sensor networks once sensors are allowed to self-schedule their activity. Some have assumed a given topology to start the study, such as a grid for instance [7]. In [8], authors study some deployment patterns in order to ensure both coverage and connectivity once wireless sensors are deployed over a sensing field. Meanwhile, the deployment might be random thus forcing sensor devices to make decisions based on non predictable information. Tian and Georganas [9] have proposed a node scheduling scheme in which time synchronized and randomly deployed sensor devices regularly detect their neighbors and listen for retreat messages for a given timeout. A node can get passive if remaining neighbors fully cover its area. To ensure connectivity of the set of active nodes, a strong assumption is made regarding the ratio between sensing and communication radii (Rs ≤ Rc ). Such an assumption had already been used and proved as preserving network connectivity as long as area coverage was provided [3,4]. Recently, a randomized algorithm was proposed in [10]. Based on a Markov model, this algorithm allows each node to probabilistically turn off while preserving both coverage and connectivity. Yet, the need for a new analysis when considering non homogeneous settings might be an obstacle in real deployments where neighbor information is required to ensure coherent node behavior once the neighborhood is modified for any reason (mobility, failure).
Performance Evaluation and Enhancement of SCR Protocol
127
Unit disk graph 1
Lognormal shadowing
Probability of correct reception
0.8
0.6
0.4
0.2
0
0
0.5
1 Distance
1.5
2
Fig. 1. The two considered physical models (Rc = 1, α = 2)
4 Base of the SCR-CADS Protocol While coverage might not be a strict requirement for wireless sensor networks, connectivity among the set of monitoring nodes is crucial as every collected piece of information should be able to reach one of the sink stations. This is why we propose a new look at our SCR-CADS protocol which has been proved to ensure both connectivity and coverage in an original manner [1]. We show how this algorithm could be configured and especially what its resistance to new communication models is. Our solution consists in three consecutive phases (a neighbor discovery mechanism, a relay selection phase and a decision making process), each being detailed in this section. Regular neighbor discovery. A classical neighbor discovery phase is first required. At the beginning of each round, a hello message is sent by each node. It contains information such as its geographical position and a priority, assumed to be unique among the set of nodes. Relay selection. When considering node coverage, Jacquet et al. [11] have defined the notion of multipoint relay (MPR). Each node selects a subset of its neighborhood; the set of relays. For any given node u, this set, denoted as M P R(u), enables multi-hop communications with every 2-hop neighbors: MPR : V → V u −→ M P R(u) ∀u, ∃ MPR | N (MPR(u))\N (u) = N (N (u))\N (u),
(5)
V being the whole set of nodes. We can observe that M P R(u) = N (u) satisfies the relation. Yet, trying to have a MPR set as little as possible leads to diminishing the size of the dominating set, denoted as MPR-DS. Each node must send its own relay set to its whole neighborhood so that a simple decision rule can be made. Authors have proved that some rules were providing MPR-DS that were connected.
128
A. Gallais and J. Carle
As we aim at providing area coverage by connected sets, we have extended this solution to the problem of area coverage in wireless sensor networks. The relay selection phase has been modified. Each node computes its own relay set; this set must cover as large an area as the whole set of neighbors. This set is called Surface Coverage Relay (SCR) set. We defined a SCR function for this purpose: SCR : V → V u −→ SCR(u) SCR(u) standing for the subset of N (u) for which the property holds: ∀u, ∃ SCR | S(SCR(u))\S(u) = S(N (u))\S(u).
(6)
Each node must be able to compute its SCR set. Finding the subset of minimal size is equivalent to the minimum set cover problem, which is NP-complete [12]. This is why heuristics have to be designed in order to compute SCR sets that are as optimal as possible considering the number of nodes involved. Initially, a heuristic based on centroid computation was used [1]. In this article, we propose a new heuristic to compute SCR sets. Each node starts with an empty SCR set. Then, step by step, a neighbor is added to the SCR set if and only if it covers a portion of area yet uncovered by the SCR set. To get interesting subsets, at each step, the neighbor that brings the more new coverage should be added. Ideally, the neighbors should therefore be ordered according to their coverage potential regarding previously observed nodes. The first heuristic was based on a simple idea; at each step, we try to find the node that should bring the more coverage. To avoid heavy computation, we opted for the one whose location is the furthest from the center of mass of already selected relays (every relay having the same mass). It turned out that still large sets were computed. We now propose a new heuristic for selecting the relays, based on a simple intuition; the furthest neighbor is the one that covers the largest portion of area outside the sensing zone of the selecting sensor. Neighbors are therefore ordered according to their distance to the selecting node. The furthest is added to SCR set, then the second furthest is evaluated and so on. Note that for any heuristic, considered neighbors have to be distinguished in case of equality when facing the choice criterion. Which node is considered first is determined with a simple random function. Once SCR sets have been computed, each node has to decide its own activity status. We now explain the decision rule used for this purpose. Making activity decision. In the MPR-DS protocol [11], nodes apply a simple rule to decide their status (dominant or not). This rule is based on a key, assumed to be unique among the set of nodes. Definition 1. Every node u whose key is the lowest among the neighborhood or which belongs to the relay set of the neighbor with the lowest key must be active. As already shown in [11], if relay sets are computed so that the property 5 holds, then the induced set of active nodes is connected. We have already proved that using this rule
Performance Evaluation and Enhancement of SCR Protocol
129
with SCR sets was providing a set covering as large an area as the whole set of nodes. It is also obvious that SCR sets are MPR sets once Rs = Rc since 2-hop neighbors are necessarily located within one or several sensing areas of the 1-hop neighbors. This means that the covering set is also connected once we every activity decision has been made. Now that we have presented SCR-CADS and a new heuristic to compute SCR sets, we are evaluating this protocol under several communication models. Indeed, it has already been largely spread that protocols evaluated under ideal communication assumptions (i.e. the unit disk model) were more prone to failures or to dysfunction once used in practice (see [13] for some instances among area coverage protocols). The remaining of this article is therefore focused on the analysis of SCR-CADS protocol and its performances with a realistic physical layer for communications.
5 Performance Evaluation We have simulated only homogeneous sensor networks. Experimental results were obtained from randomly generated connected networks with a discrete event simulator. Nodes are deployed over a 50 × 50 rectangle area, considering a Poisson point process of intensity λ > 0. We define the density d of the network to be the average number of nodes in a given communication area: we thus have d = λ × πRc2 . Both the communication range (Rc ) and the sensing radius (Rs ) are fixed at 10. Simulations were launched over densities varying from 20 to 90 nodes per communication zone, with a step of 10. For each density, the number of performed iterations is adjusted so that 95% of the results are in a sufficiently tight confidence interval. Each iteration consists of rounds. Each round starts with the neighbor discovery phase, followed by the SCR sets computation and the decision making process. A sensing area is modeled as a disk while the communciation model varies (unit disk model or a realistic physical layer). When nodes have to compute the area covered by a given neighbor, they use a coverage evaluation scheme based on a disks intersections theorem, already applied in [3,4]. A round ends with the sensing period which involves every node that has decided to be active. We essentially measure the percentage of active nodes and the percentage of preserved coverage of original area. 5.1 Under Ideal Communication Assumptions We give complementary results about the SCR protocol when simulated with ideal communication assumptions. By ideal, we mean that no message can ever be lost, neither due to message collisions nor to the communication environment itself. Area coverage and connectivity. As already mentioned, the algorithm has been proved as preserving both area coverage and connectivity of the set of active nodes. We therefore focus on the size of the active nodes sets in order to evaluate the impact of the heuristic used during the relay selection phase. Size of SCR sets. We have observed the average size of the SCR sets in order to compare the performances of the original heuristic with the ones the latest solution, based on distances. The size of the relay sets is a matter of prime importance first
130
A. Gallais and J. Carle
35
80 Distance−based heuristic Centroid−based heuristic
Distance−based heuristic Centroid−based heuristic 70
60 25
Active nodes (%)
Average size of SCR sets
30
20
15
50
40
30 10
20
5
10 20
30
40
50 60 Network density
70
80
(a) Diminishing the size of relay sets.
90
20
30
40
50 60 Network density
70
80
90
(b) Reducing the proportion of active nodes.
Fig. 2. Main advantages of the distance-based heuristic for relay selection
because the number of active nodes is strongly linked to these and second because they determine the communication overhead induced by the protocol. Concerning the first point,every node whose key is the lowest among its neighborhood is necessarily active. This number does not vary as long as the network density remains identical. Yet, the larger the relay sets, the higher the probability for a node to belong to the SCR set of the neighbor with the lowest key and so to become active. Concerning the second point, as relay sets must be sent to every 1-hop neighbor, the less large they are, the better for the communication overhead and thus the energy consumption. We could observed that the distance-based heuristic was actually performing better than the centroid-based one. As shown on Fig. 2(a), the average size of the SCR sets generated by the distance-based heuristic varies from roughly 8 neighbors to 11, respectively for densities 20 and 90, while the centroid-based method leads to 11 neighbors selected as relays at density 20 and nearly 24 at density 90, that is more than twice the size obtained by our distance-based heuristic. As already mentioned, the size of the relay sets straightly impacts the number of active nodes. As our new heuristic allows us to considerably reduce this size, we are now looking to what extent this improved relay selection impacts the average number of active nodes that are involved at every round. Active nodes. Figure 2(b) shows the average number of active nodes versus the network density. As expected, the percentage of active nodes decreases as the network density increases; the more nodes there are, the less the proportion of required sensors to achieve full coverage of the area. Interestingly, we can observe that the centroid-based heuristic leads to 24.7% of active nodes at density 90 while the distance-based method lowers this result to only 13% for the same density. We therefore obtained very interesting results with this new method. 5.2 Under Realistic Physical Layer Assumptions We now try to evaluate the SCR-CADS protocol with a realistic communication model. The goal for such an investigation is twofold. First, we aimed at testing the robustness
Performance Evaluation and Enhancement of SCR Protocol
131
of this algorithm and then some potential enhancements were to be implemented and also evaluated. Introducing a more realistic physical layer leads to more channel randomness, that is communication links that are probabilistic; two nodes can communicate with a given probability depending on the distance that separates them. Some area coverage protocols have already been studied under this new assumption and we aim at evaluating SCR protocol also. In our algorithm, once the relay sets have been computed, they are sent so that every node can make its activity decision from the rule previously described in 4. The so computed set of active nodes is connected and covers as large an area as the whole set of nodes once the unit disk model is used. When a realistic physical layer is used, we could observed some changes. Area coverage. Figure 3(a) shows the average percentage of covered area by the sets of active nodes computed by our SCR protocol. We can observe that while area coverage was fully preserved with an ideal communication model, this property does no longer holds under our new assumptions. Area coverage goes from a bit more than 95% at density 30 to less than 90% at density 80. Yet, this loss of coverage is not as important as we could have expected, and especially not as dramatic as the ones we could already observed with other protocols [13]. This shows an inherent robustness for our SCR protocol. Let us detail the reasons why the impact is that low. During the neighbor discovery phase, hello messages from some of the theoretical 1-hop neighbors may not be received by a given node u. Yet, this actually increases the probability for u to evaluate its own key as being the lowest among its neighbors. Then, the less neighbors are discovered, the higher the probability for u to get active, the worst case being no hello message ever received and u having an empty neighbor table, thus necessarily deciding to be active. In such a case, we would have all nodes being leaf and thus active, which is not the case here as some coverage losses have still been recorded. We may now focus on the relay selection phase. Once relays have been selected, SCR sets are sent to the neighbors. If some of these messages are never received by a given node u, then it results in some nodes being aware of only a subset of its real SCR selectors. If one of these potentially missed selectors is the neighbor with the 25 Distance−based heuristic Centroid−based heuristic 100 20
Active nodes (%)
Area coverage (%)
95
90
85
Distance−based heuristic Centroid−based heuristic
15
10
80 5 75
70
0 30
40
50
60 Network density
70
(a) Coverage loss.
80
90
30
40
50
60 Network density
70
80
90
(b) Reduction of active nodes (%).
Fig. 3. SCR-CADS algorithm and a realistic physical layer (LNS, see formula 4, with α = 2)
132
A. Gallais and J. Carle
lowest key so far, such a wrong reception leads to u not knowing it has been selected by its neighbor with the lowest key, which would have meant it had to be active. In such a case, u decides to be passive and therefore jeopardizes area coverage. This is why some coverage losses are observed on Fig. 3(a). However, note that if a relay message is received from a yet unknown node whose key is the lowest among already collected ones, then the receiving node must consider this message as a hello message. It must also check if it has been selected as a relay by this node. This demonstrates the inherent robustness of SCR protocol since only coverage losses may happen only in case the relay message of the neighbor with the lowest key (and whose previous hello message was received) was never received. This means that only this message is crucial while others can only bring some new information that could help nodes make the right activity decision. Active nodes. We also observed to what extent the average percentage of active nodes was impacted by our new physical layer assumptions. Generally, when some coverage losses occur, it is either caused by a decrease of the number of active nodes or by a spatial incoherence (as many active nodes but concentrated on a given region at the expense of another one which is left uncovered). The latter case can hardly take place thanks to the homogeneity of both our deployment and our random processes used during key computation. Indeed, fig. 3(b) shows that the percentage of active nodes is largely decreased (from less than 10% at density 30 to only 2% at density 90) when the distance-based heuristic is used. Although this tremendous drop of the percentage of active nodes could lead us to think that large coverage holes have occurred, we have already observed that the impact on area coverage was minor. The SCR-CADS protocol is inherently robust and does not much suffer from loss of active nodes since the remaining ones have self-elected with a coherent local process. 5.3 Tending to More Robustness and Customization of SCR Protocol Although the impact of a realistic physical layer is not important on area coverage, we could observe that the number of active nodes was largely decreased. As already mentioned, the percentage of active nodes is very much induced by the average size of the relay sets. We are now trying to tune the relay selection made by each node in order to be able to tune the global percentage of active nodes, and also the area coverage. We therefore introduced a coverage parameter, denoted as SCR COV , which stands for the number of relays that should cover any portion of the sensing area of a neighbor currently evaluated for potential selection. Every neighbor is now added to the set of relays if and only if its sensing area is not k-covered by already selected relays, with k being SCR COV . Several definitions of k-coverage exist in the literature. Our approach is to consider a k-covered piece of area as soon as k distinct sensor devices are able to sense it. This should help us to raise the proportions of active nodes as relay sets could be made larger. By this way, the resistance to some communication anomalies should be obtained thanks to the redundancy induced in the set of monitoring nodes. Note that this solution solely requires a modified coverage evaluation scheme.
Performance Evaluation and Enhancement of SCR Protocol
133
100
30
95
25 Active nodes (%)
Area coverage (%)
35
90
85
SCR_COV=2 SCR_COV=3 SCR_COV=4
20
15
80
10
75
5
70
SCR_COV=2 SCR_COV=3 SCR_COV=4
0 30
40
50
60 Network density
70
80
90
30
40
50
60 Network density
70
80
90
(a) Increasing robustness to radio channel ran- (b) Configurable proportions of active nodes domness. with enhanced relay selection. Fig. 4. Configurability of SCR-CADS protocol (LNS, see formula 4, with α = 2)
Area coverage. The prime goal of our SCR COV parameter is to be able to better resist to message loss. Figure 4(a) shows area coverage versus network density, with SCR COV ∈ {2, 3, 4} when the distance-based heuristic is used. As long as SCR COV > 1, area coverage is at least 95%. This shows that, even in presence of a realistic radio channel, our algorithm can provide satisfying results in terms of area coverage. Our new parameter may also help us to customize the relay selection more finely. Until now, we were only using our two selection heuristics. The problem with such a method is that dynamically modifying the way a relay selection is conducted by a node would require changing the code it actually runs. Given the hardware constraints of the target devices, it would be more convenient to interact with the algorithm by the means of a simple parameter. Active nodes. We observed to what extent the SCR COV value was able to impact the proportions of active nodes. The average percentage of active nodes is shown on Fig. 4(b), for several SCR COV values. When SCR COV = 2 (which is sufficient to ensure more than 95% of covered area), we have between 17% and 4% of active nodes. This shows that our SCR COV not only deserves to be used to improve area coverage but also to finely tune the proportions of active nodes.
6 Conclusion We have enhanced our SCR-CADS protocol in order to have a more efficient relay selection that fits in today’s requirements for wireless sensor networks algorithms. We have shown that our algorithm, originally designed under ideal communication assumptions, could work properly with a realistic physical layer, maintaining a high coverage level. We then introduced a new parameter during the relay selection, thus enabling full coverage in case this would be strongly required by the application. This allows to finely tune the proportions of active nodes as we tend to a more configurable protocol. Future work will focus on characterizing connectivity under more realistic communication assumptions. We also aim at observing the impact of both synchronization and
134
A. Gallais and J. Carle
localization errors which could be induced either due to the environment or to the used algorithm itself.
Acknowledgments This work was partially supported by a grant from CPER Nord-Pas-de-Calais/FEDER TAC COM’DOM, CNRS National platform RECAP.
References 1. Carle, J., Gallais, A., Simplot-Ryl, D.: Preserving area coverage in wireless sensor networks by using surface coverage relay dominating sets. In: Proceedings of IEEE Symposium on Computers and Communications (ISCC), Cartagena, Spain, pp. 347–352 (2005) 2. R¨omer, K.: Time Synchronization and Localization in Sensor Networks. PhD thesis, ETH Zurich, Switzerland (2005) 3. Zhang, H., Hou, J.C.: Maintaining sensing coverage and connectivity in large sensor networks. Ad Hoc and Sensor Wireless Networks journal (AHSWN) 1, 89–123 (2005) 4. Xing, G., Wang, X., Zhang, Y., Lu, C., Pless, R., Gill, C.: Integrated coverage and connectivity configuration for energy conservation in sensor networks. ACM Transactions on Sensor Networks (TOSN) 1(1), 36–72 (2005) 5. Cerpa, A., Wong, J.L., Kuang, L., Potkonjak, M., Estrin, D.: Statistical model of lossy links in wireless sensor networks. In: Proceedings of International Symposium on Information Processing in Sensor Networks (IPSN) (2005) 6. Kuruvila, J., Nayak, A., Stojmenovi´c, I.: Hop count optimal position based packet routing algorithms for ad hoc wireless networks with a realistic physical layer. In: Proceedings of IEEE Mobile Ad hoc and Sensor Systems (MASS), Fort Lauderdale, FL, USA (2004) 7. Shakkottai, S., Srikant, R., Shroff, N.: Unreliable sensor grids: Coverage, connectivity and diameter. In: Proceedings of IEEE International Conference on Computer Communications (INFOCOM), San Francisco, CA, USA (2003) 8. Liu, C., Xiao, Y.: Random coverage with guaranteed connectivity: Joint scheduling for wireless sensor networks. IEEE Transactions on Parallel and Distributed Systems (TPDS) 17(6) (2006) 9. Tian, D., Georganas, N.D.: Connectivity maintenance and coverage preservation in wireless sensor networks. AdHoc Networks Journal (Elsevier Science), 744–761 (2005) 10. Yener, B., Magdon-Ismail, M., Sivrikaya, F.: Joint problem of power optimal connectivity and coverage in wireless sensor networks. Wireless Networks 13(4), 537–550 (2007) 11. Adjih, C., Jacquet, P., Viennot, L.: Computing connected dominated sets with multipoint relays. Ad Hoc and Sensor Wireless Networks journal (AHSWN) 1(3), 27–39 (2005) 12. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum Press, New York (1972) 13. Gallais, A., Ingelrest, F., Carle, J., Simplot-Ryl, D.: Preserving area coverage in sensor networks with a realistic physical layer. In: Proceedings of IEEE International Conference on Computer Communications (INFOCOM), Anchorage, AK, USA (short paper, 2007)
On Extending Coverage of UMTS Networks Using an Ad-Hoc Network with Weighted Fair Queueing R. El-Azouzi1 , R. El-Khoury1, A. Kobbane1,2 , and E. Sabir1,2 1
LIA/CERI, Universit´e d’Avignon, Agroparc BP 1228, Avignon, France 2 LIMIARF, Universit´e Mohammed V Rabat-Agdal, 4 Av. Ibn Battouta B.P. 1014, Maroc {rachid.elazouzi,ralph.elkhoury,abdellatif.kobbane, essaid.sabir}@univ-avignon.fr
Abstract. In this paper, we are interested in the interconnection between an ad-hoc network and UMTS system. In presence of two access technologies, new challenges arise and mobile host will have face multiple base stations or gateways with different utilizations. As a result, a user who needs to establish a voice call, prefers to be connected through UMTS and a user who needs to transmit a Best Effort traffic with high rate, prefers to be connected through ad-hoc network. The aim of that combination is to improve the coverage of cellular networks and increases the services which can be made available for mobiles. This combination also reduces transmission power for mobile hosts. Our main result is characterization of stability condition and the end-to-end throughput using the rate balance. Numerical results are given and support the results of the analysis. Finally, we perform extensive simulation and verify that the analytical results closely match the results obtained from simulations. Keywords: Interconnection, cross-layer, routing, ad-hoc, UMTS, stability.
1
Introduction
In order to meet the increasing in data and user mobility, a considered amount of effort is tunneled towards the convergence of multi-service networks of diverse technologies. Next generation wireless networks (NGWN) will be heterogeneous radio access technologies such UMTS, WiMAx, WLAN, etc, coexist. However, the NGWN will be a joint radio resource management among multiple operators of different radio resource technologies [3]. In this paper, we are interested in the planning of ad-hoc networks in providing an improved cellular coverage of UMTS cellular networks and different services can be made available for nodes. Indeed, there exist several types of traffic that require different QoS. In presence of several access technologies, new challenge arises and mobile host will
This work was partially supported by ANR WINEM No. 06TCOM05 and Maroc Telecom R&D No. 10510005458.06 PI.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 135–148, 2008. c IFIP International Federation for Information Processing 2008
136
R. El-Azouzi et al.
face multiple base stations or gateway with different utilization. This opens the possibility of using the most appropriate multiple access types for each type of call. For example, for a network with a low bandwidth and high reliability access as UMTS, it is preferable to be used by a voice call. As a result, a user prefers to be connected through UMTS network which can offer higher reliability for particular services, even at a high cost. A multi-hop wireless ad hoc network is a collection of nodes that communicate with each other without any established infrastructure or centralized control. Many factors interact with each other to make the communication possible like routing protocol and channel access. Recently, wireless ad-hoc networks have been envisioned of commercial applications such as providing Internet connectivity for nodes that are not in transmission range of a wireless access point. Cellular network and ad-hoc networks should be considered as complementary systems. Hence, it is possible to use ad-hoc networks as an extension of the cellular networks like UMTS. Thus it is possible for a multihomed node to use different access technologies at different time to assure a permanent connectivity. As a result, potential benefits can be envisaged for UMTS cellular as : extend the cellular coverage, reduce the power transmission which implies the reduction of intra-cell and extra-cell interference. In the ad-hoc networks, new services can be made available as VoIP and streaming video. In this work we are interested in the interconnection between a MANET and a 3G system. In order to be able to connect to more than one base station (or network), a terminal should have at least two network cards. In the ad-hoc network, we consider the random access mechanism for the wireless channel where the nodes having packets to transmit in their transmit buffers attempt transmissions by delaying the transmission by a random amount of time. This mechanism acts as a way to avoid collisions of transmissions of nearby nodes in the case where nodes can not sense the channel while transmitting (hence, are not aware of other ongoing transmissions). We assume that time is slotted into fixed length time frames. In any slot, a node having a packet to be transmitted to one of its neighboring devices decides with some fixed (possibly node dependent) probability in favor of a transmission attempt. If there is no other transmission by the other devices whose transmission can interfere with the node under consideration, the transmission is successful. As examples of this mechanism, we find Aloha-type [14] and IEEE 802.11 CSMA/CA-based mechanism. With these mechanisms, each node determines its transmission times [4,25]. In the UMTS network, the nodes communicate using Direct Sequence-Code Division Multiple Access (DD-CDMA). The node transmits the packet with fixed rate and fixed power [26]. We consider an automatic-repeat-request(ARQ) mechanism in which the node keeps retransmitting a packet until the packet is received at base station without any error. At any instant in time, a device may have two kinds of packets to be transmitted: 1. Packets generated by the device itself. This can be sensed data if we are considering a sensor network. 2. Packets from other neighboring devices that need to be forwarded.
On Extending Coverage of UMTS Networks Using an Ad-Hoc Network
137
We consider two separate queues for these two types and do a weighted fair queueing (WFQ) for these two queues. This configuration allows nodes to have flexibility for managing at each node forwarded packets and its own packets differently. Two queues allow to model the selfish behavior or on the contrary to give higher priority to connections that traverse many hops that could otherwise suffer from large loss rates. The main contribution of this paper is to provide approximation expressions of stability. Our main result is concerned with the stability of the forwarding queues at the devices. It states that whether or not the forwarding queues can be stabilized (by appropriate choice of WFQ weights) depends only on the routing and the channel access rates of the devices. The end-to-end throughput achieved by the nodes are independent of the choice of the WFQ weight. The context of the stability that we study is new as it takes into account the possibility of a limited number of transmissions of a packet at each node after which it is dropped. Related works. Multihoming has been traditionally used by stub networks for improving network availability. Previous studies [5], have shown that ”Multihoming Route Control” can improve the Internet communication performance of the networks significantly. A comparison study of overlay source routing and multihoming is presented in [6]. The goal of this work is how much benefit does overlay routing provides over BGP, when multihoming and route control are considered. In wireless network, the usefulness of attaching to the Internet using different access technologies at different time, or even simultaneously, has been described in various works [7, 8]. A work devoted to IEEE 802.11 WLANs is [9], where users can choose between several access points and even split their traffic by using several of them. The network stability has been studied extensively both for networks with centralized scheduling [15, 17], Aloha protocol [10, 21] and UMTS system [19,20]. Among the most studied stability problems are scheduling [15,16] as well as for the Aloha protocol [27, 24]. Tassiulas and Ephremides [15] obtain a scheduling policy for the nodes that maximizes the stability region. Their approach inherently avoids collisions which allows to maximize the throughput. Radunovic and Le Boudec [11] suggest that considering the total throughput as a performance objective may not be a good objective. Moreover, most of the related studied do not consider the problem of forwarding and each flow is treated similarly (except for Radunovic and Le Boudec [11] or Tassiulas and Sarkar [23]). The rest of the paper is organized as: In section II, we formulate the problem. In section III, we study the stability of any given node in the system using the rate balance equation. We analyze the uplink case in section IV, valid our model and provide some numerical examples and some concluding remarks in section V.
2
Model Formulation
Background. Consider a geographical area which is not totally covered by any 3G system. The system is composed of three geographical classes (See fig.1); Class C1 contains terminals which communicate with each other using an ad-hoc network and some of them could be covered by the UMTS. The class C3 contains
138
R. El-Azouzi et al.
all terminals which are in the direct range of the 3G Node-B. We denote by C2 the set of mobiles covered by both the 3G system and the ad-hoc network, i.e. C2 = C1 ∩ C3 . Let N1 = card(C1 \C2 ) and N2 = card(C2 ). The 3G base station is denoted by B. The idea behind this work is to find a solution that allows non covered mobiles (C1 \C2 ) to use the 3G services besides they are not in the covered area, and allow to 3G devices to connect to Internet thanks to ad-hoc network. So we consider a scheme with special kind of terminals, each one disposes of two network cards (IEEE 802.11 and DS-CDMA), hence it could be connected to either the ad-hoc or the 3G system according to the coverage criterion. Moreover this mobile is assumed to be able to forward packets arriving from ad-hoc mobiles, and therefore transmits them to the 3G node-B or to another ad-hoc node; In other terms these type of mobiles can play the role of gateways that allow to access to the node-B and therefore to benefit from its services.
Internet
Ad-hoc subsystem
UMTS subsystem
Fig. 1. A single Cell 3G system combined with ad-hoc network. Access to UMTS and Internet without coverage constraint using an IEEE 802.11/UMTS gateway solution.
As long as a 3G node is far of the base station as its transmit power is higher, which could deteriorate the other nodes QoS. To avoid this, far nodes will prefer to use their ad-hoc neighbors to transmit and therefore save their energy. Moreover, for a 3G user who needs to connect to the Internet, it is very interesting to access to it via ad-hoc for high rate and low cost comparing to Internet via UMTS. Remark. (Case of a node j ∈ C2 ). Node j has neighbors transmitting using the IEEE 802.11 band and others using the UMTS band; When j transmits over the ad-hoc subsystem, it only collides with its neighbors using ad-hoc. And it is influenced only by users communicating with the node-B when it transmits using DS-CDMA. In others terms the two subsystems are separated and do not interfere with each other. Network layer. In order to insure the forwarding of a packet to a given destination (ad-hoc node or the Node-B), we assume that each terminal i possesses two queues Qi and Fi . Fi is selected with probability fi and receives packet to be forwarded to another destination whereas Qi represents the queue which carries the own packets of node i and is selected for transmission with probability 1 − fi . In this work we consider the saturated case where node i has always
On Extending Coverage of UMTS Networks Using an Ad-Hoc Network
139
packets to transmit from Qi . These queues are assumed to have infinite capacity. The network layer handles these queues using a weighted fair queuing (W F Q) protocol. For notation, we denote by Rs,d the set of nodes between a source s and a destination d (s and d not included). MAC layer. The MAC layers of the two subsystems are different. We consider for ad-hoc subsystem a collision channel where no more than one node can transmit successfully in a slot; Denote by τ a the duration of one slot. A successful transmission is occurred when there is no collision on reception, in other terms neighbors of an intermediate node should not transmit when it receives a packet. Each mobile transmitting a packet either from Qi or Fi should access the channel by probability Pi . We can consider the aloha protocol, CSMA/CA or any other mechanism to access the channel. For instance, the attempt rate (for node i) in IEEE 802.11 DCF (see [25]) is, Pi =
2(1 − 2Pc ) (1 − 2Pc )(CWmin + 1) + Pc CWmin (1 − (2Pc )m )
(1)
Pc denotes the conditional collision probability given that a transmission atmax tempt is made and CW is the contention window. m = log2 ( CW CWmin ) is the maximum of backoff stage. We assume that a packet is retransmitted (if needed) till success or dropping, Let Ki,s,d be the maximum number of transmissions allowed by a mobile i per packet. Yet expression (1) is valid only for IEEE 802.11 systems. In our case we have an hybrid network, so it does not hold anymore and we shall calculate it. Assume xi is the total proportion of UMTS cycles for a given node i. Let τiu be the average needed number of slots to send a UMTS packets. Proposition 1. Consider a heterogenous network composed of a UMTS and an Ad-hoc subsystems, then we have the following, 1) The proportion of UMTS traffic is, xi = Pi,B (1 − πi fi ) + fi
πi,s,B
(2)
s
2) The attempt rate for any given node i is, Pi =
Li (1 − xi ) Pi Li (1 − xi ) + τiu xi Pi
(3)
Proof. Here we develop a cycle-based method to prove proposition 1. Let us a observe the system (in particular a node i) for Ci,t cycles. Ci,t is the number of th u Ad-hoc cycles until the t slot, and Ci,t is the corresponding number of cycles for the UMTS traffic. We denote by Tta the total number of transmission slots used by Ad-hoc connections till the tth slot. The proportion of UMTS cycles is xi =
u Ci,t Ci,t
=
u,Q u,F Ci,t +Ci,t Ci,t
140 u,Q Ci,t Ci,t u,Q Ci,t Ci,t u,F Ci,t Ci,t
R. El-Azouzi et al.
is the probability to choose a UMTS packet from Qi . We have = =
u,Q Ci,t Q Ci,t F Ci,t a Ti,t
·
·
Q Ci,t Ci,t
a Ti,t Ci,t
= Pi,B (1 − πi fi ) π C u,F · Ci,tF = fi πi s i,s,B = fi s πi,s,B πi i,t
a where Ti,t denotes the number of cycles having Ad-hoc packets till slot t. The proportion of UMTS cycles becomes xi = Pi,B (1 − πi fi ) + fi s πi,s,B The proof is complete for part 1. a Tta T a Ci,t Ci,t = lim ta · · t→∞ t t→∞ C C t i,t i,t
The long term attempt rate is P i = lim Tta a Ci,t a Ci,t limt→∞ Ci,t limt→∞ Cti,t
limt→∞
is the average number of slots per cycle (Ad-hoc or UMTS), i.e. Li . is exactly the proportion of Ad-hoc cycles among Ci,t , i.e. 1 − xi . is the average number of slots per cycle (Ad-hoc or UMTS), then a Ci,t ·
Li
u +Ci,t .n
Pi u i limt→∞ Cti,t = =L Ci,t Pi (1 − xi ) + τi xi The result follows by substituting each term by its established expression.
For the UMTS subsystem, we assume that the base station communicates with 3G nodes using Direct Sequence Code Division Multiple Access (DS-CDMA). Each user i transmits to the base station with a transmission rate ρi,B (Uplink transmission rate), which is the bandwidth to the processing gain ratio. It depends on the mobile classes, the solicited service and the global load. The base station uses a downlink transmission rate ρB,i when sending to i. Another important parameter is the Packet Success Rate (PSR), which we denote by ξ(γi ), it is a function of the instantaneous signal to noise plus interference ratio (SINR). ξ(.) could be any increasing continuous, differentiable and S-sharped (sigmoidal) function with ξ(0) = 0 and ξ(+∞) = 1. Some properties and examples of S-sharped functions are studied with more details in [26] and references therein. Define μi,B as the uplink service rate, i.e. at which UMTS packets are served at node i. We denote by μB,i the arrival rate to node i from the node-B. Let 1 τi,B = μi,B be the average service time in number of ad-hoc slots. Then, it can be Ma written as: τi,B = ρi,B ξ(γ a where Ma is the length of one ad-hoc packet (in i ).τ a bit) and ρi,BMξ(γ is the average service time in second of a packet while allowing i) infinity retransmissions of a failure packet, see [26].
Cross-layer architecture. Nodes with two network cards are able to communicate with other nodes or with the 3G node-B. The system performance depends on more than one layer, this motivates us to use a cross-layer architecture (fig.2). This new network insight is more powerful and flexible thanks to communication and information exchange between different layers. Remark. It is interesting to note that a mobile is able to transmit on one interface and receive on the other and vice-versa. Moreover, it could get si-
On Extending Coverage of UMTS Networks Using an Ad-Hoc Network
141
Fig. 2. Cross-layer architecture including the two subsystems
multaneously one ad-hoc and one UMTS reception. Whereas we do not allow simultaneous transmissions over the two network cards. This is due to the nature of scheduling mechanism that we consider (selection of one queue a time). If we allow parallel selections we could make two simultaneous transmissions possible.
3
Stability of Forwarding Queues
In this section we study stability properties of nodes in the system in particular. a node i ∈ C2 which we call gateways. Classically ,the forwarding queue Fi is stable if the departure rate of packets from Fi is at least equal to the arrival rate into it. This is expressed by a rate balance system. For any given nodes i, s, d ∈ C1 ∪ C3 ; i denotes an intermediate node, s is the source and d is the destination. Let ji,s,d denote the entry in the set Rs,d just after i and N (i) be the set of i’s neighbors. The probability that a transmission from node i on route from node s to node d is successful is, Pi,s,d = (1 − P j ) (4) j∈ji,s,d ∪N (ji,s,d )\i
The expected number of attempts till success or dropping from i on route Rs,d is, 1 − (1 − Pi,s,d )Ki,s,d Li,s,d = Pi,s,d Let πi,s,d be the probability that the queue Fi has a packet at the first position ready to be forwarded to the path Rs,d . It follows that the probability that the queue Fi has at least one packet to be forwarded is πi = s,d πi,s,d . If Pi,d denotes the probability that i transmits to d then the average of Li,s,d over all possible paths is, Li = πi,s,d fi Li,s,d + (1 − πi fi )Pi,d Li,i,d (5) s,d:i∈Rs,d
d
In the following we classify connections onto three categories. – ad-hoc to ad-hoc connections: This corresponds to the case studied in [1] in which an ad-hoc node transmits to another ad-hoc node, i.e. s, d ∈ C1 .
142
R. El-Azouzi et al.
– ad-hoc to UMTS connections: This is the uplink case in which nodes transmit to the base station. For s ∈ C1 and d = B, note that a UMTS packet is transmitted until success from the gateway to the base station. – UMTS to ad-hoc connections: This illustrates downlink scheme, in other terms when the base station is transmitting to a given node so s = B and d ∈ C1 ∪ C3 . We note for transmissions to the base station, that an intermediate node i Pi forwards packets to a next hop at a rate L , whereas it forwards them to the i base station (if it is in the direct range of B) at a rate μi,B . For more generality we define μi,B as the service rate for the uplink scheme as following, μi,B =
μi,B , Pi , L i
if
if i ∈ C3 i ∈ C1 \C2
(6)
The departure rate In order to compute the departure rate of a given node, we shall make differentiation between ad-hoc and UMTS connections. Also the geographical criterion is taken into consideration thanks to P i and μi,B . The long term departure rate from node i for the connection Rs,d is,
di,s,d
⎧ Pi ⎪ ⎨ πi,s,d fi Li , = πi,s,B fi μi,B , ⎪ ⎩π Pi i,B,d fi L , i
s, d ∈ C1 s ∈ C1 , d = B d ∈ C1 , s = B
(7)
The arrival rate Here, we compute the arrival rate for each node i. Let i, s, d ∈ C1 ∪ C3 . s sends a packet destined to d with probability Ps,d . When the UMTS node-B transmits to a node d, we denote by g (for gateway) the first hop on the route RB,d . i’s long term arrival rate is expressed, according to possible connections, as following:
ai,s,d
⎧
Ps ⎪ ⎪ (1 − π f ) P 1 − (1 − Pk,s,d )Kk,s,d , ⎪ s s s,d ⎪ ⎪ Ls ⎪ k∈Rs,i ∪s ⎪ ⎪ ⎨
Ps = (1 − πs fs ) Ps,B 1 − (1 − Pk,s,B )Kk,s,B , ⎪ Ls ⎪ k∈Rs,i ∪s ⎪ ⎪
⎪ ⎪μ ⎪ 1 − (1 − Pk,s,d )Kk,s,d , ⎪ ⎩ B,g
s, d ∈ C1 s ∈ C1 , d = B d ∈ C1 , s = B
k∈RB,d
(8) End to end throughput of a connection. The end to end throughput between a couple of nodes s and d is exactly the arrival rate to the destination d. Namely thps,d = ad,s,d . Recall that gateways keep transmission of a packet over the UMTS subsystem till success (correctly received by the base station).
On Extending Coverage of UMTS Networks Using an Ad-Hoc Network
3.1
143
The Rate Balance System (RBS)
A queue Fi is stable while its corresponding departure rate is greater or equal than the arrival rate into it. In this subsection we consider the extreme case where we have strict equality. We will derive the rate balance equation for every node in the system. When the steady state is achieved, if all queues are stable, then for each i, s and d such that i ∈ Rs,d we get di,s,d = ai,s,d , this is the rate balance equation on the path Rs,d . Then for all i, s and d, we get a system of equations from which we can find a solution of the πi,s,d for all i, s and d. And then, find πi , for all i, which determines the load of the forwarding queues and the main condition of their stability which is πi < 1. Each node can forward packets for the three types of connections defined previously, and therefore we can explicit the rate balance equations using three generic equations: ⎧
P s Li ⎪ ⎪ πi,s,d fi = (1 − πs fs ) Ps,d 1 − (1 − Pk,s,d )Kk,s,d , s, d ∈ C1 ⎪ ⎪ ⎪ P i Ls ⎪ k∈R ∪s ⎪ s,i ⎪ ⎪
⎨ (1 − πs fs ) P s πi,s,B fi = Ps,d 1 − (1 − Pk,s,d )Kk,s,d , s ∈ C1 , d = B μ Ls ⎪ i,B ⎪ k∈Rs,i ∪s ⎪ ⎪ ⎪
⎪ Li ⎪ ⎪ πi,B,d fi = μB,g 1 − (1 − Pk,s,d )Kk,s,d , d ∈ C1 , s = B ⎪ ⎩ P i k∈R B,d
(9) We sum over all sources s and destinations d for all connections written above and get the global rate balance equation which is useful to study some special cases. Theorem 1. In the steady state, if all the queues in the system (both ad-hoc and UMTS) are stable, then for each i, s, d ∈ C1 ∪ C3 , such that i ∈ Rs,d we have, P i fi πi,s,d fi + μi,B πi,s,B = Li s,d s d:i∈R
B,d
+
s,d:i∈Rs,d
(1 − πs fs )
P s Ps,d Ls
μB,i
1 − (1 − Pk,B,d )Kk,B,d k∈RB,i
1 − (1 − Pk,s,d )Kk,s,d
(10)
k∈Rs,i ∪s
Let zi,s,d = πi,s,d fi , for all i, s and d, be the unknown of the rate balance system which is a non linear system because P i and Li usually depend on zi,s,d . We remark as in paper [1] that the solution is independent on the forwarding probability, and consequently the end to end throughput is also not impacted by the choice of fi as it appears when computing the end-to-end throughput thps,d = ad,s,d . This result holds only when the forwarding queues are stable. As a consequence, the forwarding capabilities of gateways do not affect their energy consumption.
144
R. El-Azouzi et al.
Another important remarks concern the interaction between the ad-hoc and the UMTS. Practically, the bit rate per node in the ad-hoc network is larger than the UMTS, but the size of this former network and its characteristics (as channel access and routing) influence drastically its capacity. Yet, connecting two heterogenous networks normally needs a scaling and an adaptation of the rates on each interface, if not, stability of the gateway nodes becomes a real issue. From the rate balance system, we find zi,s,d function of the service rate of the UMTS node μi,B and of the base station rate μB,i . So, the condition on zi,s,d (zi,s,d < 1) will be translated to a condition on the parameters of the UMTS, for example, we can find a minimum packets rate for UMTS nodes or a threshold of the SINR γ that insures stability on gateway nodes. Concerning the throughput of ad-hoc nodes connected to the base station, the throughput can be written in a simple form, in case of stability, as: thps,B = zg,s,B μg,B , where g is a gateway node. For the particular uplink case, the arrival rate (second equation of (8)) is independent of μi,B since πs fs , Li and Pi do not depend on it. Then when μi,B changes the load zi,s,B changes in the opposite sense, it follows that the throughput stays unchanged.
4
The Uplink Case Analysis
In the following we discuss and study with more details a scheme when the RBS is linear such the uplink case which is very important. There is indeed many asymmetric services which do not requires any request-response pattern (e.g. downloading files). In this case all sources has the same destination and each intermediate node forwards always to the same next hop. it follows that Li is independent of the unknown πi , Ls =
πs,s ,d fs Ls,s ,d + (1 − πs fs )Ps,d Ls,s,d = πs fs Ls,s,d + (1 − πs fs )Ls,s,d = Ls,s,d
s
The RBS is reduced to the second equation when d = B, we sum over all s,
zi,s,B =
ys P s Ps,B
s
μi,B Ls
s
1 − (1 − Pk,s,d )Kk,s,d
k∈Rs,i ∪s
Let yi = 1 − πi fi , later we will refer to yi as the stability region of Fi . We have, yi +
ys ωi,s = 1
(11)
s
where ωi,s =
P s Ps,d d μi,B Ls
k∈Rs,i ∪s
1 − (1 − Pk,s,d )Kk,s,d . (11) can be written in
a matrix form and then will be resolved more easily. Y (I + W ) = 1
(12)
On Extending Coverage of UMTS Networks Using an Ad-Hoc Network
145
W is the N ×N matrix whose (s, i)th entry is ωs,i , Y is the unknown N −dimensional row vector, i.e. Y (i) = yi . In the following we will restrict to symmetric ad-hoc network. The transmit and forwarding probabilities are the same for all users, i.e. Pi = P and fi = f . Assume that every ad-hoc node i has the same number of neighbors ni = n (such a grid or a mesh), it follows that Pi,s,B = (1 − P )n . For simplification and without loss of generality we will be assuming that Ki,s,d ≡ 1. Let denote by d(i, s, B) the number of intermediate nodes on the route Rs,B . Since we only consider uplink connections we shall put μB,g = 0. Proposition 2. A necessary condition for stability of Fi , for any given node in the system, for the uplink scheme is that, f≥ where ω i,s,B =
s
ωi,s,B ωi,s,B + μi,B
s.t.
f ≤1
(13)
P (1 − P )n(d(i,s,B)+1)
Proof. For any given routing, the input ratio into the forwarding queue Fi is, ai =
(1 − f πs )P Ps,s,B 1 − (1 − Pk,s,B )Kk,s,B s
k∈Rs,i
≥ (1 − f )P (1 − P )n(d(i,s,B)+1) = amin i s
The departure rate is di = μi,B f πi ≤ μi,B f = dmax i Since Fi is stable if and only if di ≥ ai , then the result follows. The next corollary provides a condition related to UMTS subsystem. Corollary. Considering the uplink case, a necessary condition that insures stability of queue Fi for any given gateway i, is that, μi,B ≥
1−f · ωi,s,B f
(14)
A more useful form is to convert it onto a condition on instantaneous SINR as, γi ≥ ξ −1
1 − f Ma · · ωi,s,B a f ρi,B τ
(15)
where ωi,s,B = s P (1−P )n(d(i,s,B)+1) . An interesting result of proposition (13) is that UMTS parameters do not affect stability of non covered nodes.
5
Numerical Results and Simulations
Consider in fig. (3) one single UMTS cell, covered by base station B, and extended with an Ad-Hoc network which is itself related to Internet via AP 9.
146
R. El-Azouzi et al. a
Connection a: 4-3-2-1-B Connection b: 6-5-1-B Connection c: 10-1-7-8-AP
1O
a
3
2
a c 4
Internet
b
b
11
5 1
c 6
c c 9
7 8
Fig. 3. Cellular and Ad-Hoc networks 0.076
1 node node node node
0.9 0.8
1 2 3 5
0.074
0.7
Throughput
0.072
π
i
0.6 0.5 0.4
a b
0.07
0.068
0.3 0.2
0.066
0.1 0 0
0.2
0.4
μi,B
0.6
0.8
0.064 0
1
0.2
0.4
μ
0.6
0.8
1
i,B
Fig. 4. Queues load for K = 4 and f = 0.8 Fig. 5. Throughput of connections a and b 1
0.08
node node node node
0.9 0.8
1 2 3 5
0.07 0.06
Throughput
0.7
πi
0.6 0.5 0.4 0.3
0.05 0.04 0.03
0.2 0.02
a b
0.1 0 0
0.2
0.4
μi,B
0.6
0.8
1
Fig. 6. Queues load πi from simulation
0.01 0
0.2
0.4
μ
0.6
0.8
1
i,B
Fig. 7. Throughput from simulation
Both nodes in the UMTS cell and the Ad-Hoc network can access to the services of the two networks. For that, gateway nodes play an important relay role on forwarding packets for both technologies. For that we are interested to show, at first, numerical results concerning the load of these latter nodes and then the end -to-end throughput of connections that traverse these nodes. We use a discrete time simulator that reflects the behavior of nodes in our model described in section 2, to validate our formulation results. In fig. (3), nodes 4 and 6 access the UMTS system with the base station B via connections a and b. Node 10 which is in the UMTS cell also reaches the access point AP (node 9) via the Ad-Hoc network with a connection c.
On Extending Coverage of UMTS Networks Using an Ad-Hoc Network
147
We fixe P3 = P9 = 0.2, P4 = 0.4 and all other Pi to 0.3. The number of transmissions is set to Ki,s,d ≡ K = 4 and the forwarding probability to fi ≡ f = 0.8. Fig. (4) and (5) consider only the uplink case where Ad-Hoc nodes communicate with the base station. Fig. 4 shows the load variation function of the UMTS service rate μ1,B . It is clear that increasing μ1,B would ameliorate the stability of node 1. Here, we see that we need μ1,B > 0.19 to get π1 < 1. For an Ad-Hoc node rate of 2Mbps and an Ad-Hoc packet size of 1000 Bytes, the slot size becomes τ a = 4ms, then a minimum UMTS rate needed must be at least 47.5 packet/s (ad-hoc packet). Concerning the throughput, fig. (5) shows the insensitivity of the throughput with the service rate of the UMTS. Note that only for π1 < 1 the figure of the throughput make sense. Elsewhere it will be an increasing function of μ1,B as it is shown in the simulation fig. (7). A validation with simulations of the numerical results is shown in fig. (7,6). Conclusion. In this paper we propose extending the coverage of UMTS using an ad-hoc network, our main result is characterization of stability condition and the end-to-end throughput, we also validate the analytical result using a discrete simulator. An ongoing work consists on studying the impact of transmit power over the 3G subsystem on the stability of the forwarding queues. We also consider a joint power and rate control game to seek the equilibrium points under some QoS constraints.
References 1. El Khoury, R., Elazouzi, R.: Stability-throughput analysis in a multi-hop ad hoc networks with weighted fair queueing. In: The Proc. 45th Annual Allerton Conf. on Communication, Control, and Computing (Allerton 2007), Monticello, IL (September 2007) 2. Kherani, A., ElAzouzi, R., Altman, E.: Stability-Throughput tradeoff and routing in multi-hop wireless ad-hoc networks. In: The Proceeding of Networking Conference 2006, Coimbra, Portugal, MAY 19, vol. 15 (2006) (Best paper Award) 3. Bria, A.: Joint resource management of cellular and broadcasting systems-research challenges. In: RVK 2005 Conference (2005) 4. Bianchi, G.: Performance analysis of the IEEE 802.11 distribute coordination function. IEEE Journal on Selected Areas in Communications (March 2000) 5. Smith, P.: Bgp multihoming techniques. In: NANOG, vol. 23 (2001) 6. Shaikh, A., Sitaraman, R., Akella, A., Maggs, B.: A measurement-based analysis of multihoming. In: SIGCOMM (2003) 7. Stemm, R., Katz, R.: Vertical handoffs in wireless overlay networks. Journal on Mobile Networks and applications 3(4), 335–350 (1998) 8. Wakikawa, R., Paik, E.K., Ng, C.W., Ernst, T., Montavont, N., Noel, T.: Goal and benefits of multihoming. IETF Internet Draft (July 2005) 9. Altman, E., Shakkottai, S., Kumar, A.: The case for non-cooperative multihoming of users to access points in ieee 802.11 wlans. In: IEEE INFOCOM 2006, Barcelona (2006) 10. Rao, R., Ephremides, A.: On the stability of ineracting queues in a multi-access system. IEEE Tran. Inform. Theory 34, 918–930 (1988)
148
R. El-Azouzi et al.
11. Radunovic, B., Le Boudec, J.Y.: Joint Scheduling, Power Control and Routing in Symmetric, One-dimensional, Multi-hop Wireless Networks. In: WiOpt 2003: Modeling and Optimization in Mobile, Ad-Hoc and Wireless Networks, SophiaAntipolis, France (2003) 12. Clausen, T., Jacquet, P., Laouiti, A., Muhlethaler, P., Qayyum, A., Viennot, L.: Optimized Link State Routing Protocol. In: IEEE INMIC Pakistan (2001) 13. Murthy, S., Garcia-Luna-Aceves, J.J.: An Efficient Routing Protocol for Wireless Networks. ACM/Baltzer Journal on Mobile Networks and Applications, Special Issue on Routing in Mobile Communication Networks 1 (1996) 14. Roberts, L.G.: Aloha packet system with and without slots and capture, Tech. Rep. Ass Notes stanford Research Institute advance Recharch Projects Agency Network Infomation Center (1972) 15. Tassiulas, L., Ephremides, A.: Stability properties of constrained queuing systems and scheduling for maximum throughput in multihop radio network. IEEE Trans. Automat. Contr. 37(12), 1936–1949 (1992) 16. Tassiulas, L.: Linear complexity algorithm for maximum throughput in radio networks and input queued switches. In: IEEE Infocom 1998, pp. 533–539 (1998) 17. Tassiulas, L.: Scheduling and perfomance limits of networks with contantly changing topology. IEEE TRans. Inform. Theory 43(3), 1067–1073 (1997) 18. Altman, E., Borkar, V.S., Kherani, A.A., Michiardi, P., Molva, R.: Some game theoretic problems in wireless ad-hoc networks. In: Kotsis, G., Spaniol, O. (eds.) Euro-NGI 2004. LNCS, vol. 3427, pp. 82–104. Springer, Heidelberg (2005) 19. Schr¨ oder, B., Weller, A.: Prediction of the Connection Stability of UMTS Services in the downlink- an Analytical Approach. In: Vehicular Technology Conference (2002) 20. Kumar, A., Patil, D.: Stability and throughput analysis of unslotted CDMAALOHA with finite number of users and code sharing. Telecommunication Systems 8 (December (2-4), 1997) 21. Luo, W., Ephremides, A.: Stability of N interacting queues in multiqueue distributed systems: buffered random acess system. IEEE Trans. Inform. Theory 45, 1579–1587 (1999) 22. Tsybakov, B., Mikhailov, W.: Ergodicity of slotted Aloha system. Prob. Inform. Transmission 15(4), 301–312 (1979) 23. Tassiulas, L., Sarkar, S.: Max-Min fair scheduling in wireless networks. In: Proceeding of Infocom 2002 (2002) 24. Tsybakov, B.S., Bakirov, V.L.: Packet transmission in radio networks. Probl Infom. Transmission 21(1), 60–76 (1985) 25. Yang, Y., Hou, J.C., Kung, L.-C.: Modeling the effect of transmit power and physical carrier sense in multi-hop Wireless networks. In: Infocom 2007, Alaska (2007) 26. Meshkati, F., Poor, H.V., Schwartz, S.C., Mandayam, N.B.: An energy efficient approach to power control and receiver design in wireless data networks. IEEE transactions on communications 53(11), 1885–1894 (2005) 27. Anantharam, V.: The stability region of the finite-user slotted Alloha protocol. III Trans. Inform. Theory 37(3), 535–540 (1991)
A Performance Analysis of Authentication Using Covert Timing Channels Reed Newman and Raheem Beyah Communications Assurance and Performance Group, Computer Science Department, Georgia State University Atlanta, GA, USA [email protected], [email protected]
Abstract. Authentication over a network is an important and difficult problem. Accurately determining the authenticity of a node or user is critical in maintaining the security of a network. Our proposed technique covertly embeds a watermark, or identifying tag, within a data stream. By implementing this model on a LAN and WLAN we show that this method is easily adaptable to a variety of networking technologies, and easily scalable. While our technique increases the time required for data to be transferred, we show that the throughput of the link during the brief authentication window is decreased by no more than 8% in a switched LAN and 11% in a WLAN. During our empirical analysis we were able to detect the watermark with 100% accuracy in both a LAN and WLAN environment.
1 Introduction With the large amount of network applications in existence today, it is becoming increasingly difficult for network administrators to police the traffic on their networks. Additionally, these network applications are increasingly carrying sensitive information; as the sensitivity of the information traversing the network increases, so too must the level of security within the network. A key step in securing a network is the authentication of all nodes1 on that network. Authentication of nodes allow perimeter devices the ability to determine if a node's requests should be granted. Reliable node authentication helps to ensure the basic elements of network security: positively identifying a node, allowing that node specific access privileges, and holding that node accountable should it compromise the security or productivity of the network. Leading industry approaches, such as Cisco’s AAA Server and Cisco's NAC model [1], rely on user name/password, S/Key, Token Cards or system profiling to authenticate users. While all of these methods provide a reasonable level of security, each are expensive and have been shown to have flaws [2,3]. Additionally, simple access control lists have proven to be ineffective as spoofing IP and MAC addresses is a trivial task, allowing attackers to easily gain access to systems and networks. 1
Authentication relating to nodes can easily be extended to users. When referencing nodes, unless otherwise noted, the statement applies to nodes and users.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 149–161, 2008. © IFIP International Federation for Information Processing 2008
150
R. Newman and R. Beyah
By not requiring strict node authentication, joining a network is a trivial matter. An attacker may obtain a physical connection to a network, enter the network through a wireless access point, circumvent 802.1x via phishing or other known exploits, or access the network via a VPN connection. By requiring each node joining the network to authenticate itself, the overall security of the network is increased making systems within the network harder to breach. We propose to embed a watermark, or signature specific to a node, within a data flow using inter-packet delay. The watermark will be embedded between select packets in a covert manner, such that it is difficult to detect and does not decrease the throughput of the link significantly. Our model can be used to supplement the above methods or be used as a standalone method of authentication. Our technique does not require synchronization between the nodes' clocks, and can easily be adapted for user authentication. We acknowledge that our proposed method, a method advocating security by obscurity, is not an impervious solution but assert that the additional layer of security increases the difficulty for an attacker. We believe that providing authentication in a covert manner adds the same level of benefit one achieves when changing an internal server’s (sftp, ssh, etc.) port from a well-known port to a random port. The attacker would require not only a technique by which to break into the server (e.g., a buffer overflow attack, etc.), but must also find the server, via port scanning, over a range of 64,000 ports without being detected. Further, this work evaluates performance of using timing channels in general to transmit data. Our specific application of this method uses this channel to communicate authentication information. The remainder of this paper is organized as follows: Section 2 reviews related work, Section 3 details the covert timing channel model, Section 4 outlines the experimental setup and procedure, Section 5 provides analysis on the model, Section 6 provides experimental data, and Section 7 contains our conclusions and future work.
2 Related Work The idea of using inter-packet delay (IPD) for node identification is not necessarily new. There have been several different applications regarding the use of IPD in recent academic research. The majority of the work focuses on the detection of stepping stone connections (i.e., intermediate systems attackers use to launch attacks and insulate themselves from detection), with most solely considering interactive (SSH) traffic. The authors of [4] extend the work done in [5,6] to detect a correlation between stepping stones. In [4], the authors use a binning technique to partition the data stream allowing the encoded watermark a greater level of robustness against timing perturbations and repacketization within the stepping stone links. In [7], the authors extend the work of [8] in an effort to defeat chaff packets (e.g., structurally correct packets inserted within the data flow to obfuscate a pattern) that may be inserted by a node within the stepping stone chain. This is done by decoding watermarks from all possible subsequences of a downstream flow, and choosing the “best” watermark (defined as the watermark with the least hamming distance from the original watermark).
A Performance Analysis of Authentication Using Covert Timing Channels
151
Peng, et al. [9] investigate the secrecy of active watermarking. The authors develop an attack technique that infers important parameters of the watermarks, and also recovers and duplicates watermarks through the stepping stones. The authors of [10] attempt to detect stepping stone connection correlations of SSH traffic by creating a logical partition between ON and OFF periods of usage. By examining the IPD of connections and determining if the OFF period transitions to an ON period within a certain threshold, it can be reasonably stated that these connections are related. It is interesting to note that this is an entirely passive technique. In [11], provable upper bounds are set on the number of packets required to confidently detect encrypted stepping stone streams with proven guarantees of a low false positive rate. The methods in [11] also take into consideration the usage of chaff packets, and provide bounds on the amount of chaff needed by the attacker to evade detection. The model proposed by [12] is similar to that of [11], with the addition of wavelets. In [12], the authors attempt to differentiate between the short term behavior of the stepping stone streams, where timing perturbations and chaff packets can mask the correlation between connections, and the long term behavior of stepping stone streams. Wang and Reeves [5] propose a watermark-based scheme that can detect correlation between streams of encrypted traffic. However, the assumption is made that the attackers timing perturbations are independent and evenly distributed. Should the traffic be disturbed in another fashion, such as with the insertion of chaff packets, their method will show a decrease in accuracy. Zhang, et al. [13] provide an upper bound on the number of packets required to detect attackers in stepping stone networks when chaff packets and timing perturbations exist simultaneously. The authors of [14] show that VoIP encoding scheme can easily contain a covert channel by altering the least significant bit. They provide analysis on the bandwidth and the amount of data transferred. Wang, et al. [6] show that encrypted VoIP calls hidden through an anonymizing network can still be traced, using a technique similar to that of Paxson [10]. Only tangentially related to our work, the authors of [15] studied the loss of anonymity in a flow-based wireless mix network under flow marking attacks, in which an adversary embeds a pattern of marks into wireless traffic flows by electromagnetic interference. They asserted that traditional mix technologies are not effective in defeating flow marking attacks in wireless networks. They proposed a new countermeasure based on digital filtering technology. The authors of [16] introduce a covert timing channel model based on the presence or absence of packets arriving during a specific time period. The authors attempt to detect this covert timing channel over IP using a similarity comparison between packet inter-arrival times. The concept of covert channels was first introduced by B. Lampson in 1973 [17]. Covert channels at the network and transport layers in TCP/IP were first investigated by Rowland [18] and Fisk, et al. [19]. Currently software such as Covert_TCP [19] and Nushu [20] are available to hide data within TCP headers. Project Loki [21] has also shown that ICMP packets are capable of carrying covert information within their headers. By moving the covert channel to the application layer, detection of covert channels becomes even more difficult. It was discussed in RFC 3205 [22] that HTTP be used as a carrier for other protocols; this was obviously meant for the covert encapsulation of the protocols. Several tools are also available to tunnel protocols
152
R. Newman and R. Beyah
through HTTP, like Lars Brinkhoff's httptunnel [23], primarily for the purpose of evading firewalls. The predominate focus of these works, with the exception of [6,14-19,21-22], are on detection of stepping stones using SSH style traffic. SSH traffic can be classified as high-latency traffic, caused by user pauses in commands being issued. In [14], data in the least significant bit of the VoIP stream is altered to create a covert channel. Wang et al. [6] use a technique similar to that of Paxson [10] to correlate between VoIP flows hidden by an anonymizing network. Additionally, all of the works discussed above, with the exception of [15-19,21-22], focus on wired networks. In [17-18,21-22], the covert channel is actually embedded or encapsulated within another protocol. We propose a general model (protocol independent) for node authentication that is able to take advantage of low-latency traffic and be effective on both wired and wireless networks. We evaluate this model experimentally and address the performance and accuracy of our model. This model does not require a large amount of overhead during the brief authentication window, using no more than 8% of the available bandwidth during testing in a switched LAN and no more than 11% in a WLAN.
3 Covert Timing Channel Model Through the application of Steganography to a network flow, we achieve Covert Timing Channels, which are defined as parasitic communication channels that draw bandwidth from other channels, via the disruption of event timing relative to other events, in order to transmit information [24-25]. Traditionally Covert Timing Channels are used for the transmission of messages, and likewise, to provide authentication, we transfer a watermark or signature specific to a node, through this channel. Particularly, we disrupt the inter-packet timing within a data flow over a network (specifically using the TCP and UDP protocols). By adding minimal amounts of delay to select packets within the flow, we add a watermark to the flow in such a way that it is unique to a specific node. By delaying packets within a data stream, even minimally, we detract from the bandwidth of the network and thus increase the time required for a given task. We will show that bandwidth degrades proportional to the amount of delay needed to accurately transfer a watermark, the length of the watermark, and the rate at which the node must be re-authenticated. Disrupting the inter-packet timing within a data flow is inherently volatile, especially when considering outside influences such as the overhead incurred from the use of TCP. Given occasionally network volatility we must assume that sending one watermark may not provide adequate authentication. Additionally to provide a greater degree of robustness, in certain cases it may be necessary to increase the frequency of watermark transmissions. For instance, a university system may not require the node to re-authenticate itself often whereas a military institution, which places a higher priority on security, may. Resending the watermark increases the security provided by the model, allowing the host network to ensure that the node it authenticated originally is still that node. In Section 4 we show that the increase per
A Performance Analysis of Authentication Using Covert Timing Channels
153
additional watermark is small, resulting in a minimal decrease of network throughput. Additionally, we provide analysis on the amount of delay required to accurately detect the watermark, versus the percentage error of false negatives. The degree of natural delay varies from one network type to another. For instance, assuming that network delay is the only factor, the time required to transfer a file over the Internet or VPN connection is greater than the time required to transfer that same file over a wireless network (from one local computer to another); the time required for the transfer of that file over the wireless network is greater than the time required to transfer the file over a 100Mbps LAN connection. With the addition of network congestion, packet loss, etc., the distinction between network types becomes even clearer. As such, the addition of delay required by our model dynamically varies depending on what type of network the user is using. Our model calculates the minimal delay required to accurately embed a watermark into the data flow by using the roundtrip time (RTT) of the first ten data packets received per connection over a TCP connection. This approach will not work over a UDP flow, as there are no acknowledgment packets being returned to the sender. If UDP is being used, the client will ping the server several times and calculate the delay based on the RTT provided. An overview of our model, graphically represented in Fig. 1, is as follows. Initially a bootstrap phase is entered. During this phase, a kernel module, called the Encoder, is loaded and an initial watermark is created. The Encoder associates itself with an application at the transport layer and acts as a filter allowing for the queuing of packets, so that delay may be added. An additional kernel hook, called the Detector, registers itself to determine the RTT of the first ten data packets sent, which will determine the amount of delay added to the data flow. The initial watermark is currently generated from a password, but can be altered to be generated from another input; for example, using a serial number unique to a node (or other identifying information) in addition to a password would allow for the authentication of the user/machine pair. To generate a watermark, our password is hashed using the SHA-1 hashing algorithm, translated from hex to binary, and truncated to the desired length of the watermark. Two delay values are used within the watermark. A high value is used to represent a binary 1, and a low value used to represent a binary 0; the true high (λ) and low (ω) values are set by the Detector with λ > ω and ω being greater than the average delay between packets. When delaying a packet that is part of the watermarked sequence, the binary watermark is consulted to determine if the delay to be added is λ or ω; if the binary value of the delay is a 0, the value represented by ω is used, with the value of λ being used if the binary value of the delay is a 1. As data is sent from the application, it is registered with the Encoder. The Encoder tracks the number of packets being sent, as well as the start time of the data flow. From a predetermined level of security configured by the user, the Encoder will embed the watermark within the data stream after every β number of packets, or after γ units of time have elapsed. Additionally, the user may configure the watermark to be rehashed using the current watermark as the key to create a new watermark after δ packets have been transferred, or to create a new watermark after ζ units of time have elapsed. From this point, the data is transferred over the network to the receiver where another module, a kernel hook, decodes the signal; it is not required that a kernel hook be used in this instance, the application itself could provide this service. Abstractly,
154
R. Newman and R. Beyah
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 1. A work flow depicting the Covert Timing Channel Model. Fig. 2. Depiction of the Experimental Testbed. Fig. 3 Watermarks in UDP traffic transmitted over a WLAN link. Here λ = 12 ms and ω = 8 ms, with the Watermark = {1 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0}.
we shall refer to this entity as the Decoder. Prior to decoding the watermark, the data flows' inter-packet timing sequence is subjected to a high-pass filter which filters out the majority of the un-encoded network traffic. Two methods of decoding the watermarks were investigated. These methods are as follows: 1. The Simple Threshold Method – A cutoff point is determined by taking the mean of the IPD not set to zero by the high-pass filter. If a packet has been delayed by an amount greater than the cutoff, it is considered to represent a 1, otherwise it is considered to represent a 0. 2. The Multi-Threshold Method – Using a moving window of two, the values are multiplied together and stored in a separate array. Doing this places the number pairs into three distinctly separate categories, 00, 01/10, and 11. It is difficult to determine the order of the pairing in the 01/10 category; as a solution, the original stream is reviewed, and the greater value is determined to be the 1 value. Due to space limitations, only the Simple Threshold Method was used during testing. Should the flow be found to not contain the watermark after a certain threshold, as determined by the level of security required, the Decoder can be configured to signal a firewall, or other external device, in an effort to alert a user or disallow continued traffic from the source. There are many additional steps that could be taken, each specific to an individual organization's needs, and beyond the scope of this work.
4 Experimental Setup and Procedure To test the Covert Timing Channel Model in a controlled environment, an experimental testbed was constructed. The testbed was comprised of a Lenovo 3000 C100 laptop running Fedora Core 4, kernel ver. 2.6.16-1.2111_FC4 with 512 MB RAM as the client sending the watermark. One server, called Server 1, used in the
A Performance Analysis of Authentication Using Covert Timing Channels
155
testbed was a custom build desktop computer running Fedora Core 6, kernel ver. 2.6.20 with a 3.0 GHz processor and 1 GB RAM connected on a local network with the client by both a wired and wireless connection. An Airlink101 wireless card was used in the Lenovo client, and a D-Link wireless USB adapter was used in Server 1 for the wireless testing. A Netgear 802.11b router was used for the local wired and wireless testing; encryption was left on during the wireless testing to more accurately simulate a real-world environment. The majority of network traffic was generated by the client to the server, with several other computers generating traffic through the router on the local network, but not to the client or server. This is an appropriate setup considering most institutions use a switched network, possibly a wireless network as well, with multiple users active at any given time. In our experiments there were two different scenarios: 1. 2.
Switched LAN – The data transfer occurs entirely over a 10/100 Ethernet network. 802.11b WLAN – The data transfer occurs entirely over an 802.11b wireless network.
Several different configurations were used with multiple trials being run for each configuration. Packets were queued using a Netfilter kernel module [26]. The packets were passed to user space if they were being sent out of a predetermined port, and placed into a separate queue with an appropriate release time set. If the packets were part of a watermark, the release time was set according to the watermark; if the packets were not part of a watermark, their release time was set to the current time. The head of the queue was continually checked to determine if the packet had reached its release time, and the packet was released if the current time was greater than the release time. All network traffic was monitored using Tcpdump [27]. Traffic was generated by transferring a 2.97 MB text file using a simple sockets program written in C, resulting in the transfer of 2035 packets over the network for Figures 4 and 5. For Figures 6-9, the same sockets program was used to transfer 88,000 packets over the network. Each set of captured data was parsed to a flat file using a pcap parser. The files were then processed in Matlab to detect the watermark embedded within the data flow. Five hundred tests per network type were run without using the watermark to provide a baseline.2 For each set of parameters used in Figures 6-9, ten trials were run per configuration, resulting in approximately 400 trials.
5 Analysis The balance of performance and security within large networks is crucial in maintaining the viability of that network. Corporate, university, government and military networks all require different levels of security and minimum tolerable levels of performance, leading to the need for scalable and adaptable solutions that do not prohibit performance. Using the Covert Channel Model, each network type receives a configurable and robust method to authenticate nodes for a minimal cost to the 2
UDP traffic was generated as a CBR data flow (35Mbps).
156
R. Newman and R. Beyah
network. The cost of using these methods varies depending on the frequency of authentication, size of the watermark, and the amount of delay added. This cost is minimal over switched LAN and WLAN connections. When adding a watermark, the majority of the data flow is uninterrupted (Unwatermarked Delay). Only the portion of the data flow containing the watermark adds delay (Watermarked Delay). Additionally, the number of times the watermark repeats increases the delay (Watermarked Repetition Delay), leading to the delay model (Total Delay). unwatermarked_transmission_time = ( packets_total – packets_watermarked ) x (RTT/2) high_delay = packets_watermarked @ λ x λ low_delay = packets_watermarked @ ω x ω watermarked_delay = high_delay + low_delay watermarked_repititions_delay = total_repititions x watermarked_delay transmission_time = unwatermarked_transmission_time + watermarked_repitions_delay
The delay added by the watermark depends on 4 variables: (1) The value of λ; (2) the value of ω; (3) the length of the watermark; and (4) the number of times the watermark is repeated. Both (1) and (2) are configurable and by minimizing these parameters the performance of the network is preserved. Both (3) and (4) are a function of the security required by the network. The Detector attempts to minimize the values in (1) and (2). By using the values of first ten data packets sent, the Detector determines which type of network is being used and sets the rates accordingly. It determines the connection type based on the RTT time of the first ten packets. The distinction between a LAN and a WLAN is made based on the variance within the first ten packets. If a LAN is in use, the Detector sets λ= (φHigh x ( RTT / 2 )) and sets ω = (φLow x (RTT/2)) with φHigh = 46.5 and φLow = 15.5. If a WLAN connection is in use the Detector sets the Encoder to “spike” the interface. When using a WLAN connection φHigh = 1.84 and φLow = 1.0; due to the comparatively high bandwidth of the LAN connection, larger normalizing parameters (φHigh, φLow) are required. During testing, it was initially observed that large delays were required on WLAN networks to ensure that the entire watermark was received. When using limited bandwidth, such as that on a WLAN, packets may be queued within the interface before being transferred to the network. By adding a large delay (“a spike”) prior to sending the watermark, the interface is allowed to clear its queue and thus transmit the watermark with greater accuracy and lower delays. Empirical testing shows that by “spiking” the interface, bandwidth over a WLAN connection is preserved far better than by simply increasing λ and ω. This will be discussed in more detail in Section 6. While testing was done using both TCP and UDP traffic, we will focus our analysis on TCP. The primary reason for this is that UDP traffic is relatively well behaved (with no outside influences unlike TCP) and easily carries the watermark. This is shown in Fig. 3 where UDP traffic is sent over a WLAN connection (which is more volatile than a LAN connection) at a rate well below the rate determined to be optimal. During this trial, watermarks were detected at 100% over the entire duration. UDP traffic over a LAN connection showed similar results. Ideally, after minimizing the delay, the watermark is transfered over the network and detected immediately. While this may be true with λ and ω set to a high value, it is not always the case. Network perturbations may cause alterations within the watermark
A Performance Analysis of Authentication Using Covert Timing Channels
157
leading to a false negative (no false positives were observed). In this case, multiple watermarks may be sent to authenticate the node. These incomplete watermarks are examined to determine their proximity to the correct watermark. Should they be within a tolerance level against false negatives (a predetermined value from the user) they will be considered the same as an ideal watermark. Additionally, care is taken such that delay values are not set too high (i.e., above the TCP timeout values), since packet retransmissions could also cause an increased number of false negatives. Optimal values for encoding differ per network. To maximize throughput the desired value for ω is slightly above the normal sending rate. Desired values for λ are slightly but distinguishably above ω. While this varies depending on what network type is in use, testing within our testbed has shown that optimal values in our LAN are λ = 6ms and ω = 2ms and in our WLAN are λ = 22ms and ω = 16ms. This was determined empirically and is illustrated in Fig. 4 and Fig. 5. These values lead to a highly detectable solution that is still covert and minimizes the number of watermarks required for accurate detection. Network performance is not greatly affected if λ and ω are set to relatively low values (but not the lowest possible values) and the number of watermarks contained within the authentication cycle is small. An authentication cycle represents the transmission of one or more watermarks. The transmission of several watermarks may be required to achieve an acceptable degree of detection accuracy. While it is certainly possible for a greater degree of performance to be achieved by setting λ and ω to lower values, doing so decreases the probability that the watermark will be accurately detected requiring a larger number of watermarks to be included within an authentication cycle. The computational power required and the speeds at which the watermarks can be detected are also important factors. The Simple Threshold Method is a powerful and easily implementable method to detect the watermark. It only requires enough memory to buffer the current watermark, and minimal computations; as such, the Simple Threshold Method is faster than the Multi-Threshold Method. The drawback to using this method is that it has minimal intelligence, and can create a higher false positive rating if network conditions change. The Multi-Threshold Method provides a greater degree of robustness with lower false positive rates. However, it does require that the original time stamps be held in memory as well as the current watermark, and a slightly greater amount of computation to be done. Comparatively, the MultiThreshold Method provides better protection against changes within network conditions. Due to space limitations our experimental data only provides analysis using the Simple Threshold method. At the monitoring node each watermark is individually detected and compared to the true watermark to determine its accuracy. Each of these accuracy ratings are stored until enough watermarks have been received (or time has elapsed) to complete one authentication cycle. The maximum value of the stored accuracy ratings is selected and compared to the configured threshold. If this maximum is greater than the threshold the user is considered to be authenticated, otherwise the maximum value of the stored accuracy ratings is averaged with all previously detected maximum rates and this value is considered to be the user’s current authentication value. The authentication value is compared to the configured threshold again to determine if the user is authenticated.
158
R. Newman and R. Beyah
Fig. 4.
Fig. 5.
Fig. 6
Fig. 7.
Fig. 8.
Fig. 9
Fig. 4. contains the results from trials used to empirically determine the optimal values for λ and ω, as well as the number of watermarks required per authentication cycle for different levels of detection accuracy in a LAN. Fig. 5 contains the results from trials used to empirically determine the optimal values for λ and ω, as well as the number of watermarks required per authentication cycle for different levels of detection accuracy in a WLAN. Figs. 6 and 8 show the percentage decrease in network throughput, compared to the baseline throughput, when the number of authentication cycles is increased in a switched LAN and WLAN respectively. Figs. 7 and 9 show the percentage decrease in network throughput, compared to the baseline throughput, when the length of a single watermark is increased in a switched LAN and WLAN respectively.
6 Performance Analysis RTT values differ per network. During testing it was observed that RTT values for a switched LAN had a median value of 0.258ms and a mean value of 0.247ms. Within a WLAN the RTT median value was 23.90ms and a mean value of 42.43ms. Using the Detector, the appropriate λ / ω delay times for a LAN and WLAN network are 6ms/2ms and 22ms/16ms respectively. As previously mentioned these values were determined empirically and are fixed in order to provide the performance analysis. While certain institutions may require the highest level of security, the needs of other institutions may not be as stringent. The Covert Timing Channel method is designed to be a general method of authentication for institutions requiring various levels of security. As such, performance testing was done for several different levels of accuracy using different watermarks lengths and various numbers of watermarks included within a single authentication cycle. By using shorter watermark lengths and/or fewer watermarks per authentication cycle, institutions (such as educational facilities that do not require high levels of security) may authenticate nodes with a
A Performance Analysis of Authentication Using Covert Timing Channels
159
lower degree of accuracy without seriously impacting the performance of the network. Conversely, institutions requiring a greater degree of security may increase the length of the watermark and/or the number of watermarks per authentication cycle. Doing so allows nodes to be authenticated at a higher degree of accuracy while increasing the complexity required for watermarks to be detected at high levels of accuracy. Testing on the switched LAN differed from that of the WLAN network in that the addition of a watermark was either fully detectable or not detectable at all up to 95% accuracy using an authentication cycle containing one watermark, after which an increase in the authentication cycle size (increased to three watermarks) was required to achieve 100% accuracy at lower rates. This is the result of extremely low latency between hosts. The “authentication watermarks” were fully detectable until 6ms/2ms, after which no watermarks were detected at all. As long as λ > 5ms and 1ms < ω ≤ λ2ms, watermarks were generally 95% detectable with 0% error within the detection. The values of 6ms/2ms were chosen for the LAN because of their consistent ability to provide a great degree of detectability with minimal performance degradation (Fig. 4). Our primary consideration during testing was the amount of overhead this model requires for accurate detection. Two configurable parameters were tested to provide a performance versus security comparison: (1) the cost to network throughput for different watermark lengths; and (2) the cost to network throughput for additional redundancy using authentication cycles. By increasing the length of the watermark and by adding additional authentication cycles, the security of the model is increased at a cost to the network throughput. Fig. 6 represents the cost of increasing the number of watermarks contained within one authentication cycle. This figure provides results for 1 watermark, 20 bits (or packets) long, per authentication cycle, with one to ten authentication cycles being sent over the authentication period of 88,000 packets for 95% detection rate. To achieve 100% detection, 3 watermarks were used per authentication cycle using the same λ and ω values. The throughput of the watermarked flows are compared to the average of several hundred baseline transmissions in the same environment. Transmission of more authentication cycles caused a decrease in the throughput of the network. In our experiments it was noted that there was an average of 0.54% decrease in throughput per additional authentication cycle transmitted, with a maximum performance penalty of 7.32% over our trials. Fig. 7 represents the cost of increasing the watermark length, tested using 6 watermarks per authentication cycle for 100% detection, and 1 watermark per authentication cycle for 95% detection. The cost of increasing the watermark from a length of 5 to 40 is a total decrease in throughput of 7.5%, with each additional bit costing, on average, a 0.036% decrease in throughput. Performance on a WLAN network had a greater degree of variance, primarily due to the nature of the network. Due to the bursty nature of a WLAN network, a higher proportional delay value was required for accurate detection. Testing was done at a detection accuracy of 90% (λ = 20ms), 95% (λ = 20ms), and 100% (λ = 22ms), with ω = 12ms (the optimal low value for our tested network) for each.
160
R. Newman and R. Beyah
Fig. 8 represents the cost of increasing the amount of authentication cycles per authentication period on a WLAN network. Each authentication cycle contains 6/3/2 watermarks for 100%/95%/90% detection respectively, consisting of 20 bits (or packets). In our experiments it was noted that there was, on average, a 0.63% decrease in network throughput per additional authentication cycle added, with a maximum decrease of 7.17%. Fig. 9 represents the cost of increasing the watermark length in a WLAN network. This figure contains a large degree of variance due to the nature of WLAN networks, but shows a general decrease in network throughput as the length of the watermarks are increased; with a larger amount of test data, we believe that this trend will smooth over time. On average adding an additional bit in a watermark on a WLAN network caused a 0.0435% decrease in throughput with, generally, an average maximum performance penalty of 8.54% when disregarding a single outlying value. All figures, while showing the general trend expected, are comprised of a limited set of trials. It is our belief that the anomalies within each figure will smooth over a larger number of trials.
7 Conclusion and Future Work While our specific application uses the Covert Timing Channel model to communicate authentication information, we investigate the use of the Covert Timing Channel model to transmit data in general. We show that the Covert Timing Channel method provides a secure, reliable, and cost effective process for which nodes may be authenticated within a network. Our model is highly scalable, easily adaptable and backwards compatible with existing technology, and provides excellent performance on switched LAN and WLAN networks. While the performance of this model is readily visible from the LAN and WLAN test results, we believe that this model could be extended for use over WAN networks as well; in our future work, we intend to show that this model will perform well over VPN connections. Additionally, we intend to investigate the effects of chaff packets on our method. Lastly, we shall apply our method to node authentication using real world traffic.
References 1. Cisco, NAC. http://www.cisco.com/en/US/netsol/ns466/ networking_solutions_package.html 2. Thumann, M., Roecher, D.: NAC@ACK: Hacking the Cisco Nac Framework. In: The Proceedings of Black Hat Europe (2007) 3. Cisco Security Response: AAA Command Authorization By-Pass. 0125-aaatcl.pdf (2006), http://www.cisco.com/warp/public/707/cisco-sr4. Pyun, Y.J., Park, Y.H., Wang, X., Reeves, D.S., Ning, P.: Tracing Traffic through Intermediate Hosts that Repacketize Flows. In: INFOCOM 2007. 26th IEEE International Conference on Computer Communications (2007) 5. Wang, X., Reeves, D.S.: Robust Correlation of Encrypted Attack Traffic through Stepping Stones by Manipulation of Interpacket Delays. In: Proc. of the 10th ACM conference on Computer and Communications Security (CCS), October 2003, pp. 20–29 (2003)
A Performance Analysis of Authentication Using Covert Timing Channels
161
6. Wang, X., Chen, S., Jajodia, S.: Tracking Anonymous Peer-to-Peer VoIP Calls on the Internet. In: Proc. of the 12th ACM conference on Computer and Communications Security (CCS), November 2005, pp. 81–91 (2005) 7. Peng, P., Ning, P., Reeve, D.S., Wang, X.: Active Timing-Based Correlation of Perturbed Traffic Flows with Chaff Packets. In: Proc. Of the 2nd International Workshop on Security in Distributed Computing Systems (SDCS), June 2005, pp. 107–113 (2005) 8. Want, X., Reeves, D.S., Ning, P., Feng, F.: Robust Network-Based Attack Attribution through Probabilistic Watermarking of Packet Flows. Technical Report TR-2005-10, Department of Computer Science, NC State Univ. (2005) 9. Peng, P., Ning, P., Reeves, D.S.: On the Secrecy of Timing-Based Active Watermarking Trace-Back Techniques. In: Proc. of the 2006 IEEE Symposium on Security and Privacy (S&P), May 2006, pp. 334–349 (2006) 10. Zhang, Y., Paxson, V.: Detecting Stepping Stones. In: Proc. of the 9th USENIX Security Symposium, August 2000, pp. 171–184 (2000) 11. Blum, A., Song, D.X., Venkataraman, S.: Detection of Interactive Stepping Stones: Algorithms and Confidence Bounds. In: Proc. of the 7th International Symposium on Recent Advances in Intrusion Detection (RAID), Octeber 2004, pp. 258–277 (2004) 12. Donoho, D.L., Flesia, A.G., Shankar, U., Paxson, V., Coit, J., Staniford, S.: Multiscale Stepping-Stone Detection: Detecting Pairs of Jittered Interactive Streams by Exploiting Maximum Tolerable Delay. In: Proc. of the 5th International Symposium on Recent Advances in Intrusion Detection (RAID), October 2002, pp. 17–35 (2002) 13. Zhang, L., Persaud, A., Johnson, A., Guan, Y.: Stepping Stone Attack Attribution in NonCooperative IP Networks. Iowa State University, Tech. Rep. TR-2005-02-1 (February 2005) 14. Takahashi, T., Lee, W.: An Assessment of VoIP Covert Channel Threats. In: Proc. of SecureComm 2007, 3rd International Conference on Security and Privacy in Communication Networks (2007) 15. Fu, X., Zhu, Y., Graham, B., Bettati, R., Zhao, W.: On Flow Marking Attacks in Wireless Anonymous Communication Networks. In: Proceedings of the 25th International Conference on Distributed Computing Systems (ICDCS) (2005) 16. Cabuk, S., Brodley, C., Shields, C.: IP Covert Timing Channels: Design and Detection. In: the Proceedings of the 11th ACM conference on Computer and Communications Security (October 2004) 17. Lampson, B.W.: A Note on the Confinement Problem. Communications of the ACM 16, 613–615 (1973) 18. Rowland, C.H.: Covert Channels in the TCP/IP Protocol Suite. First Monday 2.5 (May 1997) 19. Fisk, G., Fisk, M., Papadopoulos, C., Neil, J.: Eliminating Stenagraphy in Internet Traffic with Active Wardens. In: Information Hiding 2002, pp. 18–35. Springer, Heidelberg (2002) 20. Rutkowska, J.: The Implementation of Passive Covert Channels in the Linux Kernel. In: Chaos Communication Congress, Chaos Computer Club e.V. (2004) 21. route, alhambra. Project Loki. Phrack 7(49) (November 1996) 22. Moore, K.: On the Use of HTTP as a Substrate. Tech. Rep. In: Ternet Engineering Task Force, RFC 3205 (February 2002) 23. Brinkhoff, L.G.: httptunnel, http://www.nocrew.org/software/httptunnel.html 24. Covert Channels Definition, http://en.wikipedia.org/wiki/Covert_channel 25. Stenography Definition, http://en.wikipedia.org/wiki/Stenography 26. Netfilter / IPTables, http://www.netfilter.org/ 27. Tcpdump, http://www.tcpdump.org/
PassPattern System (PPS): A Pattern-Based User Authentication Scheme T. Rakesh Kumar and S.V. Raghavan Network Systems Laboratory, Department of Computer Science and Engineering, Indian Institute of Technology Madras Chennai-600036, India {rakeshk,svr}@cs.iitm.ernet.in
Abstract. Authenticating a user online, without compromising the user comfort is an important issue. The most popular approach to authenticate a user online is Password-based authentication. Studies show that, users (always) choose very simple passwords which are often easy to guess. On the contrary, randomly generated strings are difficult to remember, especially if the user is having many passwords. In this paper we present a dynamic password scheme based on patterns, called PassPattern System (PPS), which works using the existing infrastructure. PPS is an Adaptive Authentication System, where the strength of the system can be changed depending on the need of the application without compromising the user comfort. Keywords: User Authentication, Security.
1 Introduction "Open, Sesame!" is a classical example of using a SecretWord/Password in the famous tale Ali Baba and the Forty Thieves [1] to open the door of the magical cave. Since then SecretWord/Password has been used to authenticate the user before accessing any resource. With the advancement in computers, Internet and electronic commerce, authentication has become increasingly important. One of the main problems with the username-password scheme is ‘selection of the password’ itself. Studies show that users will always pick passwords which are short and easy to remember [2]. Often it is very easy to break the password of the user, if the personal information about him/her is known and more often than not, it is widely known. With technological advancement, alternative authentication schemes are reported in literature, which are more secure than the conventional Password-based scheme using Biometrics, RSA SecurID® [3], and Graphical passwords [4]. Each mechanism has its own advantages and disadvantages. One of the main reasons why password is still widely used is that newer technologies entail significant changes in the infrastructure of the system. Besides, the vulnerabilities of Graphical Passwords are still not fully understood and is an active area of research [4]. It appears that the need of the hour is to design an authentication scheme that is easy and intuitive to use, strong and robust against all possible attacks, and works using the existing hardware/software infrastructure. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 162–169, 2008. © IFIP International Federation for Information Processing 2008
PassPattern System (PPS): A Pattern-Based User Authentication Scheme
163
In this paper we present the PPS, which is a dynamic password authentication system, where the user will prove to the system that he/she knows the secret instead of communicating it directly. PPS can be visualized as a challenge-response system, where in the response to a challenge is hardly reusable, there by rendering the replay attack or its variant ineffective. 1.1 Related Work There are several attempts reported in literature about authentication schemes in lieu of the traditional Password-based system. While each attempt is successful in increasing the strength of the system against some of the known attacks, they are either computationally intensive or they require additional hardware/software in the infrastructure. In this section we review the contemporary attempts, identify the gaps, and highlight the motivation for developing PPS. Password-based authentication. Password system is the oldest and the most popular authentication scheme used in the modern world. But users tend to choose simple passwords, which are easy to remember but they are very easy to break [2,5]. One way of preventing this is to provide users with a random password. While it is difficult for the user to remember, with contemporary computing power even a computer generated random password of length 8 characters is not strong enough for bruteforce attack; With Blue Gene/L1, it will take 1 hour to break password of length 10 characters. According to Microsoft “A strong password should appear to be a random string of characters to an attacker. It should be 14 characters or longer, (eight characters or longer at a minimum)” [6]. While it is recommended that user should not use same password for more than one account, it is quite common that the users have more than one online profile. Besides, users have to maintain a separate account for e-mail, social networking, e-banking, e-commerce, blogging, multimedia sharing, etc. Remembering a 14 character password for each account is difficult. The other way of increasing the strength of the password system is to make the password dynamic, i.e. the user has to enter a different password every time he/she does a login. But existing dynamic password or One-Time Password schemes either require extra hardware or complex mathematical operations. Biometrics. Biometrics comes under the category of authentication systems which use ‘What the user is’. The main disadvantage with any biometric system is the need for a change in the infrastructure of the entire system. For example if Yahoo wants to replace its authentication system from password to fingerprint based authentication, all the users have to install fingerprint readers on their computers and Yahoo should maintain a registry. This kind of change in infrastructure may not be feasible always. Moreover, biometrics are not often secret [7], as people publicly expose their voice and fingers in various ways on a regular basis, creating the possibility of biometric spoofing. For biometrics like Retina-based authentication and Vein-based authentication, spoofing is not possible but the installation cost is very high. Graphical passwords. The main idea of Graphical passwords [4] is that users find it easy to remember Graphical passwords than text-based passwords. Graphical 1
IBM Blue Gene is the current fastest super computer. [www.research.ibm.com/bluegene/]
164
T. Rakesh Kumar and S.V. Raghavan
passwords are more secure than text-based passwords from various attacks like dictionary attack, bruteforce attack and spyware. Some of the problems with the graphical passwords are that the users tend to choose the same sequence of images as their password making it easy to guess; Graphical passwords require much more storage than conventional text-based passwords. RSA SecurID®. RSA SecurID® is a two-factor authentication. It is based on something you know (a password or PIN) and something you have (an authenticator). RSA SecurID® is one of the dynamic password systems, where the user password will change for every login. Issues that cause concern are the cost and the possible of loss of physical devices. Virtual keyboard. Virtual keyboard is yet another user authentication system, where the user has to enter his/her password by clicking the characters on a virtual keyboard instead of typing the password on the conventional keyboard. While such a system is good against keyloggers and spywares, it is susceptible to other attacks like Shoulder surfing, guessing, and Social Engineering. Motivation for PPS. The discussions hitherto underline the need for an authentication system, that is simple, easy-to-use, inexpensive, incorporates the essence of challenge-response system, dynamically changing, robust against attacks such as bruteforce, dictionary, shoulder surfing, spyware, social engineering, ManIn-The-Middle attack and all above doesn’t demand any special hardware/software.
2 PassPattern System PassPattern System is a challenge-response system and is based on the premise that ‘humans are good at identifying, remembering and recollecting graphical patterns than text patterns’ [8]. The core idea of PassPattern system is that, ‘Instead of remembering a sequence of characters as the secret, users have to remember a shape (which is stored internally as a sequence of positions in hash form) as the secret’. Whenever the user wants to get authenticated, the PPS displays an NxN matrix of cells, which is known as PatternSquare. Each cell in the PatternSquare is an image, which represents a character as shown in Fig. 1(a). The character can be an alphabet, number or even special characters. The PatternSquare is the challenge that is sent by the server to the user. The PatternSquare will be generated dynamically upon each user request. Hence the characters in each cell of the PatternSquare may change or their position may change or both may change, for every authentication request. At the time of registration the user is asked to choose a sequence of positions as shown in Fig. 1(b), by typing the characters in those positions. The sequence of positions then becomes the user’s PassPattern. The cell sequence of the PassPattern so chosen will remain the secret between the user and the system. The user has to remember only the shape (cell sequence in the PatternSquare) as that is the only secret. For the user to get authenticated, he/she has to type the characters that are present in the chosen shape (PassPattern) in the PatternSquare. The cells in the PatternSquare are colored2 in such a way that, the user can see smaller 2
To enhance user convenience; no relationship to security of the PPS.
PassPattern System (PPS): A Pattern-Based User Authentication Scheme
(a)
165
(b)
Fig. 1. (a) A sample PatternSquare. This is the sample of the image that the user will see, whenever he/she wants to login to the system. (b) A sample user choice of a PassPattern. A user can choose any sequence of positions as the PassPattern.
subsections of the PatternSquare (3 x 3 in our example), which makes it easy for him to remember his PassPattern. Whenever the user wants to login to the system, the user has to type the sequence of characters which appear in the user’s PassPattern presented by the system as a challenge (This sequence of characters is referred to as SecretWord). During every presentation of the PatternSquare, the contents change and hence the user has to type a different SecretWord as shown in Fig. 2, during every authentication process such as login.
Fig. 2. The above figure shows PatternSquares that are displayed to the users at different login attempts. SecretWords corresponding to the PassPattern in Fig. 1(b) will be “t?X4”, “@?7-” and “b-97”.
2.1 Design Issues PPS allows users to choose any set of characters as their PassPattern. The user can select the PassPattern based on some sequences familiar to him/her, for example knight’s moves on a chess board or some symmetric positions. Because of the coloring2 of the cells, remembering the PassPattern will be easy. The PPS comprises of 4 parts - PPS software, Pseudo-Random number generator, Image generator, and PassPattern database. PPS software controls all activities in the system, Pseudo-Random number generator and Image generator are used to create dynamic challenges and PassPattern database stores all the PassPatterns of the users. Internally the cells of the PatternSquare are indexed in row-major order (and left to
166
T. Rakesh Kumar and S.V. Raghavan
right), starting from 0 to N2-1 (where N is the size of the matrix). Hence the PassPattern of a user will be stored as a string of numbers. Both PassPattern and usernames will be stored in the database as Hash, using MD5 hash function. Registration phase. Whenever a user sends a request for new registration, the server will create two S-seeds based on the “current clock time” (This seeds are valid only for that session at the server). Using S-seeds as the seed values, the Pseudo-Random generator will (randomly) pick two sets of 49 characters (for a 7x7 matrix) out of 94 printable characters in ASCII. The Image generator takes these characters and creates two sets of 49 images and sends the images to PPS Software. PPS Software creates two PatternSquares using the images and sends them to user as challenges. These two challenges are similar to the process of retyping the password in traditional password systems at the time of registration. The S-seed values are preserved as the session variable (in physical memory) for the verification phase.
(a)
(b)
Fig. 3. (a) Registration phase in the PPS. (b) Validation and Creation of the user account.
Once the user gets these two challenges, the user has to type the username and SecretWords. The SecretWords will be based on the user selection of the PassPattern. The PPS checks both the PassPatterns and if they are same, then the PPS calculates the MD5 Hash of the username and PassPattern and stores them in the PassPattern database and acknowledges the successful registration to the user. The Authentication phase. Whenever a user makes a login or authentication request, the PPS Software creates the PatternSquare (similar to the registration phase) and sends it to the user as a challenge. Once the user gets the challenge, the user has to type the username and the SecretWord which is based on his/her PassPattern and the PatternSquare (Secret and Challenge). The PPS software calculates the MD5 Hash of the username and PassPattern and compares the result with the value stored in the PassPattern database. Based on the comparison results, the user will be either allowed or rejected.
PassPattern System (PPS): A Pattern-Based User Authentication Scheme
(a)
167
(b)
Fig. 4. (a) Challenge creation phase in the PPS. (b) Response validation phase in the PPS.
Other Design Issues. PPS is an adaptive user-authentication system. The strength of PPS can be adapted for various situations. The strength of the PPS can be varied by changing two parameters – the size of the PatternSquare and the type of image. The first method is to change the size of the PatternSquare, i.e. the more the size, more is the security. Number of possible patterns of length n in an NxN matrix will be (N2)n. Thus, as the size of PatternSquare (N) increases, the total possible number of PassPatterns will increase, which in turn increases the strength of the system. The second method of increasing the security is by changing the type of the image in the PatternSquare. The strength of the system can be drastically improved by changing the type of the image. Each character in the PatternSquare can be represented with different coding, compression, noise, distortion, and even as a CAPTCHA3 image.
3 Security Strength of PPS In general, several attacks are possible on an authentication system. For any authentication system, the hacker can attack at least at three places; they are server, client, and the communication link. The different attacks on sever includes Bruteforce attack, Dictionary attack, attack on PassPattern database, and compromising the server as a whole. At the client, the possible attacks are keylogging and shoulder surfing. Finally on the communication link, the possible attack is Man-In-The-Middle attack, which can be done using packet sniffers. In terms of the data being passed from the user to the server and the data which is being stored in the server PPS is comparable with the classical Password-based authentication system. In both cases, user sends the username and a SecretWord and this will be compared with the registry in the database. But because of the dynamic nature of the challenge-response system, PPS is more secure than Password-based scheme to attacks such as Bruteforce, Dictionary attack, Keylogger, and Shoulder surfing. For the rest of the attacks, PPS is as vulnerable as Password-based schemes 3
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a test used in computing to determine whether the user is human. For more information on the CAPTCHA refer [http://www.captcha.net].
168
T. Rakesh Kumar and S.V. Raghavan
(e.g. Server database compromise attack). The best known solution for such attacks is to use cryptography protocols at the server or on the communication link. In this section we analyze the impact of the four attacks mentioned here on PPS. Bruteforce attack. The hacker can try two kinds of Bruteforce attacks on this system. The first way of attacking the system is to ignore the PatternSquare and try with some random string. For a user PassPattern of length 4, there will be a unique SecretWord for the given session. If the hacker wants to guess that SecretWord, the probability of success will be 1/(944) = 1.28x10-8 (Since there are 94 printable characters). If the guess is wrong the probability of success will remain same for the next guess, it is because the SecretWord will change with every attempt. Hence, The probability of success for every attempt =
1 94n
(1)
The other way of doing Bruteforce search is to try all combinations of positions. For example, if we consider a 7x7 PatternSquare there will be 49n (if selection of PassPattern includes reuse of positions) or 49Pn (without reuse of positions) different patterns of length n. ⎧( N 2 ) n ⎪ Number of possible PassPatterns = ⎨ N 2! ⎪ ( N 2 − n)! ⎩
(2)
To break the system, the hacker on an average has to break (nx49n)/2 images (with reuse) or (nx49Pn)/2 (without reuse). ⎧ n × ( N 2 )n ⎪ 2 ⎪ Number of images that are to be broken = ⎨ (3) n × N 2! ⎪ ⎪⎩ 2 × ( N − n)! N represents the size of the PatternSquare and n represents the length of the PassPattern. Dictionary attack. Dictionary attack is one of the most commonly used techniques to break a Password-based system. In the case of PPS commonly used shapes, sequences can be possible candidates in a dictionary. However, as the PatternSquare changes randomly on every presentation, it approaches the behavior of one-time pad. Keyloggers. Keylogger is a program, which captures the user’s keystrokes and sends this information to the hacker. The natural protection for an authentication system from the keylogger is to have a one-time password (or Dynamic password). PPS being a dynamic password system, is not vulnerable to keyloggers. Even if the hacker gets the SecretWord of the user of a PPS system, this SecretWord cannot be reused by the hacker to login to the system, because of the dynamic nature of the PatternSquare. Shoulder surfing. Shoulder surfing is looking over someone’s shoulder when they enter a password or a PIN code. It is an effective way to get information in crowded places because it is relatively easy to stand next to someone and watch as they fill out a form, enter a PIN number at an ATM machine, or use a calling card at a public pay
PassPattern System (PPS): A Pattern-Based User Authentication Scheme
169
phone. Shoulder surfing can also be done at a distance with the aid of binoculars or other vision-enhancing devices to know the password. Shoulder surfing can be done easily on the password system, just by seeing the keys that the user is typing. But to decode the PassPattern in the PPS, the hacker has to see both the key sequence and the PatternSquare and do a mapping before the user submits the page. So shoulder surfing is of little or no use in PPS as compared to a Password-based system.
4 Conclusion In this paper we presented PPS, a user authentication system which can be a potential replacement to classical Password-based authentication system. The strength of the system can be changed by varying the size of the PatternSquare and the type of image associated witch each cell of the PatternSquare without increasing the length of the PassPattern. PPS is very easy for the user to use and at the same time it is more secure than conventional Password-based system. An existing Password-based system can be migrated to PPS, without any change in the infrastructure. Some of the open problems in this area are: What is the ideal length of the PassPattern? What is the ideal size of the PatternSquare? What is the general user behavior while choosing a PassPattern? Prototype of the PPS is available at http://netlab.cs.iitm.ernet.in/pps
References 1. McCaughrean, G.: One Thousand and One Arabian Nights, USA. Oxford University Press, Oxford (1999) 2. Adams, M.A., Sasse, M.S.: Users are not the enemy: why users compromise computing security mechanisms and how to take remedial measure. Communications of the ACM 42(12), 40–46 (1999) 3. RSA Security Inc. RSA SecurID® authenticators, http://www.rsasecurity.com 4. Suo, X., Owen, Y.Z.: Graphical Passwords: A Survey. In: 21st Annual Computer Security Applications Conference (ACSAC 2005), pp. 463–472 (2005) 5. Gilhooly, K.: Biometrics: Getting Back to Business. Computerworld (May 09, 2005) 6. Strong passwords: How to create and use them, http://www.microsoft.com/ protect/yourself/password/create.mspx 7. Brainard, J., Juels, A., Rivest, R.L., Szydlo, M., Yung, M.: Fourth-Factor Authentication: Somebody You Know. In: Proceedings of the 13th ACM conference on Computer and communications security, pp. 168–178 (2006) 8. Shepard, R.N.: Recognition memory for words, sentences and pictures. Journal of verbal Learning and verbal Behavior 6, 153–163 (1967)
A Secure and Efficient Three-Pass Authenticated Key Agreement Protocol Based on Elliptic Curves Meng-Hui Lim1 , Chee-Min Yeoh1 , Sanggon Lee2 , Hyotaek Lim2 , and Hoonjae Lee2 1
Department of Ubiquitous IT, Graduate School of Design & IT, Dongseo University, Busan 617-716, Korea {menghui.lim,yeohcm}@gmail.com 2 Division of Computer and Information Engineering, Dongseo University, Busan 617-716, Korea {nok60,htlim,hjlee}@dongseo.ac.kr
Abstract. Key agreement protocol is of fundamental importance in providing data confidentiality and integrity between two or more parties over an insecure network. In 2004, Popescu [14] proposed an authenticated key agreement protocol in which its security is claimed. However, Yoon and Yoo [19] discovered its vulnerabilities two years later and proposed an improved variant of it. In this paper, we highlight the vulnerability of this improved variant under the LaMacchia et al.’s extended Canetti-Krawczyk security model [12]. With this, we propose another enhanced version of Popescu’s protocol which offers stronger security features and appears to be significantly more efficient than Yoon-Yoo’s scheme. In order to justify our claims, we present a thorough heuristic security analysis on our scheme and compare the computational cost and security attributes with the surveyed schemes.
1
Introduction
Authenticated key agreement protocol is essential for two or more specific entities to communicate securely over an open network (e.g. the Internet), in which the communication can be fully controlled by adversaries. The protocol must be properly designed in such a way that at the end of the protocol execution, both communicating parties can be assured of each other’s identity and agreed on an unique shared secret key (also known as session key) based on the information contributed equally by each party. Once the key agreement is successfully carried out, the derived session key can be used subsequently to create a confidential channel particularly for preserving data confidentiality and integrity. Therefore, an authenticated key agreement protocol should strictly prohibit other unauthenticated parties aside from the intended protocol participants from gaining access to the shared secret under any circumstances.
This work was supported by MIC of Korea, under ITRC support program supervised by IITA (IITA-2007-C1090- 0701-0026).
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 170–182, 2008. c IFIP International Federation for Information Processing 2008
A Secure and Efficient Three-Pass Authenticated Key Agreement Protocol
171
Key agreement protocols can employ symmetric or/and asymmetric cryptography in its construction. Generally, symmetric protocols require a common secret key to be shared in prior between the communicating parties whereas asymmetric or public key-based protocols require the communicating entities to learn only specific authenticated public information of their peer (such as identity, public key, etc.), which is usually exchanged via certificates. This paper is concerned with two-party authenticated key agreement protocols in the asymmetric settings. Over the years, there has been overwhelming interest in the use of public keybased cryptography, mainly due to the employment of elliptic curve operations in constructing cryptographic key agreement protocols. However, as designing a secure and efficient protocol is far from being a simple task, many protocols have been proven to be fallible. Strangio [16], in 2005, has proposed an efficient two-party elliptic curve based key agreement protocol, namely ECKE1, which is believed to be secure at the first glance since the security attributes are well discussed in his work. Unfortunately, Wang et al. [18] have revealed a weakness in Strangio’s protocol recently. Specifically, they have presented a valid key compromise impersonation attack on it. Besides, Duraisamy et al. [8] have also pointed out a similar attack independently. To defeat the vulnerability of ECKE1, Wang et al. [18] and Strangio [17] have proposed their revised version of ECKE1, namely ECKE1N and ECKE1R respectively at the expense of a higher computational workload. In 2004, Popescu [14] has proposed another notable key agreement scheme, in which the author claimed such protocol to be secure and more efficient than the exisitng ones due to its low computational cost required at both communicating entities. However, in 2006, Yoon-Yoo [19] have subverted Popescu’s claim by exploring a series of cryptographic attacks which encompass a key-compromise impersonation attack, a reflection attack and a replay attack. Subsequently, Yoon-Yoo have proposed another improved scheme [19] in order to defeat these vulnerabilities. To justify their claims on the security of their protocol, they have demonstrated a detailed heuristic security analysis on it. In this paper, we point out that Yoon-Yoo’s improved scheme [19] does not seem secure when both ephemeral keys of a specific completed session are exposed to a passive adversary who only eavesdropped on the session. This compromise of session-specific secret information is regarded as one of the essential attacks captured by the extended Canetti-Krawczyk security model [12]. Based on the shortcoming, we propose another improved variant of Popescu’s protocol [14] and we claim that our protocol is able to achieve a stronger and higher sense of security and efficiency as compared to Yoon-Yoo’s scheme. The structure of this paper is organized as follows. In Section 2, we illustrate the basic concepts needed to evaluate the security properties of a key agreement protocol and define the elliptic curve as well as several hard cryptographic problems. In section 3, we revisit Yoon-Yoo’s improved scheme and illustrate its insecurity. We propose an efficient three-pass authenticated key agreement protocol and evaluate its security in sections 4 and 5 successively. In Section 6, we compare our scheme with
172
M.-H. Lim et al.
the other existing key agreement schemes in terms of computational efficiency and security. Lastly, we conclude this paper in Section 7.
2 2.1
Preliminaries Security Properties of Key Agreement Protocol
There are numerous security attributes that have been defined and discussed, specifically for the analysis of the key agreement protocols. The most fundamental property of a key agreement protocol is the passive attack resilience, that is, the protocol should be able to prevent the adversary from obtaining the session key or any other useful information (such as ephemeral keys) by merely eavesdropping on the protocol. On top of that, we also assume that the adversary has full control of the communications on the open network, such that the adversary is capable of intercepting, deleting, injecting, altering and replaying messages in any online instance of the key agreement protocol. These potential attacks has in fact led to the additional security properties: known session key security, perfect forward secrecy, key compromise impersonation resilience, unknown key-share resilience and key control resilience. Please refer to Blake-Wilson and Menezes’s definition of these security attributes [4,3] for an excellent overview. 2.2
Elliptic Curve Cryptography
The notion of elliptic curve cryptography was first introduced independently by Miller [13] and Koblitz [10]. Since then, numerous elliptic curve cryptosystems have been proposed and employed. The main attraction of it is that it allows much smaller parameters (e.g. key size) to be employed in order to achieve an equivalent level of security as compared to the traditional public-key cryptosystems such as RSA and DSA. Since an elliptic curve cryptosystem, that is mainly based on the intractability of Elliptic Curve Discrete Logarithm Problem (ECDLP), takes full exponential time, its resistance against the sub-exponential attack offers potential reductions in processing power and memory size which is essential in applications on constrained devices [15]. Let E(Fq ) be an elliptic curve of defined over a finite field Fq of characteristic p. The public elliptic curve domain parameters over Fq is defined as a 8-tuple (q, F R, S, a, b, P, n, h), where q is the prime order of the field, F R (field representation) indicates the representation used for the elements of Fq , S is the random seed for elliptic curve generation, the coefficients a, b ∈R Fq define the equation of elliptic curve E over Fq (y 2 = x3 + ax + b for p = q > 3, where 4a3 + 27b2 = 0), the base point P = (xp , yp ) in E(FQ ), the prime n is the order of P (n > 2160 ) and the cofactor h = E(FQ )/n, where E(FQ ) denotes the number of FQ -rational points on E. These parameters should be chosen appropriately to prevent the employment of any efficient algorithm from solving the Discrete Logarithm Problem (DLP) or the computational Diffie-Hellman Problem (CDHP) in the cyclic subgroup P [16].
A Secure and Efficient Three-Pass Authenticated Key Agreement Protocol
173
Since many cryptographic primitives base their security on the underlying assumptions in which the DLP and CDHP on some cyclic groups are intractable, our proposed key agreement protocol would not be exceptional. Our protocol also rests upon a few related conjectures before its security can be claimed. Now, we define several cryptographic problems which we will assume their hardness throughout this paper. Conjecture 1 (ECDLP). Let E(Fq ) and P be defined as above. The Elliptic Curve Discrete Logarithm Problem is said to be intractable if for any probablilistic polynomial time Turing Machine A with the knowledge of Y = xP , where Y ∈ P , the probability of success in computing logP Y = x ∈R ecdlp [1, n − 1], denoted as SuccP,E(Fq ) (A) is negligible: ⎡
⎤ x ∈R [1, n − 1]; ecdlp ⎦ ε, Y = xP ; SuccP,E(Fq ) (A) = P r ⎣ A(n, P, Y ) = logP Y where the probability is taken over the coin tosses of A for any random choice of x. Conjecture 2. (ECCDHP). Let E(Fq ) and P be defined as above. The Elliptic Curve Computational Diffie-Hellman Problem is said to be intractable if for any probablilistic polynomial time Turing Machine A with the knowledge of R = rP and S = sP , where R, S ∈ P , the probability of success in computing rsP , denoted as Succeccdhp P,E(Fq ) (A) is negligible: ⎡
⎤ r, s ∈R [1, n − 1]; eccdhp SuccP,E(Fq ) (A) = P r ⎣ R = rP ; S = sP ; ⎦ ε, A(n, P, R, S) = rsP where the probability is taken over the coin tosses of A for any random choices of r and s. Conjecture 3 (ECDDHP). Let E(Fq ) and P be defined as above. The Elliptic Curve Decisional Diffie-Hellman Problem is said to be intractable if for any probablilistic polynomial time Turing Machine A with the knowledge of R = rP and S = sP , where R, S ∈ P , the probability of success in distinguishing the two probability distributions (P, R, S, Q) and (P, R, S, T ), ecddhp denoted as SuccP,E(Fq ) (A) is negligible: ⎡
⎤ r, s, t ∈R [1, n − 1]; ⎢ R = rP ; S = sP ; Q = rsP ; T = tP ; ⎥ ⎢ ⎥ 1 ecddhp ⎥ + ε, b ∈R [0, 1]; SuccP,E(Fq ) (A) = P r ⎢ ⎢ ⎥ 2 ⎣ ⎦ if(b = 0), U = T, else U = Q; A(n, P, R, S, U ) = b where the probability is taken over the coin tosses of A for any random choices of r, s and t.
174
3
M.-H. Lim et al.
Revisiting Yoon-Yoo’s Key Agreement Protocol
This section reviews an improved variant of Popescu’s protocol due to Yoon and Yoo [19]. Figure 1 gives a clear description of this three-pass protocol. The two communication parties are specified to be A (initiator) and B (responder), with their long term public/private key pairs (YA /sA ) and (YB /sB ) (where YA = sA P and YB = sB P ), and their chosen ephemeral secret key rA and rB respectively. Consider that the static public keys are exchanged via certificates, which are generally issued by a Certification Authority (CA) binding each entity’s identity (IDA /IDB ) to his/her long term public key (YA /YB ). Once the message exchange is completed, the session key, KAB is computed as H(IDA , IDB , xVA , xVB , xK2 , xK1 ), where K1 = rA rB sA P and K2 = rA rB sB P are the shared secrets. The conjectured security attributes of this protocol are known key security, perfect forward secrecy, key compromise impersonation resilience, session key security, reflection resilience and replay resilience.
A
B
(sA , YA )
(sB , YB )
rA ∈R [1, n − 1] VA = rA P QA = rA YA VA , QA −−−−−−−−−→
rB ∈R [1, n − 1] K1 = rB QA = rA rB sA P K2 = rB sB VA = rB rA sB P VB = rB P QB = rB YB
eB = H(IDA , IDB , xVB , xK1 , xK2 )
CB = e−1 B (sB − xVB rB ) mod q
K1 = rA sA VB = rA rB sA P
VB , QB , CB ← −−−−−−−−−− −
K2 = rA QB = rA rB sB P
eB = H(IDA , IDB , xVB , xK1 , xK2 )
Abort if VB = x−1 V (YB P − eB CB P ) B
eA = H(IDA , IDB , xVA , xK2 , xK1 )
CA =
e−1 A (sA
− xVA rA ) mod q
CA − −−−−−−−− →
eA = H(IDA , IDB , xVA , xK2 , xK1 )
Abort if VA = x−1 V (YA − eA CA P ) A
KAB = H(IDA , IDB , xVA , xVB , xK2 , xK1 )
Fig. 1. Yoon-Yoo’s Improved Variant of Popescu’s Key Agreement Protocol
A Secure and Efficient Three-Pass Authenticated Key Agreement Protocol
3.1
175
The Insecurity of Yoon-Yoo’s Scheme in the Extended Canetti-Krawczyk Security Model
Recently, LaMacchia et al. [12] have extended the Canetti-Krawczyk security model [6] for authenticated key exchange to achieve a stronger sense of security by capturing all possible attacks resulting from ephemeral and long-term data compromise. Specifically, they view the message exchange and the computation of a common session key as a function of at least four pieces of secret information, which encompass their own long-term and ephemeral secret keys and the other party’s static and ephemeral secret keys. Of these four pieces of information, the adversary in their model is allowed to reveal any subset of the four which does not contain both the long-term and ephemeral secrets of one of the parties that would trivially break the scheme. For passive sessions where the adversary has only the eavesdropping ability (without active intervention in the session establishment), the adversary may reveal both ephemeral keys, both long-term secret keys, or one of each from the two different parties once the session is established. For example, as defined by Krawczyk [11], the weak Perfect Forward Secrecy (wPFS) is concerned with the security against revelation of both the long term keys after the passive session is completed. For active sessions where the adversary may forge the communication of one of the parties during protocol execution, the adversary is allowed to reveal a long-term secret key or ephemeral secret key of the other party. This is because if the adversary can also reveal the long term key of the same party, then the adversary can trivially compute the session key. As pointed out by Krawczyk [11], this is the unattainable full Perfect Forward Secrecy property for two-pass AKE protocol. By considering only passive sessions in this part of our analysis, we now prove that Yoon-Yoo’s scheme is insecure under the extended Canetti-Krawczyk security model. As claimed by Yoon and Yoo, their improved protocol indeed provides weak perfect forward secrecy as the compromise of both long term keys of A and B (sA , sB ) would not expose any of the established session keys to the adversary. Also, it is easy to see that the security of the session key remains valid even if we consider the exposure of the static key and ephemeral key, one of each from the two different parties, (sA , rB ) or (sB , rA ) after the session completes. The revelations in these two cases truly prevent the adversary from reconstructing both the shared secrets K1 and K2 of their protocol. However, the analysis does not end here. Notice that if both the ephemeral secrets rA and rB are revealed to the passive adversary, she would then be able to recompute the shared secrets K1 = rB QA = rA rB sA P , K2 = rA QB = rA rB sB P and reconstruct the session key without needing to learn any of the static private keys sA or sB . In fact, this leakage can possibly happen in practical scenario as the ephemeral secret information may be stored in insecure memory or the random-number generator of a party may be corrupted. As a result, the adversary is able to gain access to the ephemeral data and recompute the session key, which renders the specific session insecure.
176
4
M.-H. Lim et al.
A Secure and Efficient Authenticated Key Agreement Protocol
Now we propose a secure and efficient authenticated key agreement protocol. As usual, the two communicating parties are A and B with their long term public/private key pairs (YA /sA ) and (YB /sB ) respectively. In this protocol, we employ two independent one-way collision-free hash functions, namely H1 : {0, 1}∗ → Fq and H2 : {0, 1}∗ → {0, 1}k , where k is the security parameter. Assume that the public keys are exchanged via certificates. Our key agreement protocol can be performed as follows: 1. Initially A selects a random rA ∈R [1, n − 1] and computes TA = rA YA .
(1)
In the similar manner, B selects a random rB ∈R [1, n − 1] and computes TB = rB YB .
(2)
Then, A, in the role of the initiator, initiates the key agreement process by sending TA to B. 2. Upon receiving A’s message, B computes accordingly K = hrB sB TA = hrA rB sA sB P,
(3)
eB = H1 (IDA , IDB , xTA , xTB , xK ),
(4)
zB = rB + eB sB mod q,
(5)
where xTA denotes the x-coordinate of TA , xTB denotes the x-coordinate of TB , xK denotes the x-coordinate of K and h denotes the cofactor. Finally, B, in the role of the responder, sends the triple (TB , zB ) to A. 3. On reception of B’s message, A calculates
and verifies whether
K = hrA sA TB = hrA rB sA sB P,
(6)
eB = H1 (IDA , IDB , xTA , xTB , xK ),
(7)
TB = zB P − eB YB
(8)
?
holds. If the verification fails, A terminates the execution. Otherwise, A proceeds to compute eA = H1 (IDA , IDB , xTB , xTA , xK ),
(9)
zA = rA + eA sA mod q
(10)
accordingly and then sends zA to B.
A Secure and Efficient Three-Pass Authenticated Key Agreement Protocol
4. B computes
177
eA = H1 (IDA , IDB , xTB , xTA , xK ),
(11)
TA = zA P − eA YA
(12)
and checks whether
?
holds. If the verification fails, A terminates the execution. Otherwise, B can be assured of A’s identity. 5. At last, both A and B terminate and agree on the common session key which is computed as KAB = H2 (IDA , IDB , xTA , xTB , zA , zB , xK ).
5
(13)
Security Analysis
In this section, we scrutinize our proposed key agreement protocol in detail so as to ensure that our protocol is able to achieve the desired security attributes of a key agreement protocol and also able to resist against the known cryptographic attacks. Known session key security (KSK-S). As shown in our protocol description, the session key is derived from the ephemeral keys (rA , rB ) of the specific session and the long term keys (sA , sB ) of the protocol entities. This would result in distinct independent session key in each protocol execution. On top of that, a one-way collision-resistant cryptographic hash function is used to derive the session key. Thus, obtaining any other session keys would not benefit the adversary in mounting a successful attack against a protocol run without the information set (rA , sA ) or (rB , sB ) which is required in the computation of the shared secret K. Therefore, based on the conjectures that we have defined in Section 2, we claim that the knowledge of some previous session keys would not allow the adversary to gain any advantage in deriving any future and other previous session keys. Weak Perfect Forward secrecy (wPFS). Suppose that both A and B’s long term private key sA and sB have been exposed. However, the adversary, with the eavesdropped information (TA , TB , zA , zB ) of any particular session, would not be able to recover the respective established session key since the adversary does not know the involved ephemeral private key rA or rB which are needed in the computation of the shared secret K. And also, the intractability of ECCDHP has significantly thwarted the adversary’s attempt in computing K by using merely TA and TB . Hence, we claim that our enhanced protocol enjoys weak perfect forward secrecy. Key-Compromise Impersonation Resilience (KCI-R). Suppose that A’s long term private key sA has been compromised and instead of directly impersonating A, the adversary now wishes to impersonate B in order to establish a session with A. However, the adversary is unable to compute the shared secret K with the available information (sA , rB , TA ) since the required information set is (rA , sA , TB ) or (rB , sB , TA ). Even if the adversary
178
M.-H. Lim et al.
is able to compute K by some means and subsequently eB , she would again face difficulty in computing zB in which it is impossible without the knowledge of sB . Hence, the adversary is significantly prevented from launching a successful KCI attack against our protocol. Generally, the same situation will result when the long term key sB is compromised (the adversary would impersonate A in this case and her effort will be foiled in computing K and zA ) as our key agreement protocol is symmetric. As a result, we claim that this protocol is able to withstand the KCI attack under all circumstances. Key Replicating Resilience (KR-R). The key replicating attack was first introduced by Krawczyk [11] where the illustration of it involves oracle queries described in Bellare and Rogaway’s random oracle model [1,2]. This attack, if successfully carried out by the adversary, would force the establishment of a session, S (other than the Test session or its matching session) to agree on a same session key as the Test session, by means of intercepting and altering the message from both communicating parties during transmission. Since the Test session and S are non-matching, the adversary may issue a Reveal query to the oracle associated with S and she can then distinguish whether the Test session key is real or random. (See [5,11] for more details on this attack.) Notice that the message integrity of TA and TB has been guaranteed by having each party to calculate the hash function eA and eB which will be bound to zA and zB respectively. Since the adversary has no idea in forging zA or zB along with TA or TB in such a way that the verification steps in Eqs. (8) or (12) would yield a positive result, she would not be able to force the establishment of non matching sessions to possess a common session key. As a result, if the adversary reveals A’s session key, she would not be able to guess B’s session key correctly with non-negligible probability and vice versa. On top of that, A and B also include the session identifiers (TA , TB , zA , zB ) in the key derivation function in Eq. (13) so as to fully eliminate the possibility of this attack [7]. Therefore, we claim that our protocol is secure against the key replicating attack. Unknown Key-Share Resilience (UKS-R). It is apparent that an adversary masquerading as another legal entity E cannot deceive A into believing that messages received from E are originated from B, even if the adversary could obtain E’s certificate. Suppose that in a protocol run, the adversary intercepts the second flow of message transmission, erases B’s certificate and attaches E’s certificate to the message, and sends (TB , zB ) along with E’s certificate to A. After computing eE = H1 (IDA , IDE , xTA , xTB , xK )
(14)
A would eventually terminate the protocol execution as she verifies that TB = zB P − eE YE ?
(15)
does not hold. As a result, the attack attempt fails. It is easy to see that this attack does not work on tricking B in the similar manner by attaching E’s certificate to A’s message. In a nutshell, we speculate that the inclusion
A Secure and Efficient Three-Pass Authenticated Key Agreement Protocol
179
of A and B’ identity (IDA and IDB ) in the hash function eA and eB along with the employment of the sender’s public key in the verification steps in Eqs. (8) and (12) would significantly prevent the adversary from launching the unknown key-share attack on our proposed protocol. Replay Resilience (R-R). In any protocol run, an adversary may attempt to deceive a legitimate participant through retransmitting the eavesdropped information of the impersonated entity from a previous protocol execution. Although the adversary might be unable to compute the session key at the end of the protocol run, her attack is still considered successful if she manage to trick the protocol entity to complete a session with her, believing that the adversary is indeed the impersonated party. In this replay analysis, we reasonably assume that the prime order n of point P is arbitrarily large such that the probability of a protocol entity selecting the same ephemeral key (rA , rB ∈ [1, n − 1]) in two different sessions is negligible. Consider a situation where the adversary (masquerading as A) replays A’s first message (TA(old) ) from a previous protocol run between A and B. After B has sent her a fresh (TB , zB ) in the second message flow, the adversary would abort since she could not produce (by means of forging or replaying) zA corresponding to TB . Notice that the same replay prevention works in the reverse situation where B’s message is replayed. The adversary would fail eventually in generating zB corresponding to the fresh TA . Hence, we claim that message replay in our protocol execution can always be detected by both A and B. Key Control Resilience (KC-R). Obviously in our protocol, no single protocol participant could force the session key to a predetermined or predicted value since the session key KAB is computed from the shared secret K which consists of both A and B’s long term and ephemeral keys. Besides all these security attributes, we also speculate that our protocol is able to overcome the security flaw of Yoon-Yoo’s protocol and achieves a stronger sense of security as specified in LaMacchia’s model. Note that in our protocol, the only way to compute the shared secret K = rA rB sA sB P (for reconstructing the session key) is to employ both ephemeral key and secret key of the same protocol participant, along with the other participant’s public information TA or TB . Thus, this fully prevents a passive adversary to recompute the established session key by revealing any other subset of the four secret keys (rA , sA , rB , sB ).
6
Comparison of Elliptic Curve Based Key Agreement Schemes
The computational efficiency of elliptic curve based key agreement schemes mainly relies on the number of major operations (elliptic curve scalar point multiplication and point addition) that is required to be performed by each protocol entity in a protocol run. In other words, while preserving the desired security features, one may need to minimize the number of these operations in ensuring the performance of key agreement protocol.
180
M.-H. Lim et al. Table 1. Comparison of Online Computational Efficiency Without With precomputations precomputations Protocol
Pass PSM PA FM
H
PSM PA FM
H
Popescu [14]
2
3
0
0
1
1
0
0
0
ECKE1 [16]
2
3
1
4
3
2
1
2
2
ECKE1N [18]
2
4
2
2
2
3
2
0
1
ECKE1R [17]
3
4
1
3
3
3
1
1
2
Yoon-Yoo [19]
3
6
1
5
3
4
1
2
3
Our Protocol
3
4
1
3
2
3
1
1
2
Table 1 summarizes the comparison of the online computational efficiency between the existing elliptic curve based key agreement schemes with our proposed scheme. Note that we do not consider pairing based schemes in our comparison work as the computation of a paring operation is equivalent to several times of an elliptic curve point scalar multiplication, which would turn out to be trivially inefficient. Moreover, we record the online computational operations such as point scalar multiplications (PSM), point addition/point subtraction (PA), field multiplication (FM) and hash function (H) with and without precomputations. For ease of comparison, we treat all the key derivation functions employed in the surveyed protocols as hash functions. And also, among the surveyed schemes, Popescu [14] and Yoon-Yoo [19] have ignored the cofactor in describing their protocol which is required to ensure that the point sent is a member of the prime order subgroup. To illustrate a fairer comparison, we assume that the shared secret(s) in their schemes are computed as a function of the cofactor h, in which we replace K = rA rB P with K = hrA rB P in Popescu’s protocol and in Yoon-Yoo’s protocol, we replace K1 = rA rB sA P with K1 = hrA rB sA P , K2 = rA rB sB P with K2 = hrA rB sB P in this efficiency analysis. Table 2 illustrates the comparison of security properties offered by the surveyed schemes with our proposed scheme. EE basically refers to the secuity against the compromise of both parties’ ephemeral secret for a passive completed session and ES refers to the similar security except that for the ephemeral and the static key, one of each from the two different parties is exposed. From Table 1, it is apparent that the two-pass key agreement protocols require lesser computational requirements than the three-pass schemes do. Despite being more efficient, all of them except Popescu’s scheme exchange unauthenticated messages, which means that private keys are not used in the message construction. This has led to session key establishment without authenticating the partner’s identity in prior. And also, as time synchronization is not always feasible, two-pass protocols can hardly withstand the replay attack even though the transmitted message is signed with the long term private key (since the signed message can also be replayed), as illustrated in Table 2. Basically, the inherent demerits of two-pass protocol can be successfully overcome by the three-pass schemes, where Yoon-Yoo’s scheme [19] (except for the EE attribute), Strangio’s revised scheme ECKE1R [17] and our scheme offer
A Secure and Efficient Three-Pass Authenticated Key Agreement Protocol
181
Table 2. Comparison of Conjectured Security Attributes Protocol
ES
Popescu [14]
KSK-S wPFS KCI-R KR-R UKS-R R-R KC-R EE √ √ √ √ √ X X X √ √ √ √ √ √ ECKE1 [16] X X √ √ √ √ √ √ √ ECKE1N [18] X √ √ √ √ √ √ √ √ ECKE1R [17] √ √ √ √ √ √ √ Yoon-Yoo [19] X √ √ √ √ √ √ √ √ Our Protocol
X √ √ √ √ √
more attractive security attributes than the two-pass protocols do. Comparing to the existing three-pass schemes without precomputations, our protocol appears to be the most efficient as our protocol only requires each participant to perform 4 point scalar multiplications, 1 point addition, 3 field multiplications and 2 hash functions online in order to establish a secure session key, while Yoon-Yoo’s scheme requires 6 point scalar multiplications, 1 point addition, 5 field multiplications and 3 hash functions, and ECKE1R requires 4 point scalar multiplications, 1 point addition, 3 field multiplications and 3 hash functions to be computed by each party in a protocol execution. Even with precomputation where extra storage requirement is incurred, our protocol achieves the same online computational efficiency as ECKE1R as it requires only the minimum computational effort for a party to compute 3 point scalar multiplications, 1 point addition, 1 field multiplication and 2 hash functions in a protocol run.
7
Conclusion
In conclusion, we have revealed the weakness of Yoon-Yoo’s protocol in providing security as the ephemeral secret of both communicating parties are disclosed. By aiming to eliminate this security flaw, we have proposed another improved variant of Popescu’s authenticated key agreement protocol. Besides that, we have evaluated the heuristic security of the protocol thoroughly to ensure that the analyzed security features are all offered. Furthermore, we have also compared our proposed scheme with several existing elliptic curve public-key key agreement schemes in terms of computational efficiency and security attributes. As a result, our protocol turns out to be a more secure and efficient improved variant of Popescu’s key agreement scheme as compared to Yoon-Yoo’s protocol.
References 1. Bellare, M., Rogaway, P.: Entity Authentication and Key Distribution. In: Stinson, D.R. (ed.) CRYPTO 1993. LNCS, vol. 773, pp. 110–125. Springer, Heidelberg (1994) 2. Bellare, M., Rogaway, P.: Provably Secure Session Key Distribution: The Three Party Case. In: 27th ACM Symposium on the Theory of Computing - ACM STOC, pp. 57–66 (1995)
182
M.-H. Lim et al.
3. Blake-Wilson, S., Johnson, D., Menezes, A.: Key Agreement Protocols and their Security Analysis. In: Darnell, M.J. (ed.) Cryptography and Coding 1997. LNCS, vol. 1355, pp. 30–45. Springer, Heidelberg (1997) 4. Blake-Wilson, S., Menezes, A.: Authenticated Diffie-Hellman key Agreement Protocols. In: Tavares, S., Meijer, H. (eds.) SAC 1998. LNCS, vol. 1556, pp. 339–361. Springer, Heidelberg (1999) 5. Boyd, C., Choo, K.-K.R.: Security of Two-Party Identity-Based Key Agreement. In: Dawson, E., Vaudenay, S. (eds.) Mycrypt 2005. LNCS, vol. 3715, pp. 229–243. Springer, Heidelberg (2005) 6. Canetti, R., Krawczyk, H.: Analysis of Key-Exchange Protocols and Their Use for Building Secure Channels. In: Pfitzmann, B. (ed.) EUROCRYPT 2001. LNCS, vol. 2045, pp. 453–474. Springer, Heidelberg (2001) 7. Choo, K.-K.R., Boyd, C., Hitchcock, Y.: On Session Key Construction in ProvablySecure Key Establishment Protocols. In: Dawson, E., Vaudenay, S. (eds.) Mycrypt 2005. LNCS, vol. 3715, pp. 116–131. Springer, Heidelberg (2005) 8. Duraisamy, R., Salcic, Z., Strangio, M.A., Morales-Sandoval, M.: Supporting Symmetric 128-bit AES in Networked Embedded Systems: An Elliptic Curve Key Establishment Protocol-on-Chip. EURASIP Journal of Embedded Systems 2007(9) (2007) 9. Diffie, W., Hellman, M.: New Directions in Cryptography. IEEE Transactions on Information Theory 22(6), 644–654 (1976) 10. Koblitz, N.: Elliptic Curve Cryptosystems. Mathematics of Computation 48, 203– 209 (1987) 11. Krawczyk, H.: HMQV: A High-Performance Secure Diffie-Hellman Protocol. In: Shoup, V. (ed.) CRYPTO 2005. LNCS, vol. 3621, pp. 546–566. Springer, Heidelberg (2005) 12. LaMacchia, B.A., Lauter, K., Mityagin, A.: Stronger Security of Authenticated key Exchange. In: Susilo, W., Liu, J.K., Mu, Y. (eds.) ProvSec 2007. LNCS, vol. 4784, pp. 1–16. Springer, Heidelberg (2007) 13. Miller, V.S.: Use of Elliptic Curves in Cryptography. In: Williams, H.C. (ed.) CRYPTO 1985. LNCS, vol. 218, pp. 417–426. Springer, Heidelberg (1986) 14. Popescu, C.: A Secure Authenticated Key Agreement Protocol. In: Proceedings of the 12th IEEE Mediterranean (MELECON 2004), vol. 2, pp. 783–786 (2004) 15. Raju, G.V.S., Akbani, R.: Elliptic Curve Cryptosystem and its Applications. In: Proceedings of the 2003 IEEE International Conference on Systems. Man and Cybernetics (IEEE-SMC), vol. 2, pp. 1540–1543 (2003) 16. Strangio, M.A.: Efficient Diffie-Hellmann Two-Party Key Agreement Protocols based on Elliptic Curves. In: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 324–331 (2005) 17. Strangio, M.A.: Revisiting an Efficient Elliptic Curve Key Agreement Protocol. Cryptology ePrint Archive: Report 081 (2007) 18. Wang, S., Cao, Z., Strangio, M.A., Wang, L.: Cryptanalysis of an Effcient DiffeHellman Key Agreement Protocol based on Elliptic Curves. Cryptology ePrint Archive: Report 026 (2006) 19. Yoon, E.-J., Yoo, K.-Y.: An Improved Popescu’s Authenticated Key Agreement Protocol. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Laganá, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3984, pp. 276–283. Springer, Heidelberg (2006)
On the Robustness of Complex Networks by Using the Algebraic Connectivity A. Jamakovic and P. Van Mieghem Delft University of Technology Electrical Engineering, Mathematics and Computer Science P.O. Box 5031, 2600 GA Delft, The Netherlands [email protected], [email protected]
Abstract. The second smallest eigenvalue of the Laplacian matrix, also known as the algebraic connectivity, plays a special role for the robustness of networks since it measures the extent to which it is difficult to cut the network into independent components. In this paper we study the behavior of the algebraic connectivity in a well-known complex network model, the Erd˝os-R´enyi random graph. We estimate analytically the mean and the variance of the algebraic connectivity by approximating it with the minimum nodal degree. The resulting estimate improves a known expression for the asymptotic behavior of the algebraic connectivity [18]. Simulations emphasize the accuracy of the analytical estimation, also for small graph sizes. Furthermore, we study the algebraic connectivity in relation to the graph’s robustness to node and link failures, i.e. the number of nodes and links that have to be removed in order to disconnect a graph. These two measures are called the node and the link connectivity. Extensive simulations show that the node and the link connectivity converge to a distribution identical to that of the minimal nodal degree, already at small graph sizes. This makes the minimal nodal degree a valuable estimate of the number of nodes or links whose deletion results into disconnected random graph. Moreover, the algebraic connectivity increases with the increasing node and link connectivity, justifies the correctness of our definition that the algebraic connectivity is a measure of the robustness in complex networks.
1 Introduction Complex networks describe a wide range of natural and man-made systems, e.g. the Internet, the WWW, networks of food webs, social acquaintances, paper citations, as well as many others [5,11,28]. Although complex systems are extremely different in their function, a proper knowledge of their topology is required to thoroughly understand and predict the overall system performance. For example, in computer networks, performance and scalability of protocols and applications, robustness to different types of perturbations (such as failures and attacks), all depend on the network topology. Consequently, network topology analysis, primarily aiming at non-trivial topological properties, has resulted in the definition of a variety of practically important metrics, capable of quantitatively characterizing certain topological aspects of the studied systems [2,24]. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 183–194, 2008. c IFIP International Federation for Information Processing 2008
184
A. Jamakovic and P. Van Mieghem
In this paper, we rely on a spectral metric, i.e. the second smallest Laplacian eigenvalue, often also referred to as the algebraic connectivity [13]. Fiedler [13] showed that the algebraic connectivity plays a special role: 1) a graph is disconnected if and only if the algebraic connectivity is zero, 2) the multiplicity of zero as an eigenvalue of a graph is equal to the number of disconnected components. There is a vast literature on the algebraic connectivity; see e.g. [9,10,19,20,22] for books and surveys and e.g. [21,23] for applications to several difficult problems in graph theory. However, for the purpose of this work, the most important is its application to the robustness of a graph: 1) the larger the algebraic connectivity is, the more difficult it is to cut a graph into independent components, 2) its classical upper bound in terms of the node and the link connectivity provides worst case robustness to node and link failures [13]. As mentioned in [6], the second means that for every node or link connectivity, there are infinitely many graphs for which the algebraic connectivity is not a sharp lower bound. The node and the link connectivity are important for the robustness because they quantify the extent to which a graph can accommodate to node and link failures. Hence, it is worth investigating the relationship between those three connectivity metrics. Traditionally, the topology of complex networks has been modeled as Erd˝os-R´enyi random graphs. However, the growing interest in complex networks has prompted many scientists to propose other, more complex models such as ”small world” [27] and ”scalefree” [4] networks. Despite the fact that various authors have observed that real-world networks have power-law degree distribution, the Erd˝os-R´enyi random graph still has many modeling applications. The modeling of wireless ad-hoc and sensor-networks, peer-to-peer networks like Gnutella [8] and, generally, overlay-networks, provide wellknown examples [14]. Besides that, for the Erd˝os-R´enyi random graph, most of the interesting properties can be analytically expressed. This is in contrast to most other graphs where computations are hardly possible. Taking the above arguments into consideration, in the first part of this work we study the behavior of algebraic connectivity in the Erd˝os-R´enyi random graph. By using the basic approximation that the algebraic connectivity equals the minimum nodal degree, we estimate the mean and the variance of the algebraic connectivity. Hereby we improve an already existing theorem concerning its behavior [18]. In the second part, we study the relationship between the algebraic connectivity and graph’s robustness to node and link failures. Extensive simulations show that the algebraic connectivity increases with the increasing node and link connectivity, implying the correctness of our definition that the algebraic connectivity is a measure of the robustness in complex networks. The paper is organized as follows. In Section 3, we present the theoretical background on the algebraic connectivity, and the node and the link connectivity. In Section 3, we analytically derive the estimation of the mean and the variance of the algebraic connectivity for the Erd˝os-R´enyi model, which we verify by simulations. Prior to analytical derivation, we describe in Section 3.1 the common topological properties that are observed in the random graph of Erd˝os-R´enyi, how they are measured and why they are believed to be important in the context of this paper. In Section 4, we present additional simulation results: by exploring the relation between the algebraic connectivity, and the node and the link connectivity, the existing relations for the connected Erd˝os-R´enyi graph are refined. Section 5 summarizes our main results.
On the Robustness of Complex Networks by Using the Algebraic Connectivity
185
2 Background A graph theoretic approach is used to model the topology of a complex system as a network with a collection of nodes N and a collection of links L that connect pairs of nodes. A network is represented as an undirected graph G = (N , L) consisting of N = |N | nodes and L = |L| links, respectively. The Laplacian matrix of a graph G with N nodes is an N × N matrix Q = Δ − A where Δ = diag(Di ). Di denotes the nodal degree of the node i ∈ N and A is the adjacency matrix of G. The eigenvalues of Q are called the Laplacian eigenvalues. The Laplacian eigenvalues λN = 0 ≤ λN −1 ≤ ... ≤ λ1 are all real and nonnegative [21]. The second smallest Laplacian eigenvalue λN −1 , also known as the algebraic connectivity, was first studied by Fiedler in [13]. Fiedler showed that the algebraic connectivity is very important for the classical connectivity, a basic measure of the robustness of a graph G: 1) the algebraic connectivity is only equal to zero if G is disconnected, 2) the multiplicity of zero as an eigenvalue of Q is equal to the number of disconnected components of G. In [13], Fiedler also proved the following upper bound on the algebraic connectivity λN −1 in terms of the minimum nodal degree Dmin of a graph G: 0 ≤ λN −1 ≤ NN−1 Dmin . In addition, we introduce two connectivity characteristics of a graph G: the link connectivity, i.e. the minimal number of links whose removal results in losing connectivity, is denoted by κL and the node connectivity, which is defined analogously (nodes together with adjacent links are removed) is denoted by κN . The following inequality in terms of the node connectivity κN (and obviously the link connectivity κL ) is to be found in [13]: λN −1 ≤ κN . Hence, the minimum nodal degree Dmin of an incomplete1 graph G is an upper bound on both λN −1 as well as κN and κL . If κN = κL = Dmin , we say that the connectivity of a graph is optimal.
3 Algebraic Connectivity in Random Graph of Erd˝os-R´enyi In this section we give an analytical estimate of the algebraic connectivity in the Erd˝osR´enyi random graph. The analytical estimate relies on the equality with the minimum nodal degree. This approximation is verified by a comprehensive set of simulations, presented in Subsection 3.4. Prior to analyzing the minimum nodal degree in Subsection 3.2, we give some details on the Erd˝os-R´enyi random graph and the corresponding theorems. 3.1 Random Graph of Erd˝os-R´enyi The random graph as proposed by Erd˝os-R´enyi [12] is a well-known model to describe a complex network. The most frequently occurring realization of this model is Gp (N ), where N is the number of nodes and p is the probability of having a link between any two nodes (or shortly the link probability). In fact, Gp (N ) is the ensemble of all such graphs in which the links are chosen independently and the total number of links is on average equal to pLmax , where Lmax = N2 is the maximum possible number of links. 1
The node connectivity κN of a complete graph KN is λN−1 (KN ) = N > κN (KN ) = N −1.
186
A. Jamakovic and P. Van Mieghem
Many properties of the random graph can be determined asymptotically, as was shown by Erd˝os-R´enyi in the series of papers in the 1960s and later by Bollobas in [6]. For example, for a random graph to be connected there must hold, for large N , that p ≥ logNN ≡ pc . Moreover, the probability that a random graph for large N is −p(N −1) connected, equals Pr[Gp (N ) = connected] e−N e [26]. Then, the probability that the node connectivity κN equals the link κL connectivity, which in turn equals the minimum nodal degree Dmin , approaches 1 as N approaches infinity or that Pr[κN = κL = Dmin ] → 1 as N → ∞ is also proved in [6] and holds without any restriction on p. This was also shown by Bollob´as and Thomason in [7]. On the other hand, the asymptotic behavior of the algebraic connectivity in the Erd˝os-R´enyi random graph Gp (N ) is proved by Juh¯az in [18]: For any ε > 0, 1 λN −1 = pN + o N 2 +ε (1) where the algebraic connectivity converges in probability as N → ∞. 3.2 Minimum Nodal Degree in Random Graph of Erd˝os-R´enyi In Gp (N ) each node i has a degree Di that is binomially distributed. Before proceeding, we first need to show that degrees in the sequence {Di }1≤i≤N are almost independent random variables. In any graph N i=1 Di = 2L holds, thus degrees in the sequence {Di }1≤i≤N are not independent. However, if N is large enough, Di and Dj are almost independent for i = j and we can assume that all Di are almost i.i.d. binomially distributed (see also [6, p. 60]). The following Lemma quantifies this weak dependence: Lemma 1. The correlation coefficient of the degree Di and Dj of two random nodes i and j in Gp (N ) for 0 < p < 1 is Cov [Di , Dj ] 1 ρ (Di , Dj ) = = . N −1 Var [Di ] Var [Dj ] Proof. See Appendix A of [16]. For large N and constant p, independent of N , the normalized i.i.d. binomially distributed sequence {Di∗ }1≤i≤N of all degrees in Gp (N ) tends to be Gaussian distributed. The minimum of the sequence {Di∗ }1≤i≤N possesses the distribution Pr[ min Di∗ ≤ x] = 1 − 1≤i≤N
N
Pr[Di∗ > x] = 1 − (Pr[Di∗ > x]) . N
i=1
After considering the limiting process of the minimum of a set {Di∗ }1≤i≤N when N → ∞, we derive [16] the appropriate solution
2 −Y − 2 log N + log 2π log N 2π ∗ √ Dmin = 2 log N
On the Robustness of Complex Networks by Using the Algebraic Connectivity
187
∗ where Y is a Gumbel random variable [26]. With Dmin = σ[D].Dmin + E[D] = ∗ (N − 1)p(1 − p).Dmin + p(N − 1), we obtain
⎞ ⎛ N2 2π log 2π ⎟ ⎜ Y + 2 log N − log ⎟. √ Dmin = p(N − 1) − (N − 1)p(1 − p) ⎜ ⎝ ⎠ 2 log N
Finally, let Dmin (p) denote the minimum degree in Gp (N ). Since the complement of Gp (N ) is G1−p (N ), there holds that Dmin (p) = N − 1 − Dmax (1 − p) . The law of Dmax has been derived by Bollobas [6, Corollary 3.4 (p. 65)] via another method. Using the above relation, Bollobas’ results precisely agrees with ours. 3.3 Analytical Approximation for Algebraic Connectivity in Random Graph of Erd˝os-R´enyi In Section 2 we saw that λN −1 ≤ NN−1 Dmin . Our basic approximation is λN −1 Dmin for large N . A comprehensive set of simulation results, presented in Subsection 3.4, supports the quality of this assumption. With this approximation we arrive, for large N , at λN −1 p(N − 1) − 2p(1 − p)(N − 1) log N (N − 1)p(1 − p) N2 + log 2π log 2 log N 2π (N − 1)p(1 − p) − Y. (2) 2 log N By taking the expectation on both sides and taking into account that the mean of a Gumbel random variable E[Y ] = γ = 0.5772..., our estimate of the mean of the algebraic connectivity in Gp (N ) becomes, for large N and constant p, E[λN −1 ] p(N − 1) − 2p(1 − p)(N − 1) log N (N − 1)p(1 − p) N2 + log 2π log 2 log N 2π (N − 1)p(1 − p) − γ. (3) 2 log N Similarly, by taking into account that V ar[Y ] = algebraic connectivity in Gp (N ) is V ar[λN −1 ] =
π2 6 ,
the estimate of the variance of the
(N − 1)p(1 − p) π 2 . 2 log N 6
(4)
188
A. Jamakovic and P. Van Mieghem
160 140 120
eq (1) for N = 200 eq (1) for N = 400 eq (1) for N = 800 eq(3) + eq(5) for N = 200 eq(3) + eq(5) for N = 400 eq(3) + eq(5) for N = 800
λN−1
100 80 60 40 20 0
4
8
12 α = p / pc
16
20
Fig. 1. A comparison between the estimation of the mean (3) as well as the standard deviation (5), plotted in lines and error bars, and the theorem of Juh¯az (1), plotted in markers, for the algebraic connectivity λN−1 as a function of α = p/pc in the Erd˝os-R´enyi random graph Gp (N ) with N = 200, 400 and 800 nodes
An interesting observation is that the standard deviation N σ [λN −1 ] = V ar[λN −1 ] = O log N
(5)
is much smaller than the mean (3). This implies that λN −1 tends to the mean rapidly, or that, for large N , λN −1 behaves almost deterministically and is closely approximated by the first three terms in (2). Hence, the relation (2) is more accurate than (1) (see also Figure 1). 3.4 Verification of Analytical Approximation by Simulations In all simulations we consider exclusively the Erd˝os-R´enyi random graph Gp (N ) with various combinations of the number of nodes N and the link probabilities p. N can takes the following values: 200, 400 and 800. The link probability p = αpc = α logNN , where α is varying from 1 to 20. From each combination of N and p, we compute the algebraic connectivity λN −1 and the minimum nodal degree Dmin . Then, we classify the simulated graphs according to their value of α, as shown in Figures 2 and 3. Subsequently, from generated graphs with a given α, we are interested in the extreme values, i.e. min λN −1 and max λN −1 , as shown in Figures 4 and 5. In Figures 2 and 3, we have plotted the simulated mean E[λN −1 ], the corresponding standard deviation σ [λN −1 ], and our estimate for the mean, Eq. (3), and the standard deviation, Eq. (5), of the algebraic connectivity as a function of α. As illustrated in Figures in Figures 2 and 3 there is a remarkable correspondence between the simulations and our estimate: the standard deviation is much smaller than the mean, implying that for N → ∞, λN −1 will rapidly approach E[λN −1 ]. Moreover, our basic approximation that, for large N , λN −1 Dmin is verified by the simulations shown in Figures 4 and 5.
On the Robustness of Complex Networks by Using the Algebraic Connectivity
189
We found that min λN −1 or max λN −1 grows linearly with Dmin . Note in Figure 4 that, in the probability range around the connectivity threshold pc , the minimum algebraic connectivity is always equal to zero, indicating a non-connected random graph (for details see Section 4). From Figures 4 and 5 it is clear that if the value of the algebraic connectivity is larger than zero, the random graph has nodes of minimum degree always larger than zero too, referring to {λN −1 > 0} ⇐⇒ {Gp (N ) is connected}. However, by scrutinizing only degree-related simulation results, we see that the implication {Dmin ≥ 1} =⇒ {Gp (N ) is connected} is not always true, i.e. for large N and certain p which depends on N , the implication is almost surely (a.s.) correct [26]. For example, the percentage of graphs with Dmin ≥ 1 that leads to a connected Gp (50) increases from 98% for p = pc to 100% for 2pc , while the percentage for Gp (400) increases from 99% for p = pc to 100% for 2pc . Hence, the simulation results confirm that, for large N and rather small p = logNN , the latter implication a.s. is equivalent. 100
4 6
90 4 80
2
70
−2 σ[λN−1]
E[λN−1]
60
3
0
50
2
40 30
N = 200 N = 400 N = 800 eq (3) for N = 200 eq (3) for N = 400 eq (3) for N = 800
20 10 0 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 α = p / pc
Fig. 2. A comparison between the estimation (3), plotted in lines, and the simulation results, plotterd in markers, for the mean of the algebraic connectivity E[λN−1 ] as a function of α. In the upper left corner of the figure, we show the difference between the estimation and the simulation results as a function of α.
N = 200 N = 400 N = 800 eq (5) for N = 200 eq (5) for N = 400 eq (5) for N = 800
1
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 α = p/p c
Fig. 3. A comparison between the estimation (5), plotted in lines, and the simulation results, plotterd in markers, for the standard deviation of the algebraic connectivity σ[λN−1 ] as a function of α
Simulations demonstrate also that, for a particular fixed α = ppc , the mean of the algebraic connectivity increases with the size of the random graph: a higher value of the graph size N implies a higher mean of the algebraic connectivity, what in turn indicates that the probability of having a more robust graph is approaching 1 as N → ∞. Theorem given in Subsection 3.1, stating that Pr[κN = κL = Dmin ] → 1 as N → ∞, clarifies this observation in a slightly different way: given that N is approaching ∞, the node and the link connectivity will become as high as possible, i.e. equal to the minimum nodal degree, and therefore the graph will become optimally connected.
190
A. Jamakovic and P. Van Mieghem
N = 400,α = 1,2,3,...,20 100
4.5
90
4
80
3.5
70 min/max λN−1
min/max λ
N−1
N = 400, α = 0.1,0.2,...,2 5
3 2.5 2 1.5
60 50 40 30
1
min λN−1
20
min λ
0.5
max λN−1
10
max λ
0
0
1
2
3
4
5
N−1
6
D
min
Fig. 4. The relationship between the minimum algebraic connectivity min λN−1 (maximum algebraic connectivity max λN−1 ) and the minimum nodal degree Dmin in the Erd˝os-R´enyi random graph Gp (N ). For each combination of N and p = αpc , α = 0.1, 0.2, 0.3, ..., 2, we generate 104 random graphs. Then, from the generated graphs, having a given value of α, we take min λN−1 (max λN−1 ) and the corresponding Dmin , resulting in one point for each considered α.
0
N−1
0
20
40
60
80
100
Dmin
Fig. 5. The relationship between the minimum algebraic connectivity min λN−1 (maximum algebraic connectivity max λN−1 ) and the minimum nodal degree Dmin in the Erd˝os-R´enyi random graph Gp (N ). For each combination of N and p = αpc , α = 1, 2, 3, ..., 20, we generate 104 random graphs. Then, from the generated graphs, having a given value of α, we take min λN−1 (max λN−1 ) and the corresponding Dmin , resulting in one point for each considered α.
4 Relationship between Algebraic, Node and Link Connectivity in Random Graph of Erd˝os-R´enyi In the previous section we analytically estimated the behavior of the algebraic connectivity in the Erd˝os-R´enyi random graph. In this section we analyze the relation among the three connectivity measures: the algebraic connectivity, the node connectivity and the link connectivity. We have used the polynomial time algorithm, explained in [15], to find the node and the link connectivity by solving the maximum-flow problem. The maximum-flow problem can be solved with several algorithms, e.g. Dinic, Edmonds & Karp, Goldberg, etc. If Goldberg’s push-relabel algorithm√is utilized, as performed in our simulations, the link connectivity algorithm has O(N 3 L)-complexity, while the node connectivity √ 2 algorithm has O(N L L)-complexity. We have used the LAPACK implementation of the QR-algorithm for computing all the eigenvalues of the Laplacian matrix. For linear algebra problems involving the computation of a few extreme eigenvalues of large symmetric matrices, algorithms (e.g. Lanczos) whose run-time and storage cost is lower compared to the algorithms for calculation of all eigenvalues (QR algorithm has O n3 -complexity) are known [3]. We simulate for each combination of N and p, 104 independent Gp (N ) graphs. N is 50, 100, 200 and 400 nodes and the link probability p = αpc , where α varies from 1 to 10. From each combination of N and p, we compute the minimum nodal degree
On the Robustness of Complex Networks by Using the Algebraic Connectivity
191
40 N=50 N=100 N=200 N=400 E[λN−1], N=50
35
E[κN] = E[κL] = E[Dmin]
30
E[λ
], N=100
N−1
25
E[λN−1], N=200 E[λN−1], N =400
20 15 10 5 0
1
2
3
4
5 6 α = p / pc
7
8
9
10
Fig. 6. Simulated results on Erd˝os-R´enyi random graph Gp (N ) for N = 50, 100, 200, 400 and the link probability p = αpc , showing the mean of the algebraic connectivity E [λN−1 ], the mean of the node E[κN ] and the link E[κL ] connectivity and the mean of the minimum nodal degree E[Dmin ] as a function of α, where α = 1, 2, ..., 10. Note that for α = 1, E[κN ] = E[κL ] but E[κN ] = E[Dmin ].
Dmin , the algebraic, the node and the link connectivity, denoted respectively by λN −1, , κN and κL . Then, we classify graphs according to their value of α. Figure 6 shows the mean value of the algebraic connectivity E [λN −1 ] as a function of increasing α = ppc . In addition, Figure 6 shows the mean of the node connectivity E[κN ], the link connectivity E[κL ] and the minimum nodal degree E[Dmin ]. The first conclusion we can draw after analyzing simulation data is that for all generated random graphs from p = pc to p = 10pc the convergence to a surely connected random graph, i.e. λN −1 > 0, is surprisingly rapid. Results concerning connectivity percentages are plotted in Figure 7. For example, the percentage of connected random graphs with 50 nodes increases from about 39% and 98% for pc and p = 2pc , respectively, to 99% for p = 3pc , where for p = 4pc the graph is connected. These results are consistent with the Erd˝os-R´enyi asymptotic expression. For N → ∞, as observable in Figure 7, the simulated data as well as the Erd˝os-R´enyi formula confirm a well known result [17] that the random graph Gp (N ) is a.s. disconnected if the link density p is below the connectivity threshold pc ∼ logNN and connected for p > pc . The second conclusion is that our results, regarding the distribution range of the algebraic connectivity and the minimum nodal degree, indeed comply with the bounds 0 ≤ λN −1 ≤ NN−1 Dmin : the distribution of the algebraic connectivity λN −1 is contained in the closed interval [0, N ], or to be more precise λN −1 is 0 for a disconnected graph and above bounded by NN−1 Dmin for all those link probabilities p for which the graph is connected but not complete2 . Then, obviously E[λN −1 ] ≤ E[Dmin ]. The third conclusion is that the distribution range of the algebraic connectivity also complies with the bounds λN −1 ≤ κN . Moreover, in Figure 6, for p > pc and all simulated N , the distributions of the node κN and the link κL connectivity are equal to 2
If a graph G is a complete graph KN then λN−1 = N > Dmin = N − 1.
192
A. Jamakovic and P. Van Mieghem
the distribution of the minimum nodal degree Dmin (recall that in Figure 6 for p = pc , the distributions of κN , κL and Dmin are almost equal but not the same). Convergence here to a graph where κN = κL = Dmin is surprisingly rapid. For example, from the simulation results plotted in Figure 8 with p = pc and size of the random graph ranging from N = 5 to N = 400, we found that with probability approaching 1, the random graph becomes optimally connected at rather small graph sizes. For all other link probabilities, p > pc , the convergence to κN = κL = Dmin is faster (see Figure 8 for p = 2pc ). Hence, the simulation results show that the random graph Gp (N ) a.s. is constructed in such a way that deleting all the neighbors (or the links to its neighbors) of a minimum nodal degree node will lead to the minimum number of nodes (links) whose deletion from a graph will result into a disconnected random graph. Hence, the minimum nodal degree is a valuable estimate of the number of nodes or links whose deletion results into a disconnected graph. 100
100
90 98
70 κN = κL = Dmin [%]
Gp(N) is connected [%]
80
60 50
N = 50 N = 100 N = 200 N = 400 N = 50 N = 100 N = 200 N = 400
40 30 20 10 0
0
0.5
1 α=p/p
1.5
96
94
92
α=p/p =1 c
α=p/p =2 c
2
c
Fig. 7. Percentage of the connected Erd˝os-R´enyi random graphs Gp (N ), i.e. λN−1 > 0: a comparison between the simulation results, plotted in markers, and the Erdos’ asymptotic formula −p(N −1) Pr[Gp (N ) =connected] e−Ne , plotted in lines, in the probability range α = p/pc = 0.1, 0.2, ..., 2
90
0
50
500
1000
N
Fig. 8. Percentage of the Erd˝os-R´enyi random graphs Gp (N ) with p = pc and p = 2pc in which the node connectivity κN , the link connectivity κL and the minimum nodal degree Dmin converge to κN = κL = Dmin for small graph sizes N
5 Conclusion We studied the algebraic connectivity and its relation to the node and the link connectivity in the Erd˝os-R´enyi random graph. The analytical study shows that the variance of the algebraic connectivity is much smaller than its mean, implying that, for large graph size N , the distribution of the algebraic connectivity will rapidly approach the mean value. Through extensive simulations, we verified that the algebraic connectivity behaves almost deterministically and is closely approximated by our basic estimate, Eq. (2). Simulations also show that, for large N , the distribution of the algebraic connectivity grows linearly with the minimum nodal degree, confirming our basic approximation that λN −1 Dmin .
On the Robustness of Complex Networks by Using the Algebraic Connectivity
193
Moreover, for a given value of α = ppc , a higher value of the graph size N means a higher value of the algebraic connectivity. This translates into a higher probability of having a more connected, or to say robust, graph as N → ∞. On the other hand, the larger the graph size, the more the Erd˝os-R´enyi random graph is constructed in such a way that deleting all the neighbors (or the links to its neighbors) of a minimum nodal degree node leads to the minimum number of nodes (links) whose deletion disconnects the graph. However, the simulation results show that this optimal connectivity, occurs, regardless of the link probability p, at already small graph sizes N . Hence, the larger the value of the algebraic connectivity, the better the graph’s robustness to node and link failures.
References 1. Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions. Dover Publications, Inc, NY (1999) 2. Albert, R., Barabasi, A.-L.: Statistical mechanics of complex networks. Reviews of Modern Physics 74(1) (January 2002) 3. Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H.: Templates for the solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia (2000) 4. Barabasi, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512 (1999) 5. Barabasi, A.-L.: Linked. The new science of networks. Perseus, Cambridge (2002) 6. Bollob´as, B.: Random graphs, 2nd edn. Cambridge University Press, Cambridge (2001) 7. Bollob´as, B., Thomason, A.G.: Random graphs of small order. Random graphs 1983. Annals of Discrete Mathematics 28, 47–97 (1985) 8. Castro, M., Costa, M., Rowstron, A.: Should We Build Gnutella on a Structured Overlay. ACM SIGCOMM Computer Communications Review 34(1), 131–136 (2004) 9. Chung, F.R.K.: Spectral Graph Theory. In: Conference Board of the Mathematical Sciences, American Mathematical Society, Providence, RI, vol. 92 (1997) 10. Cvetkovic, D.M., Doob, M., Sachs, H.: Spectra of Graphs, Theory and Applications, 3rd edn. Johann Ambrosius Barth, Heidelberg (1995) 11. Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of Networks. In: From Biological Nets to the Internet and WWW, Oxford University Press, Oxford (2003) 12. Erd˝os, P., R´enyi, A.: On random graphs. Publicationes Mathematicae 6, 290–297 (1959) 13. Fiedler, M.: Algebraic connectivity of graphs. Czechoslovak Mathematical Journal 23, 298– 305 (1973) 14. Fuhrmann, T.: On the Topology of Overlay Networks. In: Proceedings of 11th IEEE International Conference on Networks (ICON), pp. 271–276 (2003) 15. Gibbons, A.: Algorithmic Graph Theory. Cambridge University Press, Cambridge (1985) 16. Jamakovic, A., Van Mieghem, P.: On the robustness of complex networks by using the algebraic connectivity, Technical Report (2007) 17. Janson, S., Knuth, D.E., Luczak, T., Pittel, B.: The birth of the giant component. Random Structures & Algorithms 4, 233–358 (1993) 18. Juh¯az, F.: The asymptotic behaviour of Fiedler’s algebraic connectivity for random graphs. Discrete Mathematics 96, 59–63 (1991) 19. Merris, R.: Laplacian matrices of graphs: a survey. In: Linear Algebra Applications, vol. 197,198, pp. 143–176 (1994) 20. Merris, R.: A survey of graph Laplacians. In: Linear and Multilinear Algebra, vol. 39, pp. 19–31 (1995)
194
A. Jamakovic and P. Van Mieghem
21. Mohar, B., Alavi, Y., Chartrand, G., Oellermann, O.R., Schwenk, A.J.: The Laplacian spectrum of graphs. Graph Theory, Combinatorics and Applications 2, 871–898 (1991) 22. Mohar, B.: Laplace eigenvalues of graphs: a survey. Discrete Mathematics 109, 198, 171–183 (1992) 23. Mohar, B., Hahn, G., Sabidussi, G.: Some applications of Laplace eigenvalues of graphs. Graph Symmetry: Algebraic Methods and Applications, NATO ASI Ser. C 497, 225–275 (1997) 24. Newman, M.: The structure and function of complex networks. SIAM Review 45, 167–256 (1990) 25. Skiena, S.: Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. Addison-Wesley, Reading (1990) 26. Van Mieghem, P.: Performance Analysis of Computer Systems and Networks. Cambridge University Press, Cambridge (2006) 27. Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature 393, 440– 442 (1999) 28. Watts, D.J.: Small Worlds. The Dynamics of Networks Between Order and Randomness. Princeton University Press, Princeton (1999)
On the Social Cost of Distributed Selfish Content Replication Gerasimos G. Pollatos1, , Orestis A. Telelis2, , and Vassilis Zissimopoulos1 1
Department of Informatics and Telecommunications, University of Athens, Greece {gpol,vassilis}@di.uoa.gr 2 Department of Computer Science, University of Aarhus, Denmark [email protected]
Abstract. We study distributed content replication networks formed voluntarily by selfish autonomous users, seeking access to information objects that originate from distant servers. Each user caters to minimization of its individual access cost by replicating locally (up to constrained storage capacity) a subset of objects, and accessing the rest from the nearest possible location. We show existence of stable networks by proving existence of pure strategy Nash equilibria for a game-theoretic formulation of this situation. Social (overall) cost of stable networks is measured by the average or by the maximum access cost experienced by any user. We study socially most and least expensive stable networks by means of tight bounds on the ratios of the Price of Anarchy and Stability respectively. Although in the worst case the ratios may coincide, we identify cases where they differ significantly. We comment on simulations exhibiting occurence of cost-efficient stable networks on average.
1
Introduction
Distributed network storage constitutes a frequently exploited modern resource for the improvement of services offered to internet users. A network node can replicate locally content (files or services) that is accessed frequently by local users, so as to lessen bandwidth consumption incurred by frequent remote access. Non-locally replicated objects can be retrieved from the nearest location where they can be found replicated (neighbors/origin servers). Web caching, content distribution networks and, more recently, applications over P2P networks serve this scheme. Query protocols [1], succint summaries [2], and distributed hash tables [3] can implement search of object replicas at remote nodes. Such systems are frequently modeled as Distributed Replication Groups [4]: a distant server distributes content represented as a set of information objects to a network of nodes sharing proximity, each having constrained local storage
Author supported by the 03ED694 research project (PENED), co-financed by E.U.European Social Fund (75%) and the Greek Ministry of Development-GSRT (25%). Center for Algorithmic Game Theory, funded by the Carlsberg Foundation, Denmark.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 195–206, 2008. c IFIP International Federation for Information Processing 2008
196
G.G. Pollatos, O.A. Telelis, and V. Zissimopoulos
capacity. Each node has an interest to a subset of information objects in the form of a non-negative demand weight per object. Access cost incurred per non-locally replicated object is defined as the distance from which the object is retrieved times the demand weight. The related Object (or Data) Placement problem requires placement of object replicas in nodes, subject to capacity constraints, so that total access cost summed over all objects and nodes is minimized [4,5]. We consider networks formed voluntarily by autonomous users for shared exploitation of distributed resources (storage capacity), such as file-sharing networks. We develop a game-theoretic analysis of distributed replication networks, in the light of which, users choose autonomously a replication strategy (a subset of objects to replicate locally), catering to minimization of their individual access cost. Through our study of the Distributed Selfish Replication game we address systematically the questions: “Does the performance of the network ever stabilize?” and “What is the overall performance of a stable network, in absence of a central optimizing authority?” Game theory predicts that users will eventually end up (converge) in a Nash equilibrium [6], i.e. a set of (randomized in general) replication strategies, under which no user can decrease his/her individual (in general, expectation of) access cost by unilaterally changing his/her strategy [7]. In view of equilibria as a notion of network stabilization, we are interested in deterministic strategies (pure equilibria), because randomized strategies do not naturally fit behavior of a rational user [7,8], as he/she may observe inefficiency during network’s lifetime and change the replication strategy. Two natural ways of measuring the social (overall) network cost include the average cost experienced by network users (SUM of individual costs), and the maximum cost over all users (MAX). The coordination ratio of such systems, widely known as the Price of Anarchy (PoA) was introduced in the seminal work of [9] as the worst-case ratio of the social cost of the most expensive equilibrium over the optimum centrally enforcable social cost. The PoA measures worst-case performance loss due to the absense of a centralized coordinating authority. More recently in [10] it was proposed that the worst-case ratio of the social cost of the least expensive equilibrium to the optimum social cost is also of interest, since some kind of coordination may be provided to the users, so that they reach the most efficient stable state. This ratio, initially termed the optimistic PoA [10], was established in [11] as the Price of Stability, (PoS) i.e. the performance loss due to the requirement for stability. In this paper we prove existence of pure strategy Nash equilibria for the Distributed Selfish Replication game, (section 3) over a slightly more general network model than the one originally proposed in [4], by analyzing a converging best response dynamics algorithm. We prove a tight upper bound for the PoA and matching lower bounds for the PoS (section 4), for both the SUM and MAX social costs: our PoA upper bound is valid for more general network models than the one considered. However, the simple model that we study exhibits tight worst-case lower bounds for the PoS. We identify cases where the PoS can be significantly smaller than PoA, and comment on simulations that exhibit occurence of cost-efficient equilibria on average (section 5). Our results generalize
On the Social Cost of Distributed Selfish Content Replication
197
and complement the work of [12], and answer a related open question posed in [8] with respect to considering constrained storage capacity in replication games. Related Work. The study of voluntarily formed data replication networks by autonomous users with individual incentives was initiated in [8], where each replicated object could be payed for a fixed amount (as opposed to capacity constraints). It was shown that both PoA and PoS can be unbounded in general with the exception of special topologies. The authors considered only 0/1 demand weights in their analysis. They posed consideration of capacitated nodes along with demand weights as an open question. Let us note here that the notion of capacity should not be perceived under the narrow view of hard disk storage (which is quite affordable nowadays), but rather as a more general notion of budget, to be spent for maintenance and reliability purposes (particularly when replication concerns services). Existence of equilibria for the simple capacitated model of [4] was shown in [12], and for demand weights that constitute a distribution over the set of objects requested by each node. This work however does not concern social efficiency, but rather maximization of the gain of individuals when participating in the network as opposed to staying isolated. Our work is strongly related to the recent studies on network formation games initiated by the seminal work of [13], in an effort of understanding the efficiencies and defficiencies induced to a network created by autonomous users, that are motivated by colliding incentives. A wealth of recent results on such network creation models [11,10,14] have grown this direction of research into an exciting field [15] (chapter 19), that has provided useful insights with respect to the manipulation of the users’ incentives towards creation of socially efficient networks.
2
Preliminaries
We consider a network involving k servers and a set N , |N | = n of client-nodes (simply referred to as nodes), each node having at its disposal integral local storage capacity of sizei . A universe U of up-to-date unit-sized information objects originates from the k servers and each node i requests access to each object o ∈ U by means of a demand weight wio ≥ 0. The request set of i is Ri = {o ∈ U |wio > 0}. Node i may choose to replicate locally any at most sizei cardinality subset of Ri , and access every non-replicated object o from some node (or server) j = i at cost dij wio , where dij is the cost (distance) of accessing j per unit of demand. If Pi is the subset of objects replicated by i, a placement over the network is denoted by X = {P1 , . . . , Pn }. Given a placement X, we use di (o) to denote the distance from node i of the closest node (including origin servers: they can be considered as nodes with fixed placements) replicating object o. Formally: di (o) = minj =i:o∈Pj dij . The individual access cost of i under a placement X, is defined as: ci (X) = wio di (o) (1) o∈Ri \Pi
198
G.G. Pollatos, O.A. Telelis, and V. Zissimopoulos
Our study involves two versions of social (overall) cost: SUM and MAX, defined as i ci (X) and maxi ci (X) respectively. We study the strategic Distributed Selfish Replication (DSR, and DSR(0/1) when demand weights are 0 or 1) game defined by the triple N, {Pi }, {ci } in which every node i is a player with strategy space {Pi } consisting of all sizei -cardinality subsets of objects that i may replicate, and utility expressed by the access cost ci , that i wishes to minimize. A placement X is then a strategy profile for the DSR game. In what follows we use the terms node and player interchangeably. Let X−i = {P1 , . . . , Pi−1 , Pi+1 , . . . , Pn } refer to the strategy profile of all players but i. For the DSR game, it is easy to see that given a strategy profile X−i , player i can determine optimally in polynomial time its best response Pi to the other players’ strategies, simply by solving a special 0/1 Knapsack problem, that amounts to replicating sizei objects with the greatest value wio di (o). Network Model. We consider a slightly more general network model than the one introduced by Leff,Wolf, and Yu in [4] and studied in [12], involving k origin servers, instead of 1. The distance of every node i from server l is dl for l = 0 . . . k − 1, while two nodes i and j are at distance dij = dk . We assume that distances form an ultra-metric such that dk < dk−1 < · · · < d0 . We refer to this network model as LWY(k). The minimum and maximum distances appearing in the network are also referred to with dmin = dk and dmax = d0 .
3
Existence of Pure Strategy Nash Equilibria
We introduce a simple polynomial-time algorithm, that finds a feasible placement constituting a pure strategy Nash equilibrium for the DSR game on a LWY(k) networks. The algorithm constitutes a converging iterative best response dynamics that iterates for l = 0 . . . k over the different distance values (in non-increasing order) appearing in LWY(k): for each distance value dl , all distances at most equal to dl are redefined to exactly dl and a best response placement is computed for each node i = 1 . . . n in turn. It is important to note that during computation of best response of a node i a tie for the values of two objects wio di (o) and wio di (o ) is resolved on the basis of index (for otherwise we cannot guarantee convergence to equilibrium). The order by which nodes compute their best response is fixed in all iterations. We refer to this algorithm (algorithm 1) as DSR-EQ. We show that DSR-EQ finds an equilibrium placement. In our analysis we use the iteration index l as superscript to the various quantities involved. Proposition 1. Consider the time right after computation of placements Pil−1 and Pil for a fixed node i in iterations l − 1 and l of algorithm DSR-EQ. Then for every pair of objects o, o with o ∈ Pil−1 \Pil and o ∈ Pil \Pil−1 it is dli (o) = dl and dli (o ) > dl . The proof is by induction on indices i, l through the following claims: Claim. (Basic Step) Proposition 1 holds for l = 1 (second iteration) and i = 1.
On the Social Cost of Distributed Selfish Content Replication
199
Algorithm 1. DSR-EQ for all i, j set d−1 ij ← dij ; for l=0. . . k do for all i, j set dlij ← dl−1 ij ; for all i, j: dij ≤ dl set dlij ← dl ; for i=1. . . n do Compute best response Pil of i with respect to [dlij ] end end
Proof. Right after computation of P10 and P11 the following hold with respect to objects o and o : d01 (o )w1o ≤ d01 (o)w1o ,
and
d11 (o )w1o ≥ d11 (o)w1o
(2)
Equality in both cases cannot hold, because of the index rule for breaking ties. From the first inequality it must be that w1o ≤ w1o because d01 (o ) = d01 (o) = d0 . Furthermore, in iteration l = 1, by redefinition of distances less than d0 in the original input to d1 , it must be either that d11 (o ) = d11 (o) = d0 (both objects are found only at server distance d0 ), or that d11 (o ) = d11 (o) = d1 (both objects are replicated somewhere at distance d1 ), or that d11 (o ) = d11 (o). The first two imply w1o ≥ w1o which contradicts w1o ≤ w1o , even in the case of equality (because of the tie-breaking index rule). For the third case, if d11 (o ) = d1 and d11 (o) = d0 , we obtain that d11 (o )w1o < d11 (o)w1o , a contradiction to the second relation of (2). Thus it must be that d11 (o ) = d0 > d1 and d11 (o) = d1 . Claim. (Inductive Step) Proposition 1 holds for every iteration l. Proof. Assume that the statement is true for all nodes {i + 1, . . . , n, 1, . . . , i − 1} computing their best response right after computation of Pil−1 and right before computation of Pil . For o ∈ Pil−1 \ Pil and o ∈ Pil \ Pil−1 we can write: dl−1 (o )wio ≤ dl−1 (o)wio , i i
and
dli (o )wio ≥ dli (o)wio
(3)
Equality in both cases cannot occur as before. Assume that dli (o) > dl . Then l di (o) ≥ dl−1 and, if o was replicated at some node in iteration it l − 1 (i.e. dl−1 (o) = dl−1 ), it must have remained so, by hypothesis. But then it should i have been dli (o) = dl . Thus o was found only at some server in iteration l − 1. l−1 l Then dli (o) = dl−1 i (o). For o however, we have that di (o ) ≤ di (o ), because of the assumed behavior of the other nodes computing best response after Pil−1 and before Pil : either o reduces its distance from i due to redefinition of distances to dl (in case some node has it replicated) or not. Then it is: dli (o )wio ≤ dl−1 (o )wio ≤ dl−1 (o)wio = dli (o) i i which is a contradiction to (3). Thus it must be dli (o) = dl .
200
G.G. Pollatos, O.A. Telelis, and V. Zissimopoulos
Now assume that dli (o ) = dl . Since dli (o) = dl also, we have that wio ≥ wio . Furthermore, in iteration l − 1 it must have been dl−1 (o ) ≥ dl−1 and dl−1 i i (o) = l−1 l dl−1 . Then we obtain di (o )wio ≥ di (o )wio = dl wio > dl wio = dli (o)wio = dl−1 (o)wio , which is a contradiction to (3). Thus it must be dli (o ) > dl . i Using proposition 1 it is possible to show that: Lemma 1. In the end of iteration l of algorithm 1, {Pil |i = 1 . . . n} is a pure strategy Nash equilibrim with respect to the current distance matrix [dlij ]. Proof. By proposition 1 we deduce that, in each iteration, a node’s best response is not invalidated by other nodes’ best responses, as follows. In the l-th iteration, for every node j = i and an object o ∈ Pjl−1 \ Pjl the object remains at distance dl from i (which was its distance when i was computing Pil ). For an object o ∈ Pjl \ Pjl−1 , o was at distance > dl and, therefore, not replicated in i either. Replication in j only decreases the value of o for i (because its distance is reduced). Theorem 1. Pure strategy Nash equilibria exist for the DSR game on LWY(k) networks. Proof. By lemma 1 the placement produced in the last iteration of algorithm 1 is a pure strategy Nash equilibrium with respect to a distance matrix identical to the input one.
4
Worst-Case Equilibria
We study efficiency of pure equilibria for the DSR game, by studying the worstcase ratios of the PoA and PoS. We assume that all players (nodes) have a marginal interest for joining the network, expressed by |Ri | ≥ sizei . If this does not hold, then equilibria may be unboundedly inefficient, because of unused storage space in certain nodes. In summary we show that the PoA can be at most ddmax for both the SUM and the MAX social cost functions, and the PoS can min be at least as much in the worst-case. We compare a worst-case socially most expensive equilibrium placement X towards the socially optimum placement X ∗ , and use di (·) and d∗i (·) for distances of objects in each placement. The analysis is performed for the individual cost of a single fixed node i, and therefore yields results with respect to the SUM and the MAX social cost functions. Lemma 2. If X and X are equilibrium and optimum placements respectively, for each node i we have: ci (X) = wio di (o) + wio di (o) (4) o∈Ri \(Pi ∪Pi )
ci (X ) =
o∈Ri \(Pi ∪Pi )
o∈Pi \Pi
wio di (o) +
o∈Pi \Pi
wio di (o)
(5)
On the Social Cost of Distributed Selfish Content Replication
201
Proof. Taking expression (1) for ci at X and X and noticing that: Ri \ Pi = (Ri \ (Pi ∪ Pi )) ∪ (Pi \ Pi ) and (Ri \ (Pi ∪ Pi )) ∩ (Pi \ Pi ) = ∅ Ri \ Pi = (Ri \ (Pi ∪ Pi )) ∪ (Pi \ Pi ) and (Ri \ (Pi ∪ Pi )) ∩ (Pi \ Pi ) = ∅
yields the result.
Lemma 3. If X and X are equilibrium and optimum placements respectively, for every player i we have that |Pi \ Pi | = |Pi \ Pi | and for every pair of objects o ∈ Pi \ Pi , o ∈ Pi \ Pi it is wio di (o) ≥ wio di (o ). Proof. Since for every player i sizei < |Ri |, we obtain |Pi \ Pi | = |Pi \ Pi |. For every pair of objects o ∈ Pi \ Pi , o ∈ Pi \ Pi it must be wio di (o) > wio di (o ), for otherwise, player i would have an incentive to unilaterally deviate in placement X, by substituting o for o , thus decreasing its individual access cost. Theorem 2. The price of anarchy for the DSR game is upper bounded by with respect to either of the SUM or the MAX social cost functions.
dmax dmin ,
Proof. Starting from expression (5) and using lemma 3 we have: ci (X ) =
wio di (o) +
o∈Ri \(Pi ∪Pi )
≥
o∈Ri \(Pi ∪Pi )
wio di (o)
o∈Pi \Pi
wio di (o)
o∈Ri \(Pi ∪Pi )
≥
wio di (o)
di (o) + di (o) dmin + dmax
o ∈Pi \Pi
o ∈Pi \Pi
wio
di (o ) d (o) di (o) i
wio di (o )
dmin dmax
min By expression (4) ci (X ) ≥ ddmax ci (X). Summing over all i yields the result for the SUM. For the MAX notice that for some j: maxi ci (X) = cj (X) ≤ ddmax cj (X ∗ ) ≤ min dmax ∗ dmin maxi ci (X ).
Note that we did not make use of any particular network topology. The bound is valid on any network topology, provided an equilibrium exists. Interestingly, matching lower bounds for the PoS can be shown on a LWY(1) network: Proposition 2. The Price of Stability for the DSR game has a lower bound arbitrarily close to dmax /dmin in the worst-case, with respect to SUM and MAX social cost functions, even for 1 server and 0/1 demand weights. Proof. Take dmin inter-node distance and dmax the distance of the single server. For SUM: we show that for every fixed integer I the PoS is lower bounded I dmax by I+1 dmin asymptotically with n. Take sizei = 1 and set: Ri = {a} for i = 1 . . . n−I and Ri = {bj |j = 1 . . . n−1}, for i = n−I +1 . . . n. In every equilibrium
202
G.G. Pollatos, O.A. Telelis, and V. Zissimopoulos
of this instance the I last nodes replicate collectively at most I objects from their (common) request set, while the rest n − I replicate object a. The social cost is I(I − 1)dmin + I(n − 1 − I)dmax . In the socially optimum placement we let nodes i = 1 . . . n − I − 1 replicate n − I − 1 objects that are not replicated in the I last nodes. But, we save a node for replicating a. The social cost becomes I(n − 2)dmin + (n − 1 − I)dmin = [(I + 1)n − 3I − 1]dmin . Since I is fixed, the I dmax PoS becomes I+1 dmin as n grows. For MAX: all nodes have sizei = 1 and let R1 = {a, b}, whereas for i > 1 set Ri = {c}. Take n ≥ 3 nodes. In equilibrium node 1 pays dmax for either a or b. In the social optimum some i > 1 replicates one of a,b that is not replicated by node 1, and pays dmin for c, while 1 also pays dmin .
5
Less Expensive Equilibria
Proposition 2 states that the least expensive and the most expensive equilibria can be of the same cost in the worst case. In this section we investigate existence of less expensive equilibria than the ones indicated by the worst-case PoA and PoS. In particular, we elaborate on conditions under which the PoS can be at dmax most 2d for the DSR(0/1) game, and present some indicative experimental min results on random instances of the DSR game, showing that social efficiency of equilibria does not degrade much as the ratio dmax /dmin grows. For the rest of this section’s theoretical analysis we turn our attention to the DSR(0/1) game and restrict our discussion to the SUM social cost function. In proving proposition 2 we used “over-demanding” nodes, each requesting O(n) times more objects, than the storage capacity they offer to the group. In what follows we will assume modestly demanding nodes. We define such nodes |−sizei formally through the demand ratio q = maxi |Risize as being q = O(1). We i will also use the assumption that the absolute difference of any pair of distinct distance values in the network is at least dmin . This is e.g. the case when the network is modeled as a graph with inter-node distance being one hop, while the nearest server being at least 2 hops away. We will compare the least expensive equilibrium X = {Pi } against the socially optimum placement X ∗ = {Pi∗ }, so as to obtain the maximum possible difference c(X) − c(X ∗ ). We consider objects a∗ ∈ Pi∗ \ Pi , that we call insertions and study an imaginary procedure that transforms X into X ∗ , by performing such insertions. An insertion is termed significant if it contributes to c(X) − c(X ∗ ). Lemma 4. If r significant insertions occur in comparing a least expensive equilibrium X to a socially optimum placement X ∗ , and at most n0 nodes benefit by any insertion, then: c(X) − c(X ∗ ) ≤ rn0 (dmax − dmin ) − rdmin . Proof. Take a∗ ∈ Pi∗ \ Pi for some node i. There is an object a ∈ Pi \ Pi∗ evicted from i due to insertion of a∗ . We use di (·) to denote the current distance of an object from i. Our target is to show that every significant insertion contributes a social cost increase at least dmin while potentially offering at most (dmax − dmin) cost decrease to some nodes. At first suppose that wia∗ = 0: i suffers an access
On the Social Cost of Distributed Selfish Content Replication
203
cost increase of at least dmin , because a has to be retrieved remotely at minimum distance dmin . Every node j = i with wja∗ = 1 benefits at most (dmax − dmin ) though. This is a significant insertion. Now consider the case wia∗ = 1. Then di (a) ≥ di (a∗ ), because X is equilibrium. We examine the following cases: 1. di (a) = di (a∗ ) = dmin : then i does not benefit or loose, and the insertion is indifferent to every other node j = i with wja∗ = 1 (they also do not benefit or loose, because a∗ is already replicated somewhere in the network). Such an insertion is not significant. 2. di (a) > di (a∗ ) = dmin : the cost of i increases by at least 2dmin (due to eviction of a and di (a) − di (a∗ ) ≥ dmin ) and decreases its cost by at most dmin (due to insertion of a∗ ). Such an insertion is potentially significant. 3. di (a) = di (a∗ ) > dmin : i does not benefit or loose, but any node j = i with wia = 1 and wia∗ = 0 looses at least di (a∗ ) − dmin > dmin (as long as a was stored in i it was retrieved by j from distance dmin ). If no such node exists, then we have a new equilibrium, potentially less expensive than X, a contradiction. Such an insertion can be significant. 4. di (a) > di (a∗ ) > dmin : eviction of a from i is caused at distance di (a) which is at least by dmin greater than di (a∗ ). Thus i looses at least dmin in this case. This insertion can be significant. Notice that the cost increase effect of cases (2), (3) and (4) may be compensated by a re-insertion of the evicted object a in some node j = i. By the previous arguments, such a re-insertion causes eviction of some object a from j and therefore, induces a new cost increase of at least dmin , or a new equilibrium contradiction. Such re-insertions are insignificant, as they may form chains of “movements” of objects around the network’s nodes, always carrying a cost increase, but no decrease. If n0 nodes benefit from any of the significant insertions, then assuming that all of them benefit from all the significant insertions only enlarges the difference c(X) − c(X ∗ ), and the result follows. Lemma 5. The cost of the socially least expensive equilibrium placement X for the DSR(0/1) game on a LWY(k) network in the worst case is at least: c(X) ≥ n0 (n0 − 1)Sdmin + n0 rdmax
(6)
where r is the number of significant insertions, S is the minimum storage capacity, and n0 is the number of nodes that benefit from all insertions. Proof. For every node j that benefits from an insertion a∗ it is wja∗ = 1, and for maximum benefit (dmax −dmin ), this object is payed dmax in the best equilibrium. In the worst case all n0 nodes benefit from all r significant insertions, as argued in lemma 4 (this assumption enlarges the difference c(X) − c(X ∗ )). The second term of inequality (6) is thus justified. For the first term, we notice that since all n0 nodes benefit from an insertion a∗ , then in the least expensive equilibrium X every object stored in these nodes is of interest to at least n0 nodes of the network. Otherwise, we show that there
204
G.G. Pollatos, O.A. Telelis, and V. Zissimopoulos
may be an equilibrium of less cost: take for example an instance where every one of the n0 nodes is either interested to replicas stored in the rest n0 − 1 of them or not, and no other nodes are interested in replicas stored in these n0 nodes. If there is a node j not interested to a replica a stored in some node i, then substitution of a for a∗ in i yields a less expensive equilibrium. We assume that all n0 nodes benefiting from the r insertions are also interested in objects replicated in them in equilibrium. This yields a lower bound of n0 (n0 − 1)Sdmin cost. Now we can prove the following: Theorem 3. The Price of Stability of the DSR(0/1) game on LWY(k) networks with modestly demanding nodes and distance differences at least dmin is dmax upper bounded by 2d . min Proof. By lemma 4: P oS ≤
c(X) c(X) + rdmin − r(dmax − dmin )
This bound is a monotonically decreasing function of c(X). Therefore we use for c(X) the lower bound given by (6) for worst-case P oS, and obtain: n0 (n0 − 1)Sdmin + n0 rdmax (r + rn0 + n0 (n0 − 1)S)dmin
P oS ≤
By the analysis of lemma 5, the number of insertions occurring is upper bounded as r ≤ (q − (n0 − 1))S. Taking the maximum value r = (q − (n0 − 1))S increases the ratio for every value of n0 : P oS ≤
n0 (n0 − 1)Sdmin + n0 (q − (n0 − 1))Sdmax ⇒ [(q − (n0 − 1))S + n0 (q − (n0 − 1))S + n0 (n0 − 1)S]dmin P oS ≤
n0 − 1 q+
q−(n0 −1) n0
+
q − (n0 − 1) q+
q−(n0 −1) n0
·
dmax dmin
Now notice that, since every node of the n0 accesses replicas stored in all other n0 − 1 nodes, it must be n0 − 1 ≤ q. Thus the first term can be at most 1, while dmax for q = O(1), the second term is maximized to 2d for n0 = 1. min Simulation on Random Instances. We performed simulations on random instances of the general DSR game, that provide evidence of existence of efficient equilibria on average. Fig. 1 depicts the evolution of emipirically extracted values for PoS, PoA, and an average value of coordination ratio, as dmax /dmin increases. We used an LWY(10) network of 64 nodes, with sizei = 5, and |Ri | = 10. For each node i the request set Ri was constructed by uniform random sampling of 10 objects out of a universe U of 400 objects. Demand weights were taken
On the Social Cost of Distributed Selfish Content Replication
SUM Social Cost
205
MAX Social Cost 4 PoA AVG PoS
3
PoA AVG PoS
3.5 3
2.5
2.5 2 2 1.5
1.5
1
1 0
20
40
60
80
100 120 140 160 180 200
0
20
dmax/dmin
40
60
80
100 120 140 160 180 200
dmax/dmin
Fig. 1. Evolution of PoS, PoA, and average coordination ratio (AVG) with dmax /dmin
randomly in (0, 1] for each node. We took dmin = 1 and distributed 10 servers, each hosting 40 objects, uniformly within the range [dmin , dmax ] for each different value of dmax . Social optimum for SUM and MAX was obtained by solution of the Object Placement Problem ILP formulation described in [5], using GLPK [16]. Each point of the diagrams was extracted using the maximum (for PoA), minimum (for PoS) and average (for AVG) social cost obtained over 1000 executions of algorithm DSR-EQ with randomly chosen node and object orderings per execution. Two main observations can be extracted by these experiments. First, that the rate of growth of all coordination ratios degrades significantly as dmax /dmin increases, thus giving the impression of a stabilization of the efficiency of the system. Second, that the range of different stable social cost values is quite narrow. These observations motivate a probabilistic analysis of the efficiency of equilibria, that constitutes an aspect of future work.
6
Conclusions and Open Questions
We studied the effects of selfish behavior in a distributed replication group with restricted storage capacity per node. We proved existence of pure strategy Nash equilibria for the related strategic game for networks with multiple origin servers, thus generalizing the result of [12]. Although worst-case maximum and minimum social cost of equilibria were shown to be ddmax times the optimum social cost in min the worst case, for 0/1 demand weights and certain topologies a better upper dmax bound of 2d was shown for the Price of Stability. Simulations on random min instances have shown that even the Price of Anarchy may grow much less rapidly than the ratio dmax /dmin . We were not able to find a single example without any pure equilibria when arbitrary-sized objects are involved, or disprove existence of pure equilibria for more general topologies. We consider investigation of these situations as an aspect of future work, along with enrichment of the game model with other realistic features, such as bounded node degrees.
206
G.G. Pollatos, O.A. Telelis, and V. Zissimopoulos
References 1. Wessels, D., Claffy, K.: ICP and the Squid web cache. IEEE Journal on Selected Areas in Communications 16(3) (1998) 2. Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking 8(3), 281–293 (2000) 3. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek, F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions on Networking 11(1), 17–32 (2003) 4. Leff, A., Wolf, J., Yu, P.S.: Replication Algorithms in a Remote Caching Architecture. IEEE Transactions on Parallel and Distributed Systems 4(11), 1185–1204 (1993) 5. Baev, I., Rajaraman, R.: Approximation algorithms for data placement in arbitrary networks. In: Proceedings of the ACM-SIAM Annual Symposium on Discrete Algorithms (SODA), pp. 661–670 (2001) 6. Nash, J.F.: Non-Cooperative ganes. Annals of Mathematics 54, 286–295 (1951) 7. Osborne, M.J., Rubinstein, A.: A course in game theory. MIT Press, Cambridge (1994) 8. Chun, B., Chaudhuri, K., Wee, H., Barreno, M., Papadimitriou, C.H., Kubiatowicz, J.: Selfish caching in distributed systems: a game-theoretic analysis. In: Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC) (2004) 9. Koutsoupias, E., Papadimitriou, C.H.: Worst-case Equilibria. In: Proceedings of the Symposium on Theoretical Aspects of Computer Science (STACS), pp. 404– 413 (1999) 10. Anshelevich, E., Dasgupta, A., Tardos, E., Wexler, T.: Near-optimal network design with selfish agents. In: Proceedings of the ACM Annual Symposium on Theory of Computing (STOC), pp. 511–520 (2003) 11. Anshelevich, E., Dasgupta, A., Kleinberg, J.M., Tardos, E., Wexler, T., Roughgarden, T.: The Price of Stability for Network Design with Fair Cost Allocation. In: Proceedings of IEEE Annual Symposium on Foundations of Computer Science (FOCS), pp. 295–304 (2004) 12. Laoutaris, N., Telelis, O.A., Zissimopoulos, V., Stavrakakis, I.: Distributed Selfish Replication. IEEE Transactions on Parallel and Distributed Systems 17(12), 1401– 1413 (2006) 13. Fabrikant, A., Luthra, A., Maneva, E.N., Papadimitriou, C.H., Shenker, S.: On a network creation game. In: Proceedings of the ACM Symposium on Principle of Distributed Computing (PODC), pp. 347–351 (2003) 14. Chen, H., Roughgarden, T., Valiant, G.: Designing networks with good equilibria. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA) (to appear, 2008) 15. Nisan, N., Roughgarden, T., Tardos, E., Vazirani, V.V. (eds.): Algorithmic Game Theory. Cambridge University Press, Cambridge (2007) 16. GNU Linear Programming Kit, http://www.gnu.org/software/glpk/
On Performance Evaluation of Handling Streaming Traffic in IP Networks Using TFRC Protocol Kacper Kurowski and Halina Tarasiuk Institute of Telecommunications, Warsaw University of Technology ul. Nowowiejska 15/19, 00-665 Warsaw, Poland [email protected], [email protected]
Abstract. This paper deals with the performance evaluation of handling streaming traffic in IP best effort networks using TFRC protocol. In our studies we check and discuss an influence of video on demand traffic and different network conditions on TFRC congestion control mechanism. Moreover, we check TFRC performance when it is sharing bottleneck with different versions of TCP. We illustrate our studies by simulation results performed in ns-2 simulation tool and compare results with those obtained for UDP. Keywords: IP network, TCP-Friendly, Video on Demand, Best-effort.
1 Introduction In the last years we have observed the dynamic growth of standardization activity on the Internet towards the multi-service network based on Quality of Service (QoS) architectures, e.g. [20]. However, current version of Internet i.e., best effort network, lacks mechanisms guaranteeing QoS. Therefore, it is not possible to guarantee target values of throughput, IP packet delay (IPTD), packet losses (IPLR) or packet delay variation (IPDV) that are required by applications [21](e.g. audio, video on demand). Moreover, applications generate different types of traffic e.g. streaming or elastic and have different QoS requirements e.g. real time applications are delay sensitive while non-real time applications are loss rate sensitive. In order to provide better service for streaming and elastic traffic (to avoid congestion collapse and starvation of elastic flows by streaming transmissions) without building QoS networks new transport protocols such as TFRC (TCP-Friendly Rate Control [2], [15]) have been designed. TFRC is intended to provide congestion control mechanism for streaming traffic (e.g. audio, video). The protocol is reasonably fair while competing for bandwidth with TCP flows [1]. It is also responsible for maintaining smoothly changing sending rate, which is adjusted according to the changes in network conditions. TFRC may be implemented directly into the application or the transport protocol [7]. This protocol is receiver-based and to calculate allowed transmission rate uses TCP Reno throughput equation derived in [11] and [10]. The performance of TFRC congestion control mechanism has been intensively analyzed. However, in this paper we will present some aspects that have not been presented before. We analyze TFRC when it carries VoD traffic under different network conditions such as level of congestion, A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 207–214, 2008. © IFIP International Federation for Information Processing 2008
208
K. Kurowski and H. Tarasiuk
different RTT (Round Trip Time) values or TCP version used as background traffic. We will compare obtained results with simulations where only UDP was used. The reason to study TFRC performance for handling VoD streams is that TFRC uses throughput equation derived for TCP Reno to control transmission rate and, on the other hand, VoD is non-real time application. The paper is organized as follows: section 2 provides related works dealing with performance evaluation of TFRC; section 3 contains studied system and problem statement; section 4 presents simulation results; section 5 concludes the paper.
2 Related Works There are plenty of papers evaluating effectiveness of TFRC congestion control mechanisms: [4], [5], [16] and [17]. However, majority of them evaluate TFRC when RED (Random Early Detection) [6] was used for buffer management. It is important to emphasis that configuration parameters of RED have strong influence on its performance [19]. Moreover, based on our best knowledge RED is not amply used in current Internet. Therefore, in this paper, similar as in [17], we consider best effort environment where queues are drop-tail with FIFO (First In First Out) queuing discipline. In [4] the smoothness of TFRC throughput was evaluated and compared with the results obtained for AIMD (Additive Increase Multiplicative Decrease). The simulations have shown that TFRC maintains slowly changing sending rate and obtains better results in comparison with AIMD. Authors of [17] compared three TCP-Friendly protocols: GAIMD (Generalized Additive Increase Multiplicative Decrease), TEAR (TCP Emulation at Receivers), and TFRC in terms of friendliness, smoothness and quickness in responding to changes in network conditions. The simulations showed that TFRC maintains smooth sending rate only when loss rate is low and, when losses rise above 10 −1 TFRC throughput changes rapidly and is close to zero. However, [17] did not consider the influence of RTT and traffic characteristics on the congestion control mechanism implemented in TFRC. Based on the studies of current state of the art, we can conclude that there are hardly any publications concerning the impact of VoD traffic on the TFRC’s performance. Only in [18] was considered the transmission of pre-encoded media using TFRC. Authors analyzed the impact of TFRC on user-perceived media quality. The results from their studies can be used by streaming media applications to assess the movie playback rate, which will be based on the current network conditions. We can conclude that until now researchers have focused on evaluation and adjustment of TFRC protocol and verification of its friendliness with TCP flows. The impact of the type of traffic carried by TFRC was not deeply analyzed and simulations were mainly carried with the continuous traffic resembling FTP application. Taking above into account, we propose to evaluate the impact of traffic generated by VoD server on the performance of congestion control mechanism built into TFRC. We analyze TFRC in the best-effort environment, where FIFO/DropTail queuing scheme is used.
On Performance Evaluation of Handling Streaming Traffic in IP Networks
209
3 Problem Statement In this paper we study the impact of traffic characteristics on TFRC congestion control mechanism and the behavior of TFRC under different network conditions. In our studies we employed popular versions of TCP that exist in the Internet [8] because TFRC is supposed to share bandwidth with them. We performed simulations with TCP Reno [13] for which TFRC throughput equation was derived. Moreover, we considered TCP NewReno, which improves protocol performance in situations where there are multiple drops in a single congestion window [14], [3]. Furthermore, we used the most aggressive version of TCP – TCP SACK in which receiver explicitly informs sender about lost packets [12]. As first step of our studies we checked bandwidth sharing between TFRC and each of mentioned TCP versions. We assumed that both TFRC and TCP are associated with greedy source. We wanted to verify if TFRC stays friendly despite the version of TCP that it shares bandwidth with. However, due to space limit we present only main conclusions of these studies. We observed that, TFRC is the friendliest with TCP SACK, but its friendliness decreases along with the increase in RTT. Therefore, we found interesting to evaluate whether increase in delay has a similar effect on TFRC carrying VoD traffic. Moreover, we used different versions of TCP as background traffic for VoD transmission. As reference values for measured metrics obtained for VoD transmission, IPTD and IPLR, we used those proposed in [21]. The reference metrics are mean IPTD=1 s and IPLR=10-3. Let us recall that IPDV is not critical for VoD. For this purpose we studied the following system (see Fig. 1).
Fig. 1. Studied system for VoD transmission
Data packets coming from the source are encapsulated by MPEG-TS (MPEG Transport Stream) ISO/IEC 13818-1 and sent to receiver by UDP with transmission speed controlled by TFRC (there is no transmission rate control in case of standard UDP). Finally, we compared results obtained for TFRC and UDP carrying VoD.
4 Simulations Results As we have mentioned in previous paragraphs we study impact of different RTT values and traffic characteristics on performance of TFRC. Moreover, through experiments we verify whether using TFRC in best effort network can bring any advantage in comparison with actually used UDP/RTP for handling VoD application.
210
K. Kurowski and H. Tarasiuk
We prepared well known single bottleneck topology (Fig. 2). It consists of two IP routers R1 and R2 (with FIFO/drop-tail queue), a number of sources (it depends on evaluated scenario), associated with TFRC, TCP or UDP senders, and a number of receivers. In Fig. 2 we denoted by X link rate between R1 and R2 routers, by (d11,...., d1N), d2, and (d31, ..., d3N) we denoted propagation delay introduced by links between sender – router R1, router R1 – router R2, and router R2 – receiver, respectively. We changed link rate and delay depending on tested scenario.
Fig. 2. Simulation topology
For the simulations we used the ns-2 implementation [9] of TFRC with several modifications required for our needs. Especially due to the TFRC requirement for fixed packet sizes, a middle layer MPEG-TS was introduced to avoid sending small video packets of different sizes and, as a consequence, packets of fixed size (1316 bytes) were generated by the VoD source. Moreover, we modified TFRC source code in ns-2 to measure the value of delay introduced by the TFRC sender while buffering packets before transmission. To properly assess the impact of traffic characteristics on TFRC congestion control mechanism, we used traces [22] of two different movies (“V for Vendetta” and “Cinderella Man”) with different peak to mean rates. The movies were previously captured and stored in the trace files. The first movie had much quicker changes of bit rate in comparison with the second one (on the frame level peak/mean ratio equals 18.2 and 12.1 respectively). 4.1 Impact of TFRC on VoD Transmission In this section we evaluate the impact of VoD traffic on TFRC congestion control mechanism. We performed simulations in the single bottleneck topology depicted in Fig. 2. We begin our studies from testing performance of single VoD connection in isolation. Therefore, we set the link rate between IP routers to 2 Mbps. The link is faster than the peak value of the movies’ bit rates (mean rate about 640 kbps). To test the transmission over distinct network conditions we changed d2 delay values to 20 ms and 60 ms and router R1 buffer size to 20 and 100 packets. The simulations showed that using TFRC to carry VoD transmission slightly changes traffic characteristics. The influence of TFRC increases along with delay and it is also higher for movies with quick changes of bit rate (“V for Vendetta”). As one could expect, TFRC causes the increase of end-to-end packet delay, which in extreme case can be higher of about 1.5 seconds in comparison to UDP. The difference between mean IPTD values for TFRC and UDP grows with the increase of d2 delay value. The reason for greater delays for TFRC is the queuing time in the sender, which is required to shape outgoing traffic in accordance to the allowed value of transmission
On Performance Evaluation of Handling Streaming Traffic in IP Networks
211
Table 1. Simulation results for considered movies UDP 20 ms
TFRC 60 ms
UDP 60 ms
Cinderella Man
TFRC 20 ms
IPLR
V for Vendetta
Protocol d2
none
1.13E-03
none
1.13E-03
mean IPTD [s]
0.027
0.025
0.122
0.075
max IPTD [s]
0.19
0.13
0.82
0.18
IPLR
8.59E-03
2.28E-02
4.30E-03
2.28E-02
mean IPTD [s]
0.079
0.039
0.26
0.079
max IPTD [s]
0.73
0.13
1.69
0.18
rate calculated at the sender. However, the increase of IPTD (IPTD is the sum of the packet queuing and end-to-end transfer time) leads to the decrease of IPLR for TFRC (see Table 1). We observed, that the IPLR value for TFRC and UDP depends on the type of the movie transmitted (value of peak/mean ratio) and on the delay introduced by the network (in case of TFRC). When delay increases, IPLR decreases when TFRC is used. This was due to the fact that for greater delay, TFRC sender waited for a longer time with the increase of sending rate and, therefore, through the introduction of additional delay, reduced the number of lost packets. During the simulations we noticed that the upper limit (twice the receiving rate during last RTT) for the sending rate had some negative influence on VoD traffic. This effect consisted in the introduction of additional delay even though the sender could transmit packets faster due to availability of network resources. However, changing the upper bound for the sending rate caused the increase in IPLR. Nevertheless, the results that we obtained suggest that if there are enough resources for the transport of one VoD stream, TFRC mechanism has a positive impact on the service especially because of the reduction of IPLR value in comparison to UDP (see Fig.3). Having seen that TFRC does not disturb VoD transmission in different network conditions we verified how it deals with scenarios, where there are multiple connections. Therefore, we changed resource usage from about 50% to almost 100% by increasing the number of simultaneous VoD connections. We set the buffer size of bottleneck (10 Mbps) router R1 for three different values: 0.5BDP, 1BDP, 2BDP (Bandwidth Delay Product, BDP [8]) and compared the results obtained for TFRC and UDP. We noted that in our studies TFRC performed especially well in scenario, where the delay introduced by the bottleneck link was low (20ms) and buffers had small sizes (18 or 37 packets). Sample results are shown in Fig. 4. The results are presented with the 0.95 confidence interval. With the increase in buffer space or delay, TFRC and UDP obtained almost identical results. TFRC packets always had higher value of IPTD due to the fact that packets were queued in the sender when equation based sending rate did not allow their transmission. We observed that IPTD value was greater for movies with greater peak/mean ratio such as “V for Vendetta”. It was due to the fact that quick changes in sending rate influenced the value of receiving rate, and receiver reported this and as a consequence limited the allowed sending rate. Nevertheless, comparing to UDP, TFRC allows transmitting more simultaneous VoD connections, which meet
212
K. Kurowski and H. Tarasiuk
1,0E+00
1,0E+00 10
11
12
13
14
15
16
9 1,0E-01
1,0E-02
1,0E-02 IPLR
IPLR
9 1,0E-01
1,0E-03 1,0E-04
11
12
13
14
15
16
1,0E-03 1,0E-04
TFRC 18 packets
UDP 18 packets
TFRC 37packets
1,0E-05
10
UDP 37 packets
1,0E-05
TFRC 74 packets
UDP 74 packets
1,0E-06
1,0E-06 Number of VoD connections
Number of VoD connections
Fig. 3. IPLR for V for Vendetta movie, d2–20ms. Characteristics differ on buffer size expressed in packets (TFRC – Left, UDP – Right).
0,35
0,35
0,3
0,3
0,25
TFRC 74 packets
0,2
UDP 18 packets
TFRC 18 packets IPTD mean [s]
IPTD mean [s]
TFRC 18 packets
UDP 74 packets 0,15 0,1 0,05
0,25
TFRC 74 packets
0,2
UDP 18 packets UDP 74 packets
0,15 0,1 0,05
0
0 9
10
11
12
13
14
Number of VoD connections
(a) V for Vendetta
15
16
9
10
11
12
13
14
15
16
Number of VoD connections
(b) Cinderella Man
Fig. 4. Mean IPTD for d2–20ms. Characteristics differ on buffer size expressed in packets (18, 74).
reference values of IPTD and IPLR (see section 0). We obtained better results for movies with quick changes in bit rate (“V for Vendetta”). The gain of additional VoD connections decreases while d2 delay or buffer size increases. The results that we presented in this section show that in an environment with mainly VoD traffic, the use of TFRC reduces IPLR for tested connections and allows for simultaneous transmission of more VoD connections, which meet QoS requirements. The only drawback of TFRC is the increase of IPTD. 4.2 VoD Transmission Sharing Bottleneck with TCP Traffic In the next simulations we used single VoD source and increased the number of simultaneous FTP sessions (TCP Reno or TCP SACK). This approach is based on the fact that the most amount of network traffic is TCP. Our simulations have shown that the use of TFRC reduces the value of IPLR for VoD traffic. However, the use of TFRC allows activating only one (two for TCP Reno) more TCP connections in comparison with UDP. Sample results (IPTD and IPLR as a function of number of TCP sources) obtained for “Cinderella Man” are presented in Fig. 5. From the graphs we see that, as in all previous simulations, IPTD increases when TFRC is used, which remains within the boundaries defined for VoD.
On Performance Evaluation of Handling Streaming Traffic in IP Networks
1,0E-01
213
0,6 1
2
3
4
5
6
7
8
9
10
11
12
13 0,5
TFRC (RENO)
IPTD mean [s]
UDP (RENO)
1,0E-02 IPLR
TFRC(SACK) UDP(SACK)
1,0E-03
TFRC(RENO) UDP(RENO)
0,4
TFRC(SACK) UDP(SACK)
0,3 0,2 0,1 0
1,0E-04
1
2
3
Number of TCP connections
4
5
6
7
8
9
10
11
12
13
Number of TCP connections
Fig. 5. IPLR and mean IPTD characteristics for Cinderella Man traffic trace sharing bottleneck with TCP (IPLR – Left, mean IPTD – Right), d2–20 ms , X–15Mbps.
1,0E-01
3,0E+06
6
7
8
9
10
11
12
13
TCP Reno (TFRC)
IPLR
1,0E-03 TCP Reno (TFRC) TCP Reno (UDP)
1,0E-04
Throughput [bit/s]
2,5E+06
1,0E-02
TCP Reno (UDP) TCP SACK (TFRC)
2,0E+06
TCP SACK (UDP)
1,5E+06
1,0E+06
TCP SACK (TFRC) TCP SACK (UDP) 5,0E+05 6
1,0E-05
Number of TCP connections
7
8
9
10
11
12
13
Number of TCP connections
Fig. 6. IPLR and throughput characteristics for selected TCP flow (IPLR – Left, mean IPTD – Right), d2–20 ms, X–15Mbps.
Because in our experiments VoD traffic coexists with TCP connections, it is important to assess the impact that VoD has on the parameters of TCP packet transfer (Fig. 6). The results presented for TCP were obtained for one selected TCP connection. From Fig. 6 we can observe that TCP connections coexisting with VoD transmitted using TFRC have slightly higher IPLR values and lower throughput. This effect we may only observe in case where IPLR of the VoD trace exceeds reference values defined for VoD (see section 5). This is due to the fact that TFRC tends to obtain fair share of the bandwidth and its transmission is more aggressive in comparison with UDP. Nevertheless, the differences were within the error margin and can be considered negligible. From the obtained results we noticed that, due to the dynamic shaping of traffic by TFRC, its packets obtain higher end-to-end delay. The difference with UDP can be as high as 0.6 seconds in our studies. On the other hand, the use of TFRC for VoD reduces the packet loss rate.
5 Conclusions In this paper, we considered the transmission of VoD traffic using TFRC protocol. Our results show that when RTT increases TFRC is able to reduce the IPLR value. However, TFRC performs better than UDP only when buffer size in router R1 and end-to-end delay are small. In this case, it allows transmitting more VoD connections at the same time or running more TCP connections taking into account the reference values of IPTD and IPLR. Moreover, when TFRC is used to transmit, VoD traffic has
214
K. Kurowski and H. Tarasiuk
only negligible influence on TCP throughput. In the future, it can be interesting to compare the effectiveness of TFRC with shaped UDP stream and to verify results in real networks. We would like to consider TFRC for performance evaluation studies on multi-service QoS networks.
References 1. Handley, M., et al.: TCP Friendly Rate Control (TFRC): Protocol Specification. Internet RFC 3448 (January 2003) 2. Handley, M., et al.: TCP Friendly Rate Control (TFRC): Protocol Specification, March 2007. Internet draft (2007) 3. Fall, K., Floyd, S.: Simulation-based Comparisons of Tahoe, Reno, and SACK TCP. SIGCOMM Comput. Commun. Rev. 26(3), 5–21 (1996) 4. Floyd, S., Handley, M., Padhye, J.: A Comparison of Equation-Based and AIMD Congestion Control (February 2000) 5. Floyd, S., et al.: Equation-Based Congestion Control for Unicast Applications. In: ACM SIGCOMM 2000, Stockholm (2000) 6. Floyd, S., Jacobson, V.: Random Early Detection Gateways for Congestion Avoidance. IEEE/ACM Trans. on Netw. 1(4), 397–413 (1993) 7. Kohler, E., Handley, M., Floyd, S.: Datagram Congestion Control Protocol (DCCP). Internet RFC 4340 (March 2006) 8. Lee, H., Lee, S.-h., Choi, Y.: The Influence of the Large Bandwidth-Delay Product on TCP Reno, NewReno, and SACK. In: ICOIN (Japan, January 2001) 9. The Network Simulator (NS-2)., http://www.isi.edu/nsnam/ns/ 10. Padhye, J., et al.: Modeling TCP Reno Performance: A Simple Model and Its Empirical Validation. IEEE/ACM Trans. on Netw. 8(1), 133–145 (2000) 11. Padhye, J., et al.: Modeling TCP Throughput: A Simple Model and its Empirical Validation. In: SIGCOMM Symp. on Commun. Architectures and Protocols, Vancouver (1998) 12. Mathis, M., Mahdavi, J., Floyd, S., Romanow, A.: TCP Selective Acknowledgement Options. Internet RFC 2018 (October 1996) 13. Allman, M., Paxson, V., Stevens, W.: TCP Congestion Control. Internet RFC 2581 (April 1999) 14. Floyd, S., Henderson, T.: The NewReno Modification to TCP’s Fast Recovery Algorithm. Internet RFC 2582 (April 1999) 15. Floyd, S., Kohler, E.: TCP Friendly Rate Control (TFRC): The Small-Packet (SP) Variant. Internet RFC 4828 (April 2007) 16. Widmer, J.: Equation-based Congestion Control, Master’s thesis, University of Mannheim (February 2000) 17. Yang, Y.R., et al.: Transient Behaviors of TCP-friendly Congestion Control Protocols. In: IEEE INFOCOM 2001, Anchorage Alaska (April 2001) 18. Xu, L., Helzer, J.: Media streaming via TFRC: An analytical study of the impact of TFRC on user-perceived media quality. Computer Networks 51(17), 4744–4764 (2007) 19. Ziegler, T., Brandauer, C., Fdida, S.: A quantitative Model for the Parameter Setting of RED with TCP traffic. In: IWQoS 2001, Germany (June 2001) 20. ITU-T Rec.Y.2001, General Overview of NGN (2004) 21. ITU-T Rec. Y.1541, Network Performance Objectives for IP-based Services(2002) 22. VoD traces used: http://tnt.tele.pw.edu.pl/include/tools/vod_traces.php
Efficient Heuristic for Minimum Cost Trees Construction in Multi-Groups Multicast Keen-Mun Yong, Gee-Swee Poo, and Tee-Hiang Cheng Network Technology Research Centre (NTRC) School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798 {yong0010,egspoo,ethcheng}@ntu.edu.sg
Abstract. We propose an efficient and fast heuristic algorithm (MGST) to construct minimum cost tree for each multicast-group in multi-groups multicast. The idea is to construct just one spanning shared-tree that allows minimum cost tree of each multicast-group to be computed from this shared-tree. MGST is compared with the low overheads Core-Based Tree algorithm (CBT) and approximate optimal cost Minimum Cost Tree heuristic (MCT) through simulations. The results showed that MGST constructs trees with average cost close to the approximate optimal (7% difference). The overheads incurred from MGST are also comparable to CBT and much lesser than MCT. In addition, MGST has also shown to construct minimum cost trees much faster than MCT and comparable to CBT. To further improve the performance, “trace back” method is introduced to MGST and denoted as MGST-T. MGST-T yields average cost performance of only 2% difference from the optimal and with almost the same overheads and trees’ construction time as MGST. Keywords: Multicast, Multi-groups, Minimum cost tree, Shared-Tree.
1 Introduction Due to increase demand of multicast services, it is not unusual to have applications that cater multicast services to a wide diversity of users. Each group of users will be assigned a multicast-group within which multicasting can take place and each request to deliver traffic within the group is known as a multicast-session. We refer this as multi-groups multicast. Each user in the group will be identified as the group’s member. The tree constructed to deliver traffic for each group must be cost-efficient due to economic reason. Thus, the problem of computing minimum cost trees in multi-groups multicast is referred as the Multi-Groups Multicast problem (MGM). The problem of constructing minimum cost tree for a multicast-session can be modeled as a Steiner tree problem which is known to be NP-complete [1], and heuristics have been developed to approximate the constructed tree cost to the optimal. The most well-known is the one from [2] which constructs minimum cost tree for each multicast-session from a graph consisting of minimum cost paths between any two nodes for all the nodes. We henceforth refer the heuristic from [2] as A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 215–226, 2008. © IFIP International Federation for Information Processing 2008
216
K.-M. Yong, G.-S. Poo, and T.-H. Cheng
Minimum Cost Tree heuristic (MCT). As MCT or any other source-based algorithms constructs a tree for each session from source to its destinations, the overheads incurred and the time to construct a tree will be large if the number of destinations is large. This is not desirable especially for applications which require the trees to be ready within a short time period. Also, effective algorithms for dynamic membership multicast [3] proposed in [4], [5] will not work well and may incur race condition if the rearranged tree is not ready before the arrival of next member’s addition or removal event. In this paper, we proposed Multi-Groups Shared Tree algorithm (MGST) for the MGM problem. The objective is to construct minimum cost trees comparable to the optimal cost using minimal overheads and time to provide reliable and efficient multicasting. We then further improve the cost performance of MGST and denote it as MGST-T. The organization for the rest of the paper is as follows. Section 2 states the network model. Section 3 gives a brief review of CBT[6] and MCT in multi-group multicast. Section 4 presents the proposed algorithms. The performance of the proposed algorithms is evaluated in section 5. Section 6 concludes the paper.
2 Network Model We model the network as a graph G = (V, E) where V is a set of nodes and E a set of edges. All nodes are assumed to be source and destination node candidate and multicast-capable. Each edge e ∈ E connecting node x and y is denoted as ex , y and has a non-negative cost C (e) associated with it. We assume symmetric link cost. The cost of any directed tree T = (VT , ET ) in the network is given by C (T ) = ∑ e∈E C (e) . T
The set of multicast-groups is given by Q = {q1 ,......, qx } where each group consists of members given by qi = (mi ) . The set of incoming multicast-session requests is given by Y = { y1 ,...., y j } and each request is expressed in the form of yi = { X , si , Di } where si and Di are the source and its set of destinations respectively, and X is the multicast-group ID which the session is requested from. In order to deliver traffic efficiently, a tree Ty is constructed for each session y.
3 Review of CBT and MCT in Multi-Groups Multicast 3.1 Core-Based Tree Algorithm (CBT)
In CBT [6], a core node is first selected and each member of a multicast-group sends a join message to this core node to connect itself to the shared-tree constructed so far. This continues until all the members are connected via the shared-tree. Hence, each shared-tree is shared among the members of each group and the number of CBT required is proportional to the number of multicast-groups involved. However, for each group, since only a single shared-tree is employed, the number of overheads incurred and the processing time to construct the tree ready for data delivery each time a session request is made will be low. Thus, we evaluate the performance of our
Efficient Heuristic for Minimum Cost Trees Construction in Multi-Groups Multicast
217
proposed algorithm using CBT as the benchmark in terms of the overheads incurred as well as the processing time to ready the tree for data delivery. However, due to all paths in tree are required to travel through a core node, the selection of the core node will determine the resource and cost efficiency of the tree. This means that low cost tree cannot be a guarantee since core node selection is complex and can only be approximated most of the time and congestion may occurs at the core node leading to undesirable consequence unless a very efficient queuing model can be applied at the core node. In this paper, to minimize the tree cost constructed by CBT for comparison with our proposed algorithm in terms of both overheads and cost, the core node is selected such that its average minimum cost to the other members in the group is the minimum compared with the rest of the nodes. After which, each member of group is added to the core-based tree by means of least minimum cost path until all the members are connected in the shared-tree. 3.2 Minimum Cost Tree Heuristic (MCT)
In MCT [2], minimum cost path between any two nodes are first computed. These paths make up the links of a minimum cost graph. In this graph, all the nodes are connected via links which corresponds to their respective minimum cost path joining between them. For each incoming multicast-session request, the tree construction starts from the source node. From the source node, the least cost link in the graph to its neighboring member node is appended to the tree constructed so far. The next least cost link from unconnected member node to this tree will then be appended and this continues until all member nodes have been connected to the tree. The links in the obtained tree are then expanded to their respective minimum cost paths. Pruning may be required to remove any unnecessary links. It has been shown in [2] that MCT is able to obtain trees with cost very close to the optimal. Thus, we evaluate the performance of our proposed algorithm using MCT as the benchmark in terms of cost efficiency. However, since a minimum cost tree needs to be constructed for each incoming session request, and the least minimum cost path is required to be searched and selected to append to the tree for connection of each member, therefore the processing time to ready the tree for data traffic and overheads incurred will be significantly larger than that of CBT.
4 Our Proposed Algorithms (MGST and MGST-T) 4.1 Details
Our proposed Multi-Groups Shared Tree algorithm (MGST) consists of two phases. The first phase is to construct a cost efficient shared-tree spanning to all nodes. The second phase is to compute minimum cost tree for each multicast-session from the shared-tree obtained in the first phase. We assumed that the minimum cost path between any two nodes in the graph has been computed using Dijkstra algorithm and stored in the set W. MGST-T is an improvement of MGST where “trace back” method is introduced in the second phase.
218
K.-M. Yong, G.-S. Poo, and T.-H. Cheng
4.1.1 Phase 1: Construction of Cost-Efficient Spanning Shared-Tree
For each iteration of the cost-efficient shared-tree construction, we consider each neighboring node of the existing shared-tree as the potential node to connect. In order to have a cost-efficient shared-tree, the selection function (SF) which selects each neighboring node to add to the tree in each iteration must be in such a way that the cost to deliver traffic via (as a transition node) or from the node (as a source node) is low. A transition node is used to relay data. Hence, for each iteration of the sharedtree construction, the SF of a particular unconnected node j which is a neighbor to the existing shared-tree should consider the average minimum cost to deliver traffic to each node from node j either as a transition node or source node.
Fig. 1. An example of constructing shared-tree in MGST
From fig.1, when the next node is to be added to the sharing-tree, the SF of each potential node j needs to compute the average minimum cost to deliver traffic via node j and this consists of average minimum cost from any of the nodes in the existing shared-tree constructed so far to node j and the average minimum cost to any of the nodes not in the tree from node j, and vice versa. Since symmetric link cost is assumed, therefore only one direction needs to be considered. Since those unconnected nodes cannot be reached via the tree, therefore the computation of the average minimum cost to any of the nodes not in the tree from node j should be approximated by the average minimum path cost from j to the unconnected nodes. Let H and R be the set of unconnected nodes and set of connected nodes to the existing shared-tree respectively. Denoting ek , j as the minimum cost link between the existing shared-tree Ts and node j and Cm ( Px , y ) as the minimum cost path between node x and y, the selection function of a unconnected node j can be expressed as,
⎡1 1 SF ( j ) = ⎢ ⎡⎣C (Ts ) + C (ek , j ) ⎤⎦ + R H ⎣⎢
∑
b∈H
⎤ Cm ( Pj ,b ) ⎥ ⎦⎥
(1)
The first term in (1) computes the average minimum cost from any of the nodes in the existing shared-tree constructed so far to node j and the second term computes the average minimum cost to any of the nodes not in the tree from node j. The node with the least SF value among the unconnected nodes is selected to append to the existing
Efficient Heuristic for Minimum Cost Trees Construction in Multi-Groups Multicast
219
shared-tree via the minimum cost link between the existing shared-tree and the node, and this continues until all nodes are appended to the tree. For each iteration of the shared-tree construction, since the existing shared-tree makes no further changes in its structure, therefore it is independent to the cost contribution to future iteration of the construction process. This means that for the SF computation of an unconnected node during each iteration, the existing shared-tree can be viewed as a single entity, and the first term in (1) can be replaced by the average minimum cost between the existing shared-tree and the unconnected node. Thus, (1) is further simplified as, ⎡ 1 SF ( j ) = ⎢C (ek , j ) + H ⎢⎣
∑
b∈H
⎤ Cm ( Pj ,b ) ⎥ ⎥⎦
(2)
The final tree obtained is the cost-efficient spanning shared-tree. For selecting the first node to include in the shared-tree to start the construction process, it can be observed that since initially no shared-tree is constructed, therefore the first component of (2) is zero and the node with the least average minimum cost to all other nodes is selected to add into the shared-tree to start the construction process. Each node in the existing shared-tree is required to know the up-to-date configuration of the existing shared-tree so that the SF of those unconnected nodes can be computed for further tree construction. Hence, whenever a new node is added to the shared-tree, a control packet which stores the new node addition information is broadcast from the nearest on-tree neighboring node of the added node to all other nodes in the existing shared-tree. The overheads incurred by MGST in phase 1 are the total number of control packets required to complete the shared-tree construction.
Fig. 2. 8-nodes random network and its table of minimum cost path between any two nodes using Dijkstra algorithm. The numeric cost of each path is shown beside the path in the table.
Fig.2 shows an 8-nodes random network and the corresponding minimum cost path between any two nodes tabulated using Dijkstra algorithm. Fig.3 shows the stages of computing the spanning shared-tree using MGST from the network in fig.2. For the first stage shown in fig.3(a), node G has the least SF value of 2.86. Hence, it is selected as the first node to add to the shared-tree. Since only node F, A and H are neighbors to G in the network, therefore their SF are computed while the rest are labeled infinity. Since node A has the least SF, it is selected next to be connected to the shared-tree shown in fig.3(b). If there is a tie in SF value, the one with the least
220
K.-M. Yong, G.-S. Poo, and T.-H. Cheng
Fig. 3. (a) - (g) Seven stages in producing the cost-efficient spanning shared-tree using MGST. (h) The final cost-efficient spanning shared-tree produced by MGST.
connection link cost will be selected to be connected to the shared-tree. This is repeated shown in fig.3(c) to fig.3(f) until all the nodes are connected in the spanning shared-tree shown in fig.3(h). 4.1.2 Phase 2: Construction of Minimum Cost Tree for Each Session When an incoming session arrives, the source-members pair information will be stored in each data packet as part of the packet’s overheads. Each packet will be sent from the source via the shared-tree constructed in section 4.1.1. Each node in the shared-tree upon receiving the packet will check if its descendent sub-tree has any member nodes. If there is, the packet will be duplicated and transmitted out, while no further action will be done if otherwise. Fig.4 shows the tree constructed for a session with source node at C and destination nodes at E, G, F, A and B by MGST, MCT and CBT. The arrows indication in fig.4 shows CBT is a core-based tree with core at G while MGST and MCT are source-based trees. G is selected as the core for CBT because it has the least average minimum cost to the other destination nodes. It can be observed that both MGST and MCT have tree cost of 12 units while CBT has tree cost of 13 units. Although it is fairly simple to obtain minimum cost tree through pruning of the shared-tree in MGST, one disadvantage is that minimum cost tree may not be obtained for all cases. For example, if only a handful of nodes are members, then the session-tree obtained via pruning from the spanning shared-tree may consist a large
Efficient Heuristic for Minimum Cost Trees Construction in Multi-Groups Multicast
221
Fig. 4. An example of constructing minimum cost tree for a session with source node at C and destination nodes at node E, G, F, A, B in network given in fig.2. (a) Session-Tree produced by MGST. (b) Session-Tree produced by MCT. (c) Session-Tree produced by CBT.
Fig. 5. An example of constructing minimum cost tree for a session with source node at C and destination nodes at node F, B in network given in fig.2. (a) Session-Tree produced by MGST. (b) Session-Tree produced by MCT. (c) Session-Tree produced by CBT.
number of non-leaf nodes which are not necessary (redundant nodes) to connect the destination nodes. This leads to redundant cost in the tree and increases the tree cost instead of minimizing it. Fig.5 shows one such example where a session has source node at C and destination nodes at B and F. As node B and F are leaf-nodes in the shared-tree obtained from MGST with respect to the source C shown in fig.5(a), therefore no edges can be pruned and the total session-tree cost is 12 units. This is in big contrast to the tree obtained from MCT and CBT both with session-tree cost of 5 units. To solve this, we introduced “trace back” method to phase 2 of MGST to trace back the minimum cost link between each node of the shared-tree and the current active tree constructed. Each node in the shared-tree upon receiving the packet will check if a direct minimum cost link to the current active tree existed. If so, the minimum link cost is compared to the cost of shared-tree path between the nearest “anchor” node towards the direction of the source and the node. This “anchor” node is defined as the ingress node of the shared-tree path which can be replaced by the minimum cost link to the particular node without creating any dis-joint trees as a result. Hence, “anchor” node can be the source node; ingress of the minimum cost link; a destination node or a node with out-degree greater than one. The minimum cost link replaces the shared-tree path if it has smaller cost. Two extra control packets are required. One packet is required to prune the replaced path and the other packet to activate the replacing link. The idea is to remove redundant nodes in the tree.
222
K.-M. Yong, G.-S. Poo, and T.-H. Cheng
Fig. 6. (a) – (d) Four stages to construct minimum cost tree in MGST-T. (d) The final sessiontree for session with source at node C to destination node B and F.
Fig.6 shows an example of the “trace back” method in MGST. From the sharedtree obtained in fig.3(h), since F is the first node from C which has minimum cost link connecting to the current active tree constructed, therefore the cost of path from C to F in the shared-tree and the cost of minimum cost link between C and F are compared. Since the minimum link from C to F has cost of only 3 units, while the path from C to F in the shared tree has 7 units, therefore the path is replaced with the minimum cost link. However, since G has another link connected to H beside F, therefore only the link between G and F in the path is replaced with the link between C and F shown in fig.6(b). The next node which has minimum cost link connecting to the current active tree constructed is H. As the minimum cost link between F and H is smaller than from C to D to H, the path C to D to H is replaced with C to F to H shown in fig.6(c). And finally at B, since C to B has smaller cost than from C to F to H to E to B, thus the final session-tree is obtained in fig.6(d). This tree has the same low cost of 5 units compared to MCT and CBT shown in fig.5(b) and (c). This is a vast improvement compared to the tree obtained without “trace back” in MGST shown in fig.5(a). For easier reference and distinction of the improvement made to our proposed algorithm, we denote MGST-T as the algorithm with “trace back” method introduced. 4.2 Complexity of MGST and MGST-T 2
The worst case complexity to construct the shared-tree is O( V ) as each node is considered iteratively to append to the shared-tree. As no further action is required other than identifying pruned edges and tracing minimum cost link when session request arrives, by normalizing the complexity of all identification processes whereby tree information is known as O(1) , the complexity to set up session-tree is O(1) . Thus, the total complexity of MGST is O( V ) + O( Y ) where Y is the total set of multicast-session requests. If the worst case complexity is such that each node performs “trace back” in MGST-T, then the total worst case complexity of MGST-T 2
will be O( V ) + O( 2Y ) since again all the “trace back” links and the replaced paths are all known and do not require additional searching. The worst case complexity for 2
CBT and MCT are ∑ q ∈Q O ( qi ) + O ( Y ) and ∑ y ∈Y O( Di ) respectively. 2
i
2
i
Efficient Heuristic for Minimum Cost Trees Construction in Multi-Groups Multicast
223
5 Performance Evaluations 5.1 Performance Metrics 5.1.1 Cost Competitiveness (CC)
It is a measure of a given algorithm’s average tree cost performance with respect to an optimal algorithm. Finding optimal cost tree is proven to be NP-complete. As MCT constructs trees with cost approximate to the optimal, therefore, it is used as the benchmark for cost efficiency. Let R = {T1 ,....., Tx } and U = {B1 ,....., Bx } be the set of trees constructed by a given particular algorithm and by MCT respectively for a total of x = Y multicast-session requests. The CC of a given algorithm is expressed as CC =
⎛ C (Ty ) ⎞ 1 ⎜ ⎟⎟ . A low CC is desired as it means that the given algorithm is ∑ y∈Y ⎜ x ⎝ C ( By ) ⎠
able to construct trees with cost not too far off from the optimal cost algorithm. 5.1.2 Overheads Incurred
We defined overheads incurred as the total number of control packets required to set up the tree ready for multicasting. For CBT, the overheads incurred are the total number of join messages required to construct the trees for all groups. For MCT, the overheads incurred are the total number of control packets sent to each node in the existing tree of each iteration for all iterations of the construction process. For MGST, the overheads consist of the control packets required in constructing the spanning shared-tree. Additional overheads are incurred for MGST-T as mentioned earlier. The overheads incurred should be low to allow lesser resource used for setting up trees. 5.1.3 Tree Construction Processing Time
To measure the complexity of the algorithms we measure the average processing time required to construct a tree for each multicast-session. This does not include the spanning shared-tree construction time at the initialization stage for MGST since we are only interested in the time efficiency between the period when a session request is made and its corresponding tree ready for data transmission. 5.2 Simulation Model
Waxman method [7] was used to generate random networks ranging from 10 to 100 nodes with average node’s outdegree of 3 to 4 to analyze the performance of MGST and MGST-T. Readers are to refer to [7] for details in the method. Each link has cost ranging between 1 to 10 units. 100 multicast-session requests are randomly generated. Each of the requests is to perform multicasting in an arbitrary selected multicastgroup involved. The member size of each group is defined uniformly distributed when all groups have the same member size, and non-uniformly distributed when the groups have different member size. For non-uniform distribution, we defined a member size range K and uses 50% of the network size as the mid-point of the range.
224
K.-M. Yong, G.-S. Poo, and T.-H. Cheng
This means that if the range is K on a network of 100 nodes, the member size distribution ranges from 50 − K to 50 + K . 5.3 Simulation Results and Discussions
Fig.7(a) shows that CBT has an average CC of 1.2 while MGST and MGST-T have average CC of 1.07 and 1.02 respectively. This means that MGST and MGST-T are able to construct trees with cost very close to the optimal and performs much better than CBT. MGST-T has even lower CC than MGST due to the “trace back” method introduced to remove any redundant nodes in the trees to minimize costing. Fig.7(b) shows that the overheads incurred by MGST and MGST-T are smaller than CBT for smaller networks and comparable to CBT for larger networks. This is because the spanning shared-tree gets larger with larger network and thus requires more overheads. It is interesting to note that MGST and MGST-T have much lower overheads incurred compared to MCT (optimal cost heuristic) but yet able to yield also similar cost performance as shown in fig.7(a).
(a)
(b)
Fig. 7. (a) CC vs. Network size. (b) Overheads Incurred in Log10 scale vs. Network size. (group size of 100 and uniform per group’s member size of 50% of network size).
Fig.8(a) shows that the average time taken for MCT to ready a session-tree increases exponentially with increases in network size, while the increase is much more gentle for MGST, MGST-T and CBT. This is due to the employment of sharedtrees which allows sessions to use it to derive the delivery trees without having to reconstruct them from scratch each time. From fig.7(a) , fig.7(b) and fig8.(a), it can be concluded that MGST and MGST-T are able to construct minimum cost trees using minimal overheads and within a very much shorter time. Fig.8(b) shows that as the member size per group increases, CBT increases in CC while MGST and MGST-T decreases in CC. This is because MGST and MGST-T construct trees out of a sharedtree which uses the average minimum cost from a node to the rest as the selection criteria to add each node iteratively. This means that we assume all other nodes are
Efficient Heuristic for Minimum Cost Trees Construction in Multi-Groups Multicast
225
members during the shared-tree construction. Thus, as the member size increases, more member nodes contribute to the average minimum cost computation from one node to all others, thus increasing the accuracy of the selection to construct a cost efficient shared-tree. Even at smaller member size, CC of MGST is comparable to CBT, and as the member size increases, MGST and MGST-T performs much better by having much smaller CC than CBT. It can be observed in fig.8(b) that MGST-T has even lower CC than MGST. This shows that our theory of removing redundant nodes via “trace back” method is correct.
(a)
(b)
Fig. 8. (a) Average tree construction processing time vs. Network size (group size of 100 and uniform per group’s member size of 50% of network size). (b) CC vs. Uniform member size per group (group size of 100 on 100 nodes network).
(a)
(b)
Fig.9. (a) CC vs. Non-uniform member size range per group (group size of 100 on 100 nodes network). (b) CC vs. multicast-group size (uniform per group member size of 100 on 100 nodes network).
226
K.-M. Yong, G.-S. Poo, and T.-H. Cheng
Fig.9 shows that the CC performance of MGST and MGST-T are still much better than CBT and are close to the optimal when the member size per group is nonuniform.. Fig.9(b) also shows that MGST and MGST-T perform much better than CBT when multicast-group size varies from 10 to 100 nodes. MGST-T lies very close to CC=1, indicating MGST-T’s consistency in constructing trees with cost close to the optimal. Again, it can be observed that MGST-T performs better than MGST due to the removal of redundant nodes in the delivery trees. From the above figures, it is easy to conclude that MGST and MGST-T are able to construct minimum cost trees within a very short time and with overheads comparable to the optimal.
6 Conclusion In this paper, we proposed efficient algorithms to construct a single spanning sharedtree used to compute each incoming multicast session tree. We explained the need for an algorithm that constructs cost-efficient multicast-trees within a very short processing time with as little overheads as possible. To lower tree cost even further, “trace back” is introduced to phase 2 of MGST and we denote this improved algorithm as MGST-T. Intensive simulations carried out on MGST and MGST-T against approximate optimal cost heuristic such as MCT and low overheads algorithm such as CBT, showed that MGST and MGST-T are able to compute trees within a very short processing time; with cost comparable to optimal cost MCT and overheads incurred comparable to low overheads CBT. In conclusion, our proposed MGST and its improved MGST-T are able to construct trees with approximate optimal cost in multi-groups multicast much faster than existing algorithms using much lesser overheads. However, it should be noted that MGST and MGST-T are only limited to non-delay constraint scenarios where delays requirement is largely relaxed. With more multicast applications having strict delays requirement, it will be worthwhile to expand MGST and MGST-T to delay-constraint environment as our future work.
References 1. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum Press, New York (1972) 2. Kou, L., Markowsky, G., Berman, L.: A fast algorithm for Steiner Trees. Acta Informatica 15, 141–145 (1981) 3. Striegel, A., Manimaran, G.: A survey of QoS Multicasting Issues. IEEE Comm. Magazine, 82–85 (June 2002) 4. Raghavan, S., Manimaran, G., Murthy, C.S.R.: A Rearrangeable Algorithm for the Construction of Delay-Constrained Dynamic Multicast Trees. IEEE/ACM Transactions on Networking 7(4), 514–529 (1999) 5. Bauer, F., Verma, A.: ARIES: A rearrangeable inexpensive edge-based on-line Steiner algorithm. IEEE J. of Selected Areas in Comm. 15, 382–397 (1997) 6. Ballardie, A., Francis, P., Crowcroft, J.: Core Based Trees (CBT) and Architecture for Scalable Inter-domain Multicast Routing. In: ACM SIGCOMM 1993, August 1993, pp. 85– 89 (1993) 7. Waxman, B.: Routing of multipoint connections. IEEE Journal of Selected Areas in Communication 6, 1617–1622 (1988)
Delay Bounds and Scalability for Overlay Multicast Gy¨ orgy D´ an and Vikt´ oria Fodor ACCESS Linnaeus Centre, School of Electrical Engineering KTH, Royal Institute of Technology Stockholm, Sweden {gyuri,vfodor}@ee.kth.se
Abstract. A large number of peer-to-peer streaming systems has been proposed and deployed in recent years. Yet, there is no clear understanding of how these systems scale and how multi-path and multihop transmission, properties of all recent systems, affect the quality experienced by the peers. In this paper we present an analytical study that considers the relationship between delay and loss for general overlays: we study the trade-off between the playback delay and the probability of missing a packet and we derive bounds on the scalability of the systems. We use an exact model of push-based overlays to show that the bounds hold under diverse conditions: in the presence of errors, under node churn, and when using forward error correction and various retransmission schemes. Keywords: Overlay multicast, Scalability, Delay, Large-deviation theory.
1
Introduction
Overlay multicast is promising for distributing streaming data simultaneously to a large population of users. The architectures proposed for overlay multicast (a.k.a. peer-to-peer streaming) generally fall into one of two categories: push-based or pull-based. Solutions of both categories utilize multi-path transmission. Multi-path transmission offers two advantages. First, disturbances on an overlay path lead to graceful quality degradation in the nodes. Second, the output bandwidth of the peers can be utilized more efficiently. Push-based overlays follow the traditional approach of IP multicast: nodes are organized into multiple transmission trees and relay the data within the trees. The streaming data is divided into packets and packets are transmitted at roundrobin through the transmission trees, providing path diversity for subsequent packets in this way. The transmission trees are constructed at the beginning of the streaming session and are maintained throughout the session by a centralized or a distributed protocol. Node churn leads to the disconnection of the trees and hence to data loss, which is one of the main deficiencies of push-based overlays.
This work was in part supported by the Swedish Foundation for Strategic Research through the project Winternet.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 227–239, 2008. c IFIP International Federation for Information Processing 2008
228
G. D´ an and V. Fodor
Fig. 1. The playback delay and the time needed to connect to the overlay
The pull method (also called swarming) follows the approach of batch peer-topeer content distribution: nodes know about a subset of all nodes (their neighbors); they both receive data from and forward data to their neighbors. There is no global structure maintained, hence the scheduling of data transmissions is determined locally. Pull-based overlays are resilient to node churn as forwarding decisions are taken based on the actual neighborhood information, but their efficiency depends on the scheduling algorithm. Several works deal with the management of push-based overlays ([1,2] and references therein) and with scheduling algorithms for pull-based overlays ([3,4] and references therein). There are also numerous proposals on how to improve the robustness of the overlays to errors using coding techniques such as forward error correction (FEC), multiple description coding (MDC) and network coding [5]. The evaluation of the proposed solutions is mostly based on simulations and small scale measurements; the analytical modeling of overlay multicast has not received much attention. There are a number of commercial deployments of overlay multicast, e.g. [6,7]. Commercial systems often serve hundreds of thousands of peers simultaneously [8], yet little is known how they would behave if the number of concurrent users increased to its tenfold. We argue that there is a need for an analytical understanding of the performance of large systems in order to be able to design systems that can provide predictable and controllable quality under a wide range of operating conditions. The most important difference between overlay multicast systems and peerto-peer content distribution, such as Bittorrent, is the delay aspect: data should be delivered to the nodes before their playout deadline. The probability that data arrive before their playout deadline depends on the playback delay b: the lag between the time of the generation of a packet at the source and the time of the playback at the peers, as shown in Fig. 1. The necessary playback delay for providing good streaming quality may depend on many factors: the overlay’s architecture and size, which determine the nodes’ distances from the source; the per-hop delay distribution, the packet loss probability between the nodes and
Delay Bounds and Scalability for Overlay Multicast
229
the error control solutions used; and the frequency of node departures and the time needed to reconnect to the overlay in the case of push-based overlays. In this paper we consider two questions related to the playback delay. First, how fast does the probability of missing a packet decrease as a function of the playback delay. Second, how fast should the playback delay be increased to maintain the probability of missing a packet unchanged as the overlay’s size increases. We give bounds on the decrease of the packet missing probability and the necessary increase of the playback delay under general assumptions. Our results facilitate the choice of benchmarking metrics for the performance evaluation of overlay multicast systems. The rest of the paper is organized as follows. Section 2 gives an overview of the related work. Section 3 presents bounds on the playback delay and the scalability of the overlays based on the foundations of large deviation theory. Section 4 discusses the delay bounds and the performance of the overlays based on an exact mathematical model of overlay multicast systems, and we conclude our work in Section 5.
2
Related Modeling Work
The trade-off between the available resources and the number of nodes that can join the overlay was studied for overlay multicast systems utilizing a single transmission tree in [9]. The first models that describe the data distribution performance of multi-tree-based overlay multicast were proposed in [10,11] and showed that these systems exhibit a phase-transition when using FEC. The effect of the forwarding capacity on multi-tree-based overlays was investigated in [12] using a queuing theoretic approach, and in [13] based on a fluid model. The delay characteristics of a pull-based overlay were investigated in [4], and the authors showed an exponential relationship between the playback delay and the packet missing probability. The analytical results presented there are limited to a specific packet forwarding algorithm and to complete graphs. In [14] we presented an analytical model of push-based overlays and used the model to identify the primary sources of delay in overlay multicast. We are however not aware of analytical results neither on the scalability of overlay multicast architectures in terms of delay, nor on the effects of the playback delay on the performance.
3
Delay Bounds
We model the overlay as a directed graph G = (V, E) with N = |V | vertices. The set of vertices and edges can change over time due to node churn and due to the overlay management. We chose to omit the time dimension in our notation in order to ease understanding. Let us denote by s the source of the multicast, and by Ti the spanning tree rooted at the source, through which the copies of packet i reach the nodes in V . In a push-based overlay with τ trees the Ti are predetermined by the overlay maintenance entity and ∪Ti = E. In a pull-based overlay the Ti are a result
230
G. D´ an and V. Fodor
of local decisions taken in the nodes, such that all edges (u, v) ∈ Ti are chosen from E. E is maintained by the overlay maintenance entity. Let us denote by the random variable Li (v) the length of the simple path from s to v in Ti , and by the random variable Di (v) the time it takes for packet i to reach node v from s in Ti . We assume that every packet reaches every node after some finite amount of time, i.e., limb→∞ P (Di (v) ≤ b) = 1. Both push-based overlays with retransmissions and pull-based overlays (e.g., using the policy described in [4]) can fulfill this requirement. Di (v) is the sum of Li (v) ≤ N − 1 per-hop delays. We denote the per-hop Li (v) delays by the non-negative random variables Xh (v), i.e., Di (v) = h=1 Xh (v). For example, in Fig. 1, L400 (v) = 6 and L600 (v) = 4, i.e., the packets reach node v on different overlay paths. The distribution of the Xh (v) depends on many factors, e.g., the data distribution model (time spent for coordination between nodes), the probability of losses (due to churn and network congestion), the nodes’ upload capacities, and the distance from the source (many proposed architectures place nodes with large upload capacities close to the source). The probability that node v with playback delay b misses an arbitrary packet i is P (Di (v) > b). 3.1
Playback Delay in Stationary State
First, we consider an overlay in which N is a stationary process, and consequently Li (v) is a stationary process as well. Li (v) has a finite support and the per-hop delays Xh (v) follow distributions with finite moment generating functions (m.g.f), i.e., a light-tailed distribution. We are not aware of measurements that would show heavy-tailed end-to-end delay distributions, but theoretical work has shown that heavy-tailed distributions can arise in the presence of selfsimilar traffic [15]. We argue that the per-hop delays can be modeled with a distribution with finite m.g.f. even in this case, if the back-off scheme in use is sub-exponential, e.g., polynomial. Large delays trigger retransmission requests that cut the heavy tail of the distribution, and the resulting distribution will be geometric-like, which has a finite m.g.f. We leave the case of exponential back-off schemes, i.e., heavy-tailed per-hop-delays, to be subject of future work. Given the per-hop-delays with finite moment generating functions, the sum of the perhop delays has finite m.g.f. even if the per-hop delays are positively correlated, as shown by the following lemma. Lemma 1. Given non-negative random variables Xh (h = 1 . . . n, n > 0) with marginal p.d.f f (x1 , . . . , xn ) such that E[eθXh ] < ∞, for n p.d.f fh (x) and joint θSn Sn = h=1 Xh we have E[e ] < ∞. n Proof. For independent r.v.s the proof is trivial, E[eθSn ] = Πh=1 E[eθXh ]. For correlated r.v.s, we prove the lemma can be used for n > 2. x for n = 2, induction x Let us order X1 and X2 such that 0 f1 (t)dt ≤ 0 f2 (t)dt for ∀ x > x0 (x0 > 0). Let us denote by X2∗∗ a random variable that is distributed as X2 but is in perfect positive dependence with X1 (see [16] for a definition), that is x2 = g(x1 ) for
Delay Bounds and Scalability for Overlay Multicast
231
some function g. E[eθ(X1 +X2 ) ] is a convex, monotonically increasing function, hence [16] ∗∗ E[eθS2 ] = E[eθ(X1 +X2 ) ] ≤ E[eθ(X1 +X2 ) ]. (1)
Since there exists θ > 0 such that E[eθ X1 ] < ∞ ∞ ∞ ∞ E[eθS2 ] = e(x1 +x2 )θ f (x1 , x2 )dx1 dx2 ≤ e(x1 +g(x1 ))θ f1 (x)dx1 (2) 0 0 0 ∞ ≤ a(x0 , θ) + e2x1 θ f1 (x)dx1 < ∞, (3) x0
where (2) holds because of (1), (3) holds because g(x) ≤ x for x > x0 due to the ordering of X1 and X2 , and (3) holds for every 0 < θ ≤ θ /2. Based on Lemma 1 and on results from large deviation theory [17] we can prove the following theorem. Theorem 1. The decrease of the probability that an arbitrary node with playback delay b misses an arbitrary packet in an overlay with N nodes is asymptotically at least exponential in b if the per-hop delays have finite m.g.f. Proof. We use the law of total probability to express the probability of missing a packet P (Di (v) ≥ b) =
N −1
P (Di (v) ≥ b|Li (v) = l)P (Li (v) = l),
(4)
l=1
and in the following we show that the decrease of P (Di (v) ≥ b|Li(v) = l) is asymptotically at least exponential in b for any l. According to Chernoff’s bound for the average An of n i.i.d random variables X (note that it is not the sum, but the average of the r.v.s.) P (An ≥ x) ≤ e−nI(x) ,
(5)
where I(x) is the rate function given by I(x) = maxθ>0 θx − ln(MX (θ)), and MX (θ) = E[eXθ ] is the moment generating function of X. Fig. 2 shows the rate functions for two distributions with different parameters. Chernoff’s bound holds for any distribution for which MX (θ) < ∞, and for x > E[X]. The rate function I(x) is convex for scalar random variables, is monotonically increasing on (E[X], ∞) and I(E[X]) = 0 [17]. For non-negative r.v.s X with E[X] ≥ 0, the derivative ∂I(x) ∂x |x0 ≥ I(x0 )/x0 , so that we have I(ax) ≥ I(x) + (ax − x)I(x)/x = aI(x) and hence
for ∀a > 1
P (An ≥ ax) ≤ e−nI(ax) ≤ e−naI(x) .
(6) (7)
232
G. D´ an and V. Fodor
Fig. 2. Rate function for two distributions (a) X = a+yb where a = 1 and y has discrete uniform distribution on [1, n] and (b) geometric distribution with failure probability p
We apply (7) for the r.v. Di (v) conditioned on Li (v) = l with n = 1 P (Di (v) ≥ ab|Li (v) = l) ≤ e−Il (ab) ≤ e−aIl (b) ,
(8)
where Il (b) is the rate function conditioned on Li (v) = l. Eq. (8) holds for any l, which together with (4) proves the theorem. The result is independent of the distribution P (Li (v) = l), and holds whenever there is enough forwarding capacity in the overlay. It is also independent of the number of packets in the stream and does not make any assumption on the graph’s connectivity or the distribution scheme, in particular, it does not assume a complete graph. The simulation results presented in [4] support our analytical result for pull-based overlays, and we show results later that support the theorem for push-based overlays. 3.2
Scalability
The question we address here is how the probability of missing a packet changes as the overlay’s size increases. The change of the overlay’s size affects the data distribution through the distribution of the Li (v), hence we focus on the question how the increase of the path length Li (v) affects the probability of missing a packet. This way we decouple the problem of scaling in terms of delay from the problem of overlay maintenance, e.g., neighbor selection: the neighbor selection algorithm, random or optimized according to some metric, influences the scaling of the path lengths Li (v). Given the scaling of the path lengths we can evaluate the scaling in terms of delay. A trivial lower bound on the scaling in terms of delay is given by the increase of E[Di (v)], which is proportional to E[Li (v)]. In the following we show that the upper bound is proportional to Li (v) as well. We consider the case when the Xh (v) are i.i.d. random variables with M (θ) < ∞. We note that there are no asymptotic results available for correlated r.v.s, but we conjecture that the following theorem holds for correlated and for non identically distributed r.v.s as long as E[Xh (v)] is bounded from above, and leave the proof to be subject of future work.
Delay Bounds and Scalability for Overlay Multicast
233
Theorem 2. If Li (v) ∼ O(logN ) then the increase of the playback delay b needed to maintain the probability of missing an arbitrary packet unchanged is at most asymptotically logarithmic in N . Proof. We prove the theorem by showing that the upper bound of the playback delay increases logarithmically. We look for a d ≥ 0 such that for a ≥ 0 P (Di ≥ b|Li (v) = l) = P (Di ≥ b + d|Li (v) = l + a).
(9)
We use Chernoff’s bound to express the upper bounds of the probabilities e−lI( l ) = e−(l+a)I( l+a ) . b+d
b
We omit the base and rearrange the exponents to get b b+d b+d l I( ) − I( ) = aI( ). l l+a l+a
(10)
(11)
The right hand side of (11) is always positive. As the rate function is convex and monotonically increasing on (E[Xh (v)], ∞) we have the condition b+d b ≤ , l+a l
(12)
which can be rearranged to ba . (13) l If Li (v) = O(logN ) then as the overlay grows from N1 to N2 we have a ∼ log(N2 /N1 ) and this proves the theorem. d≤
In general, (12) shows that it is sufficient to increase the playback delay at the same pace as the depth of the spanning trees grows in order to maintain the probability of missing a packet unchanged. E.g., if the nodes’ distances from the source grow as a linear function of the overlay’s size, then the playback delay should be increased in direct proportion to the growth of the overlay to keep the packet missing probability constant. Consequently, in order to show that an overlay scales as O(logN ) in terms of playback delay, it is enough to show that Li (v) = O(logN ). Even though this result is in accordance with one’s expectations, it is not straightforward. Unfortunately, the converse of the theorem cannot be proved: there is no upper bound on the increase of the packet missing probability for constant playback delay as the overlay’s size grows, because the increase depends on the shape of the rate function. These asymptotic results indicate that the exponential decrease of the packet missing probability as a function of the playback delay is not a good measure of the efficiency of a scheduling scheme in overlay multicast: all scheduling schemes that manage to distribute the data to all nodes have this property under the assumption that the per-hop-delays have finite m.g.f. The scalability with respect to the overlay’s size is however a good measure: Theorem 2 shows that in a
234
G. D´ an and V. Fodor
scalable overlay the playback delay does not have to be increased faster than the logarithm of the increase of the overlay’s size to keep the packet missing probability constant. While the exponential decrease was shown via simulations for a specific scheduling algorithm in [4], we are not aware of any scheduling algorithm for pull-based systems with analytical results on its scalability.
4
Numerical Results
In the following we present numerical results obtained using an exact model of multi-tree-based overlays presented and validated in [14], and show that the bounds derived using the large deviation approach hold in the presence of various retransmission schemes and FEC. 4.1
System Description
We denote the number of trees in the overlay by τ . We assume the existence of a tree maintenance entity (centralized [18] or decentralized [19,20]) that finds suitable predecessors for arriving nodes and for nodes that lose their predecessors due to node churn or preemption [1,21]. We denote by Lm (v) the distance of node v from the source in tree m, and say that node v is in level l of tree m if Lm (v) = l. Packet i is distributed in tree m = (i mod τ ) + 1. To simplify the notation, we introduce the notion of stripe, and say that packet i belongs to stripe m if it is distributed in tree m. We consider two forms of error control: forward error correction and retransmissions. When forward error correction (FEC) is used, the source adds c redundant packets to every k packets, resulting in a block length of n = k + c. We denote this FEC scheme by FEC(n,k). Once a node receives at least k packets of a block of n packets, it may recover the remaining c packets, and forwards the reconstructed packets if necessary. Block based FEC can be used to implement PET and the MDC scheme considered in [18], where different blocks (layers) of data are protected with different FEC codes: the probability of reception in the different layers depends on the strength of the FEC codes protecting them. We consider three retransmission schemes. A node that detects a packet loss in stripe m requests the retransmission of the packet by one of these three strategies: (RF) from its predecessor in tree m. If the loss is due to node churn then the node will have to wait until a new predecessor is found. (RB) from another node that is forwarding packets in tree m in the same level as the node’s actual predecessor. This scheme assumes that every node maintains a list of backup nodes, but we do not model the overhead of maintaining such a list. (RS) from a predecessor in another tree. A predecessor in another tree is likely to be far away from the source in tree m, hence the retransmission might take longer than using a backup list.
Delay Bounds and Scalability for Overlay Multicast
235
For the RF and the RB strategies, the level of node v in the tree in which packet i should be distributed (Li mod τ (v)) is the same as the distance of v in the spanning tree Ti through which the copies of packet i reach the nodes (Li (v)), i.e., Li (v) = Li mod τ (v). For the RS strategy Li (v) ≥ Li mod τ (v), i.e., due to the retransmissions the spanning tree Ti can be deeper than the trees maintained by the overlay maintenance entity. 4.2
System Parameters
We model the propagation delay Dp by a normal distribution truncated at 180 ms with mean and variance (E[Dp ] = 67 ms, σDp = 21 ms) extracted from a transitstub network of 104 nodes generated with the GT-ITM topology generator [22]. We consider the streaming of a B = 400 kbps data stream, and the capacity of the source node’s output link is C(s) = 100 Mbps. The outdegree of the source, Os , is set to 50 throughout the paper for easy comparison and to ensure that the overlay is feasible for all considered values of the number of trees [12], though the particular value of Os does not affect the validity of our conclusions. The packet size is 1410 bytes. The distribution of the nodes’ output capacities (Cv ) and outdegrees (Ov ) is as in [4] and is shown in Table 1. Since the effect of the input capacity of the nodes is small on the results [14], we consider 10 Mbps for all nodes. Table 1. Distribution of node output capacities and outdegrees Ratio 15% 25% 40% 20% Cv 10 Mbps 1 Mbps 384 kbps 128 kbps Ov 2.5τ 2τ 0.75τ 0.25τ
The inter-arrival times of nodes joining the overlay are exponentially distributed, this assumption is supported by several measurement studies, e.g., [23]. The session holding times M follow the log-normal distribution, the mean holding time is E[M ] = 306 s (μ = 4.93, σ = 1.26) [23]. The nodes are prioritized according to their outdegrees as proposed in [21], hence large contributors are closer to the source in the trees in which they forward data and reconnect faster to the trees. 4.3
Packet Losses
First we evaluate the packet missing probability as a function of the playback delay for the case of packet losses. Fig 3 shows results for the RF scheme with two different packet loss detection times. We denote by tdld = 0 the case when the loss of a packet is detected at the instant when it should have arrived (if it had not been lost), i.e., an ideal loss detection algorithm. We denote by tdld = 1 when a retransmission is requested 1 s after the packet or the retransmission request has been sent out. The figure shows results for N = 104 nodes organized in τ = 4 trees. In the absence of losses (p = 0) the decrease of the packet missing probability is faster than exponential. This is predicted by Fig. 2 (a), as the rate function for the discrete uniform distribution grows faster than linear. The
236
G. D´ an and V. Fodor
Fig. 3. Packet missing prob. vs. playback delay for different packet loss probabilities, and packet loss detection times, τ = 4, N = 104
Fig. 4. Packet missing prob. vs. playback delay for different retransmission schemes, τ = 4, N = 104
rate function of the geometric distribution is however close to linear (Fig. 2 (b)), hence we expect that in the presence of losses the decrease of the packet missing probability is not much faster than exponential. This is supported by the curves that show results for p > 0. The slope of the curves is related to the slope of the rate function of the per-hop delay distribution, the steeper the rate function, the faster the decrease of the packet missing probability. Though for tdld = 1 the loss detection time is big compared to the per-hop delays and hence the packet missing probability decreases almost in a stepwise manner, we still observe the exponential decay. The curves for different loss probabilities and loss detection times show similar properties, they only differ in the slopes of the curves, so in the following we show results for p = 0.1 and tdld = 0. Fig. 4 shows results without losses and with losses for the RF and the RS retransmission schemes. The N = 104 nodes are organized in τ = 4 trees, and FEC(4,3) is used when indicated. We observe that in the presence of losses the exponential decay does not hold when retransmissions are not used. In this case the analysis of the asymptotic behavior presented in [11] with respect to N ,p and the FEC code applies to limb→∞ π(b): the system converges to the asymptotically stable fixed point of the discrete dynamic system shown in [11], and limb→∞ π(b) < 1. Consequently, the assumptions of Theorem 1 are not fulfilled, because all nodes do not receive all data. When using retransmissions, the decay is exponential, as shown by the results for both the RF and the RS retransmission schemes, with and without FEC. FEC decreases the necessary playback delay to achieve a certain packet missing probability, but the exponential decay still holds: in the presence of FEC Di (v) is the minimum of two random variables, the time until the packet would be received through the corresponding parent and the time to FEC recovery (which is the k th order statistic of n − 1 random variables with finite m.g.f.), and hence Di (v) has finite m.g.f. Consequently, an alternative of increasing the playback delay in order to achieve a certain packet missing probability is to introduce FEC. Nevertheless, the ratio of FEC redundancy has to be adjusted dynamically based on feedback from the nodes. We observe a small difference between the results obtained with the two retransmission schemes. Using the RS scheme, retransmission of a packet in stripe m becomes possible only
Delay Bounds and Scalability for Overlay Multicast
237
once nodes that do not forward data in tree m receive the packet. These nodes are in the last level of tree m, and hence, we observe a slow decay of the packet missing probability close to the point where the curves with and without retransmissions separate. Surprisingly, the difference between the results obtained with the RF and the RS schemes is small, especially when FEC is used, in which case the decrease of the decay close to the point where the curves with and without retransmissions separate is significantly smaller as well. Fig 5 shows results for the RF retransmission scheme for different overlay sizes for τ = 4 trees. Both in the presence of losses and in the absence of losses it is enough to increase the playback delay logarithmically in order to maintain the packet missing probability constant. Surprisingly however, the smaller the playback delay needed to achieve a certain packet missing probability for a given overlay size, the more sensitive is the overlay to the increase of the number of nodes. For p = 0 the packet missing probability increases by orders of magnitude if the overlay’s size increases by a factor of ten, for p = 0.1 the increase is significantly smaller. Consequently, even if one could achieve a low packet missing probability with a small playback delay, the playback delay should be overdimensioned to ensure that the packet missing probability does not become too high if the overlay suddenly grows. 4.4
Node Churn
In the following we show analytical results for the case of node churn. For the reconnection times (Ξ) and the disconnection times (Ω) we use values similar to the measured data presented in [1]: E[Ξ] = 5 s and E[Ω] = 200 s in the tree in which a node forwards data, and E[Ξ] = 30 s and E[Ω] = 100 s in the trees in which it does not. (Nodes are disconnected with a higher probability in the trees in which they do not forward data.) In lack of a measured distribution we model the reconnection time Ξ with a normal distribution N (E[Ξ], E[Ξ]/3). Based on these values the loss probability experienced by a node in a tree in which it forwards data (p = 0.024) and in which it does not forward data (p = 0.1968) can be calculated as described in [14]. The distribution of the retransmission times depends on the retransmission scheme used. For the RB and RS schemes we use the retransmission times as discussed in [14]. For the RF scheme retransmission occurs once the new predecessor is found, hence the retransmission time is the sum of the forward recurrence time of a renewal process with inter-renewal time Ξ (see [24]) and the retransmission time of the RB scheme. For all three schemes, retransmissions are asked from nodes that are present in the overlay and consequently limb→∞ P (Di (v) ≤ b) = 1. Figure 6 shows results for the case of node churn and the three retransmission schemes. As expected, the RB scheme, which involves tremendous control overhead, performs best. Surprisingly however, the RS scheme performs nearly as good as the RB scheme, both without and with FEC. This is because under churn nodes experience more frequent losses in the trees in which they do not forward data, i.e., far from the source: when these losses occur, data is already
238
G. D´ an and V. Fodor
Fig. 5. Packet missing prob. vs. playback delay for different overlay sizes
Fig. 6. Packet missing prob. vs. playback delay for N = 104 , node churn and three retransmission schemes
available in large parts of the overlay, hence the additional delay introduced by the RS scheme is small. The RF scheme, due to the large retransmission delays, performs almost as bad as if there were no retransmissions at all. Nevertheless, we observe the exponential decay with a very slow decay rate. The bad performance of the RF scheme suggests that resilience to node churn in a push-based overlay requires retransmission schemes that abandon the rigid structure of the trees, and converge towards pull-based architectures, e.g., the RS and RB schemes.
5
Conclusion
In this paper we presented analytical results that show that in overlay multicast systems the packet missing probability decreases at least exponentially as a function of the playback delay under very general conditions. Consequently, the exponential decrease of the packet missing probability as a function of the playback delay is not a good measure of the efficiency of a scheduling scheme in overlay multicast: all scheduling schemes that manage to distribute the data to all nodes have this property. The scalability with respect to the overlay’s size is however a good measure: the playback delay should not have to be increased faster than the logarithm of the overlay’s size to keep the packet missing probability constant. We used an exact model of push-based overlays to show that the exponential decrease of the packet missing probability holds using various retransmission schemes and FEC. It will be subject of future work to design a provably scalable pull-based scheduling algorithm based on the analytical results on scalability. To the best of our knowledge, our work is the first to present a continuous time analytical model of the effects of the playback delay on overlay multicast.
References 1. Sung, Y.-W., Bishop, M., Rao, S.: Enabling contribution awareness in an overlay broadcasting system. In: Proc. of ACM SIGCOMM, pp. 411–422 (2006) 2. Liao, X., Jin, H., Liu, Y., Ni, L.M., Deng, D.: Anysee: Scalable live streaming service based on inter-overlay optimization. In: Proc. of IEEE INFOCOM (April 2006)
Delay Bounds and Scalability for Overlay Multicast
239
3. Magharei, N., Rejaie, R.: PRIME: Peer-to-peer Receiver drIven MEsh-based streaming. In: Proc. of IEEE INFOCOM (May 2007) 4. Massoulie, L., Twigg, A., Gkantsidis, C., Rodriguez, P.: Randomized decentralized broadcasting algorithms. In: Proc. of IEEE INFOCOM (2007) 5. Fodor, V., D´ an, G.: Resilience in live peer-to-peer streaming. IEEE Communications Magazine 45(6) (June 2007) 6. ‘PPLive,” (June 2007), http://www.pplive.com/ 7. “OctoShape,” (June 2007), http://www.octoshape.com/ 8. Hei, X., Liang, C., Liang, J., Liu, Y., Ross, K.W.: A measurement study of a large-scale P2P IPTV system. IEEE Trans. Multimedia 9(8), 1672–1687 (2007) 9. Small, T., Liang, B., Li, B.: Scaling laws and tradeoffs in peer-to-peer live multimedia streaming. ACM Multimedia (October 2006) 10. D´ an, G., Fodor, V., Karlsson, G.: On the stability of end-point-based multimedia streaming. In: Proc. of IFIP Networking, May 2006, pp. 678–690 (2006) 11. D´ an, G., Fodor, V., Chatzidrossos, I.: Streaming performance in multiple-treebased overlays. In: Proc. of IFIP Networking, May 2007, pp. 617–627 (2007) 12. D´ an, G., Fodor, V., Chatzidrossos, I.: On the performance of multiple-tree-based peer-to-peer live streaming. In: Proc. of IEEE INFOCOM (May 2007) 13. Kumar, R., Liu, Y., Ross, K.W.: Stochastic fluid theory for P2P streaming systems. In: Proc. of IEEE INFOCOM (May 2007) 14. D´ an, G., Fodor, V.: An analytical study of low delay multi-tree-based overlay multicast. In: Proc. of ACM P2P-TV (August 2007) 15. Lelarge, M., Liu, Z., Xia, C.H.: Asyumptotic tail distribution of end-to-end delay in networks of queues with self-similar cross traffic. In: Proc. of IEEE INFOCOM (March 2004) ´ Stochastic bounds of sums of dependent risks. 16. Denuit, M., Genest, C., Marceau, E.: Insurance: Mathematics and Economics 25(1), 85–104 (1999) 17. Schwartz, A., Weiss, A.: Large Deviations for Performance Evaluation: Queues, communication and computing. Chapman and Hall, Boca Raton (1995) 18. Padmanabhan, V.N., Wang, H.J., Chou, P.A.: Resilient peer-to-peer streaming. In: Proc. of IEEE ICNP, pp. 16–27 (2003) 19. Castro, M., Druschel, P., Kermarrec, A.-M., Nandi, A., Rowstron, A., Singh, A.: SplitStream: High-bandwidth multicast in a cooperative environment. In: Proc. of ACM SOSP (2003) 20. Setton, E., Noh, J., Girod, B.: Rate-distortion optimized video peer-to-peer multicast streaming. In: Proc. of ACM APPMS, pp. 39–48 (2005) 21. Bishop, M., Rao, S., Sripanidkulchai, K.: Considering priority in overlay multicast protocols under heterogeneous environments. In: Proc. of IEEE INFOCOM (April 2006) 22. Zegura, E.W., Calvert, K., Bhattacharjee, S.: How to model an internetwork. In: Proc. of IEEE INFOCOM, March 1996, pp. 594–602 (1996) 23. Veloso, E., Almeida, V., Meira, W., Bestavros, A., Jin, S.: A hierarchical characterization of a live streaming media workload. In: Proc. of ACM IMC 2002, pp. 117–130 (2002) 24. Feller, W.: An Introduction to Probability Theory and its Applications. John Wiley and Sons, Chichester (1966)
A Distributed Algorithm for Overlay Backbone Multicast Routing in Content Delivery Networks Jun Guo and Sanjay Jha School of Computer Science and Engineering The University of New South Wales, Sydney, NSW 2052, Australia {jguo,sjha}@cse.unsw.edu.au
Abstract. To support large-scale live Internet broadcasting services efficiently in content delivery networks (CDNs), it is essential to exploit peer-to-peer capabilities among end users. This way, the access bandwidth demand on CDN servers in the overlay backbone can be largely reduced. Such a streaming infrastructure gives rise to a challenging overlay backbone multicast routing problem (OBMRP) to optimize multicast routing among CDN servers in the overlay backbone. In this paper, we take a graph theoretic approach and frame OBMRP as a constrained spanning tree problem which is shown to be NP-hard. We present a lightweight distributed algorithm for OBMRP. Simulation experiments confirm that our proposed algorithm converges to good quality solutions and requires small control overhead.
1
Introduction
Recent experiences of Internet service providers have seen a huge market demand for live Internet broadcasting services in content delivery networks (CDNs) [1]. AOL’s airing of the Live 8 concerts in July 2005 drew about 5 million users within one day and delivered 175,000 concurrent video streams of the concerts at its peak. This record was soon broken by Yahoo!’s broadcasting of NASA’s Shuttle Discovery launch, which is said to have delivered more than 335,000 video streams simultaneously. MSN’s latest broadcasting of the Live Earth concerts in July 2007, using Akamai’s streaming platform [2], has also drawn a global audience of the order of millions of viewers across the Internet. For such an Internet killer application that is both resource-intensive and latency-sensitive, a challenging issue is to design bandwidth-efficient mechanisms and algorithms that can handle large-scale live Internet broadcasting of streaming video efficiently in CDNs. While IP multicast [3] is doubtlessly the ideal solution for supporting large-scale live Internet broadcasting services, enabling IP multicast across the global Internet has not been successful due to its various deployment issues [4]. Until the global deployment of IP multicast could be eventually realized, Internet service providers have to resort to interim solutions. Among them, current service providers in general favor the overlay multicast approach that emulates the forwarding mechanism of IP multicast at the application layer [5,6,7,8,9,10,11,12,13]. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 240–251, 2008. c IFIP International Federation for Information Processing 2008
A Distributed Algorithm for Overlay Backbone Multicast Routing in CDNs
241
In particular, small-scale service providers (such as PPLive [14]) typically use the peer-to-peer (P2P) approach [5,6,7,8]. In the P2P approach, end users wishing to participate in a live broadcasting session self-organize into an overlay multicast tree rooted at the origin server. The P2P approach, however, cannot yield satisfactory performance in large-scale live broadcasting services [14]. This is mainly due to the fact that current broadband access technologies provide end users with very limited upload bandwidth. The fanout capability of end users is thus significantly restricted. As a result, overlay multicast trees due to the P2P approach could be rather tall, so that end users deep in the tree (hence far from the origin server) are likely to suffer considerable lag [14]. On the other hand, large-scale service providers (such as Akamai [2]) generally adopt the infrastructure-based approach [9,10,11,12,13]. This approach relies on a set of geographically distributed and dedicated servers with large processing power and high fanout capability. Such powerful servers are typically placed at co-location facilities with high-speed connection to the Internet, and thus form a backbone service domain, which we call overlay backbone, for the overlay multicast network. The infrastructure-based approach has been widely used by commercial CDNs to support live Internet broadcasting services [2]. Akamai’s streaming platform largely benefits from a CDN that consists of over 20,000 servers distributed in more than 70 countries. Its solution is, however, a pure infrastructure-based approach, where each end user wishing to view the live broadcasting event is directly connected to one CDN server in the overlay backbone to fetch the live video stream [2]. The pure infrastructure-based approach is not efficient in handling large-scale live Internet broadcasting services, since it is less effective in restricting excessive and redundant traffic to be injected into the Internet, and also because the tremendous access bandwidth demand on CDN servers would render large-scale Internet broadcasting services sheer prohibitive. One flexible and scalable approach is to exploit P2P capabilities among end users as much as possible, so long as a predefined bound on the end-to-end latency can be met [9]. Such a hybrid approach leads to an effective two-tier overlay multicast architecture [13], in which the access bandwidth demand on CDN servers in the overlay backbone can be largely reduced. The two-tier overlay multicast architecture gives rise to a challenging overlay multicast problem among CDN servers in the overlay backbone. For a live broadcasting event, each participating CDN server is made aware of the largest delay from it to end users within its service area as well as the number of end users within its service area. The problem is to optimize the overlay multicast tree among CDN servers in the overlay backbone subject to access bandwidth constraints on CDN servers and the fixed bound on the de facto maximal end-toend latency from the origin server to end users. The optimization criterion is to minimize the weighted average latency from the origin server to the participating CDN servers, which weights each CDN server with the number of end users within its service area. We call such a problem as the overlay backbone multicast routing problem (OBMRP) in this paper. OBMRP is strongly motivated due to
242
J. Guo and S. Jha
the concern that the larger the size of user population within the service area of a CDN server, the larger is the access bandwidth demand that could be placed on the corresponding server. By reducing the latency as much as possible at such servers, it is more feasible to exploit P2P capabilities among end users and thus reduces the access bandwidth demand on CDN servers. In this paper, we take a graph theoretic approach to frame OBMRP as a constrained spanning tree problem and show that it is NP-hard. We propose a lightweight distributed algorithm for OBMRP (hence called OBMRP-DA for brevity). For the purpose of validating the efficacy of OBMRP-DA, we modify the distributed algorithm proposed in [11] for the OMNI problem (hence called OMNI-DA for brevity) to provide a comparable solution for OBMRP. In OMNI, each CDN server in the overlay backbone is made aware of only the number of end users within its service area. The overlay multicast problem addressed by OMNI-DA is to minimize the weighted average latency from the origin server to the participating CDN servers subject to access bandwidth constraints on CDN servers. OMNI-DA cannot guarantee to bound the de facto maximal end-to-end latency as required in the context of OBMRP. Moreover, OMNI-DA relies on a set of local transformation operations which essentially require probing among all nodes within up to two levels of each other. It was observed that such local transformation operations cannot guarantee that a global minimum will be achieved. OMNI-DA thus relies on a random swap operation to divert the algorithm from the local minimum region. Such a random swap operation inevitably introduces significant oscillations to the tree latency and thus prolongs the converging process. We shall see in this paper that our proposed OBMRP-DA benefits from the more efficient transformation operations that we have designed, and thus can significantly reduce the control overhead while yielding solutions of comparable qualities to OMNI-DA. The rest of this paper is organized as follows. Section 2 deals with the problem formulation. Section 3 presents the distributed algorithm. Simulation experiments are reported in Sect. 4. Finally, we draw conclusions in Sect. 5.
2
Problem Formulation
We model the overlay backbone in the CDN as a complete directed graph G = (V, E), where V is the set of N nodes and E = V × V is the set of edges. Each node in V represents a CDN server participating in the live broadcasting session. Let node r be the origin server. All other nodes (in the set V − {r}) are proxy servers. The directed edge i, j in E represents the unicast path of latency li,j from node i to node j. By {li,j }, we denote the matrix of unicast latency quantities between each pair of nodes in G. An overlay backbone multicast tree can be represented by a directed spanning tree T of G rooted at node r. For each sink node in the set V − {r}, we define Rr,v as the set of directed edges that form the overlay routing path from the root node to node v in the multicast tree. Let Lr,v denote the aggregate latency along the overlay routing path from the root node to node v. Let γv be the largest
A Distributed Algorithm for Overlay Backbone Multicast Routing in CDNs
243
delay from node v to end users within its service area, and cv be the number of end users within the service area of node v. The weight wv of node v is given by wv = cv / cv . (1) v∈V −{r}
¯ denote the weighted sum of latencies from the origin server to the proxy Let L servers along T . Let Lmax denote the de facto maximal end-to-end latency from the origin server to the end users. Let LB max denote the specified bound on Lmax . Given the unicast latency matrix {li,j }, we readily have Lr,v = li,j (2) i,j∈Rr,v
and ¯= L
wv Lr,v
(3)
v∈V −{r}
and Lmax =
max (Lr,v + γv ) .
v∈V −{r}
(4)
Let di be the residual degree of node i, and di be the out-degree of node i counted within the overlay backbone multicast tree. Since a spanning tree with N nodes has exactly N −1 edges, the sum of out-degrees in the overlay backbone multicast tree satisfies i∈V di = N − 1. Definition 1. Given the complete directed graph G = (V, E), OBMRP is to ¯ is find a constrained directed spanning tree T of G rooted at node r, such that L minimized, while T satisfies the residual degree constraint, i.e. di ≤ di , ∀i ∈ V , and the bound on the de facto maximal end-to-end latency, i.e. Lmax ≤ LB max . Such a constrained spanning tree problem is NP-hard. This is shown by creating a dummy node for each node v in V − {r} and forming an edge of weight γv between node v and its corresponding dummy node. The resulting problem can be reduced to finding a Hamiltonian path within the augmented graph which is known to be NP-complete [15].
3
Distributed Algorithm
In this paper, we define the ancestor nodes of node i as those nodes (including the root node) in the overlay routing path of the multicast tree from the root node to node i, where the parent node of node i is the immediate forwarding node to node i in the overlay routing path. The child nodes of node i are the immediate nodes that relay from node i in the subtree rooted at node i. The descendants of node i are defined as those end users served by proxy servers (including node i) in the subtree rooted at node i. Our proposed distributed algorithm for OBMRP requires node i, ∀i ∈ V , to maintain the following state information during the iterative tree restructuring process, except for those inapplicable to the root node:
244
– – – – –
J. Guo and S. Jha
ˆi: Parent node of node i in the current tree Φi : Set of all ancestor nodes of node i in the current tree Ωi : Set of all child nodes of node i in the current tree di : Residual degree of node i in the current tree Wi : Aggregate weight of all nodes in the subtree rooted at node i (including node i), which is given by ⎧ node i is a leaf node ⎨ wi , if Wi = wi + (5) Wj , elsewhere ⎩ j∈Ωi
– Lr,i : Latency from the root node to node i, which is given by Lr,i = Lr,ˆi + lˆi,i
(6)
– Γi : Largest delay from node i to end users served by proxy servers (including node i) in the subtree rooted at node i. This can be iteratively computed by γi , if node i is a leaf node
Γi = max max(l + Γ ), γ , elsewhere (7) i,j j i j∈Ωi
Clearly, Γr corresponds to Lmax of the current tree. 3.1
Tree Initialization
Starting from the initial tree including the origin server only, we allow proxy servers in V − {r} to join the tree in a random order, which is realistic in an actual live broadcasting event. If multiple nodes arrive concurrently, they are added to the tree in the decreasing order according to their node IDs. When node i wishes to join the tree, it directly contacts the root node, since the root node by default maintains the list of all nodes in the current tree. Upon the joining request of node i, the root node successively probes the list of nodes in the current tree until node j with sufficient residual degree, i.e. dj > 0, is found. Node i is thus grafted to the tree by registering as a child node of node j. 3.2
Handling of Tree Restructuring Requests
Our proposed distributed algorithm for OBMRP works by allowing each node in T to periodically and randomly contact another node in T to make a tree − → restructuring request. Let i, j denote such a tree restructuring request from node − → i to node j. Following OBMRP-DA, i, j will be simply rejected by node j if the current states and positions of node i and node j in T satisfy one of the five exclusive conditions: 1) i ∈ Φj ; 2) j = r, j = ˆi; 3) j = r, j = ˆi, dj = 0; 4) j = r, − → ˆ ˆ j = i, dˆj = 0; 5)j ∈ Φi − {r}, j = i, dˆj = 0, dj = 0. Node j considers i, j only if the current states and positions of node i and node j in T match one of the five distinct cases illustrated in Fig. 1. In particular, (a), (b) and (c) deal with the various cases where we have j ∈ Φi .
A Distributed Algorithm for Overlay Backbone Multicast Routing in CDNs
245
Fig. 1. Snapshot of the five distinct cases where a tree restructuring request from node i will be considered by node j. Both node i and node j are marked by circles. Node j is pointed by the double arrow.
(a) If j = r, j = ˆi and dˆj > 0, a Type-A transformation (described in Sect. 3.3) will be attempted by node j. (b) If j = r, j = ˆi and dj > 0, a Type-C transformation (described in Sect. 3.5) will be attempted by node j. (c) If j ∈ Φi − {r}, j = ˆi and dˆj > 0, a Type-A transformation will be attempted by node j. If dˆj = 0 but dj > 0, node j will instead attempt a Type-C transformation. (d) If ˆj = ˆi, a Type-C transformation will be attempted by node j provided dj > 0. However, if dj = 0, node j will instead attempt a Type-D transformation (described in Sect. 3.6). (e) In all other situations where neither i ∈ Φj nor j ∈ Φi , node j will selectively attempt a Type-A, B (described in Sect. 3.4), C or D transformation (in the corresponding order), depending on if any of the four exclusive conditions, i.e. dˆj > 0, dˆj = 0, dj > 0, or dj = 0, holds true. 3.3
Type-A Transformation
Since dˆj > 0, node i (together with its subtree if any) is switched to be a child node of node ˆj, given that ˘ r,i = (L ˆ + lˆ ) − (L ˆ + lˆ ) > 0 L r,i i,i r,j j,i
(8)
˘ r,i denotes the amount of variation on Lr,i . Clearly, such a transformation where L reduces the end-to-end latency Lr,j of each node j in the subtree rooted at node ˘ r,i , and thus reduces L ¯ by Wi · L ˘ r,i without increasing Γr of the current i by L tree. We illustrate an example of the Type-A transformation in Fig. 2(a) for the case presented in Fig. 1(e). 3.4
Type-B Transformation
Since dˆj = 0, if (8) holds true, we may switch node j to be either a child node of node ˆi, or a child node of node i after node i is grafted to node ˆj. More explicitly, we check if ˘ r,j = (L ˆ + lˆ ) − (L ˆ + lˆ ) > 0 L (9) r,j j,j r,i i,j
246
J. Guo and S. Jha
Fig. 2. All possible tree restructuring results for the case depicted in (e) of Fig. 1: (a) due to the Type-A transformation; (b) and (c) due to the Type-B transformation; (d) due to the Type-C transformation; (e) to (h) due to the Type-D transformation.
and if
˘ = (L ˆ + lˆ ) − (L ˆ + lˆ + li,j ) > 0 L r,j r,j j,j r,j j,i
(10)
˘ r,j = max(L ˘ , L ˘ ). If L ˘ r,j > 0, we perform the given that di > 0. Let L r,j r,j ˘ transformation by choosing the option that yields Lr,j . On the other hand, if ˘ r,j ≤ 0, but we have L ˘ r,i + Wj · L ˘ r,j > 0, Wi · L (11) ˘ r,j . we proceed with the transformation by choosing the option that achieves L However, this is done only if the transformation does not increase Γr of the B current tree or violate LB max if Γr is already smaller than Lmax . The two possible tree updates due to the Type-B transformation of Fig. 1(e) are shown in Fig. 2(b) and Fig. 2(c). 3.5
Type-C Transformation
This transformation is similar to the Type-A transformation by instead checking if ˘ r,i = (L ˆ + lˆ ) − (Lr,j + lj,i ) > 0 L (12) r,i i,i If so, node i (together with its subtree if any) is simply grafted to node j, since ¯ by Wi · L ˘ r,i without increasing Γr . An such a transformation certainly reduces L example of the Type-C transformation of Fig. 1(e) is shown in Fig. 2(d). 3.6
Type-D Transformation
Given that dj = 0 in this case, if (12) holds true, for each child node c of node j, i.e. c ∈ Ωj , we may switch node c to be either a child node of node ˆi, or a child node of node i after node i is grafted to node j. More explicitly, we check for all c ∈ Ωj if
A Distributed Algorithm for Overlay Backbone Multicast Routing in CDNs
and if
247
˘ r,c = (Lr,j + lj,c ) − (L ˆ + lˆ ) > 0 L r,i i,c
(13)
˘ r,c = (Lr,j + lj,c ) − (Lr,j + lj,i + li,c ) > 0 L
(14)
˘ r,c = max(L ˘ , L ˘ ), and let c denote the node in Ωj given that di > 0. Let L r,c r,c such that ˘ r,c = max Wc · L ˘ r,c . Wc · L (15) c∈Ωj
˘ r,c > 0, we perform the transformation by choosing the option that yields If L ˘ r,c . On the other hand, if L ˘ r,c ≤ 0, but we have L ˘ r,i + Wc · L ˘ r,c > 0, Wi · L
(16)
˘ r,c . we proceed with the transformation by choosing the option that achieves L However, this is done only if the transformation does not increase Γr of the B current tree or violate LB max if Γr is already smaller than Lmax . All four possible tree updates due to the Type-D transformation of Fig. 1(e) are illustrated in Fig. 2(e) to Fig. 2(h), respectively. 3.7
Control Overhead
The control overhead of OBMRP-DA is small due to the four lightweight trans− → formation operations that we design. For a tree restructuring request i, j, the ˆ Type-A transformation only requires probing between node i and node j (i ↔ ˆj for short) to acquire the unicast latency quantity lˆj,i . Similarly, the Type-C transformation only requires probing between i ↔ j to acquire the unicast latency quantity lj,i . The Type-B transformation requires probing between three pairs of nodes, i.e. i ↔ ˆj, ˆi ↔ j and i ↔ j, to acquire the corresponding unicast latency quantities. Even the most elaborate Type-D transformation merely requires probing between i ↔ j, ˆi ↔ c and i ↔ c, ∀c ∈ Ωj , which is within only one level of node j and thus requires only O(|Ωj |) control overhead. It is also important to note that OBMRP-DA defines exclusive conditions for an appropriate transformation operation to be performed, based on the current states and positions of node i and node j in the multicast tree. Consequently, − → any successful tree restructuring request i, j requires no more than one transformation attempt. The time for which the multicast tree is left in a transient state due to the transformation operation is thus minimal.
4
Simulation Experiments
We have examined the efficacy of our proposed OBMRP-DA through extensive simulation experiments. The network topologies used in our experiments are obtained from the transit-stub graph model of the GT-ITM topology generator [16]. All topologies have 12,000 nodes. CDN servers are placed at a set of N nodes, chosen uniformly at random. Due to the space limitation, here we report
248
J. Guo and S. Jha
experiments on one small-size graph (N = 20) and ten large-size graphs (N = 300). Unicast latencies between different pairs of nodes in these overlay topologies vary from 10 to 1000 msec. We set the residual degree as 5 for each node in the small-size graph and 10 in the large-size graphs. The bound LB max is set to 2000 in all experiments. The number ci of node i is randomly selected between 100 and 600. The value γi of node i is randomly generated following a lognormal distribution Log-N(a, b) with a = ci and b = ci . We thus introduce certain statistical correlation between ci and γi . This is due to the concern that it is likely that a larger number of end users within the service area of a CDN server can cause a larger delay from the server to its end users. Based on these overlay topologies, we design the following experiments to study the performance of OBMRP-DA. We have derived an integer linear programming (ILP) formulation for OBMRP (OBMRP-ILP for brevity, provided in the Appendix). Although ILP is known to have an exponential computational complexity and is unlikely to be solved for large-size problem instances, it allows us to find optimal solutions for small-size problem instances of OBMRP. Since OBMRP is NP-hard and the proposed OBMRP-DA uses a random approach for tree initialization, independent runs of OBMRP-DA for the same experiment setting can converge to different solutions. For each experiment setting, we thus solve OBMRP-DA for 200 independent runs, each of which for up to 300 × N successive tree restructuring requests. Results in Fig. 3(a) for the small-size graph are presented in the form of ¯ between the mean results obtained from OBMRPpercentage deviation in L DA and the optimal solutions from OBMRP-ILP. Each data point in Fig. 3(a) corresponds to the case where we set a particular node in the graph with the indicated node ID to be the root node. We see in all instances that the solutions obtained from OBMRP-DA are very close to the optimal solutions. The worst ¯ is within only 6% of OBMRP-ILP. case performance in L As we have discussed in Sect. 1, OMNI-DA proposed in [11] can be modified to provide a comparable solution for OBMRP. For the purpose of comparison, we have modified OMNI-DA by adding the same elements that we have developed for OBMRP-DA to control Γr during the tree restructuring process. While the random swap operation defined in OMNI-DA can have the same effect in improving the latency performance in the context of OBMRP-DA, we did not enable it in our experiments so as to make a fair comparison between the deterministic transformation operations proposed in this paper and those in [11]. For each of the ten large-size graphs, we again set each particular node to be the root node. This creates 3,000 simulation experiments. In each experiment, ¯ performance between OBMRP-DA and OMNI-DA in we compare the mean L the form of a ratio. We also count the number of probing actions required in the tree restructuring process. This is for us to investigate and compare the control overhead (again in the form of a ratio) between the two different algorithms. The ¯ comparison and those from the control 3,000 data points obtained from the L overhead comparison are presented in Fig. 3(b) and Fig. 3(c), respectively, both in the form of cumulative percentage.
(a)
10
100
OBMRP−DA
Cumulative Percentage
Deviation from OBMRP−ILP (%)
A Distributed Algorithm for Overlay Backbone Multicast Routing in CDNs
8
6
4
2
0 0
4
8
12
16
(b) OBMRP−DA
80
60
40
20
0 88
20
90
Root Node (c)
100
94
96
98
60
40
100
102
Maximum Average Minimum
80
80
60
40
20
20
0 4.5
92
Ratio to OMNI−DA (%) (d)
OBMRP−DA
Proportion (%)
Cumulative Percentage
100
249
4.6
4.7
4.8
4.9
5.0
Ratio to OMNI−DA (%)
5.1
5.2
0
A
B
C
D
Type
¯ performance compared with OBMRP-ILP; (b) Fig. 3. Efficacy of OBMRP-DA: (a) L ¯ performance compared with OMNI-DA; (c) Control overhead compared with OMNIL DA; (d) Distribution of transformation operations
These results clearly demonstrate the efficacy of OBMRP-DA. In general, OBMRP-DA converges to better quality solutions than OMNI-DA. In only four ¯ performance of OBMRP-DA is out of the 3,000 simulation experiments, the L slightly worse than OMNI-DA. On average, OBMRP-DA outperforms OMNIDA by more than 5%. However, as shown in Fig. 3(c), OBMRP-DA requires on average only 5% of the control overhead as required by OMNI-DA. This is due to the fact that OMNI-DA relies on a set of five transformation operations which in general require probing among all nodes within up to two levels of each other. In contrast, even the most elaborate Type-D transformation defined in OBMRP-DA merely requires probing between node i and all nodes within one − → level of node j for a particular tree restructuring request i, j. Moreover, this is also due to the fact that, for each particular tree restructuring request, OMNIDA attempts all five transformation operations until one of them is successful. In contrast, OBMRP-DA performs no more than one transformation, and requires only a modicum of the elaborate Type-D transformation operation, as shown in ¯ performance. Fig. 3(d), to achieve the comparable L
250
5
J. Guo and S. Jha
Conclusion and Future Work
In this paper, we have identified and addressed a strongly motivated overlay backbone multicast routing problem to support large-scale live Internet broadcasting services in CDNs more efficiently using the two-tier overlay multicast architecture. We have proposed a lightweight algorithm for solving OBMRP in a distributed iterative way. Simulation experiments have confirmed that our proposed algorithm can yield good quality solutions with small control overhead. Our future work is to design a distributed protocol based on the lightweight algorithm proposed in this paper, and to test the multicast routing performance in real or emulated CDN environments. Acknowledgments. This project is supported by Australian Research Council (ARC) Discovery Grant DP0557519.
References 1. Sripanidkulchai, K., Maggs, B., Zhang, H.: An analysis of live streaming workloads on the Internet. In: Proc. ACM IMC 2004, October 2004, pp. 41–54 (2004) 2. Dilley, J., Maggs, B., Parikh, J., Prokop, H., Sitaraman, R., Weihl, B.: Globally distributed content delivery. IEEE Internet Comput. 6(5), 50–58 (2002) 3. Deering, S.E., Cheriton, D.R.: Multicast routing in datagram internetworks and extended LANs. ACM Trans. Comput. Syst. 8(2), 85–110 (1990) 4. Diot, C., Levine, B.N., Lyles, B., Kassem, H., Balensiefen, D.: Deployment issues for the IP multicast service and architecture. IEEE Network 14(1), 78–88 (2000) 5. Chu, Y.H., Rao, S.G., Zhang, H.: A case for end system multicast. In: Proc. ACM SIGMETRICS 2000, Santa Clara, CA, USA, June 2000, pp. 1–12 (2000) 6. Kwon, M., Fahmy, S.: Topology-aware overlay networks for group communication. In: Proc. ACM NOSSDAV 2002, Miami, FL, USA, May 2002, pp. 127–136 (2002) 7. Zhang, B., Jamin, S., Zhang, L.: Host multicast: A framework for delivering multicast to end users. In: Proc. IEEE INFOCOM 2002, New York, USA, June 2002, vol. 3, pp. 1366–1375 (2002) 8. Banerjee, S., Bhattacharjee, B., Kommareddy, C.: Scalable application layer multicast. In: Proc. ACM SIGCOMM 2002, Pittsburgh, PA, USA, August 2002, pp. 205–217 (2002) 9. Malouch, N.M., Liu, Z., Rubenstein, D., Sahu, S.: A graph theoretic approach to bounding delay in proxy-assisted, end-system multicast. In: Proc. IEEE IWQoS 2002, May 2002, pp. 106–115 (2002) 10. Shi, S.Y., Turner, J.S.: Routing in overlay multicast networks. In: Proc. IEEE INFOCOM 2002, New York, USA, June 2002, vol. 3, pp. 1200–1208 (2002) 11. Banerjee, S., Kommareddy, C., Kar, K., Bhattacharjee, B., Khuller, S.: Construction of an efficient overlay multicast infrastructure for real-time applications. In: Proc. IEEE INFOCOM 2003, San Francisco, CA, USA, March 2003, pp. 1521–1531 (2003) 12. Lao, L., Cui, J.H., Gerla, M.: TOMA: A viable solution for large-scale multicast service support. In: Boutaba, R., Almeroth, K.C., Puigjaner, R., Shen, S., Black, J.P. (eds.) NETWORKING 2005. LNCS, vol. 3462, pp. 906–917. Springer, Heidelberg (2005)
A Distributed Algorithm for Overlay Backbone Multicast Routing in CDNs
251
13. Guo, J., Jha, S.: Host-aware routing in multicast overlay backbone. In: Proc. IEEE/IFIP NOMS 2008 (2008) 14. Hei, X., Liang, C., Liang, J., Liu, Y., Ross, K.W.: Insights into PPLive: A measurement study of a large-scale P2P IPTV system. In: Proc. Workshop on Internet Protocol TV (IPTV) services over World Wide Web in conjunction with WWW 2006, Edinburgh, Scotland (May 2006) 15. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco (1979) 16. Zegura, E.W., Calvert, K.L., Bhattacharjee, S.: How to model an internetwork. In: Proc. IEEE INFOCOM 1996, San Francisco, CA, USA, March 1996, vol. 2, pp. 594–602 (1996)
Appendix Here we provide an ILP formulation for OBMRP. Let the 0-1 variables pvi,j , v ∈ V − {r}, i, j ∈ E, indicate if the directed edge i, j is included in the overlay routing path from node r to node v. Let the 0-1 variables ti,j , i, j ∈ E, indicate if the directed edge i, j is included in the overlay backbone multicast tree. OBMRP can be mathematically formulated as: Minimize wv pvi,j li,j (17) v∈V −{r}
i,j∈E
subject to j:i,j∈E
pvi,j −
pvj,i
j:j,i∈E
⎧ ⎨ 1, if i = r = 0, if i ∈ V − {r, v} , ∀v ∈ V − {r} ⎩ −1, if i = v
pvi,j ≤ (N − 1) · ti,j , ∀i, j ∈ E
v∈V −{r}
(18) (19)
ti,j = N − 1
(20)
ti,j ≤ di , ∀i ∈ V
(21)
pvi,j li,j + γv ≤ LB max , ∀v ∈ V − {r}
(22)
i,j∈E
j:i,j∈E
i,j∈E
The objective function in (17) is equivalent to (3) by the definition of pvi,j . Equations (18) and (19) ensure that the solution is a directed spanning tree rooted at node r. More explicitly, they enforce one single overlay routing path for each source-sink pair. Equation (20) restricts that the sum of out-degrees counts N −1. Equation (21) enforces the residual degree constraint. Equation (22) ensures that the de facto maximal end-to-end latency of the overlay backbone multicast tree is bounded by LB max . All equations jointly ensure that the solution is a directed spanning tree rooted at node r and satisfies all constraints of OBMRP.
Network Performance Assessment Using Adaptive Traffic Sampling René Serral-Gracià, Albert Cabellos-Aparicio, and Jordi Domingo-Pascual Advanced Broadband Communications Centre, Technical University of Catalunya (UPC), Spain {rserral,acabello,jordid}@ac.upc.edu
Abstract. Multimedia and real-time services are spreading all over the Internet. The delivery quality of such contents is closely related to its network performance, for example in terms such as low latency or few packet losses. Both customers and operators want some feedback from the network in order to assess the real provided quality. There are proposals about distributed infrastructures for the on-line reporting of these quality metrics. The issue all these infrastructures must face is the high resource requirements to keep up with accurate and live reporting. This paper proposes an adaptive sampling methodology to control the resources needed for the network performance reporting. Moreover, the solution keeps up with accurate live reporting of the main network performance metrics. The solution is tested in a European wide testbed with real traffic which permits to assess the accuracy and to study the resources consumption of the system. Keywords: Traffic sampling, Adaptive sampling, Quality assessment, Network Metrics, Distributed traffic monitoring.
1 Introduction Since the apparition of multimedia tools such as Skype or Ekiga the Internet became a multimedia aware system. Using these services usually comes with an associated fee. Hence, the clients require some guarantees in the Service Level Agreement (SLA). This raises the need of providing some feedback to the customers about the performance given by the network, nowadays achieved through the so called network quality metrics. Therefore on-line traffic monitoring infrastructures for network performance assessment are quickly spreading. The main goal of such systems is to efficiently provide a distributed intra and inter-domain infrastructure where to evaluate the network metrics. Distributed network monitoring infrastructures can be divided into two main groups: high and low level. On the one hand, high level systems deliver information about the network’s status by focusing on the study of its capacity and availability, while providing means to visually monitor the status of the traffic. On the other hand, low level approaches focus on direct traffic observation for assessing the network performance. These systems may use a distributed infrastructure in order to gather the network
This work was partially funded by IST under contract 6FP-004503 (IST-EuQoS), MCyT (Spanish Ministry of Science and Technology) under contract TSI 2005-07520-C03-02 and the CIRIT (Catalan Research Council) under contract 2005 SGR 00481.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 252–263, 2008. c IFIP International Federation for Information Processing 2008
Network Performance Assessment Using Adaptive Traffic Sampling
253
metrics. This requires of certain control traffic to maintain the structures for the live reporting. Our solution uses a low-level approach, which instead of providing information about the availability of the network or gathering information for off-line processing, it provides real-time information about the following network metrics: One-Way Delay (RFC-2679), Packet Loss Ratio (RFC-2680) and bandwidth usage. These metrics are among the most used, as stated on ITU-T Rec. Y.1540, and help operators to monitor and verify the status of their network. In order to achieve this we proposed the Network Parameter Acquisition System (NPAS) [1], a distributed monitoring infrastructure that collects and analyses the traffic on-line in an intra-domain environment. NPAS collects the traffic in all the different ingress and egress points of the domain under analysis, monitors the desired flows and sends the collected information to an analysis unit. The analysis unit computes the metrics by using the collected information through timestamps and packet sizes. Gathering this information in a per packet basis is very expensive in terms of bandwidth and processing resources needed by the control traffic. In our previous work we proposed mechanisms that used traffic aggregation [1] and traffic static sampling techniques [2] in order to reduce the resource consumption. We plan to further reduce the used resources with an adaptive sampling technique. The novel proposed distributed adaptive sampling dynamically adapts the sampling rate depending on the traffic conditions, reducing the resources needed by the control traffic down to 4% compared to the case of per packet reporting. Specifically our contribution uses the IP Delay Variation in order to estimate the accuracy of the sampled results, adjusting the sampling rate accordingly with minimum loss in accuracy. Moreover, the technique efficiently controls the use of resources, scheduling the sampling rates as required by the system. The system is tested in a European wide scenario by using 12 different locations with different traffic profiles. The results show that the enhancements permit a tight control over the used resources for the reporting and an increase in the accuracy compared to the case of static sampling. The rest of the paper is structured as follows, next section focuses on the explanation of the already existing solutions to perform distributed network performance assessment, centring on the most similar to our work. The paper continues by describing the state of the art in network parameter acquisition system and the current static sampling solution. After this the paper centres on the main proposal, which is an adaptive sampling solution to schedule the resources used for the live network performance assessment. This approach is validated in section 5, where tests in a real scenario are performed. The paper finishes with the conclusions and the work left for further study.
2 Related Work Traffic sampling is a broadly deployed technique used to reduce the resources required for traffic analysis [3, 4]. In general the sampling schemes are static [5], where a sampling rate is decided in advance and applied during all the study. Others use adaptive
254
R. Serral-Gracià, A. Cabellos-Aparicio, and J. Domingo-Pascual
methods [6] which decide dynamically the best sampling rate over time, adapting to the changes suffered by the system. Given the distributed nature of our scenario, the applicable sampling schemes are limited. Duffield et. al developed a sound methodology, namely trajectory sampling [7], based on hash sampling, which was used to infer packet paths simultaneously with different collection points. This schema has been used several times in other scenarios [2, 4, 8] using static sampling rates. As it has been said before, we propose a novel adaptive sampling technique based on the trajectory sampling, but relying on the IP Delay Variations (IPDV). We apply this technique on a distributed scenario used to monitor specific network performance metrics. Regarding distributed network performance assessment proposals, at the best of our knowledge very few tools have been published. Firstly, perfSONAR [9] is a tool intended to monitor any network metric on a distributed fashion. Specifically the authors present a methodology that provides meaningful network performance indicators, which can be used to visually monitor the status of the traffic. The main difference between this tool and our proposal is that while perfSONAR limits the study to the link status by active traffic generation, NPAS analyses directly the network metrics by passively collecting the traffic. Secondly InterMON [10] is based on interaction and coordination of different kinds of tools with access to a common database. The system’s main goal is to gather network metrics for off-line processing, analysis and data-mining. In this case, the main difference between InterMON and our proposal is that we are focused on providing live values for some network metrics not just gathering information for off-line processing.
3 Description of Network Parameter Acquisition System Network Parameter Acquisition System as proposed in [1] is a distributed network performance assessment infrastructure. It computes the network metrics suitable to quantify the network performance over sensible traffic (e.g. VoIP or Videoconferencing) on-line. The system collects the traffic on each ingress and egress points in the network, and only analysing the traffic selected by the configured filters. It computes the following network metrics: One-Way Delay (OWD), Packet Loss Ratio (PLR) and used bandwidth. 3.1 Basic System This infrastructure presents a framework which reports the intra-domain traffic metrics to higher layer entities to assess whether the specified performance constraints (policies) are fulfilled or not. This mechanism can also be used for triggering alarms to give feedback to other system units in case that the network is not providing the contracted quality . Both alarms and policies management are out of the scope of this specification. Figure 1 shows a generic network scenario where the system is deployed. It is composed by the following entities:
Network Performance Assessment Using Adaptive Traffic Sampling
255
Fig. 1. Example Scenario
The Monitoring Entity (ME) is in charge of the traffic collection via selection filters. The traffic selection policy aggregates the data in a per flow or per Class of Service (CoS) basis, but it can be easily extended to others. The traffic information collected by MEs is sent to a per-domain analyser entity (Processing Entity - PE) which computes the network metrics of all the traffic under analysis. The Processing Entity (PE) is the gathering point within the domain. It configures the ME policies, performs most of the processing, matches the information of the packets coming from the MEs and computes the network metrics on-line. The results are published as a network management service to monitor the status of the traffic under analysis in real-time. Depending on the information sent by the ME more control traffic is required to perform the monitoring. In general this can be a problem for under provisioned domains without dedicated links for network management traffic, we believe that this is the case in most ISPs. NPAS controls this issue by having three different modes of operation: i) per packet reporting, where the traffic information is sent in a per packet basis. ii) using time aggregation, where a time sliding window is used to optimise the network usage. iii) by using traffic sampling, as further discussed in the section 3.2. More detail in the specification of the first two modes and the analysis of the bandwidth usage can be found at [1]. In this paper we focus on the traffic sampling mode of operation. 3.2 Static Distributed Sampling The complexity of applying sampling in this distributed scenario is that centralised techniques (e.g. the techniques studied in [11]) are not well suited for this task, since they do not guarantee that all the ME capture exactly the same set of packets. Thus, accurately determining packet losses or one way delays is not feasible. We addressed this issue in [2] by using a deterministic sampling technique. It was used to match exactly the same packets all over the different ME to compute the network metrics. This permits to all the different ME to collect traffic independently, and only the selection function at start up time has to be shared by the MEs.
256
R. Serral-Gracià, A. Cabellos-Aparicio, and J. Domingo-Pascual
This distributed sampling technique computes a hash function over a set of fields on the packet’s header. For this to work in our environment, using only one hash table is not sufficient, since the monitoring framework has to guarantee that all the flows under analysis are considered. So, the sampling must be applied within individual flows, not directly to all the collected traffic. Hence, the packets are identified by using two different keys, flow and packet with the corresponding two level hash table. Thus, a packet is only analysed if it falls within specified positions in the hash table (A), of which only the first r packets are considered. Changing r gives different sampling rates ( Ar ). The hash table is sent from each ME to the PE when A packets have arrived or after a timeout t. This time interval t is known as a bin and it limits the upper bound for the live reporting interval. This permits to efficiently control the resources and the applied sampling rate.
4 Adaptive Distributed Sampling Using traffic sampling in order to reduce the required resources is usually a good option. But choosing the optimal sampling rate with minimum loss of accuracy is not straightforward. Moreover, knowing that control traffic (the reporting information from MEs to PEs) must be limited on the network, it might force the per flow sampling rate to be further reduced. Hence, selecting the right flows to reduce the sampling rate leads to better results than just equally sharing the sampling rate among the flows. The methodology presented here permits to efficiently estimate the One-Way Delay and Packet Loss Ratio using adaptive sampling. 4.1 Problem Formulation Given a domain D with R = {r1 , . . . , rn } ingress and egress points of the network, and FD = { f1 , . . . , fm } the list of active flows within the domain. Then R fi (t) (Ri from now on) is the rate of the ith flow at time t, being n the number of ME and m the number of flows. For simplicity we consider through out the paper that the resources for a ME is the bandwidth required to send information to the PE. The resources required to monitor all the existing flows from ME i to ME j are xi j = S
m
∑ δi jk Rk
(1)
k=1
where S is a constant defining the amount of resources needed to process (and send) a single control packet, for example the number of bytes. As it is a constant, for the ease of exposition we will assume S = 1. And δi jk = 1 if fk goes from ri to r j and 0 otherwise. From this we derive that the resources required for each ME (Xi ) are Xi =
∑nj=1 xi j + xji . Hence, reporting in a per packet basis implies that the total amount of resources (X ) required for on-line monitoring in domain D is: XD = ∑ni=1 Xi . In general per packet reporting requires too much resources to be feasible. We ease this requirement by applying adaptive traffic sampling to the solution. We consider
Network Performance Assessment Using Adaptive Traffic Sampling
257
that XD are the global resources available to perform the collection and performance assessment. We need a fair share of the resources among all the ME, considering that ME with more load (e.g. with more number of monitored flows) deserve more of the resources. Hence, we need to adapt the per flow sampling rate (ρ) in order to comply with this restriction: m
XD ≥ X ≥ S ∑ ρi Ri
(2)
i=1
where ρi is the sampling rate (0 ≤ ρi ≤ 1) of flow i, Ri the rate in packets per second and X the maximum resources reserved for the monitoring. Note that Ri (in packets/sec) is known since the ME initially must collect all the traffic to decide whether it falls within the specified threshold set in the hash table as described in 3.2. We need a lower bound of the sampling rate, namely ρimin , which guarantees that it is adjusted depending on the rate in a per flow basis, since low rate flows at small time scales are very sensible to sampling rates. This lower bound determines the minimum required resources for the whole monitoring task. Thus, m
Xmin = ∑ ρimin Ri
(3)
i=1
Where ρimin is the minimum ρi for flow i. This works well if the traffic has constant rates, but since traffic in general is variable these values need to be updated periodically. Moreover, new and expired flows are bound to appear over time. Both the flow management and the update strategy will be discussed later. The minimum applicable sampling rate has to take into account the rate of the flows, limited by expression ρimin = R1i ∀i \1 ≤ i ≤ m. Where ρimin is the minimum applicable sampling rate for all fi . This guarantees at least that one packet per flow is captured. If cmin is the number of samples per fi then, ρimin = 1 ∀i \cmin ≥ Ri cmin ρimin = , otherwise Ri
(4)
when more precision is required or more resources are available, assuring that at least cmin packets per flow are considered. In the same way we can define the maximum sampling rate to ρimax = 1 ∀i \cmax ≥ Ri cmax ρimax = , otherwise Ri
(5)
where cmax determines the maximum number of per flow packets to be collected. Then Xmax = ∑m i=1 ρimax Ri and Xmax ≤ X ≤ XD . In order to ease the exposition we define ximin and ximax which refer to the minimum and maximum resources for a particular flow respectively.
258
R. Serral-Gracià, A. Cabellos-Aparicio, and J. Domingo-Pascual
4.2 Resource Sharing With equations 2 and 3 we know that after Xmin are used, the remaining available resources (ΔX) to share among the flows are ΔX = X − Xmin
(6)
The goal now is to share the ΔX resources. Let’s define ωi as the weight assigned to the flow i where ∑m i=1 ωi = 1, then the resources assigned to the flow follow equation 7. xi = min {ΔX ωi + ximin , ximax }
(7)
It can be noted that the resources assigned to the flow are bounded by the Xmax as assigned previously. Now the weights (ω) may be assigned in different ways, in this work as a proof of concept we study IPDV driven sharing. IPDV driven. The ΔX resources need to be shared smartly. Obviously it is much more useful to increase ρi for flows with less accuracy on the metric estimation. But the estimated error is unknown in advance, so it is not possible to determine the sampling rate deterministically. The proposed mechanism uses I PDV as a measure of the accuracy of the estimator. Before detailing the methodology, it is important to define IPDV in our context. IPDV usually is defined as OW Di − OW Di−1 , excluding lost packets (See RFC-3393), or as OW D99.9th prc − OW Dmin in Rec. ITU-T 1541. In our work the first definition is not applicable since the packet sequence is lost during the sampling. Regarding the second option, obtaining the 99.9th percentile for low rate flows is not possible because there are too few packets in the bin. As an alternative, RFC-3393 states that it is possible to specify a custom selection function to compute IPDV . Hence we use IPDVi = |OW D− OW Di |, for all the i packets within the bin. Where the estimator OW D = E[OW D] for all the sampled packets. Using the IPDV as an accuracy estimator is based on the following rationale: 1. Constant OW D leads to null IPDV , where both OW D and I PDV can be perfectly estimated using ρimin . 2. High IPDV with low ρi might lead to inaccurate OW D since the subset of packets selected by ρi may not be representative enough. Hence it is worth to increase ρi for high variable flows. 3. But, in some cases, with ρi we can incorrectly estimate small I PDV due to the bias of the selected subset. This false negative will be corrected in later bins as I PDV converges. 4. The other case is incorrectly estimating high I PDV when in reality it is low. That being a false positive and forcing a needless increase of ρi for the flow. But not leading to incorrect estimation. The probability of false positives or false negatives decreases as ρi increases. Then the system automatically tunes ρi reducing these mistakes in the estimation. Therefore, the different weights (ωi ) are distributed using:
Network Performance Assessment Using Adaptive Traffic Sampling
ωi =
259
I PDV i m ∑ j=1 I PDV j
(8)
Fairly scheduling the ΔX resources proportionally to its delay variation. 4.3 Update Strategy Flow rates vary over time, ideally in our approach we need the rate for all flows in advance. Since this is not possible in a distributed scenario, our proposal uses a periodic updating mechanism divided into these steps: 1. Warm-up phase: Initially the PE assigns xmin resources to the system. 2. Working phase: In each step (bin of size t) the PE collects the rates and the estimated network metrics and computes the corresponding ci for each flow fi as explained before. This ci will be effective on the next step. Since the rate (ri ) might vary from one bin to the other, the estimated effective rate ∑k R considered in the analysis is Rˆ = i=1 i , where k is the history of the system. k
4.4 New and Expired Flows Since users connect to and disconnect from many services, continuously new flows arrive and others end in a domain. In our system the effects of such behaviour are that expired flows release resources for the reporting, while new flows force a rescheduling of the already shared ones. In the case that a flow does not have any active packets on the bin, its resources are not used in the current step. Note that increasing the ρ for other flows would lead to incorrect estimation as the other ME receiving (or sending) the flow are not aware of the situation. In the next bin, the flow without active packets will be assigned cmin resources and will enter in the Warm-up phase again. If a flow stays in Warm-up phase during several consecutive bins it is considered as expired, hence it is removed from the resource scheduling. Another condition to be considered is the apparition of new flows on the system. When this occurs, there is no time for the ME to ask the PE for a rescheduling of the resources, so the ME enters in local rescheduling mode. We know that each ME has Xi resources dedicated to it. Therefore to accommodate the new flow (x j ) the ME has the following alternatives: 1. There are flows without active packets in the bin: in this situation cmin resources from that inactive flow are used for the new. Equation 7 guarantees that the resources are present. 2. All the flows have active packets: ME (r j ) searches for the max(xi ) ∀i x1...m and shares its resources in the following way: xinew = xi − cmin and x jnew = cmin . Forcing x jnew to enter in Warm-up phase. The PE can fairly reallocate the resources to fit the new situation in the next step.
260
R. Serral-Gracià, A. Cabellos-Aparicio, and J. Domingo-Pascual
5 Tests and Results The adaptive sampling methodology proposed in this paper has been validated by a set of more than 500 experimental tests performed during 2006 using twelve different testbeds across Europe. The testbeds were provided by EuQoS [12] partners, covering a total of 5 countries and 4 different access technologies (LAN, xDSL, UMTS and WiFi) [13] with an overlay architecture on the Géant research network. For this work the testbed is configured to act as a single domain, with one ME deployed on each testbed, amounting to 12 ME and 1 PE on this scenario. 5.1 Validation Tests The tests were performed by actively generating synthetic periodic traffic, emulating different applications, namely VoIP and Videoconferencing; with a broad set of combinations of different packet rates and packet sizes, from 16 up to 900 packets per second, with a random duration between 1 and 10 minutes. A wide range of packet loss and delay variation conditions were encountered given the different cross-traffic found on the network at different days, hours and physical location of each testbed. The goal of the evaluation is twofold, in the one hand we compute the accuracy in the estimation of OWD and PLR. On the other hand, we analyse the used resources for the task. For each test the full trace was collected. With the traces we simulated a network with the flows entering and leaving the system following an uniform randomly distribution up to a maximum of 100 simultaneous flows. This way we were able to evaluate the results for different resource reservations (X) by taking as reference the complete original trace. In the experiments the used X were: 7000, 5000, 3000, 2000, 1000, 500, 300 and 200 with the constant S = 1 as detailed before. We also consider a bin size of t = 175ms which provides a good tradeoff between the interval for live reporting and collection time to apply the proper sampling rate. We used previously this bin size with success in [1]. Summarising the methodology used for obtaining a trace per different resources is: 1. 2. 3. 4. 5. 6.
Generate the test traffic and monitor all the traces in the MEs. Simulate the scenario with multiple simultaneous flows using the real traces. Split the results into bins of size t. Apply off-line the adaptive sampling for the different resources (X). and the average OW For each bin and X the PLR D are computed. Finally the estimated OWD for the sampled bin is compared with the real value.
5.2 Evaluation Results In order to evaluate the system we compare the obtained accuracy using adaptive sampling with the application of a fairly equivalent static sampling. Since the sampling rate (ρ) depends directly of c and Ri not all values of ρ within a bin lead to a feasible situation. For example, for low rate flows, (e.g. 16 packets per second), using the bin size t = 175ms leads to 3 packets per bin, then ρimin = 23 since with IPDV we need at least 2 packets to compute it. But in the case of flows with 200 packets per bin then 2 ρimin = 200 .
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Probability
Probability
Network Performance Assessment Using Adaptive Traffic Sampling
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2 X=200 X=1000 X=7000
0.1 0
261
0
0.05
0.1
0.15
0.2
0.25 γA
0.3
0.35
0.4
0.45
ρ=3.8% ρ=11.3% ρ=30.9%
0.1 0
0.5
0
0.005
0.01
(a) Adaptive sampling
0.015
0.02
0.025 εS
0.03
0.035
0.04
0.045
0.05
(b) Static Sampling
Fig. 2. CDF of OWD accuracy comparison between adaptive and static sampling
Required resources: The row labelled as (ρe f f ) of Table 1 shows the amount of resources effectively needed for each value of X. It details the global sampling rate applied taking into account all the flows and their durations. This value is obtained by aggregating all the considered samples out of the total amount of packets sent by all the tests. As it can be noted for low X values the sampling rate is very small, which means that only 3.8% (in the case of X = 200) of the resources needed to monitor XD are required by our proposal. Table 1. Effective global sampling and Delay and Loss errors for 95th percentile per X X ρe f f εdelay γloss
200 3.8% 0.025 0.22
300 4.5% 0.018 0.17
500 7.3% 0.013 0.15
700 9.0% 0.012 0.11
1000 11.3% 0.010 0.09
2000 17.6% 0.008 0.07
3000 23.4% 0.007 0.06
5000 27.3% 0.005 0.05
7000 30.9% 0.005 0.05
OWD Estimation: In order to evaluate the accuracy of the OWD estimation we compared obtained per X with the real traces by using relative delay error values: the results ˆ ε = 1 − Dd . Where D is the average OWD per bin taken from the complete trace and dˆ stands for the estimate obtained by our solution for different X. Figure 2 shows the comparison between adaptive sampling (left) and static sampling (right). The figures show the Cumulative Distribution Function (CDF) for various X for the adaptive and equivalent sampling rates for the static case. The X-axis of the figures holds the relative error (ε) while Y-axis is the error probability. The results clearly show the gain obtained by our adaptive solution. Where adaptive sampling outperforms static sampling by around 50% for low values of X. For example in 95% of the tests, for X = 200 the εA ≤ 0.020 while εS ≤ 0.040. When the resources are increased the difference is reduced down to 5% and below for X ≥ 1000. Moreover, adaptive sampling gives a more robust and reliable mechanism to control and schedule the used resources. This sensible gain in accuracy is further enhanced when dealing with packet losses as shown later.
R. Serral-Gracià, A. Cabellos-Aparicio, and J. Domingo-Pascual 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Probability
Probability
262
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2 X=200 X=1000 X=7000
0.1 0
0
0.05
0.1
0.15
0.2
0.25 γA
0.3
0.35
0.4
0.45
ρ=3.8% ρ=11.3% ρ=30.9%
0.1
0.5
(a) Adaptive sampling
0
0
0.05
0.1
0.15
0.2
0.25 γS
0.3
0.35
0.4
0.45
0.5
(b) Static Sampling
Fig. 3. CDF of PLR accuracy comparison between adaptive and static sampling
From the accuracy point of view the figure 2 shows that increasing the resources for the reporting improves greatly OW D, in fact 95% of the tests had an error smaller than 0.025 as summarised in the row labelled as εdelay in Table 1. PLR Estimation: Given that many flows did not have any packet loss and the system always estimates this case correctly, we decided to remove from the study such flows for not biasing the results. In this case PLR comparison is performed by using absolute errors. The results show that the accuracy is lower than for OWD as shown in Figure 3, but much higher than in the case of static sampling, whose error is up to ∼ 25% in the best considered case (30.9% sampling) for 95% of the tests. Compared to the worst obtained with adaptive sampling which is ∼ 20% for the minimum resources. Eventhough for reasonable sampling rates the results show an error lower than 10% for 95% of the cases, the full details are in the row labelled as γloss in Table 1. A way to overcome this lower accuracy in the estimation of PLR, is to increase the bin size (t) in order to gather more information per bin. Hence improving the sampling estimation of
the number of lost packets, which error is bound by ε ≤ 1.96 1c where c is the number of lost samples detected. A deep analysis of this is left for further study.
6 Conclusions and Future Work In this paper we proposed a new adaptive sampling methodology for distributed network performance assessment. Our contribution is focused on the resource scheduling and the optimisation of the required resources with minimum loss of accuracy in the live reporting. As a proof of concept, we use IPDV as an accuracy estimator to set the the sampling distribution. Our adaptive approach outperforms static sampling solutions by smartly scheduling more resources to the flows with poorer estimation in the performance assessment. In order to evaluate the accuracy of the estimation of OWD and PLR, we performed a series of tests in a European-wide scenario. The results show a negligible relative error in the OWD estimation (lower than 2.5% for 95% of the cases). In the case of PLR the accuracy is lower but kept within bearable thresholds.
Network Performance Assessment Using Adaptive Traffic Sampling
263
PLR estimation can be improved by enlarging the bin size. Hence we leave as an important part of our future work the use of dynamic bin sizes. This way, low rate flows report more detailed information and the sampling rate could be better adjusted. Another detail left as future work is the treatment of X as a distributed resource. This suits better a real scenario where the resources assigned to each ME can vary within the same domain. Finally with the current system design, it always uses all the available resources regardless of the conditions. A further optimisation would be to focus on the minimisation of such used resources when possible.
References 1. Serral-Gracià, R., Barlet-Ros, P., Domingo-Pascual, J.: Coping with Distributed Monitoring of QoS-enabled Heterogeneous Networks. In: IT-News (QoSIP) - 2008, pp. 142–147 (2008) 2. Serral-Gracià, R., et al.: Distributed sampling for on-line qos reporting. In: Submitted and pending acceptance (2008) 3. Duffeld, N.: Sampling for Passive Internet Measurement: A Review. Statistical Science 19(3), 472–498 (2004) 4. Zseby, T.: Comparison of Sampling Methods for Non-Intrusive SLA Validation. In: E2EMon (2004) 5. Cozzani, I., Giordano, S.: Traffic sampling methods for end-to-end QoS evaluation in large heterogeneous networks. In: Computer Networks and ISDN Systems, vol. 30, pp. 1697–1706 (1998) 6. Ma, W., Huang, C., Yan, J.: Adaptive sampling for network performance measurement under voice traffic. In: IEEE-ICC (2004) 7. Duffield, N.G., Grossglauser, M.: Trajectory sampling for direct traffic observation. IEEE/ACM Trans. Netw. 9(3), 280–292 (2001) 8. Zseby, T.: Stratification Strategies for Sampling-based Non-intrusive Measurements of Oneway Delay. In: Passive and Active Measurements (2003) 9. Hanemann, A., Boote, J.W., et al.: PerfSONAR: A Service Oriented Architecture for MultiDomain Network Monitoring. In: Benatallah, B., Casati, F., Traverso, P. (eds.) ICSOC 2005. LNCS, vol. 3826, pp. 241–254. Springer, Heidelberg (2005) 10. Miloucheva, I., Gutierrez, P.A., et al.: Intermon architecture for complex QoS analysis in inter-domain environment based on discovery of topology and traffic impact. In: Interdomain Performance and Simulation Workshop, Budapest (March 2004) 11. V.-Nogués, C., C.-Aparicio, A., D.-Pascual, J., S.-Pareta, J.: Verifying IP Meters from Sampled Measurements. In: Testing of Communicating Systems XIV. Application to Internet Technologies and Services. IFIP TC6/WG6.1. TestCom. pp. 39–54. Kluwer Academic Publishers, Dordrecht (2002) 12. [IST] EuQoS - End-to-end Quality of Service support over heterogeneous networks (September 2004), http://www.euqos.eu/ 13. D5.1.1: Technical requirements for trials, tasks scheduling - (September 2004), http://www.euqos.eu/documents/md_1216_d-511.pdf
On the Applicability of Knowledge Based NAT-Traversal for Home Networks Andreas M¨ uller, Andreas Klenk, and Georg Carle University of T¨ ubingen, Computer Networks and Internet, Sand 13, 72076 T¨ ubingen, Germany {mueller,klenk,carle}@ri.uni-tuebingen.de http://net.informatik.uni-tuebingen.de
Abstract. The presence of Network Address Translation (NAT) is a hindrance when accessing services within home networks, because NAT breaks the end-to-end connectivity model of the Internet protocol suite. Communication across NATs is only possible if it is initiated from a host belonging to the internal network. Thus, services expecting a connection established from the outside fail in most situations. Existing approaches for NAT-Traversal do not cover the full range of NAT-Traversal methods and fail in certain situations, or deliver sub optimal results in others. Part of the problem of existing approaches is that they do not differentiate between different types of applications. We argue that the classification of applications into four service categories helps to determine the best matching NAT-Traversal technique. An extensive field test enables us to acquire knowledge about the success rates of promising NAT-Traversal techniques. These results will help us to develop a knowledge driven NAT-Traversal framework making its choice based on an understanding of NAT behavior, NAT-Traversal options and the service category of the application. Keywords: NAT-Traversal, Field Test on NAT Behavior.
1
Introduction
Today most home networks connect to the public Internet via a “Designated Border Router”, allowing multiple clients to share one public IP address using Network Address Translation (NAT). We use the term NAT as defined in [1]: “a method by which IP addresses are mapped from one realm to another, in an attempt to provide transparent routing to hosts”. Since NAT breaks the end-to-end connectivity model of the TCP/IP-protocol suite, a communication across a middlebox performing NAT is only possible if it is initiated from a host belonging to the internal network. Therefore, applications behind a Network Address Translator aiming to provide a globally reachable service (e.g. Web Servers, Peer-to-Peer or VoIP applications) suffer from the existence of NATs. This problem is known as the NAT-Traversal problem. The NAT-Traversal problem arises as soon as an external host wants to connect to a peer located in the private network. This is because NAT only passes A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 264–275, 2008. c IFIP International Federation for Information Processing 2008
On the Applicability of Knowledge Based NAT-Traversal for Home Networks
265
inbound packets to an internal host if it has a state listed in its mapping table. The creation of a state however depends on a packet traveling in the other direction than the incoming packet. In other words, before an internal host is able to receive packets, it first has to send a packet to the remote peer itself. NAT then tries to handle all incoming packets as a response to the outgoing packet and passes them to the appropriate internal host. There are many different approaches for solving the NAT-Traversal problem, but none can claim to solve the problem in all situations without drawbacks. Some are behavior based and only support UDP (e.g. ICE [2]), while others introduce a significant communication overhead and require the presence of infrastructure nodes (e.g. TURN [3]). An alternative to behavior based approaches is to exercise direct control over the NAT, for instance by using UPnP [4] or MIDCOM [5] in order to establish port forwarding entries. There is a lack of extensible frameworks that can make a choice when to utilize behavior based or control based techniques. Our aim is to establish persistent knowledge about the available NAT-Traversal options and to make an intelligent choice depending on the application. This paper has four main contributions: 1.) A survey on the behavior of Network Address Translators and techniques for NAT-Traversal 2.) Service categories for applications requiring NAT-Traversal support 3.) Framework proposal for knowledge based NAT-Traversal 4.) A Field test on the success rate and applicability of NAT-Traversal techniques regarding our four service categories. The remainder of this paper is structured as follows. In Sec. 2 we rehash NAT behavior. Sec. 3 explains NAT-Traversal techniques. Service categories for applications are introduced in Sec. 4. Sec. 5 introduces the idea of the ANTS framework for NAT-Traversal. In Sec. 6 we present our field test on NAT behavior. Sec. 7 surveys related work. Finally, Sec. 8 summarizes the contributions of this paper.
2
NAT Behavior
Current NAT implementations not only differ from vendor to vendor, but also from model to model. If an application works with one particular Network Address Translator, there is no guarantee that it always works in a NATed environment. This section gives a short introduction to a common classification proposed by [6]. For a more detailed discussion please refer to [7] and [8]. A FullCone NAT uses an endpoint independent binding strategy (the same external endpoint is assigned to two consecutive connections from the same source transport address (the combination of IP address and port) independent from the destination transport address) and an independent filtering strategy. Thus, once a mapping is created, every external host can access it in order to communicate with the appropriate internal client. An Address-Restricted-Cone NAT uses an endpoint independent binding in combination with an address restricted filtering strategy: only packets that originate from the same host the initial packet has been sent to are forwarded. Again, endpoint independent binding is used for
266
A. M¨ uller, A. Klenk, and G. Carle
a Port-Address-Restricted-Cone NAT. In addition to the destination address, a Port-Address-Restricted-Cone NAT also considers the port of the external host when translating packets between the realms. Other than the three categories described above, a Symmetric NAT uses a non-independent mapping. Depending on the exact strategy (random or not), port prediction may not be possible. Symmetric NATs usually behave like Port-Address-Restricted-Cone NATs with respect to filtering.
3
Techniques for NAT-Traversal
This section presents popular and promising methods for NAT-Traversal. One important group of NAT-Traversal techniques assumes direct control over the NAT. Adding a Port-Forwarding entry to the Network Address Translator can be done manually, which is only feasible for experienced users, or dynamically. Controlling the Network Address Translator transparent to the user is the aim of protocols such as Universal Plug and Play (UPnP) [4], MIDCOM [5], NAT/Firewall NSIS Signaling Layer Protocol [9] or the NAT-Port-Mapping Protocol [10]. However, if the protocol is not explicitly supported by the NAT (e.g. UPnP may be disabled due to security issues) it fails. Furthermore, every external port can only be used once. If, e.g., two hosts in a private network want to provide a service running on the same port, only one can use port preservation. There are also NAT-Traversal methods that do not require support by the Network Address Translator. These techniques try to anticipate NAT behavior in order to establish a connection. One such technique is UDP Hole-Punching. HolePunching was first described in [11] and more thoroughly documented in [12]. In order to receive packets behind a Network Address Translator, the initial packet coming from the remote host has to look like a response to the Hole-Punching packet sent by the NATed host. Simple Traversal of User Datagram Protocol through NATs (STUN) [6] allows to determinate external transport addresses and works well for endpoint independent mapping. Today, UDP Hole-Punching is used by many proprietary protocols for Instant-Messaging, Online-Gaming and VoIP applications. Due to the connection-oriented design of TCP, Hole-Punching for TCP is more difficult than for UDP. The TCP NAT-Traversal method STUNT [8] requires a NAT to accept an incoming TCP-SYN packet following an outgoing TCP-SYN packet (the outgoing TCP-SYN being the Hole-Punching packet), a sequence of packets usually not seen and therefore often discarded. Hole-Punching is not a suitable technique for NAT-Traversal of Symmetric NATs, since port prediction usually fails. The goal of Traversal Using Relays around NAT (TURN) [3] is to provide a relay with a public transport address allowing the exchange of data packets between a TURN-client and a public host. TURN has the advantage of working with almost every NAT because all incoming packets are directly related to outgoing packets. As a drawback, the reliability and the bandwidth of a connection depends on the chosen TURN server, introducing a Single Point of Failure.
On the Applicability of Knowledge Based NAT-Traversal for Home Networks
4
267
Service Categories for NAT-Traversal
When investigating existing applications suffering from the presence of NAT, we identified four service categories an application can belong to. Each category addresses different requirements and makes assumptions about the network topology and available components. Our first category, Global Service Provisioning (GSP), assumes that only one host has a NAT-Traversal framework running (no signaling is needed). It helps to make a local service globally accessible by creating and maintaining a NAT-mapping (e.g. sending keep-alive packets). Services such as a Web-Server belong to GSP. Peers of P2P content distribution networks can establish direct connections, even if both peers are behind a NAT and one of them supports GSP. Service Provisioning using Pre-Signaling (SPPS) extends the GSPcategory by giving NAT-Traversal support to both parties, which allows more sophisticated connection establishment procedures such as signaling. Compared to related work, SPPS is the category used by most existing frameworks [2][13][14]. The advantage of SPPS is that the NAT-Traversal service can select from all available NAT-Traversal solutions. Therefore, SPPS should be used, if possible, because it provides the highest success rate regarding NAT-Traversal. Our third category, Secure Service Provisioning (SSP), is an extension to SPPS and addresses applications that only want to allow authorized hosts to create and use a mapping in the Network Address Translator. Many applications were originally designed to run in a closed environment only, such as a home or company network. Such services do not always realize a high level of security, making them a target for hackers if they are reachable from the public Internet. Unfortunately, none of the NAT-Traversal solutions available today actually offers restricted access to certain services. The last category, ALG Service Provisioning (ALG-SP), focuses on applications having problems with realm-specific IP-Addresses in their payload. This applies to protocols using inband-signaling on the application layer. This is also strongly related to Bundled-Session Applications with asymmetric connection establishment establishing separate control- and data-connections. The most popular application belonging to this category is VoIP using SIP. Besides ICE, SIP NAT-Traversal is usually solved by STUN or a SIP-aware Network Address Translator. But STUN only works with an independent mapping strategy, and SIP-aware NATs are not widely used. Therefore, in addition to the three categories described above, a NAT-Traversal service should also be able to handle protocols using inband-signaling without relying on a NAT-ALG.
5
Concept for a NAT-Traversal Framework
This section presents the idea of ANTS, a new framework which aims towards providing an Advanced NAT-Traversal Service for legacy applications. We first give an overview by describing a particular scenario for NAT-Traversal in section
268
A. M¨ uller, A. Klenk, and G. Carle
5.1. Section 5.2 then discusses the applicability of existing NAT-Traversal techniques regarding our service categories and explains our main idea of a knowledge based framework. 5.1
Towards the Advanced NAT-Traversal Service
ANTS is deployed directly at the host and is aware of all registered applications in order to provide NAT-Traversal support based on the requested service category. Figure 1 shows a scenario where two hosts, both behind NAT, want to establish a direct connection to each other. First, the ANTS-module gathers knowledge about its environment and determines which NAT-Traversal techniques can be used in order to traverse the Network Address Translator it is behind (see 5.2). ClientApplication
ANTS
ANTS knowledge gathering
Connection Request
register Server Application
Session Manager
ask for remote endpoint
Signaling
ServerApplication
Signaling create mapping
remote endpoint received
Public Endpoint for NAT-Traversal
Remote Endpoint
Rendezvous Point legacy IP-Connection
Host A
NAT A
NAT B
Host B
Fig. 1. Scenario for a new NAT-Traversal framework
If a user decides that a particular application should be reachable from the public Internet, he registers it at a Session Manager keeping track of all applications and their service categories. With the Session Manager, ANTS is able to provide Global Service Provisioning directly. Whenever an application is added and associated with GSP, the Session Manager calls the NAT-Traversal logic and asks to allocate an appropriate mapping in the NAT (for GSP, only the host that wants to offer services must run ANTS). SPPS and SSP, however, require some form of signaling in order to exchange connection specific information in advance. Signaling should result in a working public transport address for NAT-Traversal. In order to avoid a Single Point of Failure, signaling should be done in a decentralized way (e.g. Peer-to-Peer SIP (P2P-SIP) [15]). Furthermore, we also aim towards port preservation, meaning
On the Applicability of Knowledge Based NAT-Traversal for Home Networks
269
that the legacy applications only use realm-specific ports (e.g. port 80 for a web-server) independent of the global port allocated by the NATs. 5.2
Knowledge Based NAT-Traversal
The concept of ANTS is based on the idea of reusing previously obtained knowledge. This means, instead of querying the network on every new connection request, ANTS determines the properties and capabilities of the NAT periodically, resulting in a very fast connection establishment. The Knowledge and Decision Module not only needs to determine all working NAT-Traversal techniques, it also has to make sure that ANTS only applies NAT-Traversal techniques suitable for the requested service category. For example, UPnP cannot be used with Secure Service Provisioning because it violates the idea of a secure public endpoint. Therefore, every NAT-Traversal technique integrated into ANTS has to be associated with an entry as follows: (Identifier → Service Category → Condition) The last field describes the conditions that have to be met for the technique in order to work with the defined service category. Global Service Provisioning depends on NAT-Traversal techniques that allow unrestricted access to a public endpoint. A Port-Forwarding entry created by UPnP is easy to maintain and works independently from NAT behavior. Hole-Punching for GSP is also possible if the NAT is of type Full-Cone. Finally, an independent relay can be used as the third option. Since SPPS makes no assumptions about the accessibility of the mapping, there are no restrictions to be considered. As long as the technique is supported it may be used. As SPPS uses a signaling protocol between two hosts, more sophisticated techniques such as tunneling TCP within UDP are also possible. SSP is an extension to SPPS and relies on the same signaling infrastructure. Since it only allows authorized hosts to allocate and to use a mapping, we have to choose from the following techniques: Hole-Punching with a restricted filtering strategy or TURN (which behaves like a Restricted Cone-NAT) are possible. The applicability of NAT-Traversal techniques for ALG-SP depends on the security considerations made by the user for the particular service. As long as we only want to provide NAT-Traversal based on SPPS, any approach can be used. On the other hand, if we prefer to only allow certain hosts to access the NAT-mapping, the considerations made above when discussing SSP also apply to ALG-SP. Figure 2 shows the components necessary for a knowledge based framework. When making a decision (e.g. which NAT-Traversal technique should be used for the requested application) the Knowledge and Decision Module relies on input from an overlaying Input Module. Besides the rather static knowledge about the applicability of existing NAT-Traversal techniques, the Knowledge and Decision Module should also consider user input, as well as input from the Session Manager as described above. However, the most important component
270
A. M¨ uller, A. Klenk, and G. Carle Input Module NAT-Tester
Registered Applications
Policies
Knowledge based NAT-Traversal working techniques
Applicability
decision (2)
(3)
(1)
Fig. 2. Inputs to be considered in order to make decisions
of the Input Module has to be able to test the Network Address Translator regarding its core properties and the supported NAT-Traversal methods. Only if ANTS is able to determine the properties of the NAT correctly, it is able to reuse its knowledge when providing a working public endpoint to an application.
6
Knowledge Gathering for NAT-Traversal
We described the idea of ANTS, a knowledge based approach for solving the NAT-Traversal problem in section 5. The fundamental question is how to discover the behavior of individual NAT-Traversal techniques in order to establish this knowledge. Throughout the whole paper we stated that the implementation of NAT is not standardized and differs from model to model. To get an overview of the behavior of current NAT implementations we devised the NAT-Tester as a module that can work as a stand-alone program and as an integral part of ANTS at the same time. We then did a field test investigating the behavior of 104 NATs in the wild, with our NAT-Tester available here: http://gex.cs.unituebingen.de. Similar tests have been done in the past [8] [16], but all of them only tested a few NAT properties. For example, the combination of UPnP and different Hole-Punching techniques has never been considered. Additionally, the results are dependent on the market share of the NAT manufacturers, which differs from country to country. While all former results represent the North American Market, our test mainly grasps the realities in the German market. 6.1
NAT Behavior Discovery
As the first test, the NAT-Tester queried a public STUN-Server and determined the properties of the NAT (Table 1). More than 85% of all NATs were either of type Port-Address Restricted or Full-Cone. Symmetric NATs, which are said to be popular in the business market, were rarely discovered. The second test then queried the network to determine if the NAT supports UPnP. According
On the Applicability of Knowledge Based NAT-Traversal for Home Networks
271
Table 1. Results of the Field Test: NAT behavior NAT Type
Appearance
Port-Address Restricted Full-Cone Address-Restricted Symmetric No-NAT Symmetric Firewall
50.96 36.54 4.81 4.81 1.92 0.96
% % % % % %
Behavior
Appearance
Port Preservation Hairpinning
63.46 % 51.92 %
no UPnP no IGD UPnP ok
51.92 % 9.62 % 38.46 %
to Table 1, 51.92% of all NATs had UPnP disabled. No IGD means that the NAT-Tester found an UPnP-capable device on the private network, but it was not an Internet-Gateway-Device (IGD). 6.2
Anticipated Success of NAT-Traversal Techniques
The next tests actually determined the success rates for different NAT-Traversal techniques. First, the NAT-Tester created a Candidate-List (listing all public endpoints to be tested) for UDP containing a STUN-derived public transport address for Hole-Punching. For the Hole-Punching test, the NAT-Tester sent a UDP packet towards the Test-Server allocating an appropriate mapping in the NAT. The NAT-Tester then started a listener on the private port and waited for a connection from the Test-Server. If a packet was received within a limited amount of time, the test succeeded. Table 2 shows the success rates for UDP. While 93.67% of our 104 NATs supported UDP-Hole-Punching via STUN, the success rate of UPnP was only 38.46%. Table 2. NAT-Traversal for UDP Technique Relay Hole-Punching UPnP
Appearance 100 % 92.31 % 38.46 %
Table 3. NAT-Traversal for TCP Technique Relay HP FTP-ALG HP Low-TTL UPnP HP High-TTL
Appearance 100 67.44 51.92 38.46 13.95
% % % % %
The UDP test algorithm was adapted to TCP and showed interesting results. In [8], the authors identified Hole-Punching using a Hole-Punching packet with a low TTL to be the most promising method with a success rate of 88% and rely on it for their NAT-Traversal technique STUNT2. In our tests this form of TCPHole-Punching only succeeded for 51.92% of all NATs. After discovering that the actual success rate was much lower, we modified the NAT-Tester and added two other types of TCP-Hole-Punching to it1 . HP High-TTL uses a high TTL 1
43 NATs were tested with the modified NAT-Tester.
272
A. M¨ uller, A. Klenk, and G. Carle
Table 4. Results of the Field Test: Success rates for the different Service Categories Category
Condition
GSP1 GSP3 GSP2 GSP4
(Full-Cone and HP-UDP) (Full-Cone and HP-TCP) (UPnP or (Full-Cone and HP-UDP)) (UPnP or (Full-Cone and HP-TCP))
(UDP) (TCP) (UDP) (TCP)
SPPS1 SPPS2 SPPS3 SPPS4 SPPS5 SPPS6
(UDP) (HP-UDP) (TCP) (HP-TCP) (TCP) (HP-TCP or HP-UDP) (UDP) (UPnP or HP-UDP) (TCP) (UPnP or HP-TCP) (TCP) (UPnP or HP-TCP or HP-UDP)
SSP1 (UDP) (Restricted-NAT and HP-UDP) SSP2 (TCP) (Restricted-NAT and HP-TCP)
Success Rate 35.58 21.15 57.66 50.45
% % % %
92.31 66.35 96.15 92.31 75.96 96.15
% % % % % %
49.04 % 35.58 %
for the Hole-Punching packet resulting in a RST -packet from the NAT. With a success rate of only 13.95%, it cannot be seen as a general solution for TCPNAT-Traversal. HP FTP-ALG [17] relies on a working ALG for active FTP. The ALG creates mappings in the NAT according to the FTP signaling messages. But instead of providing a working FTP endpoint within the control messages, an independent transport address is listed. This unconventional technique works with 67.44% of all NATs. For future tests it could be interesting to extend this idea to SIP and to provide a transport address within the SDP message. The results of the individual techniques for TCP show that no general solution for TCP-Hole-Punching exists. Therefore, building a framework that automatically detects and applies a working candidate is the most promising way for providing a NAT-Traversal service to applications. Finally, we assumed for all tests that if the NAT-Tester is able to connect to the Test-Server, it is also able to connect to an external relay. Therefore, 100% of the tested NATs support NAT-Traversal using an appropriate data relay such as TURN. To adapt the test results to our work, we evaluated the success rates of the individual techniques regarding to our defined service categories. Table 4 shows the categories and the conditions that have to be met according to section 5.2. For example, GSP requires the use of UPnP or working Hole-Punching support in combination with a Full-Cone NAT. Therefore, 57.66% of our tested NATs supported a direct connection for UDP and category GSP (50.45% for TCP). In all other cases (the remaining percentages) an external relay has to be used in order to provide GSP. For SPPS, which makes no security assumptions, we divided our results into two categories. First we determined the success rates without considering UPnP. With 92.31% of all NATs we were able to establish a direct connection to the host behind the Network Address Translator (66.35% for TCP). This rate increased
On the Applicability of Knowledge Based NAT-Traversal for Home Networks
273
silghtly for TCP (to 75.96% ) when UPnP was an option. The highest success rate for TCP-NAT-Traversal (96.15% ) was discovered when we also allowed the tunneling of TCP-packets through UDP. Finally, SSP only allows authorized hosts to create and to use a mapping. Therefore, a suitable technique for SSP is Hole-Punching in combination with a NAT implementing a restricted filtering strategy. This was supported by 49.04% for TCP and 35.58% for UDP.
7
Related Work
A number of techniques have been proposed to examine NAT behavior. The IETF has standardized STUN [6], which is widely used by Voice over IP (VoIP) applications and phones. The STUN protocol determines the public transport address assigned for a connection and detects the NAT strategy of the middlebox by using a number of probing messages. The drawback of STUN is that it does not identify NATs allowing direct control (e.g. UPnP). Additionally, STUN does not preserve its gathered topology knowledge for later usage. This is not an issue for VoIP where there is enough time to perform the STUN protocol until the other party answers the telephone, but many other applications may suffer from this delay. There is an informational IETF draft presenting results of a survey that also utilized STUN to classify 43 Network Address Translators [16]. Francis did a thorough analysis of NAT implementations in [18] and also considered important properties for NAT-Traversal. However, the tool used is very time consuming and is therefore not able to provide a knowledge gathering component for ANTS. There are already a number of frameworks for NAT-Traversal that can also incorporate knowledge about the NAT behavior. ICE [2] aims to provide a solution flexible enough to work with all network topologies. Whenever a call is established, each phone gathers all possible transport addresses and exchanges them using SIP. Both clients then set up a local STUN-Server and go through the received candidate list. A client tries to connect to each candidate address using STUN binding-requests and responses, resulting in working candidate pairs. ICE describes a general solution for offer/answer protocols, but was mainly designed to provide a NAT-Traversal solution for VoIP. Therefore, it only supports UDP-based candidates. However, there have been studies on extending ICE to work with TCP as well [19]. ICE also requires both peers to have an ICE-implementation running. If one side does not, ICE cannot help. In [18] and [13], the authors propose a new architecture called NUTSS. They describe the integration of a TCP NAT-Traversal method called Simple Traversal of User Datagram Protocol through NATs and TCP too (STUNT). NatTrav, published in [14], exchanges public endpoints and uses TCP-Hole-Punching to actually traverse the NAT. The proprietary Skype2 VoIP application has a sophisticated probing technique [20] for NAT-Traversal, which is not available to any other application, though. NAT-Traversal is done using an unknown variation of the 2
http://www.skype.com
274
A. M¨ uller, A. Klenk, and G. Carle
STUN and TURN protocol. If possible, a Skype node uses Hole-Punching to establish a direct connection. If not, a Super Node acts as a TURN-Server relaying traffic back and forth. The analysis of the related work shows that there is no single NAT-Traversal technique which is able to handle all situations without having significant drawbacks in others. This is the strong point of our envisioned architecture that can be extended by behavior and control based NAT-Traversal techniques and profit from persistent knowledge about NAT behavior.
8
Conclusion
The NAT-Traversal problem became more and more important with the increasing popularity of applications following the Peer-to-Peer communication paradigm. Existing solutions to the NAT-Traversal problem have drawbacks of only working with certain types of NAT implementations, or lacking support for legacy applications and multiple transport protocols. When analyzing the NAT-Traversal problem more thoroughly, we discovered that the requirements for legacy applications differ significantly, and identified four service categories relevant for NAT-Traversal. The choice of the NAT-Traversal technique depends on the service category an application belongs to. We made a proposal for a knowledge based extensible NAT-Traversal framework that incorporates the NAT behavior and the available NAT-Traversal options in its decision. As a corner stone of the framework, we implemented the NAT-Tester for knowledge gathering. We did a field test using the NAT-Tester to get an overview on behavior of current NAT implementations and about promising NAT-Traversal techniques. TCP Hole-Punching, for instance, yielded worse results than expected. We therefore must integrate alternative methods to support TCP and SCTP. Future work addresses the implementation of the framework and connecting legacy applications to it.
References 1. Srisuresh, P., Holdrege, M.: IP Network Address Translator (NAT) Terminology and Considerations. RFC 2663, Internet Engineering Task Force (August 1999) 2. Rosenberg, J.: Interactive Connectivity Establishment (ICE): A Methodology for Network Address Translator (NAT) Traversal for the Session Initiation Protocol (SIP). Internet-Draft - work in progress, Internet Engineering Task Force (October 2007) 3. Rosenberg, J., Mahy, R., Matthews, P.: Traversal Using Relays around NAT (TURN). Internet Draft - work in progress. Internet Engineering Task Force (January 2008) 4. UPnPTM Forum. Internet gateway device (IGD) standardized device control protocol (November 2001) 5. Srisuresh, P., Kuthan, J., Rosenberg, J., Molitor, A., Rayhan, A.: Middlebox communication architecture and framework. RFC 3303, Internet Engineering Task Force (August 2002)
On the Applicability of Knowledge Based NAT-Traversal for Home Networks
275
6. Rosenberg, J., Weinberger, J., Huitema, C., Mahy, R.: STUN: Simple Traversal of User Datagram Protocol (UDP) Through Network Address Translators (NATs). RFC 3489, Internet Engineering Task Force (March 2003) 7. Audet, F., Jennings, E.,, C.: NAT Behavioral Requirements for Unicast UDP. RFC 4787, Internet Engineering Task Force (January 2007) 8. Guha, S., Francis, P.: Characterization and Measurement of TCP Traversal through NATs and Firewalls. In: Proceedings of ACM Interet Measurement Conference (IMC), Berkeley, CA (October 2005) 9. Stiemerling, M., Tschofenig, H., Aoun, C., Davies, E.: NAT/Firewall NSIS Signaling Layer Protocol (NSLP). IETF draft - work in progress. Internet Engineering Task Force (October 2006) 10. Cheshire, S., Krochmal, M., Sekar, K.: NAT Port Mapping Protocol (NAT-PMP). Internet Draft, Internet Engineering Task Force (September 2006) 11. Holdrege, M., Srisuresh, P.: Protocol Complications with the IP Network Address Translator. RFC 3027, Internet Engineering Task Force (January 2001) 12. Ford, B., Srisuresh, P., Kegel, D.: Peer-to-Peer Communication Across Network Address Translation. Technical report, Massachusetts Insitute of Technology (2005) 13. Guha, S., Francis, P.: Towards a Secure Internet Architecture Through Signaling. Technical report, Cornell University (2006) 14. Eppinger, J.: TCP Connections for P2P Applications - A Software Approach to Solving the NAT Problem. Technical report, Carnegie Mellon University, Pittsburgh, PA (2005) 15. P2P-SIP IETF Working Group, http://www.p2psip.org 16. Jennings, C.: NAT Classification Test Results. Internet Draft - work in progress. Internet Engineering Task Force (July 2007) 17. Olsson, M.: Extending the FTP ”ALG” vulnerability to any FTP client (March 2000) 18. Francis, P., Guha, S.: Takeda, Y.: NUTSS: A SIP-based Approach to UDP and TCP Network Connectivity. Technical report, Cornell University, Panasonic Communications (2004) 19. Rosenberg, J.: TCP Candidates with Interactive Connectivity Establishment. Internet Draft - work in progress. Internet Engineering Task Force (November 2007) 20. Baset, S.A., Schulzrinne, H.: An Analysis of the Skype Peer-to-Peer Internet Telephony Protocol. Technical report, Columbia University, New York (2004)
IP Performance Management Infrastructure for ISP Atsuo Tachibana, Yuichiro Hei, Tomohiko Ogishi, and Shigehiro Ano KDDI R&D Laboratories, 2-1-15 Fujimino-shi, Saitama, Japan {tachi,hei,ogishi,ano}@kddilabs.jp
Abstract. An IP performance management infrastructure that has the twin frameworks of performance measurement and topology monitoring is proposed. By combining above frameworks, the infrastructure locates performance-degraded segments. Since the Internet is still highly prone to performance deterioration due to congestion, router failure, and so on, not only detecting performance deterioration, but also monitoring topology and locating the performance-degraded segments in real-time is vital to ensure that Internet Service Providers can mitigate or prevent such performance deterioration. The infrastructure is implemented and evaluated through a real-world experiment and its considerable potential for practical network operations is demonstrated.
1
Introduction
Today’s Internet serves as a communication infrastructure that supports various social and economic activities. The Internet needs to be managed in a reliable and efficient manner, and should be measurable and tractable in terms of its characteristics and internal states by ISPs (Internet Service Providers). However, it is widely known that transient performance deterioration is actually still likely to occur due to the inherent feature of providing cost-efficient and scalable services based on statistical, loosely controlled shares of network resources, meaning that ISP operators have to detect performance deterioration and take certain action to mitigate it. Although SNMP (Simple Network Management Protocol) may provide high accuracy for a metric such as a link-by-link packet loss, it is not suitable for estimating customers’ traffic performance. Hence, active measurement of packet behaviors passing through the targeted intra-network is an essential and rational approach to the network operations of ISPs. In addition to performance measurement, topology monitoring is important for network operation, because failures of a router or a transit network resulting in topology change can potentially severely degrade the performance of the network. Monitoring LSAs (Link-State Advertisements) that are updates of OSPF, a widely deployed intra-domain routing protocol, is a promising approach for ISP operators to efficiently monitor the topologies on the intra-network [1,2,3]. A variety of infrastructures and tools for active measurement and topology monitoring have been extensively developed for research (e.g., [4,5,6,7,8,9,10,11]). A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 276–288, 2008. c IFIP International Federation for Information Processing 2008
IP Performance Management Infrastructure for ISP
277
However, because most of them aim to clarify some research issues based on a different network model and not designed for real-time network operation, their practical application to a commercial ISP’s network operation in their present form remains difficult. Therefore in this paper, we propose an IP performance management infrastructure that has the twin frameworks of performance measurement and topology monitoring. In addition, there is a real possibility that the analysis of the combination of these two functionalities might reveal other useful features from the measurement data. Locating performance-degraded segment ensures that ISP operators take action to reduce costs and configuration efforts required for troubleshooting. Network tomography is a promising analysis technique for inferring the internal status of a network solely by measuring end-to-end packet behaviors passing through multiple paths traversing the targeted intra-network instead of directly monitoring each network node such as routers. A variety of network tomographic approaches have been extensively studied (e.g., [12,13,14,15,16]). Along these lines, as a practical way to locate performance-degraded segments, the inference methods in real-time are also proposed [17,18]. These methods actively and continuously measure end-to-end packet loss rate or delay variation within each measurement period (e.g., 15 seconds and 5 seconds, respectively) on paths from multiple origins to multiple destinations. The infrastructure includes these methods targeting ISPs’ network operation. The rest of the paper is organized as follows. Section 2 shows the system requirements. In Section 3, we explain the details of the system architecture, and in Section 4, in order to assess the feasibility of the infrastructure we outline an experiment with its results including the considerable potential of the proposed infrastructure for practical operational use. Finally, we conclude this work in Section 5.
2
System Requirements
Minimum system requirements for the proposed infrastructure are listed below. 1. Performance measurement First, network probes themselves are unsolicited and may cause harm to the network especially on a large scale, by consuming precious router processing resources. Probing packets have to be sufficiently light even when active measurements are simultaneously performed on multiple paths. To address this, we designed the following system requirements. – Probing packets are sent at a low rate (e.g., several Kbps) on each monitored path. To avoid being synchronized with a particular networkinternal queue behavior, the interval of an adjacent probing packet should be chosen randomly as specified in the IPPM (IP Performance Metrics) Working Group of the IETF. 2. Topology monitoring Although traceroute command is one available approach to making route measurements, since traceroute lists the source addresses of gTime exceededh
278
A. Tachibana et al.
ICMP messages, these addresses represent interfaces that received traceroute probes, the topology information obtained by traceroute is not necessarily accurate. In addition, to monitor all routers’ interface states in a large-scale network precisely, a large number of traceroute probes would be required. Hence, this approach is not suitable for continuous topology monitoring for an entire large-scale network. Instead, in this paper, we focus on monitoring LSAs. Routers running OSPF advertise their link states on LSAs to the network, and OSPF routers can construct a complete view of the topology of the network from these LSAs. Therefore, by passively monitoring LSAs flooded throughout the intra-network, we can comprehend the topology of the network. 3. Combination of performance measurement and topology monitoring From the operator’s viewpoint, not only IP performance on the measurement paths among measurement nodes but also the routes the probing packets pass through are important. For example, when performance deterioration is detected, operators investigate MIB (Management Information Base) of routers which are included in the monitored paths. Hence, the route of each monitored path needs to be calculated in real-time by combining the performance measurement information and topology. 4. Security In addition to the harmful aspect of network probing, network measurement data may be abused and used to harm the network. This is due to the fact that accurate measurement data may help unscrupulous parties to attack the infrastructure. Furthermore, network measurement data can expose private information on network architectures. Thus, the measurement infrastructure has to be securely and robustly managed including handling of measurement data. 5. Cost Although hardware-based measurement infrastructure may provide precise timestamps by using a specialized device with a built-in clock, it is difficult to apply to a large-scale distributed measurement infrastructure due to high cost and lack of expansibility. Hence, measurement nodes should be implemented with light-weight software. Our measurement software works even on a small PCs on which an RISC CPU 400-MHz processor, 64-MB RAM, 16-MB flash Rom, and two 100-Mbit Ethernet interfaces are implemented.
3
System Architecture
Our infrastructure consists of three frameworks for performance measurement, topology monitoring (i.e., OSPF monitoring), and analyzing measurement results as illustrated in Fig. 1.
IP Performance Management Infrastructure for ISP
279
The performance measurement framework continuously measures network performance along multiple paths among distributed measurement nodes. Measurement results such as one-way packet delay and loss rate along each path are periodically (e.g., every 1 minute) collected and stored in the performance database. The OSPF monitoring framework captures LSAs at a measurement node that establishes an adjacent relationship with an OSPF running router. By analyzing LSAs, we build an entire view of the network topology and detect route instability. The topology database stores an entire network topology. The data analysis framework detects performance deterioration by periodically looking up the performance database. Although network performance deterioration itself might be defined in various ways, we simply assume that packet loss rate or packet delay variation experienced on at least one monitored path is distinguishably larger than that found under normal conditions. Furthermore, the infrastructure infers the location of the performance-degraded segments by analyzing the performance database with the topology database along the line of network tomography. Operators interact with all frameworks and databases through the web-based graphical interface that uses HTML4, Javascript, PHP4, and Apache Web server using SSL (Secure Sockets Layer). In the following subsections, we explain the details of each framework.
Performance measurement framework
Graphical User Interface
Performance database
OSPF monitoring Framework
Topology database Data analysis Framework
Fig. 1. System architecture
3.1
Performance Measurement Framework
The system components of the performance measurement framework are depicted in Fig. 2. The measurement management software is in charge of measurement control, comprising not only the execution of performance measurements, but also monitoring survivability and configuration of measurement nodes. Along multiple paths among measurement nodes, performance is actively measured and the results are uploaded to the performance database. (1) Measurement node software The measurement node software is implemented over the GNU Linux operating system. Measurement nodes are always listening for commands from the measurement management software. The functionality of the measurement node should be reduced to a minimum in order to improve the robustness of the whole platform in case of a single node failure. Accordingly, the node software only provides the following functions.
280
A. Tachibana et al.
Measurement management software Graphical User Interface Performance database
Measurement node
Measurement node
Network probing
Network probing
Measurement node
Measurement node
Measurement node
Monitored network
Fig. 2. The system components of the measurement framework
– Measuring IP performance by sending probing packets UDP packets are sent at certain intervals (e.g., Poisson interval, uniformly distributed interval, and Pareto-distributed interval). The probing packet size is set by operators in the range from 64 to 1500 bytes. The TOS (Type of Service) field in the IP packet header is also specified for measurements over the policy-routing operated network. – Computing IP performance metrics The nodes periodically (e.g., every one minute) calculate 11 metrics at each period, the maximum, minimum, average, and the 95th percentile of oneway packet delay and delay variation respectively, the number of lost packets, loss rate, and the number of reordered packets. Here, we define packet delay variation vk for k-th packet along a path as the difference between the absolute one-way delay xk of the packet and the minimum d in all the observed absolute one-way delays along the path (vk = xk − d), which is nearly equal to the sum of queuing delays experienced by the packet along the path. – Uploading the measurement results to the data server The 11 metrics described above are periodically uploaded from each measurement node to the performance database using SSL. Periodic updating is also used for monitoring the survivability of the measurement nodes. After uploading the measurement results, the measurement data at each node are removed within a certain period (i.e., a few days) automatically. – Time synchronization All measurement nodes synchronize their clocks with the NTP server. To eliminate clock errors, each measurement node locally computes its own time based on the Time Stamp Counter (TSC) register that counts CPU ticks. Correspondence between the values of the TSC register and the actual time is periodically computed by simultaneous measurements for the TSC register value and reference time (i.e., the time of each NTP server). Currently, all measurement nodes are synchronized with one of the CDMA-based NTP servers (stratum 1) that are deployed in the intra-network.
IP Performance Management Infrastructure for ISP
281
(2) Measurement Management Software The measurement management software is also implemented for the GNU Linux operating system. – Measurement execution The operator chooses the measurement paths by picking up pairs of the sender and the receiver measurement nodes. Measurement parameters of probing packets are defined. Once a new set of measurement parameters has been defined, the measurement management software securely commands the measurement nodes using SSL. During the measurement execution task, the measurement management software is on standby until the end of the measurement or the registration of another measurement task. It ensures that no other measurement is duplicated on the same measurement paths. (Multiple measurements can be performed on the same measurement node.) – Downloading measurement data The measurement management software periodically receives the measurement results of 11 metrics, explained above, from each measurement node. The total size of the data at each measurement period is only about 300 Kbytes and thus, simultaneous execution of network probing and downloading measurement data is not expected to interfere with precise performance measurement. In addition to the 11 metrics, detailed (packet by packet) measurement results can be downloaded at an operatorfs request, because measurement data for each probing packet convey much more information that is possibly useful for monitoring network-internal states. The measurement management software specifies the measurement data to be uploaded with its period and path. To avoid congestion due to transmission of detailed data (e.g., the total size of detailed data that contain 1000 packets is about 10 Kbytes) the transmission rate-limiting function is implemented. In addition, a selective download function is also implemented. For example, the detailed measurement data of certain paths within certain measurement periods is downloaded if each measurement metric exceeds the predefined threshold. – System configuration Besides the measurement management tasks described above, maintenance tasks are also performed by the measurement management software. Such tasks have low priority and they take place when no other task is being performed by measurement nodes. An example of a maintenance task is to configure a newly incorporated measurement node on the infrastructure. To do this, a firewall to prevent unauthorized access and basic software configuration are added to the infrastructure. Locating information on the node i.e., which router it connects to is recorded. Other maintenance tasks are related to keep-alive functions such as connectivity checking. If no messages (including measurement result uploading) are received within a predefined period, the measurement nodes are assumed to be down and alarm messages are then sent to the operators.
282
A. Tachibana et al.
OSPF monitoring Manager Graphical User Interface
OSPF monitor
LSAs at area1
Area 1
Topology database
OSPF monitor
LSAs at area 0
Area 0 (Backbone)
OSPF monitor
LSAs at area 2
Area 2
Fig. 3. System components of the OSPF monitoring framework
3.2
OSPF Monitoring Framework
OSPF is a link-state routing protocol, meaning that each router within the OSPF area discovers and builds an entire view of the network topology. Using the topology graph, each router computes a shortest-path tree with itself as the root and applies the results to build its forwarding table according to the cost associated with each link. Hence, a measurement node running OSPF can also build an entire view of the network topology by passively capturing LSAs from a router located in each OSPF area. Note that it makes no difference wherever a monitoring point is located in an OSPF area because all OSPF routers in the same area store the same LSAs. When OSPF is run on a large-scale network, an OSPF domain is usually divided into multiple areas for scalability. For example, an OSPF domain is often organized into areas containing several dozen routers and a backbone area (area 0). Each area must contain at least one router with an interface in the backbone area. Because LSAs are flooded to all OSPF routers in the same area but are not flooded to other areas, for a multi-area OSPF network, the OSPF monitoring framework needs to distribute at least one OSPF monitoring point to each area, we refer to this as an “OSPF monitor,” and capture LSAs. Figure 3 shows the system components of the OSPF monitoring framework. Each OSPF monitor is managed by a central OSPF monitoring management software and LSAs captured by each OSPF monitor are uploaded to the topology database. 3.3
Data Analysis Framework
(1) Route Calculation along Measurement Path Once a new performance measurement task is executed, the data analysis framework calculates the route along the measurement path (defined as a set of sender and receiver measurement nodes). Because the performance measurement framework records the location of each measurement node with the information of the
IP Performance Management Infrastructure for ISP
283
router to which the node connects, the data analysis framework looks up the topology database and finds the shortest route from the router that connects to the sender measurement node to the router that connects to the receiver measurement node according to Dijkstra’s algorithm used in OSPF. As mentioned in Section 2, performance data related to the route information itself is often useful for troubleshooting when performance deterioration occurs and it is indispensable to locate performance-degraded segments, as explained below. (2) Detection of Performance Deterioration The data analysis framework monitors network packet loss rate and the 95th percentile of delay variation along monitored paths by periodically (e.g., every 1 minute) looking up the performance database. A path is regarded as being degraded within a certain measurement period if either or both of two performance metrics, packet loss rate and the 95th percentile of delay variation, over the time period is larger than a pre-defined threshold (e.g., 0.01 and 10ms, respectively). In this paper, we refer to a path on which performance deterioration is detected as a ”deteriorated path.” If we detect at least one deteriorated path, the following inference methods are applied. (3) Locating Performance-degraded Segments – Inference method based on Packet Loss Measurements[17] If we detect performance deterioration by packet loss, the method classifies all monitored paths’ states into three path states, i.e., bad, medium, and good, for the measurement period, using the high and low thresholds of packet loss rates (e.g., 0.01 and 0.005). By comparing the path statuses, our method infers a candidate set of performance-degraded segments along the deteriorated paths with the following properties: (i) each good path includes none of the segments in the candidate set; (ii) each bad path includes at least one segment in the candidate set; or (iii) the number of segments in the candidate set is the minimum among all possible performance-degraded segment sets satisfying properties (i) and (ii). The method defines a performancedegraded segment as packet loss rate on a segment that is larger than the low threshold (0.005). – Inference method based on Clustering the Delay Performance[18] If we detect at least one deteriorated path by the 95th percentile of delay variation, clustering of the monitored paths based on the delay variations detected by the probing packets is performed. The method compares the time series of packet delay variation among multiple paths and classifies each path into a cluster by using a hierarchical clustering technique. This technique is designed to effectively identify correlation among delay variations on these paths in conjunction with a network tomographic approach. After clustering, the method can extract all the clusters with at least one deteriorated path. In other words, a set of paths under the following conditions is regarded as a set of degraded paths resulting from a common congestion: the set contains at least one deteriorated path, and each path in the set is subject to similar
284
A. Tachibana et al.
variations in packet delay in a synchronized manner. The final step involves locating the performance-degraded segments by combining the clustering results with the topology information. During each measurement period, the segments shared by all paths that belong to the same cluster, including some deteriorated paths, are inferred as being performance degraded, and such segments are the cause of those deteriorated paths. Because the method clusters multiple monitored paths based on timeseries of packet delay variation, we need to download detailed (packet-bypacket) measurement data on multiple paths. However, since our concern is which monitored paths are degraded in a synchronized manner, we do not have to compare all monitored paths. Namely, among such paths whose delay variations are very small (completely non-degraded paths), some of them may be eliminated without decreasing the accuracy of the inference results. Actually, in the preliminary analysis, we could reduce the number of paths for which data need to be downloaded, by excluding the paths for which the maximum delay variations in each 60-second period are distinguishably smaller than the candidates of degraded paths.
4
Evaluation and Discussion
In this section, we evaluate the three functionalities of our infrastructure, performance measurement, OSPF monitoring and locating performance-degraded segments. The accuracy of performance measurement is verified by deploying a prototype of our infrastructure on a commercial intra-network. Since it is difficult to actually induce OSPF routing failure and performance deterioration in a real intra-network, these two functionalities are verified on a local experimental network. 4.1
Measurement Accuracy
Measuring IP performance accurately is the most essential part of our management infrastructure. Accuracy-verifying tests were conducted over the commercial intra-network. The core nodes of the network are connected with multiple 10 Gigabit/Gigabit lines. We chose 4 paths and made active measurements along them for 2 days. The distance of the targeted paths is in the range from about 1000 km and 1500 km. To verify measurement accuracy, we deployed another hardware-based (more expensive) measurement system “IQ2000 [19],” which actively measures end-toend IP performance on a targeted path more accurately than a software-based system. Probing packets with an interval of 20 ms are sent on both measurement infrastructures. Table 1 shows the distribution of the difference in the 95th percentile of one-way packet delay on a targeted path within 1 minute measured by our infrastructure and IQ2000. In addition, the distribution of the differences in packet loss rates measured is also indicated. The following results were obtained.
IP Performance Management Infrastructure for ISP
285
Table 1. Distribution of the difference between the two measurement infrastructures Dd is the difference the number of of packet delay [ms] 1-minute periods 0 ≤ |Dd | < 0.25 2588 0.25 ≤ |Dd | < 0.5 183 0.5 ≤ |Dd | < 0.75 6 0.75 ≤ |Dd | < 1 6 1 ≤ |Dd | 21
Dl is the difference the number of of loss rate 1-minute periods 0 ≤ |Dl | < 0.001 2603 0.001 ≤ |Dl | < 0.002 131 0.002 ≤ |Dl | < 0.003 44 0.003 ≤ |Dl | < 0.004 17 0.004 ≤ |Dl | 9
– The average of the difference in the 95th percentile of one-way packet delay was 0.17 ms. In about 99.3% (2783/2804) of 1-minute periods, the difference of the 95th percentile of one-way delay was less than 1 ms. – In about 99.7% (2795/2804) of 1-minute periods, the difference in packet loss rates was less than 0.004. – Measurement results along the other paths were roughly the same. The proportion of the periods within which the difference in the 95th percentile of one-way packet delay was less than 1 ms ranged from 99.1% to 99.8%. On the other hand, the proportion of periods within which the difference of packet loss rate was less than 0.004 ranged from 99.3% to 99.8%.
4.2
OSPF Monitoring
We established an experimental network consisting of five OSPF running routers (a), (b), (c), (d) and (e), the OSPF monitor and four performance measurement nodes as illustrated in Fig. 4. The OSPF monitor captures LSAs by establishing the adjacency with a router (b) and draws the entire topology of the experimental network. Figure 5 shows the topology drawn by our infrastructure with the graphical user interface. As a result, we could confirm that monitoring LSAs precisely draw the real router-level topology. In addition, we investigated behavior when links/routers are removed from the experimental environment. We monitored the topology via the graphical user interface when one of links “(a)(b)”, “(b)-(c)”, “(b)-(d)” and “(d)-(e)”, or one of routers (a), (c), (d) and (e) is removed. The results are summarized as follows. – In all cases, we could confirm that monitoring OSPF precisely recognizes the removal of links/routers and redraws the new topology. – The interval before our system recognizes a routing failure was in the range from 5 to 45 seconds. The interval mainly depends on when each router detects a failure. In this experiment, the hello-interval, the interval between the sending time of OSPF Hello Packets, and dead-interval, a timer used to time out inactive adjacencies, on each router were 10 and 40 seconds (default values of the router), respectively.
286
A. Tachibana et al.
Extension for OSPF Equal-Cost Multipath Multipath routing “Equal-Cost Multipath (ECMP)” is implemented with the OSPF. ECMP splits the traffic evenly among equal-cost paths to each destination by distributing the packets among them. Since the OSPF monitoring infrastructure monitors only the LSAs, the current version of our infrastructure cannot ascertain which paths each packet passes through at the ECMP branch points. On the other hand, the hash-based version of ECMP that maps packets from the same flow to the same path is often performed to avoid packet reordering [20,21]. For this type of ECMP routing, measuring the measurement paths by periodically issuing the traceroute command specifying the same UDP port number as the network probing packets use, in addition to the OSPF monitoring task, might be able to obtain more detailed route information. 4.3
Inference of Performance-Degraded Segments
To verify the functionality of locating performance-degraded segments, we chose a targeted segment from router (d) to measurement node 3, and caused congestion by loading heavy traffic on the segment. In Fig. 4, the traffic generator sent a number of packet trains, each of which consists of several dozen 1500 byte UDP packets. The interval of each packet train is uniformly distributed from 1 to 10 ms. By continuously monitoring the inference results shown on the graphical user interface as illustrated in Fig. 5, we checked whether the targeted segments are inferred as being degraded or not (the performance-degraded segment is indicated by bold gray arrow). Here, we define the performance-deterioration as the condition under which either or both of the following two conditions is satisfied: (i) Packet loss rate is larger than 0.01; (ii) The 95th percentile of packet delay variation is larger than 10 ms. measuremen node 1 Measurement management - OSPF monitoring and management -Graphical User Interface
measuremen node 2 web access user
probes (a)
probes (c)
(b)
These are Gbit Ethernet links (The others are 100 Mbit ethernet links) (e)
area0
probes
(d) HUB
HUB traffic load
probes
measuremen node 4
traffic generator
Fig. 4. Experimental local network
measuremen node 3
IP Performance Management Infrastructure for ISP
287
performance-degraded segment
Fig. 5. An example of monitoring OSPF
The results are summarized as follows. – During heavy traffic loading, there was a total of 131 1-minute periods during which performance deterioration was detected. (46 and 109 1-minute periods were detected by packet loss rate and delay variation, respectively.) – In 38 (29%) of the 131 1-minute periods, the data analysis framework failed to infer performance-degraded segments because the inference methods stop when any possibility of miss-inference is recognized to avoid inferring wrong segments. In 86 (92%) of the remaining 93 periods, the targeted segments were inferred as being congested correctly. In all cases, inference was performed in 1 minute or less.
5
Conclusion
In this paper, we proposed an IP performance management infrastructure having the twin frameworks of performance measurement and topology monitoring. By combining above frameworks, the infrastructure locates performance-degraded segments. Through an experiment over a commercial intra-network and an experimental local network, we verified that the proposed infrastructure can manage IP performance, topology and the location of the performance-degraded segments precisely and in real time.
References 1. Shaikh, A., Greenberg, A.: OSPF Monitoring: Architecture, Design and Deployment Experience. In: Proc. USENIX Symposium on Network Systems and Design and Implementation (NSDI) (2004) 2. Shaikh, A., Goyal, M., Greenberg, A., Rajan, R., Ramakrishnan, K.: An OSPF Topology Server: Design and Evaluation. IEEE J.Selected Areas in Communications 20(4) (2002)
288
A. Tachibana et al.
3. Watson, D., Labovitz, C., Jahanian, F.: Experiences with Monitoring OSPF on a Regional Service Provider Network. In: Proc. IEEE International Conference on Distributed Computing Systems (ICDCS) (2003) 4. Claffy, K.C., Monk, T.E., McRobb, D.: Internet tomography. Nature, Web Matter (January 1999) 5. Peterson, L., Anderson, T., Culler, D., Roscoe, T.: A blueprint for introducing disruptive technology into the Internet. In: Proc. ACM Workshop on Hot Topics in Networks (HotNets), Princeton, NJ, October 2002, pp. 59–64 (2002) 6. Andersen, D.G., Balakrishnan, H., Kaashoek, M.F., Morris, R.: Resilient overlay networks. In: Proc. ACM Symposium on Operating Systems Principles (SOSP), Banff, Alberta, Canada, October 2001, pp. 131–145 (2001) 7. Simpson Jr., C.R., Riley, G.F.: NETI@home: A distributed approach to collecting end-to-end network performance measurements. In: Proc. Passive & Active Measurement (PAM), Juan-les-Pins, France (April 2004) 8. Shavitt, Y., Shir, E.: DIMES: Let the internet measure itself. SIGCOMM Computer Communication Review 35(5), 71–74 (2005) 9. Kalidindi, S., Zekauskas, M.J.: Surveyor: An infrastructure for Internet performance measurements. In: Proc of INET 1999 (June 1999) 10. Active Measurement Project, http://amp.nlanr.net/ 11. Paxson, V., Adams, A., Mathis, M.: Experiences with NIMI. In: Proc. Passive & Active Measurement (PAM) (April 2000) 12. Caceres, R., Duffield, N., Horowitz, J., Towsley, D.: Multicast-based inference of network-internal loss characteristcs. IEEE Trans. Info. Theory 45(7), 2462–2480 (1999) 13. Duffield, N., Presti, F.L., Paxson, V., Towsley, D.: Inferring link loss using striped unicast probes. In: Proc. IEEE infocom, Anchorage (April 2001) 14. Tsuru, M., Takine, T., Oie, Y.: Inferring link characteristics from end-to-end path measurements. In: Proc. IEEE ICC, Helsinki, June 2001, pp. 1534–1538 (2001) 15. Bestavros, A., Byers, J., Harfoush, K.: Inference and labeling of metric-induced network topologies. In: Proc. IEEE infocom, New York (June 2002) 16. Padmanabhan, V.N., Qiu, L., Wang, H.: Server-based inference of internet link lossiness. In: Proc. IEEE infocom, San Francisco (April 2003) 17. Tachibana, A., Ano, S., Hasegawa, T., Tsuru, M., Oie, Y.: Locating Performancedegraded segments over the Internet Based on Multiple End-to-End Path Measurements. IEICE Transactions on Communications Internet Technology VI (April 2006) 18. Tachibana, A., Ano, S., Hasegawa, T., Tsuru, M., Oie, Y.: Locating Performancedegraded segments on the Internet by Clustering the Delay Performance of Multiple Paths. In: Proc. IEEE ICC, Glasgow (June 2007) 19. IQ, QoS Monitoring system, Yokogawa Electric Corporation (2000), http://www.yokogawa.com/tm/data/tm-data.htm 20. Cisco express forwarding(cef). Cisco white paper, Cisco Systems (July 2002) 21. Junos 6.3 internet software routing protocols configuration guide, http://www. juniper.net/techppubs/software/junos/junos63/swconfig63-routing/html/
Can Critical Real-Time Services of Public Infrastructures Run over Ethernet and MPLS Networks? Jaime Lloret1, Francisco Javier Sanchez2, Hugo Coll3, and Fernando Boronat4 1,3,4
Department of Communications, Polytechnic University of Valencia, Camino Vera s/n, 46022, Valencia (Spain) [email protected], [email protected], [email protected] 2 ADIF, Molino de las Fuentes s/n, 46026, Valencia (Spain) [email protected]
Abstract. — With the adoption of Ethernet-IP as the main technology for building end-to-end real-time networks, the requirement to deliver high availability, quality and secure services over Ethernet and MPLS has become strategic. Critical real-time traffic is generally penalized and the maximum restoration time of 50 msec. sometimes is exceeded because of devices hangings or the high delay to find an alternative path. Existing standard networks are not ready to transport real-time critical information. This issue must be solved specially in critical infrastructures such as railway control systems, because passengers’ safety could be committed. In this work, first we have checked in which conditions the devices of the network will have a reply in real-time. Then, we will explain why existing Ethernet and MPLS combined solutions don’t offer recovery time lower than 50 msec., and we will propose a new approach that achieves these conditions. Our statements are based on real measurements in real railway environments and using real testbeds. Keywords: real-time networks, critical systems, Ethernet, MPLS.
1 Introduction Distribution and transport companies (gas, oil, water, railway, etc.) often have their own voice and data communication infrastructure, usually based on SDH/SONET transmission systems over fiber or copper cables. Only few years ago, industrial systems were “islands” for control and system safety, but now they are integrating network technologies, to permit easy growing, and central management. We can distinguish two main environments depending on the type of service: non critical realtime services and critical real-time services. Normally, non critical real-time services are based on Ethernet access to corporate MPLS/VPLS network. The main structure and technology for this network responds basically to multimedia traffic requirements, but the behaviour is completely different when we are connecting critical real time systems to the same network. In fact, to avoid occasional blocking problems, critical real-time services are provisioned through independent local Ethernet networks (most of them in ring topology to provide resiliency), with no A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 289–301, 2008. © IFIP International Federation for Information Processing 2008
290
J. Lloret et al.
interaction with corporate MPLS/VPLS network. These solutions increase complexity and budget, and present many difficulties for integrated management on critical infrastructure companies. In the case of railway systems (where we are going to do our tests and simulations), non critical real-time services are provisioned through direct access to corporate MPLS network, such as railway station auxiliary services: automatic stairs, lifts, evacuation and emergency facilities, IP phones, video, remote alarm and supervision, etc. Station services and equipments are connected to third level nodes (usually MPLS routers) organized by a double ring topology. Both ends of the ring are connected to regional gigabit ring, through physical link between third level node and second level node (this is typically a layer 3 switch). On the other hand, critical real-time services are provisioned through independent Ethernet rings with resiliency mechanisms activated: level 2 methods based on Ethernet common topologies (like Spanning Tree Protocol -STP-, Rapid STP, Multiple STP and link aggregation), or specific ring topologies (RPR/802.17, EAPS/RFC3619) and level 3 methods based on interior routing protocols (EIGRP, ISIS, OSPF) and default-gateway backup protocols (VRRP, HSRP). [1] A real example of a complete railway control system network is shown in figure 1. The Circles represent routers in each station interlocking system, the rectangles represent switches using HSRP protocol and the squares represent interlocking CPUs. Dotted lines are the backup network lines. [2] Station 1
Station 2
Station 3
Station 4
Station 5
Station 6
Station 7
Station 8
Fig. 1. Network topology of a railway line critical control system
In critical real-time networks, any type of failure must be recovered in less than 50 msec. The non achievement of this goal could suppose to lose lives, such as in critical railway control system networks [3]. It is vital to study the viability of running critical real-time services over standard networks such as Ethernet and MPLS. This goal will allow the companies to use a unique network infrastructure. This paper is structured as follows. Section 2 describes last methods for protection and restoration in Ethernet and MPLS networks. Section 3 shows the measurements and analysis of real Ethernet and MPLS scenarios on the Spanish railway administrator (ADIF) company. Section 4 presents our proposal to improve the availability and fault tolerance for critical real-time systems using the integrated network. Finally, section 5 summarizes the conclusions and gives future works.
Can Critical Real-Time Services of Public Infrastructures Run over Ethernet
291
2 Failure Detection, Protection and Restoration in Ethernet and MPLS Networks When there is a failure in the network, there must be a way to detect it, so the recovery operation must start. Failure detection depends on the type of failure and may be done by an adjacent node to the failed one or by a configured point of repair in the network. Failure detection involves multiple layer mechanisms such as [4]: • • • • • •
Hardware: connectivity failures can be detected by line cards. Signaling: errors in SDH/SONET are detected using signaling mechanisms. Link: specific messages can be sent to poll node state or to notify an error. Network: when there is a failure, “hello” messages are missing. Application: SNMP traps can be used to notify a failure to a central manager. Specific protocols: failure detection could also be based on a specialized protocol like BFD (Bidirectional Forwarding Detection) [5].
The design and reconfiguration of the network, regarding with the components availability, is fundamental to deliver the required performance and reliability to realtime services. There are complex methods based on two complementary approaches to avoid the failure impact of the network components. The first one is to design high availability working paths, where the resilience is provided by selecting a working path with high connection availability. Few protocols include in their metrics availability factors, but usually data link protocols don’t include any. The second one is the use of backup paths. There are two main methods for resiliency provisioning: • Protection: static scheme in which backup paths are pre-computed and preprovisioned before the failure occurs. Protection can be done in two different forms: dedicated protection and shared protection. In dedicated protection, the resources used for protection are dedicated for backup traffic. There are two types of dedicated protection (1+1 and M:N protection). In shared protection, when a backup path is activated, other backup paths, that share resources with it, will have to be rerouted. Shared protection is more complex to maintain dedicated protection, but it can offer higher network utilization. • Restoration: dynamic scheme in which backup paths are discovered dynamically due to failed connections, so they are created and routed after the failure. In both cases have one mechanism to detect the failure and notify it to the rest of nodes and another one to re-route traffic from the working the path to a backup path. Recovery schemes can also be categorized by the scope of the recovery. Global recovery covers link and node failure by calculating a new unique end-to-end path (this recovery action run in the ingress node), while with local recovery faults are handled locally by the neighbors. Keeping failures local cause minimum disruption in the traffic, but it is necessary to reserve several paths. The topology of the network affects on how the recovery can performs [6]. Network recovery management is difficult when the network is viewed as an unstructured collection of backup paths. These schemes are based on building a set of sub-topologies of the network, serving as a more intuitive abstraction of the recovery paths, and can be applied to both, local and global recovery. The main methods are:
292
•
•
•
•
J. Lloret et al.
Self-healing rings. In these methods each node in the network is connected with two links to its upstream and downstream neighbours. In regular operation, the traffic is sent in one direction of the ring. When a link fails, the traffic is sent to the other link in the reverse direction, so the failed link is avoided. There are several mechanisms that rely on ring topologies, like SONET/SDH or Ethernet ring solutions (RPR/802.17 [7] and EAPS/RFC3619 [8]). Disjoint paths. These methods are based on a graph property described by Menger [9]. These solutions are applied to MPLS networks, because they can be adapted easier to MPLS signalling. This technique uses pre-provisioned backup paths with local recovery [10]. Protection cycles. The goal is to provide fast recovery speed (usually offered in rings) to mesh networks. In order to combine the best properties from both ring and mesh, a new concept called p-cycles has been introduced. It is based on the formation of closed paths which are formed in the spare capacity only of a mesh restorable network. When a link fails, the traffic is locally switched or routed according to the cycle instead of to the original shortest path [11]. Redundant trees. These methods are applied to standard Ethernet access networks. Ethernet traditionally relies on Spanning Tree Protocol (STP) defined in IEEE 802.1D, which provides loop-free communication between all nodes. The restoration after a link failure involves the reconstruction of the whole spanning tree. Typically, the recovery of the worst case requires around 50 seconds, which is not acceptable for real-time critical applications. In Rapid STP convergence, the time is reduced but is still in the order of few seconds. Multiple STP defines several spanning trees instances organized in regions and map each Virtual LAN onto a single spanning tree. MSTP restoration time is still high [12].
3 Measurements and Analysis in Ethernet and MPLS Networks Our first goal is to know which delays are introduced in the network because of the data transmission through the wire and because of the devices placed inside. We will give a law that shows which conditions we should follow when we want to have delays lower than 50 msec. Then, some measurements will demonstrate that existing recovery path restoration systems for Ethernet (such as STP) don’t accomplish realtime requisites. Next, we have measured the real environment to know which delays are in a real MPLS network when there is regular traffic through it. Finally, we have measured the path restoration time when there is a failure in the network. 3.1 Delays in Network Devices and Their Links Network delays are produced because of the transmission time Ttrans (time to send the information from the network adapter to the transmission medium), the propagation time Tprop (elapsed time in the medium) and the processing time Tproc (time to process the received frame in the station and transmit the next one). The Round Trip Time (RTT) of a ping is the elapsed time needed to send the frame from the station to the medium, plus the elapsed time to receive the last bit at the destination, plus the elapsed time to process the last bit and send the ACK, plus the elapsed time to send from the destination station to the medium the ACK, plus the ACK propagation time,
Can Critical Real-Time Services of Public Infrastructures Run over Ethernet
293
plus the processing time of the source station to process last bit of the ACK, plus the processing time of the source station. The expression for RTT is given by equation 1.
RTT = 2·Tprop + 2·Tproc + Ttrans_ frame + Ttrans_ ACK
(1)
We have considered the same processing time for both stations when they have the same hardware characteristics and the same operative system. The transmission time is determined by the bits of the frame sent divided by the Mbps of the interface card. The propagation time is the distance between stations divided by the propagation velocity in medium (for non guided transmissions through air or ether, it is the light speed, approximately 3•108 m/s, for guided transmissions through fiber and copper, this velocity is approximately 0.67 times the light speed). In order to determine the mean elapsed time, we have taken the average time of 100 pings. Each ping has 548 bytes, so the total packet length is 602 bytes (548 bytes of ping data + 8 bytes of the ICMP + 20 bytes of the IP + 26 bytes of Ethernet). Figure 2 shows obtained results using UTP category 5 wires. It also shows the equation that relates the meters of wire with the RTT using the obtained results. 0,41
y = 0,0063x + 0,3394 R2 = 0,96
0,40
RTT (msec.)
0,39 0,38 0,37 0,36 0,35 0,34 0,33 0,32 0,31 1 Meter
2 Meter
3 Meter
4 Meter
6 Meter
8 Meter
10 Meter 20 Meter
30 Meter
Fig. 2. RTT for several UTP category 5 pigtails
Taking these measurements into account, we can state that the wire together with the computer processing time and the packet transmission time introduces delays of about 1 msec. every 100 meters, where about 0.34 msec. will be due to the computer processing time and the packet transmission time. We must remember that it is not allowed a wire length larger than 100 meters in fast Ethernet. For Gigabit Ethernet we have calculated the delay of the propagation time for 1 kilometre about 0.004 msec. Once we have obtained the elapsed time due to the wire and to the computers, we can obtain the delay time introduced by the network devices. In order to do it, we have taken the average time of 100 pings (with the same size than the previous test) using two wires of 10 meters (one for each computer). Figure 3 shows the delay measurements introduced by some switches from several manufactures subtracting the RTT obtained for 20 meters in the previous test. The range of delays varied from 0.113 msec. to 1.943. The average value of all measured switches was 0.667 msec. The same methodology was used for the routers to measure their processing delay. Figure 4 shows results obtained from some routers of several manufacturers. The range varied from 0.479 msec. to 2.229 msec. The average value was 1.191. We have observed that the processing delay introduced highly depends on the switch or router
294
J. Lloret et al.
2,5
1,5
1
0,5
0 Cisco Catalyst 2950/12
Cisco Catalyst 2950/24
Allied Telesyn 8124XL
Allied Telesyn 8024
Cisco Catalyst 2923/24
Cisco Catalyst 3550
Cisco Catalyst 1900/12
Sitre Telecom
Nortel 2000/24
3com 2200/16
Processing delay (msec)
P rocessing delay (m sec)
2,5 2
2 1,5 1 0,5 0 Cisco 2621XM
Switches
Fig. 3. RTT for several Switches
Cisco 2651XM
Allied Telesyn AR410
Cisco 1720
Cisco 2611
Cisco 2514
Routers
Fig. 4. RTT for several Routers
manufacturer and even on its model. Switches and routers with more features introduce higher delays and, on the other hand, older switches and routers have higher delays than other devices from the same manufacturer. Using the measures obtained from the real world, we can conclude that the RTT value (in msec.) in an Ethernet network follows approximately equation 2. (2) RTT = 1.1912·Z + 0.6669·Y + 0.0063· X + 0.3394 Where Z is the number of routers, Y is the number of switches and X is the meters of wire in the Ethernet network. Assuming that there are 2 switches at least per router in the network and supposing that there is a mean value of 30 meters of wire between any device or computer, we can calculate that there couldn’t be more than 18 routers between the most remote devices in order to have a network with real time requisites. 3.2 Convergence Time in Spanning Tree Protocol In order to check that the STP doesn’t accomplish real-time responses, we have setup a testbed with three switches arranged in a triangle and test the response time when the link where the traffic goes through fails down. Figure 5 shows the network topology and the features of the computers for the testbed. In order to measure the time needed to recover a link, where the traffic is going through, we transmitted 10000 pings, each one with a difference of 200 msec. from one laptop to the other and, then, we measured how many packets were lost in the convergence time. Table 1 shows our measurements for several manufacturers and models. Table 1. STP measurements Switch 1 IP 192.168.1.1
Switch 3 IP 192.168.1.3 F 0/3
F 0/5
F 0/3
F 0/1
F 0/2
F 0/1
Laptop 1 IP 192.168.1.10 Intel Core 2 Duo 2.0Ghz 256 Mb RAM Debian Linux
F 0/5
F 0/2
Switch 2 IP 192.168.1.2
Laptop 2 IP 192.168.1.11 Intel Core 2 Duo 2.0Ghz 1 Gb RAM Debian Linux
Fig. 5. Testbed for STP measurements
Cisco Cisco HP C2923XL C2950-12 Procurve Packets received 7466 7871 9943 Packet loss 25.34% 21.29% 0.57% Time (in msec) 33954 28862 4458 Minimum value 0.156 0.136 0.107 Average value 0.356 0.305 0.295 Maximum value 1.561 4.731 1.231 Max. desviation 0.073 0.162 0.087 Convergence 506800 425800 11400 time (msec)
Can Critical Real-Time Services of Public Infrastructures Run over Ethernet
295
We can now state that the convergence time in a Fast Ethernet network, when a switch fails and the path is recovered using STP, exceeds always 50 msec, so such networks don’t provide real time recovery. 3.3 Round Trip Time in the Spanish Railway Network In order to test a MPLS network, area 3 (Madrid-Albacete-Játiva-Valencia) of ADIF corporate network was chosen. This area has 28 routers as it is shown in figure 6. The network has two levels. The second level can be distinguished by lines in black in figure 6. Stations are connected to the rings with 2 Mb/s, to deliver service in short distances (15-60 kilometers). Above this topology we can find 1 GB fiber optic links that connect main network nodes. Lines in red form the first level, that is, nodes 0, 1, 2 and 3 (Madrid, Albacete, Valencia and Jativa). Each ring has two connected nodes to this fiber network through 100Mb/s links. Thus, when a station needs to communicate with a station of another ring or other control node, data will go through fiber network. Distances between nodes in the same ring not exceed 15 Kilometers. However, first level node links are about 250 Kilometers. In order to know the real response time between the most remote nodes, we have gathered RTT from node 9 to node 21. Figure 7 shows the real response time obtained by 100 regular pings between those nodes. The minimum value was 23 msec., the maximum value was 98 msec. and the average time value was 24.84 msec. The 20th ping and the 94th ping peaks could be because there was sporadic excess of traffic in one of the nodes in the path, so the queue of that device needed more time to process all packets. Reference [13] shows more information about real measurements and simulations of this test bench. 120
RTT (msec.)
100 80 60 40 20 0 0
20
40
60
80
100
Ping
Fig. 6. ADIF corporate network - Area 3
Fig. 7. Delay time from node 9 to node 21
3.4 Convergence Time When There Are Node Failures in MPLS Networks In order to measure the delay produced in the traffic because of a link failure, we have used the testbed shown in figure 8. All routers in the topology were Cisco 2621 XM running MPLS. All links between routers have a bandwidth of 2 Mbps except the links between the routers b and c, d and e, and f and g, which are 10 MBps. In T=0 sec. we started to send 200 UDP packets per second (one packet every 50 msec.), with a size of 64 Bytes, from the computer A to the computer B. Due to the bandwidth of the links in the network, the path followed by the packets was a-b-d-e-g-h. In t=5 sec., the link between routers e and g fails. The time needed to send the traffic through an
296
J. Lloret et al. 192.168.3.0
A .1 192.168.11.0
192.168.0.0
c .2 .1
a
e .1
.1
.2
.1
g .1
192.168.10.0 .2 .2
192.168.2.0
.1
192.168.1.0
.2 .2 b
.3
.2
192.168.5.0
.1 .2
.1
192.168.6.0
192.168.8.0 .2
.2 .1
.2
192.168.4.0
d
.1
.2
f
.2
.1 B
h 192.168.12.0
192.168.9.0 .1
192.168.7.0
Fig. 8. Testbed used to measure the convergence time
available path is 5.4 secs. This time, the path followed by the packets is a-b-d-f-g-h. In t=15 secs. we recover the link between routers e and g and it is needed 5.0 secs to restore the old path. In both cases (the path recovery and the path restoration), the delays are quite higher than 50 msec.
4 Proposal We propose a new model for the integration of critical and non critical applications over standard Ethernet and MPLS networks. The new scenario is based on MetroEthernet corporate and integrated real-time network (see figure 9).
Customer Edge (VLAN)
Q in Q 802.1ad Encapsulation
MPLS o Provider Backbone Bridge (MAC in MAC 802.1ah)
Q in Q 802.1ad Encapsulation
Customer Edge (VLAN)
Fig. 9. Real-time corporate and integrated control network For each application or real time traffic flow, it is assigned a separate VLAN to establish a quality of service with level 2 or 3 marking. It will be a relationship between VLANs and VPNs at the backbone. It is necessary to improve the transport network to guarantee high quality communications. The main objectives of these improvements are: guarantee quality of service to vital applications (also during fails), minimize fail occurrences and get low recovery times (lower than 50 msec., the standard on SDH systems). MetroEthernet, technology is not used to support critical applications, where it is necessary not only to adjust to QoS requirements, but high end-to-end reliability and availability. We can find many innovations about QoS aware MPLS/VPLS networks, but really there is a gap about fault-tolerance for
Can Critical Real-Time Services of Public Infrastructures Run over Ethernet
297
supporting critical services. Our main goal is to provide high quality and reliability for the end-to-end path through Ethernet access networks and MPLS/VPLS core for non critical and critical real-time services. 4.1 Components of the Proposed Model The general components used in our proposed model are (i) a topology discovery protocol to build the network map, (ii) an algorithm for selecting primary and backup paths, (iii) an encapsulation method and frame format to carry data and explicit routing information, (iv) a protocol to detect and notify failures and, finally, (v) a mechanism to switch from active to backup paths. A protocol discovery is needed to learn network connectivity. It is better to use local non bridging protocol to avoid traffic overhead. In this case, we use LLDP (Link Layer Discovery Protocol) [14], defined in IEEE 802.1AB standard. LLDP data units (LLPDUs) are sent to destination MAC address, defined like LLDP multicast address. A LLDPDU will not be forwarded by MAC bridges or switches that using IEEE 802.1D standard [15]. In order to distribute this information we have chosen SNMP, and to retrieve the initial topology, network nodes use SNMP polls to query topology MIB database and they use SNMP traps to notify topology changes. LLDP system capabilities information is used to query only neighbours which are intermediate nodes (bridge, AP, etc.) and to register end-stations (station-only). Paths are pre-calculated with the algorithm proposed in [16]. It is a simple solution to build optimal primary and backup paths (k-paths) based on the SPF algorithm. But, in our case, we are going to include QoS and reliability considerations in a new integrated metric. All the parameters can be obtained through SNMP queries to the nodes in the path. We have not used STP algorithm because of its limitations (only one working path, no load balancing, a unique point of failure, maximum bridge diameter, etc). We have designed the metric shown in equation 3 to achieve our goals. delay ⎞ ⎛ loss·length ⎞ ⎛ ⎟⎟ Metric = ⎜ k1 ⎟ + ⎜⎜ ⎝ bandwidth ⎠ ⎝ availabili ty ⎠
(3)
The metric is formed by two parts. The first part, on the left, is related with QoS. Bandwidth is given by the addition of the bandwidth of the links along the path and the delay is the Round Trip Time required to move a packet from the source to the destination. The second part, on the right, is related with the reliability. Availability is the fraction of time in which nodes or links from all paths are working with full functionality, loss is the packet loss probability, and it is obtained from the SNMP interface counters, and length is the number of hops between the source and the destination. We have considered K1=103 to get the QoS part into the desired values. Only border network nodes build a full network map and select optimal working and backup paths, as a function of the availability and reliability based metrics. This algorithm falls under the category of proactive (pre-calculated backup paths) and global fault management (end-to-end). Really, the algorithm only have to add route descriptors (node and port identifier pairs) to a vector and have to detect if the node identifier is repeated for discarding this path. The next computation of the path can begin from the last node before the loop, and try to find another feasible segment to a destination. Our proposal allows several active paths for load balancing because the
298
J. Lloret et al.
loop free topology is inherent to explicit routing. Moreover, we have not a unique point of failure; we have a faster path recovery and an increased bridge diameter. The encapsulation method is based in QinQ or IEEE 802.1AD [17]. QinQ was originally designed to expand the number of VLANs by adding an additional tag to an 802.1Q frame. With this new tag, the number of VLANs is increased to 4096+4096. QinQ tags can indicate different information. For example, the inner tag indicates the user and the outer tag, the service. QinQ can be executed as the extension of a core MPLS VPN in the MetroEthernet to provide end-to-end VPN technology. QinQ encapsulation can be based on port or on traffic (on traffic is more flexible). QinQ encapsulation is also compatible with default-gateway standby protocols, like VRRP. Level-3 resiliency methods can be also added to our method. Ingress nodes store in their memory explicit end-to-end primary and backup paths. If an active path fails, the ingress node decides to switch to the best alternative path. The old primary path is marked inactive, and can be deleted from their memory if a timeout is exceeded. If the recovery time of the failed node, or link, is lower than a timeout, old path is marked as an active. In this case, it is not necessary to recalculate the path. This switching mechanism is extremely simple, and there is no need for changing frame control information to notify an alternative path, because primary and backup paths are stored in ingress node’s memory and intermediate nodes have not any routing information. In MSTP recovery solutions, normally end-hosts need to change the VLAN id in subsequent frames to select an alternative switching path [18]. As our proposal uses an explicit routing/switching mechanism based on the global view of the network, paths are pre-calculated from LLDP information and ingress node put full explicit path into the frame header. Since the path is predefined, the frame is routing at each node without the need for making routing decisions at every node along the path (hop-by-hop destination-based routing). Setting up paths implies that network will keep state information. It has been avoided in the past due to its potential reduce of performance, but now it can be done more efficiently because of higher-performance devices. In fact, MPLS networks use this technique to establish traffic engineering tunnels, but Ethernet networks have never use it before. Explicit routing approach is very useful for the prevention of routing loops, for traffic engineering and to adjust to QoS constraints (some constraints like the overall delay cannot be really calculated in a hop-by-hop fashion). Then, it is not necessary to work with Spanning Tree algorithm to find a unique path with no closed cycle, neither Multiple Spanning Tree routing solutions like 802.1AQ Shortest Path Bridging [19]. In order to guarantee full compatibility, we have employed the source routing transparent Ethernet operation described in IEEE Std 802.1D-2004 Annex C. However, we use QinQ tagged frame format between the ingress and egress node. In this case, when CFI bit is set (bit CFI=1), the switch performs source routing. It comprises a two-octet Route Control Field (RC) that has the following fields: •
3-bit Routing Type (RT) fixed to 0XX. If the most significant RT bit is set to 0, the Route Descriptor (RD) field contains a specific route through the network. This is the unique combination used in our proposal, because there is no need for routing explorer frames (network map is built from LLDP information).
Can Critical Real-Time Services of Public Infrastructures Run over Ethernet
• •
• •
299
5-bit Length (LTH). Indicates the length (in octets) of the Routing Information Field (RIF). Length field values will be even values between 2 and 30 inclusive. 1-bit Direction Bit (D). If the D bit is 0, the frame traverses the LANs in the order in which they are specified in the RIF (RD1 to RD2 to... to RDn). Conversely, if the D bit is set to 1, the frame will traverse the LANs in the reverse order (RDn to RDn-1 to... to RD1). 5-bit Largest Frame (LF). For source routing frames, the LF bits are ignored and shall not be altered by the bridge. Then this field is free to use such as destination region/ring id. 1-bit Non Canonical Format Indicator (NCFI). If it is set, indicates that all MAC address information, that may be present in the MAC Service Data Unit (MSDU), is in Canonical format, and it is reset otherwise. Figure 11 shows the modified E-RIF Route Control Field. 2 RT Bits:
8
(X) 6
LTH 5
D 1
8
NCFI
Octets: 1
LF 7
2
1
Fig. 11. E-RIF Route Control Field
Moreover, more octets of Route Descriptors could be added (up to a maximum of 28 octets such as in IEEE Std 802.Q-2005). Each Route Descriptor has 16-bits that next node contains (7-bit code) and port destination (9-bit code) for providing routing information. This design is scalable because it has three organization levels: regions (up to 32 coded in 5-bit field), nodes (up to 128 coded in 7-bit field), ports (up to 512 coded in 9-bit field). The bridge diameter of the network is only limited by the maximum number of octets in the frame. In our implementation, end-stations transmit standard frames with no routing information included. The ingress switch is responsible to put it into the frame and the egress put off the routing information. Then, no changes are required at end-stations. Route Descriptors are concatenated and are removed from the corresponding switch. Finally, when there is no Route Descriptors in the message, the bit CFI is set to zero. Then, last switch forwards the frame to local port destination. This mechanism of putting and taking information from the frame can prevent undesirable loops in the network, because never must be two equal route descriptors in the route information fields. In order to provide a procedure and protocol to notify link and node errors, for starting restoration actions, we can choose between two standardized methods (we can implement each one of them). Bidirectional Forwarding Detection (BFD) is an IETF draft standard that defines a method to detect errors in forwarding paths, and can be used to trigger a switch or a router to alternate interfaces or routes. It can be applied to many technologies and networks. The alternative option is to use IEEE 802.1AG standard [20]. It was designed only for Ethernet networks but it can be adapted with few modifications. Both can be used depending on the network scenario.
300
J. Lloret et al.
5 Conclusions We have shown the need of critical and non critical real-time services integration in a unique corporate network, to optimize management and to avoid high budget. We have provided an experimental law that should be applied when it is needed a network with real-time services. Measurement results confirm that recovery times in Ethernet and MPLS networks exceed 50 msec. On the other hand, we have observed that the delay of video streams is higher when the traffic begins to use an available path after a failure than when the failed device is recovered. We have proposed a new system to improve resiliency on critical infrastructures networks that allows path recoveries faster than 50 msec. because the proposal is based on a protection mechanism which guarantees a very small delay when a path has failed. This solution is fully compatible with Ethernet and MPLS equipments and protocols, with no need for substantial hardware or software modifications in nodes or end stations. In fact, only end switches must run this algorithm, and STP is yet not necessary to avoid loops. The key of our method is a new metric not only with QoS constraints, but also with metrics related with the reliability of nodes and links. This work opens new research lines in MetroEthernet, Cluster and Corporate networks and it can also be a fault-tolerance and traffic engineering optimal solution.
References 1. Sánchez, F.J., Lloret, J., Díaz, J.R., Jiménez, J.M.: Mecanismos de protección y recuperación en redes de tiempo real para el soporte de servicios de explotación ferroviaria, Gandia (Spain). In: XX Simposium Nacional de la URSI (September 2005) 2. Sánchez, F.J., Lloret, J., Díaz, J.R., Jiménez, J.M.: Standardization and improvement of real time network technology on railway traffic control systems. In: 7th World Congress on Railway Research, Montreal (June 2006) 3. Lloret, J., Sánchez, F.J., Díaz, J.R., Jiménez, J.M.: A fault-tolerant protocol for railway control systems. In: 2nd Conference on Next Generation Internet Design and Engineering, Valencia (Spain) (April 2006) 4. Harrison, E., Farrel, A., Miller, B.: Protection and restoration in MPLS networks. Data Connection White Paper (October 2001) 5. Aggarwal, R., Kompella, K., Nadeau, T., Swallow, G.: BFD For MPLS LSPs (November 2007) 6. Amin, M., Ho, K., Pavlou, G.: MPLS QoS-aware Traffic Engineering for Network Resilience. In: Proceedings of London Communications Symposium, London, UK (September 2004) 7. IEEE Std 802.17B. Resilient packet ring (RPR) access method and physical layer specifications. pp.1-216 (August 2007) 8. Shah, S., Yip, M.: IETF RFC 3619, October 2003. Ethernet Automatic Protection Switching (2003) 9. Dirac, G.A.: Extensions of Menger’s Theorem. Journal of the London Mathematical Societ s1-38(1), 148–161 (1963) 10. Hansen, A.F., Kvalbein, A., Cicic, T., Gjessing, S., Lysne, O.: Resilient Routing Layers for Recovery in Packet Networks. In: Proceedings of the International Conference on Dependable Systems and Networks, June 28 - July 1, 2005, pp. 238–247 (2005)
Can Critical Real-Time Services of Public Infrastructures Run over Ethernet
301
11. Stamatelakis, D., Grover, W.D.: IP layer restoration and network planning based on virtual protection cycles. IEEE Journal on selected areas in communications 18(10) (October 2000) 12. Padmaraj, M., Nair, S., Marchetti, M., Chiruvolu, G., Ali, M.: Distributed Fast Failure Recovery Scheme for Metro Ethernet. IEEE ICN, Los Alamitos (April 2006) 13. Coll, H., Lloret, J., Sanchez, F.J.: Does ns2 really simulate MPLS networks? In: The 4th Int. Conf. on Autonomic and Autonomous Systems (March 2008) 14. IEEE Std 802.1AB- Draft 2 d2-2. Station and Media Access Control Connectivity Discovery (December 2007) 15. IEEE Std 802.1D. Media Access Control (MAC) Bridges (June 2004) 16. IEEE Std 802.1AD. Provider Bridge (May 2006) 17. Eppstein, D.: Finding the k Shortest Paths. SIAM J. Computing 28(2), 652–673 (1998) 18. Sharma, S., Gopalan, K., Nanda, S., Chiueh, T.: Viking: a multi-spanning-tree Ethernet Architecture for Metropolitan Area and Cluster Networks. In: IEEE INFOCOM (2004) 19. IEEE 802.1AQ draft 0.3. Shortest Path Bridging (May 2006) 20. IEEE 802.1AG draft 8.1. Connectivity Fault Management (February 2007)
Shim6: Reference Implementation and Optimization Jun Bi, Ping Hu, and Lizhong Xie Network Research Center, Tsinghua University, Beijing, 100084, China [email protected]
Abstract. Shim6 is an important multihoming solution. This paper studies shim6 from several perspectives, including shim6 protocol implementation, shim6 mechanism optimization and security enhancement. In order to provide a shim6 research platform, we implement shim6 protocol on the Linux 2.6 platform as one of the first reference implementations. Based on this research platform, we refine the shim6 address switching mechanism, which reduces shim6 address switching time greatly. In addition, we propose an enhanced shim6 security mechanism to defeat reflection-type DoS/DDoS attacks launched from the multihomed site, by preventing source address spoofing in the multihomed site. Keywords: Multihoming, Shim6, IPv6.
1 Introduction Multihoming to upstream ISPs is a requirement today for most IPv4 networks in the Internet, especially for the enterprise networks. With the deployment of IPv6 network, more and more devices can access to the Internet, and more and more sites will have the requirement of multihoming. IPv6 multihoming will be a very common phenomenon undoubtedly. Currently, the shim6 mechanism [1] is an important multihoming approach. This paper studies shim6 from several perspectives, including shim6 protocol implementation, shim6 mechanism optimization and IPv4/IPv6 transition utilizing shim6. In order to provide a shim6 research platform, we implement shim6 protocol on the Linux 2.6 platform, which is one of the first reference implementations in the world. Based on this research platform, we refine the shim6 address switching mechanism, which reduces shim6 address switching time greatly. In addition, we propose an enhanced shim6 security mechanism to defeat reflection-type DoS/DDoS attacks launched from the multihomed site, by preventing source address spoofing in the multihomed site. In order to explore the utilization of shim6 in other research fields, we propose a mechanism called MI46 to optimize IPv4/IPv6 inter-operation using simplified shim6. The rest of this paper is organized as follows: Section 2 describes our Linux-based shim6 implementation; Section 3 and section 4 present the enhancements on shim6 protocol, including shim6 address switching mechanism refinement and shim6 security mechanism enhancement. Section 5 concludes the paper. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 302–313, 2008. © IFIP International Federation for Information Processing 2008
Shim6: Reference Implementation and Optimization
303
2 Reference Implementation of Shim6 Multihoming refers to the phenomena that one network end node accesses to the Internet through multiple network paths mainly due to the consideration of fault resilience. For the purpose of accessing to the Internet via multiple network paths, the multihomed network end node often possesses several addresses. Once the current network path fails, the multihomed network end node can immediately switch to another address and use another network path to communicate. In the shim6 approach, a new ‘SHIM6’ sub-layer is inserted into the IP stack in end hosts that wish to take advantage of multihoming. The shim6 sub-layer is located within the IP layer between the IP endpoint sub-layer and IP routing sub-layer. With the shim6, hosts have to deploy multiple provider-assigned IP address prefixes from multiple ISPs. These IP addresses are used by applications and if a session becomes inoperational, shim6 sub-layer can switch to using a different address pair. The switch is transparent to applications as the shim6 layer rewrites and restores the addresses at the sending and receiving host. For the purpose of transport layer communication survivability, the shim6 approach separates the identity and location functions for IPv6 addresses. In shim6, the identifier is used to uniquely identify endpoints in the Internet, while the locator is used to perform the role of routing. There is a one-to-more relationship between the identifier and locator. The shim6 layer performs the mapping function between the identifier and the locator consistently at the sender and the receiver. The upper layers above the shim6 sub-layer just use the unique identifier to identify the communication peer, even though the locator of the peer has changed. Hence, when the multihomed host switches to another locator, the current transport layer communication does not break up since the identifier is not changed. Due to the lack of the implementation, there are a lot of problems on shim6 which nobody can give some clear answers. Therefore, in order to start our research on shim6, we first implement shim6 protocol based on Linux 2.6.x platform. The control plane of shim6 mainly refers to the four handshakes process. The function of first two handshakes are to confirm the peer whether deploys shim6, and that of other two handshakes are to exchange the address list with the peer. The data plane of shim6 includes two main functions: rewriting the source and destination addresses of each packet, adding (the sender) /removing (the receiver) the payload extension header to/from the packet. The control plane of shim6 is implemented in the user space while the data plane is implemented in the kernel space using netfilter mechanism [2] of Linux. That’s because the four handshakes function of control plane is more complicated than the rewriting packets’ addresses function of data plane. If the four handshakes function is implemented in the kernel space, it may bring bad influence to the efficiency of the kernel. However, if we want to modify the addresses of every packet, we must implement the function of rewriting packets’ addresses in the kernel space. And it doesn’t bring too much weight to kernel since the function of rewriting packets’ addresses is very simple. But, the data plane relies on the addresses mapping table that is generated by the control plane. So, we implement modules of communication
304
J. Bi, P. Hu, and L. Xie
between the kernel and user space using the Linux netlink mechanism[3], in order to transfer the addresses mapping table from the control plane to the data plane.
3 Optimization of Shim6 In our experimental platform of shim6, we explore some researches on optimization of shim6 and get some achievements. The address switch mechanism of shim6 is defined in REAP (REAchability Protocol)[4]. REAP contains two parts: failure detection mechanism which detects failures between two communicating hosts, and address-pair exploration mechanism which locates another operational address-pair if a failure occurs. The failure detection mechanism uses two timers (Keepalive Timer, Send Timer) and Keepalive message to detect the failures between two communicating hosts. The concrete process is as follows: 1. When host A sends data packets to host B, a send timer starts at the same time. The suggested timeout value is 10s. 2. When host B receives data packets sent from host A, a Keepalive timer starts simultaneously. The suggested timeout value is 3s. If host B sends no packets to host A within the keepalive interval (less than Keepalive timeout value) after receiving data packets from host A, host B sends a Keepalive message to host A. 3. If host A receives no packets (including data packets and Keepalive packets) from host B before the send timer is timeout, it can be determined that there is a failure of current address pair. And then the addresses exploration mechanism sets up to locate another operational addresses pair. The address-pair exploration mechanism uses the Probe message to examine the reachability of the address-pair. Specifically, if host A wants to choose another address-pair with host B, host A needs to send a series of Probe messages to host B until receiving the Probe message sent from B. Upon receipt of the Probe message from host B, host A can conclude that the new address-pair is reachable from A to B. After receiving the first Probe message from host A, host B applies the same algorithm to confirm that the address-pair is reachable from B to A. If the test fails, another address-pair is selected to do the next address exploration. REAP defines some suggested default values associated with the exploration process. The Initial Probe Timeout which specifies the interval between initial attempts to send probes is 0.5s. Then the exploration process uses the exponential backoff procedure to increase the time between every probe if there is no response. So, each increase doubles the time. If the exploration process reaches Max Probe Timeout (60 seconds), it will continue sending at this rate until a suitable response is received. From the above description, we can see the time of shim6 address switching can be expressed as the following formula: Shim6 address switching time = failure detection time + next address-pair exploration time
Shim6: Reference Implementation and Optimization
305
The failure detection time is the value of Send timer timeout (10s). The next address-pair exploration time may grow exponentially. If the multihomed hosts have numerous addresses and the address failure rate is high, the shim6 address switching time will be very time consuming. For example, if the multihomed host A and B have Na and Nb addresses respectively, then under the worst circumstance each host would be required to detect about Na*Nb times to examine the next operational address-pair. In the original shim6 address switching mechanism, the shim6 next address-pair exploration process starts until the failure detection has finished, that is, after 10s, the hosts start to explore the reachability of the next address-pair. In this paper, we explore from the SCTP[5] protocol to refine the shim6 address switching mechanism, whose main idea is: during the failure detection, the hosts maintain the reachability of next address-pair simultaneously. In this way, once the failure occurs, the hosts can immediately switch to next operational address-pair. The refined shim6 address switch mechanism can shorten the shim6 address switching time to exactly 10s (that is the failure detection time). In the refined shim6 address switch mechanism, we use the Heartbeat message to maintain the reachability of next address-pair. The details are as follows: While host A and B communicate with each other using the current address-pair, meanwhile host A and B send Heartbeat request to each other periodically. Upon receipt of the Heartbeat request, the host sends back the Heartbeat acknowledgement. The host maintains a counter to record the number of Heartbeat requests that have been sent but not received Heartbeat acknowledgements. When the counter value reaches a certain threshold, this address-pair is marked un-reachable. Then another address-pair is selected to examine the reachability using the same method. We design several experiments to verify the effectiveness of the refined shim6 address switching method regarding the address number of each multihomed host and address failure possibility as the variables. The experimental result shows that when the multihomed hosts have numerous addresses, and these addresses are at a relatively high probability of failure, the original shim6 address switching time significantly increases. But the refined shim6 address switching time remains 10s constantly. Thus, in this situation, the refined shim6 address switching mechanism can reduce the address switching time effectively.
4 Source Address Spoofing Prevention for Multihomed Site 4.1 Problem Statement Today's Internet is suffering from reflection-type DoS/DDoS attacks. Current source address spoofing prevention methods have a coarse granularity, and the attackers can still in some cases launch reflection-type DoS/DDos attacks or other attacks hard to trace. The approaches based filtering, such as ingress filtering [6][7], can validate the source address in real-time. But the current approaches based on filtering just validate the IP source address according to the network prefix. Thus, these approaches can not
306
J. Bi, P. Hu, and L. Xie
prevent the IP source address spoofing in the edge network granularity, and leave some room to launch reflection-type DoS/DDoS attacks. For instance, with the ingress filtering, the attacker can still forge the IP source address that belongs to the same multihomed site and attack the victim using the reflection-type DoS/DDoS. This type attack scenario is shown in Figure 1.
Fig. 1. Reflection-type DDoS attack in the same multihomed site
This type security threats may be more serious with the IPv6 network deployment, because the IPv6 multihomed site may be much larger than the one of IPv4. In this situation, the ingress filtering is not sufficient. This paper aims to prevent the source address spoofing when the host in the multihomed site sends packets to somewhere outside this multihomed site. In addition, the replay attack must be handled. Even if the forged IP source address packets can be identified, the replay packet can still be sent to somewhere outside the multihomed site since its source address is valid. 4.2 The Algorithms 4.2.1 Source Address Validation Algorithm In our solution, each edge router of the edge network is equipped with a security gateway to carry out the authentication algorithm. Each security gateway is responsible for handling the addresses assigned from the ISP to which the security gateway is
Shim6: Reference Implementation and Optimization
307
attached. For example, in Figure 2, the security gateway A just handles the addresses assigned from ISP A. When host A sends packets using the addresses based on prefix A (PrefixA:PrefSite:hostA) as the source addresses, the packets may be sent to the edge router B connected to ISP B, and the security gateway B will be confused and take wrong action. In order to solve this problem, packets originating in the multihomed site are passed through a source routing domain (see Figure 2), to which all edge routers are connected. These routers would choose a route based on the destination and source addresses of a packet, selecting the appropriate edge router for the given source address.
Fig. 2. The source address spoofing prevention solution for multihoming scenario
Initially, When a host wants to access the Internet, it should firstly carry out the access authentication. This process can use the existing access authentication mechanism such as 802.1x[8], NAC[9], NAP[10] and TNC[11], etc. If the access authentication succeeds, the host generates a session key and sends it to the security gateway via some key exchange mechanisms such as IKE and IKE2. The session key is a random number that is at least 12-byte-long. The security gateway binds the session key and the host’s IP address. When the host sends packets to somewhere outside the multihomed site, it needs to certify its ownership of a certain IP address via showing the security gateway its secret session key which is shared with the security gateway by generating one signature for each packet using the hash digest algorithm (MD5, SHA-1, etc). The approach mainly includes the following steps. Host A sends a packet M to the security gateway B. M carries a signature H[M||S] which is computed by the hash digest function using the session key S and the certain part of the packet M (source address, timestamp, sequence number, etc.) as the input. When the security gateway B receives the packet M, B can
308
J. Bi, P. Hu, and L. Xie
re-compute the hash value HB according to the packet M, since B also knows the session key S. If HB is equal to the signature H [M||S] carried in the packet M, the security gateway B can confirm that the packet has a valid source address; Otherwise, B can conclude that the packet's source address is forged and then drops it. 4.2.2 The Anti-replay Algorithm After validating the signature, the security gateway also identifies the replay packets by checking whether the timestamp of the packet is expired and the sequence number of the packet is increasing. The timestamp method works as follows: when the host A sends a packet M to the security gateway, the packet M is marked with a timestamp Ta, which represents the sending time of the packet M. Once the security gateway receives the packet M, it reads its local time Tb. If |Tb-Ta|> T, where T is the admission time window, the security gateway can conclude that the packet M is a replay one then drops it. However, it’s hard to synchronize the clocks of the host and the gateway exactly. Moreover, the transmitting time of the packet in the network is also uncertain. Therefore, the admission time window T is always larger than the real transmitting time of the packet. This feature makes the timestamp method unfaithful for anti-replay. When |Tb-Ta|< T, the packet should be a non-replay one. But afterwards, if the replay packet is received in the margin time ( T-|Tb-Ta|), the security gateway will regard it as a normal packet incorrectly. The main idea behind the sequence number method is: when the host A sends packets to the security gateway, each packet carries an incremental sequence number. If the latest packet’s sequence number is greater than the previous one, the packet is normal; otherwise, the packet is a replay one. This method requires the packets to come in order. If the packets come out of order, one can employ a sequence number window to deal with it like IPSec. That the packets come out of order is mainly due to the load balance policy or queuing in the router or other 3-layer devices. This paper deals with the situation behind the first 3-layer device, thus we can assume that the packets sending to security gateway from hosts in the edge network are always in order. However, even though the packets come in order, this method may not identify some replay packets when the sequence number is used in a cycle way. For example, assuming the length of the sequence number is 16 bits, once the sequence number reaches the maximum 65535, it will return to 0 and increase as the previous cycle. In this case, if the attacker keeps a packet of the nth cycle and replays it in the (n+1) th cycle, the security gateway can’t identify the replay packet. In order to overcome the drawbacks of the timestamp and sequence method, we combine these two methods to prevent the replay attack. The timestamp method can use the sequence number mechanism to identify the replay packets in the admission time window T; And the sequence method can avoid the confusion between the normal and the replay packet by limiting the period of the sequence number cycle within the admission time window T.
△
△
△
△
△
△
△
4.2.3 IPv6 Source Address Validation Header In our approach, we design a new IPv6 extension header to carry the signature, the sequence number and other useful information. We call this new extension header
Shim6: Reference Implementation and Optimization
309
“source address validation header”. We develop a new IPv6 extension header rather than use the IPv6 AH extension header to complete the authentication between host and security gateway, because the AH extension header just can be carried once in the IPv6 packet according to the IPv6 specification. If we use it here, the host can not use the AH extension header with its communication correspondent. That’s not the result we want.
Fig. 3. The format of the source address validation header
The format of the extension header is shown in Fig. 3: • Next Header: 8-bit. Indicate either the type of the next extension header or the protocol type of the payload (TCP/UDP). • Payload Len: 8-bit. Length of the source address validation header in 8-octet units, not including the first 8 octets. • Algorithm: 8-bit. Point out the hash digest algorithm. For example, MD5 is set to 1. • Reserved: 8-bit. Reserved for future use. Zero on transmit. • Timestamp: 32-bit. It is used to anti-replay as described above. • Sequence Number: 32-bit. It is used to anti-replay as described above. • Authentication Data: 128 bits if using MD5 as the hash digest algorithm. The authentication data is computed by the hash digest algorithm. The input of the hash digest algorithm includes: IPv6 source address, timestamp, sequence number and session key. • The security gateway needs to remove the source address validation header from the packet. Considering the partial deployment, if we don’t remove the source address validation header, the hosts in other multihomed site may drop the packet due to misunderstanding of the new extension header. This step is unnecessary if our approach has been globally deployed. 4.3 Experiment Simulation 4 3.1 The Performance Evaluation of Source Address Validation Algorithm Because every packet needs to be authenticated when they are sent to somewhere outside the edge network, the primary design requirement of source address validation
310
J. Bi, P. Hu, and L. Xie
algorithm is the high performance. In our approach, this requirement means whether the performance of the hash digest algorithm is high enough. In addition, the security gateway needs to record the (IPv6 address, session key) pair for every host in the edge network, and each authentication procedure needs to look up the session key according to the IPv6 address. We use the hash table to organize the (IPv6 address, session key) pair, and its average performance for looking up is O(1). We conduct some experiments to validate the source address validation algorithm. Table 1 shows our experimental results, which are evaluated in the platform of Intel P4 2.0G CPU and 512M memory. Table 1. The performance comparison of two main hash digest algorithms HASH Digest algorithm MD5 SHA-1
The capacity per second (MB/S)
205 66
The results show that the performance of MD5 is about 1.64 Gbps This performance can meet the requirements in the most cases. We should note that this result is derived from the MD5 algorithm implemented in the software. If we implement the MD5 algorithm using hardware, we will get a higher performance. Therefore, the source address validation algorithm in our approach is completely feasible.
Fig. 4. The topology of anti-replay algorithm simulation experiments
Shim6: Reference Implementation and Optimization
311
4.3.2 The Effectiveness Evaluation of Anti-replay Algorithm In order to validate the effectiveness of our anti-replay algorithm which integrates the timestamp and the sequence number methods, we conduct several experiments to compare the effects of the timestamp, sequence number and our anti-replay algorithm. The experimental topology is shown as Fig 4. In Fig 4, host A is the victim, host B sniffs the packets sent from A and replays them. In the simulation experiment to validate effectiveness of the timestamp method, host A and the security gateway G synchronize the clocks exactly. The packet transmitting time from A to G is about 8~10ms. The host B takes 2ms±50% to sniff a packet and replay it. We set the admission time window T in the security gateway is 10ms, 11ms and 12ms respectively to see the effectiveness of the timestamp method. Fig 5 shows the result when the attacker B replays 10,000 packets. $FFXUDF\RIUHSOD\SDFNHWVLGHQWLILFDWLRQ
△
Ƹ7 PV Ƹ7 PV Ƹ7 PV
Fig.5. The simulation result of the timestamp method
△
△ △
△
From Fig 5 we can see, the effectiveness of timestamp method seriously relies on the size of admission time window T. When T=11ms, that is, T has a growth of 10% from the upper bound of packet transmitting time, the gateway can identify 68.7% of replay packets. When T=12ms, that is, T has a growth of 20% from the upper bound of packet transmitting time, the gateway can identify only 31.3% of replay packets. In the simulation experiment, we make host A and the security gateway G synchronize the clocks exactly. But, in the real network environment, it is hard to synchronize clocks exactly. What’s worse, the packet transmitting time in real network may be more fluctuant than that we set in the experiment. So, the admission time window T may have larger margin and the effectiveness may be much worse in the real network. In the simulation experiment to validate effectiveness of the sequence number method, the sequence number is 16-bit long. After recording the packets sent from A in the nth sequence number increasing cycle, attacker B replays these packets persistently in the (n+1)th sequence number increasing cycle. Fig 6 shows the accuracy of replay
△
△
312
J. Bi, P. Hu, and L. Xie
s t e k c a p y a l p e r f o y c a r u c c A
1.2 n 1 o i t 0.8 a c i 0.6 f i t n 0.4 e d i 0.2 0 0 5000 10000 15000 20000 25000 Packet number recorded in the nth sequence number increasing cycle
Fig.6. The simulation result of the sequence number method
packets identification in the (n+1)th sequence number increasing cycle when attacker B respectively records 10, 20, 50, 100, 1000, 5000, 10000, 15000 and 20000 packets sent from A in the nth sequence number increasing cycle. From Fig 6 we can see, under condition of persistent replay, the accuracy of replay packets identification in the (n+1)th sequence number increasing cycle is linear with the packet number recorded in the nth sequence number increasing cycle. The experiment result shows that almost all packets recorded can be replayed successfully under condition of persistent replay. We redo the above two simulation experiments to validate the effectiveness of our anti-replay algorithm which integrates the timestamp and the sequence number methods. The result is shown as Fig 7.
a. Comparison with timestamp
b. Comparison with sequence number
Fig. 7. The results of our anti-replay algorithm integrating timestamp and sequence number
Shim6: Reference Implementation and Optimization
313
△
In Fig 7a, even though the admission time window T has a large margin, with the help of sequence number method, all replay packets are identified successfully. In Fig 7b, in the (n+1)th sequence number increasing cycle, when attacker B replays the packets recorded in the nth sequence number increasing cycle, the security gateway can identify nearly 100% replay packets since the intervals between replay packets and normal packets exceed the admission time window T. From the above experiments we can conclude that the anti-replay algorithm this paper proposed can overcome the shortcomings of both timestamp and sequence number methods, and form a more effective, fine-grain anti-replay mechanism.
△
5 Summary This paper studies shim6 from several perspectives, including shim6 protocol implementation, shim6 mechanism optimization and security enhancement. In order to provide a shim6 research platform, we implement shim6 protocol on the Linux 2.6 platform, which is one of the first reference implementations in the world. Based on this research platform, we introduce HEARTBEAT message to refine the shim6 address switching mechanism, which reduces shim6 address switching time greatly. In addition, we propose an enhanced shim6 security mechanism to defeat reflection-type DoS/DDoS attacks launched from the multihomed site, by preventing source address spoofing in the multihomed site.
References 1. Nordmark, E.: Level 3 multihoming shim protocol, draft-ietf-shim6-proto-08.txt (2007) 2. http://www.netfilter.org 3. Salim, J., Khosravi, H., Kleen, A., Kuznetsov, A.: Linux Netlink as an IP Services Protocol. RFC 3549 (July 2003) 4. Arkko, J., Beijnum, I.: Failure Detection and Locator Pair Exploration Protocol for IPv6 Multihoming. draft-ietf-shim6-failure-detection-07.txt (2006) 5. Stewart, R., et al.: Stream Control Transmission Protocol. IETF RFC 2960 (2000) 6. Ferguson, P., Senie, D.: Network Ingress Filtering: Defeating Denial of Service Attacks which employ IP Source Address Spoofing. RFC2827 (2000) 7. Baker, F. and Savola, P.: Ingress Filtering for Multihomed Networks. RFC3704 (2004) 8. American National Standards Institute: IEEE-SA Standards Board: IEEE Standard for Local and metropolitan area networks - Port-Based Network Access Control (2001) 9. Cisco Systems: Network Admission Control 10. Microsoft: Network Access Protection 11. TNC: TCG Trusted Network Connect TNC Architecture for Interoperability (2005)
A Measurement Study of Bandwidth Estimation in IEEE 802.11g Wireless LANs Using the DCF Michael Bredel and Markus Fidler Multimedia Communications Lab Technische Universit¨ at Darmstadt, Germany {michael.bredel,markus.fidler}@kom.tu-darmstadt.de
Abstract. In this paper we present results from an extensive measurement study of wireless bandwidth estimation in IEEE 802.11 WLANs using the distributed coordination function. We show that a number of known iterative probing methods, which are based on the assumption of first-come first-serve scheduling, can be expected to report the fair bandwidth share of a new flow rather than the available bandwidth. Our measurement results confirm this view and we conclude that under the current probe gap and probe rate models the fair share can only be loosely related to the available bandwidth. Like a few other studies we report that packet sizes have a tremendous impact on bandwidth estimates. Unlike these studies we can, however, show that minor modifications to known methods for wired networks, such as Pathload, can solve previously indicated limitations of these methods in wireless networks.
1
Introduction
The term available bandwidth denotes the portion of the capacity at a link or a network path that remains unused by present traffic. The idea to estimate the bandwidth of a network path from end-host measurements dates back to TCP congestion control [11] and packet pair probing [17]. Since then the field of available bandwidth estimation has evolved significantly and to date a number of estimation methods exists, e.g. [12,10,27,30,26], which are frequently used e.g. for network management, error diagnostics, overlay routing, and traffic engineering. The theoretical underpinnings of bandwidth estimation have been explored e.g. in [23,24,22,25] and empirical evaluations can be found e.g. in [29,30]. The task of bandwidth estimation in wireless networks, such as IEEE 802.11 Wireless LANs, however, has been understood to a much lesser extend. Bandwidth estimation methods, which perform well in case of wired links, have been reported to yield highly unreliable available bandwidth estimates for wireless links [9,21,19], hinting at a number of specific challenges and open issues in wireless bandwidth estimation. Here, the performance and the quality of service of a link depend largely on the characteristics of the shared physical medium and the multi-access coordination function. These aspects strongly influence quantities like delay, loss, and throughput and may result in a high variability of available or actually accessible resources. This makes measurement-based bandwidth estimation in wireless networks a complex and difficult task. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 314–325, 2008. c IFIP International Federation for Information Processing 2008
A Measurement Study of Bandwidth Estimation
315
A few approaches to bandwidth estimation specifically address the characteristics of wireless networks. Passive methods, which measure existing traffic, can take advantage of the wireless broadcast medium and record idle periods to estimate the resources that would be available in the proximity of a node [20,28]. The approach is, however, unreliable in case of hidden stations. Active probing, on the other hand, takes measurements of specific probing traffic at the ingress and the egress of the network to infer the available bandwidth of a network path. Contrary to wired networks a strong impact of packet sizes on bandwidth estimates has been observed for wireless links [15,16,21,19,6]. Here, the fluid model, which is employed by many estimation methods, is clearly violated. Moreover, it is noted in [19] that the assumption of First-Come First-Serve (FCFS) scheduling, which is the basis of most active probing methods, may not hold in IEEE 802.11 WLANs, e.g. due to the Distributed Coordination Function (DCF). In this paper we report results from an extensive measurement study of active probing methods to shed light on the issues mentioned above. We conducted measurements of an IEEE 802.11g link in ns-2 simulations [4], Emulab [3,31], and a highly controlled local wireless testbed that is located in a shielded and reflection absorbing measuring room. Based on our measurement results we find that the common FCFS model does clearly not hold under the DCF. The DCF can (within certain limits) rather be viewed as implementing fair scheduling. Based on the fair queuing model we can show that iterative methods that are based on the FCFS assumption can be expected to estimate the fair share of a new flow instead of the available bandwidth. Our measurements using the methods in Tab. 1 support the anticipated results. The gathered data reconfirms the known dependency of bandwidth estimates on the probing packet size. The remainder of this paper is organized as follows. In Sect. 2 we review the state-of-the-art in available bandwidth estimation. In Sect. 3 we discuss relevant characteristics of wireless links and the fair queuing model. In Sect. 4 we show our measurement results and in Sect. 5 we present our conclusions.
2
Methods for Available Bandwidth Estimation
In this section we discuss the state-of-the-art of available bandwidth estimation in wired and wireless networks. We focus on a set of publicly available measurement tools, see Tab. 1, which are used for measurements in Sect. 4. For related empirical evaluations in wired networks see e.g. [29,30]. The task of available bandwidth estimation is to infer the portion of the capacity of a link or a network path that remains unused by cross-traffic. The available bandwidth of a link with index i can be defined as [12] ABi (τ, t) = Ci (1 − ui (τ, t))
(1)
where Ci is the capacity and ui ∈ [0, 1] is the utilization by cross-traffic in the interval [τ, t). The available bandwidth of a network path is determined by the available bandwidth of the tight link as AB(τ, t) = mini {ABi (τ, t)} [12].
316
M. Bredel and M. Fidler Table 1. Bandwidth estimation methods used in this study Probing method Pathload [12] Pathchirp [27] Spruce [30] IGI [10] PTR [10] WBest [21] DietTOPP [14]
Probing traffic Inference technique packet trains iterative packet chirps iterative packet pairs direct packet trains direct packet trains iterative packet trains direct packet trains iterative
Active measurement methods inject specific probes into the network and estimate the available bandwidth from measurements of the probing traffic at the ingress and at the egress of the network. The majority of the methods uses packet pairs, i.e. two packets sent with a defined spacing in time referred to as gap, or packet trains, i.e. a larger number of packets sent at a defined constant rate. The rate of a packet train can be converted into a certain spacing of the train’s packets, showing a direct relation to the gap model of packet pairs. Packet chirps [27] are specific packet trains that are sent at a geometrically increasing rate respectively with a geometrically decreasing gap. Many methods use a simplified network model, where cross-traffic is viewed as constant rate fluid and the network is abstracted as a single tight link. Under these assumptions the available bandwidth of a network path simplifies to AB = C(1 − u). In addition, FCFS multiplexing is usually assumed, where flows share the capacity of a link proportionally to their offered rates. For constant rate probes an expression referred to as rate response curve [26,23] can be derived as ri ri + λ 1 , if ri ≤ C − λ = max 1, = ri +λ (2) ro C , if ri > C − λ C where ri and ro are the input and output rates of probes respectively and λ is the input rate of cross-traffic. If λ ≤ C the available bandwidth follows as AB = C − λ and otherwise AB = 0. Based on this model the task of available bandwidth estimation is to select the rate of probing traffic such that (2) can be solved for C and λ or C − λ. While (2) is usually used for packet train probes, an equivalent gap response curve can be derived for packet pairs, where the gap g is linked to the rate r by the packet size l resulting in gi = l/ri and go = l/ro [23]. In [13] measurement methods are classified by their inference technique as either direct or iterative probing schemes. Direct probing schemes assume that the capacity of the link C is known in advance. In this case (2) can be solved for the rate of the cross-traffic if the probing rate is larger than the available bandwidth. A straightforward choice is to probe with ri = C in which case the available bandwidth follows from (2) in rate respectively gap notion as [21,30] C go − gi AB = C 2 − =C 1− . (3) ro gi
A Measurement Study of Bandwidth Estimation
317
Spruce, WBest, and IGI are methods that use direct probing. Spruce assumes that the capacity is known a priori and immediately applies the gap version of (3). WBest provides a two-step algorithm using packet pairs to estimate the link capacity and packet trains for available bandwidth estimation based on the rate version of (3). IGI uses probing trains with increasing gaps resulting in a more complex direct probing formula than (3), for details see [10]. Iterative probing methods do not require a priori knowledge of the link capacity. They employ an iterative procedure with multiple probing rates aiming to locate the turning point of the rate response curve (2), i.e. they seek to find the largest probing rate ri such that ri /ro = 1. At this point the probing rate coincides with the available bandwidth. TOPP, DietTOPP, PTR, Pathload, and Pathchirp are iterative probing methods. TOPP [26] uses trains of packet pairs with increasing rate and applies (2) for available bandwidth estimation. It recursively extends the model to the multiple node case and in addition it estimates the capacity from the second linear segment in (2). Closely related is a simplified version called dietTOPP. PTR is a packet train method that uses a gap version of (2). Pathload varies the rate of packet trains using a binary search algorithm to find the largest probing rate that does not cause overload and hence matches the available bandwidth. It uses increasing one-way delays as an indication of overload. Increasing delays indicate that the input rate exceeds the output rate, i.e. ri /ro > 1 which clearly shows the relation to (2) [23]. Pathchirp increases the probing rate within a single packet train, referred to as a chirp, instead of varying the rate of successive packet trains. Like Pathload it detects crossing the turning point of the rate response curve from increasing one-way delays. Most of the discussed methods have been developed for wired networks while WBest and dietTOPP have been suggested by the authors for available bandwidth estimation in wireless networks. Further on, a method called ProbeGap has been proposed for bandwidth estimation in broadband access networks [19]. The method does not exactly fit into the classification scheme used here. ProbeGap sends out single packets and collects the one-way delays of these probes. The fraction of the packets which have a delay close to zero are assumed to have found an idle channel. This fraction is used to estimate the available bandwidth. Besides, passive measurement approaches can take advantage of the wireless broadcast medium [28] or protocol related information [20]. Passive methods are, however, not considered within the scope of this study.
3
Relevant Wireless Link Characteristics
In this section we discuss relevant characteristics of wireless links that are of vital importance for bandwidth estimation. We show how these aspects affect current fluid rate and gap models from Sect. 2 and reason which quantity we expect to be estimated by known methods for bandwidth estimation in wireless
318
M. Bredel and M. Fidler
systems. We use the term wireless link meaning a wireless broadcast channel with Medium Access Control (MAC) and Radio Link Control (RLC) protocols. Fading and interference: As opposed to wired links the characteristics of wireless channels are highly variable due to fading. Other potentially hidden stations, which may even include stations that implement different radio standards using the same frequency band, create interference on the wireless broadcast medium. These effects can cause rapid fluctuations of the signal-to-noise ratio and may lead to high bit error rates. Different modulation and coding schemes combined with rate adaptation may be used for compensation. As a consequence, the capacity and the availability of the channel may vary drastically. Contention: In case of wireless multi-access channels, stations share the same medium and contend for access to the channel, often in a fully distributed manner. Channel access is controlled by the MAC protocol. Before accessing the medium stations listen to the channel to detect nearby transmissions with the objective of avoiding collisions. This procedure may fail in case of hidden stations, thus requiring additional protocol mechanisms such as RTS/CTS. The resulting behavior of medium access procedures may be largely different if compared to FCFS multiplexing at a point-to-point link. Retransmissions: Due to frequent packet loss on wireless links, e.g. because of fading, interference, or collisions, many standards include an RLC protocol that implements an automatic repeat request technique such as stop-and-wait ARQ to ensure packet delivery. Link layer retransmissions consume channel capacity and lead to increased and varying one-way delays. Effects that are due to fading and interference result in a time-varying channel capacity C(t). It is straight-forward to adapt the definition of available bandwidth (1) accordingly. The fluid rate and gap models for bandwidth estimation (2), (3) assume, however, a constant capacity. This assumption is mirrored by the view of cross-traffic as constant rate fluid. Our controlled measurement environment eliminates effects that are due to fading and interference to a large extend. Protocol-related aspects, however, remain to be addressed. In the following subsections we investigate two effects of the IEEE 802.11 protocols that have major impact on bandwidth estimation. 3.1
Protocol Overhead
We compute the impact of the IEEE 802.11g protocol overhead on the achievable throughput. Given packets of size l bits the throughput can be determined as C = l/g where g is the gap, i.e. the time between the beginning of two subsequent packets transmitted at the maximum rate. The gap consists of a number of additive parameters1 including inter frame spacings, the expected backoff time, transmission times of the data packet including preamble and header and the acknowledgement, as well as propagation delays which are neglected here. 1
We set tdifs = 28 μs, tsifs = 16 μs, tbackoff = 139.5 μs, tdata = 26+l/54 μs, tack = 30 μs.
A Measurement Study of Bandwidth Estimation
40 S5 contending traffic 54 Mbps
S3
100 Mbps
R
AP probe traffic
S2
measurements simple model
35 Throughput [Mbps]
S4
319
30 25 20 15 10 5
S1
0 0 Wireless test network
Wired control network
Fig. 1. Wireless testbed setup
500
1000
1500
2000
Packet size [Byte]
Fig. 2. Achievable throughput
Fig. 2 shows the achievable throughput for different packet sizes. The outcome of the above model is compared to measurement results from our testbed shown in Fig. 1, for details see Sect. 4. We used a single greedy UDP traffic stream that is generated with the D-ITG traffic generator [1]. The measured throughput is averaged over 60 s. It exhibits a strong dependence on the packet size, as also derived for IEEE 802.11b in [8]. We conclude that the size of probing packets used by bandwidth estimation tools has a large impact on the accuracy of estimates. This aspect is also noted in [16,19,21,6]. The fluid rate response model for bandwidth estimation does not involve a notion of packets. Using the equivalent gap formulation, an extended model that includes the effects of the packet size has been developed in [16]. The conclusion in [16] is none the less that the probing packet size needs to be tailored to the application that uses the bandwidth estimates. 3.2
Distributed Coordination Function
In this section we investigate how the capacity of IEEE 802.11g links using the DCF is allocated to contending flows. The issue is discussed controversially in the literature. While [18] showed short-term unfairness of CSMA/CA-based WaveLANs a recent study [7] attributes findings of unfairness to early WaveLAN cards and reports that current IEEE 802.11 DCF implementations actually exhibit good short-term fairness. On this account we performed a number of initial experiments to explore fairness issues in our shielded testbed. Fig. 3 shows the throughput of contending flows at an IEEE 802.11g link. Flows have a constant bit rate and are generated using the Rude/Crude traffic generator [5]. All packets have a constant size of 1500 Bytes. The throughput is averaged over 60 s. In Fig. 3(a) two flows contend for the link. Flow 1 has a rate of 28 Mbps and the rate of flow 2 is increased from 0 to 28 Mbps in steps of 1 Mbps after each experiment. Similarly Fig. 3(b) shows the throughput of four flows, where the rate of flow 4 is increased. The results in Fig. 3 confirm that each flow receives a fair share of the capacity. Flow 2 in Fig. 3(a) achieves its target throughput whereas the throughput of flow 1 is reduced accordingly until flow 2 reaches 14 Mbps. From this point on both
320
M. Bredel and M. Fidler
30
flow1: 24 Mbps flow2: 9 Mbps flow3: 2 Mbps flow4: 0-24 Mbps
25 Throughput [Mbps]
25 Throughput [Mbps]
30
flow1: 24 Mbps flow2: 0-24 Mbps
20 15 10 5
20 15 10 5
0
0 0
5
10
15
20
25
Rate of flow2 [Mbps]
0
5
10
15
20
25
Rate of flow4 [Mbps]
(a) two contending flows
(b) four contending flows
Fig. 3. Measured throughput of contending flows. Each flow obtains its fair share.
flows get a fair share of 14 Mbps regardless of the rate of flow 2. Fig. 3(b) confirms this result for four heterogeneous flows. Note that we measured fair throughput shares only for flows with homogeneous packet sizes. The fair share f at a congested link that is observed in Fig. 3 can be computed as the solution of n f: min{rk , f } = C (4) k=1
where C is the capacity, rk is the rate of flow k, and n is the number of flows. Once f is determined the output rate of flow k follows as min{rk , f }. The rate response curve of a fair queuing system follows immediately as 1 ri ri = max 1, = ri ro f f
, if ri ≤ f , if ri > f
(5)
where ri and ro are the input and output rates of the probes respectively. As opposed to the FCFS rate response curve (2) the available bandwidth cannot be derived from (5). Trivially the available bandwidth AB is upper bounded by the fair share, i.e. 0 ≤ AB ≤ f . Without further assumptions the two extremal values can, however, be easily attained if f ≤ C/2. As an example consider a single contending flow with rate λ = C/2 respectively λ = C. The fair share of a new greedy flow is f = C/2 in both cases, whereas the available bandwidth becomes AB = f respectively AB = 0. Referring to the classification of bandwidth estimation methods in Sect. 2 we conclude that iterative methods, which use the turning point of the rate response curve as bandwidth estimate, can be expected to report the fair share of a new greedy flow in case of a fair wireless link. For existing direct probing methods that inject probes with rate ri = C such a clear result cannot be established. Inserting ro = f into (3) does neither compute the available bandwidth nor the fair share. We note, however, that direct probing with ri = C could easily report the fair share, since ro = f in this case.
A Measurement Study of Bandwidth Estimation
4
321
Experimental Evaluation of Bandwidth Estimation
Equipped with the results from Sect. 3 we now investigate the performance of the bandwidth estimation tools listed in Tab. 1. If not mentioned otherwise we use the default configuration of the bandwidth estimation tools to perform the experiments. We evaluate the methods using a wireless testbed in a shielded, anechoic room. Hence, we act on the assumption that the physical medium is free of interference from external sources that do not belong to the testbed. We focus on the accuracy of wireless bandwidth estimates and show how these relate to the available bandwidth respectively to the fair share under different types of contending traffic. We do not report probing overhead, intrusiveness, as well as run or convergence times. These aspects are elaborated e.g. in [29]. As shown in Fig. 1 the testbed comprises five wireless stations (S1 to S5) that serve as traffic sources. The stations are connected to the access point (AP) using IEEE 802.11g with 54 Mbps2 . The access point is connected over fast Ethernet at 100 Mbps to a station (R) that acts as receiver. The distance between the wireless stations and the access point was between 0.5 m and 1.5 m. We switched of RTS/CTS, automatic rate adaption (which turned out to be unneeded in our scenario) as well as packet fragmentation. We used the DCF for medium access. In parallel to the wireless test network, all nodes are connected to a separated switched Ethernet, which is used as a control network. 4.1
Impact of the Intensity of Contending Traffic
In the first set of experiments we estimate the available bandwidth from S1 to R in the presence of a single contending flow. We increase the rate λ of the contending traffic that flows from S2 to R from 0 Mbps up to 28 Mbps in steps of 1 Mbps. The contending traffic consists of packets of 1500 Bytes and is generated using the D-ITG traffic generator [1]. All probe packets are set to 1500 Bytes. Fig. 4 shows the average of 25 available bandwidth estimates for each of the tools and all rates of the contending traffic as well as corresponding confidence intervals at a confidence level of 0.95. As a reference the available bandwidth AB = C − λ as well as a the fair share of a new flow f = max{C − λ, C/2} are plotted, where we use C = 28 Mbps from Fig. 2 for a packet size of 1500 Bytes. Iterative probing: From our arguments in Sect. 3 we expect that the iterative probing methods Pathload, DietTOPP, Pathchirp, and PTR report an estimate of the fair share. As indicated in Fig. 4 the fair share and the available bandwidth are identical for contending traffic with rate λ ∈ [0 . . . 14] Mbps, whereas they differ for λ ∈ (14 . . . 28] Mbps. Fig. 4(a) shows that the estimates from Pathload (which reports an upper and a lower bound of the available bandwidth) and DietTOPP clearly confirm the fair queuing model in (5). Both methods closely 2
We used Lenovo ThinkPad R61 notebooks with 1.6 GHz, 2 GB RAM running Ubuntu Linux 7.10 with kernel version 2.6.22. We employed the internal Intel PRO/Wireless 4965 AG IEEE 802.11g WLAN adapters. The access point is a Buffalo Wireless-G 125 series running DD-WRT [2] version 24 RC-4.
M. Bredel and M. Fidler
Bandwidth Estimate [Mbps]
35
35
Pathload DietTOPP
30 25 20 15 10 5
avail.bw.
0 0
5
10 15 20 Contending Traffic [Mbps]
Bandwidth Estimate [Mbps]
322
Spruce Pathchirp
30 25 20
fair share
15 10 5
avail.bw.
0
25
0
(a) Pathload and DietTOPP 35
IGI PTR
30 25 20 15 10 5
avail.bw.
0 0
5
10 15 20 Contending Traffic [Mbps]
(c) IGI/PTR
25
10 15 20 Contending Traffic [Mbps]
25
(b) Spruce and Pathchirp
Bandwidth Estimate [Mbps]
Bandwidth Estimate [Mbps]
35
5
WBest Avail.Bw. WBest Capacity
30 25 20 15 10 5
avail.bw.
0 0
5
10 15 20 Contending Traffic [Mbps]
25
(d) WBest
Fig. 4. Bandwidth estimates for a wireless link with one contending flow
track the fair share and the reported estimates deviate noticeably from the available bandwidth as the rate of the contending traffic increases beyond 14 Mbps. The results from PTR in Fig 4(c) and to a lesser extend from Pathchirp in Fig. 4(b) confirm this view. In case of Pathchirp we used the estimates provided after Pathchirp’s self-adapting phase. In our experimental results plotted in Fig. 4(b) it nevertheless underestimates the fair share and the estimates exhibit a comparably high variance. Pathload has been reported to provide inaccurate bandwidth estimates for wireless networks in [9,19,21]. This stands in contrast to experiences made using Pathload in wired networks. In [21] the probing packet size is mentioned as a possible reason for bandwidth underestimation, and [19] identifies the signature of one-way delays in wireless networks as a source of the problem. Having confirmed the strong impact of the packet size on the throughput in wireless networks, see Fig. 2, we modified Pathload so that we can specify the probing packet size. Using Pathload with a fixed packet size of 1500 Bytes improves bandwidth estimates significantly and leads to quite accurate and stable results as can be seen in Fig. 4(a). Similar problems have been reported for IGI/PTR [21], which can also be mitigated using a packet size of 1500 Bytes rather than the default size. The estimates in Fig. 4(c) are, however, less sensitive to the intensity of contending traffic, as also reported for cross-traffic in wired networks in [30]. Direct probing: Direct probing tools require a priori knowledge of the link capacity. We executed Spruce with a given capacity of C = 28 Mbps which
A Measurement Study of Bandwidth Estimation Pathload DietTOPP Fair Share
30 25 20 15 10 5 0
35 Bandwidth Estimate [Mbps]
Bandwidth Estimate [Mbps]
35
323
PTR Pathchirp Fair Share
30 25 20 15 10 5 0
0 1 2 3 4 Number of Contending Flows [Mbps]
(a) Pathload and DietTOPP
0 1 2 3 4 Number of Contending Flows [Mbps]
(b) PTR and Pathchirp
Fig. 5. Bandwidth estimates for a wireless link with several contending flows
corresponds to the throughput for packets of 1500 Bytes size in Fig. 2. From our results shown in Fig. 4(b) we cannot detect a clear trend of the estimates towards either the fair share or the available bandwidth once λ exceeds 14 Mbps. WBest uses a two-step algorithm to estimate first the capacity and then the available bandwidth. In our measurements both estimates exhibit a comparably high variance as shown in Fig. 4(d). Moreover, the capacity estimates are sensitive to contending traffic, possibly a result of fair resource allocation, such that bandwidth estimates that are based heron may be unreliable. 4.2
Impact of the Number of Contending Flows
In the second set of experiments we investigate the impact of the number of contending flows on bandwidth estimates. We use contending traffic with a total rate of 20 Mbps, which is divided evenly among one to four flows. Hence, the available bandwidth AB = 8 Mbps remains constant in all experiments, whereas the fair share of a new flow is (14, 9.3, 8, 8) Mbps for (1, 2, 3, 4) contending flows each offering a rate of (20, 10, 6.6, 5) Mbps respectively. Since we only related the estimates of iterative probing methods to the fair share we restrict the results shown here to iterative methods. Again all contending and probe packets are adjusted to have a fixed size of 1500 Bytes to achieve comparability. As in the experiments presented above, we ran each experiment 25 times and display the average of the available bandwidth estimates and belonging confidence intervals at a confidence level of 0.95 in Fig. 5. The estimates of the iterative methods Pathload, DietTOPP, PTR, and Pathchirp are closely related to the fair share and do not match the available bandwidth. As anticipated these results confirm the model that is developed in Sect. 3. Without presenting results due to limited space we state that the direct probing tools that were investigated tend to be more inaccurate and do not report the available bandwidth nor the fair share in these experiments.
5
Conclusions
We conducted an extensive measurement study of wireless bandwidth estimation in IEEE 802.11g WLAN testbeds. In contrast to wired links bandwidth
324
M. Bredel and M. Fidler
estimates for wireless channels depend largely on the choice of packet sizes. We adapted the examined tools accordingly. We found that the FCFS assumption common in bandwidth estimation does not apply in case of wireless channels with contending traffic, where the distributed coordination function seeks to achieve a fair bandwidth allocation. We showed that the estimates of known iterative measurement methods can be related to the fair share of a new flow, which may deviate significantly from the available bandwidth. Our measurement results confirm this relation. A similar result was not established for direct probing. As opposed to previous studies our measurement results indicate that the methods, which have been specifically targeted at wireless channels, are not superior to previously known methods that have been developed for wired networks.
Acknowledgements We would like to thank M. Hollick, R. Steinmetz, and R. Jakoby for providing the equipment and facilities that made this research possible. This work has been funded by an Emmy Noether grant of the German Research Foundation.
References 1. D-ITG: Distributed Internet Traffic Generator, http://www.grid.unina.it/software/ITG 2. dd-wrt: Linux wireless router, http://www.dd-wrt.com 3. Emulab: Network emulation testbed, http://www.emulab.org 4. ns-2: Network simulator, http://www.isi.edu/nsnam/ns 5. Rude/Crude: Real-time UDP Data Emitter/Collector, http://rude.sourceforge.net 6. Amambra, A., Hou, K.M., Chanet, J.-P.: Evaluation of the performance of the SLoPS: Available bandwidth estimation technique in IEEE 802.11b wireless networks. In: Proc. of IFIP NTMS, May 2007, pp. 123–132 (2007) 7. Berger-Sabbatel, G., Duda, A., Heusse, M., Rousseau, F.: Short-term fairness of 802.11 networks with several hosts. In: Proc. of IFIP MWCN, October 2004, pp. 263–274 (2004) 8. Bianchi, G.: Performance analysis of the IEEE 802.11 distributed coordination function. IEEE JSAC 18(3), 535–547 (2000) 9. Botta, A., Pescape, A., Ventre, G.: Improving accuracy in available bandwidth estimation for IEEE 802.11-based ad hoc networks. In: Proc. of IEEE Systems Communications, August 2005, pp. 287–292 (2005) 10. Hu, N., Steenkiste, P.: Evaluation and characterization of available bandwidth probing techniques. IEEE JSAC 21(6), 879–894 (2003) 11. Jacobson, V.: Congestion avoidance and control. In: Proc. of ACM SIGCOMM, August 1988, pp. 273–288 (1988) 12. Jain, M., Dovrolis, C.: End-to-end available bandwidth: measurement methodology, dynamics, and relation with tcp throughput. IEEE/ACM TON 11(4), 537–549 (2003) 13. Jain, M., Dovrolis, C.: Ten fallacies and pitfalls on end-to-end available bandwidth estimation. In: Proc. of ACM IMC, October 2004, pp. 272–277 (2004)
A Measurement Study of Bandwidth Estimation
325
14. Johnsson, A., Melander, B., Bj¨ orkman, M.: Diettopp: A first implementation and evaluation of a simplified bandwidth measurement method. In: Proc. of SNCNW (November 2004) 15. Johnsson, A., Melander, B., Bj¨ orkman, M.: Bandwidth measurement in wireless networks. In: Proc. of Med-Hoc-Net (June 2005) 16. Johnsson, A., Melander, B., Bj¨ orkman, M.: An analysis of active end-to-end bandwidth measurements in wireless networks. In: Proc. of IEEE/IFIP E2EMON, April 2006, pp. 74–81 (2006) 17. Keshav, S.: A control-theoretic approach to flow control. In: Proc. of ACM SIGCOMM, September 1991, pp. 3–15 (1991) 18. Koksal, C.E., Kassab, H., Balakrishnan, H.: An analysis of short-term fairness in wireless media access protocols. In: Proc. of ACM SIGMETRICS, June 2000, pp. 118–119 (2000) 19. Lakshminarayanan, K., Padmanabhan, V.N., Padhye, J.: Bandwidth estimation in broadband access networks. In: Proc. of ACM IMC, October 2004, pp. 314–321 (2004) 20. Lee, H.K., Hall, V., Yum, K.H., Kim, K.I., Kim, E.J.: Bandwidth estimation in wireless LANs for multimedia streaming services. In: Proc. of IEEE ICME, July 2006, pp. 1181–1184 (2006) 21. Li, M., Claypool, M., Kinicki, R.: WBest: A bandwidth estimation tool for multimedia streaming application over IEEE 802.11 wireless networks. Technical Report WPI-CS-TR-06-14, Computer Science Department, Worcester Polytechnic Institute (March 2006) 22. Liebeherr, J., Fidler, M., Valaee, S.: A min-plus system interpretation of bandwidth estimation. In: Proc. of IEEE INFOCOM, May 2007, pp. 1127–1135 (2007) 23. Liu, X., Ravindran, K., Loguinov, D.: A queuing-theoretic foundation of available bandwidth estimation: Single-hop analysis. IEEE/ACM Trans. Networking 15(4), 918–931 (2007) 24. Liu, X., Ravindran, K., Loguinov, D.: A stochastic foundation of available bandwidth estimation: Multi-hop analysis. IEEE/ACM Trans. Networking 16(2) (2008) 25. Machiraju, S., Veitch, D., Baccelli, F., Bolot, J.: Adding definition to active probing. ACM SIGCOMM Computer Communication Review 37(2), 19–28 (2007) 26. Melander, B., Bj¨ orkman, M., Gunningberg, P.: Regression-based available bandwidth measurements. In: Proc. of SPECTS (July 2002) 27. Ribeiro, V.J., Riedi, R.H., Baraniuk, R.G., Navratil, J., Cottrell, L.: Pathchirp: Efficient available bandwidth estimation for network paths. In: Proc. of PAM (April 2003) 28. Sarr, C., Chaudet, C., Chelius, G., Lassous, I.G.: Improving accuracy in available bandwidth estimation for IEEE 802.11-based ad hoc networks. In: Proc. of IEEE MASS, October 2006, pp. 517–520 (2006) 29. Shriram, A., Kaur, J.: Empirical evaluation of techniques for measuring available bandwidth. In: INFOCOM 2007. 26th IEEE International Conference on Computer Communications, May 2007, pp. 2162–2170. IEEE, Los Alamitos (2007) 30. Strauss, J., Katabi, D., Kaashoek, F.: A measurement study of available bandwidth estimation tools. In: Proc. of ACM IMC, October 2003, pp. 39–41 (2003) 31. White, B., Lepreau, J., Stoller, L., Ricci, R., Guruprasad, S., Newbold, M., Hibler, M., Barb, C., Joglekar, A.: An integrated experimental environment for distributed systems and networks. In: Proc. USENIX OSDI, December 2002, pp. 255–270 (2002)
Hierarchical Logical Topology in WDM Ring Networks with Limited ADMs Tomoya Kitani1 , Nobuo Funabiki2, Hirozumi Yamaguchi3, and Teruo Higashino3 1
Graduate School of Information Science Nara Institute of Science and Technology, Nara, Japan 2 Graduate School of Natural Science and Technology Okayama University, Okayama, Japan 3 Graduate School of Information Science and Technology Osaka University, Osaka, Japan [email protected], [email protected], {h-yamagu,higashino}@ist.osaka-u.ac.jp
Abstract. This paper presents a method for constructing logical topology in multi-hop WDM networks hierarchically. Our logical topology can be built over WDM ring networks with only three ADMs (Add-Drop Multiplexers) and transceivers per node. We aim to realize low-cost WDMs that can be deployed for MANs and LANs. Due to well constructed hierarchical topology, both the maximum distance between two nodes and the number of required wavelengths on each optical fiber are reduced to the logarithmic order of the number of nodes. This contributes to mitigate traffic costs in WDM networks. Furthermore, we propose a node labeling scheme to achieve efficient routing without routing table lookup. From our numerical experiments, we have shown that we could construct a logical topology consisting of 1,000 nodes with a very small diameter and 16 wavelengths in a WDM ring network. Keywords: WDM ring networks, logical topology design algorithm, hierarchical topology, compact routing algorithm.
1 Introduction Wavelength Division Multiplexing (WDM) technology enables high-speed and highvolume data transmission over fiber-optic networks by multiplexing several optical carrier signals using different wavelengths. Because of its significance for very high-speed communication infrastructure, the WDM methodology and the related issues have been widely investigated so far. In particular, since multi-hop WDM networks are required to assign lightpaths among two nodes and any data is carried over those lightpaths, the quality of lightpath design obviously has great impact on the network performance such as the volume of traffic that can be accommodated without causing congestion. Therefore, many algorithms on designing logical topologies over multi-hop WDM networks have been considered [1,2]. The existing methodologies on logical topology design over WDM networks have focused on minimizing the number of required wavelengths [3,4,5,6], the number of A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 326–337, 2008. c IFIP International Federation for Information Processing 2008
Hierarchical Logical Topology in WDM Ring Networks with Limited ADMs
327
required ADMs (Add-Drop Multiplexers) and transceivers [5,6,7], the blocking probability [8], the number of required paths [9], the total amount of relayed traffic [10,11], the number of required wavelength converters [3,12] and so on, for given traffic matrices. It is well-known that the number of ports (ADMs and transceivers) per WDM node has big impact on hardware costs. Also in order to save network resources, minimizing the number of required wavelengths is also desirable. Although some methods try to reduce the total number of ports for cost-efficient design of WDM networks [13], it is desirable to consider WDM networks with a small number of ports per node since MANs and LANs may employ a large number of low-cost nodes. Thus, the traffic grooming methods to minimize the number of ADMs and wavelengths have been investigated so far. In our previous work [14], we have proposed a logical topology called Hierarchical Chordal Ring Network (HCRN). In HCRN, each node has only three ports (ADMs and transceivers)√and the network diameter and the number of required wavelengths are proportional to N (N denotes the number of nodes). Arora et al. have proposed a method to configure a logical topology for WDM ring networks such as a prevalent MAN [15]. This topology has the network diameter and the number of required wavelengths which are proportional to the logarithmic order of N by using lightpaths with O(2N ) lengths (physical hop-counts). However, this method may require each node to use more ports in the logical topology whereas HCRN uses only three ports. In this paper, we propose a method of designing a logical topology where the network diameter and the number of required wavelengths are the logarithmic order of N on the WDM ring networks constituted by nodes with only three ports. We also propose a multi-hop routing method over this logical network. This topology is configured by clustering adjacent nodes in a similar way to HCRN and Arora’s method. This topology can admit most traffic patterns and variance because of its regularity. Although logical topologies may sometimes be reconfigured according to changes of traffic patterns, frequent reconfiguration cannot be applied in practice due to its overhead. Therefore, it is important to construct a topology which is adaptive to most traffic patterns. Furthermore, exploiting regularity and hierarchy of the logical topology, we propose a node labeling scheme to achieve effective routing on the logical topology without routing table lookup. From the numerical results, we have shown that we could construct logical topologies of 1,000 nodes with a very small diameter and up to 16 wavelengths per fiber-optic cable, and the method is dominant to the method in Ref. [16] that generates dynamic topologies and Chordal Ring Network topology.
2 Related Work Reference [16] is one of the initial literatures on logical topology design over optical networks, and provides several heuristic algorithms such as HLDA (Heuristic Logical topology Design Algorithm) and TILDA (Traffic Independent Logical topology Design Algorithm). Later we will conduct performance comparison with them to see efficiency of our method. Reference [6] deals with optical ring networks, and gives the lower bound analysis of wavelengths and ADMs usage in WDM networks. The results
328
T. Kitani et al.
indicate to what degree our achievements have been attained; our upper bound analysis shows that we could achieve almost the lowest in most cases. Reference [5] introduces several logical topologies in dynamic lightpath assignment, minimizing the number of transceivers. References [10,11] also assume SONET ring networks, minimizing the relay traffic on logical topologies by ILP formulation. In particular [10] considers the upper and lower bounds of the minimized relay traffic for cost-efficient design. The recent work [17] has presented the improvement of existing heuristic algorithms by the “rollout” technique. We have proposed the logical topology called Hierarchical Chordal Ring Network (HCRN) [14]. HCRN is inspired from the Chordal Ring Network topology (CRN) [18], where each node has only three ports (ADMs and transceivers). HCRN achieves less network diameter and less number of required wavelengths than the original CRN by clustering adjacent nodes and configuring internal / inter-cluster lightpaths. However, in both CRN√ and HCRN, the network diameter and the number of required wavelengths are order of N (N is the number of nodes). Arora et al. have proposed the method to configure a logical topology for WDM ring networks [15]. The topology achieves that the network diameter and the number of required wavelengths are the logarithmic order of N . In the method, the concept of configuring lightpaths is similar to the distributed hash table (DHT) algorithm named “Chord” for peer-to-peer networks [19]. Both methods use lightpaths/connections whose lengths (physical hop-counts) are power of two. However, they may make each node need more ports in the logical topology whereas it uses only three ports in CRN. We note that recent researches have dealt with wavelength assignment for highend, brand-new architectures like WDM networks with wavelength converters [3] and ROADM (Reconfigurable Optical Add-Drop Multiplexer) with modulators [20]. Also some others aim at attaining non-blocking rearrangement [4], achieving fairness in wavelength assignments [12] and so on. Different from those researches, we revisit the important topic; logical path design and routing over WDM networks with limited capabilities of equipments. Even several heuristic solutions have been presented over the last decade, we believe that this problem has not been solved completely and our solution is quite significant for efficient usage of low-cost WDM networks.
3 Target WDM Network Model In this section, we show the architecture of WDM networks to explain the problem dealt with in this paper. An optical network is modeled as a ring graph Gphy = (V, Ephy ) where V = {0, 1, · · · , N − 1} is a set of nodes and Ephy ⊆ V × V is a set of fiber-optic cables called physical links or simply links. Each node is equipped with OADMs (Optical Add-Drop Multiplexers) and for data transmission between two nodes; their transceivers are required to be tuned to a common wavelength in advance. The resulting logical path is called lightpath, and the two nodes are called terminal nodes. Since each lightpath needs a port (transceiver) at its both terminal nodes, the number of input and output ports at each node is greater than
Hierarchical Logical Topology in WDM Ring Networks with Limited ADMs
329
that of lightpaths which are terminated at the node. The number of ports is called degree bound. We show the architecture of WDM nodes in Fig. 1. For a lightpath, each non-terminal node on the lightpath acts as an optical repeater where the optical signal is passed through from an ingress port to the corresponding egress port. Meanwhile, routing is done by the terminal nodes by the conversion between optic signals and electric signals. We assume the same number of ADMs and transceivers for each node – two ADMs and three transceivers for each node in Fig. 1–, and thus the maximum number of wavelengths on each physical link is identical and is denoted by W . We also assume that in the similar reason, the degree bound of all nodes and the capacity of all the lightpaths are identical, and they are denoted by D and C, respectively. λ1
λ1, …, λw λw
Optical AddDrop MUX λ1
Optical AddDrop MUX
λa
λb λc λe
λ1, …, λw λw
Transceiver
λd λf
Transceiver
Electric Router
Transceiver
Fig. 1. Architecture of WDM Node
4 Proposed Hierarchical Network Topology Here, we explain the proposed hierarchical logical topology. We assume a physical ring network G = (V, Ephy ) where V = {0, 1, ..., N − 1} and Ephy = {(l, l ⊕ 1)|l ∈ V } is a set of bi-directional physical links (i ⊕ j ≡ (i + j) mod N ). 4.1 Clustering of Nodes In the given physical ring network, a subsequence of nodes in the node ring is called cluster. First, we start with a whole, single cluster [0, ..., N − 1]. As shown in Fig. 2, for each cluster we establish a lightpath called chord lightpath between the both end nodes of the cluster, and create two same-sized sub-clusters using the nodes in the cluster except the both end nodes. Then we establish a lightpath between an end node of an i-th level cluster and its physically adjacent node, which is an end node of the (i + 1)-th
330
T. Kitani et al.
chord lightpath i -th layer
(i+1)-th layer chord lightpath
Fig. 2. Clustering of nodes 1 1 0 0 0 0 0 0 0 1 0 1 1 01 1 1 01 11 10 00 0 0 0 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 11 11 01 01 01 01 01 01 01 1 0 00 00 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1
(a) Hierarchical Construction of “Chords”
0 00 000
01 001
0000 0001
1
0011
011
11 010
110
111
10 101
100
1100 1010 0110 1111 1001 0101 1101 1110 1011 1000 0010 0111 0100 Links on the ring
Chord Links
(b) Hierarchical Node Labeling
Fig. 3. Node labeling and routing in a cluster of HLT
level sub-cluster. For each sub-cluster, we repeat the same procedure until the cluster size is less than or equal to four. The consequent hierarchical network topology is called HLT (Hierarchical Logical Topology). As shown in Fig. 3(a), all nodes on this HLT can be clustered hierarchically. Then, for routing purpose, we give node labels according to the policy below (see Fig. 3(b)). For the end nodes in the 1st-level (the highest level) cluster, we assign the node labels 0 and 1 to those nodes. If the end nodes of an i-th level cluster have the node labels X0 and X1, we assign to the end nodes of the (i + 1)-th level clusters the node labels X00, X01, X11 and X10. According to this labeling policy, the nodes in the same level’s clusters have the node labels based on Gray codes where the labels of any two nodes that are directly connected by a lightpath differ in only one digit (Fig. 3). Also if a node p of (i + 1)-th level cluster is connected by a lightpath with a node of i-th level cluster with node label Y , the node label of p is Y 0. These codes are called variable length Gray codes.
Hierarchical Logical Topology in WDM Ring Networks with Limited ADMs
331
In this hierarchical logical topology HLT, the network diameter (the maximum distance (hop count) between two nodes in the network) is O(log N ). One may think that this is similar√with the Chordal ring network topology [18], however its network diameter is O( N ). Thus, the proposed hierarchical topology HLT can attain smaller network diameters than Chordal Ring Network topology in large-scale networks. As shown in Fig. 3, all the chord lightpaths established in the clusters of the same level can use a common wavelength because they never share the physical links. Therefore, as a total, we only need H wavelengths where H is the number of hierarchy levels. Since N ≤ 2H+1 − 2, the number of required wavelengths is obtained as follows; H = log2 (N + 2) − 1 log2 N
(1)
We note that if we use a physical ring network topology, the wavelength for the 1st level can be the same as that used for the second level, and this can save one wavelength. For example, in Fig. 3, the wavelength assigned to the chord lightpath between nodes 0 and 1 can be the same as that used for the two chord lightpaths between 00 and 01 and between 11 and 10 since there is a direct physical link between nodes 0 and 1. As we have seen above, the most significant feature of HLT is that the network diameter and the number of required wavelengths are suppressed to the logarithmic order of N . 4.2 Routing on Proposed Hierarchical Logical Topology Here, we consider a routing on the proposed network topology. For routing on the hierarchical topology HLT, as shown in Fig. 3(b), the proposed variable length Gray codes can be used for deciding the route from a source node to its destination node. If nodes 001 and 1001 are the source node and its destination node respectively, the path 001, 000, 00, 0, 1, 10, 100, 1000, 1001 is one of the shortest routes. Basically, each packet is relayed from the source node to a node on an upper level, and then it is forwarded from the node on the upper level to the destination node. This is the basic routing policy. There is an exception. If nodes 00 and 111 are the source node and its destination node, the path 00, 01, 11, 110, 111 is the shortest route, and the route through 00, 0, 1, 10 becomes longer. The lightpath from 01 to 11 makes the length of the route shorter. For each node S with node label s1 , s2 , · · · , sk , there are four routing actions. U P denotes the action which forwards the received packet to its parent node SUP = s1 , s2 , · · · , sk−1 . U P can be used only for sk = 0. DW denotes the action which forwards the received packet to its child node SDW = s1 , s2 , · · · , sk , 0 (sk+1 = 0). BR denotes the action which forwards the received packet to its sibling node SBR = s1 , s2 , · · · , sk ⊕ 1 through the chord lightpath (or normal lightpath) between s1 , s2 , · · · , sk and s1 , s2 , · · · , sk ⊕ 1. Here, note that x⊕1 = 1 if x = 0, and that x⊕1 = 0 if x = 1. CO denotes the action which forwards the received packet to its cousin node SCO = s1 , s2 , · · · , sk−1 ⊕ 1, sk through the lightpath between s1 , s2 , · · · , sk and s1 , s2 , · · · , sk−1 ⊕ 1, sk . CO can be used only for sk = 1. Hereafter, Forward(S, U P ), Forward(S, DW ), Forward(S, BR) and Forward(S, CO) denote the forwarded nodes SUP , SDW , SBR and SCO , respectively. Let S = (s1 , · · · , si , si+1 , · · · , sk ) and D = (s1 , · · · , si , di+1 , · · · , dh ) denote the variable length Gray codes for nodes S and D. Here, we assume that the first i bits
332
T. Kitani et al.
of the variable length Gray codes of nodes S and D are the same (si+1 = di+1 ). The following algorithm R-HLT(S,D) calculates the routing action of node S (here, “NIL” denotes no action).
[Routing Algorithm R-HLT]: if (S = D) then “NIL” else if ( (k − i > 2) ∨ (h = i) ) then if ( sk = 0 ) then “UP” else “BR” else if ( k − i = 2) then case (sk−1 , sk ) of (0, 0): if ( k > h ) then “UP” else “BR” (0, 1): if ( dk−1 = 0 ) then “BR” else “CO” (1, 0): if ( dk−1 = 0 ) then “CO” else “BR” (1, 1): if ( k > h ) then “UP” else “BR” end else if ( k − i = 1 ) then “BR” else (* k = i *) “DW” If node S is a source node or relay node, the above routing algorithm R-HLT(S,D) can decide the routing action of node S when S must forward its received packets to node D. If the source node S wants to decide the strict routing path and its length from node S to its destination node D, we can calculates them by repeating R-HLT and calculating the path S0 = S, S1 =Forward(S,R-HLT(S, D)), S2 =Forward(S1 ,R-HLT(S1 , D)), · · ·, Sq = D =Forward(Sq−1,R-HLT(Sq−1 , D)) until we find node D as Sq . This path denotes the strict routing path from S to D and its path length is q. This routing can be used when we use source routing.
5 Performance Evaluation and Numerical Results 5.1 Comparative Models Here, we will explain existing logical topologies which can be applied for WDM networks with node degree 3. A Chordal ring network has a ring topology with N nodes (N must be even) whose node IDs are 0 to N − 1. For each node with node ID k, one chord is connected to either √ node k + L or k − L. Here, L denotes a constant number. In Ref. [18], it is shown that if N + 3 ≤ N2 holds, we can maximize its network diameter √ √ when an odd number closest to N + 3 is specified as the constant L. If N + 3 ≤ N2 does not hold, we can assign the maximum odd number which is less than or equal to N 2 as the constant L. In this situation, the network diameter and the number of required wavelengths of CRN are both proportional to the square root of the number N of nodes. On the other hand, as we explained above, the proposed HLT topology only requires the network diameter and the number of required wavelengths which are proportional to the logarithmic order of N . Figure 4 denotes the comparison about the network diameter, the average hop counts and the number of required wavelengths of HLT and CRN
Hierarchical Logical Topology in WDM Ring Networks with Limited ADMs 45
100
40
90
HLT
35
25
p o h
20
70 s h t g 60 n e l e v 50 a w f o 40 #
15
30
10 Network Diameter of HLT Ave. hop count of HLT
5 0
CRN
80
30 t n u o c
333
20
Network Diameter of CRN Ave. hop count of CRN
10
0
0
200
400
600
800
1000 1200 # of nodes
1400
1600
1800
(a) Network diameter and average hop count
2000
0
200
400
600
800
1000
1200
1400
1600
1800
2000
# of nodes
(b) Number of required wavelengths
Fig. 4. Comparison between HLT and CRN
networks. Note that, as shown in Fig. 4(b), CRN needs twice the number of wavelengths in order to construct itself when N cannot be divided by the chord length L. There are also some dynamic topology construction methods. For example, in Ref. [16], HLDA (Heuristic Logical topology Design Algorithm) and TILDA (Traffic Independent Logical topology Design Algorithm) are proposed, and they are often used for performance comparison. In HLDA, in order to minimize the total amount of traffic, nodes with heavy traffic are given priority to use usable optical wavelengths. In HLDA, the following algorithm is used. Algorithm HLDA Step.0 An optical link is given between every physically adjacent nodes so that all nodes can communicate each other. Step.1 Estimate the traffic amount Pij for every link (i, j). Step.2 Find link (i, j) whose Pij is maximum. If there is an usable wavelength on the link (i, j), we assign one wavelength to link (i, j). Then, we set Pij to 0. Step.3 If all Pij are 0, go to Step.4. Otherwise, return to Step.2. Step.4 If we still have usable wavelengths, we assign those wavelengths to possible links randomly. We repeat this process until we cannot assign new wavelengths. On the other hand, TILDA does not consider traffic amounts, and gives priority for assigning usable wavelengths to physically closer links. In TILDA, the following algorithm is used. Algorithm TILDA Step.0 An optical link is given between every physically adjacent nodes so that all nodes can communicate each other. Step.1 Set i = 2. Step.2 If a node has usable ADMs and if there is a usable wavelength on the path from the node to another node with i hop distance, we assign one wavelength to all links on the path. If there is no such a path for two nodes with i hop distance, we set i = i + 1 and repeat the same process.
334
T. Kitani et al.
5.2 Performance Evaluation We have evaluated the performance of the proposed HLT, HLDA and TILDA by simulation. Here, we have used three types of traffic matrices: random model, server-client model and small-world model. In our analyses, we have changed the number N of nodes from 50 to 1, 000 in units of 50 nodes, and tried three numerical analyses for the above three types of traffic matrices, respectively. Here, we assume that HLDA can also use the same number of wavelengths as much as that used in the proposed method. The uniform random numbers between 0 and 1 are given as the traffic matrix of the random model. In the server-client model, 5% of nodes are selected as servers, and we have sent 10 times of traffic to those server nodes. In the small-world model, first we construct a traffic matrix based on the server-client model, and then we make traffic among neighboring nodes increase 10 times [21]. We also make the traffic between two nodes decrease in proportion to the distance between two nodes. For the given traffic matrices, we have evaluated the network diameters, the average number of hops, and overall traffic volume. The traffic volume, denoted by Q hereafter, is defined as follows; Q= T uv · huv (2) u
v
where huv and T uv denote the number of lightpaths and traffic on the route from u to v, respectively. We show our numerical results for the above three models in Fig. 5. Figure 5(a)–(c) denotes the network diameters and average numbers of hop counts. Figure 5(d) denotes the overall traffic volumes Q where all the traffic volumes are normalized so that the traffic volume of HLDA is represented as 1. For all the traffic models, if the number of nodes exceeds 100, the proposed logical topology HLT is better than other topologies. The reason is that if the number of nodes is small, relatively we can use more wavelengths. On the other hand, if the number of nodes becomes large, relatively we can use less wavelengths. If HLDA can assign enough number of logical links between nodes with heavy traffic, we can assign enough number of logical links. However, if the number of nodes becomes large, we might not be able to assign enough number of logical links for such purposes. For the network diameters, in the proposed hierarchical topology HLT, the network diameters are stable independently of traffic fluctuation and deviation. On the other hand, HLDA gives priority to assigns logical links for node pairs with heavy traffic. This makes the hop counts for node pairs with light traffic increase. The traffic fluctuation and deviation also make big influence to the network diameters. Next, we focus on the average hop counts and total traffic volumes for the random model and server-client model. In HLDA, the average hop counts are small for the random model since optical fiber links are assigned uniformly. In the server-client model, the total traffic volumes relatively become small since HLDA gives priority to assign logical links for node pairs with heavy traffic. Since the proposed topology is regular, it offers almost fixed performance independently of traffic models and volumes. It can also guarantee the upper limit of the network diameters theoretically. In general, the random model and server-client model have similar tendency for the average hop counts.
Hierarchical Logical Topology in WDM Ring Networks with Limited ADMs
335
(a) Net. dia. and ave. hop counts on Random (b) Net. dia. and ave. hop counts on Servermodel Client model
(c) Net. dia. and ave. hop counts “Small-world” (d) Ratio of network traffic amounts in HLT, model HLDA and TILDA (HLT = 1) Fig. 5. Network diameter, average hop counts and traffic amounts in HLT, HLDA and TILDA
Lastly, we focus on the average hop counts and total traffic volumes for the smallworld model. HLDA assigns many optical links for neighboring node pairs since we assume relatively heavy traffic to neighboring node pairs. Then, it becomes difficult to assign optical links for node pairs with long distances. On the other hand, the proposed topology can assign hierarchical chords for node pairs with long distances while neighboring node pair can communicate with each other using the links on the original ring topology. From the above reason, our proposed topology HLT and HLDA have rather different tendency. In Fig. 5(d), if the number of nodes is more than 400, the total traffic volumes of the proposed topology HLT is less than 60% of those for HLDA. We can observe that our topology could save the usage of wavelengths because it could uniformly distribute lightpaths over physical links. As shown in Fig. 5(d), for a logical topology with 1,000 nodes, an HLT topology only requires 16 wavelengths. Meanwhile, in HLDA and TILDA, it is likely to occur that some lightpaths are established over some specific physical links first due to degree limitations. In that case no more lightpaths were accepted over the link if no more wavelength is available, and the other lightpaths are needed to detour the link. Thus consequent topologies may have long routes that require many hops and make the diameters larger.
336
T. Kitani et al.
6 Conclusion In this paper, we propose a design method of hierarchical logical topology in multi-hop WDM networks. Our logical topology can be built over WDM networks with limited ADMs. The method can be applied for both ring-type and mesh-type WDM networks. The proposed logical topology has features that both the maximum distance between two nodes and the required number of wavelengths on each physical optical link are proportional to the logarithmic order of the number N of nodes. From theoretical and numerical points of views, we have compared the proposed logical topology with general Chordal ring network topology and HLDA (Heuristic Logical topology Design Algorithm) based network topology. The proposed logical topology has smaller average hop number, network diameter and total traffic amount. If the number N of nodes on a given WDM network becomes large, our logical topology becomes more efficient. It is shown that the proposed logical topology has enough scalability for large scale WDM networks. We have also shown that the network performance can be further improved by combining HLDA over our logical topology. To investigate detailed network performance on real WDM networks is one of our future work.
References 1. Dutta, R., Rouskas, G.N.: A survey of virtual topology design algorithms for wavelength routed optical networks. Optical Network Magazine 1, 73–89 (2000) 2. Zang, H., Jue, J.P., Mukherjee, B.: A Review of Routing and Wavelength Assignment Approaches for Wavelength-Routed Optical WDM Networks. Optical Networks Magazine 1(1), 47–60 (2000) 3. Chen, L.-W., Modiano, E.: Efficient Routing and Wavelength Assignment for Reconfigurable WDM Ring Networks With Wavelength Converters. IEEE/ACM Trans. on Networking 13(1), 173–186 (2005) 4. Saengudomlert, P., Modiano, E.H., Gallager, R.G.: Dynamic Wavelength Assignment for WDM All-Optical Tree Networks. IEEE/ACM Trans. on Networking 13(4), 895–905 (2005) 5. Gerstel, O., Ramswami, R., Sasaki, G.H.: Cost-Effective Traffic Grooming in WDM Rings. IEEE/ACM Trans. on Networking 8(5), 618–630 (2000) 6. Billah, A.R.B., Wang, B., Awwal, A.A.S.: Effective Traffic Grooming in WDM Rings. In: Proc. of the 45th IEEE Global Telecomm. Conf(GLOBECOM 2005), vol. 3, pp. 2726–2730 (2002) 7. Janardhanan, S., Mahanti, A., Saha, D., Sadhukhan, S.K.: A Routing and Wavelength Assignment (RWA) Technique to Minimize the Number of SONET ADMs in WDM Rings. In: Proc. of the 39th Annual Hawaii Int’l Conf. on Sys. Sci (HICSS 2006), vol. 2 (2006) 8. Liu, C., Ruan, L.: Logical Topology Augmentation for Survivable Mapping in IP-over-WDM Networks. In: Proc. of the 48th IEEE Global Telecomm. Conf(GLOBECOM 2005), vol. 4 (2005) 9. Xin, C., Wang, B., Cao, X., Li, J.: A Heuristic Logical Topology Design Algorithm for MultiHop Dynamic Traffic Grooming in WDM Optical Networks. In: Proc. of the 48th IEEE Global Telecomm. Conf(GLOBECOM 2005), vol. 4 (2005) 10. Lee, K., Siu, K.-Y.: On the Reconfigurability of Single-Hub WDM Ring Networks. IEEE/ACM Trans. on Networking 11(2), 273–284 (2003)
Hierarchical Logical Topology in WDM Ring Networks with Limited ADMs
337
11. Dutta, R., Rouskas, G.N.: On Optimal Traffic Grooming in WDM Rings. IEEE J. on Select. Areas in Commun. 20(1), 110–121 (2002) 12. Mosharaf, K., Lambadaris, I., Talim, J., Shokrani, A.: Fairness Control in WavelengthRouted WDM Ring Networks. In: Proc. of the 48th IEEE Global Telecomm. Conf(GLOBECOM 2005), vol. 4 (2005) 13. Schein, B., Modiano, E.: Quantifying the Benefit of Configurability in Circuit-Switched WDM Ring Networks with Limited Ports per Node. J. of Lightwave Tech. 19(6), 821–829 (2001) 14. Kitani, T., Funabiki, N., Higashino, T.: Proposal of Hierarchical Chordal Ring Network Topology for WDM Networks. In: Proc. of the IEEE Int’l Conf. on Networks 2004 (ICON 2004), pp. 605–609 (2004) 15. Arora, A.S., Subramaniam, S., Choi, H.-A.: Logical Topology Design for Linear and Ring Optical Networks. IEEE J. on Select. Areas in Commun. 20(1), 62–74 (2002) 16. Ramaswami, R., Sivarajan, K.N.: Design of logical topologies for wavelength-routed alloptical networks. In: Proc. of the 14th IEEE Int’l Conf. on Comp. Comm (INFOCOM 1995), pp. 1316–1325 (1995) 17. Lee, K., Shayman, M.A.: Rollout Algorithms for Logical Topology Design and Traffic Grooming in Multihop WDM Networks. In: Proc. of the 48th IEEE Global Telecomm. Conf(GLOBECOM 2005), vol. 4 (2005) 18. Arden, B.W., Lee, H.: Analysis of Chordal Ring Network. IEEE Trans. on Comp. C-30(4), 291–295 (1981) 19. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek, F., Balakrishnan, H.: Chord: a scalable peer-to-peer lookup protocol for Internet applications. IEEE/ACM Trans. on Networking 11(1), 17–32 (2003) 20. Turkcu, O., Subramaniam, S.: Blocking in Reconfigurable Optical Networks. In: Proc. of the 26th IEEE Int’l Conf. on Comp. Comm (INFOCOM 2007), pp. 188–196 (2007) 21. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small world’ networks. Nature 393, 440– 442 (1998)
A Novel Class-Based Protection Algorithm Providing Fast Service Recovery in IP/WDM Networks∗ Wojciech Molisz and Jacek Rak Gdansk University of Technology, G. Narutowicza 11/12, Pl-80-952 Gdansk, Poland {womol,jrak}@eti.pg.gda.pl
Abstract. In this paper we consider a multilayer architecture: IP-MPLS over optical transport networks (OTNs). Nodes have integrated functionality of optical cross connects (OXCs) in the optical layer and of IP routers in the IP layer. Any two IP routers can be connected together by an all-optical wavelength-division multiplexing (WDM) channel, called a lightpath. Wavelength converters may be installed in each node. We propose a class-based method providing fast service restoration and reducing the number of service recovery actions in the IP layer. We focus on service restoration time values and we show, that it is possible to decrease them, by reducing the size of active path protected area. This may be achieved for the price of link capacity utilization degradation. Keywords: IP /WDM network survivability, multilayer routing, fast, service recovery.
1 Introduction We consider a multilayer architecture: IP-MPLS over optical transport networks (OTNs), the main architecture of the next generation networks [3]. The OTN nodes have functionality of both optical cross connects (OXCs) in the optical layer and of IP routers in the IP layer. Any two IP routers can be connected together by an all-optical wavelength-division multiplexing (WDM) channel, called a lightpath. A lightpath may use several wavelengths if wavelength converters are installed in nodes. A fundamental characteristic of survivable network systems is their capability to deliver essential services in the face of failure, or attack. There are two approaches for providing survivability of IP-over-WDM networks: protection and restoration [11]. Protection uses pre–computed backup paths applied in the case of a failure. Restoration finds dynamically a new path, once a failure has occurred. In both cases we distinguish either path protection/restoration, or link protection/restoration against a single link or node failure. Several intermediate solutions can be applied between the mentioned extremes, like area protection [7], or partial path protection [13]. ∗
This work was partially supported by the Ministry of Science and Higher Education, Poland, under grants 3 T11D 001 30 and N517 022 32/3969.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 338–345, 2008. © IFIP International Federation for Information Processing 2008
A Novel Class-Based Protection Algorithm Providing Fast Service Recovery
339
1.1 Related Works Recently several papers have appeared on survivability of IP-over-WDM networks. Colle et al. [3] gave an excellent survey of multilayer survivability concepts in the context of next generation networks. Sahasrabuddhe, Ramamurthy and Mukherjee [11], assumming that each backup path in the WDM layer is arc-disjoint with the respective primary path, analysed protection in the WDM layer, and restoration in the IP layer. They proposed four integer linear programming (ILP) models and heuristics to find solutions. Pickavet et al. [8] discussed three approaches to interworking between layers and two efficient coordinated multilayer recovery techniques. Sasaki and Su [12] proposed models to evaluate the cost of survivable IP-over-WDM networks. Provisioning of survivability at the optical layer provides fast service recovery and is thus inherently attractive, but its major drawback is that it cannot recover from failures that occur at the IP-MPLS layer. The IP layer itself reacts very slowly to failures. Qin, Mason and Jia [9], Lei, Liu and Ji [5], and Ratnam, Zhou and Gurusamy [10] addressed the problem of efficient multilayer operational strategies for survivable IP-over-WDM networks and proposed several joint multiple layer restoration schemes with intralayer and interlayer signaling and backup resource sharing. Zhao et al. [16] proposed a similar strategy for IP/GMPLS/OTN networks. Bigos et al. [2], and Liu, Tipper and Vajanapoom [6] described various methods for spare capacity allocation (SCA) to reroute disrupted traffic in MPLS-over-OTN. Doucette, Grover and Giese [4] used the spare capacity of span-protecting p-cycles to provide protection against both optical link and node failures in the IP-MPLS layer. 1.2 Outline We consider here a survivable IP-MPLS-over-OTN-WDM network protected against a single node failure. Demands for IP flows are given. We assume M service classes, numbered from 0 to M-1. Class 0 represents the demands for which all the service recovery actions must be performed as fast as possible (i.e. at the WDM layer). For other (lower priority) classes, the IP restoration times may increase. In the first stage of our algorithm, for each service class we find a node-disjoint working and a backup label switched (LSP) paths. Then we group the IP demands into service classes on IP links. In the next stage we map each class onto protected lightpaths according to available capacities of optical links. The scope of WDM protection depends on service class. For the highest class 0, each two adjacent WDM links of the LSP are protected by a backup lightpath. In contrast, for the lowest class (M-1), there is no backup lightpath and all the recovery actions must be performed at the IP layer. We assume a bottom-up restoration strategy. If an active lightpath consists of more than one link, then a failure of the lightpath transit node can be restored in the optical layer. Connections which cannot be restored in the WDM layer, should be restored in the IP layer. Since all the flow optimization models (in the IP as well as in the WDM layer) are NP-complete, we propose the novel heuristic approach. The rest of the paper is organized as follows. The survivable routing and wavelength assignment optimization problems are sketched in Section 2. Heuristic algorithms are developed to solve the problems. Modeling assumptions are described in Section 3. Results discussed in Section 4 show, that it is possible to decrease service restoration duration by reducing the size of active path protected area.
340
W. Molisz and J. Rak
2 IP-MPLS Survivable Routing of Working/Backup LSPs The survivable IP-over-WDM routing problem, is divided into two following subproblems: a) survivable IP-MPLS routing consisting of determining the IP virtual topology and finding the survivable routing of IP layer demands, b) survivable WDM routing (lightpath routing and wavelength assignment). In a subproblem (a) is to find the survivable LSPs, where each working LSP is protected by the end-to-end backup LSP being node-disjoint with the working one. The number of active LSP links depends on the service class m and is determined as: ⎡ l −1 ⎤ rm = ⎢ c × m + 1⎥ ⎢ M −1 ⎥
where:
(1)
lc is the number of links of the end-to-end shortest path between the given pair of source s and destination d nodes in the WDM layer, m is the class of a demand, M is the number of service classes.
It is clear from the formula (1), that a working LSP for the class 0 is established by a direct IP layer link. This implies in turn that no IP layer recovery actions will take place. On the contrary, for the class M-1 each working LSP link will be mapped onto a single-link WDM lightpath, implying frequent recovery actions at IP layer. Fast service recovery at WDM layer is achieved here by limiting the backup lightpath protection scope. For that purpose we determine the number of backup lightpaths protecting the given working lightpath as: ⎡ l −1 ⎤ bm = ⎢− c × m + lc − 1⎥ M − 1 ⎢ ⎥
(2)
where all the symbols have the same meaning as in (1). From the formula (2), one can observe that bm decreases linearly with the increase of service class number m. In particular it means that any backup lightpath for the class 0 protects two adjacent links of the working lightpath, while for class M-1 there is no backup lightpath for a given working lightpath and all the recovery actions must be performed in the IP layer. However, due to the limitations on the number of working LSP links in the IP layer given in Eq. 1 (implying the decrease of the length of the working lightpath with the increase of service class number), the real scopes of WDM protection (measured in kilometers of fibers) remain at the same small level, independent of service class number (except for class M-1). This in turn provides fast service recovery in the WDM layer independent of the service class number. The proposed approach is characterized by better performance, compared to the approach of solving all the four subproblems separately, since it divides the original problem of integrated IP/WDM routing into two separate subproblems only. Finding the solution to subproblem (a) is followed here by solving the subproblem (b), but the tasks within subproblems (a) and (b) are performed in the integrated manner. Example IP/WDM routing for demands of boundary classes 0 and M-1 is given in Figs. 1÷2.
A Novel Class-Based Protection Algorithm Providing Fast Service Recovery
R2 R2 R1
R6
IP layer
R4
R6
IP layer
R4
R3
R1
R3
341
R5
R5
O2 O2
O3
O1 O4
O4
O6
WDM layer
O3
O1
O6
WDM layer O5
O5
backup LSP backup lightpaths for working LSP backup lightpath for backup LSP
\
working LSP working lightpath for working LSP
Fig. 1. Example IP/WDM routing of survivable connection of the class m = 0
Fig. 2. Example IP/WDM routing of survivable connection of the class m = M-1
In the proposed algorithm, given in Fig. 3, the following notation is used: s - source node of a demand d - destination node of a demand m - class of a demand Πsd - the demanded capacity ξij - the cost of the IP link lij defined as the length of the respective end-to-end lightpath in the WDM layer between nodes i and j INPUT
Network topology; A set of demands, each demand given by a quadruple [s, d, Πsd, m].
OUTPUT Survivable multilayer routing of demands from A. Step 1. Find the working and backup LSPs using the WDM topology with the link costs ξij set to link lengths or to infinity in case of insufficient spare capacity. Step 2. Divide the working LSPs into rm regions, as given in Eq. 1 and replace each working LSP region with a direct IP link. Regardless of the service class m, provide each link of the found backup LSP with the one-hop lightpath. Step 3. Find the lightpaths carrying the IP working flows, using the standard distance metrics treating the aggregated flows belonging to the same class m between the given pair of IP layer nodes, found in Step 2, as the demands for the WDM layer. Step 4. Divide each working lightpath into bm regions, as given in Eq. 2, and provide each region of WDM lightpath with the dedicated backup lightpath. Step 5. Provide each aggregated IP backup flow with the unprotected WDM lightpath.
Fig. 3. The proposed algorithm
Steps 1÷2 are responsible for determining the survivable IP-layer routing while steps 3÷5 are used to find the surivivable routing in the WDM layer. Since each of the subproblems considered here is NP-complete, we have used only the heuristic approach in computations. Due to limitations on the size of this paper, we have put the respective ILP formulations of the above problem at the Web page: http://www.pg.gda.pl/~jrak/Networking2008_ILP.pdf.
342
W. Molisz and J. Rak
3 Modeling Assumptions The modeling was performed for the U.S. Long-Distance Network [15], and European COST 239 Network [14], shown in Figs. 4÷5. The simulations were to measure the link capacity utilization ratio, the number of broken connections restored at the IP and WDM layers and the values of connection restoration time. Time of service restoration in WDM layer comprised: time to detect a failure, link propagation delay, time to configure backup lightpath transit nodes and message processing delay at network nodes (including queuing delay). All the WDM layer links were assumed to have 32 channels. Channel capacity unit was considered to be the same for all the network links. Optical nodes were assumed to have a full wavelength conversion capability. When measuring the IP layer service restoration time, we assumed the IP link propagation delay to be equal to the aggregate delay of the respective WDM lightpath realizing this IP link. The time of message processing delay at the IP layer label-switch routers was assumed to be equal to 30 ms. For each IP layer connection, the following properties were assumed: − demands from M = 5 service classes with protection against a single node failure, − the demanded capacity equal to 1/32 of the WDM link channel capacity, − provisioning 100% of the requested bandwidth after a network element failure, − a demand to assure unsplittable flows in both IP and WDM layer, − the distance metrics and Bhandari’s algorithm [1] of finding k-node disjoint paths (here k = 2) in all path computations, except for the unprotected lightpaths for the backup LSP links, found by the Dijsktra’s shortest path algorithm, − the three-way handshake protocol of service restoration in WDM layer (the exchange of LINK FAIL, SETUP and CONFIRM messages). The algorithm of a single modeling scenario is shown in Fig. 6.
Fig. 4. U.S. Long-Distance Network (28 nodes, 45 bidirectional links)
Fig. 5. European COST 239 Network (19 nodes, 37 bidirectional links)
Repeat r* times the following steps: Step 1. Randomly choose 30 pairs (s, d) of nodes (6 demands for each service class). Step 2. Try to establish the survivable connections using the proposed algorithm. Step 3. Store the ratio of link capacity utilization and the lengths of the backup paths. Step 4. t* times simulate random failures of single nodes. For each failure state restore connections that were broken and store the connection restoration time values. *
in each scenario, r = 100 and t = 100 were assumed
Fig. 6. Research plan
A Novel Class-Based Protection Algorithm Providing Fast Service Recovery
343
4 Modeling Results 4.1 Average Link Capacity Utilization Ratio Table 1 shows the average ratios of link capacity utilization achieved during simulation. The amount of capacity needed for backup lightpath purposes turned out to be about two times higher, compared with the amount of capacity needed for active lightpaths. This was due to the fact that additional unprotected lightpaths were needed for backup LSPs to provide protection against the IP layer transit router failures. Table. 1. Average ratios of link capacity utilization Network U. S. Long-Distance COST 239
active lightpath capacity 3,43% 3,15%
backup lightpath capacity 8,60% 7,08%
total used capacity 12,03% 10,22%
4.2 Average Lengths and Numbers of Links of Connection Paths
4000
Average number of LSP links
Average length of LSPs [km]
Fig. 7 shows the average lengths of IP layer active and backup paths as the function of the service class number, while Fig. 8 gives the respective numbers of IP layer path links. Independent of the service class, the lengths of IP layer active and backup paths remain at the same level, implying similar network resource utilization for all the network classes. The average number of IP layer active path links decreases with the decrease of the service class number (Fig. 8). As a result, time-consuming IP layer recovery actions are less frequent for more important service classes. The average number of IP layer backup path links remains at the same level, independent of the service class, because each IP layer backup path link is established as the one-hop lightpath.
2000
0 0
1 2 3 4 Service Class Number active LSPs (US) backup LSPs (US) active LSPs (COST) backup LSPs (COST)
6 4 2 0 0 1 2 3 4 Service Class Number active LSPs (US) backup LSPs (US) active LSPs (COST) backup LSPs (COST)
\
Fig. 7. Average length of paths in the IP layer
Fig. 8. Average number of LSP links
Fig. 9 shows the average lengths of WDM layer lightpaths as the function of the service class number, while Fig. 10 gives the respective numbers of WDM layer path links. The average length of backup lightpaths remains at the same level, independent of the service class number. However, the average length of active lightpaths decreases with the increase of service class number. This is due to the fact that with the increase of the service class number, the number of IP layer active LSP links increases, and each active LSP link is realized by a shorter WDM lightpath.
W. Molisz and J. Rak 4000
Average number of lightpath llinks
Average length of WDM lightpaths [km]
344
2000
0
6 4 2 0
0
1 2 3 4 Service Class Number active lightpaths (US) backup lightpaths (US) active lightpaths (COST) backup lightpaths (COST)
0 1 2 3 4 Service Class Number active lightpaths (US) backup lightpaths (US) active lightpaths (COST) backup lightpaths (COST)
\
Fig. 9. Average length of WDM lightpaths
Fig. 10. Average number of lightpath links
4.3 Service Recovery Actions
6000 4000 2000 0 0
1 2 3 4 Service Class Number WDM layer (US) IP layer (US) WDM layer (COST) IP layer (COST)
Average values of service restoration time [ms]
Toral numbers of restored connections
Fig. 11 shows the aggregate numbers of recovery actions for both IP and WDM layer measured in a single scenario. It shows that with the increase of the service class number, the number of recovery actions in IP layer increases, while the number of recovery actions at WDM layer decreases. For the class 0 it implies that WDM layer recovery actions are sufficient to handle all the failure cases and, as a result, provide fast service recovery as given in Fig. 12. For the other service classes, due to the existence of IP layer active path transit nodes, the more IP layer transit nodes the IP active LSP traverses, the higher frequency of IP layer service recovery actions occurs. The average values of service recovery times in WDM layer remain at the same low level, characteristic to link protection scheme. This is true independent of the service class number, since similar scopes of protection are achieved for all service classes, due to the linear decrease of the number of backup lightpaths that protect a given WDM active lightpath when increasing the service class number, as given in Eq. 2 (i.e. when decreasing the length of WDM active lightpath). 240 180 120 60 0 0 1 2 3 4 Service Class Number WDM layer (US) IP layer (US) WDM layer (COST) IP layer (COST)
\
Fig. 11. Total number of restored connections
Fig. 12. Average service recovery time values
Aggregate values of service restoration time [ms]
Fig. 13 shows the aggregate service 1200000 recovery time values as a function of the 800000 service class. Each aggregate value was 400000 calculated as the sum of all the values of 0 connection restoration time measured in 0 1 2 3 4 Service Class Number a single simulation scenario. These US Long-Distance Network COST 239 Network results give another proof of the efficiency of our approach. For the U.S. Fig. 13. Aggregate values of service restoration Long-Distance Network the ratio time between classes 0 and 4 is of order 1:6, while for the European COST 239 Network - about 1:6.6. Aggregate restoration times for the highest (0) class are always the shortest ones.
A Novel Class-Based Protection Algorithm Providing Fast Service Recovery
345
5 Conclusions In this paper we considered the multilayer architecture of the Next Generation Networks: IP-MPLS over OTN-WDM. We assumed nodes having the integrated functionality of optical cross connects (OXCs) and of IP routers in the IP layer. We proposed a class-based method providing fast service restoration and reducing the number of service recovery actions in the IP layer. We focused on service restoration time values and we showed, that it is possible to decrease them, by reducing the size of active path protected area. To prove our thesis, we modeled two typical networks: the U.S. Long-Distance Network and the European COST 239 Network and have calculated: the link capacity utilization ratio, average lengths and numbers of links of the connection paths and the values of service restoration time. To the best of our knowledge, all results are original ones and published for the first time.
References 1. Bhandari, R.: Survivable networks: Algorithms of diverse routing. Kluwer Academic Publishers, Boston (1999) 2. Bigos, W., et al.: Survivable MPLS over Optical Transport Networks: Cost and Resource Usage Analysis. IEEE J. Select. Areas Commun. 25(5), 949–962 (2007) 3. Colle, D., et al.: Data-Centric Optical Networks and Their Survivability. IEEE J. Select. Areas Commun. 20(1), 6–21 (2002) 4. Doucette, J., Grover, W.D., Giese, P.A.: Physical-Layer p-Cycles Adapted for Router Level Node Protection: A Multi-Layer Design and Operation Strategy. IEEE J. Select. Areas Commun. 25(5), 963–973 (2007) 5. Lei, L., Liu, A., Ji, Y.: A joint resilience scheme with interlayer backup resource sharing in IP over WDM networks. IEEE Communications Magazine 42, 78–84 (2004) 6. Liu, Y., Tipper, D., Vajanapoom, K.: Spare Capacity Allocation in Two-Layer Networks. IEEE J. Select. Areas Commun. 25(5), 974–986 (2007) 7. Molisz, W., Rak, J.: Region Protection/Restoration Scheme in Survivable Networks. In: Gorodetsky, V., Kotenko, I., Skormin, V.A. (eds.) MMM-ACNS 2005. LNCS, vol. 3685, pp. 442–447. Springer, Heidelberg (2005) 8. Pickavet, M., et al.: Recovery in Multilayer Optical Networks. IEEE J. of Lightwave Technology 24(1), 122–133 (2006) 9. Qin, Y., Mason, L., Jia, K.: Study on a joint multiple layer restoration scheme for IP overWDM networks. IEEE Network 17(2), 43–48 (2003) 10. Ratnam, K., et al.: Efficient Multi-Layer Operational Strategies for Survivable IP-overWDM Networks. IEEE J. Select. Areas Commun. 24(8), 16–31 (2006) 11. Sahasrabuddhe, L., et al.: Fault management in IP Over WDM Networks: WDM Protection vs. IP Restoration. IEEE J. Select. Areas Commun. 20(1), 21–33 (2002) 12. Sasaki, G.H., Su, C.-F.: The Interface between IP and WDM and Its Effect on the Cost of Survivability. IEEE Communications Magazine 41, 74–79 (2003) 13. Wang, H., Modiano, E., Médard, M.: Partial Path Protection for WDM Networks: End-toEnd Recovery Using Local Failure Information. In: Proc. ISCC 2002, vol. 725, pp. 719– 725 (2002) 14. Wauters, N., Demeester, P.: Design of the optical path layer in multiwavelength cross connected networks. IEEE J. Select. Areas Comn. 1(5), 881–892 (1996) 15. Xiong, Y., Mason, L.G.: Restoration strategies and spare capacity requirements in self healing ATM networks. IEEE/ACM Trans. on Netw. 7(1), 98–110 (1999) 16. Zhao, J., et al.: Integrated multilayer survivability strategy with inter-layer signaling. In: Proc. of Int. Conf. on Communication Technology, ICCT 2003, vol. 1, pp. 612–616 (2003)
End-to-End Proportional Loss Differentiation in OBS Networks Miguel A. González-Ortega, José C. López-Ardao, Pablo Argibay-Losada, Andrés Suárez-González, Cándido López-García, Manuel Fernández-Veiga, and Raúl F. Rodríguez-Rubio Departmento Enxeñería Telemática, University of Vigo (Spain) [email protected]
Abstract. Proportional loss differentiation in OBS networks has recently received much attention since it allows to provide a quantitative differentiation between classes, so facilitating network operations and pricing to providers, but without the complexity involved by the absolute loss differentiation. Although edge-to-edge differentiation must be the ultimate aim, most of the methods in the literature are per-hop-based, and it is well-known that guaranteeing per-hop proportional loss does not guarantee end-toend proportional loss. In this paper we modestly try to fill the gap which exists in this field, and so, we propose and analyze a new method to obtain edgeto-edge proportional loss differentiation for WDM-based OBS networks. This method is based on the idea of trunk and wavelength reservation, already used in circuit switched networks and deflection routing in OBS networks. Through extensive simulation experiments and analysis, we compare the performance of our mechanism to another mechanism proposed in the literature, based on using additional offset.
1
Introduction
The rapid advances in optical transmission technology have led to a wide deployment of WDM-based networks. In order to efficiently use the bandwidth in the optical domain, Optical Burst Switching (OBS) is considered as the most promising switching technology [1]. At the edge of OBS networks, incoming packets with the same destination and characteristics (class, quality of service, etc.) are assembled into big data packets, or data bursts. Before sending each burst, a control packet (Burst Header Packet – BHP) is sent over a signalling channel, and electronically processed in each intermediate node in order to reserve the necessary resources inside the core nodes. After an enough delay, the burst is sent, traversing the all-optical
This work was supported by the “Ministerio de Educación y Ciencia” through the project TSI2006-12507-C03-02 of the “Plan Nacional de I+D +I” (partially financed with FEDER funds) and through the grant BES-2007-15562.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 346–357, 2008. c IFIP International Federation for Information Processing 2008
End-to-End Proportional Loss Differentiation in OBS Networks
347
path set by the prior reservation. However, given that the control packets are not acknowledged, a burst may be discarded owing to contention with other bursts at the bufferless optical switches. Quality of Service (QoS) differentiation is still an open research issue in OBS networks. Since transfer delay in OBS is primarily determined by propagation delay, and bandwidth is implicitly provided by supporting loss guarantee, the focus of QoS support in OBS networks is to provide loss differentiation. Different loss differentiation methods have been proposed in the literature. According to these, differentiation can be absolute or relative. Absolute methods allow to guarantee quantitative requirements (like an upper bound on loss probability for each class) but they need efficient admission control and resource provisioning mechanisms. On the other hand, in the relative methods the QoS of one class is defined relatively in comparison to other classes, providing differentiation only in a qualitative manner. Instead, they are quite simpler and cheaper. If we analyze the methods proposed for loss differentiation (both absolute and relative) in the literature, we can see that they essentially use a few basic differentiation mechanisms, each one with its advantages and drawbacks: additional offset for the priority bursts ([2], [3]), intentional dropping of the less priority bursts ([4], [5], [6]), priority-based preemption ([7]), segmentation ([8], [9]), classbased bandwidth assignment ([5], [6]), or reordering of burst scheduling at the core nodes (see, for example, [10]). An special case of relative methods that have recently received much attention are those based on proportional QoS (see, for example, [4], [9] or [3]), since they allow to provide a quantitative differentiation between classes, so facilitating network operations and pricing to providers. The cost of achieving this quantitative differentiation, and not only qualitative, is a slightly higher complexity. Although the use of absolute or proportional QoS is usually a compromise between the level of the QoS guarantees and the complexity (or cost), it seems obvious that edge-to-edge differentiation must be the ultimate aim. However, this aim is very difficult to achieve, and so, most of the above methods are perhop-based. But, it is well-known that guaranteeing per-hop proportional loss does not guarantee end-to-end proportional loss [11]. Some works ([5]), however, go beyond trying to get close to a target end-to-end QoS using per-hop-based differentiation. To the best of our knowledge, only the recent work in [3] proposes a edge-toedge proportional loss differentiation mechanism, referred to as FOTS (Feedbackbased Offset Time Selection). This mechanism is based on the use of an additional offset, beyond the needed time for burst scheduling, that will be adaptively selected according to end-to-end measurements of the Burst Loss Probability (BLP, from now on) for each flow. Nevertheless, as all the additional offset based mechanisms for loss differentiation, FOTS only works when JET (Just Enough Time) [1] is used as resource reservation mechanism, but not with JIT (Just In Time) [12], also widely used in the literature.
348
M.A. González-Ortega et al.
In this paper we modestly try to fill the gap which exists in this field, and so, we propose a new edge-to-edge proportional loss differentiation mechanism for WDM-based OBS networks. Our mechanism is based on a new basic differentiation mechanism, although it is inspired by the idea of trunk reservation in circuit switched networks [13] and wavelength reservation for deflection routing in OBS networks [14]. We will refer to our mechanism as ADEWAR (ADaptive End-to-end WAvelength Reservation). We must note that ADEWAR works with both JET and JIT reservation mechanisms. Moreover, ADEWAR enjoys an interesting advantage since the Poisson traffic assumption allows us to calculate an analytical solution, based on the Erlang loss formula. In any case, we have also used other more complex traffic models for, through extensive simulation experiments, studying ADEWAR and comparing it to FOTS when JET is used. End-to-end proportional loss differentiation is achieved in both cases, but some differences in performance were found. The rest of the paper is organized as follows. Section 2 describes our mechanism ADEWAR. In Section 3 we show the simulation results and compare it to FOTS. The conclusions drawn are summarized in Section 4.
2
The ADEWAR Mechanism
In ADEWAR we use the same idea of trunk reservation in circuit switched networks [13] and wavelength reservation for deflection routing in OBS networks [14], but now we apply this idea for loss differentiation in OBS. So, we reserve a number of wavelengths on each link for the exclusive use of certain top-priority classes. That is, a burst of the Class of Service (CoS) i will be scheduled for a link if the number of free wavelengths in this link is higher than N − Ki , and dropped otherwise. N is the total number of wavelengths in that link. In this paper we will refer to a flow as the burst traffic between a node-pair belonging to the same CoS. So, let QiX−Y be the QoS parameter for the flow corresponding to the CoS i between the node pair X-Y . Its value, ranging from 0 to 1, is directly proportional to the number of allowed wavelengths to be used i by the flow, that is, KX−Y = N · QiX−Y . 1 If Q=1 for the highest priority CoS, KX−Y = N , that is, top-priority bursts will be always scheduled if there are free wavelengths. The value of Q for the other classes between the nodes X and Y will be adjusted according to its target BLP, BLPX−Y iT AR , that will be proportional to the BLP measured for the highest (or the lowest) priority CoS. It is clear that such a technique can be used with both JET and JIT reservation mechanisms. Due to the significant impact that an additional wavelength has on BLP, in i case of KX−Y results in a non-integer number, one additional wavelength will be randomly allowed, proportionally to the fractional part.
End-to-End Proportional Loss Differentiation in OBS Networks
349
We propose two adaptive adjustment methods: one based on feedback and end-to-end measurements, and another one where each node adaptively makes adjustments based on local measurements. We will refer to the first one as endto-end adjustment, and to the second one as local adjustment. In end-to-end adjustment, the adjustment and setting of the QoS parameter Q is made at the source node, according to end-to-end measurements of the BLP, and this value will be used in all the nodes traversed by the burst. Specifically, all the bursts belonging to the flow between the node-pair X-Y and CoS i will carry the same value QiX−Y . It is obvious that the core nodes need to know the value of Q for each burst. We propose that this value, set by the source node, travel into the corresponding control packet. For the measurements of BLP, we propose the source nodes send “collector packets” to the destination nodes. These special packets will be replied indicating the number of bursts received at destination since the last “collector packet”. This simple method can have drawbacks related to the convergence speed, but it is expected that traffic also changes slowly at the OBS networks, mainly used in the backbone. With local adjustment, in order to avoid this potential problem, a sequence number is included in the control packets to allow each traversed node to calculate the number of lost bursts from the source. So, making the corresponding adjustment at each node, the end-to-end BLP converges rapidly to its target value on a per-hop basis. In both cases, the adjustment consists in increasing Q (at the source node, or at each intermediate node) when the last measured value of BLP is higher than the target, or decreased otherwise, so making the measured BLP gets close to its target value. It seems clear that the per-hop adjustment will probably lead to a faster convergence and a better dynamic behaviour than the end-to-end adjustment; however, it is only valid when the flow travels along the same path from source to destination, that is, without using deflection or multipath. The dynamics related to increasing and decreasing Q are out of the scope of this paper and will be studied in a future work.
3
Performance Evaluation
We have studied the performance of ADEWAR in comparison to FOTS through analysis and simulation experiments, also using the NSFNET network as in [3]. All the links have 50 wavelengths of 10 Gbps. and we assume full wavelengthconversion capability in every core node. Although ADEWAR can be also used with JIT, we will use only JET as resource reservation mechanism, in order to be able to employ FOTS. LAUC (Last Available Unscheduled Channel ) [15] was the scheduling algorithm used. Except for some particular experiments, the mean burst size is 320 Kbits, and the processing time for the control packets at the core nodes is fixed and equal to 4 μsec. For computational cost reasons,
350
M.A. González-Ortega et al.
only a subset of source/destination node-pairs has been used, all of them with the same traffic pattern. We consider three CoS and the following target BLPs: BLPT2 AR = 5 · BLPT1 AR and BLPT3 AR = 25 · BLPT1 AR . In principle, each burst is randomly assigned to a CoS with probabilities 1/13 for the CoS 1, 3/13 for the CoS 2 and 9/13 for the CoS 3. We will also consider other target BLPs and traffic ratios between the CoS. In this scenario, we have conducted exhaustive simulation experiments to analyze the influence of many different factors on the performance of the mechanisms under study, mainly those related to the traffic pattern, both marginal distributions and correlation structure for burst interarrival times and burst sizes. Specifically, we will focus on: – Correlation structure: By means of using a M/G/∞ traffic generator [16], we model different correlation structures by means of only two parameters: the Hurst parameter (H) for the LRD, and the one-lag autocorrelation coefficient (R1) for the SRD. – Marginal distribution: We generate background M/G/∞ processes to capture the desired correlation structure, but by means of a transformation that preserve the correlation structure, we change its marginal distribution (Exponential, Lognormal and Uniform), using different mean values (giving rise to different average traffic intensities) and variation coefficients (VC=standard deviation/mean). As overall conclusions, we have observed that both ADEWAR and FOTS are able to provide end-to-end proportional loss differentiation (they have achieved the target proportional differentiation in all experiments we have carried out) and that FOTS suffers a slightly lower BLP than ADEWAR. However, this is achieved at the expense of increasing significantly the delay, due to the additional offset needed for differentiation, even prohibitive for some real-time applications. From now on, we will refer to this additional offset as FOTS offset. We must also note that, although we have used local adjustment for ADEWAR, the performance has been very similar for end-to-end adjustment. In order to observe the performance degradation typically associated to differentiation, we have also studied the case with no differentiation (No Diff ). First, we will analyze in detail the simulation results about the global BLP, and then we will study the magnitude of the additional offset needed for differentiation in FOTS. 3.1
Impact on Burst Lost Probability
Figure 1 shows the global BLP obtained as a function of the mean traffic intensity when the burst arrival process for each flow is Poissonian, and the burst length follows an exponential distribution. From the simulation results, we can see that the BLP obtained by FOTS and No Diff are very similar for all loads, and slightly lower than those of ADEWAR. In Figure 1 (right ), the corresponding fitting curves are also represented, showing that the BLP increases almost linearly with the traffic load for all the cases.
End-to-End Proportional Loss Differentiation in OBS Networks
0.18
No Diff (Sim.) No Diff (An.) ADEWAR (Sim.) ADEWAR (An.) FOTS (Sim.) 0.1
0.16 Burst Loss Probability
Burst Loss Probability
1
0.01
0.001
0.14 0.12
351
No Diff (Sim.) -0.19685 + 0.01352*x ADEWAR (Sim.) -0.22967 + 0.01631*x FOTS (Sim.) -0.20166 + 0.01389*x
0.1 0.08 0.06 0.04 0.02
1e-04
0 14
16 18 20 22 Average link traffic intensity
24
14
16 18 20 22 Average link traffic intensity
24
Fig. 1. Global Mean BLP for different traffic intensity values obtained by simulation, together with the analytical approximations (left) and the fitting curves (right)
Figure 1 (left ) also shows the analytical results for ADEWAR and No Diff. Since we use Poisson traffic patterns, BLP can be analytically estimated using the Erlang loss formulas, as in [13] and [17]. We have also used a fixed-point iterative procedure, calculating the values of QiX−Y , for each node-pair X − Y i and CoS i, that allows us to achieve the target values of BLP, BLPX−Y , T AR proportional to the BLP obtained for the top-priority CoS. We can see that the analytical results are slightly better than those of simulation. The reason is that the analytical model is optimistic, since it does not take into account the offset time between data bursts and BHPs. We have also studied other parameters like the differentiation degree requirements, the BHP processing time at the core nodes and the ratio of traffic for each CoS. Although the results obtained were not included for extension reasons, we will make some comments. We have found that BLP increases slightly with BHP processing time in all cases. The reason is that the minimum offset time between the BHP and the data bursts also increases with it. So, on the one hand, data bursts arrive at the core nodes with very different offset times, and this makes difficult to provide good resource provisioning. On the other hand, LAUC scheduling algorithm does not make use of the free wavelength gaps before the scheduled bursts, due to the offset time and, obviously, such wasted gaps will be larger when offset time increases. However, we could observe that the differentiation requirements hardly affect the global BLP obtained by FOTS, although there was a slight increase in the BLP of ADEWAR when the target differentiation was very stringent. In the same way, the ratio of traffic for each CoS does not have a big impact on global BLP. Impact of the Burst Interarrival Time. With respect to the marginal distribution, the performance changes in a similar way for the three cases, keeping the relative differences between them unchanged (see Figure 2 (left )). However, we can note that the lighter the tail of the marginal distribution is, the greater BLP is, since the smallest values of the interarrival time distribution are the
352
M.A. González-Ortega et al.
0.06
No Diff ADEWAR FOTS
0.05 Burst Loss Probability
Burst Loss Probability
0.05 0.04 0.03 0.02
No Diff ADEWAR FOTS
0.04 0.03 0.02 0.01
0.01 0
0 E1
L1 L025 U025 Marginal Distribution
D0
0
0.2
0.4 0.6 0.8 Variation Coefficient (VC)
1
Fig. 2. Global Mean BLP for various marginal distributions: Exponential (E1), Lognormal with VC=1 (L1) and VC=0.25 (L025), Uniform with VC=0.25 (U025), and deterministic (D0) (left) and as a function of VC with Lognormal distribution (right) of the burst interarrival time 0.09
No Diff ADEWAR FOTS
0.06
0.07
Burst Loss Probability
Burst Loss Probability
0.08
0.06 0.05 0.04 0.03 0.02
No Diff ADEWAR FOTS
0.05 0.04 0.03 0.02 0.01
0.01 0
0 0
0.2 0.4 0.6 0.8 One-lag Autocorrelation Coefficient (R1)
1
0.5
0.6 0.7 0.8 Hurst Parameter (H)
0.9
Fig. 3. Global Mean BLP depending on the SRD, represented by R1 (left) and on the LRD, represented by H (right) of the burst interarrival time. L1 marginal distribution was used.
most probable to occur, and so overlapping is also more probable. For this reason, the Exponential gives rise to greater values of BLP than the Lognormal. On the other hand, BLP increases with the value of VC (see Figure 2 (right )), but for small values the difference is not significant (for example, BLPs obtained for D0 and for L025 are similar) In Figure 3 we show the effects of the correlation structure, both SRD and LRD. Again, the influence of R1 is similar for all the cases. The increase of VC or R1 always degrades the network performance, because they cause traffic burstiness. This degradation is more significant for the case of R1. However, we can observe that the effects of LRD of the burst interarrival time is negligible. This can be explained by the lack of buffers in the optical paths, that introduces a reset effect in the correlations. Impact of the Burst Sizes. We have also studied the impact of several characteristics of the burst size in the global BLP, using Poisson arrivals. We can see that BLP decreases very slightly with the mean burst size, and this decrease is more significant for ADEWAR (see Figure 4 (left )). That is
End-to-End Proportional Loss Differentiation in OBS Networks
Burst Loss Probability
0.06
No Diff ADEWAR FOTS
0.06 Burst Loss Probability
0.07
0.05 0.04 0.03 0.02
353
No Diff ADEWAR FOTS
0.05 0.04 0.03 0.02 0.01
0.01 0
0 128
256
512 1024 2048 Mean Burst Size (Kbits)
4096
E1
L1 L025 U025 Marginal Distribution
D0
Fig. 4. Global Mean BLP as a function of the mean value (left) and the marginal distribution (right) of the burst size. L1 marginal distribution was used. 0.08
No Diff ADEWAR FOTS
0.09 0.08
0.06
Burst Loss Probability
Burst Loss Probability
0.07
0.05 0.04 0.03 0.02
No Diff ADEWAR FOTS
0.07 0.06 0.05 0.04 0.03 0.02
0.01
0.01
0
0 0
0.2
0.4 0.6 0.8 Variation Coefficient (VC)
1
0
0.2 0.4 0.6 0.8 One-lag Autocorrelation Coefficient (R1)
1
Fig. 5. Global Mean BLP depending on the VC (left) and on the SRD (right) of the burst size. L1 marginal distribution was used.
because for a given load the effect of increasing burst size is similar to that of decreasing the offset time. On the the one hand, as we reduce the number of bursts sent per time unit, the impact of the disparity between the order of the burst arrivals and the resource reservations is also reduced. So, it is easier to provide a good resource provisioning. On the other hand, wasted gaps caused by LAUC algorithm become relatively less significant. With respect to the marginal distribution of the burst size and its VC (see Figures 4 (right ) and 5 (left )), we can assert that the influence of both parameters on BLP is negligible. Regarding the correlation structure, we can see that the effect of SRD, as happened with the burst interarrival time, is important since traffic is more burstiness. So, BLP increases considerably and the impact is very similar for the three cases. However, we have observed that, again as in the case of burst interarrival time, LRD has hardly impact on BLP. 3.2
Impact on FOTS Offset
In this section, we focus our analysis on the magnitude of the FOTS offset, considering the same scenarios as those of the previous section. Burst sizes follow
M.A. González-Ortega et al.
FOTS Offset Time (μsec)
60 50
C1 Mean C1 Max. C2 Mean C2 Max.
60 FOTS Offset Time (μsec)
354
40 30 20
50
C1 Mean C1 Max. C2 Mean C2 Max.
40 30 20 10
10 0
0 14
16 18 20 Average link traffic intensity
22
24
0
5 10 15 BHP Processing Time (μsec)
20
Fig. 6. FOTS offset as a function of the traffic intensity (left) and of the BHP processing time (right)
an exponential distribution, and burst arrivals are Poissonian. The average number of hops per path was 2.465, and the BHP processing time at the core nodes was 4 μsec. So, the mean minimum offset time, or basic offset, results 2.465 · 4 μsec.= 9.86 μsec. Our objective is to discuss whether the FOTS offset is significant, both in absolute terms and compared with the basic offset, and so whether FOTS could be inadequate for some real-time applications. The figures show the average FOTS offset (Mean) and the maximum between all the flows (Max.) for CoS 1 and 2 (C1 and C2). Figure 6 (left ) shows the effect of the traffic intensity. We can observe that as BLP increases with the load, the FOTS offset also increases. A simple case can explain this. Let us consider one single OBS link with N wavelengths, Poissonian burst arrivals, and two priority classes (A > B), with traffic intensities IA and IB . In this case we define the isolation degree achieved by FOTS, α, as the fraction of class B traffic that does not compete for the resources with class A bursts. Obviously, α increases with FOTS offset. If we reasonably suppose that global BLP hardly depends on the FOTS offset, then, by applying the Erlang-B loss formula (EN (I)), we obtain that IA BLPA + IB BLPB = (IA + IB )EN (IA + IB ). Moreover we have BLPA = EN (IA +(1−α)IB ), and BLPB = ((IA +IB )EN (IA + IB ) − IA BLPA )/IB . It is easy to see from the above equations, that the quotient BLPB /BLPA decreases with traffic intensity and increases with α, provided that IA + IB and the rest of the parameters remain fixed. Therefore, the higher the traffic load is, the larger the FOTS offset must be, in order to increase α and hold a given loss differentiation. We can also see in Figure 6 (left ) that for both C1 and C2 FOTS offset is quite large. So, the average values are comparable to the basic offset (9.86 μsec.), that is, the transmission delay due to OBS offset time has doubled on average. But, if we focus on the maximum values, they are even much larger, peaking at over 60 μsec. for the highest loads. Therefore, for some flows, FOTS has increased offset time by more than 7 times, and this additional delay could be too long for real-time applications. In Figure 6 (right ) we can observe that FOTS offset increases with the BHP processing time. As already said, the reason is that the basic offset also increases
End-to-End Proportional Loss Differentiation in OBS Networks
C1 Mean C1 Max. C2 Mean C2 Max.
30
50 FOTS Offset Time (μsec)
FOTS Offset Time (μsec)
35
25 20 15 10
355
C1 Mean C1 Max. C2 Mean C2 Max.
40 30 20 10
5 0
0 E1
L1
L025 U025 Marginal Distribution
D0
0
0.2 0.4 0.6 0.8 One-lag Autocorrelation Coefficient (R1)
1
Fig. 7. FOTS offset depending on the marginal distribution (left) and on the SRD (right) of the burst interarrival time
with it, and so a larger FOTS offset is needed to achieve a certain isolation degree (α). Moreover, the results showed that, as expected, FOTS offset was larger when differentiation was more demanding, since α had to be greater, too. We could also observe, as pointed out in [3], that the FOTS offset increases with the ratio of the top-priority traffic. As a conclusion, we find again that the mean FOTS offset is comparable to the basic offset, but for certain flows it can become even several times larger. In Figure 7 we can see again that, as a general rule, the conditions which make BLP increase also make FOTS offset increase since, using the same isolation degree among CoS, the proportional BLP differentiation decreases when the global BLP increases, as it was already said. So, the FOTS offset increases with the SRD degree and the value of VC, and with the use of light-tailed marginal distributions (like Exponential) of the burst interarrival time. We have also studied the influence of the VC and the LRD of the burst interarrival time on the FOTS offset. As expected, we have found that both parameters have a similar impact on the global BLP and on the FOTS offset. So, FOTS offset slightly increases with VC, but hardly depends on LRD. Figure 8 shows the FOTS offset obtained by varying the mean value and the R1 parameter of the burst sizes. We can note that the FOTS offset notably increases with the burst size. This can be explained with a simple example. Let us assume null basic offset and bursts of constant length L. In this case, a burst whose arrival time at a core node is t could be blocked only by bursts that arrive into the collision interval [t− L, t]. It is easy to see that, for two priority levels, obtaining a certain isolation degree α between them implies to change the collision interval of the highest priority to [t − L, t − αL]. Therefore, the FOTS offset must be Lα, and so it increases with the burst size. Similar explanations can be applied to more complex scenarios. The collision interval depends on the maximum burst size. So, FOTS offset increases when large values of VC and heavy-tailed distributions are used for the burst sizes. Likewise, and due to the increase of the global BLP, the SRD also contributes to increase the FOTS offset (see Figure 8 (right )).
M.A. González-Ortega et al.
FOTS Offset Time (μsec)
60 50
C1 Mean C1 Max. C2 Mean C2 Max.
60 FOTS Offset Time (μsec)
356
40 30 20 10
50
C1 Mean C1 Max. C2 Mean C2 Max.
40 30 20 10
0
0 128
256
512 1024 2048 Mean Burst Size (Kbits)
4096
0
0.2 0.4 0.6 0.8 One-lag Autocorrelation Coefficient (R1)
1
Fig. 8. FOTS offset depending on the mean burst size (left) and on the one-lag autocorrelation coefficient of the burst sizes (right)
We have also analyzed the effect of the LRD of the burst sizes, but the results again reveal a negligible influence on the FOTS offset, basically for the same reasons, that is, the reset effect imposed by the lack of buffers in OBS networks.
4
Conclusions
In this paper we propose and analyze a new technique to obtain edge-to-edge proportional differentiation for WDM-based OBS networks. We call this method ADEWAR (ADaptive End-to-end WAvelength Reservation). ADEWAR is based on the wavelength reservation, previously used in circuit switched networks and deflection routing in OBS networks. Through analysis and extensive simulation experiments, we compare the performance of our mechanism to FOTS (Feedback-based Offset Time Selection), another feedback-based mechanism proposed in the literature that uses additional offset for achieve differentiation. This mechanism has some disadvantages compared to ADEWAR, since it only works with the JET reservation mechanism and it uses an end-to-end feedback-based adjustment, that converges slower than our local adjustment. Both ADEWAR and FOTS achieve end-to-end proportional loss differentiation, and although FOTS causes slightly less burst losses than our technique. the additional delay required for differentiation could be too long for real-time applications, specially if data bursts are large and with high traffic loads.
References 1. Qiao, C., Yoo, M.: Optical burst switching (OBS)-a new paradigm for an optical internet. Journal of High Speed Networks 8(1), 69–84 (1999) 2. Yoo, M., Qiao, C., Dixit, S.: QoS performance of optical burst switching in IPover-WDM networks. IEEE Journal on Selected Areas in Communications 18(10), 2062–2071 (2000)
End-to-End Proportional Loss Differentiation in OBS Networks
357
3. Tan, S.K., Mohan, G., Chua, K.C.: Feedback-based offset time selection for endto-end proportional QoS provisioning in WDM optical burst switching networks. Computer Communications 30(4), 904–921 (2007) 4. Chen, Y., Hamdi, M., Tsang, D.H.K.: Proportional QoS over OBS networks. In: Proc. of GLOBECOM, November 2001, vol. 3, pp. 1510–1514. IEEE, Los Alamitos (2001) 5. Zhang, Q., Vokkarane, V., Jue, J., Chen, B.: Absolute QoS differentiation in optical burst-switched networks. IEEE Journal on Selected Areas in Communications 22(9), 1781–1795 (2004) 6. Zhou, B., Bassiouni, M.A.: Supporting differentiated quality of service in optical burst switched networks. Optical Engineering 45(1) (January 2006) 7. Liao, W., Loi, C.-H.: Providing service differentiation for optical-burst-switched networks. Journal of Lightwave Technology 22(7), 1651–1660 (2004) 8. Vokkarane, V.M., Jue, J.P.: Prioritized burst segmentation and composite burstassembly techniques for QoS support in optical burst-switched networks. IEEE Journal on Selected Areas in Communications 21(7), 1198–1209 (2003) 9. Gurusamy, M., Tan, C., Lui, J.: Achieving proportional loss differentiation using probabilistic preemptive burst segmentation in optical burst switching WDM networks. In: Proc. of GLOBECOM, November-December 2004, vol. 3, pp. 1754–1758. IEEE, Los Alamitos (2004) 10. Yang, M., Zheng, S.Q., Verchere, D.: A QoS supporting scheduling algorithm for optical burst switching DWDM networks. In: Proc. of GLOBECOM, November 2001, vol. 1, pp. 86–91. IEEE, Los Alamitos (2001) 11. Chen, Y., Qiao, C., Hamdi, M., Tsang, D.H.K.: Proportional differentiation: a scalable QoS approach. IEEE Communications Magazine 41(6), 52–58 (2003) 12. Wei, J.Y., McFarland, R.I.: Just-in-time signaling for WDM optical burst switching networks. Journal of Lightwave Technology 18(12), 2019–2037 (2000) 13. Krupp, R.S.: Stabilization of alternate routing networks. In: Proc. of ICC, June 1982, pp. 1–31. IEEE, Los Alamitos (1982) 14. Zalesky, A., Vu, H., Rosberg, Z., Wong, E., Zukerman, M.: Modelling and performance evaluation of optical burst switched networks with deflection routing and wavelength reservation. In: Proc. of INFOCOM, March 2004, IEEE, Los Alamitos (2004) 15. Xiong, Y., Vandenhoute, M., Cankaya, H.C.: Design and analysis of optical burstswitched networks. In: Senior, J.M., Qiao, C., Dixit, S. (eds.) Proc. of SPIE, SPIE, vol. 3843, pp. 112–119 (1999) 16. Suárez-González, A., López-Ardao, J.C., López-García, C., Fernández-Veiga, M., Rodríguez-Rubio, R.F., Sousa-Vieira, M.E.: A new heavy-tailed discrete distribution for LRD M/G/inf sample generation. Performance Evaluation 47(2-3), 197– 219 (2002) 17. Argibay-Losada, P., Suárez-González, A., Veiga, M.F., Rodríguez-Rubio, R., López-García, C.: Evaluation of optical burst switching as a multiservice environment. In: Akyildiz, I.F., Sivakumar, R., Ekici, E., de Oliveira, J.C., McNair, J. (eds.) NETWORKING 2007. LNCS, vol. 4479, pp. 970–980. Springer, Heidelberg (2007)
The Effect of Peer Selection with Hopcount or Delay Constraint on Peer-to-Peer Networking S. Tang, H. Wang, and P. Van Mieghem Delft University of Technology, P.O. Box 5031, 2600 GA Delft, The Netherlands {S.Tang,H.Wang}@ewi.tudelft.nl, [email protected]
Abstract. We revisit the peer selection problem of finding the most nearby peer from an initiating node. The metrics to assess the closeness between peers are hopcount and delay, respectively. Based on a dense graph model with i.i.d regular link weight, we calculate the probability density function to reach a peer with minimum hopcount and asymptotically analyze the probability to reach a peer with the smallest delay within a group of peers. Both results suggest that a small peer group size is enough to offer an acceptable content distribution service. We also demonstrate the applicability of our model via Internet measurements.
1
Introduction
The idea of peer-to-peer (P2P) networking creates a reciprocal environment where, by sharing storage, bandwidth and computational capacity with each other, mutual benefit between end-users is possible. With the distribution of storage and retrieving functionality to peers in the P2P network, the process of selecting a best peer (in cost, bandwidth, delay, etc.) among a group of peers to start content retrieval becomes a vital procedure. Our model considers the hopcount and delay as the major criteria for peer selection. The problem is confined as follows: given a underlying network of size N , over which m peers with the desired content are randomly scattered, what is the distribution of the hopcount and delay respectively to the most nearby peer from a requesting node? A requesting peer refers to the peer who initiates the downloading request. By solving the above problem, we expect to answer the fundamental question of how many peers are needed to store the replicas of a particular file so that the most nearby peer can always be reached within j hopcount or t delay. Modeling of the peer selection problem is presented in section 2. We complement our model by verifying its applicability from a series of substantive experiments in Section 3. In Section 4, we conclude the paper.
This work has been partially supported by the European Union in CONTENT NoE (FP6-IST-038423) and NWO Robunet (643.000.503).
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 358–365, 2008. c IFIP International Federation for Information Processing 2008
The Effect of Peer Selection with Hopcount or Delay Constraint
2
359
Problem Description and Modeling
2.1
Modeling Assumptions
We model the number of hops and the latency to the nearest peer among a set of m peers based on three assumptions: (a) a dense graph model1 for the underlying network, (b) regular link weight around zero, and (c) i.i.d. link weight distribution on each link. The shortest path (SP) from a source to a destination is computed as the path that minimizes the link weights2 along that path. In [4, Chapter 16.1], it is shown that a regular link weight distribution - regular means a linear function around zero - will dominate the formation of the shortest path tree (SPT), which is the union of all shortest paths from an arbitrary node to all the other destinations. A uniform recursive tree (URT) is asymptotically the SPT in a dense graph with regular i.i.d. link weights (e.g. exponential link weights) distribution [6]. A URT of size N is a random tree that starts from the root A, and where at each stage a new node is attached uniformly to one of the existing nodes until the total number of nodes reaches N . 2.2
Hopcount Distribution to the Nearest Peer
A) Theoretical analysis The number of hops from a requesting peer to its most nearby peer, denoted by hN (m), is the minimum number of hops among the set of shortest paths from the requesting node to the m peers in the network of size N . Let HN (m) be the hopcount starting from one, excluding the event hN (m) = 0 in the URT. Since Pr[hN (m) = 0] = m N , we have Pr[HN (m) = j] = Pr[hN (m) = j|hN (m) = 0] =
1 Pr[hN (m) = j] 1− m N
(1)
with j = 1, 2, ...N and Pr[hN (m) = j] recursively solved in [4, p. 427]. However, the recursive computation involves a considerable amount of memory and CPUtime which limits its use to relatively small sizes of N ≤ 100. Fig. 1 illustrates Pr[HN (m) = j] versus the fraction of peers m N for different hops j with network size varying from N = 20 up to N = 60. The interesting observation from Fig. 1 is that, for separate hops j, the distribution Pr[HN (m) = j] rapidly tends to a distinct curve for most of the small networks (N 100) and that the increase in the network size N only plays a small role. Further, the crosspoint of curve j = 1 and j 2 (the bold line) around m N = 15% indicates that in small networks (i.e. N = 20), the peer fraction should always be larger than 15% to ensure Pr[HN (m) = 1] > Pr[HN (m) 2]. 1
2
The dense graph is a heterogenous graphwith the average degree E[D] pc N ≈ O(log N ) and a small standard deviation V ar[D] E[D], where pc ∼ logNN is the disconnectivity threshold of the link density [4, Chapter 15.6.3]. The link weight wij assigned to a link (i, j) between node i and node j in a network, is a real positive number that reflects certain properties of the link, i.e. distance, delay, loss, or bandwidth.
360
S. Tang, H. Wang, and P. Van Mieghem 1.0 0.8
Pr[HN(m) = j]
j=1 0.6
m/N around 15%
0.4
1.00 N = 20 N = 30 N = 40 N = 50 N = 60
0.90 0.80
j=2 0.70
0.2 j=3
10
20
30
40
70
80
90
100
0.0 0 10 20 j=4 j=5
30
40 50 60 Fraction of peers (%)
Fig. 1. Pr[HN (m) = j] versus the fraction of peers m from network size N = 20 to N 60. The bold line is the pdf of HN (m) 2 for N = 20. The inserted figure plots the Pr[HN (m) ≤ 4] as a function of peer fraction m for network sizes N = 20 to 60. N
To avoid the recursive calculation in (1), we compute Pr[HN (m) = j] approximately by assuming the independence of the hopcount from the requesting node to the m peers when m is small3 . The approximation is expressed as Pr [HN (m) ≤ j] ≈ 1 − (Pr[HN > j])m
(2)
where Pr[HN > j] is the probability that at least one of the peers is j hop away (or not all peers are further than j hop away). As explained in footnote3 , we confine the estimation of (2) with large N and small m, whereas the exact result is applicable for N 100 with all m. We discuss the usage of (2) and its asymptotic result for very large network in more detail in [7]. B) Application of Pr[HN (m) = j] We apply (1) to estimate the peer group size for a certain content delivery service. For instance, if the operator of a content distribution network (CDN) with 40 routers has uniformly scattered 4 servers (peer fraction around 10%) into the network, he can already claim that approximately in 98% of the cases, any user request will reach a server (the term of server and peer are interchangeable in this case) within 4 hops (j ≤ 4) as seen in the inserted figure of Fig. 1. Placing more servers in the network will not improve the performance significantly. 2.3
Weight of the Shortest Path to the First Encountered Peer
A) The asymptotic analysis In [4, p. 349], the shortest path problem between two arbitrary nodes in the dense graph with regular link weight distribution (e.g. exponential link weights) 3
The path overlap from the root to the m peers in the URT causes correlation of the hopcount between peers. When m is small compare to the network size N , the path overlap is expected to be small, and so is the correlation of the hopcount. The larger the m, the more dependent of the hopcount from the root to the m peers becomes.
The Effect of Peer Selection with Hopcount or Delay Constraint
361
has been rephrased as a Markov discovery process. It evolves as a function of time from the source and stops at the time when the destination node is found. The transition rate in this continuous-time Markov chain from state n with n already discovered nodes, to the next state n + 1 is λn;n+1 = n(N − n). The inter-attachment time τn between the inclusion of the n-th and (n + 1)-th node in the SPT for n = 1, 2, ...N − 1 has exponential distribution with parameter n(N − n). The exact probability generating function (pgf) of the weight WN ;m of the shortest path from an arbitrary node the first encountered peer among m peers to N −m can be formulated as ϕWN ;m (z) = k=1 E[e−zvk ] Pr[Ym (k)], where Pr[Ym (k)] represents the probability that the k-th attached node is the first encountered peer among the m peers in the URT, vk = kn=1 τn denotes the weight of the path to the k-th attached node. And the corresponding generating function of vk k n(N −n) is E[e−zvk ] = z+n(N −n) . The formation of the URT with m attached peers n=1 −1 indicates Nm ways to distribute the m peers over the N − 1 position (other than the source node). The remaining m − 1 peers should always appear −1−k in the position that are larger than k-th position. Hence, there are Nm−1 ways to distribute the m − 1 peers over the N − 1 − k position. This analysis leads us to express Pr[Ym (k)] as N −1−k
Pr[Ym (k)] = m−1 N −1 m
The asymptotic probability of ϕWN ;m (z) with proper scaling is thus derived4 as ∞ −y N e−u lim Pr[N WN ;m − ln ≤ y] = e−my mm+1 eme du (3) m+1 N →∞ m me−y u which converges to the Fermi-Dirac distribution function as shown in [5, Section 3] N 1 lim Pr[N WN ;m − ln ≤ y] = (4) N →∞ m 1 + e−y for large m as shown in Fig. 2. It illustrates that a relatively small peer group m ≈ 5 is sufficient to offer a good service quality because increasing the number of peers can only improve the performance marginally, i.e. logarithmically in m. B) Application We use the Fermi-Dirac distribution (4) to estimate the minimum number of peers m needed to satisfy the requirement of Pr[WN ;m ≤ y] ≥ η, which means that in η of the cases, the delay to the nearest peer is no larger than y. Rewriting m Ny e (4), yields Pr[WN ;m ≤ y] ≈ 1+Nm eN y ≥ η, from which we find N
m η ≥ e−yN N (1 − η) 4
(5)
The joint probability of the pair (HN (m), WN;m ) as calculated in [2] is shown to be asymptotically independent. Hence, for large N , both hopcount via (1) and delay via (3) is sufficient to compute Pr[hN (m) = j, WN;m ≤ y].
362
S. Tang, H. Wang, and P. Van Mieghem
Pr[NWN;m - logN/m < y]
1.0 increasing m
0.8 0.6
m =1
0.4 0.2 0.0 -8
-6
-4
-2
0
2
4
6
8
y
Fig. 2. The convergence of the pdf with scaled random variable N WN;m −ln N towards m the Fermi-Dirac distribution (in bold) for increasing m = 1, 2, 3, 4, 5, 10, 15, 20.
Consider an online P2P music sharing system, which takes advantage of the VoIP service for example. The standard [3] suggests a mouth-to-ear delay < 150 ms. Let N = 80, y = 150, and η = 99.99%, with (5), we find that in this network, the declaration that in 99.99% cases the delay to the nearest peer is smaller than 150ms can always be achieved if there are 5 peers sharing the same music ( m N around 6%).
3
Discussion on Modeling Assumptions
In this section, we discuss the major two assumptions made for the URT model: 1) a dense graph to mimic the underlying network and 2) i.i.d. regular link weight. The Internet might be denser than what has been measured, taking into account all sorts of significant sampling bias, such as insufficient sources for traceroutes suggested in [1]. We will also give indications on the link weight distribution and the applicability of the URT model in P2P network in this section. 3.1
Link Weight Structure of Networks
We use the data from the National Road Database provided by the Dutch transport research center to give an indication on the link weight distribution in a transportation network. The link weight of the Dutch road is evaluated as the physical distance between two roadsections. In Fig. 3, we fit the link weight distribution Fw (x) of the Dutch road network with a linear function. A regular (linear) link weight distribution is found within a small range [0, ], where ∼ 0.03, which gives evidence to the assumption of regular link weight structure around zero. Given that the link weight structure in the Internet are tunable, we claim that the assignment of regular link weights in Section 2.1 is reasonable. 3.2
Applicability of the URT Model
We have carried out a series of experiments by using the traceroute data provided by iPlane5 to give further indication on how well the URT model matches the real 5
http://iplane.cs.washington.edu/
The Effect of Peer Selection with Hopcount or Delay Constraint
363
1.0
Fw(x)
0.8
0.6
0.4
The link weight distribution of Dutch road Fitting curve with a linear function 0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
x
Fig. 3. Fx (x) = Pr[w ≤ x] of the link weight of the Dutch transportation network with normalize the x axis normalized between [0, 1] (x ∈ [a, b] → x ∈ ab , 1 ). The correlation coefficient ρ = 0.99 suggests a high fitting quality.
network. iPlane performs periodic measurements by using PlanetLab nodes to probe a list of targets with global coverage. We use the iPlane measurement data executed on 8th June 2007. We extract the stable traces from 52 Planetlab nodes that are located in different Planetlab sites. Assuming the traceroutes represent the shortest paths, we construct a SPT rooted at each PlanetLab node (as a source) to m peers (destinations), resulting in 52 SPTs in total. By using a map with all aliases resolved in iPlane, we obtain the router-level SPTs. The m peers are randomly chosen over the world, and the hopcount (HSP T ) and degree (DSP T ) distribution are obtained by making a histogram of the 52 SPTs (each with m destinations). A) Experimental results on node degree distribution Three sets of experiments with m = 10, 25 and 50 are conducted to examine the degree distribution of the sub-SPT because the number of peers in a P2P network is not expected to be large6. We observed from the experiments that an exponential node degree distribution is, if not better, at least comparable to the power law degree distribution that has been reported in most of the published papers. In Fig. 4, we fitted Pr[DSP T = k] for m = 10 with a linear function on both log-lin and log-log scales. The linear correlation coefficients used to reflect the fitting quality are ρα on the log-lin scale and ρβ on the log-log scale respectively. The quality of the fitting on the log-log scale (ρβ = 0.99) is only slightly higher than that of the log-lin scale (ρα = 0.98), which questions the power law degree distribution of a small subgraph of the Internet topology. A similar phenomenon is also observed for Pr[DSP T = k] for m = 25 and m = 50. Due to space limitation, we only provide the correlation coefficients for m = 25 and m = 50 in Table 1. Again, the quality of the fitting seems to be comparable on both scales. A discrepancy with the first three experiments occurs if we increase the peer size to 500 [7]. For larger subgraphs, a clear power law, rather than an exponential 6
Measurements on PPLive, a popular IPTV application [8] reveal that the number of active peers that a peer can download from is always smaller than 50.
S. Tang, H. Wang, and P. Van Mieghem
Pr[DSPT = k]
364
Pr[DSPT = k]
-2
-2 -4 -6 -8
-4
0.5
1.0 1.5 degree k
2.0
-6 52 SPT samples fit: lnPr[DSPT = k] = -1.10 k - 0.93 with ǏĮ = 0.98 fit: lnPr[DSPT = k] = -5.24lnk - 3.29 with Ǐǃ = 0.99
-8
2
4
6
8
degree k
Fig. 4. The histogram of degree DSP T for 52 SPTs with m = 10 peers based on PlanetLab traceroute data on log-lin and log-log scale in the inset. ρα represents the linear correlation coefficient on log-lin scale and ρβ is the one on log-log scale. Table 1. Correlation coefficient for both log-lin (ρα ) and log-log (ρβ ) scale of m = 10, 25 and 50
p = 10 p = 25 p = 50
0=98 0=95 0=95
0=99 0=99 0=99
distribution dominates the node degree. More discussion on the node degree distribution of both experimental and simulation results can be found in [7]. We conclude that the node degree of a subgraph with small m cannot be affirmatively claimed to obey a power law distribution. At least, it is disputable whether the exponential distribution can be equally good as the power law. B) Hopcount distribution in the Internet The probability density function of the hopcount from the root to an arbitrary chosen node in the URT with large N can be approximated as the following according to [4, p. 356]. Pr[HN = k] ≈ Pr[hN = k] ∼
(log N )k N k!
(6)
where HN indicates the event that k 1. We plotted the pdf of the hopcount with m = 50 (50 traceroute samples for each tree) in Fig. 5 (a), in which we see a reasonably good fitting with (6). An even better fitting quality is found in Fig. 5 (b) if we increases the number of traceroutes samples by randomly selecting m = 8000 destinations for each tree, because more traceroutes gives higher accuracy. We conclude that the hopcount distribution of the Internet can be modeled reasonably well by the pdf of hopcount (6) in the URT.
The Effect of Peer Selection with Hopcount or Delay Constraint
0.10
E[ H ] = 12.76
Pr[H = k]
Pr[H = k]
E[ H ] = 14.97
0.08
0.08
0.06
0.06
0.04
0.04 0.02 0.00
0.10
365
0.02
Measure Fitting
5
10
15 hop k
(a) m = 50
20
25
Measure Fitting
5
10
15
20
25
hop k
(b) m = 8000
( Fig. 5. The histogram of hopcount derived from 52 SPTs for m = 50 (a) and m = 8000 (b) are fitted by the pdf (6) of the hopcount in the URT. The measured data for (a) and (b) are fitted with log(N ) = 12.76 and log(N ) = 14.97 respectively.
4
Conclusion
We obtain the hopcount and delay distribution to the most nearby peer on the URT by assigning regular i.i.d. link weights (e.g. exponential link weights) on a dense graph. Both results suggest that a small peer group is sufficient to offer an acceptable quality of service. Via a series experiments, we show the applicability of the URT model, based on which the pdfs for hopcount and delay have been derived. To summarize, with a small group of peers (m 50), the URT seems to be a reasonably good model for a P2P network.
Acknowledgments We would like to thank Neil Spring, who is so kind to provide us the iPlane data.
References 1. Clauset, A., Moore, C.: Accuracy and scaling phenomena in Internet mapping. Physical Review Letters 94, 18701 (2005) 2. Hooghiemstra, G., Van Mieghem, P.: The weight and hopcount of the shortest path in the complete graph with exponential weights. CPC (submitted) 3. ITU-T Rec. G.114, One-way transmission time (May 2003) 4. Van Mieghem, P.: Performance Analysis of Communications Networks and Systems. Cambridge University Press, Cambridge (2006) 5. Van Mieghem, P., Tang, S.: Weight of the Shortest Path to the First Encountered Peer in a Peer Group of Size m. Probability in the Engineering and Informational Sciences (PEIS) 22, 37–52 (2008) 6. Van der Hofstad, R., Hooghiemstra, G., Van Mieghem, P.: First Passage Percolation on the Random Graph. Probability in the Engineering and Informational Sciences (PEIS) 15, 225–237 (2001) 7. Tang, S., Wang, H., Van Mieghem, P.: Peer selection with hopcount and delay constraint, Delft University of Technology, report20080222 (2008) 8. Hei, X., Liang, C., Liang, J., Liu, Y., Ross, K.W.: Insights into PPLive: A measurement study of a large-scale P2P IPTV system. In: Workshop on Internet Protocol TV (IPTV) services over World Wide Web, Edinburgh, Scotland, May 23 (2006)
T2MC: A Peer-to-Peer Mismatch Reduction Technique by Traceroute and 2-Means Classification Algorithm∗ Guangyu Shi1, Youshui Long1, Jian Chen1, Hao Gong1, and Hongli Zhang2 1
Huawei Technologies Co., Ltd Shenzhen, 518129, China {shiguangyu,longyoushui,jchen,haogong}@huawei.com 2 Harbin Institute of Technology Harbin, 150001, China [email protected]
Abstract. The Peer-to-Peer (P2P) technology has many potential advantages, including high scalability and cost-effectiveness. However, most P2P system performance suffers from the mismatch between the overlays topology and the underlying physical network topology, causing a large volume of redundant traffic in the Internet. A lot of research works have been presented to address this issue, but most results still have some drawbacks. In this paper, we propose a quite simple but efficient topology matching technique, T2MC, which uses the peers’ Traceroute result to execute 2-Means Classification, thereafter lets peers to build efficient “close” cluster. By performing experiments using the measured realistic Internet data of China, we show that T2MC outperforms the well-known GNP in both aspects of accuracy and maintenance cost. Keywords: Peer-to-Peer, mismatch, clustering.
1 Introduction Since the emergence of Peer-to-Peer (P2P) file sharing applications, millions of users have started using their home computers for more than browsing the Web and exchanging e-mails. With the development of the networks, the P2P model is quickly emerging as a significant computing paradigm of the future Internet. There are currently several P2P systems in operation and many more are under development. According to the methods of administering peers, they can be broadly classified into four categories, the central directory based P2P system, e.g. Napster [1]; the decentralized unstructured P2P systems, e.g. Gnutella [2] and Kazza [3]; the decentralized structured P2P systems, e.g. Chord [4], CAN [5], Pastry [6], and the reinforced P2P systems such as Bamboo [7], Accordion [8], and PROP [9], which improved the system in Mismatch solution, Churn-Resilient and Heterogeneity-aware. In a P2P system, an attractive feature is that peers do not need to directly interact with the underlying physical network, providing many new opportunities for user-level ∗
The research is partially supported by the National Information Security 242 Program of China under Grant No.2006A18.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 366–374, 2008. © IFIP International Federation for Information Processing 2008
T2MC: A Peer-to-Peer Mismatch Reduction Technique by Traceroute
367
development and applications. Nevertheless, the mechanism for a peer to randomly choose logical neighbors, without any knowledge about the physical topology, causes a serious topology mismatch between the P2P overlay networks and the physical networks. The mismatch between physical topologies and logical overlays is a major factor that delays the lookup response time, which is determined by the product of the routing hops and the logical link latency. Mismatch problem also causes a large volume of redundant traffic in inter-domain between the every ISP, which is also an important reason why ISP may prohibit P2P applications [10, 11, 12]. Studies in [13] show that about 75 percent of the query response paths suffer from the topology mismatch problem in 8000 logical peers on 27000 physical nodes. The issue of mismatch in P2P systems has been the focus of intensive research in recent years. The research works can be classified into four categories as follows: The first representative studies to address the mismatch problem are peer coordinate systems such as GNP [14] and Vivaldi [15], which demonstrated the possibility of calculating synthetic coordinates to predict Internet latencies. GNP relies on a small number (18) of “landmark” nodes; other nodes choose coordinates based on RTT measurements to the landmarks. The choice of landmarks significantly affects the accuracy of GNP’s RTT predictions. Requiring that certain nodes be designated as landmarks may be a burden on P2P systems. Vivaldi is a simple, light-weight algorithm that assigns synthetic coordinates to hosts such that the distance between the coordinates of two hosts accurately predicts the communication latency between the hosts. Although Vivaldi ensures nodes always decrease the prediction error and has no use for the choice of landmarks, it is so complex that hard to deploy in nowaday Internet. In addition, Vivaldi’s convergence of coordinate system is a slow process. The second study approach is a hierarchical location-based node IDs in P2P systems proposed by [16]. Physical locations and network distances are effectively embedded in the node IDs, and thereby improving routing locality. However, what is the influence on load balance is an important question concerning embedding location prefixes in node IDs. The clustering algorithm such as Coral [17] takes a third different approach. In Coral system, the peer discovered the router IP cluster by randomly traceroute then chose the first five routers as the close cluster and joined into the cluster. But this scheme is coarse-grained and has difficulty to distinguish relatively close nodes. Why did Coral choose the first five routers as the criterion of close cluster? Finally, the mechanisms based on measurement have been emerged in the last few years, such as Bamboo [7]. They select proximity neighbor based on monitoring daily DHT operation continuously. The disadvantage of this scheme is that it is hard to measure and learn adequate neighbor information during the peer’s limited life time. In order to address the limitations of the above cited work, we identify several basic characteristics of current networking paradigm, and propose a quite simple but efficient topology matching technique, called T2MC, which uses the peers’ Traceroute result to execute 2-Means Classification, thereafter lets peers to build efficient “close” clusters. Using the T2MC technique, we can optimize the overlay topology by identifying and replacing the mismatched connections. The rest of this paper is organized as follows: Section 2 describes T2MC in detail. Simulation methodology and performance evaluation are discussed in section 3. And we conclude the work in Section 4.
368
G. Shi et al.
2 The T2MC Technique Optimizing inefficient overlay topologies can fundamentally improve P2P search efficiency and decrease redundant traffic. T2MC utilizes several basic characteristics of current Internet paradigm, and lets peers belong to same ISP cluster together separately without any centralized control and predefined system parameters. We present two key aspects that need to be addressed in the design of T2MC following. 2.1 The Characteristics of Internet Paradigm The Internet presents two kinds of basic characteristic. On one hand, with many terminal peer devices randomly join and leave the network, the topology is dynamic and variable. On the other hand, the routers and switches split the realistic network to different AS domains. The whole topology of Internet presents a strict hierarchy. During a long period to observe, it’s clearly that the core routers are more stable than the terminal peers. According to this characteristic, we tried to make use of relationship between peers and routers in a topology location aspect. In the mean time, the inter-ISP links constitute a considerable small portion of the total links of Internet, and usually such inter-ISP links have a much higher delay than other intra-ISP links. Just like the Fig.1, a mass of inner-routers from different AS domains are connected together by a few “star” edge routers. This inspired us to divide the nodes into “near” and “remote” router clusters. We fulfilled this task with the peers’ Traceroute results. Fig.2 is an example of path information by R1 traceroute to R8. As clearly seen from Fig.2, between the R3 and R4, R5 and R6, these links present some huge latency leaps than the others. It usually means those hops across the different AS domains or different ISP ranges. How to find these latency leaps is the core part to distinguish the “near” and “remote” router clusters in T2MC. AS1
AS2
5m 10ms R1 R2
R3
100ms
6m R4 R5
150ms
20ms 8m R6
R7 R8
AS3
Fig. 1. The characteristics of Internet
Fig. 2. The characteristics of Internet
2.2 Traceroute Mechanism As mentioned before, the stabilized routers are more suitable for making a mark than dynamic terminal peers. We used the routers’ location relationship as the judgment criterion of “near” and “remote” routers. T2MC uses the peers’ Traceroute result to execute 2-Means Classification, thereafter lets peers to build efficient “closer” cluster. Firstly, when a peer joining the P2P networks, it randomly picked an Internet IP address and probed it using the traceroute tool. The peer tracked the results into a vector of
T2MC: A Peer-to-Peer Mismatch Reduction Technique by Traceroute
369
every router in traceroute path and Latency means the link delay between this router and its pervious-hop. In T2MC, the peer chose an IP address with the first bit far different from itself, if this doesn’t work, it will randomly chose another suitable IP address to probe. Secondly, based on the Latency attributes of the vector tuples, T2MC used the 2means classification algorithm to classify the Internet routers into “near” and “remote” routers. 2-means classification algorithm is a special example of k-means classification algorithm [18] with k=2. By this tactic, T2MC clustering avoided any presumed hops threshold such as used in Coral [17]. The 2-means classification algorithm in T2MC includes four steps as following. step1. Peer chose the tuples with minimum or maximum Latency attribute as centroids for two initial sets, “first” and “second”. step2. Peer took the latency attribute of each tuple in the vector to calculate the absolute distance value with the two centroids in turn, and then associated the tuple to the set whose centroid has smaller absolute distance value with it. step3. Peer calculated the latency mean and variance value of two sets. step4. If the variance value was larger than the threshold, peer picked two latency mean values as new centroids of “first” and “second” sets, then took a loop from step2 until the two sets achieved the minimum total intra-set variance. This produces a separation of the routers into two groups from which the metric to be minimized can be calculated. At last, two “first” and “second” sets confirmed in the end. The pseudo code of 2-means classification algorithm in T2MC is shown in Fig.3 structure Routerinfo ip ipaddress double srcdelay double hopdelay
Δ the ip of routers Δ the delay form this to source Δ the link delay
procedure Get2meansCluster(allrouters[1...routenumber]) firstcenter ← allrouters[min].hopdelay secondcenter ← allrouters[max].hopdelay oldE ← -1 Δ the previous variance E←0 Δ variance while (E - oldE) > 0.000001 firstlist.clear() secondlist.clear() foreach i ← 0 to allrouters.size if ((allrouters[i].hopdelay-firstcenter) <= (secondcenter - allrouters[i].hopdelay)) then firstlist.add(allrouters[i]) else secondlist.add(allrouters[i]) end foreach firstcenter ← the mean of firstlist.hopdelay firstE ← the variance of firstlist.hopdelay secondcenter ← the mean of secondlist.hopdelay secondE ← the variance of secondlist.hopdelay oldE ← E E ← firstE + secondE ret ← new Vector[2] end while Ret[0] ← firstlist Δ the “close” routers clustering Ret[1] ← secondlist Δ the “far” routers clustering Retrun Ret[0…1]
Fig. 3. Pseudo code of 2-means classification
Fig. 4. Measured Internet topology of China
370
G. Shi et al.
Finally, peer chose the router with minimum Hops attribute in “second” set as a hop threshold. This router and the other routers whose Hops attribute are larger than the hop are classified as the “remote” router cluster. The remaining routers formed the “near” router cluster. The peer chose the router with maximum Hops attribute in the “near” router cluster as its Edge Gateway, and then registered the Edge Gateway and the “near” cluster into the P2P overlay, as DHT PUTs. If two peers shared the same Edge Gateway or any members of their “near” router clusters, they would gather together to form a “close” peer clusters. The Edge Gateways presents more valuable information than other members of the “near” router clusters, T2MC were designed to let peers to register/search/save Edge Gateways information with the higher priority. In T2MC, each peer clusters has a ClusterId generated by consistent hash when the first 2 peers decided to form a cluster. The ClusterId and its member peer Ids will be PUT into the DHT (or other similar information retrieve means), and the peer can find the other members within the same cluster by DHT GET with its ClusterId remembered during its last online life. So, in T2MC, the judgment criterion for close neighbor peers is based on the much more stable routers in the underlay. The overhead of this technique is the traceroute operation when peer joining the system. On the contrary, the “close” relationship of peers is judged by their Euclidean space in GNP technique. The peer must probe the landmarks to measure the RTT delay as overhead. The simulation results in the next chapter will demonstrate such contrast.
3 Simulations and Analysis In this section, we choose GNP as comparative technique. By performing experiments based on the measured realistic Internet data of China, we get the performance and overhead comparison. Then we discuss some further improvements of T2MC. 3.1 Simulation Performance Metrics We mainly focus on three performance metrics: the accuracy of clustering “close” and “far” peers, the average delay in clusters and the average maintenance traffic cost. The accuracy of clustering is one of the important performance metrics with which topology matching algorithms are seriously concerned. The low accuracy increases the total delay of the response and redundant traffic on backbone Internet. Internet measurement [19] reports that roughly 75%~90% of flows have RTTs less than 200ms. Hence, we define the real “close” peer cluster as those peers whose latency are less than 100ms from the source in the realistic networks. And the accuracy could be measured as an “AND” operation result between the “close” peer cluster formed by T2MC and the real “close” peer cluster. We define the average latency in cluster as average link latency in each “close” peer cluster. The lower average latency obtained, the more efficient the cluster is. Traffic cost is another parameter that most seriously concerned by ISPs. Heavy network traffic is always the direct reason why many ISPs dislike P2P. We define the traffic cost as how many network messages will be used in the clustering process.
T2MC: A Peer-to-Peer Mismatch Reduction Technique by Traceroute
371
3.2 Measurement Methodology To evaluate the effectiveness of T2MC, we need an actual topology data which made up of the major ISP routers and most city-level routers in China. We used 9 probing sources to collect the data. The sources are located in Anhui, Gansu, Hainan, Heilongjiang, Hebei, Guangxi, Beijing, Shanghai and Guangzhou, nine mainland provinces. To ensure the effectiveness of probing, we intentionally selected addresses from a large set of IP addresses within China range as targets, each /24 subnet we chose 8 IP addresses. The sources triggered 20 processes to send the ICMP traceroute messages to the target and calculated the RTT value by monitoring the response messages. If the sequential 10 hops did not response, the probing process terminated. The RTT difference between previous hop and next hop was the link latency. The measurements from all probing sources were conducted roughly around the same time. The whole measurement lasted three days. As a result, 76116 routers’ IP addresses, 128083 links and their latency were obtained. They belong to a diverse set of prefixes originating from ASes across the entire Chinese Internet hierarchy. There are 33300 edge routers in whole 76116 routers. The measured routers were spread all over China as shown in Fig.4. 3.3 Experiment Parameters The logical topology represents the overlay topology built on top of the physical topology. We used PlanetSim simulator to generate logical topology, with 10000 peers each round. The peers randomly accessed the edge routes in physical topology. The J-sim simulator provided an interface to connect physical topology and logical topology. On the choice of the landmark number, we considered that in GNP thesis, GNP only set 15 landmarks for the RTT measurement in whole world. So with the China mainland topology, deploying 8 landmarks should be enough to meet the requirement. In our experiments, we set the number of GNP landmarks to 8. And the landmark was chosen randomly in whole nodes of P2P system. For the accuracy each experiment we run 50 rounds, each with different new generated 10000 peers. 3.4 Experiment Result First of all, we need to measure the accuracy of clustering “close” and “far” peers. We choose the “close” peer cluster by topology matching technique to make a comparison with the real “close” peer with latency less than 100ms each other in the realistic network. From Fig.5(a), we find that the number of wrong close peers in close cluster decreases by 65% in T2MC than that of in GNP. It is mainly because in GNP, the landmarks as probing targets are chosen randomly. Every peer calculates own GNP distance and clusters together by probing the landmark, so the number and location of landmarks directly determine the performance of GNP schemes. But in T2MC technique, there is no landmark and the probing target is random. It avoids that the limitation of landmark in peer clustering. Furthermore, through efficiently clustering, T2MC can reduce a large volume of inter-domain redundant traffic than GNP.
372
90 80 70 60 50 40 30 20 10 0
G. Shi et al.
the decrease of T 2MC than GNP in wrong close peers in clusters (%) %
the decrease of T 2MC than GNP in latency of close peer clusters (%) %
80 70 60 50 40 30 20 10 0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 times
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 times
(a) the comparison of wrong close peer in clusters
(b) the comparison in latency of close peer clusters
the improvement of T2MC than GNP in inter-
the improvement of T2MC than GNP in whole message overhead (%)
80 %
domain message overhead (%)
%
60
70 60 50 40 30 20 10 0 -10 -20
%
%
40 20 0 -20
1
5
9 13 17 21 25 29 33 37 41 45 49 times
(c) the comparison of inter-domain message overhead
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
-40 times
(d) the comparison in whole message overhead
Fig. 5. Performance comparison of T2MC and GNP in measured realistic Internet data (China)
The accuracy of clustering brings forth the improvement of latency immediately. Because of the “close” peers’ number increased and accuracy of “close” peer clusters improved, the average latency in cluster decreases by 48% approximately from the Fig.5(b). The lower average latency means quicker P2P interactions. As is clear from the Fig.5(c) and Fig.5(d), T2MC achieves another favorable performance metric, the maintenance traffic overhead. T2MC consumes less maintenance bandwidth overhead of clusters construction than GNP both in the inter-domain and whole network level. The reason is that in T2MC, each peer just probes only one peer. On the contrary, the peer must measure the delay of every landmark in GNP scheme, so that the number of landmark determines the maintenance overhead of scheme. As one can see in Fig.5(c) and Fig.5(d), T2MC decreases the average maintenance overhead in inter-domain by 40%, in whole by 8%. 3.5 Further Improvements Previous sections have shown that T2MC can achieve favorable performance. In the practical deployment, there are still some details need to be mentioned. (1) In our measurement process, we only use ICMP traceroute as the probing method. The result of 76116 routers IPs only represents 50% of whole routers in China mainland probably. In future measurement, if chose multiple probing tools such
T2MC: A Peer-to-Peer Mismatch Reduction Technique by Traceroute
373
as TCP traceroute and UDP traceroute, the more routers’ IP address will be identified, so as to let T2MC get more excellent performance in topology matching. (2) In probing process, for that peers could not get any traceroute response, they can’t find their “close” routers to form a cluster. The solution is to utilize the IP address and subnet mask address to find out its gateway router. (3) For neglecting the probing error practically, before the 2-means classification, we modify the Latency attribute less than 10ms to 10ms for each vector tuples, and also modify the Latency attribute more than 200ms to 200ms. (4) T2MC uses the peers’ traceroute result to execute 2-means classification, and lets peers to build their “close” and “far” clusters. For the further improvement, T2MC can perform 2-means classification to pervious “close” and “far” clusters recursively until the far centroid is less than 2 times of the close centroid during the latest round. With our experience, we found that 3 or 4 times recursive classification is quite enough for China network topologies, we suppose the whole world Internet maybe only needs 4 or 5 times. (5) In T2MC, during normal peer-to-peer interactions such as DHT lookup or maintenance, if peers belonging to different clusters found the delay between them were relatively low, then these two clusters would decide to combine to a new bigger cluster. The mapping between those original ClusterIds and the new generated ClusterIds would also be registered into the DHT so as to let those peers belonging to the original clusters could find and join the new cluster. This tactic alleviates the problem occurred when peers belonging to same cluster get different Edge Gateways from their traceroute response (“stars” in Fig.1), thus they may not form the ideal cluster. (6) T2MC is presented as a novel mismatch reduction technique which outperforms existing schemes. But it should be mentioned that T2MC also can be easily applied to existing equivalent techniques without the loss of their primary characteristics, such as utilize T2MC to choose the landmark peers in GNP scheme, choose the replacement of cache points in IPTV networks, or help to construct the ALM tree.
4 Conclusion In this paper we presented a simple but effective P2P mismatch reduction technique called T2MC. T2MC uses the peers’ Traceroute result to execute 2-Means Classification, thereafter lets peers to build efficient “close” cluster. T2MC utilizes several basic characteristics of current Internet paradigm, which can efficiently and accurately classify the peers belong to same ISP into a sort of cluster. Furthermore, T2MC is completely decentralized, scalable, reliable and self-organizing, without any presumed hops threshold parameters. By performing experiments using the measured realistic Internet data of China, we show that T2MC outperforms GNP scheme in both aspects of accuracy and maintenance cost.
References [1] Napster (2007), http://www.napster.com [2] Gnutella (2007), http://gnutella.wego.com [3] KaZaA (2007), http://www.kazaa.com
374
G. Shi et al.
[4] Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A Scalable Peer-to-Peer Lookup Ser-vice for Internet Applications. In: Proc of. ACM SIGCOMM (2001) [5] Ratnasamy, S., Francis, P., Shenker, S.: A Scalable Content-Addressable Network. In: Proc of. ACM SIGCOMM (2001) [6] Rowstron, A., Druschel, P.: Pastry: Scalable, Distributed Object Location and Routing for Large-Scale Peer-to-Peer Systems. In: Proc of. Int’l Conf. Distributed Systems Platforms (2001) [7] Rhea, S., Geels, D., Roscoe, T., Kubiatowicz, J.: Handling Churn in a DHT. In: Proc of. USENIX (2004) [8] Li, J., Stribling, J., Morris, R., Kaashoek, M.F.: Bandwidth-efficient Management of DHT Rout-ing Tables. In: Proc of. NSDI (2005) [9] Qiu, T., Chen, G., Ye, M., Chan, E., Zhao, B.Y.: Towards Location-aware Topology in both Unstructured and Structured P2P Systems. In: Proc of. IEEE ICPP (2007) [10] Srivatsa, M., Gedik, B., Liu, L.: Large Scaling Unstructured Peer-to-Peer Networks with Heterogeneity-Aware Topology and Routing. IEEE Transactions on Parallel and Distributed Systems 17(11), 1277–1293 (2006) [11] Aggarwal, V., Feldmann, A., Scheideler, C.: Can ISPs and P2P Systems Cooperate for Improved Performance? ACM SIGCOMM Computer Communications Review 37(3), 29–40 (2007) [12] Shen, G., Wang, Y., Xiong, Y., Zhao, B., Zhang, Z.: HPTP: Relieving the Tension between ISPs and P2P. In: IPTPS (2007) [13] Liu, Y., Xiao, L., Ni, L.M.: Building a Scalable Bipartite P2P Overlay Network. IEEE Transactions on Parallel and Distributed Systems 18(9), 1296–1306 (2007) [14] Ng, T.S.E., Zhang, H.: Predicting Internet Net-work Distance with Coordinates-Based Approaches. In: Proc of. IEEE INFOCOM (2002) [15] Dabek, F., Cox, R., Kaashoek, F., Morris, R.: Vivaldi: A Decentralized Network Coordinate System. In: Proc of. ACM SIGCOMM (2004) [16] Zhou, S., Ganger, G.R., Steenkiste, P.: Location-based Node IDs: Enabling Explicit Locality in DHTs. Technical Report CMU-CS-03-171 (2003) [17] Freedman, M.J., Mazieres, D.: Sloppy Hashing and Self-organizing Clusters. In: Proc of. IPTPS (2003) [18] Faber, V.: Clustering and the Continuous k-Means Al-gorithm. Los Alamos Science, 22, pp.138-144 (1994) [19] Jiang, H., Dovrolis, C.: Passive Estimation of TCP Round-Trip Times. ACM Computer Communications Review 32(3), 75–88 (2002)
Towards an ISP-Compliant, Peer-Friendly Design for Peer-to-Peer Networks Haiyong Xie, Yang Richard Yang, and Avi Silberschatz Computer Science Department, Yale University, New Haven CT 06511, USA {haiyong.xie,yang.r.yang,avi.silberschatz}@yale.edu
Abstract. Peer-to-peer (P2P) applications are consuming a significant fraction of the total bandwidth of Internet service providers (ISPs). This has become a financial burden to ISPs and if not well addressed may lead ISPs to block or put strict rate limits on P2P traffic. In this paper, we propose a new framework, PCP, for designing P2P applications to smoothly fit into the global Internet. In our framework, an ISP decides on how much of its bandwidth is to be allocated to P2P clients, and P2P clients inside the network adopt a peer-friendly algorithm to fairly share the bandwidth. Using the widely-used percentile-based charging model and real traffic traces, we show that an ISP can allocate a large amount of bandwidth dedicated to P2P, without increasing its financial cost. We also show that P2P clients can use the algorithm to fairly share the allocated bandwidth.
1
Introduction
There are increasing concerns on Peer-to-Peer (P2P) applications by Internet Service Providers (ISPs). P2P traffic consumes an increasingly significant fraction of network bandwidth of ISPs. A recent study [7] estimates that the aggregated traffic of all P2P applications contributes to about 60-70% of the traffic in Internet and about 80% of the traffic in the last-mile providers’ networks. Thus P2P traffic increases network utilization, and can result in performance degradation to other applications. More importantly, P2P traffic is becoming a financial burden to the ISPs. Most ISPs connect to the Internet through their providers, and pay for the transit services. However, it may well happen that P2P clients in an ISP’s network upload a substantial amount of data to their peers in the ISP’s provider networks; hence, the ISP may become a content provider and be charged by its network provider unnecessarily. In a recent study [12] on Skype, the authors find that many universities are hosting a large number of Skype super nodes, and becoming potential providers to the rest of the Internet. In such networks, the outbound P2P traffic can be 5 times more than normal inbound traffic [10]. This can incur substantially higher operational cost to the university network. The world-wide extra costs due to P2P traffic are estimated to be in excess of e 500M per annum [7].
Haiyong Xie is supported by NSF grants ANI-0238038 and CNS-0435201. Yang Richard Yang is supported in part by NSF grants ANI-0238038 and CNS-0435201.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 375–384, 2008. c IFIP International Federation for Information Processing 2008
376
H. Xie, Y.R. Yang, and A. Silberschatz
However, ISPs do not have effective methods to handle the increasing P2P traffic. First, an ISP can upgrade its infrastructure. However, this may increase its over-provisioning cost substantially; further, the upgraded bandwidth might quickly be consumed by the constantly increasing P2P traffic. Second, an ISP can use P2P caching devices to reduce P2P bandwidth consumption. However, many ISPs are reluctant to get involved in content distribution by some P2P applications due to legal concerns. Third, an ISP may choose to block P2P traffic. However, this may involve complicated legal and publicity issues. Last, an ISP can use traffic shaping devices (e.g., [6,9]) to enforce a hard limit on P2P bandwidth consumption. However, P2P applications have taken countermeasures, for instance, hide themselves by using existing application-layer protocols and using random port numbers and message stream encryption [1]. Such evading techniques not only substantially increase the complexity but also reduce the interoperability and impede the development of P2P applications. In this paper, we propose a new framework referred to as Peer Coordination Protocol1 (or PCP for short). The framework allows ISPs to decides its bandwidth allocated to P2P clients according to its policy constraints, for example, an ISP may want to bound the total extra bandwidth and financial costs incurred by P2P applications. A clear advantage of the framework is that it allow an ISP to explicitly express its policies and constraints on P2P bandwidth usage. We also design algorithms for an ISP to determine P2P bandwidth allocation without increasing its cost in the widely-used percentile-based charging model. We show that our dynamic approach can allocate a large amount of bandwidth (40-60% of total bandwidth) to P2P applications without increasing its financial cost. Thus, our framework not only gives ISPs explicit control over bandwidth allocation, but also gives P2P clients incentive to adopt the approach since they can obtain a large amount of bandwidth which may not be feasible otherwise. In addition, we design a distributed algorithm for P2P clients inside an ISP network to be peer-friendly and share the allocated P2P bandwidth. Using real Internet traces, we demonstrate the effectiveness of our algorithm.
2
A Framework for ISP-Compliant, Peer-Friendly Design
Consider the scenario where an ISP network is connected to its provider through a peering link. Our focus is how the ISP should allocate the bandwidth of this link, and how P2P clients in the ISP should share the allocated bandwidth efficiently. Although we only consider a single peering link in the setup, our framework can be easily generalized to deal with multiple peering links. The first component is the algorithm to compute the total bandwidth allocatable to all P2P clients in the ISP network (referred to as total P2P capacity). The ISP should communicate the total P2P capacity to P2P clients in its network. One implementation strategy is to use one or multiple capacity allocation servers for this purpose. The ISP’s DNS server maintains a unique DNS resource 1
It is worth of noting that besides P2P, our framework is applicable to other applications as well.
Towards an ISP-Compliant, Peer-Friendly Design for Peer-to-Peer Networks
377
record (e.g., PCP) for these servers. P2P clients should query them through DNS lookups to obtain the total P2P capacity. Common practices such as DNS-based round robin and IP-based query filtering can be used to balance the query load and preserve privacy. The second component is the peer-friendly algorithm for P2P clients to fairly share the total P2P capacity. This component is necessary since the ISP may not want to be involved in determining how P2P clients should share the allocated capacity. Further, there may exist multiple P2P applications and/or clients in the ISP network, and they may not be aware of each other. Therefore, it is desirable that the capacity sharing algorithm be distributed.
3
Percentile Networks
We now study an ISP-compliant, peer-friendly P2P design for percentile networks where an ISP’s transit costs are determined by the widely-used percentilebased charging model. We study two common policies where the ISP allocates P2P capacity so that (1) the additional P2P traffic does not increase the ISP’s transit cost; and (2) the total traffic on the transit link does not exceed link capacity. Note that the ISP may impose additional constraints, for example, a threshold on network utilization. Our framework can be extended to handle such constraints. The percentile-based charging model is a typical usage-based charging scheme currently in use by many ISPs (e.g., [5]). Consider an ISP obtains transit services through a transit link with capacity C to its provider. The ISP’s provider records the traffic volumes on the transit link in 5-minute intervals, denoted by vi for the i-th interval. Let V be the vector consisting of the volumes in a charging period (typically one month) of I intervals. The provider charges the ISP using the q-th percentile of the sorted V (referred to as charging volume). We denote by xi and yi the non-P2P and P2P traffic volume on the transit link in the i-th interval, respectively. Let X and Y be the vectors of non-P2P and P2P traffic volumes. We also denote by K the number of P2P clients in the given ISP’s network, and yik the capacity P2P client k uses onk the transit link in the i-th interval. Thus Y k = {yik |1 ≤ i ≤ I}, and Y = Y . Let p = qt(V, q) be the charging volume, i.e., the q-th percentile of the traffic volume vector V. We formulate the aforementioned two ISP policies as follows: (1) for the cost constraint, and (2) for the link capacity constraint. qt(X + Y, q) = qt(X, q); ∀i : xi +
K
yik ≤ C.
(1) (2)
k=1
3.1
Algorithm to Determine Total Allocated P2P Bandwidth
We next present an algorithm for the ISP to determine the total P2P capacity for each interval (i.e., p − vi ). However, p is unknown until the end of a charging
378
H. Xie, Y.R. Yang, and A. Silberschatz
C
B
A x˜i
x˜i−1
p˜i−1 +˜ xi−1 2
p˜i−1
Fig. 1. Network status
for each time interval i in a charging period obtain c˜i for the current interval obtain last interval’s network state // update division factor if case A ˜i = K ˜ i−1 ∗ 2 K else if case B ˜ i = max{1, K ˜ i−1 − 1} K else if case C ˜ i = c˜i × K ˜ i−1 K c ˜i−1 // compute its share of total P2P capacity ˜i yi = c˜i /K
Fig. 2. A capacity sharing algorithm
period, and vi is unknown until the end of an interval i. We can predict p and vi at the beginning of each interval. To facilitate the prediction, we collect traffic volumes in previous intervals by repeatedly querying the border router through SNMP2 . We take a hybrid sliding window approach to predicting p. Specifically, for the first M intervals in a charging period, the traffic volumes of the last charging period (up to and including the last interval) are used for predicting the charging volume; while for the remaining I − M intervals, the traffic volumes from the first interval to the latest interval in the current charging period are used for prediction. We denote by p˜i the predicted charging volume: qt(V [i − I, i − 1], q) for s ≤ i ≤ M, p˜i = (3) qt(V [s, i − 1], q) for M < i < s + I, where s = Ii ∗ I + 1 is the first interval in the current charging period. We take a similar approach to predicting vi for each interval i. Specifically, a sliding window of the recent N intervals isused for the prediction. We denote by v˜i the predicted traffic volume: v˜i = N1 i−1 j=i−N vj . Similarly we can predict x ˜i . We then estimate the predicted P2P capacity as c˜i = p˜i − x ˜i for each interval i. Note that the size of the sliding window is a parameter of the prediction algorithm. However, it cannot be too large; otherwise, the diurnal traffic patterns will be lost in the prediction. 3.2
A Distributed, Peer-Friendly Sharing Algorithm
We next present an algorithm for P2P clients to share the allocated P2P capacity. At the beginning of an interval, each P2P client first queries the ISP to obtain the current P2P capacity ci and then estimate its fair share. The estimation relies on an aggregated network status and an internal division factor. The latter is an estimation of the total number of P2P clients in the network, denoted ˜ i. by K 2
The router’s MIB keeps a counter for the traffic volume vi in each 5-minute interval i, and this is widely implemented in most of the off-the-shelf routers.
Towards an ISP-Compliant, Peer-Friendly Design for Peer-to-Peer Networks
379
Table 1. Traffic traces and rates (Mbps) AS Organization Traffic Rate Charging Volume Total P2P Capacity 52 UCLA 47.550 95.614 48.063 55 University of Pennsylvania 39.793 67.227 27.433 111 Boston University 43.314 85.674 42.359 237 Merit Network Inc. 103.772 185.027 81.254 1249 Five Colleges Network 31.189 63.073 31.887 22753 Red Hat Inc. 28.079 64.611 36.531
% 50% 41% 49% 44% 51% 57%
Network status: We divide the network status into three cases, as shown in Figure 1. We refer to a network in interval i − 1 as over-utilized if p˜i−1 < vi−1 , and under-utilized otherwise. Therefore, case A implies that the network is likely to be over-utilized, while case B and C correspond to a potentially low and high under-utilization, respectively. Note that over-utilization may result in violation of ISP-compliant constraints 3 ; therefore, P2P clients should reduce their traffic rates more aggressively in over-utilization states. Division factor: Each P2P client adjusts its local division factor using the algorithm in Figure 2. Specifically, the algorithm aggressively doubles its division factor when the network is over-utilized, resulting in reducing its share of the total P2P capacity by half. When in case B, the division factor is reduced gradually by one until it reaches the lowest value 1; thus, the P2P node can gradually increase its share of the the total capacity. When in case C, the division factor is updated proportional to the ratio of the P2P capacities in the current and latest interval. It is particularly desirable that the network is over-utilized as least often as possible, since this reduces the chances of violating ISP-compliant constraints. We can show by Proposition 1 (we omit its proof due to space limit) that the algorithm guarantees that the network is not over-utilized if it is under-utilized in the preceding interval. Proposition 1. Assume that the network is under-utilized in interval i − 1 and xi ≤ xi−1 . Assume also that the prediction for the charging volumes is accurate and that the network has a fixed number of P2P nodes. The peer-friendly sharing algorithm guarantees that the network is not over-utilized in the next interval.
4
Evaluations
We use real Internet traffic traces in our evaluations. The traces were collected from Abilene, containing 5-minute traffic volumes from Nov. 1, 2003 to Dec. 31, 2003. We take the traffic volumes in these traces as non-P2P traffic. We also use the 95th-percentile charging model and the charging period is one month. For brevity, we present the results obtained using the traces of AS 111 only, since others have similar results. 3
In a q-percentile charging model, the charging volume will increased if the total P2P capacity is over-utilized in more than q percent of intervals.
H. Xie, Y.R. Yang, and A. Silberschatz
Charging Volume (Mbps)
125
Hybrid Incremental Basic non-P2P
120 115 110 105 100 95 90
125 Charging Volume (Mbps)
380
last interval (N=1) average (N=6) maximum (N=6) non-P2P
120 115 110 105 100 95 90
0
1000 2000 3000 4000 5000 6000 7000 8000 Interval
Fig. 3. Charging volume prediction
0
1000 2000 3000 4000 5000 6000 7000 8000 Interval
Fig. 4. Traffic volume prediction
We consider three performance metrics. The first is charging volume, from which we can derive the total costs. The second is robustness of the capacity prediction algorithm to traffic pattern changes. The third is efficiency of the capacity sharing algorithm. A desirable algorithm should enable multiple P2P clients to efficiently and fairly share the allocated P2P capacity. We evaluate the following algorithms of charging volume prediction: (1) the algorithm based on Equation (3), (2) a basic sliding window algorithm that uses the recent I intervals to predict the charging volume, and (3) an incremental sliding window algorithm that uses known volumes in the current charging period only. We also evaluate the algorithms predicting traffic volume as the average and maximum volume in a varying number N recent intervals, respectively. 4.1
Total Available P2P Capacity in Percentile Networks
Table 1 summarizes the results of the available P2P capacity in percentile networks. We observe that all six ASes can allocate 40% – 60% of their bandwidth usage to P2P applications without increasing their bandwidth costs. In other words, these networks can accommodate about the same amount of additional P2P traffic as that of non-P2P traffic at no extra financial costs. This is clearly an evidence that a dynamic policy may be more desirable since it is unlikely that an ISP will statically allocate 40% to 60% of their bandwidth to P2P applications. 4.2
Algorithm to Determine the Total P2P Capacity
Figure 3 plots the results of evaluating charging volume prediction algorithms, assuming that traffic volumes are known a priori. In order to show the dynamics of the algorithms, we plot the charging volumes computed in all intervals, and only the last point on each curve represents the real charging volume for the complete charging period. We observe that all three algorithms result in higher charging volumes due to prediction errors. In particular, the hybrid/basic and incremental algorithms increase the charging volume by 3% and 6%, respectively. An ISP pays only a small penalty by using these algorithms. Fig 4 plots the results of evaluating traffic volume prediction algorithms, where we use the hybrid sliding window algorithm to predict the charging volume.
Towards an ISP-Compliant, Peer-Friendly Design for Peer-to-Peer Networks
381
We observe that the last-interval prediction algorithm (N = 1) increases the charging volume by 12%, while the maximum and average algorithms (N = 6) increase the charging volume by only 1% and 3%, respectively. The results are consistent when N is larger. The last-interval prediction relies too much on the most recent history of traffic rates, thus is likely to have higher prediction errors when the traffic rates fluctuate. The maximum algorithm slightly outperforms the average algorithm. However, we find that it allocates 26.167 Mbps, while the average algorithm allocates 33.210 Mbps (or 27% more), as the P2P capacity. This suggests that the average algorithm is more desirable as a traffic volume prediction algorithm in percentile networks. 4.3
Distributed, Peer-Friendly Sharing Algorithm
Figure 5 plots the results of evaluating the capacity sharing algorithm. We observe that the charging volume increases by 6%. This penalty is due to the prediction error in P2P capacity prediction (as shown in Section 4.2) and underestimation of of the division factor. Figure 6 illustrates how efficiently the P2P capacity is shared by P2P clients. We observe that P2P clients coordinate themselves to share the allocated P2P capacity. We define the P2P capacity utilization ratio as the ratio between the aggregated P2P traffic volumes and the aggregated P2P capacity (sum of P2P capacity allocated in all intervals). We find that the P2P capacity utilization ratio is 71%; in other words, the two P2P nodes utilize 71% of all available P2P capacity using the capacity sharing algorithm. We also find that the clients generate approximately equal amount of traffic. Although the evaluation is very preliminary, the result suggests that the distributed, peer-friendly sharing algorithm is efficient for P2P clients to fairly share the allocated P2P capacity. 4.4
Robustness to Changing Traffic Patterns
To evaluate the robustness of the proposed algorithms to the Internet traffic dynamics, We repeat the experiments of Section 4.2 using the traces of Dec. 2003, as they have changing traffic patterns. Figure 7 plots the results of charging volume prediction algorithms. We observe that all three schemes are robust to the traffic dynamics, with only about 9% increase in the final charging volume. We also observe that the hybrid algorithm slightly outperforms the incremental algorithm, and that both of them outperforms the basic sliding window algorithm. Figure 8 plots the results of the algorithm to determine the total P2P capacity. We observe that the last-interval prediction is not robust due to the same reason described in the preceding subsection. while the average and maximum algorithm increase the charging volume by 14% and 7%, respectively. This suggests that the latter two algorithms are more desirable for ISPs having non-regular diurnal traffic patterns to predict the total P2P capacity.
382
H. Xie, Y.R. Yang, and A. Silberschatz
non-P2P Traffic P2P Node 1’s Traffic P2P Node 2’s Traffic Charging Volume of non-P2P Traffic Charging Volume of Total Traffic
200
100
Traffic Volume (Mbps)
Charging Volume (Mbps)
120
80 60 40 20
peer-friendly sharing non-P2P
0 0
1000 2000 3000 4000 5000 6000 7000 8000 Interval
150 100 50 0 0
1000
2000
3000
4000
5000
6000
7000
8000
Interval
Fig. 5. Charging volume when using ca- Fig. 6. P2P capacity utilization when uspacity sharing algorithm ing capacity sharing algorithm
5
Discussions
We first consider the problem that a P2P client has to make an initial estimation about the division factor when it first becomes active. A straightforwad approach is random initial estimation. However, the number of P2P clients may change significantly, especially in flash crowd period; thus, a random estimation is unlikely to well approximate it. Furthermore, an under-estimation will result in capacity over-utilization; consequently, the division factors will be halved, leading to inefficiency in P2P capacity sharing in the following rounds. We can take an ISP-aided initial estimation. Specifically, the ISP provides ˜ i = p−˜xi as a reference initial estimation. A new P2P client should obtain K p−vi−1 this reference and share the P2P capacity only when the network is underutilized. This approach is desirable in that as Proposition 2 shows, under some mild assumptions, a single new P2P can join the network without causing the network to be over-utilized. Proposition 2. Suppose that the assumptions in Proposition 1 hold. If a single ˜ i = p−˜xi in its initial estimation new P2P client joins a network and use K p−vi−1 of division factor, then the network is under-utilized in the next interval. We next consider the implications for Internet charging models. One potential concern of the proposed ISP-compliant P2P design is that the upstream providers may thus have incentives to change their charging models. We believe that this is unlikely to happen. ISPs’ traffic is still likely to have diurnal patterns, even with a large fraction of P2P traffic (see, e.g., [10]) Consequently, an ISP’s network should still be designed to handle the aggregated traffic peaks, regardless of P2P traffic. Thus, even if the traffic rate is flatten out at the targeted charging volumes (typically much lower than the peak rates) due to ISP-compliant P2P traffic, the ISP should still be able to handle all traffic.
6
Related Work
P2P traffic shaping (e.g., [6,9]) and caching (e.g., [3,8]) can be used to reduce bandwidth consumption of P2P applications. However, P2P applications often
125 120 115 110 105 100 95 90 85 80
Charging Volume (Mbps)
Charging Volume (Mbps)
Towards an ISP-Compliant, Peer-Friendly Design for Peer-to-Peer Networks
Basic Incremental Hybrid non-P2P 0
1000 2000 3000 4000 5000 6000 7000 8000 Interval
125 120 115 110 105 100 95 90 85 80
383
last interval average maximum non-P2P 0
1000 2000 3000 4000 5000 6000 7000 8000 Interval
Fig. 7. Robustness of traffic volume pre- Fig. 8. Robustness of charging volume diction algorithms prediction algorithms
differ in control messages, and hide themselves using encryption. Many ISPs are reluctant to get involved in P2P content distribution by deployment of P2P caching devices. It remains unclear whether in the long run they can effectively reduce operational costs. More recently, researchers have studied P2P network connectivity control to reduce P2P bandwidth consumption. In particular, Xie et al. proposed iTracker as an ISP portal service for disseminating network status and policy [11], which resulted in the formation of a Distributed Computing Industry Association (DCIA) [4] working group P4PWG and large-scale pilot studies on P4P. Aggarwal et al. proposed an oracle as an ISP service to rank the peers for P2P clients [2]. However, there lacks a study and it remains unclear in [2] how effectively ISPs can implement their policies by simply ranking the peers. Our framework differs from both approaches in that it imposes finer-grain control and has the potential benefit of implementing ISP policies more effectively.
7
Conclusion
In this paper, we present a new framework for ISP-compliant, peer-friendly P2P design. The framework enables ISP network to determine its capacity allocatable to P2P applications. We show that in typical percentile networks, an ISP can allocate a large amount of bandwidth to P2P applications without increasing its financial cost. We also show that P2P clients can use a distributed, peer-friendly algorithm to fairly share the allocated P2P bandwidth.
References 1. Message Stream Encryption, http://www.azureuswiki.com 2. Aggarwal, V., Feldmann, A., Scheideler, C.: Can ISPs and P2P systems cooperate for improved performance? In: ACM CCR (July 2007) 3. CacheLogic. Serving cached content, http://www.cachelogic.com/products/cachepliance.php 4. Distributed Computing Industry Association, http://www.dcia.info
384
H. Xie, Y.R. Yang, and A. Silberschatz
5. Odlyzko, A.: Internet pricing and the history of communications. Computer Networks 36, 493–517 (2001) 6. Packeteer. Packeteer PacketShaper, http://www.packeteer.com/products/packetshaper 7. Parker, A.: The true picture of peer-to-peer filesharing (July 2004), http://www.cachelogic.com 8. Saleh, O., Hefeeda, M.: Modeling and caching of peer-to-peer traffic. In: Proceedings of ICNP 2006, Washington, DC (November 2006) 9. Sandvine. Intelligent broadband network management, http://www.sandvine.com 10. Saroiu, S., Gummadi, K.P., Dunn, R.J., Gribble, S.D., Levy, H.M.: An analysis of internet content delivery systems. In: Proceedings of OSDI 2002, Boston, MA (December 2002) 11. Xie, H., Krishnamurthy, A., Yang, Y. R., and Silberschatz, A. P4P: Proactive provider participation for P2P. Tech. Rep. YALEU/DCS/TR-1377, Yale University (March 2007) 12. Xie, H., Yang, Y.R.: A measurement-based study of the skype peer-to-peer VoIP performance. In: Proceedings of IPTPS 2007, Bellevue, WA (February 2007)
Cache Placement Optimization in Hierarchical Networks: Analysis and Performance Evaluation Wenzhong Li1,2 , Edward Chan2 , Yilin Wang1,2 , Daoxu Chen1 , and Sanglu Lu1 1
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, 210093, China 2 Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong [email protected]
Abstract. Caching popular content in the Internet has been recognized as one of the effective solution to alleviate network congestion and accelerate user information access. Sharing and coordinating in cache data placement provide an opportunity to improve system performance. This paper studies cache placement strategies and their performance in hierarchical network environments. A theoretical model is introduced to analyze the access cost of placing a set of object copies in the cache hierarchy, under which the object placement problem is formulated as an optimization problem. The problem is proved to be divided into subproblems, and a dynamic programming algorithm is proposed to obtain the optimal solution. Performance of different caching strategies is evaluated using simulations. It is shown that the proposed algorithm outperforms other cache placement strategies in hierarchical caching systems. Keywords: cooperative caching, hierarchical caching system, cache placement and replacement.
1
Introduction
Data caching is a widely used technique to improve system performance, i.e. reduce network traffic, alleviate server load and decrease access latency. Danzig et. al. first showed that by providing caches in a hierarchical arrangement, the network bandwidth consumed for file transfer could be significantly reduced [4]. Following these findings, a hierarchical caching system was implemented in the Harvest system [2]. Coordinating object placement is one of the important issues in hierarchical caching systems. Given a set of caches, their network distances and access frequencies, cache placement and replacement decides which objects should be stored in the caching system and which object should be removed in order to minimize the total access cost [5]. In this paper, we study the performance of placement strategies for hierarchical caching using both analytical model and simulations. We first setup a theoretical model to analyze the accumulative access cost of cache placement A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 385–396, 2008. c IFIP International Federation for Information Processing 2008
386
W. Li et al.
strategies. Then we formulate the object placement problem as an optimization problem. The problem is further proved to be divided into subproblems, thus optimal solution using dynamic programming is proposed. We use simulation to evaluate the performance of different hierarchical caching algorithms. Experimental results show that the proposed strategy outperforms most existing hierarchical caching algorithms.
2
Related Work
Hierarchical web caching was first proposed in the context of the Harvest [2] project, where a series of caches hierarchically are arranged in a tree-like structure. Requests are first received at the leaf caches and are routed upwards until they reach a cache (or the content server) that stores a copy of the requested document. And then, the object is sent back on the reverse path to the client, each cache on this path gets to store a copy of the object. However, caching redundant object copies may not be an efficient approach and recent work [6, 7, 3] has shown that the scheme can be improved. Che et. al. [3] analyzed an uncooperative two-level hierarchical caching system where the LRU algorithm is locally run at each cache. On the basis of this analysis, a new algorithm referred to as Filter is proposed. A cache can be viewed roughly as a low pass filter with its cutoff frequency equal to the inverse of the characteristic time. Documents with access frequencies lower than this cutoff frequency will have good chances to pass through the cache without storing a local copy. However, Filter incurs additional complexity as it needs to estimate the request frequency of each requested document in all clients and disseminate this information to all the caches. Laoutaris et. al. studied several meta algorithms for hierarchical web caching, namely Prob, LCD (Leave Copy Down) and MCD (Move Copy Down) [6]. Comparison of the three algorithms with the LCE and Filter in [7] showed that LCD performs the best under all the studied scenarios. It even appears to perform better than Filter, despite the fact that it is much simpler. Korupolu et. al. investigated the placement problem for hierarchical cooperative caching, that is, how to fill the available cache space with copies of objects in such a way that average access cost is minimized [5]. Both exact and approximate polynomial-time algorithms were presented. The exact algorithm was based on a reduction to a min-cost flow problem. However, it does not appear to be practical for large problem size. Tang et. al. proposed a coordinated en-route web caching scheme [11]. In this scheme, cache status information along the routing path of a request is used in dynamically determining where to cache the requested object and what to replace if there is not enough space. The object placement problem is formulated as an optimization problem and the optimal locations to cache the object are obtained using a dynamic programming algorithm. Similar solution can be found in [8]. Their work are the basis of our analysis.
Cache Placement Optimization in Hierarchical Networks
387
Internet Server
Content Servers
Server
Server Server
Proxy
Proxy
Proxy
Proxy
Proxy Proxy Proxy
Proxy
Proxy Proxy
Proxy
Proxy
Hierarchical Caching
Proxy
Client Client
Client
Client
Fig. 1. Hierarchical caching architecture
3
Hierarchical Caching System
A hierarchical caching architecture is illustrated in Figure 1. In this architecture, a set of cache servers (proxies) are arranged in a hierarchical tree-like form, with the leaf nodes corresponding to the lowest level caches closest to the end users and the root nodes corresponding to the highest level caches. The caches in the higher level can be shared by the caching proxy servers in the lower level, forming a highly scalable caching hierarchy. User requests travel from a given leaf node towards the root node, until the requested document is found. If the requested document cannot be found even at the root level, the request is redirected to the content server containing the document. When the document is found, it is passed down through the reverse path to the client. Each cache on the reverse path decides to cache the document or not according to a chosen strategy. We address an important issue in designing hierarchical caching systems: how to appropriately deploy cache objects so that the total user access cost in minimized. In the following sections, we will present a theoretical model to investigate this problem.
4
Hierarchical Cache Placement Problem
Figure 2 depicts a scenario of data access in hierarchical caching. A leaf proxy v initiates a query for object O. The query is forwarded to the higher level of the hierarchy, until it is served at a proxy u. Let node 0 be proxy u, node n be proxy v, and node n-1,n-2, · · · , 1 denote the proxies on the path from v to u. Let fi be the access frequency of object O observed by node i, i.e. the number of requests for object O passing through v to the number of total requests. When
388
W. Li et al. u 0 f1 1 m1 hi
f2 f3
2 mf21-f2
...
f2-f3
fi
i mi fi-1 ... fi-fi-1 fn-1 n-1 mn-1 fn n mn v
Fig. 2. Data access model for hierarchical caching
object O is passed down from node 0 to node n, a copy of O can be dynamically placed in some of the proxies along the path. The question is which nodes should cache a copy of object O, so that the total access cost is minimized. 4.1
Cache Placement
Cache placement assigns copies of objects to the nodes along the path subject to the cache capacity constraints. The set of nodes which are chosen to cache the object is called a placement of cache copies. We first give a formal definition of placement for hierarchical caching. Definition 1 (HiePlacement). Suppose object O is cached in K intermediate nodes d1 , d2 , · · · , dK , where 0 ≤ K ≤ n, and 1 ≤ d1 < d2 < · · · < dK ≤ n. The set DK = {d1 , d2 , · · · , dK } is a subset of the nodes along the path. We call DK a HiePlacement (short for hierarchical placement) of the set {1,2,· · · ,n }. 4.2
Access Cost
The cost that a node fetches the requested data object from a near proxy is called access cost of the object. The access cost can be evaluated by different metrics, i.e. the network delay to fetch the object, the total bandwidth consumed and the communication overhead caused. In this paper, we use network latency to measure the access cost of an object. Please notice that the strategy proposed in the following sections can be also apply to other cost metrics. Assume the access delay is calculated by the hops that the object travels through to serve a query. The access cost of object i depends on: the access frequency to the object,
Cache Placement Optimization in Hierarchical Networks
389
denoted by fi , and the distance to fetch the object, denoted by disti . Formally the access cost can be expressed as: costi = fi · disti . 4.3
(1)
Replacement Cost
As a proxy has limited cache space, when a new object is inserted, one or more objects should be removed from the cache to make room for the new object. If an object is removed, the access cost to it is increased, which is referred to as replacement cost. The replacement cost of an object equals to the total access cost of the evicted objects. We employs a greedy cache replacement strategy in our scheme: each object is associated with a replacement cost value; when cache replacement occurs, the objects with the smallest replacement cost are selected, until sufficient space is created. 4.4
Accumulative Access Cost
We now derive the accumulative access cost under a HiePlacement. Consider the scenario of Figure 2. Assume DK = {d1 , d2 , · · · , dK } (1 ≤ d1 < d2 < · · · < dK ≤ n) is a HiePlacement. Assume each request is satisfied by the closest copy of the requested object. We use a function dist(i, a, DK )(a ≤ i) to calculate the distance from node i to a closest node holding the requested object under HiePlacement DK . If no such node exists, it equals to i−a. Formally, the function can be defined as: dist(i, a, DK ) = i − τ
(2)
where τ = max{a, dj (a ≤ dj ≤ i)} According to equation 1, for node i, the access cost to the object is fi · dist(i, a, DK ). We now consider the access cost of all nodes in the path. As shown in the data access model in Figure 2, requests for O that go through node i must also pass through nodes i-1,i-2,· · · ,1. Let Fi = fi − fi+1 . The total access cost to object O under HiePlacement DK is (Let fn+1 = 0): n Fi · dist(i, 0, DK ) i=1
By placing an object in node i, there is a replacement cost due to cache replacement, denoted by mi . Define a penalty function as: mi if i ∈ DK ; penalty(i, DK ) = 0 else The accumulative access cost in Figure 2 under HiePlacement DK can be calculated as: n (Fi · dist(i, 0, DK ) + penalty(i, DK )) (3) i=1
390
W. Li et al.
Our objective is to find an optimal HiePlacement, so that the total access cost in equation (3) is minimized. We provide a more general definition of the placement problem as following. Definition 2 ((a,b)-HiePlacement problem). Given two integers a,b (0 ≤ a ≤ b), and a set of real number fa , fa+1 , · · · , fb−1 , fb , ma , ma+1 , · · · , mb−1 , mb , where fa ≥ fa+1 ≥ ... ≥ fb ≥ 0, and mi ≥ 0 (i=a,a+1,...,b), Assume DK = {d1 , d2 , · · · , dK } (a < d1 < d2 < · · · < dK ≤ b) is a HiePlacement of the set {a + 1, a + 2, ..., b}. Let Fi = fi − fi+1 and fb+1 = 0. Define an objective function as C(a, b, DK ) =
b
(Fi · dist(i, a, DK ) + penalty(i, DK )).
(4)
i=a+1
The (a, b) − HieP lacement problem is: find K and DK , so that C(a, b, DK ) is minimized. According to the definition, the object placement problem in Figure 2 is simply a (0, n) − HieP lacement problem.
5
Solution
Before presenting our solution, we first give the following observation on the objective function. Theorem 1. Assume DK = {d1 , d2 , ..., dK } (a < d1 < d2 < · · · < dK ≤ b) is a HiePlacement of the set {a + 1, a + 2, ..., b}. C(a, b, DK ) is an objective function of an (a, b) − HieP lacement problem. For any μ(1 ≤ μ ≤ K), the following equation should be satisfied: C(a, b, DK ) = C(a, dμ − 1, {d1 , ..., dμ−1 }) + mdμ + C(dμ , b, {dμ+1 , ..., dK }) (5) Proof. For any μ(1 ≤ μ ≤ K), the set of DK can be divided into three subsets: P = {d1 , ..., dμ−1 }, Q = {dμ }, R = {dμ+1 , ..., dK }. C(a, dμ − 1, P ) + mdμ + C(dμ , b, R) dμ −1
=
(Fi · dist(i, a, P ) + penalty(i, P )) + mdμ
i=a+1
+
b
(Fi · dist(i, dμ , R) + penalty(i, R))
i=dμ +1
According to the definition of dist and penalty, we have: (a) when a < i < dμ , dist(i, a, P ) = dist(i, a, DK ) and penalty(i, P ) = penalty(i, DK ); (b) when i = dμ , dist(i, a, DK ) = 0 and penalty(dμ , DK ) = mdμ ;
Cache Placement Optimization in Hierarchical Networks
391
(c) when dμ < i ≤ b, dist(i, dμ , R) = dist(i, a, R) = dist(i, a, DK ) and penalty(i, R) = penalty(i, DK ). So, C(a, dμ − 1, P ) + mdμ + C(dμ , b, R) dμ −1
=
(Fi · dist(i, a, DK ) + penalty(i, DK )) + Fdμ · 0 + penalty(dμ , DK )
i=a+1
+
b
(Fi · dist(i, a, DK ) + penalty(i, DK ))
i=dμ +1
=
b
(Fi · dist(i, a, DK ) + penalty(i, DK ))
i=a+1
= C(a, b, DK ) The following Theorem shows that the optimal HiePlacement can be obtained by solving subproblems. Theorem 2. Assume a, b, K are integers, 0 ≤ a ≤ b and 0 ≤ K ≤ b−a. Suppose DK = {d1 , d2 , ..., dK } (a < d1 < d2 < · · · < dK ≤ b) and Dμ = {q1 , q2 , ..., qμ } (a < q1 < ... < qμ < dK ) are two HiePlacements. If DK is an optimal placement to the (a, b) − HieP lacement problem, and Dμ is an optimal placement to the (a, dK − 1) − HieP lacement problem, then D = Dμ + {dK } is also an optimal placement to the (a, b) − HieP lacement problem. Proof. As Dμ is an optimal placement to the (a, dK − 1) − HieP lacement problem, we have C(a, dK − 1, Dμ ) ≤ C(a, dK − 1, {d1 , d2 , ..., dK−1 }) According to Theorem 1, we have C(a, b, D) = C(a, dK − 1, Dμ ) + mdK + C(dK , b, ∅) ≤ C(a, dK − 1, {d1 , d2 , ..., dK−1 }) + mdK + C(dK , b, ∅) = C(a, b, DK )
(6)
On the other hand, since DK is an optimal placement to the (a, b)−HieP lacement problem, C(a, b, DK ) ≤ C(a, b, D) (7) Combining (6) and (7), we have C(a, b, D) = C(a, b, DK ). Hence, the theorem is proven.
392
W. Li et al.
Theorem 2 shows that the problem can be divided into subproblems, thus dynamic programming can be applied to obtain an optimal solution of the problem. Let OP T (a, b, x) be the minimum access cost of the (a, b)−HieP lacement problem and x(x ≤ a) be the closest node to a which holds a copy of the requested object. Let dist(a, x) denote the network distance between node a and node x. We can devise a dynamic programming solution to the (a, b) − HieP lacement problem as follows.
OP T (a, a, x) = min{ma , fa · dist(a, x)}; OP T (a, b, x) = min{OP T (a + 1, b, a) + ma , OP T (a + 1, b, x) + Fa · dist(a, x)}
(0 ≤ a ≤ b ≤ n)
(8)
According to section 4.4, the solution of the object placement problem in Figure 2 is OP T (1, n, 0).
6 6.1
Performance Evaluation The Simulation Environment
In our simulation, we employ a hierarchical network model similar to [9]. We model the network topology as a tree with L levels. For each node in the tree, if it is not a leaf node, it has a set of children. The number of children of a node is uniformly distributed in the range of [1,M]. A content server resides outside the hierarchy. Assume all queries can be served at the content server. Each node needs to maintain visit history information. Similar to [10], we estimate the access probability pi using a ”sliding window” of K most reference K times as: pi = t−t , where t is the current time and tK is the time of the K th K most recent reference. The length of sliding window is set to 1000, and the value of K is set to 3 in our experiments. We use synthetic workload to evaluate the performance of the proposed cache replacement policies. The synthetic workload simulates a data set with 10000 items, averaging 25 KB per item. Assume the query arrival rate follows a Poisson process with a mean arrival rate of λ. A Zipf-like distribution is used to model the possibility that a data item is requested [1]. In this model, the access probability N pi of data item di is given by pi = iCα , where C = ( k=1 k1α )−1 . The parameter α(0 ≤ α ≤ 1) is the skewness parameter of the Zipf-like distribution, indicating the degree of concentration of requests. We assume homogeneous data access pattern in the mobile clients. That is, all client requests follow the same Zipf pattern. 6.2
Performance Analysis
We employ two widely used metrics in our performance evaluation: cache hit ratio, and average access delay. Three cache placement algorithms are included for comparison: LCE, Prob and LCD [7]. The proposed algorithm is referred to as ”OPT” algorithm in our later analysis.
Cache Placement Optimization in Hierarchical Networks 0.55
5
LCE LCD Prob OPT
0.5 4.5 Average Access Delay
Cache Hit Ratio
0.45 0.4 0.35
393
4
3.5
0.3
0.2 0.25
3
LCE LCD Prob OPT
0.25
0.50
0.75
1.00
Cache Size(% of Data Set Size)
(a) Cache Hit Ratio.
1.25
1.50
2.5 0.25
0.50
0.75
1.00
1.25
1.50
Cache Size(% of Data Set Size)
(b) Average Access Delay.
Fig. 3. Performance under various cache size
Impact of the Cache Size: We first compare the performance of the different cache schemes under different cache size. Each node in the experiment has a relative cache size, which is the percent of the total size of the total data set, and is varied from 0.25 percent to 1.50 percent in our simulation. Figure 3 compares the cache hit ratio as a function of the relative cache size. As shown in Figure 3(a), the cache hit ratio of all the cache schemes steady improves as the cache size increases. the cache hit ratio of LCE scheme increases from 0.2 to 0.4 (about 100 percent improvement), and the cache hit ratio of OPT scheme increases from 0.37 to 0.53 (about 43 percent improvement). It also can be seen that LCE has the lowest cache hit ratio in all conditions. Prob performs better than LCE, but does not fare as well as LCD. However LCD, which is considered a simple and efficient cache placement algorithm in hierarchical caching, is not the best in our simulation. OPT obtains the highest cache hit ratio, which is about 10 to 25 percent higher than LCD in most conditions. Comparison of average access delay is shown in Figure 3(b). It can be seen that access delay decreases with the increasing cache size. For example, the average access delay of LCE and OPT decrease from 5 down to 3.6 and from 4.1 down to 2.8 accordingly (Figure 3(b)). LCE has the highest access delay in both workloads. Prob is a little better than LCE. OPT has the lowest access delay, which indicates data access in the OPT scheme is much faster, and an improvement of 20-30 percent is observed. Impact of the Network Hierarchy Level: Network hierarchy level has a significant impact on the performance of a caching system. The larger the network, the more proxies cooperate in the caching, and the more likely that an object will obtain a hit in the cache. On the other hand, if L is large, a query needs to travel more hops before it reaches the content server. We explore the influence of network topology in this section, with the value of L ranging from 2 to 7. As it is shown in Figure 4(a), cache hit ratio increases with the number of hierarchical levels increasing, but not as remarkable as the influence of increasing cache size. Again, LCE has the lowest cache hit ratio, and OPT outperforms the other cache schemes.
394
W. Li et al.
0.5
5
LCE LCD Prob OPT
4.5 0.45
Average Access Delay
4 Cache Hit Ratio
0.4
0.35
0.3
3.5 3 2.5 2
0.25
LCE LCD Prob OPT
0.2 2
3
4
5
6
1.5 1 7
2
3
Hierarchy levels
4
5
6
7
Hierarchy levels
(a) Cache Hit Ratio.
(b) Average Access Delay.
Fig. 4. Performance under different hierarchy levels 0.45
5.5
LCE LCD Prob OPT
0.4
5
LCE LCD Prob OPT
Average Access Delay
Cache Hit Ratio
0.35 0.3 0.25 0.2
4.5
4
0.15 3.5 0.1 0.05 0.90
0.85
0.80 0.75 Zipf-parameter
(a) Cache Hit Ratio.
0.70
0.65
3 0.90
0.85
0.80 0.75 Zipf-parameter
0.70
0.65
(b) Average Access Delay.
Fig. 5. Performance under various Zipf parameters
Figure 4(b) presents the experiment result of average access delay. One can see that the access delay also increases as the number of levels in the hierarchy increases. This is because when L becomes larger, a query generated in the leaf node needs to travel more hops to reach the content server, thus the access delay becomes longer if a query is missed in all caches. The OPT algorithm still achieves the lowest access delay among the cache schemes. Impact of the Data Access Pattern: Data access pattern can affect the performance of a caching system [1]. As mentioned before, the Zipf skewness parameter α indicates the degree of concentration of file accesses. By changing the value of α, we generate a number of synthetic workloads with different data access patterns to investigate the performance of the various cache replacement strategies. Figure 5 shows the performance when the Zipf skewness parameter varies from 0.9 to 0.65. We can see from the graph that when the Zipf parameter decreases, the performance decrease accordingly. A larger Zipf parameter value means more queries are focused on a set of ”hot” data items. As a result, cache replacement policies can easily identify the frequently accessed data items and
Cache Placement Optimization in Hierarchical Networks
395
keep them in the cache. This explains the performance degradation when the data access interest becomes more sparse. Comparing the performance results shown in Figure 5, we can see that OPT still performs the best under different query patterns.
7
Conclusion
Coordinating data placement in hierarchical caching system helps to reduce user access cost. The novelty of our work is that we study coordinated cache placement strategies using an analytical model and evaluate their performance in hierarchical network environment. The object placement problem is formulated as an optimization problem and the optimal solution is derived using dynamic programming algorithm. Performance of cache placement strategies are evaluated by simulations, which shows that the proposed algorithm performs well in hierarchical caching system.
Acknowledgment The work described in this paper is partially supported by the National HighTech Research and Development Program of China (863) under Grant No. 2006AA01Z199; the National Natural Science Foundation of China under Grant No. 90718031,60573106,60721002; the National Basic Research Program of China (973) under Grant No. 2006CB303000.
References [1] Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and zipf-like distributions: Evidence and implications. In: INFOCOM 1999: Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies, pp. 126– 134 (1999) [2] Chankhunthod, A., Danzig, P.B., Neerdaels, C., Schwartz, M.F., Worrell, K.J.: A hierarchical internet object cache. In: ATEC 1996: Proceedings of the Annual Technical Conference on USENIX 1996 Annual Technical Conference, Berkeley, CA, USA. USENIX Association, pp. 13–13 (1996) [3] Che, H., Wang, Z., Tung, Y.: Analysis and design of hierarchical web caching systems. In: Proc. of Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM 2001), pp. 1416–1424. IEEE, Los Alamitos (2001) [4] Danzig, P.B., Hall, R.S., Schwartz, M.F.: A case for caching file objects inside internetworks. SIGCOMM Comput. Commun. Rev. 23(4), 239–248 (1993) [5] Korupolu, M.R., Plaxton, C.G., Rajaraman, R.: Placement algorithms for hierarchical cooperative caching. In: SODA 1999: Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms, Philadelphia, PA, USA, pp. 586– 595. Society for Industrial and Applied Mathematics (1999) [6] Laoutaris, N., Che, H., Stavrakakis, I.: The lcd interconnection of lru caches and its analysis. Perform. Eval. 63(7), 609–634 (2006), doi:10.1016/j.peva.2005.05.003
396
W. Li et al.
[7] Laoutaris, N., Syntila, S., Stavrakakis, I.: Meta algorithms for hierarchical web caches. In: International Conference on Performance, Computing, and Communications (PCCC 2004), pp. 445–452. IEEE, Los Alamitos (2004) [8] Li, K., Shen, H., Chin, F.Y.L., Zheng, S.Q.: Optimal methods for coordinated enroute web caching for tree networks. ACM Transactions on Internet Technology 5(3), 480–507 (2005) [9] Rodriguez, P., Spanner, C., Biersack, E.W.: Analysis of web caching architectures: hierarchical and distributed caching. IEEE/ACM Transactions on Networking 9(4), 404–418 (2001) [10] Shim, J., Scheuermann, P., Vingralek, R.: Proxy cache algorithms: Design, implementation, and performance. IEEE Transactions on Knowledge and Data Engineering 11(4), 549–562 (1999) [11] Tang, X., Chanson, S.T.: Coordinated en-route web caching. IEEE Transactions on Computers 51(6), 595–607 (2002)
Towards a Two-Tier Internet Coordinate System to Mitigate the Impact of Triangle Inequality Violations Mohamed Ali Kaafar1 , Bamba Gueye1, , Francois Cantin1, , Guy Leduc1 , and Laurent Mathy2 1
University of Liege, Belgium Research Unit in Networking (RUN) {ma.kaafar,cabgueye,francois.cantin,guy.leduc}@ulg.ac.be 2 University of Lancaster, UK Computing Department [email protected]
Abstract. Routing policies or path inflation can give rise to violations of the Triangle Inequality with respect to delay (RTTs) in the Internet. In network coordinate systems, such Triangle Inequality Violations (TIVs) will introduce inaccuracy, as nodes in this particular case could not be embedded into any metric space. In this paper, we consider these TIVs as an inherent and natural property of the Internet; rather than trying to remove them, we consider characterizing them and mitigating their impact on distributed coordinate systems. In a first step, we study TIVs existing in the Internet, using different metrics in order to quantify various levels of TIVs’ severity. Our results show that path lengths do have an effect on the impact of these TIVs. In particular, the shorter the link between any two nodes is, the less severe TIVs involved in are. In a second step, we do leverage our study to reduce the impact of TIVs on coordinate systems. We focus on the particular case of the Vivaldi coordinate system and we explore how TIVs may impact its accuracy and stability. In particular, we observed correlation between the (in)stability and high effective error of nodes’ coordinates with respect to their involvement in TIVs situations. We finally propose a Two-Tier architecture opposed to a flat structure of Vivaldi that do mitigate the effect of TIVs on the distances predictions. Keywords: Internet Coordinate Systems, Performance, Experimentation, Triangle Inequality Violations.
1 Introduction As innovative ways are being developed to harvest the enormous potential of Internet infrastructure, a new class of large-scale globally-distributed network services and applications (e.g. [1] [2], etc) have emerged. To achieve network topology-awareness, most, if not all, of these overlays rely on the notion of proximity, usually defined in
B. Gueye is supported by the EU under the ANA FET project (FP6-IST-27489). F. Cantin is a Research Fellow of the Belgian Fund for Research in Industry and Agriculture (FRIA).
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 397–408, 2008. c IFIP International Federation for Information Processing 2008
398
M.A. Kaafar et al.
terms of network delays or round-trip times (RTTs), for optimal neighbor selection during overlay construction and maintenance. However, proximity measurements, based on repeated pair-wise distance measurements between nodes, can prove to be very onerous in terms of measurement overheads. Indeed, the existence of several overlays simultaneously can result in significant bandwidth consumption by proximity measurements (i.e. ping storms) carried out by individual overlay nodes [3]. Also, measuring and tracking proximity within a rapidly changing group requires high frequency measurements. To palliate such problems, Internet coordinate systems [4,5] have been introduced. These systems embed latency measurements amongst samples of a node population into a metric space and associate a network coordinate vector (or coordinate in short) in this metric space with each node, with a view to enable accurate and cheap distance (i.e. latency) predictions amongst any pair of nodes in the population. However, Internet latencies, due to routing policies or path inflation [6], do sometimes violate the triangle inequalities which must hold in a metric space. Such Triangle Inequality Violations (TIV) could be a major barrier for the accuracy of Internet coordinate systems. Suppose we have a network with 3 nodes A, B and C, where d(A, B) is 1 ms, d(B, C) is 2 ms, and d(A, C) is 5 ms, with d(X, Y) denoting the measured delay between X and Y. The triangle inequality is violated because d(A, B) + d(B, C) < d(A, C). Such violations make coordinates embedding of network distances less accurate. When faced with these TIVs, coordinate systems resolve them by forcing edges to shrink or to stretch in the embedding space; this intuitively results in oscillations of the embedded coordinates, and thus causes large distance prediction errors. Indeed, ultimately Internet coordinate systems are used to estimate distances between nodes, based on their coordinates only, even and all the more so if these nodes have never exchanged a distance measurement probe. Both a reasonably stable and an accurate coordinate should then be computed. A few works considered removal (or at least the exclusion) of the Triangle Inequality violator nodes from the system to decrease the embedding distortion [7,8]. However, we claim that sacrificing even a small fraction of nodes, is not arguable since TIVs are an inherent and natural property of the Internet; rather than trying to remove them, we consider characterizing them and mitigating their impact on distributed coordinate systems. A recent work [9] proposes also to remove from the set of neighbors, nodes that underestimate their actual distances to others. These nodes are assumed to be involved in TIVs situations, and need to be removed. This approach however restricts the neighborhood selection to the closest nodes, depriving the coordinate systems from a desirable property, namely the hybrid selection of neighbors. In this paper, we first study the distributions of TIVs existing in the Internet, and we characterize their severity using different metrics. One of our findings is that longer edges cause more severe TIVs. That is to say, that considering shorter paths as measurement samples in coordinate systems would less likely lead to severe TIVs. Based on this insight, we do leverage our study to reduce the impact of TIVs on coordinate systems. To illustrate our results, we focus on the Vivaldi coordinate system, as a prominent representative of purely peer-to-peer (i.e. without infrastructure support) based coordinate systems. We then study the ways in which TIVs impact the Vivaldi coordinate system.
Towards a Two-Tier Internet Coordinate System to Mitigate the Impact of TIVs
399
We showed that TIVs seriously impact the embedding accuracy and coordinates stability. In fact, we observed that nodes that are more involved in TIVs situations are twice less accurate. These nodes’ coordinates have also been shown to have oscillations of larger amplitudes. We finally propose a Two-Tier architecture opposed to a flat structure of Vivaldi, based on the clustering of nodes. Inside these clusters, nodes compute coordinates to predict local distances, and keep predicting distances to nodes outside their clusters based on the original ’flat Vivaldi’. This hierarchical approach does mitigate the effect of TIVs on the distances predictions, and allows nodes to embed short distances with very low relative errors.
2 Analysis of Triangle Inequalities in the Internet We used the p2psim data (1740 nodes) [10] and Meridian data (2500 nodes) [11] to model Internet latency based on real world measurements. These data sets are obtained following the King [12] measurement technique. King is a technique (similar to ping) that estimates the latency between arbitrary end hosts by using recursive DNS queries. Based on these delay matrices, we study through different metrics the violations of the Triangle Inequality, and characterize their severity and distribution according to path lengths. 2.1 Severity Metrics Previous studies [13,6,14] have reported characteristics of TIVs in the Internet delay space by triangulation ratio distribution and the fraction of triangles that suffer from TIVs. Let us consider a triangle ABC. By convention, AB is always the longest edge of a triangle. If d(A, B) > (d(A, C) + d(C, B)), then ABC is called a TIV, because the triangle inequality is violated. Note that it is enough to consider the inequality with respect to the longest edge AB of the triangle. In this paper, we propose two basic characterizations of the severity of TIVs. The first one is the relative severity and is defined as follows: Gr =
d(A, B) − (d(A, C) + d(C, B)) d(A, B)
(1)
Gr ranges from 0 (minimum severity) to 1 (maximum severity). Relative severity is an interesting metric, but it may be argued that for small triangles, a high relative severity may not be that critical. Therefore we also define a second metric called the absolute severity, which is defined as follows: Gar =
d(A, B) − (d(A, C) + d(C, B)) Diameter
(2)
Note that we have normalized this metric with respect to the Diameter of the network, so that it also ranges from 0 (minimum severity) to 1 (maximum severity). Note that Diameter represents the maximal delay between any two points in the network. In the sequel, we will refer to specific severity thresholds and select TIVs whose severities are above them. We can select all TIVs such that Gr ≥ thr , or Gar ≥ thar , or even when both thresholds are exceeded.
400
M.A. Kaafar et al.
2.2 Results We first define some notations. Let K be the total number of triangles in the two data sets. For the p2psim data set, we found K = 854, 773, 676, of which 105, 329, 511 (representing 12%) are TIVs, whereas K = 2, 598, 842, 308 and the percentage of TIVs is equal to 23.5% for Meridian data set. We divide the whole range of RTTs in our data set (from 0 to Diameter) into 160 (resp. 600) equal bins of 5ms each for p2psim data (resp. Meridian). The maximum delay between any two nodes (i.e. Diameter) in p2psim data and Meridian data are respectively 800 ms and 3000 ms. Probably, the large diameter noticed in Meridian data is due to the presence of few outliers within the data set. Howeover, only 0.03% of all pair-wise RTT measurements in Meridian data are above 1000 ms. 40
40
Gr > 0 Gr > 0.10 Gr > 0.20
Gr > 0 Gr > 0.10 Gr > 0.20
35
30
30
25
25 Ki’ / Ki
Ki’ / Ki
35
20
20
15
15
10
10
5
5
0
0 100
200
300
400
500
600
700
800
100
d(A,B) in 160 bins of 5 ms (i=1..160)
200
300
400
500
600
700
800
d(A,B) in 160 bins of 5 ms (i=1..160)
(a) thar = 0
(b) thar = 0.025
Fig. 1. Proportion of TIVs in each bin for various TIV severity levels for p2psim data set
Let Ki be the number of triangles in the ith bin. By convention, we say that triangle ABC is in bin i if its longest edge AB is in that bin. Let Ki be the number of TIVs in the ith bin. K We begin our analysis by showing the proportion of TIVs is each bin, namely Kii , for different severity thresholds. Figure 1(a) (resp. Figure 2(a)) only considers relative TIV severities (as thar = 0). In Figure 1(b) (resp. Figure 2(a)), we filter out TIVs whose absolute severity is below thar = 0.025 (resp. thar = 0.005), which actually means below 20 ms (resp. 15 ms) with respect to our diameter of 800 ms for p2psim data (resp. 3000 ms for Meridian). All these curves have basically the same shapes. We can see clearly that large triangles (say above 400 ms) are more likely (severe) TIVs. Let Pi the probability for a triangle ABC, chosen at random in the data set, to be (a) in the bin i and (b) to be a (severe) TIV, namely: Pi =
Ki Ki K ∗ = i Ki K K
(3)
Obviously, Pi is simply the number of TIVs in the bin i divided by the total number of triangles. The distribution of TIVs in p2psim and Meridian data is depicted on Figure 3 and Figure 4.
Towards a Two-Tier Internet Coordinate System to Mitigate the Impact of TIVs 100 80
Gr > 0 Gr > 0.10 Gr > 0.20
90 80
70
70
60
60
Ki’ / Ki
Ki’ / Ki
100
Gr > 0 Gr > 0.10 Gr > 0.20
90
50 40
50 40
30
30
20
20
10
10
0
401
0 0
100 200 300 400 500 600 700 800 900 1000
0
d(A,B) in 200 bins of 5 ms (i=1..200)
100 200 300 400 500 600 700 800 900 1000 d(A,B) in 200 bins of 5 ms (i=1..200)
(a) thar = 0
(b) thar = 0.005
Fig. 2. Proportion of TIVs in each bin for various TIV severity levels for Meridian data set 0.25
Gr > 0 Gr > 0.10 Gr > 0.20
0.2 Ki’ / K (x 100)
0.2 Ki’ / K (x 100)
0.25
Gr > 0 Gr > 0.10 Gr > 0.20
0.15 0.1 0.05
0.15 0.1 0.05
0
0 100
200
300
400
500
600
700
800
100
d(A,B) in 160 bins of 5 ms (i=1..160)
200
300
400
500
600
700
800
d(A,B) in 160 bins of 5 ms (i=1..160)
(a) thar = 0
(b) thar = 0.025
Fig. 3. Distribution of TIVs in p2psim data set for various severity levels 1.4
Gr > 0 Gr > 0.10 Gr > 0.20
1.2
1
Ki’ / K (x 100)
Ki’ / K (x 100)
1.4
Gr > 0 Gr > 0.10 Gr > 0.20
1.2
0.8 0.6
1 0.8 0.6
0.4
0.4
0.2
0.2
0
0 0
100
200
300
400
d(A,B) in 100 bins of 5 ms (i=1..100)
(a) thar = 0
500
0
100
200
300
400
500
d(A,B) in 100 bins of 5 ms (i=1..100)
(b) thar = 0.005
Fig. 4. Distribution of TIVs in Meridian data set for various severity levels
As expected, Figure 3 is less conclusive than Figure 1, but it still shows that few severe TIVs are found in small triangles, that is below 100 ms. Similar behavior can be observed in Figure 4 where the edges shorter than 60 ms cause slight violations.
402
M.A. Kaafar et al.
Moreover, the TIV severity of edges has an irregular relationship with their lengths. For instance, in Figure 4 the TIV severity has a peak for the edges around 80 − 100 ms. This motivates our hierarchical approach to build a coordinate system. If we create clusters whose diameters do not exceed too much 100 ms, we may expect much fewer severe TIVs in each cluster, which is likely to improve the accuracy of intra-cluster coordinate systems. However, so far the impact of TIVs on the coordinates embedding remains hypothetical. In the next section, we will quantify this impact on one of the prominent P2P coordinate systems, namely Vivaldi.
3 Impact of TIVs on Vivaldi Vivaldi [5], the focus of our present study, is based on a simulation of springs, where the position of the nodes that minimizes the potential energy of the spring also minimizes the embedding error. It is described in further details in the following section. 3.1 Vivaldi Overview Vivaldi is fully distributed, requiring no fixed network infrastructure and no distinguished nodes. A new node computes its coordinate after collecting latency information from only a few other nodes. Basically, Vivaldi places a spring between each pair of nodes (i, j) with a rest length set to the known RT T (i, j). An identical Vivaldi procedure runs on every node. Each sample provides information that allows a node to update its coordinate. The algorithm handles high error nodes by computing weights for each received sample. The sample used by each node, i is based on measurement to a node, j, its coordinates xj and the estimated error reported by j, ej . A relative error of this sample is then computed as follows: es = | xj − xi − RT T (i, j)measured | / RT T (i, j)measured The node then computes the sample weight balancing local and remote error : wi = ei /(ei + ej ), where ei is the node’s current (local) error, representing node i confidence in its own coordinate. This sample weight is used to update an adaptive timestep, δi defining the fraction of the way the node is allowed to move toward the perfect position for the current sample: δi = Cc × wi , where Cc is a constant fraction < 1. The node updates its local coordinates as the following: xi = xi + δi · (RT T (i, j)measured − xi − xj ) · u(xi − xj ) where u(xi − xj ) is a unit vector giving the direction of i’s displacement. Finally, it updates its local error as ei = es × wi + ei × (1 − wi ). The reader should note that after convergence of a Vivaldi system, the relative local error variation is of the order of a few percent (e.g. +/-0.05). Vivaldi considers a few possible coordinate spaces that might better capture the underlying structure of the Internet. Coordinates embedding map into different geometric spaces, where nodes are computing their coordinates, e.g., 2D, 3D or 5D Euclidean spaces, spherical coordinates, etc.
Towards a Two-Tier Internet Coordinate System to Mitigate the Impact of TIVs
403
3.2 Results
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
CDF
CDF
In this section, we present the results of an extensive simulation study of the Vivaldi system. For the simulation scenarios, we used the p2psim discrete-event simulator [10], which comes with an implementation of the Vivaldi system. Unless otherwise stated, each Vivaldi node has 32 neighbors (i.e. is attached to 32 springs), half of which being chosen to be among the closest nodes. The constant fraction Cc for the adaptive timestep (see section 3.1) was set to 0.25. These values are those recommended in [5]. The system was considered stabilized when all relative errors converged to a value varying by at most 0.02 for 10 simulation ticks. We observed that Vivaldi always converged within 1800 simulation ticks, which represents a convergence time of over 8 hours (1 tick is roughly 17 seconds). Our results are obtained for a 2-dimensional euclidean space. In order to observe the impact of TIVs on Vivaldi nodes, and in particular on the embedding performance, we could define the notion of TIVs involvement through different considerations. In this paper, we consider a node to be more or less involved into TIVs according to the number of times it belongs to a TIV. The more node C appears in bad triangles Ai Bj C, i = j, the more C is considered involved into TIVs situations.
0.5 0.4
0.5 0.4
0.3
0.3
0.2
0.2
All nodes (p2psim) All nodes (Meridian) Top 100 of TIVs involved nodes (p2psim) Top 100 of TIVs involved nodes (Meridian)
0.1 0 0
0.5
1
1.5
All nodes (p2psim) All nodes (Meridian) Top 100 of TIVs involved nodes (p2psim) Top 100 of TIVs involved nodes (Meridian)
0.1 0 2
0
Average relative error
5
10
15
20
25
30
Average oscillations (in ms)
(a) CDF of Average relative errors
(b) CDF of Oscillations
Fig. 5. Impact of TIV severity on the embedding for p2psim and Meridian data set
In Figure 5(a), considering the p2psim and Meridian data set, we plot the Cumulative Distribution Function (CDF) of the average relative error (ARE) of the top 100 nodes involved in TIVs. We compare such distribution to the CDF of all nodes’ relative errors in a flat Vivaldi system. Note that the ARE is computed for each node as the average prediction errors (along all the nodes) yielded by the Vivaldi system at the last tick of our simulations. AREi =
j =i∈S
|(RT T (i,j)−||xi −xj ||)| RT T (i,j)
|S| − 1
where S is the set of all nodes in the system. Figure 5(a) shows that, while for the distribution of errors on all the nodes of the system, more than 90% of p2psim nodes (resp. 70% Meridian nodes) have an ARE
404
M.A. Kaafar et al.
less than 0.3 (resp. 0.5), this percentage falls down to only 50% (resp. 20%) when considering the most involved nodes in TIVs. The coordinates computation at the level of these implicated nodes is spoiled out. It is also worth observing the variation of coordinates in the Vivaldi system. In fact, even though the system converges in the sense that the relative errors at each node stabilizes, these errors could be so high that a great variation of the coordinate of a node barely affects the associated error. We can define such coordinates oscillation as the distance between any two consecutive coordinates. The average oscillations values are computed as the average of the oscillations during the last 500 ticks of our Vivaldi simulation. Figure 5(b) shows the CDF of these average oscillations, comparing again the distributions through all nodes and of the top 100 nodes involved in TIVs. We clearly see that the impact of TIVs can be considered as very serious with nodes involved in more TIV triangles seeing a large increase in their average oscillations values. In light of these observations on the serious impact of TIVs on the coordinates embedding and based on our findings related to the distributions of TIVs in the Internet, the main intuition behind our proposal of a hierarchical structure of Vivaldi is to mitigate the impact of most severe TIVs. In this way, nodes would perform a more accurate embedding at least in restricted spaces (i.e. with small distance coverage).
4 Two-Tier Vivaldi This section is divided into three parts. First, we present an overview of our Two-Tier architecture. Second, we define the clustering method we experimented with and, finally, we compare the results obtained with our Two-Tier Vivaldi to those obtained with a flat Vivaldi. 4.1 Overview Recall from our section 2 that small triangles are less often (severe) TIVs. Any 3 edges with small RTTs (as observed in section 2) should be more adequate to construct a metric space without violating too much the Triangle Inequality laws. Put simply, shorter paths are more embeddable than longer paths that tend to create more severe TIVs with high absolute errors. In this section, we exploit such property to deal with TIVs severity and their serious impact on network coordinates, by proposing a Two-Tier Vivaldi approach. The main idea is to divide the set of nodes into clusters and to run an independent Vivaldi in each cluster. Clusters are composed of a set of nodes within a given coverage distance. Since Vivaldi instances running on each cluster are independent, nodes are collecting latency information from only a few other neighbors located within the same cluster. In this way, coordinates of nodes belonging to different clusters cannot be used to estimate the RTT between these nodes. We keep then running a Vivaldi system at a higher level (i.e. the whole set of nodes) in order to ensure that coordinates computed between any two nodes belonging to different clusters, ’make sense’. Coordinates computed at the lower (resp. higher) level of clusters are called local coordinates (resp. global coordinates). Next, we describe how we set up the clusters we experimented with to illustrate our results on the Two-Tier Vivaldi structure.
Towards a Two-Tier Internet Coordinate System to Mitigate the Impact of TIVs
405
4.2 Clustering Method We have first based our clusters recognition on the coordinates as observed by running a flat Vivaldi over the p2psim and Meridian data set. As already shown by Dabek et al. [5], 2-D coordinates lead to five major clusters of nodes for p2psim data. We took the three most populated as our main clusters. For Meridian data, 2-D coordinates lead to three major clusters with a set of nodes sparsely distributed. Nodes that have not been selected in any cluster would then be considered in only the higher level of the architecture, using only their global coordinates. However, this first step in our cluster selection is only based on the estimated RTTs (as predicted by the flat Vivaldi) and not on the actual RTTs between nodes. In fact, according to their relative errors, some nodes can be misplaced by the flat Vivaldi and clusters could have their diameters larger than expected. Our second step has then consisted in a cross-checking of our preliminary clustering using the delay matrix. We proceed in a recursive way as follows: for any two nodes belonging to the same cluster, the cluster constraint is that the distance between these two nodes should be less than the cluster diameter. We then begin by verifying this cluster constraint, testing it on all the pairs of each cluster according to their actual distance provided by the delay matrix. Afterwards, we remove the top node that causes more violations of the constraint 1 . We then recursively check the cluster constraint until no more pairs inside the cluster violate the cluster constraint. Following this clusters selection method, we obtained the clusters described in Figure 6 for p2psim and Meridian data set. Nodes Diameter Cluster 1 565 140 ms Cluster 2 169 100 ms Cluster 3 93 60 ms (a) p2psim data
Nodes Diameter Cluster 1 560 80 ms Cluster 2 563 80 ms Cluster 3 282 70 ms (b) Meridian data
Fig. 6. Characteristics of the clusters
4.3 Performance Evaluation of the Two-Tier Vivaldi First, we use the relative error as our main performance indicator. Again, we compute the ARE over all nodes to represent the accuracy of the overall system. Figures 7(a) and 7(b) represent the CDFs of the relative error of nodes belonging to our three clusters 6(a). We clearly see that relative errors computed based on local coordinates inside the clusters are much less than errors as computed using global coordinates (the Flat Vivaldi labeled curves). In cluster 2, for instance, more than 90% of nodes predicting their distances inside the cluster, achieve on average a relative error less than 0.3. When using the flat Vivaldi, over half of the population of the set of nodes in cluster 2, is computing coordinates with an ARE more than 0.5. Worse cases (for the flat Vivaldi system) are observed with respect to nodes in cluster 3, as depicted in Figure 7(b), where the flat Vivaldi system collapses with very high effective relative 1
This node is likely to have a high embedding error.
406
M.A. Kaafar et al. 1
1
0.9
0.9
0.8
0.5
0.7 0.6 CDF
CDF
0.6
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1 0 0
Cluster3: Two−Tier Cluster3: Flat Vivaldi
0.8
Cluster1: Two−Tier Cluster1: Flat Vivaldi Cluster2: Two−Tier Cluster2: Flat Vivaldi
0.7
0.1
0.2
0.4
0.6
0.8
1
1.2
0 0
1.4
2
4
6
8
10
12
14
16
18
Average Relative Error
Average Relative Error
(a) Cluster 1 and 2
(b) Cluster 3
Fig. 7. Comparison of relative errors for p2psim data set: Flat versus Two-tier Vivaldi
errors for more than 70% of the nodes. In the Two-Tier architecture, nodes are clearly performing much better. We observed the same trend for the Meridian data. By lack of space, we do not present the Meridian’s figures here. It is worth noticing here that cluster 3 is the smallest cluster in terms of Diameter. The observation of the embedding relative errors in this cluster confirm then our findings related to the effect of edges lengths on the TIVs severity and thus on their impact on the embedding. More generally, improvements inside these clusters is explained by the fact that intra cluster nodes, when computing their local coordinates select only close by nodes as their neighbors. This constraints the node-to-neighbors edges lengths and thus reduces the selection of severe TIVs likelihood. When encountering severe TIVs that cause high absolute errors, a node updates its coordinate, by jumping back and forth across its actual position. When limited to TIVs of low absolute severity, a node converges ’smoothly’ towards an approximation of its correct position, then would stick to such position, and oscillate much less. In essence, it gains confidence in its local error faster (see 3.1) and performs more accurate embedding. Limiting the neighborhood inside the cluster should then limit the high oscillations due to long and severe TIVs. In a second step, we then observed the coordinates’ 1
1
0.9
0.9
0.8
0.8 0.7
0.7
0.5 0.4
CDF
CDF
0.6
Cluster1: Two−Tier Cluster1: Flat Vivaldi Cluster2: Two−Tier Cluster2: Flat Vivaldi Cluster3: Two−Tier Cluster3: Flat Vivaldi
0.6
0.4
Cluster1: Two-Tier Cluster1: Flat Vivaldi Cluster2: Two-Tier Cluster2: Flat Vivaldi Cluster3: Two-Tier Cluster3: Flat Vivaldi
0.3
0.3
0.2
0.2
0.1
0.1 0 0
0.5
0 5
10 Average Oscillations (in ms)
(a) p2psim data
15
0
5
10 15 20 Average oscillations (ms)
(b) Meridian data
Fig. 8. CDF of coordinates oscillations for flat and Two-Tier Vivaldi
25
Towards a Two-Tier Internet Coordinate System to Mitigate the Impact of TIVs
407
oscillation of nodes belonging to our three clusters. Again, we consider the average oscillations values as the average oscillations during the last 500 ticks of our simulations. We can observe that the three curves representing the CDF of the Two-Tier architecture nodes oscillations are those that are the highest in Figure 8(a) and Figure 8(b). This clearly shows that local coordinates of nodes inside our clusters oscillate with less amplitude. For instance, in Figure 8(a) more than 80% of the average oscillations are less than 3ms when nodes are computing local coordinates, whereas, only 40% of nodes in flat Vivaldi have, their oscillations less than this value. As can be seen, for Meridian data (Figure 8(b)), nodes oscillate over large ranges. More than 50% of nodes in flat Vivaldi have their oscillations superior to 5 ms.
5 Discussions and Conclusions We have presented a Two-Tier architecture to mitigate the impact of potential Triangle Inequality Violations, which often occur in the Internet. We have shown that larger triangles are more likely (severe) TIVs, with respect to the distribution of RTTs in the Internet. Previous proposals of different coordinate systems focus on the geometric properties of the coordinate space and look for which space is the most convenient to embed RTTs. Our architecture does not rely on such approach, but it is instead based on clustering nodes to mitigate the impact of severe TIVs. Within their cluster, nodes use more accurate local coordinates to predict intra-cluster distances, and keep using global coordinates when predicting longer distances towards nodes belonging to foreign clusters. Knowing that coordinate systems are often used to characterize the set of close by neighbors in an overlay distribution or to select the closest download server, our approach thus succeeds where methods based on space dimensionality or properties would fail. Although this paper focused on Vivaldi for measurements and experimentations, the Two-Tier architecture proposed is independent of the embedding protocol used. It is important to note that the deployment of our Two-Tier architecture does not equate to imposing any changes in the coordinate system process we use. Indeed, apart from running two different instantiations at the level of each cluster and at a higher level, our method does not entail any change to the operations of the embedding protocols. Our proposed method would then be general enough to be applied in the context of coordinates computed by other Internet coordinate system than Vivaldi. Even though this paper does not address the problem of clustering techniques, and rather uses an oracle-based technique where coordinates and delay matrix are known, we note that different solutions to such issues have been proposed elsewhere (e.g. [15]). Finally, we have also quantified the impact of TIVs on the embedding performance, considering that a node is involved in a TIV if it is within a bad triangle. However, it could also be argued that, in different embedding protocols, given a set of neighbors, the most involved node in TIVs is not necessarily affecting its non-neighbors measurements and coordinates. Other TIVs involvement definitions could be considered in order to refine the impact of TIVs on the embedding process. We can, for instance, consider that a node is involved in a TIV situation if it appears in a bad triangle and it considers the two other nodes of this bad triangle as its neighbors in the coordinate computation. Such
408
M.A. Kaafar et al.
new knowledge and our findings characterizing the TIVs severity could be leveraged to manage the neighbors selection in coordinate systems in order to alleviate the TIV severity at the higher-level of our Two-Tier architecture.
References 1. Rowstron, A., Drusche, P.: Pastry: Scalable, distributed object location and routing for largescale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001) 2. Chu, Y.h., Rao, S.G., Zhang, H.: A case for end system multicast. In: Proc. the ACM SIGMETRICS, Santa Clara (June 2000) 3. Rewaskar, S., Kaur, J.: Testing the scalability of overlay routing infrastructures. In: Proc the PAM Conference, Sophia Antipolis, France (April 2004) 4. Ng, T.S.E., Zhang, H.: Predicting Internet network distance with coordinates-based approaches. In: Proc. IEEE INFOCOM, New York, NY, USA (June 2002) 5. Dabek, F., Cox, R., Kaashoek, F., Morris, R.: Vivaldi: A decentralized network coordinate system. In: Proc. ACM SIGCOMM, Portland, OR, USA (August 2004) 6. Zheng, H., Lua, E.K., Pias, M., Griffin, T.: Internet Routing Policies and Round-Trip-Times. In: Proc. the PAM Conference, Boston, MA, USA (April 2005) 7. Ledlie, J., Gardner, P., Seltzer, M.I.: Network coordinates in the wild. In: Proc NSDI, Cambridge (April 2007) 8. Mendel, M., Bartal, Y., Linial, N., Naor, A.: On metric ramsey-type phenomena. In: Proc. the Annual ACM Symposium on Theory of Computing (STOC), San Diego, CA (June 2003) 9. Wang, G., Zhang, B., Ng, T.S.E.: Towards network triangle inequality violation aware distributed systems. In: Proc. the ACM/IMC Conference, San Diego, CA, USA, October 2007, pp. 175–188 (2007) 10. A simulator for peer-to-peer protocols, http://www.pdos.lcs.mit.edu/p2psim/index.html 11. Wong, B., Slivkins, A., Sirer, E.: Meridian: A lightweight network location service without virtual coordinates. In: Proc. the ACM SIGCOMM, August (2005) 12. Gummadi, K.P., Saroiu, S., Gribble, S.D.: King: Estimating latency between arbitrary Internet end hosts. In: Proc the SIGCOMM Internet Measurement Workshop, Marseille, France (November 2002) 13. Savage, S., Collins, A., Hoffman, E., Snell, J., Anderson, T.: The end-to-end effects of Internet path selection. In: Proc. of ACM SIGCOMM 1999, Cambridge, MA, USA (September 1999) 14. Lee, S., Zhang, Z., Sahu, S., Saha, D.: On suitability of euclidean embedding of internet hosts. SIGMETRICS 34(1), 157–168 (2006) 15. Min, S., Holliday, J., Cho, D.: Optimal super-peer selection for large-scale p2p system. In: Proc. the ICHIT Conference, Washington, DC, USA, November 2006, pp. 588–593 (2006)
Efficient Multi-source Data Dissemination in Peer-to-Peer Networks Zhenyu Li1,2 , Zengyang Zhu1,2 , Gaogang Xie1 , and Zhongcheng Li1 1
Institute of Computing Technology, Chinese Academy of Sciences 2 Graduate School of Chinese Academy of Sciences Beijing, China {zyli, antonial, xie, zcli}@ict.ac.cn
Abstract. More and more emerging Peer-to-Peer applications require the support for multi-source data dissemination. These applications always consist of a large number of dynamic nodes which are heterogeneous in terms of capacity and spread over the entire Internet. Therefore, it is a challenge work to design an efficient data disseminate scheme. Existing studies always ignore the node proximity information or have high redundancy or high maintenance overhead. This paper presents EMS, an efficient multi-source data dissemination in P2P networks. By leveraging node heterogeneity, nodes are organized in a two-layer structure: the upper layer is based on DHT protocol and composed of powerful and stable nodes, while the nodes at the lower layer attach to physically close upper layer nodes. Data objects are replicated and forwarded along implicit trees, which are based on DHT route table. The average path length for delivering objects to all nodes converges to O(log n) automatically, where n is the number of nodes in the application. We perform extensive simulations to show the efficacy in terms of delivery path length, delay, redundancy and the effect of proximity-awareness. Key words: Multi-source multicast, Data Dissemination, P2P Networks.
1
Introduction
A large number of emerging Peer-to-Peer applications, such as distributed multiplayer games, teleconferencing, remote collaboration and multimedia chat groups, require support for multi-source data dissemination. In these applications, each node is a potential data source. When a new data object emerges on a node, it should be propagated to all the interested members as soon as possible. Since these applications always consist of a large number of dynamic nodes which are heterogeneous in terms of capacity and spread over the entire Internet, it is a challenge work to design an efficient data disseminate scheme. In our opinion, an efficient multi-source data dissemination scheme should meet at least the following basic requirements. – Scalable performance. The scheme should be decentralized and support node churns. As the system size grows, the efficiency should degrade gracefully. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 409–420, 2008. © IFIP International Federation for Information Processing 2008
410
Z. Li et al.
– Complete coverage. A data object from any source should be delivered to all the other interest nodes. – Proximity aware. In order to reduce the bandwidth consumption and data transfer delay, the scheme should account for node proximity information. – Heterogeneity aware. The protocol should take the heterogeneity nature of nodes into account so that the load on a node is proportional to its capacity. – Low overhead. Both the maintenance overhead of the P2P overlay network and the data message redundancy should be low. Existing studies [3] [4] [5] on multi-source data dissemination do not meet all the requirements. The probabilistic gossip based hybrid push/pull scheme [3] brings lots of duplicate messages and only guarantees probabilistic convergency. CAM [4] builds capacity aware multicast on top of structured P2P network. So, it always has no redundancy. However, the maintenance overhead is considerable and it considers the proximity information only in a limited extent due to the rigid structures of structured overlay networks. ACOM[5] implicitly specifies a delivery tree for each source. However, it suffers a tradeoff between delivery delay and message redundancy. When the average path length is reduced to O(log n), the average number of duplicate messages grows to O(n/ log n), where n is the number of nodes in the overlay network. In this paper, we propose EMS, an Efficient Multi-Source data dissemination scheme. EMS is motivated by two facts. First, although the structured overlay networks need considerable maintenance overhead and have relative rigid structures, their predefined structures (e.g. ring, hypercube) give us useful information to extract data delivery trees. Second, measurement results in [11] [2] have shown that there exists some relative powerful and stable nodes in P2P applications. Therefore, we organize nodes in a two-layer structure: the upper layer is based on DHT protocol and composed of powerful and stable nodes, while the nodes at the lower layer attach to physically close upper layer nodes. Obviously, the upper layer node and the lower layer nodes attaching to it constitute a cluster. Data objects are replicated and forwarded on the upper layer along k-ary implicit trees, which are constructed using DHT route table. Thus, no control message is required to built the tree structures. The data objects are also delivered within the clusters using implicit tree structures. And in each step, a node tries to transmit the objects to as many new nodes as it can. We analyze the average path length for delivering objects to all the nodes and the results show that it automatically converges to O(log n), where n is the number of nodes in the P2P network. We also evaluate the performance of our scheme through comprehensive simulations. The results demonstrate that our scheme is more time- and cost-effective when compared with others. The rest of the paper is organized as follows. Section 2 provides a survey of related work. Section 3 describes the construction of hierarchical structure in detail. Section 4 details the data dissemination and in section 5, we evaluate our scheme through simulation experiments. Finally, we conclude our work in section 6.
Efficient Multi-source Data Dissemination in Peer-to-Peer Networks
2
411
Related Work
Probabilistic gossip based schemes [3] are reliable, scalable and easy to deploy. Their major limitations are considerable duplicate messages and only probabilistic convergency guarantees. Data-driven based methods [6] are simple and bring no redundancy. However, they suffer a control overhead and delay tradeoff [7], and therefore are not suitable for fast multi-source data dissemination applications. Tree-based schemes always build one tree or several disjoint trees [12] to propagate data. They are optimized for single source and not applicable to multi-source applications directly. NICE[8] forms nodes in a hierarchical structure to achieve high scalability. However, it is heterogeneity ignorant and fragile due to lack of redundancy in overlay networks. Narada[9] extracts source specific distribution trees from overlay network. It requires each node to maintain the information of all other nodes. Thus, it is only suitable for small scale systems. Several methods have been proposed for multi-source data dissemination in P2P systems. [1] and [4] are based on structured P2P networks. [1] assumes each node has same capacity and is not suitable for heterogeneous systems. [4] takes node capacity into account. A disadvantage of structured overlay based schemes is the considerable maintenance overhead. Moreover, the overlays are always too strict to optimize due to the predefined structures, e.g. ring or multi-dimension coordinates. Although our scheme also leverages DHT protocols to organize the upper layer nodes, the maintenance overhead is relative small because upper layer only consists of small portion of nodes which are more stable. ACOM[5], on the other hand, organizes nodes in a ring structure to support any-source multicast. Each node has a nearby neighbor and several random neighbors. Data object is first forwarded via random neighbors within several hops. Then it is transmitted down to the ring by the aware nodes, until all the nodes become aware of the object. The major contribution of ACOM is that the data object can be broadcasted to all the members in O(logc n) hops while no explicit tree is maintained, where c is the average node capacity and n is the number of members. However, this is at the cost of non-negligible redundancy. When the average path length is reduced to O(logc n), the average number of duplicate messages grows to O(n/ logc n). In addition, the constant coefficient in O(logc n) always is 2, which indicates the hop complexity is larger than other schemes. As to hierarchy architecture, Shen and Xu in [10] propose a landmark clustering based hierarchical structure to account for proximity information. However, they directly use node Hilbert number as logical node ID in the auxiliary DHTbased structure. Thus, the identifier randomness in the structure does not exist any more, which is a necessary condition in our scheme. In addition, no upper bound is imposed on the number of regular nodes that can be assigned to a supernode. Therefore, the supernodes may be overloaded.
412
Z. Li et al.
Dnodes Onodes
Fig. 1. Two-layer structure
3
Hierarchical Structure Construction
Fig. 1 captures the hierarchical structure in our scheme. Powerful and stable nodes are organized in a Chord ring, labeled as Dnodes. Ordinary nodes, labeled as Onodes, attach to physically close Dnodes. Although we use Chord [13] as a representative DHT protocol, it is straightforward for other DHT protocols. Obviously, a Dnode and the Onodes attaching to it constitute a cluster and the Dnode acts as a rendezvous point of the cluster. In order to avoid overloading Dnodes and relieve the single point of failure problem, an upper bound on the number of Onodes that a Dnode can be attached should be imposed. We call this upper bound as cluster capacity and denote it as mc . The default value of mc is 16. Nodes have different capacities. Node x’s capacity is denoted as cx , which specifies the number of nodes to which x is willing to send data objects concurrently. We assume nodes can estimate the capacities by themselves. In practice, node expected uptime should also been taken into consideration as in Gnutella protocol [14]. 3.1
Node Join
When a node x joins, it first locates nearby Dnodes. If there is a nearby Dnode y which can still accept ordinary nodes, x attaches to y as a y’s Onode. Otherwise, depending on x’s capacity, x either joins as a Dnode at the upper layer or attaches to a randomly chosen Dnode as an Onode. We leverage landmark clustering scheme [15] to generate proximity information. r nodes are picked up randomly from the Internet as landmark nodes. Node x measures the distance (i.e. delay) to these landmark nodes and obtains a landmark vector < dx1 , dx2 , · · · , dxr >, where dxi is the distance from x to the ith landmark node. The intuition behind is that physically close nodes are likely to have similar or close landmark vectors, and thus close to each other in the landmark space. Then the m-dimension landmark vector is mapped to a 1-dimension landmark numbers using Hilbert curve, which is an example of space-filling curves [16]. We partition the landmark space into 2rb smaller grids
Efficient Multi-source Data Dissemination in Peer-to-Peer Networks
413
with equal size, where b is the order of Hilbert curve and controls the number of grids used to partition the landmark space. Then we fit a Hilbert curve on the landmark space to number each grid. The nodes whose landmark number falls into grid ln has the landmark number as ln. Since space-filling curves map an m-dimension point to a 1-dimension point without loss of proximity, closeness in landmark number indicates physical closeness. We denote the landmark number of node x as lnx . In current design, the default values of r and b are 15 and 2, respectively. Each Dnode y publishes its information on the upper layer by storing its landmark number lny and address on the successor of identifier lny . As in [13], we assume new nodes can learn the information of an existing Dnodes (denoted as y0 ) in the upper layer by some external mechanism. New node x asks y0 to issue a Chord query message with its landmark number lnx as key on the DHT-based upper layer to locate nearby Dnodes of x. Dnodes whose landmark numbers fall into range Q = {q : |q − lnx | < Qx } are recognized as the nearby c−cx Dnodes of x, where Qx = Q0 ∗ 2 2 and c is the average node capacity, Q0 is the basic search radius. Thus, the search radius decreases exponentially with the growth of node capacity. The intuition behind is that the bigger a node’s capacity is, the higher probability it is a Dnodes. Note that since each Dnode maintains a continuous Chord identifier space and the information of Dnodes with close landmark numbers is stored closely in Chord ring, locating nearby Dnodes should be fast. Let Rx denote the set of nearby Dnodes located for x. Node x selects a node z ∈ Rx and sends z an attaching request message. If z is able to accept an Onode, it satisfies x’s request. Otherwise, x selects another node and performs the same operation until one of the Dnode accepts it. We call z is able to accept an Onode if the number of Onodes having attached to it does not exceed the cluster capacity mc . If none of the Dnode in Rx can accept x as an Onode, the subsequent operation depends on the x’s capacity cx : if cx is big enough (i.e. cx exceeds a threshold Cthd ), then x joins the upper layer as a Dnode; otherwise, x asks the well-known Dnode y0 to locate a random Dnode u which still can accept an Onode and attach to u. Now we describe how to choose the threshold Cthd . A Dnode maintains the information of the Onodes attaching to it and several other Dnodes’ information as Chord protocol specifying. The maintenance cost should not be ignored. In our current design, each Dnode dedicates one unit capacity for maintenance. And as mentioned before, k-ary implicit trees are used to transfer data objects among Dnodes. So, a Dnode’s capacity should at least be k. As detailed in the next section, data objects are also delivered within the cluster using implicit tree structure. The implicit tree is rooted by the Dnode in the cluster. To expedite data dissemination, the Dnode should at least have two child nodes on the implicit tree within the cluster. In summary, the threshold is set as Equ. 1. Cthd = k + 3
(1)
To increase robust, each Onode has a backup Dnodes list of length b. Onode x’s list consists of physically close Dnodes to x. When the current Dnode fails, x
414
Z. Li et al.
tries to choose a node from the list to attach. The list is initialized by the nearby Dnodes located during the joining procedure. To ensure that the backup Dnodes are always the available ones in the upper layer, x checks the availability of their backup Dnodes periodically. If a half of its backup Dnodes are not available any more, x relocates the nearby Dnodes by issuing a Chord query message through its Dnode. The query message is similar to the join message in terms of locating nearby Dnodes. Now we analyze the number of control messages used for a join operation. When a new node joins, to locate the close Dnodes, O(log nd ) Chord query messages are required, where nd is the number of Dnodes. If the new node joins as a Dnode, another O(log2 nd ) Chord join messages are required. Thus, when a new node joins, on average, #M SGjoin = log nd + pdn · log2 nd
(2)
messages are required, where pdn is the probability that a new node joins as a Dnode. 3.2
Node Departure and Failure
Each node has a buffer to store the data it received recently. When an Onode x leaves, it simply sends a Leaving Message to its corresponding Dnode. The Dnode then deletes the stored information for x. When a Dnode y leaves, it asks the Onodes attaching to it to switch their corresponding Dnode and then it leaves the upper layer based on the Chord protocol and unsubscribe its information (i.e. landmark number and address information) from the Chord ring. Each Onode z previously attaching to y selects a live node from its backup Dnode list to attach. If there is no node in the backup list can accept z, it rejoins. In any case, z records the sequence number of the last message received and fetches the messages it misses during above switch or rejoin process from other nodes’ buffers actively. Failure of an Onode can be detected by its corresponding Dnode through periodical message exchanging and this has little impact. When a Dnode fails, we resort to Chord protocol to recover the upper layer. Failure of a Dnode can also be detected by its Onodes through periodical message exchanging. Onodes switches their Dnodes as the departure procedure described in the previous paragraph. A method to relieve the impacts of departure and failure of a Dnode is to set a backup Dnode for each cluster. The backup Dnode is selected from the Onodes within each cluster. The backup Dnode should at least have a capacity bigger than Cthd and it acts as an ordinary Onode when the current Dnode is alive. After the failure of Dnode, it takes over the role of the failed Dnode by joining the upper layer and forcing other Onodes to attach to it. Because the departures or failures of Dnodes impose an larger impact on the hierarchical structure, node expected online time should be taken into account when selecting Dnodes.
Efficient Multi-source Data Dissemination in Peer-to-Peer Networks
15
1
14
2 3
13 12 finger table start Succ. 11 12 12 12 14 0 2 2
finger table start Succ. 3 5 4 5 6 7 10 10
0
4
11
finger table start Succ. 6 7 7 7 9 10 13 13
5 10
6 9
7 8
415
2 [2,1] 5
7
10
[5,9]
[10,1] 12
[7,9]
0 [0,1]
[12,15]
13 [13,15]
1
[1,1]
Dnode
Fig. 2. An example of upper layer structure based on Chord ring and a 2-ary implicit tree initialized by Dnode 2
4
Data Dissemination
When a new data object emerges on an Onode, the Onode delivers the object to its Dnode. When a Dnode y receives a new data object from its Onode or generates a data object by itself, it delivers the object to other Dnodes through a k-ary implicit tree based on node finger table which is defined and detailed in [13]. The finger table at node y contains the information of O(log nd ) Dnodes and the ith entry in the table is 1/2log nd −i of the ring away from y, where nd is the number of Dnodes and 0 ≤ i < log nd . Without loss of generality, we assume 2 ≤ k ≤ log nd . Initially, y holds the whole range of the Chord ring. We choose the first entry and the last k − 1 entries in y’s finger table as the children of y. y sends the data object to these k nodes concurrently. To ease of expression, we denote the node list constituted by these k nodes as klist and assume the relative sequence of these k nodes in klist is the same as the sequence in y’s finger table. Now the ring is partitioned into k parts by these k Dnodes. The jth Dnode in klist is responsible for the ring range which starts from it and ends with the predecessor of the (j + 1)th node, where 1 ≤ j < k. The kth Dnode in klist is responsible for the range which starts from it and ends with the predecessor of y. Each node z in klist performs the same operations as y does: partitioning the ring range which it is responsible for and forwarding the data object. The only difference is that not all the nodes in the finger table at node z fall in the range which z is responsible for. In this case, only the entries falling in the range can be z’s children in the implicit tree. We call these entries as valid entries. Note that the number of valid entries may be smaller than k. In this case, the number of z’s children is not k, but is the number of valid entries. And if there is no entry in the finger table falls into the range, the partition operation stops because this indicates there is no Dnode in this range. With the continuous partition of the Chord ring, all the Dnodes receive the data object. Fig. 2 gives an example of upper layer structure and a 2-ary implicit tree on it. At the same time that z transfers data object to its child Dnodes, z also delivers the object to as many Onodes as it can. Let t (0 ≤ t ≤ k) denote the number of z’s child Dnodes in the upper layer’s implicit tree and Sz denote the set of Onodes attaching to z. z randomly selects t Onodes from Sz to transfer
416
Z. Li et al.
data, where 2 ≤ t = cz − t − 1. Let Sz ⊂ Sz denote the set of these t Onodes and Sz denote the set of Onodes that do not directly receive data object from z. Obviously, Sz = Sz − Sz . z divides Sz into t parts as evenly as possible and assigns them to the nodes in Sz . After a node x in Sz receives the data from z, it sends the data to as many Onodes (assigned by z) as it can. The operation is similar to the one performed by z except x can sends data to cx nodes. For example, suppose 10 Onodes attach to Dnode z, which has capacity 10 and has 4 child Dnodes in the upper layer’s implicit tree (i.e. t = 4). Then t = 5. So, z sends the data message to 5 Onodes concurrently and also sends the address information of other 5 Onodes to them, one to each of them. These 5 nodes further delivery the message. This procedure naturally defines an implicit tree structure. Moreover, nodes within a cluster are always physically close to each other, so that the data delivery is fast. Theorem 1. A message from any source node can be delivered to all the nodes in O(log n) hops, where n is the number of nodes. Due to space limitation, the proof detail is omitted. The intuition behind is straightforward: the message containing the data object is delivered on the upper layer and within the clusters along implicit trees. More details can be found in [19]. A node x may fail after receiving the data object but before forwarding it to child nodes on the implicit trees, which would cause descendant nodes of x miss the object. To avoid this, Dnode y periodically exchanges a message summary with its predecessor. If y finds that a message that it has not received, it obtains the message from its predecessor. Onodes also periodically exchanges a message summary with its Dnode. If an Onode finds that a message that it missed, it requests the Dnode to send a copy of the message. The Dnode sends the message directly to the Onode or designates another Onode to send the message. The message summary is piggybacked on the heartbeat message. The basic Chord protocol does not account for node proximity. The study in [17] has pointed out that, instead of choosing the ith entry in y’s finger table to be (y + 2i ), a proximity optimization of Chord allows the ith entry to be any node within the range of [y + 2i , y + 2i+1 ], which does not affect the complexities. We leverages this optimization in our scheme to improve the proximity effect. When a Dnode y joins the upper layer (which is Chord based), it iteratively checks whether there is a nearby Dnode in Ry falls in the range [y + 2i , y + 2i+1 ]. If there is one, it chooses this nearby Dnode as the ith entry. Recall that Ry is the set of nearby Dnodes located for y during y’s join process.
5
Performance Evaluation
We develop an event-driven simulator to evaluate the performance of our scheme. We also compare our scheme with hybrid push/pull [3], CAM [4] and ACOM [5]. The metrics we mainly focus on are as following.
Efficient Multi-source Data Dissemination in Peer-to-Peer Networks
700
10
8
6
4
2
500
400
300
200
200
800 3200 Number of Nodes
(a) Avg. hops
12800
Hybrid push/pull CAM−Chord ACOM EMS
600
Number of forwarded copies
Hybrid push/pull CAM−Chord ACOM EMS
Avg. Msg Delivery Delay (ms)
Avg. Msg. Delivery Hop
12
417
200
800 3200 Number of Nodes
(b) Avg. delay
12800
12800
3200 hybrid push/pull CAM−Chord ACOM EMS
800
200
200
800
3200
12800
Number of Nodes
(c) Avg. traffic volume
Fig. 3. Comparison of different schemes while varying system size
M1. Average length of delivery paths: the average number of overlay hops it takes a message to reach a node; M2. Average delivery delay: the average time it takes a message to reach a node. It is determined by the product of the path length and the logical link latency; M3. Average traffic volume: the average number of copies that are forwarded by all nodes for delivering one message; M4. Average maintenance overhead : average number of control messages used for each node join. If not specified, the system consists of 10,000 nodes. When the average node capacity is c, the distribution of node capacity is Zipf-like. The capacity of a node is c/2, c, 2c and 4c with probability of 50 percent, 25 percent, 12.5 percent and 6.25 percent, respectively. The default value of c is 8. The default degree of implicit tree on the upper layer is 6, or k = 6. The basic search radius for locating nearby Dnodes is set to 50, or Q0 = 50. The underlying physical topology is generated by GT-ITM[18]. This topology has about 2,500 nodes: 4 transit domains each with 4 transit routers, 5 stub domains attached to each transit router, and 30 routers in each stub domain on average. Member nodes are attached to randomly chosen stub routers. Different latencies are assigned to the edges according to edge types: the distance of the edge between a node to the router it attached is 1 millisecond; the distance of intra-domain edge is 2 milliseconds; the distance of the edge between transit and stub domain is 10 milliseconds; and the distance of inter-transit edge is 50 milliseconds. In the simulations, nodes join first and then 500 randomly chosen nodes start to send data messages. We perform the simulations for 5 times and report the average results. 5.1
Comparisons of Different Schemes
In this set of experiments, we comprehensively compare the performance of different schemes.
Z. Li et al.
418 1
Hybrid push/pull CAM−Chord ACOM EMS
0.4
0.2
0
400 800 1200 Msg. Delivery Delay (ms)
1600
Number of control Msgs. per Join
0.6
Percentage of Delivered Msgs.
Percentage of Nodes
0.8
0
400
1
0.8
0.6
0.4 CAM−Chord ACOM EMS
0.2
0
300 250 200 150 100 50 0
0
20
40
60 80 100 Link Latency (ms)
120
140
CAM−Chord ACOM EMS
350
200
800 3200 Number of Nodes
12800
Fig. 4 4 . CDF of avg. deliv- Fig. 5. CDF of delivered Fig. 6 6 . Control messages ery delay messages per join
Fig. 3 shows the impact of system scale on the performance. The x-axis is in logarithm. In all schemes, the average length of delivery paths grows logarithmically with system scale. Our scheme EMS and CAM-Chord have comparable length of delivery paths and they outperform ACOM and hybrid push/pull. The reason why average length of delivery paths is so high in hybrid push/pull scheme is that a large part of the messages are duplicate messages. As to average delivery delay, EMS outperforms CAM-Chord and ACOM outperforms hybrid push/pull scheme for that both EMS and ACOM take the proximity information into account greatly. As to traffic volume, because EMS and CAM-Chord uses tree structures to delivery data messages, they bring no duplicate messages. In comparison, hybrid push/pull scheme and ACOM use about 75 percent and 25 percent more messages, respectively. To better understand the average delivery delay of different schemes, we study the cumulative distribution of delivery delay of one message, which is shown in Fig. 4. Two observations are notable. First, not all the nodes can be reached in hybrid push/pull scheme. Second, EMS outperforms other schemes. Specially, within 400ms, 73.4 percent, 47.7 percent and 33 percent nodes can be reached in EMS, CAM-Chord and ACOM, respectively. Fig. 5 illustrates cumulative distribution of delivered data messages. Intuitively, more messages are delivered within shorter distances indicates less cost and faster convergence. Fig. 5 shows that EMS delivers about 51 percent of total messages within 26ms and about 87 percent within 76ms, while CAMChord only delivers about 27 percent within 76ms and ACOM delivers about 18 percent of total messages within 26ms and about 88 percent within 76ms. The results implies that EMS effectively delivers data messages between physically close nodes. Fig. 6 compares the maintenance overhead of different schemes in terms of number of control messages per join. Since CAM-Chord has many more neighbors (O(c logc n)) per node than EMS (O(log nd )) and ACOM (O(c)), its overhead is much higher than that of EMS and ACOM. Recall that nd is the number of Dnodes in EMS. Another observation is that EMS and ACOM have comparable maintenance overhead in terms of the cost per join.
Efficient Multi-source Data Dissemination in Peer-to-Peer Networks 800
10
8
6
4
2
2−ary tree 4−ary tree 6−ary tree 8−ary tree
700
Avg. Number of Msg. Forwarded
2−ary tree 4−ary tree 6−ary tree 8−ary tree
Avg. Msg Delivery Delay
Avg. Msg. Delivery Hop
12
600 500 400 300 200
200
800 3200 Number of Nodes
12800
200
800 3200 Number of Nodes
12800
Fig. 7. Avg. path length Fig. 8. Avg. delay while while varying k varying k
5.2
419
20
15
10
5
0
0
4
8
16 Node Capacity
32
Fig. 9. Load distribution
Performance of EMS
In this set of experiments, we evaluate the performance of EMS while varying the degree (k) of implicit tree on the upper layer and show node load distribution in terms of the average number of messages it forwards. Fig. 7 and Fig. 8 show the impact of the implicit tree degree (k) of the upper layer on the average path length and delivery delay, respectively. As expected, the path length and delivery delay decrease with the growth of the degree. However, the extent of decreasing is very small when increasing the degree from 6 to 8. This is explained by the fact that the Chord ring range at a node x can be partitioned into vx parts at most, where vx is the distinct nodes in x’s finger table and is O(log nd ) on average. Recall that nd is the number of the number of Dnodes. In our experiment, nd is always smaller than 1/10 to 1/5 of the total number of nodes. Therefore, when the degree of implicit tree is 8 (always bigger than O(log nd )), the actual degree is about O(log nd ), not 8. Fig. 9 shows the load distribution in terms of the average number of messages it forwards. Since the nodes with capacity 16 or 32 are on the upper layer with high probability, they contribute more than others. Nodes with capacities 8 contribute as smaller as nodes with capacities 4. This is because they are all in the lower layers.
6
Conclusion
In this paper, we propose EMS, an efficient multi-source data dissemination in P2P networks. Nodes are organized into a two-layer structure: the upper layer is DHT-based and consists of powerful and stable nodes, ordinary nodes attache to physically close nodes at the upper layer. The data objects are delivered through implicit trees so that no control message is required to build the tree structures. The average length of delivery paths is O(log n), where n is the number of nodes in the P2P network. Compared with existing schemes, EMS is proximity-aware and has small maintenance overhead and small traffic volumes. As a future work, we will evaluate the performance with different capacity profiles.
420
Z. Li et al.
Acknowledgments This work is supported by National Basic Research Program of China with grant No.2007CB310702, by National Natural Science Foundation of China with grant No.60403031 and 90604015, and by France Telecom R&D.
References 1. El-Ansary, S., Alima, L.O., Brand, P., Haridi, S.: Efficient Broadcast in Structured P2P Networks. In: Proc. of IPTPS 2003 (2003) 2. Saroiu, S., Gummadi, P.K., Gribble, S.D.: A Measurement Strudy of Peer-to-Peer File Sharing Systems. In: Proc. of MMCN 2002 (2002) 3. Datta, A., Hauswirth, M., Aberer, K.: Updates in Highly Unreliable, Replicated Peer-to-Peer Systems. In: Proc. of ICDCS 2003 (2003) 4. Zhang, Z., Chen, S., Ling, Y., Chow, R.: Resilient Capacity-Aware Multicast Based on Overlay Networks. In: Proc. of ICDCS 2005 (2005) 5. Chen, S., Shi, B., Chen, S.: ACOM: Any-source Capacity-constrained Overlay Multicast in Non-DHT P2P Networks. IEEE Trans. on Parallel and Distributed Systems 18(9), 1188–1201 (2007) 6. Zhang, X., Liu, J., Li, B., Yum, T.-S.P.: DONet/CoolStreaming:A Data-Driven Overlay Network for Live Media Streaming. In: Proc. of INFOCOM 2005 (2005) 7. Venkataraman, V., Yoshida, K., Francis, P.: Chunkyspread: Heterogeneous Unstructured Tree-based Peer to Peer Multicast. In: Proc. of ICNP 2006 (2006) 8. Banerjee, S., Bhattacharjee, B., Kommareddy, C.: Scalable Application Layer Multicast. In: Proc. of SIGCOMM 2002 (2002) 9. Chu, Y.-H., Rao, S.G., Seshan, S., Zhang, H.: A Case for End System Multicast. IEEE Journal on Selected Areas in Communication (JSAC) 20 (2002) 10. Shen, H., Xu, C.: Hash-based Proximity Clustering for Load Balancing in Heterogeneous DHT Networks. In: Proc. of IPDPS 2006 (2006) 11. Wang, F., Xiong, Y., Liu, J.: mTreebone: A Hybrid Tree/Mesh Overlay for Application-Layer Live Video Multicast. In: Proc. of ICDCS 2007 (2007) 12. Castro, M., Druschel, P., Kermarrec, A., Nandi, A., Rowstron, A., Singh, A.: Splitstream: High-bandwidth Multicast in Cooperative Environments. In: Proc. of SOSP 2003 (2003) 13. Stoica, I., Morris, R., Karger, D., Kaashoek, M., Balakrishnan, H.: Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. In: Proc. of SIGCOMM 2001 (2001) 14. Gnutella Protocol Specification v0.6. http://rfc-gnutella.sourceforge.net/src/rfc-0_6-draft.html 15. Xu, Z., Tang, C., Zhang, Z.: Building Topology-Aware Overlays using Global SoftState. In: Proc. of ICDSC 2003 (2003) 16. Asano, T., Ranjan, D., Roos, T., Welzl, E., Widmaier, P.: Space Filling Curves and Their Use in Geometric Data Structures. Theoretical Computer Science (1997) 17. Gummadi, K.P., Gummadi, R., Gribble, S.D., Ratnasamy, S., Shenker, S., Stoica, I.: The Impact of DHT Routing Geometry on Resilience and Proximity. In: Proc. of SIGCOMM 2003 (2003) 18. Zegura, E.W., Calvert, K.L., Bhattacharjee, S.: How to Model an Internetwork. In: Proc. of INFOCOM 1996 (1996) 19. Li, Z., Xie, G., Li, Z.: Efficient Multi-source Data Dissemination in Peer-to-Peer Networks, Technical Report, available on request (November 2007)
Modeling Priority-Based Incentive Policies for Peer-Assisted Content Delivery Systems∗ Niklas Carlsson and Derek L. Eager Department of Computer Science, University of Saskatchewan Saskatoon, SK S7N 5C9, Canada [email protected], [email protected]
Abstract. Content delivery providers can improve their service scalability and offload their servers by making use of content transfers among their clients. To provide peers with incentive to transfer data to other peers, protocols such as BitTorrent typically employ a tit-for-tat policy in which peers give upload preference to peers that provide the highest upload rate to them. However, the tit-for-tat policy does not provide any incentive for a peer to stay in the system beyond completion of its download. This paper presents a simple fixed-point analytic model of a priority-based incentive mechanism which provides peers with strong incentive to contribute upload bandwidth beyond their own download completion. Priority is obtained based on a peer’s prior contribution to the system. Using a two-class model, we show that priority-based policies can significantly improve average download times, and that there exists a significant region of the parameter space in which both high-priority and low-priority peers experience improved performance compared to with the pure tit-for-tat approach. Our results are supported using event-based simulations. Keywords: Modeling, peer-assisted content delivery, priority-based incentive.
1 Introduction A highly scalable approach to content delivery is to utilize the upload bandwidth of the clients to offload the original content source [1, 2, 3, 4, 5, 6]. Existing peer-topeer download protocols, such as BitTorrent [1], allow each peer to download content from any peer that has content that it does not have, and do not require any organized delivery structure. With these protocols each file is split into many small pieces, each of which may be downloaded from different peers. To provide peers with an incentive to upload pieces to other peers, these protocols typically employ a tit-for-tat policy in which peers give upload preference to peers that provide the highest upload rate to them. Whereas such policies provide peers with strong incentives to upload at their full upload capacity while downloading a file, ∗
This work was supported by the Natural Sciences and Engineering Research Council of Canada.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 421–432, 2008. © IFIP International Federation for Information Processing 2008
422
N. Carlsson and D.L. Eager
they do not provide peers with any incentive to serve additional data after having completed their download. Therefore, unless altruistic peers graciously continue to serve data beyond their download completion, peer-assisted systems with limited server resources are not able to provide download rates much higher than the peer upload rate. When the peer download capacity significantly exceeds the upload capacity, as is commonly the case with home Internet connections (e.g., [7]), much of the download capacity in such systems is therefore underutilized. To increase the available upload resources some BitTorrent sharing communities enforce a minimum required sharing ratio (defined as the ratio between the total amount of data uploaded and downloaded); other communities rely on social etiquette (e.g., a sharing ratio may be advertised without any enforcement mechanism). While improved performance has been observed in such communities [8, 9], no policies appear to be in place to reward peers contributing more than the advertised minimum. This paper presents a simple fixed-point model of a two-class, priority-based incentive mechanism which uses differentiated service to provide peers with a strong incentive to upload data past their download completion. In addition to using a tit-fortat policy to encourage peers to upload as much as possible, while downloading, a fraction of the upload bandwidth of each peer/server is used to serve high-priority class peers. Priority is obtained based on a peer’s prior contribution to the system. We show that this policy is able to significantly improve average download times and provides peers with strong incentive to contribute past their download completion. Furthermore, there exists a significant region of the parameter space in which the performance of both high-priority and low-priority peers is improved over that with the pure tit-for-tat approach. We also discuss the impact of individual client choices on system performance. Results are supported using event-based simulations. Previous multi-class models of BitTorrent-like systems assume that the arrival rate of each class of peers is known and each peer allocates a static portion of upload bandwidth to each class of peers [10]. In contrast, we consider scenarios in which the fraction of upload bandwidth used for uploads to peers of each class depends on the proportion of peers of each class, which in turn depends on the extent to which clients decide to gain high-priority status through their contribution of upload resources. Naturally, these decisions depend on client perception of the advantage of being a high-priority peer versus the additional cost. In this work a fixed-point is determined such that the performance differential enjoyed by high-priority peers for a particular proportion of peers, provides incentive that yields exactly that proportion of peers choosing to earn high-priority status by contributing as a seed. To achieve good accuracy over the considered portion of the system parameter space, we also require a somewhat more detailed mean value analysis than previous fluid models [10, 11, 12, 13]. These differences are further discussed in Section 4. Other related work has included the use of analytic models [14, 15] or instrumentation [16, 17, 18] to capture the peer interaction, the study of reputationbased (e.g., [19]) and currency/token-based (e.g., [20]) reputation systems, and the use of taxation techniques when rewarding peers [21]. The remainder of the paper is organized as follows. Section 2 provides a brief overview of BitTorrent. A simple priority-based policy is presented in Section 3. Section 4 presents our analytic model. Results and validation are presented in Section 5. Finally, conclusions are presented in Section 6.
Modeling Priority-Based Incentive Policies
423
2 BitTorrent Overview We consider here the use of BitTorrent-like protocols to aid in the on-demand delivery of files stored on one or more content provider servers. BitTorrent [1] splits files into pieces, which themselves are split into smaller sub-pieces. Multiple subpieces can be downloaded in parallel from different peers. A peer is said to have a piece whenever the entire piece is downloaded, and is interested in all peers that have at least one piece that it currently does not have itself. Peers that have the entire file are called seeds, while peers that only have parts of the file and are still downloading are called leechers. Each peer establishes persistent connections with a large set of peers; however, at each time instance each peer only uploads to a limited number of peers (these peers are considered unchoked, while all other peers are choked). A ratebased tit-for-tat policy is used to encourage peers to upload pieces and discourage free-riding. With this policy leechers give upload preference to the leechers that provide the highest download rates. To probe for better pairings (or in the case of the seeds, allow new peers to download pieces), periodically each peer uses an optimistic unchoke policy, in which a random peer is unchoked. To ensure high piece diversity a rarest-first piece selection policy is used to determine which pieces to request next.
3 A Priority-Based Incentive Policy This section presents a priority-based incentive policy with two priority classes. We consider a system from which clients may download many files. New clients that have yet to contribute upload bandwidth as a seed are (by default) given low priority. Each time a peer uploads a (combined) total of LH data as a seed (over some number of downloaded files) it earns high-priority status for one future file download. In return for their extra upload contributions, high-priority peers receive better service. Service differentiation is achieved by requiring that every peer/server use a fraction fH of their upload bandwidth to serve high-priority peers. The remaining upload bandwidth is allocated based on tit-for-tat with optimistic unchoking. This policy provides peers with incentives to both upload at a high rate when downloading (from use of tit-for-tat), and to upload additional data as a seed (from use of priority).1 When selecting new peers to upload to, a peer randomly selects a high-priority peer if at least one such peer is requesting service and less than a fraction fH of the data it has uploaded has been uploaded using priority uploading; otherwise, tit-for-tat with optimistic unchoking is used. To encourage bidirectional tit-for-tat relationships, in which connections are responsive to the tit-for-tat policy (with optimistic unchoking), the unchoke policy considers the download rates of priority connections to be zero. To ensure that peers allocate roughly a fraction fH of their upload bandwidth to priority uploading, when possible, each peer keeps track of the amount of data uploaded using (i) tit-for-tat with optimistic unchoking, and (ii) priority uploading. When the upload of a new piece begins, the amount credited to the corresponding upload category is increased by the size of a piece. This credit may later be reduced, 1
Many policy variations are possible, including variants in which service differentiation is achieved with smoothly varying weights rather than with a binary priority assignment.
424
N. Carlsson and D.L. Eager
however, if due to parallel downloading, the actual amount uploaded by the peer is less than the size of the piece. Upload amounts are credited only during periods when there is at least one high-priority peer to which an upload could occur. We assume that the system provides all clients with information about the average download times of high-priority and low-priority clients, as well as the amount of data that must be uploaded as a seed to become high priority. This information (provided by a tracker, for example) allows clients to make an informed decision as to whether to make the upload contribution required for high-priority status.
4 Model Description This section presents our analytic model. We assume that all servers and peers have implemented the priority-based incentive policy described in Section 3 and are equipped with software (provided by the content provider, for example) which monitors the peers’ upload contributions. While such software cannot control the upload bandwidth made available by peers, we assume that the software allows the fraction fH used for priority uploading to be maintained and peers to be fairly and correctly labeled as either high or low priority.2 It is further assumed that the system operates in steady state and clients make many requests to the system. Furthermore, we assume that the probability a peer decides to upload a particular amount of data as a seed (so as to gain credit towards high priority status), after a file download, is independent of file identity, and that no peer accumulates high-priority download credit that it does not use in later downloads. Thus, in steady state, the total average rate that seeds upload data for any particular file will equal LH times the request rate of high-priority peers for that file. We assume that peers do not depart the system before having fully downloaded the file. Each client is assumed to determine the extent to which it is willing to serve as a seed according to the performance difference (as measured by the server) between the average download times of high-priority and low-priority peers. We analyze here a simple case where each peer has a threshold such that the peer will serve exactly LH data as a seed following each download, should this performance difference exceed the threshold. Assuming that these thresholds follow some (arbitrary) distribution, our analysis obtains a fixed-point solution such that the performance differential enjoyed by high-priority peers for a particular proportion of peers, yields exactly that proportion of peers choosing to earn high-priority status by contributing as a seed. We model download performance for a single representative file of size L. It is assumed that clients have an average upload bandwidth capacity U, average download bandwidth capacity D, generate requests for the file that arrive according to a Poisson process with rate λ, and that the server(s) have a total available upload bandwidth for the file equal to s times the average peer upload bandwidth U. Subscripts H and L are used to distinguish quantities for the high-priority and low-priority class, respectively. 2
A reporting and tracking system could be used to identify misbehaving peers that do not allocate the required fraction of upload bandwidth to priority uploading. Such peers could be punished by black-listing them from receiving uploads by servers and well-behaved peers.
Modeling Priority-Based Incentive Policies
425
4.1 General Model Given the above assumptions, a differential equation expressing the relationship between the average achieved download rate dc and the average number xc of clients actively downloading the file at any given time t, for each class c, can be derived as
dxc / dt = λ c − xc d c / L , c = L, H
(1)
where dc/L is the rate at which peers of class c complete their downloads and λc is the rate at which peers of class c begin new downloads. Focusing on the steady state case (i.e., when dxc/dt = 0), this yields xc = λ c ( L / d c ) .
c = L, H
(2)
We note that these expressions can easily be obtained directly using Little’s law. Similarly, the relationship between the average achieved upload rate uy of peers that are currently acting as seeds, and the average number of such peers, is given by
y = λ H ( LH / u y ) .
(3)
To continue the analysis so as to achieve acceptable accuracy over the considered portion of the parameter space, we require a more detailed mean value analysis than previous fluid models [10, 11, 12]. Our model uses conditional expected values (rather than only the long term averages) and reallocates upload bandwidth intended to be used for high-priority peers, but not allocated owing to insufficient total download capacity of these peers, to low-priority peers. Let xc(Q) and y(Q) denote the expected number of class c leechers, and the expected number of seeds, respectively, conditioned on the set of constraints Q. For example, if Q implies that xc = ac, then xc(Q) = ac; similarly, if Q implies that xc ≥ 1, then xc (Q ) = x c /(1 − e − xc ) . Further, let dc(Q) and N(Q) denote the corresponding conditional expected values for the download rate and the number of peers and servers uploading data, respectively, where the servers are considered as s peers (since their total upload bandwidth is sU). The download and upload rates of each class of clients can then be expressed as follows: d H = d H ( xH ≥1) ; d L = d L ( x L ≥1) ; u y = ( x H d H ( x H ≥1, y ≥1) + x L d L( x L ≥1, y ≥1) ) / N ( y ≥1) .
(4)
These expressions make use of the fact that the download and upload rate of a peer type is only applicable if there is at least one peer of that particular type in the system. In determining the conditional download rates, we separate upload bandwidth used for priority uploading (shared only among high-priority leechers) and tit-for-tat and optimistic unchoking (shared among all leechers). This separation allows the download rates to be adjusted based on the relative mix of peers of each class. Given a total upload capacity of N(Q)U, the download rate of high-priority peers is given by ⎡ ⎛ 1− fH f d H (Q ) = min ⎢ D, N (Q )U ⎜ + H ⎜ x H (Q ) + x L (Q ) x H (Q ) ⎢ ⎝ ⎣
⎞⎤ ⎟⎥ . ⎟⎥ ⎠⎦
(5)
426
N. Carlsson and D.L. Eager
With all the remaining bandwidth available to them (after high-priority peers are satisfied) the download rate of the low-priority peers can be expressed as follows: ⎡ N (Q )U − x H d H ( Q ) ⎤ d L (Q ) = min ⎢ D, ⎥. x L (Q ) ⎣⎢ ⎦⎥
(6)
This expression reduces to N(Q)U(1– fH)/(xH(Q)+xL(Q)) whenever dH(Q) ≤ D. Finally, the expected number of peers or servers effectively uploading data is given by N (Q ) = s + y( Q ) + ∑ c = L , H xc (Q ) − Pr (x H + xL = 1Q ) ,
(7)
where Pr(xH + xL = 1 | Q) is the probability that there only is a single leecher in the system. This expression assumes that every peer can upload data whenever there is more than one client in the system. Assuming the number of high-priority and lowpriority peers is independent given Q, the above probability can be calculated as Pr (x H + x L = 1 Q ) = Pr (xH = 1Q ) Pr (x L = 0 Q ) + Pr (xH = 0 Q ) Pr (xL = 1Q ) ,
(8)
where the conditional probabilities can be calculated using standard theory. Given the portion λH/λ of peers selecting to become high-priority peers, as would depend on the relative performance difference (TL/TH – 1), the average download rates dc and download times Tc = L/dc can be easily obtained by numerically solving the above equation system. As with other fluid models, the complexity of these calculations is independent of the number of peers in the system. 4.2 All High-Priority Case
This section describes how our model can be reduced to a closed form solution for the special case in which (i) all clients select to become high-priority peers, (i) on average serve LH data as a seed, (iii) the system is demanding λL >> d, and (iv) the seed contributions are significant λLH >> uy. For this special case the conditional values are roughly equal to the long term s + λL / d H + λLH / u y averages and equations (4) and (5) reduce to dH = min[D, U], λL / d H where uy= λL/(s + λL/dH + λLH/uy). Here Little’s law has been used to substitute the average number of leechers xH = λL/dH and seeds y = λLH/uy, respectively. Solving for dH allows an expression for the average download time TH = L/dH to be obtained: TH = min [L / D, ( L − LH ) / U − s / λ ] .
(9)
Note that as long as D ≥ U /(1 − sU /(λL )) , the download time decreases linearly with the amount of data LH uploaded by seeds, and the clients fully utilize their maximum download bandwidth capacity D when seeds upload L*H = L(1 − U / D ) − sU / λ data. While equation (9) also is applicable to a regular tit-for-tat system in which altruistic
Modeling Priority-Based Incentive Policies
427
peers serve LH data as seeds, we note that such seed contributions are much more likely in the priority system, in which the peers are given a performance incentive.
5 Results This section validates the model and analyzes the characteristics of the priority-based policy. For validation we modified an existing event-based simulator of BitTorrentlike systems [22]. It is assumed that no connections are choked in the middle of an upload, and peers only request new downloads when they are not fully utilizing their download capacity D. For simulating the rate at which pieces are exchanged, it is assumed that connection bottlenecks are located at the end points (i.e., either by the upload capacity U at the sender or by the download capacity D at the receiver) and the network operates using max-min fair bandwidth sharing (using TCP, for example). 5.1 Validation Using Known User Choices
In this section we validate the analytic model for the case where the proportion of clients of each priority class is known. We consider a system with a single server operating in steady state. The system is simulated for 6400 requests, with the initial 1000 and the last 400 requests removed from the measurements. Each data point represents the average of 5 simulations. The file is split into 512 pieces. The peers and the server can simultaneously upload to at most 4 and 4s peers, respectively. Without loss of generality, the file size L and upload bandwidth capacity U are fixed at one. With these normalized units, the volume of data transferred is measured in units of the file size and the download time is measured in units of the minimum time it takes for a client to fully upload a copy of the file. Fig. 1 presents the average download times for each priority class individually and the overall averages, as predicated by both the analytic model and simulations. For each sub-figure in Fig. 1, one variable is considered at a time. The default case assumes that λ = 100, D/U = 3, LH/L = 20%, s = 1, fH = 25%, and λH/λ = 50%. Fig. 1(a) also presents the results for a probabilistic model extension.3 For the other subfigures the difference between the general model and the probabilistic extension is not noticeable and the results for the extension are therefore omitted. The agreement between simulation and analytical results is excellent for the overall averages. The agreement for the averages for each individual class is good, although the discrimination between the classes is typically slightly smaller for the simulations. This difference is likely due to the fixed-point model not capturing variability in the number of active peers of each class. Note that larger errors are observed in regions with larger variability; for example, in systems with relatively low arrival rate. For these regions, the probabilistic extension, which takes more state information into account, provides a slightly better approximation. In general, however, we have found that the added complexity of such models is not typically justified. 3
Due to space constraints the description of this extension is omitted. The basic idea is to split each condition Q used in expression (4) into three sub-cases: (i) there is no additional leecher in the system, (ii) there is at least one high-priority leecher in the system, and (iii) there is no high-priority leecher in the system, but at least one low-priority leecher.
N. Carlsson and D.L. Eager 1.4
1.4
1.2
1.2
1 0.8 0.6 0.4
all (model) all (simulation) all (prob. ext.)
0.2
high (model) high (simulation) high (prob.ext.)
low (model) low (simulation) low (prob. ext.)
Average download time
Average download time
428
0 1
10 100 Client arrival rate
1000
0.4
all (model) high (model) low (model)
0.2
10000
all (model) all (simulation) high (model) high (simulation) low (model) low (simulation)
1.2 1 0.8
1
all (simulation) high (simulation) low (simulation)
2 3 4 Download/upload bandwidth capacity ratio
0.6 0.4 0.2
5
all (model) all (simulation) high (model) high (simulation) low (model) low (simulation)
1.4
Average download time
1.4
Average download time
0.6
0
0.1
1.2 1 0.8 0.6 0.4 0.2 0
0 0.25 0.5 0.75 1 1.25 Amount (as a seed) to become high priority peer
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
all (model) high (model) low (model)
1
1.5
10 100 Server/peer upload bandwidth capacity ratio
1.2 1 0.8 0.6
all (model) all (simulation) high (model) high (simulation) low (model) low (simulation)
0.4 0.2 0
0
0.2 0.4 0.6 0.8 Fraction devoted to priority uploading
1000
1.4
all (simulation) high (simulation) low (simulation)
Average download time
0
Average download time
1 0.8
1
0
0.2
0.4 0.6 Portion high priority clients
0.8
1
Fig. 1. Model validation. Default settings: L = 1, U = 1, λ = 100, D/U = 3, LH = 0.2, s = 1, fH = 0.25, λH /λ = 0.5. (a) Client arrival rate λ. (b) Download/upload capacity ratio D/U. (c) Amount LH to be uploaded as a seed. (d) Server/peer upload capacity ratio s. (e) Fraction fH devoted to priority uploading. (f) Portion λH /λ high priority peers.
As suggested by the closed form equations in Section 4.2, the average download times decrease linearly with the amount of content uploaded by seeds (with the slope determined by the proportion of high-priority peers in the system). The discrimination between the priority classes increases linearly with the fraction devoted to priority uploading. We note that the decreased performance of low-priority peers (relative to the high-priority peers) is offset by the overall improved performance due to high priority peers making additional upload contributions as seeds. As further discussed in Section 5.2, such contributions can in some cases allow the low-priority peers to achieve better download times than peers in non-priority systems. 5.2 Social Welfare Regions
In this section we consider under what circumstances low-priority peers experience download times no worse (or only slightly worse) than they would in a regular tit-fortat system. We call regions in the modeling space for which both classes of peers see improved performance social welfare regions. We consider a system without altruistic peers and model the performance of a regular tit-for-tat system using the special case in which LH = 0 and fH = 0. Assuming that any additional (unconditional) upload contributions made by altruistic peers are equal for both systems, the results for systems with altruistic peers correspond to scenarios with additional server bandwidth.
Modeling Priority-Based Incentive Policies
429
Fig. 2. Percentage decrease in download times for low-priority peers when the clients’ individual thresholds follow an exponential distribution. Default settings are: L = 1, U = 1, LH = 0.5, λ = 100, D/U = 3, s = 1, fH = 0.25.
Fig. 2 shows the contour lines for different levels of improvements observed by the low-priority peers. These experiments assume that the clients’ individual thresholds follow an exponential distribution. On the y-axis of each sub-figure the median of this distribution is plotted. For example, with a median of 1, 50% of the clients require the performance difference to be at least 100% before selecting to become a high-priority peer. With our analytic model providing slight overestimates of the download times of the low-priority peers, we expect that in real systems the social regions would likely be somewhat larger than indicated here. Note that the size of the social welfare regions typically increases as clients are more willing to serve as high-priority peers (i.e., for lower thresholds). In fact, for large portions of the parameter space the low-priority peers actually observe significant performance improvements. Only when clients have large thresholds or the fraction of bandwidth devoted to priority uploading exceeds 50%, do the lowpriority peers observe decreased performance. It should further be noted that with fH = 0.25 and LH = 0.5 the low-priority peers do not have more than a 10% increase in download times (independent of the peer threshold distribution). The corresponding values when LH = 0.25 and LH = 0.75 are 25% and 0%, respectively. 5.3 User Behavior and System Dynamics
To illustrate the impact that individual client choices have on the performance of the system, Fig. 3 shows the performance using different threshold distributions. To cover a wide range of cases we assume that the thresholds follow a Weibull distribution. The shape parameter β determines if the distribution is light-tailed (β ≥ 1), or heavy-tailed (β < 1), with β = 1 corresponding to an exponential
430
N. Carlsson and D.L. Eager 1.2
regular BT b=0.25 b=0.5 b=1 b=2 b=4
0.8 0.6 0.4 0.2 0 0.01
0.1 1 Median theshold difference
Probability high priority peer
0.8 0.6
0.2
0.1 1 Median theshold difference
10
regular BT b=0.25 b=0.5 b=1 b=2 b=4
0.6 0.4 0.2
250
0.4
0 0.01
1 0.8
0 0.01
10 b=0.25 b=0.5 b=1 b=2 b=4
1
Average download time (low priority peers)
1
% increase in download time (low vs. high)
Average download time (high priority peers)
1.2
200 150
0.1 1 Median theshold difference
10
b=0.25 b=0.5 b=1 b=2 b=4
100 50 0 0.01
0.1 1 Median theshold difference
10
Fig. 3. Performance impact of the shape parameter β and the overall aggressiveness of clients. Default settings are: L = 1, U = 1, LH = 0.5, λ = 100, D/U = 3, s = 1, fH = 0.25.
distribution and β = 3.4 approximating a normal distribution. Note that the curves become more and more step-like as the shape parameter grows and the distribution becomes more compressed (with, in the limiting case, all peers sharing the same threshold). All curves cross at the point where the portion of high-priority peers is 50%; this is since the distributions are normalized to have the same median and the fact that the system performance is mainly determined by the fraction of peers that have a threshold below some “critical” value (not the shape of the threshold distribution itself). Another observation from these figures is that low-priority peers only perform worse (and only slightly worse) than peers in a regular BitTorrent system when the median threshold difference is relatively large (e.g., 100%). 5.4 Alternative Scenarios and Workload Assumptions
This section considers alternative scenarios and workload assumptions than those assumed by our analytic model. First, motivated by measurement studies [12], we consider a flash-crowd scenario in which peers arrive at an exponentially decaying rate λ(t) = λ0e-γt, with λ0 = 791.0 and γ = 1. For this case 63.2% of all 500 peer arrivals occur within the first time unit. Second, a steady state scenario is considered with heterogeneous peers: low-bandwidth (UA = 0.2, DA = 0.6) and high-bandwidth (UB = 1, DB = 3) peers. For reference we also include our original default scenario. Fig. 4 shows the relative download times for each scenario. Although the performance improvements are smaller in the flash crowd scenario and for the lowbandwidth clients in the heterogeneous scenario, the increased performance differences between high-priority and low-priority peers when the proportion of highpriority peers is small gives a strong incentive to contribute additional resources, thereby allowing the overall performance to be improved. Again, we note that the low-priority peers do not perform much worse than peers in a regular tit-for-tat system and in many cases significantly better.
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
4.5
Steady state (simulation) Steady state (analytic) Heterogeous (high bw ) Heterogenous (low bw ) Flash crow d
431
Steady state (simulation) Steady state (analytic) Heterogeous (high bw ) Heterogenous (low bw ) Flash crow d
4
Download time ratio (regular vs. high)
Download time ratio (regular vs. low)
Modeling Priority-Based Incentive Policies
3.5 3 2.5 2 1.5 1
0
0.2
0.4 0.6 Portion high priority peers
0.8
1
0
0.2
0.4 0.6 Portion high priority peers
0.8
1
Fig. 4. Example scenarios: (i) steady state (λ = 100), (ii) flash crowd (λ0e-γt, with λ0 = 791.0 and γ = 1), and (iii) heterogeneous scenario (λ = 50, λA/λ = 0.5, UA = 0.2, DA = 0.6, UB = 1, DB = 3). Default settings are: L = 1, U = 1, LH = 0.5, D/U = 3, s = 1, fH = 0.25.
6 Conclusions This paper presents a simple analytic model of a two-class priority-based incentive mechanism in which priority is obtained based on peers’ prior contributions to the system. It is shown that priority-based policies can provide peers with strong incentive to contribute upload bandwidth beyond their own download completion, can significantly improve the average download times in the system, and that there exists a significant region in the parameter space in which both high-priority and lowpriority peers experience improved performance compared to the pure tit-for-tat approach. While this paper has focused on download, we note that these types of priority policies also can be used for peer-assisted streaming [22].
References 1. Cohen, B.: Incentives Build Robustness in BitTorrent. In: Workshop on Economics of Peer-to-Peer Systems, Berkeley, CA (2003) 2. Gkantsidis, C., Rodriguez, P.R.: Network Coding for Large Scale Content Distribution. In: IEEE INFOCOM, Miami, FL (2005) 3. Sherwood, R., Braud, R., Bhattacharjee, B.: Slurpie: A Cooperative Bulk Data Transfer Protocol. In: IEEE INFOCOM, Hong Kong, China (2004) 4. Chu, Y., Rao, S.G., Zhang, H.: A Case for End System Multicast. In: ACM SIGMETRICS, Santa Clara, CA (2000) 5. Kozic, D., Rodriguez, A., Albrecht, J., Vahdat, A.: Bullet: High Bandwidth Data Dissemination Using an Overlay Mesh. In: ACM SOSP, Bolton Landing, NY (2003) 6. Zhang, X., Liu, J., Li, B., Yum, T.-S.P.: CoolStreaming/DONet: A Datadriven Overlay Network for Peer-to-Peer Live Media Streaming. In: IEEE INFOCOM, Miami, FL (2005) 7. Saroiu, S., Gummadi, K.P., Gribble, S.D.: A Measurement Study of Peer-to-Peer File Sharing Systems. In: IS&T/SPIE MMCN, San Jose, CA (2002) 8. Andrade, N., Mowbray, M., Lima, A., Wagner, G., Ripeanu, M.: Influences on Cooperation in BitTorrent Communities. In: ACM SIGCOMM Workshop on Economics of P2P Systems, Philadelphia, PA (2005) 9. Ripeanu, M., Mowbray, M., Andrade, N., Lima, A.: Gifting Technologies: A BitTorrent Case Study. First Monday 11, 11 (2006)
432
N. Carlsson and D.L. Eager
10. Clévenot-Perronnin, F., Nain, P., Ross, K.W.: Multiclass P2P Networks: Static Resource Allocation for Service Differentiation and Bandwidth Diversity. In: IFIP PERFORMANCE, Juan-les-Pins, France (2005) 11. Qiu, D., Srikant, R.: Modeling and Performance Analysis of BitTorrent-Like Peer-to-Peer Networks. In: ACM SIGCOMM, Portland, OR (2004) 12. Guo, L., Chen, S., Xiao, Z., Tan, E., Ding, X., Zhang, X.: Measurement, Analysis, and Modeling of BitTorrent-like Systems. In: ACM IMC, Berkley, CA (2005) 13. Kumar, R., Liu, Y., Ross, K.: Stochastic Fluid Theory for P2P Streaming Systems. In: IEEE INFOCOM, Anchorage, AK (2007) 14. Massoulie, L., Vojnovic, M.: Coupon Replication Systems. In: Proc. ACM SIGMETRICS, Banff, Canada (2005) 15. Lin, M., Fan, B., Chiu, D., Lui, J.C.S.: Stochastic Analysis of File Swarming Systems. In: IFIP PERFORMANCE, Cologne, Germany (2007) 16. Legout, A., Urvoy-Keller, G., Michiardi, P.: Rarest First and Choke Algorithms Are Enough. In: ACM IMC, Rio de Janeiro, Brazil (2006) 17. Legout, A., Liogkas, N., Kohler, E., Zhang, L.: Clustering and Sharing Incentives in BitTorrent Systems. In: ACM SIGMETRICS, San Diego, CA (2007) 18. Piatek, M., Isdal, T., Anderson, T., Krishnamurthy, A., Venkataramani, A.: Do Incentives Build Robustness in BitTorrent? In: NSDI, Cambridge, MA (2007) 19. Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H.: The Eigentrust Algorithm for Reputation Management in P2P Networks. In: WWW, Budapest, Hungary (2003) 20. Zhong, S., Chen, J., Yang, Y.R.: Sprite: a Simple, Cheat-Proof, Credit-Based System for Mobile Ad-hoc Networks. In: IEEE INFOCOM, San Francisco, CA (2003) 21. Chu, Y., Chuang, J., Zhang, H.: A Case for Taxation in Peer-to-Peer Streaming Broadcast. In: ACM SIGCOMM Workshop on Practice and Theory of Incentives in Networked Systems, Portland, OR (2004) 22. Carlsson, N., Eager, D.L.: Peer-assisted On-demand Streaming of Stored Media using BitTorrent-like Protocols. In: IFIP/TC6 NETWORKING, Atlanta, GA (2007)
AQCS: Adaptive Queue-Based Chunk Scheduling for P2P Live Streaming Yang Guo1 , Chao Liang2 , and Yong Liu2 1
2
Corporate Research, Thomson, Princeton, NJ, USA 08540 [email protected] ECE Dept., Polytechnic University,Brooklyn, NY, USA 11201 [email protected],[email protected]
Abstract. P2P streaming has been popular and is expected to attract even more users. One major challenge for P2P streaming is to offer users satisfactory Quality of Experience (QoE) in terms of video resolution, startup delay, and playback smoothness, all require efficient utilization of bandwidth resources in P2P networks. In this paper, we propose AQCS, adaptive queue-based chunk scheduling, that can support the maximum streaming rate allowed by a P2P streaming system with small signaling overhead and short startup delay. AQCS is a distributed algorithm with minimum requirement on peers. Queue-based design enables peers to be self-adaptive to the bandwidth variations and peer churn, and automatically converges to the optimal operating point. The prototype of AQCS is implemented and various implementation issues are examined. The experiments over the PlanetLab further demonstrate AQCS’s optimality and its robustness against changing system/network environment.
1 Introduction Video-over-IP applications have recently attracted a large number of users on the Internet. Youtube [1] alone hosted some 45 terabytes of videos and attracted 1.73 billion views by the end of August 2006. With the fast deployment of high-speed residential access, such as Fiber-To-The-Home, video traffic is expected to dominate the Internet in near future. Traditionally, video content can be streamed to end users either directly from video source servers or indirectly from servers in Content Delivery Networks (CDNs). Peer-to-Peer video streaming has emerged as an alternative with low server infrastructure cost. P2P video streaming systems, such as ESM [2], CoolStreaming [3], and PPLive [4], have attracted millions of users to watch live or on-demand video programs on the Internet [5]. The P2P design philosophy seeks to utilize peers’ upload bandwidth to reduce servers’ workload. However, the upload bandwidth utilization might be throttled by the so called content bottleneck where a peer may not have any content that can be uploaded to its neighbors even if its link is idle. The mechanism adopted by P2P file sharing applications is to increase the content diversity among peers. For example, the rarest-first policy of BitTorrent [6] encourages peers to retrieve the chunks with the lowest availability among their neighbors. Network coding has also been explored to mitigate the content bottleneck problem [7]. The content bottleneck problem in live streaming is A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 433–444, 2008. c IFIP International Federation for Information Processing 2008
434
Y. Guo, C. Liang, and Y. Liu
even more severe. Video content in live streaming has strict playback deadlines. Even temporary decrease in peer bandwidth utilization leads to peer playback quality degradation, such as video playback freezing or skipping. To make things worse, at any given moment, peers are only interested in downloading a small set of chunks falling into the current playback window. This greatly increases the possibility of content bottleneck. One way to address this problem is to compromise user viewing quality. For example, a lower video playback rate would impose lower peer bandwidth utilization requirement. Allowing a longer playback delay also allows a larger set of chunks to be exchanged among peers. The opposite solution lies in designing more efficient peering strategies and chunk scheduling methods. This paper deals with the second solution. Our focus is on the design of a chunk scheduling method that can support high resolution video streaming service. We assume collaborative P2P systems. Peers help each other and forward received video chunks to other peers. Motivated by the effectiveness of buffer control on switches and Active Queue Management (AQM) on routers, we propose a novel queue-based chunk scheduling algorithm, AQCS, to adaptively eliminate content bottlenecks in P2P streaming. Using queue-based signaling between peers and the content source server, the amount of workload assigned to a peer is proportional to its available upload capacity, which leads to high bandwidth utilization. The queue-based signaling also enables the proposed scheme to adapt to the changing network environment. The data chunks are relayed at most once by intermediate peers. Hence the video content can be disseminated to all peers quickly. The simplicity of the design also reduces the signaling overhead. Our contributions are three-fold: – We propose a simple queue-based chunk scheduling method achieving high bandwidth utilization in P2P live streaming. We theoretically show that the proposed scheme can support the optimal streaming rate in idealized network environments. A practical algorithm is designed to achieve the close-to-optimum performance in realistic network environment. – A full-feature prototype is developed to test the feasibility and efficiency of the proposed scheduling algorithm. Various design considerations are explored to handle dynamics in realistic network environments, including peer churns, peer bandwidth variations, and inside network congestion. – The performance of the prototype system is examined through experiments over PlanetLab [8]. Both the optimality and the adaptiveness of the proposed chunk scheduling method are demonstrated. The remaining part of this paper is organized as follows. Background and related work is included in Section 2. The queuing model and scheduling algorithm of queuebased chunk scheduling are described in Section 3. Implementation considerations are explored in Section 4. The experiment results are reported in Section 5. Finally, Section 6 ends the paper with concluding remarks.
2 Background and Related Work In [9], we propose HCPS - Hierarchically Clustered P2P Streaming system that can support the streaming rate approaching the optimum upper bound with short delay, yet
AQCS: Adaptive Queue-Based Chunk Scheduling for P2P Live Streaming
435
is simple enough to be implemented in practice. In HCPS, the peers are grouped into small size clusters and a hierarchy is formed among clusters to retrieve video data from the source server. By actively balancing the uploading capacities among clusters, and executing the perfect scheduling algorithm [10] within each cluster, the system resource can be efficiently utilized. Fig. 1 depicts a two-level HCPS system. At the lower level, all peers are organized into bandwidth-balanced clusters. Each cluster consists of a cluster head and a small number, e.g. from 20 to 40 normal peers. Cluster heads of all clusters form a super cluster at upper level and retrieve video from the video source server in P2P fashion. After obtaining video at the upper level, a cluster head acts as a local video proxy server for normal peers in its cluster at lower level. Normal peers within the same cluster collaborate according to the perfect scheduling algorithm to retrieve video from their cluster head. In this new architecture, the server only distributes video to cluster heads; a normal peer only maintains connections to its neighbors in the same cluster; and a cluster head connects to both other cluster heads and normal peers in its own cluster. Two-level HCPS already has the ability to support a large number of peers with mild requirement on the number of connections on the server, cluster heads and normal peers. S
Cluster Head Head Mapping
1
2 a1
a3
b1
a2
b3 b2
Top Level
r
r r
a1
r
r
a3
b1
r
Base Level b3
3 b2
a2
5
4
8
6
7
Fig. 1. Hierarchically Clustered P2P Streaming System
The perfect scheduling employed in HCPS does not work well in practical, though. It requires a central controller that collects all peers’ upload capacity information, and computes the sub-stream rates sent from the server to individual peers. In practice, available upload capacity may not be known and can vary over time. The central coordinator needs to continuously monitor peers’ upload capacity and re-compute the sub-stream rates. The adaptive queue-based chunk scheduling method is a distributed solution. Peers only exchange information with the server and make local decisions. Upload capacity information of peers/source is not required, and the scheme adapts to the changing peer membership and network environment automatically. Hence AQCS is a more suitable peer scheduling method for HCPS at cluster level. There have been ongoing efforts intending to improve resource utilization in P2P live streaming. To improve the resource utilization in mesh-based P2P streaming, [11]
436
Y. Guo, C. Liang, and Y. Liu
proposes a two-phase swarming scheme where the fresh content is quickly diffused to the entire system in the first phase, and peers exchange available content in the second phase. Network coding is also applied to P2P live streaming. [12] performs a reality check by using network coding for P2P live streaming. Nevertheless neither approach can provably achieve the maximum streaming rate. The authors in [13] design a randomized distributed algorithm, Random Useful Packet Forwarding (RUPF), that can converge to the maximum streaming rate. They also study the delay that users must endure in order to play the stream with a small amount of missing data. The queue-based chunk scheduling method is a deterministic distributed algorithm with small control overhead. No data chunk need to be skipped to achieve short startup delay. AQCS also incurs less signaling overhead since no bit-maps carrying the data chunk availability information are exchanged between peers. Our initial experiments show that AQCS out-performs the randomized broadcasting scheme.
3 Adaptive Queue-Based Chunk Scheduling It is desirable to support high streaming rate in P2P streaming. Higher streaming rate allows the system to broadcast video with better quality. It also provides more cushion to absorb the bandwidth variations (caused by peer churn, network congestion, etc.) when constant-bit-rate (CBR) video is broadcasted. The key to achieve high streaming rate is to better utilize peers’ uploading bandwidth. In this section, we describe the queuebased chunk scheduling algorithm that can achieve close to optimal peer uploading bandwidth utilization in practical P2P networking environment. In AQCS, data chunks are pulled/pushed from server to peers, cached at peers’ queue, and relayed from peers to its neighbors. The availability of upload capacity is inferred from the queue status such as the queue size or if the queue is empty. Signals are passed between peers and server to convey the information if a peer’s upload capacity is available. Fig. 2 depicts a P2P streaming system using queue-based chunk scheduling with one source server and three peers. Each peer maintains several queues including a S 1 5
5
5
2
c 4
a 3
b 4
Pull signal from peer to server Chunks in response to pull signal Chunks with no pull signal
Fig. 2. Queue-based chunk scheduling example with four nodes
AQCS: Adaptive Queue-Based Chunk Scheduling for P2P Live Streaming
437
’pull’ signal queue
’pull’ signal Filter
Playback buffer
1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
F marked content
Forwarding queue
F marked content server
Content buffer
Forwarding server
(a) Queue Model of Peers
(b) Queue Model of Source Server
Fig. 3. Queue Models of Peers and Source Server
forward queue. Using peer a as an example, the signal and data flow is described next. Pull signals are sent from peer a to the server whenever the queues become empty (or have fallen below a threshold) (step 1 in Fig. 2). The server responds to the pull signal by sending three data chunks back to peer a (step 2). These chunks will be stored in the forward queue (step 3) and be relayed to peer b and peer c (step 4). When the server has responded to all ’pull’ signals on its ’pull’ signal queue, it serves one duplicated data chunks to all peers (step 5). These data chunks will not be stored in forward queue and will not be relayed further. Next we first describe in detail the queue-based scheduling mechanism at the source server and peers. The optimality of the scheme is shown afterwards. 3.1 Peer Side Scheduling and Its Queuing Model Fig. 3(a) depicts the queuing model for peers in the queue-based scheduling method. A peer maintains a playback buffer that stores all received streaming content from the source server and other peers. The received content from different nodes is assembled in the playback buffer in playback order. The peer’s media player renders/displays the content from this buffer. Meanwhile, the peer maintains a forwarding queue which is used to forward content to all other peers. The source server marks the data content either as F-marked content or NF-marked content before transmitting them. F (forwarding) represents content that should be relayed/forwarded to other peers. NF (non-forwarding) indicates that content is intended for this peer only and no forwarding is required. NF content is filtered out at peers. F content is stored into the forward queue, marked as NF content, and forwarded to other peers. Because the relayed content is always marked as NF at the relaying peer, data content is relayed at most once in AQCS, which reduces the content distribution time and startup delay. In order to fully utilize a peer’s upload capacity, the peer’s forwarding queue should be kept busy. A signal is sent to the source server to request more content whenever the forwarding queue becomes empty. This is termed a ’pull’ signal. The rules for marking the content at the source server are described next.
438
Y. Guo, C. Liang, and Y. Liu
3.2 Server Side Scheduling Algorithm and Its Queuing Model Fig. 3(b) illustrates the server-side queuing model of AQCS. The source server has two queues: a content queue and a signal queue. The signal queue buffers the ’pull’ signals issued by peers. The content queue is a multi-server queue with two dispatchers: an F-marked content dispatcher and a forward dispatcher. Which dispatcher is invoked depends on the status of the ’pull’ signal queue. Specifically, if there are ’pull’ signals in the signal queue, a small chunk of content is taken off from the content buffer, marked as F content, and dispatched by the F-marked content dispatcher to the peer that issued the ’pull’ signal. The ’pull’ signal is then removed from the signal queue. In contrast, if the signal queue is empty, the server takes a small chunk of content from the content buffer and puts that chunk of content into the forwarding queue. The forwarding dispatcher marks the chunk as NF and sends it to all peers in the system. 3.3 Optimality of Queue-Based Chunk Scheduling Given a content source server and a set of peers with known upload capacities, the maximum streaming rate, rmax , is governed by the following formula [10]: us + ni=1 ui rmax = min{us , }. (1) n where us is content source server’s upload capacity, ui is peer i’s upload capacity, and n is thenumber of peers in the system. The second term on the right-hand side of equation, us + n i=1 ui , is the average upload capacity per peer. The maximum/optimal streaming n u +
n
u
u +
n
u
u +
n
u
rate is rmax = us if us ≤ s ni=1 i , and rmax = s ni=1 i if us > s ni=1 i . The first case is termed as server resource poor scenario where the server’s upload capacity is the bottleneck. The second case is termed as server resource rich scenario where the peers’ average upload capacity is the bottleneck. Theorem 1. Assume that the signal propagation delay between a peer and the server is negligible and the data content can be transmitted at an arbitrary small amount, then the queue-based decentralized scheduling algorithm as described above achieves the maximum streaming rate possible in the system. Sketch of proof: In server resource poor scenario, the source server bandwidth is the bottleneck and cannot handle all ’pull’ signals issued by peers. The signal queue at the server side is hence non-empty and the entire server bandwidth is used to transmit Fmarked content to peers. In contrast, a peer’s forward queue becomes idle while waiting for the new data content from the source server. Since each peer has sufficient upload bandwidth to relay the F-marked content (received from the server) to all other peers, the supportable streaming rate is equal to the server’s upload capacity. Optimal rate is achieved. In server resource rich scenario, the server has the bandwidth to service the ’pull’ signals. During the time period when the ’pull’ signal queue is empty, the server transmits duplicate NF-marked content to all peers. It can be shown that the streaming u + n u rate is s ni=1 i . Optimal rate is again reached. The detailed proof can be found in [14].
AQCS: Adaptive Queue-Based Chunk Scheduling for P2P Live Streaming
439
4 Implementation Considerations The architecture of content source server and peers using the queue-based data chunk scheduling are now described with an eye toward practical implementation considerations including the impact of chunk size, propagation delay, network congestion, and peer churn. In the optimality proof, it was assumed that the chunk size could be arbitrarily small and the propagation delay was negligible. In practice, the chunk size is on the order of kilo-bytes to avoid excessive transmission overhead caused by protocol headers. The propagation delay is on the order of tens to hundreds of milliseconds. Hence, it is necessary to adjust the timing of issuing ’pull’ signals by the peers and increase the number of F-marked chunks served at the content source server to allow the decentralized scheduling method to achieve close to the optimal live streaming rate. At the server side, K F-marked chunks are transmitted as a batch in response to a ’pull’ signal from a requesting peer (via the F-marked content queue). A larger value of K would reduce the ’pull’ signal frequency and thus reduce the signaling overhead. This, however, increases peers’ threshold to be shown in Equation (2). Denote by Ti the threshold for peer i to issue ’pull’ signal. A ’pull’ signal is sent to server whenever the number of chunks in the queue is less than or equal to Ti . The time to empty the forwarding queue with Ti chunks is tempty = (n − 1)Ti δ/ui . Meanwhile, it takes i treceive = 2t + Kδ/u + t for peer i to receive K chunks after it issues a pull si s q i signal. Here tsi is the propagation delay between the source server and peer i, Kδ/us is the time required for server to transmit K chunks, and tq is queuing delay seen by the ’pull’ signal at the server pull signal queue. In order to receive the chunks before the forwarding queue becomes fully drained, tempty ≥ treceive . This leads to: i i Ti ≥
(2tsi + Kδ/us + tq )ui . (n − 1)δ
(2)
All quantities are known except tq , the queuing delay incurred at the server side signal queue. In server resource poor scenario where the source server is the bottleneck, the selection of Ti would not affect the streaming rate as long as the server is always busy. In server resource rich scenario, since the service rate of signal queue is faster than the pull signal rate, tq is very small. So we set tq to be zero. This leads to the following ’pull’ signal threshold formula that can be used to guide the threshold selection: Ti ≥
(2tsi + Kδ/us )ui . (n − 1)δ
(3)
The architectures of source server and peer are described next. Figure 4 illustrates the architecture of the source server. Using the ’select call’ mechanism to monitor the connections with peers, the server maintains a set of input buffers to store received data. There are three types of incoming messages: management message, ’pull’ signal, and missing chunk recovery request. Correspondingly three independent queues are formed for these messages. If the output of handling these messages needs to be transmitted to remote peers, the output is put on the per-peer out-unit.
440
Y. Guo, C. Liang, and Y. Liu
Source
Select Call
Packet Handler Internet
MSG PULL SIG RECOV REQ
Fig. 4. Server Architecture
There is one out-unit for each destination peer to handle the data transmission process. Each out-unit has four queues for a given peer: management message queue, Fmarked content queue, NF-marked content queue, and missing chunk recovery queue. The management message queue stores responses to management requests. An example of a management request is when a new peer has just joined the P2P system and requests the peer list. The F/NF marked content queue stores the F/NF marked content intended for this peer. Finally, chunk recovery queue stores the missing chunks requested by the peer. Different queues are used for different types of traffic in order to prioritize the traffic types. Specifically, management messages have the highest priority, followed by F-marked content, and NF-marked content. The priority of recovery chunks can be adjusted based on the design requirement. Management messages have the highest priority because it is important for the system to run smoothly. The content source server replies to each ’pull’ signal with F-marked chunks. F-marked chunks are further relayed to other peers by the receiving peer. The content source server sends out a NF-marked chunk to all peers when the ’pull’ signal queue is empty. NF-marked chunks are used by the destination peer only and will not be relayed further. Therefore, serving F-marked chunk promptly improves the utilization of peers’ upload capacity and increases the overall P2P system streaming rate. Another reason for using separate queues is to deal with bandwidth fluctuation and congestion inside the network. Many P2P researchers assume that server/peer’s upload capacity is the bottleneck. In our experiments over PlanetLab, it has been observed that some peers may slow down significantly due to congestion. If all the peers share the same queue, the uploading to the slowest peer will block the uploading to remaining peers. This is similar to the head-of-line blocking problem in input-queued switch design. Separate queues avoid inefficient blocking caused by slow peers. Peers’ architecture is similar to the server’s and is omitted here. In addition, we design the missing chunk recovery scheme that enables the peers to recover the missing chunks to avoid viewing quality degradation. Refer to [14] for more details.
5 Experiment Results In this section, we examine the performance of AQCS via experiments over PlanetLab [8]. 40+ nodes (one content source server, one public sink and 40 users/peers) are
AQCS: Adaptive Queue-Based Chunk Scheduling for P2P Live Streaming
441
used with most of them located in North America. All connections between nodes are TCP connections. TCP connections avoid network layer data losses, and allow us to use software package Trickle [15] to set a node’s upload capacity. In our experiments, we observe the obtained upload bandwidth is slightly larger (< 8%) than the value we set using Trickle. To account for this error, we measure the actual upload bandwidth, and use the measured rate for plotting the graphs. The upload capacity of peers are assigned randomly according to the distribution obtained from the measurement study conducted in [16]. The largest uplink speed is reduced from 5000 kbps to 4000 kbps, which ensures that PlanetLab nodes have sufficient bandwidth to support the targeted rate. Specifically, 20% of peers have upload capacity of 128 kbps, 40% have 384 kbps, 25% have 1 Mbps, and 15% have 4 Mbps. •Optimality evaluation. All 40 peers join the system at the beginning of the experiment, and stay for the entire duration of the experiment. The content source server’s upload capacity is varied from 320 kbps to 5.6 Mbps. For each server upload capacity setting, we run an experiment for 5 mins. The achieved streaming rate is collected every 10 seconds and the average value is reported at the end of each experiment. Fig. 5(a) 4
6
1300 1200
5
x 10
F−marked Chunk NF−marked Chunk Unique Chunk
1100
Rate(kbps)
1000
4
900
3
Server resource rich region
800 700
2 600 500
300 0
1
Server resource poor region
400
Achieved Rate Perfect Rate 1000
2000
3000
4000
5000
Server Bandwidth (kbps)
(a) Achieved Rate vs. Optimal Rate
6000
0
560
1600
3200
4800
Server Bandwidth(kbps)
(b) Distribution of Chunks from Server
Fig. 5. Optimality Evaluation Results
shows the achieved streaming rate vs. the optimal rate with different server bandwidths. The difference never exceeds 10% of the optimal rate possible in the system. The curves exhibit two segments with turning point at around 1.1 Mbps. According to Equation (1), the server bandwidth is the bottleneck when it is smaller than 1.1 Mbps (server resource poor scenario). The streaming rate is equal to the source rate. As the server bandwidth becomes greater than 1.1 Mbps, the peers’ average upload capacity becomes the bottleneck (server resource rich scenario). The streaming rate still increases linearly, however, with a smaller slope. Notice that AQCS performs better in the server resource poor scenario than in the server resource rich scenario. We plot the numbers of F-marked and NF-marked chunks sent out by the source server to explain what causes the difference. As shown in Fig. 5(b), when the server bandwidth is 560 kbps (source resource poor scenario), very few NF-marked chunks are transmitted. In theory, no NF-marked chunks should be sent in this scenario since signal queue is always non-empty. We do see several NF-marked chunks, which is caused by the bandwidth variation in the network. The
442
Y. Guo, C. Liang, and Y. Liu
variation occasionally causes the server’s pull signal queue becomes empty. In contrast, more and more NF-marked chunks are sent by the server as its uplink capacity increases beyond 1.1 Mbps (source resource rich scenarios). In the server resource poor scenario, the server sends out F-marked chunks exclusively. As long as the pull signal queue is not empty, the optimal streaming rate can be achieved. In the server resource rich scenario, the server sends out both F-marked and NF-marked chunks. If F-marked chunks are delayed at server or along the route from the server to peers due to the bandwidth variations or peer churn, peers can not receive F-marked chunks promptly. Peers’ forward queues become idle and upload bandwidth is wasted. Nevertheless, AQCS always achieve the streaming rate within 10% of the theoretical optimal rate. •Adaptiveness to peer churn and bandwidth variations. Peer churn has been identified as one of the major disruptions to the performance of p2p system. We study how AQCS performs in face of peer churns next. In this 10 minutes experiment, the server bandwidth is set to be 2.4 Mbps. Three peers with the bandwidth of 4 Mbps are selected to leave the system at time of 200 seconds, 250 seconds, and 300 seconds, respectively. Two peers rejoin the system at time of 400 seconds, and the third one rejoins the system at time of 450 seconds. Fig. 6(a) depicts the achieved rate vs. optimal rate every 120
1200
1500 F−marked Chunk Noise Traffic
Rate(kbps)
1000 900 800 700 600 500 400 0
Achieved Rate Perfect Rate 100
200
300
Time(s)
400
500
600
1000
100
500
80 0
100
200
300
400
500
600
Noise Traffic Rate(kbps)
Marked Chunk Rate(kbps)
1100
0
Time(s)
(a) Streaming Rate under Peer Churn (b) Marked Chunk Receiving Rate with Background Traffic Fig. 6. Adaptiveness to Peer Churn and Bandwidth Variations
10 seconds. Although the departure and the join of a peer does introduce disruptions to the achieved streaming rate, overall the achieved streaming rate tracks the optimal rate closely. The difference between them never exceeds 12% of the optimal rate. In addition to peer churn, the network bandwidth varies over time due to cross traffic. To evaluate AQCS’s adaptiveness to network bandwidth variations, the following experiment is conducted. We set up a sink on a separate PlanetLab node not participating in P2P streaming. One peer in the streaming system with upload capacity of 4 Mbps is selected to establish multiple parallel TCP connections to the sink. Each TCP connection sends out garbage data to the sink. The noise traffic generated by those TCP connections causes variations in the bandwidth available for the P2P video threads on the selected peer. Fig. 6(b) depicts the rate at which the F-marked chunks are received at the selected peer together with the sending rate of noise traffic. During time periods of (120 sec,
AQCS: Adaptive Queue-Based Chunk Scheduling for P2P Live Streaming
443
280 sec) and (380 sec, 450 sec), the noise traffic threads are on. The queue-based chunk scheduling method adapts quickly to the decreasing available bandwidth by reducing its pull signal rate. Consequently, the server reduces the rate of F-marked chunks sent to the selected peer. When the noise traffic is turned off, the server sends more F-marked chunks to the selected peer to fully utilize its available uploading bandwidth. The selfadaptiveness of the queue-based chunk scheduling method makes the overall achieved streaming rate close to the optimal rate. • Optimality comparison. Finally the performance of AQCS is compared with that of the Random Useful Packet Forwarding (RUPF) scheme [13], also proved to be optimal theoretically. We implemented the prototype of randomized broadcasting scheme, and carefully tune the configuration parameters so that it performs at its peak performance. The server bandwidth is chosen to be 3.2Mbps, however, the conclusion is true for different scenarios we tested. 1
0.14
AQCS RUPF 0.8
0.1
0.6 0.08
CDF
Missing Chunk Ratio
0.12
AQCS RUPF
0.4
0.06 0.04
0.2 0.02 0 700
900 1100 Streaming Rate(kbps)
(a) Chunk Missing Ratio
1300
0 0.8
0.85
0.9 0.95 Bandwidth Utilization
1
(b) Bandwidth Utilization
Fig. 7. Comparison with RUPF
Fig. 7(a) depicts the chunk missing ratio, the fraction of chunks that are lost or miss the playback deadline, with different streaming rates. Both adaptive queue-based chunk scheduling (AQCS) and the randomized scheduling (RUPF) achieve zero loss when the streaming rate is low. However, as the streaming rate increase beyond 1Mbps, AQCS remains no loss while RUPF starts to experiences missing chunks. In fact, AQCS maintains zero loss until the streaming rate reaches 1.1Mbps, the maximum streaming rate allowed by the system. The missing ratio of AQCS does increase linear after that due to un-sufficient bandwidth in P2P system. Fig. 7(b) explains why AQCS out-performs RUPF consistently. The CDFs of peer bandwidth utilization for both AQCS and RUPF are plotted at streaming rate 1.12Mbps. It is evident that peers in AQCS are able to utilize their upload bandwidth more efficiently than the peers in RUPF.
6 Conclusions In this paper, we propose a simple queue-based chunk scheduling method that supports the streaming rate close to the maximum rate allowed by a P2P streaming system. A
444
Y. Guo, C. Liang, and Y. Liu
prototype is implemented and various design considerations are explored to ensure that the algorithm works in realistic network environment. The experiments over PlanetLab further demonstrate the optimality and the adaptiveness of the proposed queue-based chunk scheduling method. Future work can develop along several avenues. As the first attempt of applying queue management to P2P streaming, we used simple queue control schemes. We will explore queue control design space to further improve its performance. The other direction is to apply this approach to other P2P content distribution applications such as file sharing and video-on-demand. Our work demonstrated the effectiveness of application layer queue management in eliminating content bottlenecks in P2P live streaming. It is a natural extension to explore its applicability in other P2P applications.
References 1. Youtube: (Youtube Homepage) http://www.youtube.com 2. Chu, Y.H., Rao, G., Zhang, S.,, H.: A case for end system multicast. In: Proceedings of ACM SIGMETRICS (2000) 3. Zhang, X., Liu, J., Li, B., Yum, T.S.P.: DONet/CoolStreaming: A data-driven overlay network for live media streaming. In: Proceedings of IEEE INFOCOM (2005) 4. PPLive (PPLive Homepage), http://www.pplive.com 5. Hei, X., Liang, C., Liang, J., Liu, Y., Ross, K.: A Measurement Study of a Large-Scale P2P IPTV System. IEEE Transactions on Multimedia (2007) 6. BT: (Bittorent Homepage) http://www.bittorrent.com 7. Gkantsidis, C., Rodriguez, P.R.: Network Coding for Large Scale Content Distribution. In: Proceedings of IEEE INFOCOM (2005) 8. PlanetLab: (PlanetLab Homepage) http://www.planet-lab.org 9. Liang, C., Guo, Y., Liu, Y.: Hierarchically clustered p2p streaming system. In: Proceedings of GLOBECOM (2007) 10. Kumar, R., Liu, Y., Ross, K.: Stochastic fluid theory for p2p streaming systems. In: Proceedings of IEEE INFOCOM (2007) 11. Magharei, N., Rejaie, R.: PRIME: Peer-to-Peer Receiver-drIven MEsh-based Streaming. In: Proceedings of IEEE INFOCOM (2007) 12. Wang, M., Li, B.: Lava: A reality check of network coding in peer-to-peer live streaming. In: Proceedings of IEEE INFOCOM (2007) 13. Massoulie, L., Twigg, A., Gkantsidis, C., Rodriguez, P.: Randomized decentralized broadcasting algorithms. In: Proceedings of IEEE INFOCOM (2007) 14. Guo, Y., Liang, C., Liu, Y.: Adaptive Queue-based Chunk Scheduling for P2P Live Streaming. Polytechnic U., Tech. Rep (2007) 15. Trickle: (Trickle Homepage) http://monkey.org/∼marius/pages/?page=trickle 16. Ashwin, R., Bharambe, C.H., Padmanabhan, V.N.: Analyzing and Improving a BitTorrent Network Performance Mechanisms. In: Proceedings of IEEE INFOCOM (2006)
E2E Blocking Probability of IPTV and P2PTV Yue Lu1 , Fernando Kuipers1 , Milena Janic2 , and Piet Van Mieghem1 1 Delft University of Technology {Y.Lu,F.A.Kuipers,P.F.A.VanMieghem}@tudelft.nl 2 TNO Information and Communication Technology [email protected]
Abstract. Increased Internet speeds together with new possibilities for tailor-made television services have spurred the interest in providing television via the Internet. Several television services are readily available and both IP-layer and application-layer (P2P) technologies are used. When disregarding commercial influences, customers will choose for IPTV or P2PTV based on the Quality of Experience (QoE). In this paper, we investigate one important QoE measure, namely the content blocking probability.
1
Introduction
The demand for digital television via a dedicated network or the Internet is growing fast. A digital television service can be provided in two different ways: either utilizing a mechanism on the IP layer (IPTV) or on the application layer (P2PTV). IPTV is implemented in a dedicated network, which connects the end-users’ television set-top boxes through Digital Subscriber Line Access Multiplexers (DSLAMs). The television programs are collected at a data centre and distributed towards the DSLAMs along an IP-layer multicast tree. A DSLAM replicates the received signal and sends it to the end users. P2PTV can either make use of application-layer multicast trees or chunkbased P2P technology. The majority of existing P2PTV applications are chunkbased. They make use of the BitTorrent technology1 and are currently free of charge. Disregarding economic incentives, telecom operators will continue to deploy and extend IPTV in the dedicated network if its quality surpasses the quality of P2PTV. In contrast, if connecting end-users’ set-top boxes in a P2P way can achieve better Quality of Experience (QoE), the operator should consider to deploy P2PTV technology instead. Hence, defining, computing and comparing the quality of these two television architectures is deemed essential. In this paper, we focus on the content blocking probability as our QoE measure of interest, and ask the following questions:
1
This work has been partially supported by the European Union CONTENT NoE (FP6-IST-038423) and the triangular cooperation KPN-TNO-TUDelft. http://www.bittorrent.com/
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 445–456, 2008. c IFIP International Federation for Information Processing 2008
446
Y. Lu et al.
– What factors contribute to the blocking of IPTV and P2PTV, and how? – Which technology incurs the lowest end-to-end blocking? – In what situation can P2PTV offer users better QoE than IPTV? In order to answer these questions we have developed blocking models for IPTV and P2PTV, which we subsequently apply to two case studies. Our case study of IPTV is based on measurement data of an existing Dutch IPTV network. Our P2PTV case study is based on measurements of SopCast2 . The remainder of this paper is organized as follows. Section 2 presents related work. In Section 3 the blocking of IPTV will be defined and computed, and applied to our case study. Section 4 adopts a same methodology for P2PTV. Section 5 compares the two architectures. Finally, we conclude in Section 6.
2
Related Work
Work on IPTV mainly focuses on designing protocols (e.g., [1]) and implementations by broadband network operators, router manufacturers, and television providers. For the performance of IPTV, video quality and packet loss were analyzed, but the end-to-end blocking of requests has not been computed from a users’ point of view. Karvo et al. [2] set up a queuing model to calculate the end-to-end multicast blocking, but their work is not aimed at IPTV and is not based on realistic data. In this paper, we will define and compute the IPTV blocking probability based on realistic data. The majority of P2PTV implementations (e.g., SopCast, PPLive [3], and CoolStreaming [4]) use chunk-based BitTorrent technology. Consequently, in this paper, we only consider chunk-based P2PTV. Related work on BitTorrent-like P2P blocking and performance [5], [6], [7], only targets the P2P file sharing systems. Empirical blocking results were provided [3], [4], but a definition of and a formula to compute the end-to-end blocking in P2PTV architectures are still missing. In this paper, we provide a first step in this direction.
3
IPTV Blocking Probability
3.1
Model with Assumptions
Figure 1 illustrates a possible realization of an IPTV architecture. As this figure indicates, end users are connected to DSLAMs which, on their turn, are connected to an edge router. We only model the blocking of a single television channel over a single DSLAM. We assume that C represents the capacity of a link from a particular DSLAM to the edge router. The maximum number of channels that can be transmitted simultaneously over a link with capacity C is m = CCo , where Co is the capacity of one television channel. We further assume that there are K available television channels that can be viewed. If a channel is being viewed, we say it is “on”. Not all the channels are 2
http://www.sopcast.org/
E2E Blocking Probability of IPTV and P2PTV
447
U
2
one domain
U ser 1 DSLAM
O ne link w ith capacity C
Us DS
…
LA
M
edge router
Router k
B ackb one ro uting
…
Data center
C onsid ered here
…
DS
LA
M
…
O ne C hannel (m ore popular)
A nother C han nel
Fig. 1. A possible realization of an IPTV architecture
equally popular. In our model we assume that the popularity distribution αi is given3 . The arrival and departure process of TV users is assumed to be Poissonian. In reality, the request arrival rate λ(t) is non-stationary. If a popular TV program starts, the arrival rate will be high, while once the advertisements come, the average arrival rate decreases. However, we consider the average case over a specific time period. Based on a TV-users market survey4 , the arrival process is approximately Poissonian in the period of 16:00 to 22:00. According to the measurement study of X. Hei et al. [3, Fig 13.], the CDF of TV viewing time follows an exponential distribution, which justifies our assumption that the TV user departure process is Poissonian. 3.2
Computation of the Blocking Probability
If a user, requesting a particular television channel i, cannot receive its content, we consider channel i to be blocked. Two main causes of blocking in IPTV are: I Limited processing capability of a DSLAM. If many users simultaneously desire content from the same DSLAM, blocking might occur. The blocking probability of this case is denoted as Bproc . II Insufficient available capacity from the DSLAM to the edge router. If a maximum number of m channels is transmitted to the DSLAM, a new user requesting a new channel i (not among the transmitted m channels) will find his request blocked. Since the more popular channels have a higher probability to be present among the already transmitted m channels, Blink (i) depends on which television channel i is requested. 3
4
We have used the TV channel popularity distribution in The Netherlands, which is available at http://www.kijkonderzoek.nl/ (in Dutch). Market Report of TV customers in The Netherlands, http://www.kijkonderzoek.nl/
448
Y. Lu et al.
Since blocking II can only occur if blocking I did not take place, for IPTV the end-to-end blocking probability B(i) for channel i is B(i) = Bproc + (1 − Bproc )Blink (i)
(1)
Bproc : Assuming Poisson arrivals and departures, as explained in Section 3.1, we model the DSLAM as an M/M/n/n/s queue, where s represents the number of users accessing the DSLAM and n the number of replications that the DSLAM can handle. In our model ρDSLAM = λμDSLAM is the same for each user, where DSLAM λDSLAM represents the rate at which a user requests a TV service, and μDSLAM the rate at which the user turns his TV off. According to [8, pp. 512], we have (s−1)! n (s−1−n)!n! ρDSLAM
Bproc = n
h=0
(2)
(s−1)! h (s−1−h)!h! ρDSLAM
Blink (i): To compute Blink (i) we introduce two new probability functions P (i) and BEngset (i). P (i) is the probability that channel i is “on” and BEngset (i) is the probability that the link from the DSLAM to the edge router is consumed by m channels other than the requested channel i. Our computation of BEngset (i) is given in the Appendix, and Blink (i) = (1 − P (i))BEngset (i)
(3)
P (i): Disregarding possible blocking5 , channel i can be modeled as an M/M/∞ queue (in steady state), with infinite positions available in the queue to store all requests for channel i. Hence, the probability that channel i is “on”, in the condition that no requests are blocked, is equal to the probability that the M/M/∞ queue [8, pp. 281] is not empty: Pr[Ns > 0]i = 1 − exp(−ρi ), where ρi = λi /ui = αi λ/ui . λi is the users’ arrival rate in channel i, and ui is the number of users leaving from channel i per second. λ is the users’ arrival rate in the IPTV system, which includes both the rate at which users switch on their television as well as the channel switching rate. αi represents the popularity of channel i , where 0 ≤ αi ≤ 1. Channel i can be either “off” or “on”. We therefore resort to a two-state Markov chain illustrated in Figure 2 to analyze its steady state. In Figure 2, “0” represents the state that channel i is “off” and “1” represents the state that channel i is “on”. The probability to be in the state “off” is 1 − P (i) and consequently the probability to be in the state “on” is P (i). To change from the “off” state to the “on” state, requests should exist for channel i (this event has probability Pr[Ns > 0]i ) and these requests may not be blocked (this event has probability equal to 1 − BEngset (i)). The process remains in the “off” state if there are no requests in the queue for channel i (Pr[Ns = 0]i ) or if the requests 5
If a user’s request enters when channel i is “on”, the request will be grouped into multicast. However, if a user’s request enters when channel i is “off”, it has to open a new channel, which might not be possible in case of insufficient available capacity.
E2E Blocking Probability of IPTV and P2PTV Pr[ N s
0]i
Pr[ N s ! 0]i (1 BEngset (i ))
Pr[ N s ! 0]i BEngset (i)
0
449
Pr[ Ns ! 0]i
1 Pr[ N s
0]i
Fig. 2. Two-state Markov chain representing the status of channel i
are blocked (this event has probability Pr[Ns > 0]i BEngset (i)). To change from the “on” state to the “off” state, all requests for channel i are served until the queue becomes empty (with probability Pr[Ns = 0]i ). To remain in the “on” state, the queue may not be empty (Pr[Ns > 0]i ). According to the steady state probability of a two-state Markov chain [8, pp.176], the probability to be in the state “on” is Pr[Ns > 0]i (1 − BEngset (i)) Pr[Ns > 0]i (1 − BEngset (i)) + Pr[Ns = 0]i 1 =1− exp(ρi ) − BEngset (i) exp(ρi ) + BEngset (i)
P (i) =
(4)
By substituting formulae (3), (2), (4), and (10) into formula (1), we obtain the end-to-end blocking probability B(i) of IPTV. 3.3
Case Study
We focus on a Dutch IPTV network for which we have obtained the following parameter values: The channel popularity distribution αi for the channels (1 ≤ i ≤ 23) is obtained from a market survey6 . We assume that the remaining K − 23 channels uniformly share a popularity of 5.9%. In reality, the channel popularity distribution changes more smoothly, but here we just classify channels as popular or unpopular. The link capacity C is typically 155Mb/s in nowadays DSL systems and one MPEG4 coded TV channel consumes 2.5Mb/s. Hence, approximately m = 60 different TV channels can be transmitted simultaneously to the DSLAM. We further assume that the IPTV system consists of 500, 000 subscribers, who are uniformly distributed over 1200 DSLAMs and 41% of the subscribers are active in rush hours. This results in an average number of s = 171 active users connecting to a single DSLAM. However, only n = 120 channel replications can be handled by the DSLAM simultaneously. We define QIP T V = λ/ui . Figure 3 plots the IPTV blocking probability B(i) as a function of the channel index i, with m = 60 and n = 120. The less popular channels (higher channel index) have a considerably higher probability to be blocked than the more popular channels. In addition, we can 6
Channel 1 has a popularity of 15.1% and channel 23 of 0.2%, covering in total 94.1%.
450
Y. Lu et al. -1
10
180 B=10-7
-2
160
-3
10
-4
10
-5
10
-6
10
K=170,Q=400,s=171 K=82,Q=400,s=171 K=170,Q=600,s=171 K=170,Q=400,s=205
-7
10
-8
10
1
10 channel index i
needed capacity C (Mb/s)
blocking probability B(i)
10
100
B=10-5
140
B=10-3
120 100 80 60 1
10 channel index i
100
Fig. 3. (Left) IPTV end-to-end blocking B(i) with m = 60, n = 120, (right) The required capacity C for meet different blocking requirements, with K = 170, s = 171, and QIP T V = 400
observe that the user’s request is less likely blocked if there are less available TV channels (smaller K), if there are less users per DSLAM (smaller s), or if users leave more frequently (smaller QIP T V ). Figure 3 also plots the capacity C that is required to assure that the IPTV blocking B(i), as a function of channel index i, does not exceed a certain level.
4
P2PTV Blocking Probability
4.1
Model with Assumptions
Figure 4 presents our model of chunk-based P2PTV applications. We focus on the blocking of a single television channel i for a particular user U . User U is viewing a television channel i. The playback rate of the channel i is v kbits/s. In the worst case, user U downloads the next second of content when displaying the current second of content. One second of television content is divided into R chunks. Hence, one chunk contains v/R kbits. First, based on a peer list that user U obtains, a fixed amount of peers (referred to as partners) are randomly chosen to form a partner group P for that channel i. In this partner group, partners exchange information about which chunks are available at which partner. Based on this chunk-availability information, a user U chooses M ≤ |P| peers from whom he downloads chunks. Those peers are called parents of user U . A parent supplies at least 1 chunk to user U . Similarly, the peers downloading the content from user U will be referred to as children of user U . A parent has Y children at the same time. We denote by Ni the number of available peers in the entire peer list corresponding to channel i and by bwup the upload bandwidth of a parent. Simultaneously while downloading the chunks, user U is displaying the content stored in his buffer7 . Here, we consider the 7
There is a zero probability of buffer overflow, because the buffer window is sliding and the outdated chunks will be deleted automatically.
E2E Blocking Probability of IPTV and P2PTV
451
C hannel i
N
i
|P |
P a re n t
M
u
1
2
b w up O th e r c h ild re n O th e r c h ild re n
b w [1, U ]
user U C u rre n t 1 seco n d
N ex t 1 seco n d N ext R chunks r
d is p la y in g
d o w n lo a d in g a t th e sa m e tim e
Fig. 4. P2PTV model when user U has 2 parents
worst case of P2PTV blocking where none of these R chunks were previously downloaded and the download of these R chunks has to be finished within the coming second. 4.2
Computation of the Blocking Probability
In IPTV, the limited available bandwidth in an infrastructure network is the main cause of IPTV blocking, while the P2PTV network built on top of the Internet has, in theory, more available capacity contributed by peers. Hence, the P2PTV blocking is mainly caused by content unavailability, peer selection randomness and peer dynamics. Contrary to IPTV, where blocking only occurs when requesting a television channel, blocking in P2PTV can also occur while watching television. We define the blocking probability in P2PTV as the probability that user U cannot successfully download the next R chunks before having displayed the current R chunks. If the required R chunks cannot all be found in the partner group, which happens with probability bchunk (i), blocking will occur. Even when all the chunks can be found at the |P| partners, a chosen parent may upload his chunks too slowly. This occurs with probability btime (i). Finally, if a parent leaves during uploading, blocking will occur. The probability of this occurrence is bdyn (i). As a reasonable approximation, we assume that the different events are independent. The end-to-end blocking b(i) of channel i in P2PTV can be presented as: b(i) = bchunk (i) + (1 − bchunk (i))(btime (i) + (1 − btime (i))bdyn (i))
(5)
In the sequel, we determine bchunk (i), btime (i) and bdyn (i). bchunk (i): Choosing parents among partners is based on the chunk-availability information. Blocking arises when user U cannot find all R chunks from the randomly chosen partners. We let 1 − bchunk (i) denote the probability that user U can find all R chunks in his partner group.
452
Y. Lu et al. R
1 − bchunk (i) = Π (1 − Bi (r)) r=1
(6)
where Bi (r) = [1 − πi (r)]|P| represents the probability that user U cannot find chunk r successfully among |P| randomly chosen (and hence considered independent) partners. We use πi (r) to represent the probability that a peer is storing chunk r. btime (i): Blocking arises when at least one parent u cannot upload his chunks to user U within 1 second due to insufficient bandwidth. The upload bandwidth that a parent u has available for user U is bw[u, U ], which has a distribution function Pr[bw[u, U ] ≤ x]. Given that user U requests Xu chunks of v/R kbits from parent u, the required upload bandwidth is bwXu = XRu v . If the available bandwidth is smaller than the bandwidth required for uploading the requested chunks, blocking will occur. In the following, we will make use of the theory of partitions [9]. A partition of a positive integer R is a collection of positive integers whose sum is R. If there are M terms in the sum, then R is said to be partitioned into M parts. We let pM (R) represent the number of partitions of R into parts not exceeding M , M −1 ∞ which has the generating function [9]: 1 − xj = R=0 pM (R)xR . From j=1
the generating function the following recursion is obtained: pM (R) = pM−1 (R) + pM (R − M ) with pM (0) = p1 (R) = 1 and pM (j) = 0 for j < 0. In our P2PTV model, we need to find the number of partitions of R chunks over exactly M parents, which equals pM (R) − pM−1 (R) = pM (R − M ). The occurrence probability of having M parents has a density function Pr[M = k], however for partitions with a same number of M we assume that the occurrence probability of each possible partition is the same. Blocking occurs only when the available bandwidth bw[u, U ] is smaller than the required bandwidth. Assuming independence,
|P| k Pr[M = k] pk (R−k) Xu (j)v btime (i) = 1− (1 − Pr[bw[u, U ] < ]) (7) R u=1 j=1 k=1 pk (R − k) where the index j refers to one of the pM (R − M ) possible partitions, and Xu (j) refers to how many chunks user U requests from parent u. For instance, we have pM (R − M ) = 2 possible partitions of R = 4 chunks over M = 2 parents (u1 and u2 ), namely partition j = 1 consisting of Xu1 (1) = 1 and Xu2 (1) = 3 and partition j = 2 consisting of Xu1 (2) = 2 and Xu2 (2) = 2. bdyn (i): We let bdyn (i) represent the probability that at least one parent leaves during his uploading period, which causes blocking. In our P2PTV model, the peer departure process Z(t) is considered Poissonian, which was explained in Section 3.1. The rate at which peers leave from θi channel i is denoted by θi . The departure rate of one peer is N , where Ni stands i for the total number of available peers in channel i.
E2E Blocking Probability of IPTV and P2PTV
453
We denote by Pu (j) the probability that parent u with Xu (j) chunks (in partition j) leaves during uploading. Pu (j) = Pr[Z(t+t0 )−Z(t0 ) = 1] = θi θi u (j)v/R (N t) exp(− N t), where t = min{1, Xbw[u,U ] }. i i Hence, we can express bdyn (i) as:
k Pr[M = k] pk (R−k) bdyn (i) = 1− (1 − Pu (j)) u=1 j=1 k=1 pk (R − k) |P|
(8)
Substituting formulae (6), (7) and (8) into formula (5), we obtain the end-toend blocking b(i) of P2PTV. 4.3
Case Study
In this P2PTV case study, we analyze a chunk-based P2PTV application called SopCast. We have developed scripts and set-up a distributed measurement testbed via Planetlab. We installed SopCast and TcpDump on each (of in total 80) Planetlab nodes and performed various measurements, among which we believe three are unique, namely: (1) measurements on the topology - we have the parent distribution Pr[M = k] for popular as well as unpopular channels, (2) measurements on the bandwidth distribution Pr[bw[u, U ] ≤ k], and (3) the quality of experience (e.g., blocking) at the end user. We have used our measurement data as input to our model and then computed the blocking probability according to the formulae above. We also compared our analytical results with our measurement results on blocking probability. Our experiments with SopCast indicated that the upload bandwidth of a parent bwup is almost uniformly shared by his Y children in a 1 second period bw of time. Hence, for user U , we can define bw[u, U ] = Yup . Since each parent can have a different upload rate and a different number of children at a given time, the value of bw[u, U ] may differ per parent. Later, we will use our empirically obtained bw[u, U ] distribution function into our model. To allow for a fair comparison with the IPTV case, we will choose for our P2PTV case study similar values, where possible. The channel popularity distribution αi is the same as for IPTV, and the user behavior is the same as for IPTV. We let Ni = N αi , where N is the number of concurrent active peers over all channels. N can be 1200sαi (the same size for IPTV) or 2, 200, 000αi (the current peak size of a P2PTV system). For other values, our case study is based on SopCast. Our measurements were run for 5 times on 5 different days and we have used the obtained results for our computations. Based on our measurement data, a minimum of 1 and a maximum of 3 partners are chosen as parents for both popular channels and unpopular channels in 1 second. The distributions of the number of parents Pr[M = k] differ for popular and unpopular channels. Our measurement study has revealed that the channel under study has a playback rate v = 300kbits/s and one chunk size is 13kbytes. Hence, R is around 3 chunks/second. In our computation of the blocking probability bchunk (i), we assume πi (r) equals to 91% ([10]) for each channel. We set v/R = 104kbits, to
454
Y. Lu et al. 0.35 K=170, Q=60, N=205200 K=50, Q=60, N=205200
blocking probability b(i)
0.3
K=170, Q=600, N=205200 K=170, Q=60, N=2200000
0.25
0.2
0.15
0.1 0
10 channel index i
100
Fig. 5. P2PTV end-to-end blocking b(i) with different values for the number of channels K, the number of active users in the system N , and QP 2P T V = λ/θi
retrieve the blocking probability btime (i). We assume that the download bandwidth of an end user is large enough to download R = 3 chunks in 1 second. In order to obtain the blocking probability bdyn (i), we define QP 2P T V = θλi. and further assume8 θi = 0.00038N/QP 2P T V , with θi fixed for each TV channel. Given the above values we can get bchunk (i), btime (i) and bdyn (i), and subsequently compute the P2PTV blocking b(i) from formula (5). Figure 5 plots the P2PTV blocking probability b(i) as a function of the channel index i. Figure 5 illustrates that, the 23 most popular channels have a much smaller probability to be blocked than the remaining unpopular channels. The blocking probability of all channels is quite high, because what we have modeled is the worst case without any positive effects due to buffering. In order to validate our model, we measured the blocking at end users. We monitored the download speed at each user using 1 second as unit and we compared the download speed with the playback rate. The fraction of time the download speed is smaller than the constant playback rate gives the blocking probability defined in our model. When averaging the blocking probability over all channels, we find a blocking probability of 22% when excluding the buffer effect, which fits our mathematical results. The P2PTV end-to-end blocking b(i) shown in Figure 5 is mainly contributed by btime (i). Hence, the limited upload bandwidth of an end user’s parent distributed to him mainly causes the large blocking. This clearly indicates the importance of the parent selection policy. We can also observe that a user will face less blocking if the number of available channels K is smaller, or if users leave infrequently (high QP 2P T V ), while changing the amount of users did not significantly affect the blocking probability. 8
As observed in [4, Fig.14], when there are on average N = 1750 active peers, the λ maximum arrival rate of peers is λ = 0.67 peers/s and N = 0.00038.
E2E Blocking Probability of IPTV and P2PTV
5
455
Blocking Comparison
In this section, we compare IPTV and P2PTV based on the computations presented in Sections 3 and 4. We assess which technology (IPTV or P2PTV) incurs the lowest end-to-end blocking. Currently, if the operator decides to transmit television content to its residential users using P2PTV instead of IPTV, all channels will likely face more blocking than before. However, we can predict that when the number of end users increases, this will change. In a P2PTV system, the number of users has little effect on the blocking, while in an IPTV system the blocking would largely increase if the processing capability of the equipment (like DSLAM) cannot scale accordingly. When the total amount of users increases and the amount of DSLAMs and their processing capability remains the same, there will be a point at which the P2PTV system will start to outperform the IPTV system for popular channels. In our Dutch case study, this cross-over point is around 297, 600 users (s = 248 per DSLAM). Provided the appropriate data is available, our formulas allow for similar computations for different cases.
6
Conclusions
Two digital television technologies, namely IPTV and P2PTV, have been discussed in this paper. We have provided new blocking models for both technologies and have defined and derived formulae to calculate the end-to-end blocking probability based on our blocking models. Different contributions of the blocking have been analyzed. Moreover, the blocking of IPTV and the blocking of P2PTV have been compared under fair and realistic conditions. We found that when the amount of users increases, there will be a point at which the blocking in P2PTV will be less than in IPTV, unless the IPTV network is extended accordingly. Finally, our results cannot only be used to analyze the blocking in existing systems, but also to dimension them.
References 1. Lim, S.Y., Soek, J.M., Lee, H.K.: A path control architecture for receiving various multimedia contents. In: Proc. of ICACT 2006 (February 2006) 2. Karvo, J., Virtamo, J., Aalto, S., Martikainen, O.: Blocking of dynamic multicast connections in a single link. In: Proc. of IEEE BROADNETS (1998) 3. Hei, X., Liang, C., Liang, J., Liu, Y., Ross, K.W.: A Measurement Study of a large-Scale P2P IPTV System. IEEE Transactions on Multimedia 9(8), 1672–1687 (2007) 4. Zhang, X., Liu, J., Li, B., Yum, T.P.: CoolStreaming/DONet: A Data-driven Overlay Network for Peer-to-Peer Live Media streaming. In: Proc. of IEEE INFOCOM, March 2005, vol. 3, pp. 2102–2111 (2005) 5. Meo, M., Milan, F.: QoS-aware Content Management in P2P Networks. In: Proc. of HOT-P2P 2004 (2004)
456
Y. Lu et al.
6. Susitaival, R., Aalto, S., Virtamo, J.: Analyzing the dynamics and resource usage of P2P file sharing by a spatio-temporal model. In: Proc. of P2P-HPCS 2006 (2006) 7. Guo, L., Chen, S., Xiao, Z., Tan, E., Ding, X., Zhang, X.: A Performance Study of BitTorrent-like Peer-to-Peer Systems. IEEE J. Selected. Area in Comm. 25 (January 2007) 8. Van Mieghem, P.: Performance Analysis of communications Networks and Systems. Cambridge University Press, Cambridge (2006) 9. Rademacher, H.: Topics in Analytic Number Theory. Springer, Berlin (1973) 10. Vassilakis, C., Laoutaris, N., Stavrakakis, I.: On the Benefits of Synchronized Playout in Peer-to-Peer Streaming. In: Proc. of CoNEXT (September 2006)
A
BEngset(i)
The IPTV system with K available channels and m admitted channels can be modeled by an M/M/m/m/K queuing system. In this Engset queue with different channel i arrival rate λi and channel i leaving rate μi , each channel request can arrive in random order. Karvo et al. [2] introduced a binomial probability (i) generation function to deduce the probability πj (j positions are occupied by j channels other than channel i in an infinite capacity system) for this system: ϕi (z) =
∞
K
(i)
j=0
πj z j = Π
in which, qi = 1 − pi and pi = 1 − e
k=1
qk + pk z qi + pi z
λ
− μi i
(i)
. BEngset (i) = πm /
(9) m j=0
(i)
πj =
j 1 d ϕi (z) j! dz j | z=0 ([8,
(i)
πj . According to
pp.18]) and Eq. (9),
BEngset (i)=
1 m! m j=0
K q +p z k k ] k=1 qi +pi z dz m
dm [ Π
1 j!
| z=0
K q +p z dj [ Π qk +pkz ] i k=1 i
dz j
| z=0
(10)
Cooperative Replication in Content Networks with Nodes under Churn Eva Jaho, Ioannis Koukoutsidis, Ioannis Stavrakakis, and Ina Jaho National & Kapodistrian University of Athens Dept. Informatics and Telecommunications Ilissia, 157 84 Athens, Greece {ejaho,i.koukoutsidis,ioannis,grad0839}@di.uoa.gr
Abstract. In content networks, a replication group refers to a set of nodes that cooperate with each other to retrieve information objects from a distant server. Each node locally replicates a subset of the server objects, and can access objects stored by other nodes at a smaller cost. In a network with autonomous nodes, the problem is to construct efficient distributed algorithms for content replication that decrease the access cost for all nodes. Such a network also has to deal with churn, i.e. random “join” and “leave” events of nodes in the group. Churn induces instability and has a major impact on cooperation efficiency. Given a probability estimate of each node being active that is common knowledge between all nodes, we propose in this paper a distributed churn-aware object placement algorithm. We show that in most cases it has better performance than its churn unaware counterpart, increases fairness and incites all nodes to cooperate. Keywords: content networks, replication, cooperation, node churn.
1
Introduction
In content networks, a replication group refers to a set of nodes that cooperate with each other to retrieve information objects (files and other software entities) from a distant server. Nodes in the group usually have a high degree of proximity, and users associated with the nodes usually have common interests. To lower the access cost, each node locally replicates a subset of the server objects, and grants access to these objects to other nodes in the group. The placement of objects in nodes should aim at decreasing the cost for all nodes (or users served by them) to access requested objects. However, in distributed systems where nodes are largely autonomous (e.g. in p2p networks), users may behave selfishly and replicate objects to maximize their own benefit. The problem is to devise a
This work has been supported in part by the following projects funded by the IST FET Program of the European Commission: CASCADAS (Component-ware for Autonomic Situation-aware Communications, and Dynamically Adaptable Services) under contract FP6-027807, BIONETS (BIOlogically-inspired autonomic NETworks and Services) under contract FP6-027748, and NoE CONTENT (IST-384239).
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 457–469, 2008. c IFIP International Federation for Information Processing 2008
458
E. Jaho et al.
distributed object placement algorithm that ensures that individual nodes would benefit through cooperation. As an additional challenge, such an algorithm must be robust to node misbehavior and general uncertainty conditions. In this paper, we address this problem in a game-theoretic context. We consider nodes to be the players in the game. Each player implements a placement strategy which consists of choosing which objects to replicate locally in its limited storage space. The goal of each player is to minimize the total access cost for all its requested objects. A node may choose not to cooperate with any other node (to act in isolation), in which case the optimal strategy is to replicate a number of most wanted objects, equal to its capacity. This is called the greedy local strategy. Naturally, selfish users who are not interested in increasing the social benefit, would want to cooperate with other nodes only if such cooperation could reduce their access cost compared to their incurred cost when all nodes follow the greedy local strategy. If, on the other hand, a node experiences an increase in its access cost, mistreatment occurs as an outcome of cooperation. Ensuring no node mistreatment is key to maintaining a cooperative scheme. The abstraction of a replication group was initially employed in [1]. We consider that, as in [1], requests for objects are handled in the following manner. First, a user’s request is received by the local node the user is associated with. If the requested object is stored locally, it is returned to the requesting user immediately, incurring a minimum access cost. Otherwise, the requested object is searched for, and fetched from other nodes of the group, at a potentially higher access cost. If the object can not be located anywhere else in the group, it is retrieved from an origin server – assumed to be outside the group – incurring a maximum access cost. Note that unlike caching, replication refers to storage of objects for a longer term, and no replacement policy (e.g., LRU) is applied for each new object request. In this setting, Laoutaris et al. developed in [2] a distributed algorithm for cooperative object placement that guarantees an access cost reduction for all nodes, when each node knows its local demand pattern and the objects selected for replication by the other nodes in the group. In this algorithm, nodes take turns to replicate objects that minimize their access cost, each node informing all other nodes of its decision. A deficiency of this scheme is that it does not take into account any uncertainty phenomena or node misbehavior. Such phenomena are collectively termed as “node churn” and denote random changes in the set of participating nodes in the group, that may occur due to “join” and “leave” events. For example, in a p2p network, nodes may disconnect themselves from the network as soon as they download their content of interest. It may also occur as a consequence of link failures in networks with wireless links and node mobility. A preliminary study in [3] has confirmed that under such a setting the algorithm in [2] may result in mistreatment for some nodes. To account for node churn, in this paper we consider a probability estimate of each node to be available for object retrieval in the group, that is common knowledge between all nodes in the group. We propose an object replication strategy wherein each node takes into consideration the probability of other
Cooperative Replication in Content Networks with Nodes under Churn
459
nodes to be available before deciding on its replication strategy. Based on this knowledge, each node makes the number of changes that will give it the greatest access cost reduction. We call this appropriately a greedy churn-aware object placement strategy. The performance of the ensuing distributed algorithm is investigated extensively and is shown to mitigate mistreatment, while in most cases reducing the access cost, compared to the churn-unaware and the greedy local strategies.
2
Game Formulation
Hereafter in the paper, nodes that are available shall be said to be “ON”, and in the opposite case “OFF”. Consider a replication group of N nodes and a set of M objects. Let rij denote the request rate of node j for object i, where 1 ≤ j ≤ N , 1 ≤ i ≤ M . All objects are assumed to be unit-sized. Node j has a storage capacity of Cj units and is assumed to be ON with a probability πj , also called the reliability of a node. Let Pj denote the placement of node j, defined to be the set of objects replicated at this node. We assume without loss of generality that |Pj | = Cj , since a node always has to gain by saving objects locally in its storage. Let P = {P1 , P2 , . . . , PN } denote the global placement for the replication group and let P−j = P \ Pj denote the set that contains the placements of all nodes except for node j. Assume that the cost for accessing an object from a node’s local memory is tl , from another remote node in the group tr and from an origin server ts , with tl < tr < ts . These values may denote averages, but are taken the same for all nodes in order to simplify the analysis; in reality, there are differences in these costs for different node pairs, but these are expected to be small, since nodes in a replication group are generally homogeneous devices in geographical proximity. Given a placement P , the mean total access cost per unit time for node j is given by Cj (P ) = rij tl + rij ts i∈Pj
i∈P / j, i∈P / −j
+ rij [tr (1 − i∈P / j, i∈P−j
N
(1 − πk )) + ts
k=1, k =j,k:i∈Pk
N
(1 − πk )] .
k=1, k =j,k:i∈Pk
The scenario under which nodes can cooperatively decide which objects to replicate is as follows. Initially, each node j makes a list of its Cj most popular objects, which constitutes the potential placement of objects in its memory. All nodes are informed of other nodes potential placements. This can be accomplished either by forwarding the information to their neighbors, or in larger networks by uploading these lists to a central database, close to all nodes in the group. In either case, we consider that the cost for disseminating this information
460
E. Jaho et al.
is negligible compared to the cost to actually obtain an object. Upon learning the other nodes potential placements, each node can make changes to its list— exploiting the fact that other nodes may have the same objects, to replicate other ones. We consider that nodes take (possible random) turns to make such changes, and inform other nodes of their decision. We model this as a dynamic game, where players do not form coalitions. Such games, in which there is a certain order of players, and each player’s moves have an effect on the utilities of players before or after him, are called Stackelberg games, or leaders-followers games (see e.g., [4]). We examine the following replacement strategies. 2.1
Greedy Churn-Unaware Strategy
Under this strategy nodes are assumed to be unaware of the reliability of other nodes; thus they falsely consider other nodes to be always ON when making replacements. It is essentially the strategy proposed in [2] under no node churn. Given the current global placement P , each node considers making replacements so as to minimize its current access cost (hence the term “greedy”). To describe the strategy, we need some further definitions. For an object e ∈ Pj , define the eviction loss Le,j as the increase in access cost if node j would evict the object from its memory. For an object i ∈ / Pj , define the insertion gain Gi,j as the decrease in access cost if node j would insert the object. In minimizing its current access cost, a node may evict an object from its local memory only if it exists at one or more of the other nodes, and may only insert an object that does not exist in any of the other nodes in the group. To prove this property, consider that if a node j evicts an object e ∈ / P−j , the eviction cost would be rej (ts − tl ). This must be replaced by an object i ∈ / Pj for which, in the best case that i ∈ / P−j , it would hold that rij (ts − tl ) > rej (ts − tl ). But there is no such object i, for if there were it should have been in the initial list of most popular objects. Likewise, if node j inserts an object i ∈ P−j , the insertion gain would be rij (tr − tl ). Since from the above a node only evicts an object that exists in one or more of the other nodes, i would replace an object e for which rij (tr − tl ) > rej (tr − tl ). But this also cannot hold, because if rij > rej object i would have been inserted in j at the beginning. With these in mind, for an object e replicated at node j and at one or more of the other nodes in P−j , the eviction cost is Le,j = rej (tr − tl ) .
(1)
For an object i not replicated in any of the nodes in the group, the insertion gain to node j is Gi,j = rij (ts − tl ) . (2) An object e ∈ Pj ∩ P−j is called an eviction candidate, whereas an object i∈ / P an insertion candidate for node j. The set of eviction candidates for node j is denoted as Ej , where |Ej | ≤ Cj , and the set of insertion candidates as Ij . Index the eviction candidates for node j as e1j , e2j , . . . , e|Ej |j , in order of increasing eviction cost. That is, Le1 ,j ≤ Le2 ,j ≤ · · · ≤ Le|Ej | ,j (the double
Cooperative Replication in Content Networks with Nodes under Churn
461
subscript j is removed). Accordingly, index the insertion candidates for node j as i1j , i2j , . . . , i|Ij |j in order of decreasing insertion gain, i.e., such that Gi1 ,j ≥ Gi2 ,j ≥ · · · ≥ Gi|Ij | ,j (the double subscript j is likewise removed). It is evident that to minimize its access cost, node j should make changes in the following way: evict e1j and insert i1j , evict e2j and insert i2j , and so on until the maximum number mj such that Gimj ,j > Lemj ,j (mj ≤ min(|Ej |, |Ij |)). For an arbitrary eviction candidate e and insertion candidate i, we call def ΔG(e,i) = Gi,j − Le,j the replacement gain for the insertion-eviction pair (e, i). A replacement of object e by object i is also denoted as e ← i. It was shown in [2] that this strategy results in a Nash equilibrium, i.e., no node can unilaterally change its placement strategy and obtain a benefit, when all other nodes follow the greedy churn-unaware placement strategy. This can be explained from the characteristic property of the strategy mentioned above: since a node only evicts an object that exists in one or more of the other nodes to insert an object unrepresented in the group, each node does its best over all possible moves of subsequent nodes. For a formal proof, interested readers are referred to [2]. Besides showing the Nash equilibrium, the proof leads to a stronger result: in the greedy churn-unaware strategy, no node can gain by playing again at any subsequent epoch in the game. However, as is shown in [3] by calculating the actual expected cost under churn, this algorithm may result in mistreatment for some nodes. Thus, the following object replacement strategy is proposed. 2.2
Greedy Churn-Aware Strategy
In this strategy, nodes have information about the reliability of other nodes, in the form of a common (prior and posterior) distribution of their ON probabilities π1 , π2 , . . . , πN . This information may be derived and forwarded in a distributed manner, by local interactions between nodes. (For a review of distributed algorithms for deriving reputation measures, readers may refer to [5].) It can also be maintained at a central database that nodes inquire into prior to making their decision. Each node uses this information in replacement decisions and makes a number of changes in its placement that minimize its expected access cost. We can follow the same arguments as in the churn-unaware case to show again that a node may only evict an object that is present in other nodes in the group. However, a node may now insert an object that also exists in one or more of the other nodes in the group (with ON probability smaller that 1). For an object e replicated at node j, we define the eviction cost as:1 Le,j = rej [(ts − tl )
N
(1 − πk ) + (tr − tl )(1 −
k=1 (k =j,k:e∈Pk )
N
(1 − πk ))] ,
(3)
k=1 (k =j,k:e∈Pk )
i.e., it is the increase in the expected access cost if node j would evict the object. 1
We maintain the same notation as in Sect. 2.1.
462
E. Jaho et al.
For an object i not replicated at node j, we define the insertion gain as: ⎧ rij (ts − tl ), if i ∈ / P−j (4a) ⎪ ⎪ ⎨ N N Gi,j = r [(t − t ) (1 − πk ) + (tr − tl )(1 − (1 − πk ))], if i ∈ P−j (4b) ij s l ⎪ ⎪ ⎩ k=1 k=1 (k =j,k:i∈Pk )
(k =j,k:i∈Pk )
i.e., it is the decrease in the expected access cost if node j would insert the object. Notice that the right-hand-side of (4b) may be greater than the right-handside of (3), which substantiates our previous claim that objects that exist in other nodes in the group may also be inserted. Considering eviction candidates ordered in increasing average eviction loss, and insertion candidates ordered in decreasing insertion gain, each node j makes a maximum number of replacements mj with positive average replacement gain, i.e., we have ΔG(ek ,ik ) > 0, for all k = 1, . . . , mj . In contrast to the churn-unaware strategy, there is no reason why a node may not benefit by playing again at a subsequent epoch in the game. We therefore formulate a version of the game in which all nodes apply the greedy churnaware strategy repeatedly for many rounds, until stopping. Although this is not an algorithm requirement, we mostly consider in our study that the same order of play is maintained in each round. For this to be applied in practice, a central coordinating entity is required that, aside from maintaining the database, authorizes each node to access the lists of placements of other nodes only at the specified order. The algorithm stops when all nodes cannot further improve their placements. In the next section, we show that stopping occurs always. Remark 1. Note however that we do not study the game as a repeated game. If nodes know a priori that they are going to play for a (possibly random) number of rounds, even the churn-unaware strategy may not arrive at a Nash equilibrium.
3
Characteristics of the Greedy Churn-Aware Strategy
We have sought standard Bayesian equilibria by studying the game without communication. (Although communication takes place, we don’t consider nodes’ placements to be private information they can lie about, and so the game is not treated as a game with communication.) Unfortunately, the greedy churn-aware strategy may not arrive at an equilibrium. To see this, consider the subgame after player N − 2 has played. Then, in order to have an equilibrium, player N −1 should be sequentially rational. In more words, given that the best for the last player N is to follow the greedy churn-aware strategy, would it always be best for N − 1 to also follow this strategy? A simple example shows that this may not be the case. Suppose that node N − 1 evicts an object e, replicated at node N and at one or more of the other nodes 1, ..., N − 2, and replaces it by an object i, such that Gi,N −1 > Le,N −1 .
Cooperative Replication in Content Networks with Nodes under Churn
463
Since one or more of the nodes 1, ..., N − 1 still have object e, node N may also at its turn evict it, by inserting an object i (i may be equal to i) for which Gi ,N > Le,N . (All of these occurrences are possible since request rates are positive reals.) Since Le,N > Le,N −1 , we may have that Gi,N −1 < Le,N and hence it would have been better for player N − 1 not to make the move e ← i. Despite this negative result, the strategy does have some good properties, as shown below. Most of these are shown to hold in a homogeneous case in which all nodes have the same request rate for each object and the same storage capacity. Despite the simplicity of this scenario, it is insightful for a real-life system, since nodes in a replication group are likely to have similar interests for objects. 3.1
Mistreatment
As discussed in the Introduction, a rational node is incited to follow the greedy churn-aware strategy if it is not susceptible to mistreatment, i.e. if its incurred access cost when all other nodes follow this strategy is smaller than its incurred cost when all nodes act in isolation. In this subsection, we examine in which cases this constraint can be satisfied for all nodes so that mistreatment is avoided. First, consider that in the case of two nodes, irrespective of their request rates for objects and their ON probabilities, the greedy churn-aware strategy is always better than the greedy local strategy for both nodes. This happens since the second node only evicts an object that belongs to the first node, and such action does not increase the access cost of the first node. For more than two nodes, one can easily construct examples with nodes having different request rates and ON probabilities, in which mistreatment occurs against one or more of the nodes. However in the homogeneous case, if less reliable nodes play first, the churn-aware strategy is mistreatment-free under an additional condition stated in the analysis below. Consider the set of nodes N = {1, 2, . . . , N }. We assume that π1 ≤ π2 ≤ ... ≤ πN . Initially, all nodes have the same lists of objects to be placed in their memory. Assume that each node starts making replacements in its list, as dictated by the churn-aware strategy. First notice that because πj ≤ πj the cost of each node j to evict an object is smaller than the cost of each node j > j to evict the same object. Similarly, the gain of each node j to insert an object is greater than the gain of each node j > j to insert the same object after j. Suppose that each node j makes mj replacements when playing. It follows that m1 ≥ m2 ≥ ... ≥ mN . Consider any subsequence of nodes, indexed without loss of generality as 1, 2, . . . , where π1 ≤ π2 ≤ ... ≤ π and assume that all nodes in this subsequence make a number of replacements k, where k ≤ m . It follows that subsequent nodes have decreasing gains when making the kth replacement. Denote the objects that node j removed by e1 , e2 , . . . , emj , indexed in increasing eviction cost. For its kth replacement, the replacement gain of node j is Gik ,j − Lek ,j . Its access cost will be increased if subsequent nodes at their kth replacement remove one or more of the objects e1 , e2 , . . . , emj that node j removed. (Its access cost is not affected if subsequent nodes insert the same objects, and is decreased if they insert some different objects.) We can show
464
E. Jaho et al.
however, that under a certain condition it remains lower than its cost before making this replacement. (j) We generally denote by Lek ,j+n the average increase in access cost incurred to node j by node j + n evicting object ek . If node j + n is the first node that evicts object ek after j and if Lek ,j+n is the eviction cost of node j + n , it holds (j) that Lek ,j+n = Lek ,j + Lek ,j+n . Let the number of nodes that make a kth replacement after node j has played be nk , and consider the subsequence j + 1, j + 2, . . . , j + nk . The object evicted by node j + 1, j + 2, . . . , j + nk at the kth replacement is denoted by ek(j+1) , ek(j+2) , . . . , ek(j+nk ) , respectively. Without loss of generality, we assume that at their kth replacement, all nodes in this subsequence evict objects evicted by node j (otherwise their induced cost to j is zero and we need not include these nodes). Due to decreasing gains of subsequent nodes for making the kth replacement, the objects evicted by a node j+ are a subset of the objects evicted by j and index k(j+) ≥ k ∀ = 1, . . . , nk . We now impose the critical condition that the objects evicted by node j + 2 are also evicted by node j + 1, for each pair (j + 2,j + 1) in the subsequence. The replacement gain of node j after these evictions becomes at least (since it may be increased by different inserted objects of other nodes) (j)
(j)
(j)
Gik ,j − Lek ,j − Lek(j+1) ,j+1 − Lek(j+2) ,j+2 − · · · − Lek(j+n ≥Gik ,j − Lek(j+1) ,j −
(j) Lek(j+1) ,j+1
=Gik ,j − Lek(j+1) ,j+1 −
−
(j+1) Lek(j+2) ,j+2
(j) Lek(j+2) ,j+2
− ···−
− ···−
k)
(5a)
,j+nk
(j) Lek(j+n ) ,j+nk k
(j) Lek(j+n ) ,j+nk k
(5b) (5c)
≥··· ≥Gik ,j − Lek(j+nk ) ,j+nk > 0 .
(5d) (j)
(5c) follows from (5b) because it holds that Lek(j+1) ,j+1 = Lek(j+1) ,j +Lek(j+1) ,j+1 (j+1)
(j)
and Lek(j+2) ,j+2 = Lek(j+2) ,j+2 (since both j, j +1 have evicted object ek(j+2) , the same cost is induced to both by j + 2 evicting it.) Finally, the last implication holds because Lek(j+nk ) ,j+nk < Gik(j+nk ) ,j+nk < Gik ,j . (ik(j+nk ) denotes the object inserted by node j + nk .) Since moreover node j makes a higher number of replacements than its subsequent nodes (mj ≥ mj+1 ≥ · · · ≥ mN ), the total gain of node j by making its mj moves can never become smaller than zero. Therefore node j incurs a smaller access cost under the final placement of all nodes compared to its greedy local placement. Under the same order of play and the same conditions, this holds for any number of rounds in the game. Remark 2. An important consequence of the previous analysis is that it reveals a mistreatment-free strategy. Consider the greedy churn-aware strategy, with the additional condition that a node only evicts an object if it was also evicted by all previous nodes, or not evicted by any of the previous nodes. In the homogeneous case where less reliable nodes play first this strategy is mistreatment-free.
Cooperative Replication in Content Networks with Nodes under Churn
3.2
465
Potential Gain of a Node by Playing Again
A useful metric in our model is the potential gain of a node by playing again after a number of steps in the game. This gain is defined as a node’s reduction in mean access cost if it were to play again after a number of steps in the game. Intuitively, a good strategy yields small such gains to all nodes. The analysis below shows that in the homogeneous case studied in the previous subsection, assuming now that more reliable nodes play first (π1 ≥ π2 ≥ ... ≥ πN ) a node does not have a benefit by playing again at a subsequent epoch, in certain game instances. A node may have a benefit by playing again by reinserting an object evicted by itself and subsequent nodes, or by evicting an object inserted by subsequent nodes. Consider that at a subsequent epoch after node j has played, a number of players evict an object e that was also evicted by node j and another number of players insert an object i also inserted by j. Say the last node that evicted object e was j + n1 , and the last node that inserted object i was j + n2 . We can show that if n1 = n2 ∀ (e, i), then Li,j > Ge,j and node j does not obtain a benefit by playing again. Consider that the insertion gain of node j to reinsert object e equals the eviction cost of the last node after j that evicted the object. So Ge,j = Le,j+n1 . If the last node that inserted object i was j + n2 , since πj ≥ πj+n2 , we also have that Li,j ≥ Gi,j+n2 . If n1 = n2 , then since we necessarily have Gi,j+n1 > Le,j+n1 it follows that Li,j > Ge,j . If n1 ≷ n2 then depending on request rates and the probabilities of nodes to be ON, we may have Li,j < Ge,j and node j may obtain a benefit by playing again. This is also the case if some other ordering is applied. However potential gains are usually small, as confirmed by the results in Sect. 4. 3.3
Finiteness of the Algorithm with Multiple Rounds
The following theorem holds: Theorem 1. The greedy churn-aware algorithm ends in a finite number of rounds, irrespective of the order of play in each round. Proof. At each step of the game, each player may evict an object owned by a certain number of other nodes, to insert another object owned by either a) a smaller number of nodes (or none), or b) a larger number of nodes, with smaller probability that at least one of them is ON. Thus, there will come a time when nodes do not have objects in common or no further replacements are possible. In practice, the algorithm finishes very quickly. In all test cases that follow, the algorithm finished in 1 to 5 rounds.
4
Numerical Evaluation
In this section we present some numerical examples to show how the different object replacement strategies perform in a replication group under node churn.
466
E. Jaho et al. LRF (π=[0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1])
MRF (π=[1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1])
60
60 Greedy local Greedy churn-unaware (1 rnd) Greedy churn-aware (1 rnd) Greedy churn-aware (multiround)
Greedy local Greedy churn-unaware (1 rnd) Greedy churn-aware (1 rnd) Greedy churn-aware (multiround)
50
40
Access cost
Access cost
50
30 20 10
40 30 20 10
0
0 1
2
3
4
5
6
7
8
9
10
1
2
3
5
6
7
Node number
(a)
(b)
8
9
10
9
10
πj=0.5 for all j=1...N
Random order (π=[0.1 0.4 0.7 0.3 0.2 1 0.8 0.9 0.6 0.5]) 60
60 Greedy local Greedy churn-unaware (1 rnd) Greedy churn-aware (1 rnd) Greedy churn-aware (multiround)
50
Greedy local Greedy churn-unaware (1 rnd) Greedy churn-aware (1 rnd) Greedy churn-aware (multiround)
50
40
Access cost
Access cost
4
Node number
30 20 10
40 30 20 10
0
0 1
2
3
4
5
6
Node number
(c)
7
8
9
10
1
2
3
4
5
6
7
8
Node number
(d)
Fig. 1. Access cost for different placement strategies with different node orderings
We are interested in analyzing cases where nodes have similar preferences for objects, so that mutual benefits emerge by cooperation. Here we consider the simplest case where nodes have the exact same request rates. Our major conclusions are similar when we consider different request rates, or when each node has a different preference order for objects. Request rates are drawn from a Zipf distribution with exponent s. The probability of the object with rank k, k = 1, ..., M , is 1/k s f (k; s, M ) = M . s m=1 1/m We consider N = 10 nodes. Each of the N nodes has a capacity of 10 objects in its local storage, and there exists a total of M = 50 objects. We set tl = 1, tr = 10 and ts = 100 cost units. Fittings based on real traces in [6] have shown the value of the exponent s of the Zipf distribution to lie between 0.8 − 0.9. Here we take s = 0.9. We examine the following orderings of players based on their reliability: (1) Least Reliable First (LRF) where nodes play in increasing order of their ON probabilities, (2) More Reliable First (MRF) where nodes play in decreasing order of ON probabilities, (3) Random order, in which nodes take a selected random order to play.
Cooperative Replication in Content Networks with Nodes under Churn
467
Table 1. Potential gains of nodes by playing again after 1 round Potential gain 4 5 6
Node
1
2
3
7
8
9
10
LRF
1.52
1.61
0.83
1.93
0
0
0
0
0
0
MRF
0
0
0
0
0
0
0
0
0
0
Random order
2.81
2.73
0
2.94
2.69
0
0
0
0
0
πj = 0.5 ∀j = 1, . . . , N
0.35
0.40
0
0
0
0
0
0
0
0
The (mean) access costs of all nodes under the different orderings are shown in Fig. 1. The vector π = [π1 , π2 , . . . , πN ] of ON probabilities of nodes for each examined order of play is shown in the title of each sub-figure. We also show a case where nodes have the same probability to be ON, πj = 0.5 for all j = 1, ..., N (Fig. 1(d)). No mistreatment is shown to occur in these graphs; intuitively, this happens here because there exists a number of nodes with relatively high reliability values. (For an example of mistreatment under the churn-unaware strategy, readers are referred to [3].) It is demonstrated that all group replication strategies perform better than the greedy local. Comparing the churn-unaware and churn-aware greedy strategies, we observe that a significant improvement occurs when the ordering is LRF (Fig. 1(a)), while similar costs are produced for an MRF order (Fig. 1(b)). To understand this, notice that under the LRF order, the churn-unaware strategy results in many high request rate objects being stored at less reliable nodes; more reliable nodes falsely trust high request rate objects to be accessible from the previous nodes, while it would be better to have them stored locally. Such erroneous placements occur less frequently when the order is MRF. The churnaware strategy can also yield a significant improvement when there is no prespecified order of play, but nodes take random turns (Fig. 1(c)). On the contrary, a slight improvement is observed when all nodes have the same probability of being ON. An important observation is that there is no significant access cost reduction by repeating the churn-aware algorithm for multiple rounds. This yields extra benefit only to some nodes; depending on the order of play, some nodes may benefit from the extra rounds, which usually occurs at the expense of others. We also examine the potential gain of each node by playing again after one round of the churn-aware algorithm, under the various orderings (if there can be no improvement, the potential gain is zero). Results are shown in Table 1. In the MRF case, the algorithms stops in the first round and thus all potential gains are zero. The analysis in Sect. 3.2 partially justifies this result. Generally in the other cases, only the first few nodes attain a benefit by playing again, since they are more likely to be farther from an optimal placement. In any case, benefits are small when compared to the access cost values (see Fig. 1), and hence nodes have a small incentive to play again.
468
E. Jaho et al.
Finally, it is worth discussing the fairness aspects of different orderings. In this way, if some mediator (e.g., a system administrator) could enforce a certain order, we would like to know the fairest one. It is fair that more reliable nodes obtain a greater benefit by participating in the game, than the less reliable ones. We established previously that when all nodes follow the churn-unaware strategy, LRF is an unfair order because it leads more reliable nodes to erroneous placements. Instead, MRF should be followed. Under the churn-aware strategy, such fairness problems are mitigated, as all nodes behave in a rational manner. In the results, both the LRF and MRF orderings tend to give similar benefits.
5
Conclusions
This paper proposed a distributed algorithm for object replication in a content network, in which each node selfishly decides which objects to replicate from a distant server and announces its decision to other nodes, thus allowing all nodes to have a gain by a limited cooperation. The algorithms accounts for selfish behavior of autonomous nodes, as well as churn phenomena, i.e., nodes that are not always available or operational. It needs minimal information in order to be applied, consisting of the placements of other nodes in the replication group and a probability estimate of their reliability. This information can be forwarded in a distributed manner, or maintained in a central database in the network that nodes query prior to making their decision. While it may not arrive at an equilibrium, we have shown that it possesses good properties: in the majority of cases, it decreases the access cost collectively for all nodes, compared to the greedy local strategy and churn-unaware strategies. Although mistreatment problems may still occur, these are alleviated if all nodes follow the common strategy. Furthermore, if nodes having the same request rates for objects play according to an LRF order, the strategy is near mistreatmentfree and provides for a fair treatment of nodes according to their reliability. A challenging line for future work is the study of conditions under which the strategy arrives at an approximate Nash equilibrium. Based on the analysis in Sect. 3.1, a first conjecture is that an approximate equilibrium exists in the homogeneous case, under an LRF order. A possible extension of this work is to consider communication capabilities of nodes, and the possibility of each one to lie about its placement. Then the game should be studied as a game with communication, and an additional requirement of a good strategy is that nodes do not have an incentive to lie. Finally, the behavior of the algorithm in the cases where there exist objects of different sizes or where each node has a subjective estimate for the reliability of other nodes should be examined.
References 1. Leff, A., Wolff, J., Yu, P.: Replication algorithms in a remote caching architecture. IEEE Trans. Par. and Distr. Systems 4(11), 1185–1204 (1993) 2. Laoutaris, N., Telelis, O., Zissimopoulos, V., Stavrakakis, I.: Distributed selfish replication. IEEE Trans. Par. Distr. Systems 17(12), 1401–1413 (2006)
Cooperative Replication in Content Networks with Nodes under Churn
469
3. Jaho, E., Jaho, I., Stavrakakis, I.: Distributed selfish replication under node churn. In: Proc. Med-Hoc-Net (2007) Poster session 4. Basar, T., Olsder, G.: Dynamic noncooperative game theory, 2nd edn. Society for Industrial and Applied Mathematics (SIAM) (1999) 5. Avrachenkov, K., Nemirovsky, D., Pham, K.S.: A survey on distributed approaches to graph based reputation measures. In: Proc. of International Workshop on Tools for solving Structured Markov Chains (SMCtools) (2007) 6. Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and Zipf-like distributions: Evidence and implications. In: Proc. IEEE INFOCOM (1999)
Congestion Avoiding Mechanism Based on Inter-domain Hierarchy Marc-Antoine Weisser1 , Joanna Tomasik1 , and Dominique Barth2 1
Supélec, 3, rue Joliot-Curie, F-91192 Gif-Sur-Yvette, France {marc-antoine.weisser,joanna.tomasik}@supelec.fr 2 PRiSM, University of Versailles, 45, avenue des États Unis, F-78035 Versailles, France [email protected]
Abstract. Inter-domain routing is ensured by BGP. BGP messages carry no information concerning quality parameters of routes. Our goal is to provide domains with information regarding the congestion state of other domains without any changes in BGP. A domain, which is aware of congested domains, can choose a bypass instead of a route exhibiting possible QoS problems. We propose a distributed mechanism sending alert messages in order to notify domains about other domains congestion state. Our solution avoids flooding the Internet with signaling messages. It limits the number of alerts by taking advantage of the hierarchical structure of the Internet set by P2C and P2P relationships. Our algorithm is heuristic because it is a solution to an NP-complete and inapproximable problem. We prove these properties using the Steiner problem in directed acyclic graphs. The simulation runs show that our mechanism significantly diminishes the number of unavailable domains and routes compared to those obtained with “pure” BGP routing and with a theoretical centralised mechanism. Keywords: inter-domain, QoS, distributed algorithm.
1
Introduction
We consider two paradigms for the introduction of QoS into the Internet: flowbased and connectionless. The first one is based on an individual flow management. Generally speaking, the flow-based paradigm is not adapted to the inter-domain level. This paradigm may be applied to a small number of critical requests which need strict QoS requirements. For the other demands of QoS without strict guarantee, the second paradigm which is connectionless, should be used. The principle of the connectionless paradigm is represented by the rejection of flow management. Packets are divided into classes and the priority traffic, packets of the most privileged class, is treated with priority. In [1] we proposed an algorithm founded on the flow-based paradigm which finds multi-constraint paths in an inter-domain network. In this paper we focus on the second approach. Our proposition: BGP is a protocol which constructs routing tables in a interdomain network. We take as a hypothesis that a domain using BGP arbitrarily A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 470–481, 2008. c IFIP International Federation for Information Processing 2008
Congestion Avoiding Mechanism Based on Inter-domain Hierarchy
471
chooses one path from a set of possible paths i.e., a domain uses the same BGP routing table for all its border routers. If a domain was aware of the congestion state of the other domains, it could choose a path avoiding the congested domains. Yet, such an approach is unrealistic because 1) it would generate too many control messages and 2) domain operators would not want to give details of configuration and resources of their domains. We propose a mechanism which is external to BGP and which allows us to select paths avoiding congested domains. Our mechanism uses incomplete congestion state information to find bypasses which go around congested domains. Our mechanism copes with the two problems stated above. It could be implemented in a path computation element (PCE) [2] since PCE has to have a global vision of inter-domain paths. The studies [3, 4, 5, 6] of BGP tables show a hierarchical structure of the inter-domain network. This hierarchy is induced by the types of relationships connecting two domains: P2C (provider to customer) and P2P (peer to peer). In our mechanism, we use this hierarchy to limit the number of control messages. A domain using our mechanism needs to know congested domains which are found only in its neighborhood. Moreover, a commercial contract behind relationships such as P2C, legitimates the exchange of the information concerning congestion. Providers payed by their customers should warn them about congestion in order to allow them to change their routing and use networks of other providers. The second problem stated above is solved because only a minimum of information is sent to a small number of domains. In this paper we only focus on the simulation results of our solution. The problem complexity results are presented in [7]. Related works: An investigated approach to introduce QoS into the interdomain consists in either exhanging QoS characteristics between domains [8] or guessing the QoS characteristics using probe messages [9]. These approaches need a large amount of information difficult to collect and to use. In our approach we propose to use a binary information whether a domain is congested or not. It is exchanged only by connected nodes. An utilization of hierarchy in QoS issues appears in some works [10, 11]. The hierarchy described in [10] is artificial and it has be contructed. In our proposition, we use the existing hierarchy introduced by P2P and P2C relationships. We find a paper exploring this hierarchy in the routing context [11]. Its authors use it for a hybrid routing mechanism which works both with link-state and path vectors. To the best of our knowledge, the proposed utilization of alert message diffusion conditioned by the inter-domain hierarchy has never been studied. Paper organization: We give a description of an inter-domain network and its hierarchy in the next section. We model our problem with the graph theory in Section 3 and give the results about its complexity and inapproximability. In Section 4 we describe our mechanism. In Sections 5 and 6 we present the simulations and their results. In order to judge our algorithm, we confront it with two extreme solutions: BGP and an on-line theoretical centralized algorithm
472
M.-A. Weisser, J. Tomasik, and D. Barth
which we define in order to compare purposes. Finally, we make conclusions and outline perspective future works.
2
Problem Modeling
The problem modelling takes advantage of a specific structure of the interdomain network which is composed of independent domains administrated by operators. Links connecting domains are characterized by two types of relationships: P2C and P2P. A P2C relationship links providers which sell connectivity to their customers. A P2P relationship exists between two domains which share connectivity. The studies of BGP tables [3, 4, 5] show that these types of relationships introduce a hierarchy in the inter-domain network. There are layers which compose the hierarchy. The core, Tier-1, on the top of the hierarchy, is a set of domains which are linked together by P2P relationships. The domains of the core are not customers of any domain. Usually, the domains in the layer Tier-j are providers of domains in the layer Tier-k and customers of domains in the Tier-i, i < j < k. The domains in the layer Tier-i and in Tier-j are linked together with P2C relationships. Domains in the same layer can be linked together with P2P or P2C relationships. The inter-domain hierarchy has an impact on the current inter-domain routing because routes are established according to commercial relationships. The only routes present in an inter-domain network are composed of an uphill and a downhill component. A downhill component is a list of consecutive links labelled with a P2C relationship. An uphill component is a list of consecutive links labelled with a C2P relationship which is dual to a P2C. The uphill and downhill components are connected either by zero or by one P2P relationship. Such routes are called valley-free by Gao [5]. We apply this term later in the paper. We use a graph to model our problem. Let G = (V, A, E) be a mixed graph (partially oriented [12]) to which we refer later in this paper as inter-domain graph. A vertex v ∈ V , represents a domain. We enumerate the elements of V with natural numbers 1, . . . , n. An ordered pair (p, c) ∈ A represents an existing P2C relationship between a provider p and its customer c. A pair {p1 , p2 } ∈ E represents a P2P relationship between domains p1 and p2 . A provider p, which sells a connectivity to a customer c, cannot be a customer of c. We, therefore, consider that there is no cycle made of P2C relationships in an inter-domain network [13]. This means that (V, A) is a directed acyclic graph. The roots of this DAG are the vertices representing the domains of the core. We define a capacity function c : V → R. The value c(vi ) represents an amount of traffic which can be transited by vi without overloading it. Let T : V 2 → R be a traffic matrix. Each value Ti,j represents an amount of traffic which has to be sent from i to j. The traffic Ti,i represents the internal traffic of the domain i. Let R : V 2 → N be a routing matrix. Each value Ri,j represents the next hop for a packet passing through the domain i which is to be sent to the domain j. We set Ri,i = i for internal traffic. We say that a matrix R is valley-free if, and only if, all the routes which it represents are valley-free.
Congestion Avoiding Mechanism Based on Inter-domain Hierarchy
473
Given a network, a traffic matrix, and a routing matrix, we say that a node is perturbed if the total amount of traffic transiting through it, emitted by it, and sending to it is greater than its capacity. A perturbed path is a path containing at least one perturbed node. Each Ti,j is transmitted on a route induced by the routing. Traffic from i to j is perturbed if the route from i to j is perturbed. The volume of a perturbed traffic is the sum of traffic passing on perturbed paths. Given a network and a traffic matrix, our problem is to find a valley-free routing matrix which minimizes the number of perturbed nodes (network approach) and the volume of the traffic passing along perturbed paths (traffic approach). This is an optimization problem with bi-criteria. We focus on the network approach only because we assume that minimizing the number of perturbed nodes may be a good heuristic approach to minimize the number of perturbed paths as well as the volume of perturbed traffic (see Section 5). Another reason for choosing the network approach criterion is implied by the operation mode of the proposed algorithm. According to the algorithm, nodes modifying their state send alert messages (see Section 3 for details). The minimization of the perturbed node number limits the number of sent messages. Given a network and a traffic matrix, we refer to the above described problem as the valley-free routing problem. The weight of its solution is the number of perturbed nodes. In [7] we prove that this problem is NP-complete and inapproximable. We use the directed Steiner tree problem for the proofs.
3
Alert Sending Algorithm
Our distributed algorithm is based upon the principle of sending alert messages which carry information about the domain congestion state. To avoid taking the risk of flooding the entire network with alert messages, we aim to limit the range of their diffusion. We propose to take advantage of the inter-domain hierarchy to restrict the alert diffusion range. Each node informs its customers and providers when it becomes perturbed in order to allow them to change their routing. Each node also keeps them informed when it returns to an operational state. Each node is provided with a BGP table and a priority table which stores potentially congestion-free routes set aside for priority traffic. The priority table is same as the BGP table if any node is not perturbed and its providers as well. The BGP table is altered by classical BGP mechanisms only [14]. The priority table is altered by our alerts and, occasionally, by changes in BGP table. Each node contains a list of possible next hops towards every destination. These lists are constructed with routes announced by BGP. Their construction guarantees that all next hops satisfy the valley-free property of routes. Each node can be in two distinct states: green or red. A node state is red if at least one of the two following conditions is satisfied: the amount of traffic transiting through the node, sending from it and sending to it, is greater than its capacity or at least one of its next hops which is a provider is in red state. A node cannot become red because of its customer and peer congestion. Thanks to this fact, our algorithm avoids spreading of red nodes in all the network
474
M.-A. Weisser, J. Tomasik, and D. Barth
when a single node become red. We use the hierarchy to limit the number of nodes in red state as well as the messages sent: a node which has a customer or a peer in red state as a next hop should not be in red state itself. The node which is not in red state is in green state. The green state is a domain stable state. Each node stores a table containing states of its neighbors. This table is updated by alerts sent by the neighboring providers and customers when their state changes. A node changing its state sends messages to its customers to inform them about its new state. The alerts are sent up and down in the hierarchy, but the peers do not receive the message so that the spread of messages would remain limited. The message is composed of: an ID of the node (domain number), a new state, and a delay d. It enters into the scheduler of the node receiving the message where it is processed after the delay d. The received message replaces any older message which arrived from the same node and is present in the scheduler. We introduce a notation in order to describe the behavior of a node processing an alert message. Let BRi and P Ri be tables stored in the node i, containing the next hop for the BGP and the priority routing, respectively. The values contained in BRi are computed by the BGP classical mechanism. The default values of P Ri are BRi . Let sti be a table stored in the node i containing the known states of its neighbors. The default values in this table are green. Let LRi be a table containing lists of the next hops which may be used to reach every destination. These lists are constructed using the routes announced by BGP. For a destination j, if there is a path from i to j via a customer of i, LRi [j] contains all next hops which are customers and which are announced by BGP. If a next hop is a peer, LRi [j] contains all the next hops announced by BGP which are peers. Otherwise, LRi [j] contains all next hops which are providers. These restrictions are useful to keep valley-free property preserved. Algorithm 1 details the behavior of a node i treating a message m received from the node j. Steps 3–13 are executed when an alert arrives from a node j which is in red state. The node i selects then destinations for which the node j is a designated next hop. The node i then looks for alternative next hops for the selected destinations. An alternative next hop is chosen uniformly among domains stored in the set LRi [dest] and indicated as being in green state. If no node can be chosen, we use the default path stored in the BGP table in spite of its state. A valley-free route does not include any cycle. Our mechanism keeps this property preserved because the construction of the LRi table permits us to replace a customer only by a customer and a provider only by a provider. Steps 14–24 are executed when the node j is in green state. In this situation, the node i replaces the nodes mentioned below by the node j: 1) the next hops whose state is red and which are leading to destinations for which j can be a next hop, or 2) the next hops whose state is green towards all destinations for which j is a “natural” next hop according to BGP routing. In dynamical systems such as networks, oscillations can emerge. In our approach, this objectionable phenomenon may be introduced by exchanges of green and red alert messages. We avoid such instability by introducing a delay before the alert message is processed. Each alert message contains a field to store the
Congestion Avoiding Mechanism Based on Inter-domain Hierarchy
475
Algorithm 1. Node i treating a message m Input: message m = (j, color, delay); tables BRI , P Ri , LRi , sti Output: 1: delete m from the scheduler 2: sti [j] ⇐ color 3: if color = red then 4: for all dest ∈ V do 5: if P Ri (dest) = j then 6: let S = {x ∈ LRi [dest]/sti [x] = green} 7: if S = ∅ then 8: P Ri (dest) ⇐ BRi [dest] 9: else 10: P Ri ⇐ choose_unif ormly_in(S) 11: end if 12: end if 13: end for 14: else 15: for all dest ∈ V do 16: if sti [P Ri [dest]] = red then 17: if j ∈ LRi [dest] then P Ri [dest] ⇐ j 18: end if 19: if BRi [dest] = j then P Ri [dest] ⇐ j 20: end for 21: end if
delay before the message processing. The messages are sent immediately but the processing is deferred. The delay is set according to a random distribution in order to avoid synchronization in the network. We use an exponential distribution. The mean of the distribution is small for red alert and big for green alert. The delay mechanism protecting the network against instability may unnecessarily introduce the red state in too many nodes. Therefore, the BGP table is used by default when no green next hop can be found and instability in the network leads our mechanism to work temporarily as BGP. Let us observe that given an instance of our problem: a network and a trafic matrix, the goal of a distributed algorithm is to find a stabilized routing matrix minimizing the number of perturbed nodes. The problem of finding a routing matrix in a centralized way is NP-complete [7]. Then, a distributed algorithm finding a stabilized routing matrix will need an exponential number of messages.
4
Simulations
We use simulation to evaluate the performance of our algorithm. Its performance will be compared to the performance of BGP and a theoretical centralized algorithm. The performance measures used to make comparisons are the numbers of perturbed nodes and paths, and the amount of perturbed traffic. We expect that BGP will provide the worst results and we will use the results to obtain an
476
M.-A. Weisser, J. Tomasik, and D. Barth
Algorithm 2. Centralized algorithm Input: Inter-domain graph G ; list L of demands (source, destination, traffic) Output: 1: let R be a routing matrix containing no path 2: sort demands L in decreasing order of traffic 3: for all qi = (si , di , ti ) ∈ L do 4: let pi = find preferred path for qi 5: if satisfying qi using pi does not perturb node then 6: add pi to matrix R ; reduce capacity of node in pi by ti ; remove qi of L 7: end if 8: end for 9: while L = ∅ do 10: let (sj , dj , tj ) = find the worst demand in L 11: pj = find best path for (sj , dj , tj ) 12: add pj to matrix R ; reduce capacity of node in pj by tj ; remove (sj , dj , tj ) of L 13: end while 14: return the routing matrix R
upper bound for the number of perturbed elements. We point out that a centralized algorithm, studied here for comparison purposes, cannot be considered as implementable in a network because of its complexity and the huge amount of data it has to process. We use it only to find a lower bound for the number of perturbed network elements. Centralized algorithm: Given an inter-domain graph G and a traffic matrix T , our goal is find a routing matrix which minimizes the number of perturbed nodes in a network. Thus, this centralized algorithm does not consider the traffic approach i.e., it does not take under consideration the overload of nodes. This problem is NP-complete and inapproximable. We cannot use any exact algorithm to solve it even for small instances. Thus, we have to use an heuristic algorithm. We observe that paths rising in the hierarchy are longer than the other paths. Moreover, they pass through nodes which may be used to satisfy many demands. These paths may cause the introduction of many perturbed nodes which degrade the network performance. For a traffic demand (s, d, t), where t = Ts,d , we define that a preferred path is a path which: 1) minimizes the number of perturbed nodes; 2) is preferably composed of nodes from lower layers. To find such paths, we use an exhaustive route exploration. Exhaustive exploration runs in exponential time but we are not interesting in the complexity of this algorithm. For most of the instances the complexity is polynomial. In practice, exploration time is quite short. Our centralized algorithm is detailed in the pseudocode 2. The steps 1–10 are the first phase of the algorithm during which we sort the demands in decreasing order of traffic amount and we satisfy as many demands as we can without perturbing any node. The introduction of any demand which is not yet satisfied (steps 11–18), produces perturbed nodes. Until there are still unsatisfied demands, we search the worst demand and we satisfy it using a preferred path.
Congestion Avoiding Mechanism Based on Inter-domain Hierarchy
477
We define the worst demand as a demand: 1) whose satisfaction using the perfered path introduces the maximum number of new perturbed nodes; 2) whose amount of traffic is maximal. Simulation plan: Our algorithm was tested for inter-domain topologies randomly generated with SHIIP [15]. It introduces the domain hierarchy into flat inter-domain topologies generated by BRITE [16]. The chosen topologies have parameters (core size, layer size, node degree) close to the means of a series of topologies of a given size obtained with SHIIP. The results discussed in the next section were obtained for a network of 100 domains. A traffic matrix T : V 2 → R is initially empty. The matrix filling procedure is iterative. After |V |2 iterations we consider the matrix as a full matrix. Starting from this point we vary the network load. We choose an (i, j) pair where both i and j are independently and uniformly distributed among the elements of V . We then set a new traffic value Ti,j using the same normal distribution as before (Ti,i indicates √ a traffic to be carried inside the domain i). In our simulation we use α N ( 5, 5) limited to nonnegative values. The scaling factor α, α ≥ 1 allows 2 |V | us to study series of experiments with a growing traffic load. We do not specify a traffic unit, we speak of Traffic Unit (TU). The choice of the time scale is not essential because we are interested in differences between the performances of our algorithm and the performance two algorithms of reference is expressed in terms of number of perturbed nodes, paths, and amount of perturbed traffic. We choose of the exponential distribution with a mean equal to one time unit. We also choose the capacities in generated topologies to study networks which are not over-dimensioned. On the Internet, the largest domains with the largest capacities are at the top of the hierarchy. We suppose that the domains in the core are never perturbed. Thus, we choose to set the infinite capacity for all of them. We arbitrarily fix the capacity for domains in the Tier-2, Tier-3, and Tier-4 to 35, 28 and 14 TU, respectively. The first performance measure is the number of perturbed nodes because it is also the optimisation criterion for our algorithm. The second one is the number of perturbed paths and the third one is the amount of perturbed traffic. Both are strongly influenced by the first one. We are also interested in particular performance measures inherent to our distributed algorithm. We count the number of messages sent and the number of network state changes where a network state is a vector containing current congestion states of all network nodes.
5
Results
In the first place we compare the number of perturbed nodes obtained for one simulation run with our distributed algorithm and BGP (Fig. 1). The volume of transiting traffic is heavy enough (α = 2.0) to saturate about 10 percent of network nodes with BGP approach. Observe that the number of the perturbed nodes increases quickly and reaches eight nodes when traffic matrix is filled up. The number of perturbed nodes oscillates around this value. Our distributed
478
M.-A. Weisser, J. Tomasik, and D. Barth Simulation results for 100 nodes network and a=2.0 Number of perturbed paths
12
Dis BGP
10 8 6 4 2
140 120 100 80 60 40 20 0
7000
Dis 6000 BGP 5000 4000 3000 2000 1000 0 500
Sent msg
Perturbed flow [TU]
Nb of sent messages
Nb of perturbed nodes
Simulation results for 100 nodes network and a=2.0 14
50000
100000 Time
150000
Dis BGP
100
200000
0
50000
100000 Time
150000
200000
Fig. 1. The number of perturbed nodes for Fig. 2. Number of perturbed path and our algorithm (one simulation run, α = amount of perturbed traffic for our algo2.0) rithm (one simulation run, α = 2.0) Behavior of distributed algorithm depending on traffic intensity 50
Dis 45 BGP
10 8
Average number of perturbed nodes
Nb of perturbed paths
Nb of perturbed nodes
Simulation resultst for 100 nodes network and a=2.0 12
Dis BGP
6 4 2
6000 4000 Dis BGP 2000
40 35 30 25 20 15 10 5
0
1000
2000
3000
4000
5000 Time
6000
7000
8000
9000
10000
0
1.5
1.75
2.0
2.25
2.5
alpha
Fig. 3. The number of perturbed nodes Fig. 4. The number of perturbed nodes deand paths averaged for a simulation run pending on the traffic load averaged for seseries, α = 2.0 ries of simulation runs
algorithm diminishes about five times the number of perturbed nodes. The initial heavy perturbation of six nodes is quickly reduced. Notice also that the shapes of these two diagrams are very similar. The number of alerts sent depends not only on the number of perturbed nodes but also on the layers in which these nodes are localized. The manifestation of four more perturbed nodes can cause the generation of the same number of alerts messages as the manifestation of one perturbed node but only if these nodes are situated far from the core. Two other performance measures, the number of perturbed paths and the quantity of perturbed traffic, taken for the same network during the same simulation run, are depicted in Fig. 2. The results for our algorithm are significantly better than for BGP although our method does not optimize them directly. To verify the functioning of our method and that of BGP we performed a series of simulation runs for the same network and traffic load. The averages of the numbers of perturbed paths and nodes, presented in Fig. 3, exhibit that our algorithm works on average five times better than BGP for the given network. The influence of the traffic load on the performance of our algorithm and BGP is presented in Fig. 4. We used averaged series of simulation runs. Each series had a different load which varied traffic light load (α = 1.5) to saturating
Congestion Avoiding Mechanism Based on Inter-domain Hierarchy Convergence of the distributed algorithm. alpha = 2.0
Cumulative number of sent messages depending on alpha 105
479
a = 2.0 a = 1.5 a = 1.0
10
Number of perturbed nodes
Number of sent messages
104
103
102
101
8
BGP BGP + Distributed Centralized Centralized + Distributed Minimum bound
6
4
2
0 0
50000
100000 Time
150000
200000
0
100
200
300 400 Number of steps
500
600
700
Fig. 5. The cumulative number of sent Fig. 6. The number of perturbed nodes avmessages for one simulation run depend- eraged for series of initial instances, α = ing on the traffic load 2.0
load (α = 2.5). We observe that saturating traffic α = 2.5 is critical for the network because almost half of the network nodes are perturbed. In this case the performance of our algorithm approaches that of BGP. The additional cost of our algorithm in relation to BGP is expressed in the number of transmitted alert messages. Their values, for one simulation and for varying traffic loads are plotted in an accumulative manner in Fig. 5. For any traffic load the number messages sent increases rapidly during the traffic matrix filling up period. Next, the number of sent messages increases at a much slower rate. The increase in the number of sent messages is more significant for heavier traffic loads because more nodes become perturbed. As message transmission causes a change of node states, the number of network state changes follows the same pattern as those of alert messages. The methodology applied to the comparison of our distributed algorithm and BGP cannot be used for a confrontation of our distributed and centralized algorithms. Firstly, computations performed by the theoretical centralised algorithm for each change in the traffic matrix is unacceptably time-consuming. Secondly, the variation of values of the elements of the traffic matrix one by one (Subsection VI.B) could be seen as too advantageous for our distributed algorithm. To evaluate our distributed algorithm, while retaining BGP as a practical reference, we start from an instant of the problem. It is selected as a typical network state together with the traffic matrix from the simulation runs which have been discussed before. Starting from this point the traffic matrix is fixed and we activate our distributed mechanism with routing matrices produced with BGP and our centralized algorithm. We observe the rate with which it diminishes the number of perturbed nodes. We compare these results with the number of perturbed nodes when we use BGP and the centralized algorithm alone. We compute a lower bound which is the number of nodes perturbed by traffic whatever the routing matrix is. For any routing matrix, a node i is perturbed if the sum of traffic sent by i and sent to i is greater than its capacity. We do not consider traffic transiting through it. This is a lower bound for our problem.
480
M.-A. Weisser, J. Tomasik, and D. Barth
Fig. 6 presents results of our comparison averaged for series of initial instances with heavy traffic: α = 2.0. We observe that the distributed algorithm optimizes routing matrices produced with both BGP and our centralized algorithm. The perturbation in the network is caused by the routing. A lower bound indicates that 0.6% of nodes only are perturbed whatever the routing is. This bound is reached by the combination of our centralized algorithm and the distributed one. This means that the bound is not only a lower bound but the optimum. With routing provided by BGP and our centralized algorithm, 9.6% and 4.3% of nodes are perturbed, respectively. This means that about 9.6 − 0.6 = 9.0 and 4.3 − 0.6 = 3.7% of nodes are perturbed only because of the transit traffic due to the BGP routing and the routing given by the centralized algorithm, respectively. The combination of BGP and our distributed algorithm is very efficient because the average percentage of perturbed nodes is stablized around one percent. This is a better result than that provided by the centralized algorithm alone and only twice greater than the optimum. This figure shows that the distributed algorithms reduces efficiently the number of nodes perturbed only by transit traffic. Moreover, this number is close to the optimal number when the initial routing matrix is provided by the centralized algorithm. The centralized algorithm provides solution which are good initial solution for the distributed algorithm but it is not an eligible solution in practice.
6
Summary and Further Studies
In this paper we have proposed a BGP independent distributed mechanism in inter-domain networks to find non-congested routes which provides new possibilities missing from the state-of-art inter-domain traffic engineering. We performed exhaustive simulation experiments to evaluate the efficiency of our distributed algorithm. The obtained results are encouraging and we are convinced that the proposed distributed algorithm is worth further study. First, we would like to verify the performance of our distributed algorithm for an inter-domain network model with the size and the hierarchy of the Internet. An open issue concerning the Internet hierarchy is notably the presence of sibling to sibling relationship which we have not investigated yet. Second, our model assumes that the entire domain can be only in one of two states. An alternative way is to consider that a domain could be partially congested. In this case, alert messages could be sent to a subset of client domains instead of all the client domains. This modification of our algorithms allows us to reduce the number of sent messages and routing table alterations. Third, we started our work from the hypothesis that information concerning congestion in the network allows an operator to route priority traffic on congestion-free paths. Notice that this is not a unique possibility to take advantage of the results of our distributed algorithm. An operator might prefer to place priority traffic on the best routes at his disposal despite their congestion state. In such a case, the operator redirects non-priority traffic on non-congested paths while saving the capacity of the best routes for priviledged traffic. The operator
Congestion Avoiding Mechanism Based on Inter-domain Hierarchy
481
decision may depend on route lengths and we would like to estimate them. The simulation studies presented in this paper consider inter-domain networks with traffic entirely seen as a priority traffic. Fourth, another issue which we see as a subject of further studies is the performance of our distributed algorithm in the case of its partial deployment in an inter-domain network. Our preliminary opinion is that it is naturally adapted to incremental deployment. Finally, certain problems appear in our distributed algorithm. It would be interesting to see how to find lengths of delays before alert messages are processed depending on a given inter-domain network characteristic. We think that more careful choice of these values may reduce momentary network state flickering.
References 1. Weisser, M.A., Tomasik, J.: A distributed algorithm for inter-domain resources provisioning. In: Proc. of EuroNGI, April 2006, pp. 9–16 (2006) 2. Farrel, A., Vasseur, J.-P., RFC, J.A.: 4655 - A Path Computation Element (PCE)Based Architecture. Technical report (August 2006) 3. Ge, Z., Figueiredo, D.R., Jaiswal, S., Gao, L.: The hierarchical structure of the logical Internet graph. In: Proc. of SPIE ITCOM 2001, July 2001, pp. 208–222 (2001) 4. Subramanian, L., Agarwal, S., Rexford, J., Katz, R.H.: Characterizing the Internet hierarchy from multiple vantage points. In: Proc. of IEEE INFOCOM 2002, June 2002, vol. 2, pp. 618–627 (2002) 5. Gao, L.: On inferring autonomous system relationships in the Internet. IEEE/ACM Transactions on Networking 9(6), 733–745 (2001) 6. Battista, G.D., Erlebach, T., Hall, A., Patrignani, M., Pizzonia, M., Schank, T.: Computing the types of the relationships between autonomous systems. IEEE/ACM Transactions on Networking 15(2), 267–280 (2007) 7. Weisser, M., Tomasik, J.: Innapproximation proofs for directed Steiner tree and network problems. Technical report, PRiSM UVSQ (June 2007) 8. Cristallo, G., Jacquenet, C.: An Approach to Inter-domain Traffic Engineering. In: Proc. of WTC 2002 (September 2002) 9. Xiao, L., Lui, K.S., Wang, J., Nahrstedt, K.: QoS extension to BGP. In: Proc. of ICNP 2002, November 2002, pp. 100–109 (2002) 10. Okumu, I., Mantar, H., Hwang, J., Chapin, S.: Inter-domain qos routing on diffserv networks: a region-based approach. Comp. Comm. 28(2), 174–188 (2005) 11. Subramanian, L., Caesar, M., Ee, C.T., Handley, M., Mao, M., Shenker, S., Stoica, I.: HLP: a next generation inter-domain routing protocol. In: Proc. of SIGCOMM 2005, August 2005, pp. 13–24. ACM Press, New York (2005) 12. Raghavachari, B., Veerasamy, J.: Approximation algorithms for mixed postman problem. SIAM J. on Discrete Mathematics 12(4), 425–433 (1999) 13. Gao, L., Griffin, T.G., Rexford, J.: Inherently safe backup routing with BGP. Proc. of INFOCOM 2001 1, 547–556 (2001) 14. Rekhter, Y., Li, T., Hares, S.: RFC 4271 - (BGP-4). Technical report (January 2006) 15. Weisser, M.A., Tomasik, J.: Automatic induction of inter-domain hierarchy in randomly generated network topologies. In: Proc. of CNS 2007, pp. 77–84 (2007) 16. Medina, A., Lakhina, A., Matta, I., Byers, J.: BRITE: An approach to universal topology generation. In: Proc. of MASCOT 2001, August 2001, pp. 346–354 (2001)
A Novel Bandwidth Broker Architecture Based on Topology Aggregation in Delay|Bandwidth Sensitive Networks Walid Htira1 , Olivier Dugeon1 , and Michel Diaz2 1 Orange Labs, 2, avenue pierre marzin, 22300-France {walid.htira,olivier.dugeon}@orange-ftgroup.com 2 Universite de Toulouse, 7, avenue du Colonel Roche, 31077-France [email protected]
Abstract. Either in flow-based or class-based QoS architectures, controlling the admission of traffic entering the network becomes a crucial task in the new telecom services. The Bandwidth Broker BB architecture is one of the efficient admission control solutions. In this paper, we present a novel bandwidth broker architecture for scalable support of guaranteed services based on the concept of topology aggregation. Indeed, the bandwidth broker does not manage all the detailed information, about topology and reservation, stored in the admission control database. Instead, it uses an aggregated representation in order to decide if a new service request can be accepted or rejected. We demonstrate that the reduction of the amount of information managed by the bandwidth broker enhance the performance of the admission control function thereby increasing the overall call processing capability. Keywords: Admission control, bandwidth broker, topology aggregation, service guarantee.
1
Introduction
Ensuring minimum quality-of-service (QoS) levels to the new emerging telecom services is an important challenge for future packet networks. Toward this end, a number of Call Admission Control (CAC) approaches have been drawn to reserve network resources in order to ensure that users and QoS objectives can be satisfied. The Resource reservation protocol RSVP [1] is one of these approaches. It consists in establishing and maintaining the resource reservation in an IntServ [2] network. Although this signalling protocol is very strong in providing QoS support, it is not scalable, since it is necessary to maintain a flow state in each router along the flow’s path. Contrary to RSVP, in the DPS (Dynamic Packet State) technique [3], the flow state information is inserted into packet headers, which overcomes the need for per-flow signaling and state management. The key problem encountered with this approach is the necessity that all routers A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 482–493, 2008. c IFIP International Federation for Information Processing 2008
A Novel Bandwidth Broker Architecture Based on Topology Aggregation
483
in the flows path participate in the admission control process, implement the same scheduling mechanism and update packet headers. Trying to reduce the overhead generated by the two aforementioned approaches, other measurement based solutions have been proposed: The egress admission control [4] which is based on passive monitoring of the network. Although this mechanism provides an efficient service model and scalability, it requires specific functionalities in the core and edge routers, like insertion of packet state in the headers, special scheduler and rate monitoring. To prevent the use of a signaling protocol and special packet processing in core nodes, a call admission control mechanism based on probing [5] was proposed, where a test flow is inserted into the network to measure its congestion level. Instead of having a distributed admission control, a Bandwidth Broker (BB) was proposed [6] within the DiffServ [7] model in order to concentrate in only one element the admission control functions. This entity maintains an admission control database with information about network topology and QoS state of each path and node. It also maintains a routing table allowing it to determine the path that the admitted flow will traverse towards the receiver. So, when a new flow with specific traffic parameters requests admission, the bandwidth broker calculates the bandwidth availability in each link of the path followed by this new flow to decide if it can be admitted or not. If the flow is admitted, the BB sends a message to the sender with a positive answer to the flow’s request, and updates its database. Thus, using this approach, the call admission control process is simplified. There is no need to store the QoS reservation states in the core routers. In addition, the BB offers a global vision of the network utilization which allows it to perform admission control function with much guarantee. However, a key difficulty encountered with such approach is the large amount of information to manage which leads to scalability issues in large networks. So, the more detailed the network is, the more time the admission control algorithm needs to compute the admission decision. There are two major solutions to keep this amount of information in a reasonable volume. The first one is to reduce the size of the network by splitting it into PoPs or areas. The major problem with this solution is how to conceive the coordination between BBs in order to not introduce much overhead in the network. The second solution is based on information aggregation. In this case, the main objective is to sufficiently aggregate the information to reduce the amount of data significantly without loosing too much detail which could lead the admission control algorithm to erroneous decisions. In this paper, we design a novel model of bandwidth broker based on topology aggregation. Admission control decisions are made based on an abstract structure of the network topology formed only by border routers. There is no need to store information about the core routers in the topology database. Our goal is to develop an architecture for admission control that can simultaneously achieve scalability and a strong service guarantee without sacrificing network utilization. The detailed information about the interactions between the CAC server and the network domain routers are described in [8].
484
W. Htira, O. Dugeon, and M. Diaz
The remainder of the paper is organized as follows. Section 2 investigates the most important aggregation models presented in the literature. Then, we propose in the Section 3 our topology aggregation based BB model. Simulation results are analyzed in Section 4 and the conclusion follows in Section 5.
2
Topology Aggregation Overview and Network Model
This section starts by introducing the topology aggregation concept and then exposes the network model. 2.1
Topology Aggregation Overview
The topology aggregation concept consists in summarizing the topology information of a network domain into a more compact structure. Numerous solutions were proposed in the literature [9] [10] [11] [12] [13], the most important ones being; The full-mesh and the star schemes. The first one [14] uses a logical path between each pair of border nodes to construct a full-mesh aggregated topology. This representation is by far the most accurate one. Unfortunately, the amount of information to be advertised increases as the square of the number of borders. The second scheme [14], which is the star, consists in connecting border nodes via virtual paths to a virtual center called ”nucleus”. In order to minimize distortion, it is recommended to define directed paths between borders called ”bypasses”. The number of logical paths in the star representation should not exceed three times the number of border nodes B. The complexity of this approach is in the order of B 2 . The performance of these two schemes was compared in [15]. It appears that the star is more powerful than the full-mesh where it significantly reduces the amount of topology information stored in the database. 2.2
Network Model
A large network is a set of domains and links that connect these domains. It is modeled as a directed graph G(V,E), where V is the set of nodes and E is the set of directed links among the nodes in V. In one domain, B is the set of border nodes that have links connected to nodes in other domains. The links in E are marked by a combination of additive and concave parameters. These parameters represent the level of QoS provided by each edge. Given a directed path P = (a, b, c,. . . , x, y) with (a, b, c,. . . , x, y) ∈ V , the value of these metrics over this path can be one of the following compositions : Definition 1. A metric mt is additive if: mt(P ) = mt(a, b) + mt(b, c) + ... + mt(x, y). Delay, jitter and cost follow the additive composition rule. Definition 2. A metric mt is concave if: mt(P ) = min(mt(a, b), mt(b, c), ..., mt(x, y)).Bandwidth parameter follows the concave composition rule.
A Novel Bandwidth Broker Architecture Based on Topology Aggregation
485
In the rest of the paper, we study the case of delay|bandwidth sensitive networks. For this, we use dlP to denote the delay of the logical path corresponding to P. Referring to definition.1, it is equal to the sum of all delays through P. We use also bwP to denote the bandwidth of the logical path corresponding to P. Referring to definition.2, it is equal to the minimal bandwidth value through P.
3
The Topology Aggregation Based Bandwidth Broker Architecture
In order to resolve the scalability issue within the classical Bandwidth Broker design, this section presents a novel approach to aggregate the network topology in order to reduce the size of the call admission control database and then to enhance the performance of the bandwidth broker. We propose two star designs, one which is basic and another which is service guarantee oriented. In such designs, admission control is done on a per domain basis instead of on a hopby-hop basis. Thus, the complexity of the admission control algorithms may be significantly reduced. For illustration purposes, we consider a simple AS with nine routers and twelve bidirectional and symmetric links. Each link is identified by a (delay, bandwidth) pair. Four nodes are selected as borders, which are, R3, R5, R8 and R9 [Fig.1-a].
Fig. 1. The star aggregation process
3.1
The Basic Star Design
As there is no common way or recommendation to retrieve the star representation, we propose new equations to compute spokes and bypasses. The formation of the star begins by computing all the end to end network state information in the network domain. So, based on routing information, we compute logical paths between border routers referring to definition 1 and 2. Indeed, the QoS metrics of a logical path correspond to the minimal bandwidth and the sum of delays through the physical path. For example, in Figure 1-a, to
486
W. Htira, O. Dugeon, and M. Diaz
set a logical path between border routers R3 and R5, we compute the sum of delays and the minimal bandwidth through the route R3→R4→R6→R5, which gives a logical path with the metric (26,10). Once all the end to end logical paths are set, the star representation is established by splitting this logical paths into two spokes. The nucleus plays the role of a relay which interconnects all the border routers through spokes. Our design follows three steps to find the QoS information associated to the basic star design: i) find the QoS parameters of the spokes from the border routers to the nucleus, ii) find the QoS parameters of the spokes from the nucleus to the border routers and iii) set the bypasses between border routers in order to reduce the distortion introduced by the aggregation process. The detailed information of the equations which lead to compute spokes and bypasses are described in [16]. In order to distinguish QoS parameters associated with end-to-end logical paths and spokes, we use respectively ”t” and ”s” notations. So, given a set of s s border routers B, a nucleus n, the metrics MSn and MnD denote respectively the QoS parameters of the ingoing spokes and the outgoing spokes. ∀S ∈ B,
s t t MSn = (min(dlSD ), max(bwSD )), D ∈ B, D =S
(1)
s t s t s MnD = (max(dlSD − dlSn ), max(min(bwSD , bwSn ))), S ∈ B, S =D
(2)
∀D ∈ B,
For bypasses, we set B bypasses. The selection of these bypasses is made according to the comparison of the joint of the two spokes S→n, n→D and the end-to-end logical path S→D. When the deviation is large, we add a bypass between these two border routers. Figure 2-b shows the resulting basic star scheme. Applying admission control on this scheme shows that there is an overestimation of the network capabilities [see Section 4]. This results from the fact that the aggregated topology looks like a black box, so there is no manner to efficiently manage the spokes bandwidth. Figure 2 shows a typical example of a path aggregation which leads to a network capability over-estimation. A reservation of 5Mb/s on the logical path XY exhausted all the available bandwidth on the physical link XW. While in the aggregated scheme, logical path XZ keeps 5Mb/s free which is not true. Even if the over-estimation value is not important, it can be the object of service degradation or user dissatisfaction. For this, we estimate that an acceptable
Fig. 2. Example of aggregation model accuracy limitation
A Novel Bandwidth Broker Architecture Based on Topology Aggregation
487
over-estimation percentage must not be greater than 0.5% of the overall number of connection requests treated during a time interval Δt. In the next subsection, we show a novel manner to form the star with much more service guarantee. 3.2
The Guarantee-Based Star Design
To be able to manage the bandwidth inside the aggregated topology, we have to fulfill all constraints coming from bandwidth utilization. Our analysis showed that it is needed to define on each spoke the physical link which can be an object of a maximal congestion in order to limit the over-estimation of network resources. But, as the basic star model is not defined directly from the real topology, it is not possible to find the set of physical links composing a spoke and so to infer the bottleneck link. So, our philosophy is to try to project the virtual nucleus in the real topology in order to seek the router which corresponds well to the nucleus characteristics, i.e is to solve the corresponding problem as follows i) to find in the real topology the real router that is the best potential nucleus node, ii) to drive from it the set of links and spokes, and iii) to use the resulting topology to perform the admission control for the given domain. The logic of our approach is the following. For each core router we compute an approximate star scheme where the nucleus corresponds to the chosen core router and spokes corresponds to the partial routes which interconnect the core router to its borders. We use H to denote the set of these partial routes. The QoS parameters of the partial routes are computed according to definition 1 and 2 and based on the routing protocol which indicates the route between the core router and the borders. At the end of this step, we are able to compute for each core router a nucleus deviation nd which reflects the deviation between the basic star and the computed star schemes. Given a core router cr, this deviation is computed from the sum of deflections from spokes between cr and border routers in the calculated star scheme and the spokes between the virtual nucleus and border routers in the basic star representation. Once the nucleus deviation is computed for each core router, we chose the node which has the minimal nucleus deviation to be the representative point of the virtual nucleus [Algorithm.1]. The deflection between two spokes is calculated as the Euclidean distance between two points in the delay|bandwidth plane due to the fact that spokes are represented as a (delay,bandwidth) pair. Algorithm.1: representative nucleus computation C = V - B /*The set of core router*/ dev(x,y) /*A function which computes the deviation between spokes x(dl,bw) and y(dl’,bw’) in the delay|bandwidth plane*/ iter = 1 /*iterations counter*/ For each cr in C For b in B nd = nd + dev(cr->b, n->b) + dev(b->n, b->cr) End For if(iter = 1){
488
W. Htira, O. Dugeon, and M. Diaz
min_dev = nd computed_nucleus = cr }else{ if (min_dev > nd){ min_dev = nd computed_nucleus = } } iter ++; End For return cr end.
cr
Now, once a router is selected to be the representative of the nucleus, it is possible to find each bottleneck link between the computed nucleus and border routers. This bottleneck link is defined as follows: Definition 3. A bottleneck link BL of a partial route rt corresponds to the physical link having the minimal bandwidth value and crossed by the greatest number of partial routes. We denote Ort the set of partial routes crossing rt in the bottleneck link BL [Algorithm.2]. Algorithm.2: BL computation for a partial route rt L(rt) /*denotes the set of links in rt*/ Ml(rt) /*denotes the set of links belonging to L(rt) and having the minimal bandwidth*/ For each link l in Ml(rt) N(l) = the number of partial routes crossing l F(l)= (bandwidth of link l)/N(l) End For BL = link l which has the minimal F(l). end. Thus, in the admission control process, it is necessary to check an additional condition in order to not over-estimate the network capabilities. This condition guarantees that the sum of reservation on all the partial route which cross the BL should be less or equal than the bandwidth capacity of the BL. So, given a network domain Ω, the bandwidth sharing constraint to check for each partial route is the following : ∀ rt ∈ H ( bwrt ) + bwrt ≤ bwBL(rt) (3) rsv rt ∈Ort
rsv
Note that the BL is fixed for bypasses in the same manner that in spokes, but instead of using partial routes we use complete routes. 3.3
Admission Control Process
Suppose that the bandwidth broker receives a service request Rq from a router S to reach another router D in the same network domain. In our design, the
A Novel Bandwidth Broker Architecture Based on Topology Aggregation
489
BB has to verify if there is a direct bypass between these two routers in the database or not. If a bypass is found, three conditions have to be checked; (4)(5) verify respectively the Rq admissibility in terms of bandwidth and delay, and (6) checks the bottleneck link constraint. Rq(bwS→D ) +
bwS→D ≤ bwS→D
(4)
rsv m Rq(dlS→D ) > dlS→D
Rq(bwS→D ) +
bwS→D +
(
(5) bwrt ) ≤ bwBL(S→D)
(6)
rsv rt ∈OS→D
rsv
In the case where there is no bypass, five conditions have to be checked; (7)(8) verify the availability of bandwidth in the two spokes, (9) checks the admissibility of Rq in terms of delay and (10)(11) verify the BL constraints on each spoke. Rq(bwS→D ) +
bwS→n ≤ bwS→n
(7)
bwn→D ≤ bwn→D
(8)
rsv
Rq(bwS→D ) +
rsv
Rq(dlS→D ) ≥ dlS→n + dln→D Rq(bwS→D ) +
bwS→n +
rsv
Rq(bwS→D ) +
(
(9)
bwrt ) ≤ bwBL(S→n)
(10)
bwrt ) ≤ bwBL(n→D)
(11)
rsv rt∈OS→n
bwn→D +
rsv
(
rsv rt∈On→D
Note that the admission control process is easier in the case of the basic star design because there is no need to check the BL constraints (6), (10) and (11). This concept of topology aggregation based CAC can be extended to multiple Class of Services (CoSs): i) by splitting the delay|bandwidth aggregated topology proportionally to the bandwidth requirements for each CoS, ii) by computing the delay|bandwidth aggregated scheme for each CoS (using the CoS allocated bandwidth instead of the physical bandwidth of the links).
4
Simulation Results
This section carries out some simulations to quantify the performance of our topology aggregation based bandwidth broker model compared to a traditional bandwidth broker architecture.
490
4.1
W. Htira, O. Dugeon, and M. Diaz
Network Model
We use the topologies of four famous commercial backbones (AT&T, Abovenet, Exodus and Sprintlink) for our experiments [17] [18]. Each experiment consists of two phases, the aggregated scheme formation and the admission control phase. In each backbone domain, the admission control phase consists in soliciting the bandwidth broker with 1.000.000 connection requests generated uniformly by the border routers with bandwidth requirements of 128Kb-5Mb. A shortest path algorithm was used to determine the routes in the network domain. Table 1. “Experimental topologies used for simulation” AS Name Sprintlink Exodus Abovenet AT&T Backbone Routers 465 211 244 729 AS border Routers 56 32 39 46
4.2
Performance Measurement
As mentioned earlier our experiments consist of two phases, namely the aggregated scheme formation phase and the admission control phase. For the first phase, simulations show that the basic star design needs less time to be computed than the guarantee based star. This is normal because there is no need to retrieve the representative nucleus and further to compute the corresponding BLs. The aggregation time τ of the basic star as well as the guarantee based star model is in the order of seconds [Table.2]. We can also remark, that τ is in the order of a routing protocol time convergence like OSPF. So, even in case of re-routing, the reformation time of the star is still acceptable and do not delay the admission control process [see Table.2]. Table 2. “Evaluation of the star formation time within the two star designs” Eval parameter BB design Sprintlink Exodus Abovenet AT&T Basic Star 52 6 11 36 τ (seconds) Guarantee-based star 73 9 16 70
The second phase, which concerns the admission control process, shows the strength of our design in reducing the amount of information stored in the database and so enhancing the admission control decision quickness. Compared to a traditional bandwidth broker model, our architecture reduces significantly the size of the database by using a simple summarized topology structure. So, admission control decisions, resources reservation and resources release are made in two passes by checking spokes state or in one pass by checking the bypass resources availability. Simulation results show that the database of the guarantee based star is slightly bigger than that of the basic star. This results from the fact that the guarantee-based model stores besides the end-to-end topology representation, the BLs information. Table.3 shows the percentage of our BB database
A Novel Bandwidth Broker Architecture Based on Topology Aggregation
491
size β by comparison with the total size needed by the traditional BB model to store the information about routing, reservations and topology. Despite that the guarantee based star design database is slightly bigger than that of the basic star model, our two designs outperform greatly the traditional bandwidth broker architecture. The database size is at least 43 times smaller in the case of the basic star design and 35 times smaller in the case of the guarantee-based star design. A such remarkable reduction has permit the enhancement of the admission decision time. There is no need to explore all the database and routing table in order to find the path followed by the incoming flow. Only an end to end check is needed to decide if the incoming call can be accepted or rejected. The admission control decision time γ is computed referring to the average time taken by the BB per new connection in order to make decision. Compared to the traditional BB architecture, γ in our two designs is at least 3 times faster [see Table.3]. This Experiments show also that the decision time in the guarantee based star model is slightly larger than that of the basic star model, the difference is in the order of tens of microseconds (μs). This results from the fact that there is no additional BL conditions to check. This quickness in making admission control decisions allows our two models to increase theirs call processing capability and so enhancing the BB availability. Table 3. “The admission control performance evaluation” Eval parameter BB design Sprintlink Exodus Abovenet AT&T Basic star 1.70% 2.29% 2.19% 1.68% β Guarantee-based star 2.19% 2.85% 2.79% 2.28% γ (ms)
Traditional BB Basic star Guarantee-based star
3.393 0.390 0.420
2.664 0.319 0.450
2.660 0.609 0.700
3.012 0.465 0.523
These two evaluation parameters prove the strength of our aggregated bandwidth broker architecture compared to the traditional design in term of complexity reduction, but do not prove its efficiency and its accuracy by comparison with the real network topology parameters. For this, we have conducted some experiments to evaluate the pertinence of our BB both when it is based on a basic star or on a guarantee-based star. For this, we introduce the exactness indicator ε which presents the accuracy of our approach to accept or reject new request in accordance with the real network capabilities. This indicator denotes the percentage of right decisions taken by the aggregated design. ε was computed for the two star designs. It appears that the guarantee-based star model is more accurate than the basic star as the number of right admission control decisions is higher. In the two star designs, ε is always higher than 88%. By focusing on the erroneous decisions taken by the bandwidth broker, we wanted to analyze the over-estimation introduced by the aggregation process. For this, we defined the over-estimation percentage σ which represents the percentage of connection requests admitted by excess whereas the network domain
492
W. Htira, O. Dugeon, and M. Diaz
capabilities does not allow it. Simulation results show the utility of defining BLs in the reduction of the over-estimation percentage. Compared to the basic star design, it appears that the guarantee-based star is at least 9 times more accurate. It has a σ at the most equal to 0.1238%, which is a very good guarantee performance result that responds to our QoS suggestions [Table.4]. Table 4. “The topology based BB accuracy evaluation” Eval parameter BB design Sprintlink Exodus Abovenet AT&T Basic star 88.3905% 96.4774% 93.8795% 92.2024% ε Guarantee-based star 88.5064% 96.4998% 94.2668% 92.5165% σ
Basic star 0.7031% 0.4001% 0.6423% 1.1834% Guarantee-based star 0.0486% 0.0442% 0.0287% 0.1238%
The aforementioned simulation results lead to several conclusions; i) Our bandwidth broker architecture, either it is basic or guarantee-based, outperforms largely the traditional BB design in terms of admission control algorithm complexity reduction, ii) The guarantee-based star design achieves very good results in terms of accuracy and overestimation percentage reduction. It presents a good track to investigate by Telecom operators for guaranteed and fast services delivery, iii) The basic star design offers very good scalability performance with a slight service degradation.
5
Conclusion
A broad view over the admission control approaches exhibits a tendency in adopting solution based bandwidth broker thanks to the service guarantee which it offers. This paper presents a novel bandwidth broker architecture for scalable support of guaranteed services based on the concept of topology aggregation. More specifically, the BB does not check all the detailed topology and reservation information. Instead, it uses a summarized picture of the network capabilities in order to perform admission control. Some simulations were conducted with four famous commercial backbone topologies. The experiments show that our architecture, either it is guarantee based or basic, reduces significantly the size of the admission control database as well as the admission control decision time. Compared to the basic star design, the guarantee-based model achieves very good performance in terms of accuracy and resources over-estimation reduction. As an ongoing work, our future plan will focus on how to retrieve the representative nucleus without handling all the core routers in order to decrease even more the time of the guarantee-based star formation.
Acknowledgment This work at Orange Labs, is registered under French patent n◦ 07 07043.
A Novel Bandwidth Broker Architecture Based on Topology Aggregation
493
References 1. Zhang, L., et al.: A New Resource ReSerVation Protocol. IEEE Network 7, 8–18 (1993) 2. Braden, R., et al.: Integrated Services in the Internet Architecture. Internet RFC 1633 (1994) 3. Stoica, I., Zhang, H.: Providing Guaranteed Services without per Flow Management. In: Proceedings of ACM SIGCOMM 1999, Cambridge, MA (August 1999) 4. Centinkaya, C., Knightly, E.: Scalable Services via Egress Admission Control. In: Proceedings of IEEE INFOCOM 2000, Tel Aviv, Israel (March 2000) 5. Qiu, J., Knightly, E.: Inter-class Resource Sharing using Statistical Service Envelopes. In: Proceedings of IEEE INFOCOM 1999, New York, NY (March 1999) 6. Zhang, Z., et al.: Decoupling QoS Control from Core Routers: a Novel Bandwidth Broker Architecture for Scalable Support of Guaranteed Services. In: Proceedings of ACM SIGCOMM 2000, Stockholm, Sweden (August 2000) 7. Blake, S., et al.: An Architecture for Differentiated Services. Internet RFC 2475 (1998) 8. Htira, W., Dugeon, O., Diaz, M.: STAMP: Towards A Scalable Topology Announcement and Management Protocol. In: IEEE AINA 2008, GinoWan, Okinawa, Japan (March 2008) 9. Awaebuch, B., Shavitt, Y.: Topology Aggregation for Directed Graphs. In: ISCC 1998, Athens, Greece (June 1998) 10. Iwata, A., Suzuki, H., Izmailow, R., Sengupta, B.: QoS aggregation algorithms In hierarchical ATM networks. In: ICC 1998 (June 1998) 11. Awerbuch, B., Du, Y., Khan, B., shavitt, Y.: Routing Through teranode networks with topology aggregation. In: ISCC 1998, Athens, Greece (June 1998) 12. Lee, W.C.: Spanning Tree Method for Link State Aggregation in Large Communication Networks. In: INFOCOM 1995, Boston, MA, USA (April 1995) 13. Lee, W.: Minimum Equivalent Subspanner Algorithms for Topology Aggregation in ATM Networks, Colmar, France (June 1999) 14. Bhutani, K.R., Battou, A., Khan, B.: Two Approaches for Aggregation of Peer Group Topology in Hierarchical PNNI Networks. In: The Catholic Uni. of America. Dept. of Mathematics, Washington, DC (June 1998) 15. Guo, L., Matta, I.: On State Aggregation for Scalable QoS Routing. In: ATM Workshop 1998 (May 1998) 16. Htira, W., Dugeon, O., Diaz, M.: An Aggregated Delay|Bandwidth Star Scheme For Admission Control. In: IEEE CAMAD 2007, Athens, Greece (September 2007) 17. CAIDA. Mapnet raw source data files, http://www.caida.org/tools/visualization/mapnet/Data/ 18. Liu, J., Liljenstam, M., Nicol, D.: An internet topology for simulation, http://www.crhc.uiuc.edu/jasonliu/projects/topo/
Client-Side Adaptive Search Optimisation for Online Game Server Discovery Grenville Armitage Centre for Advanced Internet Architectures Swinburne University of Technology, Melbourne, Australia [email protected]
Abstract. This paper describes a client-side, adaptive search technique to reduce both the time taken to discover playable online First Person Shooter (FPS) game servers and the number of network flows created during game server discovery. Online FPS games usually use a client-server model, with thousands of game servers active at any time. Traditional FPS server discovery probes all available servers over multiple minutes in no particular order, creating thousands of short-lived UDP flows. Probing triggers rapid consumption of longer-lived per-flow state memory in NAT devices between a client and the internet. Using server discovery data from Valve’s Counterstrike:Source and idSoftware’s Wolfenstein Enemy Territory this paper demonstrates that pre-probing a subset of game servers can be used to re-order and optimise the overall probe sequence. Game servers are now probed in approximately ascending latency, expediting the location of playable servers. Discovery of playable servers may now take less than 20% of the time and network traffic of conventional game server discovery. The worst case converges to (without exceeding) the behaviour of conventional game server discovery. Keywords: Server discovery, search optimisation, latency estimation.
1
Introduction
Internet-based multiplayer First Person Shooter (FPS) games (such as Wolfenstein Enemy Territory [1], Half-Life 2 [2], Counterstrike:Source [3] and Enemy Territory Quake Wars [4]) have become quite common in the past 6+ years. FPS games typically operate in a client-server mode, with game servers being hosted by Internet service providers (ISPs), dedicated game hosting companies and individual enthusiasts. Although individual FPS game servers typically only host from 4 to around 30+ players, there are usually many thousands of individually operated game servers active on the Internet at any given time [5]. The challenge for game clients is to locate up-to-date information about the game servers available at any given time, such that the player can select a suitable server on which to play. Server discovery is usually triggered manually by the human player, to populate or refresh the list of available servers presented by their client’s on-screen A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 494–505, 2008. c IFIP International Federation for Information Processing 2008
Client-Side Adaptive Search Optimisation for Online Game Server Discovery
495
‘server browser’. Once triggered, a game client first queries a master server preconfigured into the game client software. The master server returns a list of thousands of
2
Current State of FPS Server Discovery
In this section we review the current ET and CS:S server discovery processes, consider how they impact on a player’s experience, and reflect on the network layer constraints and consequences for consumer broadband services. 2.1
ET and CS:S Server Discovery
Figure 1 illustrates the server discovery process used by ET. Released in 2003 as an online-only team-play FPS game, ET is based on the earlier Quake III
496
G. Armitage
Fig. 1. An Enemy Territory (ET) client’s discovery and probing of registered ET game servers
Arena (Q3A) game engine, and inherits Q3A’s underlying server discovery mechanism. Public ET game servers automatically register themselves at etmaster.idsoftware.com, the ET master server. When a player requests server discovery an ET client first queries the ET master server (getservers), retrieving a list of
Real World Examples
For this paper representative data was gathered from a number of locations around the planet (listed in Table 1). The Australian host ran actual ET and CS:S clients, whilst the others were nodes on Planetlab [8] using the open-source game server discovery tool qstat [9]. Figures 2 and 3 illustrate the distribution of game server RTTs versus probe time experienced by the standard ET and CS:S clients in Australia in late September 2007. For ET the distribution of RTTs is consistent (and widely spread) across the 70 second discovery period. For CS:S the distribution is not
Client-Side Adaptive Search Optimisation for Online Game Server Discovery
497
Table 1. Location of clients used to gather representative RTT samples Host name and location
Country and client type(s)
gjagw.space4me.com, Australia (actual clients) planetlab1.otemachi.wide.ad.jp, Japan (Planetlab) planetlab2.csg.uzh.ch, Switzerland (Planetlab) edi.tkn.tu-berlin.de, Germany (Planetlab)
AU-CSS, AU-ET JP-ET, JP-CSS CH-ET DE-CSS
Measured game server RTT vs time of probe: AU−CSS (25601 probes at ~129 probes/second) 0.6
0.5
0.5
RTT (seconds)
RTT (seconds)
Measured game server RTT vs time of probe: AU−ET (2811 probes at ~42 probes/second) 0.6
0.4
0.3
0.2
0.1
0.4
0.3
0.2
0.1
0.0
0.0 0
10
20
30
40
50
60
70
0
20
40
60
80
100
120
140
160
180
200
Seconds since first probe
Seconds since first probe
Fig. 2. Measured RTT vs time: ET server discovery sequence as seen from Australia in September 2007
Fig. 3. Measured RTT vs time: CS:S server discovery sequence as seen from Australia in September 2007
as uniformly spread across the (roughly) 3 minute discovery period. In both cases all servers must be probed before players can presume they’ve covered all possible game servers with ‘playable’ RTT. The large number of probes over 200ms in Figures 2 and 3 reflects the fact that many ET and CS:S game servers tend to be hosted in Europe and the USA. Similar RTT levels are seen from Japan, whilst the clients in Germany and Switzerland saw almost the inverse - most probes returned low RTTs, under 200ms. ET and CS:S clients in the Asia-pacific region end up probing many servers that are subsequently determined to be unsuitable for competitive play. 2.3
Impact on Consumer Network Connections
Most game clients sit behind consumer broadband connections. The server discovery probe rate is consequently capped to the client’s estimate of available upstream bandwidth. FPS clients usually require players to indicate their Internet connection type (‘dial-up’, ‘ADSL’, ‘Cable modem’, etc). In addition to tweaking the packet rates used during actual game play, this information can allow the client to estimate an appropriate server discovery probe rate. Too many probes per second may congest the player’s upstream bandwidth, potentially inflating the estimated RTT or causing probe packets to be dropped. Too few probes per second increases the time taken to complete server discovery.
498
G. Armitage
Server discovery also increases memory consumption in NAT-enabled routers between the client and the Internet. Although UDP is notionally state-less, NAT devices typically retain mapping state for a number of minutes after seeing UDP traffic head to the Internet from a local network. For example, the trial for Figure 3 caused ~25K unique NAT mapping table entries to be dynamically created over a 3 minute period. Despite each server discovery transaction completing in under two seconds these NAT table entries remained (wasting memory) for multiple minutes before being released. 2.4
Filtering at the Client and Master Server
Client-side filtering (such as not showing servers that are full or empty, or ranking the servers in order of ascending RTT) has limited impact on the network traffic created during server discovery because it happens during or after the active probing of the available game servers. (Players may certainly choose to exit server discovery early when they see enough servers with tolerable. Nevertheless, as Figures 2 and 3 show, the player cannot safely assume they’re seeing all servers under a particular RTT until all available servers have been probed.) Valve’s Steam master server supports rudimentary server-side filtering. Manually selected by the player, a Steam client’s initial query can request game servers of a certain type (such as “only CS:S game servers”) or game servers believed (by the master server) to be in one of eight broad geographical regions of the planet (such as “US-West”, “Europe”, “Asia”, etc). Filters reduce the number of game servers returned by the master server, and hence reduce the number of probes subsequently emitted by a Steam client. However, a master server cannot know a priori the RTT between any given client and game server. Thus a client must still actively probe the entire (albeit possibly reduced) set of IP address:port pairs handed back by the master server.
3
Proposed Client-Side Optimisation
This paper’s client-side optimisation meets four key goals: – Presentation of results from closer servers before those of more distant servers – Optional automatic early termination of search sequence – Function from behind consumer NAT devices without additional manual intervention or configuration by the player – No additional processing load on master servers The first two goals improve a player’s experience and reduce network traffic generated by server discovery. The third goal addresses usability. Players should not be expected to manually configure additional knowledge into the client (such as the public IP address of the player’s home broadband connection). Adapting to being behind a NAT device is crucial because so many game clients will sit on a consumer home broadband connection. The fourth goal minimises the incremental operational cost to a game publisher (as master servers already serve thousands of queries per hour without generating any new revenue).
Client-Side Adaptive Search Optimisation for Online Game Server Discovery
3.1
499
Background
The challenge of finding FPS servers with low enough RTT is well recognised [10]. To date research has focused on re-locating clients to optimally placed servers (e.g. [11]), rather than optimising the server discovery process itself. The problem appears contradictory: we wish to probe game servers in order of ascending RTT before we’ve probed them to establish their RTT. In 2006 the author hypothesised that a client might locally re-order the probe sequence so that game servers in countries ‘closer’ to the client would be probed before those ‘further away’ [12]. First the client would map server IP addresses to their country of origin (for example, using MaxMind’s free GeoLite Country database [13]). Then selected servers in each country would be probed, providing an estimate of the RTT to each country relative to the client’s current location. Finally, the countries would be ranked in ascending order of estimated RTT, and all remaining game servers probed in order of their country’s rank. Subsequent implementation revealed that re-ordering solely on the basis of country code was inadequate. This paper presents a functional improvement of the idea in [12]. 3.2
Re-Ordering Based on Country and Topology
Algorithm 1 shows the proposed steps - clustering, calibration and re-ordered probing. Clustering involves identifying those game servers who share a common topological relationship within a country. Calibration involves initially probing a small subset of servers in each cluster, thereby creating an estimate of the RTT to every server in the cluster. (The cluster is sub-divided if the sampled RTTs are spread across ‘too wide’ a range.) Re-ordered probing involves ranking clusters by ascending RTT then probing remaining game servers according to their cluster’s rank. The definition of a cluster relies on two facts. First, third-party databases allow geographical location to be inferred from IP addresses, even though an IP address by itself has no geographic semantics. Second, within the context of given country the higher order IP address bits provide a coarse, but sufficient, key to differentiate servers sharing paths with potentially different latencies. Calibration takes care of two key requirements. Active sampling ensures cluster RTTs for ranking are relative to the client’s current Internet location (whether or not the client sits behind a NAT device). In addition, a cluster covering too wide a spread of RTT values is subdivided along /16 boundaries, creating multiple smaller clusters for the ranking step. The game servers probed in step 4 are considered ‘done’ and not probed again in step 6. Modern FPS games typically require processor speeds well over 1GHz, so Algorithm 1 adds negligible overhead relative to the time required to actually transmit and receive all probes. Thus in the worst case this algorithm’s performance (timeliness in presenting acceptable servers to the player) converges on, without being worse than, the performance of conventional server discovery.
500
G. Armitage
Algorithm 1. Clustering, calibration and re-ordered probing 1. Query master server for list of game servers 2. Create initial clusters: (a) Assign every game server a country code (b) Identify every /8 subnet within each country code that contains: i. more than 10 game servers, and ii. more than 10% of all game servers from that country (c) Every server in a /8 subnet identified above is assigned a unique cluster_id from the server’s country code and the /8 subnet number (d) Every server not in a /8 subnet identified above is assigned the server’s country code as a unique cluster_id 3. Servers with the same cluster_id are now members of the same cluster 4. For each cluster_id, perform the following steps: √ (a) Set Nsample = Ncluster , where Ncluster is the number of servers in the cluster (b) Randomly select Nsample servers from the cluster, one from each /16 subnet present within the cluster (c) Issue a standard server discovery probe to each of the server’s selected in the previous step (d) The estimated RTT (RTTcluster ) for this cluster is the median measured RTT (e) If the 20th and 80th percentile measurements making up RTTcluster differ by more than 40ms: i. Split the members of this cluster_id into mutiple new clusters ii. Each new cluster (and associated cluster_id) is made from members of the old cluster who share the most significant 16 bits of their IP addresses iii. Issue a standard server discovery probe to one previously un-probed server randomly selected from each new cluster. The probed RTT becomes RTTcluster for the new cluster 5. Rank every cluster in order of ascending RTTcluster 6. Probe all remaining game servers in order of their cluster’s rank. Within a cluster, probe servers in the order they were returned by the master server.
3.3
Automatic Termination of Discovery Phase
Individual servers probed in Algorithm 1’s step 6 may not have ascending RTTs, but the RTT averaged over a number of recently probed servers will trend upwards. Consequently it becomes feasible to implement automatic early termination (auto-stop) of probing during Step 6 when the RTT trends above a player-specified threshold. This will reduce both the number of probes transmitted and the time a player must wait to be reasonably sure the client has probed the servers who are acceptably close (particularly for clients a long way from many servers). The variability of individual RTTs for game servers within a given cluster mean an auto-stop decision should err on the side of continuing rather than terminating the search. Algorithm 2 has been found to be suitably cautious.
Client-Side Adaptive Search Optimisation for Online Game Server Discovery
501
Algorithm 2. Auto-stop calculation – Set RTTstop as the maximum RTT considered playable for a game server (for example, RTTstop = 200ms) – Decide on a window size Wautostop (for example, Wautostop = 100) – Wait until at least Wautostop servers are probed in Algorithm 1’s step 6. – Now, after each successive probe in Algorithm 1’s step 6 re-calculate RTTbottom as the 2nd-percentile RTT over the last Wautostop probes (the RTT below which 2% of RTT samples have fallen) – Terminate Algorithm 1’s step 6 when RTTbottom RTTstop
4
Illustrating the Optimisation with CS:S and ET
Space constraints preclude an exhaustive analysis of section 3’s client-side optimisation. Instead, we present a selection of illustrative examples using real-world RTT datasets gathered from the hosts in Table 1. Each dataset was then used to simulate the consequences of a client applying the steps in sections 3.2 and 3.3. Quantitative results are summarised in Table 2. (Differences exist between the number of servers probed in each case as the datasets were not gathered at precisely the same time.) For each simulated client Algorithm 1 generates traffic in two phases - ‘initial probes’ (step 4) and ‘re-ordered probes’ (step 6). Only re-ordered probes are shown on each graph, along with the variation of Algorithm 2’s RTTbottom over time. The auto-stop RTT threshold is 200ms, and the estimated auto-stop time is indicated by a blue vertical line. Figures 4, 5 and 6 illustrate the impact on typical ET clients located in Australia, Japan and Switzerland respectively. Compared to the ~70 second probe sequence in Figure 2 the Australian client sees a distinct improvement. It takes ~10 seconds to cluster the active game servers and begin issuing re-ordered probes, with auto-stop being triggered in ~14 seconds. A similar benefit is evident for the client based in Japan. Figure 6 illustrates the neutral impact on clients close to large concentrations of game servers under 200ms. Servers over 200ms are seen towards the end of the probe sequence, but the auto-stop process errs on the side of continuing (rather than stopping) and the client ultimately probes all active servers. Table 2. Impact of utilising optimised server discovery from different locations Client Countries /8 clusters AU-ET JP-ET CH-ET AU-CSS JP-CSS DE-CSS
51 52 50 71 69 69
77 77 79 140 143 143
Initial Probes
Auto-stop time (seconds)
Auto-stop (% of full probe)
396 446 492 2136 1694 2159
14 24 68 24 60 208
20 36 100 12 28 98
502
G. Armitage Re−ordered probes vs time: JP−ET (446 initial probes, 2228 re−ordered probes, 40 probes/sec) 0.6
0.5
0.5
RTT (seconds)
RTT (seconds)
Re−ordered probes vs time: AU−ET (396 initial probes, 2415 re−ordered probes, 40 probes/sec) 0.6
0.4
0.3
0.2
0.4
0.3
0.2 Re−ordered probes
Re−ordered probes Auto−stop estimator
0.1
Auto−stop estimator
0.1
Player RTT tolerance
Player RTT tolerance
Auto−stop time
Auto−stop time
0.0
0.0 0
10
20
30
40
50
60
70
0
80
10
20
30
40
50
60
70
Seconds since first probe
Seconds since first probe
Fig. 4. Re-ordered ET probe sequence for a client in Australia
Fig. 5. Re-ordered ET probe sequence for a client in Japan
Re−ordered probes vs time: CH−ET (492 initial probes, 2229 re−ordered probes, 40 probes/sec)
Re−ordered probes vs time: AU−CSS (2136 initial probes, 23465 re−ordered probes, 130 probes/sec)
0.6
0.6 Re−ordered probes Auto−stop estimator
0.5
0.5
Player RTT tolerance
RTT (seconds)
RTT (seconds)
Auto−stop time 0.4
0.3
0.2
0.4
0.3
0.2 Re−ordered probes
0.1
0.1
0.0
0.0
Auto−stop estimator Player RTT tolerance Auto−stop time
0
10
20
30
40
50
60
70
0
20
40
60
80
100
120
140
160
180
200
Seconds since first probe
Seconds since first probe
Fig. 6. Re-ordered ET probe sequence for a client in Switzerland
Fig. 7. Re-ordered CS:S probe sequence for a client in Australia
Figures 7, 8 and 9 illustrate the impact on typical CS:S clients located in Australia, Japan, and Germany respectively. Compared to the ~200 second probe sequence in Figure 3 the Australian client sees a distinct improvement. It takes just under 20 seconds to cluster the active game serves and begin issuing reordered probes, with auto-stop being triggered in ~24 seconds (12% of the nonoptimised search time). A similar, albeit less dramatic, benefit is evident for the client based in Japan. A client based in Germany would see only limited benefit from the re-ordered probe sequence - the auto-stop triggers after almost 98% of active servers have been probed. The key message is that clients far away from most game servers can see significant reduction in the number of probes emitted by their clients before concluding that all playable servers have been seen. This reduces the player’s wait time, reduces the number of UDP flows passing through any NAT devices near the client, and reduces the number of bytes sent and received to probe servers unlikely to be suited to competitive play.
Client-Side Adaptive Search Optimisation for Online Game Server Discovery
503
Re−ordered probes vs time: DE−CSS (2159 initial probes, 25519 re−ordered probes, 130 probes/sec)
Re−ordered probes vs time: JP−CSS (1694 initial probes, 25934 re−ordered probes, 130 probes/sec) 0.6
0.6
0.5
0.5
Re−ordered probes Auto−stop estimator Player RTT tolerance
RTT (seconds)
RTT (seconds)
Auto−stop time
0.4
0.3
0.2
0.4
0.3
0.2
Re−ordered probes Auto−stop estimator
0.1
0.1
Player RTT tolerance Auto−stop time
0.0
0.0 0
50
100
150
200
250
Seconds since first probe
Fig. 8. Re-ordered CS:S probe sequence for a client in Japan
5 5.1
0
50
100
150
200
250
Seconds since first probe
Fig. 9. Re-ordered CS:S probe sequence for a client in Germany
Limitations, Alternatives and Future Work Limitations and Alternatives
It is inherently challenging to consistently optimise the full search sequence based on sampling of the master server’s list of active game servers. For example, Algorithm 1’s Nsample should be kept low so the initial probe sequence in step 3 involves a small fraction of the game servers returned in step 1. This paper’s calculation of Nsample works reasonably well for the datasets used. However, increasing Nsample will improve the estimation of RTTcluster and the reliable sub-division of clusters where necessary, improving the consistency of cluster ranking in step 6. The current scheme relies on a third-party database for mapping IP addresses to geographically significant country codes. Such databases are themselves not perfect (MaxMind’s GeoIP claims to map 97% of all IP address allocations to country codes), and they must be updated regularly as real-world IP address allocations change. Fortunately, the databases are small (relative to today’s game clients) and don’t change much from month to month (GeoLite Country went from 677Kbyte to ~1Mbyte between April and October 2007). Most FPS games include mechanisms for auto-update of game content, which can also handle incremental updates to the country code mapping database. Another issue is that auto-stop relies on an inherently noisy ‘signal’, and may be fooled into premature termination if the cluster ranking is sufficiently misordered or RTTstop is set too low. Further options open up if we allow modification of the master server. Mapping of IP addresses to country codes could be moved to the master server, with tuples of
504
G. Armitage
In principle a master server could also pre-order the list of game servers returned to each client by estimating the client’s distance to servers in other countries (using the client’s public IP address and a country code mapping database held at the master server). However, a master server cannot know the network conditions prevailing between each client and game servers in different countries, so there is no benefit to be gained. 5.2
Future Work
Future work will characterise the probability with which clusters are formed and ranked suboptimally as a function of Nsample , the use of /8 and /16 boundaries, and the spread of RTTs used to trigger new cluster formation in step 4(e). We will also evaluate alternative auto-stop algorithms, such as waiting for a smooth run of consistently high probe RTTs before stopping (rather than simply triggering on the lowest 2nd percentile RTT). We will also explore alternative indicators that game servers share a common ’distance’ from a client, such as Autonomous System (AS) numbers. AS numbers are used in inter-domain routing to identify topologically distinct regions of the internet. Clustering based on AS numbers (rather than country code) may lead to more accurate RTTcluster estimates and greater consistency in ranking of clusters. Master servers are a good place to track the live BGP routing information updates required to maintain IP address to AS number mappings. We will explore the implications of modifying master servers to return
6
Conclusion
Using examples based on Valve’s Counterstrike:Source (CS:S) and idSoftware’s Wolfenstein Enemy Territory (ET) this paper illustrates a self-calibrating clientside method for optimising the FPS game server discovery probe sequence. Game servers are clustered by their countries of origin, and the clusters are then ranked in ascending order of RTT based on initial probing of a small sample of game servers in each cluster. All remaining game servers are then probed in order of their cluster’s rank. Probing may be automatically terminated when RTT exceeds a player-specified threshold. The method works wherever the client is located on the Internet, and can reduce server discovery time and traffic down to less than 20% of the regular case. In the worst case, the method converge on without exceeding - the time and resources required for regular server discovery.
Acknowledgement I am grateful to Mark Claypool for providing me with access to PlanetLab.
Client-Side Adaptive Search Optimisation for Online Game Server Discovery
505
References 1. id Software: Wolfenstein Enemy Territory, under Downloads (September 29th 2007), http://www.enemyterritory.com/main.html 2. Valve Corporation: Half-Life 2 (April 29th 2007), http://half-life2.com/ 3. Valve Corporation: CounterStrike: Source (accessed, February 8th 2008), http://counter-strike.net/ 4. id Software: Enemy Territory Quake Wars (September 29th 2007), http://www.enemyterritory.com/ 5. Armitage, G., Claypool, M., Branch, P.: Networking and Online Games - Understanding and Engineering Multiplayer Internet Games, June 2006. John Wiley & Sons, Ltd, United Kingdom (2006) 6. Valve Corporation: Welcome to Steam (September 27th 2007), http://www.steampowered.com/ 7. Valve Corporation: Server Queries (February 7th 2008), http://developer.valvesoftware.com/wiki/Server_Queries 8. PlanetLab: PlanetLab - An open platform for developing, deploying, and accessing planetary-scale services (accessed, February 8th 2008), https://www.planet-lab.org/ 9. : QStat (accessed, February 8th 2008), http://www.qstat.org/ 10. Claypool, M.: Network characteristics for server selection in online games. In: ACM/SPIE Multimedia Computing and Networking (MMCN) (January 2008) 11. Chambers, C., Feng, W.C., Saha, D.: A geographic, redirection service for on-line games. In: ACM Multimedia 2003 (short paper). (November 2003) 12. Armitage, G., Javier, C., Zander, S.: Topological optimisation for online first person shooter game server discovery. In: Proceedings of Australian Telecommunications and Network Application Conference (ATNAC) (December 2006) 13. MaxMind: GeoLite Country (accessed, February 8th (2008), http://www.maxmind.com/app/geoip_country
A Model for Endpoint Admission Control Based on Packet Loss Ignacio M´as and Gunnar Karlsson ACCESS Linneaus Center School of Electrical Engineering KTH, Royal Institute of Technology 10044 Stockholm, Sweden {nacho,gk}@kth.se
Abstract. Endpoint admission control solutions, based on probing a transmission path, have been proposed to meet quality requirements of audio–visual applications with little support from routers. In this paper we present a mathematical analysis of a probe–based admission control solution, where flows are accepted or rejected based on the packet–loss statistics in the probe stream. The analysis relates both system performance to the design parameters, and the experienced probe packet loss probability to the packet loss probability of accepted flows. The goal is to provide a simple mathematical method to perform network dimensioning for admission control based on end–to–end probing.
1
Introduction
Applications that produce traffic in the Internet can be broadly classified into elastic or inelastic. Elastic flows appear mostly as a result of the transfer of digital documents, like files, web pages or multimedia downloading for local storage. These flows use TCP as transport protocol. Inelastic flows appear nowadays mainly from the streaming of audio and video over UDP without congestion control functionality. The sudden increase of video over IP on the Internet requires a better and more predictable service quality than what is possible with the available best– effort service. Audio–visual applications can handle limited packet loss and delay variation without affecting the perceived quality. Interactive communication, in addition, requires stringent end–to–end delay requirements. For example, IP telephony requires that a maximum of 150 ms one–way delay should be maintained during the whole call. The DiffServ architecture is an attempt to meet the distinct quality of service requirements of elastic and inelastic traffic, but it still lacks an important function: admission control. Without admissions control, inelastic flows can overload the network to a point where both elastic and inelastic flows suffer intolerable packet loss. Admission control has been proposed to support applications with
Corresponding author.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 506–517, 2008. c IFIP International Federation for Information Processing 2008
A Model for Endpoint Admission Control Based on Packet Loss
507
quality of service (QoS) requirements by limiting the network load. These proposals aim at providing QoS with very little or no support in the routers. They also share a common idea of endpoint admission control: A host sends probe packets before starting a new flow and decides about the flow admission based on statistics of the probe packet loss [1, 2], or delay and delay variation [3]. The admission decision is thus moved to the hosts and is made for the entire path from the source to the destination, rather than on a per–hop basis. Consequently, the service does not require any explicit support from the routers other than one of the various scheduling mechanisms supplied by DiffServ, and the mechanism of dropping or marking packets. In this paper we provide a mathematical analysis of the probe–based admission control (PBAC) scheme proposed in [1, 2]. In this scheme, the admission control is based on the measured packet–loss ratio of the probe stream. The aim of the admission control is to provide a reliable upper bound on the packet loss probability for accepted flows. The end–to–end delay and delay jitter is limited by the use of small buffers inside the network. The goal of the mathematical analysis is to relate performance parameters, such as probe and data packet loss probabilities, flow acceptance probability and network utilization to system parameters, such as buffer size, probe length and admission threshold. The paper is organized as follows: In Section 2 we provide a short description of the PBAC solution, while Section 3 presents an approximate analytical model to calculate probe and data loss probabilities. Section 4 gives a performance evaluation of the system as well as the validation of the analytical model, while Section 5 shows an example of the service dimensioning for an access operator. Finally, we conclude our work in Section 6.
2
Probe–Based Admission Control
QoS provisioning with probe based admission control is based on the following main ideas: i) Network nodes have short buffers for data packets of accepted flows to limit end–to–end delay and delay jitter; ii) all hosts in the network perform admission control; iii) the admission control is responsible for limiting the packet loss of accepted flows; iv) the admission decision is based on the packet loss ratio in the probe stream, thus flows are accepted if the estimated packet loss probability is below a given acceptance threshold; v) probe packets are transmitted with low priority at the routers to ensure that probe streams do not disturb accepted flows; and vi) best effort traffic utilizes the remaining available capacity of the network link at a lower priority than the probe packets. To provide priority queuing for probe and data packets at the network nodes we consider a double–queue solution in this paper: One buffer is dedicated for high priority data packets and the other for low priority probe packets. The size of the high priority buffer for the data packets is selected to ensure a low maximum queuing delay and an acceptable packet loss probability, i.e., to provide packet scale buffering [4]. The buffer for the probe packets can accommodate few probe packets at a time, to ensure an over–estimation of the data packet loss. Similar
508
I. M´ as and G. Karlsson new session
new session
Probe
ACK
Data
Timeout
Measurement Probe length Ploss < Ptarget
Timeout
Timeout
Probe
Measurement Probe length
NACK
Ploss > Ptarget
Backoff time Probe
Data
Timeout
Timeout
Measurement Probe length Ploss < Ptarget
Probe length ACK
Data
Ploss < Ptarget Timeout
Probe
ACK
Measurement
Timeout
new session
Fig. 1. The probing procedure
priority queuing can be achieved using a single buffer with a discard threshold for the probe packets. The two solutions are compared in [5]. The threshold–queue is not included in the mathematical analysis of this paper. The acceptance threshold is fixed for the admission controlled traffic service class and it is the same for all flows. The reason for this is that the QoS experienced by a flow is a function of the load from the flows already accepted in the class. Considering that this load depends on the highest acceptance threshold among all flows, by having different thresholds all flows would be degraded to the QoS required by the one with the least stringent requirements. The class definition has also to state the maximum data rate allowed to limit the size of the flows that can be set up. Each data flow should not represent more than a small fraction of the service class capacity, to ensure that statistical multiplexing works well. An study of the effect of ‘thicker’ flows in the admission control scheme (over 10% of service class capacity) can be found in [5] In Fig. 1 we can see the phases of the probing procedure. When a host wishes to set up a new flow, it starts sending a constant bit rate probe to the destination host at the maximum rate the flow will require (rpr ). The probe packet size (sp ) should be small to have a high number of packets in the probing period (Tp ) in order to perform the acceptance decision with a sufficient level of confidence. The probe packets contain information about the peak bit rate and length of the probe, as well as a sequence number. With this information the receiving host can perform a quick rejection, based on the expected number of packets that it should receive in order not to surpass the target loss probability. The probe also needs to contain a flow identifier to allow the end host to distinguish probes for different flows, since one sender could transmit more than one flow simultaneously. The IP address in the probes would consequently not be enough to differentiate them.
A Model for Endpoint Admission Control Based on Packet Loss
509
Table 1. Main parameters of the mathematical analysis Static parameters: Cl Qp Qd θ Probe session parameters: Λp Tp Ip = Λp Tp Probe parameters: sp rp l μp = C sp λp =
Ip rp sp λp μp
ρp = Data parameters: sd γd rd = rp γd Td l μd = C s d
Link capacity [b/s] Probe queue length [p] Data queue length [p] Loss threshold Probe session mean arrival rate [1/s] Probe session length [s] Probe session traffic intensity Packet size [b] Bit rate [b/s] Mean service rate [p/s] Arrival rate [p/s] Traffic intensity Packet size [b] mean-to-peak ratio Average bit rate [b/s] Data session length [s] Mean service rate [p/s]
To perform the acceptance decision, the end–host measures the empirical probe loss rate, Pme in the probe stream. Assuming a normal distribution of the probe loss [5], the flow is accepted, if: pme × (1 − pme ) pme + zR ≤ θ, given that θ × s > 10, (1) s where θ is the acceptance threshold, s is the number of probe packets sent, R is the confidence level we want to have and zR is the 1 − (1 − R/2)-quantile of the normal distribution. The second condition ensures that we have sufficient number of samples for the estimation. When the same host or a new one wants to set up a new flow, it repeats the procedure with the maximum rate of the new flow and receives an independent acceptance decision. In the case in which the empirical probe packet loss is greater than the upper level of the confidence interval for the admission threshold, then session is rejected and has to back-off for a certain time before trying again. More details of the procedure can be obtained from [2].
3
Mathematical Model
The approximate analytical model presented in this section determines the performance of the PBAC system depending on the main system parameters such
510
I. M´ as and G. Karlsson
Fig. 2. Birth-death Markov chain for accepted sessions
as the queue size for the high priority data packets (Qd ) and for the low priority probe packets (Qp ), the acceptance packet loss threshold for new flows (θ), the length of the probe (Tp ), the probe rate (rp ) and the probe packet size (sp ). Table 1 summarizes the different parameters used in the analysis, where capital letters refer to session level parameters, and lower letters to packet level parameters, and subscripts p and d refer to probe and data packets or sessions. Our mathematical analysis uses the probe session mean arrival rate (Λp ) and the size of the two queues as the independent variables, while the acceptance probability and data packet loss are the values we wish to obtain. The rest of the parameters are treated as constant in the analysis. 3.1
Computation of Maximum Non-blocking Probe Load
The probe queue is a n*D/D/1 queue, which can be upper bounded by an M/D/1. The M/D/1 queue in turn is strictly shorter for the same arrival rate than an M/M/1 queue with the same average service time [4], so we can consider the queue as an M/M/1/Qp. This approximation will have an effect in the closeness of our analytical results with the simulation and experimental results, as will be seen in the evaluation section. We are as well assuming independence of packet arrivals, which can be justified under intense traffic (i.e. close to the saturation point, where we are interested in obtaining performance values). The packet loss probability is thus given by [6]: Q
Ploss (ρp , Qp ) =
(1 − ρp )ρp p Q +1
1 − ρp p
(2)
From this equation we look for the traffic intensity in the probe queue ρ∗p , such that the loss the in the queue is less or equal than the threshold: ρ∗p |Ploss (ρ∗p , Qp ) ≤ θ
(3)
This non-blocking load needs to be translated into number of concurrent probing sessions (Np∗ ), as follows: Np∗ =
ρ∗p Cp , rp
(4)
where Cp = C − (Nd rd ) is the capacity left for the probe packets, being Nd the number of ongoing data sessions. This capacity left for the probe packets is
A Model for Endpoint Admission Control Based on Packet Loss
511
approximate since it does not take into account the packet loss for the ongoing data sessions. We are thus assuming independence of the two queues, which has a small impact on the accuracy of the results, as seen in the evaluation section. For a given number of accepted data sessions, we can then obtain the maximum number of simultaneous probing sessions Np , and then we can compute the probability of having up to that number of probing sessions. Assuming that new sessions arrived according to a Poisson process, with intensity Λp , the probability of having n concurrently probing sessions is obtained as: Ip n e−Ip , (5) n! where Ip represents the probe session traffic intensity. With this probability, we can already compute the probability of acceptance conditioned to having Nd ongoing data sessions, as: pp (n) =
N∗
Pacc (Qp , Λp |Nd ) =
p
pp (n)
(6)
n=0
This equation gives us the acceptance probability as a function of the probe session arrival rate and the probe packet queue for a given number of ongoing data sessions. To obtain the unconditioned acceptance probability, we need to sum all the conditional probabilities for every Nd multiplied by the probability of having Nd ongoing data sessions in the system. The last stochastic variable can be model as a Markov-modulated Poisson process with c states, where the arrival rates are state dependent. Figure 2 shows the birth-death Markov chain. The number of servers (c) corresponds to the highest value of Nd that gives Pacc (Qp , Λp |c) > 0. The M/G/c/c system is insensitive to the service time distribution and satisfies the M/M/c/c birth-death process also for state dependent arrivals. In this way, the steady state probabilities of an M/M/c/c birth-death process can be obtained as [6]: i−1 i
pd (Nd = i) = p0 (Td )
i=0
Pacc (Qp , Λp |Nd = i) ), c!
(7)
with: c
λ p0 = pd (Nd = 0) = (1 + [( )i μ d i=1
i−1 j=0
Pacc (Qp , Λp |Nd = j) c!
])−1
(8)
With these steady state probabilities we can already compute the unconditioned acceptance probability as: Pacc (Qp , Λp ) = (Pacc (Qp , Λp |Nd ) pd (Nd )) (9) ∀Nd |Pacc (Qp ,Λp |Nd )=0
Figures 3 and 4 show the acceptance probability as a function of the probe queue size and the offered load to the system. The probe queue size varies from
512
I. M´ as and G. Karlsson
Acceptance probability 0.8 0.6 0.4 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 01
2 3 4 5 Session arrival intensity 6
Acceptance probability 0.8 0.6 0.4 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1
7 6 5 4 Probe queue buffer 3 7
8 2
Fig. 3. Acceptance probability for θ=0.01
2 3 4 5 Session arrival intensity 6
7 6 5 4 Probe queue buffer 3 7
8 2
Fig. 4. Acceptance probability for θ=0.05
1 to 8 positions and the offered load varies from 0.64 to 3.34 (which corresponds to session arrival rates Λp from 1.5 to 8 roughly), while the acceptance threshold is 0.01 and 0.05 probe packet loss ratio. The surface lines mark the points were the acceptance probability is 0.8, 0.6, 0.4 and 0.2 respectively. To compute the packet loss that ongoing data sessions will experience, we model the data queue as an M/M/1/Qd, where Qd is the data packets queue length1 . To obtain the traffic to substitute in Eq. (2), we use the obtained steady state c probabilities. On average, for a given Qp and Λp , we will have E[Nd ] = i=1 ipd (Nd = i) ongoing sessions transmitting data packets. The mean data packet arrival rate will be λd = E[Nsdd ]rd , which then gives ρd = λd Td . Substituting the values of Qd and ρd in Eq. (2), we obtain the average data packet loss as: Pdata
loss (ρd , Qd ) =
d (1 − ρd ) ρQ d d +1 1 − ρQ d
(10)
Figures 5 and 6 show the data packet loss ratio with the same parameters as the acceptance probabilities figures. The data packet (high priority) queue was 20 packets in both figures. Finally, Figures 7 and 8 illustrate what would be the achieved utilization as a function of Λp and Qp . The utilization increases as the offered load increases up to a point slightly over 0.8 of link capacity for an offered load of Λp ≈ 1.9, from where the utilization starts decreasing, due mainly to the probe thrashing effect studied in [5, 2]. This effect happens when the arrival intensity of probe packets is so high that they collide in the probe queue and are dropped even if the system has available capacity to admit new flows. This behavior then increases the blocking probability of concurrent sessions, potentially rejecting sessions that should be accepted. It’s important to notice that we have not modeled a back-off mechanism in our arrival rate which is constant even when the acceptance probability decreases. In a real scenario, peaks of offered load over 1 should fade out as the blocking increases and sessions back off for longer periods of time. 1
A more sophisticated queuing model could be used for the aggregate traffic of the Nd sessions.
A Model for Endpoint Admission Control Based on Packet Loss
Data loss ratio 0.006 0.004 0.002 0.008 0.007 0.006 0.005 0.004 0.003 0.002 0.001 01
Data loss ratio 0.01 0.005
0.014 0.012 0.01 0.008 0.006 0.004 7 6 5 4 Probe queue buffer
2
3 4 5 Session arrival intensity 6
7 6 5 4 Probe queue buffer
0.002 01
2
3 4 5 Session arrival intensity 6
3 7
8 2
Fig. 5. Data packet loss θ=0.01
3 7
8 2
Fig. 6. Data packet loss for θ=0.05
Link utilization 0.8 0.6 0.4 0.2 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1
2 3 4 5 Session arrival intensity 6
7 6 5 4 Probe queue buffer 3 7
8 2
Fig. 7. Achieved utilization for θ=0.01
4
513
Link utilization 0.8 0.7 0.6 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 1
2 3 4 5 Session arrival intensity 6
7 6 5 4 Probe queue buffer 3 7
8 2
Fig. 8. Achieved utilization for θ=0.05
Evaluation
We have performed simulations with the network simulator NS-2 to validate our analytical model. The simulations contained sources with exponentially distributed on-off times, and peak rates (rp ) of 100 kb/s over a 10 Mb/s link, or 1 Mb/s over a 100 Mb/s link. The on-off holding times had an average of 20 and 35.5 ms respectively. Packet sizes were 64 bytes for the probe packets and 128 bytes for the data packets, while the probe length was always 2 seconds. We used a confidence interval for the admission decision of 95%. The simulation time was 2000 seconds and the average holding time for accepted flows (1/μd ) was 120 seconds. We have discarded the first 500 seconds of simulation to let the simulation achieve a steady state, since the utilization of the link stabilizes after 150 seconds with fluctuations of ±2 accepted flows. Confidence intervals for each simulation are too small to be plotted in the figures. The probe queue size (Qp ) and the flow arrival rate (Λp ) varied in each simulation to increase the offered load to the system. The queue used was a double queue with a low priority buffer of two packets in the results shown and a high priority buffer of 20 packets. To prove the behavior of our probing scheme we have used a simple one bottleneck link topology. The results for a multilink scenario obtained in [1] show that the highest loaded link dominates the behavior and that the scheme discriminates against flows with longer paths.
514
I. M´ as and G. Karlsson 1
1
Analytical model Simulation rpr=1e6 Experimental K=2
0.8 Acceptance Probability
0.8 Acceptance Probability
Analytical model Simulation Experimental
0.6
0.4
0.6
0.4
0.2
0.2
0 0
0 0
0.5
1
1.5
2
2.5
3
3.5
0.5
1
1.5 2 Offered load
2.5
3
3.5
Offered load
Fig. 9. Acceptance probability for 100 Fig. 10. Acceptance probability for 1 kb/s call peak rate, with an acceptance Mb/s call peak rate, with an acceptance threshold of 5 × 10−2 threshold of 10−2
We have as well used an experimental prototype to validate the simulations in our lab environment. All the experimental results were generated by using a Spirent SmartBits 6000B traffic generator to generate the background traffic with the same characteristics as the traffic in the simulations. The probing sources have been generated by using a simple software UDP traffic generator2, to which we have added the needed probing functions from the PBAC library. The topology used in the laboratory implements that of the simulations with one bottleneck link. The routers are PC’s running Debian Linux with a 2.6.9 kernel providing the special double queue system, with 100 Mb/s Ethernet interfaces. Figure 9 and Figure 10 show the acceptance probabilities of a new call with an acceptance threshold of 0.05 and 0.01, with call peak rates of 1 Mb/s. In both cases the analytical curve intersects the curves obtained by simulation and in the experimental evaluation. The separation between the analytical curve and the simulation and experimental results comes from the assumptions taken in the mathematical model: The queue of the probes has been modeled as an M/M/1/Qp, while it should be a nD/D/1/Qp , since we have constant bit rate probes and fixed probe packet sizes. The D/D/1/Qp model should offer a closer curve to the acceptance probability of experimental and simulation results, but it requires a numerical procedure to compute the packet loss that was out of the scope of our simple analysis. Figures 11 and 12 finally show the data packet loss experienced by accepted sessions as a function of the offered load. In these figures it can clearly be seen the effect of the probe thrashing effect previously mentioned. The packet loss ratio increases as the offered load increases up to a value of 2.5 Λp , which corresponds to a link capacity of approximately 1.1 percent. After that point, sessions begin to be rejected due to probe thrashing and the blocking probability increases heavily, so as less sessions are admitted in the system, the packet loss ratio decreases sharply. It is interesting to observe the slightly higher values of packet loss ratios in the simulation and experimental results, which can be explained by the fact that we have modeled the ongoing sessions as an M/M/1/Qd, while 2
http://rude.sourceforge.net
A Model for Endpoint Admission Control Based on Packet Loss 0.008
0.001
Analytical model Simulation rpr=1e6 Experimental K=2
0.007
515
Analytical model Simulation rpr=1e6 Experimental K=2
0.0009 0.0008
0.006 Data loss ratio
Data loss ratio
0.0007 0.005 0.004 0.003
0.0006 0.0005 0.0004 0.0003
0.002 0.0002 0.001
0.0001
0
0 1
2
3
4
5
6
7
8
Offered load
1
2
3
4
5
6
7
8
Offered load
Fig. 11. Data packet loss for 1 Mb/s call Fig. 12. Data packet loss ratio for 1 Mb/s peak rate, with an acceptance threshold call peak rate, with an acceptance threshof 5 × 10−2 old of 10−2
the simulation and experimental sources are on-off sources with exponentially distributed on period. The on-off sources are thus more bursty in nature and that produces the higher packet loss. A better model would be to use an MMPP for the ongoing sources as well.
5
A Dimensioning Example
As a practical example of how to dimension the service in an access operator, we consider a 100 Mb/s Ethernet access network. The access operator is offering a video-conference service to the users in the access network itself. The maximum end to end distance between users is four hops, i.e. there will be a maximum of four routers/switches between the communicating users. The access operator wants to have an upper loss threshold of 1% packet loss, since the video-conference media carries FEC that can correct up to that loss ratio. If we consider a maximum distance of 5 km in Ethernet cable between users (in several connected Ethernet links), and a propagation speed of 2 × 108 m/s, the propagation end-to-end delay becomes 0.025 ms. The video-conference service uses a packet size of 400 bytes and has a peak rate of 1 Mb/s, while the average rate is 256 kb/s. A 100 Mb/s Ethernet implies a service time of 0.032 ms of service time per packet in every hop. The total end-to-end delay can then be approximately computed as: T ot delay = P rop delay + Queueing delay = 0.025ms + 4Qd 0.032ms
(11)
To keep an end-to-end delay under 150 ms the data queue size has to be smaller than 1169 packets, which shows the fact that end-to-end queuing delay does not have a big role when the number of hops is small. As an example, if the number of hops increases to 10, then the maximum data queue size decreases to 468 packets. Of course this is a simplification in which we have not considered the processing time in every node, only the queuing delay per node. A 400 bytes packet sent at a 1 Mb/s peak rate gives 312 packets/s. Since we are trying to limit the packet loss to under 1% this ensure that we get enough
516
I. M´ as and G. Karlsson
samples of probe packets with just a few seconds. The operator decides that it does not want to have a longer call setup time than 10 secs, so we limit the probing phase to 9.5 seconds. With this probing phase, our admission threshold can be computed using Eq. 1, where the s is 2964 packets, R is 0.95 and pme is the maximum packet loss we accept, i.e. 0.01. With this values the admission threshold becomes 0.00842, or roughly 0.84% probe packet loss. If we were to reduce the probing phase to only one second, with the same confidence interval and target loss, θ would then become 0.00592 instead. The access operator has a statistical study that predicts that it will be receiving between 0.1 to 1.5 calls a second (Λp ) and every call on average is maintained during 5 minutes. With the average rate of 256 kb/s that gives an offered load of 7.68% to 115% in the 100 Mb/s links. With these values, equation 9 gives us the unconditional acceptance probability for different values of the probe queue (Qp ). For a probe queue of 2 packets, the acceptance probability is 1 until the offered load equals 0.52 of link capacity, while it is 0.46 for an offered load of 1.5 calls per second. Increasing Qp to 5 packets would move the 1% acceptance probability level to 0.63 of link capacity, while the acceptance probability for 1.5 calls per second becomes 0.73. The packet loss of accepted sessions for this two extremes of our call session arrival interval can be computed with Eq. (10). With a data packet buffer (Qd ) of 20 packets, for example, there is no packet loss for both values of Qp and an offered load of 0.1 calls per second, while for 1.5 calls per second the loss becomes 2 × 10−6 for Qp = 2 and 6.15 × 10−3 for Qp = 5. Both values are well under our target of 1% loss ratio, though it is interesting to notice that increasing the probe queue size increases the acceptance probability of incoming sessions as well as the loss ratio operational point for accepted sessions. Finally, a look at the utilization of the link capacity gives the following numbers: For Qp = 2 and λp = 0.1 the utilization of the link is roughly the offered load, since acceptance probability is 1. The maximum utilization is achieved for an offered load of 0.82 of link capacity, while the utilization at 1.15 drops to 0.53. If we take a probe buffer size of Qp = 5 instead, then the maximum utilization is 0.85 and occurs for an offered load of 1.1 calls per second.
6
Conclusions
In this paper we have presented an approximate analytical model of the probe based admission control scheme based on the end–to–end measurements of packet loss probabilities in the probe streams. In this solution the admission control is responsible for limiting the end–to–end packet loss probability of accepted flows, while the end–to–end delay and delay jitter requirements are ensured by the use of small buffers in the routers. The analysis focuses on the flow acceptance probabilities at a given offered load, the link utilization achieved, and on the packet loss probabilities of accepted flows. The analytical results, verified by simulations, prove that the considered
A Model for Endpoint Admission Control Based on Packet Loss
517
probe–based admission control leads to a stable link utilization which has a clear upper bound on the packet loss probability. As the acceptance decision is based on the probe loss probability it is important to see how the loss admission threshold and data loss probabilities relate at different link utilization. The analysis predicts that the data loss is always under half an order of magnitude of the targeted loss threshold, independently of the actual link load, which is confirmed in our experimental results. Consequently, the probe–based admission control based on measurements of the packet loss probabilities in the probe stream provides a reliable and efficient solution for QoS provisioning for delay and loss sensitive applications, without extensive support in the routers.
References 1. Fodor, V. (n´ee Elek), Karlsson, G., R¨ onngren, R.: Admission control based on end– to–end measurements. In: Proc. of the 19th Infocom (Tel Aviv, Israel), pp. 623–630, IEEE, Los Alamitos (March 2000) 2. M´ as, I., Karlsson, G.: Probe-based admission control for a differentiated-services internet. Elsevier Computer Networks 51(13), 3902–3918 (2007) 3. Bianchi, G., Borgonovo, F., Capone, A., Petrioli, C.: Endpoint admission control with delay variation measurements for QoS in IP networks. ACM Computer Communication Review 32(2), 61–69 (2002) 4. Roberts, J., Virtamo, J.T., Mocci, U. (eds.): COST-242 1996. LNCS, vol. 1155. Springer, Heidelberg (1996) 5. M´ as, I., Karlsson, G.: PBAC: Probe–based admission control. In: Smirnov, M., Crowcroft, J., Roberts, J., Boavida, F. (eds.) QofIS 2001. LNCS, vol. 2156, pp. 97–109. Springer, Heidelberg (2001) 6. Jain, R.: The art of computer systems performance analysis. Wiley Professional computing. John Wiley, Chichester (1991)
Convergence of Intra-domain Routing with Centralized Control Jing Fu1 , Peter Sj¨odin2 , and Gunnar Karlsson1 1
ACCESS Linnaeus Center, School of Electrical Engineering 2 School of Information and Communication Technology KTH, Royal Institute of Technology SE-100 44, Stockholm, Sweden {jing,psj,gk}@kth.se
Abstract. The decentralized control scheme for routing in current IP networks has been questioned, and a centralized routing scheme has been proposed as an alternative. In this paper, we compare the convergence of centralized control scheme with decentralized link-state routing protocols. We first identify the components of the convergence time. Thereafter, we study how to achieve fast routing convergence in networks with centralized control. In particular, we analyze how to distribute forwarding information efficiently. Finally, we perform simulation studies on the convergence time for both real and synthetic network topologies, and study the impact of control element location, link weights, and number of failures on the convergence time. The results show that the centralized control scheme can provide faster routing convergence than link-state routing protocols.
1 Introduction There are architectural designs for current IP networks which could be reconsidered. While the management of IP networks in an autonomous system (AS) is carried out through centralized servers, control functions, such as routing, are performed in a decentralized fashion on individual routers. A centralized routing scheme simplifies the management and the control processes, and reduces the functions required in the routers. As the architectural advantages and challenges of centralized control are well discussed in research papers [1] [2] [3], we focus on studying the intra-domain routing convergence upon link failures. We compare the convergence process in the centralized control scheme with decentralized linkstate routing protocols, and identify the components of the convergence time in both cases. Thereafter, we present how to achieve fast routing convergence in networks with centralized control. In particular, we analyze how to distribute FIBs from the control element (CE) to the routers. We provide a thorough analysis of the number of updated FIB entries upon single link failures, and propose an efficient scheme to distribute the forwarding information. Finally, we present simulation results on convergence time for both real and synthetic topologies for varying network sizes. Also, we investigate the impact of control element location, link weights, and number of failures on the convergence time. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 518–529, 2008. c IFIP International Federation for Information Processing 2008
Convergence of Intra-domain Routing with Centralized Control
519
The rest of the paper is organized as follows. Section 2 overviews the related work. Section 3 presents the convergence process in IP networks, for both the decentralized link-state routing protocols and the centralized control scheme. Section 4 studies how to perform FIB updates, including how to distribute forwarding information efficiently. Section 5 shows our simulation setup. Section 6 presents and analyzes the results. Finally Section 7 concludes the paper.
2 Related Work The importance of network management and control has been well recognized by research communities. In particular, there has been research initiatives focusing on centralized control in IP networks recently. The position papers [1] [2] advocate a 4D architecture, which decompose the network operation into decision, dissemination, discovery, and data planes. The 4D architecture moves the control functions from individual routers to a centralized server, which enables direct control of IP networks. The evolution of IP networks and distributed router architectures has been reviewed in [3]. In particular, the centralization of control functions has been identified as a logical next step. Eventually, an entire AS may be considered as a distributed router. Based on the vision of the 4D system architecture, a route control platform (RCP) has been developed to control inter-domain routing centrally [4] [5]. RCP computes BGP routing table for an entire AS in a centralized server, and distributes the BGP routing table to individual routers. However, the intra-domain routing is still performed on individual routers. The convergence of intra-domain routing protocols has been well studied. The measurement results for the components of the convergence time in OSPF have been presented [6]. Also, a variety of studies have been performed to reduce the convergence time in OSPF and IS-IS [7] [8] [9].
3 Convergence in IP Networks In this section, we first summarize the convergence process in decentralized link-state routing protocols. Thereafter, we present how to achieve fast convergence in networks with centralized control. Finally, we analyze the components of the convergence time. 3.1 Convergence in OSPF OSPF and IS-IS are the link state intra-domain protocols used in IP networks today. Since the convergence processes of OSPF and IS-IS are identical, we use the OSPF terminology in the rest of the paper for ease of reading. The convergence process of OSPF after a link failure can be summarized as follows. First, each router exchanges hello messages with its neighboring routers to determine its local topology; a link failure can be detected after loss of three consecutive hello messages. When a link failure is detected, the router generates a link-state advertisement (LSA) describing the topology change. The LSA is flooded reliably to all neighboring routers. Upon reception of an LSA, it is reflooded to all routers except to the routers
520
J. Fu, P. Sj¨odin, and G. Karlsson
connected to the interface where the LSA arrived. After reflooding, the router recomputes its shortest path tree (SPT) and updates the FIB by merging the SPT and the BGP routing table. The components of the convergence time are presented in the list below. – – – – –
Failure detection time LSA generation time LSA flooding time SPT computation time FIB update time
3.2 Convergence in Networks with Centralized Control In the centralized control scheme, a link failure can be detected in the same way as in OSPF. After that, the router reports the failure to the CE by sending a unicast LSA. When the LSA reaches the CE, it recomputes the SPTs for all routers in the network. Thereafter, the CE updates the FIBs by generating and distributing them to the routers. Each router receiving the updated FIB installs it on the line cards. The components of the convergence time are listed below. – – – – –
Failure detection time LSA generation time LSA reporting time SPT computation time FIB update time
3.3 Components of Convergence Time The comparison of the time components in both controls schemes show that the failure detection time and LSA generation time are similar while there are differences between the LSA flooding time and LSA reporting time. In addition, the SPT computation and FIB updates are performed differently in the two control schemes. Failure detection time. In OSPF, the default hello timer is 10 seconds, which means it requires 40 seconds to detect a link failure. Recent work suggests that the hello timer can be set to the sub-second range to provide faster failure detection [7] [9]. LSA generation time. After a link failure is detected, an LSA is generated by the router. The generation time upon a single failure is fairly short, between 4 ms and 12 ms [9]. LSA flooding time and reporting time. In OSPF, the flooding time of an LSA depends on the link latencies and LSA processing time. The latency of a link is approximately proportional to the distance between the routers, and can be between 10 μ s to 66 ms according to the measurement of a real ISP topology, AS 1239, the Sprint network [10]. The LSA processing time is between 2 and 12 ms as measured in [6] [9]. In addition, a pacing delay of 33 ms is introduced to specify the minimum time interval between sending two consecutive LSAs through the same interface. In the centralized control scheme, the time for an LSA to reach the CE depends on the link latencies along the path and the forwarding delay in each router. The forwarding delay is in the range of tenths of microseconds and is negligible.
Convergence of Intra-domain Routing with Centralized Control
521
SPT computation time. In OSPF, each router needs to determine its SPT, which is a single-source shortest path problem. In the centralized control scheme, the CE needs to determine the SPTs for all routers, which is an all-pairs shortest path problem since it has to find shortest paths between all pairs of nodes. In addition to Dijkstra’s shortest path first (SPF) algorithm [11], incremental SPF (iSPF) that constructs a new SPT based on the previous SPT are available [12] [13]. Using iSPF in a network with 700 nodes, the average computation times are 3 μ s and 2 ms respectively for single-source and allsource problems [13]. Finally, a parameter SPF delay specifies the time between a new LSA arrives and the start of the SPF computation, which by defaults is 5 seconds. The parameter is set to reduce the amount for computation in routers and route oscillation.
4 FIB Updates In OSPF, the FIB updates are performed in the routers and consist of FIB generation and FIB installation. While in the centralized control scheme, in addition to FIB generation and FIB installation, FIBs need to be distributed to the routers. 4.1 FIB Generation The FIB generation process is similar in both OSPF and the centralized control scheme, except that the FIBs are generated in the routers in OSPF, and in the CE in the centralized control scheme. The FIB generation is to merge the computed SPT in OSPF and the BGP routing table. A BGP route entry contains a number of attributes such as prefix, nexthop, and AS-path. When an AS receives a BGP route from another AS, the BGP nexthop attribute is set to the IP-address of the border router which announced the route. All packets matching a prefix are forwarded to the specified BGP nexthop according to the route entry. However, not all routers are linked directly to the border router, therefore, each router needs to determine the shortest path to the border router by consulting the SPT. The FIB generation process is simply to replace the BGP nexthop with a direct nexthop, which routers can reach directly. The FIB generation time is in the ranges of milliseconds according to our implementation. 4.2 FIB Installation After a FIB is generated, it is installed on the router’s line cards. The FIB installation time is heavily dependent on the technology used in the line cards. For ternary content addressable memory (TCAM) based line cards, the installation time is linearly dependent on the number of updated entries. Recent results show that a TCAM can perform 1.3 million updates per second [14]. Therefore, updating a FIB entry requires about 0.77 μ s. Current BGP tables contain approximately 160000 entries [15]. The number of OSPF entries is approximately equal to the number of links in the AS, which is between tens to thousands of entries depending on the network size. The number of FIB entries is the sum of the number of BGP entries and OSPF entries. Since each entry requires 0.77 μ s to update, the total update time is 123 ms for a FIB of 160000 entries.
522
J. Fu, P. Sj¨odin, and G. Karlsson
Finally, the FIB generation and the FIB installation in routers can be performed in parallel. The FIB generation requires CPU usage and the FIB installation time is mostly for accessing the TCAM, and installation of FIB entries on TCAM can be started without waiting for the entire generation process to complete. Since the FIB generation is faster than the FIB installation, therefore, the FIB generation time can be neglected. 4.3 FIB Distribution FIBs are distributed from the CE to the routers in the centralized control scheme. The distribution time is the sum of the transmission delay and the propagation delay. The transmission delay of the generated FIBs depends on the size of the FIBs and the link capacity. A FIB entry requires 6 bytes: 4 bytes for the prefix, 5 bits for the prefix length, and 11 bits to represent 2048 potential nexthops. Therefore, a FIB of 160000 entries is about 940 kB in total. When transmitting the entire FIB on a 1 Gb/s link, the transmission delay is approximately 8 ms. The total time of transmitting all FIBs to the routers is the sum of times for transmitting individual FIBs. However, if there are multiple links connected to the CE, multiple FIB transmissions can be performed in parallel, reducing the total transmission delay by a factor of E, where E is the number of links connected to the CE. In a network of 1000 nodes and 4 links, the total transmission delay is about 2 seconds, which can be a dominant factor in the convergence time. 4.4 Number of Updated FIB Entries The discussion on FIB distribution in the previous section assumes that the entire FIB is distributed. However, the number of updated FIB entries upon a link failure could vary significantly. In addition, upon a link failure, the number of updated FIB entries varies for each router in the network. To have a realistic understanding of how many FIB entries are updated upon a single link failure, we have performed a study on the Swedish University Network (SUNET) which consists of 54 routers [16]. We have generated 20 random link failures, by comparing the FIBs before and after each link failure, we obtain the number of updated FIB entries in each router. The sum of these numbers from all routers is the total number of updated FIB entries in the entire network, which is the total number of FIB entries the CE needs to distribute. Fig. 1 shows the percentage of updated FIB entries in the entire network upon each failure, sorted in increasing order. The figure shows that for SUNET routers, about half of the link failures have no large impact on the FIBs. In addition, at most 11%, and in average 2.2% of the FIB entries are updated. Therefore, in the real case, significant less FIB entries are updated, and transmission delay should be significantly shorter than in our worst-case analysis that assumes distribution of full FIBs for all routers. 4.5 An Alternative Approach to FIB Distribution Although we have shown that at most 11% of the FIB entries need to be distributed upon a single link failure, the transmission delay could still be high for large networks. 11% of the 2 seconds required for distribution of full FIBs in networks with 1000 nodes are still 220 ms, therefore, alternative approaches to distribute FIBs are desired.
Convergence of Intra-domain Routing with Centralized Control
523
Percentage of FIB entries to be updated
12%
10%
8%
6%
4%
2%
0
0
2
4
6
8
10 12 Simulation
14
16
18
20
Fig. 1. Percentage of FIB entries to be updated in the network upon single link failure, 20 random link failure locations
The separation of intra-domain and inter-domain routing has been proposed in [4]. The separation simplifies the routing process, and is sound from a network architecture point of view. An alternative approach to distribute FIBs is then to maintain two tables, both being distributed from the CE. The first table is a BGP routing table, mapping BGP prefixes to BGP nexthops. The size of this table is equal to the number of BGP prefixes. However, the BGP prefixes do no change upon a local topology change in the network. The second table is a internal table that maps BGP nexthops to direct nexthops, and it also contains intra-domain routes. The number of BGP nexthops is small, about 130 for the SUNET. Combining it with the intra-domain routes, the number of entries is about 500 in total. Although this number grows for larger networks, not all entries need to be updated upon a single link failure. Therefore, we assume that distributing 500 entries is adequate for single link failures. In this approach, the internal tables are distributed from the CE to the routers upon local topology change. Using the BGP and the internal table, the FIB is constructed in each router. This construction process is similar to the FIB generation process, and is faster than FIB installation on TCAM. Therefore, the construction time can be neglected since construction and installation can be performed in parallel.
5 Simulation Setup To evaluate the convergence time in various setups, we have developed a Java simulation tool based on the SSFNet simulation framework [17]. The network topologies and the parameters used in the simulations are described and specified in the following subsections. 5.1 Network Topologies We have created several synthetic router-level topologies using the Waxman [18] and the BA [19] models. The number of nodes is varied between 10 and 1000 to demonstrate the scalability. In both models, nodes are placed in a square, with the distance
524
J. Fu, P. Sj¨odin, and G. Karlsson
of each side corresponds to a latency of 30 ms. The link weights have been set either proportionally to link latencies or given unit weights to all links. In addition to the synthetically generated topologies, we have used two real ISP topologies, which are measured in Rocketfuel based on the traceroute technique [20]. The first topology we used is from AS 1755, the Ebone network in Europe, which consists of 88 backbone routers and 161 links. The second topology is from AS 1239, the Sprint network in the USA which consists of 323 routers and 975 links. The link latencies of the real topologies are set based on the end-to-end measurements performed in [10]. The links have either been given unit weights or given the real weights inferred in [10]. 5.2 Providing Communication Paths in the Network One challenge in the centralized control scheme is to provide robust communication paths between the CE and the routers, in particular upon link and router failures, since the routers rely on the network to report the topology changes, and the CE relies on the network to control the routers. There are several classes of solutions to address this problem: source routing, spanning-tree protocols and flooding. A practical solution used in our simulations is as follows: Upon a link failure, at least one of the routers does not rely on the failed link to reach the CE. Upon receival of an LSA, the CE determines which routers can be reached; for unreachable routers, source routing is used. Table 1. Simulation parameter settings, default values in bold font Name Network topology Network size Link bandwidth Link latencies Link weights LSA generation time LSA processing time All-pairs SPF computation time LSA pacing delay SPF delay Routing entries to distribute FIB entries to update FIB entry update time Number of failures
Value Waxman, BA, AS 1775, AS 1239 10, 88, 100, 323, 1000 1 Gb/s 1-66 ms latency-based, real, uniform 8 ms 2, 12 ms 0, 3 ms 33 ms 5 seconds 500 160k 0.8 μ s 1, 5
5.3 Parameter Settings Table 1 summarizes the parameters used in the simulations, which include network topology and protocol-related parameters. The topology-related parameters are described in the previous subsection, and the protocol-related parameters are derived from the time components discussed in Sections 3 and 4. In particular, we have used the approach of maintaining two tables on routers as explained in Section 4.5, and 500 entries are distributed upon a link failure. Also, we have excluded the failure detection time from the convergence time since it is the same in both control schemes.
Convergence of Intra-domain Routing with Centralized Control
525
Parameters with multiple values have been set to explore their impact on the convergence time. The default values for these parameters are shown using bold fonts. For link weight settings, latency-based weight setting is the default for synthetic topologies and real link weight setting is the default for real ISP topologies. For the SPF computation time, we assume it is 3 ms for all-pairs SPF computation for networks of 1000 nodes, while for other setups, we neglect the computation time since it is below 1 ms.
6 Results 6.1 Convergence Time with Respect to CE Location The convergence time after a link failure in the centralized control scheme depends on the location of the CE. The CE location can be manually chosen by the network operator to reduce the convergence time. To find a reasonably good CE location, we have studied ten different CE locations in each network topology, and performed 50 simulations for each CE location to study the convergence time. The candlesticks in Fig. 2 show the min, max 5th and 95th percentile, and median convergence time for a given CE location. As can be seen in the figure, the convergence time varies depending on the CE location, and for most of the topologies, there are clearly better and worse CE locations which can be identified. The comparison of convergence time for the network topologies shows that larger networks lead to higher convergence time in general. Also, the convergence time of the Waxman topology of 100 nodes and the BA topology of 100 nodes are similar.
5600
5600
5600
5400
5400
5400
5200
5200 0
5 10 Waxman, 10 nodes
5200 0
5 10 Waxman, 100 nodes
0 5 10 Waxman, 1000 nodes
5600
5600
5600
5400
5400
5400
5200
5200 0
5 10 BA, 100 nodes
5200 0
5 AS 1755
10
0
5 AS 1239
10
Fig. 2. Impact of CE location on convergence time for various network topologies. X-axis shows different CE locations, and y-axis shows the convergence time in ms. Each candlestick represents min, max, 5th and 95th percentile, and median convergence time for a given CE location. (5000 ms SPF delay is added as it is imposed by the standard)
526
J. Fu, P. Sj¨odin, and G. Karlsson
(a) Networks with 10 nodes.
(b) Networks with 100 nodes.
5600
5600 Waxman, OSPF Waxman, centralized, uniform Waxman, centralized, latency−based BA, OSPF BA, centralized, uniform BA, centralized, latency−based
Convergence time (ms)
5500 5450
5550 5500 Convergence time (ms)
5550
Waxman, OSPF Waxman, centralized, uniform Waxman, centralized, latency−based BA, OSPF BA, centralized, uniform BA, centralized, latency−based
5400 5350 5300 5250
5450 5400 5350 5300 5250
5200
5200 0
10
20
30
40
50
0
10
20
Simulation (c) Networks with 1000 nodes.
40
50
40
50
(d) Real networks.
5600
5600 Waxman, OSPF Waxman, centralized, uniform Waxman, centralized, latency−based BA, OSPF BA, centralized, uniform BA, centralized, latency−based
5500 5450
AS1239, OSPF AS1239, centralized, real AS1239, centralized, uniform AS1755, OSPF AS1755, centralized, real AS1755, centralized, uniform
5550 5500 Convergence time (ms)
5550
Convergence time (ms)
30 Simulation
5400 5350 5300 5250
5450 5400 5350 5300 5250
5200
5200 0
10
20
30
40
50
0
Simulation
10
20
30 Simulation
Fig. 3. A comparison of convergence time in OSPF and the centralized control scheme, for both latency-based, real, and uniform link weights. Each curve represents 50 independent simulations with convergence time sorted in increasing order.
In the figure, all convergence times are above 5 seconds, due to the SPF delay of 5 seconds imposed by the CE according to the OSPF specification. When lower SPF delay is set to achieve faster routing convergence, our results are still valid, with only the offset at y-axis in the figure to be reduced accordingly. Finally, a smaller SPF delay results in a larger difference between the best and the worst convergence time in percentage, which in turn increases the importance of a good CE location. 6.2 Convergence Time Comparison with OSPF Using the best CE locations identified in the previous experiments, we compare the convergence time in the centralized control scheme with OSPF. In OSPF, the convergence time does not depend on link weights since it is based on flooding. However, in the centralized control scheme, appropriately set link weights reduce the time to report the failure and reduce the propagation delay when distributing FIBs. Fig. 3 shows the convergence time for OSPF and the centralized control scheme. Each curve in the figure represents the convergence time for 50 simulations upon a single link failure at random locations, sorted in increasing order. The results for synthetical topologies in Fig. 3(a, b, c) show that the convergence time in OSPF is shorter than the time in the centralized control scheme if uniform link weights are used, however, when latency-based link weights are used, the centralized
Convergence of Intra-domain Routing with Centralized Control
527
scheme has a shorter convergence time. This particularly highlights that fast communication paths between the CE and the routers are important for fast convergence. The results for real topologies in Fig. 3(d) show that the convergence time for the centralized control scheme is shorter than for OSPF, specially for real link weights. The difference in convergence time between using uniform and real link weights is about 40 ms. The reasons for higher convergence time in OSPF than the centralized control scheme are as follows: First, in the centralized control scheme, only a small table of 500 entries are distributed upon a single link failure, which reduces the transmission delay significantly. Second, the fast incremental all-pairs shortest path algorithm reduces the computation time in the centralized control scheme. Third, the LSA processing delay of 12 ms results in higher convergence time in OSPF. 6.3 Convergence Time with Varying LSA Processing Time The LSA processing time varies for different routers and software applications, and future architectures and technologies may result in shorter processing time. Therefore, we study its impact on the convergence time. Fig. 4 shows the convergence time with varying LSA processing time. As can be seen in the figure, when the LSA processing time is at 12 ms, the centralized control scheme performs better than OSPF. However, when the LSA processing time is at 2 ms, OSPF has lower convergence time. During the flooding process, each router needs to process the LSA. While in the centralized control scheme, the LSA is only processed once in the CE. Therefore, a shorter LSA processing time has a larger positive impact on the convergence time in OSPF. 5600
5600 Centralized: 12 ms LSA processing time Centralized: 2 ms LSA processing time OSPF: 12 ms LSA processing time OSPF: 2 ms LSA processing time
5550
5500 Convergence time (ms)
Convergence time (ms)
5500
OSPF, 1 failure Centralized, 1 failure OSPF, 5 failures Centralized, 5 failures
5550
5450 5400 5350 5300
5450 5400 5350 5300 5250
5250
5200
5200 0
10
20
30
40
50
0
10
Simulation
20
30
40
50
Simulation
Fig. 4. Impact of LSA processing time on the Fig. 5. Impact of multiple link failures on the convergence time, Waxman topology of 100 convergence time, Waxman topology of 100 nodes nodes
6.4 Convergence Time Upon Multiple Link Failures Multiple link failures may occur simultaneously in the network, therefore, we present the convergence time upon multiple link failures for both control schemes in Fig. 5. As can be seen in the figure, for single link failures, the difference on convergence time between OSPF and the centralized control scheme is about 40 ms. However, when there are five simultaneous link failures, the difference on convergence time becomes larger,
528
J. Fu, P. Sj¨odin, and G. Karlsson
to about 100 ms. This indicates that the centralized control scheme is more resilient against multiple link failures. In OSPF, each router detecting the failure floods out an LSA, and a pacing delay of 33 ms is applied in routers when reflooding two consecutive LSAs, which increases the convergence time. While in the centralized control case, the extra convergence time upon multiple link failures is equal to the difference between the latest LSA arrival time at the CE and the LSA arrival time in the single link failure case.
7 Conclusions In this work, we have studied the convergence process of intra-domain routing with centralized control, and compared it with the convergence process in OSPF. In particular, we have identified the components of the convergence time. Our analysis shows that in the centralized control scheme, the distribution of full FIBs of 160000 entries to the routers does not scale well, therefore, we have investigated the total number of updated FIB entries upon single link failures. Our results show that at most 11%, and in average 2.2% of the FIB entries need to be updated. Therefore, faster FIB distribution can be achieved by distributing only updated entries. In addition, we came up with the approach of maintaining two separate lookup tables for FIB updates. Upon a link failure, only a small table of 500 entries needs to be distributed to the routers, which significantly reduces the FIB distribution time. We have also studied the impact of CE location, link weights, LSA processing time, and number of link failures on the convergence time. The results show that there are clear better and worse CE locations in the centralized control scheme, therefore, a good CE location should be chosen to minimize the convergence time. The results also show that latency based link weights reduce the convergence time for the centralized control scheme, since both the reporting of the failure and distribution of the FIBs rely on fast routing paths. The study on LSA processing time shows that a short time is more in favor of OSPF, due to it is processed in all routers during the flooding process. For multiple link failures, the study also shows that the convergence time in OSPF increases significantly more than the time in the centralized control scheme, due to the pacing delay. Finally, the comparison of centralized control scheme and OSPF shows that a welldesigned centralized control scheme can support networks of 1000 nodes, and may provide faster convergence time than OSPF. As the centralized control scheme has architectural advantages, and its purpose is not primarily to reduce the convergence time in routing, our study that shows a comparable convergence time makes the centralized control scheme possible in future IP networks.
References 1. Rexford, J., Greenberg, A., Hjalmtysson, G., Maltz, D.A., Myers, A., Xie, G., Zhan, J., Zhang, H.: Network-wide decision making: Toward a wafer-thin control plane. In: Proc. of ACM SIGCOMM HotNets Workshop (November 2004) 2. Greenberg, A., Hjalmtysson, G., Maltz, D.A., Myers, A., Rexford, J., Xie, G., Yan, H., Zhan, J., Zhang, H.: Clean slate 4D approach to network control and management. ACM SIGCOMM Computer Communication Review 35(5) (October 2005)
Convergence of Intra-domain Routing with Centralized Control
529
3. Cs´asz´ar, A., Enyedi, G., Hidell, M., R´etv´ari, G., Sj¨odin, P.: Converging the evolution of router architectures and IP networks. IEEE Network Magazine 21(4) (2007) 4. Feamster, N., Balakrishnan, H., Rexford, J., Shaikh, A., van der Merwe, J.: The case for separating routing from routers. In: Proc. of ACM SIGCOMM Workshop on Future Directions in Network Architecture (August 2004) 5. Caesar, M., Caldwell, D., Feamster, N., Rexford, J., Shaikh, A., Merwe, J.: Design and implementation of a routing control platform. In: Proc. of Networked Systems Design and Implementation (May 2005) 6. Shaikh, A., Greenberg, A.: Experience in black-box ospf measurement. In: Proc. of the First ACM SIGCOMM Workshop on Internet Measurement (November 2001) 7. Alaettinoglu, C., Jacobson, V., Yu, H.: Towards millisecond IGP convergence. In: The North American Network Operators’ Group (NANOG) 20, Washington, DC (October 2000) 8. Basu, A., Riecke, J.: Stability issues in OSPF routing. In: Proc of ACM SIGCOMM 2001, New York, USA (August 2001) 9. Francois, P., Filsfils, C., Evans, J., Bonaventure, O.: Achieving sub-second IGP convergence in large IP networks. ACM SIGCOMM Comput. Commun. Rev. 35(3), 35–44 (2005) 10. Mahajan, R., Spring, N., Wetherall, D., Anderson, T.: Inferring link weights using end-to-end measurements. In: Proc. of ACM SIGCOMM Internet Measurement Workshop (2002) 11. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms, 2nd edn. MIT Press, Cambridge Massachusetts (2001) 12. Narvaez, P., Siu, K.-Y., Tzeng, H.-Y.: New dynamic SPT algorithm based on a ball-and-string model. IEEE/ACM Trans. Networking 9, 706–718 (2001) 13. Demetrescu, C., Italiano, G.F.: A new approach to dynamic all pairs shortest paths. In: Proc. of the 35th Annual ACM Symposium on Theory of Computing (STOC 2003) (2003) 14. Wang, G., Tzeng, N.F.: TCAM-Based forwarding engine with minimum independent prefix set (MIPS) for fast updating. In: Proc. of IEEE ICC 2006 (June 2006) 15. Route Views Project, University of Oregon. http://www.routeviews.org/ 16. Swedish University Network (SUNET), Available: http://www.sunet.se, http://www.stats.sunet.se 17. Renesys. SSFNet, Scalable simulation framework for network models. http://www.ssfnet.org/ 18. Waxman, B.M.: Routing of multipoint connections. IEEE Journal on Selected Areas in Communications 6(9), 1617–1622 (1988) 19. Barabasi, A., Albert, R.: Emergence of scaling in random networks. Science 286 (October 15, 1999) 20. Spring, N., Mahajan, R., Wetherall, D.: Measuring ISP topologies with Rocketfuel. In: Proc. of ACM SIGCOMM 2002 (August 2002)
Improving the Interaction between Overlay Routing and Traffic Engineering Gene Moo Lee1 and Taehwan Choi2 1
Samsung Advanced Institute of Technology Yongin-si, Gyeonggi-do, 446-712, Korea [email protected] 2 Department of Computer Sciences University of Texas at Austin, TX 78712, U.S.A. [email protected]
Abstract. Overlay routing has been successful as an incremental method to improve Internet routing by allowing its own users to select their logical routing. In the meantime, traffic engineering (TE) are being used to reduce the whole network cost by adapting physical routing in response to varying traffic patterns. Previous studies [1, 2] have shown that the interaction of the two network components can cause huge network cost increases and oscillations. In this paper, we improve the interaction between overlay routing and TE by modifying the objectives of both parties. For the overlay part, we propose TE-awareness which limits the selfishness by some bounds so that the action of overlay does not offensively affect TE’s optimization process. Then, we suggest COPE [3] as a strong candidate that achieves close-to-optimal performance for predicted traffic matrices and that handles unpredictable overlay traffic efficiently. With extensive simulation results, we show the proposed methods can significantly improve the interaction with lower network cost and smaller oscillation problems.
1
Introduction
Overlay routing has been proposed as an incremental method to enhance the current Internet routing without requiring additional functionality from the IP routers. Overlay techniques have been successful for many applications, including application-layer multicast [4, 5, 6], web content distribution [7], and overlay routing [8, 9]. In an overlay network, overlay nodes form an application-layer logical network on top of an IP layer network. Overlay networks enable users to make routing decisions at the application layer by relaying traffic among overlay nodes. We can achieve better route than default IP routing because some problematic and slow links can be bypassed. In addition, overlay routing can take advantage of some fast and reliable paths, which could not be used in the default IP routing due to business relationship. By its nature, overlay routing has selfish behavior [1,10]. In other words, overlay acts strategically to optimize its performance. This nature of overlay makes A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 530–541, 2008. c IFIP International Federation for Information Processing 2008
Improving the Interaction between Overlay Routing and Traffic Engineering
531
impact on the related components of the network. We term that overlay routing has vertical interaction with IP layer’s traffic engineering (TE). Whenever an overlay network changes its logical routing, the underlay routing observes physical traffic pattern changes. Then network operators use TE techniques [11,12,13] to adapt the routing to cope with the new traffic demands. This new routing, in turn, changes link latency observed by the overlay network, and then overlay makes another decision to change its routing. TE cares about the network as a whole, in order to provide better service to all the users. However, the main objective of overlay routing is to minimize its own traffic latency. Then an interesting issue is to understand the interaction between overlay routing and IP routing. The interaction between overlay routing and TE was first addressed by Qiu et al. [1], where the authors investigate the interaction of overlay routing with OSPF and MPLS TE. Keralapura et al. [14] examine the interaction dynamics between the two layers of control from an ISP’s view. Liu et al. [2] formulate the interaction as a two-player game, where overlay attempts to minimize its delay and TE tries to minimize the network cost. The paper shows that the interaction causes a severe oscillation problem to each player and that both players lose as the interaction proceeds. In this paper, we propose TE-aware overlay routing, which takes the objective of underlay routing into account, instead of blindly optimizing its performance. Moreover, we argue that it is better off for both players if the underlay routing is oblivious to the traffic demands. We suggest COPE [3] as a strong candidate for this purpose. The paper is organized as follows. First, we formally describe the mathematical model in Section 2. Section 3 formulates the interaction of overlay routing and TE as a non-cooperative two-player game. Then various underlay routing schemes are described in Section 4, and TE-aware overlay routing is introduced in Section 5. Section 6 evaluates the proposed methods with extensive simulation results. Finally, Section 7 concludes the paper and gives future direction.
2
Model
In this section, we describe the mathematical model, which will be used throughout the paper. Basically, TE and overlay have different viewpoints of the network. Network operators know all the underlying structure of the physical network, whereas overlay has a logical view of the network. Table 1 summarizes the notations for vertical interaction. First, we use a graph G = (V, E) to denote an underlay network, where V is the set of physical nodes and E is the set of edges between nodes. We use l or (i, j) to denote a link and cap(l) to refer the capacity of link l. For the overlay network, we use a subgraph G = (V , E ) of the underlay graph G. In G , we use i to represent the overlay node built upon physical node i in underlay graph G. Overlay node i is connected to j by a logical link (i , j ), which corresponds to a physical path from i to j in G.
532
G.M. Lee and T. Choi Table 1. Notations for vertical interaction (i, j), l (i , j ) cap(l) vst (l) fst (l) t(l)
physical link logical link capacity of a physical link l flow of dst on link l fraction of dst on link l traffic rate at link l
dst total TE demand on physical node pair (s, t) ds t overlay demand on pair (s , t ) under dst TE demand due to underlay traffic doverlay TE demand due to overlay flow st
P (s t ) δps t (s t )
hp
set of logical paths from s to t path mapping coefficient overlay flow on logical path p
Now, we need to have different notations for overlay and underlay traffic demands: dst is used to indicate the total traffic demand from node s to t, including overlay and non-overlay traffics, and dst is a sum of dunder and doverlay . st st under dst refers to the background traffic by non-overlay demands. Next, it is important to differentiate doverlay from ds t : ds t indicates the logical traffic dest mand from overlay node s to t , whereas doverlay is the physical traffic dest mand on physical node pair (s, t), generated by overlay network. In other words, doverlay is computed by the overlay routing based on the current logical demand st {ds t |∀s , t ∈ E }. The third group of notations is for the overlay routing. P (s t ) is the set of logical paths from s to t . δps t is the path mapping coefficient, where the value (s t )
is 1, if logical link (s , t ) is on logical path p, and 0, otherwise. hp amount of overlay demand ds t flowing on logical path p.
3
is the
Vertical Interaction Game
Based on the formulations in the previous section, TE and overlay routing are coupled through the mapping from the logical level path to physical level links. We can formulate the interaction as a non-cooperative two-player game as described in Fig. 1. The first player is the ISP’s TE and the second player is the overlay routing for the user’s side. The interaction consists of sequential moves of the two players. Each player takes turn and makes action to optimize its performance. Based on the overlay demand and flow conditions on the physical links, overlay calculates the optimal flows on the logical routing. These logical flows and the underlay background traffic are coupled to form the total traffic matrix, which is the input for TE. Then TE optimizes its performance by adapting the flows on the physical links, which in turn affects the delays experienced by the
Improving the Interaction between Overlay Routing and Traffic Engineering
533
TE underlay routing
traffic demands
Overlay
Fig. 1. vertical interaction game: TE determines the physical routing, which decides link latency experienced by overlay. Given the observed latency, overlay optimizes its logical routing and changes the physical traffic demands, which, in turn, affects the underlay routing.
overlay. This interaction continues until the two players come up with the Nash equilibrium point [15, 16], if there exists. In game theory, Nash equilibrium is a kind of optimal collective strategy in a game involving two or more players, where no player has anything to gain by changing only his or her own strategy. If each player has chosen a strategy and no player can benefit by changing his or her strategy while the other players keep theirs unchanged, then the current set of strategy choices and the corresponding payoffs constitute a Nash equilibrium. Liu et al. [2] prove the existence of Nash equilibrium in a simple interaction game, where the topology consists of three nodes and there is a single demand between two nodes. Even though we can prove the existence of a convergent point, the interaction process does not guarantee that two players’ behaviors converge to the Nash equilibrium. Moreover, if the game gets complicated, it is even harder to anticipate the interaction process. The authors show that in a realistic scenario, both TE and overlay routing experience substantial performance loss due to the oscillation. The main direction of our work is to improve the vertical interaction between overlay routing and traffic engineering. First, we want the interaction game to converge faster because the oscillation in this game degrades the performance of both players. Next, we try to reduce the performance variation in the transient oscillation process.
4
Traffic Engineering
In the vertical interaction game, we first discuss about three TE schemes as the first player: Multi-Protocol Label Switching (MPLS) [12], oblivious routing [17], and Common-case Optimization with Penalty Envelope (COPE) [3]. The output of TE is IP-layer routing, which specifies how traffic of each OriginDestination (OD) pair is routed across the network. Typically, there is path diversity: there are multiple paths for each OD pair, and each path routes a fraction of the traffic.
534
G.M. Lee and T. Choi
First, Multi-Protocol Label Switching (MPLS) [12] provides an efficient support of explicit routing, which is the basic mechanism for TE. Explicit routing allows a particular packet stream to follow a predetermined path. The combination of MPLS technology and its TE capabilities enable the network operator to load-balance the traffic demands adaptively to optimize the network performance. There are two possible ways to describe the network performance: maximum link utilization (MLU) and total link latency. Network operators worry about over-loaded links, because these links can be a bottleneck for the whole network performance. A slight change of the traffic pattern may overload the over-utilized links. Therefore, we want to minimize the MLU. We use this metric to measure the TE performance in this paper. Given the traffic demand matrix, the goal of MPLS TE is to choose a physical link flow allocation that minimizes MLU. MPLS TE optimizes the paths based on the currently observed traffic matrix. Unfortunately, measuring and predicting traffic demands are really difficult problems. Flow measurements are rarely available on all links and ingress/egress points. Moreover, demands change over time on special events such as DoS attack, flash crowds, and internal/external network failures. Oblivious routing [17] is proposed to resolve this issue. It calculates an optimal routing which performs reasonably well independent of traffic demands. In other words, this “demand oblivious” routing is designed with little knowledge of traffic demands, taking only the topology along with link capacities into account. MPLS TE can be regarded as an extreme case of online adaptation. An advantage of this scheme is that it achieves the best performance for the current traffic demand. However, if there are significantly fast traffic changes, such method can suffer a large transient penalty. Oblivious routing is a way to handle unpredicted traffic spikes. However, a potential drawback of completely oblivious routing is its sub-optimal performance for the normal traffic demand. Common-case Optimization with Penalty Envelope (COPE) [3] is proposed as a hybrid combination of predication-based optimal routing and oblivious routing. COPE handles both dynamic traffic and dynamic inter-domain routes and, at the same time, achieves close-to-optimal performance for normal, predicted traffic matrices.
5
Overlay Routing
Now, we formulate the strategy functions for overlay routing in the vertical interaction game. We start with the default overlay routing, which we term selfish overlay routing. Given the current underlay routing and experienced link latency, selfish overlay tries to minimize its total latency by changing the loads for each logical path in the overlay network. Then, we propose a variation of overlay routing. We introduce a new optimization strategy, in which overlay takes the presence of traffic engineering into account. We term this as TE-aware overlay routing.
Improving the Interaction between Overlay Routing and Traffic Engineering
5.1
535
Selfish Overlay Routing (s t )
Overlay routing algorithm determines a logical path flow allocation {hp |∀s , t ∈ V , ∀p ∈ P (s t ) } that minimizes the delay experienced by the overlay traf(s t ) fic. By hp , we denote the logical overlay demand from s to t allocated to path p. Individual overlay users may choose their routes independently by probing the underlay network. However, we assume that a centralized entity calculates routes for all overlay users. Given the physical network topology, underlay routing, and experienced latency for each link, optimal overlay routing can be obtained by solving the following non-linear optimization problem: min
t(l)overlay cap(l) − t(l) l
subject to
t) h(s p
is a logical routing ∀ link l : t(l) = fst (l)(dunder + doverlay ) st st s,t
∀ link l : t(l)
overlay
=
fst (l)doverlay st
s,t
∀s, t ∈ V :
doverlay st
=
t) δps t h(s p
s ,t ,p
The first constraint ensures that the logical routing satisfies the logical flow conservation constraints. This can be expressed as follows:
∀s , t ∈ V :
t) h(s = ds t p
p∈P (s t )
t) ∀s , t ∈ V : h(s ≥ 0. p
Note that the main objective of problem is non-linear. But we can linearize the non-linear part of the program [18]. 5.2
TE-Aware Overlay Routing
Based on the selfish overlay routing, we can change the optimization strategy to ensure the overlay is TE-aware. By TE-awareness, we mean the selfishness of the overlay is limited by some bound so that the action of overlay does not offensively affect the TE’s optimization process. The basic idea is this: (1) when the current latency is below the average latency, the overlay tries to minimize its own traffic amount, given that the current latency is preserved (load-balancer). (2) If the latency is above the average, then overlay changes the logical routing to improve the latency, but, at the same time, it avoids a specific link to be overloaded (limited-optimizer).
536
G.M. Lee and T. Choi
The first part, load-balancer, can be formalized as follows: overlay min ds,t s,t
t) subject to h(s is a logical routing p ∀ link l : t(l) = fst (l)(dunder + doverlay ) st st s,t
∀ link l : t(l)
overlay
=
fst (l)doverlay st
s,t
∀s, t ∈ V :
doverlay s,t
=
t) δps t h(s p
s ,t ,p
t(l)overlay ≤Θ cap(l) − t(l) l
Here, the main objective is to minimize the total overlay traffic amount. The last constraint guarantees that the current latency is preserved. (Θ in the last constraint indicates the current latency.) Secondly, limited selfishness can be implemented by adding the following constraint to the default selfish overlay routing: ∀ link l : fst (l) doverlay ≤θ st s,t
θ = max{t(l)overlay |∀ link l} Here, θ is the maximum link load that the overlay generates in the previous run. The additional constraint limits the selfishness of overlay and prevents specific links to be overloaded.
6
Simulation
This section describes the simulation results of vertical interactions. We first compare MPLS and COPE as the underlay TE schemes, then we evaluate two overlay schemes: TE-aware overlay and selfish overlay. We use General Algebraic Modeling System [19] to implement various optimization procedures for the experiments. Then the interaction between optimization programs is implemented by connecting the inputs and outputs of the GAMS programs through Perl scripts. Given that we run the optimization process for more than hundred iterations, we need a support of Condor [20], which is a specialized workload management system for compute-intensive jobs. 6.1
Data Set Description
We perform extensive experiments on a 14-node Tier-1 POP topology described in [21]. The underlay network topology is given in Fig. 2. We have done experiments with other topologies and observed qualitatively consistent results [18].
Improving the Interaction between Overlay Routing and Traffic Engineering
537
POP 7 9 POP 9 9
14 19
29
POP 8 9
10
POP 1
POP 14 14
9
POP 6
17
9 POP 12 25
19 POP 11 9
20
9 14 13 POP 4
POP 3 9
POP 5 9
9
20
6 POP 13
8 20 POP 3 10 POP 2
Fig. 2. 14-node Tier-1 backbone topology: each node represents a Point-of-Presence (POP) and each link represents the aggregated connectivity between the routers belonging to a pair of adjacent POPs. Four POPs (3,6,7,11) are used as overlay nodes.
On top of the physical network, we assume a four-node full-meshed overlay network. For the traffic matrix, we generate synthetic traffic demands using gravity model [21]. 6.2
MPLS and COPE with Selfish Overlay
We start with the comparison between MPLS and COPE in the operator’s viewpoint. We fix the overlay routing to be selfish and compare the performance of MPLS and COPE. In this experiment, we want to consider the performance of both TE and overlay. For the COPE, we need a pre-specified penalty envelope value. We first calculate the value (1.9969) by running oblivious routing, and find the optimal routing which minimizes the oblivious ratio. Then by multiplying 1.1 to the optimal oblivious ratio, we set the penalty envelope value. We set 10% of the total traffic demand to be operated by the selfish overlay routing, and set the load scale factor to be 0.5, which means that the maximum link utilization is 50%, when all the demands use the default underlay routing without overlay’s action. The experiment results are shown in Fig. 3. We can observe that the COPE makes better interaction with selfish overlay. MPLS TE suffers from substantially large oscillation throughout the interaction, where COPE achieves almost stable performance with its maximum link utilization. Similarly, the dynamics of overlay latency is quite stable with the interaction of COPE. Moreover, the average latency sometimes gets improved by using COPE. We have also conducted more experiments to explore the impact of overlay fraction and link utilization to the vertical interaction [18]. With the extensive simulation, we find COPE as a strong TE technique which achieves stable performance even the selfish overlay traffic dominates a significant portion of the total traffic demand.
538
G.M. Lee and T. Choi 110
0.8
COPE vs selfish MPLS vs selfish
105 100 95
0.7
90
MLU
overlay latency
COPE vs selfish MPLS vs selfish
0.75
85 80
0.65 0.6
75
0.55
70 65
0.5 0
10
20
30
40
50
60
70
80
run
(a) overlay performance
90 100
0
10 20 30 40 50 60 70 80 90 100 run
(b) TE performance
Fig. 3. MPLS and COPE with selfish overlay. 14-node topology with a 4-node overlay network, overlay fraction = 10%, load scale factor = 0.5. (MLU: Maximum Link Utilization).
6.3
TE-Aware Overlay and Selfish Overlay with MPLS
Now, we compare TE-aware overlay routing and selfish overlay routing to evaluate our proposed scheme. For the underlay routing, we again use MPLS and COPE. We first start the evaluation by comparing two overlay routings on top of MPLS TE. In Fig. 4, we set 10% of the traffic to be operated by overlay routing and set the load scale factor to be 0.5. Considering the overlay latency, TE-aware overlay routing achieves more stable performance. Moreover, the average latency of TE-aware overlay is lower than that of selfish overlay. We can see that overlay routing can achieve better and stable routing by understanding the objective of underlay routing. Considering the TE side, selfish overlay routing makes significant burden to the underlay routing because it generates substantially large amount of additional traffic. Thus, we can observe sudden increase of the maximum link utilization. However, TE-aware overlay limits its selfishness and tries to avoid a specific link to be over-loaded by its own traffic. Thus, the fluctuation of maximum link utilization is smaller when the overlay is TE-aware. We have also analyzed the impact of the overlay fraction and link utilization to the interaction [18]. The experiments are conducted where the network is substantially congested (90% - 120%). Still, the proposed method makes better interaction than selfish overlay does. With the extensive experiment results, we come up with the conclusion that TE-aware overlay routing generally makes stable interaction with MPLS TE. Selfish overlay routing experiences less predictable latency and it makes significantly large maximum link utilization of the network. However, we can achieve either convergence or regular pattern with the overlay latency when TE-awareness is embedded in overlay routing. Moreover, the network overhead to the TE is reduced by using the proposed overlay routing. Thus, TE-awareness obtains winwin game for each player in the presence of MPLS TE.
Improving the Interaction between Overlay Routing and Traffic Engineering 110
0.8
MPLS vs selfish MPLS vs TE-aware
105
95
0.7
90
MLU
overlay latency
MPLS vs selfish MPLS vs TE-aware
0.75
100
539
85 80
0.65 0.6
75
0.55
70 65
0.5 0
10
20
30
40
50
60
70
80
90 100
0
run
10 20 30 40 50 60 70 80 90 100 run
(a) overlay performance
(b) TE performance
Fig. 4. TE-aware overlay and selfish overlay on MPLS. 14-node topology with a 4-node overlay network, overlay fraction = 10%, load scale factor = 0.5 (MLU: Maximum Link Utilization).
6.4
TE-Aware Overlay and Selfish Overlay with COPE
For the last experiment, we compare TE-aware overlay and selfish overlay on top of the COPE TE. In the previous experiments comparing COPE and MPLS, we have observed that COPE achieves better interaction with selfish overlay routing. Now, the question is how much gain we can get by using TE-aware overlay with COPE. Figure 5 describes the experiment results, where 10% of the traffic is routed by overlay routing and load scale factor is 0.5. Similar to the experiment result of the previous sections, TE-aware overlay converges faster than the selfish overlay. However, comparing to the oscillation in MPLS experiments, we can see the performance variation is negligible. Similar patterns can be observed with the maximum link utilization. 90
0.56
COPE vs selfish COPE vs TE-aware
88
0.54 86 MLU
overlay latency
COPE vs selfish COPE vs TE-aware
0.55
0.53
84 0.52 82
0.51
80
0.5 0
10
20
30
40
50
60
70
80
run
(a) overlay performance
90 100
0
10 20 30 40 50 60 70 80 90 100 run
(b) TE performance
Fig. 5. TE-aware overlay and selfish overlay on COPE. 14-node topology with a 4-node overlay network, overlay fraction = 10%, load scale factor = 0.5 (MLU: Maximum Link Utilization).
540
G.M. Lee and T. Choi
6.5
Simulation Summary
Considering the overlay side, in all experiments, the latency experienced by TEaware overlay is better than that of selfish overlay. The maximum link latency experienced by selfish overlay is sometimes twice larger than average latency. In some scenarios, selfish overlay latency keeps increasing as the interaction with underlay proceeds. TE-aware overlay shows similar pattern but makes convergence at considerably lower latency. Looking at the TE side, TE-awareness obtains either similar or better performance than selfishness. Summarizing the interaction experiments with COPE, we can achieve considerably good interaction for both selfish overlay and TE-aware overlay. But, TE-aware overlay performs slightly better than selfish overlay routing.
7
Conclusion and Future Direction
In this paper, we improve the vertical interaction between overlay routing and traffic engineering by modifying the objectives of both parties. We propose TE-aware overlay routing, which takes traffic engineering’s objective into account in overlay routing decision. We also suggest COPE as a strong traffic engineering technique, which makes a good interaction with unpredictable overlay traffic demands. We show the feasibility of the proposed methods with extensive simulation results. The model used in this paper can be enhanced in several ways. First, we have captured vertical interaction within a single domain. We can extend the arguments to inter-domain level, where overlay nodes are spread across several autonomous systems and cooperate each other. Then the action of overlay will make interaction with inter-domain routing algorithms. Secondly, multiple overlays may coexist on top of a shared underlay network. The decision of an overlay depends on probing information of logical paths. If overlays share some links but do not exchange the routing information of each other, currently observed link performance is likely be changed by the actions of other overlays. We can study this horizontal interaction among overlays. Lastly, we use average latency as an indicator in TE-aware overlay routing. But it may be difficult to get this value in reality. Then we can use mechanism similar to TCP congestion control, where overlay additively increases the selfishness until it leads congestion and exponentially decreases the selfishness to ease the congestion level.
Acknowledgement The authors would like to thank the discussion on this work with Yin Zhang.
References 1. Qiu, L., Yang, Y.R., Zhang, Y., Shenker, S.: On selfish routing in Internet-like environments. In: Proceedings of ACM SIGCOMM, Karlsruhe, Germany, August 2003, pp. 151–162 (2003) 2. Liu, Y., Zhang, H., Gong, W., Towsley, D.: On the interaction between overlay routing and traffic engineering. In: Proceedings of IEEE INFOCOM (2005)
Improving the Interaction between Overlay Routing and Traffic Engineering
541
3. Wang, H., Xie, H., Qiu, L., Yang, Y.R., Zhang, Y., Greenberg, A.: COPE: Traffic engineering in dynamic networks. In: Proceedings of ACM SIGCOMM (2006) 4. Chu, Y., Rao, S.G., Seshan, S., Zhang, H.: Enabling conferencing applications on the Internet using an overlay multicast architecture. In: Proceedings of ACM SIGCOMM, pp. 55–67 (2001) 5. Jannotti, J., Gifford, D.K., Johnson, K.L., Kaashoek, M.F., O’Toole, J.: Overcast: Reliable multicasting with an overlay network. In: Proceedings of Symposium on Operating Systems Design and Implementation (OSDI), pp. 197–212 (2000) 6. Subramanian, L., Stoica, I., Balakrishnan, H., Katz, R.H.: OverQoS: An overlay based architecture for enhancing internet QoS. In: Proceedings of the First Symposium on Networked Systems Design and Implementation (NSDI), pp. 71–84 (2004) 7. Akamai: http://www.akamai.com 8. Andersen, D.G., Balakrishnan, H., Kaashoek, M.F., Morris, R.: Resilient overlay networks. In: Proceedings of Symposium on Operating Systems Principles (SOSP), pp. 131–145 (2001) 9. Savage, S., Anderson, T., Aggarwal, A., Becker, D., Cardwell, N., Collins, A., Hoffman, E., Snell, J., Vahdat, A., Voelker, G., Zahorjan, J.: Detour: a case for informed internet routing and transport. Technical Report TR-98-10-05 (1998) 10. Chun, B.-G., Fonseca, R., Stoica, I., Kubiatowicz, J.: Characterizing selfishly constructed overlay routing networks. In: Proceedings of IEEE INFOCOM (2004) 11. Fortz, B., Thorup, M.: Internet traffic engineering by optimizing OSPF weights. In: Proceedings of IEEE INFOCOM, pp. 519–528 (2000) 12. Ruela, J., Ricardo, M.: MPLS - Multi-Protocol Label Switching. In: The Industrial Information Technology Handbook, pp. 1–9 (2005) 13. Stewart, I.J.W.: BGP4: inter-domain routing in the Internet. Addison-Wesley, Reading (1999) 14. Keralapura, R., Taft, N., Chuah, C.-N., Iannacco, G.: Can ISPs take the heat from overlay networks? In: Proceedings of the Third Workshop on Hot Topics in Networks (November 2004) 15. Dutta, P.K.: Strategies and games: Theory and practice (1999) 16. Nash, J.: Non-cooperative games. The Annals of Mathematics, 286–295 (September 1951) 17. Applegate, D., Cohen, E.: Making intra-domain routing robust to changing and uncertain traffic demands: Understanding fundamental tradeoffs. In: Proceedings of ACM SIGCOMM, pp. 313–324 (2003) 18. Lee, G.M., Choi, T., Zhang, Y.: Improving the interaction between overlay routing and traffic engineering. Technical Report TR-06-61, Department of Computer Sciences, University of Texas at Austin (2007) 19. GAMS: General Algebraic Modeling System: http://www.gams.com 20. Condor: http://www.cs.wisc.edu/condor 21. Medina, A., Taft, N., Salamatian, K., Bhattacharyya, S., Diot, C.: Traffic matrix estimation: existing techniques and new directions. In: Proceedings of ACM SIGCOMM, pp. 161–174 (2002)
Designing Optimal iBGP Route-Reflection Topologies Marc-Olivier Buob1 , Steve Uhlig2 , and Mickael Meulle1 1
Orange Labs {marcolivier.buob,michael.meulle}@orange-ftgroup.com 2 Delft University of Technology [email protected]
Abstract. The Border Gateway Protocol (BGP) is used today by all Autonomous Systems (AS) in the Internet. Inside each AS, iBGP sessions distribute the external routes among the routers. In large ASs, relying on a full-mesh of iBGP sessions between routers is not scalable, so route-reflection is commonly used. The scalability of route-reflection compared to an iBGP full-mesh comes at the cost of opacity in the choice of best routes by the routers inside the AS. This opacity induces problems like suboptimal route choices in terms of IGP cost, deflection and forwarding loops. In this work we propose a solution to design iBGP routereflection topologies which lead to the same routing as with an iBGP full-mesh and having a minimal number of iBGP sessions. Moreover we compute a robust topology even if a single node or link failure occurs. We apply our methodology on the network of a tier-1 ISP. Twice as many iBGP sessions are required to ensure robustness to single IGP failure. The number of required iBGP sessions in our robust topology is however not much larger than in the current iBGP topology used in the tier-1 ISP network. Keywords: BGP, route-reflection, iBGP topology design, optimization.
1 Introduction The Internet consists in a collection of more than 25,000 interconnected domains called Autonomous Systems (ASs). Inside a single domain, an Interior Gateway Protocol (IGP) [1] such as IS-IS or OSPF is used to ensure the routing between each router of the domain. In a domain, each IGP router computes its shortest path (according to the IGP metric) to each other router of the domain. Between neighboring ASs, routers exchange their routing information thanks to BGP [2]. External BGP (eBGP) sessions are established over inter-domain links, i.e. links between two different ASs (BGP peers), while internal BGP (iBGP) sessions are established between the routers inside an AS. Typically, BGP routers do not forward iBGP messages to their iBGP peers, to reduce the amount of routing messages. Thus, each router has to establish an iBGP session with each other BGP router of its AS to diffuse its own routes. Such a topology is called a iBGP full-mesh (fm) and requires n(n − 1)/2 iBGP sessions where n is the number of BGP routers in the AS. This solution is commonly used in small ASs but does not scale. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 542–553, 2008. c IFIP International Federation for Information Processing 2008
Designing Optimal iBGP Route-Reflection Topologies
543
To achieve a scalable iBGP topology in a larger AS, network administrators have to use BGP confederations1 [3] or route-reflection2. Because route-reflection is the commonly used approach today in large ASs, we only focus on it for the rest of the paper. Despite its wide adoption, route-reflection may suffer from problems in two particular cases: 1. Some routers select their route according to the Multi Exit Discriminator (MED) attribute. These routing problems have already been studied in [4] and can easily be avoided with always-compare-med or set-deterministic-med. 2. The best routes are selected according to the hot-potato routing steps of the BGP decision process: “prefer a route learned by eBGP over a route learned by iBGP”, and “prefer a route with a closest BGP next-hop” (see [5]). In large ASs, the hot-potato steps are of particular importance as they may account for up to 70% of the BGP best routes [6]. Furthermore, an iBGP topology with routereflectors does not guarantee that each router systematically selects its closest possible egress point in the AS towards each destination. Indeed, route-reflectors reduce routing diversity because they only forward their best BGP route for each destination. Ideally, the network should converge to the same final state as in a full-mesh. Such a topology is said to be fm-optimal [5]. Before presenting how we address the iBGP topology design problem, let us summarize the the main advantages and drawbacks of full-mesh iBGP and route-reflection. A full-mesh has optimal routing and is deterministic. Convergence is fast and the network is as robust as possible. It is however not scalable, as many iBGP sesssions are necessary, and adding or removing routers imply significant configuration overhead. Furthermore, a change in a best route triggers updates to all other routers. In routereflection on the other hand, scalability is improved in terms of configuration overhead, convergence, and size of routing information bases. However, it comes with loss of route diversity [7], which may induce suboptimal routing and non-deterministic routing [8], routing oscillations [4,9,10], route deflection or forwarding loops [11,12,13]. Furthermore, the behavior of route-reflection under failures [14] and IGP topology changes is unclear. In this paper, we aim at building iBGP topologies that respect simultaneously several essential requirements: – Fm-optimality: route-reflection is a scalable alternative to an iBGP full-mesh. We do not compromise the optimal route selection under an iBGP full-mesh while using route-reflectors. As we show in this paper, keeping the best of both worlds, while not trivial, is possible. – Correctness: checking for correctness is proved to be NP-hard. But thanks to fmoptimality, we can nonetheless ensure that a network is both loop-free (because deflection-free) and deterministic, and therefore correct. – Reliability: We design an iBGP topology that follows the IGP graph as much as possible [14,15]. We allow multi-hop sessions3 only if necessary. 1 2
3
It consists in splitting the AS into smaller domains. Some BGP routers, namely route-reflectors, are authorized to forward iBGP messages to their iBGP clients. When iBGP sessions cross other BGP routers than the two BGP end-points of the session.
544
M.-O. Buob, S. Uhlig, and M. Meulle
– Robustness: We build a topology robust to IGP link failures and router maintenance. Furthermore, even after a single link failure or the removal of a router, the topology remains fm-optimal. – Scalability: We build a topology having as few iBGP sessions as possible. As far as we know, this paper is the first attempt to design route-reflection topologies that respect all previously mentioned requirements, and in particular fm-optimality robust to failures. Using an algorithm based on the Benders’ decomposition framework, we manage to solve efficiently the problem on real-world network topologies. The remainder of this paper is structured as follows. Section 2 presents the terminology. Section 3 proposes different approaches to solve the problem. In Section 4 we evaluate our solutions on both real-world transit ISP networks and generated topologies. The related work is presented in Section 5. Finally, Section 6 concludes and discusses further work.
2 Terminology IGP graph. Let be Gigp = (Vigp , Eigp ) the physical topology of the network. Each vertex of Vigp represents a router and each weighted (u, v) arc of Eigp characterizes a physical link and its IGP metric. We denote by dist : Vigp × Vigp −→ N the function which returns the weight of the shortest path between two routers. BGP graph. We denote by N the set of possible BGP next-hops in routes learned by the AS, and by R the set of routers running BGP inside the AS. The graph Gbgp = (Vbgp , Ebgp ) describes the route-reflection topology. Vbgp = R ∪ N . Ebgp denotes the set of BGP sessions between routers. When two routers share an iBGP session, we add two edges between the routers labeled with U P from a client to one of its route reflectors, DOW N from a route-reflector to one of its clients, or OV ER between peers (see Figure 1 and section 2). We use the same notations as [8,15]. We assume that the border routers of the AS (ASBR) are the BGP next-hops in the BGP routes (i.e. N ⊆ R). We denote by L = {U P, OV ER, DOW N } the types of BGP sessions and by label : Ebgp −→ L the label of a given link. We also denote by sym : L −→ L the function that returns the symmetric label of a given label: sym(U P ) = DOW N , sym(DOW N ) = U P , sym(OV ER) = OV ER A BGP path in Gbgp is a valid path if composed of zero or more U P arcs, followed by zero or one OV ER arc, followed by zero or more DOW N arcs. Any valid iBGP label sequence verifies the regular expression (U P )∗ (OV ER)?(DOW N )∗ . Let (n, r) ∈ N × R be a given next-hop router pair. When considering this pair (n, r), we assume that: – there exists a prefix p and several routes (called concurrent routes) to this prefix in the AS that are tie-breaked on the IGP cost to the next-hop, – n is the closest BGP next-hop of r in the IGP graph. We try to ensure that r will always be able to learn the route advertised by its closest BGP next-hop n. We only have to consider the following concurrent next-hop set: N (n, r) = {n ∈ N , dist(r, n ) > dist(r, n)}.
Designing Optimal iBGP Route-Reflection Topologies
545
If there exists a valid iBGP path from n to r such that each router w of that path selects the route advertised by n, then r learns the route advertised by n. We call whiterouter a router verifying this property. The set of white routers related to a given pair (n, r) is defined by: W(n, r) = {w ∈ R|∀n ∈ N (n, r), dist(w, n) < dist(w, n )} Note that n and r always belong to W(n, r). Moreover N (n, r) ∩ W(n, r) = ∅. We call white-path any iBGP path only made of white-routers. If for each (n, r) ∈ N × R there exists at least one valid white path, then the topology is said fm-optimal. Note that fm-optimality is prefix independent. This criterion ensures a good “behaviour” of the network for any set of concurrent BGP routes.
(n, r) = (y, z) N (y, z) = {x} W(y, z) = {y, z}
Fig. 1. An example of suboptimal routing: the traffic sent by z follows the IGP path (z, rr, x) instead of (z, y)
Figure 1 provides a brief example illustrating fm-optimality. In this topology, (y, rr, z) is a valid iBGP path from y to z. However rr ∈ / W(y, z). (y, rr, z) is not a valid white-path. In fact, rr may select the route advertised by x and z is thus unable to learn the route announced by its closest BGP next-hop y.
3 How to Build Fm-Optimal iBGP Topologies We now introduce our approach to solve the iBGP design problem. As input, we need the BGP next-hops set (N ), the BGP routers set (R), and the IGP topology (Gigp ). We use a mixed-integer program (MIP) to model the problem. It is impractical to enumerate the whole set of constraints because the networks using route-reflection are typically very large. That is why we use a Benders’ decomposition to generate dynamically a reduced set of constraints. 1. In a first step (section 3.1), we present the approach to solve the iBGP design problem when no failure happens, called nominal case. For each pair (n, r) ∈ N × R, we build a satellite problem which will be satisfied if and only if at least one fmoptimal path exists from n to r. 2. In a second step (section 3.2), we detail how to introduce constraints to be robust to failures. We build a satellite for each triple (n, r, f ) with a given failure f . We aggregate during a presolve step redundant satellites to reduce the problem size.
546
M.-O. Buob, S. Uhlig, and M. Meulle
We do not consider OV ER sessions for Benders’ decomposition to avoid the problem degenerescence, i.e. having too many equivalent solutions. Each OV ER session may be turned into an U P or DOW N session without invalidating any iBGP path. 3.1 Nominal Case Master problem Variables. For each candidate iBGP session (u, v), (u, v) ∈ R, u = v, we define two 0-1 variables: up(u, v) (equal to 1 if label(u, v) = U P , 0 otherwise), and down(u, v) (equal to 1 if label(u, v) = DOW N , 0 otherwise). Objective function. We try to design an iBGP topology as close as possible to the IGP topology while minimizing the number of installed iBGP session. We denote by F the objective function defined by: F = min( (R(u, v).(up(u, v) + down(u, v)))) (u,v)∈R
where R(u, v) characterizes the number of IGP hops needed4 to establish an iBGP session from u to v. Constraints. The master problem is made of two sets of constraints: – Domain constraints: Each pair (u, v) ∈ R × R is connected by 0 or 1 iBGP session, and label(u, v) = sym(label(v, u)). This leads to the following linear constraints: • ∀u, v ∈ R, up(u, v) + down(u, v) ≤ 1, • ∀u, v ∈ R, up(u, v) = down(v, u). – Max-flow Min-cut constraints: At the beginning, this set of constraints is empty. These constraints will be described in Sections 3.1 and 3.2. At each iteration it, the master problem queries many satellites problems to test the fmoptimality of their corresponding pair (n, r). Each queried unsatisfied satellite problem inserts a new max-flow min-cut constraint into the master problem. The set of maxflow min-cut constraints ensures the route propagation between any router pair (n, r) through the iBGP graph. If all satellite problems are satisfied, the MIP resolution returns a fm-optimal solution which minimizes the objective function F . Satellite problems Extended graph concepts. To guarantee that only valid iBGP paths can be built, we use the graph transformation introduced in [5]. We transform each vertex of Vbgp into a meta-node composed of two nodes (called source-node and target-node) and an arc (called internal arc) according to Figure 2. The way we link two meta-nodes depends on the iBGP relationship between the two related routers. In the extended graph we can only build valid iBGP paths. We call meta-arc each arc that connects two vertices 4
This value is equal to the length of the shortest IGP path from u to v.
Designing Optimal iBGP Route-Reflection Topologies
547
Fig. 2. An example of iBGP graph and its corresponding extended graph. The valid path (c1 , rr1 , rr2 , c2 ) in Gbgp is mapped to (c1src , rr1src , rr2src , rr2dst , c2dst ) in Gext bgp
belonging to different meta-nodes. In this graph, each meta-arc is mapped to the iBGP session and the two routers establishing this iBGP session. We denote by [u, v, rel] the meta-arc which is mapped to meta-nodes u and v, and to iBGP session rel. Each valid path of Gbgp from s ∈ Vbgp to t ∈ Vbgp is thus mapped to exactly one path in the extended graph from ssrc to tdst , where ssrc is the source-node of s and tdst the target-node of t. Satellite graphs. To ensure the fm-optimality of a given pair (n, r), we build a satellite problem. For each satellite problem we build a satellite graph Gw (n, r). Each vertex of this graph belongs to W(n, r) (see Section 2). To reduce the number of candidate iBGP sessions, we only consider the sessions (u, v) which verify the following properties: 1. u, v ∈ W(n, r): the BGP route sent by n only goes through routers that never hide this route; 2. dist(n, u) ≤ dist(n, v) and dist(v, r) ≤ dist(u, r): a BGP message crossing the arc (u, v) increases its distance to n and decreases its distance to r. The iBGP sessions that do not verify point 1 might prevent the fm-optimal routes from being propagated, so we do not want to use those sessions. Points 2 prevents iBGP messages announced by n to follow too long paths from n to r. Note that the iBGP full-mesh remains a possible solution because the direct session between n and r is allowed. Thus, the graph Gw (n, r) = (W(n, r), Ew (n, r)) gathers the candidate iBGP whitepaths able to satisfy the (n, r) pair. If for all (n, r) ∈ N × R, there exists at least one valid path from n to r in Gw (n, r), then the iBGP topology is fm-optimal. Satellite problems. For each pair (n, r) we build the extended graph Gext w (n, r) from Gw (n, r) and all the candidate meta-arcs. We denote by nsrc the source-node of metanode n and rdst the target-node of meta-node r. We install for each edge (i, j) ∈ Gext w (n, r) an arc capacity, as shown in Figure 3.
548
M.-O. Buob, S. Uhlig, and M. Meulle
Fig. 3. Two candidate iBGP sessions: label(u, v) = U P or label(u, v) = DOW N
Fig. 4. This min-cut max-flow inserts the following constraint into the master problem: up(r1 , r)+down(r1 , r)+ up(n, r) + down(n, r) ≥ 1
– If i and j belong to the same meta-node, we install on (i, j) an infinite capacity. – Otherwise (i, j) is a meta-arc. Let rel ∈ {U P, DOW N } be the iBGP relationship mapped to (i, j), ri the meta-node mapped to i, and rj the meta-node mapped to j: If an iBGP session rel is set from ri to rj , we install on the arc (i, j) a capacity equal to 1 (0 otherwise). Therefore, there is at most one meta-arc from meta-node ri to meta-node rj with a capacity equal to 1. If the max-flow sent from nsrc (the source) to rdst (the sink) is greater or equal to 1 then the pair (n, r) is satisfied. Otherwise, no flow unit can reach the sink. We search the max-flow min-cut in this graph. We denote by C(n, r, it) the set of meta-arcs that intersect this cut during the current iteration it. We insert the following max-flow mincut into the master problem: (rel(ri , rj )) ≥ 1. [ri ,rj ,rel]∈C(n,r,it)
Figure 4 provides an example of max-flow min-cut. In this example W(n, r) = {n, r1 , r} and the previous MIP resolution has lead to label(n, r1 ) = U P, label(r1 , r) = label(n, r) = N OT . When the satellite related to (n, r) is queried, the max-flow mincut inserts the linear constraint up(r1 , r) + down(r1 , r) + up(n, r) + down(n, r) ≥ 1 into the MIP. 3.2 IGP Failures We now detail how to take into account IGP failures. A BGP router uses its IGP shortest path to establish the sessions with its BGP neighbors. When an IGP failure occurs, the BGP router updates its IGP shortest path tree and re-establishes the broken BGP sessions according to its new path tree. If the IGP connectivity is not working between f f the two BGP peers, the BGP session will be down. We denote by f = (Vigp , Eigp ) an
Designing Optimal iBGP Route-Reflection Topologies
549
f f IGP failure, where Vigp ⊆ Vigp stands for the set of involved routers and Eigp ⊆ Eigp for the set of involved IGP links. We denote by φ the empty failure.
Methodology. An IGP failure f consists in recomputing the IGP cost between each router pair belonging to the same connected component. For each considered input IGP failure, we apply the “nominal case” reasoning in each connected component. Let us consider a (n, r) pair such as n and r belong to a same IGP connected component C. We only consider in Gw (n, r, f ) the white-vertices belonging to C. An iBGP session can only be mounted if both BGP routers sharing the session belong to the same IGP connected component. Thus, we only have to consider IGP failures such that n and r f f remain in the same IGP connected component and n ∈ / Vigp ,r ∈ / Vigp . Satellite aggregation. If we construct a satellite (n, r, f ) for each IGP failure (including φ), we notice that many satellites are redundant. For example, if f does not affect a given pair (n, r), it is useless to build the satellite (n, r, f ) as the (n, r, f ) and (n, r, φ) satellites will insert the same flow constraints. Let us consider two failures f and f for a given pair (n, r). Let be Gw (n, r, f ) and Gw (n, r, f ) the two related flow graphs. If Gw (n, r, f ) ⊆ Gw (n, r, f ), Gw (n, r, f ), the constraints induced by Gw (n, r, f ) will be more restrictive than the Gw (n, r, f ) ones. Thus we can safely omit the (n, r, f ) satellite. Failure case study. In the next part we will only consider the single IGP node and link failure cases, which is the most common case of network failure. To be as generic as possible, we assume that an IGP router failure also provokes the corresponding iBGP router failure. We also remove the unmountable iBGP sessions (if the two iBGP routers are not in the same IGP connected component anymore).
4 Building Optimal iBGP Topologies To illustrate how the different approaches perform on different topologies, we present in this section results for real and generated topologies. We start in Section 4.1 with relatively small topologies of less than 30 nodes. Then we tackle the problem on a large tier-1 network in Section 4.2. 4.1 Small Topologies We first compare our two approaches on five topologies. – The topology of the GEANT network from 20045. – Four topologies generated by the iGen topology generator6. iGen allows to generate random points in one or any continent, and then to connect the nodes using network design heuristics [16]. We generated four small topologies made of 25 nodes. Two of them for Northern America (NA) and two for the whole world (W). Two network design heuristics were used to generate the physical connectivity between nodes: Delaunay triangulation (D) and the Two-Trees algorithm (2T). 5 6
http://www.geant.net http://www.info.ucl.ac.be/∼bqu/igen/
550
M.-O. Buob, S. Uhlig, and M. Meulle Table 1. Solutions found on small topologies |Vbgp | = |Vigp | Input graph |Eigp | |Ebgp | in f.m. Without failure |Ebgp | With failures |Ebgp |
GEANT NA-D NA-2T W-D 22 25 25 25 72 128 96 130 462 600 600 600 74 80 72 100 172 168 146 194
W-2T 25 96 600 64 126
We assume that each router is a border router (N = R = Vigp ), i.e. any BGP router may receive external routes towards any arbitrary prefix. This is the most constrained form of the problem. Indeed, the computed iBGP topology remains fm-optimal for all subset of N . Thus, a smaller next-hop set would lead to an iBGP topology made of less iBGP sessions. Table 1 shows the results of the Benders approach with and without failures (last two rows), for the five topologies described above. The number of iBGP sessions required for an iBGP full-mesh are also indicated in the third row. The real GEANT network is configured with a full-mesh of iBGP sessions, i.e. 462 directed OV ER sessions between its 22 routers. According to Benders’ decomposition results, GEANT would only need 74 iBGP sessions under route-reflection to ensure fm-optimality, and 170 iBGP sessions to ensure fm-optimality under single failures. Delaunay topologies (NA-D and W-D) are more connected than the Two-Trees ones that are made of two disjoint spanning trees. Benders finds iBGP topologies with strictly less iBGP sessions than IGP links, for the four iGen topologies, as those graphs are wellconnected. The World-Delaunay iGen topology requires multi-hop sessions to reach fm-optimality, like for GEANT. When IGP failures are considered (last row of Table 1), the number of required iBGP sessions roughly doubles. This is still about 3 times less sessions than an iBGP fullmesh, while fm-optimality being guaranteed. To illustrate the properties of the topologies computed by the Benders approach, we use three indicators. Figures are shown for the GEANT network only because we observe a similar behaviour for the iGEN topologies. 1. Degree distribution: when more iBGP sessions are established by a router (higher degree), it requires more memory because of larger RIB-Ins. In the nominal topology (top left part of Figure 5), each router has iBGP sessions with at most 7 BGP routers, while with up to 15 when failures are considered. 2. White path length distribution: for each pair (n, r), we compute the iBGP path length in terms of iBGP hops. The more iBGP hops are needed for a router to get its BGP route, the slower the convergence time inside the AS will be. The top right of Figure 5 shows that in the nominal case most white paths have not more than 1 or 2 iBGP hops, while tending to be longer when IGP failures are considered. 3. Matching with the IGP topology: for each iBGP session, we look at its IGP hops length. Ideally, the iBGP topology should be as close as possible to the IGP topology [14,15]. The lower part of Figure 5 shows that most of the iBGP sessions go over a single IGP link in the nominal case. However, a majority of iBGP sessions cross two IGP links when failures are considered.
Designing Optimal iBGP Route-Reflection Topologies 12
single IGP failure case nominal case
100 #iBGP sessions
8 6 4
80 60 40
2
20 0 1
15
14
12
7
11
6
5
4
3
2
0 iBGP degree
2
#BGP routers
120
single IGP failure case nominal case
10
551
IGP hop count
300
single IGP failure case nominal case
#white iBGP paths
250 200 150 100 50
5
4
3
2
1
0 iBGP hop count
Fig. 5. Properties of iBGP topologies generated by the Benders’ approach (GEANT network)
4.2 Tier-1 ISP Network To show how our approach scales to large networks, we rely on a tier-1 provider network topology having hundreds of nodes and iBGP sessions. Due to confidentiality reasons, results are presented differently than for the small topologies. We denote by original Goriginal = (Vbgp , Ebgp ) the original iBGP topology of the tier-1 AS and by bgp benders benders Gbgp = (Vbgp , Ebgp ) the iBGP topology computed by the Benders approach. Table 4.2 shows that many modifications have to be done to migrate from the origiTable 2. Nominal topology original Ebgp Removed 73 % Modified 6 % Kept 20 %
benders Ebgp Added 52 % Modified 11 % Kept 36 %
Table 3. Topology robust to IGP failures original benders Ebgp Ebgp Removed 59 % Added 67 % Modified 27 % Modified 21 % Kept 12 % Kept 10 %
nal topology to the computed topologies. The classical design rules used today (e.g. a 3-level route-reflection hierarchy made of intercontinental, continental, and national level) are simple, but do not lead to a reliable and efficient iBGP topology. The nominal-case topology is 45% smaller than the original topology and is fmoptimal. Each router has to establish fewer iBGP sessions. However, the average white iBGP path length is a bit longer than in the original input topology, so BGP convergence might be slower.
552
M.-O. Buob, S. Uhlig, and M. Meulle
The topology when considering IGP failures requires 25% more iBGP sessions than the original topology, but remains close to the original one in terms of iBGP degree distribution and white path length distribution.
5 Related Work [13] was the first work to notice the possibility of occurrence of forwarding loops in route-reflection. Loop-free forwarding in iBGP was studied in [15]. [15] proved that checking the correctness of an iBGP graph is NP-complete, and showed that two conditions ensure a correct (loop-free) iBGP graph: 1) route-reflectors should prefer client routes to non-client routes, 2) every shortest path should be a valid signaling path. Those two conditions are actually too restrictive, designing correct iBGP topologies does not require the first condition [11]. [14] provides an iBGP design problem formulation. The authors aim to optimize a kind of network robustess defined through two reliability criterion (Expected Lifetime and Expected Session Loss). The computed topology is made of two hierarchical levels. [11] relied on this meshing of the top-level reflectors to design more scalable iBGP topologies that are robust to IGP failures. The approach of [11] relies on a hierarchy of route-reflectors that ensures the correctness of the iBGP propagation. [12] details the common routing problems encountered in networks under routereflection. This paper provides conditions to avoid route deflection and MED oscillations. The computed topology is a two-level reflection hierarchy and minimizes the IGP cost between two iBGP peers. The last three approaches do not guarantee that the iBGP topology remains valid if some IGP failure occurs. Thus suboptimal routing or forwarding loops may occur.
6 Conclusion In this work we proposed solutions to design optimal iBGP route-reflection topologies, i.e. route-reflection topologies that will lead to the same routing as with an iBGP fullmesh (fm-optimal topology). We showed that it is possible to build fm-optimal route reflection topologies with minimal number of iBGP sessions robust to IGP failures. Applying our method on both real-world and generated networks revealed that guaranteeing fm-optimality is possible by using a number of iBGP sessions in the order of the number of physical links, if no IGP failures occur. When IGP failures are considered, the minimal required number of iBGP sessions roughly doubles compared to the situation without failures, but still remains 5 times smaller than an iBGP full-mesh in the case of our tier-1 ISP. iBGP aims at diffusing routing information inside an AS. This diffusion depends on the diffusion graph, and the protocol that drives the diffusion of the routing information. Our approach in this paper was to optimize the iBGP diffusion graph, without changing the rules of route-reflection. Another way to think about iBGP route diffusion is to change the iBGP protocol itself instead of designing an iBGP graph. We plan to explore this second way of designing iBGP in the future.
Designing Optimal iBGP Route-Reflection Topologies
553
Acknowledgments. We would like to thank Olivier Klopfenstein and Jean-Luc Lutton for their help on the Benders’ approach.
References 1. Halabi, B., Mc Pherson, D.: Internet Routing Architectures, 2nd edn, Cisco Press (2000) 2. Rekhter, Y., Li, T.: A Border Gateway Protocol 4 (BGP-4). RFC 1771 (March 1995) 3. Traina, P., McPherson, D., Scudder, J.: Autonomous System Confederations for BGP. RFC 3065 (February 2001) 4. Griffin, T., Wilfong, G.T.: Analysis of the med oscillation problem in BGP. In: ICNP 2002: Proceedings of the 10th IEEE International Conference on Network Protocols, Washington, DC, USA (2002) 5. Buob, M., Meulle, M., Uhlig, S.: Checking for optimal egress points in iBGP routing. In: Proc. of the 6th IEEE International Workshop on the Design of Reliable Communication Networks (DRCN 2007) (October 2007) 6. Teixeira, R., Griffin, T., Voelker, G., Shaikh, A.: Network sensitivity to hot potato disruptions. In: Proc. of ACM SIGCOMM (August 2004) 7. Uhlig, S., Tandel, S.: Quantifying the impact of route-reflection on BGP routes diversity inside a tier-1 network. In: Proc. of IFIP Networking, Coimbra, Portugal (May 2006) 8. Feamster, N., Winick, J., Rexford, J.: A Model of BGP Routing for Network Engineering. In: ACM Sigmetrics - Performance 2004, New York, NY (June 2004) 9. McPherson, D., Gill, V., Walton, D., Retana, A.: BGP persistent route oscillation condition (March 2001) 10. Basu, A., Ong, L., Shepherd, B., Rasala, A., Wilfong, G.: Route oscillations in i-BGP with route reflection. In: ACM SIGCOMM (2002) 11. Vutukuru, M., Valiant, P., Kopparty, S., Balakrishnan, H.: How to construct a correct and scalable iBGP configuration. In: IEEE INFOCOM, Barcelona, Spain (April 2006) 12. Rawat, A., Shayman, M.A.: Preventing persistent oscillations and loops in iBGP configuration with route reflection. Computer Networks, 3642–3665 (December 2006) 13. Dube, R.: A comparison of scaling techniques for BGP. SIGCOMM Comput. Commun. Rev. 29(3), 44–46 (1999) 14. Xiao, L., Wang, J., Nahrstedt, K.: Optimizing iBGP route reflection network. In: IEEE INFOCOM (2003) 15. Griffin, T.G., Wilfong, G.: On the correctness of iBGP configuration. In: Proc. of ACM SIGCOMM (August 2002) 16. Cahn, R.: Wide area network design: concepts and tools for optimization. Morgan Kaufmann Publishers Inc, San Francisco (1998)
A Study of Path Protection in Self-Healing Routing Qi Li, Mingwei Xu, Lingtao Pan, and Yong Cui Department of Computer Science, Tsinghua University, Beijing 100084, China Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China {liqi,xmw,plt,cy}@csnet1.cs.tsinghua.edu.cn
Abstract. Recent routing failure accidents show that routing systems are not so stable as we estimated because most of accidents caused packet delivery failure in Internet. In order to resolve this problem, many fast reroute solutions are proposed to guarantee reroute path provision after network failures without high packet loss. However, most of these solutions are not deployed in practice, and some key issues still need further consideration, such as the route loop issue and the deployment cost issue. In this paper, we present state machine models to analyze how path protection works in self-healing routing solutions and identify the drawbacks in traditional path protection approaches. We identify several requirements of path protection approaches, and present principles of our path protection approach to protect Internet routing based on our analysis result. Furthermore, we conduct a detailed study to measure our proposed path protection approach in production networks and the study shows that availability and stability of routing systems is improved in real production networks.
1 Introduction Internet Routing is a critical element as Internet infrastructure and plays a key role in packet delivery in Internet. However, we know that they are not so robust to defend against network failures/attacks. For instance, Cisco routers caused major outage in Japan in 2007 and Earthquake in Taiwan caused global network accidents in 2006. The main failure causes, such as route/link damage, buggy software update and router configuration errors, may be the origins of these Internet accidents. Some study shows that most of these Internet instability accidents are resulted from short term failures [5]. However, current routing systems fails to adapt these network accidents. For example, to defend against short term failures, RIP requires hundreds of seconds, OSPF requires tens of seconds and BGP requires several minutes or longer. Obviously, these protocols are not stable and available enough for packet delivery requirements when network failures happen in Internet. Several researches are conducted to investigate Internet routing convergence and fast reroute issues to address the routing problem caused by network failures. Fast routing convergence is a classical problem well addressed in literature [1,3]. Although these approaches are carried out to fast routing convergence, they are not deployed in real
This work is supported by the Natural Science Foundation of China (No. 90604024),the Key Project of Chinese Ministry of Education (No. 106012), NCET and HI-Tech Research and Development Program of China (863) (2007AA01Z2A2).
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 554–561, 2008. c IFIP International Federation for Information Processing 2008
A Study of Path Protection in Self-Healing Routing
555
production networks because of their complexity to adopt, design fault or something else. Moreover, these work can not ensure low packet loss and guarantee successful packet delivery to destination after network failures. Another line of self-healing routing work is to implement some fast reroute approaches which are active activities in IETF [7,2]. However, these proposed fast reroute schemes share some big drawbacks, such as hard deployment and management, and not good enough performance. In this paper, we analyze the problem of Internet routing and identify the effect and efficiency of path protection in protecting network failures including short term and long term failures. As we analyzed, traditional path protection approaches proposed in IETF are not stable enough, we propose an improved path protection solution. In our solution, the L2TP technology is used to provide path protection which overcomes the shortcomings in traditional approaches. As we mentioned above, short term failures are the main cause of Internet instability, our solution can directly recover from these failures. Besides, our path protection approach can mitigate performance impact posed by the routing convergence process if long term network failures happen. Therefore, our path protection solution well resolve the Internet availability and stability problem. The paper is organized as follows. Section 2 introduces Internet Routing Failures and describe several solution to address path protection issues in routing. State machine models are proposed to analyze self-healing routing process after network failure in Section 3. We identify several requirements of path protection in self-healing routing and propose our path protection solution to defend against network failure in Section 4. Section 5 presents the conclusion of this paper.
2 Internet Routing Failures and Path Protection Several studies have analyzed that routing instability and impact of routing failures in Internet. Those studies found three important results [5]. First, routing protocols can efficiently handle common events of link failures. Second, a small number of links are responsible for a large fraction of the failures in Internet. This is the common but troublesome problem of flapping links. Third, link failures are usually short-term events except some big network accidents, such as the Taiwan earthquake, and they caused major events of routing failures. We need to consider two properties when measuring Internet routing, availability and stability. Availability refers to the ability of routing system to work for normal packet delivery no matter network failures happen. Stability refers to routing dynamic of routing system no matter network failures happen. However, current routing systems perform poorly in these two aspects. Actually, routing divergence also happens because of configuration errors and other factors besides the large routing convergence delay problems. Although some improved routing schemes can improve routing convergence, they can not guarantee fast routing convergence and eliminate the routing divergence problem. So, these cause poor availability of routing systems and flapping routes cause poor stability of routing systems. In order to address routing slow convergence or divergence problems after network failures, many path protection solutions are proposed. We describe these approaches and identify the drawbacks of these approaches.
556
Q. Li et al.
SONET rings can significantly reduce the recovery time by switching around the failure, but they are expensive. Fast reroute (FRR) schemes are proposed to resolve this problem, in which alternate paths are set up between routers detecting failures and some intermediate routers in the paths to the destinations. In an MPLS Traffic Engineering (MPLS-TE) network, FRR pre-calculates shortest paths around individual nodes and links so that if a failure occurs, traffic can be quickly switched to the reroute path. In these approaches, route paths are promised to be built in 50 ms. FRR does not take into consideration any best-path parameters, and its reroute paths are intended only as short-term detours around a failure. O(nk) repair paths should be set up in the network for link repair and O(nk2 ) repair paths for node repair. Moreover, MPLS-TE requires an MPLS infrastructure. It is feasible to implement a path protection mechanism natively in an IP infrastructure without using MPLS-TE. Shand et al carry out an Internet draft of IETF for a number of years now called IP Fast Reroute Framework that proposes an FRR solution without MPLS-TE [2,7]. For operators who use MPLS only for the FRR functionality, this solution could be promising for simplifying their networks. Also, they propose several Internet drafts to implement FRR, such as FRR with IP tunnel and FRR using via address. However, it is hard to build pre-computed paths for these solution and guarantee the pre-defined paths are best paths because complete topological information are heavily relied, besides the drawbacks in the MPLS-TE approach analyzed above. In addition, route loops may happen during routing convergence in these approaches. In summary, these solutions are not scale enough and may cause route loops and Internet instability even if they improve routing availability. In addition, these approaches may negatively affect traffic flows or the performance of the routing because they may not use best routing paths.
3 Self-Healing Routing As we discussed above, most of network change events are single network failures. For simplicity, we only analyze network state using the single failure model in this section. As a standard routing convergence process illustrated in Figure 1(a), there are five states, Sn denotes a normal state of network, Sd denotes that the adjacent nodes detect the network failure, Sp denotes that the adjacent nodes finish route re-computation and propagate the update messages, Ss denotes that all the nodes finish route re-computation and the whole network stays in sub-normal status 1 and Sr denotes that network failure recovers. After a network failure event, network status will get to the state Ss via the state Sd and Sp, so Td + Tp + Ts is the routing convergence time. During the routing convergence process, packets sent through the failure node/link may be lost during Td, and our-of-order packets may appear during Tp and Ts. Moreover, during Tp and Ts, some packets may still be lost because of transient route loops. Most of fast routing convergence work is to reduce Td and optimize Tp and Ts 2 . However, these approaches can not completely resolve route divergence problem and avoid packet loss. 1 2
This status is introduced to illustrate that the network convergence after network failures. Most of work do not not simple reduce Tp and Ts and it may cause route flap and route loops.
A Study of Path Protection in Self-Healing Routing
Sr
Tr
Ss
Sn
Tr
Sr
Ss
Ss Ts
Ts
Tr
Tr
Td Sn
Sn
Sp Td
Sp
Td Tt
Tu
Td
Tp
Tu
Td Tp Tu
557
Sd
(a) Original Machine
Sd Tp’
Sd
(b) Machine with FRR
Sp’
Under Protection
(c) Machine with Protection
Fig. 1. State Machine of Self-Healing Routing
In order to address these issues, fast reroute approaches are proposed. In these approaches, tunnels are used to provide fast rereoute (FRR) in which Bidirectional Forwarding Detection (BFD) mechanism is used to realize fast failure detection. Once network failure is detected, the pre-configured tunnel is activated to guarantee that packets detour the failure path and are forwarded to destination. The network state machine with FRR is showed in Figure 1(b). We see that the state machine with FRR is quite simple. The rerouting time is Td and Tt (Tt denotes the tunnel activation time). Since Tt is largely less than Tp and Ts, packet loss can be avoided (or greatly reduced) after network failures. However, some new problems rise. For instance, it is hard to avoid route loop and guarantee rerouting paths are the best available paths. So packet delivery performance may not be good enough, and these works fail to consider these problems. Furthermore, if a short term failure event happens, the tunnel will be activated to deliver the packets, and will stay in a sub-normal state and never go back to the normal state if no operators’ involvement. In this section, we present a path protection approach to resolve problems discussed above. The improved state machine with path protection is illustrated in Figure 1(c). We add a state Sp where a tunnel is activated to hold time for short term protocol failure. During this process, low packet loss and packet delivery is guaranteed. After holding the failure in Sp , the adjacent nodes start to begin routing convergence process.Otherwise, the network directly come back to the normal state if it recovers from the short term failure. So, Sp is only a temporal state and guarantees packet delivery to destination under the failure. That is, tunnels will be disactivated no matter whether failure link/node recovers. From this view of the point, path protection is only part of Self-Healing routing solution, because routing convergence is still necessary if the network failure is a long term event. In this way, holding network failures can effectively improve routing stability [6], and protection paths can guarantee routing availability once failure happens.
4 Path Protection in Self-Healing Routing In this section, we study path protection to react to network failures in self-healing routing. Before that, we need to identify some key requirements of a path protection solution, and the principles of our path protection solution follow.
558
Q. Li et al.
4.1 Requirements of Path Protection Several approaches are proposed to protect Internet routing including intra-domain routing and inter-domain routing. However, few approaches are implemented and deployed in real Internet. Many reasons caused this situation, such as wrong (or not good enough) working motivation and design fault. In this section, we identify some key requirements of path protection should be met when designing or deploying path protection solutions based on our study of path protection approaches and the analysis of self-healing routing in Section 2and 3. – Simplicity. Most of fast rerouting approaches and fast routing convergence schemes are not deployed in Internet because these scheme are too complex and put much complexity in core network. For a complex solution, there is less possibility to be deployed. The MPLS-TE approach contains this drawback. – Easy Deployment and Management. The proposed self-healing routing solution should be easily deployed and managed. For instance, not all networks support MPLS, and MPLS-related solution is not feasible because this solution needs the MPLS infrastructure in Internet. In addition, since all the backup rerouting path are pre-computed, it hard for traditional FRR to react to failures using best routing paths dynamically. – Efficiency. This point desires that proposed solutions should be deployed with good efficiency. From the view of efficiency, there is no need to pre-configure tunnels for every node/link. For instance, if non-backbone network resource can be used to protect these networks who are always unstable, a better Internet routing stability is achieved and performance of whole Internet will improve much. – Incremental Deployment Support. It is an important factor when considering and designing a novel routing protocol, because we all can not ensure that we can deploy it once. As we discussed in the above requirements, we can protect 10% of network where network failure always happens to improve routing systems at first. – Business model Support. The designed solution should consider the business model of path protection application in production networks. In order to protect unstable network and backbone network areas, contrasts between different ISPs should be signed to guarantee routing availability in these areas. Path protection solutions should not require protection contrasts for every links. Otherwise, ISPs may not consider this solution because it may release their routing privacy. – Low Cost. The path protection solution should provide routes without many computation processes or additional computation power needed on routers, and provide packet delivery performance guarantee with low packet loss. In addition, the solution should covers protection under both short term or long term network failures. 4.2 Principles of Our Solution In this section, we briefly describe the key elements of our path protection approach in self-healing routing based on some simple examples. We consider the two ISPs shown Figure 2 and focus on path protection in self-healing routing for the view of stub networks.
A Study of Path Protection in Self-Healing Routing
(a)
559
(b) Fig. 2. Reference Network
As Figure 2 shown, two types of stub networks are multi-homed when they are attached to provider network(s). Our path protection approach works in both these two scenarios if the provider ISP has a path differentiated to the original path of stub network to destinations. To quickly react to a failure of directed link R2→R4 in Figure 2(a) and 2(b), router R4 must be able to quickly detect the failure, activate the pre-configured inter-domain tunnel and send following packets to router R1 in the provider network. At last, R1 take the responsibility to forward these packets to destinations. Similarly, in the intra-domain case, if the link R2→R6 in Figure 2(a) fails, router R2 need to activate the tunnel between router R2 and router R5, and router R5 help R2 forward packets from router R2 to destinations. In this section, we need to identify three key elements to implement path protection, fast failure detection, tunnel technique for path protection and tunnel disactivation. – Fast Failure Detection. The failures of routing links are detected by using a trigger from Bidirectional Forwarding Detection (BFD) mechanism [4]. BFD runs on top of any data protocol being forwarded between two systems, and supports adaptive detection times to balance fast detection and network stability. – Path Protection Technique. As explained earlier, the self-healing solution is required to allow routers to provide path protection for packets immediately if the adjacent router/link fails. For this, although two different types of routing protocol need be considered, intra-domain routing and inter-domain routing tunnel in Figure 2, there is no need for us to provide path protection techniques for different routing instances. In this paper, we choose a label-based tunnel protocol, Layer 2 Tunneling Protocol (L2TP) [8], as the protection technique in self-healing routing. – Tunnel Disactivation. Since paths provided in the path protection scheme may be not best paths, path protection is only a short term solution to defend against routing failure events, and not substitution to routing convergence if long term failures happen. So, tunnels should be disactivated if the short term failure recovers or route converges again after a long term failure. In this situation, tunnel inactivation mechanism is essential to guarantee Internet availability and stability.
5 Methodology and Evaluation In this section, we study performance of path protection in production networks and measure the basic mechanism provided in our path protection approach. In order to
560
Q. Li et al.
Fig. 3. BGP Peer Relationship
study performance of the path protection approach in self-healing routing, we deploy the path protection scheme in real production networks of Chinese ISPs. Since we can not change much in routing architecture in real production network in case of Internet stability, for simplicity, we deploy an experimental tunnel which covers Intra-domain routing Inter-domain routing paths to study performance of path protection and measure the performance of path granularity protection, such as routing availability and stability. As Figure 3 illustrated, we deployed hosts at three different ASes which belong to three different ISPs. Effectively, this allowed us to build a L2TP tunnel between two different ISPs(ASes), and forward the traffic to the third ISP(AS). These ISPs and ASes at each ISP are shown in Figure 3. The idea behind the experiments was to choose a third ISP/AS to protect routing paths and measure the routing performance under path protection. Figure 3 shows one host with the AS in CNC China acting as a session initiator and another one with AS in CERNET acting as a session receiver, which form a data-plane path called the normal path. Assumed that the link Xm→D1 fails, the L2TP tunnel between AS 4808 and AS 17964 is activated to protect the assumed failure link, which is called the protection path. The measurement methodology described above was used to study path distribution in ASes and packet forwarding performance. Figure 4(a) shows that the distribution ratio in protection path is less than 20% and path protection successfully diversifies Path Distribution
Performance Metrics
0.6
Normal Path
60
0.5
Protection Path 50
R R T (M S.)
Ratio
0.4
0.3
0.2
0.1
40
Var. in Protection Path Var. in Normal Path
30
Ave. in Protection Path Ave. in Normal Path
20 10
1
2
3
4
5
1
2
AS
(a) Path Diversity
3
4
AS
(b) Performance Metric
Fig. 4. Performance Comparison
5
A Study of Path Protection in Self-Healing Routing
561
the routes which are maintained by intra-domain routing systems in ASes. Path protection can effectively guarantee availability of routing system after network failures if we suitably choose protection strategies under contracts with ISPs. For instance, the nonbackbone networks are involved to protect the backbone networks in this experiment. Figure 4(b) shows that average RTT in AS 4837 is about 16ms and average RTT in the whole protection path is about 8ms, which are smaller than that in the normal routing paths because busy paths are detoured in this experiment. As can be seen, our deployed solution can effectively eliminate path re-computation and improve local routing stability if network failure events happen.
6 Conclusion and Future Work Internet routing is a critical element in important Internet infrastructure. We have shown that current routing systems are not so stable and unable to work well under network failures. In this paper, we propose a state machine model to analyze impacts of path protection in routing, which shows that traditional path protection approaches fail to protect routing effectively. We propose a L2TP approach for path protection, which efficiently defends against network accidents no matter short term or long term failures. Moreover, we deployed our solution in real production networks to measure performance of our path protection approach. The result shows that the proposed solution provides strong protection for routing and well improves Internet stability and availability.
References 1. Afek, Y., Bremler-Barr, A., Schwarz, S.: Bgp-rcn: Improving bgp convergence through root cause notification. IEEE Journal On Selected Areas In Communications 22(10), 1933–1948 (2004) 2. Atlas, A., Zinin, A., Torvi, R., Choudhury, G., Martin, C., Imhoff, B., Fedyk, D.: Basic specification for ip fast-reroute: Loop-free alternates, Internet draft, draft-ietf-rtgwg-ipfrrspec- base10.txt (November 2007) 3. Nguyen, N., Chen, J., Massey, D., Pei, D., Azuma, M., Zhang, L.: Bgp-rcn: Improving bgp convergence through root cause notification. Computer Network 48(2), 175–194 (2005) 4. Katz, D., Ward, D.: Bidirectional forwarding detection, March 2007. Internet draft, draft-ietfbfd-base- 06.txt (2007) 5. Markopoulou, A., Iannaccone, G., Bhattacharyya, S., Chuah, C., Diot, C.: Characterization of failures in an ip backbone. In: Proceeding of the IEEE INFOCOM, pp. 2307–2317 (2004) 6. Nelakuditi, S., Lee, S., Yu, Y., Zhang, Z., Chuah, C.: Fast local rerouting for handling transient link failures. IEEE/ACM Transactions on Networking 15(2), 359–372 (2007) 7. Shand, M., Bryant, S.: Ip fast reroute framework, June 2007. Internet draft, draft-ietf-rtgwgipfrrframework- 07.txt (2007) 8. Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, G., Palter, B.: Layer two tunneling protocol L2TP. RFC2661 (August 1999)
The CPBT: A Method for Searching the Prefixes Using Coded Prefixes in B-Tree Mohammad Behdadfar and Hossein Saidi Department of Electrical and Computer Engineering, Isfahan University of Technology [email protected], [email protected]
Abstract. Due to the increasing size of IP routing table and the growing rate of their lookups, many algorithms are introduced to achieve the required speed in table search and update or optimizing the required memory size. Many of these algorithms are suitable for IPv4 but cannot be used for IPv6. Nevertheless new efforts are going on to fit the available algorithms and techniques to the IPv6 128 bits addresses and prefixes. The challenging issues in the design of these prefix search structures are minimizing the worst case memory access, processing and pre-processing time for search and update procedures, e.g. Binary Tries have a worst case performance of O(32) memory accesses for IPv4 and O(128) for IPv6. Other compressed Tries have better worst case lookup performance however their update performance is degraded. One of the proposed schemes to make the lookup algorithm independent of IP address length is to use a binary search tree and keep the prefix endpoints within its nodes. Some new methods are introduced based on this idea such as "Prefixes in B-Tree", PIBT. But, they need up to 2n nodes for n prefixes. This paper proposes "Coded Prefixes in B-Tree", the CPBT. In which, prefixes are coded to make them scalar and therefore suitable as ordinary B-tree keys, without the need for additional node complexity which PIBT requires. We will finally compare the CPBT method with the recent PIBT method and show that although both of them have the same search complexity; this scheme has much better storage requirements and about half of the memory accesses during the update procedures. Keywords: lookup, insert, delete, LMP, LPM, SMV.
1 Introduction Because of the exponential growth of routing table size in IP routers, an addressing method named classless addressing was proposed in 1993. This addressing scheme didn't use the previous fixed length classes of IP address space and regardless of number of the bits; it used address prefixes with variable lengths up to 32 bits. This addressing mechanism caused new design considerations for routing table search and A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 562–573, 2008. © IFIP International Federation for Information Processing 2008
The CPBT: A Method for Searching the Prefixes Using Coded Prefixes in B-Tree
563
update procedures. Route decision and forwarding of IP packets to their next hop routers based on the LPM1 is called CIDR or the "Classless Inter Domain Routing". Up to now, several algorithms are proposed for finding the LMP2 of a database with an input address. The most popular known methods are PATRICIA [1], binary and compressed Tries [2,3], Gupta [4], CAM based algorithms [5], range based algorithms [6] and some other derivatives of these algorithms. The main problem of Trie based algorithms is their worst case search and lookup performance O(W), with W considered as IP address length which is 32 for IPv4 and 128 for IPv6. Of course, for hardware based online lookup systems, the worst case lookup time and memory access would be the important factor during a burst of packet arrivals. CAM based algorithms are also facing the issues of their huge power consumption and small number of entries in the CAM modules. Range based algorithms construct their structure based on the range related to each prefix. For example if we consider W=5 and we have two prefixes 1* and 100*, their ranges would be {16, 31} and {16, 19}. Range based algorithms store prefix endpoints in their data structure [6], e.g. for 100* endpoints are 16 and 19 and in a range based algorithm, the information of both endpoints should be stored in the database. [6] proposed a range based algorithm which can use a binary tree for storing endpoints. In the worst case for n prefixes there exists 2n endpoints [6] and all of the 2n endpoints need to be stored in the tree. Because of unbalanced structure of binary search tree, the height of the tree will not be controllable in the worst case and this would result to bad search and update memory access performance. In [7] Yazdani and Min proposed a scheme and tried to improve the average lookup and update times however the worst case times didn't improve. To balance the tree and improve the worst case search operation, in [8], we proposed a new coding algorithm for the interpretation and storing the prefixes as numbers in the B-tree and using the scalar comparison during the search for the prefixes. Later, Lu and Sahni [9] have also used PIBT in a B-tree structure, using another method of separating prefixes into endpoints and store these endpoints using some vectors to facilitate the search and update procedures. Because of guaranteed maximum height of the B-tree structure, for both [8] and [9] the worst case memory access of search and update procedures will be limited to the height of the tree in comparison to the worst cases memory access in unbalanced trees. [10] has also used the B-tree but only to store disjoint IPv6 prefixes. Using our idea in [8], with modifications to its search and update operations, here we introduce the CPBT; "Coded Prefixes in B-Tree", which simplifies the search and update operations and reduces the memory requirement by a factor of 4 in comparison to PIBT. In what follows, the review of PIBT [9] is presented in section 2. In Section 3, we present the main idea of the proposed scheme as it is discussed in [8]. In section 4, the CPBT algorithm is described which modifies the main idea of [8]. Here it is described that without storing the actual encoded prefixes, the idea of the encoded prefixes might be used to compare prefixes during the search and update procedures. Additionally, enhanced version of the search and update procedures for this scheme is presented which reduces the update complexity of the algorithm. Finally in section 5, the memory requirements and update complexity of the CPBT are compared with those of the PIBT. 1 2
Longest Prefix Matching. Longest Matching Prefix.
564
M. Behdadfar and H. Saidi
2 PIBT Review Consider each node of a simple B-tree with "t" keys. The PIBT structure is made by the endpoints and some additional vectors called "Interval Vector" and "Equal Vector". Therefore for each node with t keys in the tree, three sets of parameters exist: t key values, (t+1) interval vectors, and (t+1) equal vectors i.e. t*W + (t+1)*W*2 bits. It also separates each prefix into two end points and uses a key for each of them. Therefore for n prefixes it may use up to 2n keys or 6 set of parameters. An interval int(x) from the address space is allocated for each node. For the "Root" node, this interval, int(root) would be [0, 2w-1]. Assume that X is a B-tree node with t keys and the information stored in the X has the following format: t, child0, (key1, child1), (key2, child2), …(keyt, childt) Consider that childi is a pointer to X's i'th sub-tree and assume that key1
3 Proposed Encoding of Prefixes, the Main Idea In [8], we proposed a method of storing prefixes in a routing table in which prefixes look like numbers but not ranges. Consider a prefix p with k bits and W bit addresses. Each bit of prefix p, would be coded into a 2 bit value as follows: for each bit '1', 2 bits '10', for each bit '0' two bits '01' and for each blank place, 2 bits '00' are substituted. The number of blank places is 'W-k'. For example if W=5:
The CPBT: A Method for Searching the Prefixes Using Coded Prefixes in B-Tree
0* 010* 001* 0010* 0001* 00001*
565
Æ 01 00 00 00 00 Æ 01 10 01 00 00 Æ 01 01 10 00 00 Æ 01 01 10 01 00 Æ 01 01 01 10 00 Æ 01 01 01 01 10
With the above definition, the prefixes can be compared like numbers and we'll have: 0* < 00001* < 0001* < 001* < 0010* < 010* 3.1 Using Proposed Encoding of Prefixes to Search for the Prefixes As an example consider the B-tree of prefixes shown in figure 1 and assume the objective is to find prefix "00100*". Start from the root. Since "01*>00100*", traverse the tree from "01*" to C0. The key "00100*" will be found in C0. "A" shows the direction of the search. With the above procedure we can find any given prefix with a simple search, but the challenge would be finding Longest Matching Prefix (LMP) of an input destination address.
01* C0
000*
A
00100*
0100* C1
C2
010*
010101*
10*
Fig. 1. Example B-tree for storing prefixes
Now, assume W=6 and an input destination address of "d=010100". The objective is to find LMP of d in the tree. Start from the root. Obviously, "d>01*" and "d>0100*" and also d matches with "01*". Therefore we should save "01*" as the LMP up to now and go through C2. In C2 there isn't any key matching with d. Therefore the LMP result for d would be "01*", however it can be seen from the figure that there is another key, "010*", in C1 that matches with d and its length is longer than "01*". So, this search algorithm for finding the LMP of an input destination address might miss the correct answer. This type of storing prefixes will be only useful when the prefixes are completely disjoint i.e. there aren't any 2 prefixes in the database while one is the prefix of the other. Our solution to the problem depends on a lemma, summarized here and proved in [8]. Lemma 1. Given a destination address, "d", with W bits, let's assume a prefix, "c" with j bits, which is located in a node of the B-tree and is currently being checked. Also assume another prefix, "k", with ℓ bits exists in the tree and is not found in the search yet.
566
M. Behdadfar and H. Saidi
a) If - k is a prefix of c and d, - d>c, - k, c are not stored in one common node Then k will not be in the search path and it will be missed. b) The prefix k which is a prefix of d but is not a prefix of c, would be in the search path at this node. 3.2 A Method for Finding the LMP of a Given Destination Address As described earlier, storing raw prefixes in the B-tree will cause some of them to fall out of the search path. To solve this problem, we introduce an additional vector stored with each key, called "Shorter Matching Vector" or SMV. This vector is W bits long. Consider a key "k" in the database and its SMV, if bit "i" of SMV is set to 1, it means that a prefix with length i exists in the tree which is a prefix of "k". From the definition of the SMV and Lemma 1, it is clear that if it can be updated correctly, it will prevent those mentioned prefixes to fall out of the search path [8].
4 CPBT Here, the CPBT algorithm is described. CPBT is a modified version of the main idea in [8]. It is shown that without storing the actual encoded prefixes, the idea of the encoded prefixes might be used to compare prefixes during the search and update procedures. Additionally, an enhanced version of the search and update procedures for this scheme is presented which reduces the update complexity of the algorithm. Details of modifications are described in the following sub-sections. 4.1 Modification of the Main Idea of [8] Instead of storing the encoded prefixes as it has been described in section 3, the prefixes might be stored as W bit vectors. For example, if W=6 and a prefix p=110*, a W bit vector '110000' should be stored instead of p. Additionally a 6 bits SMV is stored with this prefix, and its third bit (the prefix length) is set to one and its 4th, 5th and 6th bits are set to zero. If this SMV key is read during the search algorithm, the rightmost 1 will identify the actual prefix and its length. In this example, if the SMV is "011000", it means that the prefix is 110* (the rightmost 1, is the bolded bit in 011000). Therefore, it is not needed to keep the prefix length as a separate entity. 01*(010000,010000)
C0
000*(000000,001000)
00100*(001000,000010)
0100*(010000,011100)
C1
010*(010000,001000)
C2
010101* (010101,011001) 10*(100000,010000)
Fig. 2. Prefixes and their shorter matching vectors in the proposed structure in the form of prefix*(prefix, SMV)
The CPBT: A Method for Searching the Prefixes Using Coded Prefixes in B-Tree
567
Additionally, although keys are saved as W bit vectors, in comparing two keys, the same notion of "encoding the prefixes" to the "numbers" which was described earlier will be used. With the above modifications and using the pair entry of (prefix, SMV) and with the assumption of W=6, figure 2 shows the new tree compared to figure 1. As an example, let's assume W=6 and an input destination address of "d=010100" is given to find the prefixes for it. Starting from the root and checking the left key of the root we find that 01* exists in the database. Then checking the right key of the root node we find that d>"right key" of the root and its SMV also shows that up to now the LMP of d is 010*. Since d>"right key", we should go through C2 but in C2 we will not find another match, so the LMP of d will be 010* and the search procedure is completed. Next objective is to find a solution for updating SMVs at the time a prefix is inserted or deleted. In the next sub sections we will introduce methods for the update and search procedures. The proof of the correctness of these methods was omitted due to space limitations. First of all, consider the following assumptions: - The length of a prefix p is shown by len(p). - The ith key in a node "N" can be showed by N(i). - The length of the IP Address is assumed to be: W - The SMV vector of a prefix p is shown by SMVp. - Consider two prefixes, p and x. if p is a prefix of x, we use the following notation: p→x and if p is not a prefix of x we use this notation: p╫x. - If p→x, the bit of SMVx which corresponds to p is shown by SMVx(len(p)). - A prefix of x with length of k is shown by lenk(x) - If the SMV of a prefix x overwrites the SMV of a prefix y, x is called the Master and y is called the Slave. 4.2 Insertion Insertion of a new prefix is similar to B-tree insert of [8] and [11]. However updating the SMV is simpler than [8]. In general it may seem that updating a node key may result to update all its children keys, but, this is not true. The Insert procedure is described as follows: - The prefixes should be compared according to the concept of the coding introduced in section 3. Initially, the new adding prefix is compared with all the keys in the root to find its location among them. This comparison identifies the child and sub tree which this adding prefix would be added to. - On the search path of inserting a prefix p, in the first node which contains any prefix x with the property of p→x, the SMVx(len(p)) will be updated to '1' for all the keys with the above property. - If node splitting is required, the key "s" which is removed from the child node and added to the parent node will be the Slave in setting SMV vectors. It means that the keys of the father node and their SMV vectors are the Master and would update the SMV vector of s; however the SMV vector of s would not update them. It means that we assume that each father node's info will overwrite its child nodes info in search and update procedures.
568
M. Behdadfar and H. Saidi
g
p a. Inserting 01*, 010*, 10* 01*(010000,010000), 010*(010000,011000), 10*(100000,010000)
010*(010000,011000)
b. Inserting 110*,11*
10*(100000,010000), 11*(110000,010000), 110*(110000,011000)
01*(010000,010000)
010*(010000,011000), 11*(110000,010000)
c. Inserting 1100*
01*(010000,010000)
010*(010000,011000), 11*(110000,010000)
d. Inserting 100*,011*
01*(010000,010000)
e. Inserting 101*
110*(110000,011000) 1100*(110000,011100)
011*(011000,001000) 10*(100000,010000), 100*(100000,011000)
010*(010000,011000), 10*(100000,010000), 11*(110000,010000)
01*(010000,010000)
011*(011000,001000)
f. Inserting 0110*
110*(110000,011000), 1100*(110000,011100)
100*(100000,011000), 101*(101000,011000)
10*(100000,010000)
010*(010000,011000)
01*(010000,010000)
110*(110000,011000) 1100*(110000,011100)
10*(100000,010000),
011*(011000,001000) 0110*(011000,001100)
11*(110000,010000)
100*(100000,011000), 101*(101000,011000)
110*(110000,011000), 1100*(110000,011100)
Fig. 3. An Example for insertion of prefixes and their shorter matching vectors in the form of prefix*(prefix, SMV)
The CPBT: A Method for Searching the Prefixes Using Coded Prefixes in B-Tree
-
569
At last, after storing p in a node, for example node S, we will set the SMV vectors such that: o For each prefix x in node S, if p→x then SMVx(len(p))=1. o For each prefix x in node S, if there is a length k such that the bit k of its SMV is already set to 1 (or SMVx(k)=1) and if lenk(x)→p then it will set the SMVp(k)=1.
As an example for insertion, consider a tree in which the following prefixes are added with the order of: 01*, 010*, 10*, 110*, 11*, 1100*, 100*, 011*, 101*, 0110*. Figure 3 shows the results of the insertion process. As it is shown in figure 3, although 01* is a prefix of 011* and 0110*, the SMV vectors of 011* and 0110* are not updated for 01*. This shows that it is not necessary to update all of the SMV keys related to a prefix p in this new version of the algorithm. Since it is needed to traverse the tree height for an insertion process, the memory access complexity will be O(logm(n)) with n the total number of prefixes and m the order of the B-tree. 4.3 Deletion The method which was introduced in [8] for delete procedure, has a complexity issue. If a prefix "p" needs to be removed from the database, all other prefixes that p is also a prefix for them must be checked and their SMV vectors should be updated. It means that in their SMV vectors, the bit which is related to p should be set to zero. As this method is time consuming, we introduce a new method that improves the delete performance. The delete procedure for a prefix p is similar to the delete procedure of a B-tree [11] with additional updating of the SMV vectors. Therefore, the delete procedure and updating the SMV vectors are described as follows: -
-
-
Using the B-tree delete procedure [11], it needs to find the location of the deleting prefix. The prefixes should be compared according to the concept of the coding introduced in section 3. Initially, a deleting prefix "p" is compared with all the keys in the root to find its location among them. This comparison identifies the child that the path to the location of p will go through it. During the search for the prefix, some other prefixes which p is also a prefix of them may be found. If any of these prefixes is found in the search path, its SMV will be updated immediately. Otherwise, it means that either there is no prefix in the database which p is a prefix of it, or these prefixes are allocated after p in the search path. The regular delete operation of p checks the path for recursive delete [11] to either the predecessor or the successor of p. Therefore, if p is found in an internal node, it is necessary to search the path to the successor of p in the database. During the search to the successor, if other prefixes are found which p is a prefix of them, then its corresponding bit in their SMV vectors will be set to zero. In merge procedure [11], the key which is deleted from the father node and stored as a median key in a merged node, is the Master and will modify the SMV vectors of the other keys in the new merged node.
570
M. Behdadfar and H. Saidi
-
In the Borrow key procedure [11], the key which is borrowed from a child node x to a father node F, is the Slave and the key which is deleted from the father node F and stored in the sibling of x, e.g. y; is the Master.
The above proposed function for delete operation may not update the SMV of all the prefixes which x is a prefix of them, but it can be proved that if there are any of them in the tree, then at least the one which is the closest to the root of the tree will be updated. As an example of delete procedure consider the tree of figure 3. If 10* is deleted from this tree, then one of the trees shown in figures 4 and 5 would be resulted depending on the search order of its predecessor and successor. 010*(010000,011000), 0110*(011000,001100), 11*(110000,010000)
01*(010000,010000)
011*(011000,001000)
100*(100000,001000), 101*(101000,001000)
110*(110000,011000), 1100*(110000, 011100)
Fig. 4. An Example for delete of prefixes- prefix 10* is deleted from the tree of figure 3, checking predecessor first
010*(010000,011000), 100*(100000,001000) , 11*(110000,010000)
01*(010000,010000)
011*(011000,001000) , 0110*(011000,001100)
101*(101000,011000)
110*(110000,011000), 1100*(110000, 011100)
Fig. 5. An Example for delete of prefixes- prefix 10* is deleted from the tree of figure 3, checking successor first
In the first method which is used in figure 4, the predecessor of 10* replaces it. As it is depicted in this figure, the path to the successor prefix is checked as well. Then the corresponding SMV bits of 100* and 101* are converted to zero which are bolded in the figure. In the second method which is used in figure 5, the successor of 10* i.e. 100* replaces it and its corresponding SMV bit is set to zero (the bolded bit). As it is shown in this figure, it is not necessary to modify the corresponding SMV bit of 101*. During future operations, if there is a need to move the 101* and its SMV to its parent location, its SMV vector would be updated based on its parent SMV vectors. Similar to the B-tree delete operation, the memory access complexity of this delete strategy will be O(logm(n)), where n is the number of prefixes and m is the order of the B-tree.
The CPBT: A Method for Searching the Prefixes Using Coded Prefixes in B-Tree
571
4.4 Search of LMP for a Given Address Now, the CPBT search procedure for finding the LMP of a given address might be revised as follows. Again, the search procedure is similar to the search procedure of a B-tree but the prefixes should be compared according to the concept of the coding introduced in section 3. Consider an input address d for the search. An SMV called SMVd is defined to store the information about the prefixes of d. Initially all of its bits are set to *. The search is started at the root node. For any stored prefix x in the root node if leni(x) is a prefix of d, the corresponding bit in the SMVd(i) will be set to "0" or "1" based on the SMVx(i) and it will not change during the search operation anymore. This procedure will be repeated in the next nodes of the search path to update the remaining * bits. After completely traversing the tree, the search algorithm is finished and SMVd identifies all of the available prefixes of d in the tree. At this time, the rightmost "1" of SMVd represents the LMP of d. As an example of the search procedure of a given address, consider W=6 and d=100100. The search procedure of d in the tree of figure 5, first looks up the root and finds 100* as an existing prefix. It also finds out that 1* and 10* do not exist in the tree. By continuing the search, the path should go through the third child of the root. But, the corresponding SMV bit of 10* is set to one in the prefix 101* of the third child. As the corresponding SMV bit of 10* is already set to zero in the root node, the recent result cannot overwrite it. Therefore the LMP of d will be 100*. Similar to the B-tree search operation, the search time is propotional to the height of the tree which is less than logm((n+1)/2) with n the number of prefixes and m the order of the B-tree. 4.5 The Final Theorem The above insert, delete and search procedures result in a final theorem: Theorem- Assume that a B-tree of prefixes is constructed with the definitions of insertion and deletion of a prefix that was proposed in sections 4.2 and 4.3. For an arbitrary destination address d, each prefix p that has the property of p→d, will be found in the B-tree using the search algorithm proposed in section 4.4. The proof of this theorem has been removed due to the space limitation.
5 The Comparison of the Proposed Algorithm and PIBT As it was explained, almost all of the range based algorithms proposed up to now, need to store two end points for each prefix or at least they need to store more than a key. One of the latest proposed range based algorithms is PIBT which stores prefixes end points in a B-tree. PIBT needs to store two keys or endpoints for each prefix and also requires storing about two W bit vectors for each endpoint. In the proposed scheme of CPBT it is not needed to store the prefixes end points. It is just enough to store the prefixes themselves and use the proposed coding definition. Therefore, the number of keys in the tree will be half of the number of keys of PIBT. On the other hand we do not need to store 2 additional vectors for each key and only one SMV vector per key is sufficient. This reduces the required vector bits to 1/4. It means that
572
M. Behdadfar and H. Saidi
it requires 32 bits for IPv4 and 128 bits for IPv6, comparing to128 and 512 bits for PIBT, which is a good advantage for long IPv4 and IPv6 prefixes. Of course, other required memories to keep the pointers are ignored. In PIBT, the update of additional interval vectors and the update of the keys will be performed separately. But in the proposed architecture, the update of a prefix and SMV vectors could be done simultaneously. In other words, the SMV vectors will be updated during the insert or delete procedure of a prefix and this means less number of memory accesses for each update. Another advantage of this algorithm compared to PIBT is that we need to insert or delete only one prefix in each update process, however in the case of PIBT, it may require to insert or delete both two end points of a prefix which means traversing the tree twice. As it is shown, the height of tree, h, will be the height of a B-tree, h≤ logm((n+1)/2) with n the number of the prefixes and m the order of the B-tree. Since the tree is balanced, the proposed algorithm in this paper has a guaranteed worst case search and update time for any prefix length distribution. Also, as a good property of this proposed algorithm, it doesn't need any pre-processing to be performed.
6 Conclusions This paper introduced CPBT, a new algorithm for the prefix matching problem and also finding the Longest Matching Prefix or LMP. It treats prefixes as numbers but not ranges during the search procedures. This type of coding of prefixes has the advantage of removing the dependency to the prefix length during the search and update time and also in memory usage for the IPv4 or IPv6. Therefore, it is equally suitable for both IPv4 and IPv6. It is also possible to apply this scheme to the method in [10] for its extension to the IPv6. This scheme is compared with PIBT [9] and it is shown that although both of them have the same search complexity; this scheme has much better storage requirements and about half of the memory accesses during the update procedures.
References 1. Morrison, D.R.: PATRICIA Practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 15(14), 514–534 (1968) 2. Sirinivasan, V., Varghes, G.: Faster IP lookup using controlled prefix expansion. ACM Transactions on computer systems 17(1), 1–40 (1999) 3. Nilson, S., Karlsson, G.: IP address lookup using LC-tries. IEEE Journal of selected Areas in communications 17(6), 1083–1092 (1999) 4. Gupta, P., lin, S., McKeown, N.: Routing lookups in Hardware at Memory Access Speeds. In: Proceedings of IEEE Infocom vol. 3, pp. 1240–1247 (1998) 5. Shad, D., Gupta, P.: Fast Incremental updates on ternary-CAMs for routing lookups and Packet classification. In: Proceedings of Hot Interconnects VIII (2000), IEEE Micro. (2001)
The CPBT: A Method for Searching the Prefixes Using Coded Prefixes in B-Tree
573
6. Lampson, B., Srinivasan, V., Varghese, G.: IP lookups using multiway and multicolumn search. In: Proceedings of IEEE Infocom vol. 3, pp. 1248–1256 (1998) 7. Yazdani, N., Min, P.: Prefix Trees: new Efficient Data Structures for Matching Strings of Different length. In: Ideas 2001, pp. 76–85 (2001) 8. Behdadfar, M.: Review and Improvement of Longest Matching Prefix Problem in IP Network, MSC thesis, Isfahan University of Technology (2002) 9. Lu, H., Sahni, S.: A B-Tree Dynamic Router-Table Design. IEEE transactions on computers 54(7), 813–824 (2005) 10. Sun, Q., Zhaho, X., Huang, X., Jiang, W., Ma, Y.: A Scalable Exact Matching in Balance Tree Scheme for IPv6 Lookup. In: ACM SIGCOMM 2007 data communication festival, IPv6 2007 (2007) 11. Cormen, T., Leiserson, C., Rivest, R.: Introduction to Algorithms. Hill Book Company. McGraw-Hill, New York (1999)
On the Effectiveness of Proactive Path-Diversity Based Routing for Robustness to Path Failures Chansook Lim1 , Stephan Bohacek2 , Jo˜ao P. Hespanha3 , and Katia Obraczka4 1
3
Dept. Computer Information & Communication, Hongik University, 339-701, Republic of Korea [email protected] 2 Dept. Electrical & Computer Engineering, University of Delaware, Newark, DE 19716, USA [email protected] Dept. Electrical & Computer Engineering, University of California, Santa Barbara, CA 93106-9560, USA [email protected] 4 Computer Engineering Department, University of California, Santa Cruz, CA 95064, USA [email protected]
Abstract. Path disruptions are frequent occurrences on today’s Internet. They may be due to congestion or failures, which in turn may be attributed to unintentional factors (e.g., hardware failures) or caused by malicious activity. Several efforts to-date have focused on enhancing robustness from the end-to-end viewpoint by using path diversity. Most of these studies are limited to single- or two-path approaches. This paper is the first to address the question of what degree of path diversity is needed to effectively mitigate the effect of path failures. We seek to answer this question through extensive experiments in PlanetLab. To evaluate the effect of path diversity on routing robustness in regards to a wide spectrum of applications, we introduce a new performance metric we named outage duration. Experimental results show that proactively forwarding packets using a high degree of path diversity is more effective in overcoming path failures in comparison with single-path or two-path approaches. In addition, for applications in which low packet loss probability is as important as uninterrupted connectivity, we suggest a packet forwarding scheme based on link gains and discuss the trade-offs between robustness and packet delivery probability. Keywords: path failure, robustness, routing, path diversity.
1
Introduction
Currently, Internet routing adopts a reactive approach for handling path disruptions. In other words, it waits until failures are detected to take any action. While this strategy has been satisfactory thus far, it certainly will not provide the levels of robustness required by more recent as well as emerging (e.g., real-time, interactive, etc.) applications. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 574–585, 2008. c IFIP International Federation for Information Processing 2008
On the Effectiveness of Proactive Path-Diversity Based Routing
575
Several efforts over the last few years have attempted to mask path failures and improve end-to-end path availability. Early seminal work includes overlay network routing schemes such as Detour and RON [1,2]. These schemes monitor paths and upon detecting a path failure, forward packets indirectly via intermediate nodes. More recently, Gummadi et al. [3] suggested a low-overhead on-demand scheme that uses probes to find a new working path, only when a path failure is detected. A common feature among these schemes is that they are purely reactive. Thus they cannot avoid packet losses at least for tens of seconds, which is the time required to detect a path failure and switch to another path. In the case of RON, each node has information about all available paths, but forwards packets using a single path. Thus, a failure will result in a burst of losses until the path is updated. Depending on the application and the length of time to detect and react to the route failure, the resulting burst of packet losses might not be tolerated. One way to avoid routing disruptions and thus prevent packet loss bursts is through proactive approaches that use path diversity, e.g., multipath routing, which forward packets over multiple paths. Proactive multipath routing reduces bursty packet losses by enabling some of the packets to still be delivered to the destination as long as at least one of the paths from source to destination remains viable. Reliability enhancement through path diversity is not a new idea; indeed, it has been extensively studied [4,5,6,7]. One question that comes up though is what degree of multipath diversity is necessary/sufficient to overcome path failures on the current Internet so that application-specific performance requirements are satisfied. The answer to this question depends, among other things, on how often simultaneous link failures occur. However, there has been little work on characterizing simultaneous path failures on the Internet. Although there have been some useful studies on failure characterization within one ISP (e.g., [8]), it is hard to predict how often an arbitrary connection passing through usually more than one AS may experience simultaneous path failures. Nevertheless, most studies to-date try to cope with a single point of failure. A notable example is the work on intra-domain path diversity reported in [5]. This study focuses on proactive routing over two paths assuming that the chance of network components within an AS located physically apart to fail at exactly the same moment is extremely slim. Whether this assumption can hold even beyond intra-domain is not clear because, as pointed out before, there is little statistics on simultaneous path failures on the Internet. This is one of the main reasons why we are interested in improving routing robustness for arbitrary Internet source-destination pairs, whose routes may cross several ASs. To our knowledge, our work is the first attempt to address the question of what degree of multipath is effective in mitigating the impact of path failures on the Internet. Our approach is based on implementing a proactive multipath routing scheme, namely the Game Theoretic Stochastic Routing (GTSR) mechanism [7], as an application-layer overlay routing protocol and conducting extensive experiments
576
C. Lim et al.
using PlanetLab [9]. To maximize the degree of path diversity, we formed many small PlanetLab groups and performed experiments where packets were sent from source to destination over multiple paths for each group. To evaluate the effect of the degree of path diversity on robustness for a wide spectrum of applications, we introduce a performance metric we call outage duration. Our experimental results show that proactively forwarding packets using high degree of path diversity is more effective in mitigating the effects of path failures when compared to single-path or two-path approaches. Additionally, we also consider the case of applications that require low end-toend loss rate and propose a packet distribution scheme using heterogeneous link gains in a max-flow optimization. Under this link gain mechanism, lossy paths may be “penalized” so that more packets are sent over paths with low loss rates. We discuss how to balance the trade-off between end-to-end loss probability and maximum outage duration by adjusting the fraction of packets to be sent on each path. The rest of this paper is organized as follows. In the next section, we present related work as motivation to our work. Our empirical study of the impact of proactive multipath on routing robustness, including our experimental methodology and results, is presented in Section 3. In Section 4, we describe the link gain packet distribution scheme. Lastly, Section 5 concludes the paper and presents some directions of future work.
2
Related Work
Most existing schemes to enhance Internet robustness to path failures use reactive, single-path routing approaches. However, single-path approaches cannot avoid bursty packet losses which may happen while detecting a failure and trying to switch to an alternate path. In contrast, proactive multipath approaches which forward packets over multiple paths can reduce bursty packet losses by enabling some of the packets to still make it through, as long as at least one of the paths from source to destination remains viable. Preventing long outages will enable applications to provide uninterrupted service even while new routes are being computed/discovered. Schemes taking such approaches for the purpose of robustness to path failures include our own GTSR [7] and the work presented in [5]. In the past, an important deterrent to the use of multipath routing was the fact that TCP is known to exhibit significant performance degradation when subject to persistent packet reordering, which will likely happen under multipath routing. A number of TCP variants have recently been proposed to obviate this problem [10,11,12]. The benefit of proactive multipath routing becomes even more evident when considering that, from the end-host’s viewpoint, identifying failures is not easy. Indeed, prior studies use different definitions of failure, some of which quite arbitrary. For example, in [3], a failure is defined as a sequence of packet losses
On the Effectiveness of Proactive Path-Diversity Based Routing
577
that would have a significant or noticeable application impact. Considering TCP reaction to losses, the authors consider as failure a loss incident that began with 3 consecutive probe losses and the failure of the initial traceroute. On the other hand, in RON [2], a path outage is defined to be an incident where the observed packet loss rate over a short time interval is larger than some threshold. One drawback with any technique to identify failures is that is it not easy to differentiate between packet losses due to severe congestion and packet losses due to true path failures. Furthermore, a failure should depend on the application (e.g., non-interactive vs. interactive applications). Another noteworthy problem with identifying failures is that not trivial to conduct meaningful evaluation studies of different routing approaches in terms of their robustness. More specifically, existing performance metrics such as the percentage of recovered failures and the time until the failure is recovered focus on reactive routing approaches for recovering relatively long-term failures. However, these metrics are inadequate for evaluating proactive techniques aimed at providing uninterrupted connectivity and reducing bursty packet losses. To fulfill the need for adequate ways of evaluating the robustness of proactive routing mechanisms in regards to a wide spectrum of applications, we propose a new metric named outage duration, which measures the duration of failures experienced by end hosts. This means that routing schemes will be judged based on how frequent longer outages are observed. In other words, if one routing scheme results in failures of longer durations more frequently compared to an other routing scheme, the latter is considered more robust than the former. As previously pointed out, while many studies try to cope with a single point of failure, little is known on how frequently simultaneous path failures occur. In particular, to the best of our knowledge, what degree of multipath diversity is effective in overcoming failures on arbitrary Internet source-destination pairs has not been studied. We are interested in investigating whether more than 2 paths is more effective in mitigating the effect of path failures. It seems intuitive that bursty packet losses decrease as the number of paths between the source and destination increases. However, if simultaneous multiple path failures hardly occur, forwarding packets through more than two paths would not result in improved robustness. We seek the answer to the above question through extensive measurement experiments using a proactive multipath routing protocol implemented on PlanetLab [9]. In the next section, we describe our implementation of the proactive multipath routing protocol, Game Theoretic Stochastic Routing (GTSR) [7], on PlanetLab.
3
Proactive Multipath for Routing Robustness
Game Theoretic Stochastic Routing (GTSR) uses a game-theoretic approach to multipath routing [7]. It finds all paths between a source-destination pair
578
C. Lim et al.
and computes next-hop probabilities, i.e., the probabilities that a packet takes a particular next-hop. This contrasts with single-path algorithms that simply determine the next-hop or even with deterministic multipath routing approaches. GTSR determines the next-hop probabilities using a max-flow computation. This computation has a game theoretical interpretation because GTSR’s routing policies are saddle solutions to a zero-sum game, in which we regard routing as one player that attempts to defeat worst-case link/node faults. In this game, one associates to each link a probability pf ail, of fault occurrence. It was shown in [13] that optimal saddle routing policies for this game can be computed by solving a max-flow optimization [14] over a graph with link capacities given by 1/pf ail, . When GTSR is utilized to improve robustness, the designer typically selects low values for pf ail, in links that are perceived to be more robust. This favors sending more traffic through these links. We thus call 1/pf ail, the level-of-robustness (LOR) of link . As mentioned above, GTSR solves max-flow optimizations with link capacities given by the LOR. Once the flow x for each link is determined, the next-hop probability r for link is obtained from r := x x , ∈L[]
where the summation is over the set L[] of links that exit from the same node as . When one wants to favor shorter paths in terms of the number of hops, this can be done by introducing a link gain . In a generalized max-flow optimizations with a link gain, the flow-conservation law can be interpreted that the incoming flow to a node is equal to the outgoing flow from the same node, possibly amplified by when ≥ 1. Therefore, for > 1, longer paths are penalized since the cost incurred increases as the number of hops increases. In fact, as → ∞ the potential burden of an extra hop is so large that the optimal solution will only consider paths for which the number of hops is minimal, leading to shortest path routing but not necessarily single-path. 3.1
Experimental Methodology
We implemented GTSR as an application-layer overlay routing protocol in Linux. Although GTSR reacts to path failures by monitoring link status, its main feature emphasized in our experiments is its proactive multipath routing ability. Implemented as a link-state protocol, GTSR computes routing tables by solving max-flow optimizations on the entire network graph. The main objective of our experiment is to investigate the impact of multipath diversity degree on robustness to path failures. The experiment was performed using PlanetLab [9]. To probe as many paths as possible with the assigned number of hosts, we partitioned them into several groups so that each group consists of 5 hosts. A simple overlay network is then configured for each group. 4 paths connecting the source– and destination nodes - one direct path and three indirect overlay paths through three intermediary this, for each nodes. By doing group, we can examine four single-paths, 42 two-paths, 43 three-paths, and one 4-path. Since we associated to each link the same pf ail, , the routing policy that GTSR generates for this simple topology is simply forward packets
On the Effectiveness of Proactive Path-Diversity Based Routing
579
over the 4 paths with equal probabilities. To diagnose the network easily, we used static routing and forwarded packets over the four paths in a round-robin fashion rather than in a stochastic manner. Every 50ms, a source node sends a packet which includes a sequence number. Forwarding sequentially-numbered packets in a round-robin manner enables us to measure the duration of path outages. We started the first experiment with 10 groups consisting of 50 hosts. In this experiment conducted from May 14 to May 16, 2006, we selected PlanetLab hosts so that the hosts in the same group were geographically distributed (similarly to [3]). The second experiment was performed using 245 hosts from November 15 to November 26, 2006. In this experiment, we did not explicitly try to choose geographically distributed PlanetLab hosts. That is, we formed each group by randomly selecting hosts; we avoided cases where any group would contain more than one host from the same site. Also in each group, the roles of source, destination, and intermediary nodes were randomly assigned. Out of 49 groups, we could collect data from 28 groups (totaling 140 hosts) successfully. Introducing a New Performance Metric To compare robustness over different multipath degrees easily, we compute the distribution of outage durations experienced by the receiver. More specifically, we calculate the fraction of time that the connection is in a failure for duration of at least X seconds. As a simple example, suppose that we measured path outages over a path during 24 hours and the observed outage durations in seconds were {30, 7, 10, 100, 600}. Now let’s calculate the fraction of time that a connection over this path is in failure for duration of at least 60 seconds. Since there were two outages which lasted for at least 60 seconds, the fraction of time is (100 + 600)/(24 ∗ 60 ∗ 60) ≈ 0.0081. 3.2
Experimental Results
Figure 1 shows the resulting distribution of outage durations for one example group. In this group, the source node is in Singapore, the destination node is in Michigan, and the three intermediary nodes are located in Japan and USA. What we can observe from the graph is that the single-path suffered outage duration of up to 57 seconds and the fraction of time that the connection is in failure with duration of at least 10 seconds is 0.1%. When one more path is added, the fraction of time that the connection is in failure of duration at least 10 seconds is almost 0.01%, which means that 2-paths was significantly helpful in overcoming path failures that the single-path suffered. For 3-paths, no failure duration of 10 seconds is observed. Also, we can see that the fraction of time that the connection is in failure duration ≥ 7 seconds is about 2 × 10−4 and 5 × 10−5 , for the 2-paths and 3-paths, respectively. However for failure duration ≥ 1 second, the curves for 3-paths and 4-paths are very close to each other. This means that adding the 4th path was not very helpful. A possible explanation for this is the fact that the underlying network does not have enough physical path redundancy; for these experiments, the physical paths for some two paths
580
C. Lim et al.
Fraction of time that a connection is in a failure of duration >= X seconds
−1
10
path1 path1+3 path1+3+4 path1+2+3+4
−2
10
−3
10
−4
10
−5
10
−2
10
−1
0
1
10 10 10 Failure Duration X (seconds)
2
10
Fig. 1. An example result for a group. The Y-axis represents the fraction of time that the connection is in a failure of the duration ≥ X seconds.
are not independent. This dependence may be avoided by choosing ”better” intermediary nodes. However, if failures occur at the end hosts they cannot be avoided. In order to summarize results over all groups, we perform the following computation. For each degree of multipath, we collect all possible distributions (curves) regardless of the group. And then, for every failure duration data point, the median value is obtained. As a result, we obtain one curve for each degree of multipath, i.e., four curves total. If a curve does not have any value for some failure duration value, the value of the curve is assumed to be 0. Figure 2 (a) shows the distribution of the median values for the first experiment. We can see the gaps between the curves corresponding to each degree of multipath. This implies that as the number of paths increases, robustness to path failures is improved. Figure 2 (b) shows the distribution of the median for the second experiment. While the gap between single-path and two-paths is significant, the curves for three-paths and four-paths almost overlap. The gap between two-paths and three-paths implies that there was robustness improvement by raising the degree of multipath from 2 to 3, even though it is not so significant as when raising it from 1 to 2. In comparison with the result of the first experiment, it is inferred that geographically distributing intermediary nodes played a significant role in improving robustness. Nonetheless, the fact that there was improvement simply by selecting intermediary nodes randomly is very encouraging. These results show that the degree of multipath diversity affects robustness to path failures. A study on how to enlarge the gaps between the curves for different multipath degrees is included in our future work.
On the Effectiveness of Proactive Path-Diversity Based Routing
−1
Median of fraction of time that a connection is in a failure of duration >= X seconds
10
1 path 2 paths 3 paths 4 paths
−2
10
−3
10
−4
10
−2
10
−1
10
0
1
2
3
10 10 10 10 Failure Duration X (seconds) (a) The intermediary nodes are geographically distributed.
Median of fraction of time that a connection is in a failure >= X seconds
−1
10
1 path 2 paths 3 paths 4 paths
−2
10
−3
10
−4
10
−5
10
−2
10
−1
10
0
1
2
10 10 10 Failure Duration X (seconds)
(b) The intermediary nodes are selected randomly. Fig. 2. Distribution of mean outage duration
3
10
581
582
C. Lim et al. 0.04 End−to−end loss probability
0.035 0.03 0.025 0.02 0.015 0.01 0.005 0
(a)
0
20
40 α
60
80
(b)
Fig. 3. Effect of link gains on end-to-end loss rates : (a) Hosts and topology used in the experiments. The links incident to node 8 usually have high loss rates, because node 8 is connected via DSL. (b) Number of lost packets with respect to 5 different α values.
4 4.1
Controlling End-to-End Loss Rate Exploiting Link Gains to Reduce End-to-End Loss Rate
Depending on the application, besides reducing duration of outages, reducing loss rate may be another main concern. Clearly, if low end-to-end loss rates is the only concern, the network designer would route all packets along the path(s) with the lowest loss probability. However, if both robustness to path failures and low loss rates are desired, some trade-off must be achieved. To this end, we use link gains whose goal is to penalize lossy paths and hence robustness is traded for random packet drops. The notion of link gain is introduced in Section 3. However, the link gain used here is a bit different in the sense that the link gain for each link depends on the loss probability of the link and therefore link gains are not uniform across all links. Instead, the link gain of each link l is set to 1 (1 − ploss, )α where ploss, is the loss probability of link l and α is a positive number given as a network parameter. With this link gain definition the flow is amplified 1 by (1−ploss, )α when it passes through link . The end-to-end link gain along α a specific path is (1−p1loss, ) . Thus, when α = 1, the end-to-end link gain is the reciprocal of the probability of successfully delivering the packet. For α = 0, the link gain for every link is equal to 1 and loss probability has no impact on routing;hence, packets are routed to maximize robustness. On the other extreme, for a very large α value, the link gain becomes very large and packets are forwarded along the path(s) with the lowest loss probability. To show the impact of link gain on loss rate, we performed a simple experiment in PlanetLab. We used the topology shown in Figure 3 (a). This topology has 10 nodes and the nodes were connected by a 3-connected graph. Node 8 is connected
On the Effectiveness of Proactive Path-Diversity Based Routing
583
to the Internet via a DSL line so the loss probabilities on the links to other hosts are typically higher. We selected a source and destination pair so that node 8 is included on one of the paths. On each host, we performed 5 routing programs, in which five α values were used, namely 0, 20, 40, 60, and 80. At the source node, 5 sender programs send out packets through distinct routing programs. The results in Figure 3 (b) show that, as expected, the packet loss rate decreases, as α increases. 4.2
Trade-Offs between Robustness to Path Failures and Loss Rate
The downside of this packet distribution scheme, i.e., using link gains based on loss probability, is that it may have a negative impact on robustness to path failures, as previously discussed. Favoring forwarding packets through paths with low loss probabilities by setting a large value for α can make the end-hosts experience longer outage duration when the path through which a larger fraction of packets are forwarded fails. The quantitative assessment of how the link gain mechanism affects robustness can be easily performed, particularly in the simple case where the level-of-robustness of each link (i.e., link capacity in max flow computation) is equally set to c and there are n node-disjoint paths between the source and destination. To do this, we define a cycle as the expected time interval between two consecutive packets arriving at the destination through a given path. The cycle of a given path corresponds to the worst-case outage duration that the end-host is supposed to experience when all but one of the paths fail. The cycle is determined by the sending rate, the fraction of packets forwarded over the path, and the loss probability of the path. Note that for a given α value, the fraction of packets allocated for a path depends on the loss probabilities of other paths as well as the loss probability of the path. Let pi denote the packet delivery probability for path i, i.e., pi = :∈path i (1 − ploss, ), where ploss, is the loss probability for link . When reaching the destination, the flow along path i will have been amplified by gi = 1 1 (1−ploss, )α = pi α . Since the capacity constraint must be satisfied, the :∈path i
flow allocated over path i at the source node will be gci , i.e, cpi α . (Note that setting α = 0 is equivalent to not using link gain, i.e., the flow will be equally distributed.) Also since all the capacities are equal, the fraction of the packets α forwarded over path i is fi = npi pj α . j=1 Given the formula for the fraction of packets over each path, the network designer can find the smallest α to achieve the desired packet delivery ratio (i.e., 1 − loss rate). Figure 4 (a) shows an example where there are two paths between the source and destination. In this example, the loss probability of path 1 is fixed to 0 whereas the loss probability of path 2 varies from 0.05 to 0.2. Figure 4 (b) shows that the way packet delivery ratio varies with respect to α changes quite significantly depending on the loss probability of path 2. To reckon the change of outage duration by the link gain, we consider the expected cycle for a path. Suppose that the sending rate at the source is x
584
C. Lim et al. 1
OI
S
packet delivery ratio
0.98 Loss Prob(Path2)=0.05 Loss Prob(Path2)=0.1 Loss Prob(Path2)=0.2
0.96 0.94 0.92 0.9 0.88 0
(a)
20
40
60 α
80
100
120
(b)
Fig. 4. α versus Packet Delivery Ratio. The link gain parameter, α, to achieve a certain packet delivery ratio depends on the loss probabilities of the paths.
(packets/sec) and that p1 , p2 , ... , pn is a sequence of packet delivery probabilities sorted in a decreasing order. To obtain the expected cycle for path i, we count how many times per second we can see a packet given that we observe the path x times for each second. The number of packets being observed is binomially distributed with parameters (x, fi pi ) because each observation is independent and the probability of a packet being found in each observation is fi times pi . The mean of B(x, fi pi ) is xfi pi . That is, the expected number of packets passing path i for each second is xfi pi . As a consequence the expected interval between consecutive packets passing path i, i.e. the cycle for path i is xf1i pi = 1 x
n
j=1 pj pi α
α
1 pi .
Particularly, the paths with the lowest packet delivery probability n
pj α
(pn ) have the longest cycles, which is xpj=1 α+1 . Note that setting α to 0 gives the n longest cycle without using the link gain mechanism, which is xpnn . Therefore, when the link gain, the worst-case outage duration is increased by the ratio nusing α j=1 pj of np . α n If we want to put a limit on the outage duration when only one path survives, we can find the upper limit on α such that n α j=1 pj ≤ t, xpn α+1 where t is a threshold for the cycle. The computation can be easily made by Newton’s method given that packet delivery probabilities can be obtained through measurement.
5
Conclusions
Robustness is one of the key goals of network protocol design. In this paper the impact of proactive routing based on path diversity on end-to-end robustness is examined. A novel metric to measure robustness at many time-scales was introduced. Through experiments on PlanetLab, we found that proactive multipath
On the Effectiveness of Proactive Path-Diversity Based Routing
585
routing schemes can significantly improve robustness. However, in our experiments, we observed that little additional robustness is added when more than 3 alternate paths are considered. As future work, we plan to examine why this is the case. Our conjecture is that the main cause is the fact that the underlying network’s physical topology may not have enough redundancy. Finally, in order to balance the sometimes contradictory goals of low loss and high robustness, a technique based on link gains was proposed to achieve a specific loss rate goal.
References 1. Savage, S., Anderson, T., Aggarwal, A., Becker, D., Cardwell, N., Collins, A., Hoffman, E., Snell, J., Vahdat, A., Voelker, G., Zahorjan, J.: Detour: Informed internet routing and transport. IEEE Micro 19(1), 50–59 (1999) 2. Andersen, D., Balakrishnan, H., Kaashoek, F., Morris, R.: Resilient overlay networks. SIGOPS Oper. Syst. Rev. 35(5), 131–145 (2001) 3. Gummadi, K.P., Madhyastha, H.V., Gribble, S.D., Levy, H.M., Wetherall, D.: Improving the reliability of Internet paths with one-hop source routing. In: Proc.6th USENIX OSDI, San Francisco, CA (December 2004) 4. Apostolopoulos, J.G.: Reliable video communication over lossy packet networks using multiple state encoding and path diversity. In: VCIP (2001) 5. Cha, M., Moon, S., Park, C., Shaikh, A.: Placing relay nodes for intra-domain path diversity. In: Proc. of the IEEE INFOCOM (2006) 6. Li, Y., Zhang, Y., Qui, L., Lam, S.: Smarttunnel: Achieving reliability in the internet. In: Proc. of the IEEE INFOCOM (2007) 7. Bohacek, S., Hespanha, J.P., Lee, J., Lim, C., Obraczka, K.: Game theoretic stochastic routing for fault tolerance and security in communication networks. IEEE Trans. on Parallel and Distributed Systems (2007) 8. Markopouloou, A., Iannaccone, G., Bhattacharyya, S., Chuah, C.: Characterization of failures in an ip backbone. In: Proc. of the IEEE INFOCOM (2004) 9. Planetlab, http://www.planet-lab.org 10. Blanton, E., Allman, M.: On making TCP more robust to packet reordering. ACM Computer Communications Review 32 (2002) 11. Zhang, M., Karp, B., Floyd, S., Peterson, L.: RR-TCP: A reordering-robust TCP with DSACK. In: Proceedings of IEEE INCP (2003) 12. Bohacek, S., Hespanha, J.P., Lee, J., Lim, C., Obraczka, K.: A new tcp for persistent packet reordering. IEEE/ACM Trans. on Networking, 369–382 (2006) 13. Bohacek, S., Hespanha, J., Obraczka, K.: Saddle policies for secure routing in communication networks. In: Proc. of the 41st Conf. On Decision and Contr. pp. 1416–1421 (2002) 14. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT Press, Boston (2001)
On Robustness and Countermeasures of Reliable Server Pooling Systems Against Denial of Service Attacks Thomas Dreibholz1 , Erwin P. Rathgeb1 , and Xing Zhou2 1 University of Duisburg-Essen Institute for Experimental Mathematics Ellernstrae 29, D-45326 Essen, Germany Tel.: +49 201 183-7637, Fax: +49 201 183-7673 [email protected] 2 Hainan University College of Information Science and Technology Renmin Road 58, 570228 Haikou, Hainan, China Tel.: +86 898 6625-0584, Fax: +86 898 6618-7056 [email protected]
Abstract. The Reliable Server Pooling (RSerPool) architecture is the IETF’s novel approach to standardize a light-weight protocol framework for server redundancy and session failover. It combines ideas from different research areas into a single, resource-efficient and unified architecture. While there have already been a number of contributions on the performance of RSerPool for its main tasks – pool management, load distribution and failover handling – the robustness of the protocol framework has not yet been evaluated against intentional attacks. The first goal of this paper is to provide a robustness analysis. In particular, we would like to outline the attack bandwidth necessary for a significant impact on the service. Furthermore, we present and evaluate our countermeasure approach to significantly reduce the impact of attacks. Keywords: Reliable Server Pooling, Attacks, Denial of Service, Robustness, Countermeasures.
1
Introduction and Scope
The Reliable Server Pooling (RSerPool) architecture [1] is a generic, applicationindependent framework for server pool [2] and session management, based on the Stream Control Transmission Protocol (SCTP) [3]. While there have already been a number of publications on the performance of RSerPool for load balancing [4] and server failure handling [5], there has not yet been any research
Parts of this work have been funded by the German Research Foundation (Deutsche Forschungsgemeinschaft).
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 586–598, 2008. c IFIP International Federation for Information Processing 2008
On Robustness and Countermeasures of Reliable Server Pooling Systems
587
on security and robustness. SCTP already protects against simple flooding attacks [6] and the Internet Drafts [7] of RSerPool mandatorily require using mechanisms like TLS [8] or IPSEC [9] in order to ensure authenticity, integrity and confidentiality. However, this approach is not sufficient: as for every distributed system, there is a chance that an attacker may compromise a legitimate component and obtain the private key. So, how robust are the RSerPool protocols under attack? The goal of this paper is therefore to first analyse the robustness of RSerPool against a denial of service (DoS) attack by compromised components, in order to show the impact of different attack scenarios on the application performance. Using these analyses as a baseline performance level, we will present our countermeasure approach which can significantly reduce the impact of attacks at a reasonable effort.
2
The RSerPool Architecture
Figure 1 illustrates the RSerPool architecture [1,10] which consists of three types of components: servers of a pool are called pool elements (PE), a client is denoted as pool user (PU). The handlespace – which is the set of all pools – is managed by redundant pool registrars (PR). Within the handlespace, each pool is identified by a unique pool handle (PH).
Fig. 1. The RSerPool Architecture
2.1
Components and Protocols
PRs of an operation scope synchronize their view of the handlespace by using the Endpoint haNdlespace Redundancy Protocol (ENRP) [11], transported via SCTP [12] and secured e.g. by TLS [8] or IPSEC [9]. In contrast to GRID computing [13], an operation scope is restricted to a single administrative domain. That is, all of its components are under the control of the same authority
588
T. Dreibholz, E.P. Rathgeb, and X. Zhou
(e.g. a company or an organization). This property results in a small management overhead [14,2], which also allows for RSerPool usage on devices providing only limited memory and CPU resources (e.g. embedded systems like telecommunications equipment or routers). Nevertheless, PEs may be distributed globally for their service to survive localized disasters [15]. PEs choose an arbitrary PR of the operation scope to register into a pool by using the Aggregate Server Access Protocol (ASAP) [16], again transported via SCTP and using TLS or IPSEC. Within its pool, a PE is characterized by its PE ID, which is a randomly chosen 32-bit number. Upon registration at a PR, the chosen PR becomes the Home-PR (PR-H) of the newly registered PE. A PR-H is responsible for monitoring its PEs’ availability by keep-alive messages (to be acknowledged by the PE within a given timeout) and propagates the information about its PEs to the other PRs of the operation scope via ENRP updates. PEs re-register regularly (in an interval denoted as registration lifetime) and for information updates. In order to access the service of a pool given by its PH, a PU requests a PE selection from an arbitrary PR of the operation scope, again using ASAP. The PR selects the requested list of PE identities by applying a pool-specific selection rule, called pool policy. RSerPool supports two classes of load distribution policies: non-adaptive and adaptive algorithms [4]. While adaptive strategies base their assignment decisions on the current status of the processing elements (which of course requires up-to-date states), non-adaptive algorithms do not need such data. A basic set of adaptive and non-adaptive pool policies is defined in [17]. Relevant for this paper are the non-adaptive policies Round Robin (RR) and Random (RAND) as well as the adaptive policies Least Used (LU) and Least Used with Degradation (LUD). LU selects the least-used PE, according to up-to-date application-specific load information. Round robin selection is applied among multiple least-loaded PEs. LUD [18] furthermore introduces a load decrement constant which is added to the actual load each time a PE is selected. This mechanism compensates inaccurate load states due to delayed updates. An update resets the load value to the actual load again. PUs may report unreachable PEs to a PR by using an ASAP Endpoint Unreachable message. A PR locally counts these reports for each PE. If the threshold MaxBadPEReports [5] is reached, the PR may decide to remove the PE from the handlespace. The counter of a PE is reset upon its re-registration. 2.2
Application Scenarios
Although the main motivation to define RSerPool has been the availability of SS7 (Signalling System No. 7 [19]) services over IP networks, it is intended to be a generic framework. There has already been some research on the performance of RSerPool usage for applications like SCTP-based mobility [20], VoIP with SIP [21], web server pools [10], IP Flow Information Export (IPFIX) [22], real-time distributed computing [10, 4] and battlefield networks [23]. A generic application model for RSerPool systems has been introduced by [4], including
On Robustness and Countermeasures of Reliable Server Pooling Systems
589
performance metrics for the provider side (pool utilization) and user side (request handling speed). Based on this model, the load balancing quality of different pool policies has been evaluated [4, 10].
3
Quantifying a RSerPool System
The service provider side of a RSerPool system consists of a pool of PEs. Each PE has a request handling capacity, which we define in the abstract unit of calculations per second1 . Each request consumes a certain number of calculations; we call this number request size. A PE can handle multiple requests simultaneously – in a processor sharing mode as provided by multitasking operating systems. On the service user side, there is a set of PUs. The number of PUs can be given by the ratio between PUs and PEs (PU:PE ratio), which defines the parallelism of the request handling. Each PU generates a new request in an interval denoted as request interval. The requests are queued and sequentially assigned to PEs. The total delay for handling a request dHandling is defined as the sum of queuing delay dQueuing , startup delay dStartup (dequeuing until reception of acceptance acknowledgement) and processing time dProcessing (acceptance until finish): dHandling = dQueuing + dStartup + dProcessing .
(1)
That is, dHandling not only incorporates the time required for processing the request, but also the latencies of queuing, server selection and protocol message transport. The handling speed is defined as: handlingSpeed = requestSize dhandling . For convenience reasons, the handling speed (in calculations/s) is represented in % of the average PE capacity. Clearly, the user-side performance metric is the handling speed – which should be as fast as possible. Using the definitions above, it is possible to delineate the average system utilization (for a pool of NumPEs servers and a total pool capacity of PoolCapacity) as: systemUtilization = NumPEs ∗ puToPERatio ∗
requestSize requestInterval
. (2) PoolCapacity Obviously, the provider-side performance metric is the system utilization, since only utilized servers gain revenue. In practise, a well-designed client/server system is dimensioned for a certain target system utilization, e.g. 80%. That is, by setting any two of the parameters (PU:PE ratio, request interval and request size), the value of the third one can be calculated using equation 2 (see [4, 10] for details).
4
The Simulation Setup
For our performance analysis, the RSerPool simulation model rspsim [24, 10, 4] has been used. This model is based on the OMNeT++ [25] simulation environ1
An application-specific view of capacity may be mapped to this definition, e.g. CPU cycles or memory usage.
590
T. Dreibholz, E.P. Rathgeb, and X. Zhou
Fig. 2. The Simulation Setup
ment and contains the protocols ASAP [16] and ENRP [11], a PR module, an attacker module and PE as well as PU modules for the request handling scenario defined in section 3. Network latency is introduced by link delays only. Therefore, only the network delay is significant. The latency of the pool management by PRs is negligible [2]. Unless otherwise specified, the basic simulation setup – which is also presented in figure 4 – uses the following parameter settings: – The target system utilization is 80%. Request size and request interval are randomized using a negative exponential distribution (in order to provide a generic and application-independent analysis [10]). There are 10 PEs; each one provides a capacity of 106 calculations/s. – A PU:PE ratio of 10 is used (i.e. a non-critical setting as shown in [4]). The default request size:PE capacity is 10 (i.e. the request size is 107 calculations; being processed exclusively, the processing of an average request takes 10s – see also [4]). – We use a single PR only, since we do not examine PR failure scenarios here (see [4] for the impact of multiple PRs). PEs re-register every 30s (registration lifetime) and on every load change for the adaptive policies. – MaxBadPEReports is set to 3 (default value defined in [11]). A PU sends an Endpoint Unreachable if a contacted PE fails to respond within 10s. – The system is attacked by a single attacker node. – The simulated real-time is 120m; each simulation run is repeated at least 24 times with a different seed in order to achieve statistical accuracy. GNU R has been used for the statistical post-processing of the results. Each resulting plot shows the average values and their 95% confidence intervals.
On Robustness and Countermeasures of Reliable Server Pooling Systems
5 5.1
591
Robustness Analysis and Attack Countermeasures Introduction
The attack targets of RSerPool are the PRs, PEs and PUs; the goal of an attack is to affect the services built on top of RSerPool. Due to mandatory connection security by TLS or IPSEC (see subsection 2.1), an attacker has to compromise a component itself. Clearly, an attacker being able to compromise a PR is able to propagate arbitrary misinformation into the handlespace via ENRP. However, since the number of PRs is considered to be quite small [2] (e.g. less than 10) in comparison with the number of PEs and PUs (up to many thousands [10, 2]) and due to the restriction of RSerPool to a single administrative domain, it is assumed to be feasible to protect the small number of PRs rather well. Instead, the most probable attack scenario is an attacker being able to compromise a PE or PU. These components are significantly more numerous [10] and may be distributed over multiple, less controllable locations [15]. For that reason, ASAP-based attacks are in the focus of our study. For our analysis, we use a single attacker node only. Assuming that there is a certain difficulty in compromising a PE or PU, the number of attackers in a system is typically small. If a powerful attacker is able to compromise a large number of RSerPool components, protocol-internal countermeasures are obviously difficult and the effort should be spent to increase the system security of the components. However, as we will show, even a single attacker can cause a DoS without further countermeasures. 5.2
A Compromised Pool Element
The goal of an attacker being able to perform PE registrations is clearly to execute as many fake registrations as possible. That is, each registration request simply has to contain a new PE ID (randomly chosen). The policy parameters may be set appropriately, i.e. a load of 0% (LU, LUD) and a load increment of 0% (LUD), to get the fake PE selected as often as possible. It is important to note that address spoofing is already avoided by the SCTP protocol [6, 3]: each network-layer address under which a PE is registered must be part of the SCTP association between PE and PR. The ASAP protocol [16] requires the addresses to be validated by SCTP. However, maintaining the registration association with the PE and silently dropping all incoming PU requests is already sufficient for an attacker. In order to illustrate the impact that even a single attacker can cause on the pool performance, we have performed simulations using the parameters described in section 4. Each PE handles up to 4 requests simultaneously, i.e. the load increment of LUD is 25% for a real PE. Figure 3 presents the results; the lefthand side shows the request handling speed, the right-hand one the number of PU requests ignored by the attacker (requests sent to fake PEs). Obviously, the smaller the attack interval (i.e. the delay between two fake PE registrations), the more fake entries go into the handlespace. This leads to
592
T. Dreibholz, E.P. Rathgeb, and X. Zhou
Fig. 3. The Impact of a Compromised Pool Element
an increased probability for a PU to select a non-existing PE and therefore to a decreased overall request handling speed. In particular, this effect is the strongest for LUD: real PEs get loaded and provide their real load increment (here: 25%), while the fake PEs always claim to be unloaded with a load increment of 0%. It takes only a single registration every 10s (A=10) to cause a complete DoS (i.e. a handling speed of 0%). The other policies are somewhat more robust, since they do not allow an attacker to manipulate the PE selection in such a powerful way. Clearly, using a larger setting of MaxBadPEReports would lead to an even worse performance: the longer it takes to get rid of a fake PE entry, the more trials of PUs to use these PEs (see also the right-hand side of figure 3). As a summary, it is clearly observable that even a single attacker with small attack bandwidth can cause a complete DoS. The key problem of the described threat is attacker’s power to create a new fake PE with each of its registration messages. That is, only a few messages per second (i.e. even a slow modem connection) is sufficient to cause a severe service degradation. For that reason, an effective countermeasure is clearly to restrict the number of PE registrations that a single PE identity is authorized to create. However, in order to retain the “light-weight” property of RSerPool [2] and to avoid synchronizing such numbers among PRs, we have put forward a new approach and introduce the concept of a registration authorization ticket. Such a ticket consists of 1. a PH, 2. a fixed PE ID, 3. minimum/maximum policy information settings (e.g. a lower limit for the load decrement of LUD) and 4. a signature by a trusted authority (similar to a Kerberos service, see below). Such a ticket, provided by a PE to its PR-H as part of the ASAP registration, can be easily verified by checking its signature. Then, if it is valid, it is only necessary
On Robustness and Countermeasures of Reliable Server Pooling Systems
593
Fig. 4. Applying the Countermeasure against the Pool Element Attack
to ensure that the PE’s policy settings are within the valid range specified in the ticket. An attacker stealing the identity of a real PE would only be able to masquerade as this specific PE. A PR only has to verify the authorization ticket, i.e. no change of the protocol behaviour and especially no additional synchronization among the PRs are necessary. Therefore, the approach can be realized easily; the additional runtime required is in O(1). Clearly, the need for a trusted authority (e.g. a Kerberos service) adds an infrastructure requirement. However, since an operation scope belongs to a single authority only (see subsection 2.1), it is feasible at reasonable cost. In order to show the effectiveness of our approach, figure 4 presents the results for applying the same attack as above, but using the new attack countermeasure. The other simulation parameters remain unchanged. Clearly, a significant improvement can be observed. While the handling speed is only slightly sinking with a smaller attack interval (down to 0.001, which means 1,000 fake registrations per second), the number of ignored PU requests rises from about 1,000 to 2,200 at an attack interval of A=15s to only about 3,000 (LU, LUD) at A=0.001s. In particular, due to the lower limit for policy information, LUD now even achieves a performance benefit: only for very small attack intervals A, its performance converges to the results of LU. The stateful behaviour [10] of this policy now becomes beneficial: although the attacker registers a fake PE with a load of 0%, the PE entry’s load increment is increased each time it gets selected. Therefore, frequent re-registrations of this fake PE are necessary in order to result in a significant performance degradation. In summary, a registration attack can be significantly diminished by our countermeasure. 5.3
A Compromised Pool User
PUs are the other means of ASAP-based attacks, especially their handle resolution requests and unreachability reports. An obvious attack scenario on handle
594
T. Dreibholz, E.P. Rathgeb, and X. Zhou
Fig. 5. The Impact of a Compromised Pool User
resolutions would be to flood the PR. However, the server selection can be realized very efficiently – as shown by [2] – but it is a fallacy to assume that simply performing some handle resolutions (without actually contacting any selected PE) cannot affect the service quality. In order to show the effects, the left-hand side of figure 5 presents the impact of a handle resolution attack on the pool performance. The right-hand side of figure 5 presents the results varying the probability for an unreachability report for A=0.1 (i.e. 10 handle resolutions per second). Clearly, even without unreachability reports, handle resolutions already result in a performance decay: the handling speed of RR converges to the value of RAND, since each selection leads to a movement of the Round Robin pointer [10]. That is, instead of trying to select PEs in turn, some servers are skipped and a new round starts too early. This leads in fact to a random selection. The performance of LUD decreases due to the load incremented upon each selection. Until the next re-registration (on load change or registration lifetime expiration), the load value in the handlespace grows steadily and leads to a smaller selection probability. LU and RAND, on the other hand, are immune to a simple handle resolution attack: a PE’s state is – unlike RR and LUD – not manipulable by the handle resolution itself. Therefore, a handle resolution attack (without Endpoint Unreachable) has no effect here. Sending ASAP Endpoint Unreachables for the selected PEs (see the righthand side of figure 5) has a significant negative impact on the pool performance if the attack interval is small enough (here: A=0.1). In this case, the attacker is able to impeach almost all PEs out of the handlespace. A PE comes back upon re-registration (i.e. 30s). Clearly, a larger re-registration interval setting leads to an even lower performance. The key problem of the handle resolution/failure report attack is that a single PU is able to impeach PEs – even for MaxBadPEReport>1. Therefore, the basic
On Robustness and Countermeasures of Reliable Server Pooling Systems
595
idea of our countermeasure is to avoid counting multiple reports from the same PU. Like for the PEs, it is necessary to introduce a PU identification which is certified by a trusted authority and can be checked by a PR (see subsection 5.2 for details). After that, a PR simply has to remember (to be explained later) the PEs for which a certain PU has reported unreachability and to ignore multiple reports for the same PE. Note, that no synchronization among PRs is necessary, since the unreachability count for each PE is a PR-local variable. That is, sending unreachability reports for the same PE to different PRs does not cause any harm. In order to remember the already-reported PE identities, we have considered the following hash-based approach of a per-PU message blackboard: instead of memorizing each reported PE (an attacker could exploit this by sending a large amount of random IDs), we simply define a function Ψ mapping each PE given by PE ID and PH to a bucket: Ψ (PH, IDPE ) = h(PH, IDPE ) MOD Buckets. h denotes an appropriate hash function: an attacker may not easily guess its behaviour. This property is provided by so called universal hash functions [26], which are – unlike cryptographic hash functions like SHA1 [27] – also efficiently computable. Each bucket contains the time stamps of the latest up to MaxEntries Endpoint Unreachables for the corresponding bucket. Then, the report rate can be calculated as: NumberOfTimeStamps Rate = . (3) TimeStampLast − TimeStampFirst Upon reception of an Endpoint Unreachable, it simply updates the reported PE’s corresponding bucket entry. If the rate in equation 3 exceeds the configured threshold MaxEURate, the report is silently ignored. The effort for this operation is in O(1), as well as the required per-PU storage space. Analogously, the same hash-based approach can be applied for Handle Resolutions with the threshold MaxHRRate, using only the PH of the requested pool as hash key. In order to demonstrate the effectiveness of our approach, we have shown the results of two example simulations in figure 6. Both simulations have used 64 buckets with at most 16 entries. Assuming 1,000 PEs in the handlespace, each bucket would represent only about 16 PEs on average. The probability of a bucket collision for two really-failed PEs would therefore be rather small. The left-hand plot presents the application of the handle resolution rate limit (without failure reports) for rate limits of MaxHRRate H=1 (solid lines) and H=100 (dotted lines) handle resolutions per second. An Endpoint Unreachable is sent for each selected PE, since this is the worst case and also affects LU and RAND. The actual average handle resolution interval for legitimate PUs for the workload (see section 4) is 125s (which is a rate of 0.008 operations/s), i.e. the limit still allows sufficient room for additional handle resolutions (e.g. due to PE failures, temporary workload peaks or hash collisions). As expected from the previous results, the handling speed slightly decreases with a smaller attack interval A for LU, LUD and RR, while RAND remains unaffected. However, when the attack
596
T. Dreibholz, E.P. Rathgeb, and X. Zhou
Fig. 6. Applying the Countermeasure against the Pool User Attack
interval exceeds the threshold, all subsequent handle resolutions of the attacker are blocked and the handling speed achieves its full level again. Clearly, the handle resolution threshold should not be set too large (here: H=100), since this gives the attacker too much room to cause service degradation. The results for a failure report rate limit of MaxEURate U =1 (solid lines) and U =100 (dotted lines) are presented on the right-hand side of figure 6 for a handle resolution rate of A=0.1 at a PE registration life of 30s. In order to show the effect of failure reports only, the handle resolution rate has not been limited here. Obviously, as expected from the handle resolution results, a useful limit achieves a significant benefit: as soon as the failure report rate exceeds the limit, subsequent reports are simply ignored and the handling speed goes back to the original value. Setting a too-high limit (here: U =100) again gives the attacker room to degrade the service quality. In summary, the handle resolution and failure report limits should be configured sufficiently above the expected rate (here: about two orders of magnitude) in order to allow for temporary peaks. Too high settings, on the other hand, result in a too-weak attack countermeasure.
6
Conclusions
In this paper, we have identified two critical attack threats on RSerPool systems: (1) PE-based attacks (registration) and (2) PU-based attacks (handle resolution/failure report). In order to reduce the attack impact on registrations, we have suggested to limit the number of PE registrations a single user authorization is allowed to perform by fixed PE IDs and to enforce upper/lower policy information values. This mechanism – denoted as registration authorization ticket – is applicable in time O(1) and in particular requires no additional network overhead or protocol changes. Our solution for reducing the attack impact of handle resolutions/failure reports is a rate limitation based on hash tables. For
On Robustness and Countermeasures of Reliable Server Pooling Systems
597
each PU served by a PR, the rate limit can be realized in O(1) time and space. We have provided simulation results for both attack countermeasures, demonstrating their ability to significantly reduce the impact of attacks on the service performance. As part of our future work, it is also necessary to analyse the robustness of the ENRP protocol. Although the threat on the small number of PRs of an operation scope is significantly smaller, it is useful to obtain knowledge of possible attack scenarios. Furthermore, we intend to perform real-world security experiments in the PlanetLab, using our RSerPool prototype implementation rsplib. Finally, our goal is to also contribute our results into the IETF standardization process.
References 1. Lei, P., Ong, L., T¨ uxen, M., Dreibholz, T.: An Overview of Reliable Server Pooling Protocols. Internet-Draft Version 04, IETF, RSerPool Working Group. draft-ietfrserpool-overview-04.txt (work in progress, Janurary 2008) 2. Dreibholz, T., Rathgeb, E.P.: An Evalulation of the Pool Maintenance Overhead in Reliable Server Pooling Systems. In: Proceedings of the IEEE International Conference on Future Generation Communication and Networking (FGCN), Jeju Island/South Korea, vol. 1, pp. 136–143 (2007), ISBN 0-7695-3048-6 3. Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., Paxson, V.: Stream Control Transmission Protocol. Standards Track RFC 2960, IETF (2000) 4. Dreibholz, T., Rathgeb, E.P.: On the Performance of Reliable Server Pooling Systems. In: Proceedings of the IEEE Conference on Local Computer Networks (LCN) 30th Anniversary, Sydney/Australia, November 2005 pp. 200–208 (2005), ISBN 07695-2421-4 5. Dreibholz, T., Rathgeb, E.P.: Reliable Server Pooling – A Novel IETF Architecture for Availability-Sensitive Services. In: Proceedings of the 2nd IEEE International Conference on Digital Society (ICDS), Sainte Luce/Martinique, February 2008, pp. 150–156 (2008), ISBN 978-0-7695-3087-1 6. Unurkhaan, E.: Secure End-to-End Transport - A new security extension for SCTP. PhD thesis, University of Duisburg-Essen, Institute for Experimental Mathematics (2005) 7. Stillman, M., Gopal, R., Guttman, E., Holdrege, M., Sengodan, S.: Threats Introduced by RSerPool and Requirements for Security. Internet-Draft Version 07, IETF, RSerPool Working Group, draft-ietf-rserpool-threats-07.txt (2007) 8. Jungmaier, A., Rescorla, E., T¨ uxen, M.: Transport Layer Security over Stream Control Transmission Protocol. Standards Track RFC 3436, IETF (2002) 9. Bellovin, S., Ioannidi, J., Keromytis, A., Stewart, R.: On the Use of Stream Control Transmission Protocol (SCTP) with IPsec. Standards Track RFC 3554, IETF (2003) 10. Dreibholz, T.: Reliable Server Pooling – Evaluation, Optimization and Extension of a Novel IETF Architecture. PhD thesis, University of Duisburg-Essen, Faculty of Economics, Institute for Computer Science and Business Information Systems (2007) 11. Xie, Q., Stewart, R., Stillman, M., T¨ uxen, M., Silverton, A.: Endpoint Handlespace Redundancy Protocol (ENRP). Internet-Draft Version 18, IETF, RSerPool Working Group, draft-ietf-rserpool-enrp-18.txt (work in progress 2007)
598
T. Dreibholz, E.P. Rathgeb, and X. Zhou
12. Jungmaier, A., Rathgeb, E.P., T¨ uxen, M.: On the Use of SCTP in FailoverScenarios. In: Proceedings of the State Coverage Initiatives, Mobile/Wireless Computing and Communication Systems II, Orlando, Florida/USA, vol. X (2002), ISBN 980-07-8150-1 13. Foster, I.: What is the Grid? A Three Point Checklist. GRID Today (2002) 14. Dreibholz, T., Rathgeb, E.P.: An Evalulation of the Pool Maintenance Overhead in Reliable Server Pooling Systems. SERSC International Journal on Hybrid Information Technology (IJHIT) 1(2) (2008) 15. Dreibholz, T., Rathgeb, E.P.: On Improving the Performance of Reliable Server Pooling Systems for Distance-Sensitive Distributed Applications. In: Proceedings of the 15. ITG/GI Fachtagung Kommunikation in Verteilten Systemen (KiVS), Bern/Switzerland, February 2007, pp. 39–50 (2007), ISBN 978-3-540-69962-0 16. Stewart, R., Xie, Q., Stillman, M., T¨ uxen, M.: Aggregate Server Access Protcol (ASAP). Internet-Draft Version 18, IETF, RSerPool Working Group, draft-ietfrserpool-asap-18.txt (work in progress, 2007) 17. T¨ uxen, M., Dreibholz, T.: Reliable Server Pooling Policies. Internet-Draft Version 07, IETF, RSerPool Working Group, draft-ietf-rserpool-policies-07.txt (work in progress, 2007) 18. Zhou, X., Dreibholz, T., Rathgeb, E.P.: A New Server Selection Strategy for Reliable Server Pooling in Widely Distributed Environments. In: Proceedings of the 2nd IEEE International Conference on Digital Society (ICDS), Sainte Luce/Martinique, February 2008, pp. 171–177 (2008), ISBN 978-0-7695-3087-1 19. ITU-T: Introduction to CCITT Signalling System No. 7. Technical Report Recommendation Q.700, International Telecommunication Union (1993) 20. Dreibholz, T., Jungmaier, A., T¨ uxen, M.: A new Scheme for IP-based Internet Mobility. In: Proceedings of the 28th IEEE Local Computer Networks Conference (LCN), K¨ onigswinter/Germany, November 2003, pp. 99–108 (2003), ISBN 0-76952037-5 21. Conrad, P., Jungmaier, A., Ross, C., Sim, W.-C., T¨ uxen, M.: Reliable IP Telephony Applications with SIP using RSerPool. In: Proceedings of the State Coverage Initiatives, Mobile/Wireless Computing and Communication Systems II, Orlando, Florida/USA, July 2002, vol. X (2002), ISBN 980-07-8150-1 22. Dreibholz, T., Coene, L., Conrad, P.: Reliable Server Pooling Applicability for IP Flow Information Exchange. Internet-Draft Version 05, IETF, Individual Submission. draft-coene-rserpool-applic-ipfix-05.txt (work in progress, 2008) ¨ Zheng, J., Fecko, M.A., Samtani, S., Conrad, P.: Evaluation of Archi23. Uyar, U., tectures for Reliable Server Pooling in Wired and Wireless Environments. IEEE JSAC Special Issue on Recent Advances in Service Overlay Networks 22, 164–175 (2004) 24. Dreibholz, T., Rathgeb, E.P.: A Powerful Tool-Chain for Setup, Distributed Processing, Analysis and Debugging of OMNeT++ Simulations. In: Proceedings of the 1st OMNeT++ Workshop, Marseille/France (March 2008), ISBN 978-963-9799-20-2 25. Varga, A.: OMNeT++ Discrete Event Simulation System User Manual - Version 3.2, Technical University of Budapest/Hungary (2005) 26. Crosby, S.A., Wallach, D.S.: Denial of service via Algorithmic Complexity Attacks. In: Proceedings of the 12th USENIX Security Symposium, Washington, DC/USA, pp. 29–44 (August 2003) 27. Eastlake, D., Jones, P.: US Secure Hash Algorithm 1 (SHA1). Informational RFC 3174, IETF (2001)
DDoS Mitigation in Non-cooperative Environments Guanhua Yan and Stephan Eidenbenz Information Sciences (CCS-3) Los Alamos National Laboratory Los Alamos, NM 87544, USA {ghyan,eidenben}@lanl.gov
Abstract. Distributed denial of service (DDoS) attacks have plagued the Internet for many years. We propose a system to defend against DDoS attacks in a non-cooperative environment, where upstream intermediate networks need to be given an economic incentive in order for them to cooperate in the attack mitigation. Lack of such incentives is a root cause for the rare deployment of distributed DDoS mitigation schemes. Our system is based on game-theoretic principles that provably provide incentives to each participating AS (Autonomous Systems) to report its true defense costs to the victim, which computes and compensates the most cost-efficient (yet still effective) set of defenders ASs. We also present simulation results with real AS-level topologies to demonstrate the economic feasibility of our approach.
1
Introduction
The distributed denial of service (DDoS) attack is a formidable problem that has plagued the Internet for many years. The intent of such attacks is to exhaust the resources of a remote host or network such that the service provided to legitimate users is degraded or denied. As a response to the serious threat posed by DDoS attacks, a plethora of target-resident DDoS mitigation techniques have been proposed, including packet filtering, anti-spoofing, anomaly detection, protocol analysis, and rate limiting. Albeit commercial products providing these solutions are available, it is still a challenging undertaking for the victim to successfully defend against a large-scale DDoS attack all by itself. The reasons are two-fold. First, although the victim is at the best position to detect a DDoS attack, distinguishing traffic with good and ill intentions is not easy. A smart attacker can complicate the defense by making the attack traffic behave similar to legitimate traffic. Second, even if the victim has the perfect technology to characterize malicious traffic, it may not have the computational resources to execute it at the same speed as the arriving attack traffic. This turns the DDoS mitigation component itself into a target of a DDoS attack.
Los Alamos National Laboratory Publication No. LA-UR-07-3012.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 599–611, 2008. c IFIP International Federation for Information Processing 2008
600
G. Yan and S. Eidenbenz
Some earlier work has suggested pushing the perimeter of DDoS defense to intermediate networks [16]. None of these techniques, however, are widely deployed, due to lack of economic and social incentives for the intermediate networks [14]. In many cases, intermediate networks do not suffer directly from a DDoS attack, and are thus reluctant to devote a non-trivial amount of resources to help the victim. Even during some flooding DDoS attacks in which traffic traversing through these networks increases significantly, bandwidth overprovision prepared for flash crowds helps them absorb such types of attacks. Motivated by this observation, we develop an economically sound framework that mitigates DDoS attacks in non-cooperative environments. We assume that intermediate networks (or ASs) are selfish, and they are willing to defend against the DDoS attack for the victim only when they receive enough economic compensation. Put in a game theoretic context, the distributed DDoS mitigation problem triggers the following questions: What ASs should be selected for defending against the attack ? How much should the victim pay to each participating AS for the defense? How can we design a protocol that stimulates each AS to truthfully report its real cost on the defense other than an arbitrarily high value without economic basis? Can the victim impose a limit on how much money it is willing to pay for the defense in the protocol ? Our approach to addressing these questions in an economically and gametheoretically sound manner, is as follows: Using a combination of established tools to compute DDoS attack graphs, we make ASs on attack paths report defense cost estimates. Based on these estimates, we compute the most costefficient set of defender ASs. Using the game-theoretic paradigm of VCG-type payment schemes ensures that each defense participant maximizes its own utility but truthfully reporting its real cost estimate. In the parlance of game theory, our protocols are provably truthful and also satisfy individual rationality. We evaluate our protocol on a real Internet-scale AS-level topology. Experimental results reveal that under DDoS attacks launched from botnets of typical sizes observed so far, the inevitable economic inefficiency of our protocol is limited within only 20% of the total payment to the participating defenders if the deployment ratio of the distributed defense scheme reaches 60%. Related work. The concept of game theory, especially mechanism design, has been used in routing protocol design for communication networks, such as interdomain routing [9] and Ad hoc-VCG for mobile ad-hoc networks [5]. On the other hand, Mirkovic et al. presented a comprehensive survey on solutions to defending against DDoS attacks in [14]. Huang et al. suggested that currently providing incentives for participating parties in the DDoS defense should be given higher priority than improving the defense technology [11]. Their arguments are in agreement with the motivation behind this paper. The remainder of this paper is organized as follows. Section 2 discusses how the victim constructs the defense graph. Section 3 presents how to select defender ASs. Sections 4 and 5 introduce the payment schemes. Section 6 evaluates the economic inefficiency and Section 7 concludes this paper.
DDoS Mitigation in Non-cooperative Environments
2
601
Defense Graph Generation
We consider an attack model in which many compromised hosts are used to launch a large-scale DDoS attack against a server providing a network-accessible service, such as online banking, online game, or database query. As the network and computation resources (e.g., bandwidth, CPUs, storage, software licenses) supporting these services are usually limited, an attacker can convene a large army of zombie machines (e.g., using a botnet) to deplete these resources and thus significantly degrade the QoS (Quality of Services) experienced by legitimate clients. We do not assume that the victim can always easily distinguish malicious traffic from benign traffic. A smart attacker can make the malicious traffic behave similar to that from a legitimate user. We assume that to achieve this, the attacker has to use real source IP addresses in his attack traffic1 . Our work requires that the defense graph is constructed at the victim site. To do this, we leverage existing tools and methods as follows. First, when the target server becomes overloaded, incoming source IP addresses are randomly sampled. For each sampled packet, we use the network-aware clustering technique described in [12] to derive its origin AS. It has been shown that this method is able to cluster 99% of the web clients. For each clustered IP prefix, we can look up its corresponding AS number from a map, which can be constructed offline or obtained from a third party such as Team Cymru [1]. Second, the victim decides Vor , the set of origin ASs from which traffic should be inspected based on the traffic rate from each AS and how much the current traffic rate from an AS exceeds the average traffic rate observed in the past. Third, the victim infers AS-level paths from ASs in Vor to the victim network using tools such as RouteScope [13]. Combining all the AS-level paths inferred from ASs in Vor to the victim network, the victim can build the defense graph G(V, E), in which nodes are denoted by A1 , A2 , ..., and A|V | . The aforementioned approach to constructing defense graphs differs from existing solutions to attack graph generation [16] that use the probabilistic packet marking (PPM) scheme. PPM, which requires intermediate routers to mark traversing packets probabilistically, has a few drawbacks here. First, it is not immune to edge faking, standard IP stack attack, and near-birthday collision attack [19]. Second, intermediate routers can erase marks inscribed by upstream ones to increase its chance being selected as a defender. Third, deriving a complete attack graph at the victim site requires all ASs to perform PPM, which is difficult from a deployment point of view. After constructing the defense graph, the victim does the following: (1) contact the defense proxy (explained later) of each AS in the defense graph, querying the defense price if that defense proxy deploys the defense scheme requested by the victim; (2) select the most cost efficient set of ASs; (3) calculate the payment made to the owner of each AS that is selected for distributed defense; (4) send the payment to the owner of each AS selected for distributed defense. 1
Actually source address spoofing is rarely used these days in DDoS attacks [18].
602
G. Yan and S. Eidenbenz
Once an AS that is selected for distributed defense receives the payment, it forwards all traffic destined to the target server to the defense proxy. A defense proxy performs requested defense action by the victim such as packet filtering and anti-spoofing. To alleviate the high workload on the server, the victim can even delegate part of its service to the defense proxies. In the following sections, we explain the details of Steps (1) through (4).
3
Defender Selection
Phase I: Requesting Defense Costs After constructing the defense graph, the victim sends a request to each AS in it, asking distributed defense against the DDoS attack. We assume that an AS can forward its traffic to its defense proxy, if it has one, at two places: links connected to end hosts in its local administrative domain, and outbound links to other ASs (if the victim does not belong to the AS) or links directly connected to the victim (if the victim belongs to the AS). We use Lloc (k) and Lout (k) to denote these two sets of links in AS Ak respectively. We also assume that DDoS attack traffic originates only from end hosts. For ease of presentation, we let v denote the address of the target server. If AS Ak forwards traffic with destination v to its defense proxy on links in Lout (k), it does not need to do it on links in Lloc (k), because any locally generated traffic with destination v, if not dropped inside AS Ak , must traverse some link in Lout (k) before reaching the victim network. Each AS Ak in set Vor under request calculates two costs: one is associated with forwarding traffic with destination v from links in Lloc (k) to its defense proxy and then performing requested action on the traffic there, and the other associated with forwarding traffic with destination v from links in Lout (k) to its defense proxy and then performing requested action on the traffic there. Let Cloc (k) and Cout (k) denote these two costs respectively. If AS Ak is in set V − Vor , it only computes Cout (k) and its Cloc (k) is automatically 0. In cases that AS Ak does not have a defense proxy, both its Cout (k) and Cloc (k) are +∞. We further define C(k) as a 2-tuple vector: Cloc (k), Cout (k), and call it the genuine cost vector of AS Ak . C(k) is private information owned by AS Ak , and in the parlance of game theory, it is called the type of AS Ak . Here, we do not impose specific formula on how each AS should compute its genuine cost vector. Actually, such estimated costs rely on many factors such as performance degradation due to requested defense and the service level agreements (SLAs) between the AS and its customers. Instead, our protocol only assumes the existence of such costs and it works independently of how ASs estimate these costs. Moreover, we also assume that such cost estimation can be done quickly (e.g., performed automatically without human intervention) by each AS. We will explore how to achieve this in our future work. Meanwhile, each AS Ak is requested to report its genuine cost vector to the victim. As an AS may lie about its true costs, we use Rloc (k) and Rout (k) to denote the two reported costs respectively. Similarly, we define, R(k), the reported cost vector of AS Ak , as 2-tuple vector: Rloc (k), Rout (k).
DDoS Mitigation in Non-cooperative Environments
603
Phase II: Computing Defense Set After the victim receives all the reported cost vectors from the ASs in the defense graph, it decides which ASs should be selected for defending against the DDoS attack. Recall that an AS Ak can perform the requested action on traffic going through links in either Lloc (k) or Lout (k). Hence, the victim should also determine traffic on which links an AS, if selected for defense, should check according to the requested action. Define a defense set D as a set of 2-tuple Ai , pi , where Ai ∈ V and pi ∈ {0, 1}. If pi = 0, AS Ai performs the requested action on Lloc (k); if pi = 1, AS Ai performs the requested action on Lout (k). We further define Γ (D), a set of network links, as follows: Γ (D) = {l | (∀Ai , 0 ∈ D : l ∈ Lloc (i)) ∨ (∀Ai , 1 ∈ D : l ∈ Lout (i))}
(1)
In other words, set Γ (D) consists of all the links on which traffic is inspected under defense set D. We use P (q) to denote the set of links on the path of packet q. We define the completeness of defense set D as follows: Definition 1. Defense set D is complete relative to AS set S if for any packet q that originates from an AS in S and goes to the target server, there exists a link l in P (q) such that l ∈ Γ (D). From the economic point of view, it is reasonable to assume that the victim wants to find the most cost-efficient defense set that is also complete. Given a defense set D, we define its cost H(D) as follows: H(D) = Rloc (k) · (1 − pi ) + Rout (k) · pi . (2) {Ai ,pi }∈D
We further give the following definition: Definition 2. A defense set D that is complete relative to AS set S is efficient relative to the reported costs if for any other defense D that is also complete relative to AS set S, it holds H(D ) ≥ H(D). We now present an algorithm that computes a defense set that is complete relative to a subset of Vor and efficient relative to the reported costs. Dmin is initialized to be empty. The algorithm consists of two steps: Step 1: generate mutated defense graph Gm . We first create a new graph Gm (Vm , Em ), which is called the mutated defense graph, based on defense graph G(V, E) and reported costs from each AS in V . For each AS node Ak in Vor , we create three nodes in Gm , which are denoted by Aik , Aok , and Alk respectively; we also add directed edges (Aik , Aok ) with weight Rout (k) and (Alk , Aik ) with weight Rloc (k) to Gm . For each AS node Aj in V − Vor , we add only two nodes into Gm , which are Aij and Aoj ; we add directed edge (Aik , Aok ) with weight Rout (k) to Gm . For each directed edge (Ai , Aj ) in the original defense graph, we add a directed edge (Aoi , Aij ) with weight ∞ to Gm . We also put the victim node v into graph Gm ; for each AS node Ai in V that is directly connected to v, we add edge (Aoi , v) with weight ∞ into Gm . Finally, we add to Gm a virtual source node
604
G. Yan and S. Eidenbenz
A−1 and a directed edge with weight ∞ to every node Ali with upper index l in Gm whose corresponding AS node Ai is in set Vor . In Gm , we call a node with upper index i an i-type node, a node with upper index o an o-type node, and a node with upper index l an l-type node. Similarly, we call an edge from an i-type node to an o-type node an i-o-type edge, an edge from an l-type node to an i-type node an l-i-type edge, and an edge from an o-type node to an i-type node an o-l-type edge. In Fig. 1, we illustrate the mutated defense graph generated from an defense graph and the reported costs from each AS.
<3, 5>
A3
111 000 000 3 111 000 111 000 111 Al3 000 111 000 111 000 111
<8, 12>
A1
A−1
Victim
<4, 5>
v
A4 <5, 7>
<9, 17>
A5
A2
(1) Original defense graph.
000 4 111 000 111 111 000 Al4 000 111 000 111 000 111 000 111 111 000 5 000 111 000 111 Al5 000 111 000 111 000 111
111111 51111 000000 0000 0000 1111 000000 111111 0000 1111 0000 1111 A A 000000 111111 0000 1111 0000 000000 1111 111111 i
3
o 3
0000 1111
0000 1111 51111 111111 000000 0000 0000 1111 A A 000000 111111 0000 1111 0000 000000 1111 111111 0000 1111 000000 111111 0000 1111 0000 1111 7 000000 111111 0000 1111 0000 1111 A A 000000 111111 0000 1111 0000 1111 000000 111111
0000 1111 1111 0000 0000 l−type node 1111 0000 1111 0000 1111 0000 1111
i
4
i
5
o
0000 1111 1111 0000 0000 1111 A1l 0000 1111 0000 1111 0000 1111
8
12 0000000 1111111 A 1111111 0000000 0000000 1111111
111 000 000 111 000 111 A1o 000 111 000 111 000 111
17 0000000 1111111 A 1111111 0000000 0000000 1111111
111 000 000 111 o 000 111 000 111
i
1
v
4
o 5
0000 1111
111111 i−type node 000000 000000 111111 000000 111111
i
2
A
2 000 111 111 000
9
0000 1111 1111 0000 0000 1111 0000 1111 Al2 0000 1111 0000 1111 0000 1111 0000 1111 1111 0000 0000 o−type node 1111 0000 1111 0000 1111 0000 1111
(2) Mutated defense graph
Fig. 1. Original defense graph and mutated defense graph. The reported cost vector of each AS is shown on the corresponding node.
Step 2: compute the min-cut of Gm . We treat the mutated defense graph as a flow network with source A−1 and sink v; the weight of each directed link is its capacity. It is possible that on some paths from A−1 to v all the edges have infinite weights; this occurs when no ASs on these paths deploy defense proxies. We preprocess Gm as follows: use the DFS (Depth First Search) algorithm to traverse graph Gm by starting from sink v and traversing the graph only along edges with infinite weights in reverse direction; whenever an l-type node u is visited, we add its corresponding AS node Ak to set U and also erase the l-i-type edge associated with it from Gm . It is not difficult to see that after preprocessing Gm as described, Gm must have a min-cut with a finite capacity. The celebrated max-flow min-cut theorem [6] states that the maximal amount of a flow is equal to the capacity of a minimal cut. To compute the min-cut of max max Gm , we first derive the maximum flow fm in Gm . Given the max-flow fm , we can obtain set Emc , which contains all the cutting edges in Gm that it saturates. Then, the min-cut of Gm can be obtained by cutting all the edges in Emc . Based on Emc , the victim can derive a defense set that is complete relative to AS set Vor − U and efficient relative to the reported costs. As only l-i-type and i-o-type edges have finite capacities in Gm , Emc contains only such edges. For each l-i-type edge (Alu , Aiu ) in Emc , we add Au , 0 to Dmin ; For each i-o-type edge (Aiu , Aou ) in Emc , we add Au , 1 to Dmin . In the example shown in Fig. 1, the capacity of the min-cut is 29 and Emc is {(Ai1 , Ao1 ), (Ai2 , Ao2 )}. The final Dmin is thus {A1 , 1, A2 , 1}. Based on the property of the min-cut, we can easily establish the following theorem (proof omitted):
DDoS Mitigation in Non-cooperative Environments
605
Theorem 1. The defense set found by the algorithm as described is complete relative to set Vor − U and efficient relative to the reported costs.
4
Payments without Reserve Price
In this section, we consider the cases in which the victim does not impose a constraint on how much money it is willing to pay for the defense. Such a constraint is sometimes called reserve price [7]. A naive payment scheme is that the victim pays every AS in the final defense set by its reported cost. However, in a non-cooperative computation environment, there is no reason to believe that each AS truthfully reports its true costs. To prevent cheating, we borrow the payment scheme from the Vickrey-Clarke-Groves (VCG) mechanisms [5]. We say that we disable AS node Ak ∈ V from graph Gm by doing the following: if edge (Alk , Aik ) ∈ Em , we set its capacity to be +∞, and if edge (Aik , Aok ) ∈ Em , we set its capacity to be +∞. The new graph after disabling node Ak from Gm is denoted by Gm,−Ak . Then, for each item Ak , pk in Dmin , the payment to AS Ak ’s owner ω(Ak ), denoted by η(ω(Ak )), is given by η(ω(Ak )) = (1 − pk ) · Rloc (k) + pk · Rout (k) + Υ (Gm,−Ak ) − Υ (Gm )
(3)
In other words, AS Ak , if selected for defense, is paid by the difference between the cost of the most cost efficient defense set after Ak is disabled and the cost of the most cost efficient defense set without its own reported cost. From a practical point of view, there are two problems with the above payment scheme. First, an economic entity (e.g., a big ISP) can own multiple ASs and as the above payment scheme treats each AS as an agent in the game, multiple agents belonging to the same economic entity can collude to achieve better economic benefits. We thus call it the collusion attack problem. It is also well known that the VCG-type payment scheme does not prevent collusion attacks [?]. To solve this problem, we slightly modify the above payment scheme. We treat each independent economic entity as an agent. Let Φ denote the entire set of economic entities or agents involved in the distributed defense, and Ψ (s), where s ∈ Φ, to denote the set of ASs that agent s controls. We also use Υ (Gm,−S ) to denote the capacity of the min-cut after disabling all the AS nodes in set S from graph Gm . The total payment made to agent s is given as follows: η(s) =
((1 − pk ) · Rloc (k) + pk · Rout (k)) + Υ (Gm,−Ψ (s) ) − Υ (Gm ).
∀Ak ∈Ψ (s):Ak ,pk ∈Dmin
The difference Υ (Gm,−Ψ (s) ) − Υ (Gm ) is the overpayment made to agent s (relative to its reported cost); it is also called the premium paid to agent s. Second, after disabling all AS nodes belonging to an agent s, the min-cut Υ (Gm,−Ψ (s) ) may become infinity. This means that the victim has to pay an infinite amount of money to agent s, which is obviously not realistic. We call it the monopolistic extortion problem. To circumvent it, we modify the mutated defense graph Gm to satisfy the 1-agent resilience property: for any path from
606
G. Yan and S. Eidenbenz
node A−1 to victim node v, there are at least two edges that have finite weights and the nodes they are incident to belong to different agents. To achieve this property, we simply remove the first l-i-type edge on every path that does not satisfy this condition and put the AS node associated with that edge into set U . Following the example in Fig. 1, suppose there are three agents, s1 , s2 , and s3 . Ψ (s1 ), Ψ (s2 ), and Ψ (s3 ) are {A3 , A4 }, {A1 , A5 }, and {A2 } respectively. To satisfy 1-agent resilience property, we remove the two edges from Al1 to Ai1 and from Al2 to Ai2 , and the new defense set Dmin thus becomes {A3 , 0, A4 , 0, A5 , 0}. The payments made to s1 , s2 and s3 are 12, 17, and 0 respectively. The utility of an agent is defined as the payment it receives from the victim less its true cost. Let u(s) be the utility of agent s. Following the same example, we have u(s1 ) = 5, u(s2 ) = 12, and u(s3 ) = 0. In the parlance of game theory, our protocol is truthful (or strategyproof ) only if truth-telling is a dominant strategy for each participating agent; that is to say, if an agent wants to maximize its utility, it must report truthfully its genuine cost vector for each AS that it owns, independently of any other agent’s report. The individual rationality of a protocol means that an agent participates in the game only when its utility after participation is as much as that without participation. We then have the following theorem (proof provided in Appendix A): Theorem 2. If the victim does not have a reserve price and the protocol as described is executed, behaving truthfully is a dominant strategy and individually rational for any agent.
5
Payments with Reserve Price
In this section we consider a more realistic setting in which the victim specifies its reserver price, the maximum amount of money it wants to pay for the distributed defense. We use Pr to denote the reserve price of the victim. Pr is the type of the victim, and the utility of the victim is Pr minus the total payments made by the victim. Let Pd be the total payment made to all the agents owning ASs in Dmin . A naive solution to deciding whether a deal should be made between the victim and the agents owning ASs in Dmin is: If Pr ≥ Pd , the deal is made; otherwise, the deal fails. Although intuitively reasonable, such a decision rule can lead to cheating behavior by some agents. We use a simple example in Fig. 2 to illustrate that. Each AS belongs to a different agent. Dmin is {A3 , 0, A4 , 0, A5 , 0}. We thus have η(ω(A3 )) = 12, η(ω(A4 )) = 11, and η(ω(A4 )) = 7. Since Pd = 30 > 20, the defense deal cannot be made. Now suppose that A3 reports its cost vector as 11, 12 instead of 6, 12. Clearly, Dmin does not change. η(ω(A3 )) = 12, η(ω(A4 )) = 6, and η(ω(A5 )) = 2. Hence, Pd = 20 < 25, which makes the deal take place. The utility of A3 changes from 0 to 6 after it cheats. To solve this problem, we borrow the idea of global replacement from [7]. Define set Φmin as follows: Φmin = {s|∃Ak , pk ∈ Dmin : Ak ∈ Dmin }.
(4)
DDoS Mitigation in Non-cooperative Environments
607
0000 1111 1111 0000 0000 1111 0000 1111 A5l 0000 1111 0000 1111 0000 1111 1
<6, 12>
<0, 9>
A3
A1
<5, 8>
A4
<1, 100>
A5
A−1
Victim (reserve price = 25)
0000 1111 0000 12 1001111 0000 1111 0000 1111 1111111 0000000 0000000 1111111 0000000 1111111 0000 1111 0000 1111 A A A A A 0000 1111 0000 1111 0000000 1111111 0000000 1111111 0000000 1111111 0000 1111 0000 1111 0000 1111 0000 1111 0000000 1111111 0000000 1111111 0000000 1111111 000 111 0000 1111 0000000 1111111 0000000 1111111 000 111 0000 1111 8 111 000 0000 5 1111111 1111 0000000 0000000 1111111 000 111 0000 1111 A A A A 000 111 0000 1111 0000000 1111111 0000000 1111111 000 111 0000 1111 000 0000 1111 0000000 111 1111111 0000000 1111111 000 111 0000 1111 000 111 0000 1111 000000 i−type node 1111 111111 000 111 0000 000 l−type node 111111 111 0000 o−type node 1111 000000 000 111 0000 1111 000 111 0000 1111 000000 111111 1111 0000 6 0000 1111 0000 1111 Al3 0000 1111 0000 1111 0000 1111
l
4
<0, 9>
v
A2
(1) Original defense graph
i
3
i
4
o 3
i
5
0 4
o 5
i 1
i 2
0000 9 1111 0000 1111 1111 0000 Ao1 0000 1111 0000 1111 0000 1111
0000 1111 1111 9 0000 0000 1111 0000 1111 Ao2 0000 1111 0000 1111 0000 1111
25
v
.
(2) Mutated defense graph
Fig. 2. Untruthfulness with reserve price
That is to say, Φmin contains every agent that has at least one AS selected into the defense set Dmin . Then, for every agent s in Φmin , we disable all the ASs that it owns from graph Gm . Let the new graph be Grm . We then use the algorithm described in Section 3 to recompute the defense set from Grm . We denote the r new defense set by Dmin and call it the global replacement defense set. We define −mc r Pd as the total payment made to all the agents owning ASs in Dmin . The −mc −mc decision of the victim depends on Pr and Pd : if Pr ≥ Pd , the deal takes place and the payment to each agent is the same as computed in Section 4; otherwise, the deal does not take place. In the example, if all ASs report their genuine cost vector truthfully, we have Pd−mc = 18 < Pr = 25; therefore, the distributed defense takes place and the payments to ω(A3 ), ω(A4 ) and ω(A5 ) are 12, 11 and 7 respectively. If ω(A3 ) falsely reports the cost vector of AS A3 as 11, 12, its utility remains unchanged at 6. As the victim is only willing to pay Pd−mc , the budget is no balanced if Pd−mc = Pd . In the example, the victim pays only Pd−mc = 18 for the defense, but the total payment made to all the ASs is Pd = 30 > 18. To overcome such a side effect, we introduce the central bank model [10], in which each AS has a bank account at a central bank and the bank periodically credits or debit each agent to balance its budget for the differences between Pd−mc and Pd . Such a central bank can be managed by a trusted authority, such as IANA, which allocates unused AS numbers in the Internet. Theorem 3 establishes the truthfulness and individual rationality of the protocol when the victim has a reserve price (proof provided in [20]): Theorem 3. If the victim has a reserve price and the protocol as described is executed, behaving truthfully is a dominant strategy and individually rational for both the victim and any participating agent.
6
Experiments
Setup. In our experiments, we use a real Internet topology derived from BGP routing table dump files. From a BGP RIB file downloaded from the Route Views Project [2], we extract 25,193 AS numbers. The size of each AS is roughly estimated by adding the number of IPv4 addresses covered by all the prefixes announced by it. We choose a server that resides in AS 18784 as the target of a DDoS attack. We use the AS path inference service provided by the BGPVista
608
G. Yan and S. Eidenbenz
project [3].To obtain a map from each AS to its corresponding economic entity, we collect the information of each AS from the FixedOrbit website [4]. If two ASs have close names, we assume that they belong to the same economic entity. We identify 23,075 agents, among which 99.77% have only one AS. To obtain a map from each AS to its corresponding economic entity, we collect the information of each AS from the FixedOrbit website [4]. If two ASs have close names, we assume that they belong to the same economic entity. We also vary the sizes of the botnets that are used to launch a DDoS attack against the target server among 104 , 105 , and 106 . These values are based on the sizes of today’s botnets [15]. For each bot, we let its sending rate be 130kbps, according to the survey from [17]. In the experiments, we assume that bots are uniformly distributed in the IPv4 addresses owned by all the ASs under consideration. Moreover, when selecting a list of ASs from which traffic should be inspected, we do not consider normal traffic in our experiments, because this does not keep us from explaining the basic principle of our protocol. Instead, when the DDoS traffic generated from an AS exceeds a certain threshold θ, this AS is added into set Vor . We set θ to be 10Mbps. We assume that the genuine cost of an AS Ak on inspecting traffic on links in either Lloc (k) or Lout (k) is a linear function of the traffic destined to the victim through these links: v Cloc (k) = aloc (k) · Tloc (k) + bloc (k) (5) v Cout (k) = aout (k) · Tout (k) + bout (k) v v where Tloc (k) and Tout (k) denote the total traffic rate destined to the victim v through links in Lloc (k) and Lout (k) respectively, coefficients aloc (k) and aout (k) are uniformly chosen between 10−6 and 5 × 10−6 , and constants bloc (k) and bout (k) are uniformly chosen between 100 and 500. The reserve price of the victim Pr is given by cv · Tv , where cv is 10−5 and Tv is traffic arrival rate at the victim site. We vary the probability with which an agent deploys defense proxies among 20%, 40%, 60%, 80%, and 100%. If an agent is chosen to deploy defense proxies, every AS it owns has a defense proxy.
Results. Due to space limitation, we present only the experimental results with reserve prices here. We refer readers to [20] for results without reserve prices. Fig. 3 depicts the fraction of successful deals between the victim and the participating agents. We note that as more agents deploy defense proxies, it is more likely that the distributed defense takes place. This is because with wider defense proxy deployment, there are more choices for the global replacement defense set, which helps reduce Pd−mc . From the economic point of view, it is interesting to know how much the victim needs to overpay the participating agents so that the latter do not cheat. Among the successful deals between the victim and the participating agents, the overpayment ratio is illustrated in Fig. 4. The overpayment ratio is defined as Pd −Y Pd , where Pd is the total payment made to all participating defenders and Y is the true total defense cost incurred by them. The general trend of each curve is that as more agents deploy defense proxies, the overpayment ratio decreases. This is because a higher portion of agents with defense proxies provides more
DDoS Mitigation in Non-cooperative Environments 1000
0.8 0.7 0.6 0.5 0.4 botnet size = 10000 botnet size = 100000 botnet size = 1000000
0.3 0.2 0.2
0.4 0.6 0.8 Proxy deployment probability
100
10
1
0.1 0.2
1
Fig. 3. Fraction of successful deals on distributed defense
1
0.8
0.8
0.4 0.2
botnet size = 10000 botnet size = 100000 botnet size = 1000000
0 0.2
0.4 0.6 0.8 Proxy deployment probability
0.6
0.8
1
Fig. 4. Overpayment ratio with reserve price (95% conf. interval)
1
0.6
0.4
Proxy deployment probability
Budget balance
Overpayment ratio by the victim
botnet size = 10000 botnet size = 100000 botnet size = 1000000
0.9
Overpayment ratio
Fraction of successful deals
1
609
botnet size = 10000 botnet size = 100000 botnet size = 1000000
0.6 0.4 0.2 0
1
Fig. 5. Overpayment ratio by the victim with reserve price
0.2
0.4 0.6 0.8 Proxy deployment probability
1
Fig. 6. Budget balance with reserve price (95% conf. interval)
options if a participating agent is disabled from the mutated defense graph, which helps reduce the premium made to that agent. Although the overpayment can be very high, the victim does not need to pay as much to the participating agents because of its reserve price. Fig. 5 presents the overpayment ratio that is contributed only by the victim: the largest average overpayment ratio by the victim is 0.83, much less than the highest overall overpayment ratio (about 100). The economic model incorporating the victim’s reserve price requires a central bank to pay the overpayment that is not contributed by the victim. To balance its own budget, the central bank can impose a tax upon the agents [10]. Fig. 6 depicts the budget balance, defined as (Pd − Pd−mc)/Pd . The graph suggests that when there are not many agents deploying defense proxies and the DDoS attack is launched from a million-host botnet, the budget balance of the central bank can be as high as 70% of the total payment. However, million-host botnets are still rare so far [15], and for smaller botnets, if at least 60% of the agents deploy defense proxies, the average budget balance is below 20%.
7
Conclusion
Our work in this paper is motivated by the observation that lack of social or economic incentive is one of the root causes that no distributed defense scheme has been widely deployed against DDoS attacks. Assuming that autonomous
610
G. Yan and S. Eidenbenz
systems behave selfishly in defending against DDoS attacks, we propose a protocol that provides incentives for these ASs to participate in cooperative defense. Our protocol is provably truthful and also satisfies individual rationality.
References 1. 2. 3. 4. 5.
6. 7.
8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
http://asn.cymru.com/ http://www.routeviews.org/ http://www.bgpvista.com/ http://www.fixedorbit.com/ Anderegg, L., Eidenbenz, S.: Ad hoc-VCG: a truthful and cost-efficient rout- ing protocol for mobile ad hoc networks with selfish agents. In: Proceedings of MobiCom 2003 (September 2003) Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. The MIT Press and McGraw-Hill, Cambridge (2001) Eidenbenz, S., Resta, G., Santi, P.: COMMIT: A sender-centric truthful and energy-efficient routing protocol for ad hoc networks with selfish nodes. In: Proceedings of IPDPS 2005, April 2005, vol. 13 (2005) Feigenbaum, J., Krishnamurthy, A., Sami, R., Shenker, S.: Approximation and collusion in multicast cost sharing. In: Proceedings of EC 2001 (2001) Feigenbaum, J., Papadimitriou, C., Sami, R., Shenker, S.: A BGP-based mechanism for lowest-cost routing. In: Proc. of PODC 2002 (2002) Green, J., Laffont, J.: Incentives in public decision making. Studies in Public Economies 1, 65–78 (1979) Huang, Y., Geng, X., Whinston, A.B.: Defeating DDoS attacks by fixing the incentive chain. ACM Trans. on Internet Technology (2006) Krishnamurthy, B., Wang, J.: On network-aware clustering of web clients. In: Proceedings of SIGCOMM 2000 (2000) Mao, Z.M., Qiu, L., Wang, J., Zhang, Y.: On AS-level path inference. In: Proceedings of SIGMETRICS 2005, Banff, Alberta, Canada (June 2005) Mirkovic, J., Reiher, P.: A taxonomy of DDoS attack and DDoS defense mechanisms. ACM SIGCOMM Computer Communications Review 34(2) (April 2004) Honeynet Project and Research Alliance. Know your enemy: Tracking botnets (2005), http://www.honeynet.org/papers/bots/ Savage, S., Wetherall, D., Karlin, A., Anderson, T.: Practical network support for IP traceback. In: Proceedings of ACM SIGCOMM 2000 (2000) K. K. Singh. Botnets - an introduction. http://www- static.cc.gatech.edu/ classes/AY2006/cs6262 spring/botnets.ppt Summary of the initial meeting of the dos-resistant internet working group (January 2005), http://www.thecii.org/dos-resistant/meeting-1/summary.html Waldvogel, M.: GOSSIB vs. IP traceback rumors. In: Proc. of ACSAC 2002 (2002) Yan, G., Eidenbenz, S.: Distributed DDoS mitigation in non-cooperative environments. Technical Report LAUR-07-3012, Los Alamos National Lab (2007)
Appendix A: Proof of Theorem 2 In Eq. (3), as Υ (Gm,−Ψ (s) ) − Υ (Gm ) cannot be negative, the utility of each agent cannot be negative. Hence, our protocol is individually rational.
DDoS Mitigation in Non-cooperative Environments
611
Now we show it is the best interest of each agent to report truthfully its genuine cost vector for each AS it owns. Let Π−s denote an arbitrary set of cost (s) vectors reported by all agents except agent s. We use Dmin (Π−s , X) to denote the defense set found by the algorithm when all agents except s report their cost vectors according to Π−s and agent s reports its cost vectors as X. We also (s) (s) use Gm (Π−s , X) and Gm,−Ψ (s) (Π−s , X) to denote graphs Gm and Gm,−Ψ (s) respectively when all agents except s report their cost vectors according to Π−s and agent s reports its cost vectors according to X. Without introducing any confusion, we drop input Π−s in these notations. Let vector X srep contains all the reported cost vectors by agent s, and vector X sgen contains all the genuine cost vectors of the ASs owned by agent s. We also introduce notations R(D, s) and C(D, s) as follows: R(D, s) = (pk · Rloc (k) + (1 − pk ) · Rout (k)) ∀Ak ,pk ∈D:Ak ∈Ψ (s)
C(D, s) =
(pk · Cloc (k) + (1 − pk ) · Cout (k)).
∀Ak ,pk ∈D:Ak ∈Ψ (s)
Let Δu(s) be the utility of agent s when it reports X srep less that when it reports X sgen . We distinguish the following three cases. (s)
C1: If agent s reports X sgen , some ASs in Ψ (s) appear in Dmin (X sgen ), but if (s) Dmin (X srep ).
agent s reports X srep , no ASs in Ψ (s) appear in Its utility changes from a non-negative value to 0. (s) C2: If agent s reports X sgen , no ASs in Ψ (s) appear in Dmin (X sgen ), but if (s)
agent s reports X srep , some ASs in Ψ (s) appear in Dmin (X srep ). Then, (s)
(s)
(s)
s s s Δu(s) = Υ (Gm,−Ψ (s) (X srep ))−(Υ (G(s) m (X rep ))−R(Dmin (X rep ), s)+C(Dmin (X rep ), s)).
The second item on the right side of the equation is the capacity of a cut of (s) Gm (X sgen ). As no ASs in Ψ (s) appear in Dmin and the first item is thus the (s)
capacity of a min-cut of Gm (X sgen ), Δu(s) is not positive. (s)
C3: If agent s reports X sgen , some ASs in Ψ (s) appear in Dmin (X sgen ), but (s)
if agent s reports X srep , some ASs in Ψ (s) also appear in Dmin (X srep ). Then, (s)
(s)
s (s) s s s Δu(s) = Υ (G(s) m (X gen )) − (Υ (Gm (X rep )) − R(Dmin (X rep ), s) + C(Dmin (X rep ), s)).
The second item on the right side of the equation is the capacity of a cut of (s) (s) Gm (X sgen ) and the first item is the capacity of a min-cut of Gm (X sgen ), so Δu(s) is not positive. Combining all cases from C1 to C3, we prove that our protocol is truthful.
Generalized Self-healing Key Distribution Using Vector Space Access Structure Ratna Dutta1 , Sourav Mukhopadhyay2, Amitabha Das2 , and Sabu Emmanuel2 1
Cryptography & Security Department Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore - 119613 [email protected] 2 School of Computer Engineering Nanyang Technological University N4-B2c-06, Nanyang Avenue, Singapore - 639798 {msourav,ASADAS,ASEmmanuel}@ntu.edu.sg
Abstract. We propose and analyze a generalized self-healing key distribution using vector space access structure in order to reach more flexible performance of the scheme. Our self-healing technique enables better performance gain over previous approaches in terms of storage, communication and computation complexity. We provide rigorous treatment of security of our scheme in an appropriate security framework and show it is computationally secure and achieves forward and backward secrecy. Keywords: key distribution, self-healing, wireless network, access structure, computational security, forward and backward secrecy.
1
Introduction
Self-healing key distribution deals with the problem of distributing session keys for secure communication to a dynamic group of users over an unreliable, lossy network in a manner that is resistant to packet lost and collusion attacks. The main concept of self-healing key distribution schemes is that users, in a large and dynamic group communication over an unreliable network, can recover lost session keys on their own, even if lost some previous key distribution messages, without requesting additional transmissions from the group manager. This reduces network traffic and risk of user exposure through traffic analysis and also decreases the work load on the group manager. The key idea of self-healing key distribution schemes is to broadcast information that is useful only for trusted members. Combined with its pre-distributed secrets, this broadcast information enables a trusted member to reconstruct a shared key. On the contrary, a revoked member is unable to infer useful information from the broadcast. The only requirement that a user must satisfy to recover the lost keys through self-healing, is its membership in the group both before and after the sessions in which the broadcast packet containing the key is sent. A user who has been off-line for some period is able to recover the lost session keys immediately after coming back on-line. Thus self-healing approach of key distribution is stateless. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 612–623, 2008. c IFIP International Federation for Information Processing 2008
Generalized Self-healing Key Distribution
613
Our Contribution. We design a computationally secure and efficient generalized self-healing key distribution scheme for large and dynamic groups over insecure wireless networks. We consider general monotone decreasing access structure for the family of subsets of users that can be revoked instead of threshold one. More precisely, we use vector space access structure, which allows us to obtain a family of more flexible self-healing key distribution schemes with more flexible performances. Our self-healing mechanism uses one-way key chain that is more efficient compared to the self-healing techniques used in the previous schemes [1, 3, 4, 5, 7, 6]. Our construction is general in the sense that it depens on a particular public mapping φ and for different choices of φ we obtain different self-healing key diatribution schemes. The broadcast message length also depends on this particular function φ. Additionally, our construction does not require to send the history of revoked subsets of users in order to perform self-healing, yielding significant reduction in the communication cost. The main attraction of this paper is that our general construction has significant performance gain in terms of storage, communication and computation overhead. A special case of our family of self-healing key distribution is one that considers Shamir’s (t, n)-threshold secret sharing. We emphasize that each user in this special self-healing key distribution scheme requires (m − j + 1) log q memory and size of the broadcast message at the j-th session is (t + 1) log q, with computation cost 2(t2 + t). Here m is the maximum number of sessions, j is the current session number and q is a prime large enough to accommodate a cryptographic key. Our key distribution schemes are scalable to very large groups in highly mobile, volatile and hostile wireless network as the communication and computation overhead does not depend on the size of the group, instead they depend on the number of compromised group members that may collude together. We have shown in an appropriate security model that our proposed constructions are computationally secure and achieve both forward secrecy and backward secrecy.
2 2.1
Preliminaries Secret Sharing Schemes
In this section we define secret sharing schemes which play an important role in distributed cryptography. Definition 2.1 (Access Structure). Let U = {U1 , . . . , Un } be a set of participants. A collection Γ ⊆ 2U is monotone if B ∈ Γ and B ⊆ C ⊆ U imply C ∈ Γ . An access structure is a monotone collection Γ of non-empty subsets of U. i.e., Γ ⊆ 2U \{∅}. The sets in Γ are called the authorized sets. A set B is called minimal set of Γ if B ∈ Γ , and for every C ⊂ B, C = B, it holds that C ∈ / Γ . The set of minimal authorized subsets of Γ is denoted by Γ0 and is called the basis of Γ . Since Γ consists of all subsets of U that are supersets of a subset in the basis Γ0 , Γ is determined uniquely as a function of Γ0 . More formally, we have Γ = {C ⊆ U : B ⊆ C, B ∈ Γ0 }. We say that Γ is the closure of Γ0 and write Γ = cl(Γ0 ). The family of non-authorized subsets Γ = 2U \Γ
614
R. Dutta et al.
is monotone decreasing, that is, if C ∈ Γ and B ⊆ C ⊆ U, then B ∈ Γ . The family of non-authorized subsets Γ is determined by the collection of maximal non-authorized subsets Γ 0 . In case of a (t, n)-threshold access structure, the basis consists of all subsets of (exactly) t participants. i.e. Γ = {B ⊆ U : |B| ≥ t} and Γ0 = {B ⊆ U : |B| = t}. Definition 2.2 (Secret Sharing). Let K be a finite set of secrets, where K ≥ 2. An n-party secret sharing scheme Π with secret domain K is a randomized mapping from K to a set of n-tuples S1 ×S2 ×. . .×Sn , where Si is called the share domain of Ui ∈ U. A dealer D ∈ / U distributes a secret K ∈ K according to Π by first sampling a vector of shares (s1 , . . . , sn ) from Π(K), and then privately communicating each share si to the party Ui . We say that Π realizes an access structure Γ ⊆ 2U if the following two requirements hold: Correctness: The secret K can be reconstructed by any authorized subset of parties. That is, for any subset B ∈ Γ (where B = {Ui1 , . . . , Ui|B| }), there exists a reconstruction function RecB : Si1 × . . . × Si|B| → K such that for every K ∈ K, Prob[RecB (Π(K)B ) = K] = 1, where Π(K)B denotes the restriction of Π(K) to its B-entries. Privacy: Every unauthorized subset cannot learn anything about the secret (in the information theoretic sense) from their shares. Formally, for any subset C ∈ / Γ , for every two secrets K1 , K2 ∈ K, and for every possible shares si Ui ∈C , Prob[Π(K1 )C = si Ui ∈C ] = Prob[Π(K2 )C = si Ui ∈C ]. The above correctness and privacy requirements capture the strict notion of perfect secret sharing, which is the one most commonly referred in the secret sharing literature. Definition 2.3 (Vector Space Access Structure). Suppose Γ is an access structure, and let (Zq )l denote the vector space of all l-tuples over Zq , where q is prime and l ≥ 2. Suppose there exists a function Φ : U ∪ {D} → (Zq )l which satisfies the property: B ∈ Γ if and only if the vector Φ(D) can be expressed as a linear combination of the vectors in the set {Φ(Ui ) : Ui ∈ B}. An access structure Γ is said to be a vector space access structure if it can be defined in the above way. We now present vector space secret sharing scheme that was introduced by Brickell [2]. – Initialization: For 1 ≤ i ≤ n, D gives the vector Φ(Ui ) ∈ (Zq )l to Ui . These vectors are public. – Share Distribution: 1. Suppose D wants to share a key K ∈ Zq . D secretly chooses (independently at random) l − 1 elements a2 , . . . , al from Zq . 2. For 1 ≤ i ≤ n, D computes si = v.Φ(Ui ), where v = (K, a2 , . . . , al ) ∈ (Zq )l . 3. For 1 ≤ i ≤ n, D gives the share si to Ui .
Generalized Self-healing Key Distribution
615
– Key Recovery: Let B be an authorized subset, B ∈ Γ . Then
Φ(D) =
Λi Φ(Ui )
{i:Ui ∈B}
for some Λi ∈ Zq . In order to recover the secret K, the participants of B pool their shares and computes ⎛ ⎞ Λi si = v. ⎝ Λi Φ(Ui )⎠ = v.Φ(D) = K mod q. {i:Ui ∈B}
{i:Ui ∈B}
Thus when an authorized subset of participants B ∈ Γ pool their shares, they can determine the value K. On the other hand, one can show that if an unauthorized subset B ∈ / Γ pool their shares, they can determine nothing about the value of K (see [2] for proof). 2.2
Our Security Model
We now state the following definitions that are aimed to computational security for session key distribution adopting the security model of [5, 7]. Let U = {U1 , . . . , Un } be the universe of the network. We assume the availability of a broadcast unreliable channel and there is a group manager GM who sets up and performs join and revoke operations to maintain a communication group, which is a dynamic subset of users of U. Let m be the maximum number of sessions, and R ⊂ 2U be a monotone decreasing access structure of subsets of users that can be revoked by the group manager GM. Let i ∈ {1, . . . , n}, j ∈ {1, . . . , m} and Gj ∈ U be the group established by the group manager GM in session j. Definition 2.4. (Session Key Distribution with privacy [7]) 1. D is a session key distribution with privacy if (a) for any user Ui ∈ Gj , the session key SKj is efficiently determined from Bj and Si . (b) for any set Rj ⊆ U, where Rj ∈ R and Ui ∈ / Rj , it is computationally infeasible for users in Rj to determine the personal key Si . (c) what users U1 , . . . , Un learn from Bj cannot be determined from broadcasts or personal keys alone. i.e. if we consider separately either the set of m broadcasts {B1 , . . . , Bm } or the set of n personal keys {S1 , . . . , Sn }, then it is computationally infeasible to compute session key SKj (or other useful information) from either set. 2. D has R-revocation capability if given any Rj ⊆ U, where Rj ∈ R, the group manager GM can generate a broadcast Bj , such that for all Ui ∈ / Rj , Ui can efficiently recover the session key SKj , but the revoked users cannot. i.e. it is computationally infeasible to compute SKj from Bj and {Sl }Ul ∈Rj .
616
R. Dutta et al.
3. D is self-healing if the following is true for any j, 1 ≤ j1 < j < j2 ≤ m: For any user Ui who is a member in sessions j1 and j2 , the key SKj is efficiently determined by the set {Zi,j1 , Zi,j2 }. In other words, every user Ui ∈ Gj1 , who has not been revoked after session j1 and before session j2 , can recover all session keys SKj for j = j1 , . . . , j2 , from the broadcasts Bj1 and Bj2 where 1 ≤ j1 < j2 ≤ m. Definition 2.5. (R-wise forward and backward secrecy [5]) 1. A key distribution scheme D guarantees R-wise forward secrecy if for any set Rj ⊆ U, where Rj ∈ R, and all Us ∈ Rj are revoked before session j, it is computationally infeasible for the members in Rj together to get any information about SKj , even with the knowledge of group keys SK1 , . . . , SKj−1 before session j. 2. A session key distribution D guarantees R-wise backward secrecy if for any set Jj ⊆ R, where Jj ∈ U, and all Us ∈ Jj join after session j, it is computationally infeasible for the members in Jj together to get any information about Kj , even with the knowledge of group keys SKj+1 , . . . , SKm after session j.
3
Our General Construction
We consider a setting in which there is a group manager (GM) and n users U = {U1 , . . . , Un }. All of our operations take place in a finite field, GF(q), where q is a large prime number (q > n). In our setting, we never allow a revoked user to rejoin the group in a later session. Let H : GF(q) → GF(q) be a cryptographically secure one-way function. The life of the system is divided in sessions j = 1, 2, . . . , m. The communication group in session j is denoted by Gj ⊂ U. We consider a linear secret sharing scheme realizing some access structure Γ over the set U. For simplicity, suppose there exists a public function Φ : U ∪ {GM} → GF(q)l satisfying the property Φ(GM) ∈ Φ(Ui ) : Ui ∈ B ⇔ B ∈ Γ, where l is a positive integer. In other words, the vector Φ(GM) can be expressed as a linear combination of the vectors in the set {Φ(Ui ) : Ui ∈ B} if and only if B is an authorized subset. Then Φ defines Γ as a vector space access structure. – Setup: Let G1 ∈ U. The group manager GM chooses independently and uniformly at random m vectors v1 , v2 , . . . , vm ∈ GF(q)l . The group manager randomly picks two initial key seeds, the forward key seed S F ∈ GF(q) and the backward key seed S B ∈ GF(q). It repeatedly applies (in the preprocessing time) the one-way function H on S B and computes the one-way B backward key chain of length m: KiB = H(Ki−1 ) = Hi−1 (S B ) for 1 ≤ i ≤ m. B The j-th session key is computed as SKj = KjF + Km−j+1 , where KjF = j−1 F H (S ). Each user Ui ∈ G1 receives its personal secret keys corresponding to the m sessions Si = (v1 .Φ(Ui ), . . . , vm .Φ(Ui )) ∈ GF(q)m and the forward key seed S F from the group manager via the secure communication channel between them. Here the operation “.” is the inner product modulo q.
Generalized Self-healing Key Distribution
617
– Broadcast: Let Rj be the set of all revoked users for sessions in and before j such that Rj ∈ / Γ and Gj be the set of all non-revoked users in session j. In the j-th session the GM first chooses a subset of users Wj ⊂ U\Gj with minimal cardinality such that Wj ∪ Rj ∈ Γ 0 . The GM then computes B Zj = Km−j+1 +vj .Φ(GM) and broadcasts the message Bj = {(Uk , vj .Φ(Uk )) : Uk ∈ Wj ∪ Rj } ∪ {Zj }. – Session Key Recovery: When a non-revoked user Ui receives the j-th session key distribution message Bj , it recovers vj .Φ(GM) as follows: Since Wj ∪Rj ∈ Γ 0 is the maximal non-authorized subset with minimum cardinality having the property Wj ∈ U\Gj , the set B = Wj ∪ R j ∪ {Ui } ∈ Γ . Thus B is an authorized subset, and one can write Φ(GM) = {k:Uk ∈B} Λk Φ(Uk ) for some Λk ∈ GF(q). Hence Ui knows Λk and vj .Φ(Uk ) for all k ∈ B and can compute {k:Uk ∈B} Λk (vj .Φ(Uk )) = vj . {k:Uk ∈B} Λk Φ(Uk ) = vj .Φ(GM) B B Then Ui recovers the key Km−j+1 as Km−j+1 = Zj − vj .Φ(GM). Finally, Ui F computes the j-th forward key Kj = Hj−1 (S F ) and evaluates the current B session key SKj = KjF + Km−j+1 . A user Uk who either does not know its private information vj .Φ(Uk ) or who is a revoked user in Rj , i.e. Uk ∈ Wj ∪ Rj , cannot compute vj .Φ(GM) because Uk only knows values broadcast in the message Bj corresponding to an unauthorized subset of the secret B sharing scheme. Consequently, Uk cannot recover the backward key Km−j+1 and hence the j-th session key SKj . – Add Group Members: When the group manager adds a new group member starting from session j, it picks an unused identity v ∈ GF(q), computes the personal secret keys corresponding to the current and future sessions Sv = (vj .Φ(Uv ), . . . , vm .Φ(Uv )) ∈ GF(q)m−j+1 and gives {v, Sv , KjF } to this new group member via the secure communication channel between them. – Re-initialization: The system fails when all m sessions are exhausted, or the set of revoked users for sessions in and before the current session becomes an element of Γ . At this phase, re-initialization is required and a new setup is executed.
3.1
Complexity
Storage overhead: For simplicity, we use vector space secret sharing. Then storage complexity of personal key for each user is m log q bits. The group members that join later need to store less data. For example, the personal key for a user joining at the j-th session occupies (m−j +1) log q bits memory space. If we use a more general linear secret sharing scheme in which a participant Ui is associated with mi ≥ 1 vectors, then its personal secret key consists of mi vectors and hence is of size mmi log q bits. Communication overhead: The communication bandwidth for key management at the j-th session is (tj + 1) log q bits, where tj = |Wj ∪ Rj |, Rj ∈ / Γ is the set of all revoked users for sessions in and before j and Wj ⊂ U\Gj with minimum cardinality such that Wj ∪ Rj ∈ Γ 0 . Here we ignore the communication overhead for the broadcast of user identities Ui for Ui ∈ Wj ∪ Rj , as
618
R. Dutta et al.
these identities can be picked from a small finite field. In particular, if our scheme is obtained from Shamir’s (t, n)-threshold secret sharing scheme that realizes access structure defined by Γ = {A ⊆ U : |A| ≥ t} by means of polynomial interpolation, then communication bandwidth for key management is (t + 1) log q bits. Computation overhead: The computation complexity is 2(t2j + tj ), where tj = |Wj ∪ Rj |, Rj ∈ / Γ is the set of all revoked users for sessions in and before j and Wj ⊂ U\Gj with minimum cardinality such that Wj ∪ Rj ∈ Γ 0 . This is the number of multiplication operations needed to recover Φ(GM) by using equation (1). Considering Shamir’s (t, n)-threshold secret sharing scheme, the computation cost for key management is 2(t2 +t), which is essentially the number of multiplication operations needed to recover a t-degree polynomial by using Lagrange’s interpolation formula. 3.2
Self-healing
We now explain our self-healing mechanism in the above constructions: Let Ui be a group member that receives session key distribution messages Bj1 and Bj2 in sessions j1 and j2 respectively, where 1 ≤ j1 ≤ j2 , but not the session key distribution message Bj for session j, where j1 < j < j2 . User Ui can still recover all the lost session keys Kj for j1 < j < j2 as follows: (a) Ui recovers from the broadcast message Bj2 in session j2 , the backward key B Km−j and repeatedly apply the one-way function H on this and computes 2 +1 B the backward keys Km−j+1 for all j, j1 ≤ j < j2 . (b) Ui computes the forward keys KjF for all j, j1 ≤ j ≤ j2 by repeatedly applying H on the forward seed S F or on the forward key KjF1 of the j1 -th session. B (c) Ui then recovers all the session keys SKj = KjF + Km−j+1 , for j1 ≤ j ≤ j2 . Note that a user revoked in session j cannot compute the backward keys B Km−j for j1 > j, although it can compute the forward keys KjF1 . As a result, 1 +1 revoked users cannot compute the subsequent session keys SKj1 for j1 > j, as desired. Similarly, a user Ui joined in session j cannot compute the forward keys KjF2 for j2 < j as Ui knows only the j-th forward key KjF , not the initial forward B seed value S F , although it can compute the backward keys Km−j for j2 < j. 2 +1 This forbids Ui to compute the previous session keys as desired.
4
Security Analysis
Theorem 4.1. Our construction is secure, self-healing session key distribution scheme with privacy, R-revocation capability with respect to Definition 2.4 and achieve R-wise forward and backward secrecy with respect to Definition 2.5.
Generalized Self-healing Key Distribution
619
Proof: Our goal is security against coalition of users from R. We will show that our construction is computationally secure with respect to revoked users under the difficulty of inverting one-way function, i.e. for any session j it is computationally infeasible for any set of revoked users from R before and on session j to compute with non-negligible probability the session key SKj , given the View consisting of personal keys of revoked users, broadcast messages before, on and after session j and session keys of revoked users before session j. Consider a coalition of revoked users from R, say Rj ∈ R, who are revoked on or before the j-th session. The revoked users are not entitled to know the j-th session key SKj . We can model this coalition of users from R as a polynomialtime algorithm A that takes View as input and outputs its guess for SKj .We say that A is successful in breaking the construction if it has a non-negligible advantage in determining the session key SKj . Then using A , we can construct a polynomial-time algorithm A for inverting one-way function H and have the following claim: Claim. A inverts one-way function H with non-negligible probability if A is successful. Proof. Given any instance y = H(x) of one-way function H, A first generates an instance View for A as follows: A randomly selects a forward key seed S F ∈ GF(q) and constructs the following backward key chain by repeatedly applying H on y: B K1B = y, K2B = H(y), . . . , KjB = Hj−1 (y), . . . , Km = Hm−1 (y).
A computes the j-th forward key KjF = Hj−1 (S F ) and sets the j-th session B key SKj = KjF + Km−j+1 . A chooses at random m vectors v1 , . . . , vm ∈ GF(q)l . Each user Ui ∈ U receives its personal secret keys corresponding to the m sessions Si = (v1 .Φ(Ui ), . . . , vm .Φ(Ui )) ∈ GF(q)m and the forward key seed S F from A via the secure communication channel between them. In this setting, Γ = 2U \R is a monotone increasing access structure of authorized users over U. Γ is determined by the family of minimal qualified subsets, Γ0 , which is called the basis of Γ . Now Rj ∈ R implies Rj ∈ / Γ. Let Gj be the set of all non-revoked users in session j. At the j-th session, A chooses a subset of users Wj ⊂ U\Gj with minimal cardinality such that Wj ∪ Rj ∈ Γ 0 . A then computes broadcast message Bj for j = 1, . . . , m as: Bj = {(Uk , vj .Φ(Uk )) : Uk ∈ Wj ∪ Rj } ∪ {Zj }, B where Zj = Km−j+1 + vj .Φ(GM). Then A sets View as
⎧ ⎫ vs .Φ(Uk ) for all Uk ∈ Rj and s = 1, . . . , m; ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ Bj for j = 1, . . . , m; View = F S ; ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ SK1 , . . . , SKj−1
620
R. Dutta et al.
A gives View to A , which in turn selects X ∈ GF(q) randomly, sets the jth session key to be SKj = KjF + X and returns SKj to A. A checks whether SKj = SKj . If not, A chooses a random x ∈ GF(q) and outputs x . A can compute the j-th forward key KjF = H(S F ) as it knows S F from View for j = 1, . . . , m. Note that from View, A knows {vj .Φ(Uk ) : Uk ∈ Wj ∪ Rj , 1 ≤ j ≤ m} ∪ {vs .Φ(Uk ) : Uk ∈ Rj , 1 ≤ s ≤ m} and at most j − 1 session keys SK1 , . . . , SKj−1 . Consequently A has knowledge of at most j − 1 backward keys B B Km , . . . , Km−j+2 . Observe that SKj = SKj provided A knows the backward key B Km−j+1 . This occurs if either of the following two holds: (a) A is able to compute the vj .Φ(GM) from View and consequently can recover B B the backward key Km−j+1 as follows:Km−j+1 = Zj − vj .Φ(GM). From View, A knows {vj .Φ(Uk ) : Uk ∈ Wj ∪ Rj , 1 ≤ j ≤ m} ∪ {vs .Φ(Uk ) : Uk ∈ Rj , 1 ≤ s ≤ m}, where Wj ⊂ U\Gj has minimal cardinality with Wj ∪ Rj ∈ Γ 0 and will not be able to compute vj .Φ(GM) by the property of Φ. Observe that vj .Φ(GM) is linear combination of {vj .Φ(Uk ) : Uk ∈ B} B if and only if B ∈ Γ . Consequently, A will not be able to recover Km−j+1 from Bj as described in (a) above. B (b) A is able to choose X ∈ GF(q) so that the following relations hold: Km = j−1 B j−2 B H (X), Km−1 = H (X), . . . , Km−j+2 = H(X). This occurs with a nonnegligible probability only if A is able to invert the one-way function H. In that case, A returns x = H−1 (y). The above arguments show that if A is successful in breaking the security of our construction, then A is able to invert the one-way function. (of claim) Hence our construction is computationally secure under the hardness of inverting one-way function. We will now show that our construction satisfies all the conditions required by Definition 2.4. 1) (a) Session key efficiently recovered by a non-revoked user Ui is described in the third step of our construction. (b) For any set Rj ⊆ U, Rj ∈ R, and any non-revoked user Ui ∈ / Rj , we show that the coalition Rj knows nothing about the personal secret Si = (v1 .Φ(Ui ), . . . , vj .Φ(Ui ), . . . , vm .Φ(Ui )) of Ui . For any session j, Ui uses vj .Φ(Ui ) as its personal secret. Since the coalition Rj ∈ / Γ , the values {vs .Φ(Uk ) : Uk ∈ Rj , 1 ≤ s ≤ m} is not enough to compute vj .Φ(Ui ) by the property of Φ. So it is computationally infeasible for coalition Rj to learn vj .Φ(Ui ) for Ui ∈ / Rj . B F (c) The j-th session key SKj = KjF + Km−j+1 , where KjF = H(Kj−1 ) = j−1 F B B j−1 B F H (S ), Kj = H(Kj−1 ) = H (S ), S is the forward seed value given to all initial group members and S B is the secret backward seed value. Thus SKj is independent of the personal secrets S1 , . . . , Sm where Si = (v1 .Φ(Ui ), . . . , vj .Φ(Ui ), . . . , vm .Φ(Ui )) for i = 1, . . . , n. So the personal secret keys alone do not give any information about any session key. Since the initial backward seed S B is chosen randomly, the backward key
Generalized Self-healing Key Distribution
621
B Km−j+1 and consequently the session key SKj is random as long as S B , B B B K1 , K2 , . . . , Km−j+2 are not get revealed. This in turn implies that the broadcast messages alone cannot leak any information about the session keys. So it is computationally infeasible to determine Zi,j from only personal key Si or broadcast message Bj . 2) (R-revocation property) Let Rj ⊆ U, where Rj ∈ R, collude in session j. It is impossible for coalition Rj to learn the j-th session key SKj because the knowledge of SKj implies the knowledge of either the backward key B Km−j+1 or the knowledge of the personal secret vj .Φ(Ui ) of user Ui ∈ / Rj . The coalition Rj knows the set {vs .Φ(Uk ) : Uk ∈ Rj , 1 ≤ s ≤ m}, which is not enough to compute vj .Φ(Ui ) by the property of Φ. Hence the coalition B Rj cannot recover vj .Φ(Ui ), which in turn makes Km−j+1 appears random to all users in Rj . Therefore, SKj is completely safe to Rj from computation point of view. 3) (Self-healing property) From the third step of our construction, any user Ui that is a member in sessions j1 and j2 (1 ≤ j1 < j2 ), can recover the B backward key Km−j and hence can obtain the sequence of backward keys 2 +1 B B B Km−j1 , . . . , Km−j2 +2 by repeatedly applying H on Km−j . User Ui also 2 +1 F j1 −1 F holds the forward key Kj1 = H (S ) of the j1 -th session and hence can obtain the sequence of forward keys KjF1 +1 , . . . , KjF2 −1 by repeatedly applying H on KjF1 . Hence, as shown in Section 3.2, user Ui can efficiently recover all missed session keys.
We will show that our construction satisfies all the conditions required by Definition 2.5. 1) (R-wise forward secrecy) Let Rj ⊆ U, where R ∈ R and all user Us ∈ Rj are revoked before the current session j. The coalition Rj can not get any information about the current session key SKj even with the knowledge of group keys before session j. This is because of the fact that in order to know SKj , B any user Us ∈ Rj needs to know either vj .Φ(GM) or Km−j+1 . Determining vj .Φ(GM) requires knowledge of values {vj .Φ(Uk ) : Uk ∈ B for some B ∈ Γ }. But the coalition Rj knows only the values {vj .Φ(Us ) : Us ∈ Rj } which is insufficient as Rj ∈ / Γ . Hence Rj is unable to compute SKj . Besides, because of the one-way property of H, it is computationally infeasible to compute KjB1 from KjB2 for j1 < j2 . The users in Rj might know the B B B sequence of backward keys Km , . . . , Km−j+2 , but cannot compute Km−j+1 and consequently SKj from this sequence. Hence our construction is R-wise forward secure. 2) (R-wise backward secrecy) Let Jj ⊆ U, where Jj ∈ R and all user Us ∈ Jj join after the current session j. The coalition Jj can not get any information about any previous session key SKj1 for j1 ≤ j even with the knowledge of group keys after session j. This is because of the fact that in order to know SKj1 , any user Us ∈ Jj requires the knowledge of j1 -th forward key KjF1 = H(KjF1 −1 ) = Hj1 −1 (S F ). Now when a new member Uv joins the F group starting from session j + 1, the GM gives (j + 1)-th forward key Kj+1
622
R. Dutta et al.
instead of the initial forward key seed S F , together with the values Sv = F (vj .Φ(Uv ), . . . , vm .Φ(Uv )) ∈ GF(q)m−j+1 . Note that Kj+1 = H(KjF ). Hence it is computationally infeasible for the newly joint member to trace back for previous forward keys KjF1 for j1 ≤ j because of the one-way property of the function H. Consequently, our protocol is R-wise backward secure. In fact, this backward secrecy is independent of R.
5
Performance Analysis
The existing work to deal with self-healing key distribution using monotone decreasing family of revoked subset of users instead of monotone decreasing threshold structure is by Saez [6]. In this section, we discuss the comparison of storage overhead, communication complexity and computation cost of each user (not the GM) in our construction with [6]. In contrast to the family of the self-healing key distribution schemes proposed in [6], our general construction uses a different self-healing approach which is more efficient in terms of computation and communication without any further trade-off in storage, yielding more flexible self-healing key distribution scheme that can provide better properties. Unlike [6], the length of the broadcast message in our scheme does not depend on the history of revoked subsets of users to perform self-healing. This feature provides significant reduction in the communication cost, which is one of the main improvement of our scheme over the previous works. For simplicity, we compare a special case of our construction with the other similar schemes considering Shamir’s (t, n)-threshold secret sharing. In one hand our construction reduces the communication complexity (bandwidth) to O(t), whereas optimal communication complexity achieved by the previous schemes [1, 6, 7] is O(tj) at the j-th session. Achieving less computation cost is on the other side of the coin. For a user Ui at the j-th session, the computation cost is incurred by recovering all previous session keys upto the j-th session (worst case) by self-healing mechanism. The backward key used at B the j-th session in our construction is Km−j+1 = Zj − vj .Φ(GM). Thus compu2 tation complexity for each user is 2{(t + 1) − (t + 1)} = 2(t2 + t), which is the number of multiplication operations needed to recover a t-degree polynomial by B using Lagrange formulation. After obtaining Km−j+1 , user Ui can easily comB B B B pute Km−j+2 , Km−j+3 , . . . , Km−1 , Km by applying the one-way function H each B time. Then Ui is able to compute all previous session keys SKj1 = KjF1 +Km−j 1 +1 for all 1 ≤ j1 ≤ j. Thus the communication complexity and computation cost in this special construction do not increase as the number of session grows. These are the most prominent improvements of our scheme over the previous secret sharing based self-healing key distributions [1, 6, 7]. The storage requirement in our scheme comes from Setup phase and after receiving the session key distribution message. The storage overhead of each user for personal key is O((m − j + 1) log q), which is same as that of [1, 6, 7]. If we consider a secret sharing scheme realizing a specific bipartite access structure defined in the set of users, the previous self-healing mechanisms allow
Generalized Self-healing Key Distribution
623
to improve the efficiency of revocations of a small number of users, say less than j, for some positive integer j ≤ t − 1, t is the threshold on the number of revoked users. This is because of the fact that in all the previous self-healing key distribution schemes, a part of the broadcast message of every session contains a history of revoked subsets of users in order to perform self-healing. This part of broadcast message has a proportional amount of information to t − 1 in all the previous self-healing key distribution schemes, despite only two or three users must be revoked. We overcome this overhead on broadcast message length in our general construction since our self-healing mechanism does not need to send any such history.
6
Conclusion
This paper presents an efficient computationally secure generalized self-healing key distribution scheme with revocation capability, enabling a very large and dynamic group of users to establish a common key for secure communication over an insecure wireless network. Our proposed key distribution mechanism reduces communication and computation costs over the previous approaches, without any additional increase in the storage complexity compared to the previous works, and is scalable to very large groups in highly mobile, volatile, and hostile wireless network. Our scheme is properly analyzed in an appropriate security model to prove that it is computationally secure and achieves both forward secrecy and backward secrecy.
References [1] Blundo, C., D’Arco, P., Santis, A., Listo, M.: Design of Self-healing Key Distribution Schemes. Design Codes and Cryptology 32, 15–44 (2004) [2] Brickell, E.F.: Some Ideal Secret Sharing Schemes. Journal of Combinatorial Mathematics and Combinatorial Computing 9, 105–113 (1983) [3] Dutta, R., Mukhopadhyay, S.: Improved Self-Healing Key Distribution with Revocation in Wireless Sensor Network. In: Proceedings of WCNC 2007 - Networking, IEEE Computer Society Press, Los Alamitos (2007) [4] Hong, D., Kang, J.: An Efficient Key Distribution Scheme with Self-healing Property. IEEE Communication Letters 2005 9, 759–761 (2005) [5] Liu, D., Ning, P., Sun, K.: Efficient Self-healing Key Distribution with Revocation Capability. In: Proceedings of the 10th ACM CCS 2003, pp. 27–31 (2003) [6] Saez, G.: On Threshold Self-healing Key Distribution Schemes. In: Smart, N.P. (ed.) Cryptography and Coding 2005. LNCS, vol. 3796, pp. 340–354. Springer, Heidelberg (2005) [7] Staddon, J., Miner, S., Franklin, M., Balfanz, D., Malkin, M., Dean, D.: Self-healing key distribution with Revocation. In: Proceedings of IEEE Symposium on Security and Privacy 2002, pp. 224–240 (2002)
A Model for Covert Botnet Communication in a Private Subnet Brandon Shirley and Chad D. Mano Department of Computer Science Utah State University Logan, UT 84322, USA {brandon.shirley,chad.mano}@usu.edu
Abstract. Recently, botnets utilizing peer-to-peer style communication infrastructures have been discovered, requiring new approaches to detection and monitoring techniques. Current detection methods analyze network communication patterns, identifying systems that may have been recruited into the botnet. This paper presents a localized botnet communication model that enables a portion of compromised systems to hide from such detection techniques without a potentially significant increase in network monitoring points. By organizing bot systems at the the subnet level the amount of communication with the outside network is greatly reduced, requiring switch-level monitoring to identify infected systems.
1
Introduction
One of the most potent threats in the Internet society is the botnet [15, 4, 9]. A botnet is a collection of compromised computer systems throughout the world which are under the control of a single entity, the “botmaster.” The distributed nature of a botnet may enable an attacker to gather larger amounts of confidential personal information over a shorter period, or execute a denial-of-service attack in a quick and overpowering manner. Botnets have traditionally taken a centralized approach to the management of the compromised bot systems. Recently, a powerful botnet that utilizes peer-topeer style communication, Storm Worm, has been the cause of various incidents of malicious activity. Tracking the origin and activity of this botnet is much more difficult due to the Peer-to-Peer (P2P) communication infrastructure.
2
Related Work
Traditionally botnets have relied on a centralized command and control (C&C) communication infrastructure utilizing IRC servers as a means to manage the remote bot systems [12, 7, 3]. A security administrator is able to monitor these systems by identifying the location of the C&C IRC server and logging in posing as a compromised bot system [1, 2]. Depending on the configuration of the A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 624–632, 2008. c IFIP International Federation for Information Processing 2008
A Model for Covert Botnet Communication in a Private Subnet
625
server, various characteristics of the botnet can be identified including population and command instructions. Honeypots are systems deployed by security administrators that act as infection targets and allow administrators to detect botnet activity [4, 10, 18]. Various techniques for evading detection exist, which hide or masquerade botnet communication activity [16, 6, 9]. More recently, botnets utilizing a peer-to-peer (P2P) [17, 13] style communication infrastructure have been proposed [16, 7] and even discovered in the wild [11]. It is very difficult to ascertain the characteristics of a P2P botnet because it is not possible to monitor a centralized location where all infected bots connect as such a location does not exist. To the best knowledge of the authors’, Bothunter [8] is the most effective tool for detecting the existence of a bot within a local network, including bots utilizing P2P style communication. The following section discusses Bothunter in more detail and discusses our approach to evading detection by Bothunter.
3
Evading BotHunter Detection
BotHunter relies on the ability to identify an infection dialog in order to detect the existence of a bot within a managed network. This paper proposes a communication method that negates the effective detection techniques of traffic monitoring systems by minimizing the amount, and controlling the type, of traffic that passes through a network monitor node, while at the same time enabling infected systems to receive all appropriate commands and instructions from the botmaster. Bothunter is centered on the ability to detect five distinct events: – – – – –
E1: E2: E3: E4: E5:
External to Internal External to Internal Internal to External Internal to External Internal to External
Inbound Scans Inbound Exploits Binary Acquisition C&C Communication Outbound Infection Scanning
In order to declare a bot infection an E2 AND one E2-E5 events, or two E2E5 events must be observed. Thus, to evade detection a bot must be able to actively participate in the global botnet while eliminating events which may lead to its detection. As such, we propose the following alternatives to the previously described events that accomplish the same purpose, but in a different manner. – A2: Internal to Internal Exploits – A3: Internal to Internal Binary Acquisition – A4: Internal to Internal C&C Communication A key factor in IDS techniques is that event correlation is tied to a single interior host. As the following section shows, infected systems can coordinate with each other to share the burden or communication with the external botnet parties, thus eliminating the bond between a single system and all external communication events.
626
4
B. Shirley and C.D. Mano
Local Network “Sub-Botnet”
This section describes the process of creating and managing a local area network “sub-botnet.” The creation process results in a coordinated group of bots in a switched portion of a network. The following are important assumptions about the network environment. Note that these assumptions may not be true of all networks, but we believe are common in typical enterprise networks. – Communication within a switched network does not penetrate the router, other than defined protocols such as DHCP proxy. – IDS-type monitors do not process information that does not move from the internal to external interface, or vice-versa, of the router. – Switches utilized as network monitoring points do not have (or are not used for) IDS capabilities (typically used for monitoring bandwidth). 4.1
Initial Infection
A system can become infected by a malicious bot program in a number of ways, each ending with the same result of the system becoming part of a botnet. Initially, at least one system in the network must become infected and obtain the botnet binary in some way. We assume that this initial infection avoids detection by some means such as proposed in [8]. Once the initial bot has been created, the organization of the sub-botnet can begin. The first order of business for the initial bot is to compromise local network systems, minimizing the chance of external E2 events. First, the initial bot checks its host system for vulnerabilities and assumes other local systems have the same vulnerability. Thus, the initial bot can attempt targeted exploits at other systems, without using noisy port scanning activities. Next, the initial bot monitors network traffic to identify the IP and MAC addresses of systems within the locally switched network. DHCP requests, ARP messages, and other broadcast traffic enables the initial bot to identify many of these systems without using port-scanning activities. After determining probable valid exploits and potential targets, the initial bot can proceed to create a sub-botnet by attempting to compromise other systems, an A2 event. When a system is compromised, vulnerabilities can be patched, preventing future E2 events. 4.2
New E2 Discovery
While A2 infections are desirable over E2 infections, it is not feasible to infect all possible systems with A2 events alone. When an E2 infection occurs, the newly infected system can better protect itself from being detected if it avoids unnecessary E3-E5 events. To do so, the newly infected system must discover if another system in the local network is part of the botnet. If such a bot exists, the newly infected system can obtain the bot binary and other communication from systems within the local network. The newly infected bot relies on bots already infected to provide the necessary information.
A Model for Covert Botnet Communication in a Private Subnet
627
While the initial bot monitors the network to identify potential victims systems, it also seeks to identify newly infected systems. This can we accomplished in a number of ways including identifying a pre-determined pattern of DHCP Request messages [5]. This is enabled by an error prevention mechanism in the DHCP protocol. When a system issues a DHCP request it can repeat the request approximately four seconds later [5] if a response is not received. As this part of the protocol does not occur commonly, it can act as an “announcement” of a newly infected system. Upon observing this duplicate request sequence, the initial bot identifies the MAC address of the source machine and can direct a message to the newly infected system “welcoming” it to the botnet.
5
Sub-Botnet Management
As mentioned in Section 3 there are certain event dialogs which can be detected as a bot infection by BotHunter. In order to avoid detection the events that each bot may initiate must be controlled. To illustrate this point let (A2 ∨ A3 ∨ A4) ≡ α, (E3 ∧ E4) ≡ β, and (E2 ∧ (E3 ∨ E4)) ≡ γ. It follows that α ∧ ¬β ∧ ¬γ ≡ α ∧ ¬(β ∨ γ) ≡ δ where δ is an undetectable state and that β ∨ γ ≡ η where η is a detectable state. Therefore, it is necessary to control bot actions such that only δ occurs within a given timespan as BotHunter or any IDS Correlator will have to eventually prune old events from each internal host’s infection dialog. The core goal of the management system is to limit the botnet’s monitorable exposure to a single bot at any given time. This offers the potential to avoid detection by a perimeter-style IDS Correlator by exceeding the correlator’s pruning threshold or by event sharing. Since the IDS Correlator must eventually prune old events, a round-robin type duty passing algorithm has the potential to allow external botnet communication without any of the bots every being detected. To this end the sub-botnet will make use of a Token Bot (TB), which will perform the internal to external actions, and an internal report peer list to share traceable events amongst all the bots in the the subnet. In the pursuit of detection avoidance the Sub-Botnet management scheme must assure that each bot only performs one internal to external (E2-E4) event within a given timespan. Further, it is desirable that each bot performs the same E3 or E4 event each time it initiates said event and that only one bot is engaged in such an event at any given time. In order to ensure this scenario the sub-botnet institutes a token passing scheme, where only the TB may initiate internal to external bot communication. Assuming there are two or more bots present in the sub-botnet, a token passing and resource sharing framework can be used. The basic token passing process is as follows: 1) Token Acquisition (TAQ), 2) perform Token Action (TAC) known as an E3 or E4 event, 3) issue Token Report Request Broadcast (TRRB) and compile token, 4) start Token Action Result Propagation (TARP) qualifying as an A3 or A4 event, and 5) the Token Pass. If there are any issues which prevent
628
B. Shirley and C.D. Mano
the TB from performing all these actions, then a Token Election (TE) will take place to recover the token. The TB will also acknowledge new bots infected via E2 infection and send them the bot binary. As far as A2 events are concerned the bot that initiated the A2 event will be charged with handling the A3 event for the newly infected bot. Full details of the token passing model are available in [14] 5.1
Token Acquisition (TAQ)
In this phase a bot will have the token passed to it through some variation of what will be referred to as a Token Pass (TP) process. The TP may have occurred explicitly due to TB failure or implicitly as part of the normal TP procedure. The token includes the following: – Report list - internal only peer list – Action History - the last two internal to external actions performed (E3 or E4 events) – Timestamp - indicative of when token was compiled – Bot Binary Version - denotes the current binary version The Report list is the primary tool used for token passing and token election should token passing fail. The TAC for a given bot is implied by the information in the Report list. The new TB is listed in the Report list along with the action to be performed. Ideally this will be the same action it performed last time, or it may be a different action if bot action restructuring has occurred. Either way, the action listed should match with the action missing from the two-entry history list. 5.2
Perform Token Action (TAC)
The TB will perform the action designated by the Report list; either a command update request, a peerlist update, or a binary update check. The peerlist update will always be performed, but the other two actions may result in no update. The peerlist update will be largely handled by an external peer on the TB’s peerlist. Assuming the TB finds a responsive peer, the TB will have that peer get an updated peerlist and send it back so the TB can propagate the list to the subnet. The command update request and binary update check will essentially work the same for the TB as they would for an external peer, except that the TB must propagate the result of its action to the subnet. 5.3
Token Report Request Broadcast (TRRB)
Upon completion of the TAC the TB will issue a TRRB to which any viable local bot will reply, as shown in Figure 1. Each bot will reply with its last performed action, a timestamp indicative of when that action was performed, and its bot binary version. The TB will compile and update the action history and report lists, and associate a timestamp with the update.
A Model for Covert Botnet Communication in a Private Subnet
629
Fig. 1. Token bot issuing TRRB, and the subsequent responses from the currently active bots in its subnet
5.4
Token Action Result Propagation (TARP)
Assuming a successful TRRB, the TB will have an updated report list which includes all currently active bots in its subnet. The TB will use the Report list as the means for passing the update information to the sub-botnet. The TB will send the Report list, the update, the current Bot Binary version and the action history to a bot in the report list. In essence it is passing a copy of the token, as shown in Figure 2. 5.5
Token Pass
The TB will use the information in the token to decide which bot will receive the token next. It will consult the action history in order to see which action needs to be performed next. At this time the TB will scan the report list looking for bots whose last action matches the next necessary action. Once the TB has a list of these bots it will select the bot with the oldest timestamp and pass the token to that bot. 5.6
Token Election (TE)
The TE request is a catch-all method for handling any token action failure. The TE request will generally occur after some set waiting period has expired without a response from the TB, therefore, it is possible that multiple bots will request a TE within the same timespan. A bot that issues the TE request will wait for responses for a set time period and if no other TE requests are received it issues a broadcast that TE was successful. The bot that issues the TE success broadcast will use the history list and the most up-to-date report list that it received to pick a new TB. The method for choosing a new TB will be essentially the same as the method used for passing the token. Once the bot finds a suitable responsive bot it will pass the token
630
B. Shirley and C.D. Mano
Fig. 2. Token bot passing a “copy” of the token as part of TARP
to that bot. At this point the new bot will make the TAQ broadcast and the sub-botnet may resume normal activity. Details on handling simultaneous TE requests is presented in [14]
6
Analysis
While the proposed system is able to effectively evade detection from traditional IDS and dialog-based botnet detection systems, it is not immune from other defense mechanisms. An increased level of monitoring and traffic analysis would enable this system to be detected. As with traditional worm, virus, and botnet threats, our proposed system has defined “signatures” that allow it to be detected. An important difference between this model and others, however, is that the network traffic signatures do not pass through network devices traditionally used for identifying malicious network activity. A2 Events: A2 infection events target specific hosts and thus, do not expose the initial bot the way port scanning activities can. This activity is identical to an E2 event except that it originates interally to the network. Thus, the signature has only changed in the source of the data. Announcement Message: The announcement message generated by a newly infected system must be a valid broadcast message, but must be a custom message or a relatively uncommon one such that the TB does not respond to noninfected systems regularly. In either case, the broadcast announcement message, once identified, is a signature of the botnet. Token Passing/Management: In many locally switched networks, such as a university computer lab, it is not common for individual system to communicate directly with each other. The presence of even small data flows between systems in the switched network may be sign of questionable activity.
A Model for Covert Botnet Communication in a Private Subnet
631
Detection Requirements As has been shown, a covert switch-based communication model has a number of unique signatures that can be used by a detection mechanism. Thus, by incorporating a sufficient set of monitoring nodes throughout the network, sub-botnet activity can be detected. In essence, the level of protection is directly correlated to the level of monitoring. In order to be able to identify all of the covert communication methods proposed in this paper, it becomes necessary to implement IDS capable monitoring at all switch devices in a network. Incorporating this level of monitoring may pose a problem for large networks due to the sheer amount of data that might be generated. However, as monitoring at the switch level seeks to identify very specific type of “special case” attacks, the analysis of the network traffic may be able to be done in a simple manner.
7
Summary and Future Work
This paper presented a potential communication model for covert botnet communication. This model enables a group of systems within a switched subnet to participate in a global botnet infrastructure without generating communication patterns that would allow an external monitoring system to identify the compromised hosts. While the proposed system can be detected, it requires an in-depth monitoring infrastructure above what is typical in many enterprise networks. The future work of this project is aimed at creating an efficient IDS-type switch-based internal botnet communication monitoring system. The system will potentially act as a remote data gathering node for BotHunter to be able to correlate internal traffic patterns with internal/external patterns. It is hypothesized that such system will minimize the logging and computational requirements on an individual switch monitor and improve the overall accuracy of the system.
References 1. Barford, P., Yegneswaran, V.: An inside look at botnets. In: Special Workshop on Malware Detection, Advances in Information Security. Springer, Heidelberg (2006) 2. Binkley, J.R., Singh, S.: An algorithm for anomaly-based botnet detection. In: Proceedings of USENIX Steps to Reducing Unwanted Traffic on the Internet Workshop (SRUTI), July 2006, pp. 43–48 (2006) 3. Canavan, J.: The evolution of malicious IRC bots. In: Proceeding of Virus Bulletin Conference 2005 (October 2005) 4. Cooke, E., Jahanian, F., McPherson, D.: The zombie roundup: Understanding, detecting, and disrupting botnets. In: Proceedings of USENIX Steps to Reducing Unwanted Traffic on the Internet Workshop (SRUTI), July 2005, pp. 39–44 (2005) 5. Droms, R.: Dynamic Hosts Configuration Protocol. Internet Engineering Task Force: RFC 2131 (March 1997) 6. Geer, D.: Malicious bots threaten network security. IEEE Computer 38(1), 18–20 (2005)
632
B. Shirley and C.D. Mano
7. Grizzard, J.B., Sharma, V., Nunnery, C., Kang, B.B., Dagon, D.: Peer-to-peer botnets: Overview and case study. In: First Workshop on Hot Topics in Understanding Botnets (2007) 8. Gu, G., Porras, P., Yegneswaran, V., Fong, M., Lee, W.: Bothunter: Detecting malware infection through ids-driven dialog correlation. In: Proceedings of the 16th USENIX Security Symposium (Security 2007) (August 2007) 9. Ianelli, N., Hackworth, A.: Botnets as a vehicle for online crime, CERT Coordination Center (December 2007) 10. McCarty, B.: Botnets: Big and bigger. IEEE Security and Privacy 1(4), 87–90 (2003) 11. Porras, P., Saidi, H., Yegneswaran, V.: A multi-perspective analysis of the storm(peacomm) worm. Technical report, Computer Science Laboratory, SRI International (October 2007) 12. Saha, B., Gairola, A.: Botnet: An overview, CERT-In White Paper (July 2005) 13. Scarlata, V., Levine, B.N., Shields, C.: Responder Anonymity and Anonymous Peer-to-Peer File Sharing. In: Proceedings of the IEEE International Conference on Network Protocols (ICNP), November 2001, pp. 272–280 (2001) 14. Shirley, B., Mano, C.D.: Hiding botnet communication in a switched network environment. Technical report 15. Strayer, W.T., Walsh, R., Livadas, C., Lapsley, D.: Detecting botnets with tight command and control. In: Proceedings of 2006 31st IEEE Conference on Local Computing Networks, November 2006, pp. 195–202 (2006) 16. Vogt, R., Aycock, J.: Attack of the 50 foot botnet. Technical Report 2006-840-33, Department of Computer Science, University of Calgary (August 2006) 17. Yang, B., Garcia-Molina, H.: Comparing hybrid peer-to-peer systems. The VLDB Journal, 561–570 (September 2001) 18. Zou, C.C., Cunningham, R.: Honeypot-aware advanced botnet construction and maintenance. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 2006, pp. 199–208 (2006)
Traffic Engineering and Routing in IP Networks with Centralized Control Jing Fu1 , Peter Sj¨odin2 , and Gunnar Karlsson1 1
ACCESS Linnaeus Center, School of Electrical Engineering 2 School of Information and Communication Technology KTH, Royal Institute of Technology, SE-100 44, Stockholm, Sweden {jing,psj,gk}@kth.se
Abstract. There have been research initiatives in centralized control recently, which advocate that the control of an autonomous system (AS) should be performed in a centralized fashion. In this paper, we propose an approach to perform traffic engineering and routing in networks with centralized control, named LP-redirect. LP-redirect is based on an efficient formulation of linear programming (LP) that reduces the number of variables and constraints. As LP is not fast enough for runtime routing, LP-redirect uses a fast scheme to recompute routing paths when a network topology changes. The performance evaluation of LP-redirect shows that it is more efficient in both traffic engineering and computation than an approach using optimized link weights. In addition, LP-redirect is suitable for runtime traffic engineering and routing. Keywords: traffic engineering, routing, centralized control.
1 Introduction Traffic engineering aims at making more efficient use of network resources in order to provide better and more reliable services to the customers. Current traffic engineering and routing in IP networks are performed in both the management and the control planes. In the management plane, network operators configure weights of the links in a network, which indirectly control the selection of routing paths. These weights can be set inversely proportional to the link capacities, as recommended by Cisco [1], or they may be optimized for traffic engineering objective functions in a network with given traffic demands. Computing optimal link weights is NP-hard, therefore heuristics have been developed [2] [3]. However, these heuristics are still computationally intensive: for current backbone networks they may require hours of computation time. When the link weights are computed in the management plane, routing is performed in the control plane based on link-state routing protocols such as OSPF and IS-IS. The forwarding path between two routers is the shortest path considering the sum of the link weights along the path. When a topology changes, as when a link fails, the forwarding paths are recomputed in the routers individually using Dijktra’s algorithm [4], which is efficient and can be done in milliseconds. However, this approach does not provide efficient traffic engineering since the link weights are not re-optimized for the new topology. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 633–641, 2008. c IFIP International Federation for Information Processing 2008
634
J. Fu, P. Sj¨odin, and G. Karlsson
There have been research initiatives in centralized control recently [5] [6] [7], which advocate that the control of an AS should be performed in a centralized fashion with direct control: Instead of manipulating link weights, which influence indirectly the forwarding decisions in individual routers, a centralized server controls the routers’ forwarding decisions directly. A centralized control scheme simplifies the decision process, and reduces the functions required in the routers. As the architectural advantages of centralized control are well discussed in these papers, we focus our work on traffic engineering and routing. In our approach, instead of tuning link weights, the centralized server computes routers’ forwarding tables directly. To compute optimal routes for traffic engineering purposes is a multi-commodity network flow problem, which is computationally tractable and can be solved efficiently using linear programming (LP) [8]. Solving an LP problem to obtain optimal routes is much more efficient than computing optimized link weights, and it can be performed in ranges of seconds and tens of seconds. However, LP is still not sufficiently fast for runtime route computation, since new forwarding tables should be computed in no more than hundreds of milliseconds when a topology changes. Therefore, we present a fast scheme to recompute forwarding tables when a topology changes as a link fails. The basic idea is to fast redirect the flows on the failed link based on current link utilizations. Then in a background process, LP can be used to recompute optimal routing paths. After presenting this approach, named LP-redirect, we experimentally study its performance and compare it with an approach using optimized link weights [2] [9]. The rest of the paper is organized as follows. Section 2 overviews the related work. Section 3 presents the static routing model, which describes our LP formulation. Section 4 introduces dynamic route computation, which describes how to redirect traffic when a topology changes. Section 5 shows the experimental setup for our performance evaluation. Section 6 presents the results. Finally, section 7 concludes the paper.
2 Related Work There has been a considerably amount of research in Internet traffic engineering recently. One area of research is to tune OSPF link weights in order to achieve desired traffic engineering objectives [2] [3]. Tuning the link weights relies on having a network topology and an estimate of the traffic demands. Approaches to derive a traffic demand matrix from an operational network are described in [10] [11]. The constraints in OSPF, imposed by using the shortest path and equal splitting of traffic over equal cost paths, make it difficult to achieve optimal routing. Therefore, there are studies on how to split traffic arbitrarily, for example using hash functions [12]. In addition, it has been suggested that the traffic should not only be routed on the shortest path, but also be routed on non-shortest paths, with a penalty for longer paths [13]. Finally, optimal routing can be achieved using LP [8]. Although LP is not fast enough for runtime route computation, it is commonly used as a reference to evaluate other traffic engineering approaches.
Traffic Engineering and Routing in IP Networks with Centralized Control
635
16 14 12
Cost
10 8 6 4 2 0
0
0.2
0.4
0.6 Utilization
0.8
1
1.2
Fig. 1. Link cost φa as a function of utilization u(a)
3 Static Routing Model Our network is a directed and connected graph G composed of a set of nodes V and a set of links E. A link a ∈ E can be represented as (x, y) ∈ V × V , where x and y are the two end nodes of the link. Furthermore, there is a traffic demand matrix D. For each pair of nodes (s,t) ∈ V × V the demand D(s,t) represents the amount of traffic from s (s,t) to t. For each link a, we let fa denote the amount of traffic from s to t that is offered to link a. In addition, we let fat to denote the amount of traffic from all nodes to t that is offered to link a. The total amount of traffic over link a is denoted fa and the link capacity is denoted c(a). Finally, the link utilization u(a) could be obtained by dividing the total amount of traffic by the link capacity: u(a) = fa /c(a). The objective of traffic engineering is to keep the traffic loads within the link capacities, and particularly, the loads on the links should be balanced. A frequently used objective function in network optimization is to minimize the total cost in the network, which is known as minimum cost network flow [4]. The total cost φ is the sum of the cost on each link, φ = ∑a∈E φa . In a link a, the cost function φa can be modelled using a piecewise linear function of the utilization, as shown in Fig. 1. The idea behind φa is that it is cheap to transmit flows over a link when the utilization is low. As the utilization increases, it becomes more expensive. The exact definition of the objective function may vary and is a choice for the network operators, however the objective function should be piecewise linear increasing and convex. We have taken the objective function that is used in [2] [3], where φa (0) = 0 and its derivative ⎧ 1 for ⎪ ⎪ ⎪ ⎪ ⎪ 3 for ⎪ ⎪ ⎪ ⎨10 for φa (u(a)) = ⎪ 70 for ⎪ ⎪ ⎪ ⎪ ⎪ 500 for ⎪ ⎪ ⎩ 5000 for
0 ≤ u(a) ≤ 1/3 1/3 ≤ u(a) ≤ 2/3 2/3 ≤ u(a) ≤ 9/10 9/10 ≤ u(a) ≤ 1 1 ≤ u(a) ≤ 11/10 11/10 ≤ u(a) ≤ ∞.
The definition above specifies that the cost to transmit a unit amount of traffic over a link with a utilization below 1/3 is 1, when the utilization is above 1/3 but below 2/3,
636
J. Fu, P. Sj¨odin, and G. Karlsson
the cost is 3. Using the above definition, the optimal routing problem can be formulated as the following LP: minimize φ =
∑ φa
a∈E
subject to
∑
x:(x,y)∈E
t f(x,y) −
∑
t f(y,z) =
z:(y,z)∈E
fa =
−D(y,t) if y =t ∑s∈V D(s,t) if y = t y,t ∈ V,
∑
fat
a ∈ E,
(1) (2)
t∈V
φa ≥ fa a ∈ E, 2 φa ≥ 3 f a − c(a) a ∈ E, 3 16 φa ≥ 10 f a − c(a) a ∈ E, 3 178 φa ≥ 70 fa − c(a) a ∈ E, 3 1468 φa ≥ 500 f a − c(a) a ∈ E, 3 16318 φa ≥ 5000 f a − c(a) a ∈ E, 3 fat ≥ 0 a ∈ E;t ∈ V,
(3) (4) (5) (6) (7) (8) (9)
Constraints (1) are flow conservation constraints that ensure the desired traffic flow is routed to the destination. Constraint (2) defines the load of a link as the sum of all flows over it. Constraints (3) to (8) define the cost of a link depending on its link capacity and amount of traffic over it. Finally, constraint (9) ensures that each flow is positive. The above is a complete LP formulation of the general routing problem, and hence it can be solved optimally in polynomial time [14]. In particular, our LP formulation is different from formulations in other papers where each flow is considered as a sourcedestination pair [2] [3] [15]. Thus, the number of commodities in these approaches is |V |2 and the number of flow variables is |E| × |V |2 . Basically, for each link a and a pair of nodes s and t, there is a flow variable fas,t to denote the amount of traffic from source s to destination t that goes via link a. As we model the flows as flow to a destination, the number of commodities is |V | and the number of flow variables is |E| × |V | only. Therefore, for each link a there is a flow variable fat for each destination t. With our formulation, the number of variables and equality constraints in the LP is only 1/|V | of the traditional LP formulations, hence it can be solved much more efficiently.
4 Dynamic Route Computation Although LP provides optimal routing, it is not fast enough for runtime route computation. Therefore a fast approach to reroute the packets is required upon link failures. When a link fails, the first step in LP-redirect is to select a set of failed flows. A failed flow is a flow that is considered to be affected by the link failure, and whose traffic needs to be rerouted (whole or in part). When the failed flows have been selected, the next
Traffic Engineering and Routing in IP Networks with Centralized Control
637
step is to establish the residual traffic demand (rD) for those flows. The residual traffic demand of a failed flow represents the amount of traffic that needs to be rerouted for that flow. After that all residual traffic demands are determined, routing paths for these demands are recomputed. After that the packets are rerouted using the fast dynamic approach, LP can be used again in the background to recompute optimal paths. The recomputed optimal paths can be also be used later in routers. 4.1 Finding Residual Traffic Demands The approaches to find residual traffic demands are best illustrated using an example. Consider a network topology shown in Fig. 2, we study flows towards e when link e (c, d) fails. For each link (x, y) in the figure, the value of f(x,y) is shown, representing the amount of traffic to e that goes via that link. The initial traffic demands to e are also shown in the figure. When link (c, d) fails, a simple approach is to set residual e traffic demand rD(c, e) = 10. After that, 10 units of flow should be removed from f(d,e) , since the flow from c to e was transmitted through the link (d, e) before the failure. An alternative approach is to set residual traffic demand rD(a, e) = 10, and remove 10 units e e of traffic from both f(a,c) and f(d,e) . We could also set residual demands from several sources, for example to set both rD(a, e) = 5 and rD(b, e) = 5. In this case, we need to e e , and remove 10 units of traffic from remove 5 units of traffic from both f(a,c) and f(b,c) e f(d,e) . Traffic demands
a
20
10
e
c b
10
10
d
15
D(a,e) = 10 D(b,e) = 10 D(c,e)
= 10
D(d,e) = 5
Fig. 2. A network with traffic demands
When there are several flows to choose from, LP-redirect tries to find flows from links with higher utilization. For example, if u(a, c) > u(b, c), residual traffic demand rD(a, e) = 10 will be chosen. This choice leads to higher decrease in cost φ when removing the traffic, since the cost of a high-utilized link might be higher than that of a low-utilized link. In addition, LP-redirect prefers to select residual demands for flows where source and destination are further away. Using this approach, residual demands rD(a, e) and rD(b, e) are preferred over rD(c, e). The intuition behind this is that as traffic from more links are removed, the decrease in cost will be higher. 4.2 Recomputing Forwarding Paths When the residual traffic demands have been determined, LP-redirect uses Dijkstra’s algorithm to compute the shortest paths for these residual demands. The link weights for this computation are obtained as dynamic weights for each individual link, which is based on its capacity and current utilization. It is defined as w(a) = φa (u(a))/c(a).
638
J. Fu, P. Sj¨odin, and G. Karlsson
Basically, the dynamic link weight is inversely proportional to the link capacity and is proportional the cost function φa (u(a)) specified in section 3. One consideration in LP-redirect is how often the dynamic link weights should be updated? The dynamic link weights can be updated when all new routing paths have been computed after a failure, or be updated whenever flows are added to or removed from a link. In addition, if a residual demand rD(s,t) is large, it might be inappropriate to assign the entire flow to a single shortest path, since it might congest the links on the path. An alternative approach is to break down the demand into pieces and assign a part of the demand to the shortest path, update the dynamic link weights, and then recompute the shortest path to reassign remaining demands. The approach that LP-redirect uses to recompute the paths is as follows. For each node s that has a positive residual demand to any destination t, rD(s,t) > 0, the shortest paths from s to all other nodes are computed. If the demand from s to any other node t is smaller than a certain percentage p of the lowest link capacity along the path, rD(s,t) < p · c(a), the entire demand is assigned to the links. Otherwise a part of the demand equal to p · c(a) is assigned to the links. After assigning the demands from the source node s to all destination nodes, the dynamic link weights are recomputed. If the residual demands from s have not been fully assigned, LP-redirect recomputes the shortest-path tree from s, and reassigns the remaining demands. After assigning all demands from s, LP-redirect continues with the next source node with positive residual demand. The choice of p is also an interesting parameter. When p is too small, demands from a source need to be assigned in many iterations. When p is too large, assigning a large demand on a single path may congest the links. Through experimentation we have found that p = 1% provides a suitable trade-off for our performance studies. 4.3 Removing Potential Loops The forwarding paths computed above can be loopy since they are based both on the initial LP optimization and on shortest paths derived from dynamic link weights. Therefore, before installing the new routing paths, flow loops need to be removed. A flow loop can be detected and removed by using a breadth-first search [4].
5 Evaluation In this section, we present a method to evaluate the performances of LP-redirect and optimized link weights. Thereafter, we present experimental setups for the performance studies. 5.1 Performance Studies The performance of a solution is expressed using the cost function specified in section 3. We define the minimal cost obtained when solving the LP as the optimal solution, φopt . In addition, the minimal cost obtained using optimized link weights is denoted φosp f . We further define optimality gap of an approach as its normalized difference in cost to φopt . For example, the optimality gap of optimized OSPF link weights is (φosp f − φopt )/φopt .
Traffic Engineering and Routing in IP Networks with Centralized Control
639
In the dynamic case, let G be a topology when one or several of the links in G fails. denote the optimal solution obtained by solving the LP with topology G . We let φopt Furthermore, we let φosp f denote the solution obtained using the original optimized link weights in G, but with the new topology G . Then the optimality gap for OSPF after link failures is (φosp f − φopt )/φopt . When using LP-redirect, φLP−redirect is computed after redirection of flows. 5.2 Experimental Setup We have created a synthetic router-level topology using the Waxman’s model, with 50 nodes and an average link degree of four [16]. In addition, we have used an ISP topology measured in Rocketfuel [17], the Ebone network (AS 1755) that consists of 88 routers and 161 links. Finally, we assume uniform link capacities in both topologies. The traffic demands we used are based on the Newton’s gravity model of migration. The model is initially for migration of people between different towns. As the size of the towns increases, there will be an increase in movement between them. The further apart the two towns are, the movement between them will be less. The gravity model and its modified versions have been widely used in communication networks [2] [18]. In our demand matrix generation, each node is given two uniformly distributed random variables o and i for outgoing and incoming traffic. The traffic demand from node s to t is specified as follows: os it D(s,t) = α , d(s,t)2 where d(s,t) denotes the distance and a constant α is used to scale the demands to appropriate values. Our goal is to set α so that the average link utilizations are about 30% and 70% respectively, in order to evaluate the efficiency of traffic engineering in varying network conditions. Finally, we have used the GNU linear programming kit (GLPK) for LP modelling and solving [19]. To generate near-optimal link weights, we have used IGPWO [9] that is based on the algorithm in [2].
6 Results Fig. 3 shows the optimality gap for four setups. The optimality gaps are shown when links fail one after another. We study up to ten link failures from the initial network topology, as the frequency of link failures for a single link can be as low as several minutes [20]. As link failures at different locations may have varying impact on the optimality gap, in each setup, our results are based on 20 simulations, with median, 10th and 80th percentile shown. The optimality gap for LP-redirect is smaller than optimized weights in all setups. Therefore, LP-redirect provides clearly more efficient traffic engineering. In addition, the optimality gap in the static case (with no link failure) for optimized weights is about 0.1 when average link utilization is 30%, while it is approximately 0.6 when average utilization becomes 70%. This indicates that the performance of optimized link weights decreases when the average link utilization increases.
640
J. Fu, P. Sj¨odin, and G. Karlsson Waxman−50, average link utilization: 30%
Waxman−50, average link utilization: 70%
0.45
2
0.4 0.35
1.5
Optimality gap
Optimality gap
0.3 0.25 0.2 0.15
Optimized weights
1
0.5
0.1 0.05
Optimized weights
LP−redirect 0
LP−redirect
0 −0.05
0
1
2
3
4 5 6 7 Number of link failures
8
9
−0.5
10
0
1
Ebone−88, average link utilization: 30%
2
3
4 5 6 7 Number of link failures
8
9
10
9
10
Ebone−88, average link utilization: 70%
5
1.4 1.2
4
1
Optimality gap
Optimality gap
3
2
1
Optimized weights
0.8
Optimized weights
0.6 0.4 0.2
LP−redirect
0
0
LP−redirect −1
0
1
2
3
4 5 6 7 Number of link failures
8
9
10
−0.2
0
1
2
3
4 5 6 7 Number of link failures
8
Fig. 3. A comparison of optimality gap between LP-redirect and optimized weights
In the dynamic case when links fail one after another, the gap for optimized weights increases, in particular in some setups where the 80th percentile gap can be up to 5. These gaps suggest that optimized weights may provide very poor traffic engineering for some of the link failures. While using LP-redirect, the optimality gap increases slowly when links fail, and the 80th percentile gap is fairly close to the median, suggesting that LP-redirect provides efficient traffic engineering for most of the link failures.
7 Conclusion In this paper, we have presented an approach to perform traffic engineering and routing in networks with centralized control, named LP-redirect. LP-redirect is based on a formulation of LP that reduces the number of variables and constraints, and thus can be solved efficiently, in ranges of seconds to tens of seconds. In addition to the LP formulation, LP-redirect relies on a fast scheme to recompute routing paths when a network topology changes, by identify failed flows and recomputing routes for these flows based on current link utilizations and link capacities. As a comparison, we have evaluated both LP-redirect and an approach using optimized link weights. The results show that LP-redirect provides more efficient traffic engineering and computation than optimized link weights. Finally, traffic in a network varies over time and it can be difficult to estimate the traffic demands in advance. In a network with centralized control, the traffic demands and link utilizations can be reported to the centralized sever in runtime, and this information can
Traffic Engineering and Routing in IP Networks with Centralized Control
641
be used by LP-redirect to compute routing paths. While with optimized link weights, when observing a change in traffic demands, it is not possible to re-optimize link weights in a short time range. The unpredictability and abrupt changes of traffic demands make our approach even more suitable for traffic engineering and routing.
References 1. Cisco, Configuring OSPF (2006) 2. Fortz, B., Thorup, M.: Optimizing OSPF/IS-IS weights in a changing world. IEEE Journal on Selected Areas in Communications 20(4), 756–767 (2002) 3. Ericsson, M., Resende, M.G.C., Pardalos, P.M.: A genetic algorithm for the weight setting problem in OSPF routing. Journal of Combinatorial Optimization 6(3) (November 2002) 4. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms, 2nd edn. MIT Press, Cambridge Massachusetts (2001) 5. Rexford, J., Greenberg, A., Hjalmtysson, G., Maltz, D.A., Myers, A., Xie, G., Zhan, J., Zhang, H.: Network-wide decision making: Toward a wafer-thin control plane. In: Proc. of ACM SIGCOMM HotNets Workshop (November 2004) 6. Greenberg, A., Hjalmtysson, G., Maltz, D.A., Myers, A., Rexford, J., Xie, G., Yan, H., Zhan, J., Zhang, H.: Clean slate 4D approach to network control and management. ACM SIGCOMM Computer Communication Review 35(5) (October 2005) 7. Cs´asz´ar, A., Enyedi, G., Hidell, M., R´etv´ari, G., Sj¨odin, P.: Converging the evolution of router architectures and IP networks. IEEE Network Magazine 21(4) (2007) 8. Karakostas, G.: Faster approximation schemes for fractional multicommodity flow problems. In: Proc. of ACM-SIAM Symposium on Discrete Algorithms(2002) 9. Interior Gateway Protocol Weight Optimization Tool (IGPWO). http://www.poms.ucl.ac.be/totem/ 10. Feldmann, A., Greenberg, A., Lund, C., Reingold, N., Rexford, J., True, F.: Deriving traffic demands for operational IP networks: methodology and experience. IEEE/ACM Transactions on Networking 9(3), 265–280 (2001) 11. Gunnar, A., Johansson, M., Telkamp, T.: Traffic Matrix Estimation on a Large IP BackboneA Comparison on Real Data. In: Proc. of ACM SIGCOMM Internet Measurement Conference (October 2004) 12. Cao, Z., Wang, Z., Zegura, E.: Performance of hashing based schemes for Internet load balancing. In: Proc. of IEEE INFOCOM (March 2000) 13. Xu, D., Chiang, M., Rexford, J.: DEFT: Distributed exponentially-weighted flow splitting. In: Proc. of IEEE INFOCOM (May 2007) 14. Khachian, L.G.: A polynomial time algorithm for linear programming. Doklady Akademii Nauk SSSR 244, 1093–1096 (1979) 15. Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network flows: theory, algorithms and applications. Prentice-Hall, Englewood Cliffs (1993) 16. Waxman, B.M.: Routing of multipoint connections. IEEE Journal on Selected Areas in Communications 6(9), 1617–1622 (1988) 17. Spring, N., Mahajan, R., Wetherall, D.: Measuring ISP topologies with Rocketfuel. In: Proc. of ACM SIGCOMM 2002 (August 2002) 18. Medina, A., Taft, N., Salamatian, K., Bhattacharyya, S., Diot, C.: Traffic matrix estimation: Existing techniques and new directions. In: Proc. of ACM SIGCOMM (August 2002) 19. The, G.N.U.: Linear Programming Kit (GLPK). http://www.gnu.org/software/glpk/ 20. Iannaccone, G., Chuah, C., Mortier, R., Bhattacharyya, S., Diot, C.: Analysis of link failures in an IP backbone. In: Proc. of ACM SIGCOMM Internet Measurement Workshop (2002)
Distributed PLR-Based Backup Path Computation in MPLS Networks Mohand Yazid Saidi1 , Bernard Cousin1 , and Jean-Louis Le Roux2 1
Universit´e de Rennes I, IRISA/INRIA, 35042 Rennes Cedex, France France T´el´ecom, 2 Avenue Pierre Marzin, 22300 Lannion, France [email protected], [email protected], [email protected] 2
Abstract. In this article, we provide mechanisms enabling the backup path computation to be performed on-line and locally by the Points of Local Repair (PLRs), in the context of the MPLS-TE fast reroute. To achieve a high degree of bandwidth sharing, the Backup Path Computation Entities (BPCEs), running on PLRs, require the knowledge and maintenance of a great quantity of bandwidth information (non aggregated link information or per path information) which is undesirable in distributed environments. To get around this problem, we propose a distributed PLR (Point of Local Repair)-based heuristic (DPLRH) which aggregates and noticeably decreases the size of the bandwidth information advertised in the network while maintaining the bandwidth sharing high. DPLRH is scalable, easy to be deployed and balances equitably computations on the network routers. Simulations show that with the transmission of a small quantity of aggregated information per link, the ratio of rejected backup paths is low and close to the ideal. Keywords: local protection; bandwidth sharing; path computation.
1
Introduction
The proactive protection of communication becomes increasingly important with the explosion of the number of network real time applications (voice over IP, network games, video on demand, etc). Thus, to ensure network service continuity upon a failure, the proactive protection techniques [1,2] precompute and generally pre-establish backup paths capable to receive and reroute the traffic of the affected primary paths. Two schemes of protection exist: global (end-to-end) and local. With global scheme [2], each primary path is protected by one vertex (or link) disjoint backup path interconnecting the primary source and destination nodes. This protection scheme presents the disadvantage of increasing the recovery cycle since it requires that failure information reaches the source before the switching from the primary toward the backup path. This drawback is eliminated with the use of local protection where the recovery is achieved locally and without any control plane notification by the upstream node to the failing component. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 642–653, 2008. c IFIP International Federation for Information Processing 2008
Distributed PLR-Based Backup Path Computation in MPLS Networks
643
With the advent of MPLS [3] in the last decade, the local protection is provided in efficient manner. In fact, MPLS offers a great flexibility for choosing paths (called Label Switched Paths or LSPs) and thus, the backup paths can be determined so that bandwidth availability is maximized. Two types of backup LSP are defined for MPLS local protection [4]: Next HOP (NHOP) LSP and Next Next HOP (NNHOP) LSP. A NHOP LSP (resp. NNHOP LSP) is a backup path protecting against link failure (resp. node failure); it is setup between a primary node called Point of Local Repair (PLR) and one primary node downstream to the PLR (resp. to the PLR next-hop) called Merge Point (MP). Such backup LSP bypasses the link (resp. the node) downstream to the PLR on the primary LSP. When a link failure (resp. node failure) is detected by a node, this later activates locally all its NHOP and NNHOP (resp. its NNHOP) backup LSPs by switching traffic from the affected primary LSPs to their backup LSPs. To ensure enough resource (particularly the bandwidth) after the recovery from a failure, the backup LSPs must reserve the resources they need beforehand. In this way, if we consider that each backup path has its own exclusive resources, the network will be overbooked rapidly since the available resources decrease quickly. Instead, with the the practical hypothesis of single failures, resource utilization can be improved by sharing the resources between some backup LSPs. To increase the number of LSPs that can be setup in a network, the resource sharing should be taken into account when the backup LSPs are computed. Three functionalities are necessary to perform such computations in a distributed environment: information collection, information distribution and path determination. The first functionally gathers the structures and properties of the backup LSPs setup in the network. In practice, each network node stores the path links, the bandwidth and the risks protected by the backup LSPs traversing it. Such information can be obtained easily and without any additional overhead when the backup LSPs are signaled as in [4]. The second functionality reorganizes and transmits the collected information to nodes supporting the BPCEs (Backup Path Computation Entity). We note that for a same capability of bandwidth sharing, less the transmitted information is, better the functionality of distribution is. Finally, the last functionality searches for the backup LSPs providing the desired protection and verifying the bandwidth constraints. In this article, we focus on the mechanisms allowing an efficient distribution of the bandwidth information and enabling the bandwidth guaranteed-backup LSP computation to be performed on-line and locally by the PLRs. Hence, we propose a new distributed PLR-based heuristic (DPLRH) aggregating and reducing significantly the size of the bandwidth information advertised in the network. With our heuristic, the backup LSPs are computed and configured by same nodes which correspond to the backup LSP head-end routers (PLRs). This eliminates the communication between the entities computing the backup LSPs (BPCEs) and those configuring the backup LSPs (PLRs). Besides, DPLRH is scalable, shares effectively bandwidth between the backup LSPs and capable to compute backup LSPs protecting against any type of failure risk.
644
M.Y. Saidi, B. Cousin, and J.-L. Le Roux
Fig. 1. Bandwidth allocation on an arc λ
The rest of this article is organized as follows: section 2 describes the three types of failure risks and gives the formulas allowing the computation of the minimal protection bandwidth to be reserved on each unidirectional link. In section 3, we review some works related to the bandwidth sharing. Then, we explain in section 4 the principles of DPLRH. In section 5, we present simulation results and analysis. Finally, section 6 is dedicated to the conclusions.
2
Failure Risks and Bandwidth Sharing
To deal with any single physical failure in a logical (MPLS) layer, three types of (logical) failure risks are defined: link, node and SRLG. The first type of failure risk corresponds to the risk of a logical link failure due to the breakdown of an exclusive physical component of the logical link. The second type of failure risk corresponds to the risk of a logical node failure. Finally, the third type of risk corresponds to failure risk of a common physical component (eg. optical fiber) shared by a group of logical links. In order to ensure enough bandwidth upon a failure, minimal quantities of bandwidth must be reserved on links. To determine such quantities, we define the concept of the protection cost of a risk r on an arc λ (noted δrλ ). This later corresponds to the cumulative bandwidth of the backup LSPs which will be activated on the arc λ upon a failure of the risk r. For a SRLG risk srlg λ composed of links (l1 , l2 , .., ln ), δsrlg is determined as follows: λ δsrlg =
δlλi
(1)
0
To cope with any single failure, a minimal quantity of protection bandwidth Gλ must be reserved on the arc λ: Gλ = M axr∈P F RG(λ) δrλ
(2)
In order to control and specify the quantity of bandwidth dedicated for protection and to separate the task of primary LSP computation from that of backup LSP computation, the bandwidth capacity Cλ on arc λ can be divided in two pools: primary pool and protection pool (fig. 1). The primary pool has a capacity P Cλ and it is used to allocate bandwidth for primary LSPs. The protection pool has a capacity BCλ and it is used to allocate bandwidth for backup LSPs.
Distributed PLR-Based Backup Path Computation in MPLS Networks
645
To ensure the respect of bandwidth constraints upon a failure, the minimal protection bandwidth reserved on each arc λ must verify: Gλ ≤ BCλ
(3)
To keep (3) valid after the setup of a backup LSP b of bandwidth bw (b) and protecting against the risks in FR (b), only the arcs λ verifying the following inequality can be selected to be in the LSP b: M axr∈F R(b) δrλ ≤ BCλ − bw(b)
3
(4)
Related Works
Recently, a great deal of work is addressing the path protection in order to find algorithms and mechanisms allowing an on-line computation of the optimized backup paths. Several solutions are then proposed but a large number of them, like [5] and [6], copes only with failure risks of type link or node. As the quality of the distributed techniques computing the backup paths depends closely on the algorithms implementing the functionality of information distribution (cf. Section 1), we will focus below on the study of these algorithms; for path computation, various variants of the Dijkstra’s algorithm and ILP formulation can be applied. In a first obvious approach [7], Kini proposes to flood within the IGP-TE protocols the topology information, the primary bandwidth, the capacities and all the protection costs of the risks on the topology arcs in the network. In this way, each node has a complete knowledge of the information necessary to the backup LSP and as a result, it can perform the backup LSP computation in an efficient manner. This distribution technique increases the bandwidth availability but it overloads the network with large and frequent messages advertising the protection costs. To scale well, [8] proposed the PCE-based MPLS-TE fast reroute technique in which no control message is necessary to compute the bandwidth-guaranteed backup LSPs. With this technique, a separate PCE (path computation element) is associated with each failure risk in order to compute the backup LSPs which will be activated at the failure of that risk. This computation technique is efficient when there are no SRLGs in the network. Otherwise, the PCE-based MPLS-TE fast reroute technique requires a mechanism distinguishing a node failure from a link failure. This increases significantly the recovery cycle of all the communications. To offer scalability without increasing the recovery cycle, new computation heuristics which approximate and reduce the bandwidth information (protection costs especially) transmitted in the network have emerged[7,9]. Hence, once the bandwidth information is collected, nodes aggregate it before its flooding in the network. For instance, in [7], Kini proposes to approximate the protection cost δrλ of a risk r on a (unidirectional) link λ by the maximum of protection costs (Gλ ) on that link. In this way, only one aggregated value per link is advertised in the network. This heuristic has the advantage of facility of its deployment.
646
M.Y. Saidi, B. Cousin, and J.-L. Le Roux
Fig. 2. Protection pool of an arc λ
Indeed, this requires only slight modifications to IGP-TE protocols for the advertisement of the minimal quantities of protection bandwidth on links. However, this heuristic does not exploit efficiently the bandwidth sharing. As a result, the number of backup LSPs that can be built with this heuristic is low.
4
Distribued PLR-Based Backup Path Computation Heuristic (DPLRH)
4.1
DPLRH Principles
DPLRH allows an efficient approximation of the protection costs on the links with the advertisement of a small quantity of aggregated protection bandwidth information. It is based on the two following principles: – An arc λ can be used to establish a new backup LSP b requiring a quantity of bandwidth bw if and only if the protection costs of the risks protected by such LSP (on λ) are lower or equal to BCλ - bw. As a result, the knowledge of the partial information consisting of the protection costs (and their corresponding risks) which are higher than BCλ - bw is sufficient to decide without mistake if λ can be selected to be in the backup LSP b. – Some values of protection cost on an arc can be very low. Aggregate and approximate these values by their maximum can decrease the quantity of protection information to be advertised in the network with slight or without deterioration of the bandwidth sharing. To show how DPLRH exploits the two above principles, let us consider an example. In fig. 2, the protection pool of an arc λ is shown. For simplicity, we Algorithm 1. End node oλ of an arc λ Assign to the new xλ vector associated to λ the xλ highest protection costs and their corresponding risk identifiers if the (x + 1)th highest protection cost is higher than T sλ then Replace, in new xλ vector, the identifier corresponding to the xth highest protection cost by the special identifier generic risk end if if new xλ vector = old xλ vector then Advertise new xλ vector end if new xλ vector ← old xλ vector
Distributed PLR-Based Backup Path Computation in MPLS Networks
647
Algorithm 2. Each node receiving an xλ vector associated to λ if generic risk ∈ xλ vector then min cost ← xλ vector[generic risk] else min cost ← 0 end if for each risk identifier id do if id ∈ xλ vector.index () then local risk protecion cost[λ][id] ← xλ vector[id] else local risk protecion cost[λ][id] ← min cost end if end for
considered here that the capacity Cλ of each arc λ is divided in two disjoint pools (protection pool and primary pool). In this manner, the task which computes the backup LSPs can be dissociated from that determining the primary LSPs. As illustrated in fig. 2, the protection capacity BCλ of the arc λ is equal to 100 units. This arc λ is used in the protection of seven risks: node1 , node2 , link1 , link2 , link3 , link4 and srlg1 . The protection costs associated to these risks are λ λ λ λ λ as follows: δnode = 100, δnode = 60, δlink = 75, δlink = 40, δlink = 5, 1 2 1 2 3 λ λ δsrlg1 = 80, δlink4 = 80. When the maximal quantity of bandwidth maxbw (maxbw = 30 ) that a LSP can request is known, the use of the principle 1 of DPLRH permits to deduce that all the risks whose protection costs are lower or equal to the threshold T sλ (T sλ = BCλ − maxbw ) can be ignored (approximated by zero) when a new backup LSP is computed. In fact, the selection of the arc λ to be in a new backup LSP of bandwidth bw does not cause the violation of bandwidth constraints upon a failure of a risk r if the protection cost δrλ is lower or equal to the threshold T sλ . This results from the following inequalities: λ δr ≤ T sλ = BCλ − maxbw ⇒ δrλ + bw ≤ BCλ bw − maxbw ≤ 0 Obviously, the elimination of the protection costs, which are lower or equal to the threshold T sλ (T sλ = BCλ − maxbw = 70) from the information to be advertised in the network, does not alter the decision of excluding (or including) the arc λ in a next backup LSP computation. Typically, in fig. 2, the outgoing node oλ to the arc λ, which is responsible1 of the advertisement of the protection costs {δrλ }r on the arc λ, approximates the protection costs of node2 , link2 and link3 by zero. As a result, these risks (node2 , link2 and link3 ) and their corresponding protection costs on the arc λ are not advertised in the network. When the value maxbw is high (or ignored by the nodes of the network), the quantity of bandwidth information advertised for each arc of the topology can 1
The end nodes of each arc λ know all the values of protection cost on this arc λ.
648
M.Y. Saidi, B. Cousin, and J.-L. Le Roux
be high and unacceptable. To avoid the flooding of the network while maintaining bandwidth sharing high, DPLRH limits the size of the protection bandwidth information that is advertised for each arc λ to a vector (called xλ vector) composed of xλ elements. Each xλ vector component includes a couple of protection cost and its associated risk. Besides, the costs conveyed in the xλ vector of an arc λ correspond to the xλ highest values of protection cost. In this manner, each node receiving an xλ vector of an arc λ can deduce the xλ highest protection costs of the risks using λ for protection and it approximates all the rest of protection costs by the (xλ )th highest protection cost (principle 2 of DPLRH). For instance, if we consider that xλ is equal to 2 in fig. 2, the outgoing node oλ to the arc λ will send the following xλ vector: [(node1 , 100), (generic risk, 80)] (where generic risk refers to any risk different from those conveyed in the xλ vector). 4.2
DPLRH Algorithm Description
With the combination of DPLRH’s principles 1 and 2, we obtain the algorithms Alg. 1 and Alg. 2 which specify the steps of the protection cost advertisement and collection. Thus, Alg. 1 describes the procedure of protection cost advertisement which limits the transmitted bandwidth information for each arc λ, to an xλ vector. This vector contains, at most, the xλ highest protection costs which are larger than the threshold T sλ . Concerning Alg. 2, it specifies the procedures used to approximate the protection costs from the received xλ vector information. In order to increase the bandwidth sharing and to reduce the size of messages transmitting the xλ vectors, each node running Alg. 1, eliminates from its protection cost table all the entries corresponding to the risks which are included in others. In fig. 2 for instance, the SRLG risk srlg1 includes two link risks: link1 and link3 . As a result, these two last risks (link1 and link3 ) and their corresponding protection costs are deleted from all the protection cost tables before any advertisement of the xλ vectors. After this step, the outgoing node oλ to each arc λ deduces the xλ vector containing (at max) the x highest protection costs on λ and their corresponding risks. For the arc in fig. 2, the corresponding xλ vector is: [(node1 , 100 ), (srlg1 , 80 ), (link4 , 80 ), (node2 , 60 ), (link2 , 40 )]. At each change of the xλ vector corresponding to λ, node oλ advertises its new value in the network. When a node (different from the end nodes of λ) receives an xλ vector corresponding to an arc λ, it runs the routine shown in Alg. 2 to approximate the protection costs on the arc λ. According to the threshold value (T sλ ) and the (xλ + 1 )th highest protection cost (denoted by xλ plus 1 cost) on the arc λ, the DPLRH’s decision of excluding the arc λ in the next backup LSP computation can be “sure and correct” or “possibly wrong”. Two areas are defined to measure the correctness degree of the protection cost approximation used by DPLRH: doubtful zone (xλ plus 1 cost > T sλ ) and sure zone (xλ plus 1 cost ≤ T sλ ).
Distributed PLR-Based Backup Path Computation in MPLS Networks
649
Doubtful area (xλ plus 1 cost > T sλ ) When the (xλ + 1 )th highest protection cost on an arc λ is in the doubtful area, the advertisement of an xλ vector whose size is equal to xλ can be insufficient to decide without mistake if the arc λ can be selected to be in a new backup LSP. In fig. 2 for instance, the (xλ + 1 )th highest protection cost on the arc λ is in the doubtful area if xλ ≤ 2. Thus, the advertisement of an xλ vector containing at most the two highest protection costs on λ can be insufficient. Typically, with xλ = 2, the outgoing node oλ to the arc λ advertises in the network this xλ vector: [(node1 , 100 ), (generic risk, 80 )]. When a node n receives the xλ vector transmitted by oλ , it updates its protection cost table by approximating the protection costs on the arc λ as follows (Alg. 2): λ δnode1 = 100 ∀r(r is a risk ∧ r = node1 ) : δrλ = 80 As we see here, all the risks whose protection costs on the arc λ are not transmitted within the xλ vector, are aggregated and approximated by (xλ )th highest protection cost on this arc. As this last cost is higher than the threshold, any computation of a new backup LSP, protecting against a risk which is not conveyed in the transmitted xλ vector of λ and claiming a quantity of bandwidth bw (bw > BCλ − xλ plus 1 cost), excludes by mistakes the arc λ. However, any other computation will include or exclude the arc λ without mistake. In fig. 2 for instance, after the reception by node n of this xλ vector (xλ vector = [(node1 , 100 ), (generic risk, 80 )]), node n excludes by mistake the arc λ in its next computation of a backup LSP if this last one requires a quantity of bandwidth larger than 20 (20 = T sλ − 80) and protects against failure risks which do not belong to node1 , srlg1 , link4 ; otherwise, the arc λ is included or excluded without mistake. Sure area (xλ plus 1 cost ≤ T sλ ) When the (xλ + 1)th highest protection cost on an arc λ is in the sure area, the advertisement of an xλ vector whose size is equal to xλ is sufficient to decide without mistake if the arc λ can be selected (or not) to be in a new backup LSP (principle 1 of DPLRH). In fig. 2 for instance, the (xλ + 1 )th highest protection cost on the arc λ is in the sure area when xλ > 2. Thus, the advertisement of an xλ vector containing at least the three highest protection costs on λ is sufficient. Typically, with the choosing of xλ > 2, the outgoing node oλ to the arc λ transmits an xλ vector including (at most) the three highest protection costs which are greater than the threshold T sλ = 70. This xλ vector corresponds to: [(node1 , 100 ), (srlg1 , 80 ), (link4 , 80 )]. When a node n receives the xλ vector transmitted by oλ , it updates its protection cost table by approximating the protection costs on the arc λ as follows (Alg. 2): λ λ λ δnode1 = 100 ∧ δsrlg = 80 ∧ δlink = 80 1 4 ∀r(r is a risk ∧ r = node1 ∧ r = srlg1 ∧ r = link4 ) : δrλ = 0
650
M.Y. Saidi, B. Cousin, and J.-L. Le Roux
Fig. 3. Test network
Contrarily to the case xλ plus 1 cost > T sλ where the protection costs of the risks which are not conveyed in the advertised xλ vector are approximated by the (xλ )th highest protection cost on the arc λ, in the sure area these protection costs are aggregated and approximated by zero. As explained in the previous section, this approximation does not alter the decision of including or excluding the arc λ in the next computation of a backup LSP b. Indeed, if the new backup LSP b protects against the failure of a risk belonging to node1 , srlg1 , link4 , node n can deduce easily the exact value of the highest protection cost on λ of the risks protected by b. This highest protection cost corresponds to 100 units if b protects against the failure of node1 , 80 units otherwise. As a result, node n decides without mistake if the arc λ can be selected to be in b or not (formula (4) in section 2). When the new backup LSP b is planned to protect against the failure risks which do not belong to node1 , srlg1 , link4 , node n selects the arc λ, without risk of mistake, when it computes the backup LSP b (cf. principle 1 of DPLRH in section 4.1).
5 5.1
Analysis and Simulation Results Simulation Model
We evaluated the performance of our proposed heuristic DP LRH by comparing it to the Kini’s flooding-based algorithm (FBA) and heuristic (FBH) described in section 3. Our choice is motivated by two reasons: – With the Kini’s algorithm, all the protection costs are advertised in the network. In this way, the backup LSP computations could be performed efficiently and without any bandwidth waste (due to approximations). Although F BA is not practical (it floods the network), we used it in our simulation to measure the approximation quality of the compared heuristics. – With Kini’s heuristic, only the highest protection cost on a link is advertised in the network. Contrarily to F BA, F BH is practical and it was used in several backup LSP computation algorithms (like in [5,6]). For our simulations, we used the topology network depicted in fig. 3 and composed of 162 risks: 50 nodes, 87 bidirectional links and 25 SRLGs (represented
Distributed PLR-Based Backup Path Computation in MPLS Networks
651
by ellipses in fig. 3). The capacity bandwidth on each arc was separated in two pools: protection pool of capacity equal to 100 units and primary pool of infinite capacity. The traffic matrix is generated randomly and consists of LSPs arriving one by one and asking for quantities of bandwidth uniformly distributed between 1 and 10. The ingress and egress nodes of each primary LSP are chosen randomly among the network nodes. Both the primary and backup LSP computations were based on the Dijkstra’s algorithm. Four variants of DP LRH were used in simulations. The first one DP LRH(∞,90) used a threshold (∀λ : T sλ = 90) and an infinite xλ vector (∀λ : xλ = ∞) on all the arcs. The second variant DP LRH(2,0) used an xλ vector of maximal size equal to 2 (∀λ : xλ = 2) but it does not employed the threshold (∀λ : T sλ = 0). The third variant DP LRH(5,0) used an xλ vector of maximal size equal to 5 (∀λ : xλ = 5) and a null threshold (∀λ : T sλ = 0). Finally, the last variant DP LRH(5,90) uses a threshold (∀λ : T sλ = 90) and an xλ vector of maximal size equal to 5 (∀λ : xλ = 5). Two metrics are used for the comparison: ratio of rejected backup LSPs (RRL) and mean number of parameter changes per backup LSP (NPC ). The first metric measures the ratio of backup LSP requests that are rejected because of the lack of protection bandwidth. It is computed as the ratio between the number of backup LSP requests that are rejected and the total number of backup LSP requests. The second metric measures the mean number of changes in the bandwidth protection parameters which allow the approximation of the different protection costs. It is computed as the ratio between the number of changes in the protection bandwidth parameters and the number of backup LSPs built in the network. For DP LRH (resp. F BA), each change of the xλ vectors (resp. of any protection cost) increases the NPC whereas only the changes in the minimal protection bandwidth allocated on arcs ({Gλ }λ ) are counted with F BH. At each establishment of 20 primary LSPs, the two metrics RRL and NPC are computed for each algorithm or heuristic. We note that our simulation results correspond to the average values of 1000 runs generated randomly. 5.2
Results and Analysis
Fig. 4 depicts the evolution of RRL as a function of the number of primary LSPs setup in the network. As expected, F BA and DP LRH(∞,90) have RRL values better than those of DP LRH(5,90) , DP LRH(5,0) and DP LRH(2,0) which, in turn, have better RRL values than those of F BH. This is due to the complete knowledge of the protection costs, which are necessary for the backup LSP computation, with F BA and DP LRH(∞,90) whereas only partial information consisting of the 5 (resp. 5, 2, 1 ) highest protection costs is used for the backup LSP computation with DP LRH(5,90) (resp. DP LRH(5,0) , DP LRH(2,0) , F BH). Due to the localization of the backup LSPs which tend to traverse arcs close to the protected risks, the probability P (δλr > T sλ ) is close to zero when the distance between the risk r (its components) and the arc λ is high. As a result, the number of protection costs which are higher than the threshold on the arc λ (i.e. Protection costs required for the backup LSP computation in order to
652
M.Y. Saidi, B. Cousin, and J.-L. Le Roux
Fig. 4. Ratio of rejected backup LSPs Fig. 5. Mean number of messages sent in (RRL) the network per backup LSP (NPC)
avoid the rejection by mistake of λ) depends strongly from the number of risks located in the close neighborhood of λ. Typically, in the network of fig. 3 where the mean degree of nodes is 3.5 and where the number of SRLGs is equal to 25, the advertisement of the 5 highest protection costs seems to be sufficient to avoid the blocking by mistake of protection requests when the number of primary LSPs is lower than 700. When the number of primary LSPs is higher than 700, the use of an xλ vector of size equal to 5 results in the rejection par mistake of some protection requests but globally, the corresponding RRL is a very close to the ideal case (case where all the protection costs are systematically advertised). Concerning the second metric, fig. 5 shows that F BA has higher NPC values than those of DP LRH and F BH. This is due to the systematical advertisement of protection costs used with F BA. In fact, at each establishment of a new backup LSP, F BA advertises the new values of protection costs (on the backup LSP links) whereas only changes in the x highest protection costs which are larger than the threshold (resp. 1 highest protection cost) on the backup LSP links require new xλ vector advertisements with DP LRH (resp. require the advertisement of the new highest protection costs with F BH). Concerning the comparison between the NPC values of DP LRH and F BH, we observe in fig. 5 that the NPC values of DP LRH(∞,90) and DP LRH(5,90) are lower than those of F BH, DP LRH(2,0) and DP LRH(5,0) . This comes from the use, in DP LRH(∞,90) and DP LRH(5,90) , of a high threshold T sλ (T sλ /BCλ = 0.9) which eliminates the flooding of a large number of xλ vectors. Thus, when the threshold value is known and high, setting xλ to the infinity could be a very promising choice which decreases the blocking probability without the flooding of the network. The other important observation, in fig. 5, is related to the similarity between the NPC values of DP LRH(∞,90) and those of DP LRH( 5, 90). This means that for such network load, the xλ vectors transmitted with DP LRH(∞,90) are nearly the same as those advertised with DP LRH(5,90) . This explains the similarity between the RRL values of DP LRH(∞,90) and DP LRH(5,90) in fig. 4.
Distributed PLR-Based Backup Path Computation in MPLS Networks
6
653
Conclusion
In this article, we proposed a completely distributed heuristic, called DPLRH, for backup LSP computation. Our heuristic allows a high-quality approximation of the protection information necessary for the backup LSP computation with the advertisement of small and size-limited vector (xλ vector) per network arc λ. This xλ vector is formed of the xλ (xλ > 0) highest protection costs which are higher than a threshold T sλ (T sλ ≥ 0), and does not change at each backup LSP establishment. DPLRH has several advantages. Firstly, it is symmetrical and it balances equitably computations on the network nodes. Secondly, DPLRH does not involve a communication between the entities computing the backup LSP and those configuring them since such tasks are performed by the same nodes (PLRs). Thirdly, DPLRH is scalable and it reaches a high degree of bandwidth sharing with the advertisement of a limited quantity of protection information (xλ vector information). Finally, our heuristic is easy to be deployed since it requires only very slight extensions to the IGP-TE protocols. Simulation results show that DPLRH decreases significantly the number of rejected backup LSPs and the frequency of advertisements when the threshold and the highest sizes of xλ vectors are well chosen.
References 1. Meyer, P., Van Den Bosch, S., Degrande, N.: High Availability in MPLS-based Networks, Alcatel. Alcatel telecommunication review (4th Quarter 2004) 2. Ramamurthy, S., Mukherjee, B.: Survivable WDM Mesh Networks (Part 1 - Protection). In: IEEE INFOCOM, vol. 2, pp. 744–751 (1999) 3. Rosen, E., Viswanathan, A., Callon, R.: Multiprotocol Label Switching Architecture. RFC 3031 (January 2001) 4. Pan, P., Swallow, G., Atlas, A.: Fast Reroute Extensions to RSVP-TE for LSP Tunnels. RFC 4090 (May 2005) 5. Kodialam, M.S., Lakshman, T.V.: Dynamic Routing of Locally Restorable Bandwidth Guaranteed Tunnels using Aggregated Link Usage Information. In: IEEE INFOCOM, pp. 376–385 (2001) 6. Balon, S., M´elon, L., Leduc, G.: A Scalable and Decentralized Fast-Rerouting Scheme with Efficient Bandwidth Sharing. Computer Networks 50(16), 3043–3063 (2006) 7. Kini, S., Kodialam, K., Lakshman, T.V., Sengupta, S., Villamizar, C.: Shared Backup Label Switched Path Restoration. Internet Draft, IETF (May 2001)draftkini-restoration-shared-backup-01.txt 8. Vasseur, J.P., Charny, A., Le Faucheur, F., Achirica, J., Le Roux, J.L.: Framework for PCE-based MPLS-TE Fast Reroute Backup Path Computation. Internet Draft, IETF (July 2004), draft-leroux-pce-backup-comp-frwk-00.txt 9. Saidi, M.Y., Cousin, B., Le Roux, J.-L.: A Distributed Bandwidth Sharing Heuristic for Backup LSP Computation. In: IEEE GlobeCom (November 2007)
Adaptive Multi-topology IGP Based Traffic Engineering with Near-Optimal Network Performance Ning Wang, Kin-Hon Ho, and George Pavlou Center for Communication Systems Research, University of Surrey Guildford, United Kingdom GU2 XH {N.Wang,K.Ho,G.Pavlou}@surrey.ac.uk
Abstract. In this paper we present an intelligent multi-topology IGP (MT-IGP) based intra-domain traffic engineering (TE) scheme that is able to handle unexpected traffic fluctuations with near-optimal network performance. First of all, the network is dimensioned through offline link weight optimization using Multi-Topology IGPs for achieving maximum path diversity across multiple routing topologies. Based on this optimized MT-IGP configuration, an adaptive traffic engineering algorithm performs dynamic traffic splitting adjustment for balancing the load across multiple routing topologies in reaction to the monitored traffic dynamics. Such an approach is able to efficiently minimize the occurrence of network congestion without the necessity of frequently changing IGP link weights that may cause transient forwarding loops and routing instability. Our experiments based on real network topologies and traffic matrices show that our approach has a high chance of achieving nearoptimal network performance with only a small number of routing topologies.
1 Introduction Intra-domain Traffic Engineering (TE) based on IGPs such as OSPF and IS-IS has recently been receiving numerous attentions in the Internet research community [1-5]. In order to achieve near-optimal or even optimal network performance, it is suggested that both IGP link weights and traffic splitting ratio need to be optimized simultaneously [2, 3, 5] based on the traffic matrix (TM) and the network topology as input. However, this is only applicable to offline TE where knowledge of the estimated TM is assumed a priori. Unfortunately, this assumption is usually not valid in real operational networks given frequent presence of traffic dynamics such as unexpected traffic spikes that are difficult to anticipate [6]. As a result, the absence of accurate traffic matrix estimation may lead the offline TE approaches to perform poorly. The most straightforward approach for handling this is to reassign IGP link weights dynamically in reaction to the monitored dynamics. However, re-assigning link weights on the fly may cause transient forwarding loops during the convergence phase, which often leads to service disruptions and traffic instability. In this paper, we propose AMPLE (Adaptive Multi-toPoLogy traffic Engineering), a novel IGP TE approach that is capable of adaptively handling traffic dynamics in operational IP networks. Instead of re-assigning IGP link weights in response to A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 654–666, 2008. © IFIP International Federation for Information Processing 2008
Adaptive Multi-topology IGP Based Traffic Engineering
655
traffic fluctuations, we adopt multi-topology IGPs (MT-IGPs) such as MT-OSPF [7] and M-ISIS [8] as the underlying routing platform to enable path diversity, based on which adaptive traffic splitting across multiple routing topologies is performed for dynamic load balancing. AMPLE consists of two distinct phases to achieve our TE objectives. First, the offline phase (e.g., at a weekly or monthly timescale) focuses on the static dimensioning of the underlying network, with MT-IGP link weights computed for maximizing intra-domain path diversity across multiple routing topologies. Since the objective is to obtain diverse IGP paths between each source/destination pair, the computation of MT-IGP link weights is actually agnostic to any traffic matrix. Once the optimized link weights have been deployed in the network, an adaptive TE algorithm performs traffic splitting ratio adjustment for load balancing across diverse IGP paths in multiple routing topologies, according to the up-to-date monitored traffic conditions. This adaptive TE aims to efficiently handle traffic dynamics at short time-scale such as hourly or even in minutes. Given the fact that traffic dynamics are common in operational IP networks, our proposed approach provides a promising and practical solution that allows network operators to efficiently cope with these dynamics that normally cannot be anticipated in advance. The contributions of our work can be summarized as follows. First of all, AMPLE does not require frequent and on-demand re-assignment of IGP link weights, thus minimizing the undesired transient loops and traffic instability. Second, the optimization of the MT-IGP link weights does not rely on the availability of traffic matrix a priori, which plagues existing IGP TE solutions due to inaccuracy of traffic matrix estimations. Finally, our experiments based on real network topologies and traffic matrices have shown that AMPLE has a very high chance of achieving nearoptimal performance with only a small number of routing topologies.
2 Related Work Based on the Equal Cost Multi Path (ECMP) capability in IGP routing, Fortz and Thourp [1] first proposed a local search heuristic to determine a set of IGP link weights that is able to achieve 50%-110% higher service capability in comparison to conventional IGP configurations. Some variations of this TE algorithm were also designed to obtain a set of link weights that is robust to traffic demand uncertainty [11] and link failures [12]. Sridharan et al. [3] revealed that near-optimal TE solutions can be achieved by carefully assigning traffic to some selected next-hops over ECMP paths at each eligible router. By modifying the next-hop entry in the forwarding table of a limited number of routers, uneven traffic splitting based on individual routing prefixes can be emulated. Another approach proposed in [4] is to enable arbitrary traffic splitting at the network edge only. The key idea of this approach is to optimally partition traffic demand into multiple sub-sets at ingress routers and route each of them within dedicated IP routing topologies. As far as resilience is concerned, the authors of [10, 13] proposed to use multiple MT-IGP routing topologies for fast IP recovery while balancing the load in the network after failures. However, none of the proposals above have considered dynamic TE based on the IGP. Most recently, D. Xu et al proposed PEFT, a new IGP that is able to achieve optimal TE performance by applying traffic splitting with exponential penalties [5]. However, this approach
656
N. Wang, K.-H. Ho, and G. Pavlou
requires changes to the existing intra-domain routing protocols and it is not able to handle traffic dynamics. A more comprehensive TE survey is available in [9].
3 AMPLE Overview As we have already mentioned, AMPLE encompasses two distinct tasks, namely (1) offline network dimensioning through link weight optimization for achieving maximum intra-domain path diversity across multiple MT-IGP routing topologies; and (2) adaptive traffic splitting ratio adjustment across these routing topologies for achieving dynamic load balancing in case of unexpected traffic dynamics. Regarding the first task, our link weight setting scheme is agnostic to traffic matrices, meaning that the input to the MT-IGP link optimization only includes the physical network topology. Fig. 1 illustrates how MT-IGP link weights can be assigned to provide intra-domain path diversity across three routing topologies between a single source/destination pair. The path shown in solid lines in each topology represents the shortest IGP path from the source node I to the destination node E. It can be seen that disjoined IGP paths are achieved through this link weight setting. Of course, the problem of maximizing IGP path diversity becomes much more complicated if the MT-IGP link weights are computed for all source-destination pairs, for instance, in a Point-of-Presence (PoP) topology where every node may send/receive traffic. Once the optimized MT-IGP link weights have been configured in the network, dynamic traffic control through traffic splitting at source nodes can be performed through the provisioned diverse IGP paths. 1
I
A 1
2
C
1 1
2
B
2
1
E
1
D
Topology 1
2
I
A 1
1
C
2 1
1
B
2
2
E
1
D
Topology 2
1
I
A 1
1
C
1 1
2
B
1
E
1
D
2
Topology 3
Fig. 1. An example of MT-IGP link weight setting for path diversity
Arbitrary traffic splitting is usually required for achieving optimal TE performance [2, 3, 5]. However, in plain OSPF/IS-IS networks, this prerequisite is not possible since these protocols only allow equal traffic splitting onto multiple equal cost paths. Another concern is that changing link weights may cause transient forwarding loops and traffic stability problems. Therefore, this should not be done frequently in reaction to traffic dynamics. By taking these issues into account, a solution that enables unequal traffic splitting while avoiding changing link weights frequently would be attractive. In AMPLE, the MT-IGP configuration produced in the offline phase provides an opportunity to use multiple diverse IGP paths for carrying traffic with arbitrary splitting across multiple routing topologies. More specifically, each source node can adjust the splitting ratio of its local traffic (through remarking the Multi-Topology ID field of the IP packets) according to the monitored traffic and network conditions in order to achieve sustainable optimized network performance (e.g. minimize the Maximum Link Utilization, MLU). It is worth mentioning that the computing of new traffic splitting ratios at each source PoP node is performed by a
Adaptive Multi-topology IGP Based Traffic Engineering
657
central traffic engineering manager. This TE manager has the knowledge about the entire network topology and periodically gathers the overall network status such as the current utilization of each link and traffic matrices based on which the new traffic splitting ratio is computed and thereafter enforced at individual source nodes.
4 Proposed Algorithms 4.1 Offline Link Weight Optimization The network model for MT-IGP link weight optimization is described as follows. The network topology is represented as a directed graph G=
(1)
r∈R
where xlu,v (r) indicates whether link l constitutes the shortest IGP path between u and v in routing topology r: ⎧1 if l ∈ Pu,v (r) x lu,v (r) = ⎨ ⎩0 otherwise
(2)
Our ultimate objective is to minimize the chance that a single link is shared by all routing topologies between each source-destination pair. The objective is to avoid introducing critical links with potential congestion where the associated sourcedestination pairs cannot avoid using it no matter which routing topology is used. Towards this end, we define the Full Degree of Involvement (FDoI), which indicates whether a critical link l is included in the IGP paths between source-destination pair (u, v) in all routing topologies: ⎧1 if DoI lu,v = |R| FDoI lu,v = ⎨ ⎩0 Otherwise
(3)
In summary, MT-IGP link weight optimization problem is formally described as follows. To calculate |R| sets of positive link weights W(r) = {wl (r)}:wl (r) > 0, r ∈ R in order to minimize:
∑ ∑ FDoI
u , v∈V l ∈E
u ,v l
(4)
658
N. Wang, K.-H. Ho, and G. Pavlou
We designed and implemented a Genetic Algorithm (GA) based scheme to compute the MT-IGP link weights for the problem formulated above. The cost function (fitness) is designed as: λ (5)
∑ ∑ FDoI
u , v∈V l ∈E
u ,v l
where λ is a constant value. In our GA based approach each chromosome C is represented by a link weight vector for |R| routing topologies: C = {W(r)| r ∈ R} . The total number of chromosomes in each generation is set to 100. According to the basic principle of Genetic Algorithms, chromosomes with better fitness value have higher probability of being inherited in the next generation. To achieve this, we first rank all the chromosomes in descending order according to their fitness, i.e., the chromosomes with high fitness are placed on the top of the ranking list. Thereafter, we partition this list into two disjoined sets, with the top 50 chromosomes belonging to the upper class (UC) and the bottom 50 chromosomes to the lower class (LC). During the crossover procedure, we select one parent chromosome from UC and the other parent from LC in generation i for creating the child C i +1 in generation i + 1 . Specifically, we use a crossover probability threshold K C ∈ [0,0.5) to decide the genes of which parent to be inherited into the child chromosome in the next generation. We also introduce a mutation probability threshold KM to randomly replace some old genes with new ones. The optimized MT-IGP link weights are pre-configured in the network as the input for adaptive traffic engineering which will be detailed in the next section. 4.2 Adaptive Traffic Control In this section, we present an efficient algorithm for adaptive adjustment of traffic splitting ratio at individual PoP source nodes. In a periodic fashion at a relatively short-time interval (e.g., hourly), the central TE manager needs to perform the following three operations: 1. Measure the incoming traffic volume and the network load for the current interval. 2. Compute new traffic splitting ratios for all PoP nodes based on the measured traffic demand and the network load for dynamic load balancing. 3. Instruct individual PoP nodes to enforce the new traffic splitting ratio over their locally originated traffic. We start by defining the following parameters: z z
t(u,v) – traffic between PoP node u and v. φu,v(r) – traffic splitting ratio of t(u,v) at u on routing topology r, 0.0≤ φu,v(r)≤1.0.
The algorithm consists of the following steps. We define an iteration counter k which is set to zero initially. Step-1: Identify the most utilized link lmax in the network. Step-2: For the set of traffic flows that are routed through lmax in at least one but not all the routing topologies (i.e. {t(u,v) | 0 < DoIlu,v < |R|} ∀u,v ∈V ), consider each at a time max
Adaptive Multi-topology IGP Based Traffic Engineering
659
and compute its new traffic splitting ratio among the routing topologies until the first feasible one is identified. A feasible traffic flow means that, with the new splitting ratios, the utilization of lmax can be reduced without introducing new hot spots with utilization higher than the original value. Step-3: If such a feasible traffic flow is found, accept the corresponding new splitting ratio adjustment. Increment the counter k by one and go to Step-1 if the maximum K iterations have not been reached (i.e. k ≤ K). If no feasible traffic flow exists or k = K, the algorithm stops and thereafter the TE manager instructs individual PoPs with the currently computed traffic splitting ratios. The parameter K controls the algorithm to repeat at most K iterations in order to avoid long running time. In Step-2, the task is to examine the feasibility of reducing the load of the most utilized link by decreasing the splitting ratios of a traffic flow assigned to the routing topologies that use this link, and shift a proportion of the relevant traffic to alternative paths with lower utilization in other topologies. More specifically, the adjustment works as follows. First of all, a deviation of traffic splitting ratio, denoted by δ where 0.0<δ ≤1.0, is taken out for trial. For the aggregate traffic flow t(u,v) under consideration, let R+ be the set of routing topologies in which the IGP paths from u to v traverse lmax. The main idea is to decrease the sum of traffic splitting ratios on all the routing topologies in R+ by δ and at the same time to increase the sum of the ratios on other topologies that do not use lmax by δ (We denote this set of topologies by R- where R-=R\R+). Specifically, for all the topologies in R+, which share a common link with the same (maximum) utilization, their traffic splitting ratios are evenly decreased. Hence, the new traffic splitting ratio for each routing topology in R+ becomes:
φu,v(r)’ = φu,v(r) - δ / |R+|
∀r∈R+
On the other hand, let μr be the bottleneck link utilization of the IGP path in routing topology r∈R-. The traffic splitting ratio of each routing topology in Rincreases in an inverse proportion to its current bottleneck link utilization, i.e. φu ,v (r ) ' = φu ,v (r ) + (
1 − μr
∑ 1− μ
r∈R −
×δ )
∀r ∈ R −
r
The lower (higher) the bottleneck link utilization, the higher (lower) the traffic splitting ratio will be increased. An important issue to be considered is the value setting for δ. If not appropriately set, it may lead to either slow convergence or overshoot of the traffic splitting ratio, both of which are undesirable. On one hand, too large value of δ may miss the chance to obtain desirable splitting ratios due to the large gap between each trial. On the other hand, too small (i.e. too conservative) value of δ may cause the algorithm to perform many iterations before the most appropriate value of δ is found, thus causing slow convergence to the equilibrium. Taking these considerations into account, we apply an algorithm to increase δ exponentially starting from a sufficiently small value. If this adjustment is able to continuously reduce the utilization of lmax without introducing negative new splitting ratios on R+, the value of δ will be increased exponentially for the next trial until no further improvement on the utilization can be
660
N. Wang, K.-H. Ho, and G. Pavlou
Notation: U(l) is the utilization of link l Require: A set of MT-IGP topologies R, constants K and Ω 1. glb_improve = TRUE, k = 0 2. while (glb_improve & k < K) do 3. lmax ← the most utilized link in the network 4. Let T’ be the set of traffic flows routed over lmax in at least one but not all of the routing topologies 5. t(u,v) ← the first traffic flow in T’ 6. feasible_fnd = FALSE 7. while (!feasible_fnd & not all flows in T’ are examined) do 8. R+ ← the set of routing topologies that uses lmax for t(u,v) 9. R- ← R \ R+, ω = 0, μmax = U(lmax) 10. best_dlt = 0, loc_improve = TRUE 11. while (ω ≤ Ω & cont) do 12. 13. 14. 15.
1 2Ω −ω φu ,v (r ) ' = φu ,v (r ) − δ / R + )
δ=
∀r ∈ R +
μr ← the bottleneck link utilization of the path for t(u,v) in topology r∈R1 − μr φu ,v (r ) ' = φu ,v (r ) + ( × δ ) ∀r ∈ R − ∑ 1 − μr r∈R −
16. 17.
l’max ← the most utilized link among those traversed by t(u,v) in all the routing topologies if φu,v(r)’ is to be implemented if (U(l’max) < μmax & φu ,v (r ) ' ≥ 0 ∀r ∈ R + ) then
μmax = U(l’max), best_dlt = δ, ω = ω + 1
18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
else cont = FALSE end if end while if μmax < U(lmax) then accept the adjusted splitting ratios based on best_dlt feasible_fnd = TRUE k=k+1 else t(u,v) ← next traffic flow in T’ end if end while l*max ← the current most utilized link in the network if U(l*max) ≥ U(lmax) then glb_improved = FALSE end if end while
Fig. 2. Pseudo code - Adaptive traffic splitting ratio adjustment algorithm
made or the value of δ reaches 1.0 (i.e. the maximum traffic splitting ratio that can be applied). The exponential increment of δ works as follows. δ=
1 2Ω −ω
Adaptive Multi-topology IGP Based Traffic Engineering
661
where Ω is a constant that can be set by the network operator, and ω is the iteration counter. The pseudo code for the algorithm is shown in Fig. 2. Finally, we discuss some implementation issues on the adaptive traffic control functionality. First of all, traffic measurement plays an important role which provides necessary information on the network status to the central TE manger. In AMPLE, we adopt the hop-by-hop based monitoring mechanism that is similar to the proposal of [16]. The basic idea is that, each PoP node within the network is responsible for monitoring the performance (e.g., bandwidth utilization) of the associated outgoing inter-PoP links. In a synchronized fashion, they periodically report the link conditions to the central TE manger. Compared to other end-to-end and distributed traffic monitoring mechanisms, this approach has distinct advantages such as the easiness of identifying the most utilized link in the network.
5 Performance Evaluation 5.1 Experiment Setup In order to evaluate the performance of AMPLE, we use the real topologies and traffic matrices from the GEANT [14] and Abilene [15] networks. The GEANT network topology contains 23 PoP nodes and 74 links, most of which are OC48 (2.5Gbps) and OC192 (10Gbps), but it also has a few link with low capacity of 155Mbps. The traffic matrices have been derived every 15 minutes for several months. We present results based on a 7-day long traffic matrices dataset obtained from the TOTEM Project [17]. The Abilene network topology contains 12 nodes and 30 links, most of which are OC192, but the link between Indianapolis and Atlanta has 2.5Gbps capacity. 5.2 Path Diversity Performance In this section we present our simulation results for the offline MT-IGP link weight optimization. The performance metric we use to evaluate path diversity is the proportion of source-destination pairs that can successfully avoid any critical link with FDoI (i.e., shared by all routing topologies). Fig. 3(a) shows the path diversity performance (i.e. the proportion of source/destination pairs that can fully avoid critical links) in the GEANT topology with: (1) optimized MT-IGP link weight setting for maximizing intra-domain path diversity (Optimized), and (2) random link weight setting in all routing topologies (Random). From the figure we can see that the optimized link weight setting substantially outperforms the random solution in terms of path diversity. More specifically, our algorithm is able to guarantee 100% avoidance of critical links shared by all topologies with only three routing topologies. In this case, in an event of network congestion, the associated sources are always able to remark their local traffic to enforce alternative IGP path selection to bypass the congested link. Fig. 3(b) shows the path length distribution performance of individual schemes, including the actual link weight setting (Actual), proportional to inverse bandwidth capacity (InvCap) and our proposed GA-based scheme (Optimized), all with three routing topologies. We can see that our proposed algorithm leads to some longer paths due to the efforts for maximizing path diversity across multiple routing topologies, which accounts for some increment in the overall network cost.
662
N. Wang, K.-H. Ho, and G. Pavlou
Fig. 4 shows the corresponding performance in the Abilene network. Again, we can see that larger number of routing topologies lead to higher path diversity. It should also be noted that the overall path diversity performance is not as good as that of the GEANT topology. This is mainly due to the fact that one of the PoP nodes in Atlanta has node degree of one, thus it is not possible to provide path diversity for it. If we ignore this node, the corresponding path diversity performance can reach 100% with four routing topologies. The lower path diversity of Abilene is also due to its lower mean node degree (2.5) compared to that of GEANT (3.2). 40 InvCap
80
Distribution (%)
Proportion (%)
100
60 40 Optimized 20
Random
30
Opt imized
20 10
0 2
3
4
Act ual
0
5
1
2
3
4
Number of topologies
5
6
7
8
9 10
P ath length
(a)
(b)
Fig. 3. MT-IGP link weight setting performance (GEANT) 40 Opt imized
100
InvCap
Distribution (%)
Proportion (%)
120 Random
80 60 40 20
30
Act ual Opt imized
20 10 0
0
2
3 4 Number of topologies
5
(a)
1
2
3
4
5
6
7
8
P ath length
(b)
Fig. 4. MT-IGP link weight setting performance (Abilene)
5.3 Adaptive Traffic Engineering Evaluation We compare the following approaches in our adaptive TE evaluation: z z z
Actual: The actual link weight setting in the current operational networks. InvCap: Setting link weights proportional to inverse capacity. Multi-TM: We use the TOTEM toolbox [17] to compute a set of link weights for multiple traffic matrices. The objective is to make the IGP TE robust to traffic demand uncertainty. Specifically, the link weights are computed at the beginning
Adaptive Multi-topology IGP Based Traffic Engineering
663
of each day based on the sampled traffic matrices (one per hour) on the same day of the previous week. AMPLE-n: Our proposed adaptive TE algorithm that runs on top of n optimized MT-IGP routing topologies. Optimal: As the baseline for our comparisons, we use the GLPK in the TOTEM toolbox to compute optimal MLU for the given topologies and traffic matrices.
z
z
We first present in Fig. 5 the MLU achieved by Actual and Optimal for all the TMs of the GEANT network during a week. As shown in the figure, the performance gap between Actual and Optimal is very large, which reveals that network resources are far from being utilized at the maximum efficiency. In order to minimize or avoid if possible potential congestion caused by unexpected traffic spikes, it is desirable to maintain the network utilization as close to Optimal as possible. Actual
InvCap
Optimal
Actual
Multi-TM
AMPLE-3
3.00E+07
1.2 1
2.50E+07
Network cost
MLU
0.8 0.6
2.00E+07
1.50E+07
0.4 1.00E+07
0.2 0
5.00E+06
1
101
201
301
401
501
601
1
101
201
301
Interval
401
501
601
Interval
Fig. 6. Network cost in GEANT
Fig. 5. MLU in GEANT
InvCap
Actual
Multi-TM
AMPLE-3
Ratio of MLU relative to Optimal
3
2.5
2
1.5
1 1
101
201
301
401
501
601
Interval
Fig. 7. Ratio of MLU relative to Optimal in GEANT
Fig. 7 plots the ratio of MLU relative to Optimal based on the same traffic matrices used in Fig. 5. The ratio is calculated as the MLU of a specific method divided by that of Optimal. For illustration purposes, we only included AMPLE with three topologies (AMPLE-3). Nevertheless, Table 1 shows additional statistics on both GEANT and Abilene networks (not shown in figures due to space limit) with the number of topologies varying from 2 to 4. Specific MLU performance metrics are defined:
664
N. Wang, K.-H. Ho, and G. Pavlou
z Average maximum link utilization (AMU) – the average value of the MLU across all the traffic matrices during the seven-day period; z Highest maximum link utilization (HMU) – the highest value of the MLU across all the traffic matrices during the period. z Proportion to near-optimal performance (PNO) – the percentage over all the TMs in which AMPLE can achieve near-optimal performance. We define here the meaning of near-optimal to be the MLU that is within 3% gap to the optimality.
A common observation from the figure is that based on the optimized MT-IGP link weights, AMPLE can achieve near-optimal MLU for most of the traffic matrices. On average, the MLU of Actual is nearly twice that of Optimal. This can be explained by the fact that the actual setting of link weights in the GEANT network mainly takes delay into account [18]. Although InvCap and Multi-TM perform better than Actual, their gaps from Optimal are still significant with the relative ratio being around 1.5. We further analyze the relevant statistics in Table 1. In the GEANT network, the Actual link weight approach produces AMU that is 86% higher than that of the optimal value. Using InvCap achieves AMU that is 52% higher than the optimal one, whereas with AMPLE the value varies between 0.1% and 43% depending on the number of routing topologies that are used. The larger the number of routing topologies, the closer to the optimal performance can be achieved. For the PNO metric in Table 1, if AMPLE is based on two routing topologies, the value is only 13.1% but it still outperforms significantly the Actual and InvCap approaches. We can now start to see the practical usefulness of our approach in improving network utilizations: When the number of routing topologies increases to three, the PNO boosts up to 78.3%. With 99.6% of all the traffic matrices, AMPLE achieves nearoptimal performance with four routing topologies. These results reveal that, for the GEANT network, AMPLE has very high chance of achieving near-optimal TE performance under any scenario of traffic matrix with four routing topologies. Our experiments based on the Abilene network also show similar results. We also observed that the Multi-TM approach does not achieve good performance in minimizing the MLU according to Fig. 7. There are two reasons for this. First of all, the ultimate objective of Multi-TM is to minimize the network cost represented by a piece-wise linear function [1] rather than specifically minimizing MLU. Second, even if multiple traffic matrices with different pattern characteristics are considered in the link weight optimization, unexpected traffic spikes may still introduce poor TE performance. This is especially the case in the Abilene scenario (see HMU in table 1). Minimizing network cost is another important TE objective. To evaluate this, we adopt the commonly used piece-wise linear function [1] to indicate the actual network cost. By using this cost function, the two objectives of minimizing network bandwidth consumption and load balancing are taken into account simultaneously. Fig. 6 shows the corresponding performance in the GEANT network whose pattern of traffic dynamics pattern is quite regular on the daily basis. In overall, Multi-TM is the best performer since it optimizes the network cost as the primary objective. Although AMPLE has higher network cost due to the trade-off to path diversity, the increase is small and acceptable. On the other hand, the network cost performance in Abilene is similar to that in GEANT, but due to the space limit it is not shown in the paper.
Adaptive Multi-topology IGP Based Traffic Engineering
665
Table 1. Probability of achieving near-optimality Optimization Method
GEANT (%)
Abilene (%)
AMU
HMU
PNO
AMU
HMU
PNO
Optimal InvCap
30.05 45.72
52.82 94.41
1.6
12.2 19.58
33.42 62
0.15
Actual Multi-TM
55.74 48.56
96.91 104.15
0 0.44
19.59 53.2
63.24 230
1.19 0.15
AMPLE-2 AMPLE-3 AMPLE-4
42.9 31.95 30.08
94.61 60.36 52.88
13.08 78.34 99.56
18.61 12.36 12.4
60.96 33.44 49.6
64.14 88.69 97.77
6 Summary In this paper we presented AMPLE, a novel TE approach that enables dynamic load balancing in operational IP networks. Instead of frequently changing IGP link weights, we use multi-topology IGP routing protocols that allow adaptively splitting traffic across multiple routing topologies. Offline link weight optimization is performed in order to enable path diversity, followed by the adaptive control of traffic splitting across individual routing topologies according to the monitored traffic dynamics. Our simulation experiments based on the GEANT and Abilene networks and the respective traffic matrices have shown that AMPLE has high chance of achieving near-optimal network performance with only a small number of topologies.
Acknowledgement This work was partially funded by the EU FP6 AGAVE Project (IST-027609)
References 1. Fortz, Thorup, M.: Internet Traffic Engineering by Optimizing OSPF Weights. In: Proc. IEEE INFOCOM (2000) 2. Wang, Y., et al.: Internet Traffic Engineering without Full Mesh Overlaying. In: Proc. IEEE INFOCOM (2001) 3. Sridharan, A., et al.: Achieving Near-Optimal Traffic Engineering Solutions for Current OSPF/ISIS Networks. IEEE/ACM Transactions on Networking 13(2) (April 2005) 4. Wang, J., et al.: Edge-based Traffic Engineering for OSPF Networks. Computer Networks 48(4) (July 2005) 5. Xu, D., et al.: Link-state Routing with Hop-by-hop Forwarding Can Achieve Optimal Traffic Engineering. In: Proc. IEEE INFOCOM (2008) 6. Wang, H., et al.: COPE: Traffic Engineering in Dynamic Networks. In: Proc. ACM SIGCOMM (2006) 7. Psenak, P., et al.: Multi-Topology (MT) Routing in OSPF., RFC 4915 (2007) 8. Przygienda, T., et al.: M-ISIS: Multi Topology (MT) Routing in IS-IS. RFC 5120 (2008)
666
N. Wang, K.-H. Ho, and G. Pavlou
9. Wang, N., et al.: An Overview of Routing Optimization for Internet Traffic Engineering. IEEE Communications Surveys and Tutorials (to appear, 2008) 10. Kvalbein, A., et al.: Post-Failure Routing Performance with Multiple Routing Configurations. In: Proc. IEEE INFOCOM (2007) 11. Zhang, C., et al.: On Optimal Routing with Multiple Traffic Matrices. In: Proc. IEEE INFOCOM (2005) 12. Nucci, A., et al.: IGP Link Weight Assignment for Operational Tier-1 Backbones. IEEE/ACM Transactions on Networking 15(4), 789–802 (2007) 13. Menth, M., Martin, R.: Network Resilience through Multi-topology Routing. In: Proc. International Workshop on Design of Reliable Communication Networks (DRCN) (2005) 14. The GEANT topology: http://www.geant.net/upload/pdf/GEANT_Topology_12-2004.pdf 15. The Abilene topology and traffic matrices dataset: http://www.cs.utexas.edu/~yzhang/research/AbileneTM/ 16. Asgari, H., et al.: Scalable Monitoring Support for Resource Management and Service Assurance. IEEE Network Magazine (November 2004) 17. Balon, S., et al.: Traffic Engineering an Operational Network with the TOTEM Toolbox. IEEE Transaction on Network Service Management 4(1), 51–61 (2007) 18. Uhlig, S., et al.: Providing Public Intradomain Traffic Matrices to the Research Community. ACM SIGCOMM Computer Communication Review 36(1) (January 2006)
Modeling TCP in Small-Buffer Networks Mark Shifrin and Isaac Keslassy Department of Electrical Engineering Technion - Israel Institute of Technology Haifa 32000, Israel {logout@tx,isaac@ee}.technion.ac.il
Abstract. Today, the large packet buffers present in backbone routers significantly increase their power consumption and design time. Recent models of networks with large buffers have suggested that these large buffers could be replaced with much smaller ones. Unfortunately, it turns out that these models are not valid anymore in networks with small buffers, and therefore cannot predict how these small-buffer networks will behave. In this paper, we introduce a new model that provides a complete statistical description of small-buffer Internet networks. First, we present novel models of the distributions of several network components, such as the line occupancies of each flow, the instantaneous arrival rates to the bottleneck queues, and the bottleneck queue sizes. Then, we combine all these models in a single fixed-point algorithm that forms the key to the global statistical small-buffer network model. In particular, given some QoS requirements, this new model can be used to precisely size small buffers in backbone router designs. Keywords: Small-Buffer Network, Network Model, Backbone Routers.
1
Introduction
Current backbone routers typically contain extremely large buffers. These buffers take about half of their board space and a third of their power consumption [1]. They rely on massive amounts of SRAM and DRAM with fast access times, require complex scheduling algorithms to manage these SRAM and DRAM modules, and can take a significant amount of time to design [2]. These large buffer sizes typically result from a widely-followed rule of thumb, stating that router buffer sizes should be equal to the product of the typical (or worst-case) round-trip-time by the router capacity [3]. This rule of thumb is derived when considering synchronized TCP flows. For instance, given a standard linecard with 40 Gbps and a 250-ms round-trip time, the rule of thumb dictates a large linecard buffer of 10 Gb, which needs several DRAM modules and cannot be practically implemented in SRAM alone. Recent papers in the literature suggest that this rule of thumb overprovisions buffers by several orders of magnitude [4,5,6,7,8]. In fact, as the number of TCP flows becomes extremely large, they are much less synchronized, and therefore their sum is smoother than for synchronized flows, hence incurring smaller buffer A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 667–678, 2008. c IFIP International Federation for Information Processing 2008
668
M. Shifrin and I. Keslassy
needs. More precisely, these papers argue that the TCP flows can be modeled as independent, and therefore, by the law of large numbers, the total number of TCP packets in the network converges to a Gaussian distribution. As a consequence, in this model, also known as the Stanford model, √ it is hypothesized that the needed buffer size is smaller by a factor of about n than the rule of thumb for synchronized flows, where n is the number of TCP flows going through the buffer. For instance, given a million flows, the example above yields a buffer size of about 10 Mb in the Stanford model, which can be implemented in SRAM instead of DRAM. If true, such a result would obviously incur significant architectural changes in backbone routers: for the same power budget, it would be possible to pack more lines, thus increasing the router capacity. The memory architecture would be much simplified. And the input and output queues might be packed together with the switch fabric in a single chip, hence increasing its modularity as well. Unfortunately, because it analyzes networks with large buffers, the Stanford model assumes that most of the traffic variability is in the buffers, and not on the lines. However, this assumption does not hold anymore in small-buffer networks, where the variability in the line occupancy cannot be neglected. Therefore, a new model is needed in small-buffer networks. This is the objective of this paper. Our work is related to several studies in the literature. First, the Stanford model is developed in [4, 5, 6], but none of these papers provide any complete network models for small-buffer networks. In contrast, [7,8,9] do provide network models, but they all assume Poisson arrivals to the bottleneck queues, and we will later see that these assumptions do not match simulations in small-buffer networks. Other models also consider non-droptail queueing policies [10]. Finally, [6] shows that buffers can be made even smaller, but it assumes that TCP is altered or access lines are made slower. In this paper, we introduce a new method that provides a complete statistical description of large Internet small-buffer networks with TCP traffic. To do so, we consider each bottleneck queue, and successively build models for the distributions of several network components around this queue. We then connect all these models together in a closed loop, and derive the final network model as a result of a fixed-point equation. In particular, contributions of this paper include: (a) to our knowledge, the first-ever model for the traffic distribution on the links entering the bottleneck queues; (b) a new model for the instantaneous arrival rate to the bottleneck queues, including a decomposition along the input lines and a Gaussian-based model for the total rate; (c) a model for the occupancy distribution and packet loss rate of bottleneck queues that does not make assumptions on the incoming traffic load and does not assume that it is Poisson; and, most importantly, (d) a closed-loop model of small-buffer networks that enables us to determine their loss rate, as well as the distributions of the major network components. We also conclude this paper with a detailed discussion of assumptions and consequences of these results. In particular, these models can be used by router designers to
Modeling TCP in Small-Buffer Networks
669
determine the necessary router buffer size given any loss probability target in any large network topology. The rest of the work is organized as follows. Section 2 presents the notations and the closed-loop model used in this paper. Section 3 contains the models of the different network components. Then, Section 4 presents simulation results. Finally, Section 5 discusses the assumptions used and Section 6 discusses the generality of the presented results.
2 2.1
Notations and Closed-Loop Model Notations
Our objective is to model a large network with small buffers. We will first formally reduce the problem to a simpler dumbbell topology problem, and then introduce the different notations used. Assumption 1 (Dumbbell Topology). The large modeled network can be decomposed into subnetworks with a single bottleneck buffer in each subnetwork, each subnetwork being modeled using a dumbbell topology around its bottleneck buffer. This is a classical assumption (see for instance [4, 6, 9, 11, 12]), which we further discuss in Section 5. The intuition behind it is that the behavior of TCP flows mainly derives from the congestion of the buffers on their path, and that in practice, a single buffer on their path typically causes most of the congestion. As shown in Figure 2.2, the dumbbell topology includes n persistent TCP flows sharing the same bottleneck buffer of capacity C and buffer size B. For each flow i, we will denote the congestion window size as Wi , the access link occupancy as Li , the access link propagation time as Ti , and the total roundtrip propagation time, not including the queueing time, as RT Ti . We will also denote the bottleneck queue size as Q, and its loss rate as p. The latencies of the forward and backward access links are assumed to follow some given positive distribution, and their capacities are assumed to be very high so that the bottleneck link is the only one experiencing congestion. Further, the TCP window sizes are assumed to be integer and have positive lower and upper bounds. The queueing policy is assumed to be drop-tail. Finally, all modeled distributions are assumed to be stationary. Note that the simulations in Section 4 relax these assumptions, e.g. by allowing for short TCP flows. Also, Section 5 further discusses the influence of these assumptions on the general results. 2.2
Closed-Loop Model
Our goal is to provide a closed-loop model for the dumbbell network. First, we will establish a general set of inter-related models. Then, by solving a fixed-point problem involving all these inter-related models, we will converge towards a final network model.
670
M. Shifrin and I. Keslassy
(a) Dumbbell topology
(b) Closed-loop schematic model
Fig. 1. Two topology views
As illustrated in Figure 2.2, the inter-related models will successively provide equations for the distributions of: (1) the access link occupancies {Li }, (2) the instantaneous arrival rates {ΔAi }, (3) the total instantaneous arrival rate ΔA, (4) the bottleneck queue size Q, (5) the value of the loss rate p, and (6) the congestion window sizes {Wi }. The first five models are presented in this paper. The last model for the congestion window distribution can be taken from the many literature references (see for instance [13, 14]). Once the inter-related models are obtained, we can put them together in a loop using the following schematic chain: p ⇒ {Wi } ⇒ {Li } ⇒ {ΔAi } ⇒ ΔA ⇒ Q ⇒ p,
(1)
which can be rewritten as: p = f (p).
(2)
We can then simply find p by solving this fixed-point equation, and the solution provides us as well with the distributions of all the network components mentioned above. To solve for p, it is possible to use the gradient descent algorithm for a few iterations till |p − f (p)| < for a desired . In other words, we now have the ability to provide a model for all the main characteristics of this small-buffer network when given only the link latencies, bottleneck link capacity, and buffer size.
3
Closed Loop for the Packet Loss Rate Derivation
In this section we will successively go through the inter-related models presented above. 3.1
Model of Li
We will now assume that we are given the distributions of the TCP congestion windows Wi , and want to provide a model for the distribution of Li . To do so, we will first make two simplifying assumptions.
Modeling TCP in Small-Buffer Networks
671
Assumption 2 (Independence). The {Wi } are independent and identically distributed. In the remainder, we will denote their common distribution function as fW . Intuitively, this simplifying assumption relies on the fact that as the number of flows increases, their mutual synchronization decreases, and therefore the congestion windows can increasingly be modeled as independent. Note that this assumption, as well as the next one, are further discussed in Section 5. TCP flows typically send packets and ACKs (acknowledgements) in a highly bursty manner. Further, we can use their congestion window size to approximate the size of this burst. The following assumption models this high burstiness. Assumption 3 (Burstiness). Flow i has a total of Wi packets (or ACKs on the reverse path), all present as a single burst on a given link. Note that this simplifying assumption directly contradicts the common fluid models of TCP, which assume that the window is spread out, and assumes instead that the window is concentrated at a single point. This assumption also uses the fact that we consider small-buffer networks: since the probability of having packets in the buffer is small enough, it can be neglected in this model. We can now derive the distribution of the access link occupancies Li : Theorem 1 (Access Link Distribution). The number of packets on forward access link i is distributed as: 1 − RTTiTi if k = 0, P r(Li = k) = (3) Ti otherwise. RT Ti · fW (k) Proof. Using Assumption 3, the probability that the burst is not present on the access line is 1 − Ti /RT Ti. The burst presence probability is independent of its size, since the propagation times are the same no matter what the burst size is. Therefore, the probability that k > 0 packets are present on the access line is the product of the probability that the burst size is k (fW (k)) by the probability that the burst is present on the access line (Ti /RT Ti). The simulation results regarding this model are presented in Section 4. 3.2
Arrival Rates of Single Flows and Total Arrival Rate
Denote the number of packets of flow i arrived to the bottleneck queue in Δt i seconds as ΔAi . Intuitively, ΔA Δt represents the (instantaneous) arrival rate on line i to the bottleneck queue. We are interested in studying the distribution of ΔAi for some small Δt < Ti , and obtain the following model: Theorem 2 (Flow Arrival Rate Distribution). The number of packets ΔAi of flow i arrived during time Δt is distributed as:
672
M. Shifrin and I. Keslassy
P r(ΔAi = k) =
1−
Δt Ti
+ P r(Li = 0) ·
P r(Li = k) ·
Δt Ti
Δt Ti
if k = 0, otherwise.
(4)
Proof. By Assumption 3, packets of flow i move on each line in a single burst of size Wi and at a constant speed. Therefore, the probability that on some link i, k > 0 packets arrive within Δt, is P r(Li = k) · Δt Ti . This gives us the probability for the packet arrival of any size larger then zero. The complementary probability, therefore, stands for the no-arrival event. Incidentally, note that Equation (4) can be rewritten as in Equation (3): Δt 1 − RT if k = 0, Ti P r(ΔAi = k) = (5) Δt · f (k) otherwise. W RT Ti This is Equation (3), replacing the access link propagation time Ti with the propagation time Δt. We now want to find the total instantaneous arrival rate ΔA = i (ΔAi ) of all flows. Since the {Wi } (and the {Li }) are statistically independent by Assumption 3, the {ΔAi } are statistically independent as well. Thus, we can use Lindeberg’s Central Limit Theorem [15] to show that the total arrival rate ΔA will be Gaussian for a large number of flows. We refer interested readers to [14] for the detailed proof. Theorem 3 (Total Arrival Rate Distribution). When the number of flows ΔA− E(ΔAi ) n → ∞, the normalized total arrival rate √ i converges in distribution i
V ar(ΔAi )
to the normalized Gaussian distribution N (0, 1). The simulation results comparing this Gaussian model with a typical Poisson model are presented in Section 4. 3.3
Queue Size Distribution and Packet Loss Rate
Our next objective in the fixed-point model is to find the distribution of the queue size Q and the packet loss rate p given the above model for ΔA. We will decompose time into frames of size Δt, and assume that packets arrive as bursts of size ΔA every Δt seconds. Thus, we clearly get the following queueing model: Theorem 4 (Queue Size and Loss Rate). The queue size distribution and [ΔA] the packet loss rate are obtained by using a GΔt /D/1/B queueing model, in which every Δt seconds, packets arrive in batches of size ΔA and immediately obtain service for up to CΔt packets. We implemented this queueing model using both the algorithm developed in [16] and a simple Markov chain, and both yielded similar results. We are now done with our set of network models, and are ready to analyze their correctness using simulation results.
Modeling TCP in Small-Buffer Networks
673
Fig. 2. Distribution of the access link occupancy Li (logarithmic scale)
4
Simulation Results
We will now present the simulation results for the different parts of the closedloop model. The simulations were done in ns2. Specific settings are detailed below for their respective simulations. In all the simulations below, we ran 2000 simultaneous persistent TCP NewReno flows, unless noted otherwise. The packet size was set to 1000 bytes. The maximum allowed window size was set to Wmax = 64 packets. The propagation time of the bottleneck link was fixed to 20 milliseconds, and the propagation times for all the other links were chosen randomly according to a uniform distribution. 4.1
Access Link Occupancy Model vs. Fluid Model
We want to compare our bursty model for {Li } against a fluid model, in which packets are distributed uniformly on all the links (the queueing time is neglected). According to this fluid model, the number of packets present on access link i at time t is thus equal to Li (t) = RTTiTi · Wi (t). The maximum number of packets on the access link, therefore, is bounded by RTTiTi · Wmax . Figure 2 represents the probability distribution function (PDF) of the access link occupancy Li of some random flow i using a logarithmic scale. It was obtained using a simulation involving 500 simultaneous TCP flows. It can be seen that our bursty model is fairly close to the measured results throughout the whole scale. It behaves especially better than the fluid model, for which the pdf is equal to 0 for any occupancy above RTTiTi ·Wmax ≈ 10, and thus cannot even be represented on the logarithmic scale. This observation strengthens the intuition that Assumption 3 was reliable. Note that we obtained similar results on many simulations with different parameters.
674
M. Shifrin and I. Keslassy
Fig. 3. Distribution of the total arrival rate ΔA in time Δt
4.2
Arrival Rate Model vs. Poisson Model
We want to compare our Gaussian-based model for the distribution of the total instantaneous arrival rate ΔA with a typical Poisson model [7, 8, 9]. Figure 3 plots the pdf of ΔA, using Δt = 10 ms, and compares it with simulated data obtained artificially using the two models. Our Gaussian-based model relies on the expected value and variance computed above. On the contrary, for the Poisson model, we force the expected value to equal the measured value. Of course, over this amount of time, the Poisson model yields an approximately Gaussian distribution as well. As shown in the figure, our model approximately yields the correct variance and is close to the simulated results, while the Poisson model yields too small a variance. This can be explained by the burstiness properties of the TCP flows. 4.3
Queue Size Distribution Results and Comparisons
Figure 4 compares the measured queue size distribution with our model. We used a buffer size of 580 packets. Our Markov-chain-based queue model was obtained after the convergence of the entire closed-loop model, and therefore uses our modeled arrivals as well. It can be seen that our queue model is fairly close on most of the simulated range of Q, but cannot exactly reproduce the smooth continuous behavior of the queue at the edges (Q = 0 and Q = B) because of its discrete bursty nature. In fact, in this example, the true loss rate was 0.55%, while our model gives 0.76%. 4.4
Fixed-Point Solution
In the simulations, the fixed-point solution of our network model was found using the gradient descent algorithm, in typically less than 50 iterations. The exact number of needed iterations depends of course on the desired precision
Modeling TCP in Small-Buffer Networks
675
Fig. 4. Queue size distribution
and on the network parameters. Table 1 illustrates the average flow throughput, the average queueing delay (measured and modeled) and the average loss rate (measured and modeled) using the simulation results in 4 settings with quite different parameters: - Case 1: 500 long-provisioned TCP flows with RT Ti distributed between 80 and 440 msec, B = 232 packets and C = 232.5 Mbps. - Case 2: 500 long-provisioned TCP flows with RT Ti distributed between 80 and 440 msec, B = 450 packets and C = 116.25 Mbps. - Case 3: 750 long-provisioned TCP flows and a constant number of 25 short TCP flows, with RT Ti distributed between 70 and 2040 msec, B = 244 packets and C = 69.75 Mbps. (The short flows were omitted in the model.) - Case 4: 1500 long-provisioned TCP flows and a constant number of 5 short TCP flows, with RT Ti distributed between 220 and 240 msec, B = 828 packets and C = 209.25 Mbps. (The short flows were omitted in the model.) In all these cases, the modeled queueing delay and packet loss rate were close to, but slightly above, the measured results. Incidentally, using the same simulations without short flows, we also verified that the influence of the short flows on the simulation results was negligible. Note also that all these cases use small buffers, with the last case following the Stanford model, and the three first cases having even smaller buffers.
5
Discussion of Assumptions
Let’s now discuss the correctness and generality of the assumptions. Dumbbell Topology — We assumed in Assumption 1 that any large network can be subdivided into dumbbell topologies. This assumption relies on the observation that in the Internet, few flows practically have more than one bottleneck, and that flows having more than one bottleneck actually mainly depend on the most congested one [4]. Thus, the assumption of a single point of congestion
676
M. Shifrin and I. Keslassy Table 1. Measured versus modeled results Case Case Case Case Case
1 2 3 4
Average flow throughput 52 pkts/sec 29 pkts/sec 11 pkts/sec 18 pkts/sec
Queueing delay Loss rate p Measured Modeled Measured Modeled 1.79 msec 1.9 msec 0.79 % 0.90 % 18.27 msec 19.87 msec 1.81 % 2.10 % 7.84 msec 8.54 msec 1.30 % 1.30 % 22.16 msec 26.15 msec 3.30 % 3.50 %
seems realistic enough. However, we also assumed that the congestion only affects packets, not ACKs. This assumption might be too restrictive – even though we found that our Gaussian-based models still held in various simulations using reverse-path ACK congestion. Statistical Independence of {Wi } — It is obviously not correct that the congestion windows are completely independent, since they interact through the shared bottleneck queue. However, in order to study how far from reality the independence assumption is, we checked how correlated the congestion windows were over time. We obtained the following correlation matrix for the congestion windows of five arbitrary flows, using 70,000 consecutive time samples in a simulation with 500 persistent TCP flows. ⎛ ⎞ ⎜ ⎜ C =⎜ ⎝
1
0.066
0.066
1
0.14
0.054
0.14 0.058 −0.025
0.054 0.0005 −0.081 ⎟ 1
0.058 0.0005 0.063
0.063
0.051
1
0.003
−0.025 −0.081 0.051 0.003
⎟ ⎟ ⎠
(6)
1
It can be seen that the correlation coefficients between different flows were indeed quite low and far from the maximum absolute value of 1. Of course, while independent random variables have zero correlation, the reverse is not necessarily true. Nevertheless, this low correlation would tend to indicate that the flows are indeed desynchronized, and therefore that the simplifying assumption of independence is not too far from reality. In fact, as the number of flows increases, we also found that a heuristic measure of the independence of their window sizes was decreasing (we took two arbitrary flows and used the symmetric form of the Kullback-Leibler distance between the joint distribution of their window sizes and the product of their respective distributions). For instance, it was 20.03 for 10 flows, 3.37 for 50 flows, and 2.65 for 200 flows. Identical Distribution of {Wi } — The distributions of the congestion window sizes mainly depend on the loss rate p in the shared bottleneck buffer [13,14]. In simulations, the loss rate was indeed found to be equal for all flows when there is no strong synchronization. Burstiness — In Section 4, the comparison of the bursty model with the fluid model already strengthened the bursty assumption. More generally, this assumption needs to be used with care in networks without enough space on the links for a packet burst (extremely small link latencies or link capacities). In other cases, while not exactly reflecting reality, this assumption seems close enough [4].
Modeling TCP in Small-Buffer Networks
677
RTT Distribution and Queuing Delay — We assumed that the queueing delay can be neglected in front of the link propagation times. In fact, in the RT √ T , where RT T is the Stanford model, the worst-case queueing delay is B C = n average round-trip propagation time. Consequently, the assumption seems reasonable when n is large, as long as there is no flow with a round-trip propagation time significantly small in front of RT T . Number of Packets — We assumed that the number of packets in the network is close to the total window size, as reflected by the definition of the window size. This assumption is especially justified when the loss rate is reasonably small, as seen in our simulations and in the literature [4].
6
Generality of Results
It is obviously impossible to consider every possible topology and every possible set of flows. We made hundreds of simulations for this paper, and still feel that there is much to research. Nevertheless, we can already discuss the scope of the results and their sensitivity to various topology parameters. Buffer Size — Our paper is about models that are valid in networks with small buffers. In a network with large buffers, the flows might become synchronized, and our independence assumptions and the ensuing models might not be applicable. Likewise, we would not be able to neglect the queueing delay in our models. Propagation Times — We chose the link propagation times using a uniform distribution with different parameters so as to reflect the diversity of real-life Internet flows. In simulations, our models were fairly insensitive to these propagation times, as long as there were many flows and there were no flows with near-zero round-trip-times. Protocols — The closed-loop model fits a network with long-provisioned TCP flows. The cases above illustrate how a small portion of short TCP flows had no major influence on the results – in fact, the short flows contribute to the global desynchronization and make the independence-based models even closer to simulated results! Likewise, we believe that a small portion of UDP traffic would not have any influence on the results because of its open-loop nature. Number of Flows — In simulations, we found that the desynchronization was already practically correct for several hundreds flows, as previously stated in [4]. We thus believe that with the hundreds of thousands of flows present in a congested backbone router, the desynchronization will be even more correct.
7
Conclusion
In this paper, we provided a complete statistical model for a large network with small buffers. We started with a model for the traffic on a single access line. Then, we modeled the arrival rates to the bottleneck queues. Later, we found
678
M. Shifrin and I. Keslassy
a model for the queue size distribution and the loss rate. Finally, using these inter-related models, we showed how to solve a single fixed-point equation to obtain the full network statistical model. A router designer might directly use this network model for buffer sizing. Indeed, the designer might consider a set of possible benchmark parameters and model the behavior of the resulting network. Then, given target QoS requirements such as the required maximum packet loss rate or maximum expected packet delay, the router designer will be able to design the buffer size that satisfies these constraints.
References 1. McKeown, N.: Sizing router buffers. EE384Y Course Talk, Stanford (2006) 2. Shrimali, G., Keslassy, I., McKeown, N.: Designing packet buffers with statistical guarantees. In: IEEE Hot Interconnects XII, Stanford, CA (2004) 3. Villamizar, C., Song, C.: High performance tcp in ansnet. ACM Computer Communications Review (1994) 4. Appenzeller, G., Keslassy, I., McKeown, N.: Sizing router buffers. In: ACM SIGCOMM, Portland, OR (2004) 5. Appenzeller, G., McKeown, N., Sommers, J., Barford, P.: Recent results on sizing router buffers. In: Network Systems Design Conference, San Jose, CA (2004) 6. Enachescu, M., Ganjali, Y., Goel, A., McKeown, N., Roughgarden, T.: Routers with very small buffers. In: IEEE Infocom, Barcelona, Spain (2006) 7. Raina, G., Towsley, D., Wischik, D.: Part II: control theory for buffer sizing. ACM Computer Communications Review (2005) 8. Raina, G., Wischik, D.: Buffer sizes for large multiplexers: TCP queueing theory and instability analysis. In: EuroNGI (2005) 9. Avrachenkov, K.E., Ayesta, U., Altman, E., Nain, P., Barakat, C.: The effect of router buffer size on the TCP performance. In: LONIIS Workshop, St. Petersburg, Russia, pp. 116–121 (2002) 10. Bu, T., Towsley, D.F.: A fixed point approximation of TCP behavior in a network. In: ACM SIGMETRICS (2001) 11. Hespanha, J.P., Bohacek, S., Obraczka, K., Lee, J.: Hybrid modeling of TCP congestion control. In: Di Benedetto, M.D., Sangiovanni-Vincentelli, A.L. (eds.) HSCC 2001. LNCS, vol. 2034. Springer, Heidelberg (2001) 12. Shorten, R., Wirth, F., Leith, D.: “A positive systems model of TCP-like congestion control: Asympototic results,” Tech. Rep. 2004-1, Hamilton Institute (2004) 13. Altman, E., Avrachenkov, K.E., Kherani, A.A., Prabhu, B.J.: Performance analysis and stochastic stability of congestion control protocols. In: IEEE Infocom, Miami, FL (2005) 14. Shifrin, M.: The Gaussian nature of TCP in large networks, M.Sc. Research Thesis, Technion, Israel (August 2007), http://www.graduate.technion.ac.il/Theses/Abstracts.asp?Id=23526 15. Zabell, S.L.: Alan Turing and the central limit theorem. Amer. Math. Monthly 102, 483–494 (1995) 16. Tran-Gia, P., Ahmadi, H.: Analysis of a discrete-time G[X] /D/1 − S queueing system with applications in packet-switching systems. In: IEEE Infocom, New Orleans, LA (1988)
Active Window Management: Performance Assessment through an Extensive Comparison with XCP∗ M. Barbera1, M. Gerla2, A. Lombardo1, C. Panarello1, M. Sanadidi2, and G. Schembra1 1
DIIT – University of Catania V.le A. Doria, 6 – 95125 Catania, Italy {mbarbera,lombardo,cpana,schembra}@diit.unict.it 2
Computer Science Department Boelter Hall – UCLA 405 Hilgard Ave., Los Angeles, CA, 90024, USA [email protected], [email protected]
Abstract. The most efficient approaches defined so far to address performance degradations in end-to-end congestion control exploit the flow control mechanism to improve end-to-end performance. The most authoritative solution in this context seems to be the eXplicit Control Protocol (XCP) which achieves high performance but requires changes in both network routers and hosts which make it difficult to deploy. To this aim we have developed a new mechanism, called Active Window Management (AWM), which is able to maintain the queue length in network routers almost constant providing no loss, while maximizing network utilization. The idea at the basis of AWM is to allow network routers to manipulate the Advertised Window field in TCP ACKs. In this way no modifications to the TCP protocol are required. The target of this paper is to propose an extensive numerical analysis of AWM to compare it with the XCP protocol, chosen as reference case. Keywords: TCP performance optimization, XCP, Congestion Control, Advertised Window.
1 Introduction Internet congestion occurs when the available capacity is insufficient to meet the instantaneous aggregate demand. The resulting effects are long delays in data delivery and wasted resources due to lost or dropped packets. For these reasons congestion control mechanisms have been introduced in the Internet from the early beginning. Congestion control is implemented by transport protocol algorithms that dynamically adjust rate or window size in response to implicit or explicit indications of congestion. In the Internet, the most used transport protocol is TCP which, in the absence of explicit feedback from the gateways, usually infers congestion from packet losses or from other congestion measurements (changes in throughput or end-to-end delay)[1]. ∗
This work has been partially supported by the Italian PRIN project FAMOUS under grant 2005093971 and by the AIBER project funded by the MIUR.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 679–690, 2008. © IFIP International Federation for Information Processing 2008
680
M. Barbera et al.
Nevertheless, the most effective detection of congestion can occur in the gateways, that is, where the queuing behavior can be monitored over time. For this reason, Active Queue Management (AQM) techniques, such as RED, REM, have been proposed in the last few years to address performance degradations in end-to-end congestion control. All the proposed AQM techniques are certainly an improvement over traditional drop-tail queues; however, they have several shortcomings and have been demonstrated to be inefficient because they miss the challenging target of both minimizing network delay and keeping goodput close to the network capacity [2]. So the important issue on how to control TCP burstiness while maintaining the stability of the “AQM+TCP congestion control” system is still open. A different approach exploits the flow control mechanism in order to improve endto-end performance and maintain network queue stability. There are two different classes of solutions that adopt this approach. The first class includes schemes that are implemented both in the hosts and in the routers [3], and need undesirable modifications to the TCP protocol, or even its replacement, in order to accept, understand and use feedback signals from gateways. On the contrary, the second class of solutions includes algorithms which are implemented in the gateways only. The most authoritative solution in the first class is XCP [3], a new transport protocol that relies on the explicit cooperation among XCP routers and XCP senders or receivers. This protocol generalizes the Explicit Congestion Notification (ECN) [4] by allowing XCP routers to inform the XCP senders about the degree of congestion at the bottleneck. Moreover XCP uses two different mechanisms to control utilization and fairness: to ensure efficient utilization of network resources, an aggregate feedback is calculated according to the spare bandwidth in the network and the feedback delay; to control fairness, the concept of bandwidth shuffling is applied and a perpacket-feedback is obtained by simultaneously allocating and de-allocating bandwidth such that the total traffic rate does not change while the throughput of each individual flow changes gradually to approach the flow’s fair share. To do so, knowledge about flow parameters is needed but in order to avoid per-flow state in routers and ensure scalability as number of flows increases, the control state is maintained in the packet header by introducing new fields. The explicit cooperation among routers and senders allows XCP to achieve better performance than TCP in terms of fairness, high utilization, and small queue size, with almost no packet drops. However, even if XCP seems to be the most valid solution to address performance degradations in end-to-end congestion control, it also requires changes in the whole system (both network routers and hosts) and that makes it relatively hard to deploy. The second class of solutions that relies on a distributed flow control mechanism among routers and end-users includes algorithms which are implemented in the gateways only. More specifically, the basic idea is that network routers manipulate the Advertised Window parameter in the TCP ACKs. By so doing, routers can drive the TCP end-to-end flow control mechanism while TCP is not aware of this occurrence. Our proposed mechanism belongs to this second class. The main challenge of these algorithms is to derive the Advertised Window value to be sent to each TCP source from global congestion information such as the available buffer space, and although total compatibility with all versions of the TCP protocol makes this class of algorithms more attractive than the previous one, solutions proposed so far ([5], [6]) are not able to improve the performance of TCP connections enough to be competitive with XCP.
Active Window Management: Performance Assessment
681
A previous work of the same authors [7] presents an early proposal of a new mechanism, called Active Window Management (AWM), which is able to maintain the queue length in routers implementing it very close to a given target value, while, at the same time, avoiding packet losses. In this paper we extend the work in [7] with an extensive analysis of the AWM performance. Moreover we compare AWM with XCP and demonstrate that, when the access router implementing AWM is the bottleneck in the network, TCP performs very closely to XCP, providing no loss and very high network utilization. However, since changes required in the whole system (network routers, and hosts) are less than XCP and, differently from XCP, AWM is perfectly compatible with TCP, it is easier to implement and deploy. By manipulating the Advertised Window parameter in the TCP ACKs, AWM gateways minimize the delay jitter and maximize the goodput of TCP sources. In addition, no per-flow state storage is needed in AWM gateways, ensuring scalability as number of flows increases. The TCP sender works as usual, but the AWM gateway action makes the Advertised Window usually more constraining than the Congestion Window. We propose to implement the AWM technique in the access network routers, that is, in routers through which both incoming and outgoing packets related to the same TCP connections are forced to pass through, whatever the routing strategy used in the network. The remaining gateways in the network may use other AQM technique, if any. Since changes required in the whole system (network routers, and hosts) are less than XCP, AWM is easier to implement and deploy relative to XCP. The rest of the paper is organized as follows. After a brief description of the AWM mechanism in Section 2, we show simulation results about comparisons between AWM and XCP in different network scenarios. Finally, in Section 3, we draw our conclusions.
2 The AWM Mechanism 2.1 Mechanism Description The goal of AWM is to maximize link utilization, and at the same time avoid losses. AWM is implemented on gateway nodes and interacts with TCP sources without requiring any modification to the TCP protocol. An AWM gateway, that is, a gateway node implementing AWM, acts on ACK packets sent to TCP sources. More specifically, it uses the 16-bit header field in a TCP packet called Advertised Window (awnd). Let us recall that, according to the TCP flow control, the TCP receiver uses awnd to inform the sender about the number of bytes available in the receiver buffer. The TCP sender, on the other hand, transmits a number of packets given by the socalled Transmission Window (twnd), corresponding to the minimum value between its Congestion Window (cwnd) and the last received awnd. Note that immediately after the establishment of the connection, cwnd is smaller than awnd, and therefore twnd=cwnd. However, according to the TCP protocol specification, cwnd is increased until a loss is detected. Thus, twnd increases as well until a loss is detected, unless cwnd becomes greater than the awnd value specified by the TCP receiver before losses occur. In this way, if the TCP receiver is able to manage more packets than the network nodes in the end-to-end path, awnd is greater than cwnd, and so losses may occur.
682
M. Barbera et al. TCP sources
AWM gateway
QF
TCP destinations
QB Fig. 1. Buffers considered in the AWM algorithm
With this in mind, the idea at the base of the proposed algorithm is that, before losses occur, the AWM gateway modifies the awnd value in ACK packets in order to interrupt the increase in twnd values, and therefore control the transmission rate of the TCP sources. By so doing, the awnd value received by the TCP sender may be different from that specified by the TCP receiver and the TCP AIMD mechanism is bypassed, providing the network with a traffic shape defined by the AWM gateway. In order to avoid packet losses and maximize the link utilization, the AWM algorithm tries to maintain the average queue length in the gateway close to a target value to be chosen significantly smaller than the buffer size (to have no losses), and at the same time higher than zero (to achieve high link utilization). To this end, based on the state of the queue, the AWM gateway estimates the number of bytes that it should receive from each TCP source. Hereafter we refer to this value as suggested window (swnd). The AWM algorithm works on two different buffers in the same gateway (see Fig. 1): QF is the buffer loaded by data packets coming from TCP sources; QB is the buffer queuing the corresponding ACKs. Therefore, it needs to be implemented in network nodes crossed by both incoming and outgoing packets of the same TCP connections. This is the case of network access routers, which are also very often bottleneck routers. The AWM gateway calculates swnd on the basis of the status of QF. This is done at each packet arrival and departure in the QF buffer, independently of the TCP connection the packet belongs to. In the following the arrival and departure events are called updating events. Every time an ACK packet leaves the buffer QB, if its awnd value is greater than the last updated value of swnd, the AWM gateway overwrites the value of the awnd field with the swnd value and recalculates the checksum. In the opposite case the awnd field value remains unchanged in order not to interfere with the original TCP flow control algorithm1. Let us stress that this mechanism has been defined in such a way that scalability is not weaken. In fact, the AWM gateway maintains one swnd value for the entire set of connections: no per-flow state memory is required, and updates are applied to awnd in forwarded ACKs, independently of the particular TCP connection they belong to. More specifically, the swnd value at the generic updating event k, swndk, is evaluated on the basis of its previous value swndk-1, by considering two corrective terms DQk and DTk:
swnd k = max (swnd k −1 + DQk + DTk , MTU )
1
Clearly the roles of QF and QB are exchanged for TCP traffic in the opposite direction.
(1)
Active Window Management: Performance Assessment
683
The minimum value of the suggested window is set to the Maximum Transfer Unit (MTU) in order to avoid the Silly Window Syndrome [8]. The term DQk in (1) is defined as follows: DQ k =
1 (q k −1 − q k ) N
(2)
where N is an estimation of the number of active TCP flows crossing the AWM gateway. Its evaluation is discussed in Section 2.2. The term DQk has been defined as in (2) in order to give a negative contribution when the instantaneous queue length qk is greater than its previous value and a positive contribution in the opposite case. In other words, the AWM gateway detects the bandwidth availability when the instantaneous queue length decreases, and informs TCP sources by proportionally increasing the suggested window value for each of them. When, on the contrary, the queue length increases, the AWM gateway infers incipient congestion and forces TCP sources to reduce their emission rates. If the N TCP sources reduce their transmission window by the term DQk, the queue length will go back to the qk-1 value. The term DTk, on the other hand, has been introduced in order to stabilize the queue length around a given target value, T. To this end, DTk has to give a positive contribution when the queue length is less than T, and a negative contribution in the opposite case. These considerations motivate our choice of evaluating DTk in the following way:
DTk =
β ⋅δ k N ⋅T
⋅ (T − q k )
(3)
The positive parameter β has to be chosen in such a way that fast convergence to the target is guaranteed, reducing the oscillations as much as possible. In [7], we have presented an analytical fluid model of the AWM gateway to be used for the choice of the appropriate values of β in order to obtain the desired performance. 2.2 Estimation of N
In order to evaluate both the terms DQk and DTk in (1), the AWM gateway has to know the number N of active TCP flows passing through the AWM gateway. As a first step we assume that they have the same average round-trip time, R(t ) . We will subsequently relax this assumption in section 2.3. Let q(t) indicate the instantaneous queue length in the buffer QF of the AWM gateway at the generic time instant t, and W(t) the value of the TCP source transmission window. Thanks to the AWM mechanism, the transmission window is the same for all the sources and equal to the suggested window indicated by the AWM gateway in the ACK packets coming out of the buffer QB at the generic time instant t. The change rate of the queue length is equal to the total arrival rate minus the output rate, that is:
dq(t ) N ⋅ W (t ) = −C dt R(t )
(4)
684
M. Barbera et al.
where R(t ) is the round-trip time calculated as the sum of the round-trip propagation
delay D and the queuing delay, i.e. R(t ) = D + q(t ) / C , being C the link capacity of the buffer QF. Since the β parameter is chosen in order to guarantee the convergence of the queue length to the target, we can conclude that at the steady state the derivative of the queue length is zero. As a consequence, we can rewrite (4) as N ⋅ W (t ) = C ⋅ R(t ) . Moreover, since the queue length is constant, the average round-trip time R(t) does not suffer of variations during the system evolution, so the product N⋅W(t) is constant as well. This means that if we consider two consecutive updating events at the instants t k −1 and t k , we have: N k −1 ⋅ swnd k −1 = N k ⋅ swnd k
(5)
where swnd k −1 and swnd k are the values of W(t) calculated at the instants t k −1 and t k , respectively. The above relationship states that if the AWM algorithm needs to change the value of the suggested window to be sent to TCP sources according to (1), the reason is that a variation in the number of active sources is occurred. As a consequence, (5) allows us to adaptively evaluate Nk in the following way: Nk =
N k −1 ⋅ swnd k −1 swnd k
(6)
The number of sources immediately after the k-th updating event estimated as in (6) should be substituted in (1) for evaluating swndk+1. However, when one or more TCP sources start, because of the Slow Start mechanism, they will probably have a congestion window less than the suggested window, so they will not react to the AWM gateway indications. Therefore, until their congestion window value does not reach the suggested window, only the other sources will have in charge to maintain the queue length constant. For this reason during this phase the number of sources in (1) will be kept constant and equal to the number of sources already active before the start of the new sources. The number of source will be varied in (1) only when the Slow Start phase is interrupted by a suggested window value less than the congestion window value. This event is detected by the AWM gateway when the number of sources estimated by AWM, which is continuously evaluated by (6), decreases for the first time. On this occurrence the value of N to be used in (1) will be calculated by gradually increasing the value of N from that calculated before the start of the new sources, until it reaches the value estimated by (6). 2.3 AWM Model in Presence of Sources with Different Round Trip Time
Let us consider a set of M TCP sources loading the AWM buffer, and divide them in G groups in such a way that in each group there are TCP sources with the same
Active Window Management: Performance Assessment
685
average round-trip time, Ri, where i = 1, 2, …, G2. Further let us indicate the number of sources in the generic group i as Ni. In this case (4) can be rewritten as follows: dq(t ) G N i ⋅ W (t ) =∑ −C dt Ri i =1
(7)
Let Rm denote the mean value of the round-trip time of all the groups of sources, and let us define Neq as follows: G
N eq = R m ⋅
∑R i =1
Ni
(8)
i
The term Neq represents the equivalent number of TCP sources that should produce the same effects on the AWM buffer if all they had the same round-trip time equal to Rm. According to this notation, (4) becomes: dq(t ) N eq ⋅ W (t ) = −C dt Rm
(9)
Since (4) and (9) have the same form, we can repeat all the considerations about the AWM mechanism by substituting N and R with Neq and Rm, respectively.
3 Numerical Results In this section we show that AWM appropriately shapes incoming traffic into the network. More specifically, we will show that, with AWM, TCP performs very closely to a pseudo-constant bit-rate protocol providing no loss, and network utilization becomes very close to one. In addition, we compare performance achieved by using AWM with performance obtained with XCP and when no control is applied in the network. That is, we will consider the following three cases: • AWM - sources use TCP NewReno unmodified as the transport protocol, and gateways in the network implement AWM mechanism over a buffer with DropTail dropping policy; • NC - sources use TCP NewReno unmodified as the transport protocol, but gateways do not implement any particular mechanism; buffer dropping policy is DropTail; • XCP - sources use XCP as transport protocol, and gateways in the network are XCP as well, with a buffer using Drop-Tail dropping policy. Let us observe that, although we are considering TCP NewReno, any TCP version can be used achieving the same results for AWM given that AWM by-passes the TCP control mechanism, which is just the key element making the difference between different versions of TCP. 2
As mentioned above, at the steady state the round-trip time does not suffer appreciable variations. Therefore we can remove its temporal dependence, and refer to it with the constant value R.
686
M. Barbera et al.
Comparison is carried out through extensive simulations in different scenarios with the ns-2.29 simulator [9], integrated with a module implementing AWM. For the sake of space we present just a subset of the obtained simulation results. In order to have a fair comparison with XCP, we use the same case study considered in [3], where the XCP authors demonstrated its superiority over other important schemes like RED, REM, AVQ and CSVQ. The used network topology presents a single bottleneck shared by a number of sources (see Fig. 2(a)). The bottleneck capacity, the round-trip propagation delay and the number of flows vary according to the specific goal of each experiment. When unspecified, the link capacity is set to 150 Mb/s, the link propagation delay to 80 ms and the number of flow in the forward direction to 50. All flows have the same round-trip propagation delay and sources are long-lived FTP flows. In order to create a 2-way traffic with the potential for ack compression, we use the same number of flows in the forward and reverse paths. The MTU length is 1000 bytes. The buffer dropping policy is Drop-Tail. The buffer size is set equal to the bandwidthdelay product, while the target value for AWM is half a buffer size. We present results for link utilization and lost packet percentage. Moreover we introduce a robustness index, defined as the standard deviation of the queue length normalized with respect to the minimum distance between the average queue length and the two buffer limits (zero and buffer size). This definition comes from the consideration that the standard deviation is not able to represent by itself the impact of queue length oscillations on buffer performance. In fact, if the average queue length is close to zero, large space is available in the buffer to absorb an increase of the queue length without losses, but an even small decrease may empty the buffer and cause underutilization; on the contrary, if the average queue length is close to the buffer size, relatively large oscillations may lead to buffer overflow. Bottleneck capacity – First we investigated how the bottleneck capacity affects performance of the three considered schemes. Fig. 2(b) shows performance in terms of loss percentage, bottleneck utilization and normalized queue standard deviation. From that figure we can observe that no losses occur with AWM and XCP. Moreover, only AWM provides 100% link utilization in all the considered cases. This is because the queue length is always close to the target value therefore greater than zero. In fact the amplitude of oscillations of the queue length around the average is very small for AWM, which means that the queue is stable around its average value. In addition, since the round-trip time is function of the bottleneck queuing time, small oscillations around the average queue length make variance of the round-trip time small as well, implying small values of delay jitter. Round-Trip Propagation Delay – To study the impact of increasing delays, we vary the propagation delay of each link. The bottleneck capacity is set to 150Mb/s while all the other parameters maintain the same values used in the previous analysis. Fig. 3(a) shows that performance achieved by AWM, as well as XCP, are not affected by the round-trip propagation delay. Number of flows – We analyzed the impact of the number of flows on congestion control. Even with a large number of flows, AWM is able to guarantee 100% link utilization and zero loss (see Fig. 3(b)). The amplitude of oscillations around the average queue length, and consequently the delay jitter, grows as the number of flows
Active Window Management: Performance Assessment
687
increases. This behavior is due to the rounding errors introduced when TCP sources convert the number of bytes specified by the suggested window in an equivalent number of packets. As the number of flows increases, the suggested window decreases and the equivalent number of packets decreases as well, making the effects of the rounding errors more evident. Short Web-Like Flows – Finally we consider the impact of the presence of short web-like sources (mice) on performance of the three considered schemes. To this aim we assume short flows arrive according to a Poisson process, with Pareto-distributed file sizes [10]. Fig. 4(a) demonstrates that AWM is robust in dynamic environments with a large number of flow arrivals and departures. Once again, the utilization is 100% while no losses occur even when the number of short flows grows significantly; AWM
XCP
NC
0.03 Packet Loss
0.02 0.01 0 0
500
1000 1500 Bottleneck Capacity (Mb/s)
2000
500
1000 1500 Bottleneck Capacity (Mb/s)
2000
500
1000 1500 Bottleneck Capacity (Mb/s)
2000
Bottleneck Utilization
1
S1 S2
0.98 0.96 0.94 0
r1
Normalized Queue Standard Deviation
r0 S10
1 0.5 0 0
(a)
(b)
Fig. 2. (a) Network topology: Single Bottleneck – (b) Performance vs. LinkCapacity AWM
XCP
NC
AWM
0.01
0.005 0 0
200
400 600 800 Round Trip Propagation Delay (ms)
0 0
1000
Bottleneck Utilization
0.96
400 600 Number of FTP Flows
800
1000
200
400 600 800 Round Trip Propagation Delay (ms)
1000
1
200
400 600 Number of FTP Flows
800
1000
0.5
200
400 600 800 Round Trip Propagation Delay (ms)
(a)
1000
200
400 600 Number of FTP Flows
800
1000
0.98 0.96 0.94 0
Normalized Queue Standard Deviation
Bottleneck Utilization
Normalized Queue Standard Deviation
200
1
0.98
0 0
NC
0.05
1
0.94 0
XCP
0.1 Packet Loss
Packet Loss
0.015
1 0.5 0 0
(b)
Fig. 3. (a) Performance vs. Round-Trip Propagation Delay – (b) Performance vs. Number of FTP Flows
M. Barbera et al. AWM
XCP
NC
Packet Loss
0.4 0.2 0 100
200
300
400 500 600 700 New Mice per seconds
800
900
1000
Average Response Time Reduction (%)
688
50 AWM XCP
40 30 20 10 0 100
200
300
400
300
400 500 600 700 New Mice per seconds
800
900
1000
4
Average Web Throughput Increase (%)
Normalized Queue Standard Deviation
200
6
700
800
900
1000
50 AWM XCP
40 30 20 10 0 100
200
300
400
500
600
700
800
900
1000
New Mice per seconds
2 0 100
600
(b)
0.9
200
300
400 500 600 700 New Mice per seconds
(a)
800
900
1000
Average FTP Throughput Increase (%)
Bottleneck Utilization
1 0.95
0.85 100
500
New Mice per seconds
20
AWM XCP
10 0 -10 -20 100
200
300
400
500
600
700
800
900
1000
New Mice per seconds
(c) Fig. 4. (a) Performance vs. Mice Arrival Rate – (b) Average Response Time Reduction vs. Mice Arrival rate – (c) Average Throughput Increase vs. Mice Arrival Rate
but as soon as the number of new mice per second becomes higher than 750, the number of simultaneously active flows becomes larger than the buffer size measured in packets: there is no space in the buffer to maintain a minimum of one packet from each flow and both AWM and XCP start dropping packets. For short web-like sources, as additional performance parameter we consider the Response Time that is, the time interval between the instant in which the users make the web file request and the instant in which its transfer is completed. Fig. 4(b) shows the reduction in the average Response Time among all web connections in AWM and XCP relative to the NC case. When AWM, as well as XCP, is applied, average response time reduction of 20-30% is possible. Moreover, it is interesting to investigate if this reduction of the response time happens because AWM, or XCP, allows short web-like flows to gather more bandwidth than long FTP-like flows. To this end we show the variation of the Average Web Throughput and the Average FTP Throughput obtained when AWM and XCP schemes are implemented and compare these results with the NC case. Fig. 4(c) shows that AWM is able to substantially increase the throughput of Web sources without wasting the throughput of the FTP sources. In fact, let us consider the case of 500 new mice per second as an example. The average web throughput obtained with AWM is approximately 25% more than the NC case, while the reduction of the average FTP throughput is less than 5%. On the contrary, XCP increases the average web throughput by 5% but the average FTP throughput is approximately 20% less than the NC case. Topology with multiple bottleneck links – In order to demonstrate that results shown so far are independent from the particular topology, we considered a more complex scenario with multiple bottlenecks. Once again, for comparison we choose to use the same topology used by the authors of XCP. The topology is composed by 9 links as shown in Fig. 5(a). Capacity of link 5 is 50Mb/s, while all other links have a capacity of 100Mb/s. The propagation delay of every link is 20ms. Links are traversed
Active Window Management: Performance Assessment AWM
XCP
689
NC
Packet Loss
0.1 0.05 0 1
2
3
4
5 Link ID
6
7
8
9
1
2
3
4 Link ID
5
6
7
8
1
2
3
4 Link ID
5
6
7
8
Bottleneck Utilization
1 0.9
1
2
3
4
5
6
7
8
9
Normalized Queue Standard Deviation
0.8 0
10
200 100 0 0
(b)
(a)
Fig. 5. (a) Network topology: Multiple Bottlenecks – (b) Performance vs. Link ID
by 50 flows both in the forward and reverse directions, from node 1 to node 10. In addition, in order to have multiple bottleneck links, 50 flows traverse every link from node i to node i + 1 , with i = 1, 2, …, 9. In this scenario, XCP provides lower link utilization than the other considered schemes (Fig. 5(b)). As XCP authors explained, this is because, at links 1, 2, 3 and 4, XCP tries to share the bandwidth between long-distance and short-distance flows. Nevertheless, throughput of long-distance flows is limited by the same mechanism at the most stringent bottleneck link 5. To overcome this drawback, XCP could be modified in order to keep the queue length around a target value and improve the link utilization, but the effect of this choice would be a fluctuation in the queue. However XCP authors chose to trade “a few percent of utilization for a considerably smaller queue”. The opposite choice (i.e. to trade a greater average queue length for the maximum link utilization) can be made for AWM without affecting the stability of the queue around the target. This is demonstrated in Fig 5(b) where the normalized standard deviation is very small. Results show that packet losses are zero for both AWM and XCP schemes. Fairness – Simulation results are obtained considering 30 flows with a bottleneck link of 30Mb/s. As regards the fairness metric, we use the Jain fairness index [11]. Considering N flows, with flow i receiving a fraction xi of the given link bandwidth,
the Jain fairness index is defined as
(∑ x ) / (N ∑ x ) . Simulation results are pre2
i
2 i
sented in Table I. In case 1, flows have the same round-trip propagation delay equal to 40ms and performance of three considered schemes are quite the same. In case 2, different round-trip propagation delays are considered. In particular round-trip propagation delays of each flow are chosen as Di = Di-1 + 10ms, where i = 2, 3, …, 30 and D1 = 40ms. Simulation results demonstrate that AWM is fairer than the noncontrolled case, but as expected, thanks to the explicit collaboration between XCP users and XCP network nodes, XCP is fairer than AWM. In fact, an XCP router receives, from each XCP source, information about the estimated round-trip time and
690
M. Barbera et al. Table 1. Jain’s Fairness Index Case
AWM
XCP
NC
1 2
1.0000 0.7007
1.0000 0.9954
0.9937 0.4392
on the base of this value calculates a different feedback for each source. On the contrary, since the goal of AWM is to preserve the compatibility with TCP, AWM does not receive any information regarding the estimated round trip time and calculates the swnd regardless of the specific TCP connection would receive it.
4 Conclusions In this paper we have proposed an extensive numerical analysis to compare the AWM mechanism, intended to enhance TCP congestion control, with the eXplicit Control Protocol (XCP), chosen as usual as reference case. Results demonstrate that AWM performs very closely to XCP. It achieves high performance as it stabilizes the queue length in the network access routers by manipulating the Advertised Window parameter in TCP ACKs. Moreover it is scalable as the number of flows increases. However, AWM application is easier than XCP since it requires implementation only at access gateways and no modification at internal routers or at hosts is required.
References 1. Brakmo, L., Peterson, L.: TCP Vegas: End to End Congestion Avoidance on a Global Internet. IEEE Journal on Selected Areas in Communication 13(8), 1465–1480 (1995) 2. Bitorika, A., Robin, M., Huggard, M., Mc Goldrick, C.: A Comparative Study of Active Queue Management Schemes. In: Proc. of ICC 2004 (June 2004) 3. Katabi, D., Handley, M., Rohrs, C.: Congestion Control for High Bandwidth-Delay Product Networks. In: Proc. of ACM Sigcomm (2002) 4. Floyd, S.: TCP and Explicit Congestion Notification. ACM Computer Communication Review 24(5), 10–23 (1994) 5. Kalampoukas, L., Varma, A., Ramakrishnan, K.K.: Explicit window adaptation: A method to enhance TCP performance. IEEE/ACM Transactions on Networking 10(3), 338–350 (2002) 6. Savoric, M.: Fuzzy Explicit Window Adaptation: Using router feedback to improve TCP performance, Technical Report TKN-04-009 (July 2004) 7. Barbera, M., Lombardo, A., Panarello, C., Schembra, G.: Active Window Management: an efficient gateway mechanism for TCP traffic control. In: Proc. of IEEE ICC 2007, Glascow, Scotland (June 2007) 8. Braden, R.: Requirements for Internet Hosts - Communication Layers, October 1989. RFC 1122 (1989) 9. The network simulator ns-2., http://www.isi.edu/nsnam/ns 10. Barford, P., Crovella, M.: Generating representative Web workloads for network and server performance evaluation. In: Proc. of ACM SIGMETRICS (1999) 11. Jain, R., Chiu, D., Hawe, W.: A quantitative measure of fairness and discrimination for resource allocation in shared computer systems, DEC Research Report TR-301 (1984)
AIRA: Additive Increase Rate Accelerator Ioannis Psaras and Vassilis Tsaoussidis Dept. of Electrical and Computer Engineering Democritus University of Thrace, 12 Vas. Sofias Str., 67100, Xanthi, Greece {ipsaras,vtsaousi}@ee.duth.gr http://comnet.ee.duth.gr
Abstract. We propose AIRA, an Additive Increase Rate Accelerator. AIRA extends AIMD functionality towards adaptive increase rates, depending on the level of network contention and bandwidth availability. In this context, acceleration grows when resource availability is detected by goodput/throughput measurements and slows down when increased throughput does not translate into increased goodput as well. Thus, the gap between throughput and goodput determines the behavior of the rate accelerator. We study the properties of the extended model and propose, based on analysis and simulation, appropriate rate decrease and increase rules. Furthermore, we study conditional rules to guarantee operational success even in the presence of symptomatic, extra-ordinary events. We show that analytical rules can be derived for accelerating, either positively or negatively, the increase rate of AIMD in accordance with network dynamics. Indeed, we find that the ”blind”, fixed Additive Increase rule can become an obstacle for the performance of TCP, especially when contention increases. Instead, sophisticated, contention-aware additive increase rates may preserve system stability and reduce retransmission effort, without reducing the goodput performance of TCP.
1
Introduction
We introduce a new paradigm for the responsive behavior of flows when bandwidth availability changes due to varying network contention. We call this paradigm, Additive Increase Rate Accelerator (AIRA), to reflect its operational perspective, namely to adjust the transmission rate of flows to current conditions of network contention. More precisely, we propose an algorithm, which progressively adjusts the Additive Increase factor of AIMD according to the current level of network contention. Typical systems adjust their rate, inevitably, only when contention leads to congestion; then, timeouts and multiplicative decreases force flows to reduce their windows. However, the transmission policy itself (i.e., increase/decrease rules) is not adjusted according to network contention. That is, although systems are adaptive to network dynamics, this adaptivity is limited: window size can be regulated but the window increase rate (i.e., Additive Increase factor) cannot. This is similar to a car regulating its velocity scale but with fixed acceleration. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 691–702, 2008. c IFIP International Federation for Information Processing 2008
692
I. Psaras and V. Tsaoussidis
By and large, the impact of contention on network dynamics, as well as the need for adjusting network parameters accordingly, is known to the networking community. Indeed, the conclusion is straightforward if we consider that the aggregate system rate increase is different in a system with 2 or 200 flows, which increases with a fixed rate of 1 packet per window for every successfullyacknowledged window. Consequently, our sample system with the same fixed increase/decrease parameters exhibits significantly different properties: it may reach stability or fail [15]; it may exploit bandwidth rapidly or waste available resources; it may require major retransmission effort or minimize overhead [13], [16]. In this context, we rely on two major concepts, to move beyond the confined perspective of predetermined and inflexible network parameters: (i) effort-based contention estimation and (ii) contention-oriented adaptive Additive Increase transmission. Effort, which is expressed as the ratio of Throughput over Goodput, reflects the efficiency of transmission strategy; when both throughput and goodput increase, bandwidth availability is clearly indicated, otherwise, as their gap widens, transmission effort is wasted. Consequently, increased contention will be reflected by higher protocol effort. In simple terms, a protocol that monitors transmission effort could approximate the dynamics of contention. That said, responses can be triggered accordingly. We investigate precisely the responsive strategies that correspond to various contention dynamics. The proposed paradigm requires a number of issues to be addressed, prior to deployment. How accurately can we estimate changes of contention? What is practically the gain from a hypothetical transition to AIRA paradigm? What is the adaptation scheme of choice to varying contention? Are the properties of stability and fairness violated in favor of efficiency? We address the aforementioned issues based on analysis and simulation experiments. In particular, we show that (i) contention can be estimated coarsely; coarse estimation suffices to enhance system properties; (ii) system fairness and stability can be preserved if we retain the Multiplicative Decrease response to congestion; and (iii) system efficiency can be increased if we adopt an adaptive Additive Increase response to available bandwidth. Adaptive increase allows for more sophisticated utilization of bandwidth, through more aggressive increase, when contention is low and less retransmission effort when contention is high. Furthermore, we go beyond conclusive statements to investigate the particular responsive strategy that corresponds to varying contention. During our investigation, we uncovered a number of dynamics associated with heterogeneous RTTs, as well as with long- and short-lived flows. We show that adaptive increase favors short-lived flows, which is a desirable property if we consider a utilization-oriented (i.e., time-oriented) notion of fairness. Also, flows that experience long propagation delays tend to decrease their rate slower than flows that experience short propagation delays, which is another desirable property. Therefore, our proposal demonstrates high potential for deployment. We organize the remaining paper as follows: In Section 2, we justify our research perspective and motives; in Section 3, we define our system model, which
AIRA: Additive Increase Rate Accelerator
693
includes definitions, observations on the dynamics of Additive Increase with diverse rates and proposed solution framework. We present AIRA in Section 4 and we propose adjustments to the proposed algorithm, due to practical constraints in Section 5. In Section 6, we evaluate the performance of the proposed algorithm through simulations and, finally, in Section 7, we conclude the paper.
2
Motivation
Deployment of AIMD is associated with two operational standards: (i) the fixed increase rate and decrease ratio and (ii) the corresponding selection of appropriate values. Recent research has focused on altering the Additive Increase (a) and Multiplicative Decrease (b) values (e.g., [5], [8], [7], [21], [22], [18], [4], [19], [9], [2], [10], [11], [12], [24], [20], [6], [17], [1], [23]) but has not questioned really the validity and efficiency of fixed rates throughout the lifetime of participating flows. In this context, research efforts cannot address questions such as: Why do flows increase their rate by a packets instead of 2a packets, even when half users of a system leave and bandwidth becomes available? One possible justification for not highlighting that research direction is that: The Additive Increase factor of AIMD does not contribute to the long-term Goodput performance of TCP. In Figure 1, we present the cwnd evolution for two TCP flavors: Figure 1(a), where a = 1 (regular TCP) and Figure 1(b), where a = 0.5. The area underneath the solid cwnd lineplot (Area 1 and 2) represents the Goodput1 performance of the protocols. In Figure 1(c), we show that both protocols achieve the same Goodput performance, since A1 = A2 and A3 = A42 . However, Additive Increase affects significantly the Retransmission Effort of flows, which impacts overall system behavior, as well. For example, TCP a = 1, in Figure 1, experiences 4 congestion events, while TCP a = 0.5 experiences only 2. Assuming that each congestion event is associated with a fixed number of lost packets, regular TCP (i.e., a = 1) will retransmit twice as many packets as TCP with a = 0.5, without any gain in Goodput performance. Clearly, there is a tradeoff between Aggressiveness and Retransmission Effort. The degree of Aggressiveness that a transport protocol can achieve is tightly 1
2
Original Data We define the system Goodput as Connection , where Original Data is the T ime number of bytes delivered to the high level protocol at the receiver (i.e., excluding retransmissions and the TCP header overhead) and Connection T ime is the amount of time required for the data delivery. Instead, system Throughput includes T otal Data retransmitted packets and header overhead (i.e., Connection ). T ime In Figure 1(c) grey areas are common for both protocols; white areas are equal (A1 is similar to A2 and A3 is similar to A4).
694
I. Psaras and V. Tsaoussidis
Congestion Events
a=1
a = 0.5
Congestion Events
A1
cwnd_
cwnd_
R1
R2
R3
R4
cwnd_
R1
Area 1
t0
Time
A2
A3
A4
R2 Area 2
t1
t0
Time
t1
(a) Increase Factor = 1 (b) Increase Factor = 0.5
t0
Time
t1
(c) Difference
Fig. 1. Different Increase Factors
associated with its Retransmission Effort. The higher the Additive Increase factor, the more the retransmission effort of the transport protocol.
3
System Model
The initial study that investigated the operational properties of AIMD is [3]. In that study, the authors assume a feedback model, where all flows become aware of congestion events synchronously. In the current study, we extend this model to reflect more realistic situations. For example, in our model, different flows may become aware of congestion events at different points in time (i.e., congestion feedback is received asynchronously). A synchronous model, inherently assumes that flows do not experience queuing delays and hence, the duration of a Round (which is defined here as the interval between two cwnd multiplicative decreases) is in-varying for all flows. Instead, we allow for the possibility of queuing delays, which further means that the duration of a Round may differ among flows. Finally, we also allow for the possibility of multiple packet losses at the end of a Round. 3.1
Definitions
We define the following terms: 1. A Round is defined as the interval between two cwnd multiplicative decreases. 2. Round Loss Rate (pi ) is the ratio of the lost packets over the total number of sent packets, within Round i. The Round Loss Rate is calculated at the end of each Round. a 3. The Throughput Slope within a Round is defined as cwnd . Obviously, the Throughput Slope is identical to the cwnd Slope (see Figure 2). 4. Assuming a Round Loss Rate pi and a Round duration ti , the Desired Throughput Slope within a Round is defined as the hypothetical cwnd Slope, which would result in zero packet losses, within ti , but without causing bandwidth underutilization. The Desired Throughput Slope is, therefore, dea termined by cwnd · (1 − pi ) (see Figure 2).
AIRA: Additive Increase Rate Accelerator Lost
Throughput Desired Throughput
Sent
= p_i
W
Round i-1W Round i
Packets Lost
a/W
cwnd_
695
Packets Delivered
Packets Sent
a(1-p_i)/W W/2 t _i Time
Fig. 2. Throughput / Desired Throughput Slopes at Round i
3.2
Observations on the Dynamics of Additive Increase with Diverse Rates
1. When pi > pi−1 then the Throughput Slope exceeds the Desired Throughput Slope, for Round i. – The greater the distance between pi and pi−1 , the wider the gap between the Throughput and Desired Throughput Slopes (i.e., the protocol is too aggressive). 2. When pi < pi−1 then the Throughput Slope is underneath the Desired Throughput Slope, for Round i. – The greater the distance between pi and pi−1 , the wider the gap between the Desired Throughput and Throughput Slopes (i.e., the protocol is too conservative). Hence, our primary objective is to reduce the gap between the Throughput and Desired Throughput Slopes, in order to avoid extensive retransmission effort, or bandwidth under-utilization, respectively. 3.3
Solution Framework
We provide a solution framework in order to determine the primary requirements of our system model. We require from our system to: 1. Converge to Fairness. – We evaluate the Fairness properties of a protocol using the Fairness Index introduced in [3]: (T hroughputi )2 F airness = , (1) n (T hroughputi 2 ) where T hroughputi is the Throughput performance of the ith flow and n is the number of participating flows. – We introduce the ALPHA Fairness Index of multi-rate systems. We define ALPHA Fairness Index as: (ai )2 ALP HA F airness = 2 , (2) n (ai )
696
I. Psaras and V. Tsaoussidis
where ai is the Additive Increase factor of the ith flow and n is the number of participating flows. The ALPHA Fairness Index has the following properties: In a homogeneous/synchronous static (contention-wise) system, increased ALPHA Fairness Index leads to increased System Fairness Index as well. In a heterogeneous/asynchronous dynamic system, however, the properties of ALPHA Fairness Index are not straightforward. For example, in a diverse-RTT system, reduced ALPHA Fairness Index may correspond to either increased or decreased system Fairness (i.e., reduced Additive Increase factor for longer-RTT flows results in reduced Fairness, while increased Additive Increase factor for longer-RTT flows results in increased system Fairness). We explore the above properties through simulations in the following sections. 2. Guarantee Stability. Stability is a quality measure, which we attempt to simplify and furthermore quantify by measuring the frequency of congestion events. Therefore, a protocol is said to achieve higher Stability, when it minimizes retransmission overhead. 3. Exploit available resources efficiently, which means: – faster (i.e., aggressively) when contention is low (i.e., utilize high percentage of available bandwidth). – slower (i.e., conservatively) when contention is high (i.e., minimize overhead/transmission effort). Efficiency is achieved through high resource utilization (i.e., high goodput performance) and minimal retransmission overhead.
4
AIRA: Additive Increase Rate Accelerator
We assume that a flow can adjust its rate at the end of each Round. The flow at the end of Round i can determine the rate for round i + 1 exploiting, recursively, the behavior of its rate during Round i − 1. In particular, the Decrease rule applies when pi > pi−1 . 4.1
The Decrease Rule
- Decrease Rule. In order to reduce the gap between the Throughput and Desired Throughput Slopes, the Additive Increase factor should decrease, according to Equation: ai+1 = ai · (1 − pi ). (3) We present the process followed by the AIRA Decrease Rule in Figure 3. At the end of Round i, the sender calculates the error rate pi and applies Equation 3 to the cwnd update function. Since pi > pi−1 , the system operates in a moderated congestion environment, where the level of contention within the current round has increased compared to the previous one. Therefore, the sending rate at Round
AIRA: Additive Increase Rate Accelerator Lost
Throughput Desired Throughput W(t) Round i
W(t)
Round i-1 W
Packets Lost W(Dt)
a/W
cwnd_
= p_i
Sent
Round i-1W
697
Packets Delivered
Round i Round i+1
a/W(s-1)
Packets Sent
W(Dt)
cwnd_
a(1-p_i)/W
a(1-p_i)/W
a(1-p_i)/W
W/2
W/2 t _i
t_i
Time
Time
(a) Throughput/Desired Throughput (Dt) Slopes at Round i
(b) cwnd Slope at Round i + 1
Fig. 3. Decrease Rule
i + 1 decreases in order to improve stability and system fairness, according to the Desired Throughput slope indicated by AIRA. Justification regarding the rationale behind the AIRA Decrease Rule setting can be found in [14]. 4.2
The Increase Rule
According to the Decrease Rule, the Additive Increase factor of TCP (i.e., AIRA) may get non-increasing values. In that case, however, the system may never reach equilibrium; flows already existing in the system, possibly transmitting with a < 1, will not have the opportunity to compete (fairly) with new, incoming flows. Therefore, we introduce an Increase Rule, which applies when pi < pi−1 . - Increase Rule. In order to reduce the gap between the Desired Throughput - Throughput Slopes, the Additive Increase factor should increase, according to Equation: ai+1 = ai · (1 + pi−1 − pi ). (4) AIRA interprets the decrease of the Round error rate (i.e., pi < pi−1 ) as a sign of contention decrease and therefore increases the Additive Increase factor
Previous Throughput Slope Desired Throughput Slope
W(Dt) W(Dt)
Current Throughput Slope Round i-1 cwnd_
W(t)
Round i
Round i-1
Round i W(t)
a/W
cwnd_
Round i+1
a/W
a/W
a/W
a(1-p_i)/W
a(1-p_i)/W W(t)/2
t_i
t_i
Time
Time
(a) Throughput / Desired Throughput Slopes at Round i
(b) cwnd Slope at Round i + 1
Fig. 4. Increase Rule
698
I. Psaras and V. Tsaoussidis
of TCP in order to exploit available resources. This increase, however, should not be extended in order to preserve system stability. Therefore, the Additive Increase factor is increased according to the error rate difference during the last two rounds. Justification regarding the rationale behind the AIRA Increase Rule setting can be found in [14].
5 5.1
Adjustments due to Practical Constraints Lower Bounds
We set lower bounds for AIRA for the following reason: Progressive reduction of the Additive Increase factor may result in transmission rate stabilization (i.e., a = 0). This, however, will inevitably cause system in-stability, protocol inefficiency and flow starvation. We, therefore, bound a to 0.1, attempting to avoid the above un-desirable system property. That said, the greatest possible reduction of the Additive Increase factor is 0.9 (i.e., agreatest reduction = 0.9, when a = 0.1). Furthermore, we complement this absolute bound with another, dynamicallyadjustable lower bound, which depends on the number of completed Rounds and is called Minimum Additive Increase Limit (MAIL). The reason is twofold: 1. A flow may lose several back-to-back packets, due to sudden traffic-bursts or symptomatic events. In order to limit drastic and systematic responses to symptomatic events, lower bounds need to be introduced. Otherwise, recovery from such symptomatic events may require significant effort and time or may even become impossible. 2. When contention reaches extra-ordinary levels, AIRA lower bound (i.e., a = 0.1) dominates and the rest of AIRA functionality is practically suspended. In that case, MAIL will prevent short-flow starvation, adhering to a time-oriented notion of Fairness (e.g., a flow at the second Round operating with a = 0.1). That is, assuming that short flows need to be favored over long, time-insensitive ones, MAIL is designed to provide more opportunities for data transmission to short, probably time-sensitive flows. Minimum Additive Increase Limit (MAIL). We use the following equation to adjust MAIL: M AIL = amin = 1 − agreatest
reduction
· RN %,
(5)
where, RN is the number of completed Rounds and agreatest reduction = 0.9 (see Figure 5). As a secondary effect, MAIL exhibits one desirable property, which was not an initial design goal: it favors long-RTT flows over shorter-RTT ones. That is, short-RTT flows complete a Round faster than long-RTT flows. Therefore, MAIL will get a lower value (i.e., Additive Increase factor) for a short-RTT flow faster than for a longer-RTT one. Hence, in case of moderated or high contention the short-RTT flows will operate with lower Additive Increase factor, avoiding this way starvation of long-RTT flows. We verify the above assumption through simulations in Section 6.
AIRA: Additive Increase Rate Accelerator
699
1
MAIL
0.8 0.6 0.4 0.2 0 0
20
40
60
80 100 Round
120
140
160
180
Fig. 5. Minimum Additive Increase Limit (MAIL)
5.2
Response to Contention Decrease
Although the AIRA Increase Rule (Section 4) proves to operate efficiently in static as well as in contention-increase scenarios, it fails to exploit extra available resources when contention decreases (i.e., pi−1 −pi is a small value, incapable of accelerating a fast enough, see Equation (4)). Inline with our Solution Framework (Section 3.3), we apply a Reset Condition to AIRA, in order to prevent bandwidth wastage in case of contention decrease. AIRA Reset Condition. Assuming that a flow’s cwnd oscillates between W/2 and W before the contention decrease event, the flow will have the opportunity to expand its cwnd to W + 12 W = 32 W after 13 of the participating flows leave the system. When this happens, AIRA resets a to 1. System-wise, we assume that AIRA should exploit bandwidth fastly iff n−x n ≤ 2 , where n is the total number of participating flows and x is the number of flows 3 who end their task and leave the system. In this case (i.e., when x ≥ 13 n), AIRA resets a to 1. Clearly, AIRA is slower in exploiting available resources compared to regular TCP (i.e., a = 1), but we consider this delay to be acceptable, according to our solution framework (see Section 3.3). In [14], we present an analysis regarding the convergence properties of AIRA in case of contention decrease.
6
Simulation Results
We attempt to verify the above considerations through simulations. Due to space limitations, we cannot present extended performance evaluations for AIRA in the current manuscript. We refer the interested reader to [14] for a variety of simulation scenarios and the corresponding results. Initially, we simulate 4 flows over the ”Diverse-RTT Network Topology” (Figure 6(a)). We use Drop Tail routers, with buffer sizes equal to the Bandwidth-Delay Product of the outgoing links. We note that results are similar in case of Active Queue Management schemes, like RED for example. The round trip propagation delay for flows 1 and 2 is 60ms, while for flows 3 and 4 is 220ms. Simulation time is 300 seconds. Figure 6(b) depicts the Additive Increase factor for each flow. We see that MAIL provides more transmission opportunities to long-RTT flows (see Fairness Index in Table 1), than to short-RTT flows. We extend the above scenario to include more flows. The results are presented in Figure 7. In all cases (i.e., 4-flow scenario, Table 1 and extended scenario,
700
I. Psaras and V. Tsaoussidis Source 1
Sink 1 10Mbps 10ms
10Mbps 10ms
4 flows: 2Mbps 20,40...120 flows: 20Mbps 10ms
10Mbps 50ms
Source n
Router 1
Router 2
10Mbps 50ms
Sink n
Additive Increase Factor
(a) Diverse-RTT Network Topology flow 1 flow 2 flow 3 flow 4
1 0.8 0.6 0.4 0.2 0 0
50
100
150 Time (s)
200
250
300
(b) Additive Increase Factor Fig. 6. Table 1. Performance Difference - RTT Un-Fairness Goodput 200.6 KB/s 209.5 KB/s
Retransmissions 2009 pkts 943 pkts
2e+06 1.5e+06 1e+06 500000 0 20
40
60
80
100
120
Retransmitted Packets
Goodput (B/s)
AIMD AIRA
Fairness 0.6594 0.8768
30000 25000 20000 15000 10000 5000 0 20
40
Number of Flows
60
80
100
120
Number of Flows
AIMD AIRA
AIMD AIRA
(a) System Goodput
(b) Retransmission Overhead 1
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
ALPHA Fairness
Fairness
a Fairness 1.0 0.787
0.8 0.6 0.4 0.2 0
20
40
60
80
100
20
120
40
Number of Flows AIMD AIRA
60
80
100
120
Number of Flows AIMD AIRA
(c) Fairness Index
(d) ALPHA Fairness Index
Fig. 7. RTT Un-Fairness Scenario
Figure 7), we see that the proposed algorithm improves Stability (through far less retransmissions) and Efficiency (through less retransmissions and slightly increased Goodput). As contention increases, we notice considerable decrease of
AIRA: Additive Increase Rate Accelerator
701
the ALPHA Fairness Index, which indicates more fair resource allocation among short and long-RTT flows, when AIRA is used.
7
Conclusions
We have shown that analytical rules can be derived for accelerating, either positively or negatively, the increase rate of AIMD in accordance with network dynamics. Indeed, we found that the ”blind” Additive Increase rule can become an obstacle for the performance of TCP, especially when contention increases. Instead, sophisticated, contention-aware additive increase rates may preserve system stability and reduce retransmission effort, without reducing the goodput performance of TCP. Based on specific criteria, namely efficiency, fairness and stability, we proposed and evaluated an adaptive additive increase rate scheme, which we call Additive Increase Rate Accelerator (AIRA). We have shown two major results: (i) Fairness is possible even in the context of varying increase rates within the same system. Furthermore, the problematic balance among short and long flows, as well as long- and short-RTT flows, can be better handled. (ii) Efficiency can be improved. Efficiency is judged not only on the basis of goodput performance but mainly against retransmission effort and overhead. Our primar future work direction is the theoretical evaluation of AIRA estimations regarding their accuracy. For instance, progressive contention is here estimated and is not explicitly communicated by some central network authority. The accuracy and precision of this estimation depends on two main factors: (i) the granularity of measurements and (ii) the accuracy of the monitoring functions. For example, the way throughput is calculated and the frequency of calculating throughput may have some impact. Higher (e.g., per packet) measurement granularity will probably increase precision, but it will also increase system entropy, leading to reduced system stability. Among others, this observation calls for further investigation.
References 1. Bansal, D., Balakrishnan, H.: Binomial congestion control algorithms. In: INFOCOM, pp. 631–640 (2001) 2. Brakmo, L., O’Malley, S., Peterson, L.: TCP Vegas: New techniques for congestion detection and avoidance. In: Proceedings of SIGCOMM (1994) 3. Chiu, D.-M., Jain, R.: Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks. Computer Networks and ISDN Systems 17(1), 1–14 (1989) 4. McCullagh, G., Leith, D.J., Shorten, R.N.: Experimental Evaluation of Cubic-TCP. In: Proceedings of PFLDnet (2007) 5. Floyd, S.: HighSpeed TCP for Large Congestion Windows, RFC 3649 (December 2003) 6. Floyd, S., Kohler, E.: Tcp friendly rate control (tfrc): the small-packet (sp) variant, rfc 4828, experimental (April 2007)
702
I. Psaras and V. Tsaoussidis
7. Jin, S., Guo, L., Matta, I., Bestavros, A.: Tcp-friendly simd congestion control and its convergence behavior. In: Proc. 9th IEEE International Conference on Network Protocols (ICNP 2001), Riverside, CA, November 2001, vol. 5 (2001) 8. Kelly, T.: Scalable TCP: Improving Performance in Highspeed Wide Area Networks. ACM SIGCOMM Computer Communication Review 33(2), 83–91 (2003) 9. King, R., Baraniuk, R., Riedi, R.: Tcp-africa: An adaptive and fair rapid increase rule for scalable tcp. In: Proceedings of INFOCOM 2005, vol. 3, pp. 1838–1848 (2005) 10. Lahanas, A., Tsaoussidis, V.: Exploiting the efficiency and fairness potential of aimd-based congestion avoidance and control. Computer Networks 43(2), 227–245 (2003) 11. Lahanas, A., Tsaoussidis, V.: τ -aimd for asynchronous receiver feedback. In: Proceedings of the Eighth IEEE International Symposium on Computers and Communications, ISCC (2003) 12. Marfia, G., Palazzi, C., Pau, G., Gerla, M., Sanadidi, M.Y., Roccetti, M.: TCP Libra: Exploring RTT-Fairness for TCP. UCLA Computer Science Department Technical Report TR050037 (2005) 13. Psaras, I., Tsaoussidis, V.: WB-RTO: A Window-Based Retransmission Timeout for TCP. In: In Proc. of the 49th IEEE GLOBECOM (November 2006) 14. Psaras, I., Tsaoussidis, V.: Aira: Additive increase rate accelerator. Technical Report, TR-DUTH-EE-2007-12 (2007), http://comnet.ee.duth.gr/comnet/files/aira-tech-report.pdf 15. Psaras, I., Tsaoussidis, V.: Why tcp timers (still) don’t work well. Computer Networks 51(8), 2033–2048 (2007) 16. Psaras, I., Tsaoussidis, V., Mamatas, L.: CA-RTO: A Contention-Adaptive Retransmission Timeout. In: Proceedings of ICCCN (October 2005) 17. Rhee, I., Ozdemir, V., Yi, Y.: Tear: Tcp emulation at receivers – flow control for multimedia streaming. April 2000. NCSU Technical Report (2000) 18. Rhee, I., Xu, L.: CUBIC: A New TCP-Friendly High-Speed TCP Variant. In: Proceedings of PFLDnet (2005) 19. Tan, K., Song, J., Zhang, Q., Sridharan, M.: A compound tcp approach for highspeed and long distance networks. 20. Tsaoussidis, V., Zhang, C.: The dynamics of responsiveness and smoothness in heterogeneous networks. IEEE Journal on Selected Areas in Communications (JSAC) 23(6), 1178–1189 (2005) 21. Wei, D.X., Jin, C., Low, S.H., Hegde, S.: FAST TCP: motivation, architecture, algorithms, performance. IEEE/ACM Transactions on Networking (to appear, 2007) 22. Xu, L., Harfoush, K., Rhee, I.: Binary increase congestion control (bic) for fast long-distance networks. In: Proceedings of INFOCOM 2004, vol. 4, pp. 2514–2524 (March 2004) 23. Yang, Y.R., Lam, S.S.: General aimd congestion control. In: Proceedings of ICNP 2000 (October 2000) 24. Zhang, C., Tsaoussidis, V.: Tcp smoothness and window adjustment strategy. IEEE Transactions on Multimedia 8(3), 600–609 (2006)
Performance Evaluation of Quick-Start TCP with a Linux Kernel Implementation Michael Scharf and Haiko Strotbek Institute of Communication Networks and Computer Engineering (IKR) University of Stuttgart, Germany [email protected], [email protected]
Abstract. Quick-Start is an experimental extension of the Transmission Control Protocol (TCP) that uses explicit router feedback to speed up best effort data transfers. With Quick-Start, TCP endpoints can request permission from the routers along the path to send at a higher rate than allowed by the default TCP congestion control, which avoids the time-consuming Slow-Start. However, since Quick-Start TCP requires modifications in the protocol stacks of end-systems and routers, realization complexity is a major concern. This paper studies Quick-Start with a new implementation in the Linux protocol stack. We first show that Quick-Start support can be added to a real stack with rather limited effort, without causing much processing overhead. Second, we perform measurements with Web applications and study the impact of important parameters. These experiments with real applications demonstrate that Quick-Start can significantly speed up data transfers, and they confirm the outcome of previous simulation efforts. Our results suggest that Quick-Start is a lightweight mechanism that could be very beneficial for broadband interactive applications in the future Internet.
1
Introduction
Most Internet applications use the Transmission Control Protocol (TCP) for reliable, best effort transport. Since TCP is a pure end-to-end protocol, the connection endpoints must continuously probe the available bandwidth on the path in order to adapt their sending rate. The TCP congestion control is challenged by new broadband technologies with large bandwidth-delay products. In such cases, TCP has difficulties in determining an appropriate data rate, in particular after connection setups, or after long idle periods. Several “clean slate” Internet research efforts address this issue by using explicit feedback from routers. Quick-Start [1] is an experimental TCP extension that applies explicit router feedback to avoid the time-consuming Slow-Start [2], which is often considered to be a problem in high-speed networks. With Quick-Start, hosts can ask for a high initial sending rate, e. g., during the three-way handshake, and the routers along the path can approve, modify, or discard this request. Several simulation studies [3,4] show that these mechanisms can significantly reduce the completion time of high-speed data transfers over paths with a large bandwidth-delay product, such as over broadband wide area networks, satellite or cellular links. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 703–714, 2008. c IFIP International Federation for Information Processing 2008
704
M. Scharf and H. Strotbek Host 1 Rate?
Router SYN
Router
Host 2
QS request IP
TCP
QS response Rate!
SYN,ACK ACK
Rate pacing
New ACK
IP
TCP
QS report IP
Echo
TCP
Standard algorithms
Fig. 1. Illustration of a Quick-Start (QS) request during the three-way handshake
The new Quick-Start TCP extension requires modifications of the protocol stacks, both in end-systems and in routers. Thus, similar to other related schemes for router-assisted congestion control, implementation and deployment complexity is a major concern. In order to answer the question whether Quick-Start could be useful in the future Internet, implementations and experiments with real applications are required [1]. This paper contributes to this discussion in two ways: First, we present a complete implementation of Quick-Start in the Linux kernel. To the best of our knowledge, it is the first operational Quick-Start implementation in a realworld protocol stack. We also report some of the lessons learned during our development work, extending an initial presentation in [5]. Second, we use our implementation to perform experiments with different applications and network scenarios. The results illustrate the usefulness of Quick-Start in real setups. The rest of this paper is structured as follows: Section 2 introduces the QuickStart extension and surveys related work. In Section 3, we present details of our implementation. Section 4 reviews our evaluation methodology. The results of our measurements are discussed in Section 5. Section 6 concludes the paper.
2 2.1
Quick-Start TCP Extension Overview
The Quick-Start mechanism enables TCP connection endpoints to determine an allowed sending rate in cooperation with the routers on the path, in particular at the beginning of a data transfer. Fig. 1 illustrates a Quick-Start request during the connection establishment: In order to indicate its desired sending rate, the originator adds a Quick-Start request option to the Internet Protocol (IP) header. This target rate is encoded in 15 coarse-grained steps ranging from 80 kbit/s to 1.31 Gbit/s. The routers along the path can approve, modify, or disallow this rate request. If the request arrives at the destination, the granted rate is echoed back as a piggybacked TCP option (Quick-Start response). The originator then determines whether all routers along the path have approved the request. If not, the default congestion control (Slow-Start) is used to ensure backward compatibility. If the Quick-Start request is successful, the originator can immediately
Performance Evaluation of Quick-Start TCP
705
increase its congestion window and start to send with the approved rate. After one round-trip time (RTT) the Quick-Start phase is completed and the default TCP congestion control mechanisms are used for subsequent data transfers. Quick-Start does not guarantee any data rate, i. e., it is a lightweight speedup mechanism for elastic best effort traffic only. Simulations [3] and analytical models [4] reveal that the transfer times of moderate-sized files can be improved by several hundred percent. In addition to avoiding initial Slow-Starts, the QuickStart mechanism could also be quite useful in the middle of data transfers, e. g., after longer idle periods, or in combination with link layer mobility triggers. However, Quick-Start TCP, as any router-assisted congestion control approach, comes at some cost: It requires support in all routers along a path, which causes processing overhead and which makes a short-term deployment in the Internet unrealistic. There are also interworking issues, e. g., with IP tunnels and firewalls. This is why [1] recommends that the use of Quick-Start should initially be limited to controlled environments such as intranets. 2.2
Related Work
There are numerous proposals that address the performance limitation of TCP’s Slow-Start. A comprehensive survey can be found in [1]. Four principal approaches can be distinguished: (1) Schemes that apply bandwidth estimation techniques in order to determine the available bandwidth, (2) additional function blocks that share path capacity information, such as a “congestion manager”, (3) explicit feedback from network elements along a path, and (4) proposals to use an arbitrarily high sending rate at the beginning of data transfers [6]. Quick-Start belongs to the third category. It does not suffer from the inaccuracy of end-to-end bandwidth estimation techniques. Unlike the second approach, Quick-Start can also handle completely new paths. And, different to the fourth category, there is no danger of severe congestion due to over-aggressiveness. Quick-Start is to be used in transient phases only, when precise information about the path is missing. Other research approaches suggest replacing the TCP congestion control by a fine-grained, per-packet feedback from routers. Two examples are the eXplicit Control Protocol (XCP) and the Rate Control Protocol (RCP). XCP has been implemented in FreeBSD [7] and Linux [8]. In the latter case, the XCP protocol is encoded as a TCP option, and the router functions are realized as a loadable module for the Linux traffic control. A Linux implementation of RCP has recently been reported in [9]. Here, a shim layer has been added in the endpoints between IP and TCP, and the router support is provided by a Linux netfilter plugin. These implementation efforts demonstrate the need for practical experiments with such new congestion control schemes.
3 3.1
Quick-Start Implementation in the Linux Kernel Required TCP/IP Protocol Stack Extensions
Quick-Start requires modifications in the stacks of end-systems as well as additional processing in every node that forwards IP packets towards potential
706
M. Scharf and H. Strotbek User space Config
Application Activate QS
Socket interface
New sysctl
TCP
do_tcp_setsockopt
Sysctl config Fast/slow path Analysis Options tcp_v4_rcv
State Cong. control
tcp_send_msg Rate pacing
ip_push_pend ing_frames
ip_forward
ip_rcv
ip_queue_xmit
QS TTL decr. ip_forward_finish
Traffic metering, adm. control
net_rx_action
Handle SYN
Options tcp_transmit_skb
Send ACK
ip_local_deliver
Routing
Flow control tcp_write_xmit
Hist.
Options
IP
ip_build_and_ send_packet
Sysctl config
Metering, adm.
Typical flow of packets ip_finish_output
dev_queue_xmit
Device driver Kernel space
Fig. 2. Linux kernel code modifications by our Quick-Start implementation
bottleneck links. For instance, the originator must build and send the IP option, validate the response, and then use the allowed data rate when sending new data. This requires an additional rate pacing mechanism because TCP is a windowbased protocol without rate control. The receiver must echo back the requests as a TCP option. Furthermore, the flow control may have to be changed [10]. Quick-Start enabled routers require three new functions: First, routers have to determine the capacity of the outgoing links and keep track of their utilization, unlike today’s routers, which are not necessarily aware of link capacities. This information can be either obtained by configuration, by cross-layer information exchange with the sub-IP layer, or by bandwidth estimation techniques. Second, routers must process the new IP options, i. e., they must read and write some of the entries (e. g., the data rate). And third, routers must perform an admission control that decides whether to accept a Quick-Start request, and which data rate to grant. This typically requires a short history about recently approved requests. All three functions can be realized without keeping any per-flow state. 3.2
Implementation Details
The Linux kernel has a highly optimized TCP/IP networking stack. This stack is thus an excellent basis for studying the implementation effort caused by QuickStart. We extended the Linux TCP/IP stack so that it completely implements Quick-Start for IP version 4 according to [1], using kernel version 2.6.20 as basis. The implemented features include both the end-system and router functions mentioned in the previous section. Fig. 2 gives a simplified overview of the structure of the IP and TCP layer in the Linux kernel, highlighting a couple of important functions that are involved in packet processing (details are given e. g. in [11]). The blocks in this figure illustrate our main modifications of the kernel code. Even though the Quick-Start mechanism is rather simple compared of the
Performance Evaluation of Quick-Start TCP
707
Uninitialized New request
Req.
New request
Enabled SYN or data segment
Sent request Approved rate>0
Report rate Report sent
Rate approved
Not approved or invalid response
Not used
Done
Not supported Report sent
Not used
Fig. 3. Quick-Start state engine
QS state: "Report rate"
Initialized
Timeout
Target cwnd achieved, or arrival of an ACK for paced data, or DUPACK or ECN received
Data available
Active
Fig. 4. Rate pacing state engine
complexity of TCP as a whole, changes have been required in almost every part of the TCP implementation, and also in several IP functions. In the following, we briefly explain some aspects where implementing the specification [1] has not been straightforward. More details are documented in [12]. A Quick-Start sender must keep additional state concerning a request and the rate pacing. Our Quick-Start state engine is sketched in Fig. 3. Some complexity is caused by the requirement that requests must only be sent in IP packets carrying a TCP segment that gets acknowledged, i. e., with a SYN flag or with payload data. Further states are needed for the rate pacing (see Fig. 4), because the rate pacing only starts when new data is available, and because there are several abort conditions. We have realized the rate pacing by an additional timer. Each time the timer expires, the congestion window snd cwnd is increased by a given amount of segments, up to the Quick-Start congestion window cwnd qs. As Linux offers a high timer granularity of 1 ms when running with HZ = 1000 1/s, there is some degree of freedom whether to use few or many timers. As shown in Fig. 5, one can distinguish three different cases: In case 1, the increase of cwnd qs is small and the timers must be distributed evenly over rtt ticks possible timers during one RTT. However, if cwnd qs is large (case 2), the maximum number of timers rtt ticks can be reached, and carryover segments might be needed if cwnd qs is not a multiple of rtt ticks. The overhead of many timers can be reduced by enforcing a minimum window increase per timer (case 3), i. e., a minimum chunk size M. The number of timers then follows as timer = min (cwnd qs/M, rtt ticks) The configuration parameter M allows to trade off timer processing overhead vs. traffic burstiness. By default, we use a minimum chunk size of 3 segments. The Quick-Start router functions are realized by a new method in the forwarding path of IP packets. This method meters the traffic during intervals of duration D (1 s, by default), and it processes the Quick-Start requests. We implemented the “target algorithm” admission control [1,3]. The recently approved requests are stored in a ring buffer (cf. [4]). The default size is two slots, i. e., the granted bandwidth is remembered for H = 2 D = 2 s. An alternative solution would have been to use the Linux netfilter hooks. However, a precise traffic metering requires some additional processing of every packet anyway.
708
M. Scharf and H. Strotbek Case 1: Small window increase cwnd_qs = 15 1 2 3 chunk = 1
4
5
6
7
8
9
10 11 12
Minimum chunk M = 1
13 14 15
Time [Ticks]
Case 2: Large window increase 3
6
9 12 15 18 21 24 27 30 32 34 36 38 40 42 44 46 48 50
cwnd_qs = 50 chunk = 1
Minimum chunk M = 2 Time [Ticks]
"Carryover" segments
Case 3: Large window increase with bundling 5 10
14
18 22
26
30 34
38
42 46
50 Minimum chunk M = 4
cwnd_qs = 50 chunk = 4 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20Time
[Ticks]
RTT = 20 Ticks
Fig. 5. Incrementing the congestion window with timers to realize rate pacing
3.3
Implementation Complexity and Lessons Learned
The experiments in the next sections show that our Quick-Start implementation is fully operational. The total implementation effort has been rather limited: The patch adds or modifies less than 2000 lines of kernel code (< 5 % of the TCP and IPv4 code). It requires about 20 additional integer variables per TCP connection in the tcp sock structure and some new variables per device in the IP layer. It is configured by some additional sysctl variables and ioctl calls. The addition of Quick-Start results in some ugly code. Since Quick-Start violates the layer separation between IP and TCP, interface extensions are required between the layers. Options have to be processed in a couple of different functions so that many parts of the TCP/IP stack have to be modified. A tricky problem is that the TCP maximum segment size must be reduced for IP packets carrying a Quick-Start IP option in order to avoid packet fragmentation. A remaining challenge is to automatically determine the capacity of outgoing links. There are certain possibilities to obtain such information from the corresponding Linux device drivers, but there is no “one-fits-all” interface from the network layer to the drivers, even for standard Ethernet network cards. Our current solution is to manually configure the link capacity for each interface. These changes result in performance costs. The overhead introduced by the Quick-Start functions is rather small in typical usage scenarios. In the TCP layer, a fine-grained rate pacing requires additional timers. Yet, this does not have a significant impact on the system load unless there are many parallel connections. The overhead in the IP layer is also small, as long as only few packets carry Quick-Start options. We performed a kernel profiling analysis to determine the load share of all method calls that handle the packet forwarding and process IP options. Fig. 6 presents the results obtained from the “oprofile” tool, using both an unmodified and a patched kernel. A router (P4 2.8 Ghz PC with Gbit/s Ethernet interfaces) was exposed to a small packet rate of 100 packets/s and a high load of 300, 000 packets/s. Both workloads use small 50 byte packets. Without any Quick-Start options being present, there is a measurable overhead, but it is rather small. The additional effort is mainly caused by the traffic metering that counts every IP packet. It could be avoided if the link utilization
Performance Evaluation of Quick-Start TCP Client
Original kernel Kernel with Quick-Start
Server
P4 2.8 GHz 2 GB RAM
P4 2.8 GHz 2 GB RAM
10
SYN QS request IP
300,000 packets/s 1
0.1 100 packets/s
0.01
0%
1%
5%
50 %
Ratio of packets with Quick-Start IP Option
Fig. 6. Router processing overhead
Server response time
Load caused by IP packet forwarding [%]
100
709
TCP
SYN,ACK ACK
QS report IP
QS response IP
TCP
Request
TCP
Response 10 Mbit/s Ethernet
Fig. 7. Client-server communication
was already available, e. g., from network monitoring components. If 1 % of the packets include a Quick-Start option, which could be a realistic scenario, the option processing can increase the load by few percent. Still, this effect neither significantly decreases the maximum throughput (about 330, 000 packets/s), nor increases the packet delay in the router (ca. 50 μs in this scenario). Only if the share of Quick-Start packets is very high, the Quick-Start processing causes a considerable additional system load, which can mainly be attributed to the recalculation of the random Quick-Start nonces (see [1]). But such a traffic characteristic is unrealistic and should never occur when Quick-Start is used as foreseen.
4 4.1
Measurement Methodology Testbed Setup
In order to test our Quick-Start implementation and to study the benefit compared to standard TCP, we used a network setup with a client and a server. As illustrated in Fig. 7, Quick-Start is activated for data transfers from the server to the client, with the request being sent in the SYN/ACK segment. Both computers were interconnected by a direct Ethernet link that is forced to 10 Mbit/s line speed. This is a realistic value for long-distance connections over today’s access and wide area networks. Client and server are Ubuntu 7.04 installations running the modified Linux kernel. The Linux “NetEm” network emulation was used to delay packets for a constant duration, which models the latency of wide area networks. As recommended in [10], the socket buffers were increased to a larger value (8 Mbyte) in order to avoid any limitation by the TCP flow control. 4.2
Quick-Start Enabled Applications
A Quick-Start request can be activated by two mechanisms: First, we have implemented some kernel-internal heuristics that can automatically issue a QuickStart request. For instance, a kernel can be configured to send a request in every TCP connection setup. The request can also automatically be repeated after
710
M. Scharf and H. Strotbek Base req. Base doc
Embed. req. 1
Embed. req. 2
Embed. file 1
Base req.
Embed. file 2
Base doc Time
Object download time
Think time
Fig. 8. HTTP model used by the “surge” Web traffic generator
longer idle times, if the congestion window validation [13] has reduced the window. Second, applications can explicitly trigger a Quick-Start request by setting a new socket option, either before or during the usage of the connection. Quick-Start is particularly interesting for interactive applications that frequently exchange certain amounts of data and thus often suffer from the SlowStart. In our experiments, we use two different types of interactive applications: First, we perform tests with simple client and server C programs that communicate as shown in Fig. 7, with variable amounts of data to be transferred. The measured server response time is averaged over five consecutive measurements. Second, in order to study Web applications, we use a “lighttpd” Web server (version 1.4.18) and the “surge” Web traffic generator (version 1.00a) [14], which supports HTTP/1.0 and HTTP/1.1 requests, without or with pipelining. We use mainly the default configuration, i. e., there is a configurable number of users/threads that each send HTTP requests to the Web server. As performance metric we study the object download time, which is the total duration of loading a page consisting of a base document and potentially also embedded files (see Fig. 8). The measurement duration is one hour per sample. Concerning the object characteristics and request patterns, we first use the “surge default” Web model with rather small file sizes. However, recent studies [15] reveal that new Web applications frequently transfer objects of the order of 100 kByte (highresolution images, complex database queries, 3D structures, . . . ). In order to reflect the traffic characteristics of such broadband interactive applications, we also use a second, customized “large files” workload model (see Tab. 1).
5 5.1
Experimental Evaluation Quick-Start Validation
In the following, we discuss some results of systematic tests with the QuickStart mechanism. Fig. 9 and 10 depict several traces of the traffic in download Table 1. Used Web traffic model parameters (selection) Parameter
“Surge default” model “Large files” model
Ratio base/embedded/loners File popularity File size distribution No. embed. files per object User think time
0.30 / 0.38 / 0.32 Zipf, 2000 files Mixed logn. and parato Pareto, α = 1.245, k = 2 Pareto, α = 1.4, k = 2
0.30 / 0.38 / 0.32 Uniform, 2000 files Logn., μ = 11.92, σ = 1.0 Geometric, 1/p = 2 Pareto, α = 1.4, k = 2
Performance Evaluation of Quick-Start TCP 10
10 QuickStart
SlowStart
IP data rate [Mbit/s]
8
9
Reno
7 6
Quick-Start with ssthresh reduction
5 4
SS, Reno SS, Cubic QS, Reno, no ssthresh reduc. QS, Cubic, no ssthresh reduc. QS, Reno, ssthresh reduction QS, Cubic, ssthresh reduction
3 2 1 0 -0.5
0
0.5
1
1.5
2
Reno
SlowStart
8
Cubic
IP data rate [Mbit/s]
9
711
2.5
3
3.5
Time since SYN segment [s]
4
4.5
7 6
Cubic
Quick-Start with ssthresh reduction
5 4
SS, Reno SS, Cubic QS, Reno, no ssthresh reduc. QS, Cubic, no ssthresh reduc. QS, Reno, ssthresh reduction QS, Cubic, ssthresh reduction
3 2 1
5
Reno
QuickStart
0 -0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Time since SYN segment [s]
Fig. 9. Data rate trace (RT T = 100 ms) Fig. 10. Data rate trace (RT T = 200 ms)
direction after the connection setup, comparing Slow-Start (SS) and Quick-Start (QS). The data rates have been obtained from a “tcpdump” trace by averaging the IP packet sizes within intervals of one RTT. We study the Reno congestion control, which implements [2], as well as the frequently used Cubic high-speed TCP variant [16]. In Fig. 9, the emulated minimum RTT is 100 ms. In this case a Slow-Start needs about one second to reach a data rate that completely fills the link, regardless of the congestion control variant. In contrast, the link can almost instantaneously be utilized with a Quick-Start request of q = 5.12 Mbit/s, which is the largest possible request value smaller than the link capacity (see [1]). The Quick-Start specification [1] leaves open whether to modify the Slow-Start threshold ssthresh [2] after a successful Quick-Start procedure. This variable is an estimation of the minimum available bandwidth, and it is challenging to define a reasonable initial value. Two alternative strategies could be: 1. No adaptation of ssthresh by the Quick-Start functions 2. Adaptation of ssthresh to the Quick-Start window (or some multiple of it) The rational behind the latter strategy would be to use the information about the path characteristic obtained from the Quick-Start mechanisms. Fig. 9 shows that a reduction of ssthresh can be disadvantageous compared to the first strategy, if the Quick-Start request rate q is smaller than the path capacity. In this case, the TCP sender enters the congestion avoidance phase before the link capacity is reached. Hence, it may require many seconds to send with full speed. This effect would be even more severe if a smaller Quick-Start request rate was used (e. g., q = 2.56 Mbit/s). In the congestion avoidance phase, the Cubic variant outperforms Reno due to its higher aggressiveness. For a higher RTT of 200 ms, the situation is more complicated, as shown in Fig. 10. In this case, the bandwidth-delay product of the emulated path corresponds to 167 packets, which is larger than the default initial ssthresh value of the Cubic algorithm in the used 2.6.20 kernel version1 . With Cubic, the sender enters the connection avoidance state at a sending rate about 6 Mbit/s, and it takes 1
Newer Linux kernels use a larger threshold for Cubic to overcome this problem.
M. Scharf and H. Strotbek 4
1
RTT 10 ms
3
10
4
10
5
10
6
10
7
10
1
"Surge default" model
0.5 0
HTTP/1.0
HTTP/1.1 HTTP/1.1p
HTTP/1.0
HTTP/1.1 HTTP/1.1p
HTTP request mechanism
Transfer size [byte]
Fig. 11. Rel. performance improvement
Quick-Start 5.12 Mbit/s
Reno Cubic
RTT 50 ms
2 1.5
Reno Cubic
RTT 100 ms
2
3 2.5
Reno Cubic
RTT 200 ms
Slow-Start
Reno Cubic
3
"Large files" model
3.5
Reno Cubic
Analytical model Reno Cubic
Mean object download time [s]
Speedup of Quick-Start vs. Slow-Start
4
Reno Cubic
712
Fig. 12. Benefit for Web traffic (1 user)
longer to fully utilize the path than Reno, which uses a very large default initial ssthresh. Again, a threshold reduction is detrimental: In particular the Reno congestion control requires then a long time to completely utilize the path, and the Slow-Start ends up to be faster. From this follows that ssthresh should not be decreased after a successful Quick-Start request. This is why we only use the “no adaptation” strategy in the following experiments. 5.2
Performance Benefit of Quick-Start
The measurements with our test programs can quantify how much the data transfer times could be improved by Quick-Start. Fig. 11 shows the relative speedup of data transfers η = TSS /TQS , where T is the server response time (cf. Fig. 7). An analytical expression for η has been derived in [4]. Our measurements confirm that Quick-Start can improve the response time by several hundred percent when the RTT is larger than 50 ms. Quick-Start is of particular benefit for transfer sizes between 10 kB and 1 MB, and the speedup is slightly larger for Reno because of the higher initial ssthresh value. For longer file transfers, the relative improvement is rather small because the Slow-Start is then only a transient effect. Very short transfers cannot be improved neither, since the initial congestion window is anyway large enough to send out such data immediately. The comparison of the Linux measurement results with the analytical model from [4] reveals a close match, provided that the model takes into account the so-called “Quick ACKs” [17]. This mechanism of the Linux stack makes the Slow-Start more aggressive than foreseen by the TCP standards [2]. A result of our experiments with the Web server and the “surge” HTTP traffic generator is given in Fig. 12. Here, the minimum RTT is 200 ms, and the 10 Mbit/s link is used by one emulated user only. The mean object download time with the standard Slow-Start is always larger than with Quick-Start, independent of the HTTP protocol variant being used. However, the absolute difference is rather small for the “surge default” scenario. This is to be expected, since the mean size of an object (i. e., a Web page) is here only of the order of 17 kB. For larger file sizes, Quick-Start can significantly reduce the download duration. A speedup of more than 1 s can be achieved for HTTP/1.0 traffic, since here
Performance Evaluation of Quick-Start TCP 5 Slow-Start Quick-Start, 5.12 Mbit/s
Averaged object download time [s]
Averaged object download time [s]
5 4.5
Slow-Start
4 3.5
Quick-Start
3 2.5 2 1.5
"Large files" model
1 0.5 0
713
Slow-Start Quick-Start
"Surge default" model 1
2
3
4
5
6
7
8
9
4 3.5 3 "Large files" model
2.5 2 1.5 1
"Surge default" model
0.5 0
10 11 12 13 14 15 16
Number of traffic generators (users)
Fig. 13. Performance for increased load
Slow-Start, 4 users Quick-Start, 5.12 Mbit/s, 4 users
4.5
0
20
40
60
80
100
Quick-Start admission threshold [%]
Fig. 14. Variation of the QS capacity
every file has to be transported over a new TCP connection suffering from the Slow-Start. But a significant improvement can also be observed for HTTP/1.1 persistent connections, both without and with request pipelining. 5.3
Impact of the Quick-Start Admission Control
At the Web server, the IP layer realizes the Quick-Start router functions, i. e., it monitors the utilization of the 10 Mbit/s link and processes the Quick-Start requests. They are approved if the resulting bandwidth, including the current traffic and recently approved request, is less than a certain threshold [1,3]. If there are many requests in parallel, only a certain share will be approved. In Fig. 13, the number of users generating HTTP requests is successively increased, resulting in more Quick-Start requests and a higher link utilization. For simplicity we consider here the object download time averaged over six experiments with all combinations of Reno/Cubic and the three HTTP variants. As to be expected, download durations increase if the load gets larger. And the performance benefit of Quick-Start is also smaller, because less requests are approved and because the granted bandwidth is smaller. This illustrates that the Quick-Start mechanism mainly targets at underutilized links. However, it is not necessary to allow Quick-Start to use the full link capacity: The diagram in Fig. 14 shows that application performance can also be improved if only a certain part of the capacity (e. g., only 50 %) is available for Quick-Start requests. This again confirms analytical predictions from [4].
6
Conclusion and Future Work
Quick-Start is an experimental TCP extension that allows to immediately utilize links in high-speed networks. This avoids the time-consuming Slow-Start, which is considered to be a limiting factor for broadband interactive applications. Quick-Start requires explicit router feedback, but it is rather lightweight compared to other related mechanisms that are proposed for a future Internet. This paper presents a Quick-Start implementation in the Linux stack and shows that
714
M. Scharf and H. Strotbek
the resulting overhead and performance cost is small. The impact of several parameters is studied by measurements, including both simple validation tests and experiments with HTTP traffic. Our measurements confirm the outcome of simulation studies, and they demonstrate that Quick-Start can significantly improve the performance of interactive applications communicating over long-distance wide area networks. Future work will also consider hardware-based Quick-Start implementations and compare Quick-Start to related congestion control schemes.
References 1. Floyd, S., Allman, M., Jain, A., Sarolahti, P.: Quick-Start for TCP and IP. IETF RFC 4782 (experimental) (January 2007) 2. Allman, M., Paxson, V., Stevens, W.: TCP congestion control. RFC 2581 (proposed standard) (April 1999) 3. Sarolahti, P., Allman, M., Floyd, S.: Determining an appropriate sending rate over an underutilized network path. Computer Networks 51(7), 1815–1832 (2007) 4. Scharf, M.: Performance analysis of the Quick-Start TCP extension. In: Proc. IEEE Broadnets (September 2007) 5. Scharf, M., Strotbek, H.: Experiences with implementing Quick-Start in the Linux kernel. In: IETF 69, Chicago, IL, USA (July 2007) 6. Liu, D., Allman, M., Jin, S., Wang, L.: Congestion control without a startup phase. In: Proc. PFLDnet 2007 (February 2007) 7. Falk, A., Faber, T., Coe, E., Kapoor, A., Braden, B.: Experimental measurements of the eXplicit Control Protocol. In: Proc. PFLDnet 2004 (February 2004) 8. Zhang, Y., Henderson, T.R.: An implementation and experimental study of the Explicit Control Protocol (XCP). In: Proc. IEEE Infocom 2005, March 2005, pp. 1037–1048 (2005) 9. Dukkipati, N., Gibb, G., McKeown, N., Zhu, J.: Building a RCP (Rate Control Protocol) test network. In: Proc. IEEE HOTI (August 2007) 10. Scharf, M., Floyd, S., Sarolahti, P.: Avoiding interactions of Quick-Start TCP and flow control. IETF Internet Draft (work in progress, July 2007) 11. Wehrle, K., P¨ ahlke, F., Ritter, H., M¨ uller, D., Bechler, M.: The Linux Networking Architecture, Upper Saddle River, NJ, USA. Prentice-Hall, Englewood Cliffs (2005) 12. Strotbek, H.: Design and implementation of a TCP extension in the Linux kernel. Diploma thesis (in German), University of Stuttgart, IKR (May 2007) 13. Handley, M., Padhye, J., Floyd, S.: TCP congestion window validation. IETF RFC 2861 (experimental) (June 2000) 14. Barford, P., Crovella, M.: Generating representative Web workloads for network and server performance evaluation. ACM SIGMETRICS Perform. Eval. Rev. 26(1), 151–160 (1998) 15. Schneider, F., Agarwal, S., Alpcan, T., Feldmann, A.: The new Web: Characterizing AJAX traffic. In: Proc. 9th International Conference on Passive and Active Network Measurement, April 2008. LNCS, Springer, Heidelberg (2008) 16. Rhee, I., Xu, L.: Cubic: A new TCP-friendly high-speed TCP variant. In: Proc. PFLDnet 2005 (February 2005) 17. Sarolahti, P., Kuznetsov, A.: Congestion control in Linux TCP. In: Proc. USENIX Annual Technical Conference (June 2002)
Lightweight Fairness Solutions for XCP and TCP Cohabitation Dino M. L´ opez Pacheco1,2, Laurent Lef`evre2 , and Congduc Pham3 1
2
CONACyT INRIA RESO / Universit´e de Lyon / ENSL, France, 3 LIUPPA Laboratory, Universit´e de Pau, France {dmlopezp,lefevre.laurent}@ens-lyon.fr, [email protected]
Abstract. XCP is a promising router-assisted protocol due to its high performance and intra-protocol fairness in fully XCP networks. However, XCP is not inter-operable with current E2E protocols resulting in a poor performance of XCP flows when the resources are shared with TCP-like flows. In this paper, we propose a lightweight fairness solution, that can benefit the cohabitation between XCP and TCP in large bandwidth x delay product networks and long-life flows scenarios, as it is shown in our simulation results. Keywords: TCP, End-to-End protocols, XCP, ERN protocols, interprotocol fairness, inter-operability.
1
Introduction
End-to-End (E2E) protocols are currently the most deployed protocols to control congestion in the networks and provide fair share of resources between users. One of the reasons because E2E protocols are widely deployed is that most of their mechanisms are generally implemented only in the senders side. Thus, E2E protocols are totally independent from the network infrastructure. However, E2E protocols have limited performance. For instance, TCP [14], [3], which is currently the most used congestion control protocol, produces frequently dropped packets and is unable to provide fairness when users have different delay [1]. In addition, TCP have poor performance in networks with large capacity and delay [15]. Others proposed E2E protocols especially designed to get high performance in large bandwidth x delay product (BDP) networks, like High Speed TCP [15] or Scalable TCP [6] suffers intra-protocol and inter-protocols unfairness [16]. Since E2E protocols have revealed unable to solve the problems of congestion and fairness, a new family of congestion control protocols has been proposed. This new protocols use the assistance from routers to provide an accurate information about the network conditions to the senders, that is why they are known like routers-assisted protocols or Explicit Rate Notification (ERN) protocols. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 715–726, 2008. c IFIP International Federation for Information Processing 2008
716
D.M. L´ opez Pacheco, L. Lef`evre, and C. Pham
XCP [5], which is one of the most famous of ERN protocols, is a very promising protocol since XCP shows very stable behavior, high performance and intra-protocol fairness. However, XCP (as others ERN protocols) (i) requires the collaboration of all the routers on the data path, which is almost impossible to achieve in an incremental deployment scenario of XCP and (ii) does not have any mechanism to fairly share the resources with E2E protocols (e.g. TCP). Concerning the interoperability problems of XCP with heterogeneous network equipments, we have already proposed a solution to this problem in [9] (the XCPi module). On the other hand, in [8] we presented a short description about a solution to ensure the XCP-TCP fairness, that we implemented and tested on the XCP protocol, but that can be easily applied to any other ERN protocol. In that occasion, we dedicated specially the space to show graphically that our XCPTCP fairness solutions is able to ensure a max-min fairness solution between XCP and TCP in a wide range of scenarios. Now, in this article, our main interest is to present a detailed description of our lightweight XCP-TCP fairness solution as well as a widely discussion about a few simulation results obtained by mean of the ns2 simulator [12]. We profit also to present some arguments to reinforce the “lightweight” character of our XCP-TCP fairness solution. This paper is organized as follows: Section 2 provides a quick description of the XCP protocol and Section 3 describes the XCP-TCP unfairness problem. Section 4 presents a detailed description of our XCP-TCP fairness solution, while in Section 5 we validate our solution by simulation. Section 6 presents an improvements of our XCP-TCP fairness mechanism to make it lightweight in terms of CPU and memory. Finally, in Section 7 we provide concluding remarks.
2
The XCP Protocol
XCP [5] (eXplicit Control Protocol) uses router-assistance to accurately inform the sender of the network conditions. In XCP, data packets carry a congestion header, filled in by the source, that contains the sender’s current congestion window size (H cwnd), the estimated RTT (H rtt) and a feedback field (H feedback). The H feedback field is the only one which could be modified at every XCP router based on the value of the two previous fields. Basically, the H feedback represents the amount by which the sender’s congestion window size is increased (positive feedback) or decreased (negative feedback). The core mechanism of XCP resides in XCP routers. For every incoming packet, an XCP router needs to access the IP header, to get the packet size, the packet type, and the congestion control protocol used. When the congestion control protocol used is XCP, and the incoming packet is not an ACK, then, the XCP router must get the H cwnd, H rtt and H feedback values. With the informations collected, XCP routers will execute two control laws in order to calculate a per packet feedback: 1. The Efficiency Controller (EC), that maximizes the link utilization while minimizing losses of packets.
Lightweight Fairness Solutions for XCP and TCP Cohabitation
717
2. The Fairness Controller (FC), that assigns resources fairly between XCP flows. To maximize the link utilization, the Efficiency Controller computes a value named feedback, denoted by the symbol φ, that reflects the available bandwidth: φ = α.rtt.(O − I) − β.Q
(1)
where α and β are constant whit values 0.4 and 0.226 respectively. rtt is the average RTT of all the packets crossing the router, O the output link capacity, I the input traffic rate seen by the XCP router during a control interval and Q, the persistent queue size. Note that in the feedback equation, φ decreases as the input traffic rate increases, and the available bandwidth decreases. Later on, the Fairness Controller translates the computed general feedback, φ, in a feedback per packet, that will assign the same amount of bandwidth to each flow. The feedback per packet, computed by the FC, will replace the H feedback value carried in the data packet if the computed feedback is smaller than the one carried in the header. Thus, for every data packet arriving to the XCP receiver, (i) the receiver will copy the H f eedback value into the ACK packets to let the sender update its congestion window size, if we are using the XCP standard protocol, or (ii) if we are using XCP-r (as described in [10]), the receiver will calculate the new congestion window size with the H f eedback, and that size will be sent to the sender back. The way to update the congestion window either in the sender or in the receiver is cwnd = max(cwnd + H f eedback, packetsize)
3
(2)
Fairness Issues between XCP and TCP
In order to maximize link utilization and avoid congestion, the Efficiency Controller computes the available bandwidth S (the spare bandwidth) as S = O − I. However, the spare bandwidth equation does not have any mechanism to differentiate between traffic generated by XCP and non-XCP flows. Hence, the input traffic rate could be generated by any TCP flow, forcing XCP flows to take only the remaining available bandwidth. Thus, XCP routers are not able to decrease the sending rate of non-XCP flows, like TCP flows. Using the topology shown in Figure 1, we analyzed the behavior of one XCP flow when competing with two TCP flows by mean of a simulation. The results can be found in Figure 2, where we plotted the throughput of the XCP flow (continuous line) and the throughput of both TCP flows (dashed lines). As we can see in Figure 2, the XCP flow starting at second 0 takes all the available bandwidth while it does not share the resources with concurrent TCP flows. However, at second 10 when two TCP flows appear in the scenario, the performance of the XCP flow is strongly degraded. In fact, after second 19, the XCP flow has almost disappeared. In Figure 2, between seconds 15 and 19, it seems that the XCP flow tries to get some bandwidth but without success. The explanation of this phenomenon is
718
D.M. L´ opez Pacheco, L. Lef`evre, and C. Pham
Fig. 1. Topology: n XCP and m TCP flows sharing the bottleneck
as follows: Since the congestion window size of both TCP flows are not yet large enough to keep a continuous transference of packets, during some milliseconds the XCP router placed in the bottleneck detects a positive difference between the output link capacity and the input traffic rate. This very small remaining bandwidth is assigned to the XCP sender, which has still the opportunity of transferring some data packets.
Fig. 2. XCP flow dealing with TCP concurrent flows
4 4.1
Proposition for an Inter-protocol XCP-TCP Fairness Definition of XCP-TCP Fairness and Goals of Our Fairness Mechanism
To consider that TCP is fair with XCP, XCP must get a bandwidth equivalent to the ratio between the output link capacity O, and the sum of the number of XCP and TCP flows, multiplied by the number of XCP flows. Equation 3 shows this relationship. BWXCP =
O ∗N N +M
(3)
Lightweight Fairness Solutions for XCP and TCP Cohabitation
719
In Equation 3, BWXCP is the bandwidth needed by XCP when the bottleneck is shared by XCP and TCP flows, N is the number of active XCP flows and M the number of active TCP flows. Thus, we can estimate the limit in terms of bandwidth for TCP, BWT CP , as a function of BWXCP : BWT CP = O − BWXCP 4.2
(4)
Estimating the Resources Needed by XCP
In order to fairly share the resources between XCP and TCP, from Equation 3, we can deduct that it is necessary to calculate the number of XCP and TCP active flows. Calculating the number of active flows crossing a router is not easy. In this article we describe two mechanisms to estimate the number of active flows that we consider as the more importants: the Bloom filter algorithm [2] and the SRED’s zombie estimator [13]. The Bloom filter algorithm [2], as it has been implemented by NRED [7], works as follows: When a packet arrives, the NRED’s Bloom filter algorithm hashes the sourcedestination pair of the packet into a bin. A bin is a 1 bit memory to mark 0 or 1. The estimator maintains P × L bins. The bins are organized in L levels, each of which contains P bins. The estimator also uses L independent hash functions, each of which is associated with one level of bins, that maps a flow into one of the P bins in that level. For every arriving packet, the L hash functions can be executed in parallel. At the beginning of measurement interval, all bins and a counter Nact are set to zero. When a packet arrives at the router, the L hashed bins for the source-destination IP address pair of the packet are set to 1 and Nact increases in one unit if at least one of the L hashed bins is zero before hashing. However, it could arrive that for packets belonging to two different flows are hashed into the same bins of L levels causing a “misclassification”. The authors claim that the probability of having a “misclassification” decreases as the number of L hash functions increases. The problem of the Bloom filter algorithm is based on the fact that to avoid “misclassification”, an important number of hash functions is needed, and since every hash function is executed one time for every incoming packet, this algorithm could be expensive in terms of CPU requirements. In addition, even if the hash functions can be performed in parallel, those hash functions must synchronize in order to update Nact , becoming this algorithm expensive in terms of time. On the other hand, the zombie estimator is an algorithm used in SRED [13]. The zombie estimator keeps a table called zombie, able to keep up to 1000 ID flows (the ID flow could be src addr:src port::dest addr:dest port). The zombie table is filled in at the beginning with the ID flows belonging to the first 1000 incoming packets. When the zombie table has been filled in, every arriving packet to the router is examined. First, the ID flow of the packet is taken, and compared with an ID flow taken randomly from the zombie table. If the IDs are equal, an event hit is
720
D.M. L´ opez Pacheco, L. Lef`evre, and C. Pham
declared. Otherwise, an event mis is declared. When a mis is detected, with a probability of 25%, the old stored ID in the zombie table will be replaced by the new flow ID. Later on, in both cases hit or mis, a variable P (t), that keeps the probability to make a hit is updated as follows: P (t) = (1 − δ)P (t − 1) + δ.Hit(t)
(5)
where δ reflects the probability to get a packet from the zombie table with the same ID of the incoming packet (δ = 0.25 ∗ (1/1000) = 0.00025). It should be noted that α from the P(t) original equation [13] has been replaced by δ to avoid confusions with the α XCP parameter. Hit(t) is a constant with value 1 in case of hit, or 0 otherwise. If the total of active flows in a router is N , then the probability to get a hit is 1/N . Since the probability to get a hit is already contained in P (t), hence 1/P (t) represent the estimation of the number of active flow seen by a router at time t. The operations executed by the zombie estimator do not require a big utilization of CPU, since this estimator only needs one comparison, to generate two random numbers, and in some cases one writing, for every incoming packet. However, one of the weakness of the zombie estimator lies on the fact that it does not calculate the exact number of active flows, but only tries to estimate this number. We believe that with current technologies it is very difficult to know the exact number of active flows during a given period (we need faster CPU and bigger memory than currently available in routers). We believe also that having an estimation of the number of active flows is enough to ensure a max-min fairness between XCP and TCP. For this reason, we have used the zombie estimator to compute the number of active TCP and XCP flows. Note that even though a zombie table with 1000 slots could not accurately estimate a high number of flows (e.g., 10,000 active flows), a zombie table with such a size can give us a close idea about the real behavior of the estimation method. In a real environment we can use a larger zombie table, to accurately estimate higher numbers of active flows, without introducing significantly CPU operations. The way to implement the zombie estimator in the XCP routers is very similar to the way proposed in SRED. However, in our case we will have one variable P (t)XCP that will keep the probability to get an XCPhit , and another one P (t)T CP that will keep the probability to get a T CPhit . Thus, by applying the described zombie estimator, we can have an idea about the resources needed by the XCP flows BWXCP . We have proposed to estimate the resources needed by XCP and TCP at every control interval of the XCP router. 4.3
Ensuring the XCP-TCP Fairness
In order to ensure the fairness between XCP and TCP, it is necessary to compute also the real amount of resources taken by XCP. Thus, by comparing both the
Lightweight Fairness Solutions for XCP and TCP Cohabitation
721
needed and the real amount of resources taken by XCP, we will be able to make a decision to improve the fairness. Computing the Real Amount of Resources Taken by XCP. As we said in section 2, for every incoming packet, XCP routers must compute a per packet feedback by computing an input traffic rate (in traffic rate) at the end of every control interval (ctrl int ). This corresponds to the average RTT signaled in every incoming data packet. The in traffic rate variable is calculated by dividing the total amount of data received during a control interval (in traffic), by the duration of such control interval. in traffic is calculated by adding the packet size of every incoming (data or ACK) packet. Since the XCP routers already implement a procedure to compute the input traffic rate, to estimate the XCP input traffic rate requires only minor changes in the XCP mechanism. Thus, all that we need to do is to add a new variable, xcp in traffic, that will store the sum of the size of every XCP packet, and divide this variable by the duration of the control interval ctrl int, to get the XCP input traffic rate xcp in traffic rate. Ensuring Fairness Between XCP and TCP Flows. Once calculated the XCP input traffic rate, xcp in traffic rate, and the needed XCP bandwidth, BWXCP , we can make decisions to provide fairness between XCP and TCP. In order to limit the TCP throughput and get fairness, with a pdrop probability, our XCP-TCP fairness will determine if an incoming TCP packet should be discarded. The pdrop , which is initialized with a value of 0.001, is updated at every control interval as follows: if xcp in traffic rate > BWXCP , the probability pdrop is updated as pdrop = pdrop ∗ Ddrop , where 0.99 < Ddrop < 1. If xcp in traffic rate < BWXCP , the probability pdrop is updated as pdrop = pdrop ∗ Idrop , where 1.01 > Idrop > 1. When xcp in traffic rate = BWXCP , then pdrop conserves its last value. 4.4
Specificities about our XCP-TCP Fairness Mechanism
It is very important to emphasize that the XCP routers will execute the XCPTCP fairness mechanism only (i) when both XCP and TCP protocols have been detected during a control interval, and (ii) if the total input traffic rate exceeds a threshold γ. We will not execute our XCP-TCP fairness mechanism when the input traffic rate is smaller than γ, since it means that the current router is not the bottleneck. Concerning the γ parameter, we advise to set γ to a value slightly smaller than the output link capacity (e.g., 97% of the output link capacity), since even in a bottleneck router, the computed rate during a control interval is not always equal to the output link capacity, due to the burstiness nature of the senders. Thus, during a certain control interval we could compute an input traffic rate slightly smaller than the output link capacity, and in the next control interval, see the router dropping packets. Finally, when the total input traffic rate becomes smaller than our γ threshold, the pdrop variable will be reinitialized to the original value (0.001).
722
5
D.M. L´ opez Pacheco, L. Lef`evre, and C. Pham
Validating Our XCP-TCP Fairness Solution
In order to test our XCP-TCP fairness, we implemented our propositions in the ns2 XCP modules provided by D. Katabi (http://www.ana.lcs.mit.edu/ dina/XCP/) In our simulations, we will focus on 2 different scenario: national small distance Grid (such as the French Grid5000 [4] infrastructure) with an average 20 ms RTT and larger scale Grids with 100 ms RTT. The topology used to test our XCP-TCP fairness solutions will be the one shown in Figure 1. In our experiments we have used Ddrop = 0.9999 and Idrop = 1.0001. In this set of experiments (the results are shown in Figure 3) we gradually incorporated 3 TCP flows after second 10 (each flows is incorporated with a 20-seconds delay), among 10 XCP flows (all XCP flows are started at second 0), with an RT T = 20ms (case 1) and an RT T = 100ms (case 2). We believe that these scenarios represent well what will happen in a real environment where thousand or hundreds of active flows share the network. At the same time, using a limited number of flows in our simulations (3 TCP and 10 XCP flows) can help us to understand the behavior of our XCP-TCP fairness solution, since we can easily observe and compare the evolution of every XCP and TCP flow in a dynamic network. The simulation results of the first case are shown in Figure 3(a) (RT T = 20ms), where we can see that every time a new TCP flow starts, it gets more bandwidth than needed (seconds 10, 30, 50). However, after Slow Start finishes, our XCP-TCP mechanism succeeds in ensuring fairness between XCP and TCP,
(a) 20 ms RTT
(b) 100 ms RTT Fig. 3. 3 TCP flows appear among 10 XCP flows
Lightweight Fairness Solutions for XCP and TCP Cohabitation
723
as we can observe between seconds 20 and 30 and between seconds 50 and 110. Figure 3(a) proves also that our mechanism ensures fairness even though every TCP flow comes in/out asynchronously to the network. In the case where the RT T = 20ms, the worst fairness level is found maybe between seconds 30 and 50, where 2 TCP flows share close to 300Mbps while the remaining 10 XCP flows share around 700Mbps (BWT CP ≈ 170.66M bps and BWXCP ≈ 853.34M bps). On the other hand, one of best fairness level is found between second 60 and 70, where 3 TCP flows share approximately 250Mbps while the remaining 10 XCP flows share around 750Mbps (BWT CP ≈ 236.30M bps and BWXCP ≈ 787.70M bps). Those results are far from the one shown in Figure 2. When the RT T = 100ms, at the beginning every TCP flow takes more resources than needed, producing the execution of the XCP-TCP fairness mechanism. After finished the Slow-Start phase, due to our fairness mechanism, the amount of bandwidth taken by TCP is smaller than the maximum allowed bandwidth. Even if our mechanism does not penalize TCP flows when the TCP throughput is too low, since pdrop → 0, the time needed by TCP to get enough resources is very large (see seconds 60 to 110 in Figure 3(b)). It is important to note that the time needed for TCP to get the resources decreases as the RTT decreases and the number of TCP flows increases. In the case where the RT T = 100ms, the worst fairness level is found maybe between seconds 60 and 70, where 3 TCP flows share close to 100Mbps while the remaining 10 XCP flows share around 900Mbps (BWT CP ≈ 236.30M bps and BWXCP ≈ 787.70). On the other hand, one of best fairness level is found between second 100 and 110, where 1 TCP flow gets approximately 100Mbps while the remaining 10 XCP flows share around 900Mbps (BWT CP ≈ 93M bps and BWXCP ≈ 931M bps). Still, those results are far from the one shown in Figure 2.
6
Limitations and Optimizations
Even if the zombie estimator only executes a few operations for every incoming packet, those operations, added to the ones needed by the XCP algorithm, could increase significantly the processing time of packets. For this reason, we have proposed to compute the number of XCP and TCP flows, taking as base only a percent of the total incoming packets. This modification may have an impact very important in the routers. For instance, in a router processing 833,333 packets/s (10 Gbps whether every packet is composed by 1500B), to estimate the number of flows such a router would approximately generate 1,666,666 random numbers, access the zombie table 833,333 times, execute 833,333 comparisons, between others operations like write the ID flow when a mis is declared, etc. Taking into account the operations listed here, our 10Gbps routers should execute 3,333,332 operations/s. Thus, a reduction of 30% in the number of inspected packets should represent a decrease of approximately 1,000,000 operations/s in the router.
724
D.M. L´ opez Pacheco, L. Lef`evre, and C. Pham
In order to estimate the active flow number without checking every incoming packet, a few changes in the zombie estimator are necessary. As shown early, the hit probability equation is given by P (t) = (1 − δ)P (t − 1) + δ.Hit(t), where α reflects the probability to get a packet from the zombie table with the same ID of the incoming packet. Thus, we propose to check only 50% of the total incoming packets. However, verifying only 50% of the total incoming packet and updating the zombie table with a low probability (25%) increases significantly the probability of missing flows in our zombie table. On the other hand, to increase the updating probability of the zombie table should decrease the problem of missing flows. For these reasons, in our work we have decided to use a probability of 50% to update the zombie table in case of mis. Therefore δ = 0.00025.
(a) 20 ms RTT
(b) 100 ms RTT Fig. 4. 3 TCP flows appear among 10 XCP flows - inspecting 50% of packets
With the modification introduced in our fairness mechanism (that let routers monitor only 50% of incoming packets to estimate the number of flows), we reexecuted the experiment shown in the Section 5, where 3 TCP flows are incorporated gradually among 10 active XCP flows. Figure 4 shows the results of our new set of experiments. As we can see in Figure 4, our fairness mechanism success in getting fairness between XCP and TCP flows inspecting only 50% of the total incoming packets. If we compare (i) Figure 4(a) (RT T = 20ms and 50% of inspected packets) with Figure 3(a) (RT T = 20ms and 100% of inspected packets), and (ii) Figure 4(b) (RT T = 100ms and 50% of inspected packets) with Figure 3(b) (RT T = 20ms
Lightweight Fairness Solutions for XCP and TCP Cohabitation
725
and 100% of inspected packets); we can see that the results are very similar. Thus, we show that we can reduce significantly the operations executed by our XCP-TCP fairness solution, while keeping the same fairness level shown in the simulations of Section 5 (where we inspected 100% of packets).
7
Conclusion
In this article we have shown that when XCP and TCP share the bandwidth of the bottleneck link, TCP flows grab as many resources as needed, in damage of XCP flows. This unfairness problem affected most of the ERN protocols, like JetMax [17], TCP MaxNet [11], etc. Therefore, in this article we presented a solution that ensures the XCP-TCP fairness, which is independent to the XCP protocol, easy to be applied to others ERN protocol and lightweight in terms of CPU and memory usage. Our XCPTCP fairness mechanism lies on the execution of two main mechanism. The first mechanism estimates the resources needed by the XCP and non-XCP flows. The second mechanism limit the TCP throughput by dropping some of its packets with a probability pdrop . The experiments presented along this article proved that our algorithms successfully ”allocate” and ”deallocate” bandwidth dynamically to XCP, in order to ensure the fairness between XCP and TCP. However, the level of fairness provided between XCP and TCP is strongly influenced by factors that escape to the control of our XCP-TCP fairness mechanisms. For instance, after that our XCP-TCP fairness solution drops packets to limit the TCP throughput and get fairness, TCP flows could finish with a throughput lower than the one desired. The capacity of TCP to re grab the lost bandwidth will depend on the RTT and the number of TCP flows. In order to decrease the unfairness between XCP and TCP due to the RTT, one solution can be to “evaluate” the TCP aggressiveness when TCP increase their throughput. Thus, the drop probability could be updated in a more intelligent way. Finally, we want to remark that with our XCP-TCP fairness solution is a new step to stimulate the deployment of the Explicit Rate Notification protocols in heterogeneous long-distance high speed networks, to improve the performance of long life flows.
References 1. Akella, A., Seshan, S., Shenker, S., Stoica, I.: Exploring Congestion Control (2002) 2. Bloom, B.H.: Space/Time Trade-Offs In Hash Coding With Allowable Errors. Communications of the ACM 13(7), 422–426 (1970) 3. Braden, R.: Requirements for Internet Hosts - Communication Layers. RFC 1122 (Standard), Updated by RFCs 1349, 4379 (October 1989)
726
D.M. L´ opez Pacheco, L. Lef`evre, and C. Pham
4. Cappello, F., Caron, E., Dayde, M., Desprez, F., Jeannot, E., Jegou, Y., Lanteri, S., Leduc, J., Melab, N., Mornet, G., Namyst, R., Primet, P., Richard, O.: Grid’5000: a Large Scale, Reconfigurable, Controlable and Monitorable Grid Platform. In: Grid’2005 Workshop, Seattle, USA, November 13-14, IEEE/ACM (2005) 5. Katabi, D., Handley, M., Rohrs, C.: Congestion Control for High Bandwidth-Delay Product Networks. In: ACM SIGCOMM (2002) 6. Kelly, T.: Scalable TCP: Improving Performance in Highspeed Wide Area Networks. SIGCOMM Comput. Commun. Rev. 33(2), 83–91 (2003) 7. Li, J.-S., Su, Y.-S.: Random Early Detection With Flow Number Estimation and Queue Length Feedback Control. Journal of Systems Architecture 52(6), 359–372 (2006) 8. L´ opez Pacheco, D.M., Lefevre, L., Pham, C.-D.: Fairness Issues When Transferring Large Volume of Data on High Speed Networks With Router-Assisted Transport Protocols. In: High Speed Networks Workshop 2007, in conjunction with IEEE INFOCOM 2007, Anchorage, Alaska, USA (May 2007) 9. L´ opez Pacheco, D.M., Pham, C.-D., Lefevre, L.: XCP-i: eXplicit Control Protocol for Heterogeneous Inter-Networking of High-Speed Networks. In: Globecom 2006, San Francisco, California, USA (November 2006) 10. L´ opez-Pacheco, D.M., Pham, C.: Robust Transport Protocol for Dynamic HighSpeed Networks: Enhancing the XCP Approach. In: Proceedings of IEEE International Conference on Networks, Kuala Lumpur, Malaysia, November 2005, vol. 1, pp. 404–409 (2005) 11. Martin Suchara, B.W., Witt, R.: TCP MaxNet: Implementation and Experiments on the WAN in Lab. In: IEEE International Conference on Networks (November 2005) 12. ns2. The Network Simulator (2007), http://www.isi.edu/nsnam/ns/index.html 13. Ott, T.J., Lakshman, T.V., Wong, L.H.: SRED: Stabilized RED. In: INFOCOM, pp. 1346–1355 (1999) 14. Postel, J.: Transmission Control Protocol. RFC 793 (Standard), September 1981. Updated by RFC 3168 (1981) 15. Floyd, S.: HighSpeed TCP for Large Congestion Windows. RFC 3649 (Experimental) (December 2003) 16. Xu, L., Harfoush, K., Rhee, I.: Binary Increase Congestion Control (BIC) for Fast Long-Distance Networks. In: INFOCOM (2004) 17. Leonard, D., Zhang, D.L.Y.: JetMax: Scalable Max-Min Congestion Control for High-Speed Heterogeneous Networks. In: INFOCOM (April 2006)
Concurrent Multipath Transfer Using SCTP Multihoming: Introducing the Potentially-Failed Destination State Preethi Natarajan1, Nasif Ekiz1, Paul D. Amer1, Janardhan R. Iyengar2, and Randall Stewart3 1
Protocol Engineering Lab, CIS Dept, University of Delaware {nataraja,nekiz,amer}@cis.udel.edu 2 Dept of Math & Computer Science, Connecticut College [email protected] 3 Internet Technologies Division, Cisco Systems [email protected]
Abstract. Previously, we identified the failure-induced receive buffer (rbuf) blocking problem in Concurrent Multipath Transfer using SCTP multihoming (CMT), and proposed CMT with a Potentially-failed destination state (CMTPF) to alleviate rbuf blocking. In this paper, we complete our evaluation of CMT vs. CMT-PF. Using ns-2 simulations we show that CMT-PF performs on par or better than CMT during more aggressive failure detection thresholds than recommended by RFC4960. We also examine whether the modified sender behavior in CMT-PF degrades performance during non-failure scenarios. Our evaluations consider: (i) realistic loss model with symmetric and asymmetric path loss, (ii) varying path RTTs. We find that CMT-PF performs as well as CMT during non-failure scenarios, and interestingly, outperforms CMT when the paths experience asymmetric rbuf blocking conditions. We recommend that CMT be replaced by CMT-PF in future CMT implementations and RFCs1. Keywords: Stream Control Transmission Protocol (SCTP), Concurrent Multipath Transfer (CMT), Path failure, Receive buffer blocking.
1 Introduction Unlike TCP and UDP, the Stream Control Transmission Protocol (SCTP) [RFC4960] natively supports multihoming at the transport layer. SCTP multihoming allows binding of a transport layer association (SCTP’s term for a connection) to multiple IP addresses at each end of the association. This binding allows transmission of data to different destination addresses of a multihomed receiver. Multiple destination 1
Prepared through collaborative participation in the Communication and Networks Consortium sponsored by the US Army Research Lab under Collaborative Tech Alliance Program, Coop Agreement DAAD19-01-2-0011. The US Gov’t is authorized to reproduce and distribute reprints for Gov’t purposes notwithstanding any copyright notation thereon. Supported by the University Research Program, Cisco Systems, Inc.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 727–734, 2008. © IFIP International Federation for Information Processing 2008
728
P. Natarajan et al.
addresses imply the possibility of independent end-to-end paths. Concurrent Multipath Transfer (CMT) [5] leverages SCTP’s multihoming support, and increases an application’s throughput via simultaneous transfer of new data over independent paths between multihomed source and destination hosts. Reference [4] explores the receive buffer (rbuf) blocking problem in CMT, where transport protocol data unit (TPDU) losses throttle a sender once the transport receiver’s buffer is filled with out-of-order data. Even though the congestion window would allow new data transmission, rbuf blocking (i.e., flow control) stalls the sender, causing throughput degradation. To reduce rbuf blocking effects during congestion, [4] proposes different retransmission policies that use heuristics for faster loss recovery, such as, sending retransmissions on the path with the largest congestion window (RTX_CWND policy), or the largest slow-start threshold (RTX_SSTHRESH policy). Both RTX_CWND and RTX_SSTHRESH perform equally well during congestion-induced rbuf blocking, and therefore [4] arbitrarily selected and recommended the RTX_SSTHRESH policy for CMT. In [9], we show how CMT with RTX_SSTHRESH policy suffers from consecutive instances of failure-induced rbuf blocking. To mitigate rbuf blocking during path failures, we modified CMT’s failure detection process to include a new “potentially-failed” (PF) destination state. CMT with a modified set of retransmission policies that consider the PF state is called CMT-PF. CMT-PF is based on the rationale that loss detected by a timeout implies either severe congestion or failure en route. After a single timeout on a path, a sender is unsure, and marks the corresponding destination as PF. A PF destination is not used for data transmission or retransmission. Only heartbeats are sent to the PF destination. If a heartbeat-ack returns, the PF destination returns to active state. Note that CMT-PF retransmission policies are applied only for timeout based loss recoveries. One of CMT’s current retransmission policies, such as RTX_SSTHRESH, is applied for TPDU losses detected by RFC4960’s threshold number of missing reports (fast retransmits). Details of CMT-PF can be found in [9].
2 Evaluation in Failure Scenarios We implemented CMT-PF in the University of Delaware’s SCTP/CMT ns-2 module [2]. Our previous evaluation [9] considered only the RTX_SSTHRESH variants of CMT and CMT-PF. Our current evaluations include the RTX_CWND variants as well since RTX_CWND appears to be a better policy than RTX_SSTHRESH during failure (more details follow). The four variations considered for evaluations are: 1) CMT-CWND: CMT with RTX_CWND retransmission policy. 2) CMT-SSTHRESH: CMT with RTX_SSTHRESH retransmission policy. 3) CMT-PF-CWND: CMT-PF with RTX_CWND retransmission policy for fast retransmissions. 4) CMT-PF-SSTHRESH: CMT-PF with RTX_SSTHRESH policy for fast retransmissions. To achieve faster yet robust failure detection, [1] argues for varying Path.Max.Retransmit (PMR) based on a network’s loss rate, and suggests PMR=3 for
Concurrent Multipath Transfer Using SCTP Multihoming
729
the Internet. Lowering the PMR for CMT flows reduces the number of rbuf blocking episodes, and thus CMT’s throughput degradation during failures. However, a tradeoff exists on deciding the value of PMR – a lower value reduces rbuf blocking but causes spurious failure detection, whereas a higher value increases rbuf blocking but avoids spurious failure detection for a wide range of environments. We first investigate how CMT-PF improvements are affected by the failure detection threshold. We evaluate CMT vs. CMT-PF for PMR values ranging from 0 – 5. The simulation topology is shown in Figure 1. A multihomed sender, A, has two independent paths to a multihomed receiver, B. The edge links between A or B to the routers represent last hop link characteristics. The end-to-end one-way delay is 45ms on both paths, representing typical U.S. coast-to-coast delays experienced by a significant fraction of today’s Internet flows [7].
Fig. 1. Topology with Bernoulli Loss Model
Each simulation run is a bulk file transfer from A to B. Rbuf=64KB and both paths experience 1% loss rate with Bernoulli loss model. Path 2 fails 10 seconds after the file transfer commences. Sender A detects this path failure after PMR+1 consecutive timeouts. Figure 2 plots the different variations’ goodput during failure detection (application data received ÷ failure detection time), with a 5% error margin. Note that the failure detection period decreases with smaller PMR values. As mentioned in Section 1, [4] arbitrarily recommends RTX_SSTHRESH policy over RTX_CWND during congestion-induced rbuf blocking. Interestingly, the two policies exhibit performance differences during failure-induced rbuf blocking (Figure 2a). Ssthresh is a more historic account of a path’s congestion conditions than cwnd. When path 2’s loss rate swiftly changes from 1% to 100% (failure), ssthresh converges slower to path 2’s current conditions than cwnd. For CMT, RTX_CWND appears to be a better policy than RTX_SSTHRESH during failure scenarios. The dashed line in Figure 2b denotes the maximum attainable goodput of an SCTP file transfer (application data received ÷ transfer time) using path 1 alone. When the failure detection threshold is most aggressive (PMR=0), all senders detect path 2 failure after the first timeout. The 4 variations experience similar rbuf blocking during this failure detection and perform almost equally (Figure 2b). As PMR increases, the number of rbuf blocking instances during failure detection increases, resulting in
730
P. Natarajan et al.
increasing performance benefits from CMT-PF. As seen in Figure 2b as PMR increases, CMT-PF’s goodput increases, whereas CMT’s goodput decreases. Starting from PMR=3, CMT-PF’s goodput is comparable or equal to the maximum attainable SCTP goodput. Unlike the two CMT variations, both CMT-PF-SSTHRESH and CMTPF-CWND perform equally well during failure, and better than CMT for PMR > 0.
(a) CMT; CWND vs. SSTHRESH
(b) CMT vs. CMT-PF
Fig. 2. Goodput during Failure Detection for varying PMR values
3 Evaluation in Non-failure Scenarios Section 2 confirms that transitioning a destination to the PF state after timeout caused by path failure alleviates rbuf blocking, and improves CMT-PF performance. However, it is necessary to evaluate whether the PF state transition impairs performance during timeouts caused by non-failure scenarios such as congestion. In [9], our congestion experiments assumed a simple loss model with uniformly distributed loss rates. Here, we consider the more realistic topology and loss model shown in Figure 3.
Fig. 3. Bursty Loss Created by Cross-traffic
Concurrent Multipath Transfer Using SCTP Multihoming
731
In this dual-dumbbell topology, each router, R, is attached to five edge nodes. Dual-homed edge nodes A and B are the transport sender and receiver, respectively. The other edge nodes are single-homed, and introduce cross-traffic that instigates bursty periods of congestion and bursty congestion losses at the routers. Their lasthop propagation delays are randomly chosen from a uniform distribution between 520 ms, resulting in end-to-end one-way propagation delays ranging ~35-65ms [7]. All links (both edge and core) have a buffer size twice the link's bandwidth-delay product, which is a reasonable setting in practice. Each single-homed edge node has eight traffic generators, introducing cross-traffic with a Pareto distribution. The cross-traffic packet sizes are chosen to resemble the distribution found on the Internet: 50% are 44B, 25% are 576B, and 25% are 1500B [3, 8]. The result is a data transfer between A to B, over a network with self-similar cross-traffic, which resembles the observed nature of traffic on data networks [10]. We simulate a 32MB file transfer between sender A and receiver B, under different loss conditions. Rbuf=64KB, PMR=5, and loss rates are controlled by varying the cross-traffic load. 3.1 Symmetric Loss Conditions The aggregate cross-traffic load on both paths are similar and vary from 40%-100% of the core link’s bandwidth. Figure 4 plots the average goodput (file size ÷ transfer time), of the 4 variations with 5% error margin. The 4 variations exhibit similar performance during low loss rates (i.e., minimal cross-traffic), since most of the TPDU losses are recovered via fast retransmits as opposed to timeouts. As crosstraffic and hence loss rate increases, the number of timeouts on each path increases. Under such conditions, the probability and time duration that both paths are simultaneously marked “potentially-failed” increases in CMT-PF. To ensure that CMT-PF performs well even if all destinations get marked PF, CMT-PF transitions the destination with the smallest number of consecutive timeouts back to the active state, allowing data to be sent to that destination [9]. This modification guarantees that CMT-PF performs on par with CMT even when both paths experience high loss rates (Figure 4). Under symmetric loss conditions, we study how a path’s RTT affects the throughput differences between CMT and CMT-PF. Note that any difference between CMT and CMT-PF transpires only after a timeout on a path. Let us assume that a path experiences a timeout event, and the next TPDU loss on the path takes place after n RTTs. After the timeout, CMT slow starts on the path, and the number of TPDUs transmitted on the path at the end of n RTTs = 1 + 2 + 4 … + 2n = (2(n +1) – 1). CMT-PF uses the first RTT for a heartbeat transmission, and slow starts with initial cwnd=2 after receiving the heartbeat-ack. In CMT-PF, the number of TPDUs transmitted by end of n RTTs on the path = 0 + 2 + 4 … + 2n = (2(n +1) – 2). Thus, in n RTTs, CMT transmits 1 TPDU more than CMT-PF, and the 1 TPDU difference is unaffected by the path’s RTT. Therefore, when paths experience symmetric RTTs (a.k.a. symmetric RTT conditions), we expect the performance ratio between CMT and CMT-PF to remain unaffected by the RTT value. We now consider a more interesting scenario when the independent end-to-end paths experience symmetric loss rates, but asymmetric RTT conditions. That is, path 1’s RTT=x sec, and path 2’s RTT=y sec (x ≠ y). How do x and y impact CMT
732
P. Natarajan et al.
vs. CMT-PF performance? More importantly, does CMT-PF perform worse when the paths have asymmetric RTTs? We performed the following Bernoulli loss model experiment to gain insight (Bernoulli loss model simulations take much less time than cross-traffic ones, and both loss models resulted in similar trends between CMT and CMT-PF). We used the topology shown in Figure 1 to transfer an 8MB file from sender A to receiver B. Path 1’s one-way propagation delay was fixed at 45ms while path 2’s one-way delay varied as follows: 45ms, 90ms, 180ms, 360ms, and 450ms. Both paths experience identical loss rates ranging from 1%-10%.
Fig. 4. Goodput during Symmetric Loss Conditions
Fig. 5. Goodput Ratio during Asymmetric Path RTTs
Figure 5 plots the ratio of CMT-SSTHRESH’s goodput over CMT-PFSSTHRESH’s (relative performance difference) with 5% error margin. As expected, both CMT and CMT-PF perform equally well during symmetric RTT conditions. As the asymmetry in paths’ RTTs increases, an interesting dynamic dominates and CMTPF performs slightly better than CMT (goodput ratios < 1). Note that rbuf blocking depends on the frequency of loss events (loss rate) and duration of loss recovery. Loss recovery duration increases with a path’s RTT and RTO. As loss rate increases, the probability that a sender experiences consecutive timeout events on the path increases. After the first timeout, CMT-PF transitions the path to PF, and avoids data transmission on the path (as long as another active path exists) until a heartbeat-ack confirms the path as active. But, a CMT sender suffers back-to-back timeouts on data sent on the path, with exponential backoff of timeout recovery period. Consecutive timeouts increase path 2’s RTO more at higher RTTs, and results in longer periods of rbuf blocking in CMT (shown in [9]). Therefore, as path 2’s RTT increases, CMT’s goodput degrades more than CMT-PF’s, and the goodput ratio decreases (Figure 5). Similar trends are observed between CMT-CWND and CMT-PF-CWND (not shown due to space constraints) [9]. In summary, during symmetric loss conditions, CMT and CMT-PF perform equally well when paths experience symmetric RTT conditions. As the RTT asymmetry increases, CMT-PF demonstrates a slight advantage at high loss rates. 3.2 Asymmetric Loss Conditions In our asymmetry experiments, paths 1 and 2 experience different cross-traffic loads. The aggregate cross-traffic load on path 1 is set to 70% of the core link bandwidth,
Concurrent Multipath Transfer Using SCTP Multihoming
733
while on path 2 the load varies from 70%-100% of the core link bandwidth. Figure 6 plots the average goodput of the 4 variations with 5% error margin. As discussed in the previous sub-section, as path 2’s cross-traffic load increases, the probability that a sender experiences back-to-back timeouts on path 2 increases. CMT suffers a higher number of consecutive timeouts on data (Table 1) resulting in extended rbuf blocking periods than CMT-PF. Therefore, as path 2’s cross-traffic load increases, CMT-PF performs better than CMT (Figure 6). This result holds true for both RTX_CWND and RTX_SSTHRESH variants.
Fig. 6. CMT vs. CMT-PF Goodput during Asymmetric Loss Conditions (Path 1 Cross-traffic Load = 70%) Table 1. Mean Consecutive Data Timeouts (Path 2)
Table 2. Mean Number of Transmissions
In CMT, RTX_CWND and RTX_SSTHRESH are retransmission policies, and are not applied during new data transmissions. In CMT-PF, a path is marked PF after a timeout, and as long as active path(s) exist, CMT-PF avoids retransmissions on the PF path. Once the retransmissions are all sent, CMT-PF’s data transmission strategy is applied to new data, and CMT-PF avoids new data transmissions on the PF path. As shown in Table 2, when compared to CMT, CMT-PF reduces the number of (re)transmissions on the higher loss rate path 2 and (re)transmits more on the lower loss rate path 1. This transmission difference (ratio of transmissions on path 1 over path 2) between CMT-PF and CMT increases as the paths become more asymmetric in their loss conditions. In summary, CMT-PF does not perform worse than CMT during asymmetric path loss conditions. In fact, CMT-PF is a better transmission strategy than CMT, and performs better as the asymmetry in path loss increases.
734
P. Natarajan et al.
4 Conclusion and Future Work Using simulation, we demonstrated that (re)transmission policies using CMT with a “potentially-failed” destination state (CMT-PF) outperform CMT during failures − even under aggressive failure detection thresholds. Investigations during symmetric loss conditions revealed that CMT-PF performs as well as CMT during symmetric path RTTs, and slightly better when the paths experience asymmetric RTT conditions. The simulation results conclude that CMT-PF (i) reduces rbuf blocking during failure scenarios, and (ii) performs on par or slightly better than CMT during non-failure scenarios. In light of these findings, we recommend CMT be replaced by CMT-PF in SCTP implementations and RFCs. We have implemented CMT-PF in FreeBSD SCTP/CMT stack, and are performing emulation experiments comparing CMT vs. CMT-PF. As future work, we will evaluate CMT vs. CMT-PF when end-to-end paths share a bottleneck and therefore are no longer independent.
5 Disclaimer The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government.
References [1] Caro, A.: End-to-End Fault Tolerance using Transport Layer Multihoming. PhD Dissertation, CIS Dept, U of Delaware (August 2005) [2] Caro, A., Iyengar, J.: ns-2 SCTP Module. Version 3.6 (June 2006) http://pel.cis.udel.edu [3] CAIDA: Packet Sizes and Sequencing (March 1998), http://www.caida.org [4] Iyengar, J., Amer, P., Stewart, R.: Performance Implications of Receive Buffer Blocking in Concurrent Multipath Transfer. Computer Communications, 30(4), 818–829 (February 2007) [5] Iyengar, J., Amer, P., Stewart, R.: Concurrent Multipath Transfer using SCTP Multihoming over Independent End-to-end Paths. IEEE/ACM Trans on Networking, 14(5), 951–964 (October 2006) [6] Natarajan, P., Iyengar, J., Amer, P., Stewart, R.: CMT using Transport Layer Multihoming: Performance under Network Failures. In: MILCOM 2006, Washington, DC (October 2006) [7] Shakkottai, S., Srikant, R., Broido, A., Claffy, K.: The RTT distribution of TCP Flows in the Internet and its Impact on TCP-based flow control, TR, CAIDA(February 2004) [8] Fraleigh, C., Moon, S., Lyles, B., Cotton, C., Khan, M., Moll, D., Rockell, R., Seely, T., Diot, C.: Packet-level Traffic Measurements from the Sprint IP Backbone. IEEE Network, 17(6), 6–16 (November 2003) [9] Natarajan, P., Ekiz, N., Amer, P., Stewart, R.: CMT using SCTP Multihoming: Transmission Policies using a Potentially-failed Destination State, TR 2007/338, CIS Dept, U of Delaware (February 2007) [10] Leland, W., Taqqu, M., Willinger, W., Wilson, D.: On the Self-similar Nature of Ethernet Traffic. In: ACM SIGCOMM, San Francisco (September 1993)
Energy-Efficient Mobile Middleware for SIP on Ubiquitous Multimedia Systems Felipe Garcia-Sanchez, Antonio-Javier Garcia-Sanchez, and Joan Garcia-Haro Department of Information and Communication Technologies. Technical University of Cartagena. Campus Muralla del Mar, Plaza del Hospital, 1, 30202. Cartagena, Spain {felipe.garcia,antoniojavier.garcia,joang.haro}@upct.es
Abstract. Currently, different technologies provide solutions to successfully resolve the continuity of the service of a mobile host. The employment of a middleware is a common approach to face this issue, supporting also heterogeneity. In addition to the usual requirements of bandwidth, delay, or CPU load, multimedia applications have special features that are still being studied in a mobile scenario, as the intensive energy consumption. Due to all them, current middleware specifications and services do not directly apply. The Session Initialization Protocol (SIP) is the most frequently employed tool to satisfy mobile multimedia requirements. In this paper, we present and validate the design of a service based on middleware and Publish/Subscribe schemes that minimizes the control data exchange introduced by SIP, reducing the energy consumption in the handoff process. It considers all the relevant factors in the audio/video transmission done by different middleware entities, and it supports their profiles (provider or consumer). Using an intermediate element, a modified event server, decouples the communication edges, adding some improvements: asynchronous connection establishment and extension of mobility to all the terminals involved in the connection. The performance offered by the service proposed is also evaluated, and their results further discussed. Keywords: SIP, middleware, streaming, handoff, publish/subscribe.
1 Introduction Current wireless technologies are evolving to allow the communication among various equipments, supporting the concept ‘everyone, everywhere and every time’. Personal Digital Assistants (PDA), mobile phones, or laptops are examples of terminals that may use these technologies to access all the types of data. Services based on middleware (software disposed between the applications and the operating system) are widely employed in multimedia and Audio/Video applications, highly distributed and heterogeneous environments. Many interactions among middleware technologies and general-purpose applications may be found in the open literature [1], as well as the usual services provided, e.g. service discovery, QoS parameter negotiation, or security. We consider that higher layer communication protocols must be aware of modifications of the environment as part of the global communication process [2] [3]. Examples of this issue are collaborative environments and A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 735–747, 2008. © IFIP International Federation for Information Processing 2008
736
F. Garcia-Sanchez, A.-J. Garcia-Sanchez, and J. Garcia-Haro
embedded systems whose possible interconnection (wireless and mobile) through mobile ad-hoc networks (MANET) is not currently normalized or, at least, not clearly focused. In reference [4], the interesting topic of middleware in mobile networks is outlined. All the expected capabilities may be designed through middleware, traditional ones as CORBA (Common Object Request Broker Architecture) or new approaches as JXTA, uPnP or OSGi, because of their platform independence, the multiple services supported, the interoperability and programming coherence, and the signaling and connection services included in the middleware layer instead of in the application one. However, in this context, mobile multimedia and video applications are not purely developed for their special requirements in throughput, delay, QoS features or handoff energy saving. This is because middleware specifications and services (i.e. standard A/V Streaming [5]) which provide A/V connectivity are not oriented to aspects such as access integration or energy aware of different terminals. For instance, a mechanism to optimize the handoff scheme of an A/V Streaming application when the service is down or when a device changes its location is not included. In these systems, energy consumption is an important issue for mobile terminals, raising the highest values in the handoff process. Therefore, the heterogeneity and volatility of ubiquitous environments demand new middleware approaches to support mobile multimedia applications, offering generic access. SIP (Session Initialization Protocol) [6] [7] protocol is the most extended signaling system to connect and manage services and applications. In this paper, we address how current middleware technologies may face mobile multimedia applications. In particular, we focus on how a multimedia service programmed via middleware may be restored when a terminal changes its place and/or its access to the wired network. With this aim, we propose a system based on generic middleware that supports the SIP signaling operation. Our contribution consists of developing a communication protocol and a software procedure to guarantee the connection and re-connection of multimedia applications, providing a mechanism to retrieve the signaling information lost in the process and reducing the energy consumption. To implement it, we employ the publish/subscribe paradigm, which helps to solve additional issues, such as scalability or localization, by means of decoupling edges, avoiding the limitations of the client-server scheme in SIP. Other partial objectives are to reduce the signaling information, to decrease the connection establishment delay, and to build an architecture allowing the mobility of providers and consumers. This last goal is applied in symmetric services as direct streaming between final users. We also conduct an evaluation testbed to measure the connection delay incurred in a re-connection between two terminals having into account event propagation in an infrastructure network. The results obtained are discussed and compared with other connection strategies. The rest of the paper is organized as follows. Section 2 summarizes current mobility services and the protocols and procedures employed for traditional nonmiddleware applications. Sections 3 and 4 present the general architecture used as workbench and the protocol proposed. Performance evaluation results are obtained and discussed in section 5. Finally, section 6 concludes the paper.
Energy-Efficient Mobile Middleware for SIP on Ubiquitous Multimedia Systems
737
2 Related Work 2.1 The Session Initialization Protocol Subsystem SIP is a specification [6] that defines the procedure and protocol to support audio/video transmissions in mobile environments. The major SIP features for multimedia applications are the following ones: • • • •
The communication between user A and user B is session-oriented. The communication may be in real-time or differed time. The system may support video and multimedia applications that require Quality of Service (QoS). The identification of users and services is done through a URI (Universal Resource Identifier) that facilitates the usability of services. It may be coded in XML (eXtensible Markup Language).
The system consists of two different client classes: the User Clients Agents (UCAs) and the User Server Agents (USAs). Messages follow a request-reply structure and implement the current services such as registering and addressing servers, proxies, etc. (see fig. 1) [8]. SIP is implemented by different basic operations like invite, ack, bye, cancel, register, and options. All of them are required to establish the point-to-point communication in a mobile environment, using return codes similar to the http (Hyperterminal Transmission Protocol) protocol. Nevertheless, the computing mechanism is not defined. UCA
USA
Registering server register
register
[email protected]
[email protected]
200-OK
200 - OK [email protected] 302 - MOVED ACK Invite [email protected] 180-RING
REGISTER sip:www.upct.es SIP/2.0 Via:SIP2.0/UDP 192.168.1.10:3010 To: [email protected] From: [email protected] ; tag:mno Call-ID: [email protected] CSeq: 10- REGISTER Contact: sip:[email protected]
200-OK ACK data data
Fig. 1. Resume SIP connection establishment procedure
2.2 Previous Middleware Mobility Works Currently, the major contributions for SIP implementations are related to minimize the time connection between devices, but omitting the effect on the energy consumption in this process. In the field of mobile devices, the most usual restrictions are both the service connectivity and the energy lifetime. Therefore, the latest interesting
738
F. Garcia-Sanchez, A.-J. Garcia-Sanchez, and J. Garcia-Haro
works are: (i) the contribution of Zhang et al. [9] that provides a seamless SIP scheme to reduce the time-delay connection and (ii) optimizations of protocols as hierarchical mobile SIP (HMSIP) [10] or RTP (Real Time Protocol). Both cases do not take care about energy consumptions and are not oriented to multimedia requirements. In the field of mobile middleware technology and ubiquitous networking, several efforts have been done to provide multimedia connectivity. We emphasize the latest works of Gehlen et al. [11] and Su et al. [12]. Both provide mechanisms to design mobile devices applications, but without specifications of the signaling mechanism. In the first one, the Publish/Subscribe model [2] is employed to provide contextinformation about the communication and other subscribed services to the application. Moreover, they use several intermediate middleware services. In the second case, the multimedia system divides users among producers and consumers and a collaborative scenario is developed through events communication. It already performs events written in XML language and offers as implementation example: a SQL database server and clients as middleware agents. The recent work [13] implements SIPHOC (SIP protocol over a MANET network), offering performance evaluations about the signaling and connection delay over different routing protocols and to a wired network. To the best of our knowledge, there is not any document or paper related to the operation of SIP signaling in a multimedia implementation based on middleware. The closest documents are related to publish/subscribe schemes to facilitate mobility to middleware applications. However, protocols SIP and SDP (Session Description Protocol) are the most widely signaling protocols used for mobility in wireless networks. Therefore, our work is aimed at enhancing the middleware operation for wireless environments through XML event propagation, including the SIP operations. Our system includes the following features: • • •
A/V Streaming connection establishment through event propagation, using the Publish/Subscribe model. Mobility is facilitated in both edges (provider and consumer). Optimizing the number of remote invocations, to reduce the messages transmission and, consequently, the energy consumption.
3 Mobile Architecture The Event Service (ES) [14] specification is the simplest tool to propagate events in a middleware environment, employed as starting point in the signaling architecture. However, it does not offer the singular features required by a streaming transmission. In fact, the modification of ES must support the following requirements: • • • •
It must provide the connection and create the channels for the flow transmission. The edges, the video providers, and the consumers, must have the same interface, and their behavior for mobility has to be the same. The streaming provider and the consumer may be located in any place of the network. The system must be scalable, to generalize the architecture for a global network, or through Internet. For our implementation, we define these new elements: Bidirectional-Event Services (B-ES), Hybrid interfaces for all the edges, and using a federated policy of events transmission (see fig. 2).
Energy-Efficient Mobile Middleware for SIP on Ubiquitous Multimedia Systems
739
Fig. 2. Physical infrastructure network example
The design considers all the terminals as the same type of elements, they are hybrids. Any edge may send and receive events, and, at the same time, they may choose their functionality of sender of a video/audio flow or receiver, or even both profiles. This section shows the architecture at high-level description. In particular, fig. 3 presents the architecture and the interactions among the different elements. 3.1 Hybrid Interfaces Usually, the different edges of the video or multimedia applications are divided into providers (or senders) and consumers (receivers). The Hybrid interface presents different features providing the classes and methods required for both, providers and consumers, and open methods to add future properties. Now, instead of the sender and receiver edges, the edges are Hybrid with a different task but with the same interface. This Hybrid interface facilitates the connection between an A/V Stream application (in general, any application that uses it as interface) and a B-ES. This interface implements the connection to the B-ES through the request method. It sends a packet (XML format) that includes different aspects such as the identification (URI) of the requested edge, the type of connection, transport protocol to use for the flow, etc. This information is useful not only for the connection itself, but also for locating the edge. The headers identify different aspects, such as origin machine, port, etc. All this information is saved by the B-ES in a database. The additional functions are focused on managing the events and on receiving the connection responses. They provide the methods to send, receive, and manage events, and for creating A/V flows channel. The A/V channels are created when both edges wanting to intercommunicate are connected to the same B-ES. 3.2 Bidirectional-Event Service The B-ES is a service presented in reference [14]. It may be developed on different programming paradigms such as JXTA, uPnP, etc., according to the applications requirements, supporting XML events. This service propagates the events from any of the Hybrid connected to it to the Hybrids included in their respective Event Channel [14] where the original Hybrid is connected. The structure of the B-ES is distributed according to the next levels:
740
F. Garcia-Sanchez, A.-J. Garcia-Sanchez, and J. Garcia-Haro
Stream Dispatcher or event deliverer. It delivers the events that arrive, and selects their destination. It composes the different Event Channels for the event propagation (each service needs at least an Event Channel created inside the B-ES), and it orders the creation of the A/V Flow through the respective interfaces. Moreover, it has access to the two different databases: one dedicated to the usual operation where the different Hybrids are registered with their features and destination edges and another one specially devoted to the mobility service, where the different Hybrid mobile terminals are registered with their current location. Hybrid Administrators are required to manage the different connections of the Hybrids (edges) to the B-ES, as the typical specification. They add one more task: the order to create the A/V flows. Finally, the ProxyHybrid Interface is the tool to connect the different Hybrids, or, for federated structures, the connection of other B-ES. It should be noted that the traditional policy of Publish/Subscribe based on Suppliers and Consumers is changed for a point-to-point structure based on Hybrids. Moreover, the ProxyHybrid has to receive, send, and propagate events. The definition of the ProxyHybrid allows to include it with other traditional interfaces, supplier and consumer if necessary. 3.3 Connection Establishment and Static Operation This section presents the general operation of the B-ES in coordination with the edges that require the service. Fig. 3 represents the usual sequence of connection set-up. This process obtains energy efficiency to simplify the number of invocations, decreasing transmissions, and assuming that the message transmission is highly more costly than the CPU processing, and reception is less costly than transmission. The edges must execute the method request that sends a packet with their own and the desired destination (SIP register). This message includes the original location of the edges, communication port, etc. This information is saved in the Location database of the B-ES. Moreover, the destination information is saved in the Hybrid database. This information is maintained until a destroy (SIP bye) event happens. When two edges are connected to the same B-ES (or for federated events to two interconnected B-ES) the B-ES executes two methods, one accept (SIP invite) for Hybrid, and a complete one for the second Hybrid. Then, both edges know that the event channel is open, and they may send events to the other edge. In the messages associated to accept and complete, the information includes the references of the destination edge of the A/V Stream Flow. In the example of Fig. 3, the packet for Service includes the information of Client, and for Client includes the information of Service. Finally, the one that receives the accept (Service) executes the active connection. The rest of the connection establishment of the A/V stream flow is independent of the event transmission and propagation through the B-ES. Therefore, from this moment the B-ES is limited to delivering the possible events between applications. The event propagation is similar to the one provided for the traditional ES. However, an interesting point is that the event propagation is bidirectional, both Hybrids may send and receive events. This is possible thanks to the ProxyHybrid Class that listens to its connected Hybrid and to the Event Channel performed in the B-ES. Moreover the Hybrid components in the edges must pay attention to the events
Energy-Efficient Mobile Middleware for SIP on Ubiquitous Multimedia Systems
741
Fig. 3. Connection establishment scheme
received from the Event-Channel and the events generated in their own edge. On account of that, a design condition is that only a push model is supported to guarantee the right attention to all the events. The pull model may be considered as a push event in the opposite direction. 3.4 Federated Bidirectional-Event Service Although an alone B-ES node is enough to perform an end-to-end service, desired scalability requires a more complex configuration. The B-ES, as the primitive ES standard, may use centralized or federated architecture. In the federated architecture, different B-ES may be subscribed to other B-ES as common Hybrids and propagate connections and events to other edges or B-ES. The familiar structure of a federated ES may be observed in reference [15]. Basically, a B-ES uses the ProxyHybrid interface as a real Hybrid interface. It is able to propagate and receive events. It is similar to a supplier/consumer architecture where the ProxySupplier connects to the ProxyConsumer of the ES but changing the role of ProxySupplier as a Hybrid, and ProxyConsumer as ProxyHybrid. The connection of two Edges through a federated B-ES architecture is as follows: 1. 2.
3.
4
Both edges connect to a different B-ES known by each one of them, and both subscribe for the other edge. A B-ES (or both) sends an event to the rest of the B-ES network. The event is propagated through the network, arriving at the destination B-ES. If both events arrive at the remote B-ES, one operation is discarded when the time-to-live (configurable parameter of time of event living in the B-ES without response until it is discarded) of the event expires. The system offers the traditional features, such as control of event replications or event time-to-live. The request packet of the event gives the information of the original edge for the A/V flow connection (URI), but the information of the message header comes from the subsequent B-ES. This provides the appropriate information to connect the A/V flow between Hybrids, and the information to propagate events through the federated architecture.
SIP Using B-ES Nodes
The upgraded design of the SIP protocol to the B-ES architecture is developped through the implementation of B-ES nodes that supports the SIP signaling and
742
F. Garcia-Sanchez, A.-J. Garcia-Sanchez, and J. Garcia-Haro
operation. The basic idea behind is to identify the tasks that the SIP registering server performs to adapt the methods required by the protocol to the B-ES working diagram. There are two initial conditions: the B-ES must be located in a fixed position that may be known through a simple call to a DNS (Domain Name Service) or similar service, and the disconnection of both edges Hybrids may not be performed simultaneously. In that case, the B-ES loses the references and the re-connection is not possible. Therefore, the system provides additional structures for the Hybrid interface: •
•
The void identify (Hybrid peerHybrid) method encapsulates the actions to generate the identification of a Hybrid previously connected to the B-ES (as invite does in SIP). It saves these identifications in the Location database and compares the URI provided by the identify operation with the stored ones in this database. When the comparison succeeds, the B-ES decides that this Hybrid is the same as the one connected before. Then, the following steps are done to manage the connection of the new flow and propagate the data events. In fact, identify is similar to request, adding the check and managing the connection of an event channel previously created. The null event. This event is managed between Hybrid and B-ES (not propagated) and aimed at confirming regularly the success of the connection between them when many events fail or events are not generated in a pre-established time or threshold.
To perform the SIP operation, the handoff process must be designed. Therefore, we distinguish two major cases, (1) when the Hybrid in movement receives the flow, and (2) when it transmits. As mentioned in section 2, we take advantage of the location discovery notification of the A/V flow rate sensitiveness to the movement, since rates are reduced (or they stop) when the location is close to change. Therefore, the location discovery of the case (1) may be obtained by this analysis. However, in the case of (2), the Hybrid does not receive the flow, and this task may not be implemented. Therefore, we propose two different mechanisms to resolve the location discovery in the source edge (using reactive strategies, applied in streaming services): 1. 2.
To regularly send null events from Hybrid to the B-ES, receiving ACK messages when the connection succeeds and no response otherwise. To directly send the null events (they are sent by the B-ES when it does not detect activity for the Hybrid). In this case, the B-ES acts as an active element.
In our implementation, we choose a combined solution: first, the sending Hybrid transmits null events when the last event exceeds a threshold, and secondly the B-ES is able to send null events when it does not detect activity. Finally, when the Hybrid sends consecutive null events without any response or it receives a reply from the BES, it starts the re-connection process. 4.1 Re-connection for the Consumer Mobility The mobile behavior of the Hybrid that receives the flow may be denoted as the consumer mobility. It may be considered a typical SIP operation with a reactive handoff
Energy-Efficient Mobile Middleware for SIP on Ubiquitous Multimedia Systems
B-ESs
Hybrid receive
743
Hybrid send data data
request
SIP operation
ACK-200OK Identify RING-180OK complete
accept ACK-200OK ACK-200OK
data data events events
Fig. 4. Time diagram for the re-connection of consumers
algorithm. This is the most extended case, with high relevance when the service is located in the wired network. The re-connection process is shown in fig. 4. When the activity on the A/V flow decreases, the Hybrid decides the re-connection operation. Then, as every Hybrid that requires a service, it must send the packet request to the B-ES node. The next step, is to execute the identify method, sending its URI. At this point, the B-ES compares this URI with the stored ones in the Location database. Therefore, the B-ES carries out two tasks: (a) It performs the actualization in its own database and (b) It propagates the events queued in the server provided by the other edge representing the service to the moving Hybrid. When the events propagation is taking place, the edges may execute their end-to-end connection. Then, when the connection establishment and the events delivered are finished, the service is restored. 4.2 Re-connection for the Supplier Mobility The mobile behavior of the Hybrid that sends the flow may be denoted as the supplier mobility. This case appears when the service source is in charge of a mobile host. The re-connection process is shown in fig. 5. When the null events do not arrive to the B-ES, the Hybrid decides the reconnection operation. Then, it sends the packet request to the B-ES node. While the connection establishment to the B-ES is performed, the Hybrid sends the packet identify. The B-ES compares its URI information with the one stored in the Location database and determines the service identification. Then, it searches for the Hybrids connected to it and, by means of the accept and complete methods, it orders the connection establishment as explained above in section 3. Finally, the service may be supplied.
744
F. Garcia-Sanchez, A.-J. Garcia-Sanchez, and J. Garcia-Haro Hybrid receive
B-ESs
Hybrid send data
SIP operation
complete ACK-200OK
Null Event Null Event request identify ACK-200OK RING-180OK accept ACK -200OK data
data
Fig. 5. Time diagram for the re-connection of supplier
5 Performance Evaluation Our implementation is evaluated by simulation. Therefore, we develop a tool based on the OMNeT++ framework. It provides an appropriate environment to obtain several metrics of interest, in particular the connection delay and power consumption. The evaluation is only performed for the consumer mobility, because the supplier mobility procedure does not offer a real estimation of the connection delay. In the supplier case, delays depend on the variable time to estimate the location change. We simulate the scenario where a centralized B-ES node interconnects several (or different) Hybrids components. This scenario consists of a fixed B-ES node which the Hybrids must find inside a wired local area network (i.e. 100 Mbps Ethernet), and radio interfaces that implement an IEEE 802.11b (11 Mbps) network. Moreover, Hybrids are moving hosts with a constant speed v. The evaluation testbed is completed with an implementation of the SIPHOC protocol [13] connected to the wired network (one hop). All values are referenced to the native SIP procedure. An important constraint in the wireless communication is that not all the A/V packets arrive in time, being discarded in that case. Therefore, the condition to disconnect and to initialize a re-connection process is imposed to 20 consecutive packets (frames for video case) lost (typically 1 sec.). The simulation reproduces two different cases: first, when the Hybrids change their location from the fixed network to the wireless one; and second, when moving from a wireless network to other wireless network, changing the IP of the mobile host. Additional parameters are the average arrival rate of events λ =0.5 and an average rate of services lifetime of 0.0003 (mean service duration up to 15 minutes). Services are typical MPEG-4 video at a medium bitrate of 150 Kbps. Moreover, to measure the energy consumption, we select the power parameters of the SDIO RW31 card for PDA’s: sleep mode 50 µA, reception 80 mA, transmission 138 mA and power supply 3.6 V. We modify the range of different parameters: the number of Hybrids in mobile hosts connected to the arriving access point (bandwidth) and the packet (event) generation rate λ are evaluated. The mobile speed is fixed to 5 meters per second with
Energy-Efficient Mobile Middleware for SIP on Ubiquitous Multimedia Systems
745
linear course for hosts. For the first scenario we obtain the average values for the connection delay presented in fig. 6 left. From these values, some considerations arise: • •
Both implementations offer better performance than the SIP protocol as expected. The re-connection time in the B-ES case is slightly increasing to the SIPHOC protocol. The packet sizes (request) in the B-ES architecture are bigger, delaying the time connection, but the results are smaller than other multimedia middleware solutions (e.g. [15]). Energy consumption (Fig. 7 left) reveals that this implementation minimizes the power lost in the handoff process. Notice that when the number of events (signaling or handoff) is higher, the implementation B-ES performs best results in comparison to the other options, due to the smallest number of packets delivered.
•
1,00
0,84
0,95
0,79
0,69 0,64 0,59 rate=0,5_BES rate=0,1_BES rate=0,5_SIPHOC rate=0,1_SIPHOC
0,54 0,49 0,44 5
10 15 Mobile Hybrids
Connection delay
Connection delay
0,90
0,74
0,85 0,80 0,75 0,70 0,65 rate=0,5_BES rate=0,1_BES rate=0,5_SIPHOC rate=0,1_SIPHOC
0,60 0,55 0,50 5
25
10 15 Mobile Hybrids
25
Fig. 6. Connection delay average referenced to native SIP (95% interval confidence, Batch Mean Method) in B-ES systems: left, wired-wireless handoff, right, wireless-wireless handoff
Energy Consumption
0,90 0,80 0,70 0,60
0,80 Energy Consumption
rate=0,5_BES rate=0,1_BES rate=0,5_SIPHOC rate=0,1_SIPHOC
1,00
0,70 0,60 0,50 0,40 rate=0,5_BES rate=0,5_SIPHOC
0,30
0,50 5
10 15 Mobile Hybrids
25
5
10 15 Mobile Hybrids
rate=0,1_BES rate=0,1_SIPHOC 25
Fig. 7. Energy Consumption average referenced to native SIP (95% interval confidence, 30 min. simulation): left, wired-wireless handoff, right, wireless-wireless handoff
For the second case (from wireless to wireless network) we obtain a mean value for the connection delay and energy consumption, that are shown in figs. 6-7 right, are higher than in the previous case. Regarding the results, the average value increases again with the number of hybrids and with the generation rate of events, but not in the same proportion (due to the bandwidth limitation of both wireless networks). The values associated to higher number of mobile hosts can be explained by the bandwidth saturation to manage too much simultaneous streaming and hybrid traffic.
746
F. Garcia-Sanchez, A.-J. Garcia-Sanchez, and J. Garcia-Haro
6 Conclusions and Future Work In this paper, a middleware architecture providing mobility of multimedia applications is presented. It is based on the SIP protocol and the publish/subscribe paradigm, propagating events. The integration of the SIP protocol, together with middleware computing extended by means of event propagation and optimizing the energy consumption is the most relevant contribution. Therefore, this implementation supports the SIP signaling system and manages mobility of multimedia nodes in a wireless network. It overcomes the limitations in the behavior of these services, including the reduction of overhead, flexibility, and separation between flow data and signaling. The energy consumption decreases thanks to the connection-oriented operation of the middleware layer and the minimization of the number of connections. The B-ES architecture decouples the communication edges and facilitates the connection among them. Although remote ends are designed in a similar way, mobility tasks are adapted according to the status of the stream flows (sent or received), thus optimizing, the SIP implementation. As future works, we propose to extend the study to large scale networks and mobility prediction and the event propagation system to support other data signaling such as those required in collaborative environments, for instance. Acknowledgment. This work has been funded by the Spanish Ministerio de Educacion y Ciencia with the project grant TEC2007-67966-01/TCM and it is also developed in the framework of "Programa de Ayudas a Grupos de Excelencia de la Región de Murcia (Plan Regional de Ciencia y Tecnología 2007/2010)”.
References 1. Chambers, D., et al.: Stream Enhancements for the CORBA Event Service. In: Proceedings of the 9th ACM Multimedia Conference, Ottawa, Canada, pp. 61–69 (2001) 2. Carzaniaga, A., et al.: Achieving scalability and expressiveness in an Internet-scale event notification service. In: Proc. Of ACM (PODC), Portland, OR, pp. 219–227 (2000) 3. Sivaharan, T., et al.: GREEN: A Configurable and Re-configurable Publish-Subscribe Middleware for Pervasive Computing. In: Meersman, R., Tari, Z. (eds.) OTM 2005. LNCS, vol. 3760, pp. 732–749. Springer, Heidelberg (2005) 4. Mandato, D., et al.: CAMP: A Context-Aware Mobility Portal. IEEE Communications Magazine 40(1), 91–97 (2002) 5. Mungee, S., et al.: The design and performance of a CORBA Audio/Video Streaming Service. In: IEEE Proc. of the Hawaiian Int. Conf. is System Science, pp. 8043–8059 (2001) 6. IETF Draft: Session Initialization Protocol. RFC. 3261 (2002) 7. Kalmanek, C., et al.: A Network-Based Architecture for Seamless Mobility Services. IEEE Communication Magazine 44(6), 103–106 (2006) 8. Huang, C.M., et al.: A Novel SIP-Based Route Optimization for Network Mobility. IEEE Journal of Selected Areas in Communication 24(9), 1682–1691 (2006) 9. Zhang, J., et al.: A SIP based Seamless-handoff (S-SIP) Scheme for Heterogeneous Mobile Network. In: IEEE WCNC Proceeding, Hong Kong, pp. 3949–3954 (2007) 10. Chahbour, F., et al.: Fast Handoff for Hierarchical Mobile SIP Networks. In: Proc. World Academy of Science, Engineering and Technology, Czech R, vol. 5, pp. 34–39 (2005)
Energy-Efficient Mobile Middleware for SIP on Ubiquitous Multimedia Systems
747
11. Gehlen, et al. (eds.): A Mobile Context Dissemination Middleware. In: ITNG, pp. 165– 170 (2007) 12. Su, X., et al.: Middleware for Multimedia Mobile Collaborative System. In: Proc. of 3rd IEEE Wireless Telecommunication Symposium, USA, pp. 112–117 (2004) 13. Suedi, P., et al.: SIPHOC: Efficient SIP Middleware for Ad Hoc Networks. In: Cerqueira, R., Campbell, R.H. (eds.) Middleware 2007. LNCS, vol. 4834, pp. 60–79. Springer, Heidelberg (2007) 14. Garcia-Sanchez, F., et al.: A CORBA Bidirectional-Event Service for Video and Multimedia Applications. In: Meersman, R., Tari, Z. (eds.) OTM 2005. LNCS, vol. 3760, pp. 715– 731. Springer, Heidelberg (2005) 15. Caporuscio, M., et al.: Design and Evaluation of a Support Service Publish/Subscribe Applications. IEEE Trans. on Software Engineering 29(12), 1059–1071 (2003)
A Secure Mechanism for Address Block Allocation and Distribution Damien Leroy and Olivier Bonaventure Universite catholique de Louvain (UCL) - Louvain-la-Neuve, Belgium {Damien.Leroy,Olivier.Bonaventure}@uclouvain.be http://inl.info.ucl.ac.be
Abstract. All equipments attached to the Internet are configured with one or several IP addresses. Most hosts are able to automatically request (e.g., from a DHCP server) or discover (e.g., by using stateless autoconfiguration) the IP address that they should use. This simplifies the configuration and the management of hosts. Unfortunately, these techniques do not apply on routers whose IP addresses and subnet prefixes for their directly attached LANs still need to be manually configured. This utilization of manual configuration is error-prone and a frequent source of errors. It is also one of the reasons why IP address renumbering is so difficult with both IPv4 and IPv6. In this paper, we propose a new address block allocation and distribution protocol that has been designed to be both secure and efficient. We first summarize the main requirements of an address block allocation mechanism. We then describe the operation of our proposed mechanism. Finally, we demonstrate the efficiency of our protocol by simulations.
1 Introduction The growth of the BGP routing tables in the default-free zone is again a concern for many network operators [1]. A key reason for this is the way IP addresses are allocated and used. The pool of available IP addresses is managed by the regional registries (RIPE, APNIC, . . . ). These registries define two types of addresses : Provider Aggregatable (PA) and Provider Independent (PI). To limit the size of the BGP routing tables, only large ISPs should obtain PI addresses while customer networks should receive PA addresses from the PI block of their upstream provider. Unfortunately, if a corporate network uses a PA address block, it should renumber all its network when it changes from its upstream provider. For this reason, most corporate networks insist on obtaining PI addresses, even with IPv6 [2]. Combined with the growth of multihoming [3], this explains the growth of the BGP routing tables. This growth could be avoided if corporate networks and smaller ISP networks were able to more easily use PA addresses. Unfortunately, with the current Internet architecture, using PA addresses implies that each corporate network must be renumbered each time it changes from provider and several studies have shown this to be painful with both IPv4 and IPv6 [4].
Supported by a grant from FRIA (Fonds pour la formation a` la Recherche dans l’Industrie et dans l’Agriculture, rue d’Egmont 5 - 1000 Bruxelles, Belgium).
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 748–755, 2008. c IFIP International Federation for Information Processing 2008
A Secure Mechanism for Address Block Allocation and Distribution
749
In the early days, IP addresses were allocated manually to both routers and endsystems. However, this manual allocation was a cause of errors and problems. As a consequence, most endsystems now obtain their IP address automatically either via DHCP or via auto-configuration. Despite the widespread use of automatic configuration of endsystems, the addresses used by the routers are still manually configured (except in small networks by using DHCP extensions) and several studies have shown that configuration errors are responsible for a large number of operational problems [5]. In this paper, we propose and evaluate a new distributed mechanism which is able to automatically allocate IPv6 prefixes to routers and LANs in an ISP, corporate or campus network. Our address distribution mechanism is targeted for edge networks as well as ISP networks that need to provide address blocks to their subnets1 . The typical environment where it should be run is represented at Fig. 1. In this figure, arrows represent the direction of address block allocation. Our mechanism is composed of three main parts. One or several prefixes are obtained from upstream ISPs by border routers (A, B on Fig. 1). Subnets needing an address block ask for it at border routers (D, E, F on Fig. 1). The routers negotiate which parts of the obtained prefixes have to be allocated to subnets. The term “prefix” is used to indicate a prefix allocated by one of our providers (i.e. ISP α or ISP β in Fig. 1) while “address block” is used to indicate a group of addresses allocated by our protocol to subnets.
Fig. 1. The protocol does apply on hierarchical topologies
This paper is organized as follows. We first discuss in Sect. 2 the requirements for such an allocation and discuss a few existing proposals. Sect. 3 provides a detailed presentation of our solution by considering the security aspects, the protocol and the algorithm used to allocate address blocks. Finally, Sect. 4 contains a simulation-based evaluation of our solution. More details about our mechanism can be found in [6].
2 Requirements In this section, we summarize the main requirements that must be fulfilled by an address allocation mechanism as well as a state of the art of such mechanisms. 1
In this paper, the term subnet is used to define a subnetwork (a LAN or a customer network) that needs to obtain an address block (e.g., customer 1 or LAN in Fig. 1).
750
D. Leroy and O. Bonaventure
Utilization of address space. A first goal is to support an Host-Density ratio (noted HD) equals to 0.8. The HD is a way to measure address space usage [7]. It is expressed as the ratio of log(number of allocated address blocks) to log(maximum number of allocatable address blocks). Several regional registries defined the value 0.8 as an acceptable IPv6 address utilization for justifying the allocation of additional address space [8]. Compatibility with existing protocols. Since the role of the mechanism is limited to address block allocation, it must be independent from the routing protocol. Security. The Internet started as an open network mainly used by researchers. Nowadays, IP networks are used to support mission critical business services and security matters. We discuss the security threats in more details in Sect. 3.1. Roles. In an ISP, enterprise or campus network, network operators usually group the hosts that have similar roles in contiguous address blocks. For example, all loopback addresses of routers usually belong to the same prefix, all servers are grouped in a few prefixes and the hosts used by students in a campus network are grouped in contiguous prefixes as well. ISPs also group their customers in prefixes depending on their type (e.g. home users, business customers, . . . ). This allows to define and deploy policies on routers (e.g., QoS, traffic engineering, . . . ) that are more scalable and easier to maintain. Common examples are the packet filters deployed on most routers and firewalls to filter unwanted packets. Chown et al. have shown that updating packet filters is a difficult problem when renumbering a network [9]. Prefix coloring. Another requirement is that the prefixes received from different providers are not necessarily equals. Due to routing policies, it can be necessary to classify the prefixes received from providers in different categories. For instance, if a National Research Network has been allocated a prefix from a backbone research network, it should only allocate address blocks from it to its customers that are research labs. K12 schools should not be allocated an address block from such a research prefix. 2.1 Existing Address Allocation Mechanisms Several research groups have studied the problem of automatically configuring the addresses used by routers. First, studies within ad-hoc networks and notably within the autoconf working group of the IETF [10] tackle very different type of networks. The topologies they consider are changing frequently while we are looking for a stable configuration. The NAP Protocol (No Administration Protocol) by Chelius et al. [11] is targeted at small networks and unfortunately does not consider the problems faced by campus and enterprise networks such as the need for roles and the security issues. Other solutions proposed at IETF such as DHCP prefix delegation [12] or router renumbering [13] do not meet all requirements. They require manual configurations and cannot be adapted to provide minimum security features without difficulties. A detailed analysis of these solutions is available in [6].
A Secure Mechanism for Address Block Allocation and Distribution
751
3 Description of Our Solution This section describes in more details our proposed mechanism. We first present the security risks and solutions. Next, we describe the address block distribution protocol. This section ends with the explanation of the address block allocation algorithm. The distributed IPv6 addresses are composed of three parts as illustrated in Fig. 2. Here are some notations we use in this paper. The first part is the prefix given by the upstream ISP. These leftmost bits are common for all hosts and routers in our network. The last 64 bits of the address are used for the Interface ID (IID). The bits between these two parts uniquely identify a subnet and are called Subnet ID (SID). We will consider two parts in this SID: the allocated SID (ASID) and the delegated SID (DSID). The ASID is the part that our mechanism allocates and provides as a prefix to subnets. The DSID is the part of the SID that the subnet is free to allocate inside its network. In other words, a subnet obtains a /(pref ixs + ASIDs ) prefix2 .
Fig. 2. Different parts of an IPv6 address
The protocol for prefix delegation used with the providers of our network and with our customers is outside the scope of this paper. It could be designed as an extension to BGP, using DHCP with prefix delegation option [12] or a new protocol using the X.509 certificates being defined by the IETF SIDR working group. The only requirement we impose is security, as discussed below. 3.1 Security Threat Model and Solutions It is easy to be convinced that an address block allocation mechanism is a key component of an IP network and could be the target of attacks. Indeed, this protocol, in the network configuration process, runs prior to the other IP protocols. Therefore, if this protocol were compromised, the entire network or a large part of it could be compromised. The first risk comes from a malicious router that could be interpolated in the network. In addition to the security threats on all the routing and IP protocols, this router could send several kinds of messages in order to confuse the address block distribution protocol. It should be possible for a router to detect whether its neighbor is a legitimate router. For instance, if router D on Fig. 1 were unauthorized, router A and E should not trust it and thus not forward its messages. The second risk could be another possibility of obtaining similar rights but without inserting a router. It consists in intercepting the packets exchanged between two routers and performing a man-in-the-middle attack. The last two risks are based on abusing a link between border routers and their providers or customers (e.g. between ISP α and router A or between router D and customer 1 on Fig. 1). Invalid prefix announcements can cause a global unavailability or traffic hijacking. Abuse can also appear with address block requests if there are automated and not secure. 2
s subscripted means “size” in this paper. For instance, SIDs for the size of the SID.
752
D. Leroy and O. Bonaventure
Solutions. The first idea is that IPv6 prefix delegations and certificate delegations go together. In other words, each network that possesses a prefix has obtained, from its ISP, a certificate proving that it is allowed to use and delegate it. In our mechanism, these are mainly used to authenticate our ISP’s routers. Actually, we rely on the X.509 extensions for IP addresses [14]. To allow a provider to authenticate its customers, our network has its own certification authority and each router is configured with a certificate (as suggested by Greenberg et al. [15]) to confirm that it does belong to the network. An X.509 certificate signed by our local certification authority (CA) can be used by its owner to provide an authorization evidence to nodes of our network. A certificate itself identifies its owners and we associate them with attribute certificates [16] to provide authorization. These attributes are used by our subnets to specify their role, their provided prefix colors and their address block size. These certificates could be given offline to customers when the contract is signed and are used when the customer connects to one of our border routers. Secondly, attribute certificates are also used to authenticate connections between routers : a neighbor is considered as a legitimate router if and only if it can be authenticated. The following packets exchanged between routers are secured with IPSec. 3.2 Address Block Distribution Protocol Our protocol is distributed, each router is responsible for its address blocks and has to choose and advertise them. In practice, each router chooses one or several address blocks and advertises them through the network. A distributed protocol avoids the single point of failure problem. Chelius et al. also showed that it is more efficient for address block distribution, especially if the network is multihomed [11]. Each router is configured with one 64-bits router id (derived from MAC address or crypto based), its X.509 and attribute certificates and information about the subnets attached to itself. The loopback address of each router can be derived from the router id using a fixed SID for all routers. Routers use discover and hello messages for, respectively, discovering their direct neighbors and initializing connections with them. Since this is very similar to other protocols such as LDP, we will not enter into detail about these. The prefix and address block advertisement messages are flooded to all routers. Prefix advertisement message. When a border router learns that a new global prefix has been allocated to the network, it floods immediately the information. Each prefix is associated with a preferred and a valid lifetime and so must be regularly renewed as long as it remains valid. A prefix advertisement message contains : the size of the prefix, the prefix itself, the deprecated and validity lifetime, the prefix color, a sequence number and security parameters. The sequence number is incremented each time a new message about a specific prefix is generated. Until the lifetime expires, an entry per prefix is stored in a prefix table in each router. Address block advertisement message. When a router chooses new address blocks, it floods an address block advertisement message. This message contains a list of address block entries, one for each entry it chooses. When such a message is received, the proposed address blocks are checked and new address blocks are stored with the already
A Secure Mechanism for Address Block Allocation and Distribution
753
allocated ones. The way address blocks are chosen and stored is explained in Sect. 3.3. An address block advertisement message contains the router id of its issuer and a list of address block entries. An entry contains a role, Ps , ASIDs , ASID, a timestamp, the preferred and valid lifetime and a sequence number. As for prefixes, the sequence number is incremented each time an address block entry is changed (ASID changed, preferred time redefined, . . . ). As the mechanism is distributed, there is a non-zero probability that two routers choose conflicting blocks nearly at the same time; we call this a collision. The issue is solved by using logical clocks and randomness for tie-breaking. 3.3 Address Block Allocation Algorithm This section explains the way subnet address blocks are chosen in the address space. The following rules apply to any prefix for which the subnets should obtain an address block. The only part of the address block that our mechanism has to determine is the ASID as shown in Fig. 2. For instance, consider that the network has obtained two /48 prefixes, 2001:4bc7:9377::/48 and 2001:3def:73cb::/48, from its ISPs. If the ASID “1ed6” is allocated a /64 subnet, this subnet will obtain as prefixes 2001:4bc7:9377:1ed6::/64 and 2001:3def:73cb:1ed6::/64. In our mechanism, each router chooses a large address block that can contain all the subnets attached to itself, we will call it the “router block”. Inside this block, each router allocates address blocks for its subnets, we will call these the “subnet blocks”. In the topology depicted in Fig. 1, it means that router F allocates two address blocks inside its own router block : one for Customer 2 and one for LAN. The routers blocks are chosen by routers using heuristics such as role aggregation and optimization of the address space. Due to space limitation, these cannot be explained in this paper. The addresses of the router blocks are flooded through the network and are therefore known by all the routers. On the other hand, the subnet block allocations are only known by the router in charge of them. More details can be found in [6].
4 Evaluation In order to validate our assumptions, we have written a simulator [6] that allows us to evaluate the performance of our mechanism. Some results of our simulations are shown in this section. Simulations are run on a 110-routers-topology coming from a real ISP network. Subnets with random DSIDs are uniformally associated to routers to obtain the HD-Ratio we want to observe. Since we want to evaluate the protocol on a large number of subnets, the size of blocks requested is quite small, DSID sizes are comprised between 0 and 2. We consider that we obtain a /48 from our ISPs, so SID size equals to 16. The test set consists in starting this configuration and applying every minute modifications to the subnet topology. Modifications include adding new subnets, removing subnets, reducing and enlarging the address block they need. These modifications are uniformally performed unless we obtain a larger HD-Ratio than the one we want to maintain. We consider 10,000 changes for each experiment. The set of address blocks globally reserved is one of the information that is stored on all routers. Therefore, the number of such allocations should be as low as possible
754
D. Leroy and O. Bonaventure
and can be viewed as a way to measure the performance of the allocation algorithm. If there is at least one subnet managed by each router, the minimum value of this is the number of routers. In our test bed, this value should be as close as possible to 110. Figure 3 shows the evolution of this value when the HD-Ratio is changing. The upper part shows number of global address blocks for each experiment. The lower part represents the mean number of subnets per router. The first vertical line shows the 0.8 value of HD-Ratio, i.e. the ratio we want to reach without any problem. The second vertical line shows the HD value (0.95) from which our mechanism is not able to place all the subnets. The dotted horizontal line represent the number of routers (110), i.e. the optimum value we could obtain. Note that for the results obtained in these experiments, we did not have to move any already allocated address block.
Fig. 3. Evolution of the number of global address blocks reserved when HD is increased
We can see at the left of Fig. 3 some points below the “110 line”. They correspond to experiments in which some routers have no subnet to place. Next, we see an increase followed by a drop. In these experiments, there is a small number of customers allocated in each router address block. Therefore, when modifications are performed (e.g. addition of new subnet) the router address block can be full and a new address block requested. When the number of customers increases, these modifications do not fill entirely the router address blocks anymore. That is why the number of address blocks decreases and stays equals or nearly equals to 110. From 0.95 as an HD-ratio, some routers do not succeed in allocating address block for all their subnets.
5 Conclusion In this paper, we have proposed a distributed mechanism to allocate and distribute address blocks in ISP, campus and enterprise networks. We have first listed the requirements for such a protocol and discussed the security threats and how to handle them.
A Secure Mechanism for Address Block Allocation and Distribution
755
The allocation based on roles permits to aggregate subnets to make rules, such as firewall ones, simpler and shorter. When subnets have obtained address blocks, renumbering and topology changes can be performed without introducing any important perturbation in the rest of the network. Our simulations shows that our protocol holds 0.8 as HD-Ratio without any problem. However, lots of parameters of the mechanism can still be discussed, most of all concerning allocation, even though some of them could be found in our technical report. We are currently implementing the proposed protocol in the XORP platform.
References 1. Meyer, D., Zhang, L., Fall, K.: Report from the IAB Workshop on Routing and Addressing. RFC 4984, Internet Engineering Task Force (2007) 2. Palet, J.: Provider independent (PI) IPv6 assignments for end user organisations. (RIPE Policy Proposal 2006-01) http://www.ripe.net/ripe/policies/proposals/2006-01.html 3. Huston, G.: Analyzing the Internet’s BGP routing table. The Internet Protocol J.l 4 (2001) 4. Baker, F., Lear, E., Droms, R.: Procedures for renumbering an IPv6 network without a flag day. RFC 4192, Internet Engineering Task Force (2005) 5. Mahajan, R., Wetherall, D., Anderson, T.: Understanding BGP misconfiguration. In: SIGCOMM 2002: Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications, pp. 3–16. ACM, New York (2002) 6. Leroy, D., Bonaventure, O.: A secure mechanism for address block allocation and distribution. Technical report, Universite catholique de Louvain (UCL), Belgium (work in progress, 2008), http://inl.info.ucl.ac.be/addr-alloc-distrib 7. Durand, A., Huitema, C.: The Host-Density Ratio for Address Assignment Efficiency: An update on the H ratio. RFC 3194, Internet Engineering Task Force (2001) 8. APNIC, ARIN, RIPE NCC: IPv6 address allocation and assignment policy. ripe-412 (2007), http://www.ripe.net/ripe/docs/ipv6policy.html 9. Chown, T., Ford, A., Venaas, S.: Things to think about when renumbering an IPv6 network. Internet Draft, Internet Engineering Task Force “draft-chown-v6ops-renumber-thinkabout05” ( work in progress, 2006) 10. Baccelli, E., Mase, K., Ruffino, S., Singh, S.: Address autoconfiguration for MANET: Terminology and problem statement. Internet Draft (work in progress 2007), “draft-ietf-autoconfstatement-02”, Internet Engineering Task Force 11. Chelius, G., Fleury, E., Toutain, L.: No Administration Protocol (NAP) for IPv6 router autoconfiguration. Int. J. Internet Protocol Technology 1 (2005) 12. Troan, O., Droms, R.: IPv6 Prefix Options for Dynamic Host Configuration Protocol (DHCP) version 6. RFC 3633, Internet Engineering Task Force (2003) 13. Crawford, M.: Router renumbering for IPv6. RFC 2894, Internet Engineering Task Force (2000) 14. Lynn, C., Kent, S., Seo, K.: X.509 Extensions for IP Addresses and AS Identifiers. RFC 3779, Internet Engineering Task Force (2004) 15. Greenberg, A., Hjalmtysson, G., Maltz, D.A., Myers, A., Rexford, J., Xie, G., Yan, H., Zhan, J., Zhang, H.: Refactoring Network Control and Management: A Case for the 4D Architecture. Technical Report CMU-CS-05-117, CMU CS (2005) 16. Farrell, S., Housley, R.: An Internet Attribute Certificate Profile for Authorization. RFC 3281, Internet Engineering Task Force (2002)
Asset Localization in Data Centers Using WUSB Radios N. Udar1 , K. Kant2 , and R. Viswanathan1 1
Southern Illinois University, Carbondale, IL 2 Intel Corporation, Hillsboro, OR
Abstract. Asset tracking is a critical problem in modern data centers with modular and easily movable servers. This paper studies a wireless asset tracking solution built using wireless USB (WUSB) radios that are expected to ubiquitous in future servers. This paper builds on our previous work on direct ranging using WUSB radios and studies algorithms for data center wide localization of all rack mounted servers. The results show that it is possible achieve very low error rates in localization in spite of very stringent constraints (i.e., each server being just 1.8” high) by exploiting the properties of the data center environment. Keywords: Ultra Wideband (UWB), Wireless USB, data centers, channel model, localization, Maximum Likelihood Estimation.
1
Introduction
Data centers form the backbone of modern commerce and continue to grow both in terms of number of servers and number of servers per unit volume of data center space. Modern data centers sport rows upon rows of “racks”, each with certain number of standard size “slots” where various “assets” (servers, routers, switches, storage bricks, etc.) can be inserted. Fig. 1 shows a typical row of a data center. Tracking these assets has repeatedly been cited as among the top 5 issues facing IT administrators in large data centers and other IT environments. As assets become easier to move – by virtue of smaller sizes, modular structure, hot plug-in/plug-out capabilities – they indeed tend to change locations more frequently. There are several reasons for assets to move around: (a) replacement of old/problematic equipment and/or addition of new equipment and resulting reorganization, (b) manual reorganization of assets to handle evolving needs and applications, (c) reorganization driven by power and thermal issues which keep becoming more and more severe, (d) removal followed by reinsertion for miscellaneous reasons including SW patching, etc. One whole, the assets don’t necessarily move very much, but in a large data center, even a limited movement could become very cumbersome to manage manually. In particular, trivial solutions such as those requiring personnel to log each moved asset in a spreadsheet/database have not worked well in the past. Several vendors including Sun and HP have devised new solutions, which points to the importance of the problem. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 756–767, 2008. c IFIP International Federation for Information Processing 2008
Asset Localization in Data Centers Using WUSB Radios
Fig. 1. Racks in a data center
757
Fig. 2. Schematic Rack Layout
From the survey paper [1] on indoor positioning techniques, it can be concluded that techniques based on wireless local area network (WLAN) are perhaps reasonable to locate racks in the data center, but too crude for locating servers. Techniques based on surface acoustic wave (SAW) can provide good accuracy but would require a network of SAW sensors to be deployed and managed. Ultra wideband (UWB) based solutions can provide good accuracy, but any external infrastructure would add to cost and management issues. In this paper, a wireless universal serial bus(WUSB) based solution is exploited that uses UWB radios integrated into the server and does not rely on any external infrastructure. HP has developed a solution based on an array of passive RFID tags attached to each server [2]. The solution requires one radio frequency identification (RFID) reader per server which communicates with the rack level data collector. Each RFID reader has a directional antenna mounted on a motorized rack and each rack has a sensor controller aware of its position. Although an accuracy exceeding 98% is claimed for this system, the complexity and cost of the system are expected to be prohibitive. In this paper we explore an automated asset tracking solution by exploiting the WUSB that is expected to replace wired USB in the near future. The basic assumption is that WUSB radios will be integrated with various assets and could form a sort of mesh fabric which can be exploited for a variety of lowbandwidth applications including asset tracking. There are many reasons why such an approach is attractive: 1. WUSB uses UWB radio as its physical layer which is known to have excellent localization properties. 2. Given its role as a wired USB replacement, WUSB radio cost should come down rapidly, which makes the solution inexpensive. 3. Since WUSB radios are assumed to be integrated in each server, such a solution does not require any external infrastructure. Additional infrastructure is usually highly undesirable from an IT personnel’s perspective. In this paper we focus on “plugged-in” assets that have our integrated solution implemented. The assets don’t necessarily have to be “booted up” or even powered, just plugged in. To allow for this, we propose to implement asset localization
758
N. Udar, K. Kant, and R. Viswanathan
in a server’s baseboard management controller (BMC), which is supposed to be always operational irrespective of the state of the server. The most challenging aspect of the data center localization problem is the need for very high accuracy localization. The typical “one meter” accuracies that localization methods can typically achieve with significant effort are about 2 orders of magnitude worse than what we need in this application. Furthermore, we would like to do this without any additional infrastructure. The outline of the paper is as follows. In section 2, we present the configuration of modern data centers and discuss how a wireless USB based scheme can walk through the entire data center and localize individual plugged in assets. In section 3 we propose the method of maximum likelihood identification (MLI) for localization of servers and show that the performance of proposed method far exceeds the performance of traditional hyperbolic positioning (HBP). The novelty of the MLI solution is the exploitation of geometric properties of the data center environment to obtain high accuracy localization. The performance of MLI is analyzed in section 4. The simulation results show that it is possible to achieve good accuracy in a data center environment using MLI. Section 5 concludes the discussion.
2
Configuration and Assumptions
The geometric aspects of server arrangement in a data center are crucial to our solution; hence are described briefly here. The racks, as shown in Fig. 1 are 78” high, 23-25” wide and 26-30” deep and are arranged in rows. The rack rows are arranged in pairs so that the servers in successive odd-even row pairs face one another. This creates alternating hot and cold aisles (backs and fronts of servers) and helps make cooling more efficient. The width of these hot and cold aisles is generally different and needs to be accounted for. Fig. 2 shows a simplified top view of this arrangement where the wider aisles are the cold aisles (fronts of servers). The Fig. 2 also establishes a coordinate system that we will be using in our analysis. The x -axis denotes the row index and the y-axis denotes the rack position in the row. The z -axis (not shown in the figure) is along the height of the rack. For example,(0, 1, 1) denotes that the server is in row 0, rack 1 and at position 1 in the rack. The positions in the rack are labeled as 1, 2, ..N starting from top of the rack. In this paper, we focus on the popular “rack mount” servers that go directly into the racks as shown in Fig. 1. Obviously, the server thickness decides how many servers can fit in a rack. The standard measure of thickness is “U” (about 1.8”). Consequently, a single rack can take up to 42 1U “assets”. The increasingly popular “Blade servers” go vertically in chassis that in turn fit into racks will be addressed in future work. It is assumed that each plugged asset has an integrated WUSB radio accessible from the management controller in the pre boot or post boot environments. We assume that each rack has at least 2 plugged in servers and the racks are arranged in a rectangular pattern – localization with geometries other than a rectangle is beyond the scope of the current paper.
Asset Localization in Data Centers Using WUSB Radios
759
We use a domain dependent medium access (MAC) protocol for localization based on Wimedia [3] where MAC domain is referred as a Piconet and the master of Piconet is called a Piconet Controller (PNC). The PNC guarantees access to the medium on a TDMA basis and up to 255 slaves are allowed in the protocol and slaves are allowed to interact with each other in the contention access period. We assume that there are at least 3 servers with known locations. The WUSB radios of these servers are assumed to be manually configured and these servers are prohibited from being moved around. (If even the BMC of one of these servers fails, it is possible to “elect” a suitable substitute from other already localized servers – however, this is just a matter of detail that we omit here.) In particular two known servers in rack(0,0) and one in rack(0,1) is considered. One of the known server in rack(0,1) is hard coded to be the PNC. All other radios, when turned on, are in the listening mode. We assume that every WUSB radio could operate in a range of power between a minimum and maximum value. The PNC starts with minimum power and gradually increases it (up to the maximum power) so that it can reach a few servers around it in the same or adjacent rack. For a distance estimate between two nodes time of arrival(ToA), time difference of arrival or receiver strength may be used [4]. The fundamental mechanism that we used for for distance estimation is the ToA of the strongest signal. Here, we briefly summarize our previous results on achievable ranging accuracy [5]. The results show that it is possible to achieve accuracies of 0.05-0.65 meter with direct ranging [5]. In any case, such accuracies are far from adequate for our purposes; therefore, we need additional mechanisms to reduce the error substantially. These mechanisms include bias estimation, multiple measurements, exploitation of geometry and effective estimation procedures. In particular, based on a large number of measurements, we found that the standard deviation of ToA determined distance to be less than 0.32 meters in all cases (i.e., variance less than 0.1 m2 ). We also found that the variance is largely independent of the distance for the range of values that we are interested in (a few meters). In particular, we conservatively assume the worst case variance of 0.2 (double the maximum observed in our experiments). Consequently, we expect the errors from our simulation results to be significantly overstated than those expected from real measurements. 2.1
Localizing Multiple Nodes
In this section we discuss how we can systematically all nodes in a data center starting with the known location of a few servers. There are in fact two distinct phases here: (a) cold start, and (b) steady state. The cold start localization phase occurs initially when none of the servers other than the known ones have their locations known. It may also be invoked by the operator on major changes or perhaps automatically once per day (or more frequently) to ensure that no changes are missed. The steady state phase involves localization of one (or at most a few) servers when the presence of new radios is detected or some radios are removed or shut down.
760
N. Udar, K. Kant, and R. Viswanathan
The cold-start localization starts with localization of unknown servers in rack (0,0). For rack (0,0) localization, two servers at known locations (known servers) in rack (0,0) and one known server in rack (0,1) is used as reference nodes. During rack(0,0) localization, all the unknown nodes in rack(0,0) and one unknown node in rack(0,1) and 2 unknown nodes in rack(1,1) are localized. These unknown nodes are then used as reference nodes for corresponding rack localization. Rack(0,1) onwards during localization 2 reference nodes from the current rack and one reference node from the previous rack. Rack(0,1) localization onwards exactly 2 nodes in the next rack and 2 nodes in the opposite rack are localized. Thus the row 0 localization is complete. The odd numbered racks need not localize the servers in the next rack except the ones present in the beginning of the first rack. For example, rack(1,1) uses two reference nodes from its rack and one from rack(0,0) and localizes one server in rack(2,0). Thus, all the servers in the data center are localized from right to left one row at a time. Each server maintains a local grid map of its neighbors. Local grid map refers to the map of position of neighbors. As the localization is performed, each server updates its map to indicate which neighbors are plugged-in. A critical element in cold-start localization is the avoidance of servers in racks that we are currently not interested in localizing. The fundamental assumption in the approach above is that the transmit power is chosen low enough so that the signal does not reach across more than one rack in either direction. Obviously, this cannot be guaranteed by power control alone. Instead, we need to examine each distance estimate and determine whether the responder is more than one rack away. That is, the problem requires doing “macro localization” in order to enable systematic localization discussed above. This is another place where we take advantage of the regular geometry of the racks. Since the rack dimensions are known and the racks are regularly placed, macro-localization is relatively straightforward. In particular, if we have 3 known nodes and distances from them to an unknown node, simple geometric calculations can easily tell us with high probability which rack the unknown node belongs to. In cases where there is ambiguity, multiple measurements can be used for disambiguation much in the same way as discussed in section 4.2 for correctly locating new set of reference nodes. For brevity, we omit the details of this procedure. Once the cold start localization is finished, monitoring in steady state is relatively simple because several Piconets operating simultaneously monitor their surrounding nodes in order to update any changes in the locations of the nodes.
3
Localization Methods
In this section, we consider the issue of accurately estimating the position of a single unknown node based on distance measurements from a number of known nodes. We propose the maximum likelihood identification method and compare with the traditional hyperbolic positioning method.
Asset Localization in Data Centers Using WUSB Radios
3.1
761
Hyperbolic Positioning
Hyperbolic Positioning technique requires common time reference only between reference nodes and does not rely on the synchronization between references nodes and the target node [6]. The position of the target node is determined based on the time difference of arrival from two reference nodes. Let us determine the 2-D position of the target node T (x0 , y0 ) using k reference nodes. Each of the reference node makes the range measurement based on TOA with the target node denoted as D1T , D2T ....DkT . Let us assume that the reference node share a common time reference and the clock at the target node T is delayed by δ. Then it is observed that the difference of range measurements between any pair of node removes the delay. The range measurement between any two nodes is given as DkT − DlT = c(τkT + δ) − c(τlT + δ)
(1)
where τkT denotes the TOA measured by reference node k from the target node. The position of target node T in the 2-D space is determined by the intersection of hyperbola using 3 reference nodes. However, when range measurements incur errors due to multi path and noise HBP could show large errors in localization [7]. In fact, for our problem ML identification shows much better accuracies than HBP localization, as will be shown in section 4. 3.2
Maximum Likelihood Identification
Maximum likelihood (ML) testing is a well-known technique for deciding on which one of the hypothesized models a measurement came from. Since in our problem, we only have 42 discrete possible positions for a server in a rack, the maximum likelihood identification (MLI) deals with identifying which one of the several discrete positions is the correct one. For a more formal description of MLI, consider two nodes at the known locations in rack 1 and one in rack 2 (this can be easily extended to p transmitters). Each of the known nodes measure the distances using TOA from an unknown node at (m, l, n) where m denotes the row number, m ∈ (0, 1..M ), l denotes the rack number l ∈ (0, 1, ..L), and n denotes the position of the node in the rack, n ∈ (0, 1, 2...N − 1). Let V out of N -2 (N -2 possible unknown positions for rack 1 ) possible positions are filled by plugged-in servers. Each of the V nodes determines its position based on its range estimates from the three known nodes. Let us consider the detection of the location of one of these V nodes say, node u. Since the location of node u is unknown , it hypothesizes its location to be any one of the possible locations and forms N − 2 likelihood functions, one for each hypothesis, based on the range measurements (r1u , r2u , r3u ). For forming the likelihood function, we assume that a range estimate riu is distributed as Gaussian with zero bias (that is, mean equal to the true distance between the reference node i and the node u, diu ) and variance σ 2 = N0 /2. This model assumes line of sight propagation, which may be reasonable when the transmitters and the receivers are in close proximity of each other. In future work we relax
762
N. Udar, K. Kant, and R. Viswanathan
this assumption and introduce bias in the range estimates. The node u estimates its location based on the maximum likelihood rule, i.e., decide location (m, l, n ˆ ), (m, l, n ˆ ) = arg max p (r1u , r2u , r3u |Hn ) (2) n where the row index m is assumed to be correct as we localize one row at a time and the rack index l is decided based on the geometry of the racks and the three range estimates as mentioned in section 2.1. Hence, MLI searches only over the rack index n. Since each plugged in server is classified as occupying one of the allowed server positions, several types of errors could happen. A server could be misclassified to be occupying a wrong position where (i) there actually exists another server or (ii) there exists no server. Another situation occurs when two or more nodes claim the same location. This situation could be resolved by subsequent range measurements. For the proposed location estimation system to be successful, it is imperative that the probability of error in the classification of a node is extremely small, especially because the estimation of nodes occurs in successive stages, relying on previous estimates, one rack at a time and one row at a time. We briefly mention the calculation of probability of correctly identifying a node location for the maximum likelihood rule. By expanding eq. 2 and by throwing out the terms that do not depend on n, the equivalent rule picks the maximum T of Zn,u = ru , dn − En /2, where ru = (r1u , r2u , r3u ) , dn = (d1n , d2n , d3n ). Here, −, − denotes the inner product between two vectors and En = dn , dn . Given that the true location of unknown node say, (m,l,u), the probability of correct classification is the probability that Zu,u is the largest among all possible Zn,u , Finally, a closed expression for error probabilities is similar to calculating the error probabilities, in the detection of M-ary signals in additive white Gaussian noise with arbitrary signal set [8]. It is well-known that the exact error probability is difficult to calculate but tight upper and lower bounds can be obtained. Notice that the maximum likelihood rule does not require the knowledge of the variance, as long as the variance is the same for any estimated range. If the variances of the range estimates depend on the distance, then the maximum likelihood rule would require the values of these variances. However, for short distances, spanning the height of a rack or an adjacent rack, a single variance assumption may be valid as a first approximation [5]. For assessing the effectiveness of the algorithm we will use MATLAB simulations and union bound discussed in section 3.3. 3.3
Error Bound
For any countable set of events, Ai , i = 1, 2, ..., with corresponding probabilities P [Ai ], the probability of union of these events is no greaterthan the sum of the probabilities of individual events [8]. That is, P ( i Ai ) ≤ i P [Ai ]. A node u can be uniquely identified by specifying the distances between the node and the reference nodes, du . Then, the “distance” between two nodes u 3 2 2 and i, denoted Du,i , is characterized by Du,i = k=1 (dku − dki )
Asset Localization in Data Centers Using WUSB Radios
763
From the detection of known signals in Gaussian noise [8], the probability of error in distinguishing between two locations, u and i, given that the data came from the location u, is given by P (εui |du ) = Q Du , i/ (2N0 ) where Q(x) is the complementary cumulative distribution function of the standard Gaussian at point x. Since a node u can be misidentified as any one of the other possible nodes i = u, the probability of error in misidentifying u is ⎡ ⎤ P (ε|du ) = P ⎣ εui |du ⎦ ≤ P (εui |du ) (3) i =u
i =u
Since N -2 corresponds to the maximum number of possible unknown servers assuming the rack is completely filled, the average probability of error is obviously P (ε) = N1−2 u P (ε|du ) 3.4
Comparison of HBP and MLI
To analyze the performance of HBP and MLI methods, two racks in a data center are considered. The problem of localization of 1U servers in rack(0,0) is studied. The two servers in known locations in the top and middle of rack(0,0) (known servers) and one known server in the middle of rack (0,1) is considered. We analyze the performance of HBP and MLI methods by estimating the probability of incorrect identification of server position, Pe . To simulate the range estimate obtained via UWB radio, i.i.d Gaussian errors are added to the true distance between the two nodes. Given the dimension of the rack, there are 42 IU servers possible in rack(0,0) when rack is completely filled, out of which a maximum of 40 server positions may be unknown. Let the three reference nodes for rack(0,0) simulation are labeled as 1,2,3 respectively and the unknown server is at the position u, then the error distance metric for MLI method is formed as errDist(i) = sqrt((r1u − d1i )2 + (r2u − d2i )2 + (r3u − d3i )2 ), i = 1, 2, ..., V (4) The position of the unknown node is localized as that position for which the errDist metric is minimum among the V metrics. Ideally, when the range measurements are exact, the correct decision i = u is made with the corresponding errDist metric being zero. In HBP method, the position of each of the unknown server is estimated using the intersection of hyperbolas. The position of the unknown server is determined in the nearest neighbor sense by finding the minimum distance between the estimated position and all the possible server positions. In Fig. 3 the performance of HBP method is compared with MLI method. Assuming zero bias in the range estimates, the average probability of error in identification of locations of unknown servers in a rack is plotted as a function of variance of ranging errors. In Fig. 3 two graphs are shown for MLI method, one for error bound described earlier and the other based on simulation procedure outlined here. It can be seen that the MLI method significantly outperforms the HBP method. The MLI method requires the calculation of distances between the
764
N. Udar, K. Kant, and R. Viswanathan
three known nodes and all possible hypothesized positions for the unknown node whereas the HBP method requires the finding of the nearest neighbor node, nearest to the estimated position obtained through hyperbolic intersection. Hence, the MLI method requires a slightly more computation per sever localization. However, since computations involve mainly the calculations of Euclidean distances and ascertaining the minimum of a set of numbers, these computations can be done quickly with a reasonable inexpensive processor. Certainly, the proposed localization algorithm dictates an accuracy that is not attainable with the HBP method. It is observed that, as long as the variance is below 0.8 (corresponding to a probability of error of less than 0.1), the union bound is extremely close to the simulation estimate. In rest of the paper we analyze the performance of the MLI method only.
4 4.1
MLI Simulation Results Single Rack Simulation
In this section, we further analyze the performance of MLI method for a single rack. We also introduce the threshold rule based on the magnitude of error distance metric in eqn (2). Fig. 4 shows the error probability of ML method as a function of node position, for two different variances of ranging errors. It is observed that the nodes at the top and bottom of a racks, experience smaller errors than the nodes at the center of the rack. Of course, each of the two end nodes, the top and the bottom, has only one immediate neighbor, and this could have contributed to the low error rate. The probability of incorrect identification can be lowered by increasing the number of reference nodes or using multiple measurements. In Fig. 5, the simulations show that by using 2 measurements of range estimates, Pe is lowered by two orders of magnitude. Similar reduction could also be obtained using increased number of reference nodes. However, more reference nodes require that we start
Fig. 3. Comparison of HBP and ML Detection Error
Fig. 4. Prob. of Incorrect Identification vs. Server Position
Asset Localization in Data Centers Using WUSB Radios
Fig. 5. Performance with More Reference Nodes & Measurements
765
Fig. 6. Illustration of Error Propagation
out with more servers in fixed and known locations, which is undesirable. A downside to more measurements is increase in localization time. The distance between two adjacent 1U servers is 1.8 inches. When the magnitude of minimum of V errDist metrics exceeds a certain threshold value (which, by simulation study,was fixed at 0.9, for a range estimate variance of 0.2) the probability of incorrect identification increases substantially. Such error events are unacceptable when reference nodes are identified for subsequent localizations. Therefore, during the next rack reference node localization, if the magnitude of the minimum of errDist metrics exceed the threshold value, then two range measurements to each known reference nodes are made in order to increase the ranging accuracy and reduce the errDist metric. 4.2
Error Propagation in Large Data Centers
The previous section showed that we can achieve excellent localization accuracy in single rack. In this section we address the question of how the localization error will change as we proceed from rack to rack. Fig. 6 shows how the localization proceeds across racks in row 0. For simplicity, we only show localization of reference nodes. We need to choose 2 reference nodes in successive racks starting with known racks in racks 0 and 1 (shown in red squares). Each such localization depends on 3 previously localized servers as shown. The error propagation could then be characterized via a recursive set of equations as discussed below. This simple analysis suggests that we arbitrarily pick two of the localized nodes in a rack as reference nodes. In fact, because of their importance, we pick reference nodes a bit more carefully, by considering error distance metric and position within the rack. Nodes with small values of error distance metric are likely to be located more precisely and hence make a better choice for reference nodes. The position is important since we don’t want
766
N. Udar, K. Kant, and R. Viswanathan
both reference nodes to be close together. These aspects should only make the reference node choice more robust and hence result in even less error propagation. Let Pei denote the probability of error in identifying an unknown node in a rack, given i reference nodes in incorrect positions where i ∈ {0, 1, 2, 3}. Using simulations and the MLI method we estimate values for Pei as: Pe0 = 1.920e-5, Pe1 = 1.430e-3, Pe2 = 7.60e-1, Pe3 = 8.20e-1
(5)
Not surprisingly, the probability of error goes up sharply with number of nodes in error. The simulation also examined instances where the estimated server position is off by only ±1 slot. The simulation study shows that, except when all the three reference nodes are in error, whenever an error occurs in the identification of a node, the MLI method picked one of the immediate neighboring nodes. Let Pi (n), i = 1, 2 denote the probability of error in locating the ith reference node in rack n. Now referring to Fig. 6, we can write the following equations. First, the initial conditions are: P1 (0) = P2 (0) = 0,
P1 (1) = Pe0 , P2 (1) = 0
(6)
Now for n > 1, we can write the following recursive equation for P1 (n). P1 (n) = [1−P1 (n−1)][1−P2 (n−1)] {[1−P1 (n−2)]Pe0 +P1 (n−2)Pe1 } + {P1 (n−1)[1−P2 (n−1)]+[1−P1 (n−1)]P2 (n−1)} {[1−P1 (n−2)]Pe1 +P1 (n−2)Pe2 } +P1 (n−1)P2 (n−1) {[1−P1 (n−2)]Pe2 +P1 (n−2)Pe3 }
The equation for P2 (n) is almost identical and is omitted. Since P1 () and P2 () differ only at rack 1, we simply assume the worst case scenario (which applies to P1 ) and drop the subscript from the above equation. We now show, via an inductive proof, that P (n) remains “small” (i.e., of the order of Pe0 or smaller) even as n grows. For this, let us inductively assume that both P (n−2) and P (n−1) are of the order of Pe0 . Certainly, this is true for P (0) and P (1). Therefore, P (n) ≈ Pe0 +P (n−2)Pe1 +2P (n−1){Pe1 +P (n−2)Pe2 } ≈ Pe0 +3Pe1 P (n−1) (7) Clearly, P (∞) = Pe0 /(1 − 3Pe1 ). Now since Pe1 is itself quite small (1.43e-3), it follows that P (n) ≈ Pe0 and there is no appreciable error propagation. More generally, so long as Pe1 can be kept reasonably small, there is no error propagation of any consequence. In particular, for the values given above, P (∞) =1.9283e-5. Once row 0 localization is completed, two reference nodes in (rack 0, row 1), are localized by utilizing range measurements from known nodes in (rack 0, row 0). After this, row 1 localization proceeds exactly as in row 0. In order to verify analytic calculations above, we simulated localization for all 16 racks in a 4x4 data center. In other words, we carried out successive localizations across racks using MLI along with the threshold rule for identifying the reference nodes. We ran the simulation 1 million times. With range estimate variance of 0.2, the resulting localization error in 16th rack was found to be 1.9286e-5, which is pretty close to the expected value.
Asset Localization in Data Centers Using WUSB Radios
5
767
Conclusions and Future Work
In this paper we proposed a wireless USB based solution for localization of all plugged-in servers in data center racks. In particular, we showed how we can systematically locate all other nodes starting with a few known nodes and the associated errors in doing so. We showed how the use of multiple measurements and careful selection of reference nodes can be used to keep the error probabilities low in this process. For estimating the location of individual servers, we considered both Hyperbolic Positioning (HBP) and Maximum Likelihood identification (ML) techniques. Our analysis indicates that (i) the ML method is far superior to HBP in terms of accuracy of localization (ii) the ML method, though computationally more intense than HBP, is executable with cheap processors (iii) even with ML method, the overall accuracy critically depends on good range estimates between a pair of UWB radios. Future work on the topic would address the issue of reliably estimating bias in distance measurement that arises naturally due to non line of sight components. We also would like to obtain bounds on total localization time as a function of the data center size. Acknowledgements. We would like to thank Anas Tom for his help with Matlab program for Union Bound.
References 1. Liu, H., Darabi, H., Banerjee, P., Liu, J.: Survey of Wireless Indoor Positioning Techniques and Systems. IEEE Trans on Systems, Man & Cybernetics 37(6), 1067– 1080 (2007) 2. Brignone, C., et al.: Real Time Asset Tracking in the Data Center. Distr. and Parallel Databases 21(2-3), 145–165 (2007) 3. Di Benedetto M.-G., Giancola, G.: Understanding Ultra Wide Band Radio Fundamentals, pp. 474–475. Prentice-Hall, Englewood cliffs (2004) 4. Patwari, N., et al.: Locating the nodes: cooperative localization in wireless sensor networks. IEEE Signal Process. Mag. 22(4), 54–69 (2005) 5. Udar, N., Kant, K., Viswanathan, R., Cheung, D.: Ultra Wideband channel Characterization and Ranging in Data Center. In: ICUWB 2007, September 2007, pp. 322–327 (2007) 6. Darne, C., Macnaughtan, M., Scott, C.: Positioning GSM telephones. IEEE Communications Magazine 36(4), 46–54, 59 (1998) 7. Chan, Y.-T., Hang, Y.C., Ching, H.P.-C.: Exact and approximate Maximum Likelihood Localization Algorithms. Vehicular Technology, IEEE Transactions on 55(1), 10–16 (2006) 8. Seguin, G.E.: A lower Bound on the Error Probability for the Signals in White Gaussian Noise. IEEE trans on information theory 4(7), 3168–3175 (1998)
Improving the Performance of DCH Timing Adjustment in 3G Networks Gaspar Pedre˜ no, Juan J. Alcaraz, Fernando Cerd´ an, and Joan Garc´ıa-Haro Department of Information Technologies and Comunications Polytechnic University of Cartagena (UPCT) Plaza del Hospital, 1, 30202 Cartagena, Spain {gaspar.pedreno,juan.alcaraz,fernando.cerdan,joang.haro}@upct.es
Abstract. The main goal of Transport Channel Synchronization is that frames sent by the RNC arrive in time to the Nodes B for transmission over the air interface. The 3GPP specifies a simple Timing Adjustment algorithm that tracks the delay of the Iub link (interface defined between RNC and Node B) by adding or subtracting a constant step. We show that this scheme reacts too slowly under abrupt delay variations and may become unstable in high delay scenarios. This paper proposes an alternative mechanism denominated Proportional Tracking Algorithm (PTA) which ensures stability in any scenario and improves performance compared to the classic one. The stability analysis of PTA is based on control theory and the performance is measured by means of an UTRAN simulator considering realistic traffic conditions in the Iub.
1
Introduction
In UMTS WCDMA access networks, link-layer communication between base stations (Nodes B ) and Radio Network Controllers (RNCs) is governed by the Frame Protocol (FP), specified by the 3GPP in [1,2]. FP is responsible, among other things, of the Transport Channel Synchronization which is applied to dedicated channels and aims at delivering each frame to the Node B in time to be transmitted over the air interface within its corresponding Transmission Time Instant (TTI). This procedure, known as Timing Adjustment, basically involves adjusting the sending time instant of the data frames from the RNC to the Node B in order to make them arrive within a predefined receiving window. This receiving window is defined per each transport channel by two parameters (see Figure 1): Time of Arrival Window Startpoint (TOAWS) and Time of Arrival Window Endpoint (TOAWE). This window should not be too large in order to minimize the buffering time of the frames at the Node B. Frame queuing should be done in the RNC, where the scheduling and the resource management processes are located. Figure 1 shows the possible arrival instants and temporal references. When a frame arrives outside its window, the Node B responds with a Timing Adjustment (TA) frame containing the Time Of Arrival (TOA) of the received frame. The RNC will use this feedback information to adjust the sending time instant of the A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 768–779, 2008. c IFIP International Federation for Information Processing 2008
Improving the Performance of DCH Timing Adjustment in 3G Networks
769
following frames, trying to steer the arrival time of the frames into the window. Therefore, for each TA, the RNC modifies the delay estimation (offset ) of the corresponding FP flow. If a frame arrives after the Last Time of Arrival (LTOA) instant, the Node B will be unable to process it before the TTI of this frame, which is, therefore, discarded. For a more detailed explanation of the related signaling the reader is referred to [1].
Fig. 1. Instants of frames reception at the Node B
The work in this paper is motivated by the fact that UTRAN frame synchronization has not received much attention in scientific research. The scientific literature so far has mainly considered a simple algorithm consisting of adding or subtracting a constant step to the offset estimated at the RNC according to the TOA value contained in the Timing Adjustment frames (TAs): tn+1 = tn + tT T I − K
if T OA < 0
tn+1 = tn + tT T I + K
if T OA > 0
(1)
where tT T I is the duration of a TTI and tn is the transmission time instant of frame n. In the rest of this paper, we refer to this scheme as classic algorithm. In [3], the effect of the window parameters on the Iub link capacity for voice traffic is evaluated but authors assume that 3G nodes implement the classic algorithm and no improvement is suggested. The work presented in [4] is, to the best of our knowledge, the only one that presents an alternative timing adjustment method, and the approach suggested follows an additive increase, multiplicative decrease principle. The authors propose a modification of the classic algorithm by which the offset is decremented in the absence of TA frames, and the increment, applied upon a late frame arrival, is higher than the decrement. Therefore, in a scenario of steep delay changes its operation is analogue to the classic one. In this paper, a new timing adjustment mechanism is presented, the Proportional Tracking Algorithm (PTA). Our proposal is based on correcting the offset proportionally to the distance between the TOA and the center of the window.
770
G. Pedre˜ no et al.
PTA is analyzed by means of discrete-time control theory which let us adjust the gain (K ) of the system assuring stability for different delay situations. The behavior of the algorithm under different Iub delays and its response to sharp delay changes is studied and compared to the classic algorithm by means of computer simulation. The global performance is evaluated considering realistic traffic conditions in the Iub. The simulation results confirm that PTA appreciably improves performance (lower buffer levels at Nodes B, fewer frame losses and less signaling), adds robustness and simplifies configuration compared to the classic algorithm. It should be noted that our proposal does not require any changes in 3GPP specifications. The rest of the paper is organized as follows. Section 2 describes our proposal theoretically as well as its modeling as a discrete-time control system including the stability analysis. Section 3 explains and shows the results of the simulationbased study conducted about the parameter configuration of each algorithm. In this section we compare both schemes under realistic traffic conditions, and evaluate their reaction to abrupt delay variations. Finally, Section 4 offers the conclusions of this work.
2
Proposed Algorithm
In this section, we introduce an alternative scheme based on a classical control model to perform Transport Channel Synchronization: the Proportional Tracking Algorithm (PTA). As explained in Section 1, when a data frame arrives out of the receiving window, the Node B sends a signaling frame (TA frame) indicating the Time Of Arrival (TOA) of that data frame. Until now, no algorithm made use of this valuable information. Our proposal makes use of this TOA to schedule the transmission of the following data frames from the RNC. The goal of PTA is to place the arrival time of the frames in the center of the window. This instant is chosen because it minimizes the signaling and reduces excessive buffering time at the Node B. According to this principle, upon a TA frame arrival, the RNC makes the following correction in the sending time of the next frame: tn+1 = tn + tT T I + K · (T OA − T OAW S)/2)
(2)
That is, the next sending time will be advanced or delayed K times the difference between the center of the window and the instant of arrival. If the frame arrives after the end of the window (”late arrival”), T OA − T OAW S/2 is negative and, therefore, the next frame will be advanced. This advance involves a similar increment in the offset estimation performed at the RNC. On the other hand, if the frame arrives before the beginning of the window (”early arrival”), the following frame will be delayed. Whereas the frames arrive inside the window, there is no signaling sent and the RNC will continue transmitting frames with the same synchronization. Figure 2 illustrates the operation of PTA and the delays involved. We define the Round Trip Time (RTT) of the FP control system as the time elapsed
Improving the Performance of DCH Timing Adjustment in 3G Networks
771
between the sending time of a frame and the moment when the corresponding TA frame is applied to another frame. The number of TTIs in the RTT is r = RT T /tT T I . Defining j and m as the downlink and uplink delay in TTIs respectively, r = j + m. Therefore, we can express (2) as: of f setn+1 = of f setn + K · en
(3)
where en is the error between the offset applied to frame n − r (of f setn−r ) and the delay perceived in time-slot n − m (delayn−m ), i.e. en = delayn−m − of f setn−r . The delay r has a great impact on the stability and the performance of the system, as we explain later. K (TOA-TOAWS/2) offsetn−1
RNC j TTIs
offsetn
m TTIs tn
tn−r
framen
framen−r TA
Node B
windown
windown−j TOA-TOAWS/2
Fig. 2. Time diagram of the offset adjustment algorithm
2.1
Stability Analysis of the Algorithm
In this section we use a discrete-time control theory approach to study the stability of the proposed algorithm and to adjust gain K according to a desired performance of the step response. The simulation results shown in Section 3 confirm the benefits of our scheme compared to the classic algorithm. Both PTA and the classic algorithm are systems where the signals involved, i.e. the delay, the error and the offset, take one value per frame. Therefore, we can model this algorithm as a discrete-time control system making the following assumptions: 1. The RNC transmits one frame per TTI, i.e. there are no idle times for data traffic in the period under study. 2. The RNC has TOA information in every time slot. This implies that the window size (T OAW S) is w = 0 (see Fig. 2). 3. The inter-departure time between frames in the RNC is constant and equal to tT T I . Therefore, the model neglects the variation in the TTI duration caused by the offset correction each time slot. Figure 3 shows a diagram of the control system for the PTA mechanism, where the sampling frequency is 1/tT T I and the blocks 1/z j and 1/z m represent
772
G. Pedre˜ no et al.
Fig. 3. Discrete-time control model of the Adaptive Proportional Tracking Algorithm (PTA)
respectively the downlink and uplink delays. The signal u(n), corresponds to delayn and acts as the control signal, while x(n) is of f setn . Expressing (3) in terms of the signals in Figure 3 we have: x(n) = x(n − 1) + K(u(n − m) − x(n − r))
(4)
Taking the z-transform in (4) we get: X(z) = z −1 X(z) + K · z −m U (z) − K · z −r X(z)
(5)
Resulting in the following transfer function: X(z) K · zj = r U (z) z − z r−1 + K
(6)
In order to adjust the gain K, we use a typical performance parameter in classic control design [5], the overshoot percentage of the first peak of the response, Mp . Considering an step input of the form u(n) = C for n > 0, and an output which first peak equals (Amax ), the definition is: Mp (%) =
Amax · 100 C
The above equation is related to the damping ratio (ζ) by √ 2 Amax = e−ζπ/ 1−ζ C
(7)
(8)
Using (8) we obtain the ratio (ζ) for a given overshoot. The gain K is obtained solving the characteristic equation 1 + G(z) · H(z) = 0, which can be expressed as F (z) = −1, where F (z) = G(z) · H(z). Identifying G(z) and H(z) in our system (see Figure 3) we obtain: F (z) =
K·z = −1 − 1)
z r (z
(9)
If ζ < 0, the poles of the system can be written as z = exp(−2πζω/ 1 − ζ 2 ) according to [5], where ω = ωd /ωs and ωs is the sampling frequency in rad/s.
Improving the Performance of DCH Timing Adjustment in 3G Networks
773
In our case ωs = 2π/tT T I , where tT T I is fixed to 10 ms in the channels under consideration. Given that F (z) is a complex magnitude, (9) can be split into two equations, one for the phase of F (z), equal to π radians, and the other for the module that must be equal to 1. The phase condition yields the equation: √
(r − 1)2πω + arg(C ω ej2πω − 1) = π
(10)
Where C = e−2πζ/ 1−ζ . From (10) we obtain ω which is applied to the module condition to obtain K: K = C ω(r−1) C ω ej2πω − 1 (11) 2
For the particular case of ζ = 0, we get the critical K value, i.e. the gain above which the system becomes unstable. These values can be checked with the Jury stability test [5] for discrete time systems. Table 1 summarizes K for different overshoots (in percentage) and delay values (r, in number of TTIs). Table 1. Gain K for different overshoot percentages r 2 3 4 5 6 7
2% 0.313 0.186 0.133 0.103 0.084 0.071
5% 0.346 0.207 0.140 0.114 0.094 0.079
10% 0.389 0.233 0.167 0.129 0.106 0.089
15% 0.428 0.257 0.184 0.143 0.117 0.099
20% 0.464 0.280 0.200 0.156 0.127 0.108
ζ→∞ 1 0.618 0.445 0.347 0.285 0.241
From this theoretical analysis we deduce that the gain K should be adjusted depending on the measured offset in the RNC in order to keep the stability of the algorithm. Therefore, the algorithm has to update K automatically according to the downlink delay estimation performed in the RNC. This operation requires the FP to keep a set of pre-calculated values of K. In our implementation we take the set corresponding to a 10% overshoot. The value of K is automatically selected from this set according to the Iub delay estimated by the RNC (offset ), resulting in a dynamic adjustment.
3
Simulation Results
In this section, performance results obtained by computer simulation are discussed. To test both algorithms over Dedicated Channels (DCHs), we have developed a detailed implementation of the Frame Protocol, according to 3GPP specifications [1,2], using OMNeT++, an event-driven C++ simulation tool [6]. Figure 4 illustrates the simulated network topology, consisting of a point to point link between the RNC and the Node B. The bandwidth of the Iub interface was set to 1.7 Mbit/s, roughly equal to the net rate of an E1 PDH link subtracting 10% for signaling and control traffic.
774
G. Pedre˜ no et al.
Fig. 4. Simulated network topology
Both voice and data traffic are enabled in the simulations. Speech sources are based on Adaptive Multi-Rate (AMR) codecs while each data source is governed by a WWW application. For a more detailed description of the used data and speech sources the reader is referred to [7]. Each of these sources uses a DCH at a constant rate of 384 Kbps. 3.1
Response of Both Algorithms to Abrupth Delay Changes
Foremost, we study the behavior of both algorithms when the Iub delay undergoes a sharp delay change. Given this situation, the simulation results show that PTA reacts faster than the classic algorithm avoiding excessive frame losses and signaling and, simultaneously, keeping buffer levels low. Besides, the classic algorithm can become unstable depending on the value of the gain K, requiring a cautious configuration. These dramatic changes of the Iub delay characterize realistic situations that may be caused by several reasons: – A simultaneous connection/disconnection of several sources in a Node B causing an abrupt increase/decrease of traffic toward that Node B. – An intermediate node in the UTRAN transport network handling a large amount of traffic. – When traffic engineering or path restoration mechanisms re-route Iub flows. See [8] and the references therein. In the examples provided in [9], one-way delay of different Iub branches ranged from 12 to 31 ms considering voice traffic and ATM transport. This gives an idea of the delay differences within the network connecting 3G access nodes. If data services and IP transport are considered, these delays can be larger. In this paper, we considered delay changes ranging from 10 ms to 60 ms although our algorithm can adapt to even longer delays. In this section, we evaluate two scenarios where the Iub connection experiences a sharp delay change (an increment and a decrement). For the experiments performed one active speech source is considered and the parameters TOAWS and TOAWE have been set to typical values [3], 10 ms and 5 ms respectively. Figure 5 illustrates the operation of PTA and the classic algorithm with K = 1 ms and K = 3 ms in the delay increase scenario. The Time Of Arrival (TOA) of the FP flow is also shown. The Iub interface is initially configured with a propagation delay of 10 ms. At simulation time t = 50 ms, the Iub delay switches to
Improving the Performance of DCH Timing Adjustment in 3G Networks
775
Fig. 5. Reaction of both algorithms to a delay increment
50 ms. This change implies that frames start to arrive after the receiving window causing the transmission of TA frames (signaling events) and the discarding of the frames arriving too late (loss events), as shown in Figure 5. Consequently, the buffer in the Node B empties until the sending time of new frames is adjusted in the RNC so that they arrive on time to be transmitted over the air interface. These traces show that PTA behaves clearly better and stabilizes much faster than the classic algorithm for any of its configurations. With K = 3 ms (appropriate for an Iub delay of 10 ms) the classic mechanism looses stability after the delay increase and, with K = 1 ms (appropriate for an Iub delay of 50 ms), the response of the classic algorithm is appreciably slow compared to PTA. The results of the delay decrease experiment are depicted in Figure 6. The Iub interface is initially configured with a propagation delay of 50 ms and, at simulation time t = 50 ms, it switches to 10 ms. After this change, frames start to arrive before the receiving window and, therefore, they generate signaling but are not discarded. However, a delay decrease has an undesirable effect: overbuffering in the Node B. For this reason, the timing adjustment algorithm should react as quickly as possible to avoid an excessive buffering but, at the same time, assuring a stable response of the system. As we can see in Figure 6, the classic algorithm with K = 3ms is the fastest in terms of buffer draining but it becomes unstable. On the other hand, the classic algorithm with K = 1 ms remains stable but its response is too slow. PTA presents the most balanced behavior. It is able to follow the delay variation quickly thanks to the proportional strategy while the dynamic variation of K (based on Table 1 values for 10% overshoot) assures stability. Compared to the classic algorithm, PTA shows a better behavior in every measurement given a delay change (lower buffer levels, fewer frame losses and less signaling).
776
G. Pedre˜ no et al.
Fig. 6. Reaction of both algorithms to a delay decrement DATA TRAFFIC
SPEECH TRAFFIC
Fig. 7. Performance of both algorithms under realistic traffic conditions for different Iub delays
Improving the Performance of DCH Timing Adjustment in 3G Networks
3.2
777
Performance Comparison Under Realistic Traffic Conditions
In this Section, both algorithms are studied and evaluated using realistic traffic (30 speech sources and 6 data sources). The performance figures measured are the frame loss ratio, the signaling ratio and the maximum buffer level at Node B. The signaling ratio is the ratio between the number of TA frames sent by the Node B and the total number of data frames received at the Node B. The frame loss ratio is the ratio between the number of frames arriving ”too late” and the total number of frames received at the Node B. Both ratios are expressed in percentage while the buffer levels are expressed in number of frames. In the first test we compare both algorithms given different Iub propagation delays, set at the beginning of each simulation and remaining constant during the simulation time. The parameters TOAWS and TOAWE have been established to 20 ms and 5 ms respectively. Performance results are depicted in Figure 7. Each value is obtained by averaging 10 simulation runs and the error bars correspond to a confidence interval of 95% according to a t-student distribution. From Figure 7, it can be seen that the classic algorithm with high K values performs similarly to PTA in case of low Iub delays (10 or 20 ms) but, however, it becomes unstable with larger Iub delays. On the other hand, the classic algorithm with K = 1 ms remains stable for a wide range of Iub delays but its performance DATA TRAFFIC
SPEECH TRAFFIC
Fig. 8. Performance of both algorithms in a realistic traffic scenario with sharp Iub delays changes
778
G. Pedre˜ no et al.
is lower than that of PTA in every scenario, because of its slow response. Speeding up this response requires a higher K, which causes instability at higher delays. These facts illustrate that the main drawback of the classic algorithm is the trade-off between responsiveness and robustness against oscillations. Next, we study a situation where the Iub delay experiences an abrupt change. The traffic considered is the same as in the previous scenario. Figure 8 illustrates the performance figures for Iub delay increases (positive sign) and decreases (negative sign) of different magnitude, ranging from 10 ms to 40 ms. The TOAWS is set to 10 ms and TOAWE equals 5 ms. Each performance figure is measured during the period of 60 seconds after the delay change. The average values and the confidence intervals (95%) are obtained from 10 simulation runs. The performance figures shown in Figure 8 are in accordance with the results observed in Section 3.1, where only a single FP flow was considered. It is shown that, even under realistic traffic conditions, PTA achieves better results than the classic algorithm given Iub delay increments and decrements. The buffer occupancy figure of PTA is sometimes higher than the classic algorithm because of the lower frame loss ratio of PTA, which implies that the Node B has to store frames that, with the classic algorithm, would have been lost.
4
Conclusions
In this paper, we propose a new timing adjustment algorithm for the Transport Channel Synchronization mechanism in 3G access networks. This algorithm, the Proportional Tracking Algorithm (PTA), has been analyzed from a control theory point of view and evaluated by computer simulation under realistic traffic conditions showing better performance than the classic algorithm in scenarios with Iub delay increments and decrements. The most important conclusions are that PTA has a faster response to abrupt delay changes than classic algorithm, reducing frame losses, and its operation is stable regardless of the total Iub delay. In our opinion, QoS management as well as congestion control algorithms in UTRAN transport networks should be revisited with the perspective of a faster and more efficient synchronization procedure as the one proposed in this paper, especially with the gradual substitution of ATM by IP in the transport infrastructure giving support to the mobile access network. Acknowledgements. This research has been supported by the national project grant TEC2007-67966-01/TCM (CON-PARTE-1) and it is also developed in the framework of ”Programa de Ayudas a Grupos de Excelencia de la Regi´on de Murcia”. Gaspar Pedre˜ no acknowledges support from Spanish MEC under FPU grant AP2006-01568.
References 1. 3GPP; Technical Specification Group RAN.: Synchronization in UTRAN Stage 2, 7.2.0. 3GPP TS 25.402 (2006) 2. 3GPP; Technical Specification Group RAN.: Iub/Iur Interface User Plane Protocol for DCH Data Streams, 7.3.0. 3GPP TS 25.427 (2006)
Improving the Performance of DCH Timing Adjustment in 3G Networks
779
3. Abraham, S., et al.: Effect of Timing Adjust Algorithms on Iub Link Capacity for Voice Traffic in W-CDMA Systems. In: Proc. IEEE VTC, vol. 1, pp. 311–315 (2002) 4. Sagfors, M., et al.: Transmission Offset Adaptation in UTRAN. In: Proc. IEEE VTC, vol. 7, pp. 5240–5244 (2004) 5. Ogata, K.: Discrete Time Control Systems. Prentice Hall Inc, Englewood Cliffs (1994) 6. OMNeT++ Discrete Event Simulation System, http://www.omnetpp.org 7. 3GPP; Technical Specification Group RAN.: IP Transport in UTRAN, 5.4.0. 3GPP TS 25.933 (2003) 8. Bu, T., Chan, M.C., Ramjee, R.: Connectivity, Performance, and Resiliency of IPBased CDMA Radio Access Networks. IEEE Trans. on Mobile Computing 5(8) (2006) 9. 3GPP; Technical Specification Group RAN.: Delay Budget within the Access Stratum, 4.0.0, 3GPP TR 25.853 (2004)
Link Adaptation Algorithm for the IEEE 802.11n MIMO System Weihua Helen Xi, Alistair Munro, and Michael Barton Department of Electrical Electronic Engineering,University of Bristol, Bristol, UK {Helen.Xi, Alistair.Munro, M.H.Barton}@bristol.ac.uk
Abstract. The employment of Multi-Input Multi-Output (MIMO) technology leads the WLAN to a new stage - high throughput (HT) WLAN, recently proposed in IEEE 802.11 TGn. The nature of multiple operation modes of MIMO has fundamentally changed the requirements for the MAC layer link adaptation (LA) process. The LA algorithms currently used in single input single output (SISO) WLANs are hardly effective for HT WLANs. We propose a cross-layer design for a MIMO LA algorithm which requires the MAC and the PHY to work proactively with each other to take full advantage of MIMO technologies adopted in 802.11n. The algorithm employs the channel state information (CSI) and operates in a ‘closed-loop’ manner. Simulations taken under various conditions validate our research. Keywords: Link Adaptation, MIMO, 802.11n.
1
Introduction
The evolution of wireless Local Area Network (WLAN) technologies aims to deliver data faster at the physical layer (PHY) and manage the distributed resource allocation more effectively at the Medium Access Control (MAC) layer. A typical device deployed in the legacy WLAN today follows a Single-Input Single-Output (SISO) model. The high throughput (HT) WLAN, 802.11 n [1], employs MultiInput Multi-Output (MIMO) technology, which implies that a transceiver uses multiple antennas for transmitting/receiving radio signals. The use of MIMO technology increases the PHY layer capacity (e.g. 100Mbps) and significantly changes the PHY layer’s capability on operational modes. MIMO fundamentally enhances the requirements of the MAC for the PHY resource management, which is normally carried out through the link adaptation (LA) process. The existing LA algorithms developed for the legacy WLAN using the SISO model become barely effective or valid for HT WLAN using MIMO technologies. In the legacy (SISO) WLAN, a station has only one antenna. There is only one PHY link between a transmitting node and a receiving node. The PHY’s operation mode (e.g. 802.11a) is only a combination of modulation and coding scheme (MCS). Normally there is a consistent relationship between the MCS and transmission failures: if the transmission is believed to have caused too many failures, the transmitting station will choose the next lower rate in the A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 780–791, 2008. c IFIP International Federation for Information Processing 2008
Link Adaptation Algorithm for the IEEE 802.11n MIMO System
781
MCS set. The LA process, undertaken by the MAC, is executed without much involvement of its PHY layer. In the HT WLAN, a MIMO system, a station has multiple antennas. There are multiple PHY links between a transmitting node and a receiving node. The multi-antenna system enables a station to utilize the multipath propagation of the radio signals over the PHY links for an enhanced link capacity. A subset of the antennas of the MIMO system can work in two operation modes: Space Time Block Codes (STBC, Alamouti scheme [14]) or Spatial Multiplexing (SM, also called V-BLAST in the literature [8]). In the STBC mode, each modulated symbol is transmitted on all the antennas in the subset to improve the reliability of the PHY link. In the SM mode, each antenna in the subset transmits different modulated symbols in parallel to increase the overall throughput. The MCS, MIMO modes under a given number of antennas, and the variation of the radio frequency (RF) environment, e.g. multiple path propagations, fading, should all be considered in the LA process of a MIMO system (called ‘MIMO LA’ below). Increased cross-layer coordination between the MAC and the PHY becomes a must in order to use MIMO effectively. In this paper, we propose a MIMO LA algorithm which takes full advantage of MIMO technologies adopted in 802.11n. Firstly, the algorithm employs the Channel State Information (CSI) from the MIMO engine to profile the PHY links. The profile data are used for channel quality and capacity estimation. Secondly, the algorithm operates in a ‘closed-loop’ manner, which involves both the transmitter’s and the receiver’s contribution, to maximise the utilisation of the estimated channel quality and capacity. The two elements work harmoniously together to keep the overall system utilisation at its peak. Simulation results validate our proposal. The paper is organised as follows: Section 2 serves as the research motivation. Section 3 presents our MIMO LA algorithm. Mathematic calculation of the throughput is given in Section 4. Section 5 presents the simulation results in different scenarios. Section 6 concludes the paper.
2
Research Motivation
In legacy WLANs using the SISO system, information data bits are protected by Forward Error Control (FEC) Coding, which can be at a rate of 1/2, 2/3, 3/4 or 5/6, before they are modulated into the data symbols in one of the modulation schemes (BPSK, QPSK, 16QAM or 64QAM). The objective of the LA in the SISO WLAN is to choose a MCS combination, indicated in Fig.1 (‘SISO only’ in the circle). Normally the MCS has a consistent relationship with the bit error rate (BER): for the same signal noise ratio (SNR), a higher rate in the MCS has a higher BER. The BER can be roughly estimated through the rate of successful ACK reception. Therefore, the LA in SISO WLANs becomes an adaptive modulation process: if the transmission is believed to have caused too many reception failures, the transmitting station will lower the rate in the MCS set for the next transmission.
782
W.H. Xi, A. Munro, and M. Barton
Fig. 1. Challenge of Link Adaptation in 802.11n (2 × 2)
A widely-used LA algorithm in legacy WLANs is called ARF [2]. After two consecutive missed ACKs, the rate will be decreased to the next lower one. After 10 consecutive successful transmissions, the rate will increase to the next higher one. AARF[4], RRAA[6], CARA[7] and [5] are all modifications of ARF. The advantage of the algorithm is its simplicity for implementation. The disadvantage is its inaccuracy when there are collisions. Another LA approach is given in RBAR[3]. It predefines the SNR threshold θi , at which the BER is 1 × 10−5 . The highest rate which can make the BER no larger than 1 × 10−5 will be chosen. The problem of this algorithm is that the threshold values should vary with different environments, e.g. multipath, fading and rms (root mean square) delay spread. An example will be given later in Fig.2. In the MIMO system, in addition to MCS selection, LA faces another dimension of challenge: MIMO mode selection, as described in Fig.1 shows. We use a 2 × 2 system as an example. MIMO can work in either STBC or SM mode. In STBC mode, the transmitter sends the same modulated symbol via different antennas at different times to duplicate the symbol in the space and time domains and, consequently, creates both space and time diversities, which make the detection of the transmitted symbol on the receiver side much more reliable. In SM mode, MIMO sends different symbols over the two antennas, and consequently, doubles the system throughput. Table 1 summarises the possible data rates in a 2 × 2 MIMO system. Data rates might be slightly different from the specification draft owing to the OFDM implementation. Fig. 2 shows the BER performance of some of the transmission modes in Table 1. Most of the lines, except the STBC 6(20ns) and SM 36(5dB) ones, represent a 150ns rms delay spread and Rayleigh environment. In SISO systems, there is a fixed performance (BER vs SNR) relationship between two different transmission modes, e.g. 36Mbps always achieves better BER than 54Mbps under the same SNR in any channel environment. Such a monotonic behavior of the performance curve gives validity to the ARF algorithms. In MIMO systems, however, because of the difference between STBC and SM modes, there is no such fixed performance relationship between an MCS in STBC and an MCS in SM. For example, at 40dB SNR, the SM 36Mbps mode achieves better throughput than STBC 54Mbps,
Link Adaptation Algorithm for the IEEE 802.11n MIMO System
783
Table 1. Data Rates in 2 × 2 System MCS Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Modulation Coding Rate BPSK 1/2 QPSK 1/2 QPSK 3/4 16-QAM 1/2 16-QAM 3/4 64-QAM 2/3 64-QAM 3/4 64-QAM 5/6 BPSK 1/2 QPSK 1/2 QPSK 3/4 16-QAM 1/2 16-QAM 3/4 64-QAM 2/3 64-QAM 3/4 64-QAM 5/6
MIMO mode STBC STBC STBC STBC STBC STBC STBC STBC SM SM SM SM SM SM SM SM
Data Rate 6Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps 60Mbps 12Mbps 24Mbps 36Mbps 48Mbps 72Mbps 96Mbps 108Mbps 120Mbps
0
10
−1
10
−2
BER
10
−3
10
−4
10
−5
10
−6
10 −10
SM 54 SM 36 SM 6 STBC 54 STBC 36 STBC 6 STBC 6(20ns) SM 36(5dB) −5
0
5
10
15
20
25
30
35
40
SNR (dB)
Fig. 2. BER vs SNR curves
but when the channel is Rician fading, the SM 36Mbps (line SM 36(5dB)) never outperforms STBC 54Mbps. The selection of transmission modes has to vary according to different environments. The traditional LA algorithms developed for the SISO systems became hardly effective. We were motivated to develop LA algorithms optimized for MIMO systems. There is some published research mentioning the LA for MIMO systems, e.g. [13] and [12]. The former gives a general structure and lacks detail; the later assumes only one MIMO mode such as SM is available. In our research, because both the MIMO modes are available, the LA has to choose not only the MCS but also the MIMO mode.
3 3.1
Proposed MIMO LA Algorithm Generic LA Process - A Collaboration of MAC and PHY
The transmitting and receiving procedures in WLANs determine that the LA is a collaborative process between the PHY and the MAC. Fig.3 (based on 802.11n)
784
W.H. Xi, A. Munro, and M. Barton
MPDU
MAC
PHY PLCP
Training Sequence
PHY PMD
PLCP preamble rate 6Mbps
MCS 7bit
Length 16bit
SIGNAL 6Mbps
STBC 2bit
PLCP preamble rate 6Mbps
CRC 8bit
PSDU
Data rate indicated in SIGNAL
Fig. 3. Packet Receiving Procedure
describes the data-bit mapping in the two layers. We highlight the parts related to our proposed MIMO LA. Parts unrelated are omitted or left blank. The PLCP preamble and PLCP header are always being transmitted in the most reliable mode (BPSK, coding rate 1/2). Field MCS and STBC (only Rate field in 802.11a) in the PLCP header are important signalling elements: – For the PHY layer, MCS and STBC indicate at which rate to demodulate the Data field when receiving and at which rate to modulate when transmitting. – Information about these two fields is passed from the MAC through primitives (indicated as arrows). This information must be visible to the MAC layer: firstly, the MAC needs MCS and MIMO mode information to calculate the Network Allocation Vector (NAV) in the MAC header; secondly, the selected rate is normally related to the destination, which is located in the MAC header and invisible to the PHY. The 802.11n MAC elaborates much more detail on the potential needs of LA process than the legacy 802.11. There is an HT control field in the MAC header. The first two bytes are called Link Adaptation Control, where one bit called MRQ (MCS Request) indicates whether the sender wants a MCS feedback. The 7-bit MFB (MCS Feedback) contains the recommended MCS for the request station. 3.2
The MIMO LA Algorithm
As we have described in Section 2 and 3.1, an effective LA algorithm for the MIMO system should be distributed at both the transmitter and the receiver, and coordinated by the MAC and the PHY: 1. Identification of the RF channels estimated by the PHY layer at the receiver side; 2. Utilization of SNR information from the PHY layer for BER estimation (furthermore, throughput estimation) under the above channel information;
Link Adaptation Algorithm for the IEEE 802.11n MIMO System A MPDU
B MPDU (MFB) Tx
Tx Mode
Upper Layer
MAC LARx
RF (MIMO)
ACK, CTS Rx
Mode
RF Signal
LATx
MPDU
785
LATx RF (MIMO)
MAC LARx
Upper Layer
CSI Rx
MPDU
Fig. 4. Structure of MIMO LA Algorithm
The proposed LA algorithm consists of LAT x and LARx sections on both sides, distributed on both sides of the transmitter and the receiver as shown in Fig. 4. The algorithm is described as follows. Step 1. Station A wants to start a transmission to Station B. Its LAT x sets 1 in the MRQ field in the MAC header to request MFB from Station B. The MAC sends the first packet using the most reliable rate (BPSK, 1/2, STBC mode; we refer to this as the ‘conservative mode’ below). Step 2. Station B receives the packet. The MRQ bit triggers the LARx section at the receiving side. Step 3. The LARx requests the CSI Matrices report, which includes both the H matrix and the SNR estimate from the PHY (how this is done is out of the scope of this paper) and identifies the channel model according to a pre-defined channel templates (we will discuss in Section 5.1). Step 4. The LAT x of Station B uses the channel model and the SNR to estimate the throughput and choose the mode which can provide the maximum throughput. The procedure can be described in the following pseudo language. Equations used in this section will be explained in the next section. m: total number of transmission modes, e.g. 16 in Table 1. TMAX : maximum throughput recorded, set to zero before the loop. j: selected mode. For (i = 0; i < m; i++) { Step 4.1: Get BER Estimation. Given the channel model and SNR value, LAT x looks up in the BER S NR lookup tables to obtain the BER estimate bi ; Step 4.2: Perform throughput estimation. If i represents an STBC mode, estimate the throughput TI using Equation 2. If i represents an SM mode, estimate the throughput using Equation 3. Step 4.3: Record the maximal throughput. if (TI ≥ TMAX ) { TMAX = TI ; j = i; } } //End of loop i
786
W.H. Xi, A. Munro, and M. Barton
Step 5. The LAT x of Station B indexes the mode j in the field MFB of the CTS or ACK frame and sends to A. Step 6. When the MAC of Station A receives the packet from B, LARx parses the mode j given in the MFB and map j to a transmission mode as in Table 1 for the next transmission. If Station A does not get any feedback, it will repeat the transmission using the conservative mode. Step 1-6 is repeated when the LA procedure is triggered next time.
4
Throughput Calculations
Throughput calculation used in the above Step 4 of the LA working procedure is described in this section. Equation 1 is the basic idea for throughput calculation. T=
L × (1 − b)L+lm Ttotal
(1)
where Ttotal = TDIF S + Tbackof f + Theader + Tdata + TSIF S + Tack T represents Throughput, L is the packet length in bits (L can be derived from the ‘Duration’ field in the MAC header.), b is the BER, and lm is the MAC header length in bits. The denominator is the total time spent on transmitting a packet. TData gives the time spent on the MAC frame body (L). Theader includes the time on the PHY preamble, PLCP header, the MAC header and the padding bits. Since we are interested in the impact of transmission mode on throughput, the impact of collision on throughput is not included here. Because the calculation of Ttotal varies according to the MIMO mode, we have differentiated the calculation for STBC and SM, respectively. STBC: The overhead of the throughput in STBC mode works as the SISO mode and we state: ρ = (TDIF S + Tbackof f + Theader + TSIF S + Tack ) × R Tdata = L/R where R is the physical transmission rate. Throughput in STBC mode is given in Equ. 2. T=
L×R R × (1 − b)L+lm = × (1 − b)L+lm L+ρ 1 + ρ × L−1
(2)
SM: In the SM mode, the calculation differs from the above because the transmission time of an SM packet (the MPDU part) is halved. The PHY layer divides this MPDU into halves and transmits (L + lm )/2 at each antenna. The time spent on the PHY overhead is still the same. The successful reception of
Link Adaptation Algorithm for the IEEE 802.11n MIMO System
787
the whole packet depends on the probability that both the two parts are received correctly. So the probability of receiving an SM packet successfully is still (1 − b)L+lm . The throughput in 2 × 2 SM mode is given by: TSM = 2 ×
L 2 L 2
×R R × (1 − b)L+lm = × (1 − b)L+lm 1/2 + ρ × L−1 + ρ
(3)
where ρ = ρ − (T lm + T lack ) × R 2
5
2
Simulations and Performance Analysis
Simulations were developed using OPNET. Apart from the modification of the MAC layer, the BER vs SNR performance curves are replaced by the data provided in [9]. Stations are associated with the ray-tracing data. The receive pipeline is modified to read the ray tracing characteristics and the BER vs SNR look-up tables. Each station has 2 antennas. Six transmission speeds 6, 12, 18, 24, 36 and 54 Mbps at each antenna are available. The antenna gain is 1.5 dB. The white noise is -101 dBm. 5.1
Channel Models
As described in Section 3.2, the classiTable 2. Channel Scenarios in a 2 × fication of channel model is a necessary part of our proposed MIMO LA algorithm. 2 System Channel modeling has been well studied Channel rms de- K factor in the past, e.g. [10] originally for HIPERScenario lay LAN, [11] for 802.11n. In our research, we H 20 0 20 ns Rayleigh have used the PHY MIMO modeling reH 20 5 20 ns 5 dB sults from [9], undertaken by colleagues in H 20 10 20 ns 10 dB the author’s department. In our choice of H 50 0 50 ns Rayleigh modeling (Annex D of [9]), K-factor, rms H 50 5 50 ns 5 dB delay spread and angular spread (360◦ ) H 50 10 50 ns 10 dB were chosen as parameters for the classiH 150 0 150 ns Rayleigh fication. The rms delay spread is the root H 150 5 150 ns 5 dB mean square of the multiple signal path deH 150 10 150 ns 10 dB lays, weighted proportionally to the energy in their paths. Three values are chosen for the K-factor: 0, 5 dB and 10 dB. The first corresponds to a Rayleigh scenario, which can be observed indoors and outdoors. The other two are Rician channels. The 20ns rms delay spread corresponds to indoor systems when the two terminals are close. The 50ns represents indoor system with high delay spread or outdoor hot-spots. The 150ns means an outdoor case. Consequently, we have classified the channel model into 9 scenarios, as listed in Table 2. [9] also provides us the BER vs SNR performance curves (such as the
788
W.H. Xi, A. Munro, and M. Barton
one in Fig.2, more can be found in [15]) of different rates (6, 12, 18, 24, 36 and 54Mbps, under STBC and SM mode respectively) under each channel scenario. Ray-tracing data used for simulations is also from [9]. 5.2
MIMO LA in an Indoor Random Topology
This section presents the simulation results of the MIMO LA in a mesh mode in the indoor environment, a typical office with columns, screens and furniture. We pick up 6 nodes, Node A to Node F, in the simulation. The PHY channel of each transmitter-receiver pair is unique according to the ray-tracing data. For example, between Node A and Node C, the LOS signals are fairly strong and the distance is short. The SNR is large and the chosen rate is very high (54Mbps). Because of the multipath and distance between Node A to Node E, the chosen rate is low (12Mbps). Therefore, the rate chosen needs to be set for each pair, which means the transmission station keeps a set of records of LA parameters for each possible destination station. Since the rms delay spread of indoor RF environment is small, MIMO favours the STBC mode only in general. Since there is only one MIMO modulation mode (STBC) used in this indoor environment, the legacy WLAN LA schemes like ARF can be directly applied here. Thus we can use it as a baseline and compare with the MIMO LA. In the ARF simulation, the MIMO operating mode is fixed at STBC. We made sure the speed chosen in ARF is destination-oriented, which means the transceiver has a separate ACK counter for each destination. Because of the possibly high level of contentions, it is worth applying RTS/ CTS to reduce the impact of collisions. In the MIMO LA, whenever a transmission fails, the transceiver will switch to the conservative mode. The transceiver does not know whether the failure is due to a collision or an encoding/decoding error. Changing the transmission rate will not affect the collision probability. Therefore, working on the conservative mode maximises the chance of receiving the MFB. Fig.5 gives the throughput and rate chosen with three and six traffic flows. Every flow is saturated. Packet size is 1000 bytes. In the scenario with three flows, Node A to Node E, Node C to Node D, and Node F to Node B, the overall throughput achieved using the ARF fluctuates wildly as shown in Fig.5.a. This is because the information used in the decision making is not accurate; therefore, the speed selected in the ARF is probably not the best rate at that moment. Fig.5.b lists the rate selected (excluded the conservative rate used by RTS for probing) from Node C to Node D in the first 10 seconds. The magenta line representing the MIMO LA constantly chose 24Mbps, as its decision making is not affected by collisions. The blue line represents the ARF: after 10 consecutive successful transmissions at 24Mbps, it will try 36Mbps. Because this rate gives a BER of 1.27 × 10−4 , the packet error rate (PER) is then 65%. The transmission quickly drops down to a lower rate. From the 24Mbps to the lower rates, the drop of speed is entirely due to collisions.
Link Adaptation Algorithm for the IEEE 802.11n MIMO System
40
10 8 6 4 MIMO LA
2
ARF
0 1
11
21
31
41 51
61
71
Transmission Speed Chosen (Mbps)
Throughput (Mbps)
12
81 91
35 30 25 20 15 10 5 0
MIMO LA ARF
0 5 1 7 3 9 4 9 6 5 1 7 6 1 8 1. 1. 2. 2. 3. 3. 4. 4. 5. 6. 7. 7. 8. 9. 9.
Time (second)
Time (second)
(a) throughput with 3 flows
(b) rate chosen(Node C to D)
16
12 10 8 6 4 MIMO LA
2
ARF
0 1
8
15 22 29 36 43 50 57 64 71 78 85 92 99
Time (second)
(c) throughput with 6 flows
Transmission Speed Chosen (Mbps)
40
14
Throughput (Mbps)
789
MIMO LA
35
ARF
30 25 20 15 10 5 0
1.0 1.1 1.9 2.2 3.4 4.0 5.6 6.3 7.6 7.9 8.1 8.8 9.1 9.6
Time (second)
(d) rate chosen(Node C to D)
Fig. 5. Throughput and rate chosen in the MIMO LA and ARF
Fig.5.c and Fig.5.d are for more intense contentions, where Nodes A - F all have packets to send and the probability of destination station is random and equal. The collision probability in the 6-flow case is much larger than the 3-flow case. The ARF scheme starts from 54Mbps; but because of many failures caused by collisions, the ARF scheme quickly drops to the lowest rate and has very little chance to return to higher ones. 5.3
An Outdoor Environment
In the outdoor environment, the rms delay spread tends to be much larger and the channel models are more time-varying. When the rms delay spread is larger than 35ns and the K factor is less than 2.5 dB, which means none of the multipath is significantly stronger than the others and the multipaths are more varied, the SM mode has chance to be chosen as the best rate. Of course, another condition is that SNR is large enough (about 35dB). In the outdoor scenario we use, most of the transmission links are categorised in the channel model H 150 0, which is similar to the Channel F in [11]. To illustrate the LA in a MIMO system, we pick up some locations to which the transmitter selects different rates. The transmitter is fixed at location 0. The receiver is individually put at location 1, 2, ..., and 6.
790
W.H. Xi, A. Munro, and M. Barton
Fig.6 lists the throughput when the receiver is at different locations. The best rates 3 20 for these locations are listed in 4 the x-axis (reflecting the MCS) 15 5 and y-axis (reflecting MIMO 10 mode). When the receiver is 6 at location 1, the channel be5 tween them is of H 50 10. The SM 0 SM does not perform well when 54 STBC 36 there is a strong LOS path. 24 18 12 MCS Ch The SNR is more than 20dB, osen (M bps) so the selected rate is STBC 54Mbps. At location 2, the Fig. 6. MIMO LA selected at different locations channel becomes of H 150 0. and the throughput achieved in an outdoor envi- The BER of SM at 36Mbps is ronment about 8.8 × 10−6, which gives a higher throughput (23.2Mbps) than STBC at 54Mbps (23.1Mbps). The other locations (3, 4, 5 and 6) all choose STBC as the SNR is not large enough for SM. 1
2
Throughput (Mbps)
25
6
Conclusion
The use of MIMO in the PHY of a HT WLAN (802.11n) has fundamentally changed the requirements for the link adaptation process. The traditional LA algorithms developed for the legacy WLAN, e.g. 802.11a, become barely effective or valid in a MIMO WLAN. This paper presents an LA algorithm specifically proposed for the MIMO system used in HT WLAN (802.11n). This algorithm takes into account of the impact of different RF environments on the MIMO operation modes, utilises the channel state information and operates in a ‘closedloop’ manner. Analysis and simulations taken under various conditions show the effectiveness of the algorithm over the traditional LA algorithms.
Acknowledgment The author would like to thank the OFCOM-sponsored project Antenna Array Technology and MIMO Systems for providing the data for the STBC and SM modes.
References 1. IEEE P802.11n/D2.00, Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Amendment: Enhancements for Higher Throughput (February 2007) 2. Kamerman, A., Monteban, L.: WaveLAN-II: A high-performance wireless LAN for the unlicensed band. Bell Labs Technical Journal, 118–133 (Summer 1997)
Link Adaptation Algorithm for the IEEE 802.11n MIMO System
791
3. Holland, G., Vaidya, N., Bahl, P.: A Rate-Adaptive MAC Protocol for Multi-Hop Wireless Networks. In: Proc. ACM MobiCom 2001, Rome, Italy (July 2001) 4. Lacage, M., Manshaei, M.H., Turletti, T.: IEEE 802.11 Rate Adaptation: A Practical Approach. In: MSWiM 2004, Venezia, Italy (October 2004) 5. Qiao, D., Choi, S.: Fast-Responsive Link Adaptation for IEEE 802.11 WLANs. In: ICC 2005, Seoul, Korea (May 2005) 6. Wong, S., Yang, H., Lu, S., Bharghavan, V.: Robust Rate Adaptation for 802.11 Wireless Networks. In: MobiCom 2006, Los Angleles, USA (September 2006) 7. Kim, J., Kim, S., Choi, S., Qiao, D.: CARA: Collision-Aware Rate Adaptation for IEEE 802.11 WLANs. In: IEEE INFOCOM 2006, Barcelona, Spain (April 2006) 8. Tse, D., Viswanath, P.: Fundamentals of Wireless Communication. Cambridge University Press, Cambridge (2005) 9. Williams, C.(ed.): Antenna Array Technology and MIMO Systems. doc. 8366CR2, Deliverable 2 for OFCOM (June 2004), http://www.ofcom.org.uk/research/technology/ spectrum efficiency scheme/ses2003-04/ay4476b/antenna array tech.pdf 10. Medbo, J., Schramm, P.: Channel Models for HIPERLAN/2, ETSI EP BRAN, document no. 3ERI085B (1998) 11. Erceg V., et al.: TGn Channel Models, IEEE 802.11 document 11-03/940r2 (January 2004) 12. Park, M., Choi, S., Nettles, S.: Cross-layer MAC Design for Wireless Networks Using MIMO. In: IEEE Globecom 2005, St.Louis, MO, USA (2005) 13. Abraham, S., Meylan, A., Nanda, S.: 802.11n MAC Design and System Performance. In: IEEE Globecom 2005, St.Louis, MO, USA (2005) 14. Alamouti, S.M.: A simple transmitter diversity scheme for wireless communication. IEEE Journal on Selected Areas in Communication 16, 1451–1458 (1998) 15. Doufexi, A., Tameh, E., Williams, C., Nix, A., Beach, M., Prado, A.: Capacity and coverage enhancements of MIMO WLANs in realistic environments. In: ICC 2006, Istanbul, Turkey (2006)
Thorough Analysis of 802.11e Star Topology Scenarios in the Presence of Hidden Nodes Katarzyna Kosek1, Marek Natkaniec1, and Luca Vollero2 1
AGH University of Science and Technology, Department of Telecommunications al. Mickiewicza 30, 30-059 Kraków, Poland 2 Consorzio Interuniversitario Nazionale per l'Informatica, Laboratorio ITeM-CINI Complesso Universitario Monte S.Angelo-Edificio Centri Comuni Via Cinthia, 80126 Napoli {kkosek,natkaniec,vollero}@ieee.org
Abstract. In this paper the authors present a simulation study of two 802.11e network scenarios. The presented analysis is not only novel but most of all crucial for understanding how a theoretically simple star topology network can be degraded by the presence of hidden nodes. The authors discuss the results obtained during the analysis of two different star topologies where the hidden node problem exists and compare them with the corresponding ones without hidden nodes. Additionally, the usefulness of the four-way handshake mechanism is argued and, furthermore, brief descriptions of the currently known solutions of the hidden node problem are given. Finally, the authors signalize the need for a better MAC protocol and provide a number of important conclusions about the 802.11e nature. Key words: hidden nodes, QoS, IEEE 802.11e.
1 Introduction Wireless networking is currently one of the most evolving technologies which importance grows constantly. Wireless devices appear almost everywhere – at homes, in companies, public places, etc. and help in everyday life. However, one of the most interesting access technologies, from the average user perspective, are ad-hoc networks. These are networks without infrastructure which do not need complicated administration and have one more undoubted advantage, i.e., they greatly facilitate Internet access. Unfortunately, wireless networks were created to deal with data exchanges and not multimedia services. Therefore, the need for QoS assurance for delay sensitive and/or bandwidth consuming services remains an interesting and unresolved issue. The nature of ad-hoc networks (i.e., constantly changing and unpredictable channel conditions, varying load, changeable device performance, different transmission and sensing ranges, mobility, hidden and exposed node problems, etc.) make it even more difficult. In this article the authors focus on the hidden node problem because it seems the most interesting from their point of view. The purpose of simulating star topologies is simple. A good example of such a case in a real environment is a situation with a gateway (GW) and several nodes with unidirectional antennas. The nodes need to communicate with the GW every time they A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 792–803, 2008. © IFIP International Federation for Information Processing 2008
Thorough Analysis of 802.11e Star Topology Scenarios
793
access the Internet services. At the same time, these nodes stay hidden from each other and can hear only the GW’s transmissions. Therefore, due to the fact of high popularity of the described topology, the authors found it crucial to check if 802.11e can assure QoS guarantees in such environments. Preliminary results of the star topologies simulations can be found in [10]. However, they are less scrupulous than the ones presented in this paper. In [10] only one configuration of a five-node star scenario was analyzed and it was not compared to any other configuration. The analysis presented in this article helps draw new and more thorough conclusions about 802.11e. Additionally, it shows similarity in the performance of the four and five node star topology which helps in formulation of general conclusions. Among many consequences of the hidden nodes presence, the most important seems the unavoidable unfairness in granting medium access and distortion of the throughput levels obtained by different priority traffic streams. The paper gives also a brief description of the known solutions of the hidden node problem and argues the usefulness of the most commonly used four-way handshake. The remainder of this paper is organized as follows. Section 2 contains the state of the art. In Section 3 the simulation scenarios description can be found. Section 4 gives conclusions on the obtained results which can be found in Appendix. The concluding remarks can be found in Section 5.
2 State of the Art The IEEE 802.11e Standard Quality of Services support in wireless networks is provided by the IEEE 802.11e standard [1]. The standard defines the Hybrid Coordination Function (HCF), which includes the HCF Controlled Channel Access (HCCA) and the HCF contention-based channel access, also known as the Enhanced Distributed Channel Access (EDCA). HCCA and EDCA are interoperable channel access mechanisms. HCCA is based on polling, while EDCA is based on a slotted and highly parametric CSMA/CA protocol. Both mechanisms distribute Transmission Opportunities (TXOPs), in which nodes are allowed to transmit one or more data frames. EDCA defines the concept of Access Category (AC). Each node may use up to four ACs which represent four priority levels for data transmission. The standard names these levels as background (BK), best effort (BE), video (Vi) and voice (Vo). Each AC implements a slotted CSMA/CA algorithm with its own parameters and competes with other ACs in order to obtain TXOPs. When it obtains a TXOP, it can transfer at least the first frame waiting in its transmission queue. Moreover, the AC can transmit more frames if allowed. Specifically, each AC has a maximum channel occupancy time, called TXOPlimit which is a configurable QoS parameter. If its TXOPlimit is equal to zero, the AC is allowed to transmit only the one frame for each TXOP it gains. When the TXOPlimit is greater than zero, the AC is allowed to transmit as long as the total channel occupancy time is less or equal than the TXOPlimit. An idle AC starts competing for a TXOP upon the arrival of a new frame in its queue. If the frame arrives and no more ACs are active in the same node, the AC checks the wireless medium if it is idle or busy. If the channel is idle, the AC ensures that it remains idle for a fixed interval of time, i.e., the Arbitration Inter-Frame Space (AIFS[AC]), which is
794
K. Kosek, M. Natkaniec, and L. Vollero
another configurable QoS parameter. As soon as the AIFS has expired, the AC is allowed to transmit. If the transmission is successful, the receiving node transmits back to the sender a special acknowledgement (ACK) frame, acknowledging the successful transmission. The AC that has obtained the TXOP handles the transmission of all the frames waiting in its queue until its TXOPlimit has been consumed. If the transmission is unsuccessful, the AC enters the Backoff process. This process is also used when the channel is sensed busy during the first AIFS, when another AC in the same node is busy or the last TXOPlimit of the AC is too close in time. As soon as the Backoff process is started, the AC updates an internal variable, called Backoff Timer (BT). When updating, the value of the BT is extracted randomly in the set: {0,1,..., min(CW max[AC],2 k ⋅ CW min[AC])}, where CWmin [AC] and CWmax[AC] are the minimum and the maximum Contention Window and k is the number of collisions occurred to the current frame. CWmin and CWmax are also configurable QoS parameters. An AC with BT equal to zero is allowed to attempt a transmission in the first slot time following an idle AIFS or an empty slot time. The BT is decremented in each slot time following an AIFS or an empty slot time.
Fig. 1. AC channel access time-sequence
An example of the EDCA access mechanism is shown in Fig. 1 (with TXOPlimit equal to zero). If the queue of the AC is empty after its last successful transmission, the standard mandates the updating of the BT. This Post-Backoff is executed as the normal Backoff process. If this queue is empty also when the BT has expired, the BT remains equal to zero until a new frame arrives. If a new frame arrives before the BT has expired, it is served using the Backoff process, inheriting the current BT value of the AC. Due to the fact that EDCA is the basic IEEE 802.11e mode implemented in real devices the authors find it crucial to check if it can assure QoS guarantees in different star topologies. TXOPlimit different from zero was not analyzed during the simulations they performed. The Hidden Node Problem One of the meaningful disadvantages of wireless networks is that even though the possible PHY Layer rates are satisfactorily high the MAC Layer is not able to use the
Thorough Analysis of 802.11e Star Topology Scenarios
795
whole bandwidth. There are two main reasons of such performance. First of all, the current MAC proposals are suboptimal. Secondly, node starvation is unavoidable as long as hidden nodes are present within wireless networks. Hidden nodes may appear due to the half-duplex nature of the wireless devices, i.e., when two nodes are out of range of each other they are unable to hear their transmissions. Therefore, if they start their data transmissions simultaneously a physical collision must happen. Solutions of the Hidden Node Problem The most known solution being used to minimize the destructive effects of the hidden nodes is the four-way handshake. The mechanism uses four different types of frames, i.e., Request to Send (RTS), Clear to Send (CTS), Data (DATA) and Acknowledgement (ACK) which are exchanged during the process of granting the medium access. Unfortunately, this solution has several disadvantages among which only the most important ones are given next. Firstly, it is optional and, therefore, not always used. Secondly, this method minimizes the problem of hidden nodes only when the network is of a single-hop type. Thirdly, this mechanism consumes bandwidth even if no hidden nodes appear within a network. Finally, due to the exchange of the additional RTS/CTS frames, the mechanism is unsuitable for delay-sensitive traffic. An important improvement to the four-way handshake is Multiple Access with Collision Avoidance for Wireless (MACAW, [5]) where five different types of frames are exchanged, i.e., RTS, CTS, Data Sending (DS), DATA and ACK. Additionally, to increase the per-node fairness MACAW involves a Request to RTS (RRTS) control frame. The biggest weakness of MACAW is the unsolved exposed node problem and furthermore, the increased signaling overhead. Therefore, in contrast to the senderinitiated handshake mechanisms, a family of receiver-initiated mechanisms has been proposed. In these types of channel granting solutions, the receiver must poll its neighbors in order to check if they have packets destined to it. The most known is Multiple Access Collision Avoidance By Invitation (MACA-BI, [6]) where a threeway handshake mechanism is invoked for every frame transmission, i.e., CTS/DATA/ACK. However, this mechanism is unsuitable for ad-hoc networks where polling a station without packets to be sent is a waste of time. Due to the weaknesses of previous mechanisms, hybrid solutions have appeared (e.g., [7]) which take advantages of both receiver and sender-initiated channel access methods. They assure better fairness and decrease end-to-end delay. However, they cannot guarantee QoS for delay sensitive traffic and were tested only in 802.11 environments. There also exists a family of protocols involving busy tone signals. The two most known solutions are Busy Tone Multiple Access (BTMA, [8]) and Dual Busy Tone Multiple Access (DBTMA, [9]). BTMA is dedicated only to networks with infrastructure. DBMA uses two busy tone signals and two sub-channels (i.e., communication and control) but it does not pay enough attention to the possible interference on the control channel and does not involve ACKs which seems illogical in the case of unreliable wireless links. One other solution is Floor Acquisition Multiple Access with Non-persistent Carrier Sensing (FAMA-NCS, [11]). It takes advantage of using long CTS frames which aim is to prevent any contending transmissions within the receiver’s range. Unfortunately, this scheme requires all nodes to hear the interference what makes the mechanism inefficient in case of short DATA frames.
796
K. Kosek, M. Natkaniec, and L. Vollero
As it was shown, in the literature there are several concurrent solutions to the fourway handshake mechanism, however, none of them is broadly used. Usually, it is the RTS/CTS/DATA/ACK exchange which is selected to deal with the hidden node problem. Therefore, only this protocol is validated during the performed tests.
3 Simulation Scenarios The simulation analysis was performed with the use of an improved version of the TKN EDCA enhancement [3] to the ns2 simulator. The adjustments made affect the RTS/CTS mechanism and were verified with the use or the OPNET [4] modeler. All important simulation parameters are given in Tables 1 and 2. Table 1. EDCA parameter set Priority P0 P1 P2 P3
AC VO VI BE BK
CWmin[AC] 7 15 31 31
CWmax[AC] 15 31 1023 1023
AIFSN[AC] 2 2 3 7
Table 2. General simulation parameters [1]
N1 N2 N3 N4 N5
C. 1 P3 P1 P2 P3 -
Four-node star C. 2 P0 P0 P1 P2 30 μs 10 μs 250 m 1000 B
PIFS SIFS Tx Range Frame Size Node Distance Wireless Standard Carrier Sensing (CS) Range
Five-node star C. 2 C. 3 P0 P3 P3 P0 P2 P1 P1 P2 P0 P3 Slot Time 20 μs DIFS 50 μs Tx Power 0.282 W Traffic Type CBR/UDP 200 m 802.11b w/ 802.11e extension 263 m (network w/ hidden nodes) 550 m (network w/o hidden nodes)
C. 2 P3 P0 P1 P2 -
C. 1 P3 P2 P2 P1 P3
The first scenario consisted of four nodes, three of which were hidden from each other (Fig. 2a). The second scenario consisted of five nodes, four of which were hidden (Fig. 2b). The priorities of flows were changed during the tests (Table 2) and, therefore, they are not presented in the figures. The simulation study was performed with the assumptions that all nodes send the same traffic with sending rate varying from 10 kb/s to 5 Mb/s. IEEE 802.11b and IEEE 802.11e are used as MAC and PHY layers in the communication protocol. In all simulations packet generation times of different nodes are not synchronized. The authors assume a frame size of 1000 B for all traffic priorities (even VO). Such an
Thorough Analysis of 802.11e Star Topology Scenarios
Node 2
Node 2
Node 1
Node 4
797
Node 5
Node 3
Transmission Direction
Transmission Range
Node 1
Node 4
Node 3
Transmission Direction
Transmission Range
Fig. 2. Four-node (a) and five-node (b) star topology network
assumption is made in order to avoid ineffective transmissions of small frames and, primarily, to compare the four EDCA queues under similar conditions. The analysis was performed with basic channel access (DATA/ACK) and with the four-way handshake mechanism (RTS/CTS/DATA/ACK). In the presented figures showing throughput and frame loss the errors do not exceed ±2% assuming 95% confidence intervals.
4 Simulation Results Three separate configurations were analyzed in order to have enough data for general conclusions. The results obtained for the four-node star were compared to the ones obtained for the five-node star. They appeared quite similar, i.e., not only the general conclusions but also the specific ones (e.g., the relations of the throughput levels to the number of DATA collisions (COL), interface queue (IFQ) drops, address resolution protocol (ARP) drops, and MAC retransmissions (RET) drops) were the same. Only the curve levels were different (i.e., smaller overall throughput and higher number of the lost frames were observed for the five-node star). Since this similarity and for the clarity of presentation, only the five-node configurations are presented in detail. However, the general conclusions form Section 5 account also for the results obtained in the four-node configurations. Eventually, due to the lack of space, only the overall frame loss curves are presented in the paper. They represent an aggregated number of IFQ, RET and ARP drops. A. First Configuration First configuration results are shown in Fig. 3 and Fig. 6. In this setup the number of low priority streams, 4, is higher than that of high priority ones, 1. Fig. 3-(a) shows that the order of the throughput levels is slightly distorted. For the hidden nodes, VI has the highest throughput while BE and BK have lower. However, the unhidden N1, sending BK traffic, achieves significantly higher throughput than the hidden nodes sending BK and BE. Moreover, the throughput of N1 is almost the same as that of N4 (which sends VI priority traffic) and the throughput levels of the low priority streams (i.e., for N2, N3, and N5) are unacceptably low. Furthermore, when the four-way handshake mechanism is enabled the fairness is slightly better but it is still not ideal
798
K. Kosek, M. Natkaniec, and L. Vollero
(c.f., the dashed lines in Fig. 3). On the basis of these observations the following question can be formulated: Why is the throughput: th N 4 > th N1 >> th N 2 ≅ th N 3 ≅ th N 5 ? When the four-way handshake is enabled, N4 has almost three times more COLs and two times less RET drops than other hidden nodes and between 1.2 and 1.4 times fewer IFQ drops than any other node. The number of COLs is about 32 times lower than RET drops and, therefore, it can be concluded that the predominant number of frames are received by the destination without the need of MAC retransmissions. N1 has the smallest number of COLs and RETs, however, it has also more IFQ drops than N4. Therefore, the explanation why N4 has generally higher throughput than the unhidden N1 is due to its priority traffic class. The high number of collisions experienced by N4 is also caused by its data traffic priority because the higher the priority, the higher the probability that the node will try to gain access to the medium. It can collide more often with other nodes but on the other hand it can also have fewer frames in its MAC priority queues. N1 experiences hardly any collisions because it is the only unhidden node and can hear other transmissions practically all the time. However, when N1 transmits without colliding, it completely captures the channel avoiding any interference by other nodes. The overall summarization of frame loss shows that the number of lost frames for N4 is lower than the ones observed for other nodes, which stays in line with the throughput level it achieved. With RTS/CTS disabled the overall frame loss situation is almost similar to the previous one because basically only the curves’ levels change. The trend of the curves representing the number of IFQs for N1 is a bit higher and for N4 is a bit lower. The number of COLs is reduced practically to zero for all nodes. Also the number of RET drops changes significantly and for all hidden nodes is almost two times lower than with RTS/CTS disabled. Therefore, the throughput levels changed accordingly (c.f., the dashed lines in Fig. 3) Another experiment was performed to increase the hidden nodes’ carrier sensing range so as to not make them hidden any more. The obtained results are presented in Fig.6. In such a configuration the medium access was fair. However, it is also noticeable that enabling RTS/CTS causes a decrease in the overall throughput (mostly due to the increase in the signaling overhead and low sending rate of RTS/CTS frames which is 2 Mb/s) which results in worse overall network performance. Therefore, it seems reasonable to disable four-way handshake in star environments where the problem of hidden nodes does not exist. Nodes with BK priorities have the highest number of lost frames which is lower than the number observed for BE which is in turn lower than the number for VI. With RTS/CTS enabled the number of frames lost for all priorities is higher than with RTS/CTS disabled. The most important impact on frame loss comes from IFQ drops (the number of which is incomparably higher than the number of ARP or MAC retransmission drops). Therefore, practically only the IEEE 802.11e mechanism has impact on network performance. B. Second Configuration The second configuration is presented in Fig. 4 and Fig. 7. The most important change with respect to the previous configuration is that N1 has VO priority instead of BK and, additionally, there are more high priority streams, 3, than low priority ones, 2.
Thorough Analysis of 802.11e Star Topology Scenarios
799
Important conclusions from the results shown in Fig. 4 are as follows: N1 gains the medium access more often than any other node both with enabled and disabled RTS/CTS. When the four-way handshake mechanism is enabled the fairness of granting medium access is better, i.e., hidden nodes sending VI and VO have better throughput than the ones sending BK. The throughput of N5, with disabled RTS/CTS, equals zero. With a smaller offered load (up to 0.4MB/s), the throughput of N2 and N3 is higher than the throughput of N4, even though they send lower priority traffic. The dominance of N1 over hidden nodes is higher than it was presented in the second case, therefore, the unanswered questions are: Why is th N1 >> th N 2 ≅ th N 3 ≅ th N 4 ≥ th N 5 with RTS/CTS disabled? N5 experiences the greatest number of COLs which is, however, practically equal to the number observed for N4. Also the number of RET drops is highest for these two nodes. The number of COLs for N2 and N3 are practically equal and lower than the ones observed for N5 and N4. Additionally, they experience lower number of RET drops than N5 and N4 but visibly more than N1. In general, the number of frame drops observed for hidden nodes can be explained by the comparable priorities of the traffic carried by N5 and N4 and by N2 and N3. Higher priority traffic can be send more often experiencing more COLs, as it was already described in the previous configuration. N1 has almost no COLs and, therefore, also the number of RET drops falls to zero. Such performance is caused by the unhidden placement of N1. The number of IFQ drops is highest for N2 and N3, a bit lower for N4 and N5 and lowest for N1, while, the overall frame loss is almost equal for all hidden nodes if only the offered load is higher than 0.4 Mb/s and much higher than the one observed for N1. Why is th N1 >> th N 5 > th N 4 > th N 3 ≥ th N 2 with RTS/CTS enabled? The number of COLs decreased almost to zero for all nodes. The number of RET drops was lowest for N1 (equal to 0), a bit higher for N2 and N3 and highest for N4 and N5. The number of IFQ drops was practically equal for N2, N3 and was slightly higher than the ones observed for N4 and N5 and much higher than the one observed for N1. These numbers can be explained by the priorities of the traffic carried by the particular hidden nodes and by the placement of the unhidden N1. The overall frame loss was highest for N2 and N3, lower for N4, a bit more lower for N5 and lowest for N1 – this determines the obtained throughput. Fig. 7 shows the ideal case, i.e., the environment in which no hidden nodes are present. Frame loss levels obtained by the competing stations are in the correct order which is in line with the obtained throughput. Once again enabling the RTS/CTS exchange causes a meaningful decrease in the overall network throughput and increase in the overall frame loss for each priority. In both cases, IFQ drops have the strongest impact on the achieved results. C. Third Configuration The third configuration is shown in Fig. 5 and Fig.8. There are slightly more low priority streams, 3, than high priority ones, 2. The list of the most important conclusions based on the results presented in Fig. 5 is as follows: N1 had lower traffic priority than in the previous case which resulted in better fairness in granting medium access when the four-way handshake was enabled. N1 tried to send data more seldom,
800
K. Kosek, M. Natkaniec, and L. Vollero
what is a result of the high CWmin and CWmax values connected with the BK traffic. Unfortunately, once more with disabled RTS/CTS the nodes with P0 and P1 priorities did not get enough bandwidth when the hidden nodes were present in the network. Up to the value of the overall traffic load of 0.3MB/s, BE and BK streams took advantage over VI and, up to 0.4MB/s, over VO. Additionally, the throughput of N5 with disabled RTS/CTS once again equals zero. The change in fairness is noticeable mostly when the four-way handshake is enabled, i.e., hidden nodes sending VI and VO traffic have better throughput than the ones sending BE and BK traffic. However, in comparison to configuration 1, the VO traffic levels are severely distorted. The explanation of such performance is given next. Why is th N1 >> th N 3 ≅ th N 4 ≅ th N 5 ≥ th N 2 with RTS/CTS disabled? The number of COLs was highest for N2 and N3, lower for N4 and N5 and lowest for N1 (practically equal to zero). The number of RET drops was highest for N3, lower for N2, a bit more lower for N4 and N5 and lowest for N1 (equal to zero). These numbers can be explained by the traffic priorities and the unhidden position of N1. The number of IFQ drops was highest for N4 and N5, lower for N3 and N2 and lowest for N1. The overall frame loss was smallest for N1 and practically equal to the hidden nodes which explains the throughput levels obtained. Why is th N1 > th N 2 > th N 3 > th N 4 ≅ th N 5 with RTS/CTS enabled? The number of COLs dropped practically to zero for all nodes. The number of RET drops was highest for N3, lower for N2, another bit lower for N4 and N5 and lowest for N1 (equal to zero). The order of IFQ drops numbers was smallest for N1, higher for N2, another bit higher for N3 and highest for N4 and N5. However, the overall frame loss was highest for N4 and N5, smaller for N3, smaller for N2 and smallest for N1. Therefore, it can be assumed that for the hidden nodes it was 802.11e mechanism which mostly determined their performance. Obviously, the dominant role of N1 was caused by its unhidden placement. Furthermore, in comparison to configuration 1, the placement of N1 was even more important as the traffic sent by hidden nodes was generated and collided more often. It was caused by the fact that one of the hidden nodes was sending P0 instead of P2. Fig. 8 presents the ideal environment, e.g. the same configuration without hidden nodes. The curves representing throughput have the expected shapes, i.e., better priority traffic has higher throughput than the low priority traffic, as well as, the same priority streams have the same throughput values. Frame loss curves have also the expected order and they impact on the throughput obtained by different nodes. Enabling RTS/CTS exchange causes decrease in the overall throughput and increase in the number of lost frames. With both enabled and disabled four-way handshake the main impact on the nodes’ throughput comes from IFQ drops.
5 Conclusions and Future Work The paper analyzed the impact of EDCA parameters in WLAN configurations where hidden nodes are present and compared them with configurations where the problem of hidden nodes did not exist. In the presented scenarios the imperfect channel sensing is a major cause of the asymmetry in resource allocation. Additionally, novel
Thorough Analysis of 802.11e Star Topology Scenarios
801
simulation analysis showed that current EDCA cannot be used to counterbalance this effect alone if hidden nodes are present within a particular wireless network. As shown in the figures with RTS/CTS disabled, the higher the priority of traffic is carried by the hidden stations the worse the throughput level they achieve. Furthermore, with RTS/CTS enabled, the problem of severe unfairness still exists. In both cases, node starvation is much worse when the unhidden station transmits high priority traffic. The analysis showed also that similar configurations behave similar, i.e., the order of throughput levels obtained by different nodes for a five- and a four-node star is exactly the same regardless of the traffic priority the unhidden station transmits. This suggests that common strategies for avoiding the hidden nodes’ impact on EDCA may exist. The authors’ future works will be mainly focused on the definition of such strategies and on embedding them in ad-hoc protocols. Additionally, the authors plan to analyze different scenarios with a varied order of traffic priorities. They will analyze cases with a smaller and a higher number of hidden nodes in order to formulate even more general conclusions which will help in defining a new MAC protocol.
Acknowledgement This work has been carried out under the EU founded NoE CONTENT project no. 038423. The authors would like to thank OPNET Technologies, Inc. for supporting this research by providing the OPNET Modeler software.
References [1] I.S.: 802.11e 2005 IEEE 802.11 WG Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications Amendment 8 (2005) [2] IEEE 802.11b: Higher-speed PHY extension in the 2.4 GHz band (1999) [3] TKN EDCA 802.11e extension to the network simulator ns2. [Online] http://www.tkn.tu-berlin.de/research/802.11e_ns2 [4] OPNET. [Online] http://www.opnet.com [5] Bharghavan, V., Demers, A., Shenker, S., Zhang, L.: MACAW: A Media Access Protocol for Wireless LAN’s. In: ACM SIGCOMM 1994, UK (September 1994) [6] Talucci, F., Gerla, M., Fratta, L.: MACA-BI (MACA by invitation) - A Receiver Oriented Access Protocol for Wireless Multihop Networks. In: Proc. IEEE PIMRC 1997, Finland (September 1997) [7] Wang, Y., Garcia-Luna-Aceves, J.J.: A New Hybrid Channel Access Scheme for Ad Hoc Networks. In: MedHocNet 2002, Italy (September 2002) [8] Tobagi, F.A., Kleinrock, L.: Packet switching in radio channels: Part II–The hidden terminal problem in carrier sense multiple-access and the busy-tone solution. In: IEEE Transactions on Communications (December 1975) [9] Haas, Z.J., Deng, J.: Dual Busy Tone Multiple Access (DBTMA) – A Multiple Access Control for Ad Hoc Networks. IEEE Transactions on Communications (June 2002) [10] Kosek, K., Natkaniec, M., Vollero, L., Pach, A.R.: Performance Analysis of 802.11e Networks with Hidden Nodes in a Star Topology. In: CCNC 2008, USA (January 2008, short paper) [11] Fullmer, C.L., Garcia-Luna-Aceves, J.J.: Solutions to Hidden Terminal Problems in Wireless Networks. In: Proc. ACM SIGCOMM 1997, France (September 1997)
802
K. Kosek, M. Natkaniec, and L. Vollero
Appendix 300000
300 250000
200000
200 Frames lost
Throughput [KB/s]
250
150 100
150000
100000
50 50000
0 0
500
1000
1500
2000
2500
3000
0
3500
0
Total offered load [KB/s] N1, P3, RTS off N1, P3, RTS on
N2, P2, RTS off N2, P2, RTS on
N3, P2, RTS off N3, P2, RTS on
500
1000
1500
2000
2500
3000
3500
Total offered load [KB/s]
N4, P1, RTS off N4, P1, RTS on
N5, P3, RTS off N5, P3, RTS on
N1, P3, RTS off N1, P3, RTS on
N2, P2, RTS off N2, P2, RTS on
N3, P2, RTS off N3, P2, RTS on
N4, P1, RTS off N4, P1, RTS on
N5, P3, RTS off N5, P3, RTS on
Fig. 3. Configuration 1 – network with hidden nodes: (a) throughput (b) frame loss 200000
400
180000
350
160000 140000 Frames lost
Throughput [KB/s]
300 250 200 150
120000 100000 80000 60000
100
40000
50
20000 0
0 0
500
1000
1500
2000
2500
3000
0
3500
500
1000
N1, P0, RTS off N1, P0, RTS on
N2, P3, RTS off N2, P3, RTS on
N3, P2, RTS off N3, P2, RTS on
1500
2000
2500
3000
3500
Total offered load [KB/s]
Total offered load [KB/s] N4, P1, RTS off N4, P1, RTS on
N5, P0, RTS off N5, P0, RTS on
N1, P0, RTS off
N2, P3, RTS off
N3, P2, RTS off
N4, P1, RTS off
N5, P0, RTS off
N1, P0, RTS on
N2, P3, RTS on
N3, P2, RTS on
N4, P1, RTS on
N5, P0, RTS on
Fig. 4. Configuration 2 – network with hidden nodes: (a) throughput (b) frame loss 200000
160
180000
140
160000 140000 Frames lost
Throughput [KB/s]
120 100 80 60
120000 100000 80000 60000
40
40000
20
20000 0
0 0
500
1000
1500
2000
2500
3000
0
3500
500
1000
N1, P3, RTS off N1, P3, RTS on
N2, P0, RTS off N2, P0, RTS on
N3, P1, RTS off N3, P1, RTS on
1500
2000
2500
3000
3500
Total offered load [KB/s]
Total offered load [KB/s] N4, P2, RTS off N4, P2, RTS on
N5, P3, RTS off N5, P3, RTS on
N1, P3, RTS off N1, P3, RTS on
N2, P0, RTS off N2, P0, RTS on
N3, P1, RTS off N3, P1, RTS on
N4, P2, RTS off N4, P2, RTS on
N5, P3, RTS off N5, P3, RTS on
Fig. 5. Configuration 3 – network with hidden nodes: (a) throughput (b) frame loss 400
250000
350 200000
250
Frames lost
Throughput [KB/s]
300
200 150
150000
100000
100 50000
50 0 0
500
1000
1500
2000
2500
3000
3500
0 0
500
1000
Total offered load [KB/s] N1, P3, RTS off N1, P3, RTS on
N2, P2, RTS off N2, P2, RTS on
N3, P2, RTS off N3, P2, RTS on
1500
2000
2500
3000
3500
Total offered load [KB/s] N4, P1, RTS off N4, P1, RTS on
N5, P3, RTS off N5, P3, RTS on
N1, P3, RTS off N1, P3, RTS on
N2, P2, RTS off N2, P2, RTS on
N3, P2, RTS off N3, P2, RTS on
N4, P1, RTS off N4, P1, RTS on
N5, P3, RTS off N5, P3, RTS on
Fig. 6. Configuration 1 – network without hidden nodes: (a) throughput (b) frame loss
Thorough Analysis of 802.11e Star Topology Scenarios
803
300 180000 150000
200
120000 Frames lost
Throughput [KB/s]
250
150 100
90000 60000
50
30000
0
0
0
500
1000
1500
2000
2500
3000
3500
0
500
1000
Total offered load [KB/s] N1, P0, RTS off N1, P0, RTS on
N2, P3, RTS off N2, P3, RTS on
N3, P2, RTS off N3, P2, RTS on
1500
2000
2500
3000
3500
Total offered load [KB/s] N4, P1, RTS off N4, P1, RTS on
N1, P0, RTS off N1, P0, RTS on
N5, P0, RTS off N5, P0, RTS on
N2, P3, RTS off N2, P3, RTS on
N3, P2, RTS off N3, P2, RTS on
N4, P1, RTS off N4, P1, RTS on
N5, P0, RTS off N5, P0, RTS on
Fig. 7. Configuration 2 – network without hidden nodes: (a) throughput (b) frame loss 450 200000 180000
350
160000
300
140000 Frames lost
Throughput [KB/s]
400
250 200 150
120000 100000 80000 60000
100
40000
50
20000
0
0
0
500
1000
1500
2000
2500
3000
3500
0
500
1000
Total offered load [KB/s] N1, P3, RTS off N1, P3, RTS on
N2, P0, RTS off N2, P0, RTS on
N3, P1, RTS off N3, P1, RTS on
1500
2000
2500
3000
3500
Total offered load [KB/s] N4, P2, RTS off N4, P2, RTS on
N5, P3, RTS off N5, P3, RTS on
N1, P3, RTS off
N2, P0, RTS off
N3, P1, RTS off
N4, P2, RTS off
N5, P3, RTS off
N1, P3, RTS on
N2, P0, RTS on
N3, P1, RTS on
N4, P2, RTS on
N5, P3, RTS on
Fig. 8. Configuration 3 – network without hidden nodes: (a) throughput (b) frame loss
A Mobility-Adaptive TDMA MAC for Real-Time Data in Wireless Networks Johannes Lessmann and Dirk Held University of Paderborn, 33102 Paderborn, Germany {lessmann,madmax}@upb.de
Abstract. In this paper, we present a TDMA MAC protocol for scenarios with real-time traffic like voice streams. While many previous works assume stationary networks, DynaMAC can quickly adapt to changing topologies while not violating delay guarantees for even a single data packet with high probability. By intelligent data aggregation, DynaMAC also accounts for the fact that the large synchronization preambles of high-speed PHYs make transmissions of very small packets like voice samples extremely inefficient. Further, it introduces a novel segmentation concept, which results in a discrete set of potential per-node delay guarantees, which can be exploited by higher layers to balance the endto-end delay of routes with different lengths. Finally, the slot allocation strategy tries to ensure a contiguous placement of unassigned slots, so that larger best-effort packets can be efficiently transmitted with a minimum of fragmentation. We validate our propositions with expressive simulations.
1
Introduction
In cases of QoS sensitive traffic, an alternative to contention-based MAC protocols are TDMA MACs. The challenge with TDMA is to create an access schedule for the wireless medium. If the assigned slots are non-overlapping within the twohop neighborhood of each node, all packets can be sent without risk of collisions. TDMA MACs also naturally lend themselves to periodic real-time traffic like voice or video streams for two reasons. First, they can maintain an established quality level (delay, bandwidth) even if the network load increases. Second, the overhead associated with allocating slots before actually transmitting is affordable since periodic traffic will make use of the same slots for a comparatively long time therefore making up for the initial cost. While many existing TDMA protocols are designed for sensor networks and assume that the nodes do not move (much), our proposed protocol, DynaMAC, can efficiently handle changing topologies. With high probability, it can not only recover from mobility-induced packet collisions without dropping a single packet, but even uphold the delay guarantees for collided packets. Another issue, that we explicitly address is the fact that the PHY preambles can be very large compared to individual real-time data packets. A modern voice codec like G.729 [1], for example, produces 40 voice packets per second with A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 804–811, 2008. c IFIP International Federation for Information Processing 2008
A Mobility-Adaptive TDMA MAC for Real-Time Data in Wireless Networks
805
approximately 200 bit per sample. When transmitted with a high-speed PHY like 802.11g at full speed (54 Mbps), this implies a packet length of approximately 4 μs. Compared to the PHY overhead of 26 μs per packet, this results in a goodput reduction of about 86%. With 802.11b, which has a preamble of 192 μs, the overhead at 11 Mbps would be 98%. We therefore propose to aggregate real-time samples first on a per-sender, then on a per-receiver basis to utilize the channel more efficiently. We finally introduce a novel segmentation concept that allows DynaMAC to offer higher layers (most importantly the routing layer) not only one, but multiple per-node delay guarantees from which they can choose. That way, QoS routes which require many hops on the routing level, can make use of very small per-node delays while short routes are free to choose a more relaxed delay offer. This greatly helps to achieve common end-to-end delays even among routes of different lengths. Traditionally, end-to-end delays are somewhat proportional to the hop count. With DynaMAC, packets on routes with many hops can be routed “faster”. Common end-to-end delays are particularly desirable for voice conversations where values of more than 250 ms are not tolerable no matter how many hops the required route might have. The remainder of this paper is organized as follows. Section 2 discusses previous TDMA MAC protocols and their shortcomings. In Section 3, DynaMAC is described in detail. Section 4 shows DynaMAC’s performance using simulation results. We conclude in Section 5.
2
Related Work
Many works have been published in the domain of QoS MACs. Here, we confine ourselves to TDMA approaches. In USAP [2], the author presents a slot allocation scheme where time is divided into frames of variable length. The length of a frame depends on the node density in the neighborhood. When there are many neighbors, the frame length must be doubled to create new transmission opportunities. Unfortunately, [2] does not specify how exactly slots are to be allocated and is also vague as to the details of frame doubling. In [3], the authors try to improve on USAP. Time is again divided into frames where the first slot is always unoccupied. When a new node needs a slot and no unassigned slot can be found, it eventually doubles its frame length (number of slots per frame), copies the schedule into what is now the second half and claims the first slot in the second half. Mobility-induced conflicts are also resolved by doubling the frame length. One conflicting node takes the original slot, the other one claims as described above. Effectively, this approach halves the transmission opportunities for the conflicting slot. This makes it unusable for real-time streams with fixed periodic traffic which we consider in this paper. In LMAC [4], each node owns exactly one slot in every frame. In the control part of its messages, a node announces the intended recipient for the current message. All nodes must listen to the control part of their neighbors to know whether they are the recipients or not. This approach does not make sense for
806
J. Lessmann and D. Held
small slot sizes like the ones appropriate for real-time samples, since that would imply constant idle listening of all nodes even when they are rarely involved in any actual communication. DRAND [5] has a schedule allocation strategy which is somewhat similar to ours (allocation by negotiation, schedule propagation). However, DRAND assumes that this schedule has to be constructed only once and for all, it does not provide any method to change or repair schedules. It is explicitly targeted at stationary networks and not applicable to changing mobility. Besides, DRAND does not specify how messages for the initial schedule construction must be exchanged (contention-based or according to some preliminary schedule). Therefore, it is hard to compare its performance with that of DynaMAC, since the initial construction might take any amount of time. None of the previous TDMA protocols can efficiently meet the requirements imposed by periodic real-time traffic with very small samples in a dynamic network of constantly moving nodes. Additionally, our proposed approach of frame segmentation which allows us to uphold delay guarantees even in the face of mobility-induced collisions is completely novel.
3
Architecture of DynaMAC
In DynaMAC, time is divided into cycles (frames). The length of a cycle depends on the rate of the (CBR) real-time traffic that is to be supported. A voice codec like G.729 produces 40 packets per second, i.e. one packet every 25 ms. To support this kind of traffic, the cycle length should therefore be 25 ms. If a mixture of real-time traffic types with different rates is to be routed, the least common multiple must be selected for the cycle length. A cycle is divided into a number lcyc of slots. On a higher abstraction level, a cycle consists of a random phase and a scheduled phase. The slot count lrnd of the random phase is small compared to that of the scheduled phase. The random phase is accessed using CSMA without exponential backoffs and used to exchange control data like topology information, slot allocations or, to some extent, best-effort traffic. The scheduled phase is accessed according to the TDMA schedule and accomodates the real-time traffic and all best-effort traffic which could not be already transmitted in the random phase. It is divided into sc segments of equal size, i.e. each segment l −l has lseg = cyc sc rnd slots. At the end of each segment are lbuf dedicated buffer slots which are used for collision resolution as will be described later. Finally, to allow new nodes to easily synchronize with their neighbors, the random phase is started with one synch slot. Each node u sends a synch beacon in this slot with probability ρu = nu1+1 where nu is the number of neighbors of u. Figure 1 depicts the structure of a DynaMAC cycle. 3.1
Timing Parameters for the Medium Access
Random Phase. For the whole random phase, all nodes must switch to listening mode. The random phase is accessed using CSMA without backoff. A node
A Mobility-Adaptive TDMA MAC for Real-Time Data in Wireless Networks
807
Fig. 1. Structure of a DynaMAC cycle (frame). The numbers in brackets are sample values which are assumed for explanation purposes in this paper.
that wants to transmit computes a random slot number from the random phase and sends. In case of collisions, packets must be resent. To detect collisions, nodes which could not successfully receive a message in the random phase have to broadcast a NACK message a SIFS after the message. SIFS means “Short Inter-Frame Space” and is adopted from the 802.11 standard. In addition to NACK s, unicast messages are acknowledged by the recipients with an ACK. Scheduled Phase. In the scheduled phase, nodes sleep whenever they are not scheduled from transmission or receiving. Since a particular node u may be scheduled to send in slot s and receive in slot s + 1 or vice versa, a slot must not be occupied completely, but a SIFS must be left free at the end of each slot to allow the node’s radio to switch the radio mode. When mobility-induced collisions occur (cf. Section 3.5), the receiver v must be able to send a NACK to the sender u so that the situation can be resolved. Instead of sending a NACK as a direct reply (which would consume a lot of time in every slot), the NACK is piggybacked in the next message which is sent by v anyway. Since u knows the schedules of all of its one-hop neighbors, it can correctly identify the slot where to expect a potential NACK message and switch to listening mode accordingly. Since space must be reserved for the NACK s statically (regardless of whether they are actually sent or not), we also piggyback ACK s for confirmation of successful receptions. That way, a successful transmission can be distinguished from the case where the recipient has moved out of the sender’s transmission range. 3.2
Data Aggregation
As already indicated in Section 1, real-time stream samples can be very small compared to the synchronization preamble of the 802.11 PHY (which we continuously consider in this paper). Hence, transmitting only one sample per packet as a general rule is unacceptable. To our knowledge, we are the first to consider full data aggregation in the context of multi-hop TDMA MACs with delay-sensitive data. In DynaMAC, we try to combine all samples for which a node u is the sender into one packet. If a packet becomes too large for one slot, u tries to allocate a neighboring slot so that fragmentation can be avoided. This means that a packet will generally have multiple receivers, namely all nodes, for which a sample is
808
J. Lessmann and D. Held
included in the aggregated packet. Hence, all receivers of a packet must listen for u at the particular slots where u transmits. To be able to do so, u must notify its recipients to listen before transmitting by a message in the random phase. In Section 3.3, we will introduce the segmentation concept of DynaMAC. A DynaMAC cycle is divided into several segments to provide different delay bounds. There, we will see that our data aggregation is actually done independently for each segment. Hence, packets contain only those samples for which a node u is the sender and which are scheduled for the same segment. 3.3
Per-Node-Delay Guarantees
In a TDMA MAC, the per-node delay is given by the time difference between the slot when a sample is received by a node u and the slot when it is forwarded by u to the successing hop. Consequently, in a TDMA MAC, where slots are allocated conflict-free, it is principally possible to require (and guarantee) per-node delays in a very fine-grained manner. However, our data aggregation strategy makes this kind of fine-grained approach very counterproductive, since aggregation relies on a certain amount of streams that can be combined. Consequently, since data aggregation is an absolutely essential part of DynaMAC to improve the goodput (cf. Section 3.2), we propose segmentation as a compromise to remedy the problems discussed above. We divide a DynaMAC cycle into a set of sc distinct segments. Guarantees are only given in terms of segments. Within segments, data aggregation can be performed. When we guarantee delivery only for the end of the segment, we do not need to make any statement as to the specific slot in the segment which we chose for transmission. The segmentation constitutes a good combination of the advantages of data aggregation, on the one hand, and fine-grained delay guarantees, on the other hand. 3.4
Slot Allocation
Generally, the goal of the slot allocation strategy is to construct and maintain a conflict-free schedule. We start by explaining the smallest case of just two nodes. Eventually, one of them (say u) will hear the synch beacon of the other one (say v) and synchronizes its cycle to that of its neighbor. Since u does not know whether v is also new to the network or not, it sends a Schedule Request (SR) packet to v. Since v is unaware of any other node, it replys with a Schedule Update (SU ) packet without any allocated slots. Both nodes can then independently allocate their initial slots. The node with the higher ID chooses the first two slots in the first segment, the other one takes the first two slots in the second segment. We call the initial double-slot of a node its default slot. Note that the default slot is always a sending double-slot, because it is used to reliably broadcast some control data like SU packets (see below) or NACK s. If additional nodes join the network, they will eventually receive a non-collided synch beacon from one of the existing nodes which allows them to synchronize with their neighbors. They can then broadcast an SR packet in the random phase. Since now a schedule is already in place, all nodes which hear the SR
A Mobility-Adaptive TDMA MAC for Real-Time Data in Wireless Networks
809
packet reply with an SU packet that contains a bitmask in which each slot of the scheduled phase is represented by one or two bit. The first bit indicates whether the corresponding slot is allocated or not. If it is allocated, there is a second bit for that slot which indicates whether this slot is claimed by the sender of the SU packet itself or by one of its one-hop neighbors. The SU packets are sent in the default slots of each node. This is done to relieve the random phase from as much burden as possible. Since new nodes are always in listening mode until they have established their own default slots, the packet will certainly be received by the new node. When the new node receives the SU packets from its neighbors, it computes their disjunctive combination to determine the set of unassigned slots in its two-hop neighborhood. Then, it picks the first free slot and broadcasts a Slot Allocation (SA) message with an “unapproved” flag in the next random phase indicating its slot selection. When no disapproval is sent by any neighbor, the new node sends the same SA message again, this time with an “approved” flag. The neighboring nodes receive the second SA message and in turn propagate it to their one-hop neighbors in their default slots in the next cycle, so that eventually, all two-hop neighbors are aware of the new slot allocation and can update their internal schedules. The same procedure applies for established nodes to get additional slots. When two one-hop neighbors want to allocate the same slot in a random phase, one node will be earlier, and, since all nodes are listening, the other node can hear the conflicting allocation and claim another slot. If two two-hop neighbors u and v claim the same slot, there will always be some node w which is a one-hop neighbor to both u and v and therefore hears both nodes’ SA messages. Hence, w sends an SU message to the node with the lower node ID, say u, which includes the new slots allocated by v. Node u interprets this as a disapproval and must send a new SA message (unapproved) with different slot allocations. This procedure is repeated until all allocation request are satisfied. When a node wants to release a slot that it previously allocated, the same procedure as described in this section can be utlized. 3.5
Mobility-Induced Collisions
When nodes are moving around, it is possible that two nodes which were more than two hops apart when the current schedule was negotiated, approach each other and therefore become part of each other’s two-hop neighborhood. When these nodes have conflicting slot allocations (which was not a problem previously), there can be collisions. We resolve this problem as follows. When packets of two nodes u and v collide in slot s, one or both recipients will send a NACK in their next sending slot (i.e. collision-free). Then, one or both of u and v immediately choose a buffer slot located at the end of the current segment and retransmit the packets. If a collision occurs in the buffer slot again (other nodes could also have chosen the same buffer slot due to independent collisions), a new buffer slot is computed using a linear congruential generator with the triple (sender ID, current slot number, original slot number) as its input. Eventually,
810
J. Lessmann and D. Held
Fig. 2. (a) Successful delivery rate of 802.11 DCF and DynaMAC under increasing load. (b) Successful delivery rate of DynaMAC with and without buffer slots under increasing load.
all conflicting nodes will have sequentialized. Since buffer slots are only used for collision resolution, chances are very high that any (even the first) retransmission in a buffer slot is successful. Since in DynaMAC, per-node delay guarantees are only given in terms of segment ends (cf. Section 3.3), a retransmission in a buffer slot at the end of the original segment does not violate the delay guarantee. This is another advantage of our proposed segmentation concept. Generally, the larger the delay between incoming and outgoing slot for a packet, the more buffer slots are in between and the more likely is it that a retransmission within the guaranteed delay will succeed.
4
Simulation Results
We simulated DynaMAC using ShoX [7]. For the simulation, we chose the DynaMAC parameters which we also assumed throughout this paper. The simulated network consisted of 100 nodes randomly distributed in an area of 50 × 50 m. The length of a stream is chosen randomly with a maximum of 15 seconds. For signal propagation, we used the unit disk model with a radius of 20 m. To assess the quality of DynaMAC, we compared it to the 802.11g DCF (Distributed Coordination Function) MAC in ad hoc mode. Our desired end-to-end delay is 100 ms. Figure 2(a) shows the delivery rate of the 802.11 DCF MAC and DynaMAC under different network loads in a static network. The network load is given in terms of the number of concurrent streams in the network. As can be expected, the 802.11 DCF MAC performs poorly with timely delivery when the load increases. When the number of streams exceeds 30, less than half of the packets are delivered on time. This is due to the fact that, since there is no data aggregation in 802.11, a number of 40 streams implies that there are actually 1600 sample packets per second which compete for the medium. DynaMAC, on the other hand, delivers 100% of the packets regardless of the current load. This is because of the fact, that the routing layer reserved the complete path between source and destination before sending any sample packets. Collisions due to mobile nodes are covered below.
A Mobility-Adaptive TDMA MAC for Real-Time Data in Wireless Networks
811
We measured the impact of our buffer slot concept in order to evaluate and demonstrate its benefits. For that, we induced collisions artificially by having the nodes broadcast interfering sample streams in arbitrary slots. This strategy allows to induce collisions without the need to move nodes, i.e. without routing layer issues. Figure 2(b) clearly shows the benefit of our buffer slot strategy. If no buffer slots are used, all collided packets get lost as long as an interfering “collision stream” lasts. With buffer slots, however, the delivery ratio almost remains at 100 %. The few exceptions are due to situations when multiple collisions of independent streams occur at the same time and different nodes try to switch to the same buffer slots.
5
Conclusion
In this paper, we have presented DynaMAC, a novel TDMA MAC protocol for delay-sensitive data. It is one of the few TDMA MACs which consider mobility of nodes. DynaMAC even allows to uphold packet delay guarantees in the face of constantly moving nodes. DynaMAC also greatly increases the network goodput when it comes to small real-time samples. We have proposed a novel segmentation concept as an excellent compromise between fine-grained delay guarantees and data aggregation. Our proposition to provide buffer slots at the segments’ ends and give delay guarantees only in terms of segment ends achieves useful flexibility for DynaMAC to shift outgoing slots back and forth within the whole delay guarantee period, thus enabling optimal inter-segment aggregation.
References 1. International Telecommunication Union, Recommendation G.729, http://www.itu.int/rec/T-REC-G.729/en 2. Young, D.: USAP Multiple Access: Dynamic Resource Allocation for Mobile Multihop Multichannel Wireless Networking. In: IEEE Military Communications Conference Proceedings (1999) 3. Kanzaki, A., Hara, T., Nishio, S.: An Efficient TDMA Slot Assignment Protocol in Mobile Ad Hoc Networks. In: Proceedings of the 2007 ACM symposium on Applied computing (SAC) (2007) 4. van Hoesel, L., Havinga, P.: A Lightweight Medium Access Protocol (LMAC) for Wireless Sensor Networks. In: First International Workshop on Networked Sensing Systems (INSS) (2004) 5. Rhee, I., Warrier, A., Min, J., Xu, L.: DRAND: Distributed Randomized TDMA Scheduling For Wireless Ad-hoc Networks. In: ACM MobiHoc (2006) 6. Rajendran, V., Garcia-Luna-Aceves, J., Obraczka, K.: Energy-Efficient, ApplicationAware Medium Access for Sensor Networks. In: 2nd IEEE Conf. on Mobile Ad-hoc and Sensor Systems (MASS) (2005) 7. The ShoX developers, ShoX - scalable ad hoc network simulator, http://shox.sourceforge.net
High Performance Distributed Coordination Function for Wireless LANs Haithem Al-Mefleh and J. Morris Chang Dept. of Electrical and Computer Engineering Iowa State University, Ames, IA 50011, USA {almehai,morris}@iastate.edu
Abstract. The performance of DCF, the basic MAC scheme in IEEE 802.11, degrades under larger network sizes, and higher loads due to higher contention and so more idle slots and higher collision rates. We propose a new high-performance DCF (HDCF) protocol that achieves a higher and more stable performance while providing fair access among all users. In HDCF, the transmitting stations randomly select the next transmitter and so active stations do not contend for the channel, and an interrupt scheme is used by newly transmitting stations without contending with the existing active stations. Thus, HDCF achieves collision avoidance and fairness without idle slots added by the backoff algorithm used in DCF. For evaluation, we provide Opnet simulation that considers saturated and non-saturated stations. Results show that HDCF outperforms DCF in terms of throughput, and long-term and short-term fairness with gains up to 391.2% of normalized throughput and 26.8% of fairness index.
1
Introduction
Other than being simple and distributed, the IEEE 802.11 DCF (Distributed Coordination Function) [1, 2] is most popular because it assures long-term fairness (each station has the same opportunity to access the channel). In DCF, Binary-Exponential-Backoff (BEB) procedure is used to resolve collisions, and a uniform distribution is used to provide fairness property for users. A station with a packet to transmit will do so if the medium is sensed idle for a period of DIFS. Otherwise, the station sets its backoff counter by randomly choosing a number following a uniform distribution: N umberOf Backof f Slots ∼ U (0, CW ), where CW is called the contention window and is initially set to CWmin . The station decrements its backoff counter by one for every time slot the medium is sensed idle, and transmits when this counter reaches zero. The destination responds by sending an acknowledgment (ACK) back. The packets transmitted carry the time needed to complete the transmission of a packet and its acknowledgement.
This work is partially supported by a grant from the Information Infrastructure Institute (iCUBE) of ISU.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 812–823, 2008. © IFIP International Federation for Information Processing 2008
High Performance Distributed Coordination Function for Wireless LANs
813
This time is used by all other stations to defer their access to the medium and is called NAV, Network Allocation Vector. Collisions occur when two or more stations are transmitting at the same time. With every collision, the station doubles its CW unless a maximum limit CWmax is reached, and selects a new backoff counter from the new range. The process is repeated until the packet is successfully transmitted or is dropped because a retry limit is reached. Unfortunately, such behavior degrades the performance of the network especially under higher loads due to collisions and idle slots. Even if the network has only one station transmitting, that station still has to backoff for a number of slots. In addition, collisions occur more frequently when the number of contending users increases. This results in an unstable behavior of DCF under very high loads. In this paper, we propose a new contention management scheme named Highperformance DCF (HDCF), which addresses the problem of wasted time in contention resolution via classifying stations into active and inactive ones. Our objectives are to coordinate transmissions from different active stations with no collisions or idle slots, and limit the contention to newly transmitting stations. HDCF is a distributed random access scheme that achieves a higher throughput while providing long-term and short-term fairness among all users. In general, each station maintains a list of active users. The transmitting station chooses randomly the next station to transmit from its own list of active users following a uniform distribution: N extStationT oT ransmit ∼ U (f irst, last), where first and last are the first and last entries of the active list. The selected station transmits after a PIFS period following the last transmission, and other active stations will defer their attempts to transmit the same way NAV is used in DCF. Thus, there are no collisions or redundant idle slots due to active transmissions. On the other hand, a newly transmitting station uses an interrupt scheme. Thereafter, active stations stop their active transmissions and only new stations would contend for the channel using DCF. As a result, HDCF reduces the number of contending stations, and so collision rates, and backoff slots. Results show that HDCF outperforms DCF in terms of throughput, and fairness index with gains up to 391.2% and 26.8% respectively. With HDCF, stations transmit in a uniform random order using a single channel with no central control, no time synchronization, no slotted channel, and no periods’ reservations. In addition, HDCF utilizes an interrupt scheme so that active stations (one or more) keep transmitting unless there are new stations welling to transmit, and that those new stations (one or more) can contend directly to assure fairness preventing unbounded delays for new stations. Finally, HDCF works using the 802.11 PHY and attributes (like NAV, retry limits, fragmentation, and others), introduces no additional packets, and works with or without RTS/CTS mode (e.g. used for hidden-terminal problem). The rest of this paper is organized as following. Related work is summarized in section 2. In section 3, HDCF protocol’s details and rules are defined. A simulation study is presented in section 4 to evaluate HDCF and compare it to DCF. Finally, section 5 concludes the paper.
814
2
H. Al-Mefleh and J.M. Chang
Related Work
To enhance DCF, many researchers proposed schemes that mainly attempt to reduce collision rates, adapt CW to congestion levels, or find optimal values of CW . However, collisions and wasted times still exist because some approaches solve one problem and leave another (e.g., [3, 4, 5]), and optimal values are approximate and oscillate with the network conditions that are variable (e.g., [3, 4, 5, 6, 10]). In addition, some schemes require the existence of an access point (AP) or complex computations (e.g., [6,10]). Instead of providing a history of all such proposals, we will give examples that fall into these categories. SD [3] divides CW by a factor after a successful transmission to improve fairness. FCR [4] achieves a high throughput by having each station reset its CW to a minimal value after a successful transmission, and double the CW exponentially after a collision or losing contention. Thus, FCR requires the use of another mechanism to provide fairness. CONTI [5] attempts to fix the total number of backoff slots to a constant value. Hence, there are always idle slots and collisions may occur. In [6], the authors argued that the backoff value must be set equal to the number of stations to maximize the throughput. This algorithm requires an AP to broadcast the number of stations. Hybrid protocols (e.g. [10, 11]) divide the channel into consecutive reserved contention and contention-free periods. Such protocols require a central controller, reservation, multi-channels, the use of RTS/CTS, slotted channels, and/or time synchronization. Also, new stations first wait for the contention-free periods to end resulting in unbounded delays and unfairness especially when a new station waits more than one contentionfree period. Therefore, most of these schemes limit the number of active users and lengths of different periods.
3
HDCF Details
HDCF utilizes an interrupt scheme and active transmissions to enhance fairness and eliminate, or reduce much of, the costs of contention of DCF (idle slots and collisions) without adding any assumptions or constraints to DCF. 3.1
System Model, and Definitions
The network consists of stations that share a single wireless channel. For simplicity in analysis and discussion, we assume a fully connected network; each station can listen to every other one in the network. Finally, HDCF is overlaid on 802.11 PHY layer and uses same frame formats used in 802.11 MAC. 1) Active Stations, Active-List, and Active Transmissions: active stations are those added to Active-List. Active-List contains a list of stations that have more packets to transmit, hence the name Active stations. Each station maintains its own Active-List with each entry having the format < ID > where ID is the MAC address of an Active station. Active lists may not be the same in all stations. An Active transmission is started by Next-Station after a PIFS (P IF S = SLOT + SIF S) following the last transmission.
High Performance Distributed Coordination Function for Wireless LANs
815
2) Next-Station: the station that is supposed to be the next transmitter and that is selected by the currently transmitting station. 3) Idle Stations, and New Stations: Idle stations are stations that have no data to transmit. New stations are those were idle, and at current time are having packets to transmit. This includes mobile stations that move into the network and have data to transmit. 3.2
Next Station Selection, and Announcement
Next Station Selection. The current transmitting station, the source, will randomly select an entry from its Active-List, and announce that ID as NextStation by including it in the data frame. To provide fairness, a uniform distribution is used: N ext-Station = U nif orm(A[0], A[Size − 1]). Here, A[0] is the first entry and A[Size − 1] is the last entry of the station’s Active-List. The announcing station does not have to be active and will make an announcement even if it will not become active. This eliminates the need for active stations to contend to get back into active transmissions. Using the uniform distribution, an active station may choose itself as the next transmitter. This assures the property provided by DCF, that each station has the same opportunity to access the channel. In addition, it prevents a station from wasting any idle slots, no need to go through the backoff stages, if there are no other active stations. Announcement. A station announces its future status by informing its neighbors that it does have or does not have more packets to transmit. In addition, a station announces Next-Station; the next station that has the right to access the channel. Using 802.11 packet formats, the ”More Data” bit of the Frame Control field can be used to announce that a station is active. In addition, ”Address4” of the 802.11 data frame’s header can be used to announce Next-Station. This means an overhead of 6 bytes, the size of the MAC address which is small compared to the average packet size. When a station receives, or overhears, a packet with the ”More Data” bit set to ”1”, it adds an entry to its Active-List unless that entry already exists. The entry will be < ID >, where ID is the MAC address of the transmitting node. On the other hand, if the ”More Data” bit is set to ”0” then the entry, if exists, that has the MAC address of the transmitting node will be removed from all overhearing stations’ Active-Lists. 3.3
HDCF Rules
When a station transmits, it also announces the Next-Station. All active stations overhearing the announcement know which station has the right to access the channel next. As shown in Fig. 1(b), Next-Station starts transmitting PIFS after the end of the last station’s transmission, and SIFS is used as in DCF (Fig. 1(a)) between packets of the same transmission. Also, DCF NAV is still used; stations will defer to the end of the ongoing transmission. A new station initially follows DCF; it transmits if the channel is idle for a period of DIFS followed by backoff slots determined by Binary Exponential Backoff
816
H. Al-Mefleh and J.M. Chang SIFS Data
A
ACK
B
Transmission 1
(a) An active transmission
PIFS
Transmission 2
PIFS
Transmission 3
(b) Active transmissions
Fig. 1. Active transmissions and their components
as shown in Fig. 2. If there are active stations, then a new station will detect at least one active transmission since PIFS is used as the IFS between any two consecutive active transmissions and PIFS is shorter than DIFS. Thus, following DCF rules would block a new station if there are active stations. Therefore, we propose to use an interrupt scheme, Fig. 3, by which a new station uses a jam signal (the jam signal is a special signal used by many wireless MAC protocols, for instance, different jam periods are used by Black Bursts (like [7]) to provide levels of priority) to stop active transmissions. If there is more than one new station interrupting, they will collide resulting in longer time spent contending for the channel. Hence, a new station starts transmitting after the jam only if the medium is idle for a period of one slot followed by backoff slots. The backoff procedure will follow the Binary Exponential Backoff procedure.
A B
Backoff
DATA
DIFS
SIFS ACK
Fig. 2. Basic operation of DCF
Active Transmission
The New Station Interrupting A New Station now has data to transmit SIFS SLOT Active New Station Backoff JAM Transmission Transmission Slots SLOT PIFS DIFS
Fig. 3. The interrupt scheme
When active stations including the Next-Station detect a busy medium before the end of PIFS, as described in Fig. 3, then there is at least one new station trying to transmit. Therefore, all active stations switch back to DCF to give new stations the chance to transmit. To prevent long delays and for practical issues, active stations follow DCF after the jam signal but with EIFS (EIF S = DIF S + SIF S + TACK , with ACK sent using lowest PHY rate) instead of DIFS. EIFS is used only one time after the jam signal. This also provides much higher priority for new stations that use one slot after the jam. Active transmissions are reactivated by the interrupting station since it knows about at least one active station; the last announced Next-Station. 3.4
An Example
Fig. 4 is a simple example that illustrates HDCF operation. In the example, there are three stations that have data to transmit: A with 2 packets, B with 1 packet, and C with 3 packets. Initially, all stations contend for the channel using DCF as they do not overhear any active transmission. Assuming that A wins the contention, A transmits one packet, and adds itself to its Active-List since it has another packet to transmit. The packet transmitted by A will inform all
High Performance Distributed Coordination Function for Wireless LANs
817
Fig. 4. An example with three stations joining network at different times
neighbors that A has more packets to transmit, and announces that the next transmitter is A becuase there is one entry, i.e A’s MAC address, in the list and so that entry will be selected with a probability of 1. Stations B and C overhears that announcement, and hence each one adds A to its own Active-List. Stations B and C jam for one slot SIFS after the end of the transmission of A. After jamming, both stations attempt to transmit after waiting for one slot followed by a random number of backoff slots. Assume that C wins the contention. C adds itself to its Active-List, and transmits while announcing A as the next transmitter, assuming that A was selected. The active list is updated at each of the three stations to include both: A, and C. Station B jams the channel for one slot SIFS following the transmission of C, then it transmits after a period of one slot and a number of backoff slots. Note that station B announces that it has no more data to transmit. Now the active list at each station includes stations A and C. Moreover, B announces A as the next transmitter, and so A transmits PIFS after transmission of B as no more stations are interrupting. Node A transmits while announcing that it has no more data and that C is the next transmitter. As a result, Active-Lists at all stations are updated to include only C. Station C transmits PIFS after the transmission of A while announcing itself as the next transmitter, and that it has more data to transmit. Station C transmits its last packet without any interrupt announcing no more data to transmit. All three stations update their Active-lists that become empty. 3.5
Recovery Mechanism
Because of hidden terminal problem, channel errors, and the sudden power-off of a station, the next selected station may not be able to start its transmission. However, all other active stations would notice the absence of Next-Stations’s transmission just after a PIFS period by SIFS. Here, active stations are required to temporarily contend for the channel using DCF as a recovery mechanism, Fig. 5 shows the timings of switching to DCF (note there is no overhead to original timing in DCF since DIF S = P IF S + SIF S). Also, an active station switches to DCF when it is not able to decode a packet, or the ACK is lost. Once active transmissions are recovered, the active station switches to the active state.
818
H. Al-Mefleh and J.M. Chang
Fig. 5. Recovery from a lost Next-Station announcement
3.6
Maximum Achieved Throughput
The maximum saturation throughputs of a DCF and HDCF networks (ignoring the time needed by stations to join the active mode) can be approximated by: SDCF =
E[L] DIF S + SIF S +
SHDCF =
CWmin σ 2
+ Tack + Tdata
E[L] P IF S + SIF S + Tdata + Tack
(1)
(2)
Here, σ is one slot time, Tdata is the time needed to send one data packet, Tack is the time needed to send an ACK, and E[L] is the average packet size. 3.7
Summary, and Advantages
An HDCF station operates in one of two modes: active mode, and contending mode. In active mode, there are no backoff, no collisions, and no idle slots. On the other hand, contending mode uses legend DCF but with much lower collision rate because almost only new stations contend for the channel. The way Next-Station is selected, and the interrupt scheme have different advantages: 1) No idle slots wasted when there are no new stations; i.e. no need to stop active transmissions. 2) Fairness to new stations as they can contend for the channel directly (like in DCF) without long delays as contention cost is much smaller. 3) Stations transmit in random order without the need for slotted channel, reserved periods, time synchronization, central control, or knowledge of number of active users. Finally, just like 802.11 DCF, HDCF stations may adapt their transmissions according to network and channel characteristics using different techniques used for the 802.11 like RTS threshold, fragmentation, link adaptation, and the use of RTS/CTS for hidden nodes.
4
Simulation
This section presents the simulation we used to evaluate the performance of HDCF and compare it to that of the IEEE 802.11 DCF. We utilized the commercial Opnet Modeler 11.5.A [8] to implement HDCF by modifying the 802.11 model. The simulations presented are for networks of 802.11g and 802.11b PHYs without RTS/CTS mode. Table 1 shows the 802.11 network parameters [2, 1]. We consider different scenarios with different sizes of packets, number of stations, and offered loads to study the performance of HDCF and compare it to
High Performance Distributed Coordination Function for Wireless LANs
819
that of DCF. First, we assume a fully-connected network with no channel errors; collisions are the only source of errors. Here, we start with a saturated scenario to provide a reference and an understanding of the maximum achievable performance. Then, we study the performance under different loads, or a non-saturated network. Second, we study the performance under different noise levels. To implement noise, we used a jammer node provided by Opnet. Table 1. Network Parameters Parameter Slot T ime P IF S CWmin P LCP Overhead M AC ACK Size Data Rate Signal Extension2 1- 802.11g, 802.11b values.
Value
Parameter 1
20, 20μs SIF S 30, 30μs DIF S 15, 32 CWmax 20, 192μs DCF M AC Overhead 14, 14 Bytes HDCF M AC Overhead 54, 11M bps ControlRate 6μs Service/T ail Size2 2- Only for 802.11g.
Value 10, 10μs 50, 50μs 1023, 1023 28, 28 Bytes 34, 34 Bytes 24, 1M bps 22Bits
For performance analysis, we use the following metrics: 1) Normalized Throughput (T hrouput/RateMAC ): where throughput is the total data transmitted successfully per the simulation period. ( n S i )2 2) Fairness Index: we used Jain Index [9] defined by (JF = n i=1 n 2 ), where n i=1 Si is number of stations and Si is the throughput of station i. The closer the value of F I to 1, the better the fairness level. We provide results for different simulation periods to have better conclusions about long and short term fairnesses. 4.1
No Channel Noise
Saturated Stations. Here, we compare between DCF and HDCF for a network of saturated stations, i.e stations that have data packets to transmit at all times. Fig. 6 and Fig. 7 compare the normalized throughput between HDCF and DCF for 50 contending stations as a function of the packet size, which changes from 50 bytes to 2304 bytes. The gain goes from 45.7% to 64% for 802.11b, and from 119.8% to 282.5% for 802.11g. The normalized throughput increases with the packet size for both protocols. However, the gain increases as the packet size gets smaller. This is due to the fact that collision’s cost is higher as it takes longer time for larger packets. In addition, the figures show the maximum normalized throughput values estimated by equations (2) for HDCF and (1) for DCF. While HDCF almost achieves the maximum throughput, DCF throughput is always much lower than the maximum possible values. Fig. 8 compares the normalized throughput between HDCF and DCF as a function of the network size, number of stations. Here, the packet size is fixed at 1000 bytes. For 50 users, the gains are about 49.8% for 802.11b and 164.7%
H. Al-Mefleh and J.M. Chang DCF - 802.11b MAX - DCF 802.11b
Normalized Throughput (%)
80
HDCF - 802.11b MAX - HDCF 802.11b
80
70 60 50 40 30 20 10
Packet Size (Bytes)
0 0
500
1000
1500
2000
DCF - 802.11g MAX - HDCF 802.11g
90 Normalized Throughput (%)
820
70 60 50 40 30 20 10
Packet Size (Bytes)
0 0
2500
HDCF - 802.11g MAX - DCF 802.11g
500
1000
1500
2000
2500
Fig. 6. Throughput vs. packet size, 802.11b Fig. 7. Throughput vs. packet size, 802.11g
DCF - 802.11b
HDCF- 802.11b
450
DCF - 802.11g
HDCF- 802.11g
400
70
MIN-802.11b MIN-802.11g
MAX-802.11b MAX-802.11g
350
60
Gain (%)
Normalized Throughput(%)
for 802.11g. Fig. 8 shows a higher stability of HDCF; the number of stations has small effect on the performance of HDCF. On the other hand, DCF performance degrades as the number of stations gets larger since the probability of collisions increases exponentially when number of stations increases. Moreover, all stations contend for the channel at all times in DCF. However, HDCF reduces the number of contending stations linearly, and remove any unnecessary idle slots. Hence, collision probability and overheads are much reduced by HDCF allowing a higher stability. In Fig. 9, we also summarize results from different simulations we conducted. The figure shows the minimum and maximum gains of normalized throughput for different network sizes. Again, packet size is changed from 50 to 2304 bytes for each network size. The least gain is 10.5%, and the greatest is 391.2%. In all cases, the gain increases when the number of stations increases.
50 40 30
300 250 200 150
20
100
10
50 Number of Stations
0 0
20
40
60
80
100
Fig. 8. Throughput vs. network size
0 120
0
20
40
60
80
100
120
Number of Stations
Fig. 9. Max/Min Gain
Fig. 6, Fig.7, and Fig.8 show that the higher the data rate, the lower the normalized throughput of a DCF network. In general, this is due to the higher overhead ratio. On the contrary, HDCF has a slightly higher performance when the rate increases because of the reduction of contention level and unnecessary idle slots which are parts of the overhead. Fig. 10 and Fig. 11 illustrate that HDCF provides a higher short-term and long-term fairness among all stations. Here, we used an average packet size of
High Performance Distributed Coordination Function for Wireless LANs
821
1000 bytes. In HDCF, the fairness index is always above 0.84 for the 1 second simulation, and is almost 1 for the 3 seconds simulation for all sizes from 1 to 100 stations. On the other hand, the fairness index in DCF continues to decrease as the number of stations increases for both scenarios, and reaches values of 0.49 and 0.74, respectively. For 802.11b, the gains are up to about 31.1% for longterm fairness, and 86.7% gain for short-term fairness. Correspondingly, the gains are up to 10.1% and 26.8% for 802.11g. The smaller gains in 802.11g are because of the lesser time required to transmit a packet using the higher data rate. DCF 802.11b DCF 802.11g
HDCF 802.11b HDCF 802.11g
DCF 802.11b DCF 802.11g
1
1 Fairness Index
0.9 Fairness Index
HDCF 802.11b HDCF 802.11g
0.8 0.7 0.6 0.5
0.95 0.9 0.85 0.8 0.75
Number of Stations
0.4 0
10
20
30
40
50
60
70
80
90
100
Fig. 10. Fairness Index for 1 second
Number of Stations
0.7 0
20
40
60
80
100
120
Fig. 11. Fairness Index for 3 seconds
Non-saturated stations. Here, we simulated a network of non-saturate stations to study the performance under different offered loads. We select 802.11g PHY, fix number of stations in the network to 50, set packet size to 1000 bytes, and vary the offered load at every station from 1 to 400 packets per second. Each simulation is run for a 50 second period. Fig. 12 illustrates that while providing better fairness, HDCF could achieve up to 168.3% more throughput than that of DCF. From the previous experiment, we select a load rate of 20 packets per second and vary network size from 1 to 100 stations. Fig. 13 shows that HDCF concurrently improves fairness and provides throughput gains as high as 71.7%. Fig. 12 and Fig. 13 also explain that once the network load, or the network size, gets beyond a certain level, DCF performance starts to degrade due to higher contetion levels. Alternatively, HDCF enhances the performance because it reduces number of contending stations and unnecessary idle slots. Also, HDCF performs just like DCF when the network load is very light or when the network size is small where DCF is proved to be highly efficient. Thus, HDCF achieves higher performance and adapts better to different loads and network sizes. Finally, we provide results for different simulation’s periods to have better conclusions about short-term fairness. Fig. 14 presents fairness index vs. duration for a network of 80 stations, and a constant packet interarrival time of 0.01 second at each station. The gains can reach values up to 70.6% and 23.6% for 802.11b and 802.11g respectively. Since the gain increases for shorter periods, HDCF improves the short-term fairness of the network. Such enhancement is related to throughput’s improvements as explained above, and how the next station is selected as explained in section 3.2.
H. Al-Mefleh and J.M. Chang
0.999
30
0.998
20
0.997
10
0.996
0
0.995 0 50 100 150 200 250 300 350 400 450 Number of packets per node per second
Fig. 12. Throughput vs. load
HDCF - 802.11b
DCF - 802.11g
HDCF - 802.11g
Gain - 802.11b
Gain - 802.11g 80 70
1
60 50 40 30
0.8 0.6 0.4
20 10 0
0.2 0 20 30 40 Time (Seconds)
50
60
Fig. 14. Fairness Index vs. time
4.2
0.9995 0.9985
20
0.9975
15
0.9965
10
0.9955
5
0.9945 Number of Stations
0 0
20
40
60
Throughput - DCF JF - DCF
Gain (%)
Fairness Index (JF)
1.2
10
1.0005
30 25
80
100
0.9935 120
Fig. 13. Throughput vs. number of users
DCF - 802.11b
0
JF - HDCF 35
Throughput - HDCF JF - HDCF
40
1.001
35
1
30
0.999
25
0.998
20
0.997
15
0.996
10 5 1000
10000
100000
Fairness Index (JF)
40
Throughput - DCF JF - DCF
Fairness Index (JF)
1
50
Normalized Throughput (%)
1.001
60 Normalized Throughput (%)
Load Throughput - HDCF
Throughput - HDCF JF - HDCF
Normalized Throughput (%)
Throughput - DCF JF - DCF
Fairness Index
822
0.995 1000000
Noise bits per second
Fig. 15. Throughput vs. noise level
Channel Noise
The results provided are for a network of 50 stations with CBR traffic of 50 packets per second. Each packet is 1000 bytes, and each simulation is run for 100 seconds. The jammer was configured to produce noise signals with a constant length of 1024 bits/signal at a constant rate varied for different runs. Fig. 15 shows the performance measures vs. the number of noise bits per second for an 802.11g network, and has a log-scaled x-axis. The figure shows that the throughput gain increases up to about 93.7%, and then starts to decrease until there is no gain when the channel errors are very high. Furthermore, HDCF provides higher throughputs and fairness for all tested noise levels.
5
Conclusions
In this paper, we proposed a new high-performance DCF (HDCF) MAC protocol to address the problem of wasted time in contention resolution in the IEEE 802.11 DCF. HDCF eliminates the need for unnecessary contention and idle slots by allowing transmitting stations to select the next user to transmit. To assure fairness, next station is selected in a random uniform fashion. In addition, new stations utilize an interrupt scheme to contend directly without delays. Thereafter, active stations would stop their active transmissions and only new
High Performance Distributed Coordination Function for Wireless LANs
823
stations would compete for the channel using DCF. As a result, HDCF reduces the number of contending stations, and so collision rates, and backoff slots. Also, stations transmit in a uniform random order without the need for slotted channel, reserved periods, time synchronization, central control, or knowledge of number or order of active users. We also presented a simulation study using Opnet which illustrated that HDCF significantly improves the performance as it achieves higher throughput and fairness levels for both saturation and nonsaturation scenarios. For 802.11g, the gains can be up to 391.2% of throughput and 26.8% of fairness index.
References 1. IEEE Std 802.11b-, Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Higher-Speed Physical Layer Extension in the 2.4 GHz Band (1999) 2. IEEE Std 802.11g-, Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications Amendment 4: Further Higher Data Rate Extension in the 2.4 GHz Band (2003) 3. Ni, Q., Aad, I., Barakat, C., Turletti, T.: Modeling and analysis of slow cw decrease for ieee 802.11 wlan (2003) 4. Kwon, Y., Fang, Y., Latchman, H.: A novel medium access control protocol with fast collision resolution for wireless lans. In: INFOCOM (2003) 5. Abichar, Z.G., Chang, J.M.: Conti: Constant-time contention resolution for wlan access. NETWORKING, 358–369 (2005) 6. Li, C., Lin, T.: Fixed Collision Rate Back-off Algorithm for Wireless Access Networks. In: VTC Fall 2002 (September 2002) 7. Sobrinho, J.L., Krishnakumar, A.S.: Real-time Traffic over the IEEE 802.11 Medium Access Control Layer. Bell Labs Technical Journal, 172–187 (1996) 8. Opnet, Opnet Modeler, http://www.opnet.com 9. Jain, D.R.: Chiu, and W. Hawe, A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems, DEC Research Report TR-301 (September 1984) 10. Alonso, L., Ferrus, R., Agusti, R.: Wlan throughput improvement via distributed queuing mac. Communications Letters. IEEE 9(4), 310–312 (2005) 11. Chlamtac, I., Farago, A., Myers, A., Syrotiuk, V., Zaruba, G.: Adapt: a dynamically self-adjusting media access control protocol for ad hoc-networks. In: Global Telecommunications Conference. GLOBECOM 1999, vol. 1A(1a), pp. 11–15 (1999)
Turning Hidden Nodes into Helper Nodes in IEEE 802.11 Wireless LAN Networks Haithem Al-Mefleh and J. Morris Chang Dept. of Electrical and Computer Engineering Iowa State University, Ames, IA 50011, USA {almehai,morris}@iastate.edu Abstract. To enhance the performance of IEEE 802.11 WLANs in the presence of hidden terminal problem, we propose a protocol that allows non-hidden stations to help each other retransmit faster whenever possible. Opposite to other approaches, the new protocol benefits from the hidden terminal problem to improve the performance of DCF, which is the basic operation of IEEE 802.11. The new protocol is compatible with IEEE 802.11, and works with the same PHY of IEEE 802.11. Using Opnet simulation, results show that the proposed scheme improves throughput, delay, packet drop, retransmissions, and fairness with small trade-off regarding fairness depending on the network topology.
1
Introduction
The IEEE 802.11 [1,2,3] wireless networks are widely deployed. Therefore, many challenges of the wireless medium are addressed by research especially to improve the performance of the 802.11 DCF (Distributed Coordination Function), which is the basic operation of the medium access control (MAC) defined in 802.11. One major challenge is the hidden terminal problem which significantly degrades the performance of DCF because it results in high collision rates. When a collision occurs, some stations other than the destination may be able to successfully receive one of the collided packets. Reasons include the capture effect and hidden terminal problem because of different locations of stations, existing obstacles like walls and doors, and interferences. Accordingly, and different than other proposals, we would like to investigate if non-hidden stations could help each other retransmit faster whenever possible to enhance the performance of 802.11 wireless local area networks (WLANs). In this paper, we propose a new simple protocol that modifies 802.11 DCF, is backward compatible, and works over the 802.11 PHY to achieve such goal. We also evaluate the new scheme using Opnet with and without capture effect for different topologies. Results show gains of retransmissions, throughput, fairness, delay, and packet drops with a small trade-off regarding fairness in some scenarios. The rest of the paper is organized as following. In section 2 we provide background information about 802.11 DCF and hidden terminal problem, and then related works are discussed in section 3. In section 4, we provide details of the proposed protocol. Simulation results are given in section 5. Finally, conclusions are in section 6. A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 824–835, 2008. © IFIP International Federation for Information Processing 2008
Turning Hidden Nodes into Helper Nodes
2 2.1
825
Background IEEE 802.11 DCF
The IEEE 802.11 standard defines two mechanisms for DCF which are based on Carrier Sense Multiple Access/Collision Avoidance (CSMA/CA). In basic operation, a station that has a packet to transmit will do so if the medium is sensed idle for a period of distributed interframe space (DIFS). Otherwise, the station will go into backoff where the Binary-Exponential-Backoff (BEB) procedure is used. The station chooses a number of time slots to wait before trying to transmit again. The number, or the backoff counter, is selected from the range [0, CW ], where CW is called the contention window and is initially set to CWmin . The station decrements its backoff counter by one for every slot time the medium is sensed idle. When the backoff counter reaches zero, the station transmits its packet. Upon receiving a data frame, the destination responds by sending back an acknowledgment (ACK) frame after a short interframe space (SIFS) time. The ACK frame has a higher priority because SIFS is the shortest interframe space (IFS) used in DCF. The packets transmitted carry the time needed to complete the transmission of a packet and its acknowledgement. This time is used by all other stations to defer their access to the medium and is called NAV, Network Allocation Vector. Collisions occur when two or more stations are transmitting at the same time, or when the ACK frame is not received after a timeout period. With every collision, the transmitting station will double its CW unless it reaches a maximum limit CWmax , and selects a new backoff counter from the new range. The process is repeated until the packet is successfully transmitted or is dropped because a retry limit is reached. In RTS/CTS operation, a station uses control frames to contend for the channel before transmitting data frames, i.e. data frames are free of collision. When the backoff counter reaches zero, the transmitter starts by sending RTS frame to the receiver who then replies with CTS if RTS frame is received successfully. The durations of RTS and CTS frames are used to reserve the channel long enough to exchange the following data frame and its acknowledgement. Fig. 1 illustrates the RTS/CTS operation in a fully connected (no hidden nodes) network WLAN. S1
DIFS
SIFS Backoff Slots RTS
Others
SIFS CTS
AP DIFS
Backoff Slots
Data
SIFS ACK
Defer Access
Fig. 1. RTS/CTS operation without hidden nodes
2.2
Hidden Terminal Problem
Using the wireless medium, a statio n is not able to hear frames transmitted by another station when they are out of range. Such phenomenon is referred to as the hidden terminal problem, and significantly degrades the performance of
826
H. Al-Mefleh and J.M. Chang
802.11 DCF because it results in high collision rates. An example is shown in Fig. 2 where S1 and S2 are within range, and are hidden from S3. Just like when all stations are within range, collisions occur because of equal backoff values used by different nodes. However, the hidden terminal problem adds another type of collisions as shown in Fig. 3. Here, S1 and S3 are contending for the channel with S1 backoff value is smaller than that of S3. Accordingly, S1 starts to transmit its RTS frame to the AP (access point). Unfortunately, S3 is unaware of S1’s transmission and thus does not freeze its backoff counter. S1’s RTS frame would not experience a collision only if S3’s backoff counter reaches zero after the start of the response frame, i.e. a CTS frame from the AP. However, here S3 bacoff counter reaches zero sometime before the end of S1’s transmission, and thus S3 starts transmitting its RTS frame. As a result, a collision occurs at the AP and both station S1 and S3 would timeout and then double their contention windows.
S2
S1
AP
DIFS
Backoff Slots
RTS
AP
S1
S3 S3
Fig. 2. Hidden terminal problem
DIFS
Backoff Slots
RTS
Fig. 3. Collision due to hidden terminal problem
A special situation occurs when S3 starts transmitting RTS frame at the same time the AP may start transmitting CTS frame. Accordingly, S1 would start transmitting a data frame. However, S3 would time out and begin backoff procedure. As a result, S3 may attempt to retransmit while S1 is transmitting the data frame resulting in a collision of the data frame. Consequently, data frames are not collision-free with RTS/CTS operation when hidden terminal problem exists.
3
Related Work
There are few analytical models for wireless networks with hidden terminals like [4, 5, 6]. In [7], the authors analyze the effect of hidden terminal on the throughput of a WLAN AP. They find that hidden terminal problem reduces the network throughput by 50%, and the capture effect (receiving one of the collided frames correctly under some conditions [8,9,10,11,12]) can enhance the performance by (10 − 15)%. Capture effect adds to the complexity and cost of wireless devices, and thus is mostly not implemented. In [13], a study of the effect of hidden terminal problem in a multi-rate 802.11b network for both basic and RTS/CTS methods is provided. The study shows that although RTS/CTS method does not help against hidden nodes for rates higher than 1M bps and 2M bps, it is recommended for all rates since it alleviates the packet collisions. Different approaches are proposed to reduce the effect or/and the number of hidden nodes. First, in many protocols like 802.11 DCF [1], the RTS/CTS
Turning Hidden Nodes into Helper Nodes
827
exchange is used to mitigate the hidden terminal problem. Second, the use of centralized scheduling like 802.11 PCF [1] would help. However, scheduling is not attractive because of its higher complexity, centralized control, and overhead of control packets which increase latency and reduce throughput. Third, increasing the transmission power or the carrier-sensing range may reduce the number of hidden nodes. In [12], the authors define a set of conditions for removing the hidden terminal problem for 802.11 ad-hoc and infrastructure networks: 1) the use of a sufficiently large carrier-sensing range, and 2) the use of capture effect which is referred to as the ”Restart Mode” of the receiver. The authors show that one of these conditions alone is not sufficient; both conditions are required. Moreover, the work assumes that there are no significant obstructions in the network. In general, such approaches could be not desirable for energy-efficiency reasons, and would increase the exposed nodes problem in overlapping WLANs and ad-hoc networks. In addition, it may not be feasible due to different limits like available power levels, obstacles, and regulations. On the contrary, power control schemes [14, 15] could result in increasing the number of hidden nodes. Fourth, multi-channel approaches [16] mitigate the effect of hidden stations. These approaches require more transceivers and channels, and more complex MAC protocols. Fifth, busy tone protocols [17, 18] require a central node or a separate channel to transmit a special signal to inform other nodes that there is an ongoing data transmission. Finally, using new MACs and backoff algorithms, adapting the contention parameters, and broadcasting helpful information are used (many of which do not consider hidden-terminal problem). In [19], each station broadcasts its backoff time in data frames to achieve fairness with its neighbors, and a multiplicative increase/linear decrease backoff algorithm is used. In [20], an impatient backoff algorithm is proposed to enhance the fairness level toward the nodes in the middle of an ad-hoc network. In contrast to all existing approaches, impatient nodes decrease their backoff upon a collision or losing contention, and increase it upon a successful transmission using an exponential instead of a uniform random backoff. The authors assume slotted system where synchronization is achieved, and propose to use reset messages to address the issues of small backoff values when there are many collisions and high backoff values when there are no collisions.
4 4.1
The Proposed Scheme Motivation
With the IEEE 802.11’s distributed operation of DCF, stations compete for the channel using a random access scheme. Hence, there are always collisions whose level increases with the number of contending stations, and the existence of hidden terminal problem. Different approaches were proposed to enhance DCF by adjusting contention parameters and the backoff procedure. However as discussed in related work (Section 3), they do not eliminate the hidden terminal problem, or even do not consider it.
828
H. Al-Mefleh and J.M. Chang
When a collision occurs because of hidden terminal problem, some stations other than the destination may be able to successfully receive one of the collided packets. The same scenario may occur if the there is a bad channel between the transmitter and the destination, like existing noise at destination, while there is a good channel between the transmitter and some stations other than the destination. In the presence of hidden nodes, we would like to investigate if non-hidden stations could help each other for retransmitting collided frames to enhance the performance of infrastructure WLANs. Such cooperative retransmission is expected to be faster since with DCF a non-collided station mostly transmits earlier than collided stations that double their CW. First, we propose a new simple protocol that modifies 802.11 DCF, is backward compatible, and works over the 802.11 PHY to achieve such goal. Then, we evaluate the proposed protocol via simulation. 4.2
Description of the New Scheme
We distinguish between two types of transmission opportunities (TXOPs) as shown in Fig. 4. First, a normal TXOP (NTXOP) occurs when a stations starts to transmit a data frame after the required DIFS, or EIFS, and backoff periods. Second, a compensating TXOP (CTXOP) occurs when a station starts to transmit after the current NTXOP by SIFS period. Also, each station maintains locally a table, called CTABLE, of other stations that may need to be assigned CTXOPs. When a station (say S2) overhears an RTS frame or a data from another station (say S1) sent to the AP, it adds an entry (the MAC address) of the frame transmitter (S1) to its CTABLE if no such entry exists. A station (S2) drops an existing entry from local CTABLE when overhearing an ACK frame sent to another station (S1) whose MAC address is equal to that entry. Note that a station is not required to wait for ACK frames after RTS or data frame to add/remove an entry to/from its CTABLE. Fig. 4 illustrates the new scheme. Here, only S3 is hidden from both S1 and S2. After DIFS and backoff periods following DCF operation, both S1 and S3 transmissions overlap resulting in a collision. Since S2 overheard S1’s data frame, it adds S1’s MAC address to its CTABLE. After backoff, S2 transmits without interference, and at the same time informs the AP that S1 has a collided packet to transmit by including S1’s MAC address in the transmitted data frame. The AP responds by sending back an ACK to S2 while piggybacking the AID of S1 in this ACK frame (CACK frame in Fig. 4). Upon receiving the ACK frame, S2 remove the entry of S1 from its CTABLE, S1 removes the entry of S2 from its CTABLE if exists, and S1 recognize that it is assigned a CTXOP. Thereafter, S1 sends a data frame after a period of SIFS to the AP who then replies with an ACK (last frame in Fig. 4). When overhearing the ACK, S2 removes S1’s MAC address from its CTABLE, and all stations continue their contention for the channel. For reasons like power saving (energy will be consumed for every bit transmitted or received), an 802.11 station first receives the MAC header of a frame and then receive the payload only if the frame is destined to that station. This be-
Turning Hidden Nodes into Helper Nodes NTXOP
DIFS+ Backoff
RTS
RTS
DIFS+ Backoff
RTS
CTS
Data SIFS
DIFS+ Backoff
SIFS
SIFS
AP
S3
CTXOP
Data SIFS
S1
DIFS+ Backoff
SIFS
S2
DIFS+ Backoff
829
CACK
ACK, CF-END
DIFS+ Backoff
Fig. 4. The proposed scheme
havior is not changed by the new scheme as only headers information is needed. Also, the helping station does not reserve the channel for a CTXOP, but the AP does so using the duration value of the CACK frame. Since duration is not known in advance, it is set to the time required to transmit a frame with maximum possible length and lowest rate. If needed (duration reserved is longer than CTXOP), the AP sends an ACK+CF-END frame instead of ACK frame in the CTXOP so that all stations reset their NAV values to start contention. On the other hand, the AP sends a CF-END if the helped station did not start transmitting after PIFS. Finally, when a station gets a CTXOP, it does not reset its CW value and it uses its current backoff counter for the next frame in order to maintain adapting to congestion levels. 4.3
Capture Effect
Capture effect [8, 9, 10, 11, 12] allows receiving one of the collided frames correctly under some conditions, and thus would enhance the throughout of the network while decreasing the fairness level. Our scheme is expected to improve the performance of WLANs with or without the hidden terminal problem when capture effect is enabled since more than one station is included in a collision; those transmissions not captured still can be helped using the proposed scheme as different stations would capture different frames depending on the distance and environment between each receiver and different transmitters. 4.4
Implementation Issues
CACK frame is a new ACK type with a format shown in Fig. 6, and adds only one field, named CAID, to that standard ACK frame shown in Fig. 5. CAID represents the AID of the station that is assigned a CTXOP following the current NTXOP. The 16-bit AID is used because of its smaller size compared to that of the 48-bit MAC address, and thus reducing the extra time required. To distinguish between ACK and CACK frames, we used the fact that all bits B8 to B15 except for B12 in the Frame Control field of IEEE 802.11 control frames are always set to 0 . In our implementation, we selected B10 to be set
830
H. Al-Mefleh and J.M. Chang
Octets: 2 Frame Control
2 Duration
6 RA
Octets: 2 Frame Control
4 FCS
Fig. 5. ACK frame
2 Duration
2
6 RA
CAID
4 FCS
Fig. 6. CACK frame
to 1 for CACK. Note that a CTS frame also can be used with the same modifications to implement a CACK. The new scheme is fully backward compatible since CACK is of known type and subtype, and will not be used to acknowledge data frames from stations that do not implement the new scheme. Data frames are not changed. AIDs cannot be used here becuase a non-AP station maintains only its own AID. Hence, the 48-bit ”Address4” of the IEEE 802.11 data frame’s header can be used by a station to inform the AP about a collided station.
5
Performance Evaluation
This section presents the simulation we used to evaluate the performance of the proposed scheme and compare it to that of 802.11 DCF. We implemented the new scheme with the commercial Opnet Modeler 11.5.A [21] by modifying the Opnet 802.11 models. We consider an infrastructure network which consists of one AP and a number of stations that share a single wireless channel. Moreover, there are no channel errors; collisions are the only source of errors. For each scenario, the results are the average of 100 different runs with a different seed, which is used for the random generator, for each run. Finally, 802.11b and RTS/CTS operation are used with the parameters shown in Table 1. Table 1. Network Parameters Parameter
Value
Slot T ime 20μs SIF S 10μs DIF S 50μs CWmin 32 CWmax 1023 Control Rate 1M bps Data Rate 11M bps
5.1
Parameter
Value
M AC ACK Size M AC CT S Size M AC RT S Size P LCP Overhead DCF M AC Overhead Short Retry Limit Long Retry Limit
14 Bytes 14 Bytes 20 Bytes 192μs 28 Bytes 4 7
Performance Metrics
For performance analysis, we use the following metrics: 1. Throughput (S): the total data bits transmitted successfully per the simulation time.
Turning Hidden Nodes into Helper Nodes (
831
n
S )2
i 2. Fairness Index (F I): we used Jain Index [22] defined by (F I = n i=1 n 2 ), i=1 Si where n is number of stations and Si is the throughput of station i. The closer the value of F I to 1, the better the fairness provided. We use F I to find how fair a scheme is to different DCF users. 3. Average Delay: the delay of a data packet is measured from the moment it was generated until it is successfully received. Only successfully transmitted packets are considered for finding the average delay. 4. Packet drop: number of data packets dropped due to buffers overflow, and due to reaching a retry limit. 5. Retransmissions: the number of retransmission attempts of each packet before it is successfully transmitted or dropped.
5.2
Hidden Groups without Capture Effect
Here, each scenario is an infrastructure network with one AP and a number of stations that are positioned to be in two groups. Stations of different groups are hidden from each other, and stations of the same group are non-hidden. Each scenario is referred to as n-m, with n stations in the first group and m stations in the second group, and n is fixed while m is variable. Then we test with scenarios referred to as n-m-c, where a third group (group c) of 5 stations, which are not hidden from each other, is added to each of the previous n-m networks. However, stations are arranged as following: 1) c1 and c2 are nonhidden from all stations in network. 2) c3 − {n2 }. 3) c4 − {n1 , m1 , m5 , m9 , m10 }. 4) c5 − {m1 , m5 , m6 , m7 , m9 , m10 }. Here, xi is station i in group x, and xi − {xj } means that xi and xj are hidden from each other. Also, |x| is number of stations in group x. These scenarios include a general topology of a wireless network. Results are provided in Fig. 7 to Fig. 14. In all figures, the letter ”d” (”e”) is used if the new scheme is disabled (enabled). For the n-m scenarios, different measures follow the same trend for DCF with the new scheme enabled or disabled; we show this for fairness and throughput in Fig.7 to Fig.9. This can be explained by the fact that CW resetting and backoff counters are unchanged after a CTXOP. Fig. 7 to Fig. 9 also show a trade-off FI 1-m d
FI 3.65
S Mbps
3.6
1
3.9
0.9 0.8
3.7
0.7 0.6
3.55
0.5 3.5
0.4 0.3
3.45
0.2 0.1
3.4 3.35
0 0
2
4
m
6
8
Fig. 7. 1-m
S 5-m d
FI 1-m e
10
12
S 5-m e
FI 5-m d
FI 5-m e FI
S 1-m e
1.2 1
3.5 S Mbps
S 1-m d 3.7
0.8
3.3 0.6
3.1
0.4
2.9
0.2
2.7 2.5
0 0
5
m
Fig. 8. 5-m
10
832
H. Al-Mefleh and J.M. Chang
between throughput and fairness for the n-m scenarios. The fairness gets smaller for some cases when the new scheme is enabled. This is because collided stations may retransmit before being helped due to random backoff values. However, the difference is small and FI of the new scheme is always above 0.7, and almost is the same as that of DCF for the 1-m, and 10-m scenarios. On the other hand, fairness is always enhanced for the n-m-c scenarios, Fig. 10 and Fig. 11, where there is a higher probability of being helped before retransmitting using contention due to more general relations (not just two groups). Fig. 7 to Fig. 11 illustrate that
FI 10-m d
FI 10-m e FI
1.2
0.8
0.4 0.2 0 0
5
10
m
S 1-m-c e
FI 1-m-c d
FI 1-m-c e
3.68 3.63 3.58 3.53 3.48 3.43 3.38 3.33 3.28
1
0.6
S 1-m-c d
FI
S 10-m e
S Mbps
S Mbps
S 10-m d 3.8 3.6 3.4 3.2 3 2.8 2.6 2.4 2.2 2
0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5
0
5
10
m
Fig. 10. 1-m-c
S 2-m-c d
S 2-m-c e
FI 2-m-c d
FI 2-m-c e FI
3.68 3.63 3.58 3.53 3.48 3.43 3.38 3.33 3.28
0.7 0.65 0.6 0.55 0.5 5 m
Fig. 11. 2-m-c
10
2-m 2-m-c
3-m 1-m-c
5-m
18 16
0.8 0.75
0
1-m 10-m
20
14 Gain (%)
S Mbps
Fig. 9. 10-m
12 10 8 6 4 2 0 0
2
4
m
6
8
10
Fig. 12. Delay
throughput is always enhanced. The minimum (maximum) gains (%) are about 3.2 (4.3), 4.1 (27.8), 4.3 (52.7), 1.8 (6.4), 3.3 (9.9) for the 1-m, 5-m, 10-m, 1-m-c, and 2-m-c scenarios respectively. Also, the throughput is always above 3.4M bps when the new scheme is enabled, and may reach 2.2M bps otherwise for the n-m networks. In addition, throughput of n-m-c networks is always above 3.6M bps with the new scheme but keeps decreasing otherwise. Delay, retransmissions, and drops are also enhanced in all scenarios, Fig. 12 to Fig. 14. Gains come from the fast retransmissions as shown in Fig. 13. The performance of the new scheme is affected by number of stations in each group. For the n-m networks, the gain (all measures except FI) increases until a maximum value, and then decreases until
Turning Hidden Nodes into Helper Nodes
1-m 10-m
40
2-m 2-m-c
3-m 1-m-c
5-m
80
35
2-m 2-m-c
3-m 1-m-c
5-m
70 Gain (%)
30 Gain (%)
1-m 10-m
90
833
25 20 15
60 50 40 30
10
20
5
10 0
0 0
2
4
m 6
8
0
10
2
4
m
6
8
10
Fig. 14. Packet drop
Fig. 13. Retransmissions
it reaches a saturated value. This is explained by the fact that the probability of collisions due to hidden nodes decreases when |n| is small compared to |m| (or |m| is small compared to |n|). 5.3
Random Scenario with Capture Effect
When considering capture effect for scenarios in previous section, collected results showed similar gains but higher values of different measures in both schemes. Therefore, we do not show those results. We also randomly generated a network of 30 stations positioned around the AP which is in center of an area of 420m×420m. In addition, a signal can be captured if received power is at least 10 times greater than the received power of any other one, and SNR requirement is met according to the model used in Opnet. Also, each station follows an ON/OFF model: each period is Exponential(0.375 seconds), traffic is generated during the ON period with Exponential(r seconds), and a packet is 1024 bytes. Changing r allows for testing the network with different loads. Results are given in Fig. 15. For very small loads, there are almost zero collisions and the number of transmitters, and so helpers, is smaller. Thus no
FI
S
Delay
Retransmissions
Packet Drop
120
Gain (%)
100 80 60 40 20 0 0
20
40
60
80
100
120
Load (%)
Fig. 15. Performance gain for random scenario with capture effect
834
H. Al-Mefleh and J.M. Chang
improvement is seen for such loads. However, improvements start at about loads of 14% for throughput and fairness, and at about 5.3% for all other measures. The gains (except FI which continues to increase) increase with load until a maximum value, and then start to decrease. The decrease is because when loads are higher, collisions due to hidden and non-hidden nodes also gets higher (our proposed algorithm does not change collisions), and also more packets are buffered at different stations (more delays and drops due to long waiting).
6
Conclusions
We proposed a new protocol for 802.11 WLANs to take advantage of the hidden terminal problem by allowing non-hidden stations to assist each other retransmit faster whenever possible. The new scheme is a modification to DCF, is backward compatible, and works over the 802.11 PHY. We evaluated the proposed scheme via simulation which was conducted using Opnet Modeler for different scenarios. Results showed that the new scheme improves the throughput, delay, packet drop, fairness, and retransmissions. The performance gain comes from cooperative retransmissions that are faster than that used in DCF where a collided station doubles its CW. In addition, results showed a trade-off between throughput and fairness only in some scenarios. Further work includes investigating performance enhancements using different design issues like having the AP decide when not to allow stations to assist each other, and using help information to update backoff counters and CW.
References 1. IEEE Std 802.11b-1999, Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Higher-Speed Physical Layer Extension in the 2.4 GHz Band (1999) 2. IEEE Std 802.11g-2003, Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications Amendment 4: Further Higher Data Rate Extension in the 2.4 GHz Band (2003) 3. IEEE Std 802.11e-2005, Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications. Amendment 8: Medium Access Control (MAC) Quality of Service Enhancements (2005) 4. Tsertou, A., Laurenson, D.: Revisiting the hidden terminal problem in a csma/ca wireless network. In: IEEE Transactions on Mobile Computing (Accepted 2007) 5. Wu, H., Zhu, F., Zhang, Q., Niu, Z.: Analysis of ieee 802.11 dcf with hidden terminals. In: GLOBECOM, IEEE, Los Alamitos (2006) 6. Ray, S., Starobinski, D., Carruthers, J.B.: Performance of wireless networks with hidden nodes: a queuing-theoretic analysis. Computer Communications 28(10), 1179–1192 (2005) 7. Zahedi, A., Pahlavan, K.: Natural hidden terminal and the performance of the wireless lans. In: Proc. IEEE 6th Int. Conf. on Univ. Pers. Comm, pp. 929–933 (1997) 8. Nadeem, T., Ji, L., Agrawala, A.K., Agre, J.R.: Location enhancement to ieee 802.11 dcf. In: INFOCOM, pp. 651–663. IEEE, Los Alamitos (2005)
Turning Hidden Nodes into Helper Nodes
835
9. Durvy, M., Dousse, O., Thiran, P.: Modeling the 802.11 protocol under different capture and sensing capabilities. In: INFOCOM, pp. 2356–2360. IEEE, Los Alamitos (2007) 10. Klienrock, L., Tobagi, F.: Packet switching in radio channels: Part i-carrier sense multiple access modes and their throughput delay characteristics. IEEE Transaction on Communication COM-23(12), 1416–1430 (1975) 11. Lee, J., Kim, W., Lee, S.-J., Jo, D., Ryu, J., Kwon, T., Choi, Y.: An experimental study on the capture effect in 802.11a networks. In: WiNTECH 2007, Montr´eal, Qu´ebec, Canada (September 10, 2007) 12. Jiang, L.B., Liew, S.C.: Hidden-node removal and its application in cellular wifi networks. IEEE Transactions on Vehicular Technology 56(5), Part 1, 2641–2654 (2007) 13. Borgo, M., Zanella, A., Bisaglia, P., Merlin, S.: Analysis of the hidden terminal effect in multi-rate ieee 802.11b networks (September 2004) 14. Ho, I.W.-H., Liew, S.C.: Impact of power control on performance of ieee 802.11 wireless networks. IEEE Transactions on Mobile Computing 6(11), 1245–1258 (2007) 15. Qiao, D., Choi, S., Jain, A., Shin, K.: Adaptive transmit power control in ieee 802.11a wireless lans. Proc. of the IEEE Semiannual Vehicular Technology Conference 2003 (VTC 2003-Spring) 1, 433–437 (2003) 16. So, J., Vaidya, N.H.: Multi-channel mac for ad hoc networks: handling multichannel hidden terminals using a single transceiver. In: Murai, J., Perkins, C.E., Tassiulas, L. (eds.) MobiHoc, pp. 222–233. ACM Press, New York (2004) 17. Klienrock, L., Tobagi, F.: Packet switching in radio channels: Part 11-the hidden terminal problem in carrier sense multiple access and the busy tone solution. IEEE Transactions on Communications COM-23(12), 1417–1433 (1975) 18. Haas, Z., Deng, J.: Dual busy tone multiple access (DBTMA)-a multiple access control scheme for ad hoc networks. IEEE Transactions on Communications 50(6), 975–985 (2002) 19. Bharghavan, V., Demers, A., Shenker, S., Zhang, L.: Macaw: A media access protocol for wireless lans. In: Proc. ACM SIGCOMM, London, UK (October 1994) 20. Gupta, R., Walrand, J.: Impatient backoff algorithm: Fairness in a distributed ad-hoc mac. In: ICC (2007) 21. Opnet, Opnet Modeler, http://www.opnet.org 22. Jain, D.R.: Chiu, and W. Hawe, A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems, DEC Research Report TR-301 (September 1984)
Performance Evaluation of 802.11 Broadcasts for A Single Cell Network With Unsaturated Nodes Ashwin Rao1 , Arzad Kherani2 , and Anirban Mahanti1 1
Amar Nath and Shashi Khosla School of Information Technology, Indian Institute of Technology Delhi, New Delhi, India [email protected], [email protected] 2 General Motors - India Science Lab, International Technology Park, Bangalore, India [email protected]
Abstract. Broadcast communication plays a critical role in ad hoc networks especially vehicular communication where a large number of applications are being envisioned to use the broadcast services of IEEE 802.11p. These safety applications exchange messages at very low rates compared to voice and streaming applications but have very stringent latency requirements. The broadcast packets undergoing collision are not retransmitted as the source cannot detect collisions, hence, obtaining the collision probability is critical for the success of applications relying on broadcast communication. In this paper, we initially assume our nodes to be bufferless and model them using a two-state Markov Chain to obtain the collision probability. We then extend our model for finite buffers to study the system under higher data traffic loads. Keywords: Model, 802.11, MAC.
1
Introduction
In this paper, we consider an IEEE 802.11 wireless ad hoc network where stations communicate with each other using broadcast messages. The broadcast environment allows receivers to collate information from all transmitting nodes within its neighborhood, allowing receivers to be aware of their immediate surroundings. Currently, “neighborhood awareness” is being leveraged for the design of next-generation vehicle collision warning systems [13,12]. It is envisioned that each vehicle will be equipped with sensors that records its position and dynamics and broadcast this information to nearby vehicles. Each vehicle, in turn, can process the information it receives along with information regarding its own state to detect potential collisions. The distributed coordinated function (DCF), the primary medium access control (MAC) technique of the IEEE 802.11, is a Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA) scheme designed to support asynchronous data transport where all users have an equal chance of accessing the network. CSMA/CA does not rely on stations detecting a collision by hearing their own transmissions, hence, error-free reception of a packet results in transmission of A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 836–847, 2008. c IFIP International Federation for Information Processing 2008
Performance Evaluation of 802.11 Broadcasts
837
a positive acknowledgment from the receiver. The absence of an acknowledgment is considered to be a failure of successful reception of the packet, resulting in retransmission of the packet; this process of retransmission continues until a threshold number of retransmission attempts are made or until a positive acknowledgement is received. For broadcast transmissions, however, the 802.11 MAC protocol does not define any retransmission mechanism since receivers do not acknowledge receipt of packets. The ability of vehicle collision warning systems to prevent imminent collisions depends on the success of broadcast transmissions. Clearly, in a broadcast environment where there exists no direct mechanism to infer the loss of information owing to collisions in the transmission medium, it is important to indirectly and accurately determine the probability of packet collisions. In this paper, we propose an analytic model for IEEE 802.11 broadcast-based ad hoc networks. Specifically, a simple finite-state Markov chain model is developed to model the buffer occupancy process for the IEEE 802.11 MAC considering broadcast traffic within a single cell. The steady state transition probabilities are computed as a function of the packet arrival rate at the MAC layer, the probability of transmission, the number of nodes in the cell and the packet length to obtain the probability of collisions and study the stability of the system. The remainder of the paper is organized as follows. Section 2 gives an overview of the IEEE 802.11 DCF. The proposed system model is presented in Section 3. Section 4 provides the simulation setup used and the numerical results obtained from simulations. The related work is discussed in Section 5 in light of our contributions. Conclusions and future work directions are presented in Section 6.
2
The IEEE 802.11 Medium Access Control
The Distributed Coordination Function of the IEEE 802.11 MAC provides a time bounded, asynchronous and contention free access protocol based on CSMA/CA [2]. A wireless station with a packet to transmit initially monitors the channel activity for the Distributed Inter-Frame Space (DIFS) time period. If the channel is (or becomes) busy during this observation period, the station begins the backoff process. The station then chooses a backoff counter in the range (0, w), where w is the contention window. This backoff time represents a counter that is decremented each time the channel is sensed idle, frozen when transmission is detected in the channel, and decremented when the channel is sensed idle again for more than one DIFS. The station transmits when this counter reaches zero. The IEEE 802.11 CSMA/CA does not perform collision detection by listening on the channel while its transmission is in progress. Thus, unicast transmissions result in transmission of a positive acknowledgment from the receiver, the absence of which is considered to be a failure of successful reception of the packet at the receiver. The DCF adopts an exponential backoff mechanism in the event of losses. The initial value of w is set to CWmin , the minimum contention window, and is doubled after each unsuccessful transmission up to a maximum value CWmax = 2m CWmin , where m is the maximum backoff stage.
838
A. Rao, A. Kherani, and A. Mahanti
There are several key differences in how broadcast transmissions are handled with respect to their unicast counterparts. First, RTS/CTS exchange cannot be used in the broadcast environment to reserve channel for transmission. Second, there are no acknowledgements for packet transmissions. Third, there is no retransmission mechanism in place for packets that experience collision especially in vehicular networks due to the stringent latency requirements. Finally, there exists no direct mechanism to detect collisions, and therefore, the protocol does not adapt its sending rate in accordance to the perceived load on the channel.
3
System Model: Assumptions and Notations
We consider a single cell of n homogeneous nodes that communicate with each other using the broadcast services of IEEE 802.11. The data payloads exchanged are of the same size and the same data rate is used to broadcast the packets resulting in L busy slots (L ≥ 1). If the DIFS occupies d slots then, on sensing a transmission, the other nodes that have a packet to transmit restart their backoff process after L = L + d slots. The wireless channel is assumed to be noiseless i.e., the errors in packet reception due to fading and other external interferences are not considered as a serious problem compared to the errors caused due to packet collisions. The packet arrivals from the higher layer are assumed to be based on a Poisson distribution. The system being slotted, these packets arrive with a probability λ in each slot. In the analysis we initially assume λ to be very small (λ << 1), such that all the nodes have at most one packet ready for transmission in any given slot resulting in bufferless MAC queues. We later extend this model to higher packet generation rates to model finite buffers at the MAC layer. All the nodes in the network use the same backoff parameters and independently attempt to transmit with a probability β in each slot. The summary of the assumptions made are as follows. 1. There are n homogeneous nodes placed in a single cell wireless topology, i.e., the n nodes are able to hear each other. 2. The data exchanged is of a fixed size resulting in busy periods of a fixed number of slots (L). 3. The nodes that have a packet are able to transmit in each slot with a probability β, i.e., the backoff counter for packet transmission is sampled from a geometric distribution. 4. The arrival rate λ is based on a Poisson distribution. 3.1
Modeling Bufferless MAC
The wireless channel is seen by all the nodes to be either busy or idle as shown in Fig. 1(a). When the arrival rates low (λ << 1), at the end of a busy period, the nodes can either have a backlogged packet ready for transmission or be awaiting an arrival of a packet from the higher layer. Each node at the end of the busy periods can be independently modeled using a two-state Markov Chain as shown
Performance Evaluation of 802.11 Broadcasts
(a) Broadcast of fixed size packets
839
(b) Two state Markov chain for a node
Fig. 1. The state of the wireless channel and the state of the nodes
in Fig. 1(b). State 0 represents the state where the node does not have any packets at the MAC queue while State 1 represents the state where the node has exactly one packet and is performing the backoff process. At the end of each busy period, the state of each node (either 0 or 1) is independent of the state of other nodes present in the system. Let P10 denote the steady state transition probabilities from state 1 to state 0 and P10 (k) denote the probability of a transition from State 1 to State 0 by a node in a cycle, when there are k other nodes (0 ≤ k ≤ n − 1) contending for the channel. The value of P10 (k) is derived as follows. Consider a marked node that has a packet to transmit at the end of a busy cycle. In the first slot just after a busy period if k nodes other than the marked node have a packet to transmit then, based on the bufferless assumption, there can be j nodes that can receive a packet from the higher layer (0 ≤ j ≤ n − k − 1). The transmit probability β is assumed to independent of the state of each of the other n − 1 nodes in the system, hence, if this node does not transmit, then the next slot can be visualized as the first slot after a busy period in which k + j nodes have a packet to transmit. P10 (k) can be given as P10 (k) =
n−k−1 j=0
n−k−1 j λ (1 − λ)n−k−1−j (1 − β)k+1 P10 (k + j) + β. (1) j
Similarly, let P01 denote the steady state transition probability that a node at State 0 at the beginning of the cycle ends up in State 1 at the beginning of the next cycle and P01 (k) denote the probability of that a node undergoes a transition from State 0 to State 1 when there are k nodes (0 ≤ k ≤ n − 1) contending for the channel. P01 (k) can be given as follows. k L P01 (k) = 1 − (1 − β) 1 − (1 − λ) + n−k−1 n − k − 1 (n−k−1−j) k { λj (1 − λ) (1 − β) j j=0 ((1 − λ) P01 (k + j) + λP11 (k + j))}
(2)
840
A. Rao, A. Kherani, and A. Mahanti
Let π denote the steady state probability that a node has a packet at the beginning of the cycle. As the nodes are assumed to be homogeneous, the steady state value for π at each node will be the same. The steady state values of P01 and P10 can be computed from P01 (k) and P10 (k) for a given value of π as follows. n−1 k π (1 − π)n−1−k P01 (k) k k=0 n−1 n − 1 = π k (1 − π)n−1−k P10 (k) k
P01 =
P10
n−1
(3)
(4)
k=0
The value of P10 (n − 1), numerically computed from (1), can be used to compute P10 (k) for n − 1 ≥ k ≥ 0 that can be substituted in (4) to give P01 . As π is the steady state probability that a node has a packet to transmit at the end of the busy period, the value of π can be obtained as follows. π=
P01 P01 + P10
(5)
Using this value of π, that can be numerically computed using P01 and P10 , we shall now obtain the probability of packet collisions. At the end of the busy cycle, a node independently attempts a transmission in each slot with a probability πβ. The packet transmitted is successfully received by all the other nodes only if none of the other nodes attempt a transmission in the same slot. Hence the probability of successful transmission is given by (1−πβ)n−1 and the probability of collisions as the fraction of attempted transmissions undergoing collisions is Pcoll = 1 − (1 − πβ)n−1 .
(6)
The system parameters λ, β, L and n are used to obtain the values of π and the probability of collision Pcoll . The real world implications of these parameters are as follows. 1. The packet arrival rate (λ) abstracts the packet generation process at the application layer. 2. The probability of transmission in a given slot (β) represents the contention window of IEEE 802.11. 3. The length of the busy period (L) can be used to abstract the packet length and the data transmission rates. 4. The number of nodes present in the single cell (n) represents the transmission range of the nodes. 3.2
Modeling MAC with Finite Buffers
We now look at the MAC layer with finite buffers where packets are queued up according to the drop tail queuing principle. Arrivals when a node has a packet to transmit are queued up provided the queue contains less than K elements
Performance Evaluation of 802.11 Broadcasts
841
(K ≥ 1). At the end of a busy period, each node can be modeled as a Markov Chain of K +1 states as shown in Figure 2. State i represents the state where the node has i (0 ≤ i ≤ K) packets in the MAC queue. At the end of the busy period a given node is in State i with a probability πi . The probability of collision for this system can be given as Pcoll = 1 − (1 − (1 − π0 )β)n−1 .
(7)
The transition probabilities from State i to State j (Pi,j ), where i, j ∈ [0, K], can be obtained from Pi,j (k) indicating the probability of transition from state i to state j when k (0 ≤ k ≤ n − 1) other nodes have a packet to transmit. From any state i (0 ≤ i ≤ K), a transition to state j (max(0, i − 1) ≤ j ≤ K) is possible at the end of a busy period with a probability Pi,j (k). One such equation of Pi,j (k) for 0 < i < K and j ≥ i is as follows. L + 1 k L+1−j+i Pi,j (k) = (1 − β) 1 − (1 − β) λj−i (1 − λ) j−i L+1 L−j+i +β λj−i+1 (1 − λ) j−i+1 n−k−1 n − k − 1 n−k−1−m k+1 + λm (1 − λ) (1 − β) m m=0 ((1 − λ) Pi,j (k + m) + λPi+1,j (k + m)) We know from the analysis of bufferless system that the probability that a node has a packet to be transmitted in the steady state, π, depends on the system parameters n, β, L, K and λ. For the finite buffer analysis we relax the constraint of low packet generation rates and assume all possible λ, i.e, 0 < λ ≤ 1. Recall that L represents the inverse of the transmission bandwidth. One important problem for such systems is to obtain the number of nodes n the system can support for a given configuration of β, λ and L. The definition of “can support” qualifier above can be subjective. For example, one may state a value for probability of MAC layer collision and another value of packet blocking probability (the probability with which new arriving packets are lost) and then find maximum value of n that satisfies both these constraints. We shall be attempting to understand this problem from the point of view of system stability. For this, we assume that each node has a storage capacity of K units. The buffer size K will now serve as another system parameter that can be tuned to support a target number of nodes. To do this, we need an understanding of dependence of various performance metric on the system parameters. Given the complex system, we shall attempt an approximate analysis which gives a good approximation. Approximate Analysis: Let πK (k), 0 ≤ k ≤ K be the probability that a marked node is having k packets in its buffer in the steady state when the buffer size at the node is K packets. The probability that an arriving packet is blocked is then πK (K). Recall the time instants at which a transmission just got over in the network. One can write down transition probability equations for the process of
842
A. Rao, A. Kherani, and A. Mahanti
Fig. 2. Markov Chain for a node with finite buffer
buffer occupancy of a marked node (embedded at the ends of successive network transmissions). These equations will be approximate as we make the assumption that the state of all the nodes in the system are independent of each other at the ends of successive transmissions thus, marked nodewould nodes with see k other k n−1−k at least one packet in their buffers with probability n−1 (1−π (0)) π K K (0) k Stability Analysis: Let us now assume that the nodes have infinite buffering capacity, i.e., K = ∞. We are interested in ensuring stability of buffer occupancy processes, i.e., that the buffer occupancy process converges over time (in law) to a random variable with finite expected value. A family of probability measures over positive real line {P j (·)}j is called tight ∞ if for any given η > 0, one can find C(η) < ∞ such that y=C(η) dPj (y) < η uniformly for all values of j. We state the following theorem (proof follows standard tightness arguments), Theorem 1. For a given set of system parameters λ, n, K and β, the buffer occupancy process of any node is stable iff the sequence {PK (·)}K forms a tight sequence of probability measures. Corollary 1. For a given set of system parameters λ, n, K, and β, the buffer occupancy process of any node is stable iff πK (K) →K→∞ 0. Proof : Let the sequence {PK (·)}K be tight. For any given η > 0, there exits a
K C(η) < ∞ such that j=C(η) πK (K) < η for all K > C(η). This is possible if and only if πK (K) →K→∞ 0. •
4
Numerical Results and Discussions
A simple discrete event simulator to simulate the state of the nodes and the wireless channel for 106 cycles of busy and idle periods was written. The system parameters used were obtained as follows. The slot duration of IEEE 802.11 is 20 μs [2] while data rate vehicular communication is 6 Mbps [13,7]. The number of slots L required to transmit a a signed message of 250 bytes [1] = 6∗106250∗8 ∗20∗10−6 ≈ 17. The awareness messages are broadcast once every 100 milliseconds [3,13], −3 i.e., once every 100∗10 20∗10−6 = 5000 slots with a transmission range of about 150 meters [13]. Assuming each vehicle, of length 4 meters, maintains a distance of 2 meters with the vehicle in front of it, we can have 150/6 = 25 vehicles in each lane giving approximately 150 nodes in a six lane freeway lying in the
Performance Evaluation of 802.11 Broadcasts
843
Table 1. Simulation Parameters Parameter Number of nodes (n) Probability of transmit in a given slot (β) Probability of packet arrival in a given slot (λ) Number of busy slots (L)
Value 3 to 200 1/16 and 1/8 1/2500 and 1/5000 25, 50
transmission range of a given node. Using the above information, the simulation parameters given in Table 1 were used for studying the behavior of the system. Figure 3 and Figure 4 show the relation of π and the system parameters λ, β, n and L for the bufferless nodes. Figure 3(a) shows the relation between β and π when L and λ are kept constant at 25 and 1/5000 respectively. An increase in value of β from 1/16 to 1/8 shows a decrease in the value of π. Higher values of β indicate lower number of slots for which a node remains in State 1 when the channel is idle, resulting in lower values of π. π when L = 25, λ = 1/5000 0.04
π when L = 25, β = 1/16
β=1/16 - Simulations β=1/16 - Numerically β=1/8 - Simulations β=1/8 - Numerically
0.035 0.03
0.12 0.1
0.025 0.02
π
π
λ=1/5000 - Simulations λ=1/5000 - Numerically λ=1/2500 - Simulations λ=1/2500 - Numerically
0.14
0.08
0.015
0.06
0.01
0.04
0.005
0.02 0
0 0
25
50
75 100 125 Number of Nodes
150
175
(a) Relation between π and β
200
0
25
50
75
100
125
150
175
200
Number of Nodes
(b) Relation between π and λ
Fig. 3. Impact of β and λ on π
Figure 3(b) shows the relation between λ and π when L and β are kept constant at 25 and 1/16 respectively. An increase in value of λ from 1/5000 to 1/2500 shows an increase in the value of π. This can be explained as follows. Higher values of λ indicate an increase in the arrival rate resulting in an increase in the number of P01 transitions. Figure 3(a) and Figure 3(b) show that π is more sensitive to λ as compared to β, hence, adapting the packet arrival rates from the higher layers has greater impact on π compared to adapting the transmit probabilities at the MAC layer. Figure 4(a) shows the relation between L and π when λ and β are kept constant at 1/5000 and 1/16 respectively. An increase in value of L from 25 to 50 shows an increase in the value of π. This can be explained as follows. Higher values of L indicate an increase in the number of slots for which the channel
844
A. Rao, A. Kherani, and A. Mahanti
is kept busy, resulting in increase in the number of slots required to complete a cycle, as shown in Figure 1(a). These extra slots account for the increased number arrivals in the cycle resulting in higher values of π. Figure 3(b) and Figure 4(a) shows that an increase in the value of L is similar to an increase in the value of λ, hence, for fixed packet sizes increasing the data rate can be used to reduce the values of π and thus packet collisions. Figure 4(b) shows the that Pcoll increases as the number of nodes contending the wireless channel increase. β = 1/8, λ = 1/2500 L=25
P when β = 1/16, λ = 1/5000
1
0.12
Fraction of packets lost due to collisions
L=25 - Simulations L=25 - Numerically L=50 - Simulations L=50 - Numerically
0.14
π
0.1 0.08 0.06 0.04 0.02
Simulation Numerical
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0 0
25
50
75 100 125 Number of Nodes
150
175
0
200
25
50
75
100
125
150
175
200
Number of Nodes
(a) Relation between π and L
(b) Fraction of packets lost due to collisions
Fig. 4. Impact of L on π and n on Pcoll
Figure 5 illustrates the behavior of the system of finite buffers. Figure 5(a) shows the πi values for (0 ≤ i ≤ K) when the buffer size K is 4 for λ = 1/1000, β = 1/16 and L = 25. The value of πk converges to a fixed value as the number of nodes increase. The value of π0 converges to a value greater than 0.5 indicating that the buffers are empty for a considerable amount of time even when the packet arrival rate is as high as 1/1000. The stability of the system is reflected in Figure 5(b) which shows the relation between πK and K. Even for high arrival rates, i.e., when λ = 1/100, i.e, packets generated once every 2 milliseconds, we see that the system can support about 200 nodes for buffer sizes less than 10. The Stability when λ=1/100, β = 1/16 and L = 25
λ=1/1000 1
0.6
0.9 0.8 0.7
Probablity of K packets in queue (πK)
π0 - Simulations π0 - Numerical π1 - Simulations π1 - Numerical π2 - Simulations π2 - Numerical
π
0.6 0.5 0.4 0.3 0.2
n = 10 n = 50 n = 100 n = 200
0.5
0.4
0.3
0.2
0.1
0.1 0
0 25
50
75
100 125 Number of Nodes
150
(a) πk when K = 4
175
200
3
4
5
6
7
8
9
10
Buffer Size (K)
(b) Stability of the system with respect to n
Fig. 5. System with finite buffers
Performance Evaluation of 802.11 Broadcasts
845
absence of retransmissions results in the goodput to be 1 − Pcoll for the single cell, hence, even if the system is stable, Figure 4(b) shows that the goodput reduces considerably as the number of nodes increase. This behavior is interesting because the absence of retransmissions imply that large collision probability is good for the stability of the system which is not the case when there are retransmissions. Large collision probabilities are a result of significant number of backlogged nodes resulting in an increase in the waiting time for transmission of the packets. The interplay of these two effects make the stability analysis of these systems complex and we are currently working on it. Numerical approximations based on our proposals in this paper help in the stability analysis.
5
Related Work
Modeling wireless networks especially those based on the IEEE 802.11 standard has received considerable research attention over the years. Given the space constraints a detailed discussion is not feasible, hence, we limit ourselves to the work that allows us place our contribution in the overall context. In this paper we model the buffer occupancy process for broadcast networks where there are no retransmissions. Retransmissions exist in unicast communication, hence, modeling of the backoff process essential for analysis of unicast communication. In the seminal paper for performance analysis of 802.11 DCF [4] and its extensions [8,9], the backoff process is modeled as a Markov Chain for nodes with saturated MAC queues to obtain the collision probability. The model by Bianchi et al. [4] cannot be used to study the stability of the system as their model assumes saturated MAC queues. Malone et al. [10] extend [4] for unsaturated MAC queues for unicast networks. Similarly Chen et al. [5] provide a model based on the one available in [4] for broadcast traffic assuming saturated MAC queues. The above proposals model the backoff process while this work models the buffer occupancy process to study the impact of packet arrival rates from higher layer, packet transmission probabilities at the MAC layer, packet transmission times and number of neighbors present in the broadcast network on the packet collision probabilities when the MAC can buffer only a finite number of packets. Qui et al. [11] study the interference in wireless networks relying on broadcast and unicast communication, but, they do not address any stability issues. Choi et al. [6] provide an analytical model to study the hidden node problem in multi-hop networks by providing a two-state Markov Chain for modeling the communication channel. Similarly, we model the channel to be either idle or busy in any given cycle and nodes to have at most one packet in our initial analysis. Torrent-Moreno et al. [14,15] provide insights into the impact of radio propagation models using the probability of reception as a performance metric. We believe our work can be extended to accommodate the channel models considered in [14,15].
846
6
A. Rao, A. Kherani, and A. Mahanti
Conclusion
In section 3 we provide equations for the transition probability and π. A closed form of the equations is desirable for the sensitivity and stability analysis of this system. Simulations show that adapting the packet arrival rates has a greater impact than adapting the transmit probabilities as π is more sensitive to λ than β under the simulated parameters. Similarly an increase in the value of L results in an increase in π hence increasing in data transmission rates or reducing packet sizes can result in lower collisions in broadcast networks. The region where these observations hold true can be obtained only after obtaining the closed form of π. Similarly a closed form expression for finite buffers needs to be derived. The current equations for πk are quadratic hence an algorithm needs to be derived to obtain the desired root of the equation. In this paper we target the performance evaluation of IEEE 802.11 broadcast networks. A simple two-state Markov chain is as used to model the system under low loads. The model is further extended to support higher traffic loads. The proposed models give a good insight into the working of the system that can be used for sensitivity and stability analysis. The simulations show that a stable system may not provide the desired goodput and adapting the packet arrival rates has a greater impact than adapting the transmit probabilities, motivating the need for the applications to adapt their rates based on the sensed channel load.
References 1. IEEE Trial-Use Standard for Wireless Access in Vehicular Environments - Security Services for Applications and Management Messages, IEEE Std 1609.2-2006 (2006) 2. IEEE Standard for Information technology-Telecommunications and information exchange between systems-Local and metropolitan area networks-Specific requirements - Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications. June 12 (2007) 3. Bai, F., Krishnan, H., Sadekar, V., Holland, G., ElBatt, T.: Towards characterizing and classifying communication-based automotive applications from a wireless networking perspective. In: 1st IEEE Workshop on Automotive Networking and Applications (AutoNet 2006) (2006) 4. Bianchi, G.: Performance analysis of the IEEE 802.11 distributed coordination function. Selected Areas in Communications, IEEE Journal on 18(3), 535–547 (2000) 5. Chen, X., Refai, H.H., Ma, X.: Saturation performance of IEEE 802.11 broadcast scheme in ad hoc wireless lans. In: Vehicular Technology Conference, VTC-2007 Fall. 2007 IEEE 66th (September 30-October 3, 2007), pp. 1897–1901 (2007) 6. Choi, J.-M., So, J., Ko, Y.-B.: Numerical analysis of IEEE 802.11 broadcast scheme in multihop wireless ad hoc networks. Information Networking, 1–10 (2005) 7. DSRC Industry Consortium. DSRC Technology and the DSRC Industry Consortium (DIC) Prototype Team, White Paper. Tech. rep (2005) 8. Gupta, N., Kumar, P.R.: A Performance Analysis of the 802.11. Wireless LAN Medium Access Control Communications in Information and Systems 3, 279–304 (2003)
Performance Evaluation of 802.11 Broadcasts
847
9. Kumar, A., Altman, E., Miorandi, D., Goyal, M.: New insights from a fixed-point analysis of single cell IEEE 802.11 WLANs. IEEE/ACM Trans. Netw. 15(3), 588– 601 (2007) 10. Malone, D., Duffy, K., Leith, D.: Modeling the 802.11 distributed coordination function in nonsaturated heterogeneous conditions. IEEE/ACM Trans. Netw. 15(1), 159–172 (2007) 11. Qiu, L., Zhang, Y., Wang, F., Han, M.K., Mahajan, R.: A general model of wireless interference. In: MobiCom 2007: Proceedings of the 13th annual ACM international conference on Mobile computing and networking, pp. 171–182. ACM, New York (2007) 12. Tan, H.-S., Huang, J.: DGPS-based vehicle-to-vehicle cooperative collision warning: Engineering feasibility viewpoints. Intelligent Transportation Systems, IEEE Transactions on 7(4), 415–428 (2006) 13. The CAMP Vehicle Safety Communications Consortium. Vehicle Safety Communications Project Task 3 Final Report: Identify Intelligent Vehicle Safety Applications Enabled by DSRC. Tech. Rep. 809859, National Highway Traffic Safety Administration, U. S. Department of Transportation (USDOT) (March 2005) 14. Torrent-Moreno, M., Corroy, S., Schmidt-Eisenlohr, F., Hartenstein, H.: IEEE 802.11-based one-hop broadcast communications: Understanding transmission success and failure under different radio propagation environments. In: MSWiM 2006: Proceedings of the 9th ACM international symposium on Modeling analysis and simulation of wireless and mobile systems, pp. 68–77. ACM Press, New York (2006) 15. Torrent-Moreno, M., Jiang, D., Hartenstein, H.: Broadcast reception rates and effects of priority access in 802.11-based vehicular ad-hoc networks, pp. 10–18.
Optimal Placement of Mesh Points in Wireless Mesh Networks* Suk Yu Hui1, Kai Hau Yeung1, and Kin Yeung Wong2 1
Department of Electronic Engineering, City University of Hong Kong [email protected], [email protected] 2 Computer Studies Program, Macao Polytechnic Institute [email protected]
Abstract. A strategic placement of mesh points (MPs) in a Wireless Mesh Network (WMN) is essential to maximize the throughput of the network. In this paper, we address the problem of MPs placement for throughput optimization in WMN with consideration of routing protocols. Specifically, we formulate the routing paths among the MPs and mesh routers (MRs), calculate the aggregated traffic at the MPs, and suggest the optimal positions for the MPs. Our study also compares the optimal performance of routing protocols by finding their corresponding optimal placements of MPs first. Keywords: wireless optimization.
mesh
network,
gateway
placement,
throughput
1 Introduction In a Wireless Mesh Network (WMN) [1], mesh routers (MRs) connect among themselves to form a wireless backbone. Unlike the traditional wireless LANs that require all wireless access points connecting to the wired network, WMNs only require some MRs connecting to the wired network. Those MRs are serving as gateways to the Internet, and called mesh points (MPs). Providing quality Internet access for clients is an important issue in WMN [2]. Packets originated by the clients to the Internet are routed among the MRs to reach one of the MPs. With the dynamic self-organization and self-configuration, MRs in the network automatically establish and maintain route among themselves [3]. They may relay traffic on behalf of the other MRs which are not within direct wireless transmission ranges of the MPs. This paradigm is similar to the routing process in wired network, but the radio capacity of a WMN is usually lower than that of the wired network. In order to support high bandwidth broadband services, it is desirable to utilize the limited capacity efficiently and optimize the throughput capacity of the WMN [4, 5]. The placement of MPs strongly affects the performance, such as throughput of a WMN. Since the MPs aggregates the traffic flows, if they are misplaced, they may become the bottlenecks of data transmission. Zhang and Tsang proposed an algorithm *
This work was supported by City University Strategic Research Grant numbered 7001941.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 848–855, 2008. © IFIP International Federation for Information Processing 2008
Optimal Placement of Mesh Points in Wireless Mesh Networks
849
to form an optimal topology to balance the traffic in the network [6]. However, this study can be used for a newly designed WMN, but not for a WMN which topology is given. On the other hand, the authors in [7] consider the placement of MPs in a given network topology. They focused on the impact of link capacity constraints, wireless interference, fault tolerance, and variable traffic demands, and designed a series of placement algorithms. Li. et al. proposed a novel grid-based gateway deployment method to solve the problem of gateway placement for throughput optimization in WMN [8]. Specifically, they studied how to place a given number of MPs for WMN so that the total throughput achieved can be maximized. Their results show that the method achieves better throughput than both random deployment and fixed deployment method. As can be seen, a detailed planning in the placement of MPs is essential in increasing the system capacity effectively. However, this study does not take the existing routing protocols into account. In WMNs, routing protocols are needed for MRs so that they can forward traffic to the appropriate MP (Internet gateway). Routing protocols for WMN have a common goal which is to achieve a higher network throughput. However, most of the existing studies on the performance of these protocols are evaluated under the condition of predefined topology, i.e., the placement of the MPs is fixed [9-12]. This cannot reflect the optimal throughput that the routing protocol can achieve. In this paper, we study optimal placement of MPs in WMN with respect to different routing protocols. Given a set of routing paths that are generated from a given routing protocol, our goal is to find the positions of the MPs that optimize the throughput capacity of the network. The rest of the paper is organized as follows. In Section II, we present the model of WMNs and mathematically formulate the throughput optimization problem for a given network topology. We then give numerical examples in Section III. Section IV concludes the paper.
2 Problem Formulation Consider that the traffic in WMNs is mainly transmitted to or from the Internet. Since the Internet access is commonly asymmetric (i.e., downlink traffic is more than the uplink one), the goal of our study is to find the optimal placement of MPs that maximizes the downlink throughput. In the following, we formulate the placement problem under two scenarios: 1). WMN with a single MP, and 2). WMN with multiple MPs. 2.1 WMN with a Single MP Supposing that a WMN contains N MRs (see Figure 1). We define a set A in which each element denotes one of the MRs, or A = {a1 , a 2 , … , a n , … , a N }. These MRs are connected with one another by many wireless links to form a wireless backbone network. Each MR associates with some mobile clients. The MP denoted as a e is the MP which connects to the wired network and act as the gateway
850
S.Y. Hui, K.H. Yeung, and K.Y. Wong
Access point
Intern et
Mobile station Wired connection Wireless connection between access points Wireless connection between access point and mobile station
ae
aN a2 a1 an+1
an-1
aN -1
an
Fig. 1. A WMN with N mesh routers and one mesh point
for the network. In order to forward the data from the wired network to the mobile clients, multi-hopping routing is performed through the wireless backbone. There are a number of routing protocols for WMNs such as Routing Information Protocol (RIP), Hybrid Wireless Mesh Protocol (HWMP) and Radio Aware Optimized Link State Routing (RA-OLSR). Our method of finding the optimal placement of the MPs discussed below does not restrict the use of a particular routing algorithm. However, no multiple paths are allowed in our study. Let matrix Ce denotes the result of running the routing algorithm. It specifies the resulting paths from ae to each node. ⎣ c1,1 ⎧ ⎧ c 2,1 ⎧ Ce =⎧ ⎧ ⎧ ⎧ ⎨ c N ,1
c1, 2 ci, j c N , N 1
c1, N ⎞ ⎟ ⎟ ⎟ ⎟ c N 1, N ⎟ ⎟ c N ,N ⎟ ⎠
(1)
where ⎧⎪1 ci,j = ⎨ ⎪⎩0
if ai to a j is a path from ae to a j if ai to a j is not a path from ae to a j
Note that for ci,j = 0 , there is either no wireless connection between ai and aj, or ai to aj is not a path from ae to aj. For a given routing algorithm, Ce can be obtained if all link metrics are known. Let mn be the no. of mobile clients associating to an, and λ be the incoming frame rate of the downlink traffic from the wired network to each client. We then define tn be the downlink traffic to an’s clients. Therefore, t n = λmn
(2)
Let hn be the number of hops from ae to an, and hmax be the largest hn among all n. Define also di,n be ⎧1 d i ,n = ⎨ ⎩0
if hn − hi = 1 otherwise
(3)
If di,n ci,n equals 1, it implies that an is a next hop of ai in the downlink transmission.
Optimal Placement of Mesh Points in Wireless Mesh Networks
851
Next, let Wi be the bandwidth of the wireless link connecting ai and ak where dk,i=1. Thus, the total carried traffic of each node, f(ai), can be obtained as: N ⎧ ⎪t i + ∑ f (a n )ci ,n d i ,n n =1 ⎪ N ⎪⎪ f (ai ) = ⎨t i + ∑ f (a n )ci ,n d i ,n n =1 ⎪ ⎪ ⎪Wi ⎪⎩
if i = e N ⎛ ⎞ if i ≠ e and ⎜ t i + ∑ f (a n )ci ,n d i ,n ⎟ < Wi n =1 ⎝ ⎠
(4)
N ⎛ ⎞ if i ≠ e and ⎜ t i + ∑ f (a n )ci ,n d i ,n ⎟ > Wi n =1 ⎝ ⎠
Note that eq. (4) is a recursive formula, as f(ai) depends on other an where i ≠ n . Therefore, we need to first find out f(an) with hn=hmax. This is then followed by f(an) with hn= hmax-1, hmax-2, and so on. When n=e, we find the total carried traffic of the WMN at the MP ae. The total carried traffic at the MP is limited when any one or more wireless links in the WMN are over-utilized, that is f (an ) > Wi for any n. By putting the MP in different places, we can obtain different values of the total carried traffic. The optimal placement of the MP is where f (ae ) ≥ f ( a n ) for all n ≠ e . Numerical examples will be given later to show how these equations can be used. This study also gives the highest incoming frame rate that the network can support without over-utilizing any wireless links in the network. More specifically, we found it by increasing the value of λ up to a value that the first wireless link in the network is over-utilized. 2.3 Multiple MPs connecting to the wired network To increase the capacity and reliability of a network, multiple MPs are usually employed in a WMN. Let E be the set of MPs in the WMN where E ∈ A . Given the routing algorithm, each MP in E can calculate the shortest paths from itself to each MR in the network. We then compare the paths from each MP to the same MR. The shortest one among these paths is selected as the routing result and put into the matrix as Eq. (1). By considering all possible positions of MPs (i.e., N numbers of CE combinations of MP positions are considered), we select the best positions of MPs that give the maximum total carried traffic. The total carried traffic is the sum of the carried traffic at each MP, i.e., Total carried traffic = ∑ f (ae )
where e ∈ E
(5)
3 Numerical Examples The first example given below aims at finding the optimal placement of one single MP in a WMN. This WMN consists of six MRs namely a1 to a6 connecting one another. One of them will act as the MP of this network. As shown in Fig. 2, each node is assigned as the MP (the dark color node) alternatively, and the routing
852
S.Y. Hui, K.H. Yeung, and K.Y. Wong
protocol RIP is then run to find the shortest paths from this MP to other MRs. The resulting shortest paths are shown in the dotted lines in the figure. According to these paths and Eq. (1), the routing results are represented in six matrices as given below. q ( ) g p g a1
a2
a3
40 ӳ
a1
40 ӳ
a2
a3
40 ӳ
40 ӳ
15ӳ 40 ӳ
15ӳ
40ӳ
15ӳ 40 ӳ
40ӳ
15ӳ
40ӳ
40ӳ
15ӳ 15 ӳ
15ӳ
15 ӳ
15 ӳ
15 ӳ
a4
a5
a6
a4
a5
a1
a2
a3
a1
a2
40 ӳ
40 ӳ
a6 a3
40 ӳ
40 ӳ
15ӳ 40 ӳ
15ӳ
40ӳ
a4
40 ӳ
40ӳ
15ӳ
40ӳ
40ӳ 15ӳ
15 ӳ
a5
a1
a2 40 ӳ
15 ӳ
a6
a4
a3
a1
40 ӳ
15 ӳ
a5
a6
a2
a3
40 ӳ
40 ӳ
15ӳ 40 ӳ
15ӳ
40ӳ
15ӳ 40 ӳ
40ӳ
15ӳ
40ӳ
15ӳ 15 ӳ
a4
0 1 0 0· ¸ 1 0 1 1¸ 0 0 0 0¸ ¸ 0 0 0 0¸ 0 0 0 0 ¸¸ 0 0 0 0 ¸¹
§0 ¨ ¨1 ¨0 C3 = ¨ ¨0 ¨0 ¨ ¨ ©0
0 0 1 0 0· §0 1 ¸ ¨ 0 0 0 0 0¸ ¨0 0 ¨ 1 0 0 1 1¸ ¸ C4 = ¨ 0 0 0 0 0 0 0¸ ¨1 0 ¨0 0 0 0 0 0 0 ¸¸ ¨ ¨ 0 0 0 0 0 ¹¸ ©0 0
0 0 0 0· ¸ 1 0 0 1¸ 0 0 0 0¸ ¸ 0 0 1 0¸ 0 0 0 0 ¸¸ 0 0 0 0 ¹¸
40ӳ
§0 ¨ ¨0 ¨0 C5 = ¨ ¨0 ¨1 ¨ ¨0 ©
§0 0 0 0 0 0 0· ¨ ¸ 0 0 0 0 0¸ ¨1 0 ¨0 0 ¸ 0 0 0 0 0 ¸ C6 = ¨ ¨0 0 0 0 0 0 0¸ ¨0 0 1 1 1 0 1 ¸¸ ¨ ¨0 1 0 0 0 0 0 ¸¹ ©
0 1 0 0· ¸ 0 0 0 0¸ 0 0 0 0¸ ¸ 0 0 0 0¸ 0 0 0 0 ¸¸ 1 0 1 0 ¸¹
15ӳ
15 ӳ
a5
1 0 1 1 0· §0 1 ¸ ¨ 0 1 0 0 1¸ ¨0 0 ¨0 0 0 0 0 0 0¸ ¸ C2 = ¨ 0 0 0 0 0¸ ¨0 0 ¨0 0 0 0 0 0 0 ¸¸ ¨ ¨0 0 0 0 0 0 0 ¸¹ ©
15ӳ
15ӳ 15 ӳ
§0 ¨ ¨0 ¨0 C1 = ¨ ¨0 ¨0 ¨ ¨0 ©
15 ӳ
a6
a4
15 ӳ
a5
a6
Fig. 2. RIP is running in a WMN with one MP at any position of six MRs
Assume that there are 20 numbers of mobile clients associating with each MR (mn=20 where n=1,2,…6). The incoming frame rates, λ , of the downlink traffic from the wired network to all clients are the same. It is further assumed that the frame processing rates of the MRs are always larger than the incoming traffic, that is, μ n ≥ λmn where n=1,2,3…6. According to Eq. (2), the traffic generated to each MR is, t n = 20λ
where n = 1,2, … ,6.
(6)
Then, the total traffics aggregated at the MP in different locations are calculated by Eq. (4). Note that the capacities of the wireless links connecting among the MRs, Wi, are varied and limited to some values as shown on the links in the figure. Take a1 be the MP as an example, its total aggregated traffic is calculated Eq. (7). Eq. (7) is a recursive formula that fRIP(a3) and fRIP(a6) are needed to be found out first. It is because when a1 is the MP, the largest number of hops (i.e., hmax) will be 2. And this happens when traffic is routed from the MP to either a3 or a6 (i.e., h3= h6=2). After obtaining fRIP(a3) and fRIP(a6), we continue to find fRIP(a2), fRIP(a4), and fRIP(a5) (because h2=h4=h5=hmax-1=2-1=1). Finally, fRIP(a1) is obtained. N ⎧ ⎪t1 + ∑ f (a n )c1,n d1,n n =1 ⎪ N ⎪⎪ f RIP (a1 ) = ⎨t1 + ∑ f (a n )c1,n d1,n n =1 ⎪ ⎪ ⎪W1 ⎩⎪
if 1 = e N ⎛ ⎞ if 1 ≠ e and ⎜ t1 + ∑ f (a n )c1,n d1,n ⎟ < W1 n =1 ⎝ ⎠ N ⎛ ⎞ if 1 ≠ e and ⎜ t1 + ∑ f (a n )c1,n d1,n ⎟ > W1 n =1 ⎝ ⎠
(7)
Optimal Placement of Mesh Points in Wireless Mesh Networks
853
Table 1 shows the results for all possible placement of the MP. As seen from the second column of the table, if the MP of the WMN using RIP is placed at a2, it can carry the highest volume of downlink traffic for the network (115 λ ). Next, we consider applying another routing protocol HWMP to the above WMN. In our example, the HWMP is configured in the mode with root portal that most of the traffic is routed to the root (i.e., the MP). It is a kind of proactive routing protocol that uses a radio-aware routing metric to identify an efficient radio-aware path. The metric of each connection is represented by the number shown on the link as in Fig. 3. The paths having the minimum costs from the MP to other MRs are shown in the dotted lines while the costs are shown in the circles. a1 0
8, 40ӳ
8
a2 15, 40ӳ
15
30, 15ӳ
15, 40 ӳ
10, 40ӳ
a1 15
12, 40ӳ 8, 40ӳ
15, 40ӳ
30, 15ӳ
a2 0
10, 40ӳ
42
23
a5
a6
a4
a5
a1
a2
a3
a1
a2
0
8
30
8, 40ӳ
38
15, 40ӳ
15
30, 15ӳ
60, 15ӳ
75, 15 ӳ
15, 40 ӳ
10, 40ӳ 25
45, 15ӳ 12, 40ӳ 8, 40ӳ
60, 15ӳ
15, 40ӳ
30, 15ӳ
30, 15ӳ
75, 15 ӳ
12
0
60, 15ӳ
10
23
33
a4
a5
a6
a4
a5
a2
a3
a1
a2
25
42
25
8, 40ӳ
33
a4
10
30, 15ӳ
60, 15ӳ
15, 40 ӳ
10, 40ӳ 0
a5
45, 15ӳ 12, 40ӳ 8, 40ӳ
15, 40ӳ
30, 15ӳ
27
75, 15 ӳ
37
50
60, 15ӳ 37
a6
a4
a5
a3 15 12, 40ӳ
30, 15ӳ 27
a6 a3 15, 40 ӳ
12, 40ӳ 30, 15ӳ 50
a6 15, 40 ӳ 45, 15ӳ
a3 12 12, 40ӳ
30, 15ӳ
75, 15 ӳ
§0 ¨ ¨0 ¨0 C1 = ¨ ¨0 ¨0 ¨ ¨0 ©
§0 0 1 0 1 0 0· ¨ ¸ 0 1 0 1 0¸ ¨1 0 ¨0 0 ¸ 0 0 0 0 1 ¸ C2 = ¨ ¨0 0 0 0 0 0 0¸ ¨0 0 0 0 0 0 0 ¸¸ ¨ ¨0 0 0 0 0 0 0 ¸¹ ©
0 1 0 0· ¸ 1 0 1 0¸ 0 0 0 1¸ ¸ 0 0 0 0¸ 0 0 0 0 ¸¸ 0 0 0 0 ¸¹
§0 ¨ ¨1 ¨0 C3 = ¨ ¨0 ¨0 ¨ ¨0 ©
0 0 1 0 0· §0 1 ¸ ¨ 0 0 0 1 0¸ ¨0 0 ¸ ¨0 0 1 0 0 0 1 ¸ C4 = ¨ 0 0 0 0 0¸ ¨1 0 ¨0 0 0 0 0 0 0 ¸¸ ¨ ¨0 0 0 0 0 0 0 ¸¹ ©
0 0 0 0· ¸ 1 0 1 0¸ 0 0 0 1¸ ¸ 0 0 0 0¸ 0 0 0 0 ¸¸ 0 0 0 0 ¸¹
§0 ¨ ¨1 ¨0 C5 = ¨ ¨0 ¨0 ¨ ¨0 ©
0 0 1 0 0· §0 0 ¸ ¨ 0 1 0 0 0¸ ¨1 0 ¸ ¨0 1 0 0 0 0 1 ¸ C6 = ¨ 0 0 0 0 0¸ ¨1 0 ¨0 0 1 0 0 0 0 ¸¸ ¨ ¨0 0 0 0 0 0 0 ¸¹ ©
0 1 0 0· ¸ 0 0 1 0¸ 0 0 0 0¸ ¸ 0 0 0 0¸ 0 0 0 0 ¸¸ 1 0 0 0 ¸¹
38
45, 15 ӳ
75, 15 ӳ
10, 40ӳ
30, 15ӳ
45, 15ӳ
75, 15 ӳ
10, 40ӳ
a1 15, 40ӳ
15, 40 ӳ
30, 15ӳ
a4
60, 15ӳ
25
a3 30
45, 15ӳ
0
a6
Fig. 3. HWMP is running in a WMN with one MP at any position of six MRs
By the same approach of calculating fRIP(an), the total carried traffic at the MP located at each MR in the WMN running HWMP is calculated as shown in the last column of Table 1. As shown in the table, the MP of the WMN should also be placed at a2 so as to support the largest downlink traffic (120 λ ). However, it is noted that the best performance of an algorithm may not be equal to that of another algorithm. From Table 1, the best performance of RIP can give 115 λ while that of HWMP can give 120 λ . Therefore, a selection on routing protocol to the network is important. On the other hand, if the MP is placed at a4 of the network using RIP, it can only carry 75 λ downlink traffic. There is about 53% performance difference in placing the MP at a2 and a4. Similarly, there is 100% performance difference when HWMP is used. This indicates that the positions of MPs may influence the overall system throughput.
854
S.Y. Hui, K.H. Yeung, and K.Y. Wong Table 1. Total carried traffic at the single MP of the WMN MPs f’RIP(ae,f) 95 λ a1 115 λ a2 95 λ a3
f’HWMP(ae,f) 80 λ 120 λ 80 λ
MPs a4 a5 a6
f’RIP(ae,f) 75 λ 100 λ 70 λ
f’HWMP(ae,f) 60 λ 60 λ 60 λ
The second example considers that two MPs are deployed in the WMN. By running RIP again, the following matrixes for the routing results are obtained. Note that Ce,f represents the case when ae and af are selected as the MPs.
C1, 2
⎛0 ⎜ ⎜0 ⎜0 =⎜ ⎜0 ⎜0 ⎜ ⎜0 ⎝
0 0 1 1 0⎞ ⎛0 1 ⎜ ⎟ 0 1 0 0 1⎟ ⎜0 0 ⎜0 0 0 0 0 0 0⎟ ⎟ C 2 ,3 = ⎜ 0 0 0 0 0⎟ ⎜0 0 ⎜0 0 0 0 0 0 0 ⎟⎟ ⎜ ⎜0 0 0 0 0 0 0 ⎟⎠ ⎝
0 1 1 0⎞ ⎛0 1 ⎟ ⎜ 0 0 0 0⎟ ⎜0 0 ⎜0 0 0 0 0 1⎟ ⎟ C1, 4 = ⎜ 0 0 0 0⎟ ⎜0 0 ⎜0 0 0 0 0 0 ⎟⎟ ⎜ ⎜0 0 0 0 0 0 ⎟⎠ ⎝
0 0 1 0⎞ ⎛0 1 ⎟ ⎜ 1 0 0 1⎟ ⎜0 0 ⎜ 0 0 0 0⎟ ⎟ C1,5 = ⎜ 0 0 0 0 0 0⎟ ⎜0 0 ⎜0 0 0 0 0 0 ⎟⎟ ⎜ ⎜0 0 0 0 0 0 ⎟⎠ ⎝
0 0 0 0 1 0
C 2 ,3
⎛0 ⎜ ⎜1 ⎜0 =⎜ ⎜0 ⎜0 ⎜ ⎜0 ⎝
0 0 0 0 0 0
0⎞ ⎛0 0 ⎟ ⎜ 1⎟ ⎜1 0 ⎜ 0⎟ ⎟ C2 , 4 = ⎜ 0 0 0⎟ ⎜0 0 ⎜0 0 0 ⎟⎟ ⎜ ⎟ ⎜0 0 0⎠ ⎝
0 0 0 0⎞ ⎛0 0 ⎟ ⎜ 1 0 1 1⎟ ⎜1 0 ⎜0 0 0 0 0 0⎟ ⎟ C 2 ,5 = ⎜ 0 0 0 0⎟ ⎜0 0 ⎜0 0 0 0 0 0 ⎟⎟ ⎜ ⎟ ⎜0 0 0 0 0 0⎠ ⎝
0 0 0 0⎞ ⎛0 0 ⎟ ⎜ 1 0 0 1⎟ ⎜1 0 ⎜0 0 0 0 0 0⎟ ⎟ C2 , 6 = ⎜ 0 0 0 0⎟ ⎜0 0 ⎜0 0 0 1 0 0 ⎟⎟ ⎜ ⎟ ⎜0 0 0 0 0 0⎠ ⎝
0 1 0 0⎞ ⎛0 0 ⎟ ⎜ 1 0 1 0⎟ ⎜0 0 ⎜0 1 0 0 0 0⎟ ⎟ C 3, 4 = ⎜ 0 0 0 0⎟ ⎜1 0 ⎜0 0 0 0 0 0 ⎟⎟ ⎜ ⎟ ⎜0 0 0 0 0 0⎠ ⎝
C3, 5
⎛0 ⎜ ⎜0 ⎜0 =⎜ ⎜0 ⎜1 ⎜ ⎜0 ⎝
0 0 0 0 0⎞ ⎛0 0 ⎟ ⎜ 0 0 0 0 0⎟ ⎜1 0 ⎟ ⎜0 1 1 0 0 0 1 ⎟ C 3, 6 = ⎜ 0 0 0 0 0⎟ ⎜0 0 ⎜0 0 0 0 0 1 0 ⎟⎟ ⎜ ⎜0 0 0 0 0 0 0 ⎟⎠ ⎝
0 0 0 0⎞ ⎛0 0 ⎟ ⎜ 0 0 0 0⎟ ⎜0 0 ⎟ ⎜0 0 0 0 1 0 ⎟ C4 , 5 = ⎜ 0 0 0 0⎟ ⎜1 0 ⎜0 1 0 1 0 0 ⎟⎟ ⎜ ⎜0 0 0 0 0 0 ⎟⎠ ⎝
0 0 0 0⎞ ⎛0 0 ⎟ ⎜ 0 0 0 0⎟ ⎜0 0 ⎟ ⎜0 0 0 0 0 0 ⎟ C 4 ,6 = ⎜ 0 0 0 0⎟ ⎜1 0 ⎜0 0 1 0 0 1 ⎟⎟ ⎜ ⎜0 1 0 0 0 0 ⎟⎠ ⎝
0 0 0 0⎞ ⎛0 0 ⎟ ⎜ 0 0 0 0⎟ ⎜0 0 ⎟ ⎜0 0 0 0 0 0 ⎟ C5 , 6 = ⎜ 0 0 1 0⎟ ⎜0 0 ⎜1 1 0 0 0 0 ⎟⎟ ⎜ ⎜0 0 1 0 0 0 ⎟⎠ ⎝
0 0 0 0 0 0
1 0 0 0 0 0
0 1 0 0 0 0
1 0 0 0 0 0
0 0 0 0 0 0
⎛0 1 0⎞ ⎜ ⎟ 0⎟ ⎜0 0 ⎜0 0 0⎟ ⎟ C1, 6 = ⎜ ⎜0 0 0⎟ ⎜0 0 1 ⎟⎟ ⎜ ⎜0 0 0 ⎟⎠ ⎝
0 1 1 0⎞ ⎟ 0 0 0 0⎟ 0 0 0 0⎟ ⎟ 0 0 0 0⎟ 0 0 0 0 ⎟⎟ 1 0 0 0 ⎟⎠ 0 0 0 0⎞ ⎟ 0 0 0 0⎟ 0 0 1 1⎟ ⎟ 0 0 0 0⎟ 0 0 0 0 ⎟⎟ 0 0 0 0 ⎟⎠ 0 0 0 0⎞ ⎟ 0 0 0 0⎟ 0 0 0 0⎟ ⎟ 0 0 0 0⎟ 1 1 0 0 ⎟⎟ 0 0 0 0 ⎟⎠
The total carried traffic at both MPs, f’RIP(ae,f), is shown in the second column of Table 2. Then, we repeat the above steps for the WMN using HWMP and the results of f’HWMP(ae,f) are in the last column of Table 2. Table 2. Total carried traffic at two MPs of the WMN MPs f’RIP(ae,f) a1,a2 110 λ a1,a3 115 λ a1,a4 95 λ a1,a5 110 λ a1,a6 115 λ
f’HWMP(ae,f) 120 λ 120 λ 120 λ 100 λ 120 λ
MPs a2,a3 a2,a4 a2,a5 a2,a6 a3,a4
f’RIP(ae,f) 115 λ 115 λ 110 λ 120 λ 115 λ
f’HWMP(ae,f) 120 λ 120 λ 120 λ 120 λ 120 λ
MPs a3,a5 a3,a6 a4,a5 a4,a6 a5,a6
f’RIP(ae,f) 110 λ 95 λ 110 λ 110 λ 105 λ
f’HWMP(ae,f) 80 λ 80 λ 100 λ 80 λ 80 λ
Similar to the previous results, the placement of the MPs affects the system throughput. As shown in Table 2, the performance differences in placing the MPs at different positions are about 26% and 33% for the network RIP and HWMP respectively. Furthermore, the best positions of the MPs in the network using different
Optimal Placement of Mesh Points in Wireless Mesh Networks
855
routing algorithm may not be the same. For the example above, a2 and a6 are the best positions for the MPs of the network using RIP. This combination of MPs can support 120 λ downlink traffic. However, for the network using HWMP, a1 and a2 , a1 and a3 , a1 and a4 , a1 and a6 , a2 and a3 , a2 and a5 , a2 and a6 or a3 and a4 can also be the best positions of the MPs.
5 Conclusion In this paper, we show that the placement of MP(s) in a WMN is important that may affect the network throughput of the network. Numerical examples show that there is up to 100% throughput difference in placing a MP at different positions. Also, the best positions of MPs in a network using different routing algorithms may not be the same. Therefore, to compare the routing algorithms by a given set of MP positions is not fair. Instead, the optimal performance of the routing algorithms can be found by finding the corresponding optimal placement of MPs first. Lastly, a selection in routing algorithm is still a key performance factor of the network because the best performance for an algorithm may not be equal to that of another algorithm.
References 1. Akyildiz, I.F., Wang, X.: A Survey on Wireless Mesh Networks. IEEE Radio Communications 43(9), S23 - S30 (2005) 2. Wu, X., Liu, J., Chen, G.: Analysis of Bottleneck Delay and Throughput in Wireless Mesh Networks. IEEE MASS, 765–770 (2006) 3. Jun, J., Sichitiu, M.L.: The Nominal Capacity of Wireless Mesh Networks. IEEE Wireless Communications 10, 8–14 (2003) 4. Gupta, P., Kumar, P.R.: The Capacity of Wireless Networks. IEEE Transaction of Information Theory 6(2), 388–404 (2000) 5. Gamal, A.E., Mammen, J., Prabhakar, B., Shah, D.: Throughput-delay Trade-off in Wireless Networks. In: Proc. IEEE INFOCOM (2004) 6. Zhang, H., Tsang, D.H.K.: Traffic Oriented Topology Formation and Load-balancing Routing in Wireless Mesh Networks. In: Proc. IEEE ICCC, pp. 1046–1052 (2007) 7. Chandra, R., Qiu, L., Jain, K., Mahdian, M.: Optimizing the Placement of Internet TAPs in Wireless Neighborhood Networks. In: ICNP, pp. 271–282 (2004) 8. Li, F., Wang, Y., Li, X.Y.: Gateway Placement for Throughput Optimization in Wireless Mesh Networks. In: Proc. IEEE ICC, pp. 4955–4960 (2007) 9. Draves, R., Padhye, J., Zill, B.: Comparison of Routing Metrics for Static Multi-hop Wireless Networks. In: Proc. of the ACM SIGCOMM (2004) 10. Shen, Q., Fang, X.M.: A Multi-metric AODV Routing in IEEE 802.11s. In: Proc. IEEE ICCT, pp. 1–4 (2006) 11. Zhu, C.H., Lee, M.J., Saadawi, T.: On the Route Discovery Latency of Wireless Mesh Networks. In: Proc. IEEE CCNC, pp. 19–23 (2005) 12. Liu, L.G., Feng, G.Z.: Mean Field Network Based QoS Routing Scheme in Wireless Mesh Networks. In: Proc. IEEE WCMN, vol. 2, pp. 1110–1113 (2005)
A Distributed Channel Access Scheduling Scheme with Clean-Air Spatial Reuse for Wireless Mesh Networks* Yuan-Chieh Lin1, Shun-Wen Hsiao1, Li-Ping Tung2, Yeali S. Sun1, and Meng Chang Chen2 1
Dept. of Information Management, National Taiwan University, Taipei, Taiwan {r94001,r93011,sunny}@im.ntu.edu.tw 2 Institute of Information Science, Academia Sinica, Taipei, Taiwan {lptung,mcc}@iis.sinica.edu.tw
Abstract. There are two effective approaches to maximize network capacity throughput: increasing concurrent non-interfering transmissions by exploiting spatial reuse and increasing transmission rates of active senders. These two ways are, however, a tradeoff due to the signal interference. In this paper, we propose a distributed channel access scheduling scheme under a Clean-Air Spatial Reuse architecture which spans both the MAC layer and the network planning plane to scale a wireless mesh network to high network capacity throughput and large coverage. Simulations results of the network capacity throughput performance under different levels of Clean-Air Spatial Reuse policies are presented. The results show that having more number of concurrent transmission pairs scheduled in each time slot usually can compensate the negative effect of using lower transmission rates of transmission links and result in better throughput performance.
1 Introduction The wireless mesh network is expected to provide large transmission capacity and serve as a promising complementary solution to existing broadband access infrastructure. An important performance metric to evaluate the effectiveness of such a network is the network capacity throughput defined as the aggregate number of bits that can be successfully received by designated receivers within the network for a certain period of time. The main factor that limits the network capacity throughput of a wireless mesh network is the interference between neighboring nodes when using a shared medium [1]. The network capacity throughput of a wireless mesh network is determined by the number of non-interfering transmissions that can be achieved at the same time (i.e., transmission concurrency) and their transmission rates. According to the Shannon’s Channel Capacity Theorem [2] and the SINR definition, having a successful transmission with higher transmission rate at the sender requires higher SINR at the *
This research is partly supported by National Science Council under grants NSC 95-2221-E-001-018-MY3 and NSC 95-2221-E-002-192.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 856 – 864, 2008. © IFIP International Federation for Information Processing 2008
A Distributed Channel Access Scheduling Scheme with Clean-Air Spatial Reuse
857
receiver which means it would be better to keep the interference low at the receiver by controlling the number of concurrent transmissions in the neighborhood. Conversely, a successful transmission can also be achieved by lowering a sender’s transmission rate in a lower SINR environment which allows more number of concurrent non-interfering transmissions in the neighborhood. In other words, transmission concurrency and senders’ transmission rates are indeed a tradeoff. Fig. 1 shows the network capacity throughputs under two different transmission rate and scheduling schemes. Note that the amount of increases between transmission rate and SINR has the logarithm relationship. Hence, to accomplish higher network capacity throughput, it would be more effective to do linear increase of the number of concurrent transmissions and log-scale decrease of the transmission rates of individual links.
39.6
31.7 (a)
110.9 Mbps
39.6
21.0
19.1
26.2
19.1
(b)
26.2 21.0
132.6 Mbps
Fig. 1. Network capacity throughput comparison under two different transmission rates and scheduling schemes (link bandwidth W = 10Mhz, path loss exponent = 4.0)
There are a number of works in the past studying the problem of how concurrent non-interfering transmissions in a TDMA-based mesh network may affect the arrangement of individual link transmission rate and the resulting network capacity throughput. Most of the work use some techniques such as compatibility matrix [3] to first find non-interfering transmission pairs and then use centralized algorithms for time slot assignment[4][5][6][7]. Various linear programming and heuristic methods are proposed for time slot assignment [8][1]. These works require prior knowledge of the traffic requirements and the interference relationship between wireless nodes, and assume they do not change frequently. The other drawbacks for completely centralized access scheduling in mesh networks include optimal and near-optimum scheduling are highly complex and time consuming; and it needs global information of the traffic demands in the network. This approach may not be computationally efficient for real-time multi-access network scheduling. Due to inherent complexities of the interference in a mesh network and without global knowledge of the traffic demand, realizing a practical distributed data transmission schedule raises great challenges in coordinating active transmission pairs to avoid collision. One solution is the election-based three-way handshake scheme in IEEE 802.16 networks [9]. This scheme incurs long connection setup delay even for small size mesh networks. In [10], they show that more than 50 control message exchanges are necessary to complete a single three-way handshake procedure in a 30-node network. In [11], a distributed algorithm is used to dynamically tune each wireless station’s power and transmission rate to achieve optimal spatial reuse and network capacity throughput in a CSMA-based network. In [12], they investigate
858
Y.-C. Lin et al.
receiver’s sensitivity to different transmission rate, transmission range, node topology, and SINR in determining the optimal carrier sense range for a CSMA-based network. These results can not be directly applied to TDMA-based mesh networks. In this work, our design goals are two folds. First, we wish to retain the flexibility of distributed channel access scheduling but reduce the complexity of distributed multi-access contention resolution. Second, we want to increase transmission concurrency as possible in the control/data slot access to raise network capacity throughput by exploiting the spatial reuse property and with the support of variable transmission rate capability at wireless stations. Here, we propose a combined approach which spans both the MAC layer and the network planning plane to scale a wireless mesh network to high network capacity throughput and large coverage. We use a regular hexagonal wireless mesh network (as shown in Fig. 2) as an illustrative example throughout the paper. However, we emphasize that the approach is general and applicable to wireless mesh networks with any arbitrary topology. The rest of the paper is organized as follows. In Section 2, we present the distributed three-way handshake protocol for dynamic data transfer schedule setup by active transmission pairs in the mesh network. Section 3 presents the proposed planning model. Section 4 describes the distributed dynamic data transfer scheduling scheme combined with the planning model. Section 5 presents the simulations results of the network capacity throughput performance. Section 6 gives the conclusion.
2 The Three-Way Handshake Protocol for Distributed Data Packet Scheduling Assume the wireless network access time is synchronized and slotted. The time slots are organized into a sequence of frames. Each frame consists of a control subframe and a data subframe. The control subframe is comprised of a number of control opportunity round used by active sender-receiver pairs to establish data transfer connections. A round consists of three control slots for the three-way handshake. The data subframe consists of a number of fixed-size data slots. To reserve data slots, an active sender must first obtain a handshake opportunity to initiate the connection setup with the intended receiver and successfully agree upon each other with a feasible data slot transmission schedule (or reservation). An important function of this protocol is to let the sender and receiver’s “neighbors” to lean their data slot reservation. When the neighbors engage their own data slot scheduling, they will exclude these reserved data slots in the negotiation to avoid collision. Now, the design issue becomes how far the transmission range of the control messages should go or the scope of “neighbors”. To have less number of concurrent transmissions at a data slot, the negotiation information needs to reach more number of nodes as possible, and vice versa.
3 The Planning Model In this paper, a planning model is developed to a) reduce the complexity and overhead in control channel access scheduling; b) raise network capacity throughput by
A Distributed Channel Access Scheduling Scheme with Clean-Air Spatial Reuse
859
exploiting transmission concurrency for both the control and data channel accesses. There are four processes: the specification of the Level-N Clean-Air Spatial Reuse policy, the formation of the concurrent data access groups, the formation of the concurrent control access groups, and the determination of the transmission rate of the stations in each access group. 3.1 The Level-N Clean-Air Spatial Reuse Policy There are V stations, denoted as SS1, SS2, …, SSV, in a TDMA-based mesh network. Each SSi has at most k SSj’s within the distance d which are referred to as the level-1 neighbors. Note the distance d could be arbitrary and is restrained by the maximum transmission range. The level-1 neighbors of the station’s level-1 neighbors are referred to as the level-2 neighbors, and so on (as shown in Fig. 2). The idea of defining Level-N Clean-Air Spatial Reuse policy is to specify the minimum scope (level) between any two concurrent transmissions in the network, hence the name clean air. The policy will be used to partition all possible sender-receiver pairs in the network into distinct logical concurrent access groups. Pairs within the same group are allowed to transmit at the same time slot. Hence, given the Level-N Clean-Air Spatial Reuse policy, the assignment of active sender-receiver pairs to an access group must satisfy the following two conditions: • for every SS assigned to transmit in the slot, all its level-1, level-2, …, up to level-N neighbors, except for the intended receiver must not be scheduled for transmitting and receiving; • for every SS scheduled to receive, all its level-1, level-2, …, up to level-N neighbors, except for the intended transmitter must not be scheduled for transmitting and receiving.
level-1
SS
i
level-2
level-3
Fig. 2. The wireless mesh network and the Level-N relationship between SSi and its neighbors
To further increase the number of concurrent transmissions in a group, we also define the Level-0 Clean-Air Spatial Reuse policy with the following two conditions: • for every SS scheduled to transmit, all its level-1 neighbors, except for the intended receiver must not be scheduled for receiving;
860
Y.-C. Lin et al.
• for every SS scheduled to receive, all its level-1 neighbors, except for the intended transmitter must not be scheduled for transmitting. The formations of the concurrent data and control access groups will take the same Level-N policy to assure data schedule consistency of the stations in the network. 3.2 The Concurrent Data Access Groups In this process, we partition all possible sender-receiver pairs in the network into a number of concurrent data access groups according to the specified Level-N Clean-Air Spatial Reuse policy. Pairs within the same group can be scheduled to transmit at the same data slot (Note that depending on the traffic demands, the proposed distributed data scheduling scheme in Section 4 allows pairs belonging to different groups to transmit at the same data slot). Assume all data transmissions occur between a station and one of its level-1 neighbors. We first divide all possible transmission pairs into a number of clusters. Then, the partitioning algorithm is applied to each cluster. It is a greedy algorithm that assigns as many sender-receiver pairs to the same access group as possible according to the Level-N Clean-Air Spatial Reuse policy constraint. Fig. 3 shows the initial six clusters which result in twelve concurrent data access groups for the Level-1 Clean-Air Spatial Reuse policy. Cluster1
Cluster4
Cluster2
Cluster5
Cluster3
D1 D2
D3 D4
D5 D6
Cluster6
D7 D8
D9 D10
D11 D12
(a) initial 6 clusters
(b) 12 concurrent data access groups
Fig. 3. The greedy initial clustering and concurrent data access group assignment for the Level-1 Clean-Air Spatial Reuse policy
3.3 The Concurrent Control Access Group To regulate the control subframe access, we partition all possible transmission pairs into a number of concurrent control access groups. For the sender-receiver pairs in the same group, they can do control message exchange at the control opportunity round designated to the group. Since within a control access group we must assure all sender-receiver pairs can successfully perform the three-way handshake protocol, the “clean-air” constraint must assure all their level-1, …, up to level-N neighbors are
A Distributed Channel Access Scheduling Scheme with Clean-Air Spatial Reuse
861
non-busy. It has more stringent requirement of transmission concurrency than in the concurrent data access grouping. 3.4 The Transmission Rate Assignment In this process, each sender of an access group is assigned a static and deterministic transmission rate so that all sender-receiver pairs in the same group can all transmit at the same time slot successfully. Here, we assume all sender-receiver pairs within the same group are active and stations transmit at the sufficiently large power level that satisfies receivers’ capture threshold requirements at any achievable transmission rate. According to the adopted Clean-Air Spatial Reuse policy, we calculate individual receiver’s SINR value and find the maximum achievable transmission rate for the sender.
4 The Distributed Data Scheduling and Reservation Schemes Under the planning model, each station has the following knowledge: a) the Clean-Air Spatial Reuse policy employed; b) the concurrent control access group it belongs to; c) the concurrent data access group it belongs to. Meantime, it maintains a state vector recording the reservation status of the data subframe. Each element in the state vector has the value of “Free” (no one reserved or reserved by pairs of my group), “Busy” (already reserved by the other groups), “Send” (for my sending) or “Receive” (for my receiving). The number of slots of the control subframe is assumed to be equal to the number of the concurrent control access groups so that every station has a control message access opportunity on a periodic basis.
D1 D2 SS1
D4 DS6 : SS3
SS4
SS6 (D4)
F R F F F R
SS2 SS7 SS5 SS6 SS8 SS9
SS10
Data Subframe: d1 d2 d3 d4 d5 d6
DS1 S F S F S F DS2 R B R B R B DS3 B S B S B F DS4 F R F R F F DS5 B S B F B S DS6 F R F F F R DS7 F B F B F B DS8 F B F F F B DS9 F B F F F B DS10 F F F F F F (before SS6 Æ SS9)
DS6 : (updated)
SS9 (D4)
MSH-D SCH-RE {D4 , (d Q 1 , d3 , d , 4 d5 ),3} H-GNT DSC MSHd 3 ,d 4)} , {D 4, (d 1
DS9 : F B F F F B
DS9 : (updated) R B R R F B
S R S S F R
DS10 : (updated) B F B B F F
MS H -DS CH-C {D , 4 (d FM 1, d 3 ,d ) 4 }
DS9 : (updated) R B R R F B
DS5 : (updated)
Legend: DSi: data schedule at station i Status of a data slot: B: busy; S: send; R: receive; F: free. Message content: < GroupID, list_FreeSlots, amount>
B S B B B S
DS7 : (updated) B B B B F B
Fig. 4. The distributed data scheduling and reservation
862
Y.-C. Lin et al.
An example of distributed data scheduling and reservation under the Level-1 policy is depicted in Fig. 4. Suppose SS6 wants to send three data packets to its Level-1 neighbor SS9; they belong to the data access group D4 and control access group C4. SS6 uses the opportunity round designated to C4 to negotiate a feasible data reservation schedule with SS9. The figure shows the schedule exchange and negotiation between them. Meantime, their neighbors are required to update their state vectors accordingly. After establishing the data transfer schedule, SS6 will send data packets to SS9 on the reserved data slots. Here, SS1 and SS6 can transmit at the same time slots (i.e., d1 and d3) even though they belong to different data access group.
5 Performance Evaluation In this section, we present the network capacity throughput performances under different Clean-Air Spatial Reuse policies in balancing between the number of concurrent transmissions and individual sender’s transmission rate via simulation. Consider a mesh network with 24 nodes which are always backlogged. The frame duration is 10 ms; there are 64 data slots in the data subframe; the bandwidth is 10MHz. All simulations are implemented in C. The transmission rate requirements are based on the data in [13]. For example, the minimum capture threshold is 1.9484 (2.9dB) and the corresponding transmission rate is 10Mbps. 275 Mbps
Level-2
300 300
N etw ork C apacity Throughput (M bps)
250
200 200 150 100 100
50 00 6
6
5.5 5.5
4.5 4.5
44
3.5 3.5
250 200 200 150 100 100 50 00 6
6
5.5 5.5
55
4.5 4.5
33
2.5 2.5
2 2
Level-1
208 Mbps
300 300
44
3.5 3.5
33
2.5 2.5
22
0 0.03 0.070.02 0.110.07 0.11 0.15 0.15
Level-0
300 300 250 200 200 150 100 100 50 00
55
0 0.03 0.070.02 0.110.07 0.11 0.15 0.15
157 Mbps
66
5.5 5.5
55
4.5 4.5
44
3.5 3.5
33
Path Loss Exponent
2.5 2.5
22
0 0.03 0.070.02 0.110.07 vel 0.11 e 0.15 0.15 eL
is No
Fig. 5. The network capacity throughputs for Level-0, -1 and -2 Clean-Air Spatial Reuse policies under different environmental variables
Fig. 5 presents the network capacity throughputs for Level-0, -1 and -2 Clean-Air Spatial Reuse policies under different environmental conditions. The Level-0 policy achieves the highest network capacity throughput due to more number of concurrent
A Distributed Channel Access Scheduling Scheme with Clean-Air Spatial Reuse
863
transmission pairs are scheduled in each time slot which compensates the negative effect of using lower transmission rates. However, Level-0 policy may not have feasible scheduling solutions for environments with larger path loss exponent such as in metropolitan areas. The values with the start sign are the theoretical upper bounds of the throughput calculated according to the Shannon’s Channel Capacity Theorem.
6 Conclusions In this paper, we propose an approach to address the problem of channel access scheduling in wireless mesh networks to raise network capacity throughput. First, we propose a planning model with three key elements. The Level-N Clean-Air Spatial Reuse policy sets the minimum scope (level) of scheduling concurrent non-interfering transmission pairs in the wireless mesh network. Two algorithms are proposed to divide the network into distinct logical concurrent control/data access groups to raise network capacity throughput. Then each sender is assigned with a static transmission rate so that all sender-receiver pairs can have successfully transmissions. Based on the planning model, a distributed three-way handshake protocol is used for dynamic data scheduling. The combined scheme achieves the flexibility of dynamic and distributed channel access scheduling with reduced complexity (compared to pure distributed approach), and high network capacity throughput. The simulation results show that having more number of concurrent transmission pairs scheduled in each time slot usually can compensate the negative effect of using lower transmission rates and achieve better throughput performance.
Reference 1. Brar, G., Blough, D.M., Santi, P.: Computationally Efficient Scheduling with the Physical Interference Model for Throughput Improvement in Wireless Mesh Networks. In: ACM MobiCom, pp. 2–13 (2006) 2. Goldsmith, A.: Wireless Communications. Cambridge University Press (2005) 3. Nelson, R., Kleinrock, L.: Spatial TDMA: A Collision Free Multihop Channel Access Protocol. IEEE Transactions on Communications 33(9), 934–944 (1985) 4. Cheng, S.-M., Lin, P., Huang, D.-W., Yang, S.-R.: A Study on Distributed/Centralized Scheduling for Wireless Mesh Network. In: IWCMC, pp. 599–604 (2006) 5. Wei, H.-Y., Ganguly, S., Izmailov, R., Hass, Z.J.: Interference-Aware IEEE 802.16 WiMax Mesh Networks. In: IEEE VTC 2005-Spring, pp. 3102–3106 (2005) 6. Kabbani, A., Salonidis, T., Knightly, E.W.: Distributed Low-Complexity Maximum-Throughput Scheduler for Wireless Backhaul Networks. In: IEEE INFOCOM, pp. 2063–2071 (2007) 7. Chen, J., Chi, C., Guo, Q.: An Odd-Even Alternation Mechanism for Centralized Scheduling in WiMAX Mesh Network. In: GLOBECOM, pp. 1–6 (2006) 8. Viswanathan, H., Mukherjee, S.: Throughput-Range Tradeoff of Wireless Mesh Backhaul Networks. IEEE Journal on Selected Areas in Communications 24(3), 593–602 (2006) 9. IEEE Standard for Local and Metropolitan Area Networks Part 16: Air Interface for Fixed Broadband Wireless Access Systems. IEEE Std 802.16-2004 (2004)
864
Y.-C. Lin et al.
10. Cao, M., Ma, W., Zhang, Q., Wang, X., Zhu, W.: Modeling and Performance Analysis of the Distributed Scheduler in IEEE 802.16 Mesh Mode. In: ACM MobiHoc, pp. 78–89 (2005) 11. Kim, T.-S., Lim, H., Hou, J.C.: Improving Spatial Reuse through Tuning Transmit Power, Carrier Sense Threshold, and Data Rate in Multihop Wireless Network. In: ACM MobiCom, pp. 366–377 (2006) 12. Zhai, H., Fang, Y.: Physical Carrier Sensing and Spatial Reuse in Multirate and Multihop Wireless Ad Hoc Networks. In: IEEE INFOCOM, pp. 1–12 (2006) 13. WiMAX Forum, http://www.wimaxforum.org
A Unified Service Discovery Architecture for Wireless Mesh Networks Martin Krebs, Karl-Heinz Krempels, and Markus Kucay Department of Computer Science, Informatik 4 RWTH Aachen University, Germany Ahornstrasse 55, 52074 Aachen {krebs,krempels,kucay}@cs.rwth-aachen.de Abstract. This paper proposes a unified architecture for service discovery in Wireless Mesh Networks. In this architecture, routing clients and non-routing clients, which connect via a mesh gateway with Service Proxy (SP), can also participate seamlessly in the service discovery process. Therefore, the multicast DNS (mDNS) protocol is encapsulated in Optimized Link State Routing (OLSR) messages to make mDNS multi hop capable. Service Caches (SC) on wireless mesh routers are added for efficiency reasons. It is discussed that simple flooding of service discovery messages is not efficient and that inspection of messages at the application layer increases the efficiency of message propagation. We present a plug-in for an OLSR-daemon which is widely used in real world deployments. Finally, the measurements performed in the department’s wireless mesh testbed are discussed with results and conclusions. Keywords: Service Discovery, Wireless Mesh Networks, Cross-Layer Design, OLSR.
1
Introduction
Wireless Mesh Networks (WMNs) [1] are an emerging technology in the direction of future wireless networks. They are a flexible technology to provide wireless high-speed broadband coverage to areas where the installation of wires is not possible or too costly. Example deployment scenarios are community, emergency, disaster recovery or municipal networks. From the user point of view, automatic service discovery and zero configuration will be a central task in WMNs. Current solutions and proposals for service discovery in ad-hoc networks like SLP [2] or UPnP [3] use simple flooding which is unacceptable in wireless networks. Many solutions presume an existence of well defined multicast groups which are also unrealistic in ad-hoc or mesh networks. A more efficient approach are distributed hash tables (DHT) for P2P networks which can be used to create an overlay network, but this is inappropriate for highly dynamic wireless networks. For complex service discovery operations it is not clear whether any meaningful hash value could even be calculated. It can not be assumed that a querying client knows the hash value of a service in advance. In this paper we present a complete service discovery approach for WMNs using DNS-SD in combination with the advantages of the OLSR message A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 865–876, 2008. c IFIP International Federation for Information Processing 2008
866
M. Krebs, K.-H. Krempels, and M. Kucay
distribution mechanism. We use caching techniques on the wireless mesh nodes realized through an OLSRD [4] plug-in, providing a cross-layer design, where the service discovery protocol benefits from routing layer information such as hop count or the ETX metric for the services. This paper is organized as follows: In section 2 we review related work and existing protocols for service discovery. In section 3 we discuss service discovery in Wireless Mesh Networks, before our architecture is presented in Section 4. In section 5 we state the results from our implementation in the wireless mesh testbed. Section 6 comprises the conclusion.
2
Related Work
On the protocol level a lot of work has been done for service discovery in the internet, e.g. the Service Location Protocol (SLP) [2], Simple Service Discovery Protocol (SSDP) [5], DNS-based Service Discovery (DNS-SD) [6] together with multicast DNS (mDNS) [7] or Universal Plug and Play (UPnP) [3]. On the application level are the Java-based Jini [8] or UDDI (Universal Description, Discovery and Integration) for web services. These protocols are designed for wired infrastructure networks and are not well suited for Mobile Ad-hoc Networks (MANETs) or even Wireless Mesh Networks (WMNs), since these approaches are directory or simple flooding based. A lot of work has also been done for service discovery in ad-hoc networks. Konark [9] is a service discovery mechanism on the application level for ad-hoc networks. mSLP [10] is an enhancement of SLP, where SLP Directory Agents (DAs) setup up a fully meshed structure and exchange service registration states. However, this approach does not scale well in WMNs, because service registration states must be replicated between all servers. This replication causes a high network load. Another promising approach seems to be the integration of service discovery mechanisms in a routing protocol. In many cases the service requests are piggybacked on route request messages. Koodli and Perkins [11] propose a solution for service discovery in ad-hoc networks embedded in a reactive routing algorithm. They describe a service discovery mechanism in on-demand ad-hoc networks along with discovery routes to the service, for example with AODV or DSR. However, most of the work in cross-layer design has been done for reactive routing protocols for efficiency reasons. On the level of proactive routing protocols are approaches for cross-layering routing and service discovery like [12] where the authors enable the Session Initiation Protocol (SIP) for MANETs with OLSR. The authors propose an approach called OLSR with Service Location Extension which is implemented in a simulator where servers regularly advertise their location or clients can query for server locations.
3
Service Discovery
In general, there are two architecture approaches to perform service discovery: directory based and non-directory based approaches. In case of the non-directory
A Unified Service Discovery Architecture for Wireless Mesh Networks
867
based approach, broadcasting or multicasting is used. Flooding techniques are only suited for small ad-hoc networks or networks with a reliable transport medium where broadcasting does not bother at all. The advantage of this architecture is that no administration is needed and that it is not dependent on infrastructure components like a central server. In large infrastructure networks a local directory server is used for performance reasons. However, a directory server solution is also not suggested in WMNs, because all operations rely on a single point of failure. Service discovery operations are not possible if the central server or its wireless links are not available. A widely used technique to optimize standard protocol effectiveness in wireless networks is called cross-layer design. When service discovery mechanisms are integrated within OLSR two approaches are possible: In the first case service announcements/replies are piggybacked on Topology Control information (TC) messages which are broadcasted in a regular interval. This is similar to the proactive routing approach. Here, a service record is immediately available and no service query needs to be initiated. Backward compatibility is not given, because all OLSR nodes must be capable to parse the modified TC messages. In the second case a service discovery protocol is tunneled through OLSR. There is no piggybacking, because a new OLSR message type is defined for the service discovery messages. The normal query/advertisement phases of the original service discovery protocol remain and are separated from TC updates. Backward compatibility is given, because there is no change to the original protocols. OLSR is only needed to distribute and deliver the messages. This approach is well suited for a WMN where routing and non-routing clients seamlessly can operate service discovery. The approach of mDNS messages encapsulated in a new OLSR message type is discussed in the following.
Fig. 1. Wireless Mesh Network architecture
868
3.1
M. Krebs, K.-H. Krempels, and M. Kucay
DNS-Based Service Discovery
DNS-based Service Discovery (DNS-SD) [6] offers clients the opportunity to discover a list of named instances of the desired service using only standard DNS-messages. DNS-SD can be used in combination with multicast, which is then called multicast DNS (mDNS) [7] or it can be used with any existing DNS server. Moreover, clients can register their services at a DNS server if the DNS server allows dynamic updates. DNS-SD is a widely used technique which is implemented in Apple’s Bonjour Protocol [13]. Instead of requesting for ”SRV” (Service) resource records, clients query for a ”PTR” (pointer from one domain to another in the DNS namespace) record. In a second step, if the client selects a service instance, a query for the corresponding ”SRV” record is processed. For DNS efficiency a server may place additional information in the answers, even if the client originally did not request it. This may help to suppress the client’s following request message. This response message includes the requested ”PTR” record, the additional ”SRV”, ”TXT” and all address records. A client request for <Service>.
Multicast DNS
Multicast DNS (mDNS) [7] is an IETF protocol which allows DNS operations on the local link without a DNS server. For this purpose mDNS uses a part of the DNS namespace, which is available for local use. mDNS needs little setup and even works if no infrastructure is available. Furthermore, mDNS does not change the existing standardized DNS message types such as Operation or Response Codes. The most important difference to classical unicast DNS is that the client sends its DNS message to a multicast address. mDNS works only on the local link, i.e. not beyond routing borders, since multicast is not forwarded by routers by default. The advantage of mDNS is the nature of multicast messages, i.e. all other clients can overhear the message and are able to detect possible conflicts.
4
Proposed Architecture
Our proposed architecture consists of wireless mesh routers, routing mesh clients and non-routing clients (see Fig. 1). Therefore, we propose an architecture which supports two different client access and service discovery modes: In the first mode (described in Fig. 2) non-routing clients connect to the wireless mesh network through an access point. The clients do not implement any mesh routing protocol and they are running a downwardly compatible mDNS service discovery application. The second mode (described in Fig. 3) provides service discovery support
A Unified Service Discovery Architecture for Wireless Mesh Networks
869
for routing clients which implement OLSR and a mDNS service discovery application. The multicast DNS (mDNS) protocol is encapsulated in Optimized Link State Routing (OLSR) messages to make mDNS multi hop capable. When using OLSR for message propagation there is no need to setup and maintain multicast routing, like constructing multicast routing trees [14]. The presented application layer inspection of service discovery messages improves efficient message propagation. In this approach we use distributed DNS-SD caches called Service Caches (SC) on wireless mesh routers. Since Optimized Link State Routing [15] (OLSR) is the most popular proactive routing protocol, we are using the OLSR implementation OLSRD [4] from UNiK for the WMN testbed and implementation. 4.1
Non-routing Clients
Non-routing clients connect to the mesh network through an access point. They do not have access to information from the routing layer, because they are not implementing a routing algorithm. We distinguish between the legacy and the modified mDNS application. In the modified application, the client is able to control the Time-To-Live (TTL) value by using the first octet of the transaction ID of the DNS header to omit mesh network-wide flooding and traffic flow distortions caused thereby. This is possible, because the transaction ID is set to zero in the original mDNS protocol. For legacy clients who do not set the Time To Live value, the corresponding access point sets the value to 255. Clients can also advertise services, but service announcements are not routed and broadcasted into the mesh network. However, they are stored on the corresponding mesh gateway node with a Service Proxy (SP). 4.2
Routing Clients
Routing clients are running OLSR and can directly access the internal tables from OLSR. They can extract metrics like hop count or Expected Transmission Count (ETX ) for desired services. This is an advantage compared to a non crosslayering approach. Before a client uses a service, it can choose the most suitable service (service selection). This can be either the nearest service with respect to hop count or the service with the best ETX. A routing client can also advertise and browse for services. Moreover, advertisements are possible for any service lease time within a flooding diameter ≤ 255. Mobile clients should only be allowed to advertise a service with a small lease time for their own services. For mobile nodes the lease time of the service must be kept small, because otherwise records of mobile nodes which already left the network are still stored. If a node turns off normally it sends an update with resource record live set to zero, which means that the resource record is deleted. Mobile clients are limited in capacity and therefore have a smaller cache for services.
870
M. Krebs, K.-H. Krempels, and M. Kucay
Fig. 2. Scenario A: service discovery support for non-routing clients
Fig. 3. Scenario B: service discovery support for routing clients
4.3
Mesh Router
When the plug-in is activated on a mesh router, the router can act as Service Cache (SC). Mesh routers which do not run the plug-in forward the messages defined by OLSRD’s default forwarding strategy. Instead of being complete directory servers, the SCs are self-organizing and do not need administrative effort. It is recommended to run the caches on fixed mesh nodes with a continuous power supply and enough memory and storage. If OLSR receives encapsulated mDNS messages, the plug-in extracts the service records and stores them in the cache. The corresponding service entry is deleted after the lease time of the service record has expired. A limitation of mDNS is that exact matching or complex queries like in SLP are not possible. A query is always domain based, which means that for example a cache responds with all ’printer’ services and not only with the desired ’color printer’. One important architectural decision concerns the answer behavior of the caches. In the original mDNS protocol only authoritative answers are allowed. This means that a cache must not answer with service records which are not
A Unified Service Discovery Architecture for Wireless Mesh Networks
871
running locally and for which the mDNS stack is not authoritative. For our architecture we decided to allow non- authoritative answering. The strength of the non-authoritative answering behavior is that more services can be discovered within a smaller hop range. This is beneficial for mesh gateways with a service proxy. Clients connected to the mesh gateway can receive all necessary services within one hop. We are also aware of cache replacement strategies [16]. However, this is not the focus of this paper and should not be considered further. We decided to use a simple cache replacement strategy: If the cache is full, the oldest entry is discarded. 4.4
Protocol Definition
In our approach we use the OLSR header to make mDNS multi hop capable. Therefore, every standard mDNS message gets a new OLSR header and its message handling is controlled by OLSRD. We now can also make use of OLSR’s advanced flooding techniques called Multi Point Relaying. In the following we describe some OLSR message fields and their importance for our approach: – Message Type is defined as 222. – Time-To-Live (TTL) states how many hops the message is allowed to be forwarded. The value is decremented by one every time the message is forwarded. If the value is zero, the message is discarded. A client sets this value to control the network search depth for discovering services. – Hop Count is incremented by one every time the message is forwarded. This value is also displayed to the client together with the service to indicate how many hops the service is away. – Message Sequence Number is incremented by one every time a node generates a message. This field is also used to discard duplicated messages. We also need a hop control in non encapsulated mDNS messages which are created by non-routing clients. Therefore, we use the transaction ID field from the original DNS header. The transaction ID field is set to zero in the original mDNS protocol. For our approach we need a TTL field already in DNS, because otherwise the user application can not control the number of hops the message is allowed to be forwarded. This modification is only necessary for scenario A (see Fig. 2) where non-routing clients connect to an access point. If the service discovery application is directly integrated within OLSR (see scenario B), this is not needed. Routing clients can directly set the Time To Live field value of the OLSR header. 4.5
Message Propagation Strategies
In our architecture we implemented two message propagation strategies: The Simple Flooding mechanism is a straight forward way to propagate service discovery messages. When a client sends a query it can limit its query with the TTL
872
M. Krebs, K.-H. Krempels, and M. Kucay
message field of OLSR to a certain number of hops to avoid flooding of the whole network1 . A receiving node answers with its services from the cache. Duplicate information about services is always forwarded and never discarded. Matching answers from the local cache are always sent regardless of any equivalent answers which were already forwarded. In the Discarding Duplicates strategy each node holds an additional forward history hashtable indicating the last time the service record was forwarded. After the HashValidity time τ the entry is deleted from the table. All new incoming messages are discarded as long as there is an entry in the hashtable for the corresponding service. Answers from the local node are always sent with no respect for already seen answers from other nodes. Whereas OLSR discards only duplicate packets [15], our strategy is an application layer routing which inspects the message body and discards duplicate information which was sent by multiple nodes. The discarding strategy is only used on fixed mesh routers and not on mobile clients.
5
Results
5.1
Testbed
For our measurements we use our 36 nodes wireless mesh testbed. The testbed is located at the Department of Computer Science at RWTH Aachen University. The Department complex consists of one four- and two three-story buildings. The wireless mesh routers are distributed over different offices and floors inside the buildings. The mesh routers are single board computer (SBC) based on the WRAP.2E board by PC Engines [18] running on a minimal Ubuntu Linux. Each router consists of two WLAN IEEE 802.11a/b/g interfaces which are built on Atheros AR5213 XR chips, and two omnidirectional antennas. The first WLAN interface is tuned to channel 1 running in ahdemo mode [19] to connect to the mesh network. The second WLAN interface is used for client access and can be tuned to different channels dependent on interferences with other WLAN access points. All routers are also connected with an Ethernet interface for management reasons. Currently, we are running the pro-active and table-driven UniK OLSR daemon (OLSRD) [4] in version 0.5.4 as routing algorithm. Table 1 shows the classification of the wireless mesh routers of our testbed to scenario classes. The topology and the resulting number of neighbors for each router are given by the placement of the routers in the buildings. For more details and performance measurements about our testbed see [20]. 5.2
Measurements
In the following measurements with OLSRD and our service discovery plug-in we are using hysteresis, which adds more robustness to the link sensing but delays neighbor registration. We are also using Multipoint Relaying (MPR) and default 1
We assume that the TTL is set by the user manually or by some adaptive algorithm which proposes a feasible value, e.g. depending on the topology.
A Unified Service Discovery Architecture for Wireless Mesh Networks
873
Table 1. Classification of the testbed mesh routers to scenario classes Scenario class Number of neighbors Number of mesh routers Sparse n<5 6 Medium 5 ≤ n ≤ 10 19 Dense n > 10 11
RFC [15] values for the message interval. We started the measurement after the warm-up phase of OLSR after the routes became mostly stable. In our first measurement we want to investigate with which Time To Live a client has to send a query to discover as many services as possible. Therefore, we installed our plug-in on all mesh routers of the testbed. Every node loads a unique service at start-up which has to be discovered by our client. We investigated two scenarios: First, queries are initiated from within a dense topology and in a second measurement queries are initiated from within a sparse topology. In the next step we started the mDNS client application on a laptop which sends its DNS-SD queries with unicast to OLSR on the same machine. The query is then encapsulated as OLSR packet and sent to the mesh network. As Fig. 4 shows, the client discovers around 11 services within one hop in the dense scenario. This high number of services results from the high number of direct one-hop neighbors of the client. The measurement shows that about 3 to 4 hops are necessary to discover about 90% of our services in the testbed querying from a node which is located in a dense region. In the sparse topology scenario the client discovers only a few services within a small number of hops, because of a small number of direct neighbors. More services are discovered with a higher TTL. The result shows that the number of discovered services varies for the same number of hops within different measurement runs. Even though our mesh routers are fixed, the routes are dynamic and can change very often due to link quality issues. However, the number of discovered services per hop also depends on the topology from where the query is initiated. In the second measurement we focus on the service discovery message overhead after a cold start and after a warm-up phase. In the first scenario (see Fig. 5(a)) the caches on 36 nodes perform a cold start where they load their unique local service. In the second step a laptop client sends a service request message which triggers all nodes to respond with cached services. We now compare the Simple Flooding strategy against the Discarding Duplicates strategy with a HashValidity time τ = 10s: With the Simple Flooding strategy the mesh routers have to handle a significantly higher number of messages than with the Discarding Duplicates strategy. In the second scenario (see Fig. 5(b)) we directly continue working with the current state of the first scenario. Due to propagation, every cache has now all 36 services cached. This means all router caches have the same information. A simple query for the corresponding service class is leading to very high message overhead using the Simple Flooding strategy, because every node propagates its matching cache content through the whole mesh network.
874
M. Krebs, K.-H. Krempels, and M. Kucay Service Discovery 45
Services found in dense scenario Services found in sparse scenario
Number of services found
40 35 30 25 20 15 10 5 0 0
1
2
3
4
5
6
7
8
Number of hops
Fig. 4. Service discovery measurement Message Overhead (Cold Start)
Number of Processed Messages
Number of Processed Messages
Message Overhead (After Warm-up Phase) 800
Simple Flooding Discarding Duplicates
140 120 100 80 60 40 20
Simple Flooding Discarding Duplicates
700 600 500 400 300 200 100
0
0 5
10
15
20
25
30
35
5
10
15
Router
20
25
30
35
Router
(a) After a cold start
(b) After a warm-up phase
Fig. 5. Message overhead of all mesh routers Hop Values (Cold Start)
Mean Hop Values (After Warm-up Phase) 7
Simple Flooding Discarding Duplicates
Simple Flooding Discarding Duplicates
6
5 Number of Hops
Average Number of Hops
6
4 3 2 1
5 4 3 2 1
0
0 5
10
15
20
25
Router
(a) After a cold start
30
35
5
10
15
20
25
30
35
Router
(b) After a warm-up phase
Fig. 6. Mean hop count values of received messages of all mesh routers
Figure 6(a) and 6(b) show the average hop count value of all received messages after a cold start and after the warm-up phase. The Discarding Duplicates strategy leads to a lower number of average hop count value after the warm-up
A Unified Service Discovery Architecture for Wireless Mesh Networks
875
phase, because all nodes receive answers now from their neighbors. In general, service discovery messages are not forwarded over many hops, because they are discarded if the same information was already sent in the last 10 seconds. For overall performance it is more efficient to receive service response message from near SCs rather than from far away. The Discarding Duplicates strategy improves the number of messages processed by every node compared to Simple Flooding. For further overhead reduction other techniques need to be applied.
6
Conclusion
In this paper we presented a unified approach for service discovery in Wireless Mesh Networks. Furthermore, we implemented this as a plug-in for OLSRD and presented our measurement results from our testbed. We encapsulated the messages from mDNS in OLSR packets. With this technique it is possible to use the multicast based service discovery protocol mDNS in a Wireless Mesh Network without a multicast routing protocol or the need to define a new protocol. This approach benefits from advanced OLSR flooding techniques like Multi Point Relaying or message sequence numbers. We showed that simple flooding of service discovery messages is not efficient and that the presented application layer inspection of messages can further improve message propagation with regard to message overhead and hop count.
Acknowledgment This work was supported by the German National Science Foundation (DFG) within the research excellence cluster Ultra High-Speed Mobile Information and Communication (UMIC).
References 1. Akyildiz, I.F., Wang, X., Wang, W.: Wireless mesh networks: a survey. Computer Networks 47(4), 445–487 (2005) 2. Guttman, E., Perkins, C., Veizades, J., Day, M.: Service Location Protocol, Version 2. RFC Standard 2608 (1999) 3. UPnP Forum: Universal Plug and Play (UPnP), http://www.upnp.org/ 4. OLSRD: UniK OLSR Daemon (OLSRD), http://www.olsr.org/ 5. Cai, T., Leach, P., Gu, Y., Goland, Y.Y., Albright, S.: Simple Service Discovery Protocol/1.0. IETF Internet Draft, April 1999. work in progress (1999) 6. Cheshire, S., Krochmal, M.: DNS-Based Service Discovery. IETF Internet Draft (August 2006, work in progress) 7. Cheshire, S., Krochmal, M.: Multicast DNS, August 2006. IETF Internet Draft (work in progress, 2006) 8. Sun Microsystems Inc.: Jini network technology, http://sun.com/jini/
876
M. Krebs, K.-H. Krempels, and M. Kucay
9. Helal, S., Desaii, N.: V.Verma, Lee, C.: Konark: A Service Discovery and Delivery Protocol for Ad-Hoc Networks. In: Proc. IEEE Wireless Communications and Networking Conference (WCNC 2003) (March 2003) 10. Zhao, W., Guttman, E.: mSLP - Mesh Enhanced Service Location Protocol Internet Draft draft-zhao-slp-da-interaction-07.txt 11. Koodli, R., Perkins, C.E.: Service Discovery in On-Demand Ad Hoc Networks. IETF Internet Draft (October 2002, work in progress) 12. Li, L., Lamont, L.: Service Discovery for Support of Real-time Multimedia SIP Applications over OLSR Manets. In: OLSR Interop & Workshop 2004, San Diego, California (August 2004) 13. ZeroConf: Zero Configuration Networking, http://www.zeroconf.org/ 14. Jia, W., Cheng, L., Xu, G.: Efficient multicast routing algorithms on mesh networks. In: ICA3PP 2002: Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing (2002) 15. Clausen, T., Jacquet, P.: Optimized Link State Routing Protocol (OLSR), October 2003. IETF Experimental RFC 3626 (2003), http://rfc.net/rfc3626.txt 16. Hu, Y.C., Johnson, D.B.: Caching Strategies in On-Demand Routing Protocols for Wireless Ad Hoc Networks. In: MobiCom 2000: Proceedings of the 6th annual international conference on Mobile computing and networking (2000) 17. Chakraborty, D., Joshi, A., Yesha, Y.: Integrating service discovery with routing and session management for ad-hoc networks. Ad Hoc Networks 4(1), 204–224 (2006) 18. PCEngines: WRAP-Wireless Router Application Platform, http://www.pcengines.ch/wrap.htm 19. MADWiFi: Multiband Atheros Driver for WiFi (2007), http://madwifi.org/ 20. Zimmermann, A., Guenes, M., Wenig, M., Makram, S.A., Meis, U., Faber, M.: Performance Evaluation of a hybrid Testbed for Wireless Mesh Networks. In: Proceedings of the 4st IEEE International Conference on Mobile Ad-hoc and Sensor Systems (MASS 2007) (October 2007)
Using Predictive Triggers to Improve Handover Performance in Mixed Networks Huaiyu Liu, Christian Maciocco, and Vijay Kesavan Communication Technology Lab, Corporate Technology Group, Intel, Hillsboro, OR 97124 {Huaiyu.Liu,Christian.Maciocco,Vijay.S.Kesavan}@intel.com
Abstract. End-users of multi-radio devices expect ubiquitous connectivity anytime and anywhere across heterogeneous networks. A minimal service disruption time while roaming across networks is key for successful deployments. Today’s roaming decisions are reactive and traditionally based on thresholds for signal strength, missed beacons, or other properties, leading to undesirable handover delays. In this paper we propose a novel and energy efficient approach to predict when a device needs to take proactive actions to perform either a horizontal or vertical handover. The significant improvements our solution achieved in reducing service discontinuity time are demonstrated by applying our algorithms on WiFi and WiMax networks, in both enterprise and citywide wireless environments.
1 Introduction With the introduction of multi-radio devices and the deployment of multiple access networks (WiFi, WiMax, GSM, CDMA, etc), referred as Mixed Networks, end-users are expecting ubiquitous connectivity for their mobile devices and expect these devices to roam seamlessly across networks with no service disruptions while using demanding applications. Supporting seamless mobility across networks is a challenging task for both the clients and the networks. On the client side, the handover process can be classified into three steps, i) “when and why” should the device transition to a new network, ii) “where” should the device transition to, and iii) “how” should the device transition between networks? The “when and why” step corresponds to what we describe as the triggering process when the mobile device receives an indication that it should look to operate on another network, for instance, due to signal strength degradation. The “where” step decides which network to operate on. The “how” step defines the execution of the handover and how the device performs the transition, e.g. either doing a horizontal handover to the same network or a vertical handover to other networks. This paper addresses the “when and why” step. A companion work [4] addresses the “where” step, and a SIPbased solution for the “how” step is discussed in [3]. There are various efforts underway in standard organizations and industry forums to standardize vertical handovers, including IEEE 802.21[14] and 3GPP SA WG2[15]. For instance, the IEEE 802.21 Working Group is standardizing a set of media-independent handover (MIH) methods and procedures that facilitate mobility A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 877–888, 2008. © IFIP International Federation for Information Processing 2008
878
H. Liu, C. Maciocco, and V. Kesavan
Fig. 1. Network switching without link triggers
management. This allows the higher layer services or applications to treat all access technologies in a generic manner. 802.21 specifies three media-independent services: event service, command service, and information service. Events may indicate changes in state and transmission behavior of the physical, data and logical link layers, or predict state changes on these layers. The event service maps to the “when and why” step mentioned above. Events defined in 802.21 include Link-ParametersChange, Link-Up, Link-Down, Link-Going-Down, etc. Link layer triggers provide a mechanism for upper layer entities, such as Multi-radio Connection Manager, to monitor and detect changes in link quality and link parameters. For example, LinkGoing-Down can be used by Connection Manager to search for a new point of attachment before the current point of attachment cease to carry frames, hence reduce handover delay in vertical handovers. Fig. 1 shows a baseline measurement on a generic operating system without any link layer triggers for a vertical handover, based on our prototype of a WiFi/WiMax multiradio system [3]. It also shows the breakdown of the overall delay when a SIP audio session is transferred from a WiFi network to a WiMax network after WiFi signal degrades. (Note, here we do not consider authentication to the network). Transferring a SIP session across networks involves radio initialization, network address acquisition on the new interface, sending SIP de-registration and re-registration messages for old and new interfaces respectively (binding update), and media redirection [3]. Our measurements show that it took more than 9 seconds from the time WiFi signal was lost to the time the SIP client re-established the audio session on a new interface. This includes the time it took to detect loss of connectivity at the application level (since there is no linklayer triggers) and re-establish connection on a new interface, assuming that the SIP application does not include any intelligence to proactively detect loss of connectivity. These measurements demonstrate that an early indication of Link-Going-Down (LGD) will significantly help reduce the handover delay. The standard organizations, however, do not define the algorithms for event services, for instance, the “intelligence” to decide when to generate a link layer trigger. In this paper, we address the problem of providing accurate and predictive link layer triggers to enable seamless handovers, horizontal or vertical, and minimize service interruption, and present our algorithms for trigger generation and trigger prediction. Our algorithms are of low complexity. Evaluation of the algorithms is based on real WiFi and WiMax signal traces. Results show that our trigger prediction algorithm is able to predict accurately and early in time. For instance, the WiFi results show that our algorithm accurately predicted 96% of LGD triggers and on average, the prediction was made 1.13 seconds before LGD is detected. Note that a LGD trigger would
Using Predictive Triggers to Improve Handover Performance in Mixed Networks
879
be generated before loss of connection, hence would remove the gap between “network lost” and “detect link down” in Fig.1. The benefit of trigger prediction is to generate warnings of LGD even earlier in time. Accurate and early prediction would enable Connection Manager to take proactive actions and establish alternative connections before the quality of the current connection degrades to an intolerable level, thus minimize or eliminate service interruption. Although there has been plenty of work in signal strength prediction [5, 7, 8, 9, 10], to the best of our knowledge, there are few works on link trigger predictions. In [6], an analytical model is proposed to set appropriate thresholds for generating link triggers, however, no evaluation on real signal strength traces were presented. The remainder of the paper is structured as follows. In Section 2 we present the trigger generation process. In Section 3, we address trigger prediction and evaluate our trigger generation and prediction algorithms on WiFi and WiMax networks in Section 4. We conclude in Section 5.
2 Trigger Generation In this section, we discuss how to generate link triggers such as a LGD trigger, or in other words, when an event such as LGD is deemed to happen. In the next section, we discuss how to predict such an event. Typically, signal strength is used to reflect link quality [7]. For example, Received Signal Strength Indication (RSSI) is used in WiFi networks and Carrier to Interference-plus-Noise Ratio (CINR) in WiMax networks. In this paper, we focus our discussion on RSSI measurements. The same algorithm is applicable to other networks as well, as shown by results from WiMax networks in Section 4. 2.1 Trigger Generation Method In WiFi networks, handovers are typically triggered when RSSI is lower than a preconfigured threshold. We follow this common practice to generate link layer triggers. Table 1 lists the thresholds for trigger generation used in this paper. (According to [1], a RSSI value of -80 corresponds to no effective communication. And -76 is a commonly used threshold to trigger handovers in WiFi networks.) They could be set to other values and the algorithms proposed in this paper are independent of the values. Due to shadowing and fading effects, the radio environment is highly time varying and RSSI measurements (raw RSSI) fluctuate severely over time. Typically, a smoothing method is applied to smooth out raw RSSI. In addition, we apply a simple optimization to reduce duplicated triggers, the Link-status-update method (shown in Table 2). The basic idea is to maintain the state of the link status in a conservative way, and only generate triggers when link status changes (e.g. when link status changes to LGD, a LGD trigger is generated). In Table 2, the top row specifies the value of the current smoothed RSSI, and the first column specifies the previous link status. The other items define the next link status given the previous status and the Table 1. Thresholds for trigger generation Link Up threshold (LU_TH) Link Coming Up threshold (LCU_TH) Link Going Down threshold (LGD_TH) Link Down threshold (LD_TH)
-60 -70 -76 -80
880
H. Liu, C. Maciocco, and V. Kesavan Table 2. Link status transition in the Link-Status-Update method RSSI>=LU _TH
LU_TH > RSSI>=LCU_TH
LCU_TH >RSSI>= LGD_TH
LGD_TH >RSSI>= LD_TH
LD_TH > RSSI
LINK_UP (LU)
LU
LU
LU
LGD
LD
LINK_ COMING_UP (LCU) LINK_GOING_DOWN (LGD) LINK_DOWN (LD)
LU
LCU
LCU
LGD
LD
LU
LCU
LGD
LGD
LD
LU
LCU
LD
LD
LD
current smoothed RSSI value. For instance, if the link status is LGD, then unless the smoothed RSSI goes above LCU_TH, the link status will not change even if it goes above LGD_TH (as shown by the item on the 4th row and the 4th column). Thus, even though smoothed RSSI may go above and below LGD_TH multiple times, no additional triggers would be generated unless it goes up high enough. 2.2 Signal Strength Smoothing As mentioned before, it is necessary to apply a smoothing method to smooth out raw RSSI. We next evaluate several smoothing methods to choose one that can both respond fast to short term changes and catch long term trend properly. The evaluation is based on the total number of triggers generated, unnecessary triggers, and triggering delays. Intuitively, unnecessary trigger indicates false alarms, and triggering delay indicates the level of responsiveness. The larger the triggering delay, the less responsive a method is to the changing trend of RSSI. • A LGD/LD (or LCU/CU) trigger generated after obtaining the ith raw RSSI is a unnecessary trigger if 60% of next m raw RSSI values are no lower (or no higher) than LGD_TH/LD_TH (or LCU_TH/LU_ TH). • After obtaining the ith raw RSSI value, a LGD/LD (or LCU/CU) trigger is said to be delayed for one step, or triggering delay is increased by one step, if 60% of next m raw RSSI values are below (or above) LGD_TH/LD_TH (or LCU_TH/LU_ TH). In this paper, we set m=30. The smoothing methods we considered are listed below, where x(i) is the ith smoothed RSSI, r(i) is the ith raw RSSI measurement, and N=50. In practice, RSSI values are reported as integers. Hence, all smoothed RSSI values were also converted to integer values in our study.1 • Average: average of the past N raw RSSI values. • Exponential average: x(i) = αx(i-1) + (1-α)r(i). We set α to 0.9 [1]. • Olympic average: Remove the highest n values and lowest n values of the past N values, and take the average of the remaining N-2n values, n=3. • Median: the median value of r(i-N+1) to r(i). • Mode: the most frequent value of r(i-N+1) to r(i).2 1
For exponential average, we keep x(i) in real number for calculation accuracy, and return the integer part as the value for the current exponential-average RSSI. 2 Actually, to obtain the Mode, we divide the RSSI value space into buckets of 3, e.g., bucket[0]={-40, -41, -42}. Mode is defined to be the middle value of the bucket that contains the most raw RSSI values. The reason for not using the most frequent value directly is that the frequency is usually low and may not be the right representative of r(i-N+1) to r(i).
Using Predictive Triggers to Improve Handover Performance in Mixed Networks
881
Fig. 2. Traces of raw RSSI, part I
To evaluate the performance of these smoothing methods in trigger generation, we collected multiple RSSI traces by moving inside Intel’s office buildings. We used an IBM T60 laptop equipped with an Intel 3945 wireless card. Altogether, more than 42K raw RSSI values were collected. Fig.2 shows the traces, where x axis is the index of each RSSI measurement (100ms apart), and y axis is the value of each measurement. Each sharp increase of RSSI corresponds to switching to another AP. Roaming aggressiveness was set to the lowest, so that for each AP the laptop associated, we could collect a complete trace of RSSI from the start of the association to the time the connection is almost lost due to bad signal strength. Using each smoothing method, triggers were generated for each trace in Fig.2 using the link-status-update method in Table 2. We then compare the total number of triggers, unnecessary triggers, and trigger delays for all smoothing methods. Fig.3 presents results for LGD triggers. (We see similar trends for other triggers.) Overall, all methods generated similar number of triggers, Mode tends to generate more unnecessary triggers than others, and both exponential average and mode methods have much shorter triggering delay than other methods. 3 We conclude that exponential average fits our needs the best. In summary, for trigger generation, we use the Link-Status-Update method to avoid duplicated triggers, and use exponential average as the smoothing method to react faster to recent trend of signals strength and minimize service interruption.
Fig. 3. Smoothing methods’ performance comparison, for LGD triggers 3
Though in this figure, the triggering delays of exponential average and mode are similar, mode has a much higher triggering delays in generating Link-Down triggers.
882
H. Liu, C. Maciocco, and V. Kesavan
3 Trigger Prediction In this section, we discuss how to predict a link trigger, for instance, how to predict that LGD is going to happen. Such a prediction serves as a warning and Connection Manager could start preparation for handover. Once the predicted LGD trigger is really generated, the necessity for a network handover is confirmed and the on-going session (e.g. VoIP) could be transferred to the new connection. Note that although our discussion focus on prediction of LGD triggers, the same algorithm is applicable to other link layer triggers as well. For simplicity, in what follows, we use “RSSI” to refer to the exponential-average of RSSI, and explicitly use “raw RSSI” otherwise. The objectives for trigger prediction are: • A large “proactive window”, which is the period from the time a prediction is made to the time the predicted LGD trigger is generated. • Accurate prediction. (1) No miss: when a LGD trigger is generated, a prediction has been made some time in advance. (2) No false-alarm: when a prediction is made, some time later a LGD is indeed generated; and in between the signal strength does not go high enough such that link quality is considered coming up instead of going down. Note that if we miss predicting a trigger, we are doing no worse than trigger generation without prediction. However, with false alarms, we may trigger unnecessary preparation to network handovers or even unnecessary handovers. As shown in Table 2, a LGD trigger is generated when the previous link status is Link-Up or Link-Coming-Up and the smoothed RSSI goes below LGD_TH. Hence, to predict the trigger, we first discuss how to predict future RSSI values, and then discuss how to predict triggers based on predicted RSSI values. 3.1 Signal Strength Prediction The RSSI prediction problem is defined as below, where x[i] denotes the exponential average of RSSI at step i, and x’[i] the predicted value for x[i]. • RSSI prediction: Given x[i-N+1] to x[i], predict x[i+j], j>0. We call x[i-N+1] to x[i] the data in the history window, and j the prediction step. Typically, RSSI measurements are collected every 100ms. Hence, predicting one step ahead corresponds to predicting 100ms into the future. Note that since smoothed RSSI values are reported as integers, predicted RSSI values are also returned as integers. There has been a lot of work in predicting fading signals [5, 7, 8, 9, 10]. Most are based on complex computation, such as autoregressive methods, and are applied to predict raw RSSI. In contrast, for trigger prediction, we only need to predict smoothed RSSI. Moreover, we need a method that provides reasonable accuracy but does not require complex computation. The reason is the prediction algorithm would be executed each time a new RSSI measurement is obtained, and most likely it would be implemented in the wireless card driver, where simple methods are always preferred. As such, we investigated the following non-complex methods:
Using Predictive Triggers to Improve Handover Performance in Mixed Networks
883
• Straight-line: x’[i+j]=k*j+x[i], where k= (x[i]- x[i-N+1])/N (the slope of the line formed by the oldest and newest RSSI values in the history window). • Step-by-step: First predict x[i+1] by applying the straight-line method to x[iN+1] to x[i]. Then slide the history window by 1 to include x[i-N+2] to x’[i+1] and predict x[i+2], and so on. • Least square estimator (LSE) [16].
Fig. 4. RSSI prediction errors
To evaluate the prediction accuracy of the above three methods, we apply them to the traces in Fig. 2 and compare prediction errors (the difference between x[i] and x’[i].) Fig.4 presents some evaluation results, where the cumulative percentage is calculated based on the prediction errors from all traces. The closer a curve is to the top line (100%), the more accurate the prediction is. First, Fig.4(a) shows that the larger the prediction steps (larger j), the bigger the prediction errors. Apparently, there is a tradeoff between prediction accuracy and large prediction steps. In Section 4, we will discuss choosing an appropriate value for the prediction step in trigger prediction. Second, Fig.4(b) shows prediction errors for the three prediction methods over all traces for j=5. Interestingly, the straight-line and step-by-step methods (which have similar performance) perform better than LSE.4 In summary, due to its simplicity and higher accuracy, we choose the straight-line method for RSSI prediction. 3.2 Trigger Prediction Next, we discuss how to predict a LGD trigger by predicting RSSI. For simplicity, when at step i it is predicted that a LGD trigger will happen, we say that a PreTrigger is generated at step i, and we will use trigger prediction and PreTrigger generation interchangeably. A basic algorithm could be to apply the trigger generation algorithm to predicted RSSI values. However, since RSSI prediction has errors, and the further we predict, the larger the errors, this basic algorithm could result in many false alarms and missed predictions. 4
The reason could be that integer values (low precision) are used to represent RSSI values, and also these RSSI values are smoothed values and are not totally independent of each other.
884
H. Liu, C. Maciocco, and V. Kesavan
To improve the basic algorithm, we have a few observations. First, the algorithm should consider both the long term trend and short term changes, in order to avoid false alarms and respond to sharp changes in time. Second, in predicting RSSI values, the size of the history window (parameter N) is important. A large window is good for following the long term trend but less sensitive to sharp changes. On the other hand, a small window reacts faster to sharp changes but may result in more false alarms due to fluctuation of signal strength. We propose to leverage both window sizes in the prediction. To react fast to recent sharp decrease of RSSI values, we predict RSSI values by applying both large and small history windows and take the minimum of the two predicted values to be the predicted RSSI. On the other hand, to catch the long term trend to avoid false alarms, we apply a trend analysis algorithm to RSSI values in the large history window. Only when the predicted RSSI is below LGD_TH and the long term trend of RSSI is also going downward, will we make a prediction (i.e. generate a PreTrigger).
Fig. 5. State machine diagram of Trigger Prediction (PreTrigger) algorithm
The state machine diagram of PreTrigger generation is specified in Fig. 5, where Pred_RSSI(i+j) = Min(x’[i+j]N1, x’[i+j]N2) and x’[i]N denote the predicted value of exponential average RSSI at step i, by using window size N and the straight-line prediction method. We use N1 to denote the large window size and N2 the small window size, and set N1=50 and N2=10. Parameter j is tunable and we discuss how to choose an appropriate value for j in Section 4. When the state transitions from IDLE to PRE_TRIGGER, a PreTrigger is generated. Later, if a LGD trigger is indeed generated, the state changes to FINAL (i.e. the prediction is accurate). The transition from PRE_TRIGGER to INVALID indicates that after the predication is made, RSSI actually goes up and the predication is considered a false alarm. The transition from IDEL to FINAL indicates that a LGD trigger is generated before a predication could be made, hence a missed prediction. To analyze the recent trend of signal strength, we apply the FFT method. (Refer to [2] for details of the FFT method.) Given a set of RSSI, x (i- N1+1) to x(i), the trend analysis algorithm will return the trend as downward, upward, or undefined (meaning recent RSSIs do not show a strong trend of either going up or down. Finally, Algorithm 2 presents the overall procedure of trigger generation and prediction. It is executed each time a new RSSI measurement is obtained.
Using Predictive Triggers to Improve Handover Performance in Mixed Networks
885
Algorithm 1. Trend Analysis Algorithm Step 1: Apply the FFT method to get the trend of RSSI in the following history windows, which will return the trend as UP, DOWN, or UNDEFINED (N1=50 and N2=10): • Long term trend: the trend of [x (i- N1+1), x(i)]. • Half long term trend: the trend of [x (i- N1/2), x(i)]. • Short term trend: the trend of [x (i-N2+1), x(i)]. Step 2: The recent trend is downward if one of the following is true: long term trend is DOWN; long term trend is UNDEFINED and half long term trend is DOWN; or long term trend is UNDEFINED and short term trend is DOWN. Similarly, the recent trend of RSSI is upward if the following is true: long term trend is UP; long term trend is UNDEFINED and half long term trend is UP; or Long term trend is UNDEFINED and short term trend is UP. Otherwise, the trend is undefined. Algorithm 2. Overall Algorithm for Trigger Generation and Prediction 1. Initialize data structures when associated with AP and set i=0. 2. When a new RSSI measurement is obtained, calculate exponential average, x(i). 3. If i < N2-1, wait for more RSSI measurements. Otherwise, continue. 4. Apply the trigger generation algorithm (Section 2) and generate triggers if link status changes. 5. Apply the trigger prediction algorithm (Fig. 5), and generate PreTrigger when entering the PRE_TRIGGER state. 6. i=i+1. Repeat steps 2 to 6.
4 Evaluation of Trigger Prediction To evaluate the performance of trigger prediction, we apply Algorithm 2 to real traces of signal strength measurements. For each trace, at each step a measured RSSI value (or CINR for WiMax) is read in from the trace and the algorithm is applied. For all traces, we count the number of accurate predictions, missed predications, and false alarms. We also evaluate the length of proactive window of each accurate prediction. 4.1 Results from WiFi Traces We used eight WiFi traces for evaluation. Two traces were from those in Fig.2 (the first two in the bottom). Six more were collected and are shown in Fig.6. The top three were collected by walking from inside of an office building to the outside. The bottom three were collected by moving from inside to outside of a house, at different directions and with different speeds. Note that these traces include both sharp decay and slow degradation of signal strength. Altogether, there are 25 LGD triggers generated in the eight traces. Table 3 presents the results for using different prediction steps, j =5 and j =10. With a larger prediction step, we could make earlier predictions (i.e. larger proactive window). As shown in Table 3, with j =10, for accurate predictions, on average our algorithm was able to predict LGD triggers 1.50 seconds ahead of time. On the other hand, the larger the prediction step, the more missed predictions and false alarms. In Table 3, with j=10, there were 10 canceled triggers (false alarms) and 4 missed predictions, where both numbers are larger than those of j=5. Hence, we recommend setting j=5 to
886
H. Liu, C. Maciocco, and V. Kesavan
Fig. 6. Traces of raw RSSI, part II Table 3. Summary of trigger prediction results from WiFi traces
Total # of LGD triggers from all traces # of PreTriggers generated by PreTrigger algorithm # of Accurate PreTriggers # of canceled PreTriggers # of missed PreTriggers Average length of proactive windows (seconds)
j=5 25 24 24 0 1 1.13
j = 10 25 31 21 10 4 1.5
trade-off between prediction accuracy and large proactive window. On the other hand, Table 3 demonstrates that with j=5, our algorithm was able to successfully predict 24 out of the 25 triggers, and on average made predictions 1.13 seconds ahead of time (for reference, around half predictions were made more than 1 second ahead of time). Hence, by applying our algorithm, an warning that link would be going down could be sent out 1.13 seconds before LGD is detected, thus giving Connection Manager plenty of time to monitor alternative connections, select one with good quality, and initiate a handover. 4.2 Results from WiMax Traces We obtained a set of WiMax traces that were collected by driving around in the fields with real deployment of WiMax base stations. In each trace, measurements of downlink signal strength (CINR) were recorded every second.5 We extracted 5 traces that altogether span for 4 hours and 27 minutes and applied our algorithms. Figure 1 shows two of the traces. From private conversation, we learned that in WiMax networks, connection failures are likely to occur when CINR is below 0, and 9 was used to trigger handover in collecting these traces. Hence, for the WiMax traces, we set LGD_TH=9 and LD_TH=0. Results are presented in Table 4. Altogether there were 30 LGD triggers, and our algorithm successfully predicted 24 of them with an average proactive window of 8.7 5
Due to implementation limitations, measurements could not be taken at a higher frequency.
Using Predictive Triggers to Improve Handover Performance in Mixed Networks
887
Fig. 7. Sample WiMax traces
steps.6 On the other hand, the results show one canceled PreTrigger and missed 6 predictions. We conjecture that the higher percentage of missed prediction in WiMax traces, compared to WiFi results, is due to various driving speeds in collecting WiMax traces. In the current design of our algorithms, we used fixed sizes of history windows. One enhancement is to apply adaptive window sizes to adapt to different moving speeds, which would be part of our future work. Table 4. Summary of results of trigger prediction for WiMax traces Total # of LGD triggers from all WiMax traces # of Pre-Triggers generated by our algorithm # of Accurate Pre-Triggers # of canceled Pre-Triggers # of missed Pre-Triggers Avg. length of proactive window (# of steps)
30 25 24 1 6 8.7
5 Conclusion For multi-radio devices that expect ubiquitous connectivity across networks, vertical handovers could be lengthy (in terms of multiple seconds), since it typically involves a combination of link layer, IP layer, and application layer executions. In this paper, we proposed a novel approach to predict when a device needs to take proactive actions and prepare for handovers. With our link trigger predication algorithm, events such as Link-Going-Down could be predicted more than 1 second ahead of time. As indicated by the measurement in Fig.1, such predictions not only enables a Connection Manger to detect Link-Going-Down before connection is lost (instead of waiting for OS messages for confirmation), but also provides it extra time to turn on another radio, obtain IP address, and redirect a session. More specifically, with link triggers and trigger predictions, the service interruption, if any, is mainly due to media redirection and should be in milliseconds. We are now integrating our algorithms into our WiFi/WiMax multi-radio prototype system for further validation. 6
For WiMax results, we do not report the length of the proactive window as in seconds, because the CINR values were collected every 1 second, and this frequency may not reflect future implementations. With more frequent samples, we conjecture that the absolute time span of a proactive window may be smaller, but the its length in terms of average number of steps would remain similar.
888
H. Liu, C. Maciocco, and V. Kesavan
Another potential benefit of trigger prediction is that it could be applied to enable energy efficient network monitoring and selection for handovers. In general, to make appropriate handover decisions, it is desired to monitor the quality of other available networks (or available access points in other channels) [1][2]. However, leaving all the radios on all the time for monitoring network quality consumes a lot of power. By applying our work, one can turn on other radios only when a prediction of LGD is made. The early prediction enables a reasonably long time window for monitoring and comparing networks, and also enables radios not-in-use to be turned off to save power most of the time. We plan to further investigate in this direction. Acknowledgments. The authors thank Brent Elliott, Intel Corporation, for acquiring and sharing the WiMax data.
Reference [1] Mhatre, V., Papagiannaki, K.: Using smart triggers for improved user performance in 802.11 wireless networks. In: Proceedings of the 4th international conference on Mobile systems, applications and services (MobiSys) (2006) [2] Guo, C., Guo, Z., Zhang, Q., Zhu, W.: A Seamless and Proactive End-to-End Mobility Solution for Roaming Across Heterogeneous Wireless Networks. IEEE Journal on Selected Areas in Communications (June 2004) [3] Choong, K.N., Kesavan, V.S., Ng, S.L., de Carvalho, F., Low, A.L.Y., Maciocco, C.: SIP-based IEEE802.21 media independent handover — a BT Intel collaboration. BT Technology Journal 25(2) (2007) [4] Mitsuya, K., et al.: IEEE802.21 assisted Network Selection for Seamless Handover (submission) [5] Chien, S.F., Liu, H., Low, A.L.Y., Maciocco, C., Ho, Y.L.: Smart Predictive Trigger for Effective Handover in Wireless Networks (submission) [6] Woon, S., Golmie, N., S¸ekercioglu, Y.A.: Effective Link trigger to Improve Handover Performance. In: Proceedings of The 17th Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC) (2006) [7] Pollhi, G.P.: Trends in Handover Design. IEEE Communications Magazine (March 1996) [8] Eyceoz, T., Hu, S., Duel-Hallen, A.: Performance Analysis of Long range Prediction for Fast fading Channels. In: Proceedings of the 33rd CISS, Baltimore, MD (March 1999) [9] Piroddi, L., Spinelli, W.: Long-range Nonlinear Prediction: A Case Study. In: Proceedings of IEEE Decision and Control 42nd Conference (June 2006) [10] Semmelrodt, S., Kattenbach, R.: Investigation of Different Fading Forecast Schemes for Flat Fading Radio Channels. In: Proceedings of IEEE VTC, Orlando, FL (October 2003) [11] BT 21 Century Network, http://www.btplc.com/ [12] International Télécommunication Union, http://www.itu.int/home/index.html [13] Telecoms and Internet converged Services and Protocols for Advanced Networks, http://www.etsi.org/tispan/ [14] IEEE P802.21/D01.00 (March 2006) [15] 3GPP Architecture Enchancements for non-3GPP accesses (3GPP TS 23.402) (2007) [16] NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd431.htm
Medium Access Cooperations for Improving VoIP Capacity over Hybrid 802.16/802.11 Cognitive Radio Networks Deyun Gao1 , Jianfei Cai2 , and Chuan Heng Foh2 1
School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing 100044, P.R. China 2 School of Computer Engineering, Nanyang Technological University, Singapore 639798
Abstract. There are some existing works that study the coexistence of 802.16 and 802.11 networks. However, not many of them consider the resource allocation issues in the case of delivering traffic between mobile stations and Internet users through an access point (AP) and a base station (BS) which operate at the same frequency band. In this paper, we design a cooperation mechanism for 802.16 and 802.11 to share the same medium with adaptable resource allocation. The adaptiveness in resource allocation in our design eliminates the potential inefficiency from the cooperation due to bandwidth bottleneck. Targeting VoIP applications, we propose a simple approach for the adaptation that optimizes the resource allocation not only between the IEEE 802.11 AP and the stations but also between the IEEE 802.16 BS and the IEEE 802.11 AP. Numerical results show significant improvement in voice capacity.
1 Introduction There have been tremendous advances in wireless networks and mobile devices in recent years. It is expected that different radio technologies, including WiFi (IEEE 802.11), UWB (IEEE 802.15.3), WiMAX (IEEE 802.16), ZigBee (IEEE 802.15.4), Bluetooth and 3G will coexist. With such a rapid growth of wireless technologies, spectrum scarcity has become a serious problem as more and more wireless applications compete for very little spectrum. In order to solve this problem, the cognitive radio technology was introduced in the late 1990s by Joseph Mitola [1]. Although the cognitive radio technology sheds light on spectral reuse, it leaves open the issues of how to efficiently and practically deploy cognitive radios [2]. Recently, cognitive radio has attracted a lot of interests from research community [3,4], where dynamic spectrum utilization is the main focus. Among various wireless networks, 802.16 and 802.11 are the two most important wireless access technologies, which could be integrated to provide Internet connections for end users as described in [5,6]. In general, 802.16 can provide wireless backhaul connectivity to homes and offices while 802.11 offers complementary local area connectivity such as within a home, an office or a campus. In this paper, we consider hybrid 802.11 and 802.16 cognitive radio networks. Fig. 1 shows a typical scenario, where A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 889–900, 2008. c IFIP International Federation for Information Processing 2008
890
D. Gao, J. Cai, and C.H. Foh
Building Building
802.11 AP + 802.16 SS
802.16 BS 802.11 AP + 802.16 SS
` `
Fig. 1. A typical scenario of hybrid 802.16 and 802.11 networks
mobile stations are connected with 802.11 APs and the APs are connected to Internet through an 802.16 BS. Under such a scenario, 802.16 and 802.11 might have to share the spectrum, e.g., the U-NII frequency band at 5GHz that could be used by both 802.11a and 802.16a. Currently, in the unlicensed bands, only a few spectrum sharing methods [7,8,9] have been proposed. In [7], Berlemann et al. proposed to partially block 802.11 stations to access the medium so that 802.16 could use the same spectrum. In [8], Jing et al. proposed to utilize the available degrees of freedom in frequency, power and time, and react to the observations in these dimensions to avoid interference. In [9], Jing and Raychaudhuri proposed to use a common spectrum coordination channel to exchange the control information in order to cooperatively adapt the key PHY-layer parameters such as frequency and power. All of these existing schemes do not consider the resource allocation issues in the case of delivering traffic between mobile stations and Internet users through an AP and a BS, which are sharing the same frequency band. In this paper, we design a cooperation mechanism for 802.16 and 802.11 to share the same medium with adaptable resource allocation. We further consider VoIP applications and propose a simple approach to find the optimal resource allocation not only between the AP and the stations but also between 802.16 and 802.11. The rest of the paper is organized as follows. In Section 2, we give a brief overview of the 802.11 and 802.16 MAC protocols. In Section 3, we propose the medium access cooperation mechanism to coordinate the channel access between 802.11 and 802.16. In Section 4, we discuss how to optimally allocate the resource for VoIP over the hybrid 802.16/802.11 networks. Numerical results are provided in Section 5 and conclusions are given in Section 6.
2 Overview of 802.11 and 802.16 MAC Protocols 2.1 802.11 MAC Protocol In 802.11 WLANs, the MAC layer defines the procedures for 802.11 stations to share a common radio channel. The legacy 802.11 standard specifies the mandatory distributed coordination function (DCF) and the optional point coordination function (PCF) [10]. DCF is essentially a “listen-before-talk” scheme based on CSMA/CA, while PCF uses polling to provide contention-free transmission. To enhance the QoS supports in 802.11,
Medium Access Cooperations for Improving VoIP Capacity
891
the new IEEE 802.11e [11] introduces the hybrid coordination function (HCF), which includes two medium access mechanisms: enhanced distributed channel access (EDCA), and HCF controlled channel access (HCCA), which can be regarded as the extensions of DCF and PCF, respectively. In the IEEE 802.11 MAC protocol, time is divided into superframes, where each superframe consists of two types of phases: contention free period (CFP) and contention period (CP). In the legacy 802.11, DCF is used in CPs and PCF is used in CFPs. Likewise, in the 802.11e MAC protocol, EDCA can only be used in CPs, while HCCA can be used in both phases. Fig. 2 illustrates the different periods under HCF. Note that the CAP (controlled access phase) is defined as the time period that the medium control is centralized. It can be seen that CAPs consist of not only CFPs but also parts of CPs. B e a c o n
B e a c o n CFP
B e a c o n CP CFP Repetition Interval
B e a c o n CFP
CP
CAP EDCA TXOPs and access by legacy STAs using DCF.
Fig. 2. An example of CAPs/CFPs/CPs
In this research, we consider EDCA is used in WLANs for the communications between mobile stations and the access point (AP). The EDCA mechanism extends the legacy DCF through introducing multiple access categories (ACs) to serve different types of traffics. In particular, there are four ACs with independent transmission queues in each mobile station. The four ACs starting from AC3 to AC0 are designed to serve voice traffic, video traffic, best effort traffic, and background traffic, respectively. Each AC, basically an enhanced variant of DCF, contends for transmission opportunities (TXOPs) using one set of the EDCA channel access parameters. The major parameters include – CW min[AC]: minimal contention window (CW) value for a given AC. – CW max[AC]: maximal CW value for a given AC. – AIF S[AC]: arbitration interframe space. Each AC starts its backoff procedure after the channel is idle for a period of AIF S[AC]. – T XOP limit[AC]: the limit of consecutive transmission. During a TXOP, a station is allowed to transmit multiple data frames but limited by T XOP limit[AC]. 2.2 802.16 MAC Protocol The 802.16 MAC protocol [12] supports point to multipoint (PMP) and mesh network modes, schedules the usage of the air link resource and provides QoS differentiations. In this paper, we focus on the PMP mode, where one base station (BS) and many subscriber stations (SSs) form a cell similar to that in cellular networks. There are two types of duplexing schemes: FDD (Frequency Division Duplex) and TDD (Time Division Duplex). Most WiMAX implementations mainly use TDD.
892
D. Gao, J. Cai, and C.H. Foh
Fig. 3 shows the frame structure in a typical 802.16 TDD system. Basically, time is divided into frames, and each frame consists of uplink and downlink subframes. A downlink subframe (DL-Subframe) has two major parts: control information and data. There are two important maps in the control information of a DL-Subframe: DL-MAP and UL-MAP, which describe the slot locations for the downlink and uplink subframes. It is through the DL-MAP and UL-MAP fields that the BS allocates resources to SSs. The UL subframe contains an initial ranging field, a bandwidth request field, and burst fields for MAC PDUs. The 802.16 MAC protocol supports both polling and contentionbased mechanisms for SSs to send bandwidth requests. Frame DL-Subframe Prea mble
FCH
Burst #1
DLMAP
DLFP
ULMAP
UL-Subframe
Burst #n
T T G
Contention for Initial Ranging
Contention for BW Request
Burst #1
Prea mble
MAC messages, MAC PDUs MAC PDUs
IE
IE
IE
IE
IE
IE
IE
Burst #m
R T G
UL burst MAC PDUs
PAD
IE MAC header
MAC payload
CRC
Fig. 3. The frame structure of 802.16
The 802.16 MAC protocol is connection-oriented. The QoS requirements of a connection in a SS can be varied by sending requests to the BS. Service differentiation has also been introduced in WiMAX [13], where four service classes are defined: – – – –
Unsolicited grant service (UGS) for CBR traffic such as voice Real-Time polling service (rtPS) for real-time VBR traffic such as MPEG videos Non-realtime polling service (nrtPS) for non-realtime traffic such as FTP Best effort (BE).
3 Medium Access Cooperations in hybrid 802.16/802.11 networks In this paper, we consider a typical scenario of hybrid 802.16/802.11, where an 802.16 BS is connecting to a few SSs using TDMA/TDD and each SS is an AP communicating with many mobile stations through EDCA. Although the medium access protocols in both 802.16 and 802.11 have been well defined, we still need to design a mechanism to coordinate the medium access between them in order for them to operate at the same spectrum. Since a typical superframe in 802.11 is about 100 - 200 ms, much longer than a frame in 802.16 (typically 5 - 20 ms), hence it is a natural choice to embed 802.16 frames into an 802.11e superframe and use CAPs for the communications between APs/SSs and the BS. In particular, when an AP/SS joins into 802.16, the BS periodically allocates some time slots in each frame to the AP/SS. The AP/SS can obtain the frame length
Medium Access Cooperations for Improving VoIP Capacity
893
information from the frame header. After that, the AP/SS uses the HCCA mechanism to send one packet such as RTS to inform all the mobile stations the periodic time intervals of the 802.16 frames indicated by network allocation vector (NAV), as shown in Fig. 4. All the mobile stations and the AP will not communicate each other during the periods indicated by NAVs, while for other periods they communicate using EDCA. In this way, we avoid transmission conflictions between 802.16 and 802.11. frame
802.16 time
802.11e
B e N a A c o V n
N A V
N A V
N A V
B e a c o n
superframe 802.16 frame EDCA TXOPs and access by legacy STAs using DCF.
Fig. 4. Medium access cooperations between 802.16 and 802.11
In 802.16, when the traffic conditions change, we need to change the 802.16 frame length to accommodate the new traffic load. All the attached APs/SSs need to send a new NAV to their associated stations. We would also like to point out that the differentiated services in both 802.11e and 802.16 are quite similar. We could directly map each of the four services in 802.11e into one of the services in 802.16 although the ways of implementing the service differentiation are different, where the former is through adjusting the EDCA parameters and the latter uses prioritized scheduling. Note that, our proposed scheme also applies to multiple WLAN cells each of that connects to one SS. We assume that these WLAN cells do not located within the interference range. The interference problems between WLAN cells are outside of the focus of this paper. Under such scenario, 802.16 BS needs to chose the maximum transmission time requirements among these WLAN cells as the common requirement. Then, 802.16 BS allocates some time slots that satisfy the common requirement to each AP/SS. Each WLAN cell can complete the data transmission in parallel during the allocated time slots.
4 Improved Resource Allocation for VoIP Applications Considering VoIP applications in the hybrid 802.16/802.11 networks, each voice talk involves one 802.11 mobile user and another user connected to Internet, and the communications go through one AP and one BS. One of the most important issues is how to optimally allocate the resource among mobile stations, AP and BS so that more VoIP connections can be supported. In our previous work [14], we have studied the case of VoIP over WLANs. We found that the AP is the bottleneck for VoIP applications if we directly use the legacy DCF mechanism. We proposed to use EDCA to solve the bottleneck problem. In particular, we give the AP higher priority than mobile stations. The experimental results in [14] shows the improved voice capacity.
894
D. Gao, J. Cai, and C.H. Foh
For VoIP over the hybrid 802.16/802.11 networks, the bottleneck problem of AP becomes even severe since it needs to transmit not only all the 802.16 downlink packets to the stations but also all the 802.11 uplink packets to the BS. 4.1 Improved Resource Allocation In order to appropriately allocate resource, we need to consider the four throughput: 16 16 the uplink and the downlink throughput of 802.16 denoted as Sup and Sdw . We further 11 11 define Sup and Sdw to be the uplink throughput of EDCA from each station and the downlink throughput of EDCA from the AP respectively. For simplicity, we assume there is only one SS. The following derivation can be easily extended to the case of multiple SSs. Considering the symmetric property of VoIP traffic, the contention-free resource allocation in 802.16 and contention-based resource allocation in EDCA, we have 16 16 Sup = Sdw 16 Sup = N Rreq
(1)
11 Sup (1 − r) ≥ Rreq 11 Sdw (1 − r) ≥ N Rreq ,
where N is the number of voice connections, Rreq is the one-way voice throughput requirement, and r is the time fraction occupied by 802.16. The adjustable parameters include all of the EDCA parameters such as CWmax, CWmin, retry limits, and others. In this research, we only consider adjusting the CWmin of the AP/SS and fix all the other EDCA parameters. The condition for optimal operation can be described as follows. ⎧ ⎨ Maximize N ∈ N 16 16 subject to (r − 1) N Sup (N, Wdw) + Sdw (N, Wdw ) (2) ⎩ 16 16 +r Sup (N ) + Sdw (N ) ≤ B where B is the total bandwidth for sharing between 802.16 and 802.11. Since all through16 16 11 11 put functions, namely Sup (N ), Sdw (N ), Sup (N, Wdw ) and Sdw (N, Wdw ), are monotonically increasing functions in terms of N where N ∈ N, the solution can be practically computed numerically by searching for Nmax with the following method. Step 1: Set N to a small initial value. Step 2: Calculate the aggregate one-way voice traffic load. Then, according to the 16 16 first two equations in (1), we obtain Sup and Sdw . Based on 802.16 frame structure, we can compute the length of an 802.16 frame (see Section 4.3). Further considering the proposed setup between 802.16 frames and an EDCA superframe shown in Fig. 4, we derive r. Step 3: Based on the obtained r value, we test different values of Wdw , where Wdw = CWmin [dw] + 1. If we can find a particular Wdw , for which the corresponding uplink and downlink saturation throughput (see Section 4.2) can satisfy the throughput requirements shown in the two inequalities in (1), we set N = N + 1 and go back
Medium Access Cooperations for Improving VoIP Capacity
895
to step 2. Otherwise, we stop and set Nmax = N − 1. Note that we use the EDCA saturation throughput, which might not be the actual throughput. The reason we use it is that the analysis for the EDCA saturation throughput is much easier and mature. The obtained voice capacity can be regarded as a lower bound. 4.2 EDCA Saturation Throughput Analysis Several analytical models [15,14] have been proposed to analyze the performance of EDCA under saturation conditions, where the transmission queue of each station is assumed to be always nonempty. All of the existing EDCA modelling schemes are fundamentally based on the Bianchi’s work [16], which introduces using the Markov chain to model DCF. In our previous work [14], we have developed a simplified Markov chain model for the EDCA performance analysis, which takes not only most of the EDCA parameters but also transmission errors into consideration. Directly applying the model in this research, we calculate Si , i ∈ {up, dw} according to the ratio of the time occupied by the transmitted information in a time interval to the average length of a time interval, i.e. E[time used for successful transmission in an interval] E[length between two consecutive transmissions] Pi,s E[P ] = R11 E[I] + E[N C] + E[C]
Si = R11
(3)
where R11 is the physical transmission rate of 802.11, E[P ] is the VoIP payload length, Pi,s E[P ] is the average amount of successfully transmitted payload information, and the average length of a time interval consists of three parts: E[I], the expected value of idle time before a transmission, E[N C], transmission time without collision, and E[C], collision time. The details of the derivation can be found in [14]. 4.3 802.16 Throughput Analysis Considering only one SS attached to a BS in 802.16, we calculate the time length of one frame as 16 16 16 16 TF16rame = TLongP re + TF CH + TDLburst + TT T G + 16 16 16 16 TInitRang + TBW request + TULburst + TRRG
(4)
where each term corresponds to one component in the frame structure shown in Fig. 3. 16 16 The terms TDLburst and TULburst are further divided into 16 16 16 TDLburst = TULburst = TP16re + TMAC + TP16ad 16 16 16 16 TMAC = Theader + Tsubheader + L/R16 + TCRC ,
(5)
where L is the payload length in bits. The particular parameter values defined in 802.16 are [17]: six bytes for header, four bytes for CRC, three ranging slots with each slot corresponding to eight OFDM symbols, ten bandwidth request slots with each slot corresponding to two OFDM symbols, two OFDM symbols for TTG and RTG, two OFDM symbols for the Preamble at the frame head of frame, and one OFDM symbol for the PDU Preamble.
896
D. Gao, J. Cai, and C.H. Foh
5 Numerical Results For experiments, we adopt the system parameters of the 802.11a and 802.16a physical layers. For EDCA, we set Wup = 32, AIF S[up] = AIF S[dw] = 2, CW max[up] = CW max[dw] = 1023 and a maximum retry limit of 7. We consider G.711 voice codec is used in the application layer with a packetization interval of 20 ms, a raw voice packet is 160 bytes. From the viewpoint of the MAC layer, the frame payload size is 160+40=200 bytes and the data rate is 200 × 8/20 = 80 kbps. First, we assume the physical data rates for 802.16 and 802.11 are 6.91 Mbps and 6 Mbps, respectively. We compare our proposed scheme that use priority and cooperation with two other schemes, where one has no priority and the other has no cooperation. The throughput performance for our proposed scheme is shown in Fig. 5, which depict the aggregate one-way voice traffic load, the aggregate 802.11 uplink throughput and the 802.11 downlink throughput. Note that the 802.16 uplink and downlink throughput is equal to the aggregate one-way voice traffic load according to our system setup. We would also like to point out that, in Fig. 6, when the number of voice connections is small, the throughput is larger than the input traffic load, which is not realistic. This is because the throughput we plot is the saturation throughput while the cases of small numbers of voice connections are actually under unsaturation conditions. From the figure, we can see that, when Wdw = 2, the number of supported voice connections is 12, beyond which either the 802.11 uplink throughput or the downlink throughput will become less than the traffic load. If we increase Wdw to three, we will increase the number of supported voice connections to 14. However, if we further increase Wdw to 4 or larger, the 802.11 downlink throughput will decrease, which leads to a reduced number of supported voice connections. Therefore, Wdw = 3 should be the optimal solution and N = 14 is the maximum number of supported voice connections. 2 one−way VoIP traffic load 802.11 uplink, Wdw=2 802.11 downlink, Wdw=2 802.11 uplink, Wdw=3 802.11 downlink, Wdw=3
1.8
Throughput (Mbps)
1.6 1.4 1.2 1 0.8
10
12
14 16 Voice connections
18
20
Fig. 5. The throughput performance for our proposed scheme using priority and cooperation
For the scheme without priority, we set Wdw = Wup = 32. Fig. 6 shows its throughput performance. It can be seen that the maximum number of supported voice connections in this situation is about five, which is much less than our proposed scheme. This is because without priority the AP becomes the bottleneck for the communications in
Medium Access Cooperations for Improving VoIP Capacity
897
802.11. For the scheme without cooperation, we fix the resource allocation between 802.16 and 802.11 to 50%, i.e. r = 0.5. Fig. 7 shows the throughput performance. It can be seen that the maximum number of supported voice connections in this situation is about 11, which is slightly smaller than the optimal result. However, such a fixed resource allocation could lead to much worse performance since it does not consider the actual traffic conditions in 802.11 and 802.16 networks. On the contrary, our cooperation mechanism dynamically adjust r according to the traffic loads, which effectively allocates the resource between 802.11 and 802.16. 2 one−way VoIP traffic load 802.11 uplink, Wdw=32 802.11 downlink, Wdw=32
1.8
Throughput (Mbps)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
5
10 Voice connections
15
20
Fig. 6. The throughput performance for the scheme without using priority
2 one−way VoIP trafffic load 802.11 uplink, Wdw=3 802.11 downlink, Wdw=3
1.8
Throughput (Mbps)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
5
10 Voice connections
15
20
Fig. 7. The throughput performance for the scheme without cooperation
To consider different channel conditions, we vary the 802.16 data rate while fixing the 802.11 data rate to 6 Mbps. Table 1 shows the maximum numbers of supported voice connections under different 802.16 PHY-layer modes. We can see that when the 802.16 data rate is low, it has great impact on the system performance. On the other hand, when the data rate becomes much higher than the 802.11 data rate, it has affect the voice
898
D. Gao, J. Cai, and C.H. Foh
capacity at all. This is because when the 802.16 data rate is high, the resource percentage it needs becomes very small and the voice capacity soly depends on the performance of 802.11. Similar observations in Table 2 can be made when we fix the 802.16 data rate and vary the 802.11 data rate. However, the reason behind this phenomena is different. In 802.11a WLANs, the physical and MAC overheads are fixed for each frame and the transmission rate variation has no impact on these overheads. The VoIP frame payload which is small has few impact on the total transmission time of each frame when the transmission rate is large. Therefore, the number of stations that the system can support varies rarely when the 802.11a transmission rate becomes higher. Table 1. The maximum numbers of supported voice connections under different 802.16 PHYlayer modes Modulation Code Rate Data Rate Max. Wdw (Mbps) Voice Conn. BPSK 1/2 6.91 14 3 QPSK 1/2 13.82 16 3 QPSK 3/4 20.74 18 2 16 QAM 1/2 27.65 20 2 16 QAM 3/4 41.47 21 2 64 QAM 2/3 55.3 21 2 64 QAM 3/4 62.21 21 2
r 0.343 0.205 0.158 0.135 0.096 0.077 0.070
Table 2. The maximum numbers of supported voice connections under different 802.11a PHYlayer modes Modulation Code Rate Data Rate Max. Wdw (Mbps) Voice Conn. BPSK 1/2 6 14 3 BPSK 3/4 9 16 2 QPSK 1/2 12 19 2 QPSK 3/4 18 22 2 16 QAM 1/2 24 23 2 16 QAM 3/4 36 24 2 64 QAM 2/3 48 25 2 64 QAM 3/4 54 25 2
r 0.343 0.390 0.459 0.528 0.552 0.575 0.598 0.598
6 Conclusion In this paper, we proposed the use of CAPs for 802.16 communications and CPs for 802.11 communications in the hybird 802.16/802.11 networks. We have analyzed the resource allocation issues for VoIP over the hybrid networks. By carefully choosing the Wdw , we are able to give the AP more priorities than the stations in 802.11, which solves the bottleneck problem in VoIP applications. By adjusting r, we are able to dynamically adjust the resource allocation between 802.16 and 802.11. Preliminary results have shown the significant improvement in voice capacity.
Medium Access Cooperations for Improving VoIP Capacity
899
Acknowledgments The authors gratefully acknowledge the support of the Research Fund of Beijing Jiaotong University under Grant No. 2007RC017, and the support of the National High Technology Research and Development Program of China (863 program) under Grant No. 2007AA01Z241.
References 1. Mitola, I.J., Maguire, J.G.Q.: Cognitive radio: making software radios more personal. IEEE Wireless Commun. Mag. 6(4), 13–18 (1999) 2. Jesuale, N.: Spectrum policy issues for state and local government. International Journal of Network Management 16(2), 89–101 (2006) 3. Haykin, S.: Cognitive radio: brain-empowered wireless communications. IEEE J. Select. Areas Commun. 23(2), 201–220 (2005) 4. Xing, Y., Chandramouli, R., Mangold, S., S.N.: Dynamic spectrum access in open spectrum wireless networks. IEEE J. Select. Areas Commun. 24(3), 626–637 (2006) 5. Garuda, C., Ismail, M.: A multiband CMOS RF front-end for 4G WiMAX and WLAN applications. In: Proc. 2006 IEEE International Symposium on Circuits and Systems (ISCAS 2006), Island of Kos, Greece, May 2006, pp. 3049–3052 (2006) 6. Intel Corporation, Understanding Wi-Fi and WiMAX as metro-access solutions, Intel White Paper, [Online] (2004), http://www.intel.com/netcomms/technologies/wimax/304471.pdf 7. Berlemann, L., Hoymann, C., Hiertz, G., Walke, B.: Coexistence of IEEE 802.16 and 802.11(a) in unlicensed frequency bands. In: Proc. the 16th Meeting of Wireless World Research Forum (WWRF16), Shanghai, China (April 2006) 8. Jing, X., Mau, S.-C., Raychaudhuri, D., Matyas, R.: Reactive cognitive radio algorithms for co-existence between IEEE 802.11b and 802.16a networks. In: Proc. IEEE Global Telecommunications Conference (GLOBECOM 2005), St Louis, MO, November 2005, vol. 5, pp. 2465–2469 (2005) 9. Jing, X., Raychaudhuri, D.: Spectrum co-existence of IEEE 802.11b and 802.16a networks using the CSCC etiquette protocol. In: Proc. First IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks (DySPAN 2005), Baltimore, Maryland USA, November 2005, pp. 243–250 (2005) 10. Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, IEEE Std. 802.11 (1999) 11. Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications Amendment 8: Medium Access Control (MAC) Quality of Service Enhancements, IEEE Std. 802.11e-2005 (2005) 12. IEEE Standard for Local and metropolitan area networks Part 16: Air Interface for Fixed and Mobile Broadband Wireless Access Systems, IEEE Std. 802.16e-2004 (2004) 13. Cicconetti, C., Lenzini, L., Mingozzi, E., Eklund, C.: Quality of service support in IEEE 802.16 networks. IEEE Network 20(2), 50–55 (2006) 14. Gao, D., Cai, J., Chen, C.W.: Capacity analysis of supporting VoIP in IEEE 802.11e EDCA WLANs. In: Proc. the 49th IEEE Global Telecommunications Conference (GLOBECOM 2006), San Francisco, California (November 2006)
900
D. Gao, J. Cai, and C.H. Foh
15. Kong, Z., Tsang, D., Bensaou, B., Gao, D.: Performance analysis of IEEE 802.11e contention-based channel access. IEEE J. Select. Areas Commun. 22(10), 2095–2106 (2004) 16. Bianchi, G.: Performance analysis of the IEEE 802.11 distributed coordination function. IEEE J. Select. Areas Commun. 18(3), 535–547 (2000) 17. Hoymann, C.: Analysis and performance evaluation of the OFDM-based metropolitan area network IEEE 802.16. Computer Networks 49(3), 341–363 (2005)
ERA: Effective Rate Adaptation for WLANs Saˆad Biaz and Shaoen Wu Computer Science and Software Engineering Department Auburn University Shelby Center for Engineering Technology, Suite 3101, Auburn University, AL, 36849-5347, USA {biazsaa,wushaoe}@auburn.edu
Abstract. Rate adaptation consists of using the optimal rate for a given channel quality: the poorer the channel quality, the lower the rate should be. Multiple rate adaptation schemes were proposed and studied so far. The first generation rate adaptation schemes perform well in collision free environments and manage quite well strict channel degradation. Under a congestion dominated environment, these schemes poorly perform because they do not differentiate losses due to channel degradation from those due to collisions and unnecessarily decrease the rate in response to collisions. A second generation rate adaptation schemes overcome this limitation. However, these most recent schemes usually require RTS/CTS control frames that are not in general used because they constitute an overhead that may heavily lower network performance especially when frame size is small. This work proposes an effective and practical rate adaptation scheme (ERA) that does not require RTS/CTS control frames. ERA judiciously uses fragmentation in compliance with IEEE 802.11 standard to diagnose the cause of frame loss and to promptly recover from frame losses. Through extensive simulations on ns-2, ERA exhibits a significant throughput improvement over other rate adaptation schemes.
1 Introduction A wireless medium suffers frame losses from unstable channel conditions resulting from signal fading and/or interference. When channel conditions degrade, a transmitter should use more robust encoding schemes that enable more reliable communications, but yield lower data rates: rate adaptation is the determination of the optimal rate most appropriate for current estimated wireless channel conditions. Rate adaptation in IEEE 802.11 networks has been studied at length in recent years and many rate adaptation schemes were proposed. Most first generation rate adaptation schemes implicitly assume that all packet losses are due to channel degradation with limited congestion losses [1, 2, 3, 4]. Such an assumption is valid as long as the MAC layer deploys some collision avoidance mechanisms such as the use of RTS/CTS control frames to eliminate or minimize congestion losses. Some rate adaptation schemes [5, 6] rely on such RTS/CTS control frames to adapt the rate and minimize congestion losses. However, based on IEEE 802.11 standard, RTS/CTS control frames are optional and are recommended only when data frames are large (by default 2347 bytes in Annex D). Therefore RTS/CTS control frames are usually not enabled, A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 901–913, 2008. c IFIP International Federation for Information Processing 2008
902
S. Biaz and S. Wu
thus limiting RTS/CTS based rate adaptation schemes. Even with RTS/CTS control frames, congestion losses may still occur and mislead these rate adaptation schemes. Most rate adaptation schemes systematically respond to frame loss by decreasing their data rate. While such a response is appropriate for channel degradation, it is not for collisions for two reasons: 1) a lower rate may exacerbate medium congestion because of longer frame transmission duration and wider transmission range (more interference); 2) a lower rate is wasteful and unnecessary as channel conditions may well support a higher rate. Thus, in recent years, researchers designed second generation rate adaptation schemes that diagnose the cause of a loss and appropriately react [7, 8]. But these schemes require the use of RTS/CTS control frames that are pure overhead. The design of a rate adaptation scheme differentiating losses without RTS/CTS frames is desirable, but challenging. The authors of this work proposed a previous scheme [9] that does not require RTS/CTS control frames, but that scheme may incorrectly diagnose the cause of a loss. This paper proposes and evaluates a new effective rate adaptation scheme, dubbed ERA, that does not require RTS/CTS frames, is compatible with IEEE 802.11 standard, and efficiently works in both channel degradation and collisions dominated environments. The strategy in ERA yields two benefits: 1) an accurate diagnosis of the cause of a loss (channel degradation or congestion), and 2) a faster recovery from losses. ERA achieves these benefits by fragmenting the retransmitted lost frame in two fragments: a very short frame and the remainder of the retransmitted frame. For an accurate loss differentiation, ERA exploits the loss rate and the signal-to-noise ratio after a frame loss. ERA maintains the rate unchanged for collisions and judiciously adapts the rate for channel degradation based on a scheme described later. Through extensive ns-2 simulations, ERA is quite effective under (congestion) collisions and channel degradation and outperforms previous rate adaptation schemes with loss differentiation. The primary contributions of this paper are: – A simple, robust, and effective rate scheme with loss differentiation that does not require RTS/CTS frames. – An efficient recovery strategy from losses. – An extensive evaluation and analysis of ERA and other previous data rate adaptation schemes The rest of this paper is organized as follows: The related work is presented in Section 2. Section 3 reviews the background information of IEEE 802.11 relevant to this work: distributed coordination function basic operation and fragmentation. Then we present the rationale of ERA. Section 4. Afterwards, the effective rate adaptation ERA scheme is described in Section 5. We evaluate the performance of ERA and other schemes in Section 6 through extensive ns-2 simulations. Section 7 concludes this paper.
2 Related Work This section does not cover all previous rate adaptation schemes, but it surveys the most typical schemes relevant to this work. First, we present first generation schemes: these schemes do not differentiate losses due to channel degradation from those due to collisions. Second, we present schemes with loss differentiation.
ERA: Effective Rate Adaptation for WLANs
903
2.1 First Generation: Rate Adaptation without Loss Differentiation Scheme Auto Rate Fallback (ARF) by Kamerman and Monteban [1] is the earliest rate adaptation proposed for IEEE 802.11 based wireless networks. It is simple and intuitive. If two consecutive transmissions fail, ARF decreases the rate and resets a timer. If either the timer expires or the sender achieves ten successful transmissions consecutively at rate rold , ARF increases the rate to rnew and the timer is reset. Note that if a transmission fails immediately after a rate increase, ARF falls back to the prior rate rold . Under a stable environment, ARF keeps increasing its rate leading to periodical failures and rate oscillations. Scheme Adaptive Auto Rate Fallback (AARF) by Lacage, Manshaei, and Turletti [2] mitigates ARF oscillations by varying the threshold number N of successful transmissions required to increase the rate. AARF increases the data rate after N consecutive successful transmissions and decreases the data rate if the transmission fails twice consecutively. If the first transmission fails at the newly increased rate, AARF falls back to the prior rate just like ARF. In addition, AARF doubles the threshold required for the next rate increase to 2N . In general, most data rate adaptation schemes such as SampleRate [10] or Onoe (MadWifi Linux driver implementation) collect statistics about losses in the most recent past and adjust the rate according to some rules. 2.2 Second Generation: Rate Adaptation with Loss Differentiation Recent rate adaptation schemes focus on the differentiation of the cause of frame losses between channel degradation and collision. The scheme dubbed Robust Rate Adaptation Algorithm (RRAA) by Wong, et al. [8] is a frame loss rate driven scheme with loss differentiation. It keeps track of short term frame loss rates, and accordingly adapts the rate. In order to address the hidden terminal problem, Wong proposed the “Adaptive RTS Filter” that uses RTS frames only to assess new channel conditions when needed and to minimize collisions. “Adaptive RTS Filter” turns on RTS frames to prevent collision losses such that the frame loss rate reflects channel degradation. Kim et al. designed a Collision-Aware Rate Adaptation (CARA) scheme that uses RTS/CTS control frames whenever a frame loss occurs. CARA assumes that since all RTS frames are transmitted at the lowest rate, then they are resilient to channel fading and would fail only due to collisions. On the other hand, if RTS is transmitted successfully, but the transmission of the following data frame at higher rate fails then this failure is attributed to channel degradation, and CARA decreases the rate as in ARF. CARA does not consider the scenario where an RTS frame could be lost due to a receiver node moving out of radio range. In parallel and independently from RRAA and CARA, we proposed a Loss Differentiation Rate Adaptation (LDRA) [9]. LDRA retransmits a lost frame at the lowest (basic) rate to diagnose the cause of a frame loss, and appropriately adjusts the rate. LDRA strategy is analytically justified and particularly efficient in low SNR environments. But LDRA does not perform well under a congested environment and relies upon the signalto-noise ratio (SNR). Previous experiments [10, 11] show a weak correlation between delivery probability at a given rate and SNR.
904
S. Biaz and S. Wu
This work addresses the design of a rate adaptation scheme with 1) loss differentiation, 2) a judicious retransmission strategy, and 3) some collision avoidance. The key challenge is that the new scheme does not use RTS/CTS control frames, but yet delivers better performance than other schemes proposed so far.
3 IEEE 802.11 Standard This section presents key aspects of IEEE 802.11 standard relevant to this paper. 3.1 Distributed Coordination Function This work is limited to the widely deployed IEEE 802.11 distributed coordination function (DCF) [12] and considers the DCF basic protocol without RTS/CTS frames preceding data frames. IEEE 802.11 standard discourages the use of RTS/CTS frames when the data frame length is less than the RTS threshold (2347 recommended). Garg and Kappes [13] showed that, for VoIP traffic with 160 byte frames, the data frame efficiency drops to about 12% in 802.11b [14] networks at 11 Mbps when RTS/CTS control frames are used. Thus, RTS/CTS frames are usually not used. 3.2 Fragmentation in IEEE 802.11 Fragmentation is usually performed for packets too large or unfit for current channel conditions. Under a lossy environment, a 1500 byte frame may yield poor performance and its fragmentation into optimally sized frames may dramatically improve performance. As illustrated in Figure 1, a sender divides a frame F into two fragments F rag1 and F rag2 . Fragment F rag1 bears a network allocation vector (NAV) N AV1 : N AV1 reflects the time duration that the sender requires to complete the transmission of the entire frame F (till the reception of ACK2 ). All stations that hear fragment F rag1 will remain silent for duration N AV1 . The receiver of F rag1 will acknowledge it with an ack ACK1 that bears a new NAV that also covers the complete transmission of the entire frame F . Therefore, in a collision dominated environment, if the first fragment 12
ARF CARA LDRA CONSTANT
25
Throughput (KFrames/s)
Throughput (KFrames/s)
30
20 15 10 5
10 8
ARF LDRA CARA CONSTANT
6 4 2
0 0
100
200
300
400
Transmission distance (m)
500
16
25 36 49 64 81 100 The number of client stations
Fig. 1. Transmission of frag- Fig. 2. Throughput under Fig. 3. Throughput under pure ments in IEEE 802.11 strict channel degradation collision
ERA: Effective Rate Adaptation for WLANs
905
is successfully acknowledged, the channel is cleared for all remaining fragments of the same frame. The probability of collision for the whole frame is reduced to the probability of collision of its first fragment.
4 Rationale Despite the fact that SNR is not a good indicator [10, 11], LDRA yields a significant throughput improvement when there are no collision losses and no hidden terminal problem. The strength of LDRA is due to its retransmission strategy that provides prompt recovery. Figure 2 illustrates this strength by plotting the throughput between two static stations communicating at different distances. The Ricean model was used to create a variable channel quality. In presence of channel degradation without collisions, Figure 2 also illustrates as expected that 1) rate adaptation is beneficial (the scheme that maintains a constant rate (CONSTANT) yields a poor throughput, close to null, because its transmissions at the highest rate almost fail all the time.), 2) a scheme without loss differentiation (ARF) performs as good as CARA (loss differentiation based). On the contrary, the CONSTANT rate scheme yields the best throughput in a congested network as shown on Figure 3. These results suggest that decreasing the rate for collisions hurts performance. The good performance of CARA in this figure illustrates another fact: a rate adaptation scheme not only should not decrease its rate in response to collisions, but it should proactively diffuse congestion: CARA uses RTS frames to avoid collisions. Based on the above observations, a good rate adaptation scheme should 1) accurately distinguish losses due to channel degradation from those due to collisions, 2) appropriately react (decrease rate only in response to channel degradation), 3) develop an efficient strategy for recovery from losses, and 4) diffuse congestion when collisions occur. So far, RTS/CTS frames were used [7, 8] to probe the channel and/or alleviate congestion to yield efficient rate adaptation schemes. However, since RTS/CTS control frames may represent a significant overhead and are usually not used, it is particularly desirable, though challenging to develop, such a rate adaptation scheme without using RTS/CTS frames. In order to address the challenge of not using RTS/CTS frames, we plan to use fragmentation to probe the channel, to diagnose the cause of a loss, and to develop an efficient retransmission strategy for loss recovery.
5 ERA: An Effective Rate Adaptation When a frame is lost, ERA deploys effective strategies to diagnose the channel, to adjust the rate, and eventually to recover from the loss. We outline the key elements of ERA while the precise scheme is described by the flowchart in Figure 4. 5.1 ERA Frame Loss Diagnosis Let rf ail be the rate used for the current lost packet and rprevious be the rate of the latest successful transmission before the current failed transmission. We have to consider two cases:
906
S. Biaz and S. Wu Initial rate = intermediate rate (24 Mbps in 802.11g) Succ = 0
Transmit
N
Succ = 0
Update Number of Consecutive Successes (Succ++)
N
Y
Frame loss?
Succ >= Threshold?
N
Y
Was rate just increased?
Decrease rate
Fragment lost frame
Y Double Threshold
Increase rate Transmit short fragment at failure rate
Threshold Å 8
Threshold Å 8
Succ = 0 Threshold Å 8 Y
Success?
Differentiated as from collision
Differentiated as from collision Y
N Was the retranmission at basic rate? N Transmit the short fragment at basic rate
Differentiated as from fading
Y N
Decrease rate
Y Success?
Is SNR of ACK different enough from prior one for lower rate
N Differentiated as from collision
Fig. 4. ERA flowchart
– rf ail > rprevious : This means that we just increased the rate to rf ail after a successful transmission at rate rprevious . It seems that the channel does not support the new higher rate rf ail . We conclude that the loss is likely due to channel degradation at rate rf ail : the wireless channel condition can only support rprevious . – rf ail = rprevious : Since previous transmissions succeeded at the same rate rf ail , the loss may be due to a channel degradation as well as a collision. We assume the channel dynamic is fairly stable in a short recovery process, which is reasonable because the time on recovering a frame loss is not long compared to the channel fading. In order to differentiate, we fragment the lost frame into two fragments: a very short frame and the remainder fragment. The first (short) frame is retransmitted at the same rate rf ail . If the transmission is successful for this short frame, then the loss was most likely due to a collision. This inference is reasonable because of two reasons: 1) the probability of delivery is higher for a smaller frame in a collision prone environment, 2) the probability of delivery is not significantly higher for a smaller frame in a degraded channel. Otherwise, if the transmission of
ERA: Effective Rate Adaptation for WLANs
907
the short frame is not successful, then the loss is likely due to channel degradation, and further procedure shown in flowchart on Figure 4 refines this diagnosis. A key feature of ERA consists of using the rate rf ail to retransmit the short frame rather than using the basic (lowest) rate immediately after the first failure. This feature allows to refine the diagnosis loss in contrast with other schemes: the other schemes use short frames sent at the basic rate immediately after the first failure; if the transmission is successful, it is unclear if the transmission is successful due to the use of a lower (basic) rate (channel degradation) or the use of a short frame (collisions). 5.2 ERA Rate Adjustment As shown on the flowchart in Figure 4, ERA decreases the rate when a transmission fails immediately after a rate increase. Another instance of rate decrease occurs when ERA successfully retransmits the short frame at the basic rate following two consecutive transmission failures at rate rf ail . ERA uses the same strategy as AARF to increase the rate: the rate is increased after T hreshold consecutive successful transmissions. The value T hreshold is the number of consecutive successful transmissions required to increase the rate. T hreshold should not be too high to allow a sender to quickly reach the highest available rate and should not be too small to avoid rate oscillations and frequent losses due to overly optimistic increases. The value T hreshold is dynamically adjusted similarly to AARF: for each frame loss due to channel degradation, ERA doubles the value T hreshold up to 64. 5.3 ERA Frame Loss Recovery The strategy that ERA adopts for frame loss recovery is efficient because it addresses both channel degradation and collisions. To tackle collisions, the fragmentation into a very short frame followed by the remaining fragment increases the probability of delivery. To tackle channel degradation, ERA uses our LDRA strategy: use the basic rate in case of the failure of the short fragment (at rate rf ail ) likely due to channel degradation.
6 Performance Evaluation The performances of ARF, CARA, and ERA are respectively evaluated and compared under different scenarios: strictly congested network (no losses due to channel degradation), strict channel degradation (collision free), and the more usual case where mixed losses may occur due to channel degradation as well as collisions. CARA was chosen for comparison as a representative of the schemes, such as RRAA, that use RTS/CTS control frames to address collisions in rate adaptation. Performance evaluation is conducted through extensive simulations using ns-2 network simulator. IEEE 802.11g [15] is used as the MAC layer because it has more predefined rates. For the physical layer, we use the parameters of Cisco Aironet 802.11a/b/g cardBus wireless LAN adapter [16] and assume symmetric links. Table 1 summarizes the network configurations and parameters used in these simulations; and Table 2
908
S. Biaz and S. Wu Table 1. Simulation parameters
Simulator ns-2 v.30 MAC CSMA/CA DCF basic access MAC data rates 6, 9, 12, 18, 24, 36, 48, 54 Mbps Channel fading Two-ray ground or Ricean Ricean fading factor (K) 4 Traffic UDP Constant Bit Rates (CBR) Frame length 100, 500, 1000, 1500 Bytes The short fragment length Same as RTS
Table 2. Simulation topologies Topologie Number of Stations Motion Model 1 2 static 2 2 random movement 3 17 static 4 17 random movement
describes the main four network topologies and the number of stations (including the access point) in each one. Figure 5 shows the static topology with 16 stations and an access point. The evaluation is limited to one hop communication: all stations are within the radio coverage of the same access point. Note that these client stations are not necessary within the radio coverage of each other. The traffic is always from the stations to the access point. Each result figure plots the data averaged from multiple runs with random staring time for different evaluation environments. Each run of simulation is of 2-minute saturated UDP traffic. We evaluate ERA in three environments: channel degradation dominated, collision dominated, and mixed (frame losses due to collisions as well as channel degradation). First, we present the experiments made in a mixed environment. 6.1 Mixed Environments: Collisions and Channel Degradation First, we evaluate the presumed strength of ERA, i.e., its effectiveness to promptly recover from losses. Then we measure the throughput (by Kframes/s) for the static network on Figure 5 and for a network of mobile stations. ERA Retransmission/Recovery Effectiveness. We measure the effectiveness by collecting the fraction of frames losses recovered after only ONE retransmission. For example, if there are 100 losses and forty of them were recovered after only one retransmission, then the fraction is 0.40. The experiments are conducted over a static network with 16 stations and one access point located at the center as shown on Figure 5. We use topology #3 (See Table 2). We adopt the Ricean fading model to induce dynamic channel conditions. By varying the size of the network layout area, we get a variable of loss ratios due to collisions: the collision ratio is the number of frame losses due to collisions over the total number of frame losses. If the collision ratio is null, this means that the environment is collision free (channel degradation dominated), which is emulated with only one client station. A strictly collision dominated environment has a collision ratio of one, which happens when all the stations are so close to the access point that there is no transmission failure due to channel fading even with Ricean model. Figure 6 plots the fraction of recovered losses after only one retransmission on the y-axis with the collision ratio on the x-axis. Under most conditions, ERA succeeds in recovering 80% of the losses with only one retransmission of the whole lost frame and outperforms CARA under all circumstances. In a channel degradation dominated environment (close to null collision ratio), CARA does not recover quickly because it often addresses the loss as a collision.
Mean of retransmission
ERA: Effective Rate Adaptation for WLANs
Fraction of retransmission at once
1 0.8 0.6 ERA CARA 0.4 0.2
ERA CARA
2 1.8 1.6 1.4 1.2 1
0 0
0.2
0.4
0.6
0.8
1
Collision ratio
Fig. 5. Topology of 17 static stations
2.4 2.2
909
Fig. 6. Fraction of recovered losses with ONE retransmission
0 0.2 0.4 0.6 0.8 1 Collision ratio
Fig. 7. The retransmission
mean
of
For the same experiment, we also compute the mean of the number of retransmissions plotted on the y-axis of Figure 7. The strength of ERA stems from its refined mechanism to diagnose the cause of a frame loss and adopt the best strategy to recover. But for CARA, the success of RTS/CTS at the basic rate can not indicate channel fading or collision. Throughput for Static Network. To evaluate ARF, CARA, and ERA under realistic channel conditions with both channel degradation and collision, we use the Ricean fading model on a static network of 16 stations (see Figure 5) and vary the area size to adjust the congestion level. We use topology #3 (See Table 2). CBR traffic with 1000 bytes packets is sent from the 16 stations to the access point. Figure 8 plots on the y-axis the throughput achieved by ARF, CARA, and ERA, respectively. As we observe, ERA performs better in heavily congested environments (in smaller areas). This is because ERA implicitly assumes the loss is due to collision at first (by transmitting the short fragment at the current rate). Consequently, it favors the loss differentiation in collision dominated environments. Throughput for Mobile Nodes. The throughputs of ARF, CARA, and ERA are respectively evaluated on a mobile network of 17 stations: all 16 client stations are mobile with random velocities to random destinations with short pause between two movements in a 350mX350m area. The access point, located at the center of the network , is static. We use topology #4 (See Table 2). The Ricean fading model is used to simulate channel degradation. Table in Figure 10 shows the parameters used in these simulations. All stations keep transmitting to the access point UDP CBR traffic. The first experiment evaluates the throughputs for all schemes with a CBR traffic with different packet lengths: 500 bytes, 1000 bytes and 1500 bytes. Figure 9 plots the throughputs for ARF, CARA, and ERA for different packet lengths. The horizontal axis represents the packet length used for each run. With longer frames, there are more collisions, therefore ERA has more opportunities to deploy its strength. We also collect the adaptation transient dynamics for ARF, CARA, and ERA for the mobile stations. Figure 11 displays three plots for ARF, CARA, and ERA. Each plot
910
S. Biaz and S. Wu 2.5
ERA CARA ARF
0.2
Throughput (KFrames/s)
Throughput (KFrames/s)
0.25
0.15 0.1 0.05
ERA CARA ARF
2 1.5
parameter value Network Area 350mX350m Location of Access Point (175m, 175m) Fading Ricean Velocity 2 m/s to 30 m/s Pause Time 1 second Client Stations 16
1 0.5 0
0 30
60
500
90 120 150 180 210
Length of the rectangular (m)
1000 1500 Frame Length (Bytes)
Fig. 8. throughput under both Fig. 9. throughput of mobile Fig. 10. Network parameters for mobile network channel degradation and stations simulations collision
corresponds to a two-second period of dynamics for a randomly chosen mobile client station. The x-axis is the time. The y-axis shows the data rate in Mbps. Each plot presents two curves: the first curve labeled “Channel Rate” (symbol ×) is the rate that the channel can support at that instant (assuming perfect knowledge) and the second curve (symbol +) is the rate selected by ARF, CARA, or ERA. We observe that in such conditions (mixed environment with both collision and channel degradation), 1) the volatile nature of the channel makes it difficult for any rate adaptation scheme to select the optimal supported rate 2) the plot for ERA has more points (which means more frame transmissions) than the two others (ARF and CARA) because ERA results in less congestion losses, and 3) ERA selects a rate that is closer to the optimal achievable rate. CARA outperforms ARF mainly due to its ability to reduce collisions with on RTS frames. With above extensive simulations on different network topologies and configurations, ERA robustly exhibits a dramatic performance improvement. These simulation results illustrate ERA’s effectiveness mixed environments (channel degradation and collisions). In order to understand better ERA, we evaluated it separately in a strictly channel degradation dominated environment and then in a strictly collisions dominated environment.
CARA Channel Rate
ERA Channel Rate 54
48
48
48
36
24
Data Rate (Mbps)
54
Data Rate (Mbps)
Data Rate (Mbps)
ARF Channel Rate 54
36
24
36
24
18
18
18
12 9 6
12 9 6
12 9 6
0
0 8
8.5
9 Time (s)
9.5
10
0 8
8.5
9 Time (s)
9.5
10
Fig. 11. Adaptation dynamics
8
8.5
9 Time (s)
9.5
10
ERA: Effective Rate Adaptation for WLANs
911
6.2 Channel Degradation Dominated Environment For these experiments, we use topology #1 (See Table 2) with one static station at a constant distance from an access point. First, we evaluate the throughput performance throughout the experiments of ARF, CARA, and ERA in stable channel environments. Second, we measure the throughput for these three schemes in an environment with dynamic channel conditions simulated with the more general Ricean fading model. Finally, we measure the impact of packet length on the throughput of the three schemes for a station at 300m from an access point. Due to the space limitation, we only present the result gathered in the stable environment and leave the result of the other two scenarios in our technical report [17]. Stable Channel Conditions. In general, a user sits at a desk and has a long work session where a given constant rate will be most often supported. We somewhat create such environment with only one fixed station and one access point. The two-ray ground fading model is used to ensure that a given constant rate is supported: the supported rate under a two-ray ground fading model depends only on distance. Figure 12 plots on the y-axis the throughput achieved by ARF, CARA, and ERA respectively when the client is at a constant distance (x-axis) from the access point. ERA outperforms the other schemes due to 1) the effectiveness of its recovery strategy and 2) the mechanism inherited from AARF: adjusting the threshold number (T hreshold) of consecutive successful transmissions required to increase the rate.
0.25
0.125
ERA ARF CARA
0 120 180 240 300 360 420 480 Transmission distance (m)
Fig. 12. throughput under stable channel conditions
Throughput (KFrames/s)
Throughput (KFrames/s)
0.375
140% ERA CARA ARF
0.2 0.15 0.1 0.05 0
Throughput Improvement
0.25
0.5
100 Bytes 500 Bytes 1000 Bytes 1500 Bytes
120% 100% 80% 60% 40% 20% 0
80
160 240 320 400 480 560 Length of the rectangular (m)
Fig. 13. throughput under different levels of collision
80
160 240 320 400 480 560 Length of the rectanglular (m)
Fig. 14. Impact of Packet length on throughput improvement
6.3 Collisions Dominated Environment For this environment, all simulations are carried out in a network with 17 static stations corresponding to topology #3 (See Table 2): one access point and 16 client stations. The objective is to evaluate the performance of ERA and compare it to ARF and CARA performances, respectively, in a collision dominated environment. We use the two-ray ground fading model such that the best supported rate is constant (stable channel conditions): with the two-ray ground fading model, the quality of the channel is determined only by the distance. All 16 client stations are uniformly placed in a rectangular area as
912
S. Biaz and S. Wu
shown in Figure 5. The congestion level in the network is varied by adjusting the size of the rectangle. All stations keep sending 10 Mbps CBR traffic to the access point at the center of the rectangular area. We measure the throughput and evaluate the impact of the packet length on the performance. Impact of the Congestion Level. Figure 13 plots on the y-axis the throughputs of ARF, CARA, and ERA, respectively. The x-axis the length of the rectangle. The packet size is set to 1000 bytes. ERA performs best again, followed by CARA, and then ARF. Especially, as the collision increases (smaller rectangular area), more improvement is achieved by ERA due to a more accurate loss differentiation and to the judicious recovery strategy. Impact of Packet Length on ERA. For each simulation run, each station sends CBR traffic with packets of length 100 bytes, 500 bytes, 1000 bytes, or 1500 bytes. All stations send packets of the same size. Figure 14 plots the percentage throughput improvement (100 ∗ ERA−CARA ). When the frame size increases, the improvement increases CARA because collisions are more likely with longer frames. The improvement is more significant for smaller rectangles because there are more collisions and ERA has more opportunity to take advantage of its superiority.
7 Conclusions and Future Work We propose a new effective rate adaptation scheme that diagnoses quite accurately the cause of a loss, appropriately recover it, and judiciously adjust the rate without requiring RTS/CTS control frames. Extensive simulations with ns-2 illustrate that ERA performs well under channel degradation and/or collisions dominated environments. ERA outperforms ARF ( a first generation scheme) and CARA, a recent second generation scheme which uses RTS control frames. The strength of ERA is based on 1) the fragmentation of a lost frame into a very short frame followed by the remainder, 2) a refined technique to diagnose the cause of a frame loss, and 3) an adaptive threshold to increase rate inherited from AARF. The fragmentation of the lost frame yields a faster recovery in channel degradation and/or collisions dominated environments. The overhead of fragmentation is well lower than using RTS/CTS control frames. We are currently implementing and evaluating ERA on Linux operated laptops. The performance of ERA implementation will be compared to the performance of most recent rate adaptation schemes.
Acknowledgement This work is funded by the National Science Foundation through grants NeTS CNS# 0435320 and CRCD/EI CNS#0417565, and by the Vodafone Foundation through a fellowship awarded to Shaoen Wu.
ERA: Effective Rate Adaptation for WLANs
913
References 1. Kamerman, A., Monteban, L.: WaveLAN II: A high-performance wireless LAN for the unlicensed band. Bell Labs Technical Journal, 118–133 (1997) 2. Lacage, M., Manshaei, M., Turletti, T.: IEEE 802.11 Rate Adaptation: A Practical Approach. In: MSWiM 2004, pp. 126–134 (2004) 3. Qiao, D., Choi, S., Shin, K.: Goodput Analysis and Link Adaptation for IEEE 802.11a Wireless LANs. IEEE Transactions On Mobile Computing 1, 278–292 (2002) 4. Pavon, J., Choi, S.: Link Adaptation Strategy for IEEE 802.11 WLAN via Received Signal Strength Measurement. In: ICC, pp. 1108–1123 (2003) 5. Holland, G., Vaidya, N., Bahl, P.: A rate-adaptive MAC protocol for multi-hop wireless networks. In: ACM MOBICOM 2001 (2001) 6. Sadeghi, B., Kanodia, V., Sabharwal, A., Knightly, E.: Opportunistic Media Access for Multirate Ad Hoc Networks. In: MOBICOM 2002, pp. 24–35 (2002) 7. Kim, J., Kim, S., Choi, S., Qiao, D.: CARA: Collision-Aware Rate Adaptation for IEEE 802.11 WLANs. In: IEEE INFOCOM 2006, Barcelona, Spain (2006) 8. Wong, S., Yang, H., Lu, S., Bharghavan, V.: Robust Rate Adaptation for 802.11 Wireless Networks. In: MobiCom 2006, Angeles, California, USA, pp. 146–157 (2006) 9. Wu, S., Biaz, S.: Loss Differentiated Rate Adaptation in Wireless Networks. Technical Report CSSE07-02, Auburn University (2007) 10. Bicket, J.C.: Bit-rate Selection in Wireless Networks. Master’s thesis, Massachusetts Institute of Technology (2005) 11. Aguayo, D., Bicket, J., Biswas, S., Judd, G., Morris, R.: Link-Level Measurements from an 802.11b Mesh Networks. In: ACM SIGCOMM (2004) 12. IEEE802.11: 802.11-1999.pdf (1999), http://standards.ieee.org/getieee802/download/ 13. Garg, S., Kappes, M.: An Experimental Study of Throughput for UDP and VoIP Traffic in IEEE 802.11b Networks. In: WCNC 2003, vol. 3, pp. 1748–1753 (2003) 14. IEEE802.11b (1999), http://standards.ieee.org/getieee802/download/ 802.11b-1999.pdf 15. IEEE802.11: http://standards.ieee.org/getieee802/download/802.11 g-2003.pdf 16. Cisco Systems: http://www.cisco.com/en/US/products/hw/wireless/ps 4555/products data sheet09186a00801ebc29.html 17. Wu, S., Biaz, S.: ERA: Efficient Rate Adaption Algorithm with Fragmentation. Technical Report CSSE07-04, Auburn University (2007)
On the Probability of Finding Non-interfering Paths in Wireless Multihop Networks S. Waharte and R. Boutaba David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada N2L 3G1 {spwaharte,rboutaba}@uwaterloo.ca
Abstract. Multipath routing can improve system performance of capacity-limited wireless networks through load balancing. However, even with a single source and destination, intra-flow and inter-flow interference can void any performance improvement. In this paper, we show that establishing non-interfering paths can, in theory, leverage this issue. In practice however, finding non-interfering paths can be quite complex. In fact, we demonstrate that the problem of finding two non-interfering paths for a single source-destination pair is NP-complete. Therefore, an interesting problem is to determine if, given a network topology, non-interfering multipath routing is appropriate. To address this issue, we provide an analytic approximation of the probability of finding two non-interfering paths. The correctness of the analysis is verified by simulations.
1
Introduction
The landscape of network communications has evolved over the last years with the emergence and deployment of protocols enabling more flexible and reliable means to communicate wirelessly over single or multiple hops. Deploying wireless backbone networks (also referred to as Wireless Mesh Networks) is becoming increasingly popular in view of the reduced initial investment cost and flexibility of deployment. Wireless sensor networks have also received great attention from the research community, which can be explained by their application potential in various areas such as military applications, environment monitoring, etc. [1]. However dealing with interference in a shared transmission environment still remains a challenging task. Environmental noise can be partly responsible for the quality degradation of data transmissions. In addition to that, the necessity to share the transmission channel among wireless nodes within the same vicinity significantly contributes to decreasing the nominal capacity available to wireless nodes. It becomes therefore crucial to develop appropriate mechanisms to alleviate the effect of these limitations on the system performance. Multipath routing has been put forward as a solution to leverage the capacity limitations in wireless networks. Spreading traffic flows over multiple paths has been demonstrated to achieve a better load balancing than single path routing potentially leading to an increase in the nominal achievable throughput [3] [5]. But A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 914–921, 2008. c IFIP International Federation for Information Processing 2008
On the Probability of Finding Non-interfering Paths
915
the resulting performance gain greatly depends on the choice of the routing paths. If the paths are at a distance such that the traffic on one path interferes with the traffic on the other path, the throughput improvement overall becomes negligible [7]. It is also important to factor in the choice of the routing approach the cost of the paths establishment. As the control overhead increases with the number of paths [6], it can potentially void the benefit of multipath routing. Therefore, estimating the probability of finding non-interfering paths given a network topology can help deciding whether for a particular network, multipath routing is an appropriate approach and if it can improve the system performance at all. To support our analysis, we first show that routing over non-interfering paths can lead to a better network utilization. We then study the complexity of finding two non-interfering paths and show that the problem is actually NP-complete. Therefore, in order the reduce the cost of establishing multiple paths, we provide an analytic estimate of the probability of finding two non-interfering paths as a function of the network density. This result can subsequently be used to evaluate the suitability of implementing a non-interfering multipath routing algorithm for a given network topology and to adjust accordingly the routing strategy. We validate the correctness of our analysis through simulations. The remainder of this paper is organized as follows. We describe our analysis of the complexity of 2-path routing in Section 2. The analytical derivation of the probability of finding two non-interfering paths is presented in Section 3. Section 4 summarizes our contributions and concludes this paper.
2
Complexity Analysis of Disjoint Path Routing with Interference Constraints
The problem we are studying consists in finding multiple non-interfering paths between a source-destination pair. To analyze the complexity of the problem, we restrict our analysis to the case in which only two paths are set up between a source and destination node. We refer to this problem as 2-path routing. We prove the NP-completeness of this problem, i.e. that is, the polynomial time solution for this problem is unlikely to exist. Definition 1. Given a directed graph G(V, E) and two nodes (s, t) ∈ V , 2-path routing consists in finding two paths P1 and P2 between s and t such that: 1/ all the nodes in P1 and all the nodes in P2 form a connected graph; and 2/ there exists no edge between a node in P1 and a node in P2 . Theorem 1. 2-path routing is NP-complete. Proof. Omitted due to space restrictions.
3 3.1
Probability of Finding Two Non-interfering Paths General Methodology
Given a source node and a destination node, two paths minimizing the interference shoud consist of a set of nodes such that the nodes on one path do not
916
S. Waharte and R. Boutaba
interfere with the nodes on the other path, except for the first-hop nodes and last-hop nodes. We assume that nodes locations are known. In order to avoid the choices of multiple paths with significantly different lengths (and consequently to avoid the burden of handling traffic flows with different end-to-end delays), the approach we considered is to define a guard band between the source and the destination and to set up paths outside this band but as close as possible to it (within a distance of the guard band). Therefore, the paths are guaranteed not to interfere. is a tunable parameter that can be adjusted depending on the network density. More details on the tuning of this parameter are given at the end of this section. We therefore need to compute: 1. The probability P1 to find two non-interfering nodes at the first hop. 2. The probability P2 that two paths exist after the first hop and before the last hop, with the constraint that the nodes along each path do not interfere with each other. 3.2
Computation of P1
The first condition to satisfy is to find two non-interfering nodes in the transmission area of the source. We assume that for each path the next-hop nodes are located in the half-plane oriented towards the destination (i.e. no backward transmission). For the first hop, this constraint added to the physical transmission range limitation restricts the feasible geographic location of the first-hop nodes to half a disk. We need therefore to compute: 1. the probability P11 (k) to have k nodes in half a disk; 2. the probability P12 (k) that at least two of these k nodes are separated by a distance greater than R (transmission range). ∞ P1 can be determined by: P1 = k=1 P11 (k)P12 (k) Computation of P11 (k). We assume that the nodes are uniformly distributed over an area A with a density ρ. They have a transmission range R and their positions are independent of each other. For a large number of nodes, the probability that k nodes are located in a given area A can be approximated with a Poisson distribution (Eq. 1) [4]. P (number of nodes = k) =
(ρA)k −ρA e k!
(1)
Computation of P12 (k). Let A be the source node and B a node randomly located in the half-disk centered at A with respective polar coordinates (0, 0) and (r, θ). We refer to the extreme points of the diameter of the half-disk as A1 and A2 . In order not to interfere with B, a node should respect the following conditions: 1. to be at transmission range of A 2. not to be at transmission range of B
On the Probability of Finding Non-interfering Paths
917
To find the probability that at least two nodes randomly located in half a disk do not interfere, we adopt the following method. We choose a point B in the transmission area of node A and determine the probability that there exists at least one node non-interfering with B (therefore located in one of the dashed areas as shown in Fig 1) and this, for all possible positions of B. Depending on the position of B, two scenarios can occur: – Case 1: B is located at a distance less than R from either A1 or A2 (Fig. 1 (a)). One single solution area exists. This corresponds to the case where R cos(θ) ≥ 2r . – Case 2: B is located at a distance greater than R from both A1 and A2 (Fig. 1 (b)). Two solution areas exist. This corresponds to the case where R cos(θ) ≤ 2r .
B B A1
A
A2
(a) Case 1: 1 feasible solution area
A1
A
A2
(b) Case 2: two feasible solution areas
Fig. 1. Computation of P12 (k) (the non-interfering zones are dashed)
A general formulation of the probability to find two nodes distanced by at least R can be expressed as follows. Theorem 2. Let us assume that there are N nodes at transmission distance of A. The probability P that at least two of these N nodes are at a distance greater than R is: R π2 2 2(Ainter (r, θ)) n−1 P =1− (1 − ( ) )∂r∂θ (2) πR πR2 r=0 θ=0 where Ainter (r, θ) is the interference area of a node with polar coordinate (r, θ) in the solution domain (half-disk). In order to determine the actual value of the interference area Ainter (r, θ), we need to breakdown the computation into the two cases previously described. Case 1: One feasible region Let N be the number of nodes in the half-disk area obtained by properly choosing the network density so that the probability to find at least 2 nodes tends to 1. If we consider one node (Node B) among these N nodes, the probability that at least one of the remaining N − 1 nodes is at least at a distance R from B can be determined by computing the complement of the probability that all the nodes are at a distance less than R from B.
918
S. Waharte and R. Boutaba
Let us first calculate the intersection between the coverage areas of A restricted to the half-disk oriented towards the destination node and the coverage of B. The intersection area between the disk centered at A and the disk centered at B form a lens whose area is referred to as Alens . This area can be straightforwardly computed geometrically as follows: r r 2 Alens = 2R2 arccos( )− 4R − r2 (3) 2R 2
B θ
C D
E
A
A1 x1
(a) Case 1: 1 feasible region
x2
B x1 A x2
A2
(b) Case 2: 2 feasible regions
Fig. 2. Computation of the non-interfering zone (dashed)
We can also observe that since A and B are at transmission range of each other, these points are necessarily located in the lens whose area has been previously computed. In particular, we can establish the following relation: SACD = SBCD − SABC The area formed by BCD consists of a disk section that can be directly computed: ∗ R2 BCD SBCD = (4) 2 let us define AC = x. By construction, we have To compute BCD, BAC = π − θ, AB = r and BC = R. Using the law of cosine, we determine x: x = −r cos(θ) + R2 − r2 (sin(θ))2 can therefore be deducted using the same method. BCD To obtain the area of ABC, we apply Heron’s formula: s=
SABC =
x+r+R 2
s(s − r)(s − x)(s − R)
(5)
On the Probability of Finding Non-interfering Paths
919
By combining Eq. 4 and Eq. 5, we obtain: SACD = arccos(
R2 + r2 − x2 R2 ) − s(s − r)(s − x)(s − R) 2Rr 2
(6)
We can therefore compute the intersection area: Sinter (r, θ) =
Alens θ − SACD + R2 2 2
(7)
Finally, the probability that at least one of these N − 1 nodes does not fall in this area is: Sinter (r, θ) N −1 Pcase1 = 1 − ( ) (8) πR2 2
Case 2: Two feasible regions In this case, node B is at a distance at least R away from A1 and A2 . Without lack of generality, let us assume that B is located in the same quarter of disk as A1 (Fig. 2 (b)). The disk centered at B cut the x-axis in two points x1 and x2 such that x1 < x2 . Obviously a solution zone exists in this area only if |x1 | < R. Let x1 be the intersection point with the smallest x-coordinate between the circle centered at A and the circle centered at B. The solution area is therefore bounded by A1 x1 x1 . By geometric considerations we can observe that: SA1 x1 x1 = SABx1 A1 − SABx1 x1 By calculating SABx1 A1 and SABx1 x1 , we find the solution area. Due to space limitations, we omit the details of the computation that follows similar steps as in the previous case. The probability to find two non-interfering nodes in this second case is: Pcase2 = 1 − (
πR2 − SA1 x1 x1 − SA2 x2 x2 )N −1 2
(9)
Evaluation. To evaluate the accuracy of the upper bound of our analysis, we compared the results obtained by our derivation with the ones obtained through simulations by computing the distance between two pairs of nodes in a random distribution. We ran the experiments 1000 times for various network densities. The results of the simulations are depicted in Fig. 3. We can see that our analysis provides a close approximation of the probability of finding two non-interfering paths when compared with simulations. 3.3
Computation of P2
For the subsequent hops along each path, it is sufficient to determine the probability that each node has at least one neighbor towards the destination while respecting the interference constraint, that is to say that the chosen node should
S. Waharte and R. Boutaba
Probability of finding two nodes distant of at least R
920
1 Theory Simulations
0.95 0.9 0.85 0.8 0.75 0.7 0.65
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Network density
Fig. 3. Probability of finding two non-interfering nodes
not interfere with the nodes on the other path [2]. In addition, for quality of service purposes, it is necessary to limit the number of hops along each path. This can be achieved by constraining the possible location of the next-hop node. Considering that each node has a fixed transmission range, we can adjust the width of the band depending on the network density and quality-of-service constraints (Eq. 10). For a network density of 8e-4 nodes/m2 and a transmission range of 250m, the probability to find a next-hop node with a probability of 95% is achieved with a value of epsilon of 15m. P (at least 1 neighbor) = 1 − P (no neighbor) P (at least 1 neighbor) = 1 − e−ρR =
− ln(1 − P (at least 1 neighbor))) ρR
(10)
Let h be the number of hops between the source and the destination, h ≥ 2. The probability to find 2 totally-disjoint paths can be expressed as: P2 = (1 − e−ρR )2(h−2)
4
(11)
Conclusion
With the increasing deployment of wireless networks, the necessity to cope with limited link capacity becomes more and more stringent. This problem is further exacerbated when transmission occurs over multiple hops. The gain in flexibility in network deployment is counterbalanced by a reduction in available throughput. This is directly linked to the problem of interference. All devices in the same vicinity have to share the transmission medium, which consequently reduces the capacity available to each. To tackle this issue, some research works have focused on developing routing protocols that could account for interference and improve the network performance. Multipath routing has been proposed as an alternate solution to single path routing due to its potential to improve the
On the Probability of Finding Non-interfering Paths
921
network throughput by balancing the load more evenly. However, to take full advantage of this routing method, interference has to be accounted for during the choice of the routing paths. The contributions of this work are the followings. First we studied the complexity of finding two paths between a given source and destination and we proved the NP-completeness of this problem. Then, we analytically derived the probability of finding two non-interfering paths given a certain network density. The results obtained are noteworthy as they can be directly applied to the choice of a routing strategy. The network density and therefore the probability of finding non-interfering paths can lead to different routing decisions. The results derived in this paper can consequently enable adaptive routing strategies depending on the network characteristics.
References 1. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor networks. IEEE Communications Magazine 40(8), 102–114 (2002) 2. Bettstetter, C.: On the minimum node degree and connectivity of a wireless multihop network. In: Proceedings of the 3rd ACM international symposium on Mobile ad hoc networking & computing (MobiHoc 2002), pp. 80–91. ACM, New York (2002) 3. Lee, S.-J., Gerla, M.: Split multipath routing with maximally disjoint paths in ad hoc networks. In: Proceedings of the IEEE International Conference on Communications (ICC), June 2001, vol. 10, pp. 3201–3205 (2001) 4. Papoulis, A.: Probability and Statistics, 1st edn. Prentice-Hall, Englewood Cliffs (1990) 5. Pearlman, M.R., Haas, Z.J., Sholander, P., Tabrizi, S.S.: On the impact of alternate path routing for load balancing in mobile ad hoc networks. In: First Annual Workshop on Mobile and Ad Hoc Networking and Computing, August 2000, pp. 3–10 (2000) 6. Pham, P., Perreau, S.: Performance analysis of reactive shortest path and multi-path routing mechanism with load balance. In: Proceedings of the Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM) (2003) 7. Waharte, S., Boutaba, R.: Totally disjoint multipath routing in multihop wireless networks. In: Proceedings of the IEEE International Conference on Communications (ICC), June 2006, pp. 5576–5581 (2006)
Multi-commodity Flow Problem for Multi-hop Wireless Networks with Realistic Smart Antenna Model Osama Bazan and Muhammad Jaseemuddin Department of Electrical and Computer Engineering Ryerson University, Toronto, Canada {obazan,jaseem}@ee.ryerson.ca
Abstract. Utilizing smart antennas in multi-hop wireless networks can boost their performance. However, most of the work done in the context of wireless adhoc and sensor networking assume ideal or over-simplistic antenna patterns that overrate the performance benefits. In this paper, we compute the optimal throughput of multi-hop wireless networks with realistic smart antenna model. Our goal is to evaluate the network performance degradation using real smart antenna as compared to using the ideal model. We derive a generic interference model that can accommodate any antenna radiation pattern. We formulate the problem as a multi-commodity flow problem with novel interference constraints derived from our generic interference model. Our numerical results show that using a real-world smart antenna in a dense network results in up to 55% degradation in the throughput as compared to the case of ideal flat-topped antenna, whereas the throughput degradation is as much as 37% in a sparse network. Keywords: Smart antennas, Multi-hop wireless networks, Optimization.
1
Introduction
Recently, the use of smart antennas in multi-hop wireless networks has received increasing attention in the research community due to their potential benefits over omni-directional antennas [1]. Smart antennas can increase the spatial reuse, extend the transmission range and reduce power consumption. Performance evaluations [1, 2, 3, 4] have demonstrated a tremendous improvement in the network performance. Most of the evaluations were performed using ideal or over-simplistic antenna patterns that overrate the performance benefits [3]. However, the results from experimental test-beds [4] have shown that realistic smart antennas still outperform the omni-directional antennas with a significant factor. The pressing need for enhancing network capacity is a major motivation for using smart antennas. Interference is a dominant factor in limiting the capacity of wireless networks. By controlling the beamforming in a desired direction, A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 922–929, 2008. c IFIP International Federation for Information Processing 2008
Multi-commodity Flow Problem for Multi-hop Wireless Networks
923
interference in other directions is alleviated. This enables multiple simultaneous transmissions to happen in the same vicinity which increase the network capacity. There have been several attempts to derive the optimal capacity bounds for multihop wireless networks. Gupta and Kumar [5] derived asymptotic capacity bounds. Jain et. al. [6] studied the impact of interference on multihop wireless network. They used the conflict graph to model interference. Kodialam et. al. [7] studied the joint routing and scheduling problem to achieve a given rate vector. In the context of smart antennas, Yi et. al. [8] extended the work in [5] to the case of ideal antennas. Spyropoulos et. al. [9] presented capacity bounds that are technology-based. Huang et. al. [10] formulated the problem as a single commodity flow problem in wireless sensor networks with directional antennas. However, they used a simple interference model and ideal antenna model. In [11], Peraki et. al. computed the maximum stable throughput in wireless networks with directional antennas by solving the minimum cut problem. In this paper, we compute the optimal throughput of multi-hop wireless network using a realistic smart antenna model. Our goal is to evaluate the network performance degradation using real smart antenna model as compared to the ideal model. We formulate the problem as a multi-commodity flow problem. The contribution of this paper is two fold. First, we develop a generic interference model that can accommodate any antenna radiation pattern. Second, we formulate the multi-commodity problem with novel interference constraints. The differences between directional and omni-directional antennas impose the derivation of new interference constraints. We redefine the conflict graph [6] to take those differences into consideration. In section 2, we provide basic antenna concepts. The generic interference model is derived in section 3. We formulate the multi-commodity flow problem with the new interference constraints in section 4. In section 5, we present the numerical results. We conclude the paper in section 6.
2
Preliminary
In this section, we present few basic concepts related to smart antennas [12, 13] to introduce the terminology used to derive the interference model. In contrast to the omni-directional antenna, the directional antenna radiates and receives energy from a direction more than the others. The gain of the antenna in any direction indicates the relative power in that direction to the power of omni-directional antenna. The peak gain over all directions is a major characteristic of any directional antenna. The antenna radiation pattern specifies the gain values in all directions of space. In this paper, we assume all nodes in the network to lie in a two-dimension plane, so, without the loss of generality, the gain of the antenna is a function of the azimuth angle only. A directional antenna pattern consists of a high gain main lobe and smaller gain side lobes. The axis of the main lobe is known as the boresight of the antenna. Another characteristic of a directional antenna is the half-power beamwidth which refers to the angle subtended by the directions on either side of the peak gain which
924
O. Bazan and M. Jaseemuddin 90°
90° 120°
60°
150°
120° 150°
30°
±180°
0°
−150°
−30° −120°
60° 30°
±180°
0°
−150°
−60°
−30° −120°
−90°
−60° −90°
(a) Realistic directional antenna
(b) Ideal antenna model
Fig. 1. Directional antenna radiation patterns
are 3dB less in gain. An ideal antenna pattern has constant gain in main lobe and zero gain outside. Figure 1 shows antenna patterns for both realistic and ideal directional antennas. Since the transmission and reception characteristics on the antenna are reciprocal, the directional antenna has both transmission and reception gains. According to Friss equation [13], the received power Pr at a distance r from a node with transmission power Pt is: Pr =
Pt Gt (θt )Gr (θr ) , K rα
(1)
where Gt and Gr are the transmitter and receiver gains, θt and θr are the transmitting and receiving angles measured with respect to the boresight, α is the path loss index and K is a constant that depends on the wavelength. A receiver can receive the signal if the received power is greater than or equal to the receiver sensitivity threshold Ω. Hence, the communication range R in a certain direction is: 1/α Pt Gt (θt )Gr (θr ) R(θt , θr ) = . (2) K Ω A smart beamforming antenna combines an antenna array with digital signal processing techniques to dynamically change its antenna pattern. Smart antennas can be classified into switched beam and steered beam systems. In switched beam systems, multiple fixed beam patterns are formed and the transceiver chooses the best among those beams. Hence, they do not guarantee maximum gain due to scalloping [13]. Scalloping is the roll-off of the antenna pattern as a function of the angle from the boresight. In steered beam systems, the boresight of the main lobe can be directed in any direction. Additionally, nulls can be placed in the direction of interfering sources using sophisticated techniques.
Multi-commodity Flow Problem for Multi-hop Wireless Networks
3
925
The Interference Model
In a wireless network, any transmission imposes an interference region in which other transmitters should not transmit to avoid collisions. The interference region generated by directional antenna transmissions is totally different than omnidirectional interference region since it depends not only on the distance between nodes but also on the boresights of the transmitting/receiving antennas and their radiation patterns. In this section, we present a generic interference protocol model to define the conditions for a successful wireless transmission. Most of the previous work in adhoc networks with directional antennas [2,3,4] assume that a node transmits and receives in a directional mode but listens omnidirectionally when it is idle. We adopt this realistic assumption and hence the communication range will be function of θt only as Gr =1 in all directions, 1/α Pt Gt (θt ) R(θt ) = . (3) K Ω A wireless link (i, j) from node i and node j exists if dij ≤ Ri (θij )
−βi βi ≤ θij ≤ , 2 2
(4)
where dij is the distance between the nodes, θij is the angle between the boresight of the antenna of node i and the direction of j and βi is the beamwidth of node i. Here, we assume that the wireless link exists only through the main lobe of the transmitter. Note that, equation (4) is generic to represent both switched beam and steered beam systems. Since the interference may also occur through the side lobes, the interference range R in any direction is given by
R (θt , θr ) = Δ R(θt , θr ),
(5)
where Δ is greater than or equal to 1 and R(θt , θr ) is calculated as in (2). In our interference model, both the sender and the receiver are required to be free of interference to reflect the case of most contention-based MAC protocols. For a successful transmission over existing wireless link (i, j), any node k (k = i or j) should avoid the communication in a certain direction if one of these conditions applies dkj ≤ Rk (θkj , θjk ) or dki ≤ Rk (θki , θik ), (6) where any angle θxy is subtended between the boresight of node x and the direction of node y and θxy ∈ [0, 2π). Note that, these conditions do not force node k to always remain silent but allow it to communicate in non interfering directions. This introduces the spatial reuse factor into the interference model.
4
Multi-commodity Flow Problem Formulation
We are interested in the optimal throughput of multi-hop wireless networks with smart antennas. We formulate the problem as a multi-commodity flow
926
O. Bazan and M. Jaseemuddin
problem with interference constraints. However, the complexity of a generic interference model demands novel interference constraints. Two major differences in our formulation distinguish it from [6]. First, we consider that the nodes may be equipped with different realistic antenna systems. We include the benefits of using smart antennas in our formulation. Second, we do not use the independent sets constraints that are computationally expensive and require global information. Instead, our constraints are localized in nature and can be implemented using distributed algorithms. We consider a network graph G(N, L) where N is the set of nodes and L is the set of directed wireless links. A directed link exists between two nodes if (4) is satisfied. Each node is equipped with real antenna system. Given a set of sourcedestination pairs, we assume that each source always has data to send and we are interested in the maximum total throughput the network can support. We assume the network operates in a time slotted mode, hence, the throughput we obtain will provide an upper bound for contention-based MAC protocols. 4.1
The Conflict Graph
We use the conflict graph to obtain the link capacity constraints due to interference. A conflict graph describes how the wireless link shares the medium. A vertex in the conflict graph is a link in the network graph while an edge in the conflict graph indicates that the two vertices (links in the network graph) interfere with each other. In [6], any two links that have a node in common are connected in the conflict graph. This is acceptable when omni-directional antennas are used. However, in the case of smart antennas, a node’s link in one direction may not use the same wireless medium as its link in another direction and hence they should not have an edge in the conflict graph although they still cannot be active simultaneously due to the node capacity constraints. We modify the conflict graph to reflect the above argument. The main reason behind this modification is that interference from one direction should not affect the links on other directions. This is a major benefit of using smart beamforming antennas. In our modified conflict graph, there is an edge between two links with a common node if and only if they share the same wireless medium. In the case of switched beam systems, all the links transmitted/received by the same beam have edges in the conflict graph. 4.2
The LP Formulation
Given a wireless network G(N, L) and a set of M commodities each with sourceth destination pair {sm , tm }. We denote xm ij as the amount of flow from the m commodity over link (i, j) normalized with respect to the capacity of the channel. f m denotes the flow coming out from source sm . We denote Int (i, j) as the set of links that interfere with link (i, j) according to the conflict graph. The multicommodity flow problem can be formulated as the following LP formulation: max fm (7) m∈M
Multi-commodity Flow Problem for Multi-hop Wireless Networks
927
subject to
xm ij −
(i,j)∈L
m∈M
(j,i)∈L
xm ji
xm ij +
m∈M
⎧ m i = sm ⎨f i = N ∼ {sm , tm } = 0 ⎩ m −f i = tm xm ∀(i, j), pq ≤ 1
(p,q)∈Int(i,j) m∈M
xm ij +
(i,j)∈L
xm ji ≤ 1
∀ i, m ,
(8)
∀ i,
(j,i)∈L
xm ij ≥ 0
∀ (i, j), m.
The first constraint represents the flow conservation constraints at each node for each commodity. The second constraint is the link capacity constraint dictated by the interference model. The third constraint is the node capacity constraint in which the sum of the ingoing and outgoing flows should be less than the channel capacity. This is a necessary condition for scheduling feasibility [7]. Note that the node capacity constraint does not include any interference terms thanks to the beamforming antennas.
5
Numerical Results
In this section, we present some numerical results. Our goals are two fold: to evaluate how realistic antenna models impact the whole interference pattern of a multi-hop wireless network compared to their idealistic counterparts, and then to assess how this difference affects the optimal throughput of the network. We compare the resulting conflict graphs under different antenna models. Then, we compute the maximum flow by solving our multi-commodity flow problem using CPLEX. We study randomly deployed networks in 1500m X 1500m area. The link capacity is normalized to 1. We assume two-ray propagation model. Each node is equipped with 6-beams switched beam smart antenna system with a peak gain of 10 dBi and beamwidth of 60◦ . The maximum transmission range is 450m. We assume each node sends a flow to a random destination. We consider three different antenna patterns: (i) An ideal antenna pattern. (ii) A cone-plusball antenna model [1] which has an ideal main lobe in addition to low constant gain in all directions outside the main lobe averaging the side lobes. (iii) A real directional antenna pattern as the one shown in Fig. 1a. This antenna pattern can be implemented using uniform circular antenna array with six isotropic elements with radius of 0.35 wavelength of the channel frequency [12]. Figure 2 shows the average total number of vertices in the conflict graph (directed links in the network graph) for different network densities. As expected, the network has less links when a realistic switched beam antenna pattern is used due to scalloping. Also, the difference between the number of ideal and realistic links increases in denser networks. Figure 3 shows the average vertex degree in the conflict graph. This number represents the average number of links
928
O. Bazan and M. Jaseemuddin
Fig. 2. Average number of conflict graph Fig. 3. Average vertex degree of the conflict graph with different node densities vertices with different node densities
6
7
ideal cone+ball realistic
ideal cone+ball realistic
6
Maximum flow
Maximum flow
8
4
2
5 4 3 2 1 0
0 20
30
40
Total number of nodes
Fig. 4. Average maximum flow for different network densities
0
5
10
15
20
25
30
Total number of flows
Fig. 5. Average maximum flow of 30-nodes network with different number of flows
interfering with each link. We can see that, in the case of 30-nodes network with a real antenna, the average vertex degree is almost 2.5 times the vertex degree in the case of ideal model. Figure 3 also shows that the average vertex degree of the conflict graph increases linearly with the density if an ideal antenna is assumed but increases exponentially if a real antenna is used. This implies that the effect of the side lobes is significant and should not be ignored. Figure 4 shows the optimal throughput of random multi-hop wireless network with switched beam antennas for different network densities. It is clear that using realistic antenna patterns results in significant degradation in the throughput as opposed to the case of ideal antenna patterns. This is because of the significant difference in the interference pattern. In a sparse network (20 nodes), the realistic optimal throughput is 37% less than the ideal, whereas the realistic throughput degrades as much as 55% of the ideal in a dense network (40 nodes). Next, we consider solving the LP under different traffic loads. Figure 5 shows the average
Multi-commodity Flow Problem for Multi-hop Wireless Networks
929
maximum flow for networks with 30 nodes and various numbers of flows. The results show that the percentage of performance degradation caused when a realistic antenna pattern is used is almost the same for different number of flows.
6
Conclusions
In this paper, we computed the optimal throughput of multi-hop wireless network with realistic smart antenna models. We presented a generic interference model to accommodate any antenna pattern. We formulated the problem as a multi-commodity flow problem with novel interference constraints. Our numerical results showed that using realistic antenna pattern could result in 55% degradation in the throughput as opposed to the ideal case. This implies that the effect of the side lobes is significant and should not be ignored. Our future work includes evaluating the performance improvement over omni-directional antennas and the design tradeoffs of using different smart antenna technologies.
References 1. Ramanathan, R.: On the Performance of Ad Hoc Networks with Beamforming Antennas. In: ACM MobiHoc, pp. 95–105 (2001) 2. Choudhury, R.R., Yang, X., Ramanathan, R., Vaidya, N.H.: On Designing MAC Protocols for Wireless Networks Using Directional Antennas. IEEE Transactions on Mobile Computing 5(5), 477–491 (2006) 3. Takai, M., Martin, J., Ren, A., Bagrodia, R.: Directional Virtual Carrier Sensing for Directional Antennas in Mobile Ad Hoc Networks. In: ACM MobiHoc, pp. 183–193 (2002) 4. Ramanathan, R., Redi, J., Santivanez, C., Wiggins, D., Polit, S.: Ad Hoc Networking with Directional Antennas: A Complete System Solution. In: IEEE WCNC, pp. 1–5 (2006) 5. Gupta, P., Kumar, P.R.: The Capacity of Wireless Networks. IEEE Transactions on Information Theory 46, 388–404 (2000) 6. Jain, K., Padhye, J., Padmanabhan, V., Qiu, L.: Impact of Interference on Multihop Wireless Network Performance. In: ACM MOBICOM, pp. 66–80 (2003) 7. Kodialam, M., Nandagopal, T.: Characterizing Achievable Rates in Multi-hop Wireless Networks: The Joint Routing and Scheduling Problem. In: ACM MOBICOM, pp. 42–54 (2003) 8. Yi, S., Pei, Y., Kalyanaraman, S.: On The Capacity Improvement of Ad Hoc Wireless Networks Using Directional Antennas. In: ACM MobiHoc, pp. 108–116 (2003) 9. Spyropoulos, A., Raghavendra, C.S.: Capacity Bounds for Ad-Hoc Networks Using Directional Antennas. In: IEEE ICC, pp. 348–352 (2003) 10. Huang, X., Wang, J., Fang, Y.: Achieving Maximum Flow in Interference-Aware Wireless Sensor Networks with Smart Antennas. Elsevier Ad Hoc Networks 5(6), 885–896 (2007) 11. Peraki, C., Servetto, S.D.: On the Maximum Stable Throughput Problem in Random Networks with Directional Antennas. In: ACM MobiHoc, pp. 76–87 (2003) 12. Balanis, C.A.: Antenna theory: Analysis and design. Wiley, Chichester (1997) 13. Liberti, J.C., Rappaport, T.S.: Smart Antennas for Wireless Communications. Prentice-Hall, NJ (1999)
Evolutionary Power Control Games in Wireless Networks Eitan Altman1 , Rachid El-Azouzi2 , Yezekael Hayel2 and Hamidou Tembine2 1
2
INRIA, Sophia Antipolis, France. LIA-CERI, Avignon University, France.
Abstract. In this paper, we apply evolutionary games to non-cooperative power control in wireless networks. Specifically, we focus our study in a power control in W-CDMA and WIMAX wireless systems. We study competitive power control within a large population of mobiles that interfere with each other through many local interactions. Each local interaction involves a random number of mobiles. An utility function is introduced as the difference between a utility function based on SIR of the mobile and pricing. The games are not necessarily reciprocal as the set of mobiles causing interference to a given mobile may differ from the set of those suffering from its interference. We show how the evolution dynamics and the equilibrium behavior (called Evolutionary Stable Strategy - ESS) are influenced by the characteristics of the wireless channel and pricing characteristics. Keywords: power control, evolutionary games, W-CDMA, WiMAX.
1
Introduction
The evolutionary games formalism is a central mathematical tools developed by biologists for predicting population dynamics in the context of interactions between populations. This formalism identifies and studies two concepts: the ESS (for Evolutionary Stable Strategy), and the Replicator Dynamics. The ESS, first been defined in 1972 by the biologist M. Smith [10], is characterized by a property of robustness against invaders (mutations). More specifically, (i) if an ESS is reached, then the proportions of each population do not change in time. (ii) at ESS, the populations are immune from being invaded by other small populations. This notion is stronger than Nash equilibrium in which it is only requested that a single user would not benefit by a change (mutation) of its behavior. The ESS concept helps to understand mixed strategies in games with symmetric payoffs. A mixed strategy can be interpreted as a composition of the population. An ESS can be also interpreted as a Nash equilibrium of the one-shot game but a (symmetric) Nash equilibrium cannot be an ESS. As is shown in [13], ESS has strong refinement properties of equilibria(proper equilibrium, perfect equilibrium
This work is partially by the ANR WINEM no 06TCOM05 and the INRIA ARC POPEYE. We thank three anonymous reviewers for their many suggestions for improving this paper.
A. Das et al. (Eds.): NETWORKING 2008, LNCS 4982, pp. 930–942, 2008. c IFIP International Federation for Information Processing 2008
Evolutionary Power Control Games in Wireless Networks
931
etc). Although ESS has been defined in the context of biological systems, it is highly relevant to engineering as well [14]. In the biological context, the replicator dynamics is a model for the change of the size of the population(s) as biologist observe, where as in engineering, we can go beyond characterizing and modeling existing evolution. The evolution of protocols can be engineered by providing guidelines or regulations for the way to upgrade existing ones and in determining parameters related to deployment of new protocols and services. There have been a lot of work on non-cooperative modeling of power control using game theory [1,2]. There are two advantages in doing so within the framework of evolutionary games: – it provides the stronger concept of equilibria, the ESS, which allows us to identify robustness against deviations of more than one mobile, and – it allows us to apply the generic convergence theory of replicator dynamics, and stability results that we introduce in future sections. The aim of this paper is to apply evolutionary game models to study the interaction of numerous mobiles in competition in a wireless environment. The power control game in wireless networks is a typical non-cooperative game where each mobile decides about his transmit power in order to optimize its performance. This application of non-cooperative game theory and tools is studied in several articles [6,9,7]. The main difference in this article is the use of the evolutionary game theory which deals with population dynamics that is well adapted for studying the power control game in dense wireless networks. Specifically, we focus our first study in a power control game in a dense wireless ad hoc network where each user transmits using orthogonal codes like in W-CDMA. In the second scenario, we consider also uplink transmissions but inter-cell interferences like in WiMAX cells deployment. The utility function of each mobile is based on carrier(signal)-to-interference ratio and pricing scheme proportional to transmitted power. We provide and predict the evolution of population between two types of behaviors : aggressive (high power) and peaceful (low power). We identify cases in which at ESS, only one population prevails (ESS in pure strategies) and others, in which an equilibrium between several population types is obtained. We also provide the conditions of the uniqueness of ESS. Furthermore, we study different pricing for controlling the evolution of population. The paper is organized as follows: we present in next section some background on evolutionary games. In section 3, we present the evolutionary game model to study the power allocation game in a dense wireless ad hoc network. In section 4, we propose an evolutionary game analysis for the power allocation game in a context of inter-cell interferences between WiMAX cells. Numerical results of both wireless network architectures are proposed in section 5. Section 6 concludes the paper.
2
Basic Notions on Evolutionary Games
We consider the standard setting of evolutionary games: there is a large populations of players; each member of the population has the same finite pure strategies set S. There are many local interactions at the same time.
932
E. Altman et al.
Mixed strategies. Denote by Δ(S) the (|S| − 1)−dimensional simplex of R|S| . Let x(t) be the |S|- dimensional vector whose j−th element xj (t) is the population share of action j at time t. Thus we have j∈S xj (t) = 1 and xj (t) ≥ 0. We frequently use an equivalent interpretation where x(t) is a mixed strategy used by all players at time t; by a mixed strategy we mean that a player chooses at time t an action j with probability xj (t). With either interpretations, at each local interaction occurring at time t a given player can expect that the other player would play action j with probability xj (t). The vector x(t) will also be called the state of the population. Fitness. The fitness for a player at a given time is determined by the action k taken by the player at that time, as well as by the actions of the population it interacts with. More precisely, if the player 1 chooses the action k when the state of the population is x then player 1 receives the payoff J(k, x) (called fitness). The function J(k, x) is not necessary linear in x. Below we denote by F (y, x) = k∈S yk J(k, x) the fitness for an individual using the strategy y when the state of the population is x. ESS. The mixed strategy x is an ESS if for all strategy mut = x there exists mut > 0 such that ∀ ∈ (0, mut ), F (x, mut) + (1 − )F (x, x) > F (mut, mut) + (1 − )F (mut, x).
(1)
We have a sufficient condition on the existence of the ESS. The strategy s is an ESS if it satisfies F (x, x) ≥ F (mut, x), ∀ mut, and (2) ∀ mut = x, F (x, x) = F (mut, x) ⇒ F (x, mut) > F (mut, mut).
(3)
Dynamics. Evolutionary game theory considers a dynamic scenario where players are constantly interacting with others players and adapting their choices based on the fitness they receive. A strategy having higher fitness than others tends to gain ground: this is formulated through rules describing the dynamics (such as the replicator dynamics or others) of the sizes of populations (of strategies). The replicator equation is given by d xk (t) = xk [J(k, x) − F (x, x)] , k ∈ S. (4) dt There is a large number of population dynamics other than the replicator dynamics which have been used in the context of non-cooperative games. Examples are the excess payoff dynamics, the fictitious play dynamics, gradient methods [8], Smith dynamics. Much literature can be found is the extensive survey on evolutionary games in [5].
3
W-CDMA Wireless Network
We study in this section competitive decentralized power control in an wireless network where the mobiles uses, as uplink MAC protocol, the W-CDMA technique
Evolutionary Power Control Games in Wireless Networks
933
R
Fig. 1. Interferences at the receiver in uplink CDMA transmissions
to transmit to a receiver. We assume that there is a large population of mobiles which are randomly placed over a plane following a Poisson process with density λ. We consider a random number of mobiles interacting locally. When a mobile i transmits to its receiver R(i), all mobiles within a circle of radius R centered at the receiver R(i) cause interference to the transmission from node i to receiver R(i) as illustrated in figure 1. We assume that a mobile is within a circle of a receiver with probability μ. We define a random variable R which will be used to represent the distance between a mobile and a receiver. Let f (r) be the probability density R function (pdf) for R. Then we have μ = 0 f (r)dr. Remark 1. If we assume that the receivers or access points are randomly distributed following a poisson process with density ν, the probability density function is expressed by f (r) = νe−νr . For uplink transmissions, a mobile has to choose between Higher(H) power level and Lower(L) power level. We denote by PH the higher power and PL the lower power levels. Let s be the population share strategy H. Hence, the signal Pr received at the receiver is given by Pr = gPi l(r), where g is the gain antenna and α > 2 is the path loss exponent. For the attenuation, the most common function is l(t) = t1α , with α ranging from 3 to 6. Note that such l(t) explodes at t = 0, and thus in particular is not correct for a small distance r and largen intensity λ. Then, it makes sense to assume attenuation to be a bounded function in the vicinity of the antenna. Hence the last function becomes l(t) = max(t, r0 )−α . First we note that the number of transmission within a circle of radius r0 centered at the receiver is λπr02 . Then the interference caused by all mobiles in that circle +(1−s)PL ) is I0 (s) = λπg(sPHrα−2 . 0 Now we consider a thin ring Aj with the inner radius rj = jdr and the outer radius rj = r0 + jdr. The signal power received at the receiver from any node in i Aj is Pri = gP r α . Hence the interference caused by all mobiles in Aj is given by i
Ij (s) =
⎧ L ⎨ 2gλπrj dr( sPH +(1−s)P ) if rj < R, rα j
L ⎩ 2μgλπrj dr( sPH +(1−s)P ) if rj ≥ R. rα j
934
E. Altman et al.
Hence, the total interference contributed by all nodes at the receiver is ∞ R 1 1 I(s) = I0 (s) + 2gλπ(sPH + (1 − s)PL ) dr + μ dr , α−1 α−1 r0 r R r α −(α−2) = gλπ(sPH + (1 − s)PL )( r − 2(1 − μ)R−(α−2) ). α−2 0 Hence the signal to interference ratio SIN Ri is given by
gP /rα i 0 if r ≤ r0 , SIN Ri (Pi , s, r) = σ+βI(s) gPi /r α σ+βI(s) if r ≥ r0 , where σ is the power of the thermal background noise and β is the inverse of the processing gain of the system. This parameter weights the effect of interference, depending on the orthogonality between codes used during simultaneous transmissions. In the sequel, we compute the mobile’s utility (fitness) depending on his decision but also on the decision of his interferers. We assume the user’s utility (fitness) choosing power level Pi is expressed by R J(Pi , s) = w log(1 + SIN R(Pi , s, r))f (r)dr − ηPi . 0
The pricing function Pi define the instantaneous ”price” a mobile pays for using a specific amount of power that causes interference in the system and η is a parameter. This price can be the power cost consumption for sending packets. The total fitness of the population is given by F (s, s) = sJ(PH , s) + (1 − s)J(PL , s). We are now looking at the existence and uniqueness of the ESS. For this, we need the following result. Lemma 1. For all density function f defined on [0, R], the function h : [0, 1] → R defined as R 1 + SIN R(PH , s, r) s −→ log f (r) dr 1 + SIN R(PL , s, r) 0 is continuous and strictly monotone.
R(PH ,s,r) proof The function s −→ log 1+SIN 1+SIN R(PL ,s,r) f (r) is continuous and integrable in r on the interval [0, R]. The function h is continuous. Using derivative properties of integral with parameter, we can see that the derivative function of h is the function h : [0, 1] → R defined as R ∂ 1 + SIN R(PH , s, r) s −→ log f (r). ∂s 1 + SIN R(PL , s, r) 0 R(PH ,s,r) ∂ We show that the term ∂s log 1+SIN is negative. Let A(s) := 1+SIN R(PL ,s,r) 1+SIN R(PH ,s,r) 1+SIN R(PL ,s,r) . The function A can be rewritten as A(s)
= 1+
g(PH −PL ) rα gP σ+βI(s)+ rαL
where
Evolutionary Power Control Games in Wireless Networks
I(s) = (s(PH − PL ) + PL )c(r) and c(r) = λπg if r ≥ r0 and PL )
λπg r0α−2
−(α−2) α α−2 r0
935
− 2(1 − μ)R−(α−2)
otherwise. Since A satisfies A(s) > 1 and A (s) = −c(r)β(PH −
g(PH −PL ) rα gP (σ+βI(s)+ rαL )2
< 0. Hence,
∂ 1 + SIN R(PH , s, r) ∂ A (s) log = (log A(s)) = <0 ∂s 1 + SIN R(PL , s, r) ∂s A(s)
i.e h (s) < 0. We conclude that h is strictly decreasing. Using this lemma, we have the following proposition which gives pure strategies depending on the parameters. Proposition 1. For all density function f , the pure the
strategy PH dominates R R(PH ,PH ,r) strategy PL if and only if wη (PH − PL ) < 0 log 1+SIN f (r) dr = 1+SIN R(PL ,PH ,r) h(1). For all density function f , the pure strategy PL dominates the strategy PH R 1+SIN R(PH ,PL ,r) η if and only if w (PH − PL ) > 0 log 1+SIN R(PL ,PL ,r) f (r) dr = h(0). proof We decompose the existence of the ESS in several cases. 1. PH is preferred to PL : The higher power level dominates the lower if and only if J(PH , PH ) > J(PL , PH ) and J(PH , PL ) > J(PL , PL ). These two inequalities implies that η (PH − PL ) < w
R
log 0
1 + SIN R(PH , PH , r) 1 + SIN R(PL , PH , r)
f (r) dr.
2. PL is preferred to PH : Analogously, the lower power dominates the higher power if and only if J(P L , PH ) > J(PH , PH ) and J(PL , PL ) > J(PH , PL ) R R(PH ,PL ,r) η i.e w (PH − PL ) > 0 log 1+SIN 1+SIN R(PL ,PL ,r) f (r) dr. We have this other result which gives a sufficient condition for the existence of the ESS. Proposition 2. For all density function f , if h(1) < wη (PH η− PL ) < h(0), then there exists an unique ESS s∗ which is given by s∗ = h−1 w (PH − PL ) . Proof. Suppose that the parameters w, η, PH and PL satisfy the following inequality h(1) < wη (PH − PL ) < h(0). Then the game has no dominant strategy. A mixed equilibrium is characterized by J(PH , s) = J(PL , s). It is easy to see η that this last equation is equivalent to h(s) = w (PH − PL ). From the lemma η 1, we have that the equation h(s) = (P − P ) has an unique solution given H L w η by s∗ = h−1 w (PH − PL ) . We now prove that this mixed equilibrium is an ESS. To prove this result, we compare s∗ J(PH , mut) + (1 − s∗ )J(PL , mut) and mutJ(PH , mut) + (1 − mut)J(PL, mut) for all mut = s∗ . The difference between ∗ ∗ two values is exactly w(s − mut)(h(mut) − h(s )). According to lemma 1, h is decreasing function. Hence, (s∗ − mut)(h(mut) − h(s∗ )) is strictly positive for all strategy mut different from s∗ . We conclude that the mixed equilibrium s∗ is an ESS.
936
E. Altman et al.
From the last proposition, we can use the pricing η as a design tool for create an incentive for the user to adjust their power control. We observe that the ESS s∗ decreases when η increases. That means the mobiles become less aggressive when pricing function increases and the system can limit aggressive requests for SIR. In the numerical examples, we show how the pricing function can optimize the overall network throughput. In the next section we study another wireless architecture where interferences are between mobiles which are located in different cell. This typical interference problem occurs in WiMAX environment.
4
OFDMA-Based IEEE802.16 Network
OFDMA (Orthogonal Frequency Division Multiple Access) is recognized as one of the most promising multiple access technique in wireless communication system. This technique is used to improve spectral efficiency and becomes an attractive multiple access technique for 4th generation mobile communication system as WiMAX. In OFDMA systems, each user occupies a subset of subcarriers, and each carrier is assigned exclusively to only one user at any time. This technique has the advantage of eliminating intra-cell interference (interference between subcarriers is negligible). Hence the transmission is affected by intercell interference since users in adjacent sectors may have also been assigned to the same carrier. If those users in the adjacent sectors transmitted with high power the intercell interference may severely limit the SINR achieved by the user. Some form of coordination between the different cells occupying the spectral resource are studied in [4,3]. The optimal resource allocation requires complete information about the network in order to decide which users in which cells should transmit simultaneously with a given power. All of these results however, rely on some form of centralized control to obtain gains at various layers of the communication stack. In a realistic network as WiMAX, centralized multicell coordination is hard to realize in practice, especially in fast-fading environments. We consider an OFDMA system where radio resources are allocated to users on their channel measures and traffic requirements. Each carrier within a frame must be assigned to at most one user in the corresponding cell. In this way each carrier assignment can be made independently in each cell. Hence when a user is assigned to carrier, the mobile should determine the power transmission to the Base station. This power should take into account the interference experienced by the transmitted packet. Consider the uplink of a multiple multicell system, employing the same spectral resource in each cell. Power control is used in an effort to preserve power and to limit interference and fading effects. For users located in a given cell, co-channel interference may therefore come from only few cells as illustrated in figure 2. Since the intra-cell interference is negligible, we focus on the users which use a specific carrier. Consider N cells, and a large number of population of mobiles randomly distributed over each channel and each cell. Since in
Evolutionary Power Control Games in Wireless Networks
937
2 3
7 1
4
6 5
Fig. 2. Hexagonal cell configuration
OFDMA systems, each carrier is assigned exclusively to only one mobile at any time, we assume that the interactions between mobiles are manifested through many local interactions between N mobiles where N is the set of neighbors of a cell. We can ignore the interaction among more than N mobiles that transmit simultaneously and that can cause interference to each other. Hence, in each slot, interaction occurs only between the mobiles which have been assigned to the same carrier. Let gij denote the average channel gain from user i to cell j. Hence, if a user in cell i transmits with power pi , the received signal strength at cell i is pi gii , while the interference it occurs on cell j is pi gij . Hence, the interference experienced by cell i is given by SIN Ri (p) = σi +gii pi gij pj , where σi is the j=i power of the thermal noise experienced at cell i. The rate achieved by user i is given by ri (p) = log(1 + SIN Ri (p)), where p denotes the power level vector of mobiles choice which are assigned to a specific carrier. We assume that the user’s utility is given by ui (p) = ri (p) − ηpi . The above utility represents the weighted difference between the throughput that can be achieved as expressed by Shannon’s capacity and the power consumption cost. We assume that all mobiles perform the on-off power allocation strategy. In this strategy, each mobile transmits with full power or remains silent. Let Ni be the set of neighbors of a user in cell i. Hence the interference experienced by a user in the cell i is given by SIN Ri (p) = σi + gii pi gji pj , where pj ∈ {0, P }, j∈Ni \{i}
P is the power level of transmission. Let xi the proportion of transmitters in the cell i. The couple (xi , 1 − xi ) with xi ∈ (0, 1), represents the state of the cell i. We denote by x−i the vector (x1 , . . . , xi−1 , xi+1 , . . . , xNi ). The fitness of the cell i can be defined as follows: fi (xi , x−i ) = xi fi (P, x−i ) = xi ui (P, a−i )x−i (a−i ) where ⎛ x−i (a−i ) = ⎝
a−i ∈{0,P }|Ni |−1
j∈T (a−i )
⎞⎛ xj ⎠ ⎝
⎞
j∈Ni \{T (a−i )
(1 − xj )⎠ and {i}}
T (a−i) = {k ∈ Ni \{i}, pk = P }, is the set of neighbors transmitting. We have fi (0, x−i ) = 0 for any multi-strategy x−i .
938
E. Altman et al.
A multi-strategy x = (xi )i=1,...N is neutrally stable if for all x = x there exists y > 0 such that ∀ ∈ (0, y ) fi (xi , x−i + (1 − )x−i ) ≥ fi (xi , x−i + (1 − )x−i )
(5)
for some i. The multi-strategy x is evolutionary stable if the inequality (5) is strict. Hence, an ESS is a neutrally stable strategy but the reciprocal is not true. See [11] for more details on neutrally stable strategy. If x is neutrally stable strategy then x is Nash equilibrium. The best response (BR) of the user i to x−i is given by ⎧ ⎨ 1 if fi (P, x−i ) > 0, BRi (x−i ) = 0 if fi (P, x−i ) < 0, ⎩ [0, 1] if fi (P, x−i ) = 0. A strategy x is a Nash equilibrium is equivalent to xi ∈ BRi (x−i ), i = 1, . . . , |N |. In particular when η = 0, the strategy ON is a Nash equilibrium (ON is a dominant strategy). The following proposition gives necessary conditions for pure strategies and strictly mixed strategies to be neutrally stable. Proposition 3. We decompose the different equilibrium strategy in the following items. – If the strategy ON (P) is a neutrally stable equilibrium then 1 gii η ≤ ηmin where ηmin = min log 1 + σi P i=1,...,N + j∈Ni , P
j=i
gji
.
Moreover, ON becomes a strictly dominant strategy if η < ηmin . – If the strategy OFF (0) is a neutrally stable equilibrium then 1 gii P η ≥ max log 1 + =: ηmax . P i σi For η > ηmax , OFF becomes a strictly dominant strategy (hence, an ESS). – If ∀ j = i, gij = g, gii = g¯, σi = σ then the game becomes a symmetric game. Hence, it exists a symmetric equilibrium with the proportion x(= xi , ∀i) of transmitters, which must satisfy Q(x) :=
n−1
ak xk (1 − x)n−k−1 (n−1 ) = ηP k
(6)
k=0 g ¯P where ak = log(1 + σ+kgP ) > 0 represents the fitness obtained by user i when he transmits and k of the opponents of user i decided to transmit and the n − 1 − k others stay quiet. – Every strictly mixed equilibrium (symmetric or not) must satisfy fi (xi , x−i ) = 0 for all i.
Evolutionary Power Control Games in Wireless Networks
939
Proof. The two first assertions of proposition 3 are obtained by Nash equilibrium conditions: the strategy ON (P) is a neutrally stable equilibrium implies that ∀i, 0 ≤ fi (P, P, . . . , P ) and the strategy OFF (0) is a neutrally stable equilibrium implies that ∀i, fi (P, 0, . . . , 0) ≤ 0. In the last assertion (symmetric case), the fitness of user i is x(Q(x)−ηP ) where (x, 1−x) is state of the cell. Since Q(1) < 0 and Q(0) > 0 when ηmin < η < ηmax , then the equation (6) has a solution x in (0, 1) which implies the existence of an ESS. Remark that the strategy ON is an ESS if η < ηmin . In order to prove the existence and uniqueness of the ESS in the symmetric case, we study roots of the polynomial Q in the following lemma. Lemma 1. Let 0 ≤ an−1 < an−2 < . . . < a0 , η and P positive reals satisfying ηP ∈ (an−1 , a0 ). Then, the polynomial Q(x) − ηP has a unique root on (0, 1). Proof. We show existence and uniqueness of the root of the polynomial Q on (0, 1). Existence. Q is a polynomial with real coefficients. Hence, Q is continuous in (0, 1). The image of the interval (0, 1) contains I = [an−1 − ηP, a0 − ηP ]. The inequality an−1 := ηmin < η < aP0 := ηmax , implies 0 ∈ I. Thus, there exists, P a ∈ (0, 1) such that Q(a) = 0. Uniqueness. We show that Q is strictly monotone on (0, 1). Let Q be the derivative of Q. Q is given by n−2 Q (x) = −(n − 1)a0 (1 − x)n−2 + (n − 1)an−1 xn−1 + k=1 ak (n−1 )xk (1 − x)n−1−k k The inequalities ak − ak+1 > 0, k = 0, . . . , n − 2 and x ≥ 0, 1 − x ≥ 0 imply that Q (x) < 0 on (0, 1). Hence, Q is strictly increasing on (0, 1). We conclude that Q is bijective from (0, 1) to I and hence, Q has a unique root on (0, 1) Given this result, we have the following proposition given the existence and uniqueness of the ESS in the symmetric game. The proof is immediate from the lemma 1 and from proposition 3. Proposition 4. The symmetric power control game has a unique strictly mixed equilibrium.
5
Numerical Examples
In this section, we first investigate the impact of the different parameters and pricing on the ESS and the convergence of the replicator dynamic. We also discuss the impact of the pricing function on the system capacity. W-CDMA Wireless Networks: we show the impact of density of nodes and pricing on the ESS and the average rate. We assume that the receivers which are randomly placed over a plane following a Poisson process with density ν,
940
E. Altman et al.
i.e, f (r) = νe−νr . We recall that the rate of a mobile using power level Pi at R the equilibrium is given by w 0 log(1 + SIN R(Pi , s, r))f (r)dr. We took r0 = 0.2, w = 20, σ = 0.2, α = 3, β = 0.2, R = 1. First, we show the impact of the density of nodes λ on the ESS and the average rate. In figures 3-4, we depict the average rate obtained at the equilibrium and the ESS, respectively , as a function of the density λ. We recall that the interference for a mobile increases when λ increases. We observe that the mobiles become less aggressive when the density increases. In the figure 3, we observe that it is important to adapt the pricing as function of the density of nodes. Indeed, we observe that for low density of nodes, the lower pricing (η = 0.92) gives better results than higher pricing (η = 0.97). When the density of nodes increases, the better performance is obtained with higher pricing. 0.8
η=0.92 η=0.97
7.8
η=0.92 η=0.97
0.7
7.7
0.6 7.6
ESS
Throughput
0.5 7.5
7.4
0.4
0.3
7.3
0.2
7.2
0.1
7.1 0.2
0.25
0.3
0.35
0.4
0.45
0 0.2
0.5
0.25
0.3
0.35
0.4
0.45
0.5
Density of nodes
Density of nodes
Fig. 3. The average rate of a mobile at Fig. 4. ESS versus density of nodes λ for equilibrium as function of the density of η = 0.92, 0.97 nodes λ for η = 0.92, 0.97
Our second experiment in W-CDMA studies convergence to the ESS of the W-CDMA system described in section 3 under replicator dynamics. Fig. 5 and 6 represent the fraction of population using the high power level for different initial states of the population: 0.99, 0.66, 0.25 and 0.03. We observe that the choice of receiver distributions change the ESS. 1
Proportion of players using the high power level: s(t)
Proportion of players using the high power level: s(t)
1
s =0.99 0 s =0.66 0 s =0.25 0 s =0.03
0.9
0.8
0
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2
4
6
8
10
time t
12
14
16
18
20
s =0.99 0 s =0.66 0 s =0.25 0 s =0.03
0.9
0.8
0
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
25
30
35
40
time t
Fig. 5. Convergence to the ESS in CDMA Fig. 6. Convergence to the ESS in CDMA system: uniform distribution system: quadratic distribution
OFDMA-based IEEE802.16 networks. We consider below an numerical example of an OFDMA system. We shall obtain the ESS for several values of
Evolutionary Power Control Games in Wireless Networks
941
η. We assume that gii = g and gij = g¯ for all i = j and σi = σ for all i. We consider below a numerical example for different values of N (see figure 2 in which N = 7). Let the noise σ = 0.1 and the power attenuation g = g¯/4 = 0.9. In figure 7 (resp. 8), we plot the ESS versus η (resp. power level P ). In both figures, the population ratio using the strategy ON is monotone decreasing in η. 1
0.7
n=3 n=4 n=5 n=6 n=10 n=11
0.9
0.6 0.8
0.5
Optimal Strategy
Optimal Strtaegy
0.7
0.6
0.4
0.5
0.3
0.4
0.3
0.2
0.2
0.1 0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Pricing parameter η
0.7
0.8
0.9
1
0 0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
Power Level
Fig. 7. The population ratio using strategy Fig. 8. The population ratio using strategy ON at equilibrium as a function of η for ON as a function of power level for n = n=7 3, 4, 5, 6, 10, 11
We see that the parameter η which can be interpreted as the pricing per unit of power transmitted, can determine whether the ESS is aggressive (in pure strategies) or in mixed strategies. It can determine in the latter case what fraction of the population will use high power at equilibrium. Pricing can thus be used to control interference.
6
Concluding Remarks
We considered power control games with one or several populations of users and studied evolutionary stable strategies in interaction of numerous mobiles in competition in a wireless environment. We have modeled power control for W-CDMA and OFDMA systems as a non-cooperative population game in which each user needs to decide how much power to transmit over each receiver and many local interactions between users at the same time has been considered. We have derived the conditions for existence and uniqueness of evolutionary stable strategies and characterize the distribution of the users over each power level.
References 1. Alpcan, T., Basar, T., Srikant, R., Altman, E.: CDMA uplink power control as a noncooperative game. Wireless Networks 8(6), 659–670 (2002) 2. Sagduyu, Y.E., Ephremides, A.: SINR-Based MAC Games for Selfish and Malicious Users. In: Information Theory and Applications Workshop, San Diego, CA (to appear, January 2007) 3. Hoon, K., Youngnam, H., Jayong, K.: Optimal subchannel allocation scheme in multicell OFDMA systems. In: Proc. IEEE VTC (May 2004)
942
E. Altman et al.
4. Li, G., Liu, H.: Downlink dynamic resource allocation for multi-cell OFDMA system. In: Proc. IEEE VTC (October 2003) 5. Hofbauer, J., Sigmund, K.: Evolutionary game dynamics. AMS 40(4), 479–519 (2003) 6. Meshkati, F., Poor, H., Schwartz, S., Mandayam, N.: An Energy-Efficient Approach to Power Control and Receiver Design in Wireless Data Networks. IEEE Trans. on Communications 52 (2005) 7. Meshkati, F., Poor, H., Schwartz, S.: Energy-Efficient Resource Allocation in Wireless Networks. IEEE Signal Processing Mag. 58 (2007) 8. Rosen, J.B.: Existence and uniqueness of equilibrium points for concave N-person games. Econometrica 33, 520–534 (1965) 9. Saraydar, C., Mandayam, N., Goodman, D.: Efficient Power Control via Pricing in Wireless Data Networks. IEEE Trans. on Communication 50 (2002) 10. Smith, M.: Game Theory and the Evolution of Fighting. In: Smith, J.M. (ed.) On Evolution, pp. 8–28. Edinburgh Univ. Press, Edinburgh (1972) 11. Samuelson, L.: Evolutionary Games and Equilibrium Selection. MIT Press, Cambridge (1997) 12. Tembine, H., Altman, E., ElAzouzi, R., Hayel, Y.: Evolutionary games with random number of interacting players with application to access control. In: Wiopt (to appear 2008) 13. van Damme, E.: Stability and perfection of Nash equilibria. Springer, Heidelberg (1991) 14. Vincent, T.L., Vincent, T.L.S.: Evolution and control system design. IEEE Control Systems Mag. 20(5), 20–35 (2000)
Author Index
A¨ıache, Herv´e 1 Al Agha, Khaldoun 99 Al-Mefleh, Haithem 812, 824 Alcaraz, Juan J. 768 Altman, Eitan 930 Amer, Paul D. 727 Ano, Shigehiro 276 Argibay-Losada, Pablo 346 Armitage, Grenville 494 Bachir, Abdelmalik 36 Barbera, M. 679 Barth, Dominique 470 Barton, Michael 780 Bazan, Osama 922 Behdadfar, Mohammad 562 Beyah, Raheem 149 Bi, Jun 302 Biaz, Saˆ ad 901 Bohacek, Stephan 574 Bonaventure, Olivier 748 Boronat, Fernando 289 Borsetti, Diego 60 Bouet, Mathieu 112 Boutaba, R. 914 Bredel, Michael 314 Buob, Marc-Olivier 542 Cabellos-Aparicio, Albert 252 Cai, Jianfei 889 Calomme, Sandrine 48 Cantin, Francois 397 Carle, Georg 264 Carle, Jean 124 Carlsson, Niklas 421 Casetti, Claudio 60 Cerd´ an, Fernando 768 Chan, Edward 385 Chang, J. Morris 812, 824 Chen, Daoxu 385 Chen, Jian 366 Chen, Meng Chang 856 Cheng, Tee-Hiang 215 Chiasserini, Carla-Fabiana 60 Choi, Taehwan 530
Coll, Hugo 289 Conan, Vania 1 Cousin, Bernard 642 Cui, Jun-Hong 72 Cui, Yong 554 D´ an, Gy¨ orgy 227 Das, Amitabha 612 Diaz, Michel 482 Domingo-Pascual, Jordi 252 Dreibholz, Thomas 586 Duda, Andrzej 36 Dugeon, Olivier 482 Dutta, Ratna 612 Eager, Derek L. 421 Eidenbenz, Stephan 599 Ekiz, Nasif 727 El-Azouzi, Rachid 135, 930 El-Khoury, R. 135 Emmanuel, Sabu 612 Fern´ andez-Veiga, Manuel Fidler, Markus 314 Fodor, Vikt´ oria 227 Foh, Chuan Heng 889 Fu, Jing 518, 633 Funabiki, Nobuo 326
346
Gallais, Antoine 124 Gao, Deyun 889 Garc´ıa-Haro, Joan 735, 768 Garcia-Sanchez, Antonio-Javier 735 Garcia-Sanchez, Felipe 735 Ignacy 99 Gawedzki, Gerla, M. 679 Gong, Hao 366 Gonz´ alez-Ortega, Miguel A. 346 Gueye, Bamba 397 Guo, Jun 240 Guo, Yang 433 G¨ urses, Eren 14 Hayel, Yezekael 930 Hei, Yuichiro 276
944
Author Index
Held, Dirk 804 Hespanha, Jo˜ ao P. 574 Heusse, Martin 36 Higashino, Teruo 326 Ho, Kin-Hon 654 Hsiao, Shun-Wen 856 Htira, Walid 482 Hu, Ping 302 Hu, Siquan 26 Hui, Suk Yu 848 Iyengar, Janardhan R.
727
Jaho, Eva 457 Jaho, Ina 457 Jamakovic, A. 183 Janic, Milena 445 Jaseemuddin, Muhammad Jha, Sanjay 240
922
Kaafar, Mohamed Ali 397 Kant, K. 756 Karlsson, Gunnar 506, 518, 633 Kesavan, Vijay 877 Keslassy, Isaac 667 Kherani, Arzad 836 Kitani, Tomoya 326 Klenk, Andreas 264 Kobbane, A. 135 Kosek, Katarzyna 792 Koukoutsidis, Ioannis 457 Krebs, Martin 865 Krempels, Karl-Heinz 865 Kucay, Markus 865 Kuipers, Fernando 445 Kumar, T. Rakesh 162 Kurowski, Kacper 207 Le Roux, Jean-Louis 642 Lebrun, Laure 1 Leduc, Guy 48, 397 Lee, Bu-Sung 87 Lee, Gene Moo 530 Lee, Hoonjae 170 Lee, Sanggon 170 Lef`evre, Laurent 715 Leguay, J´er´emie 1 Leroy, Damien 748 Lessmann, Johannes 804 Li, Qi 554
Li, Wenzhong 385 Li, Zhenyu 409 Li, Zhongcheng 409 Liang, Chao 433 Lim, Chansook 574 Lim, Hyotaek 170 Lim, Meng-Hui 170 Lin, Yuan-Chieh 856 Liquori, Luigi 60 Liu, Huaiyu 877 Liu, Yong 433 Lloret, Jaime 289 Lombardo, A. 679 Long, Youshui 366 L´ opez-Ardao, Jos´e C. 346 L´ opez-Garc´ıa, C´ andido 346 L´ opez Pacheco, Dino M. 715 Lu, Sanglu 385 Lu, Yue 445 Luong, Trung-Tuan 87 Maciocco, Christian 877 Mahanti, Anirban 836 Mano, Chad D. 624 M´ as, Ignacio 506 Mathy, Laurent 397 Meulle, Mickael 542 Molisz, Wojciech 338 Motani, Mehul 26 Mukhopadhyay, Sourav 612 M¨ uller, Andreas 264 Munro, Alistair 780 Natarajan, Preethi 727 Natkaniec, Marek 792 Newman, Reed 149 Obraczka, Katia 574 Ogishi, Tomohiko 276 Pan, Lingtao 554 Panarello, C. 679 Pavlou, George 654 Pedre˜ no, Gaspar 768 Pham, Congduc 715 Pollatos, Gerasimos G. Poo, Gee-Swee 215 Psaras, Ioannis 691 Pujolle, Guy 112
195
Author Index Raghavan, S.V. 162 Rak, Jacek 338 Rao, Ashwin 836 Rathgeb, Erwin P. 586 Rodr´ıguez-Rubio, Ra´ ul F. Rousseau, St´ephane 1
Udar, N. 756 Uhlig, Steve 542
346
Sabir, E. 135 Saidi, Hossein 562 Saidi, Mohand Yazid 642 Sanadidi, M. 679 Sanchez, Francisco Javier 289 Scharf, Michael 703 Schembra, G. 679 Serral-Graci` a, Ren´e 252 Shi, Guangyu 366 Shi, Zhijie Jerry 72 Shifrin, Mark 667 Shirley, Brandon 624 Silberschatz, Avi 375 Sj¨ odin, Peter 518, 633 Stavrakakis, Ioannis 457 Stewart, Randall 727 Strotbek, Haiko 703 Su´ arez-Gonz´ alez, Andr´es 346 Sun, Yeali S. 856 Tachibana, Atsuo 276 Tang, S. 358 Tarasiuk, Halina 207 Telelis, Orestis A. 195 Tembine, Hamidou 930 Thoumin, Damien 1 Tomasik, Joanna 470 Tsaoussidis, Vassilis 691 Tung, Li-Ping 856
Van Mieghem, Piet 183, 358, 445 Viswanathan, R. 756 Vollero, Luca 792 Waharte, S. 914 Wang, H. 358 Wang, Ning 654 Wang, Yilin 385 Weisser, Marc-Antoine 470 Wong, Kin Yeung 848 Wu, Shaoen 901 Xi, Weihua Helen 780 Xie, Gaogang 409 Xie, Haiyong 375 Xie, Lizhong 302 Xu, Mingwei 554 Yamaguchi, Hirozumi 326 Yan, Guanhua 599 Yan, Hai 72 Yang, Yang Richard 375 Yeo, C.K. 87 Yeoh, Chee-Min 170 Yeung, Kai Hau 848 Yong, Keen-Mun 215 Zhang, Hongli 366 Zhou, Xing 586 Zhu, Zengyang 409 Zissimopoulos, Vassilis
195
945