Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
6003
Fabio Ricciato Marco Mellia Ernst Biersack (Eds.)
Traffic Monitoring and Analysis Second International Workshop, TMA 2010 Zurich, Switzerland, April 7, 2010 Proceedings
13
Volume Editors Fabio Ricciato Università del Salento, Lecce, Italy and FTW Forschungszentrum Telekommunikation Wien, Austria E-mail:
[email protected] Marco Mellia Politecnico di Torino, Italy E-mail:
[email protected] Ernst Biersack EURECOM, Sophia Antipolis, France E-mail:
[email protected]
Library of Congress Control Number: 2010923705 CR Subject Classification (1998): C.2, D.4.4, H.3, H.4, D.2 LNCS Sublibrary: SL 5 – Computer Communication Networks and Telecommunications ISSN ISBN-10 ISBN-13
0302-9743 3-642-12364-3 Springer Berlin Heidelberg New York 978-3-642-12364-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
The Second International Workshop on Traffic Monitoring and Analysis (TMA 2010) was an initiative of the COST Action IC0703 "Data Traffic Monitoring and Analysis: Theory, Techniques, Tools and Applications for the Future Networks" (http:// www.tma-portal.eu/cost-tma-action). The COST program is an intergovernmental framework for European cooperation in science and technology, promoting the coordination of nationally funded research on a European level. Each COST Action aims at reducing the fragmentation in research and opening the European research area to cooperation worldwide. Traffic monitoring and analysis (TMA) is nowadays an important research topic within the field of computer networks. It involves many research groups worldwide that are collectively advancing our understanding of the Internet. The importance of TMA research is motivated by the fact that modern packet networks are highly complex and ever-evolving objects. Understanding, developing and managing such environments is difficult and expensive in practice. Traffic monitoring is a key methodology for understanding telecommunication technology and improving its operation, and the recent advances in this field suggest that evolved TMA-based techniques can play a key role in the operation of real networks. Besides its practical importance, TMA is an attractive research topic for many reasons. First, the inherent complexity of the Internet has attracted many researchers to face traffic measurements since the pioneering times. Second, TMA offers a fertile ground for theoretical and cross-disciplinary research––such as the various analysis techniques being imported into TMA from other fields––while at the same time providing a clear perspective for the exploitation of the results in real network environments. In other words, TMA research has the potential to reconcile theoretical investigations with practical applications, and to realign curiosity-driven with problemdriven research. In the spirit of the COST program, the COST-TMA Action was launched in 2008 to promote building a research community in the specific field of TMA. Today, it involves research groups from academic and industrial organizations from 24 countries in Europe. The goal of the TMA workshops is to open the COST Action research and discussions to the worldwide community of researchers working in this field. Following the success of the first edition of the TMA workshop in 2009––which gathered around 70 participants involved in lively interaction during the presentation of the papers––we decided to maintain the same format for this second edition: single-session full-day program. TMA 2010 was organized jointly with the 11th Passive and Active Measurement conference (PAM 2010) and was held in Zurich on April 7 2010. We are grateful to Bernhard Plattner and Xenofontas Papadimitropoulos from ETH Zurich for the perfect local organization.
VI
Preface
The submission and revision process for the two events was done independently. For TMA 2010, 34 papers were submitted. Each paper received at least three independent reviews by TPC members or external reviewers. Finally, 14 papers were accepted for inclusion in the present proceedings. A few papers were conditionally accepted and shepherded to ensure that the authors in the final version addressed the critical points raised by the reviewers. Given the very tight schedule available for the review process, it was not possible to implement a rebuttal phase, but we recommend considering this option for future editions of the workshop. We are planning to implement a comment-posting feature for all the papers included in these proceedings on the TMA portal (http://www.tma-portal.eu/forums). The goal is to offer a ready-to-use channel to readers and authors for posting comments, expressing criticisms, requesting and providing clarifications and any other material relevant to each paper. In the true spirit of the COST Action, we hope in this way to contribute to raising the level of research interactions in this field. We wish to thank all the TPC members and the external reviewers for the great job done: accurate and qualified reviews are key for building and maintaining a high-level standard for the TMA workshop series. We are grateful to Springer for accepting to be the publisher of the TMA workshop. We hope you will enjoy the proceedings!
Fabio Ricciato Marco Mellia Ernst Biersack
Organization
Technical Program Committee Patrice Abry Valentina Alaria Pere Barlet-Ros Christian Callegari Ana Paula Couto da Silva Jean-Laurent Costeaux Udo Krieger Youngseok Lee Michela Meo Philippe Owezarski Aiko Pras Kavé Salamatian Dario Rossi Matthew Roughan Luca Salgarelli Yuval Shavitt Ruben Torres Steve Uhlig Pierre Borgnat
ENS Lyon, France Cisco Systems UPC Barcelona, Spain University of Pisa, Italy Federal University of Juiz de Fora, Brazil France Telecom, France University of Bamberg, Germany CNU, Korea Politecnico di Torino, Italy LAAS-CNRS, France University of Twente, The Netherlands University of Lancaster, UK TELECOM ParisTech, France University of Adelaide, Australia University of Brescia, Italy Tel Aviv University, Israel Purdue University, USA T-labs/TU Berlin, Germany ENS Lyon, France
Local Organizer Xenofontas Dimitropoulos Bernhard Plattner
ETH Zurich, Switzerland ETH Zurich, Switzerland
Technical Program Co-chairs Fabio Ricciato Marco Mellia Ernst Biersack
University of Salento, Italy Politecnico di Torino, Italy EURECOM
Table of Contents
Analysis of Internet Datasets Understanding and Preparing for DNS Evolution . . . . . . . . . . . . . . . . . . . . Sebastian Castro, Min Zhang, Wolfgang John, Duane Wessels, and Kimberly Claffy Characterizing Traffic Flows Originating from Large-Scale Video Sharing Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatsuya Mori, Ryoichi Kawahara, Haruhisa Hasegawa, and Shinsuke Shimogawa Mixing Biases: Structural Changes in the AS Topology Evolution . . . . . . Hamed Haddadi, Damien Fay, Steve Uhlig, Andrew Moore, Richard Mortier, and Almerima Jamakovic
1
17
32
Tools for Traffic Analysis and Monitoring EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaroslaw Sliwinski, Andrzej Beben, and Piotr Krawiec A Database of Anomalous Traffic for Assessing Profile Based IDS . . . . . . Philippe Owezarski
46
59
Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luca Deri, Valeria Lorenzetti, and Steve Mortimer
73
DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Barbuzzi, Gennaro Boggia, and Luigi Alfredo Grieco
87
Traffic Classification Uncovering Relations between Traffic Classifiers and Anomaly Detectors via Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Romain Fontugne, Pierre Borgnat, Patrice Abry, and Kensuke Fukuda Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers . . . . . . . . . . . Alessandro Finamore, Michela Meo, Dario Rossi, and Silvio Valenti
101
115
X
Table of Contents
TCP Traffic Classification Using Markov Models . . . . . . . . . . . . . . . . . . . . . Gerhard M¨ unz, Hui Dai, Lothar Braun, and Georg Carle
127
K-Dimensional Trees for Continuous Traffic Classification . . . . . . . . . . . . . Valent´ın Carela-Espa˜ nol, Pere Barlet-Ros, Marc Sol´e-Sim´ o, Alberto Dainotti, Walter de Donato, and Antonio Pescap´e
141
Performance Measurements Validation and Improvement of the Lossy Difference Aggregator to Measure Packet Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Josep Sanju` as-Cuxart, Pere Barlet-Ros, and Josep Sol´e-Pareta
155
End-to-End Available Bandwidth Estimation Tools, an Experimental Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emanuele Goldoni and Marco Schivi
171
On the Use of TCP Passive Measurements for Anomaly Detection: A Case Study from an Operational 3G Network . . . . . . . . . . . . . . . . . . . . . . . . Peter Romirer-Maierhofer, Angelo Coluccia, and Tobias Witek
183
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
199
Understanding and Preparing for DNS Evolution Sebastian Castro1,2, Min Zhang1 , Wolfgang John1,3 , Duane Wessels1,4 , and Kimberly Claffy1 1
CAIDA, University of California, San Diego 2 NZRS, New Zealand 3 Chalmers University of Technology, Sweden 4 DNS-OARC {secastro,mia,johnwolf,kc}@caida.org
[email protected]
Abstract. The Domain Name System (DNS) is a crucial component of today’s Internet. The top layer of the DNS hierarchy (the root nameservers) is facing dramatic changes: cryptographically signing the root zone with DNSSEC, deploying Internationalized Top-Level Domain (TLD) Names (IDNs), and addition of other new global Top Level Domains (TLDs). ICANN has stated plans to deploy all of these changes in the next year or two, and there is growing interest in measurement, testing, and provisioning for foreseen (or unforeseen) complications. We describe the Day-in-the-Life annual datasets available to characterize workload at the root servers, and we provide some analysis of the last several years of these datasets as a baseline for operational preparation, additional research, and informed policy. We confirm some trends from previous years, including the low fraction of clients (0.55% in 2009) still generating most misconfigured “pollution”, which constitutes the vast majority of observed queries to the root servers. We present new results on security-related attributes of the client population: an increase in the prevalence of DNS source port randomization, a short-term measure to improve DNS security; and a surprising decreasing trend in the fraction of DNSSEC-capable clients. Our insights on IPv6 data are limited to the nodes who collected IPv6 traffic, which does show growth. These statistics serve as a baseline for the impending transition to DNSSEC. We also report lessons learned from our global trace collection experiments, including improvements to future measurements that will help answer critical questions in the evolving DNS landscape.
1
Introduction
The DNS is a fundamental component of today’s Internet, mapping domain names used by people and their corresponding IP addresses. The data for this mapping is stored in a tree-structured distributed database where each nameserver is authoritative for a part of the naming tree. The root nameservers play a vital role providing authoritative referrals to nameservers for all top-level domains, which recursively determine referrals for all host names on the Internet, F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 1–16, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
S. Castro et al.
among other infrastructure information. This top (root) layer of the DNS hierarchy is facing three dramatic changes: cryptographically signing the root zone with DNSSEC, deploying Internationalized Top-Level Domain (TLD) Names (IDNs), and addition of other new global Top Level Domains (TLDs). In addition, ICANN and the root zone operators must prepare for an expected increase in IPv6 glue records in the root zone due to the exhaustion of IPv4 addresses. ICANN currently plans to deploy all of these changes within a short time interval, and there is growing interest in measurement, testing, and provisioning for foreseen (or unforeseen) complications. As part of its DNS research activities, in 2002 CAIDA responded to the Root Server System Advisory Committee’s invitation to help DNS root operators study and improve the integrity of the root server system. Based on the few years of trust we had built with these operators, in 2006 we asked them to participate in a simultaneous collection of a day of traffic to (and in some cases from) the DNS root nameservers. We collaborated with the Internet Systems Consortium (ISC) and DNS Operation and Research Center (DNS-OARC) in coordinating four annual large-scale data collection events that took place in January 2006, January 2007, March 2008, and March 2009. While these measurements can be considered prototypes of a Day in the Life of the Internet [8], their original goal was to collect as complete a dataset as possible about the DNS root servers operations and evolution, particularly as they deployed new technology, such as anycast, with no rigorous way to evaluate its impacts in advance. As word of these experiments spread, the number and diversity of participants and datasets grew, as we describe in Section 2. In Section 3 we confirm the persistence of several phenomenon observed in previous years, establishing baseline characteristics of DNS root traffic and validating previous measurements and inferences, and offering new insights into the pollution at the roots. In Section 4 we focus on the state of deployment of two major security-related aspects of clients querying the root: source port randomization and DNSSEC capability. We extract some minor insights about IPv6 traffic in Section 5 before summarizing overall lessons learned in Section 6.
2
Data Sets
On January 10–11, 2006, we coordinated concurrent measurements of three DNS root server anycast clouds (C, F, and K, see [13] for results and analysis). On January 9–10, 2007, four root servers (C, F, K, and M) participated in simultaneous capture of packet traces from almost all instances of their anycast clouds [5]. On March 18–19, 2008, operators of eight root servers (A, C, E, F, H, K, L, and M), five TLDs (.ORG, .UK, .BR, .SE, and .CL), two Regional Internet Registries (RIRs: APNIC and LACNIC), and seven operators of project AS112 joined this collaborative effort. Two Open Root Server Network (ORSN) servers, B in Vienna and M in Frankfurt, participated in our 2007 and 2008 collection experiments. On March 30–April 1, 2009, the same eight root servers participated in addition to seven TLDs (.BR, .CL, .CZ, .INFO, .NO, .SE, and .UK), three
Understanding and Preparing for DNS Evolution
3
RIRs (APNIC, ARIN, and LACNIC), and several other DNS operators [9]. To the best of our knowledge, these events deliver the largest simultaneous collection of full-payload packet traces from a core component of the global Internet infrastructure ever shared with academic researchers. DNS-OARC provides limited storage and compute power for researchers to analyze the DITL data, which for privacy reasons cannot leave OARC machines.1 For this study we focus only on the root server DITL data and their implications for the imminent changes planned for the root zone. Each year we gathered more than 24 hours of data so that we could select the 24-hour interval with the least packet loss or other trace damage. The table in Fig. 1 presents summary statistics of the most complete 24-hour intervals of the last three years of DITL root server traces. Figure 1 (right) visually depicts our data collection gaps for UDP (the default DNS transport protocol) and TCP queries to the roots for the last three years. The darker the vertical bar, the more data we had from that instance during that year. The noticeable gaps weaken our ability to compare across years, although some (especially smaller, local) instances may have not received any IPv6 or TCP traffic during the collection interval, i.e., it may not always be a data gap. The IPv6 data gaps were much worse, but we did obtain (inconsistently) IPv6 traces from instances of four root servers (F, H, K, M), all of which showed an increase of albeit low levels of IPv6 traffic over the 2-3 observation periods (see Section 5).
3
Trends in DNS Workload Characteristics
To discover the continental distribution of the clients of each root instances measured, we mapped the client IP addresses to their geographic location (continent) using NetAcuity [2]; the location of the root server instances is available at www.root-servers.org [1]. Not surprisingly, the 3 unicast root servers observed had worldwide usage, i.e., clients from all over the globe. Fifteen (15) out of the 19 observed global anycast instances also had globally distributed client populations (exceptions were f-pao1, c-mad1, k-delhi, m-icn2 ). Our observations confirm that anycast is effectively accomplishing its distributive goals, with 42 of the 46 local anycast instances measured serving primarily clients from the continent they are located in (exceptions were f-cdg1, k-frankfurt, k-helsinki, f-sjc13 ). We suspect that the few unusual client distributions results from particular BGP routing policies, as reported in Liu et al.[13] and Gibbard [10]. Figure 3 shows fairly consistent and expected growth in mean query rates observed at participating root servers. The geographic distribution of these queries spans the globe, and similar to previous years [13] suggest that anycast at the root servers is performing effectively at distributing load across the now much more globally pervasive root infrastructure. 1 2 3
OARC hosts equipment for researchers who need additional computing resources. f-pao1 is in Palo Alto, CA; c-mad1 in Madrid, ES; and m-icn in Incheon, South Korea. f-cdg1 is in Paris, FR, and f-sjc1 in San Jose, CA.
4
S. Castro et al. DITL2007 DITL2008 DITL2009 roots, 24h roots, 24h roots, 24h
Duration IPv4 # instances*
C: 4/4 F: 36/40 K: 15/17 M: 6/6 3.83 B 2.8 M
# queries # clients IPv6 # instances*
F: 5/40 K: 1/17
# queries # clients
0.2 M 60
TCP # instances*
A: 1/1 C: 4/4 E: 1/1 F: 35/41 H: 1/1 K: 16/17 L: 2/2 M: 6/6 7.99 B 5.6 M
A: 1/1 C: 6/6 E: 1/1 F: 35/48 H: 1/1 K: 16/17 L: 2/2 M: 6/6 8.09 B 5.8 M
F: 10/41 H: 1/1 K: 1/17 M: 4/6 23 M 9 K
F: 16/48 H: 1/1 K: 9/17 M: 5/6 29 M 16 K
A: 1/1
A: 1/1 C: 6/6 E: 1/1 F: 35/48 H: 1/1 K: 16/17 M: 5/6 3.04 M 163 K
C: 4/4
# query # client
F: 36/40
E: 1/1 F: 35/41
K: 14/17 M: 5/6 0.7 M 256 K
K: 16/17 M: 5/6 2.07 M 213 K
*observed/total
Fig. 1. DITL data coverage for 2007, 2008, 2009. The table summarizes participating root instances, and statistics for the most complete 24-hour collection intervals, including IPv4 UDP, IPv6 UDP, and TCP packets. The plots on the right show data collection gaps for UDP and TCP DNS traffic to the roots for the last three years. Clients distribution by Continent for each instance (2009) N. America
S. America
Europe
Africa
Asia
Oceania
100
% of Clients
75
Unicast
50
Global Node Local Node
NorthAmerica
SouthAmerica
Europe
Africa
Asia
Oceania
f−bne1 k−brisbane f−akl1
f−cai1
f−tlv1 f−svo1 k−moscow k−doha k−abudhabi f−dxb1 f−khi1 k−delhi f−dac1 f−sin1 f−hkg1 f−pek1 f−tpe1 f−sel1 m−icn f−kix1 k−tokyo m−nrt−dixie m−nrt−jpix m−nrt−jpnap
k−reykjavik c−mad1 f−mad1 f−bcn1 f−cdg1 m−cdg f−ams1 k−amsterdam k−geneva f−trn1 c−fra1 k−frankfurt k−milan f−osl1 f−muc1 f−rom1 k−poznan f−tgd1 k−budapest k−athens k−helsinki
f−uio1 f−scl1 f−ccs1 f−eze1 f−gru1
m−sfo f−pao1 e−root f−sjc1 c−lax1 l−lax c−ord1 f−ord1 f−atl1 k−miami l−mia f−yyz1 f−pty1 a−root c−iad1 h−h4 f−yow1 c−jfk1
25
Unknown
Fig. 2. The geographic distribution of clients querying the root server instances participating in the DITL 2009 (colored according to their continental location). The root server instances are sorted by geographic longitude. Different font styles indicate unicast (green), global anycast (black, bold) and local anycast nodes (black, italic). The figure shows that anycast achieves its goal of localizing traffic, with 42 out of 46 local anycast instances indeed serving primarily clients from the same continent.
Understanding and Preparing for DNS Evolution
5
Mean query rate at the root servers (2006, 2007, 2008, 2009) 16396
16000 15139 14528
14000
13390
13168 12917
Queries Per Second
Root
13454
13337 13119
11978 11389
12000 10644
10337
10000
A C E F H K L M
10625 9868
9614 8909
8859
8450
8000
7553
7511 6317
6076
6000
4000
Growth Growth 2007-2008 2008-2009 -1.91% 40.5% 12.86% 10.71% 13.06% -41.15% 25.74% 32.94% 1.66% 18.90% 5.17% 12.32%
2000
0
08
09
06
A
07
08
09
C
08
09
06
07
E
08
09
08
F
09
06
H
07
08
09
08
K
Root Server
09
07
L
08
09
M
Fig. 3. Mean query rate over IPv4 at the root servers participating in DITL from 2006 to 2009. Bars represent average query rates on eight root servers over the four years. The table presents the annual growth rate at participating root servers since 2007. The outlying (41%) negative growth rate for F-root is due to a measurement failure at (and thus no data from) a global F-root (F-SFO) node in 2009.
Distribution of queries by query type (2006,2007,2008,2009) 1
Fraction of Queries
0.8
0.6
0.4
0.2
0 08
09
A A TXT
06
07
08
09
C
08
09
06
E
NS AAAA
07
08
09
F
CNAME SRV
08
09
06
07
H SOA A6
08
09
K PTR
08
09
L
07
08
09
M
MX
OTHER
Fig. 4. DITL distribution of IPv4 UDP queries by types from 2007 to 2009. IPv6related developments caused two notable shifts in 2008: a significant increase in AAAA queries due to the addition of IPv6 glue records to root servers, and a noticeable decrease in A6 queries due to their deprecation.
6
S. Castro et al.
Figure 4 shows that the most common use of DNS – requesting the IPv4 address for a hostname via A-type queries – accounts for about 60% of all queries every year. More interesting is the consistent growth (at 7 out of 8 roots) in AAAA-type queries, which map hostnames to IPv6 addresses, using IPv4 packet transport. IPv6 glue records were added to six root servers in February 2008, prompting a larger jump in 2008 than we saw this year. Many client resolvers, including BIND, will proactively look for IPv6 addresses of NS records, even if they do not have IPv6 configured locally. We further discuss IPv6 in Section 5. Figure 4 also shows a surprising drop in MX queries from 2007 to 2009, even more surprising since the number of clients sending MX queries increased from .4M to 1.4M over the two data sets. The majority of the moderate to heavy hitter “MX” clients dramatically reduced their per-client MX load on the root system, suggesting that perhaps spammers are getting better at DNS caching. Distribution of clients binned by query rate intervals (2007,2008,2009) 87.7 23.9
25% 10%
Percent of Clients
5%
Queries
9.7 5.8
2.3
100% 75% 50%
35.6 20.5
11.9
10% 2007 2008 2009
2.1
1%
0.4
0.5%
25%
0.1%
5%
1% 0.5%
0.1%
0.07 Clients
0.01%
Percent of Queries
100% 75% 50%
0.01%
0.005
>10
1−10
0.1−1
0.01−0.1
0.001−0.01
0.001%
0−0.001
0.001%
Query rate interval [q/s]
Fig. 5. Distribution of clients and queries as a function of mean IPv4 query rate order of magnitude for last three years of DITL data sets (y-axes log scale), showing the persistence of heavy-hitters, i.e. a few clients (in two rightmost bins) account for more than 50% of observed traffic. The numbers on the lines are the percentages of queries (upward lines) and clients represented by each bin for DITL 2009 (24-hour) data.
Several aspects of client query rates are remarkably consistent across years: the high variation in rate, and the distributions of clients and queries as a function of query rate interval. We first note that nameservers cache responses, including referrals, conserving network resources so that intermediate servers do not need to query the root nameservers for every request. For example, the name server learns that a.gtld-servers.net and others are authoritative for the .com zone, but alsos learns a time-to-live (TTL) for which this information is considered valid.
Understanding and Preparing for DNS Evolution
7
Typical TTLs for top level domains are on the order of 12 days. In theory, a caching recursive nameserver only needs to query the root nameservers for an unknown top level domain or when a TTL expires. However, many previous studies have shown that the root nameservers receive many more queries than they should [23,22,13,7]. Figure 5 shows the distributions of clients and queries binned by average query rate order of magnitude, ranging from 0.001 q/s (queries per second) to >10 q/s. The decreasing lines show the distribution of clients (unique IP addresses) as a function of their mean query rate (left axis), and the increasing lines show the distribution of total query load produced by clients as a function of their mean query rate (right axis). The two bins with the lowest query rates (under 1 query per 100s) contain 97.4% of the clients, but are only responsible for 8.1% of all queries. In stark contrast, the busiest clients (more than 1 query/sec) are miniscule in number (<0.08%, or 5483 client IPs) but account for 56% of the total query load. Table 1. The number and fraction of clients, queries, and valid queries in each query rate interval, for a 10% random sample of DITL2009 clients for each root. Rate interval Number of clients Number of queries Number of valid queries <0.001 602 K 23 M ( 2.7%) 8,088 K (47.9%) 0.001-0.01 72 K 49 M ( 5.7%) 5,446 K (32.3%) 0.01-0.1 14 K 79 M ( 9.2%) 2,343 K (13.9%) 0.1-1 3K 165 M (19.3%) 770 K ( 4.6%) 1-10 565 324 M (37.8%) 206 K ( 1.2%) >10 71 216 M (25.2%) 32 K ( 0.2%)
We next explore the nature of traffic from these hyperbusy clients, which (still) generate mostly DNS pollution in the form of invalid queries. Given the role of caching DNS responses described above, and the far less consistent implementation of caching of negative (NXDOMAIN) results, a high fraction of invalid queries landing at the root is not too surprising – everything else is more consistently cached. Less expected is the extremely high rate of invalid queries, including identical and repeated queries. We believe this behavior is largely due to a combination of firewalls/middleboxes blocking responses and aggresive retransmission implementations at senders behind these firewalls, as described in RFC 4697[12]. Similar to our previous analyses [23,7], we categorized DNS root pollution into nine groups i.e. (i) unused query class; (ii) A-for-A queries; (iii) invalid TLD; (iv) non-printable characters; (v) queries with ’ ’; (vi) RFC 1918 PTR [15]; (vii) identical queries; (viii) repeated queries; and (ix) referral-not-cached queries. We classify the remaining queries as legitimate. Since some of the pollution categories require keeping state across the trace, computational limitations prevented us from analyzing pollution for the entire 24-hour traces. Table 1 reflects a set of queries from a random sample of 10% clients for each root in the 2009 dataset.
8
S. Castro et al. Query Validity (2009) 1.0 Legitimate Referal not cached
0.8
Fraction of queries
Repeated queries Identical queries 0.6 Invalid TLD A−for−A 0.4
Unused query class + non−printable char + queries with underscore + RFC 1918 PTR
0.2
0.0
A C E F H K L M
A C E F H K L M
A C E F H K L M
A C E F H K L M
A C E F H K L M
A C E F H K L M
< 0.001
0.001−0.01
0.01−0.1
0.1−1
1−10
> 10
Query rate interval
Fig. 6. Query validity as a function of query rate (2009) of the reduced datasets (queries from a random 10% sample of clients)
Figure 6 reflects this sample set of queries and confirms previous years – over 98% of these queries are pollution. The three rightmost groups in Figure 6 and corresponding three bottom rows of Table 1, which include moderately and very busy clients, represent less than 0.54% of the client IPs, but send 82.3% of the queries observed, with few legitimate queries. A closer look at the pollution class of invalid TLDs (orange bars in Figure 6) reveals that the top 10 most common invalid TLDs represent 10% of the total(!) query load at the root servers, consistently over last four years. The most common invalid TLD is always local, followed by (at various rankings within the top 10) generic TLD names such as belkin, lan, home, invalid, domain, localdomain, wpad, corp and localhost, suggesting that misconfigured home routers contribute significantly to the invalid TLD category of pollution. Table 2. Pollution and total queries of the busiest DITL2009 clients Clients % of clients Top 4000 0.07% Top 4000-8000 0.07% Top 8000-32000 0.41% Top 32000 0.55% All clients 100.00%
#Pollution/#Total % of queries 4,958M/4,964M=99.9% 61.39% 760M/ 762M=99.7% 9.42% 1,071M/1,080M=99.2% 13.36% 6,790M/6,803M=99.8% 84.13% #Total queries: 8,086M 100.00%
To explore whether we can safely infer that the 98% pollution in our sample also reflects the pollution level in the complete data set, we examine a different sample: the busiest (“heavy hitter”) clients in the trace. We found that the 32,000 (0.55%) busiest clients accounted for a lower bound of 84% of the pollution queries in the whole trace (Table 2). These busy clients sent on average more
Understanding and Preparing for DNS Evolution
9
than 1 query every 10 seconds during the 24-hour interval (the 3 rightmost groups in Figures 5 and 6). We also mapped these busy clients to their origin ASes, and found no single AS was responsible for a disproportionate number of either the busy clients or queries issued by those clients. DNS pollution is truly a pervasive global phenomenon. There is considerable speculation on whether the impending changes to the root will increase the levels and proportion of pollution, and the associated impact on performance and provisioning requirements. Again, the DITL data provide a valuable baseline against which to compare future effects.
4
Security-Related Attributes of DNS Clients
We next explore two client attributes related to DNS security and integrity. 4.1
Source Port Randomness
The lack of secure authentication of either the DNS mapping or query process has been well-known among researchers for decades, but a discovery last year by Dan Kaminksy [17] broadened consciousness of these vulnerabilities by demonstrating how easy it was to poison (inject false information info) a DNS cache by guessing port numbers on a given connection4 . This discovery rattled the networking and Source Port Randomness Comparison 100 90
Percent of clients
80 70 60 50 40 30
2006
POOR
20
GOOD
GREAT
2007 2008
10
2009 0.9
0.95
0.8
0.85
0.7
0.75
0.6
0.65
0.5
0.55
0.4
0.45
0.3
0.35
0.2
0.25
0.1
0.15
0
0.05
0
Score Port changes/queries ratio
# different ports/queries ratio
Bits of randomness
Fig. 7. CDFs of Source Port Randomness scores across four years of DITL data. Scores <0.62 are classified as Poor, scores in [0.62, 0.86] as Good and scores >0.86 as Great. DNS source port randomness has increased significantly in the last 4 years, with the biggest jump between 2008 and 2009, likely in response to Kaminksy’s demonstration of the effectiveness of port-guessing to poison DNS caches [17]. 4
Source port randomness is an important security feature mitigating the risk of different types of spoofing attacks, such as TCP hijacking or TCP reset attacks [20].
10
S. Castro et al.
operational community, who immediately published and promoted tools and techniques to test and improve the degree of randomization that DNS resolvers apply to DNS source ports. Before Kaminsky’s discovery, DITL data indicated that DNS port randomization was typically poor or non-existent [7]. We applied three scores to quantify the evolution of source port randomness from 2006-2009. For each client sending more than 20 queries during the observation interval, we calculated: (i) the number of port number changes/query ratio; (ii) the number of unique ports/query ratio; (iii) bits of randomness as proposed in [21,22]. We then classified scores <0.62 as Poor, in the range [0.62, 0.86] as Good, and scores >0.86 as Great. Figure 7 shows some good news: scores improved significantly, especially in the year following Kaminsky’s (2008) announcement. In 2009, more than 60% of the clients changed their source port numbers between more than 85% of their queries, which was only the case for about 40% of the clients in 2008 and fewer than 20% in 2007. 4.2
DNSSEC Capability
Although source-port randomization can mitigate the DNS cache poisoning vulnerability inherent in the protocol, it cannot completely prevent hijacking. The longer-term solution proposed for this vulnerability is the IETF-developed DNS Security extensions (DNSSEC) [3] architecture and associated protocols, in development for over a decade but only recently seeing low levels of deployment [19]. DNSSEC adds five new resource record (RR) types: Delegation signer (DS), DNSSEC Signature (RRSIG), Next-Secure record NSEC and NSEC3), and DNSSEC key request (DNSKEY). DNSSEC also adds two new DNS header flags: Checking Disabled (CD) and Authenticated Data (AD). The protocol extensions support signing zone files and responses to queries with cryptographic keys. Because the architecture assumes a single anchor of trust at the root of the naming hierarchy, pervasive DNSSEC deployment is blocked on cryptographically signing the root zone. Due to the distributed and somewhat convoluted nature of control over the root zone, this development has lagged expectations, but after considerable pressure and growing recognition of the potential cost of DNS vulnerabilities to the global economy, the U.S. government, ICANN, and Verisign are collaborating to get the DNS root signed by 2010. A few countries, including Sweden and Brazil, have signed their own ccTLD’s in spite of the root not being signed yet, which has put additional pressure on those responsible for signing the root. Due to the way DNSSEC works, clients will not normally issue queries for DNSSEC record types; rather, these records are automatically included in responses to normal query types, such as A, PTR, and MX. Rather than count queries from the set of DNSSEC types, we explore two other indicators of DNSSEC capability across the client population. First we analyse the presence of EDNS support, a DNS extension that allows longer responses, required to implement DNSSEC. We also know that if an EDNS-capable query has its DO bit set, the sending client is DNSSEC-capable. By checking the presence and value of the OPT RR pointer, we classify queries and clients into three groups:
Understanding and Preparing for DNS Evolution EDNS support (by clients) 1
0.9
0.9
0.8
0.8
0.7
0.7
Fraction of Clients
Fraction of Queries
EDNS support (by queries) 1
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
11
0 08 09
06 07 08 09
08 09
A
C
E
06 07 08 09
08 09
06 07 08 09
08 09
07 08 09
F
H
K
L
M
08 09
06 07 08 09
08 09
A
C
E
DO bit
EDNS version 0
No EDNS
Fig. 8. Growth of EDNS support (needed for DNSSEC) measured by DNS queries, especially between 2007 and 2008. In 2009, over 90% of the EDNS-capable queries are also DO enabled, i.e., advertising DNSSEC capability.
06 07 08 09
08 09
06 07 08 09
08 09
07 08 09
F
H
K
L
M
Root Server
Root Server
DO bit
EDNS version 0
No EDNS
Mixed EDNS
Fig. 9. Decrease in EDNS Support measured by clients. In contrast to the query evolution, the fraction of EDNS enabled clients has dropped since 2007. Worse news for DNSSEC, in 2009 only around 60% of the observed EDNS clients were DO enabled, i.e., DNSSEC-capable.
(i) no EDNS; (ii) EDNS version 0 (EDNS0) without DO bit set; (iii) and EDNS0 with DO bit. A fourth type of client is mixed, i.e. an IP address that sources some, but not all queries with EDNS support. Figure 8 shows clear growth in EDNS support as measured by queries, particularly from 2007 to 2008. Even better news, over 90% of the observed EDNS-capable queries were DO-enabled in 2009. This high level of support for DNSSEC seemed like good news, until we looked at EDNS support in terms of client IP addresses. Figure 9 shows that the fraction of the EDNS-capable clients has actually decreased over the last several years, by almost 20%! In 2009, fewer than 30% clients supported EDNS, and of those only around 60% included DO bits indicating actual DNSSEC capability. We hypothesized that the heavy hitter (hyperbusy) clients had something to do with this disparity, so we grouped clients according to query rate as in Section 3. Figure 10 shows that EDNS support for clients sending few queries dropped significantly after 2007, while busy clients have increased EDNS support. In our 2009 data set, more than half of the EDNS queries were generated by the fewer than 0.1% of clients in the two rightmost categories, sending more than 1 query/sec. (cf. Figure 5). Since we have already determined that these busiest clients generate almost no legitimate DNS queries, we conclude that most of the DNSSEC-capable queries are in pollution categories. The category of clients with mixed EDNS support represents 7% (or 396K) of the unique sources in the 2009 dataset. We identified two reasons why clients can show mixed support: (i) several hosts can hide behind the same IP address (e.g. NAT); and (ii) EDNS fallback, i.e. clients fail to receive responses to queries with EDNS support, so they fallback to “vanilla” DNS and retry once more without
12
S. Castro et al.
Client distribution by EDNS support
100
% of Clients
75
50
25
08
09
EDNS0
06
07
08
09
06
07
08
09
06
07
08
09
06
07
08
09
> 10
07
1−10
06
0.1−1
09
0.01−0.1
08
0.001−0.01
07
< 0.001
06
Query rate interval No EDNS
Mixed EDNS
Fig. 10. Plotting EDNS support vs. query rate reveals that EDNS support is increasing for busy clients, who mainly generate pollution, but has declined substantially for low frequency (typical) clients.
EDNS support. A test on a sample of 72K (18%) of the mixed EDNS clients showed that EDNS fallback patterns account for 36% of the mixed clients. EDNS also provides a mechanism to allow clients to advertise UDP buffer sizes larger than the default maximum size of 512 bytes [14]. Traditionally, responses larger than 512 bytes had to be sent using TCP, but EDNS signaling enables transmission of larger responses using UDP, avoiding the potential cost of a query retry using TCP. Figure 11 shows the UDP buffer size value distribution found in the queries signaling EDNS support. There are only four different values observed: (1) 512 bytes was the default maximum buffer size for DNS responses before the introduction of EDNS in RFC 2671 [18]; 1280 bytes is a value suggested for Ethernet networks to avoid fragmentation; 2048 was the default value for certain versions of BIND and derived products; and 4096 bytes is the maximum value permitted by most implementations. Figure 11 reveals a healthy increase in the use of the largest buffer size of 4096 bytes (from around 50% in 2006 to over 90% in 2009), which happened at the expense of queries with a 2048-byte buffer size. The fraction of queries using a 512-byte buffer size is generally below 5%, although it sometimes varies over years, with no consistent pattern across roots. One of the deployment concerns surrounding DNSSEC is that older traffic filtering appliances, firewalls, and other middleboxes may drop DNS packets larger than 512 bytes, forcing operators to manually set the EDNS buffer size to 512 to overcome this limitation. These middleboxes are harmful to the deployment of DNSSEC, since small buffer sizes
Understanding and Preparing for DNS Evolution
13
EDNS buffer size (by query) 1.0
Fraction of Queries
0.8
0.6
0.4
0.2
0.0 2008 2009
2006 2007 2008 2009
2008 2009
A
C
E
512 bytes
1280 bytes
2006 2007 2008 2009
2008 2009
2006 2007 2008 2009
2008 2009
2007 2008 2009
F
H
K
L
M
Root Server 2048 bytes
4096 bytes
other size
Fig. 11. Another capability provided by EDNS is signaling of UDP buffer sizes. For the queries with EDNS support, we analyze the buffer size announced. An increase from 50% to 90% in the largest size can be observed from 2006 to 2009.
combined with the signaling of DNSSEC support (by setting the DO bit on) could increase the amount of TCP traffic due to retries.
5
A First Look at DNS IPv6 Data
Proposed as a solution for IPv4 address exhaustion, IPv6 supports a vastly larger number of endpoint addresses than IPv4, although like DNSSEC its deployment has languished. As of November 2009, eight of the thirteen root servers have been assigned IPv6 addresses [1]. The DITL 2009 datasets are the first with significant (but still pretty inconsistent) IPv6 data collection, from four root servers. Table 3 shows IPv6 statistics for the one instance of K-root (in Amsterdam) that captured IPv6 data, without huge data gaps in the collection, for the last three years. Both the IPv6 query count and unique client count are much lower than for IPv4, although growth in both IPv6 queries and clients is evident. Geolocation of DITL 2009 clients reveals that at least 57.9% of the IPv6 clients querying this global root instance are from Europe [16], not surprising since this Table 3. IPv4 vs. IPv6 traffic on the K-AMS-IX root instance over three DITL years K-AMS-TX, k-root Query Count Unique Clients
2007 IPv4 IPv6 248 M 39 K 392 K 48
2008 2009 IPv4 IPv6 IPv4 IPv6 170 M 8.21 M 277.56 M 9.96 M 340 K 6.17 K 711 K 9K
14
S. Castro et al.
instance is in Europe, where IPv6 has had significant institutional support. The proportion of legitimate IPv6 queries (vs. pollution) is 60%, far higher than for IPv4, likely related to its extremely low deployment [4,11].
6
Lessons Learned
The Domain Name System (DNS) provides critical infrastructure services necessary for proper operation of the Internet. Despite the essential nature of the DNS, long-term research and analysis in support of its performance, stability, and security is extremely sparse. Indeed, the biggest concern with the imminent changes to the DNS root zone (DNSSEC, new TLDs, and IPv6) is the lack of data with which to evaluate our preparedness, performance, or problems before and throughout the transitions. The DITL project is now four years old, with more participants and types of data each year across many strategic links around the globe. In this paper we focused on a limited slice – the most detailed characterization of traffic to as many DNS root servers possible, seeking macroscopic insights to illuminate the impending architectural changes to the root zone. We validated previous results on the extraordinary high levels of pollution at the root nameservers, which continues to constitute the vast majority of observed queries to the roots. We presented new results on security-related attributes of the client population: an increase in the prevalence of DNS source port randomization, and a surprising decreasing trend in the fraction of DNSSEC-capable clients, which serve as a motivating if disquieting baseline for the impending transition to DNSSEC. From a larger perspective, we have gained insights and experience from these global trace collection experiments, which inspire recommended improvements to future measurements that will help optimize the quality and integrity of data in support of answering critical questions in the evolving Internet landscape. We categorize our lessons into three categories: data collection, data management, and data analysis. Lessons in Data Collection. Data collection is hard. Radically distributed Internet data collection across every variety of administrative domain, time zone, and legislative framework around the globe is in “pray that this works” territory. Even though this was our fourth year, we continued to fight clock skew, significant periods of data loss, incorrect command line options, dysfunctional network taps, and other technical issues. Many of these problems we cannot find until we analyze the data. We rely heavily on pcap for packet capture and have largely assumed that it does not drop a significant number of packets during collection. We do not know for certain if, or how many, packets are lost due to overfull buffers or other external reasons. Many of our contributors use network taps or SPAN ports, so it is possible that the server receives packets that our collector
Understanding and Preparing for DNS Evolution
15
does not. Next year we are considering encoding the pcap stats() output as special “metadata” packets at the end of each file. For future experiments, we also hope to pursue additional active measurements to improve data integrity and support deeper exploration of questions, including sending timestamp probes to root server instances during collection interval to test for clock skew, fingerprinting heavy hitter clients for additional information, and probing to assess extent of DNSSEC support and IPv6 deployment of root server clients. We recommend community workshops to help formulate questions to guide others in conducting potentially broader “Day-in-the-Life” global trace collection experiments [6]. Lessons in Data Management. As DITL grows in number and type of participants, it also grows in its diversity of data “formatting”. Before any analysis can begin, we spend months fixing and normalizing the large data set. This curation includes: converting from one type of compression (lzop) to another (gzip), accounting for skewed clocks, filling in gaps of missing data from other capture sources5 , ensuring packet timestamps are strictly increasing, ensuring pcap files fall on consistent boundaries and are of a manageable size, removing packets from unwanted sources6, separating data from two sources that are mixed together7 , removing duplicate data8 , stripping VLAN tags, giving the pcap files a consistent data link type, removing bogus entries from truncated or corrupt pcap files. Next, we merge and split pcap files again to facilitate subsequent analysis. The establishment of DNS-OARC also broke new (although not yet completely arable) ground for disclosure control models for privacy-protective data sharing. These contributions have already transformed the state of DNS research and data-sharing, and if sustained and extended, they promise to dramatically improve the quality of the lens with which we view the Internet as a whole. But methodologies for curating, indexing, and promoting use of data could always use additional evaluation and improvement. Dealing with extremely large and privacy-sensitive data sets remotely is always a technical as well as policy challenge. Lessons in Data Analysis. We need to increase the automatic processing of basic statistics (query rate and type, topological coverage, geographic characteristics) to facilitate overview of traces across years. We also need to extend our tools to further analyze IPv6, DNSSEC, and non-root server traces to promote understanding of and preparation for the evolution of the DNS. 5
6 7 8
In 2009, for example, one contributor used dnscap, but their scripts stopped working. They also captured data using WDCAP and were able to fill in some gaps, but the WDCAP data files were not boundary-aligned with the missing pcap files. Another contributor included packets from their nearby non-root nameservers. In 2008, we received data from A-root and Old-J-Root as a single stream. In 2007-09, at least one contributor mistakenly started two instances of the collection script.
16
S. Castro et al.
References 1. List of root servers, http://www.root-servers.org/ (accessed 2009.11.20) 2. NetAcuity, http://www.digital-element.net (accessed 2009.11.20) 3. Arends, R., Austein, R., Larson, M., Massey, D., Rose, S.: DNS Security Introduction and Requirements. RFC 4033 (2005) 4. CAIDA. Visualizing IPv6 AS-level Internet Topology (2008), http://www.caida.org/research/topology/as_core_network/ipv6.xml (2009.11.20) 5. CAIDA and DNS-OARC. A Report on DITL data gathering (January 9-10, 2007), http://www.caida.org/projects/ditl/summary-2007-01/ (accessed 2009.11.20) 6. CAIDA/WIDE. What researchers would like to learn from the ditl project (2008), http://www.caida.org/projects/ditl/questions/ (accessed 2009.11.20) 7. Castro, S., Wessels, D., Fomenkov, M., claffy, k.: A Day at the Root of the Internet. In: ACM SIGCOMM Computer Communications Review, CCR (2008) 8. N. R. Council. Looking over the Fence: A Neighbor’s View of Networking Research. National Academies Press, Washington (2001) 9. DNS-OARC. DNS-DITL (2009), participants, https://www.dns-oarc.net/oarc/data/ditl/2009 (2009.11.20) 10. Gibbard, S.: Observations on Anycast Topology and Performance (2007), http://www.pch.net/resources/papers/anycast-performance/ anycast-performance-v10.pdf (2009.11.20) 11. Karpilovsky, E., Gerber, A., Pei, D., Rexford, J., Shaikh, A.: Quantifying the Extent of IPv6 Deployment. In: Moon, S.B., Teixeira, R., Uhlig, S. (eds.) PAM 2009. LNCS, vol. 5448, pp. 13–22. Springer, Heidelberg (2009) 12. Larson, M., Barber, P.: Observed DNS Resolution Misbehavior. RFC 4697 (2006) 13. Liu, Z., Huffaker, B., Brownlee, N., claffy, k.: Two Days in the Life of the DNS Anycast Root Servers. In: Uhlig, S., Papagiannaki, K., Bonaventure, O. (eds.) PAM 2007. LNCS, vol. 4427, pp. 125–134. Springer, Heidelberg (2007) 14. Mockapetris, P.: Domain names - implementation and specification. RFC 1035, Standard (1987) 15. Rekhter, Y., Moskowitz, B., Karrenberg, D., de Groot, G.J., Lear, E.: Address Allocation for Private Internets. RFC 1918 (1996) 16. Team Cymru. Ip to asn mapping, http://www.team-cymru.org/Services/ip-to-asn.html (accessed 2009.11.20) 17. US-CERT. Vulnerability note vu#800113: Multiple dns implementations vulnerable to cache poisonings, http://www.kb.cert.org/vuls/id/800113 (2009.11.20) 18. Vixie, P.: Extension Mechanisms for DNS (EDNS0). RFC 2671 (1999) 19. Vixie, P.: Reasons for deploying DNSSEC (2008), http://www.dnssec.net/why-deploy-dnssec (2009.11.20) 20. Watson, P.: Slipping in the Window: TCP Reset attacks (2004), http://osvdb.org/ref/04/04030-SlippingInTheWindow_v1.0.doc (2009.11.20) 21. Wessels, D.: DNS port randomness test, https://www.dns-oarc.net/oarc/services/dnsentropy (2009.11.20) 22. Wessels, D.: Is your caching resolver polluting the internet? In: ACM SIGCOMM Workshop on Network Troubleshooting, Netts 2004 (2004) 23. Wessels, D., Fomenkov, M.: Wow, that’s a lot of packets. In: Passive and Active Measurement Workshop (PAM) 2002, Fort Collins, USA (2002)
Æ
Characterizing Tra c Flows Originating from Large-Scale Video Sharing Services Tatsuya Mori, Ryoichi Kawahara, Haruhisa Hasegawa, and Shinsuke Shimogawa NTT Research Laboratories, 3–9–11 Midoricho, Musashino-city, Tokyo 180–8585, Japan
Abstract. This work attempts to characterize network traÆc flows originating from large-scale video sharing services such as YouTube. The key technical contributions of this paper are twofold. We first present a simple and eective methodology that identifies traÆc flows originating from video hosting servers. The key idea behind our approach is to leverage the addressingnaming conventions used in large-scale server farms. Next, using the identified video flows, we investigate the characteristics of network traÆc flows of video sharing services from a network service provider view. Our study reveals the intrinsic characteristics of the flow size distributions of video sharing services. The origin of the intrinsic characteristics is rooted on the dierentiated service provided for free and premium membership of the video sharing services. We also investigate temporal characteristics of video traÆc flows.
1 Introduction Recent growth in large-scale video sharing services such as YouTube [19] has been tremendously significant. These services are estimated to facilitate hundreds of thousands of newly uploaded videos per day and support hundreds of millions of video views per day. The great popularity of these video sharing services has even lead to a drastic shift in Internet traÆc mix. Ref. [5] reported that the share of P2P traÆc dropped to 51% at the end of 2007, down from 60% the year before, and that the decline in this traÆc share is due primarily to an increase in traÆc from web-based video sharing services. We envision that this trend will potentially keep growing; thus, managing the high demand for video services will continue to be a challenging task for both content providers and ISPs. On the basis of these observations, this work attempts to characterize the network traÆc flows, originating from large-scale video sharing services as the first step toward building a new data-centric network that is suitable for delivering numerous varieties of video services. We target currently prominent video sharing services; YouTube in US, Smiley videos in Japan [16], Megavideo in Hong kong [12], and Dailymotion in France [6]. Our analysis is oriented from the perspective of a network service provider, i.e., we aim to characterize the traÆc flows from the viewpoints of resident ISPs or other networks that are located at the edges of the global Internet. Our first contribution is identifying traÆc flows that originate from several video sharing services. The advantage of our approach lies in its simplicity. It uses source F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 17–31, 2010. c Springer-Verlag Berlin Heidelberg 2010
18
T. Mori et al.
IP addresses as the key for identification. To compile a list of IP addresses associated with video sharing services, we analyze a huge amount of access logs, collected at several web cache servers. The key idea behind our approach is to leverage the namingaddressing conventions used by large-scale server farms. In many cases, web servers for hosting videos and those for other functions such as managing text, images, or applications, are isolated. These servers are assigned dierent sets of IP prefixes, which are often associated with intrinsic hostnames, e.g., “img09.example.com” is likely used for servers that serve image files. We also leverage open recursive DNS servers to associate the extracted hostnames of video hosting servers with their globally distributed IP addresses. Our second contribution is revealing the intrinsic characteristics of video sharing services, which are not covered by conventional web traÆc models. The origin of the characteristics is based on the dierentiated services provided for free and premium membership of the video sharing services. We also investigate temporal characteristics of video traÆc flows. The remainder of this paper is structured as follows. Section 2 describes the measurement data set we used in this study. We present our classification techniques in section 3. We then analyze the workload of video sharing services, using the identified traÆc flows originating from video hosting servers, in section 4. Section 5 presents related work. Finally, section 6 concludes this paper.
2 Data Description This section describes the two data sets we used in this study. The first data set was web proxy logs, which enable us to collect the IP addresses of video hosting servers used by video sharing services. The second data set was network traÆc flows, which enable us to investigate the characteristics of the workload of video sharing services. 2.1 Web Cache Server Logs We used IRCache data set [10], which is web cache server logs, open to the research community. We used the access logs collected from 7 root cache servers located in cities throughout the US. The access logs were collected in September 2009. Since the client IP addresses were anonymized for privacy protection, and the randomization seeds are dierent among files, we could not count the cumulative number of unique client IP addresses that used the cache servers. We noted, however, that a typical one-day log file for a location consisted of 100–200 unique client IP addresses, which include both actual clients and peered web cache servers deployed by other institutes. Assuming there were no overlaps of client IP addresses among the web cache servers, the total number of unique client IP addresses seen on September 1, 2009 was 805, which was large enough to merit statistical analysis. The one-month web cache logs consisted of 118 M web transactions in total. The 118 M transactions account for 7.8 TB of traÆc volume. 89 M transactions return the HTTP status code of “200 OK” and these successfully completed transactions account for 6.2 Terabytes of traÆc flows that are processed on the web cache servers.
Characterizing TraÆc Flows Originating from Large-Scale Video Sharing Services
19
2.2 Network Flow Data In this work, we define a flow as a unique combination of sourcedestination IP address, sourcedestination port number, and protocol. We used network flow data that were collected at an incoming 10-Gbps link of a production network. For each flow, its length in seconds and size in bytes were recorded. The measurement was conducted for 9.5 hours on a weekday in the first quarter of 2009. The format of the network flow data set is: , where “ ” and “” are created and modified time of a flow, and “” and “” are the number of packets and bytes of a flow, respectively. “ ” is randomized destination (client) IP address. The 5-tuple, composes a flow. The total amount of incoming traÆc carried during the measurement period was 4.4 TB, which corresponded to the mean oered traÆc rate of 1.03 Gbps. The incoming traÆc consisted of 108 M distinct flows that were originated from 5.5 M of sender IP addresses to 34 K of receiver (client) IP addresses. Of these, 40.6 M were the incoming web flows. The traÆc volume of the web flows was 1.8 TB (mean oered traÆc rate was 0.42 Gbps).
3 Extracting Sources of Video Flows We now present the techniques for identifying video flows among the network flow data set. We use a source IP address as a key for identification, i.e., if an incoming flow originates from an IP address associated with a video sharing service, we identify the flow as a video flow. As mentioned earlier, the key idea of this approach is leveraging the naming addressing conventions used by large-scale web farms, where servers are grouped by their roles, e.g., hosting a massive amount of large video files, hosting a massive number of thumbnails, or providing rich web interfaces. We first present a video sharing service uses distinct hostnames for each subtypes of HTTP content-type, i.e., video, text, image, and application. We then collect hostnames of video hosting servers. Finally, we compile the IP addresses that are associated with the hostnames. 3.1 Classifying Hostnames with Subtypes of HTTP Content-Type This section studies naming convention used in large-scale video sharing services and presents distinct hostnames are used for each sub-category of HTTP content-type. The web cache server logs are used for the analysis. We also study the basic property of the objects for each category. We start by looking for a primary domain name for the video sharing service of interest, e.g., YouTube. More specifically, we compare a hostname recorded in web cache logs with that domain name to see if the hostname in URL matches the regular expression in perl-derivative, . If we see a match, we regard the object as one associated with YouTube. Although we use the YouTube domain as an example in the following, other prominent video sharing services today can also be explored in a similar way. For brevity, only the results for those services will be shown later.
20
T. Mori et al. Table 1. Statistics of Content-types in web transactions associated with YouTube Content-type No. of transactions total volume mean size video image text application other
160,180 48,756 458,357 109,743 35,021
681 GB 157 MB 4.3 GB 359 MB 23 MB
4.3 MB 3.2 KB 9.5 KB 3.3 KB 678 B
0
10
-1
P(X <= x)
10
-2
10
-3
10
video image text application
-4
10
-5
10100
1k
10 k
100 k
x
1M
10 M 100 M
1G
Fig. 1. CDFs of object size for each content-type
We analyzed the 89 M of successfully processed HTTP transactions, and found 812 K transactions were associated with YouTube. The total volume of these transactions was 686 GB in the one-month logs. Frequency of content-types for the transactions to the YouTube domain are summarized in Table 1. A content-type is defined as an Internet media type, which is used by several protocols such as SMTP, HTTP, RTP, and SIP. Examples of observed sub-types for each content-type are x-flvmp4 (video), jpeggif (image), htmlxml (text), and x-shockwave-flashjavascript (application). This variety of content files together forms the video sharing service. Notice that more than 80% of YouTube web transactions carry non-video data. This indicates that these nonvideo data are crucial factors from the viewpoint of processing overhead rather than transport overhead. For instance, from the perspective of a network service provider, these non-video transactions consume a lot of resources on network middle boxes, such as firewalls or NAT, which need to keep track of connections. Next, we looked at the correlation between the content-type and size of an object. In addition to the total volume and mean size of objects, we plot the cumulative distribution functions (CDFs) of object sizes for each content-type (see Fig. 1). Notice that the majority of image, text, and application objects are small. For instance, 99% of image and application objects are less than 10 KB, and 99% of text objects are less than 30 KB. In contrast, the size of video objects is heavy-tailed, ranging over 6 orders of magnitude, from less than 1 KB to 500 MB. We note that very small video objects, e.g., less than 20 KB, are likely due to partial or incomplete transmission.
Characterizing TraÆc Flows Originating from Large-Scale Video Sharing Services
21
Table 2. Number of distinct hostnames observed for YouTube transactions with HTTP status code of 200 or 206 Content-type # of hostnames video image text application
video 1.5 k v*.lscache*.c.youtube.com
40 k 1.0 k
30 k 20 k
500.0
10 k 0.0 1
2
3
4
5
image
50 k
0 1
490 63 23 5
www.youtube.com
3
4
100 k
application
www.youtube.com www.youtube.com
200 k 50 k
gdata.youtube.com 100 k
2
text
300 k
img.youtube.com
5
0 1
gdata.youtube.com
2
3
4
5
gdata.youtube.com
0 1
2
3
4
5
Fig. 2. Top 5 hostnames by number in YouTube web transactions for each content-type
3.2 Collecting Hostnames of Server Farm Using the naming convention studied in the previous subsection, we aim to compile a list of hostnames of video hosting servers from the web cache server logs. Processing the data set is straightforward, however, we note that it is necessary to cope with side eects of irregular patterns such as HTTP status code “204No Content”, which is used in cases where the request was successfully processed but the response does not have a message body [10]. To avoid this, we prune the transactions that have HTTP status codes other than “200Ok” or “206Partial Content”, which mean the successful transaction and the response to a request to an object data subset, respectively. This heuristic eliminates the cases where video hosting servers return texthtml content with the “204No Content” code, “303redirection” code, or other error codes. The statistics of collected hostnames for YouTube are shown in Table 2. While video objects are served by a large number of servers, objects of other content-types are served by a small set of servers. Next, we focus on the number of web transactions per hostname. The top five hostnames for each content-type are shown in Fig. 2. Clearly, the hostnames of the video hosting servers are dierent from those for other content-types. We also notice that for video hosting servers, the number of accesses are balanced among the top five servers; this indicates that the video hosting servers are likely to be accessed by load-balancing mechanisms. We next extract the naming rule of the video hosting servers from the collected hostnames. Table 3 shows the hostnames of the video hosting servers of YouTube, where “” and “” represent the variables of a number. In compiling the list, we complement the missing numbers in hostnames. For instance, if we observe “
! ” and “
"! ”, but not “
#! ”, we conjecture that the last hostname was likely missed during data measurement and add it to the list. We also test whether the complemented hostname has a valid DNS A record(s). In total, the generalized hostnames of YouTube video hosting servers contribute 998 distinct hostnames.
22
T. Mori et al.
Table 3. Generalized hostnames of YouTube video hosting servers and number of observed transactions for each class of hostname Hostnames
Complemented range
# of observed transactions
v.lscache.c v.cache.c tc.v.cache.c v.nonxt.c lax-v.lax sjl-v.sjl
1 24, 1 8 1 8, 1 8 1 24, 1 8 1 24, 1 8 1 308 1 50 (with exceptions)
130,286 27,485 1626 25 19 19
Table 4. Generalized hostnames of video hosting servers Service
Hostnames
Smiley videos Smiley videos Megavideo Dailymotion
smile-com.nicovideo.jp 0 6 0 3 smile-cub.nicovideo.jp 0 6 0 3 www.megavideo.com can be any positive integer. proxy-.dailymotion.com 0 9 0 9
Complemented range
We note the primary classes of hostnames are significantly biased to the top two classes, i.e., “v.lscache.c” and “v.cache.c”. The number of hostnames for these classes is 256 ( (24 8) 8). In the rest of this work, we will use these 256 names as the primary hostnames of YouTube. Using these techniques, we extracted generalized hostnames of video hosting servers for other video sharing services. The results are summarized in Table 4. Although naming conventions dier among the services, all the services use their own naming rule. We finally note that these lists are snapshots and should be periodically updated. 3.3 Extracting Global IP Addresses This section aims to extract global IP addresses that are associated with the hostnames collected in the previous subsection. It is well known that large-scale server farms such as YouTube and Akamai typically use a large number of global IP addresses that are associated with a smaller set of hostnames. The methodology is aimed at balancing the load of globally distributed accesses across the server farms [9]. This addressing convention enables us to associate global IP addresses with particular hostnames that are used for video hosting. Because of these spatially distributed load-balancing mechanisms used in server farms, an IP address obtained by looking up the DNS A record of a hostname could dier in location. For example, Akamai CDN tweaks the DNS mechanism to select the closest web server from a client. Similarly, YouTube uses the HTTP redirection mechanism to introduce load balancing in a dynamic way [20]. Thus, we need to perform globally distributed DNS resolutions to compile a list of IP addresses associated with the list of hostnames. To achieve this objective, we adopted a methodology proposed by Huang et al. in Ref. [9]. They performed globally distributed DNS resolution of 16 M unique web
Characterizing TraÆc Flows Originating from Large-Scale Video Sharing Services
23
Table 5. Number of IP addresses for video hosting servers Service YouTube Smiley videos Megavideo Dailymotion
# of addresses 2,138 74 670 100
hostnames to obtain a complete list of DNS CNAMEs for Akamai CDN servers, which could be used to estimate roughly the scale of the Akamai infrastructure. The key idea of their approach is to leverage open recursive DNS (ORDNS) servers, which will resolve DNS queries for any clients from anywhere; this approach enables us to obtain global view of the system from an Internet edge site. We use a similar approach to that shown in Ref. [9] to compile the list of ORDNS servers. In total, we collected more than 5,000 ORDNS servers that are distributed across 68 countries. For each hostname shown in Tables 3 and 4, we performed DNS resolutions from all the ORDNS servers we collected, and compiled the resolved answers. The results are summarized in Table 5. Notice that these services consist of a fairly large number of servers. For example, YouTube has 2,138 unique IP addresses, which is much larger than the original 256 hostnames. Note that the number of global IP addresses does not necessarily correspond to the number of actual video hosting servers, meaning the actual infrastructure could be larger than can be seen from an Internet edge site. In addition, the extracted IP addresses for video hosting servers are mostly dierent from those for other media types, i.e., image, text, and application. Thus, the obtained IP addresses of video hosting servers can be used as a simple and eective key to identify the video flows of large-scale video sharing services.
4 Characterizing Video Flows In the previous section, we compiled a list of IP addresses for the video hosting servers, using web cache server logs. This section uses our network flow data set to characterize traÆc flows originating from video sharing services by using the list of IP addresses. First, we study fundamental statistics of video flows, which plays an essential role in understanding the structure of traÆc flows. Next, we employ in-depth analysis of flow size distributions, which exhibit intrinsic characteristics. Finally, we investigate temporal characteristics of video traÆc flows. 4.1 Flow Statistics We investigate the fundamental statistics of video flows, i.e., flow size, flow rate, and flow duration, which form essential parts of the traÆc workload model. Since a large portion of the extracted flows is composed of small flows, which could be incomplete flows or error flows, we exclude these small flows from our analysis. On the basis of the observation from Fig. 1, we use 20 KB as a threshold for pruning the small flows. As a result, of the 103K of collected YouTube video flows, 60K flows were removed. We
24
T. Mori et al. Table 6. Statistics of observed flows that are larger than 20 KB Service
# of flows Mean size Mean rate Mean duration
YouTube 43,960 3,438 Smiley videos 1,354 Megavideo 730 Dailymotion All web 5,043,927
4.1 MB 21.3 MB 30.0 MB 13.7 MB 0.33 MB
1.3 Mbps 2.6 Mbps 1.3 Mbps 1.5 Mbps 0.9 Mbps
41.8 sec 139.8 sec 232.6 sec 96.0 sec 16.5 sec
Table 7. Uploading limitations (as of 1Q 2009) Service
Free membership
Premium membership
YouTube Smiley videos Megavideo Dailymotion
10 minutes or 2 GB per video 40 MB per video 100 MB per video 20 minutes or 150 MB per video
20 GB per video (partners) 100 MB per video 5 GB per video –
note that although the number of pruned flows was not small, their contribution to the total traÆc volume was negligible. Actually, the pruned flows contributed less than 1% in total traÆc volume. We also note that majority of the pruned flows were incomplete, i.e., most of them were one-packet TCP flows with SYNACK flag, originated from youtube video hosting servers; i.e., the video hosting servers were likely port-scanned by some of clients. Since we are interested in the impact of YouTube traÆc from the network service provider perspective, focusing on flows that deliver actual video data is essential. The basic statistics for these metrics are summarized in Table 6. In general, video flows are larger, faster, and longer than conventional web flows. This observation agrees with the previous work [15] by Plissonneau et al. Next, we look into the detailed characteristics of each metric. Flow size: The top-left graph of Fig. 3 presents log-log complementary cumulative distribution function (CCDF) plots for flow size distributions. While the web flows (shown as “All web” in the legend of the graph) obey a clear Pareto-like distribution with moderate decaying at the tail, all the other video flows exhibit dierent characteristics. In general, they are significantly heavy-tailed; for instance, more than 60% of the video flows are larger than 1 MB for all video services, while less than 3% of flows are larger than 1 MB in all the web flows. This significant heavy-tailedness can also be seen in the flows of P2P file-sharing applications. What makes the video flows distinguishable from P2P flows is shown next. That is, the video flows exhibit clear change points in the middle area, where probability distributions drop sharply, i.e., the points at 30, 40, and 100 MB. In fact, we find that these change points correspond to the intrinsic capacity limitation of the video sharing services. This data is summarized in Table 7. We can see that the file size limitations agree with the change points of the flow size distributions. For example, the change point of Megavideo flow size distribution is 100 MB, which is exactly the upload file size limitation for non-premium (free) membership.
Characterizing TraÆc Flows Originating from Large-Scale Video Sharing Services
25
We conclude the following from these observations: – Size distributions of video flows are quite heavy-tailed. – The tail parts of flow size distributions for video sharing services are constrained by the limitation of available service resources for free membership (upload file size). Flow rate: Next, we look at the flow rate, which is the number of bits divided by flow duration. The top-right graph of Fig. 3 shows log-log CCDF plots for flow rate distributions. In contrast to flow size distributions, we do not see much dierence among the flows. An exception is a change point of 3 Mbps for Smiley videos. This observation again agrees with the dierentiated service oered by the providers, i.e., premium members enjoy high-speed downloading while free members do not. We note that all the distributions fit the log-normal distributions well. For brevity, we omit the results. Flow duration: Finally, we look into the flow duration. The bottom-left graph of Fig. 3 presents log-log CCDF plots of flow duration distributions. While we have seen the eects of available service capacities on the distributions for size and rate, we do not see the eects in the flow duration, despite the fact that the mean size and mean duration are positively correlated (see Table 6). Note that the actual download time (i.e., flow duration) may depend on other factors, such as throughput of access links or CPU
0
0
10
10
-1
10
-1
10
-2
P(X > x)
P(X > x)
10
-3
10
-4
10
-5
10
-6
10
YouTube Smiley videos Megavideo Dailymotion All web
-4
100 k
1M
10 M
x
100 M
1G
10
10 G
0
-1
P(X > x)
P(X > x)
YouTube Smiley videos Megavideo Dailymotion All web
10
4
5
10
10
100 M
1G
-2
10
-3
10
-4
10
-5
10 100 m
3
x
-1
-2
-4
2
10
10
10
10
1
10
10
10
-3
0
10 0
10
10
YouTube Smiley videos Megavideo Dailymotion All web
-3
10
-7
1010 k
-2
10
YouTube DTPD approximation DPD approximation
-5
1
10
100
x
1k
10 k
1010 k
100 k
1M
x
10 M
Fig. 3. Statistics of YouTube flows; Log-log CCDFs of flow size in bytes (top left), flow rate in Kbps (top right), and flow duration time in seconds (bottom left). Approximation of YouTube flow size distribution (bottom right).
26
T. Mori et al.
resources of the end-terminal devices, which could be drastically dierent among the clients. Therefore, we do not see clear change points for the flow duration distributions. 4.2 Characterizing the Size Distributions of Video Flows In the previous section, we found that the flow size distributions of current video sharing services have an intrinsic property. That is, the tail part of the distributions is constrained by the available service capacity, e.g., upload file size limitation. In this section, we attempt to characterize the flow size distribution to better understand its structure. We start by approximating the distribution with known distributions. Assume that flows for free and premium membership can be modeled with dierent distribution models. Since flow sizes for free membership are truncated at a certain threshold, we adopt the discrete truncated Pareto distribution (DTPD) [13] for this class. For flow sizes for premium membership, we adopt the simple discrete Pareto distribution (DPD), which does not include any truncations. Let X be a discrete random variable, which represents the size of a flow. The probability mass function of DTPD is given as P(X
x) f (x;
1
1 1 )
x
1
1
1
(x 1)
1
1
1
which satisfies P(X 1 ) 1 and P(X 1 ) 0. Note that P(X x) (x 1 1 )( 1 1 1 1 ) and P(X x) P(X x) P(X x 1). The property of DTPD enables us to capture both the heavy-tailedness and the constraints at the threshold 1 . Similarly, the distribution function of DPD is given as P(X
x) g(x;
2
2 )
x
2
(x 1) 2
2
2
We now illustrate how the flow size distribution can be approximated with the DTPD and DPD in a graphical manner. YouTube is chosen as our case study. We first set 1 20 000 and 1 30 000 000 (bytes), which are the minimum flow size (20 KB) we are interested in, and the graphically estimated truncation point, respectively. Note that YouTube has two-way constraints: size and time; thus, the truncation point reflects their mix. In general, 1 reflects the capacity limitation for a video sharing service. Next, we estimate the parameters of DTPD and DPD independently, using a simple assumption, i.e., contribution of flows from premium users to DTPD is negligible. The shape parameter of DTPD is estimated with the maximal likelihood estimation (MLE), using the given parameters 1 and 1 . Note that the estimation process requires numerical calculation to solve the ML equation. See our previous work [13] for the detail of calculation. Finally, we estimate the shape parameter of DPD for flows larger than 1 with the least square method. We note that the approximated distributions above are not continuous at the truncation point in theory. Thus, we cannot use the approximated distributions to derive statistics such as mean or variance. Our objective is to illustrate that the actual flow size distribution of a video sharing service can be divided into two distinct types of distribution models, DTPD and DPD.
Characterizing TraÆc Flows Originating from Large-Scale Video Sharing Services
27
The estimated shape parameters are 1 0008 for DTPD and 2 276 for DPD. The bottom-right graph of Fig. 3 shows the results. Notice both DTPD and DPD approximate the distribution well. In addition, notice that the shape parameter of DTPD takes extremal values, i.e., 0008 1 indicates that the first and second moments could be divergent if there is no constraint. In fact, more than 10% of flows is larger than 10 MB. We conjecture that these skewed parameters reflect the eect of the constraints. That is, many free membership users who hope to upload large files need to compress or divide the files so that they fit into the service capacity. Accordingly, many files that were originally larger than the limitation are made smaller than the limitation; thus, they show the truncation property. We note that flow size distribution and file size distribution are not exactly the same because the former reflects the popularity of file accesses. However, the characteristics of flow size distribution should be correlated with the flow size distribution because a flow basically originates from a file. In summary, we found that the flow size distributions of large-scale video sharing services exhibit significant heavy-tailedness and the truncation property, which are quite dierent from the property of conventional web traÆc flows. 4.3 Temporal Characteristics Finally, we focus on the temporal characteristics of video flows. Understanding the temporal characteristics such as time variability and flow arrival process is crucial in building realistic traÆc model. Since our data set consists of 9.5-hours of traÆc data, we cannot study the cyclic patterns of traÆc, i.e., diurnal or weekly variation as shown in [15]. However, as we shall see shortly, the multiple time-scale analysis enables us to explore the temporal structure of video sharing traÆc. Figure 4 shows the time-series of (1) total traÆc volume, (2) number of active flows, and (3) number of arrival flows, for traÆc flows originating from YouTube servers. We see that the traÆc volume and number of active flows are positively correlated (correlation coeÆcient is more than 0.8). In contrast, the number of arrival flows is independent of these two (correlation coeÆcient is less than 0.05). Figure 5 shows auto-correlation functions for the time-series of total traÆc volume, number of active flows, and number of arrival flows. While traÆc volume and number of active flows exhibit a long-range dependence (LRD) property, the number of arrival flows does not have the time correlation structure, i.e., it exhibits the memorylessness property. These observations can be explained by the traditional traÆc source model, such as that in Ref. [17]; i.e., aggregation of heavy-tailed source traÆc (i.e., flows with heavy-tailed size distribution) exhibits the LRD characteristics. Finally, we aim to validate the assumption that flow arrival process can be modeled with the Poisson model. Figure 6 shows the probability mass function of the number of arrival flows per a time unit (1 sec) in normal and log scales. We also plot the approximated distribution with the Poisson distribution model. While the approximation works well over the several orders of magnitudes, we observe a small number of outliers, e.g., n 10. For instance, we observe an event n 27, which means 27 distinct flows are observed in a second. According to the Poisson approximation, the expected probability that the event occurs should be less than 10 18 , which is unlikely to happen in the 9.5-hours measurement period. Thus, it is likely that these extremal
28
T. Mori et al.
Fig. 4. Time-series of traÆc volume (top), number of active flows (middle), and number of arrival flows (bottom), for traÆc flows originating from YouTube servers
0.5
Frequencies
total traffic volume (bytes/sec) # of active flows (#flows/sec) # of arrival flows (#flows/sec)
5k 0
-0.5
-1.0 0
2000
4000
6000
8000
10000
Time lag (sec)
Fig. 5. Auto-correlation functions for traÆc volume, number of active flows, and number of arrival flows
observation expected (Poisson)
10 k
0.0
log (Frequencies)
Autocorrelation Function
1.0
3 sigma 0
5
10
0
5
10
15
20
25
30
15
20
25
30
n
0
10
-6
10
-12
10
-18
10
n
Fig. 6. Probability mass function of the number of arrival flows per sec: normal-scale (top) and log-scale (bottom)
congestion periods are exceptional. On the basis of careful examination of the flows that constitute these outliers, we conjecture that these outliers are associated with the flashcrowd e«ect because the observed client IP addresses are not identical during the time periods. We also applied Pearson’s chi-square test to make our observation conclusive. Note that the outliers were removed before applying the statistical test. We tested a null hypothesis that the observed distribution was identical to the Poisson distribution. We concluded that we cannot reject the null hypothesis at the 0.05 level of significance. From these observations, we may conclude that (1) the aggregated traÆc of video sharing services can be modeled with the LRD traÆc model and, (2) after removing outliers, the flow arrival process of YouTube can be well modeled with the Poisson arrival process. We validated that these observations hold for the other services as well and omit the results due to the space limitation.
Characterizing TraÆc Flows Originating from Large-Scale Video Sharing Services
29
5 Related Work This section reviews prior studies also on the large-scale video sharing services and compares them to ours. Recently, many researchers have focused on characterizing the workload of large-scale video sharing services [8,2,3,20,7,4,1,15,11]. Huang et al. [8] analyzed the access log of MSN Video [14] and found 95% of accesses could have been eliminated by using peer-assisted content delivery system. Cha et al. [2] analyzed the properties of video files on YouTube and derived similar implications. Cheng et al. [3, 4] crawled the YouTube site and found that statistics such as length, access patterns, growth trend, and active life span were quite dierent compared to traditional video streaming applications. Kang et al. [11] measured and analyzed Yahoo! Video sites [18] to characterize workload of the video sharing service. Based upon obtained characteristics, they gave guideline for SLA and workload scheduling schemes on the resource management eÆciency of an online video data center. Abhari and Soraya [1] investigated YouTube popularity distribution and access patterns through analysis of a vast amount of data collected by crawling YouTube API. On the basis of the observations, they presented essential elements of workload generator that can be used for benchmarking caching mechanisms. While the above works were attempted from video sharing service provider perspective, the following works were oriented for network service providers, like ours. Zink et al. [20] analyzed YouTube traÆc at a campus network and analyzed the local popularity characteristics of video files. In Ref. [7], Gill et al. investigated the statistics of user sessions (i.e., flows) on YouTube. They showed YouTube users transfer more data and have longer think times than traditional Web workloads. Their observation on file transfer volume of YouTube is coherent with our findings. Note that both the refs [20] and [7] rely on deep packet inspection (DPI) for their analysis. In general, employing DPI exploits payload information and has been prone to privacy problem. In contrast, our method, which bases on namingaddressing conventions of large-scale server farms, does not require any payload information nor IP addresses of end-users. It just leverages server-side IP address, which is publicly available information. Also, thanks to its simplicity, the detection method is light-weight and scalable, while conventional DPI requires expensive processing, e.g., wire rate byte stream matching with stateful inspection of all incoming flows on high-speed links; thus employing DPI at high-speed links is a diÆcult task. Plissonneau et al. [15] characterized the impact of YouTube traÆc on a French regional ADSL point of presence. They revealed that YouTube video transfers are faster and larger than other large Web transfers. Their observations agree with our study on YouTube and the other video services as well. They also revealed that performance of video transfers and network load on the underlying ADSL network platform are correlated. In analysing the data set, they proposed a technique of detecting YouTube video traÆc by looking up RDNS and commercial Geo-IP database with some heuristics. We note that our detection method is more comprehensive in extracting IP address blocks operated by service providers, and can be seen as a generalization of their approach. We finally note that the originality of our work lies in the three contributions: (1) proposing a simple and privacy-friendly way of identifying video flows, (2) investigating
30
T. Mori et al.
multiple large-scale video sharing services simultaneously, and (3) traÆc analysis from the perspective of network service providers, including the temporal analysis of traÆc.
6 Conclusion and Future Work In this work, we attempted to characterize traÆc originating from large-scale video sharing services from the perspective of a network service provider. We presented a simple methodology that enables us to identify video flows by correlating web cache server logs and network measurement data. The key idea behind our approach is to leverage the addressingnaming conventions used in large-scale server farms. We analyzed the characteristics of the resulting classified video flows and revealed that flows originating from current large-scale video sharing services have intrinsic characteristics, the significant heavy-tailedness and the truncation property, which have not been observed in existing web traÆc models. The origin of these characteristics is rooted in the dierentiated service provided for free and premium membership. We also investigated the temporal characteristics of video flows and revealed that flow arrival process can be modeled with the Poisson arrival process and temporal variability exhibit LRD characteristics. Our future work includes in-depth analysis of the new distribution model. We will study empirically how the model fits the actual data. We also aim to understand the mechanism that governs the distributions and its implications on network resource management.
References 1. Abhari, A., Soraya, M.: Workload Generation for YouTube. Multimedia Tools and Applications journal (June 2009) 2. Cha, M., Kwak, H., Rodriguez, P., Ahn, Y.Y., Moon, S.: I Tube, You Tube, Everybody Tubes: Analyzing the World’s Largest User Generated Content Video System. In: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, pp. 1–14 (2007) 3. Cheng, X., Dale, C., Liu, J.: Understanding the Characteristics of Internet Short Video Sharing: YouTube as a Case Study. CoRR, abs0707.3670 (2007) 4. Cheng, X., Dale, C., Liu, J.: Statistics and Social Network of YouTube Videos. In: IWQoS 2008, pp. 229–238 (2008) 5. Cisco Systems, Inc. Cisco Visual Networking Index – Forecast and Methodology (2007– 2012), !" #$#% (June 2008) 6. Dailymotion,
7. Gill, P., Arlitt, M., Li, Z., Mahanti, A.: Characterizing User Sessions on YouTube. In: Fifteenth Annual Multimedia Computing and Networking Conference, MMCN (2008) 8. Huang, C., Li, J., Ross, K.W.: Can Internet Video-on-Demand Be Profitable? In: ACM SIGCOMM 2007, pp. 133–144 (August 2007) 9. Huang, C., Wang, A., Li, J., Ross, K.W.: Measuring and Evaluating Large-scale CDNs. In: Microsoft Research Technical Report MSR-TR-2008-106 (2008) 10. IRCache project,
Characterizing TraÆc Flows Originating from Large-Scale Video Sharing Services
31
11. Kang, X., Zhang, H., Jiang, G., Chen, H., Meng, X., Yoshihira, K.: Measurement, Modeling, and Analysis of Internet Video Sharing Site Workload: A Case Study. In: Proceedings of IEEE International Conference on Web Services, pp. 278–285 (2008) 12. Megavideo,
& 13. Mori, T., Takine, T., Pan, J., Kawahara, R., Uchida, M., Goto, S.: Identifying Heavy-Hitter Flows from Sampled Flow Statistics. IEICE Transactions 90-B(11), 3061–3072 (2007) 14. MSN Video, & 15. Plissonneau, L., En-Najjary, T., Urvoy-Keller, G.: Revisiting Web TraÆc from a DSL Provider Perspective: the Case of YouTube. In: Proceedings of the 19th ITC Specialist Seminar (October 2008) 16. Smiley Videos,
& 17. Willinger, W., Taqqu, M.S., Sherman, R., Wilson, D.V.: Self-similarity through highvariability: statistical analysis of ethernet lan traÆc at the source level. IEEEACM Trans. Netw. 5(1), 71–86 (1997) 18. Yahoo! Video, & 19. YouTube,
20. Zink, M., Suh, K., Gu, Y., Kurose, J.: Characteristics of YouTube Network TraÆc at a Campus Network – Measurements, Models, and Implications. Comput. Netw. 53(4), 501–514 (2009)
Mixing Biases: Structural Changes in the AS Topology Evolution Hamed Haddadi1 , Damien Fay2 , Steve Uhlig3 , Andrew Moore4 , Richard Mortier5 , and Almerima Jamakovic6 1
Royal Veterinary College, University of London 2 National University of Ireland, Galway 3 Deutsche Telekom Laboratories/TU Berlin 4 University of Cambridge 5 University of Nottingham 6 TNO ICT, The Netherlands
Abstract. In this paper we study the structural evolution of the AS topology as inferred from two different datasets over a period of seven years. We use a variety of topological metrics to analyze the structural differences revealed in the AS topologies inferred from the two different datasets. In particular, to focus on the evolution of the relationship between the core and the periphery, we make use of a recently introduced topological metric, the weighted spectral distribution. We find that the traceroute dataset has increasing difficulty in sampling the periphery of the AS topology, largely due to limitations inherent to active probing. Such a dataset has too limited a view to properly observe topological changes at the AS-level compared to a dataset largely based on BGP data. We also highlight limitations in current measurements that require a better sampling of particular topological properties of the Internet. Our results indicate that the Internet is changing from a core-centered, strongly customer-provider oriented, disassortative network, to a soft-hierarchical, peering-oriented, assortative network.
1
Introduction
The Internet continuously evolves: new networks are created, old ones disappear, and existing ones grow or merge. At the same time, business dynamics cause interconnections between networks to change. Both these effects cause the underlying topology of the Internet to be in a constant state of flux. Studying the evolution of this topology is important as it impacts a variety of factors relevant to network users and application designers, such as scalability and performance. For example, different network structures affect the speed of propagation of both legitimate (e.g., routing) and illegitimate (e.g., hijacked prefixes) information. Most efforts to understand the structure of the Internet have focused on the Autonomous System (AS) topology. There are over 30,000 ASes today, each representing a single administrative authority with its own network and peering
This work was done while the author was at Max Planck Institute for Software Systems, Germany.
F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 32–45, 2010. c Springer-Verlag Berlin Heidelberg 2010
Mixing Biases: Structural Changes in the AS Topology Evolution
33
policies. Thus, the AS topology is a graph reflecting the interconnections between the networks that compose the Internet. Relationships between ASes are typically classified as either customer-provider, sibling-sibling or peer-peer. Note that as the Internet has grown, many larger networks have come to be represented as more than one AS (i.e., to advertise more than one AS number). As a result, the AS topology may contain edges that do not directly represent a business relationship between two distinct networks. However, the AS topology serves as an available, albeit approximate, measure of the complexity of the Internet’s structure at the network level. Characterizing the structure of the AS topology has proven to be difficult, but it is usually simplified to identifying a richly connected core, including the fully meshed tier-1 Internet Service Providers (ISPs), providing connectivity for the large number of smaller ISPs and customer networks at the periphery of the network. These edge ISPs may connect to only a single upstream provider, or may connect to many for resilience, performance and cost reasons. Recent work has shown that the trend is for networks to try to connect directly in the periphery of the Internet, rather than to the core, bypassing the largest providers [8]. However, no direct evidence of a corresponding large-scale change in the topological structure had been shown. In this paper we analyze the evolution of the AS topology using two significant datasets, each generated by a different measurement technique: the Skitter dataset using traceroute, and the UCLA dataset using BGP. We are aware that there are problems with biased measurements in both data sets and one of the aims of this paper is to highlight such differences and biases, which could potentially affect many simulations, protocol designs and publications based on these datasets. However, we still aim to draw conclusions mindful of these drawbacks in this paper. We focus on the overall structure of the topology, rather than local features such as node degree, using a recently introduced metric called the weighted spectral distribution (WSD) [7]. This allows us to distinguish topologies with different mixing properties, i.e., how much the core can be differentiated from the periphery of the topology. A clear distinction between the core and the periphery is believed to be one of the strongest features of the Internet topology [19,20]. This paper makes three contributions. First, we explain how the WSD depicts the mixing between core and periphery in the AS topology (Section 2). Second, we find that the AS topology has evolved from a highly hierarchical graph with a clearly distinct core towards a “softer” hierarchy where the core and non-core parts of the topology are less distinct (Section 3). Third, we show how the two different measurement techniques, traceroute and BGP, both provide limited but complementary coverage of the AS topology: the traceroute dataset has increasing difficulty sampling the periphery, while the BGP dataset can improve its sampling of the transit part of the Internet (Section 4). Section 3.1 studies the evolution of the AS topology seen in the Skitter dataset, and Section 3.2 then studies the evolution of the AS topology seen in the UCLA dataset. We compare these views of the AS topology in Section 4, where we also discuss the likely evolution of the ”real” AS topology.
34
H. Haddadi et al.
We are aware of the problems associated with traceroute sampling and we are also aware of the efforts in DIMES project to remedy these issues1 , however this data is currently only available since January 2007 and hence not long enough for a thorough comparison of Internet topology evolution.
2
Theoretical Background
The weighted spectral distribution (WSD) is a graph theoretic metric based on the random walk cycles in a graph. A random walk starts at a node, say u, with degree du , and transitions to a connected node with probability 1/du . After several such steps, say N , if the random walk returns to the starting node, then this is called a random walk cycle of length N . The WSD takes the struture of the graph to be all such random walk cycles as expressed via the normalised Laplacian (roughly speaking, how the graph appears over short walks taken from every node). The normalised Laplacian matrix of a graph, G, defined as: ⎧ 1, if u = v and dv = 0 ⎪ ⎪ ⎨ 1 , if u and v are adjacent (1) L(G)(u, v) = − √ ⎪ du dv ⎪ ⎩ 0, otherwise Expressing L using the eigenvalue decomposition, L(G) = λi ei eTi
(2)
i
where ei and λi are the eigenvalues and eigenvectors of L respectively2 . The WSD is based on the following theorem from [7]: Theorem 1. The eigenvalues, λi , of the nomalised Lapacian matrix for an undirected network are related to the random walk cycle probabilities as: 1 (1 − λi )N = (3) d d u u 1 2 . . . duN i C
where dui is the degree of node u1 and u1 . . . un denotes a path from node u1 of length n ending at node n, i.e. an n-cycle. The number of N-cycles is related to various graph properties. The number of 2-cycles is just (twice) the number of edges and the number of 3-cycles is (six times) the number of triangles. Hence (1 − λ)3 is related to the clustering coefficient3 . An important graph property i 1 2 3
http://www.netdimes.org These are in general different from the eigenpairs of the walk Laplacian. The clustering coefficient, γ(G), is defined as the average number of triangles divided by the total number of possible triangles Ti , di ≥ 2 γ(G) = 1/n (4) d i (di − 1)/2 i where Ti is the number of triangles for node i and di is the degree of node i.
Mixing Biases: Structural Changes in the AS Topology Evolution
35
is the number of 4-cycles. A graph which has the minimum number of 4-cycles, for a graph of its density, is quasi-random, i.e., it shares many of the properties of random graphs, including, typically, high connectivity, low diameter, having edges distributed uniformly through the graph, and so on. For a proof see [7]. Theorem 1 states that the probability of taking a random walk of length N that returns to the original node, is directly related to the weighted eigenvalues of L. This probability is the ’local structure’ of a node, i.e. its local connecivity. Noting that the λi are unique 4 to a graph it can be seen that the WSD gives a ”thumbprint” of the graph structure. In [7] the distribution of eigenvalues, f (λ = k), rather than the eigenvalues themselves is used to form a graph metric (we refer the reader to [7] for details). Specifically the weighted spectral distribution is then defined as: W SD : G → |K| {k ∈ K : ((1 − k)N f (λ = k))}
(5)
Where K is the set of bins used to estimate the distribution. Of interest in this paper is the spectral clustering coefficient, ω(G, 3) defined as: ω(G, 3) = ((1 − k)3 f (λ = k)) (6) K
which gives a measure of the proportion of paths length 3 in the network which form triangles. As shown in [7] the WSD and ω(G, 3) can be used for estimating the parameters of a topology generator that produce graphs which are close (in the WSD sense) to the target graph. It is also shown in [7](Section V.A) how the WSD represents the core and periphery of a graph in terms of easily identifiable peaks. However, in this paper we apply the technique for tracking the evolution of the AS level graph. The WSD enables us to view the distinct features of the core and periphery more clearly than in the past.
3
Evolution of the Internet
In this section we look at the evolution of the Internet seen through the two datasets, over a total period of more than 7 years and 3 joint years of the two datasets. We rely on a number of topological metrics presented in [10]. 3.1
Skitter Topology
The first dataset we study consists of 7 years of traceroute measurements, starting in January 2001, collected by the CAIDA Skitter project [12]. Traceroutes are initiated from several locations in the world toward a large range of destination IP addresses. The IP addresses reported in the traceroutes are mapped to 4
This is not strictly true but the proportion of co-spectral graphs is thought to be insignificant.
36
H. Haddadi et al. 9500
30000
Skitter
9000
Skitter
6.5
7500 7000 6500
Average node degree
26000
8000
Number of links
Number of ASes
7
Skitter
28000
8500
24000 22000 20000 18000
6000
6
5.5
5
5500
16000
5000
14000 0
10
20
30
40 50 Month
60
70
80
90
0
(a) Number of nodes
0.035
4.5 10
20
30
40 50 Month
60
70
80
90
0
(b) Number of links
20
30
40 50 Month
60
70
80
90
(c) Average degree
1.4
-0.16
Skitter
10
Skitter 1.3
-0.18
0.025
0.02
0.015
1.2
-0.2
Σfλ(λ)(1−λ)3
Assortativity Coefficient
Clustering Coefficient
0.03
-0.22 -0.24
1.1
1
0.9
0.8
-0.26
0.7
0.01
-0.28 0
10
20
30
40 50 Month
60
70
80
90
0
10
20
30
40 50 Month
60
70
80
90
(d) Clustering coefficient (e) Assortativity coefficient
0
10
20
30
40
50
60
70
80
90
Month
(f) ω(G, 3)
Fig. 1. Topological metrics for Skitter AS topology
AS numbers using RouteViews BGP data. We use a monthly union of the set of all unambiguous links collected on a daily basis by the project.5 Figure 1 presents the evolution over the 7 years of a set of topological metrics computed on the AS topology of the Skitter dataset. The number of ASes seen by Skitter exhibits abrupt changes during the first 40 months. At the end of those 40 months, changes were made in the way probing was performed.6 The large increases in the number of ASes, observed during the first 40 months, are due to new monitors being added to the system. After each increase in the number of ASes a smooth decrease follows, corresponding to a subset of the IP addresses of the Skitter list that no longer respond to probes, e.g., because a firewall starts blocking the probes. The variations in the number of ASes seen by Skitter are not caused by changes in the AS topology itself, but are artifacts of the probing. Such artifacts should be reported and accounted for in topological studies. 5
6
A link may be ambiguous for a variety of reasons, principally due to problems resolving an IP address to an AS number. The Skitter IP address list includes some IP addresses which matched a prefix with two or more origin ASes. This can happen for a number of reasons such as a provider stripping the customer AS from the AS path. Since it is not known which AS is the true origin, the dataset lists both ASes. We filter out such instances as it is not possible to identify the authenticity of such links. These changes were subject to caveats and bugs affecting measurements, and, thus, the values of the resulting metrics at month 40. For more information we refer to http://www.caida.org/data/active/skitter_aslinks_dataset.xml/.
Mixing Biases: Structural Changes in the AS Topology Evolution
37
The number of AS edges and the average node degree both follow the behavior of the number of ASes seen. We only observe a large increase in the number of links during the first few months, during which new monitors are added resulting in new regions of the Internet being covered by Skitter measurements. Given the difficulty of building a list of destination IP addresses that will answer probes and cover most of the ASes, especially at the edge [2], a new monitor will typically discover new ASes close to its location. The AS edges that Skitter no longer observes probably still exist but can no longer be seen by Skitter due to its shrinking probing scope. To be effective in observing topology dynamics, traceroute data collection must update destination lists constantly to give optimal AS coverage. This limitation of Skitter is visible in the decreasing average node degree. We would expect to see a net increase in the average node degree as ASes tend to add rather than remove peering links, and the results of the BGP data support this view. If the sample of the AS topology of the Skitter measurements was not worsening, we should see an increasing average node degree. The lower three graphs of Figure 1 present the evolution of the clustering coefficient, the assortativity coefficient, 7 and the weighted spectrum with N = 3, ω(G, 3) (related to the topology’s clustering)8 . We observe that changes were made to the way Skitter probes the Internet around month 40: the metrics take an unusual value, very small for the clustering and very high for assortativity. The values of the clustering and the assortativity coefficients fluctuate wildly over the 7 years, as if the sampling of the AS topology by Skitter at the ASlevel is not stable. Neither the clustering nor the assortativity seem to decrease or increase over the 7 years. The value of ω(G, 3) shows a long-term increasing trend, similar to the decreasing trend in the average node degree. Although related to the clustering, ω(G, 3) gives different weights to different parts of the topology. The subset of the topology that corresponds to duplicated topological structures, e.g. different ASes at the periphery that connect to the same set of upstream providers, receives a smaller weight than the rest. The increasing ω(G, 3) is likely to be caused by the shrinking network sampled by Skitter, that contains more 3-cycles on average. Figure 2(a) presents four WSDs sampling the entire duration of the Skitter dataset. Notice the eigenvalues at zero, indicating the presence of several disconnected components. The WSD in January 2002 shows a single peak at λ = 0.4. As time passes, a second peak appears around λ = 0.3. The WSD computed from the Skitter data suggests an Internet moving from a less hierarchical to more hierarchical topology, as if the core was becoming more dominant. This contradicts current observations that the AS topology is becoming less 7
8
Assortativity is a measure of the likelihood of connection of nodes of similar degrees [14]. This is usually expressed by means of the assortativity coefficient r: assortative networks have r > 0 (disassortative have r < 0 resp.) and tend to have nodes that are connected to nodes with similar (dissimilar resp.) degree. See [7] and [9] for a detailed explanation on the mathematical measures and different datasets.
38
H. Haddadi et al.
0.07
0.5
December 2007
0.45
0.06
fk−core(k−core=k)
4
fλ(λ)(1−λ)
January 2004
0.04
January 2002
0.03
January 2004
0.4
January 2006 0.05
0.02
0.35 0.3 0.25 0.2 0.15
April 2008
0.1 0.01 0.05 0
0
0.1
0.2
0.3
0.4
0.5
λ
0.6
0.7
0.8
(a) WSD, Skitter topology.
0.9
1
0
0
2
4
6
8
10
12
14
16
18
20
k−core
(b) k-core proportions, Skitter topology.
Fig. 2. Clustering and spectral features of Skitter topology
hierarchical, with increasing numbers of ASes peering at public Internet Exchange Points (IXPs) to bypass the core of the Internet [8]. To understand the unexpectedly dominant core seen in the Skitter dataset, we rely on the k-core metric. A k-core is defined as the maximum connected subgraph, H, of a graph, G, with the property that dv ≥ k ∀v ∈ H. As pointed out by [1] and [3] the k-core exposes the structure of a graph by pruning nodes with successively higher degrees, k, and examining the maximum remaining subgraph; note this is not the same as pruning all nodes with degree k or less. Figure 2(b) shows the proportion of nodes in each k-core as a function of k. There are 84 plots shown, but as can be seen there is little difference between each of them, demonstrating that the proportion of nodes in each k-core is not fundamentally changing over time. The WSD on the Skitter data is therefore not really observing a more dominant core, but a less well-sampled edge of the AS topology. We provide explicit evidence in Section 4 that Skitter has increasing problems over time to sample the non-core part of the topology. There is a practical explanation for the sampling bias of Skitter: the Skitter dataset is composed of traceroutes rooted at a limited set of locations, so the k-core is expected to be similar to peeling the layers from an onion [1]. From a topology evolution point of view, Skitter’s view of the AS evolution is inconclusive, due to its sampling bias. Skitter is not sampling the periphery of the Internet and so cannot see evolutionary changes in the whole AS topology. Based on our evidence, we cannot make claims about the relative change of the core compared to the edge, as we can with the UCLA dataset. We insist on the fact that the purpose of this paper is not to blame the Skitter dataset for its limited coverage of the AS topology, as it aims at sampling the router-level topology. Datasets like Skitter that rely on active probing do provide some topological information not visible from BGP data, as will be shown in Section 4.
Mixing Biases: Structural Changes in the AS Topology Evolution
3.2
39
UCLA
We now examine the evolution of the AS topology using 52 snapshots, one per month, from January 2004 to April 2008. This dataset, referred to in this paper as the UCLA dataset, comes from the Internet topology collection9 maintained by Oliviera et al. [16]. These topologies are updated daily using data sources such as BGP routing tables and updates from RouteViews, RIPE,10 Abilene11 and LookingGlass servers. Each node and link is annotated with the times it was first and last observed. Note that due to the multiple sources of data used by the UCLA dataset, there is a risk of pollution and bias when combining such differing data sources, which may contain inconsistencies or outdated information.
240000 220000
32000
200000
30000 28000 26000 24000
160000 140000 120000 100000 80000
20000
60000 5
10
15
20
25
30
35
40
45
50
10 9 8 7 6 5
0
10
20
Month
30
40
50
60
0
10
20
Month
(a) Number of nodes 0.08
0.05 0.045
60
UCLA 1.1
-0.14
1
Σfλ(λ)(1−λ)3
Assortativity Coefficient
0.06 0.055
50
(c) Average node degree
-0.13
0.065
40
1.2
-0.12
UCLA
0.07
30 Month
(b) Number of links
0.075 Clustering Coefficient
11
40000 0
UCLA
12
180000
22000
18000
13
UCLA
Average node degree
UCLA
34000
Number of links
Number of ASes
36000
-0.15 -0.16 -0.17
0.9
0.8
0.7
0.04 -0.18
0.035
0.6
0.03
-0.19 0
10
20
30
40
50
60
0
10
Month
20
30
40
50
60
Month
(d) Clustering coefficient (e) Assortativity coefficient
0.5
0
10
20
30
40
50
60
Month
(f) ω(G, 3)
Fig. 3. Topological metrics for UCLA AS topology
Figure 3 presents the evolution of the same set of topological metrics as Figure 1, over 4 years of AS topologies in the UCLA dataset. The UCLA AS topologies display a completely different evolution compared to the Skitter dataset, more consistent with expectations. As the three upper graphs of Figure 3 show, the number of ASes, AS edges, and the average node degree are all increasing, as expected in a growing Internet. The increasing assortativity coefficient indicates that ASes increasingly peer with ASes of similar degree. The preferential attachment model seem to be less dominant over time. This trend towards a less disassortative network is consistent with more ASes bypassing the tier-1 providers through public IXPs [8], 9 10 11
http://irl.cs.ucla.edu/topology/ http://www.ripe.net/db/irr.html http://abilene.internet2.edu/
40
H. Haddadi et al.
0.5
0.45
0.06
fk−core(k−core=k)
0.35
4
0.04
fλ(λ)(1−λ)
January 2004
0.4
January 2004 0.05
February 2005 0.03
0.3
0.25
0.2
0.15
0.02 April 2008
0.1
April 2008
0.01
0
0.05
0
0
0.1
0.2
0.3
0.4
0.5
λ
0.6
0.7
0.8
0.9
1
(a) WSD, UCLA topology.
0
2
4
6
8
10
12
14
16
18
20
k−core
(b) k-core proportions, UCLA topology.
Fig. 4. Clustering and spectral features of UCLA topology
hence connecting with nodes of similar degree. Another explanation for the increasing assortativity is an improvement in the visibility of non-core edges in BGP data. We will see in Section 4 that the sampling of core and non-core edges by UCLA and Skitter biases the observed AS topology structure. Contrary to the case of Skitter, ω(G, 3) for UCLA decreases over time. As a weighted clustering metric, ω(G, 3) indicates that the transit part of the AS topology is actually becoming relatively sparser over time compared to the periphery. Increasing local peering with small ASes in order to reduce the traffic sent to providers decreases both the hierarchy induced by strict customer-provider relationships, and in turn decreases the number of 3-cycles on which ω(G, 3) is based. If we look closely at Figure 4(a), we see a spectrum with a large peak at λ = 0.3 in January 2004, suggesting a strongly hierarchical topology. As time passes, the WSD becomes flatter with a peak at λ = 0.4, consistent with a mixed topology where core and non-core are not so easily distinguished. Figure 4(b) shows the proportion of nodes in each k-core as a function of k. There are 52 plots shown as a smooth transition between the first and last plots, emphasized with bold curves. The distribution of k-cores moves to the right over time, indicating that the proportion of nodes with higher connectivity is increasing over time. This adds further weight to the conclusion that the UCLA dataset shows a weakening hierarchy in the Internet, with more peering connections between nodes on average.
4
Reconciling the Datasets
The respective evolutions of the AS topology visible in the Skitter and UCLA datasets differ, as seen from topological metrics. Skitter shows an AS topology that is becoming sparser and more hierarchical, while UCLA shows one that is becoming denser and less hierarchical. Why do these two datasets show such
Mixing Biases: Structural Changes in the AS Topology Evolution
41
differences? The explanation lies in the way Skitter and UCLA sample different parts of the AS topology: Skitter sees a far smaller fraction of the complete AS topology than UCLA, and even UCLA does not see the whole AS topology [15]. A far larger number of vantage points than those currently available are likely to be necessary in order to reach almost complete visibility of the AS topology [17]. To check how similar the AS topologies of Skitter and UCLA are, we computed the intersection and the difference between the two datasets in terms of AS edges and ASes. We used a two-year period from January 2006 until December 2007. In Table 1 we show the number of AS edges and ASes that Skitter and UCLA have in common during some of these monthly periods (labeled “intersection”), as well as the number of AS edges and ASes contributed to the total and coming from one of the two datasets only (labeled “Skit-only” or “UCLA-only”). We observe a steady increase in number of total ASes and AS edges seen by the union of the two datasets. At the same time, the intersection between the two datasets decreases. In late 2007, Skitter had visibility of less than 25% of the ASes and less than 10% of the AS edges seen by both datasets. As Skitter aims at sampling the Internet at the router-level, we should not expect that it has a wide coverage of the AS topology. Such a limited coverage is however surprising, given the popularity of this dataset. Note that Skitter sees a small fraction of all AS edges, which is not seen by the UCLA dataset. This indicates that there is potential in active topology discovery to complement BGP data. From Table 1, we may conclude that the Skitter dataset is uninteresting. To the contrary, the relatively constant, albeit decreasing, sampling of the Internet core by Skitter gives us a clue about which part of the Internet is responsible for its structural evolution. In Table 2 we show the number of AS edges belonging to the tier-112 mesh (labeled “T1 mesh”) as well as other AS edges where a tier-1 appears. More than 30% of the AS edges sampled by Skitter cross at least a tier-1 AS, against about 15% for UCLA. Both dataset see almost all AS edges from the tier-1 mesh. Note that the decrease in the number of AS edges in which a tier-1 appears in Skitter is partly related to IP to AS mapping issues for multi-origin ASes [8]. The evolutions of the AS topology observed by the Skitter and UCLA datasets are not inconsistent. Rather, the two datasets sample differently, the AS topology, leading to different bias. A large fraction of the AS topology sampled by Skitter relates to the core, i.e., edges containing at least a tier-1 AS. With its wider coverage, UCLA observes a different evolution of the AS topology, with a non-core part that grows more than the core. The evolution seen from the UCLA dataset seems more likely to reflect the evolution of the periphery of the AS topology. The non-core part of the Internet is growing and is becoming less and less hierarchical. We wish to point out that, despite a common trend towards making a union of datasets in our networking community, such simple addition is not appropriate for the UCLA and Skitter datasets. Each dataset 12
We rely on the currently accepted list of 12 tier-1 ASes that provide transit-only service: AS174, AS209, AS701, AS1239, AS1668, AS2914, AS3356, AS3549, AS3561, AS5511, AS6461, and AS7018.
42
H. Haddadi et al.
Table 1. Statistics on AS and AS edge counts in the intersection of both Skitter and UCLA datasets, and for each dataset alone Time Jan. 2006 Mar. 2006 May. 2006 Jul. 2006 Sep. 2006 Nov. 2006 Jan. 2007 Mar. 2007 May. 2007 Jul. 2007 Sep. 2007 Nov. 2007
Total 25,301 26,007 26,694 27,396 28,108 28,885 29,444 30,236 30,978 31,668 32,326 33,001
Autonomous Systems Intersect. Skit-only UCLA-only 32.6% 0% 67.4% 31.6% 0% 68.4% 30.5% 0% 69.5% 29.5% 0% 70.5% 28.7% 0% 71.3% 27.9% 0% 72.1% 27.2% 0% 72.8% 26.5% 0% 73.5% 25.6% 0% 74.4% 25.9% 0% 86.1% 24.5% 0% 75.5% 23.9% 0% 76.1%
Total 114,847 118,786 124,052 128,624 133,813 139,447 144,721 151,380 157,392 166,057 168,876 174,318
AS Edges Intersect. Skit-only 15.4% 5.3% 14.9% 4.4% 13.8% 4.6% 13.2% 3.7% 12.6% 3.4% 12.4% 3.4% 11.6% 3.1% 11.2% 3.0% 10.5% 2.7% 10.0% 3.8% 9 .7% 2.5% 9 .5% 2.2%
UCLA-only 79.3% 80.7% 81.5% 83.1% 84.0% 84.2% 85.3% 85.8% 86.8% 86.2% 87.8% 88.3%
Table 2. Coverage of tier-1 edges by Skitter and UCLA Time Jan. 2006 Mar. 2006 May. 2006 Jul. 2006 Sep. 2006 Nov. 2006 Jan. 2007 Mar. 2007 May. 2007 Jul. 2007 Sep. 2007 Nov. 2007
Skitter Total T1 mesh Other T1 23,805 66 7,498 22,917 66 7,289 22,888 64 7,504 21,740 65 7,192 21,400 65 6,974 22,034 66 7,159 21,345 65 6,898 21,366 65 6,774 20,738 65 6,694 22,972 65 6,838 20,570 64 6,510 20,466 64 6,430
UCLA Total T1 mesh Other T1 108,720 64 19,149 113,555 64 19,674 118,331 64 20,143 123,842 64 20,580 129,228 64 21,059 134,636 65 21,581 140,216 65 22,531 147,000 65 23,194 153,156 65 23,769 159,792 65 24,310 164,770 65 24,888 170,431 65 25,480
has its own biases and measurement artifacts. Combining them blindly will only add these biases together, potentially leading to poorer quality data. Further research is required in order to devise a correct methodology that takes advantage of different datasets obtained from different sampling processes. The above observations suggests that the Internet, once seen as a tree-like, disassortative network with strict power-law properties [6], is moving towards an assortative and highly inter-connected network. Tier-1 providers have always been well connected, but the biggest shift is seen at the Internet’s periphery where content providers and small ISPs are aggressively adding peering links among themselves using IXPs to avoid paying transit charges to tier-1 providers. Content distribution networks are partly the reason behind such changes [13]. A different view of the Internet evolution can be obtained using the WSD, shown in Figures 2(a) and 4(a). One possible cause for this behavior is increased mixing of the core and periphery of the network, i.e. the strict tiered hierarchy is becoming less important in the network structure. This is given further weight by
Mixing Biases: Structural Changes in the AS Topology Evolution
43
studies such as [15] which show that the level of peering between ASes in the Internet has greatly increased during this period, leading to a less core-dominated network. Given that a fraction of AS edges are not visible from current datasets and that visibility is biased towards a better visibility of customer-provider peering relationships, we believe that our observations actually underestimate the changes in the structure of the AS topology. Using a hierarchical and preferential attachment-based model to generate synthetic AS topologies is likely to be less and less justified than ever. The AS topology structure is becoming more complex than in the past.
5
Related Work
In this section we outline related work, classified into three groups: evolution of the AS topology, spectral graph analysis of the AS topology, and analysis of the clustering features of the AS topology. Dhamdhere and Dovrolis [4] rely on available estimation methods for type of relationships between ASes in order to analyze the evolution of the Internet ecosystem in last decade. They believe the available historic datasets from RouteViews and RIPE are not sufficient to infer the evolution of peering links, and so they restrict their focus to customer-provider links. They find that after an exponential increase phase until 2001, the Internet now grows linearly in terms of both ASes and inter-AS links. The growth is mostly due to enterprise networks and content/access providers at the periphery of the Internet. The average path length remains almost constant mostly due to the increasing multi-homing degree of transit and content/access providers. Relying on geo-location tools, they find that the AS ecosystem is now larger and more dynamic in Europe than in North America. In our paper we have relied on two datasets, covering a more extensive set of links and nodes, in order to focus on structural growth and evolution of the Internet. We use a large set of graph-theoretic measures in order the focus on the behavior of the topology. Due to inherent issues involved with inference of node locations and types of relationships [11], we treat the AS topology as an undirected graph. Shyu et al. [18] study the evolution of a set of topological metrics computed on a set of observed AS topologies. The authors rely on monthly snapshots extracted from BGP RouteViews from 1999 to 2006. The topological metrics they study are the average degree, average path length, node degree, expansion, resilience, distortion, link value, and the Normalized Laplacian Spectrum. They find that the metrics are not stable over time, except for the Normalized Laplacian Spectrum. We explore this metric further by using WSD. Oliveira et al. [16] look at the evolution of the AS topology as observed from BGP data. Note that they do not study the evolution of the AS topology structure, only the nodes and links. They propose a model aimed at distinguishing real changes in ASes and AS edges from BGP routing observation artifacts. We use the extended dataset made available by the authors, in addition to 7 years of AS topology data from an alternative measurement method.
44
6
H. Haddadi et al.
Conclusions
In this paper we presented a study of two views of the evolving Internet AS topology, one inferred from traceroute data and the other from BGP data. We exposed discrepancies between these two inferred AS topologies and their evolution. We reconciled these discrepancies by showing that the topologies are not directly comparable as neither method sees the entire Internet topology: BGP data misses some peering links in the core which traceroute observes; traceroute misses many more peering links than BGP in the periphery. However, traceroute and BGP data do provide complementary views of the AS topology. To remedy the problems of decreasing coverage by the Skitter traceroute infrastructure and the lack of visibility of the core by UCLA BGP data, significant improvements in fidelity could be achieved with changes to the existing measurement systems. The quality of data then collected by the traceroute infrastructure would benefit from greater AS coverage, while the BGP data would benefit from data showing intra-core connectivity it misses today. We acknowledge the challenges inherent in these improvements but emphasize that, without such changes, the study of the AS topology will forever be subject to the vagaries of imperfect and flawed data. Availability of traceroute data from a larger number of vantage points, as attempted by the Dimes project, will hopefully help remedy these issues. However, even such measurements have to be done on a very large scale, and ideally performed both from the core of the network (like Skitter), as well as the edge (like Dimes). Efforts in better assessment of the biases inherent to the measurements are also necessary. In an effort to provide a better perspective on the changing structure of the AS topology, we used a wide range of topological metrics, including the newly introduced weighted spectral distribution. Our analysis suggests that the core of the Internet is becoming less dominant over time, and that edges at the periphery are growing more relative to the core. The practice of content providers and content distribution networks seeking connectivity to greater numbers of ISPs at the periphery, and the rise of multi-homing, both support these observations. Further, we observe a move away from a preferential attachment, tree-like disassortative network, toward a network that is flatter, highly-interconnected, and assortative. These findings are also indicative of the need for more detailed and timely measurements of the Internet topology, in order to build up on works such as [5], focusing on the economics of the structural changes such as institutional mergers, multi-homing and increasing peering relationships.
Acknowledgements We thank Mickael Meulle for his help with the BGP datasets.
References 1. Alvarez-Hamelin, J.I., Dall’Asta, L., Barrat, A., Vespignani, A.: k-core decomposition of Internet graphs: hierarchies, self-similarity and measurement biases. Networks and Heterogeneous Media 3, 371 (2008)
Mixing Biases: Structural Changes in the AS Topology Evolution
45
2. Bush, R., Hiebert, J., Maennel, O., Roughan, M., Uhlig, S.: Testing the reachability of (new) address space. In: Proceedings of the 2007 SIGCOMM workshop on Internet network management, INM 2007 (2007) 3. Carmi, S., Havlin, S., Kirkpatrick, S., Shavitt, Y., Shir, E.: A model of Internet topology using k-shell decomposition. In: PNAS (2007) 4. Dhamdhere, A., Dovrolis, C.: Ten years in the evolution of the Internet ecosystem. In: Proceedings of ACM/Usenix Internet Measurement Conference (IMC) 2008 (2008) 5. Economides, N.: The economics of the Internet backbone. NYU, Law and Research Paper No. 04-033; and NET Institute Working Paper No. 04-23 (June 2005) 6. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the Internet topology. In: Proceedings of ACM SIGCOMM 1999, Cambridge, Massachusetts, United States, pp. 251–262 (1999) 7. Fay, D., Haddadi, H., Uhlig, S., Moore, A.W., Mortier, R., Jamakovic, A.: Weighted Spectral Distribution. IEEE/ACM Transactions on Networking (TON) (to appear) 8. Gill, P., Arlitt, M., Li, Z., Mahanti, A.: The flattening Internet topology: Natural evolution, unsightly barnacles or contrived collapse? In: Claypool, M., Uhlig, S. (eds.) PAM 2008. LNCS, vol. 4979, pp. 1–10. Springer, Heidelberg (2008) 9. Haddadi, H., Fay, D., Jamakovic, A., Maennel, O., Moore, A.W., Mortier, R., Rio, M., Uhlig, S.: Beyond node degree: evaluating AS topology models. Technical Report UCAM-CL-TR-725, University of Cambridge, Computer Laboratory (July 2008) 10. Haddadi, H., Fay, D., Jamakovic, A., Maennel, O., Moore, A.W., Mortier, R., Rio, M., Uhlig, S.: On the Importance of Local Connectivity for Internet Topology Models. In: 21st International Teletraffic Congress (ITC 21) (2009) 11. Haddadi, H., Iannaccone, G., Moore, A., Mortier, R., Rio, M.: Network topologies: Inference, modelling and generation. IEEE Communications Surveys and Tutorials 10(2) (2008) 12. Huffaker, B., Andersen, D., Aben, E., Luckie, M., claffy, K., Shannon, C.: The skitter as links dataset (2001-2007) 13. Labovitz, C., Iekel-Johnson, S., McPherson, D., Oberheide, F.J.J., Karir, M.: ATLAS Internet Observatory 2009 Annual Report. NANOG47 (June 2009), http://tinyurl.com/yz7xwvv 14. Newman, M.: Assortative mixing in networks. Physical Review Letters 89(20), 871–898 (2002) 15. Oliveira, R., Pei, D., Willinger, W., Zhang, B., Zhang, L.: In search of the elusive ground truth: The Internet’s AS-level connectivity structure. In: ACM SIGMETRICS, Annapolis, USA (June 2008) 16. Oliveira, R., Zhang, B., Zhang, L.: Observing the evolution of Internet AS topology. In: Proceedings of ACM SIGCOMM 2007, Kyoto, Japan (August 2007) 17. Roughan, M., Tuke, S.J., Maennel, O.: Bigfoot, sasquatch, the yeti and other missing links: what we don’t know about the as graph. In: IMC 2008: Proceedings of the 8th ACM SIGCOMM conference on Internet measurement, pp. 325–330. ACM, New York (2008) 18. Shyu, L., Lau, S.-Y., Huang, P.: On the search of Internet AS-level topology invariants. In: Proceedings of IEEE Global Telecommunications Conference, GLOBECOM 2006, Francisco, CA, USA, pp. 1–5 (2006) 19. Subramanian, L., Agarwal, S., Rexford, J., Katz, R.H.: Characterizing the Internet hierarchy from multiple vantage points. In: Proceedings of IEEE Infocom 2002 (June 2002) 20. Zhou, S.: Characterising and modelling the Internet topology, the rich-club phenomenon and the PFP model. BT Technology Journal 24 (2006)
EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network Jaroslaw Sliwinski, Andrzej Beben, and Piotr Krawiec Institute of Telecommunications Warsaw University of Technology Nowowiejska 15/19, 00-665 Warsaw, Poland {jsliwins,abeben,pkrawiec}@tele.pw.edu.pl
Abstract. The paper describes the EmPath tool that was designed to emulate packet transfer characteristics as delays and losses in IP network. The main innovation of this tool is its ability to emulate packet stream transfer while maintaining packet integrity, packet delay and loss distribution and correlation. In this method, we decide about the fate of new packet (delay and loss) by using the conditional probability distributions depending on the transmission characteristics of the last packet. For this purpose, we build a Markov model with transition probabilities calculated on the basis of the measured packet traces. The EmPath tool was implemented as a module of the Linux kernel and its capabilities were examined in the testbed environment. In the paper, we show some results illustrating the effectiveness of EmPath tool. Keywords: network emulator, traffic modeling, validation, testing.
1
Introduction
Network emulators are attractive tools for supporting design, validation and testing of protocols and applications. They aim to provide a single node (or a set of nodes) with ability to introduce the packet transfer characteristics as they would be observed in a live network. Therefore, the network emulators are often regarded as a “network in a box” solution [1]. The key issue during the design of the network emulator is the method for representation of the network behavior. Among the realizations of emulators we recognize two main techniques: (1) the real-time simulation, where emulator simulates the network to get appropriate treatment of packets, e.g., as proposed in [2,3], or (2) the model based emulation, where emulator enforces the delay, loss or duplication of packets using a model of the network behavior; the parameters for the model come from measurements, simulations or analysis. We focus on model based emulation because it is regarded to be scalable even for large networks and high link speeds. One of the first widely used emulators was dummynet, which design and features were presented in [4]. Its main objective was the evaluation of TCP performance with regard to the limited bit rate, constant propagation delay and
This work was partially funded by MNiSW grant no. 296/N-COST/2008/0.
F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 46–58, 2010. c Springer-Verlag Berlin Heidelberg 2010
EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network
47
packet losses. This concept was further investigated by several authors, e.g., in [5,6], leading to implementation NIST Net [1] and NetEm [7] tools for Linux kernel versions 2.4 and 2.6, respectively. Both of the tools model the network impairment as a set of independent processes related to packet transfer delays, packet losses, packet reordering and duplication. Due to the assumed independence, they do not maintain the integrity of transferred packet streams nor maintain the autocorrelation within packet transfer delay process or within packet loss process. Moreover, they lack any cross-correlation between delay and loss generation processes. On the other hand, the studies of packet transfer characteristics in the Internet, e.g., as these presented in [8,9,10,11], point out the significant dependencies in the Internet traffic that results in strong correlation between delays and losses experienced by transferred packets, as well as, the long rage dependency. This effect is especially visible for packets sent in a sequence with small interpacket gaps. Furthermore, the autoregression analysis of Internet traffic, which was performed in [10], suggests that the transfer delay of given packet strongly characterizes the transfer delay of the consecutive one. The constraints of the NetEm tool, which we briefly presented above, are a topic of more detailed discussion in section 2. Following those limitations we formulate the requirements for the design of a new emulation method. In our method, named EmPath, the delays and losses experienced by transferred packets are modeled as a Markov process. However there are also solutions that use Markovian description for this purpose, e.g., in [12] and [13], our model correlates both the delay and loss processes into one solution. Moreover, our approach uses multiple transition matrices, where each of them is conditioned on the status of the proceeding packet. Contrary to previous works, we do not try to fit the transition matrices using linear programming optimization. We derive necessary conditional probabilities from the delay traces measured in a live network by sending the probing packets with small inter-packets gaps. Notice that the correlation depends on the duration of inter-packets gap. Therefore, for each incoming packet we observe its inter-packet gap and then we calculate a number of steps over the transition matrix; the number of steps depends on the inter-packet gap. Finally, we implemented this emulation method as an open source tool for the Linux kernel with version 2.6. The paper is structured as follows: in section 2, we recall the concept of network emulation and we discuss the requirements for useful emulators. Then, we analyze the NetEm tool and we show its capabilities and limitations. After that, in section 3, we present the proposed emulation algorithm that is based on the Markov model and we focus on implementation issues. In the next section, we show results of exemplary experiments that show performance of our tool. Finally, section 5 summarizes the paper and gives a brief outline of further works.
2
The Problem Statement
In this section we recall the concept of network emulation and present its applications for experimentally driven research. Then, we discuss requirements for
48
J. Sliwinski, A. Beben, and P. Krawiec
pdf
correlation
Measured packet transfer characteristics, e.g. packet delay distribution, losses, correlations
lag
delay
Input data
Network Emulator
Tested protocol
Fig. 1. The concept of network emulation
designing a network emulator. From this point of view, we study the effectiveness of widely used NetEm tool [7]. Our experiments aim to identify NetEm capabilities and limitations. The obtained results motivate us to design and implement the EmPath emulation tool. 2.1
The Concept of Network Emulation
The concept of network emulation assumes that instead of performing experiments in a live network, we measure packet transfer characteristics offered in the network, and on that basis, we synthetically reproduce the packet transfer process in a single device, called the network emulator. As illustrated in Fig. 1, the network emulator should statistically provide the same values of packet transfer delay, delay variation, packet loss and correlation between consecutive packets as in a live network. Therefore, in principle, there should be no difference if an incoming packet is served by network or by emulator. The network emulator is regarded as a convenient tool for supporting experimental driven research, prototype testing and designing of new protocols [14]. It may be treated as a complementary approach for testing real code in “semisynthetic” environment [1]. The key advantages of network emulation comparing to simulation techniques and network trails are the following. First, the network emulator allows for testing of prototype implementations (real code and physical equipment) with theirs software or hardware limitations. Second, the network emulator allows for repeating tests under the same network conditions, what is practically impossible in the case of live network trials. Last but not least, the network emulator simplifies the tests in complex network scenarios since it
EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network
49
reproduces end-to-end characteristics without modeling details of network elements. In addition, lab environment does not need to have the access to a live network. On the other hand, the network emulator is not a perfect solution, because it just follows the network behavior that was observed in the past. This is usually performed by gathering packet traces that should be representative for given experiment. Therefore, the collection of packet traces is a crucial issue. These packet traces may be obtained from live network, from testbed environment or even from simulation experiments. As we mentioned above, the network emulator requires a method for accurate replication of the packet transfer process. The method should statistically assure the same service of packets as in the case that they would be handled in the network; nevertheless, one must define measurable features, which can be used for evaluation of emulator’s effectiveness. In this paper, we consider the following conditions: 1. the emulator should provide the same probability distribution of packet transfer delay and packet loss as it was observed in the network. In this way, the network emulator provides accurate values of IP performance metrics, e.g., IP packet transfer delay (IPTD), IP packet delay variation (IPDV), IP packet loss ratio (IPLR), defined in [15]. 2. the emulator should introduce similar autocorrelation of the packet transfer delay process and the packet loss process. The autocorrelation is an important factor that shows how the emulator represents the dependencies between samples in a given realization of the process. Our analysis is focused on the correlograms (autocorrelation plots). 3. the emulator should allow for maintaining cross-correlation between emulated processes as it was experienced by packets in the network. This feature shows how the method captures dependencies between different random processes, i.e., between packet delay process and packet loss processes. We measure the cross-correlation by correlation coefficient [16]. 4. the emulator should maintain the packet stream integrity as occurs in the live network. This feature is important, because reordered packets may have deep impact on the protocol performance. We measure the level of packet stream integrity by IP packet reordered ratio (IPRR) metric defined in [15]. 2.2
Case Study: NetEm Tool
In this case study we focus on evaluation of capabilities and limitations of NetEm emulator [7] available in Linux operating system. The NetEm uses four independent processes to emulate the network behavior that are: (1) packet delay process, (2) packet loss process, (3) packet duplication and (4) reordering process. The packet delay process uses delay distribution stored in the form of inverted cumulative distribution function. The NetEm offers a few predefined distributions, e.g., uniform, normal and Pareto, but it also allows to provide custom distributions. They could be created from packet delay traces by maketables tool available in iproute2 package.
50
J. Sliwinski, A. Beben, and P. Krawiec
100
Probability
10−1 10−2 10−3 10−4 10−50
10
20 30 Delay [ms]
40
50
Fig. 2. Histogram of packet transfer delay for original network
Notice that the traces with packet transfer delay samples are not usually publicly available. Taking into account this fact, we used “one point” packet traces (with volume of traffic over time) and performed simulations to obtain delay and loss characteristics. We selected one of the traffic traces that are available in the MAWI repository [17], i.e., a trace file captured in sample point F of the WIDE network on the 1st of November 2009 (file name 200911011400.dump). This sample point records traffic going in both directions of an inter-continental link. In the network emulation, we were interested in one direction, so we filtered the trace file to include only the packets with destination Ethernet address 00:0e:39:e3:34:00. In simulations, we used only IP packets (version 4 and 6) without overhead of Ethernet headers; there were 6 non-IP packets and they were discarded. Finally, the filtered trace file covered around 8.2 × 109 bytes that were sent over 15 minutes; the mean bit rate of traffic was close to 73 Mbps. Note that the packet trace was collected in a single point in the network. In order to obtain packet transfer delay and loss characteristics we performed simple simulation experiment. First, we created a topology with 2 nodes that are connected by 100 Mbps link with 10 ms propagation delay. The size of output buffer for each interface was set to 300 packets. Next, we introduced two traffic streams: (1) background stream based on the prepared packet trace, and (2) foreground constant bit rate stream using 100 byte packets emitted every 1 ms (bit rate equal to 800 kbps). Since the link capacity in the original packet trace was equal to 150 Mbps, we artificially created a bottleneck where queueing effects appeared. Finally, we recorded the packet delay and loss traces for the probing stream. The obtained histogram of delay distribution is presented in Fig. 2. On the basis of these delay samples, we prepared a distribution table for NetEm tool. This table was used for tests in a testbed network, which consisted of 3 nodes (PCs) connected in cascade by 1 Gbps links, similarly to the scenario
100
100
10−1
10−1 Probability
Probability
EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network
10−2 10−3 10−4 10−50
51
10−2 10−3 10−4
10
20 30 Delay [ms]
40
50
10−50
(a) NetEm with default configuration
10
20 30 Delay [ms]
40
50
(b) NetEm with integrity
Fig. 3. Histograms of packet transfer delay for NetEm tool
presented in Fig. 1. The middle node runs NetEm (with our custom delay distribution table), while the other two run MGEN [18] to generate traffic. All nodes were synchronized with the GPS clock with time drift below 100 μs. The traffic emitted by the generator had the same profile as the foreground traffic used in simulation (constant bit rate, 100 byte IP packets, 1 ms inter-packet gap). Using this setup we performed measurements in two test cases: Case 1: NetEm with default configuration, Case 2: NetEm with enforced packet stream integrity. Fig. 3 shows the histogram of packet transfer delay for both test cases. Moreover, Table 1 presents values of performance metrics measured for NetEm with reference to the original network. In the first test case, we observe that the delay distribution is maintained up to 23 ms, but above this value the distribution is trimmed; all greater mass of probability is assigned to the last value. This effect comes from the NetEm’s delay distribution storage method, which in default configuration does not allow the values to differ more than 4 times standard deviations from the mean value. This behavior limits the usage of distributions with long tails, which are typically observed in the Internet [8]. Other limitation appeared in the form of large Table 1. Results of NetEm tests original NetEm NetEm network with integrity mean IPTD [ms] 11.7 stddev of IPTD [ms] 3.3 IPLR [%] 0.215 IPPR [%] 0 cross-correlation coeff. 0.26
10.4 0.6 0.213 46 0.00
14.7 3.8 0.213 0 0.00
J. Sliwinski, A. Beben, and P. Krawiec
original NetEm NetEm - integrity
0.8
Autocorrelation of loss
Autocorrelation of delay
52
0.6 0.4 0.2 0.0 −0.2 0
10
20
Lag
30
40
50
original NetEm NetEm - integrity
0.8 0.6 0.4 0.2 0.0 −0.2 0
10
(a) packet transfer delay
20
Lag
30
40
50
(b) packet loss
Fig. 4. Correlogram of packet transfer delay and packet loss
amount of reordered packets, i.e., more than 46% packets changed the order in packet stream. In the second test case, we changed the default behavior of NetEm to maintain the packet stream integrity (we changed queue type from tfifo to pfifo). This modification caused that the delay distribution changed it’s shape and parameters, see Fig. 3(b) and Table 1. In Fig. 4, we presented the correlograms (autocorrelation plots) of delay and loss processes as observed in both test cases and original network. Notice that, for both first and second test case, the autocorrelation function of these processes is not maintained. In fact, the loss process is entirely uncorrelated, while the delay process shows minor autocorrelation only for the second test case caused by enforcing integrity. We also see that, not only each process shows no autocorrelation, there is no cross-correlation between packet transfer delay and packet loss processes. In original network the cross-correlation coefficient equals about 0.24, while for both NetEm tests it is zero (no correlation). The lack of correlation motivated us to perform additional test with NetEm’s built-in correlation feature. Comparing the results of this test with the first test case, we observe lack of lost packets. This effect comes from incorrect implementation of NetEm “correlation” model when inverted cumulative distribution function is used. Therefore, we ignored these results. By performing the above validation, we concluded that the NetEm emulation model has important lacks. This motivated us to create a new model that mitigates NetEm’s limitations and allows for more precise replication of network characteristics.
3
Emulation Method in EmPath Tool
In this section we present the EmPath tool. After describing the proposed algorithm we focus on its implementation in Linux kernel.
EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network
3.1
53
The Emulation Algorithm
The EmPath uses two random processes to emulate packet transfer. The first process decides whether incoming packet is lost or transferred, while the second process determines the delay of transferred packets. Specifically, we define these processes as: – the packet loss process, named L(t), which takes the value 1 when incoming packet is lost and 0 otherwise, – the packet transfer delay process, named D(t), which determines the delay observed by transferred packet. Next, we define discrete-times series {Li } and {Di }, based on the above processes using moments of packet arrivals ti , where i=1, 2, 3, . . . denotes packet number, as: – {Li }, where Li = L(ti ), – {Di }, where Di = D(ti ). We assume that emulator’s decision about incoming packet depends only on the status of the previous packet and current inter-packet gap. This assumption allows us to use a discrete-time Markov process with the following generic equation: (Ln−1 , Dn−1 , tn − tn−1 ) → (Ln , Dn ).
(1)
Although we could apply more complex models with broader range of dependencies (beyond n − 1 state), our choice originates from the usual behavior of the queueing systems. Moreover, implementation of models with “long” memory of the process is unfeasible due to state space explosion. Notice that, the generic rule (1) uses real numbers for Dn−1 , tn−1 and tn , which for efficient implementation would require us to know the exact distribution functions. In order to circumvent this limitation, our model uses simplified representation with quantized values. First, we assume that quantized delay values can be grouped into a number of predefined states (with relation f (delay) → state). Furthermore, we introduce a special state sloss that is used to emulate the delay of the packet transfered after any lost packet. Next, we treat the packet inter-arrival period with finite resolution of time Δ = tn − tn−1 , where all packets arriving within one time unit Δ observe the same result (loss or delay). Finally, for each state s we need to know: – the probability of packet loss ls , – the conditional probability distribution of packet delay ds under the condition that current packet is not lost (the support set of this distribution is also quantized). Using the above variables the emulation algorithm for each time unit Δ summarizes as: 1. Check packet loss against ls . 2. If packet is lost, set current state to s ← sloss , return the result {loss} and stop the algorithm.
54
J. Sliwinski, A. Beben, and P. Krawiec nǻ
(n+1)ǻ
(n+2)ǻ
(n+3)ǻ
(n+4)ǻ
time
sloss
s2
s1
s0
state
{delayn}
{lossn+1}
{delayn+2}
{delayn+3}
Fig. 5. Exemplary operation of the emulation algorithm
3. Generate new delay from distribution ds . 4. Update the current state according to the relation s ← f (delay). 5. Return the result {delay} and stop the algorithm. In order to better understand the proposed emulation method, let us consider the example presented in Fig. 5, which shows few transition steps. Initially (time moment nΔ ), the algorithm is in state s0 . As new packet arrives, the decision is made that it would not be lost and that it would observe a delay equal to {delayn}. Furthermore, this value of delay is related to a new state s1 for the algorithm. Consequently, in time moment (n + 1)Δ the algorithm uses another loss probability and delay distribution table. This time, it was decided that packet should be lost, so algorithm switches into a state sloss and it returns a result {lossn+1 }. Following this scheme, the algorithm switches in next steps into states s0 and s2 with respective loss and delay distributions, and returns exemplary values {delayn+2} and {delayn+3}. The emulation algorithm must be performed for each time unit Δ, so the number of iterations is proportional to the duration of packet inter-arrival time, e.g. for packet inter-arrival time equal to k ∗ Δ it requires k iterations. This behavior may hamper the emulator performance, especially when there are long idle periods between arriving packets. In order to improve emulator performance, we can calculate new table with analogical distributions for time unit 2Δ based on distributions ls and ds for time unit Δ. This is similar to Markov chain when we want to obtain transition matrix with “two steps” from transition matrix
EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network
55
with “one step”, i.e., we calculate a square of transition matrix. This method can by applied recursively for any time unit in the form 2n Δ. Consequently, using multiple tables for different time units we reduce the complexity to the logarithmic level, e.g., for inter-arrival 78Δ it is sufficient to use 4 iterations as 78Δ = (64 + 8 + 4 + 2)Δ. 3.2
Implementation
The core part of the EmPath is implemented as a kernel module of the Linux operating system. Similar to the NetEm tool, our tool is implemented as a packet scheduler, which can be deployed on a network interface (the kernel module is named sch empath). The emulation algorithm is applied to each packet arriving to the interface, which is dropped or delayed in first-in first-out queue. While the decision about packet loss can be implemented using single condition, the representation of the delay distribution requires more attention. As our emulation model assumes multiple distribution tables that are used depending on the current state, we decided to use less memory consuming tree-like structure. Moreover, the EmPath kernel module allows for recursive calculation of “two step” distribution tables from any Δ time unit into 2Δ one. The second part of our tool is the extension of tc tool in the iproute2 package. The tc tool allows for deployment and configuration of various packet schedulers from the user space of the Linux operating system. Our extension reads a specially formatted files containing packet loss probabilities and packet transfer delay distributions. Notice that the Linux kernel does not support calculations with floating point numbers. Consequently, both EmPath extension of tc tool and EmPath kernel module use fixed point representations with unsigned 32 bit integer numbers. This choice was also motivated by the fact that the default kernel random number generator that we use: function net random(), which provides unsigned 32 bit integer values. We released the EmPath tool as open source software available at address http://code.google.com/p/empath/.
4
Evaluation of EmPath Tool
In order to start the evaluation of EmPath tool we created tables with conditional probability distributions of packet transfer delay. We used the same packet delay traces for the original network that were used in the NetEm case study (see section 2.2). The profile of traffic stream used for sampling the delay and loss processes considered 1 ms intervals, so the time unit was set to Δ = 1 ms. Furthermore, the resolution of delay distributions was set to 1 ms. Taking into account that the range of the packet transfer delay was equal to 30 ms we decided to use 31 states (including one “loss state”). Therefore, we prepared 31 delay distributions and packet loss values, one for each state.
J. Sliwinski, A. Beben, and P. Krawiec
100
100
10−1
10−1 Probability
Probability
56
10−2 10−3 10−4 10−50
10−2 10−3 10−4
10
20 30 Delay [ms]
40
50
10−50
(a) original network
10
20 30 Delay [ms]
40
50
(b) EmPath emulation
Fig. 6. Histograms of packet transfer delay in EmPath experiment
The experiment was performed in the same network topology as used for the NetEm use case (3 nodes connected in cascade where middle node provided emulation capabilities). The profile of measurement traffic remained the same. Fig. 6 presents histograms of delay for original network and for values measured with EmPath tool (more than 10 million packets were emitted). The shape of the delay distribution, as well as, performance metrics (shown in Table 2) are very similar to the original characteristic. The recovered distribution is shifted to the right due to assumed 1 ms delay resolution. Moreover, Fig. 7(a) shows that the autocorrelation function of delay process is similar to the original network, which suggests that proposed model correctly captures this characteristic. For the characteristic of the loss process, the mean value (IPLR) is similar, but the autocorrelation functions (shown in Fig. 7(b)) differ for lower lags. This indicates that the assumed emulation method of packet losses is insufficient in short term time scale (the geometric distribution is too simple). One may extend the model to add more states related with losses, e.g., by adding a condition for rare loss events upon other rare loss events. Consequently, it would require us to gather much more samples from original network when preparing the distribution tables. On the other hand, the cross-correlation coefficient between delay and loss processes is similar for original network (0.26) and for EmPath emulation (0.24). Furthermore, the EmPath tool guaranteed no packet reordering. Table 2. Selected metrics measured in the EmPath experiment original EmPath network emulation mean IPTD [ms] 11.7 stddev of IPTD [ms] 3.3 IPLR [%] 0.215 IPPR [%] 0 cross-correlation coeff. 0.26
12.1 3.4 0.231 0 0.24
original EmPath
0.8
Autocorrelation of loss
Autocorrelation of delay
EmPath: Tool to Emulate Packet Transfer Characteristics in IP Network
0.6 0.4 0.2 0.0 −0.2 0
10
20
Lag
30
40
(a) packet transfer delay
50
57
original EmPath
0.8 0.6 0.4 0.2 0.0 −0.2 0
10
20
Lag
30
40
50
(b) packet loss
Fig. 7. Correlogram of packet transfer delay and packet loss
Notice that, while the implementations of NetEm and EmPath tools are similar in their nature, EmPath provides more realistic characteristics and features of emulated network.
5
Summary
In this paper we investigated the problem of network emulation. Our objective was to emulate the packet transfer characteristics offered by a network path. We performed experiments with the widely used NetEm tool, which is available in the kernel of Linux operating system. The obtained results showed several limitations of the NetEm tool, related to limited range of packet transfer delay distribution, lack of correlation between packet transfer delays and packet losses lack of packet stream integrity. These limitations motivated us to design more effective emulation method and implement own tool called EmPath. In our method, the delays and losses experienced by transferred packets are modeled as a Markov process that uses multiple transition matrices, where each of them is conditioned on the status of the proceeding packet. Our method guarantees the distribution of packet transfer delay without a packet reordering and introduces correlation of delay and loss processes (autocorrelation and cross-correlation) in similar way as observed in a live network. The parameters required by EmPath, such as conditional probability distribution of packet delays and losses are directly derived from the delay traces measured in a live network. We implemented EmPath tool in the Linux kernel and released it under GNU Public License. Then, we preformed experiments to verify the EmPath capabilities and possible limitations. The obtained results, as illustrated in the included examples, confirmed the effectiveness of the proposed method and EmPath tool. In further work, we plan to focus on extending the model of loss process for inclusion of burst losses which was identified as a slight limitation of EmPath tool. Moreover, we plan more experiments for delay and loss traces collected in different network environments.
58
J. Sliwinski, A. Beben, and P. Krawiec
References 1. Carson, M., Santay, D.: NIST Net: a Linux-based network emulation tool. SIGCOMM Comput. Commun. Rev. 33(3), 111–126 (2003) 2. Fall, K.: Network emulation in the Vint/NS simulator. In: Proceedings of IEEE International Symposium on Computers and Communications 1999, pp. 244–250 (1999) 3. Vahdat, A., Yocum, K., Walsh, K., Mahadevan, P., Kostic, D., Chase, J., Becker, D.: Scalability and accuracy in a large-scale network emulator. Operating Systems Review 36, 271–284 (2002) 4. Rizzo, L.: Dummynet: a simple approach to the evaluation of network protocols. SIGCOMM Comput. Commun. Rev. 27(1), 31–41 (1997) 5. Yeom, I., Reddy, A.N.: ENDE: An end-to-end network delay emulator tool for multimedia protocol development. Multimedia Tools and Applications 14(3), 269– 296 (2001) 6. Avvenuti, M., Vecchio, A.: Application-level network emulation: the EmuSocket toolkit. Journal of Network and Computer Applications 29(4), 343–360 (2006) 7. Hemminger, S.: Network Emulation with NetEm. In: Linux Conf. Au. (April 2005) 8. Papagiannaki, K., Moon, S., Fraleigh, C., Thiran, P., Diot, C.: Measurement and analysis of single-hop delay on an IP backbone network. IEEE Journal on Selected Areas in Communications 21(6), 908–921 (2003) 9. Piratla, N., Jayasumana, A., Smith, H.: Overcoming the effects of correlation in packet delay measurements using inter-packet gaps. In: Proceedings of 12th IEEE International Conference on Networks 2004 (ICON 2004), vol. 1, pp. 233–238 (November 2004) 10. Vivanco, D.A., Jayasumana, A.P.: A measurement-based modeling approach for network-induced packet delay. In: LCN 2007: Proceedings of the 32nd IEEE Conference on Local Computer Networks, pp. 175–182. IEEE Computer Society, Washington (2007) 11. Borgnat, P., Dewaele, G., Fukuda, K., Abry, P., Cho, K.: Seven years and one day: Sketching the evolution of Internet traffic. In: INFOCOM 2009, pp. 711–719. IEEE, Los Alamitos (2009) 12. Zinner, T., Tutschku, K., Nakao, A., Tran-Gia, P.: Performance evaluation of packet re-ordering on concurrent multipath transmissions for transport virtualization. In: 20th ITC Specialist Seminar, Hoi An, Viet Nam (May 2009) 13. Nebat, Y., Sidi, M.: Resequencing considerations in parallel downloads. In: Proceedings of INFOCOM 2002. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 3, pp. 1326–1335. IEEE, Los Alamitos (2002) 14. Owezarski, P., Berthou, P., Labit, Y., Gauchard, D.: LaasNetExp: a generic polymorphic platform for network emulation and experiments. In: TRIDENTCOM 2008 (2008) 15. ITU-T Recommendation Y.1540: IP packet transfer and availability performance parameters (November 2007) 16. Krickeberg, K.: Probability Theory. Addison-Wesley, Reading (1965) 17. Cho, K., Mitsuya, K., Kato, A.: Traffic data repository at the WIDE project. In: ATEC 2000: Proceedings of the annual conference on USENIX Annual Technical Conference, Berkeley, CA, USA, June 2000, pp. 51–51. USENIX Association (2000) 18. Naval Research Laboratory: Multi-Generator Toolset (MGEN) 4.0 (November 2009), http://cs.itd.nrl.navy.mil/work/mgen/
A Database of Anomalous Traffic for Assessing Profile Based IDS Philippe Owezarski CNRS; LAAS; 7 Avenue du colonel Roche, F-31077 Toulouse, France Universit´e de Toulouse; UPS, INSA, INP, ISAE; LAAS;F-31077 Toulouse, France
[email protected] Abstract. This paper aims at proposing a methodology for evaluating current IDS capabilities of detecting attacks targeting the networks and their services. This methodology tries to be as realistic as possible and reproducible, i.e. it works with real attacks and real traffic in controlled environments. It especially relies on a database containing attack traces specifically created for that evaluation purpose. By confronting IDS to these attack traces, it is possible to get a statistical evaluation of IDS, and to rank them according to their detection capabilities without false alarms. For illustration purposes, this paper shows the results obtained with 3 public IDS. It also shows how the attack traces database impacts the results got for the same IDS. Keywords: Statistical evaluation of IDS, attack traces, ROC curves, KDD’99.
1
Motivation
1.1
Problematics
Internet is becoming the universal communication network, conveying all kinds of information, ranging from the simple transfer of binary computer data to the real time transmission of voice, video, or interactive information. Simultaneously, Internet is evolving from a single best effort service to a multiservice network, a major consequence being that it becomes highly exposed to attacks, especially to denial of services (DoS) and distributed DoS (DDoS) attacks. DoS attacks are responsible for large changes in traffic characteristics which may in turn significantly reduce the quality of service (QoS) level perceived by all users of the network. Detecting and reacting against DoS attacks is a difficult task and current intrusion detection systems (IDS), especially those based on anomaly detection
The author wants to thank all members of the MetroSec project. Thanks in particular to Patrice Abry, Julien Aussibal, Pierre Borgnat, Gustavo Comerlatto, Guillaume Dewaele, Silvia Farraposo, Laurent Gallon, Yann Labit, Nicolas Larrieu and Antoine Scherrer.
F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 59–72, 2010. c Springer-Verlag Berlin Heidelberg 2010
60
P. Owezarski
from profile, often fail in detecting DDoS attacks efficiently. This can be explained via different lines of arguments. First, DDoS attacks can take a large variety of forms so that proposing a common definition is in itself a complex issue. Second, it is commonly observed that Internet traffic under normal conditions presents per se, or naturally, large fluctuations and variations in its throughput at all scales [PKC96], often described in terms of long memory [ENW96], selfsimilarity [PW00], multifractality [FGW98]. Such properties significantly impair anomaly detection procedures by decreasing their statistical performance. Third, Internet traffic may exhibit strong, possibly sudden, however legitimate, variations (flash crowds - FC) that may be hard to distinguish from illegitimate ones. That is why, IDS relying on anomaly detection by way of statistical profile often yield a significant number of false positives, and are not very popular. These tools also lack efficiency when the increase of traffic due to the attack is small. This situation is frequent and extremely important because of the distributed nature of current denial of service attacks (DDoS). These attacks are launched from a large number of corrupted machines (called zombies) and under the control of a hacker. Each machine generates a tiny amount of attacking traffic in order to hide it in the large amount of cross Internet traffic. On the other hand, as soon as these multiple sources of attacking traffic aggregate on links or routers on their way to their target, they represent a massive amount of traffic which significantly decreases the performance level of their victim, and of the network it is connected to. The anomaly detection is easy close from the victim, but detecting it so late is useless: targeted resources have been wasted, and the QoS degraded; the attack is therefore successful. It is then essential for IDS to detect attacks close from their sources, when the anomaly just results from the aggregation of the attacking traffic of few zombies hidden in the massive amount of legitimate traffic. 1.2
The KDD’99 Traditional Evaluation Method
It exists several approaches for statistically evaluating IDS. However, they all have in common the need of a documented attack database. Such database generally represents the ground truth [RRR08]. Up to now, the most used database is KDD’99. The principle of a statistical evaluation deals with making the IDS to evaluate analyze the documented attack traces, and to count the number of true positives, true negatives, false positives, and false negatives. The relations between these values can then be exhibited by mean of a ROC curve (Receiver Operating Characteristic). The ROC technique has been used in 1998 for the first time for evaluating IDS in the framework of the DARPA project on the off-line analysis of intrusion detection systems, at the MIT’ Lincoln laboratory [MIT]. At that time, it was the first intelligible evaluation test applied to multiple IDS, and using realistic configurations. A small network was set-up for this purpose, the aim being to emulate a US Air Force base connected to the Internet. The background traffic was generated by injecting attacks in well defined points of the network, and collected with TCPDUMP. Traffic was grabbed and recorded for 7 weeks,
A Database of Anomalous Traffic for Assessing Profile Based IDS
61
and served for IDS calibration. Once calibration was performed, 2 weeks of traces containing attacks were used to evaluate the performance of IDS under study. Several papers describe this experience as Durst et al. [DCW+ 99], Lippmann et al. [LFG+ 00] and Lee et al. [LSM99]. This DARPA work in 1998 used a wide range of intrusion attempts, tried to simulate a realistic normal activity, and produced results that could be shared between all researchers interested in such topic. After an evaluation period, researchers involved in the DARPA project, as well as many others of the same research community, provided a full set of evaluation results with the attack database, which lead to some changes in the database, known nowadays under the name KDD’99. These changes mainly dealt with the use of more furtive attacks, the use of target machines running Windows NT, the definition of a security policy for the attacked network, and tests with more recent attacks. Whereas the KDD’99 aimed at serving the needs of the research community on IDS, some important questions about its usability were raised. McHugh [McH01] published a strong criticism on the procedures used when creating the KDD’99 database, especially on the lack of verification of the network realism compared to an actual one. It was followed in 2003 by Mahoney et Chan [MC03] who decided to review into detail the database. Mahoney et Chan showed that the traces were far from simulating realistic conditions, and therefore, that even a very simple IDS can exhibit very high performance results, performances that it could never reach in a real environment. For example, they discovered that the trace database includes irregularities as differences on TTL between attacking and legitimate packets. Unfortunately, despite all the disclaimers about this KDD’99 database for IDS evaluation, it is still massively used. This is mainly due to the lack of other choices and efforts for providing new attack databases. Such limitations of the KDD’99 database motivated ourselves for issuing a new one, as we needed to compare performances of existing profile based IDS with the ones of new anomaly detection tools we were designing in the framework of the MetroSec project [MET] (project from the French ACI program on Security & Computer science, 2004-2007). 1.3
Contribution
Despite these limitations of the DARPA project contributions for IDS evaluation, the introduction of ROC technique remains nevertheless a very simple and efficient solution, massively used since. It consists in combining detection results with the number of testing sessions to issue two values which summarize IDS performance: the detection ratio (number of detected intrusions divided by the number of intrusion attempts) and the false alarm rate (number of false alarms divided by the total number of network sessions). These summaries of detection results then represent one point on the ROC curve for a given IDS. The ROC space is defined by the false alarms and true positive rates on X and Y axis respectively, what in fact represents the balance between efficiency and cost. The best possible IDS would then theoretically be represented by a single point
62
P. Owezarski
curve of coordinates (0, 1) in the ROC space. Such a point means that all attacks were detected and no false alarm raised. A random detection process would be represented in the ROC space by a strait line going from the bottom left corner (0, 0) to the upper right corner (1, 1) (the line with equation y = x). Points over this line mean that the detection performance is better than the one of a random process. Under the line, it is worse, and then of no real meaning. Given the efficiency and simplicity of the ROC method, we propose to use it as a strong basis for our new profile based IDS evaluation methodology; it therefore looks like the KDD’99 one. At the opposite, the KDD’99 trace database appears to us as completely unsuited to our needs for evaluating performances of IDS and anomaly detection systems (ADS). This is what our contribution is about. Let us recall here that in the framework of the MetroSec project, we were targeting attacks which can have an impact on the quality of service and performances of networks, i.e. on the quality of packets forwarding. It is not easy to find traffic traces containing this kind of anomalies. It would be required to have access to many traffic capture probes, close from zombies, and launch traffic captures when attacks arise. Of course, hackers do not advertise when they launch attacks and their zombies are unknown. We then haven’t documented traffic traces at our disposal containing these kinds of anomalies, i.e. traces for which no obvious anomaly appears, but for which we would know that between two dates, an attack of that kind, and having a precise intensity (also called magnitude) was perpetrated. The lack of such traces is one of the big issues for researchers in anomaly detection.1 In addition, it is not enough to validate detection methods and tools on one or two traces; it would for instance forbid to quantify detection mistakes. 1.4
Paper Structure
This paper presents a new evaluation and comparison method of profile based IDS and ADS performances, and that improves and adapts to new requirements the ancient KDD’99 method. The main contribution dealt with creating a new documented anomaly database - among which some are attacks. These traces contain, in addition of anomalies, a realistic background traffic having the variability, self-similarity, dependence, and correlation characteristics of real traffic, which massively distributed attacks can easily hide in. The creation process of this trace database, as well as the description of the main components used are described in section 2. Then, section 3 shows for a given ADS - called NADA [FOM07] - the differences in evaluation results depending on the anomaly/attack trace database used. Section 4 then shows by using our new evaluation method based on our new trace database - called MetroSec database - the statistical evaluation results got for 3 IDS or ADS publicly available. Last, section 5 concludes this paper with a discussion on the strength and weaknesses of our new evaluation method relying on our new MetroSec anomalies database. It also gives information on the database availability. 1
This lack is even more important as for economical and strategic reasons of carriers, or users’ privacy, such data relating anomalies or attacks are not made public.
A Database of Anomalous Traffic for Assessing Profile Based IDS
2
63
Generation of Traffic Traces with or without Anomalies, by Ways of Reproducible Experiments
2.1
The MetroSec Experimental Platform
One of the contributions of MetroSec was to produce controlled and documented traffic traces, with or without anomalies, for testing and validating intrusion detection methods. For this purpose, we lead measurement and experimentation campaigns on a reliable operational network (for which we are sure that is does not contain any anomalies, or at least very few), and to generate ourselves attacks and other kinds of anomalies which are going to mix and interact with background regular traffic. It is then possible to define the kinds of attacks we want to launch, to control them (sources, targets, intensities2 , etc.), and to associate to the related captured traces a ticket indicating the very accurate characteristics of perpetrated attacks. In such a context, anomalies are reproducible (we can regenerate as often as wanted the same experimental conditions). This reproducibility makes possible the multiplication of such scenarios in order to improve the validation statistics of detection tools, or the comparison accuracy of our methods with others. The trace database produced in MetroSec is one of the significant achievements of the project, and guaranties the reliability of IDS evaluation. The experimental platform which was used for creating the MetroSec trace database uses the Renater network, the French network for education and research. Renater is an operational network which is used by a significantly large community in its professional activity. Because of its design, Renater has the necessary characteristics for our experiments: – it is largely over-provisioned related to the amount of traffic it is transporting. Its OC-48 links provide 2,4 Gbits/s of throughput, whereas a laboratory as LAAS, having at its disposal a link whose capacity is 100 Mbits/s, generates in average a traffic less than 10 Mbits/s [OBLG08]. As a consequence, Renater provides a service with a constant quality. Thus, even if we want to saturate the LAAS access link, the impact of Renater on this traffic, and the provided QoS would be transparent. Experimental conditions on Renater would be all the times the same, and therefore our experiments are reproducible; – Renater integrates two levels of security to avoid attacks coming from outside, but also from inside the network. Practically speaking, we effectively never observed any attack at the measurement and monitoring points we installed in Renater. The laboratories involved in the generation of traces are ENS in Lyon, LIP6 in Paris, IUT of Mont-de-Marsan, ESSI in Nice, and LAAS in Toulouse. Traffic is 2
Intensities (or magnitude) of attacks are defined in this paper as the byte or packet rate of the attacking source(s).
64
P. Owezarski
captured at these different locations by workstations equipped with DAG cards [CDG+ 00] and GPS for a very accurate temporal synchronization. In addition, if we want to perform massive attacks, the target is the laasnetexp network at LAAS [OBLG08], which is a network fully dedicated to risky experiments. We can completely saturate it in order to analyze extreme attacking situations. 2.2
Anomalies Generation
Anomalies studied and generated in the framework of the MetroSec project consist of more or less significant increases of traffic in terms of volume. We can distinguish two kinds of anomalies: – anomalies due to legitimate traffic. Let us for instance quote in this class flash crowds (FC). It is important to mention here that such experiment can hardly be fully controlled; – anomalies due to illegitimate traffic, as flooding attacks. This traffic, which we can have a full control on, is generated thanks to several classical attacking tools. Details about anomalies generation are given in what follows. • Flash Crowd (FC). For analyzing the impact on traffic characteristics of a flooding event due to legitimate traffic variations, we triggered flash crowds on a web server. For realism purpose, i.e. humanly random, we chose not to generate them using an automatic program, but to ask our academic colleagues to browse the LAAS web server ( http://www.laas.fr). • DDoS attack. Attacks generated for validating our anomaly detection methods consist of DDoS attacks, launched by using flooding attacking tools (IPERF, HPING2, TRIN00 et TFN2K). We selected well known attacking tools in order to generate malicious traffic as realistic as possible. The IPERF tool [IPE] (under standard Linux environment) aims at generating UDP flows at variable rates, with variable packets rates and payloads. The HPING2 tool [HPI] aims at generating UDP, ICMP and TCP flows, with variable rates (same throughput control parameters as IPERF). Note that with this tool, it is also possible to set TCP flags, and then to generate specific signatures in TCP flows. These two tools were installed on each site of our national distributed platform. At the opposite of TRIN00 and TFN2K (cf. next paragraph), it is not possible to centralize the control on all IPERF and HPING2 entities running at the same time, and then to synchronize attacking sources. One engineer on each site is then in charge of launching attacks at a given predefined time, what induces at the target level a progressive increase of the global attack load. TRINOO [Tri] and TFN2K [TFN] are two well known distributed attacking tools. They allow the installation on different machines of a program called zombie (or daemon, or bot). This program is in charge of generating the attack
A Database of Anomalous Traffic for Assessing Profile Based IDS
65
towards the target. It is remotely controlled by a master program which commands all the bots. It is possible to constitute an attacking army (or botnet) commanded by one or several masters. TFN2K bots can launch several kinds of attacks. In addition of classical flooding attacks using UDP, ICMP and TCP protocols (sending of a large number of UDP, ICMP or TCP packet to the victim), many other attacks are possible. The mixed flooding attack is a mix of UDP flooding, ICMP flooding and TCP SYN flooding. Smurf is an attacking technique based on the concept of amplification: bots use the broadcast address for artificially multiplying the number of attacking packets sent to the target, and then multiplying the power of this attack. TRIN00 bots, on their side, can only perform UDP flooding. Table 1. Description of attacks in the trace database Tool
Attack type
Trace duration
Campaign of November-December 2004 HPING TCP flooding 1h23mn 1h23mn 3h3mn 3h3mn UDP flooding 16h20mn Campaign of June 2005 IPERF UDP flooding 1h30 1h30 1h30 1h30 1h30 1h30 1h30 Campaign of March 2006 TRINOO UDP flooding 2h 1h Campaigns from April to July 2006 TFN2K UDP flooding 2h 1h ICMP flooding 1h30 1h TCP SYN flooding 2h 1h Mixed flooding 1h Smurf 1h
Attack duration
Intensity
3h3mn 15mn 13mn 3mn 30.77% 27.76% 90.26% 30mn 7mn 8mn 5mn 70.78% 45.62% 91.63% 5mn 99.46% 1h30 30mn 30mn 30mn 1h30 41mn 30mn 30mn 1h30 30mn 30mn 30mn 30mn 1h
17.06% 14.83% 21.51% 33.29% 39.26% 34.94% 40.39% 36.93% 56.40% 58.02%
10mn 10mn 10mn 7.0%
22.9% 86.8%
30mn 11mn 10mn 10mn 92% 4.0% 20mn 10mn 13% 9.8% 10mn 10mn 12% 33% 10mn 27.3% 10mn 3.82%
7.0%
Attacks launched with the different attacking tools (IPERF, HPING2, TRINOO et TFN2K) have been performed by changing frequently the attack characteristics and parameters (duration, DoS flow intensity, size and rate of packets) in order to create different profiles for attacks to be detected afterwards. Main characteristics of generated attacks are summarized in table 1. For each configuration, we captured the traffic before, during and after the attack, in order to mix the DoS period with two normal traffic periods. It is important to recall here that most of the times, we tried to generate very low intensity attacks, so that they do not have a significant impact on the global traffic (and therefore be not the cause of average traffic change). This emulates the case of a router receiving the packets from a small number of zombies, and then represents the most interesting problem of our problematic, i.e. detecting DDoS attacks close from their low intensity sources. Our trace database contains nowadays around thirty captures of such kinds of experiments.
66
3
P. Owezarski
Comparative Evaluation with Different Anomalies Databases
NADA (Network Anomaly Detection Algorithm) is an anomaly detection tool relying on the use of deltoids for detecting significantly anomalous variations on traffic characteristics. This tool also includes a classification mechanism of anomalies aiming at determining whether detected anomalies are legitimate and their characteristics. NADA has been issued in the framework of the Metrosec project. Thanks to the full control we have on its code, it was easier for us to run experiments with such a tool. Results were also easier to analyze with a full knowledge of the detection tool. For more details, interested readers can refer to [FOM07]. Anyway, as the evaluation methodology we are proposing is of the ”black box” kind, it is not necessary to know how a tool is designed and developed for evaluating it. Just let us say that NADA uses a threshold k which aims at determining whether a deltoid corresponds to an anomalous variation. Setting the k threshold allows the configuration of the detection tool sensitivity, in particular related to the natural variability of normal traffic. The rest of this section aims at comparing NADA’s performance evaluation results with our method depending on the anomalies database used, MetroSec or KDD’99. 3.1
Evaluation with the MetroSec Anomalies Database
The statistical evaluation of NADA was performed by using the traces with documented anomalies presented in section 2.2. This means a total of 42 different traces. Each of the attack traces contain at least one DDoS attack; some contain up to four attacks of small intensity. Six traffic traces with flash crowds were also used for the NADA evaluation. In addition, the documented traces of the MetroSec database can be grouped according to the attacking tools used for generating the attacks/anomalies, and in each group differentiate them according to attack intensities, durations, etc. Such a differentiation is important as it makes possible to measure the sensitivity of the tool under evaluation, i.e. its capability of detecting small intensity anomalies (what of course cannot be done using the KDD’99 database, only able to provide binary results). The intensity and duration of anomalies are two characteristics which have a significant impact on the capability of profile based IDS/ADS to detect them. Whereas the detection of strong intensity anomalies is well done by most of detection tools, it is in general not the case when small intensity attacks are considered. Therefore, a suited method for evaluating anomaly detection tools performance must be able to count how many times it succeeds or failed in detecting anomalies contained in the traces, and among which some are of low intensity. Figure 1.a shows the ROC curve got by evaluating NADA with the MetroSec anomaly database. It shows the detection probability (P D) according to the probability of false alarms (P F ). Each point on the curve represents the average of all results obtained by NADA on all the anomalies of the MetroSec database for a given value of the k parameter, i.e. for a given sensitivity level. The curve analysis shows that NADA is significantly more efficient than a random
A Database of Anomalous Traffic for Assessing Profile Based IDS
(a)
67
(b)
Fig. 1. Statistical performances of NADA evaluated based on (a) the MetroSec database and (b) the 10% KDD database. Detection Probability (P D) vs. Probability of false alarms (P F ), P D = f (P F ). Table 2. Characteristics of the KDD’99 database in terms of samples numbers Base DoS Scan 10% KDD 391 458 4107 Corrected KDD 229 853 4166 Full KDD 3 883 370 41 102
U2R 52 70 52
R2L 1126 16 347 1 126
Normal 97 277 60 593 972 780
tool, as all points are over the line y = x. Even when the detection probability increases, the NADA performance ROC curve exhibits a very weak false alarm rate. For example, with P D in [60%, 70%], the probability of false alarms is in [10%, 20%], which is a good result. 3.2
KDD’99
KDD’99 database consists of a set of traces detailed in table 2. During the International Knowledge Discovery and Data Mining Tools contest [Kd], only 10% of the KDD database were used for the learning phase [HB]. This part of the database contains 22 types of attacks and is a concise version of the full KDD database. This later contains a greater number of attack examples than normal connections, and the types of attacks do not appear in a similar way. Because of their nature, DoS attacks represent the huge majority of the database. On the other hand, the corrected KDD database provides a database with different statistical distributions compared to the databases ”10% KDD” or ”Full KDD”. In addition, it contains 14 new types of attacks. NADA evaluation was limited to the 10% KDD database. Several reasons motivated this choice: first, despite this database is the simplest, it is also the most used. Second, our intension is to show that the KDD database is not suited for
68
P. Owezarski
evaluating current ADS (and we will show that the reduced database is enough to demonstrate it). Last, it is a first good test for observing NADA behavior with high intensity DoS attacks. Figure 1.b shows the NADA performance ROC curve obtained with the 10% KDD database. It especially shows the detection probability (P D) according to the probability of false alarms (P F ), and its analysis shows that NADA got very good results. Applied to the KDD’99 database, NADA exhibits a detection probability close to 90%, and a probability of false alarms around 2%. These results are extremely good, but unfortunately unrealistic if we compare them with the results obtained with the MetroSec database! DoS attacks are detected in a very reliable way, but certainly because the database is excessively simple: around 98% of attacks are DoS attacks of the same type, presenting in addition very strong intensities. The differences of NADA performances when applied to the two MetroSec and KDD’99 databases underline the importance of the anomaly database for evaluating profile based IDS and ADS. It is obvious that the MetroSec database presents more complex situations for NADA than KDD’99. Therefore, the evaluation results got with the MetroSec database are certainly closer from the real performance level of NADA than the ones got with KDD’99.
4
Evaluation of 2 Other IDS/ADS with the MetroSec Database
For illustrating the real statistic evaluation efficiency of the MetroSec method proposed in this paper to distinguish between real capabilities of profile based IDS and ADS, this section shows the comparative results between NADA and two other tools or approaches: the Gamma-FARIMA based approach [SLO+ 07], and the PHAD tool (experimental Packet Header Anomaly Detection) [MC01]. The Gamma-FARIMA approach is also one of the achievements of the MetroSec projects. PHAD was selected as, in addition of being freely available, it aims at both detecting and classifying anomalies and attacks similarly to what the objectives of MetroSec were. In addition, PHAD, funded by the American NSF, is said to be the ultimate intrusion and anomaly detection tool. The argumentation of its authors mainly relies on tests lead with KDD’99 traces: on these traces, PHAD gives perfect results. • PHAD. The evaluation of PHAD using the MetroSec database was performed by making the K threshold vary; K also represents the probability of generating a correct alarm. Figure 2.a shows the ROC curve obtained. Its analysis shows that PHAD behaves just a little bit better than a random detection process. Additional analysis for explaining the problem of too much false alarms points a problem with the underlying model and the learning process. It was also observed that when facing low intensity attacks, PHAD detects none of them. When comparing with the performances obtained with the KDD’99 database, there is a big gap. In fact, it seems that PHAD only detects anomalies of the KDD’99, and no other.
A Database of Anomalous Traffic for Assessing Profile Based IDS
(a)
69
(b)
Fig. 2. Statistical performances of (a) PHAD and (b) Gamma-FARIMA approach evaluated thanks to the MetroSec database. Detection Probability (P D) vs. Probability of false alarms (P F ), P D = f (P F ).
• Gamma-FARIMA Figure 2.b shows the ROC curve obtained for the Gamma-FARIMA approach evaluated with the MetroSec database. More than 60% of anomalies are detected with a false alarm rate close to 0. By observing all the ROC curves, it seems that the Gamma-FARIMA approach is the most efficient among the 3 when confronted to the MetroSec database. NADA also exhibits good performances, not too far from the ones of the Gamma-FARIMA approach. These two algorithms designed and developed in the framework of the MetroSec project aim at detecting and classifying known and unknown anomalies. Both algorithms reach this goal by using nevertheless different approaches. While NADA uses simple mathematical functions, the Gamma-FARIMA approach relies on more complex mathematical analysis which are the source of its advantage. However, the NADA simplicity coupled with its high performance level can be of interest. It is particularly true if NADA has to be combined with an identification process of malicious packets. On the other side, PHAD presents a dramatically low performance level, when confronted to the MetroSec database (but is excellent when confronted to KDD’99 one). By going into a deep analysis of the PHAD algorithm, it seems that the use of 33 different parameters, and assigning a score to each of its anomalous occurrences introduces uncertainty without improving detection accuracy. In addition, these 33 parameters only play a minor role in the detection of most of the attacks [MC01].
5
Conclusion
This paper presented a statistical evaluation method of profile based IDS and ADS that relies on a new traffic traces database containing both legitimate and
70
P. Owezarski
illegitimate anomalies. This paper exhibited that the anomalies database has a major impact on the evaluation results. For example, it was shown that for NADA and PHAD the results obtained with KDD’99 and MetroSec traces are completely different. This paper also shows that the KDD’99 does not permit satisfactory results when considering the evaluation accuracy of the different detection tools. It does not confront them to realistic enough conditions (and then complex enough), and in general the different evaluated tools pass the tests with good marks... marks that are not reproducible once installed in a real environment. Indeed, the KDD’99 evaluation is kind of binary: it only shows whether high intensity attacks can be detected. The MetroSec method plays with the intensities and durations of anomalies for determining levels at which an attack can be detected. All these observations exhibit one of the big problems. If we assume that the MetroSec database is exhaustive for the current traffic anomalies phenomenon (we tried as much as possible to define generic anomalies on all dimensions of network traffic), how would it be possible to ensure its everlastingness? Indeed, even if anomalies classes do not radically change, their shapes and intensities (especially related to the network capacities evolutions) will change. And the database, on a more or less long term, will lose some of its realism. Given the strategic aspect of traffic traces for operators, it is not sure at all that we could continue having traffic traces in which anomalies could be injected. In addition, producing such traces containing anomalies is a very time consuming task which can hardly be supported by a single or few laboratories. It represents one of the main drawbacks against which it does not appear any solution. The last problem exhibited by this paper is related to the role of experimental data we are exploiting. In fact, we use the same data for designing and validating/evaluating our tools. Thus, PHAD which works perfectly on KDD’99 database (it was designed for detecting anomalies and attacks by taking its inspiration in the KDD’99 database) shows incredibly low performances when tested with the MetroSec database. What would be the results of GammaFARIMA or NADA tools if they were evaluated with other anomalies databases than MetroSec or KDD’99 ones (we took our inspiration from these databases when designing our tools)? Again, the solution would be to have a large number of anomalous traces database in order to separate, at least experimentally, design and evaluation. But we then fall back in the problem previously quoted of the lack of exploitable traffic traces. • Database availability Our intension is to open the trace database as widely as possible, given some respect of privacy. These privacy issue forces us to keep the traces on LAAS’ servers, and it does not allow us to grant anyone to download them. Up to now, the database is made available for COST-TMA members which could visit LAAS for short scientific missions. This possibility will be maintained and we hope to extend it in the near future. LAAS will then offer the storage and computing capacities for all COST-TMA researchers interested in working with the traces. An explicit agreement with the LAAS’ charter is required. Our intension is to
A Database of Anomalous Traffic for Assessing Profile Based IDS
71
open the trace database to the research community at large. We already asked what are the requirements to make them publicly available on a web/FTP server. When details will be defined, they will be put on the MetroSec web page [MET].
References [CDG+ 00]
Cleary, J., Donnelly, S., Graham, I., Mcgregor, A., Pearson, M.: Design principles for accurate passive measurement. In: Passive and Active Measurements, Hamilton, New Zealand (April 2000) [DCW+ 99] Durst R., Champion T., Witten B., Miller E., Spagnuolo L.: Testing and evaluating computer intrusion detection system. Communications of the ACM 42(7) (1999) [ENW96] Erramilli, A., Narayan, O., Willinger, W.: Experimental queueing analysis with long-range dependent packet traffic. ACM/IEEE transactions on Networking 4(2), 209–223 (1996) [FGW98] Feldmann, A., Gilbert, A.C., Willinger, W.: Data networks as cascades: Investigating the multifractal nature of internet wan traffic. In: ACM/SIGCOMM Conference on Applications, technologies, architectures, and protocols for computer communication, pp. 42–55 (1998) [FOM07] Farraposo, S., Owezarski, P., Monteiro, E.: NADA - Network Anomaly Detection Algorithm. In: Clemm, A., Granville, L.Z., Stadler, R. (eds.) DSOM 2007. LNCS, vol. 4785, pp. 191–194. Springer, Heidelberg (2007) [HB] Hettich, S., Bay, S.: The UCI KDD archive, Department of Information and Computer Science. University of California, Irvine (1999), http://kdd.ics.uci.edu [HPI] HPING2, http://sourceforge.net/projects/hping2 [IPE] IPERF. The TCP/UDP bandwith Measurement Tool, http://dast.nlanr.net/Projects/Iperf/ [Kd] UCI KDD Archive KDD 1999 datasets, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html [LFG+ 00] Lippman, R., Fried, D., Graf, I., Haines, J., Kendall, K., Mcclung, D., Weber, D., Webster, S., Wyschogrod, D., Cunningham, R., Zissman, Y.: Evaluating intrusion detection systems: the 1998 darpa off-line intrusion detection evaluation. In: DARPA Information Survivability Conference and Exposition, pp. 12–26 (2000) [LSM99] Lee, W., Stolfo, S., Mok, K.: Mining in a data-flow environment: Experience in network intrusion detection. In: Proceedings of the ACM International Conference on Knowledge Discovery & Data Mining KDD 1999, pp. 114–124 (1999) [MC01] Mahoney, M., Chan, P.: Phad: Packet header anomaly detection for identifying hostile network traffic. Technical Report CS-2001-04. Department of Computer Sciences - Florida Institute of Technology (2001) [MC03] Mahoney., M., Chan, P.: An analysis of the 1999 darpa/lincoln laboratory evaluation data for network anomaly detection. In: Vigna, G., Kr¨ ugel, C., Jonsson, E. (eds.) RAID 2003. LNCS, vol. 2820, pp. 220–237. Springer, Heidelberg (2003)
72
P. Owezarski
[McH01]
[MET] [MIT] [OBLG08]
[PKC96]
[PW00]
[RRR08] [SLO+ 07]
[TFN]
[Tri]
Mchugh, J.: Testing intrusion detection systems: A critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory. ACM Transactions on Information and System Security 3(4), 262–294 (2001) METROSEC, http://www.laas.fr/METROSEC MIT. Lincoln Laboratory (2008), http://www.ll.mit.edu/mission/communications/ist/corpora/ideval Owezarski, P., Berthou, P., Labit, Y., Gauchard, D.: Laasnetexp: a generic polymorphic platform for network emulation and experiments. In: Proceedings of the 4th International Conference on Testbeds and Research Infrastructure for the Development of Network & Communities (TRIDENTCOM 2008) (March 2008) Park, K., Kim, G., Crovella, M.: On the relationship between file sizes, transport protocols, and self-similar network traffic. In: International Conference on Network Protocols, pp. 171–180. IEEE Computer Society, Washington (1996) Park, K., Willinger, W.: Self-similar network traffic: an overview. In: Park, K., Willinger, W. (eds.) Self-Similar Network Traffic and Performance Evaluation, pp. 1–38. Wiley (Interscience Division), Chichester (2000) Ringberg, H., Roughan, M., Rexford, J.: The need for simulation in evaluating anomaly detectors. Computer Communication Review 38(1), 55–59 (2008) Scherrer, A., Larrieu, N., Owezarski, P., Borgnat, P., Abry, P.: Nongaussian and long memory statistical characterisations for internet traffic with anomalies. IEEE Transaction on Dependable and Secure Computing 4(1) (January 2007) TFN2K. An analysis, http://packetstormsecurity.org/distributed/ TFN2kAnalysis-1.3.txt Trinoo. The DoS Project’s “trinoo” distributed denial of service attack tool, http://staff.washington.edu/dittrich/misc/trinoo.analysis
Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases Luca Deri1,2, Valeria Lorenzetti1, and Steve Mortimer3 1
ntop.org, Italy IIT/CNR, Italy 3 British Telecom, United Kingdom {deri,lorenzetti}@ntop.org,
[email protected] 2
Abstract. Collecting and exploring monitoring data is becoming increasingly challenging as networks become larger and faster. Solutions based on both SQL-databases and specialized binary formats do not scale well as the amount of monitoring information increases. This paper presents a novel approach to the problem by using a bitmap database that allowed the authors to implement an efficient solution for both data collection and retrieval. The validation process on production networks has demonstrated the advantage of the proposed solution over traditional approaches. This makes it suitable for efficiently handling and interactively exploring large data monitoring sets. Keywords: NetFlow, Flow Collection, Bitmap Databases.
1 Introduction NetFlow [1] and sFlow [2] are the current state-of-the-art standards for building traffic monitoring applications. Both are based on the concept of traffic probe (or agent in the sFlow parlance) that analyzes network traffic and produces statistics, known as flow records, which are delivered to a central data collector [3]. As the number of records can be pretty high, probes can use sampling mechanisms in order to reduce the workload on both probe and collectors. In sFlow, the use of sampling mechanisms is native in the architecture so that it can be used by agents to effectively reduce the number of flow records delivered to collectors. This practice has a drawback in terms of result accuracy while providing them with quantifiable accuracy. In NetFlow the use of sampling (both on packets and flows) while reducing the load on routers it leads to inaccuracy [4] [5] [6], hence it is often disabled in production networks. The consequence is that network operators have to face the problem of collecting and analyzing a large number of flow records. This problem is often solved using a flow collector that stores data on a relational database or on a disk in raw format for maximum collection speed [7] [8]. Both approaches have pros and cons; in general SQLbased solutions allow users to write powerful and expressive queries while sacrificing flow collection speed and query response time, whereas raw-based solutions are more efficient but provide limited query facilities. The motivation behind this work is to overcome the limitations of existing solutions and create an efficient alternative to relational databases and raw files. We aim F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 73–86, 2010. © Springer-Verlag Berlin Heidelberg 2010
74
L. Deri, V. Lorenzetti, and S. Mortimer
to create a new generation of a flow collection and storage architecture that exploits state-of-the-art indexing and querying technologies [9], and a set of tools capable of interactively exploring large volume of collected traffic data with minimal query response time. The main contributions of this paper include: • The ability to execute multidimensional queries on arbitrary large amounts of data with response time in the order of seconds (in many cases, milliseconds). • An efficient yet simple flow record storage architecture in terms of disk space, query response time, and data collection duration. • A system that operates on raw flow records without first reducing or summarizing them. • The reduction of the time needed to explore a large dataset and the possibility to display query results in real-time, making the exploration process truly interactive. The following section presents a survey of relevant flow storage and retrieval architectures, describes their limitations, and lists a set of requirements that a flow management architecture should feature. Section three covers the architecture and design choices of the proposed solution. Section four validates this solution on two production networks, evaluates the implementation performance and positions this work against popular tools identified during the survey.
2 Related Work and Motivation Flow collectors are software applications responsible for receiving flow records emitted by network elements such as routers and switches. Their main duty is to make sure that all flow records are received and successfully written on a persistent storage. This solution limits flow record loss and decouples the collection phase from flow analysis, with the drawback of adding some latency as records are often not immediately processed as they arrive. Tools falling into this category include nfdump [10], flow-tools [11], FlowScan [12], Stager [13] and SiLK [14]. These tools store data in binary flat files, optionally in compressed format in order to reduce disk space usage and read time; they typically offer additional tools for filtering, extracting, and summarizing flow records matching specific criteria. As flat files have no indexing, data searching always requires a sequential scan of all stored records. In order to reduce the dataset to scan, these tools save flow records in directories that have a specific duration, so that to ease record temporal selection during queries. Basically the speed advantage of dumping flow records in raw format is paid at each search operation in terms of amount of data to read. Another limitation of these families of tools, is that the query language they offer is limited when compared to SQL, as they feature flowbased filtering with minimal aggregation, join and reporting facilities. The use of relational databases is fairly popular in most commercial flow-collectors such as Cisco NetFlow collector, Fluke NetFlow Tracker, and on open-source tools such as Navarro [15] and pmacct [16]. The flexibility of the SQL language is very useful during report creation and data aggregation phases although some researchers
Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases
75
have proposed a specialized flow query language [17]. Unfortunately the use of relational databases is known to be slower (both during data insert and query) and take more space when compared to raw flow record files [18] [19] [20]. The conclusions of the survey on popular flow management tools are: • Tools based on raw binary files are efficient when storing flow records (e.g. nfdump can store over 250K records/sec on a dual-core PC) but provide limited flow query facilities. • Relational databases are both slower during flow record insertion and retrieval, but thanks to SQL they offer very flexible flow query and reporting facilities. • On large volume of collected flow records, the query time of both tool families takes a significant amount of time (measured in minutes if not hours [21]) even when high-end computers are used, making them unsuitable for interactive data exploration. Seen that the performance figures of state-of-the-art tools is suboptimal, authors investigated whether there was a better solution to the problem of flow collection and query with respect to raw files and relational databases. 2.1 Towards Column-Oriented Databases with Bitmap Indexes A database management system typically structures data records using tables with rows and columns. The system optimizes the query-answering process by implementing auxiliary data structures known as database indexes [22] to accelerate queries. Relational databases encounter performance issues with large tables in particular because of the size of table indexes that need to be updated at each record insertion. In the last few years, new regulations that require ISPs to maintain large archive of user activities (e.g. login/logout/radius/email/wifi access logs) [23] stimulated the development of new database types able to efficiently handle billion of records. Although available since late 70‘s [24], column-oriented databases [25] have been niche products until vendors such as Sensage [26], Sybase [27] and open source implementation such as FastBit [28] [29] [30] ignited new interest on this technology. A columnoriented database stores its content by column rather than by row is known as vertical organization. This way the values for each single column are stored contiguously, and column-stores compression ratios are generally better than row-stores because consecutive entries in a column are homogeneous to each other [31] [32]. These database systems have been shown to perform more than an order of magnitude better than traditional row-oriented database systems, particularly on read-intensive analytical processing workloads. In fact, column-stores are more I/O efficient for read-only queries since they only have to read from disk (or from memory) those attributes accessed by a query [25]. B-tree indexes are the most popular method for accelerating search operations. They are designed initially for transactional data (where any index on data must be updated quickly as data records are modified, and query results are usually limited in number of records) and fail to meet requirements of modern data analysis, such as interactive analysis over large volume of collected traffic data. Such queries return thousands of records that with b-trees would require a large number of tree-branching
76
L. Deri, V. Lorenzetti, and S. Mortimer
operations that use slow pointer chases in memory and random disk access, thus taking a long time. Many popular indexing techniques such as hash indexes, have similar shortcomings. Considering the peculiarity of network monitoring data where flow records are read-only and several flow fields have very few unique values, as of today the best indexing method is a bitmap index [33]. These indexes use bit arrays (commonly called bitmaps) and answer queries by performing bitwise logical operations on these bitmaps. For tasks that demand the fastest possible query processing speed, bitmap indexes perform extremely well because the intersection between the search results on each variable is a simple AND operation over the resulting bitmaps [22]. Seen that column-oriented databases with bitmap indexes provide better performance compared to relational databases, the authors explored their use in the field of flow monitoring. Hence they have designed a system based on this technology able to efficiently handle flow records. The main requirements of this development work include: • Ability to save flow records on disk with minimal overhead allowing no-loss onthe-fly flow-to-disk storage, as it happens with tools based on raw files. • Compact data storage for limiting disk usage hence enable users to store months of flow records on a cheap hard-disk with no need to use costly storage systems. • Stored data must be immutable (i.e. once it has been saved it cannot be modified/deleted) as this is a key feature for billing and security systems where nonrepudiation is mandatory. • Ability to perform efficiently on network storage such as NFS (Network File System). • Simple data archive structure in order to move ancient data on off-line storage systems without having to use complex data partitioning solutions. • Avoid complex architectures [34], hard to maintain and operate, by developing a simple tool that can be potentially used by all network administrators. • On tens of millions of records: • Sub-second search time when performing cardinality searches (e.g. count the number or records that satisfy a certain criteria). This is a requirement for exploring data in real-time and implementing interactive drill-down data search. • Sub-minute search time when extracting records matching a certain criteria (e.g. top X hosts and their total traffic on TCP port Y). • Feature rich query language as SQL with the ability to sort, join, and aggregate data while perform mathematical operations on columns (e.g. sum, average, min/max, variance, median, distinct), necessary to perform complex statistics on flows. The following chapters covers the design and implementation of an extension to nProbe [35], an open-source probe and flow collector, that allows flow records to be stored on disk using a column-oriented database with an efficient compressed bitmap indexing technology. Finally the nProbe implementation performance is evaluated and positioned against similar tools previously listed.
Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases
77
3 Architecture and Implementation nProbe is an open-source NetFlow probe that supports both NetFlow and sFlow collection, as well as flow conversion between versions (for instance convert v5 to v9 flows). sFlow Packet Capture
NetFlow
nProbe
Flow Export
Data Dump
Raw Files / MySQL / SQLite / FastBit
Fig. 1. nProbe Flow Record Collection and Export Architecture
It fully supports the NetFlow v9/IPFIX so it has the ability to specify dynamic flow templates (i.e. it supports flexible netflow) that are configured when the tool is started. nProbe features flow collection and storage, both on raw files and relational databases such as MySQL and SQLite. Support of relational databases has always been controversial as nProbe users appreciated the ability to query flow records using SQL, but at the same time flow dump to database is usually activated only for small sites. The reason is that enabling database support could lead to flow records loss due to the database processing overhead. This is mostly caused by network latency and multi-user database access, slow-down caused by table indexes update during data insertion, and poor database performance while searching records during data insertion. Databases offer mechanisms for mitigating some of the above issues, including data insertion in batch mode instead of realtime, transaction disabling, and definition of tables with no indexes for avoiding the overhead of indexes update. In order to overcome the limitations of existing flow-management systems, the authors decided to explore the use of column-based databases by implementing an extension to nProbe that allows flows to be stored on disk using FastBit [29]. More precisely, FastBit is not a database but a C++ library that implements efficient bitmap indexing methods. Data is represented as tables with rows and columns. A large table may be partitioned into many data partitions and each of them is stored on a distinct directory, with each column stored as a separated file in raw binary form. The name of the data file is the name of the column. In each data partition there is an extra file named -part.txt that contains metadata information such as the name of the partition, and column names. Each column contains data stored in an uncompressed form, so its size is the same size of a raw file dump. Columns can accept data of 1, 2, 4, and 8 bytes long. Data longer than 8 bytes needs to be split across two or more columns.
78
L. Deri, V. Lorenzetti, and S. Mortimer
Compressed bitmap indexes are stored in separate files whose name is the name of the column with the .idx suffix. This means that each column typically has two files: one file contains data and the other the index. Indexes can be created on data “as stored on disk” or on reordered data. This is a main difference with respect to conventional databases. In fact it is possible to first reorder data, column by column, so that bitmap indexes are built on reordered data. Please note that reordering does not affect queries results (i.e. rows data is not mixed when columns are reordered), but it just improves index size and query speed. Data insert and query facilities is performed by means of library calls or using a subset of SQL, natively supported by the library. In FastBit the SELECT clause can only contain a list of column names and some functions that include AVG, MIN, MAX, SUM, and DISTINCT. Each function can only take a column name as its argument. The WHERE clause is a set of range conditions joined together with logical operators such as AND, OR, XOR, and NOT. The clauses GROUP BY, ORDER BY, LIMIT and the operators IN, BETWEEN and LIKE can also be applied to queries. FastBit actually does not support advanced SQL functionalities such as nested queries, and neither operators such as UNION, HAVING, or functions like FIRST, LAST, NOW, and FORMAT. nProbe creates FastBit partitions depending on the flow templates being configured (probe mode) or read from incoming flows (collector mode), with columns having the same size as the the netflow element it contains. Users can configure partition duration (in minutes) at runtime and when a partition reaches its maximum duration, a new one is automatically created. Partition names are created on a tree fashion (e.g.
/year/month/day/hour/minute). Similar to [36], authors have developed facilities for rotating partitions hence limiting disk space usage while preserving their structure. No FastBit specific configuration is necessary as nProbe knows the flow format, and then it automatically creates partitions and columns. Datatypes longer than 64 bit as IPv6 addresses are transparently split onto two FastBit columns. Flow records are not saved individually on disk, but for efficiency reasons they are dumped in blocks of 4096 records. Users can decide to build indexes on all or only on a few selected columns, this in order to save space creating indexes for columns that will never be used in queries. If while executing a query FastBit does not find an index for a key column, it will build the index for such column on the fly, prior to execute the query. For efficiency reasons, the authors have decided that indexes are not built at every data dump but when a partition is completed (e.g. the partition duration time has elapsed). This happens because building indexes on reordered data is more efficient (both in terms of disk usage and query response time) than building them on data on the same order as it has been saved on disk. The drawback of this design choice is that queries can use indexes only once they have been built hence the partition is completely dumped on disk. On the other hand, flow records can be dumped at full speed with no index-build overhead. Thus, not considering flow receive/decoding overhead, it is possible to save on disk more than one million flow records/sec on a standard Serial ATA (SATA) disk. Column indexes are completely loaded into memory during searches, thus it imposes a limit on the partition size also limited by FastBit to 232 records. Hence it is wise to avoid creating large partitions, but at the same time the creation of too many small partitions must also be avoided, as this will result in many files created on disk and the overhead of accessing them
Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases
79
(open, close and file seek time) can dominate the data analysis time. A good compromise is to have partitions that either last a fixed amount of time (e.g. 5 minutes of flow records) or that have a maximum number of records. Typically, for a machine with a few GB of memory, FastBit developers recommend data partition containing between 1 million and 100 million records. Conceptually a FastBit partition is similar to a table on a relational database, thus when a query is spread across several partitions, it is necessary to merge results and to collapse them when using the DISTINCT SQL clause. This task is not performed by FastBit but it is delegated to utilities developed by the authors: • fbmerge: tool for merging several FastBit partitions into a single one. This tool, now part of the FastBit distribution, is useful when small fine grained partitions need to be aggregated into a larger one. For instance if nProbe is configured to create ‘one minute’ partitions, at the end of the hour all of them can be aggregated into a ‘one hour’ partition. This allows the number of column files hence the number of disk i-nodes to be reduced a lot, very useful on large disks containing many days/months of collected records. • fbquery: tool that allows queries to be performed on partitions. It supports SQLlike syntax for querying data and implements on top of FastBit useful facilities such as: • Aggregation of similar results, data sort, and result set limitation (same as MySQL LIMIT). • Search recursively on nested directories so that a single directory containing several partitions can be searched in one shot. This is useful for instance when nProbe has dumped 5 minutes long partitions, and users want to search on the last hour so that various partitions need to be read by fbquery. • Data dump on several formats such as CSV, XML, and plain text. Data format is based on the metadata information produced by nProbe, thus partition columns are printed according to its native representation (e.g. an IPV4_DST_ADDR is printed as dot-separated IPv4 address and not as a 32 bit unsigned integer). • Scriptability using the Python language for combining several queries or creating HTML pages for rendering data on a web browser. In a nutshell, the authors have used the FastBit library for creating an efficient flow collection and storage system. As the library was not designed for handling network flows, the authors have implemented some missing features that are a prerequisite for creating comprehensive network traffic reports. The following section evaluates the performance of the proposed solution, compares it against relational databases, and validates it on two large networks. This is to demonstrate that nProbe with FastBit is a mature solution that can be used on a production environment.
4 Validation and Performance Evaluation In order to evaluate the FastBit performance, nProbe has been deployed in two different environments:
80
L. Deri, V. Lorenzetti, and S. Mortimer
• Medium ISP: Bolig:net A/S The average backbone traffic is around 250 Mbit/sec (about 40K pps). The traffic is mirrored onto a Linux PC (Linux Fedora Core 8 32 bit, Dual Core Pentium D 3.0 GHz, 1 GB of RAM, two SATA III disks configured with RAID 1) that runs nProbe in probe mode. nProbe computes the flows (NetFlow v9 bidirectional format with 5 minutes maximum flow duration) and saves flow records on disk using FastBit. Each FastBit partition stores one hour of traffic, and in average the probe produces 36 million flow records/day. Before deploying nProbe, records were collected and stored on a MySQL database. • Large ISP: British Telecom nProbe is used in collector mode. It receives flow records from 10 peering routers, with peak flow export of 85 K flow records/sec with no flow loss. Each month the total amount of record exceeds 4 TB of disk space. The application server has dual quad-core Intel processors with 24 GB of memory, running Ubuntu Linux 9.10 64 bit, and is used to carry out queries on the data stored on an NFS server by the Collection server. The Netflow collection server has a single quad-core Intel processor and 8 GB of memory, running Ubuntu Linux 9.10 64 bit, and stores the fastbit data to the NFS server. Each FastBit partition stores 60 minutes of traffic that occupy about 5.8 GB of disk space when indexed. Before deploying nProbe, flow records were collected using nfdump. The goal of these two setups is to both validate nProbe with FastBit on two different setups and compare the results with the solutions previously used. The idea is to compare a regional with a country-wide ISP, and verify if the proposed solution can be effectively used in both scenarios. Being the code open-source, it is also important to verify that this work is efficient when used on standard PCs (contrary to solutions based on costly clusters or server farms mostly used in Telco environments) as this is the most common scenario for many open-source users. 4.1 FastBit vs. Relational Databases The goal of this test is to compare the performance of FastBit with respect to MySQL (version 5.1.40 64 bit), a popular relational database. As the host running nProbe is a critical machine, in order to not interfere with the collection process, two days worth of traffic was dumped in FastBit format, and then transferred to a Core2Duo 3.06 GHz Apple iMac running MacOS 10.6.2. Moving FastBit partitions across machines running different operating systems and word length (one is 32, the other is 64 bit) has not required any data conversion as FastBit transparently takes care of differences among various architectures. This is a good feature as collector hosts can be based on different operating systems and technology. In order to evaluate how FastBit partition size affects the search speed, hourly partitions have been merged into a single daily directory. In order to compare both approaches, five queries have been defined: • Q1: SELECT COUNT(*), SUM(PKTS), SUM(BYTES) FROM NETFLOW • Q2: SELECT COUNT(*) FROM NETFLOW WHERE L4_SRC_PORT=80 OR L4_DST_PORT=80 • Q3: SELECT COUNT(*) FROM NETFLOW GROUP BY IPV4_SRC_ADDR
Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases
81
• Q4: SELECT IPV4_SRC_ADDR, SUM(PKTS), SUM(BYTES) AS s FROM NETFLOW GROUP BY IPV4_SRC_ADDR ORDER BY s DESC LIMIT 1,5 • Q5: SELECT IPV4_SRC_ADDR, L4_SRC_PORT, IPV4_DST_ADDR, L4_DST_PORT, PROTOCOL, COUNT(*), SUM(PKTS), SUM(BYTES) FROM NETFLOW WHERE L4_SRC_PORT=80 OR L4_DST_PORT=80 GROUP BY IPV4_SRC_ADDR, L4_SRC_PORT, IPV4_DST_ADDR, L4_DST_PORT, PROTOCOL FastBit partitions have been queried using the fbquery tool with appropriate command line parameters. All MySQL tests have been performed on the same machine with no network communications between client and server (i.e. MySQL client and server communicate using a Unix socket). In order to evaluate the influence of MySQL indexes on queries, the same test has been repeated with and without indexes. Tests were performed on 68 million flow records containing a subset of all NetFlow fields (IP source/destination, port source/destination, protocol, begin/end time). The following table compares the disk space used by MySQL and FastBit. In the case of FastBit, indexes have been computed on all columns. Table 1. FastBit vs MySQL Disk Usage (results are in GB)
MySQL FastBit
No/With Indexes
1.9 / 4.2
Daily Partition (no/with Indexes)
1.9 / 3.4
Hourly Partition (no/with Indexes)
1.9 / 3.9
Table 2. FastBit vs MySQL Query Speed (results are in seconds)
Query
MySQL
Daily Partitions
With Indexes 22.6
No Cache 12.8
Cached
Q1
No Index 20.8
Q2
23.4
69
Q3
796
Q4 Q5
Hourly Partitions Cached
5.86
No Cache 10
0.3
0.29
1.5
0.5
971
17.6
14.6
32.9
12.5
1033
1341
62
57.2
55.7
48.2
1754
2257
44.5
28.1
47.3
30.7
5.6
The test outcome has demonstrated that FastBit takes approximately the same disk space as MySQL in terms or raw data, whereas MySQL indexes are much larger. Merging FastBit partitions does not usually improve the search speed, but instead queries on merged data requires more memory, as FastBit loads a larger index.
82
L. Deri, V. Lorenzetti, and S. Mortimer
The size/duration of a partition mostly depends on the application that will access data. Having small partitions (e.g. 1 or 5 minutes long) makes sense for interactive data exploration where drill-down operations are common. In this case, having small partitions means that the FastBit index would also be small, resulting in faster operations and less memory used. On the other hand, querying data on a long period using small partitions requires fbquery to read several small indexes instead of a single one that is inefficient on standard disks (i.e. non solid-state drive) due to disk seek time. In addition, a side effect of multi-partitions is that fbquery need to merge results produced on each partition, this without relying on FastBit. Note that the use of large partitions has drawbacks on searches, as indexes cannot be built on the them until they have been completely dumped. For this reason, if nProbe saves flow records on a large one day long partition, it means that queries on the current day must be performed without indexes as the partition has not completely dumped yet. In a nutshell there is not a single rule for defining partition duration; in general the partition granularity should be as close as possible to the expected query granularity. Authors suggest to use partitions lasting from 1 to 5 minutes in order to have quick searches even on partitions being written (i.e. on most recent data), and then daily merge partitions using fbmerge. This to avoid exhausting disk i-nodes with index files, and efficiently perform searches on past data without accessing too many files. In terms of query performance FastBit is not at all comparable with MySQL: • Queries that only require access to indexes take less than a second, regardless of the query type. • Queries that require data access are at least an order of magnitude faster than on MySQL but always complete within a minute. • Index creation time on MySQL takes many minutes and it prevents it using in real life when importing data in (near-)realtime, also considering that they take a significant amount of disk space. Indexes on MySQL do not speed up queries, contrary to FastBit, as query time using indexes takes longer when compared to the same query on unindexed data. • Disk speed is an important factor for accelerating queries. In fact running the same test twice with data already cached in memory, it significantly decreases the query speed. The use of RAID 0 has demonstrated that the performance speed has been improved. 4.2 FastBit vs. Raw Files The goal of this test is to compare FastBit with a popular open-source collection tool named nfdump. Tests have been performed on a large network with TB of collected flow data per month. Although nfdump performs well when it comes to flow collection, its performance is sub-optimal during query time when using large data sets. One of the main concerns of the network operators is that with nfdump queries take a long amount of time, so they often need to be run overnight before producing results. An explanation of this behavior is that nfdump does not index data, so searching on a large time span means reading all raw data that was received over that period, and in this setup means GBs (if not TBs) of records. Using FastBit the average speed improvement is in the order of 20:1. From the operator's point of view this means that queries can last a reasonable amount of time. For instance, a query written in SQL as
Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases
83
‘SELECT IPV4_SRC_ADDR, L4_SRC_PORT, IPV4_DST_ADDR, L4_DST_PORT, PROTOCOL FROM NETFLOW WHERE IPV4_SRC_ADDR=X OR IPV4_DST_ADDR=X’ on 19 GB of data that contain 14 hours of collected flow records, takes about 45 seconds with FastBit which is major improvement with respect to nfdump, which takes about 1500 seconds (25 minutes) to complete the same query. As nfdump does not use any index, its execution time is dominated by the time needed to sequentially read the binary data. This means that: query time = (time to sequentially read the raw data) + (record filtering time). The time needed to filter records is usually very little as nfdump is fast enough, and also because the complexity of filters, whose syntax is similar to BPF [37] filters, is usually limited. This means that in nfdump the query time is basically the time needed to sequentially read the raw data. The previous query validates this hypothesis: 1500 seconds to read 19 GB of data means that the average reading speed is about 12.6 MB/sec, that is the typical speed of a SATA drive. For this reason, this section does not list the same tests as in section 4.1, because the query time of nfdump is mostly proportional to the amount of data to read [20]; hence with some simple math it is possible to compute the expected nfdump response time. Also note that the nfdump query language is not SQL-like, therefore it is not possible to make a one-to-one comparison with FastBit and MySQL. As flow records take a large amount of disk space, it is likely that they will be stored on a SAN (Storage Area Network). When the storage is directly attached to the host by means of fast communication links such as InfiniBand and FibreChannel, the system does not see any speed degradation when compared with a directly attached SATA disk. The authors decided to study how the use of network file systems such as NFS affects the query results. A simple model for the time needed to read γ bytes is t = α + β * γ, where α represents the disk access latency and β is the throughput. NFS typically increases α but not β as the network speed is typically higher than disk read speed. In the case of nfdump the data is read sequentially, whereas on FastBit the raw data is accessed based on indexes. Thus FastBit requires a small number of read operations which have to pay α multiple times. However this extra cost is in milliseconds, so it does not alter the overall comparison. This behavior has been tested repeating some queries of 4.1, and demonstrating that the use of NFS marginally affects the total query time. 4.3 FastBit Scalability The tests have shown that the use of FastBit offers advantages with respect to both relational databases and raw files-based solutions. In order to understand nProbe scalability when used with FastBit, it is necessary to split flow collection from flow query. As stated in section 3, the index creation happens when the partition has been dumped on disk, hence the dump speed to disk is basically the speed of the hard drive where, in the case of SATA disks, it exceeds 1 million flow records/sec. As shown in 4.2, a large ISP network produces less than 100’000 flow records/sec, this means that FastBit introduces no bottleneck in flow collection and dump. Flow query requires disk access, therefore the query process is mostly I/O bound. For every query, FastBit reads the whole index file of each column present in the WHERE clause. Then based on the index search, it reads if necessary (e.g. COUNT(*) does not require that) the
84
L. Deri, V. Lorenzetti, and S. Mortimer
column files containing real data by performing seeks on files in order to move to the offset where the index has found a data match. Thus a simple model for query response time is τ = γ + ε + δ, where γ represents the time needed to read all the column indexes present in the WHERE clause, ε is the time to read (if there is any) the matching rows data present in the SELECT clause, and δ is the processing overhead. In general δ is very limited with respect to γ and ε. As γ = (index size / disk speed), it takes no more than a couple of seconds. Instead δ can be pretty large if data is sparse, and several disk seeks are required. Note that δ can grow significantly depending on the query (e.g. in case of sorting large data sets), and that ε is zero for queries that count (e.g. compute the number of records on port X produced by host Y) or that use mathematical functions such as SUM (e.g. total number of bytes on port X).
5 Open Issues and Future Work Tests on various FastBit configurations have shown that the disk is an important component that has a major impact on the whole system. The authors are planning to use SSD drives in order to see how query time is affected, in particular while accessing raw data records that require several disk seek operations. One of the main limitations of FastBit is the lack of data compression, since it currently compresses only indexes. This is a feature that the authors are planning to add, as it allows disk space to be saved hence reduce the time needed to read the records. Using compression libraries such as QuickLZO, lzop, and FastLZ it should be possible to implement transparent de/compression while reducing disk space. Another area of interest is the use of FastBit for indexing packets instead of flows. The authors have prototyped an application that parses pcap files and creates a FastBit partition based on various packet fields such as IP address, port, protocol, and flags, in addition to an extra column that contains the file id and packet offset inside the pcap file. Using a web interface built on top of fbquery, users can search packets matching the criteria and also retrieve the original packet contained in the original pcap files. Although this work is not rich in features when compared with specialized tools [36], it demonstrates that the use of bitmap indexes is also effective for handling packets and not just flow records. The work described on this paper is the base for developing interactive data visualization tools based on FastBit partitions. Thanks to recent innovation in web 2.0, there are libraries such as the Google Visualization API that split visualization from data. Currently, the authors are extending nProbe adding an embedded web server that can make FastBit queries on the fly and return query results in JSON format [38]. The idea is to create an interactive query system that can visualize both tabular data (e.g. flow information) and graphs (e.g. average number of flow records on port X over the last hour) by means of FastBit queries. This way the user does not have to interact with FastBit tools at all, and can focus on data exploration.
6 Final Remarks The use of nProbe with FastBit is a major step ahead when compared to state-of-theart tools based on both relational databases and raw data dump. When searching data
Collection and Exploration of Large Data Monitoring Sets Using Bitmap Databases
85
on datasets of a few million records the query time is limited to a few seconds in the worst case, whereas queries that just use indexes are completed within a second. The consequence of this major speed improvement is that it is now possible to query data in real time and avoid periodically updating costly counters, as their value can be computed on-demand using bitmap indexes. Finally this work paves the way to the creation of new monitoring tools on large data sets that can interactively analyze traffic data in near-realtime, contrary to what usually happens with most tools available today. Availability. This work is distributed under the GNU GPL license and is available at the ntop home page http://www.ntop.org/nProbe.html. Acknowledgments. The authors would like to thank K. John Wu
for his help and support while using the FastBit library, Anders Kjærgaard Jørgensen for his support during the validation of this work, and Cristian Morariu <[email protected]> for his suggestions during MySQL tests.
References 1. Claise, B.: NetFlow Services Export Version 9, RFC 3954 (2004) 2. Phaal, P., et al.: InMon Corporation’s sFlow: A Method for Monitoring Traffic in Switched and Routed Networks, RFC 3176 (2001) 3. Quittek, J., et al.: Requirements for IP Flow Information Export (IPFIX). RFC 3917 (2004) 4. Haddadi, H., et al.: Revisiting the Issues on Netflow Sample and Export Performance. In: ChinaCom 2008, pp. 442–446 (2008) 5. Duffield, N., et al.: Properties and Statistics from Sampled Packet Streams. In: Proc. ACM SIGCOMM IMW 2002 (2002) 6. Estan, C., et al.: Building a better NetFlow. In: Proc. of the ACM SIGCOMM Conference (2004) 7. Chakchai, S.: A Survey of Network Traffic Monitoring and Analysis Tools (2006) 8. Ning, C., Tong-Ge, X.: Study on NetFlow-based Network Traffic Data Collection and Storage. Application Research of Computers 25(2) (2008) 9. Reiss, F., et al.: Enabling Real-Time Querying of Live and Historical Stream Data. In: Proc. of 19th Intl. Conference on Scientific and Statistical Database Management (2007) 10. Haag, P.: Watch your Flows with NfSen and NfDump. In: 50th RIPE Meeting (2005) 11. Fullmer, M., Roming, S.: The OSU Flow-tools Package and Cisco NetFlow Logs. In: Proc. of 14th USENIX Lisa Conference (2000) 12. Plonka, D.: FlowScan: A Network Traffic Flow Reporting and Visualization Tool. In: Proc. of 14th USENIX Lisa Conference (2000) 13. Øslebø, A.: Stager A Web Based Application for Presenting Network Statistics. In: Proc. of NOMS 2006 (2006) 14. Gates, C., et al.: More NetFlow Tools: For Performance and Security. In: Proc. 18th Systems Administration Conference, LISA (2004) 15. Navarro, J.P., et al.: Combining Cisco NetFlow Exports with Relational Database Technology for Usage Statistics, Intrusion Detection and Network Forensics. In: Proc. 14th Systems Administration Conference, LISA (2000) 16. Lucente, P.: Pmacct: a New Player in the Network Management Arena. RIPE 52 Meeting (2006)
86
L. Deri, V. Lorenzetti, and S. Mortimer
17. Marinov, V., Schönwälder, J.: Design of an IP Flow Record Query Language. In: Hausheer, D., Schönwälder, J. (eds.) AIMS 2008. LNCS, vol. 5127, pp. 205–210. Springer, Heidelberg (2008) 18. Sperrotto, A.: Using SQL databases for flow processing. In: Joint EMANICS/IRTFNMRG Workshop on Netflow/IPFIX Usage in Network Management (2008) 19. Hofstede, R.: Performance measurements of NfDump and MySQL and development of a SURFmap plug-in for NfSen, Bachlor assignement. University of Twente (2009) 20. Hofstede, R., et al.: Comparison Between General and Specialized Information Sources When Handling Large Amounts of Network Data, Technical Report, University of Twente (2009) 21. Siekkinen, M., et al.: InTraBase: Integrated Traffic Analysis Based on a Database Management System. In: Proc. of E2EMON 2005 (2005) 22. Wu, K., et al.: A Lightning-Fast Index Drives Massive Data Analysis. SciDAC Review (2009) 23. Oltsik, J.: The Silent Explosion of Log Management, CNET News (2008), http://news.cnet.com/8301-10784_3-9867563-7.html 24. Turner, M.J., et al.: A DBMS for large statistical databases. In: Proc. of 5th VLDB Conference (1979) 25. Abadi, D., et al.: Column-Stores vs. Row-Stores: How Different Are They Really? In: Proc. of ACM SIGMOD 2008 (2008) 26. Herrnstadt, O.: Multiple Dimensioned Database Architecture, U.S. Patent application 20090193006 (2009) 27. Loshin, D.: Gaining the Performance Edge Using a Column-Oriented Database Management System, White Paper (2009) 28. Wu, K., et al.: Compressed Bitmap Indices for Efficient Query Processing, Technical Report LBNL-47807 (2001) 29. Wu, K., et al.: FastBit: Interactively Searching Massive Data. In: Proc. of SciDAC 2009 (2009) 30. Bethel, E.W., et al.: Accelerating Network Traffic Analytics Using Query-Driver Visualization. In: IEEE Symposium in Visual Analytics Science and Technology (2006) 31. Abadi, D., et al.: Integrating Compression and Execution in Column-Oriented Database Systems. In: Proc. of 2006 ACM SIGMOD (2006) 32. Otten, F.: Evaluating Compression as an Enabler for Centralised Monitoring in a Next Generation Network. In: Proc. of SATNAC 2007 (2007) 33. Sharma, V.: Bitmap Index vs. B-tree Index: Which and When? Oracle Technology Network (2005) 34. Chandrasekaran, J., et al.: TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. In: Proc. of Conference on Innovative Data Systems Research (2003) 35. Deri, L.: nProbe: an Open Source NetFlow Probe for Gigabit Networks. In: Proc. of Terena TNC 2003 (2003) 36. Desnoyers, P., Shenoy, P.: Hyperion: High Volume Stream Archival for Retrospective Querying. In: Proc. of 2007 USENIX Annual Technical Conference (2007) 37. McCanne, S., Jacobson, V.: The BSD Packet Filter: A New architecture for User-level Packet Capture. In: Proc. Winter 1993 USENIX Conference (1993) 38. Crockford, D.: JSON: The fat-free alternative to XML. In: Proc. of XML 2006 (2006)
DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections Antonio Barbuzzi, Gennaro Boggia, and Luigi Alfredo Grieco DEE - Politecnico di Bari - V. Orabona, 4 - 70125, Bari, Italy Ph.: +39 080 5963301; Fax: +39 080 5963410 {a.barbuzzi,g.boggia,a.grieco}@poliba.it
Abstract. Spurious Retransmission Timeouts in TCP connections have been extensively studied in the scientific literature, particularly for their relevance in cellular mobile networks. At the present, while several algorithms have been conceived to identify them during the lifetime of a TCP connection (e.g., Forward-RTO or Eifel), there is not any tool able to accomplish the task with high accuracy by processing off-line traces. The only off-line existing tool is designed to analyze a great amount of traces taken from a single point of observation. In order to achieve a higher accuracy, this paper proposes a new algorithm and a tool able to identify Spurious Retransmission Timeouts in a TCP connection, using the dumps of each peer of the connection. The main strengths of the approach are the great accuracy and the absence of assumptions on the characteristics of TCP protocol. In fact, except for rare cases that are not classifiable with absolute certainty at all, the algorithm shows no ambiguous nor erroneous detections. Moreover, the tool is also able to deal with reordering, small windows, and other cases where competitors fail. Our aim is to provide to the community a very reliable tool to: (i) test the working behavior of cellular wireless networks, which are more prone to Spurious Retransmission Timeouts with respect to other technologies; (ii) validate run-time Spurious Retransmission Timeout detection algorithms.
1
Introduction
TCP congestion control [1,2,3] is fundamental to ensure Internet stability. Its main rationale is to control the number of in-flight segments in a connection (i.e., segments sent, but not yet acknowledged) using a sliding window mechanism. In particular, TCP sets an upper bound on the number of in-flights segments via the congestion window (cwnd) variable. As well known, the value of cwnd is progressively increased over the time to discover new available bandwidth, until a congestion episode happens, i.e., 3 Duplicated Acknowledgements (DUPACKs) are received or a Retransmission Timeout (RTO) expires. After a congestion episode, cwnd is suddenly shrinked to avoid a network collapse. TCP congestion control has demonstrated its robustness and effectiveness over the last two decades, especially in wired networks. F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 87–100, 2010. c Springer-Verlag Berlin Heidelberg 2010
88
A. Barbuzzi, G. Boggia, and L.A. Grieco
Recently, the literature has questioned the effectiveness of the basic RTO mechanism originally proposed in [1] and refined in [4]. Such a mechanism measures the Smoothed Round Trip Time (SRTT) and its standard deviation (DEV) and then sets RT O = SRT T + 4DEV . The rationale is that, if the RTT exhibits a stationary behavior, RTO can be considered as a safe upper bound for the RTT. Unfortunately, in cellular mobile networks this is not ever true. In fact, delay spikes due to retransmissions, fading, and handover can trigger Spurious Retransmission Timeouts (SRTOs) that lead to unnecessary segment retransmissions and useless reductions of the transmission rate [5,6]. Despite of the importance of SRTOs, there is not any tool able to accomplish the task of properly identifying them with high accuracy by processing off-line traces. The only off-line existing tool is designed to analyze a great amount of traces taken from a single point of observation [7]. To bridge this gap, this paper proposes a new algorithm and a tool able to Detect SRTOs (which will be referred to as DeSRTO ) in a TCP connection, using the dumps of each peer of the connection. The main strengths of the approach are the great accuracy and the absence of assumptions on the characteristics of TCP protocol. In fact, except for rare cases that are not classifiable with absolute certainty at all as SRTO, the algorithm shows no ambiguous nor erroneous detections. Moreover, the tool is also able to deal with reordering, small windows, and other cases where competitors fail. Our aim is to provide to the community a very reliable tool to: (i) test the working behavior of cellular wireless networks, which are more prone to Spurious Retransmission Timeouts with respect to other technologies (but it is well suited also for any other kind of networks with traffic flows using TCP); (ii) validate run-time Spurious Retransmission Timeout detection algorithms. To provide a preliminary demonstration of the capabilities of DeSRTO, some results derived by processing real traffic traces collected over a real 3G network have been reported. Moreover, to test its effectiveness, a comparison with the Vacirca detection tool [7] is reported. The rest of the paper is organized as follows. In Sec. 2 a summary of related works is reported. Sec. 3 describes our algorithm, reporting also some examples on its behavior. Sec. 4 shows the experimental results. Finally, conclusions are drawn in Sec. 5.
2
Related Works
So far, research on SRTOs has produced schemes that can be classified in to two families: (i) real-time detection schemes for TCP stacks; (ii) off-line processing tools for SRTO identification. The Eifel Detection [5], the F-RTO [6], and the DSACK [8] schemes belong to the first family. Instead, the tool proposed in [7] and the DeSRTO algorithm herein presented belong to the second family. The goal of the Eifel detection algorithm [5] is to avoid the go-back-N retransmits that usually follows a spurious timeout. It exploits the information provided by the TCP Timestamps option in order to distinguish if coming ACKs are related to a retransmitted segment or not.
DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections
89
The algorithm described in [8] exploits DSACK (Duplicate Selective ACK) informations to identify SRTOs. DSACK is a TCP extension used to report the receipt of duplicate segments to the sender. The receipt of a duplicate segment implies either the packet is replicated by the network, or both the original and the retransmitted packet arrive at the receiver. If all retransmitted segments are acknowledged and recognized as duplicated using DSACK information, the algorithm can conclude that all retransmissions in the previous window of data were spurious and no loss occurred. The F-RTO algorithm [6] modifies the TCP behavior in response to RTOs: in general terms, when a RTO expires, F-RTO retransmits the first unacknowledged segment and waits for subsequent ACKs. If it receives an acknowledgment for a segment that was not retransmitted due to the timeout, the F-RTO algorithm declares a spurious timeout. Possible responses to a spurious timeout are specified in [9], namely the Eifel Response Algorithm. Basically, three actions are specified: (i) the TCP sender sends new data instead of retransmitting further segments; (ii) the TCP stack tries to reverse the congestion control state prior to the SRTO; (iii) the RTO estimator is re-initialized by taking in account the round-trip time spike that caused the SRTO (a slightly different approach has been also proposed in [10]). To the best of our knowledge, till now, the only SRTO detection algorithm aimed at the analysis of collected TCP traces is the one proposed by Vacirca et al. in [7]. This algorithm is conceived to process a large amount of traces from different TCP connections. Such traces are collected by a single monitoring interface, placed in the middle of a path traversed by many connections. The design philosophy of the algorithm targets strict constraints on execution speed and simplicity. Anyway, being the algorithm based on a monitoring point placed in the middle of the network, it cannot exploit fundamental information available on dumps collected at connection endpoints that could improve estimation accuracy. The rationale of the algorithm of Vacirca is to analyze ACKs stream to identify SRTOs. In case of a Normal RTO (NRTO), a loss involves the transmission of duplicate ACKs by the TCP data receiver, that indicates the presence of a hole in the TCP data stream. On the contrary, in case of a SRTO, no duplicate ACKs are expected, since there is no loss. It is well known that this algorithm does not work properly in the following conditions: (i) packet loss before the monitoring interface; (ii) presence of packet reordering; (iii) small windows; (iv) no segment is sent between the original transmission that caused the RTO and its first retransmission; (v) loss of all the segments transmitted between the original transmission that caused the RTO and its first retransmission; (vi) loss of all ACKs. Some of these cases lead to erroneous conclusions, while others lead to ambiguity. Furthermore, the absence of packet reordering is a fundamental hypothesis for the validity of the Vacirca detection scheme. Our algorithm is instead aimed to the analysis of a single TCP flow, without making any assumption on the traffic characteristic.
90
3
A. Barbuzzi, G. Boggia, and L.A. Grieco
Spurious Timeout Identification Algorithm: DeSRTO
In this section, we will examine closely the SRTO concept, reporting some example of SRTOs. Then, we will explain our algorithm. 3.1
What Is a SRTO?
As well known, every time a data packet is sent, TCP starts a timer and waits, till its expiration, for a feedback indicating the delivery of the segment. The length of this timer is just the retransmission timeout, i.e., the RTO. The expiration of a RTO timeout is interpreted by the TCP as an indication of packet losses. The computation of RTO value is specified in [4]; it is based on the estimated RTT and its variance. The proper setting of the RTO is a tricky operation: if it is too long, TCP would waste a lot of time before it realizes that a segment is lost; if it is too short, unnecessary segments would be retransmitted, with a useless waste of bandwidth. As already stated, it has been shown that the RTO estimation can be unreliable in some cases, especially when connections cross through cellular networks. Therefore, retransmission procedure can be unnecessarily triggered, even in absence of packet loss. The RFC 3522 names these RTOs as spurious (i.e., there is a SRTO), and defines them as follow: a timeout is considered spurious if it would have been avoided had the sender waited longer for an acknowledgment to arrive. Let us clarify the SRTO concept through an example. In Fig. 1(a), it is reported a borderline case, representing the simplest possible case of SRTO. The segment in the packet p1 was sent and received successfully, like the relative acknowledgment, i.e., the packet p3 . However, p3 arrived after the RTO expiration; therefore, the sender uselessly retransmitted in packet p2 the data contained in the payload of packet p1 . Note that if the sender had waited longer, it would have received the acknowledgment contained in packet p3 . Thus, as stated by the definition, here we are in presence of a SRTO. The presented example considers a very simple scenario with a small TCP transmission window. A more complex case is reported in Fig. 1(b): the first ACK segment (the one with ack number 100) is lost, but since ACKs are cumulative, data contained in packet p1 can be acknowledged by any of the subsequent ACKs. Therefore, the correct application of SRTO definition requires to check that at least one ACK (among the ones sent between the reception of p1 and the reception of the retransmitted segment) is delivered successfully to the sender. 3.2
DeSRTO
DeSRTO is an algorithm we developed to detect Spurious RTOs. The main difference between its counterparts is that it uses both TCP peer dumps, a feature that enhances the knowledge on the network behavior during a RTO event. DeSRTO discriminates SRTOs from NRTOs according to the RFC 3522, by reconstructing the journey of the TCP segments involved into the RTO management.
DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections
91
{
RcvdPacketList
S1
0 :10
k
k
0
10
10
ac
S1
ac
:10
0
Receiver
p3
p2
p1
0
Sender RTO
(a) A simple example of Spurious RTO.
{
RcvdPacketList
R
t RTO
p2
S2
00
:30
0
0 :20 00
0
p3
x
S1
S1
0 :10 S1
30
0
0
20
30
0
k
k
ac
10
ac
k
p1
k
ac
Sender
ac
:10
0
Receiver
S
RTO
t RTO
(b) A more complicated example of Spurious RTO. Fig. 1. Spurious RTO examples
For a specific RTO event, the algorithm needs to associate a packet containing a TCP segment with the packet(s) containing the relative acknowledgment that was(were) sent in reply to the segment. Note that the ACK number refers to a flow of data and not to a specific received packet. Instead, an IP datagram containing a TCP ACK is triggered by a specific TCP segment and thus can be associated to it. Hence, in the sequel, we will adopt the expression “packet A that acknowledges packet B” referring to the specific IP datagram A that contains a TCP ACK related to the reception of packet B. It is important to highlight that the algorithm needs unambiguous couple of packets on the sender and receiver dumps. Since we cannot rely only on the sequence number or on the ACK number of TCP segments, we make use of the identification field in the IP header, that, according to [11], is used to distinguish the fragments of one datagram from those of another. Let us clarify the behavior of DeSRTO applying it to the SRTO cases described in the previous section. We will start from the simplest scenario in Fig. 1(a). In this instance, DeSRTO would proceed according to the following steps: 1. 2. 3. 4.
Identify the packet that caused the RTO (p1 ) on the sender dump. Find packet p1 in the receiver dump. If p1 is lost, the RTO is declared Normal. Otherwise, find the ACK “associated” to p1 on the receiver dump (namely p3 ). This packet should exist because it is transmitted by the receiver.
92
A. Barbuzzi, G. Boggia, and L.A. Grieco
5. Find the packet p3 on the sender dump. 6. If p3 is not present in the sender dump (that is, it was lost), the RTO is declared Normal, otherwise it is declared as Spurious. Now, let us analyze the more complicated case in Fig. 1(b). Of course, the loss of only the first ACK is not enough to come to a conclusion. Therefore, the algorithm would check if at least one ACK (sent between the reception of p1 and the instant tR RT O of the reception of the retransmitted packet) was delivered correctly to the sender. The aim of our algorithm is to try to detect all types of SRTOs. Thus, through a deeper analysis of the methodology needed to distinguish all the possible RTO types, we realized that the definition used by RFC 3522 does not allow the practical identification of all possible RTO episodes. In fact, according to the definition, in order to understand what would happened if we had “waited longer”, it is possible to check what the TCP data sender would have done till the RTO event tSRT O , and what the TCP data receiver would have done till the reception of the first retransmission triggered by the RTO event, tR RT O . Everything happens after these two instants depends also on how the RTO event has been handled. From tR RT O on, the storyline diverges, and the check of the “what if” scenarios can result in significantly different outcomes, since the consequences of the RTO management cannot be really undo: it influences the following events in an unforeseeable way. We can check what would have happened if we “waited longer” with certainty as long as we do not need to undo the TCP stack management of the RTO event. To deal also with these uncertain cases, we define the concept of Butterfly-RTO as follows: a Butterfly-RTO is a RTO whose identification as SRTO or NRTO would require the check of packets at the receiver side after the instant tR RT O . In order to highlight the complexity involved in dealing with a ButterflyRTO, we can consider the RTO example in Fig. 2. Packet p1 is transmitted, but the RTO timer expires before the reception of any ACKs related to it (the ACK relative to packet p1 is lost). Packet p2 is the retransmitted packet, with the same sequence number interval as p1 . Note that the reception of packet p2 marks the instant tR RT O , i.e., the instant of the reception by the sender of the retransmitted packet. The following packet (the one with sequence numbers 100 : 200) is lost whereas packet pB (the one with sequence numbers 200 : 300) is delivered correctly. But, due to reordering, it arrives after the instant tR RT O and, consequently, p3 , the packet that acknowledges pB , is sent after tR . Apparently RT O the case depicted in Fig. 2 is a Spurious RTO, because, if we had waited for p3 to arrive, we would not have sent the retransmission p2 . Examining more carefully the situation, we should note that the story of packet pB could also be different if packet p2 would have not been never transmitted. In other terms, the storylines of p2 and pB are coupled and there is no way to undo the effects of RTO management with absolute certainty. To further stress the concept, we remark that, if the TCP delayed ACK option is enabled, the time instant at which the ACK p3 is transmitted depends also on p2 and not only on pB . Finally, TCP implementations running at different hosts often
DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections
93
R
t RTO
Receiver ac
:10
0
0
:10
30
20
S1
0:
:20 00
p2
S
pB
p3
S1
0
p1
10
0
k
x
S1
0
ac
x
10
0
k
Sender
S
t RTO RTO
Fig. 2. An example of Butterfly RTO
implements RFC specifications in slightly different ways, thus making even worse the problem. Actually, Butterfly-RTOs are negligible in normal network conditions, since reordering is rare and the probability of joint RTO and packet reordering events (needed for the Butterfly-RTO occurrence) is very low if the RTO and reordering events are uncorrelated. Anyway, note that possible network malfunctions during handover or channel management (due, for example, to bugs or incorrect settings in network devices) could systematically cause the scenario happening. To conclude this discussion, we describe how our algorithm tries to solve and/or classify the uncertain cases. If no reordering happens, DeSRTO searches for all ACKs that acknowledge p1 and have been transmitted within tR RT O . If at least one of these ACKs has been received by the sender, then the RTO is classified as spurious. Otherwise, the RTO is normal. If a reordering happens and none of the ACKs transmitted within tR RT O has been received, then the RTO is classified as Butterfly. In Fig. 3, it is shown a case taken from [7], where the detection with the Vacirca tool would fails, as stated in [7] itself. The dashed line indicates the point where the monitoring interface used by the Vacirca algorithm is placed, and the sequence of packet passing through the line is the one seen by such a detection tool. In this case, the network experiences packet reordering, specifically packet with sequence numbers 100 : 200 arrives after packet with sequence numbers 200 : 300. As reported in [7], the Vacirca tool will erroneously classify the SRTO as normal. In fact, the algorithm will see an ACK (the packet with ACK number 300) after the retransmission of p1 and p3 ; therefore, it deduces that packet p3 fills the hole in the data sequence. Our tool, instead, is not affected by packet reordering. In fact, by following the same steps (1 - 6) before outlined to explain the simpler SRTO example in Fig. 1(a), it is straightforward to show that DeSRTO is able to classify the RTO as spurious. It is worth to note that the four examples reported in Figs. 1-3 have been pictured to provide an idea of the extreme complexity associated to SRTO detection. More SRTO scenarios can happen, depending on reordering, loss of data packet, and so on. The details on DeSRTO behavior in a general setting are presented in the pseudocode description of the algorithm (see Sec. 3.3).
94
A. Barbuzzi, G. Boggia, and L.A. Grieco
{
RcvdPacketList
300
00
p2
ac k k 1 300 00
S1
:40 00
ac
00:
S3
S1 0 20 0: 10 S
p3
S2
p1
100
Sender
ack
Vacirca's Monitoring Interface
0
:10
0
:20
0
Receiver
RTO
Fig. 3. Example of SRTO with reordering: the Vacirca tool fails considering it as normal
3.3
The Algorithm in Detail: The Pseudocode
Hereafter, we discuss the DeSRTO pseudocode reported in Algotithm 1. We continue to refer to Fig. 1(b) all over the text. As general consideration, we recall that to unambiguously couple packets on the sender and receiver dumps, we use the identification field of the IP header [11]. To simplify the notation, we use the following conventional definitions: SndDump is the dump of the TCP flow at the sender side; RcvDump is the dump of the TCP flow at the receiver side; RT OList is the list of all the generated RTOs. Below, in the step-by-step description of the algorithm, numbers at the beginning of each line refer to the line numbers in the pseudocode of Alg. 1. This description will give further insight in the comprehension of DeSRTO behavior. 1:3 Initially, RT OList contains a list of RTOs, with timestamps of each RTO and the sequence number of the TCP segment that caused it. From these information, we can find in the SndDump the packet p1 (see fig. 1(b)) that caused the RTO. The second step requires to find p1 also on the receiver side in RcvDump. 4:5 If packet p1 is lost, that is, if it is not present in the RcvDump, the RTO is declared Normal. Note that TCP “fast retransmit” is comprised in this case. 6:13 It is found the first retransmission of the segment encapsulated in the p1 datagram, straight after the RTO event, namely p2 . The packet p2 is also found at the receiver side. If p2 is not present on the receiver dump, i.e., it has been lost, the algorithm looks for the first packet transmitted after p2 that successfully arrives at the receiver. 14:21 In SndDump, all the packets sent between the transmission of p1 and the transmission of p2 are found. They are stored in the list SentP ktsList. Then, DeSRTO stores in the list RcvdP ktsList all the packets in SentP ktsList that have been successfully received (this step requires an inspection of RcvDump). The first and the last packets in RcvdP ktsList are called pm
DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections
95
and pM , respectively. We will refer to tm as the reception instant of pm and to tM as the transmission time instant of the first ACK transmitted after the reception of pM . Note that pm and pM could be different from p1 and p2 , respectively, in case of packet reordering. 22:27 Search on RcvDump, in the time interval [tm , tM ], all the ACKs that acknowledged p1 . Found ACKs are saved in the list AckList, according to their transmission order. 28:43 For each ACK saved in AckList, the algorithm checks if it was successfully received by the sender. The search stops as soon as the first ACK successfully received is found. If a received ACK is found, two cases are considered: 1. the ACK was sent before tR RT O . 2. the ACK was sent after tR RT O In the first case, the sender received an ACK for p1 after the RTO expiration, therefore the RTO is declared Spurious. In the second case, we have a Butterfly-RTO. Note that we account also for reordering of the ACKs on the reverse path; therefore, the check on the ACK packets is done in chronological order, starting from the first sent ACK packet till the last one. 44:48 If none of ACKs in AckList has been received by the sender, i.e., an entire window of ACKs is lost, two cases are considered: – if the the greatest timestamp of packets in AckList is smaller than tR RT O (i.e., tp2 ), the RTO is declared Normal, – otherwise it is declared Butterfly. 3.4
Implementation Details
To verify the effectiveness of our algorithm, DeSRTO has been written in python programming language. The realized tool implements exactly the pseudocode described above (the actual version of the tool is v1.0-beta and it is freely available at svn://telematics.poliba.it/desrto/tags/desrto_v1). The DeSRTO tool takes in input a list of RTOs (RT OList in the pseudocode), with timestamp and sequence number and the dumps related to the TCP connection of two peers. The list of RTOs is generated using a Linux kernel patch, included in the repository, that simply logs the sequence numbers and the timestamps of each RTO. Of course, other methods can be used to have a list of RTOs, such as a simple check of the presence of duplicate transmission without 3-DUPACK. We have planned to implement it as an option in the near future. The dumps of each peer can be truncated in order to discard the TCP payload. Of course, DeSRTO requires that no packets are discarded by the kernel. In fact, if packets we look for are not found, the analysis would be wrong. An option to deal with flows that go through a NAT has been implemented. The aim of each option is to analyze most of the cases in background, without the presence of an operator.
96
A. Barbuzzi, G. Boggia, and L.A. Grieco
Algorithm 1. Pseudocode of DeSRTO 1: for each rto in RT OList do 2: FIND the packet p1 that causes the rto in SndDump 3: FIND packet p1 in RcvDump 4: if p1 is Lost then 5: rto ← N RT O 6: else 7: FIND packet p2 in SndDump, the first retransmission of packet p1 8: FIND p2 in RcvDump 9: while p2 is Lost do 10: FIND tmp the first packet transmitted after p2 in SndDump 11: p2 ← tmp 12: FIND p2 in RcvDump 13: end while 14: SET tretr TO the timestamp of p2 on RcvDump 15: GET all sent packets between p1 and p2 in SndDump, including p1 and not p2 16: FIND the corresponding received packets in RcvDump 17: STORE founded packets in RcvdP ktsList 18: SET pm TO the first packet in RcvdP ktsList in chronological order 19: SET tm TO the timestamp of pm 20: SET pM TO the last packet in RcvdP ktsList in chronological order 21: SET tM TO the timestamp of the first ACK transmitted after pM 22: for EACH sent packet pa in RcvDump FROM tm TO tM do 23: if pa acknowledges p1 then 24: STORE pa IN AckList 25: end if 26: end for 27: SET tmax to the greatest timestamp of the packets in AckList 28: SET tp2 TO the timestamp of p2 on RcvDump 29: SET ACK F OU N D TO False 30: for EACH ack packet a in AckList do 31: if ACK F OU N D = F alse then 32: FIND packet pa in SndDump 33: if pa is not LOST then 34: SET ACK F OU N D TO True 35: SET t TO the timestamp of pa on RcvDump 36: if t3 > tp2 then 37: rto ← Butterf lyRT O 38: else 39: rto ← SRT O 40: end if 41: end if 42: end if 43: end for 44: if ACK F OU N D = F alse and tmax > tp2 then 45: rto ← Butterf lyRT O 46: else if ACK F OU N D = F alse then 47: rto ← N RT O 48: end if 49: end if 50: end for
DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections
4
97
Experimental Results
To test the effectiveness of the tool and its performance, we have considered a series of TCP flows generated using iperf, a commonly used network testing tool that can create TCP data streams (http://dast.nlanr.net/iperf/), in a real 3.5G cellular network (a UMTS network with the HSPA protocol) with concurrent real traffic . The testbed is presented in Fig. 4. There are two machine equipped with a Linux kernel patched with the Web100 patch (http://www.web100.org/), a tool that implements TCP instruments, defined in [12], able to log internal TCP kernel status, and with a kernel patch we developed that logs all the RTO events (see Sec. 3.4). The developed patch has been tested comparing the reported timeouts against Web100 output.
3G core network
Mobile Host
Internet
Wired Host
Fig. 4. Experimental testbed
The first PC is connected to the Internet through a wired connection, while the second one is equipped with a UMTS (3.5G) card and it is connected to the cellular network. We have generated a series of one hour greedy long flows between the two machines using iperf. We have conducted several experiments, with flows originating from both the machines, in order to test either the directions of the connection. The average transfer rate was 791 kbits/sec in download and 279 kbits/sec in upload. No experiments experienced packet reordering. In the download case, where the UMTS equipped machine receives data, the number of detected SRTOs is negligible also due to the low number of RTOs (actually most RTOs are due to retransmission of SYN packets); whereas in the upload case, the SRTOs are more common, even if not prevalent. This behavior was expected due to the asymmetry between uplink and downlink in cellular networks (downlink usually provides higher bandwidth, higher reliability, and smaller delays with respect to uplink [13]). Tabs. 1 and 2 show the number of NRTOs and SRTOs detected by DeSRTO and by the Vacirca tool in the upload and in the download cases, respectively. Note that there are no Butterfly RTOs, since no reordering was experienced in performed experiments.
98
A. Barbuzzi, G. Boggia, and L.A. Grieco Table 1. Results reported by Vacirca Tool and DeSRTO for the upload case
N. 1 2 3 4 5 6 7 8 9 10 11 TOT
DeSRTO Results SRTO NRTO % SRTO 3 23 13,0% 5 27 18,5% 5 231 2,2% 4 305 1,3% 7 151 4,6% 5 48 10,4% 4 343 1,2% 3 83 3,6% 2 9 22,2% 4 108 3,7% 1 2 50,0% 43 1330 3,2%
SRTO 5 5 15 23 24 10 28 4 3 9 1 127
Vacirca tool results NRTO Am% biguous SRTO 535 4 0,9% 536 7 0,9% 711 31 2,0% 784 43 2,7% 637 19 3,5% 502 9 1,9% 749 69 3,3% 636 1619 0,2% 441 2 0,7% 629 22 1,4% 348 0 0,3% 6508 1825 1,5%
% Ambiguous 0,7% 1,3% 4,1% 5,1% 2,8% 1,7% 8,2% 71,7% 0,4% 3,3% 0,0% 21,6%
To validate the algorithm, we have manually inspected all the RTOs expired during the experiments and we have verified their correspondence with the ones revealed by DeSRTO. It is worth to highlight that no false positive or negative cases were found by DeSRTO. It was an expected results, since the algorithm behavior follows the human operational procedure to find SRTO. To validate its own algorithm, [7] uses a patched kernel that logs the timeout sequence numbers on the sender side and, on the receiver side, logs the hole in the sequence number space left by the reception of an out-of-order segment. In that paper, it is claimed that an out-of-order segment point out a loss, i.e., a NRTO, and, therefore, all the remaining RTOs are spurious. Note that this technique is more accurate that the use of the Vacirca’s tool, but it is not free from errors. In fact, besides the intuitive failure in case of reordering, where an out-of-order segments is not a lost packet, this validation technique does not consider RTOs due to lost ACKs. In fact, in case a whole ACK window is lost, no hole is logged on the receiver, and then a NRTO is wrongly believed to be spurious. Therefore, we think that the validation technique used by [7] was unfeasible for our algorithm; in fact, our algorithm claims to work even with cases where the validation technique used by [7] fails. Thus, the only possible validation technique is the manual inspection of all RTOs. Even if Vacirca tool and DeSRTO are designed with different targets, a working comparison between the two tools is mandatory, although some differences in results are expected. For this purpose, we used an implementation of the Vacirca tool available (http://ccr.sigcomm.org/online/?q=node/220) as a patch for tcptrace v.6.6.7. The algorithm was applied using the traces captured on the sender side (the Ethernet interface in case of download, the UMTS interface in case of upload). Even if the location of the monitoring interface is unusual (it is not in the middle of the path), the placement is correct, since the only assumption done in [7] is that no loss is present between the sender side and the monitoring
DeSRTO: An Effective Algorithm for SRTO Detection in TCP Connections
99
Table 2. Results reported by Vacirca Tool and DeSRTO for the download case
N. 1 2 3 4 5 6 7 8 9 10 11 TOT
DeSRTO Results SRTO NRTO % SRTO 1 0 100,0% 2 0 100,0% 1 0 100,0% 2 0 100,0% 1 0 100,0% 1 0 100,0% 1 0 100,0% 1 0 100,0% 1 0 100,0% 1 0 100,0% 1 0 100,0% 13 0 100,0%
SRTO 0 0 0 1 0 0 0 0 0 0 0 1
Vacirca tool results NRTO Am% biguous SRTO 11 0 0,0% 12 0 0,0% 18 0 0,0% 65 0 1,5% 10 0 0,0% 8 0 0,0% 10 0 0,0% 8 0 0,0% 13 0 0,0% 3 0 0,0% 6 0 0,0% 164 0 0,6%
% Ambiguous 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0%
interface. The obtained results are reported in Tabs. 1 and 2. The comparison shows lots of differences. The number of RTOs detected by the Vacirca tool is substantially different from the ones reported by our kernel patch or, equally, by Web100. On average, the Vacirca patched version of tcptrace reports a number of RTOs about 6 times greater that the ones reported by the kernel, with peaks of 100 times. Instead, the number of SRTOs is more similar between the two tools, and, even if the results are significantly different, in some cases the reported values are comparable. It is worth to highlight that sometimes the number of ambiguous RTOs reported by the Vacirca tools is very high, although no packet reordering was experimented on the network in any experiments. Unfortunately, we were not able to make any reliable hypothesis on the causes of results obtained by the Vacirca tool. We found some issues in the use of such a tool and details about these problems can be found in [14].
5
Conclusions
In this paper, the new algorithm DeSRTO to find Spurious Retransmission Timeouts in TCP connections has been developed. Several examples have been reported to illustrate its behavior in the presence of packet reordering, small windows, and other cases where competitors fail. Except for rare cases that are not classifiable with absolute certainty at all, the algorithm shows no ambiguous nor erroneous detections. Moreover, the effectiveness of the proposed algorithm has been highlighted with some results of its application on TCP traces collected in a real 3.5G cellular network and comparing its performance with respect to another detection tool available in literature. Future work will illustrate the application of DeSRTO to data traces in order to analyze the presence of SRTOs and their impact in several network environments.
100
A. Barbuzzi, G. Boggia, and L.A. Grieco
Acknowledgement Authors want to thank Dr. F. Ricciato and its team at FTW (Vienna) for suggestions and the valuable support during this work, which was funded by projects PS-121 and DIPIS (Apulia Region, Italy) as well as supported by TMA-COST action IC0703.
References 1. Jacobson, V.: Congestion avoidance and control. SIGCOMM Comput. Commun. Rev. 18(4), 314–329 (1988) 2. Allman, M., Paxson, V., Stevens, W.: RFC 2581: TCP congestion control (1999) 3. Floyd, S., Henderson, T., Gurtov, A.: RFC 3782: The NewReno modification to TCP’s fast recovery algorithm (2004) 4. Paxson, V., Allman, M.: Computing TCP’s retransmission timer (2000) 5. Ludwig, R., Meyer, M.: The Eifel detection algorithm for TCP. RFC 3522, Experimental (April 2003) 6. Sarolahti, P., Kojo, M.: Forward RTO-Recovery (F-RTO): An algorithm for detecting spurious retransmission timeouts with TCP and the stream control transmission protocol (SCTP). RFC 4138, Experimental (August 2005) 7. Vacirca, F., Ziegler, T., Hasenleithner, E.: An algorithm to detect TCP spurious timeouts and its application to operational UMTS/GPRS networks. Comput. Netw. 50(16), 2981–3001 (2006) 8. Blanton, E., Allman, M.: Using TCP duplicate selective acknowledgement (DSACKs) and stream control transmission protocol (SCTP) duplicate transmission sequence numbers (TSNs) to detect spurious retransmissions. RFC 3708, Experimental (February 2004) 9. Ludwig, R., Gurtov, A.: The eifel response algorithm for TCP. RFC 4015, Proposed Standard (February 2005) 10. Blanton, E., Allman, M.: Using spurious retransmissions to adapt the retransmission timeout (July 2007) 11. Postel, J.B.: Internet protocol. Internet RFC 791 (September 1981) 12. Mathis, M., Heffner, J., Raghunarayan, R.: TCP extended statistics MIB. RFC 4898, Proposed Standard (May 2007) 13. Bannister, J., Mather, P., Coope, S.: Convergence Technologies for 3G Networks: IP, UMTS, EGPRS and ATM. John Wiley & Sons, Chichester (2004) 14. Barbuzzi, A.: Comparison measures between desrto and vacirca tool. Technical report (2009), available at, http://telematics.poliba.it/DeSRTO_tech_rep.pdf
Uncovering Relations between Traffic Classifiers and Anomaly Detectors via Graph Theory Romain Fontugne1 , Pierre Borgnat2, Patrice Abry2 , and Kensuke Fukuda3 1 3
The Graduate University for Advanced Studies, Tokyo, JP 2 Physics Lab, CNRS, ENSL, Lyon, FR National Institute of Informatics / PRESTO JST, Tokyo, JP
Abstract. Network traffic classification and anomaly detection have received much attention in the last few years. However, due to the the lack of common ground truth, proposed methods are evaluated through diverse processes that are usually neither comparable nor reproducible. Our final goal is to provide a common dataset with associated ground truth resulting from the cross-validation of various algorithms. This paper deals with one of the substantial issues faced in achieving this ambitious goal: relating outputs from various algorithms. We propose a general methodology based on graph theory that relates outputs from diverse algorithms by taking into account all reported information. We validate our method by comparing results of two anomaly detectors which report traffic at different granularities. The proposed method succesfully identified similarities between the outputs of the two anomaly detectors although they report distinct features of the traffic.
1
Introduction
Maintaining network resources available and secured in the Internet is an unmet challenge. Hence, various network traffic classifiers and anomaly detectors (hereafter both called classifiers) have been recently proposed. However, the evaluation of these classifiers usually lacks rigor, leading to hasty conclusions [1]. Since synthetic data is rather criticized and common labeled database (like the datasets from the DARPA Intrusion Detection Evaluation Program [2]) is not available for backbone traffic; researchers analyze real data and validate their methods by manual inspection, or by comparison with other methods. Our final goal is to provide a reference database by labeling the MAWI archive [3] which is a publicly available collection of real backbone traffic traces. Due to the difficulties faced in analyzing backbone traffic (e.g. lack of packet payload, asymmetric traffic), we plan to label the MAWI archive by cross-validating results from several methods based on different theoretical backgrounds. This systematic approach permits to maintain updated database in which recent traffic traces are regularly added, and labels are improved with upcoming algorithms. This database aims at helping researchers by providing a ground truth relative to the state of the art. F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 101–114, 2010. c Springer-Verlag Berlin Heidelberg 2010
102
R. Fontugne et al.
However, we face several complicated issues to reach our final goal. This article discusses the difficulties faced in relating outputs provided by distinct algorithms, and proposes a methodology to achieve it. This is an important first step for labeling traffic data traces. The main contribution is to provide a general methodology to efficiently compare outputs exhibiting various granularities of the traffic. It uncovers the relations between several outputs by inspecting all information reported by classifiers and the original traffic. Also, the proposed method inherently groups similar events and permits to label quantity of traffic at once. 1.1
Related Work
Usually ground truth is built by hand implying a lot of human work. Several applications have been proposed to assist humans and speed up this laborious task [4,5,6]. For example, GTVS [4] helps researchers by automating several tasks, and authors claim that a 30 minutes trace from a gigabyte link can be labeled within days. Since our purpose is to label a database containing 15 minutes backbone traffic traces taken everyday for 9 years (MAWI archive) manual labeling is unpractical. Alternatively, specialized network interface drivers have been recently proposed [1,7] to label traffic while packets are collected. These drivers trace each packet and retrieve the corresponding application. Although these approaches are really promising to compute confident ground truth from Internet edges, it is not applicable for backbone traffic. Closer to our work, Moore et al. [8] proposed an application combining nine algorithms that analyze different properties of the traffic. This application successfully achieved an accurate classification on a full payload packet traces recording both link directions. TIE [9] is also an application designed to label traffic with several algorithms. It computes sessions — i.e. flows, bi-directional flows, or traffic related to a host — from the original traffic and provides them to encapsulated classifiers. The final label for each session is decided from the labels provided by each classifier. Although these two applications are similar to our work, they do not solve the general problem of relating outputs from distinct algorithms. Indeed, both applications restrict classifiers to label only flows, ignoring all classifiers that operate at other granularities (e.g. packet, host...) and their benefits. Thus, they only deal with flows and bypass the problem addressed in this paper. Our work provides a more general approach that permits to combine results from any classifier. This issue has been only tackled in previous work; for example, Salgarelli et al. [10] also discuss the challenges faced by researchers in comparing performances of classifiers and proposed unified metrics to measure the quality of an algorithm. Although the need of common metrics in evaluating classifiers is crucial, we stress that these measures are not sufficient to compare classifiers outputs. For instance, let A and B be two distinct classifiers with the same true positive score (as defined in [10]: the percentage of flows that the classifiers labeled correctly) equal to 50% on a certain dataset. Let assume that the combination of A and B achieve 75% of true positive on the same dataset, then it will be interesting to know what kind of traffic A could identify that B could not (and vice versa).
Uncovering Relations between Traffic Classifiers and Anomaly Detectors
2
103
Problem Statement
Comparing outputs from several classifiers seems at first glance to be trivial, but in practice, it is a baffling problem. The main issue is that classifiers report different features of the traffic that are difficult to systematically compare. Hereafter, we define an event as any classifier decision to categorize a traffic (i.e. alarms from anomaly detectors or labels from traffic classifiers). Formally, an event e is a set of items e = {tbegin , tend , f1 , ..., fh } where tbegin , tend are timestamps respectively standing for the begin and the end of identified traffic, and other items fi , 0 < i ≤ h correspond to one of the following five traffic features: {srcIP, dstIP, srcP ort, dstP ort, protocol}. At least one traffic feature (0 < h) is required to describe identified traffic. For example, the event e1 = {tbegin : 90s, tend : 150s, srcPort : 80} refers to one minute of traffic from source port 80. Also, the same traffic feature can occur several times in a single event. For example the event e2 = {tbegin : 30s, tend : 90s, srcPort : 53, protocol : udp, protocol : tcp} refers to one minute of UDP or TCP traffic from port 53. 2.1
Granularity of Events
The traffic granularity of reported events results from the diverse traffic abstractions, dimensionality reductions and theoretical tools employed by classifiers. For example, in the case of anomaly detection: – hash based (sketch) anomaly detectors [11,12] usually report only IP addresses and corresponding time bin, no other information (e.g. port number) describes identified anomalies. – An anomaly detector based on image processing reports an event as a set of IP addresses, port numbers and timestamps corresponding to a group of packets identified in analyzed pictures [13]. – Several intrusion detection systems take advantage of clustering techniques to identify anomalous traffic [14]. These methods classify flows in several groups and report clusters with abnormal properties. Thereby, events reported by these methods are sets of flows. These different kinds of event provide distinct details of the traffic that are difficult to systematically compare. A simple way is to digest all of them to a less restrictive form; namely, by examining only the source or destination IP addresses (assuming that anomaly detectors report at least one IP address). Comparing only IP addresses permits to determine that Event 1, Event 2 and Event 3 in Fig. 1 are similar. However, the port numbers provided by Event 2 and Event 3 indicate that these two events represent distinct traffics. Consequently, an accurate comparison of these two events requires to also take into account port numbers, but it raises other issues. First, a heuristic is needed to make a decision when port number is not reported (for example in comparing Event 1 and Event 2 ). Second, fuzzy equality is required to compare Event 4 and Event 5 of Fig.1. So forth, inspecting various traffic features reported by events makes the task harder although the accuracy of the comparison increases.
104
R. Fontugne et al.
Fig. 1. Event 1, Event 2 and Event 3 report different traffics from the same host. A same port scan is reported by two events; Event 4 identifies only a part of it (beginning of the port range), whereas Event 5 identifies another part (the end of the port range).
Similar problems arise in the case of traffic classification where different entities are labeled: – Usually flows are directly labeled (e.g. based on clustering techniques [15,16,17]). – Whereas, BLINC [18] decides a label for a source (IP address, source port) based on its connection pattern. Also, a recent method [19] labels directly hosts without any traffic information by collecting and analyzing information freely available on the web. Thus, researchers faced difficulties in comparing events standing for flows with events representing hosts. A common way is to apply the same label to all flows initiated from the host reported by an event, thus, only flows are compared [16]. Unfortunately, this way of comparing these two kinds of traffic classifiers leads to erroneous results. For example, if an host is reported by a classifier as a web client then all its corresponding flows are cast as HTTP. A simple port-based method also classifies most of these flows as HTTP but a few of them are labeled as DNS. In this case we cannot conclude that the port-based method misclassified DNS flows neither the other classifier failed in classifying this host. Obviously, the transition between an event representing host to its corresponding flows introduce errors. More sophisticated mechanisms are required to handle these two concepts (flow and host), whereas the synergy between them might provides accurate traffic classification. 2.2
Traffic Assortment
Recent applications and network attacks tend to be distributed over the network and composed of numerous flows. Therefore, classifiers labeling flows output an excessive number of events for a single distributed traffic. Regardless the quantity of traffic involved by a unique attack, or instance of an application, the whole amount of generated traffic should be handled and annotated as a single entity. In that way traffic annotations are clarified and highlight connection pattern of hosts. In some cases, finding these similarities between events requires to retrieve the original traffic. For example, let X be an event corresponding to traffic emitted from a single host, and Y an event representing traffic received by another host. X and Y can represent exactly the same traffic but from two different points of view, one reports the source whereas the other one reports the destination of the traffic. The only way to verify if these events are related to each other is to
Uncovering Relations between Traffic Classifiers and Anomaly Detectors
105
investigate the analyzed traffic. If all traffic reported by X is also reported by Y , then we conclude that they are strongly related. Obviously, a quantitative measure is also needed to accurately score their similarities.
3
The Proposed Method
We propose a method to relate several events of different granularities by analyzing all their details. The main idea underlying our approach is to discover events relations among events from original traffic (oracle in Fig.2) and represent all events and their relations as a graph (graph generator in Fig.2). Afterwards, coherent groups of similar events are identified in the graph with an algorithm finding community structure (community mining in Fig.2).
Fig. 2. Overview of the proposed method
3.1
Oracle
The oracle is the interface binding specific classifiers outputs to our general methodology. Its role is to retrieve the relation between the original traffic and the reported events. It accepts a query in the form of a packet p and returns a list of events, Rp = {ep0 , ..., epn }, consisting of all events from every classifiers that are relevant to the query. Formally, let a packet p be a set of five distinct traffic features and a timestamp tp , p = {tp , fp1 , ..., fp5 } then Rp consists of all events e = {tbegin , tend , f1 , ..., fh } where tbegin ≤ tp ≤ tend and ∃fj , fj = fpi , with 0 < i ≤ 5 and 0 < j ≤ h. Queries are generated for each packet of the original traffic, thereby the oracle produces the lists of events matching all packets R = {Rp1 , ..., Rpm } (m is the total number of packets). 3.2
Graph Generator
The graph generator collects all responses from the oracle and builds a graph highlighting event similarities. Nodes of the graph represent the events and those appearing in a same list returned by the oracle are connected to each other by edges. Thus, for any edge of the graph (ex , ey ) there is at least one list provided by the oracle, Rpz ∈ R, in which the two connected nodes appear ex , ey ∈ Rpz . Weights of edges quantify the similarities of events based on the quantity of traffic they have in common. Let c(e1 , ..., en ) be a function returning the number of lists, Rpz ∈ R, in which all events given as parameters appear
106
R. Fontugne et al.
together, e1 , ..., en ∈ Rpz . Then the weight of an edge (ex , ey ) is computed with the following equation: w(ex , ey ) = c(ex , ey )/ min(c(ex ), c(ey )) w ranges (0, 1], 1 means that events are strongly related whereas values close to 0 represent weak relationships. The characteristic of graphs built by the graph generator is that connected components stand for sets of events representing common traffic. Also, connected components consists of sparse and dense parts, hereafter, we define a community as a coherent group of nodes representing similar events. 3.3
Community Mining
The next step is to find out community structure [20] to identify coherent groups of similar events within connected components of graphs constructed by the graph generator. Although many kinds of community structure algorithm have been proposed, we only take an interest in those based on modularity because there exists versions that perform fast on sparse graph [21]. Modularity. Newman and Girvan proposed a metric for evaluating the strength of a given community structure [20] based on inter and intra-community connections; this metric is called the modularity. The main idea underlying the modularity is that the fraction (or proportion) of edges connecting nodes of a single community is expected to be higher than the value of the same quantity in a graph with the same number of nodes but randomly connected. Let eij be a fraction of edges connecting nodes of community i to those of community j, such that eii is the fraction of edges within a community i. Thus, i eii is the total fraction of edges connecting nodes of the same community. This value highlights the connections within communities, a large value represents a good division of the graph in communities. However, it takes the maximum value 1, for the particularly meaningless case in which all nodes are grouped in a single community. Newman et al. enhanced this measure by subtractingfrom it the value it would take if edges were randomly placed. We note ai = j eij the fraction of all edges attached to nodes in community i. If the edges are placed at random, the fraction of edges nodes within community i is a2i . The modularity that link 2 is defined as Q = i (eii − ai ). If the fractions of edges within communities are similar to those expected in a randomized graph, then score of the modularity will be 0, whereas Q = 1 indicates graphs with strong community structure. Since Q represents the quality of community structure in a graph, researchers investigated this metric to efficiently partition graph in communities. Finding communities. Blondel et al. proposed an algorithm [21] finding community structure by optimizing the modularity in an agglomerative manner.
Uncovering Relations between Traffic Classifiers and Anomaly Detectors
107
Their algorithm starts by assigning a community to each node of the graph, then the following step is repeated iteratively. For each community i the gain of modularity obtained by merging it with one of its neighbor j is evaluated. The merge of i and j is done for the maximum gain, but only if this gain is positive. Otherwise i is not merged with other communities. Once all nodes have been examined a new graph is build, and this process is repeated again until no merge can be done. The authors claim that the computational complexity of their algorithm is linear on typical and sparse data. In their experiments Blondel et al. successfully analyzed a graph with 118 million nodes and 1 billion edges in 152 minutes. The performances of this algorithm allow us to compare thousands of events in a really short time frame (order of seconds).
4
Evaluation
4.1
Data and Processing
The proposed method is preliminarily evaluated by comparing the results of two anomaly detectors based on different theoretical backgrounds. One consists of random projection techniques (sketches) and multiresolution gamma modeling [11]. Hereafter we call it as the gamma-based method. In a nutshell, the traffic is split into sketches and modeled using Gamma laws at several time scales. Anomalous traffic is detected by reporting too large distances from an adaptively computed reference. The sketches are computed twice; the traffic is hashed on source addresses and on destination addresses. Thus, when anomalies are detected this method reports the corresponding source or destination address within a certain time bin. The other anomaly detector is based on an image processing technique called the Hough transform [13] (we call it the Hough-based method). Traffic is monitored in 2-D scatter plot where each plot represents packets and anomalous traffics appear as “lines”. Anomalies are extracted with a line detector (the Hough transform) and the original data are retrieved from the identified plots. The output of this method is an aggregated set of packets. These two anomaly detectors were tested on a pcap file of the MAWI archive containing 15 minutes of traffic taken at a trans-Pacific link between Japan and US (Samplepoint-B, 2004/08/01) corresponding to a period of Sasser outbreak. In practice, the output of these two anomaly detectors is in admd1 form, which is a XML schema allowing to annotate traffic in an easy and flexible way. Hence, we implemented an oracle able to read any admd and pcap file to compare results from both methods. 4.2
Results
In our experiments 332 events have been reported by the gamma-based method and 873 by the Hough-based one, where respectively 235 and 247 events have 1
Meta-data format and associated tools for the analysis of pcap data: http://admd.sourceforge.net
108
R. Fontugne et al.
(a) Both methods detect the same host (b) The gamma-based method reports infected by the Sasser worm. the destination of anomalous traffic whereas the Hough-based one reports the source of it.
Fig. 3. Two simple connected components with two similar events
been merged by our method. The resulting graph consists of 124 connected components (we do not consider components with a single event), we present some typical graph structures in this Section. Note that we use following legend for Fig. 3-7. Gray rectangles represent the separation in community structure, green ellipses are events reported by the Houghbased method, and red rectangles are events reported by the gamma-based method. The labels of events are displayed as: IPaddress direction;nbPackets, where IPaddress is the IP address reported by the gamma-based model or the prominent IP address of traffic reported by the Hough-based method; direction is a letter, s or d, informing if the identified hosts are rather the sources or destinations of reported traffic; nbPackets is the number of packets in the traffic trace that match the event. We emphasize that the IP addresses shown in these labels are only provided to facilitate the readability of figures. Thus it is not the only information considered in the oracle decisions (the gamma-based method also reports timestamps, and the Hough-based method can provide several IP addresses, port number, timestamps, and protocols). The label for an edge linking two events is the weight of the edge w and the number of packets matching both events. Simple connected components. Figure 3 consists of two examples of the simplest connected components built by our method. Figure 3(a) stands for the same Sasser activity reported by both methods. Since the two methods reported anomalous traffic from the same host, the events have been obviously grouped together. The single edge between the two events represents numerous incomplete flows that can be labeled as the same malicious activity. Figure 3(b) displays two events reporting different hosts; one event describes anomalous traffic sent from a source IP (172.92.103.79), whereas the other one exhibits abnormal traffic received by another host (210.133.66.52). Their relationship is uncovered by the original traffic, all packets (except one) initiated from the source host have been sent to the identified host destination. This connected component illustrates the case where both anomaly detectors report the same
Uncovering Relations between Traffic Classifiers and Anomaly Detectors
109
Fig. 4. RSync traffic identified by 5 events
traffic but from different points of view, one identified its source whereas the other emphasized its destination. In our experiments, we observed 86 connected components containing only 2 events (like those depicted in Fig.3) where, the two linked events are sometimes reported by the same anomaly detector. Large connected components. The proposed method found 38 connected components that consist in more than 2 events. For example, Fig. 4 shows 5 events grouped in a strongly connected component. All these events report abnormally high traffic volume, and manual inspection of packets header revealed that they are all RSync traffic. Three hosts are concerned by these events, and the structure of the component helps us in understanding their connection pattern. The weights of edges indicate that these events are closely related. Thus, these 5 events are grouped as one community that is uniquely reported. Figure 5 depicts a connected component consisting of 29 events; 27 are from the gamma-based method output and 2 from the Hough-based one. All these events are reporting abnormal DNS traffic. The event on the right-hand side of Fig.5 and the one on the left-hand side (both labeled 200.24.119.113) represent traffic from a DNS server. This server is reported by both methods because it replies to numerous requests during the whole traffic trace. Other events shown in Fig.5 represent the main clients soliciting this service. By grouping all these events together our method permits to report the flooded server and the blamed clients at the same time. Whereas, by analyzing individually events raised by clients, one may misunderstand the distributed characteristic of this network activity — similar to DDoS, flash crowd, or botnet activity — and misinterpret each event.
110
R. Fontugne et al.
Fig. 5. DNS traffic reported by many events
Uncovering Relations between Traffic Classifiers and Anomaly Detectors
111
Communities in connected components. In the examples presented above the algorithm finding community structure (see Section 3.3) identified each connected component as a single community. Nevertheless, our method found 11 connected components that are split in several communities (e.g. Fig. 6 and 7); the smallest contains 5 events grouped in 2 communities, and the largest consists of 47 events clustered in 8 communities. These connected components stand for distinct network traffics that are linked by loose events (i.e. events reporting only one traffic feature). Fortunately, the algorithm finding community structure succeed in cutting connected components in coherent groups of events. An example of a connected component representing two communities is depicted in Fig.6. The community on the left-hand side of Fig.6 stands for a high-volume-traffic directed to port number 3128 (proxy server). However, the community on the right-hand side of Fig.6 represents nntp traffic between two hosts. A single packet is responsible for connecting two events from both communities. It is a TCP/SYN packet sent from the main host representing the left-hand side community and aiming at the port 3128 of a host belonging to the other community. This is the only traffic observed on port 3128 for the latter host. The proposed method successfully dissociates the two sets of events having no similarities, so they can be handle separately. Figure 7 depicts another example of connected component split in several communities, but this involves 14 events grouped in 5 communities. All events report uncommon HTTP traffic among numerous hosts. Although all events are connected together, weight of edges emphasizes several dense sets of events. By analyzing the weight of edges and the degree of nodes, the algorithm finding community structure successfully detected these coherent groups of events. 4.3
Discussion
The proposed method enabled us to compare outputs from different kinds of classifier (e.g. host classification and flow classification), and fulfill our requirements to combine results form many classifiers. Our method is also useful in inspecting the output of a single method. For example, the gamma-based method inherently reports either source or destination address of anomalous traffic, but both are sometimes reported in two distincts events. Let T be an anomalous traffic between hosts A and B raising two events ex = {tbegin : X, tend : Y, srcIP : A} and ey = {tbegin : X, tend : Y, dstIP : B}, then the proposed method merges these events as packets p ∈ T are in the form p = {tp : Z, srcIP : A, dstIP : B, ...} with X ≤ Z ≤ Y . In our experiments, our method permits to merge 27 events (see Fig.5) reported by the gamma-based method increasing the quality of reported events and reducing the size of the output. Another benefit of the proposed method is to help researchers in understanding different results from their algorithms. For instance, while developing anomaly detector, researchers commonly face a problem in tuning their parameter set. Therefore, researchers usually run their application with numerous parameter settings, and the best parameter set is selected by looking at the highest
112
R. Fontugne et al.
Fig. 6. Connected component standing for two distinct traffics
Fig. 7. HTTP traffic represented by a large connected component split in 5 communities
detection rate. Although this process is commonly accepted by the community a crucial issue still remains. For instance, a parameter set A may give a similar detection rate to that obtained with a parameter set B, but a deeper analysis of reported events may show that B is more effective for a certain kind of anomalies not detectable with the parameter set A (and vice versa). Deciding if A or B is the best parameter is then not straightforward. This interesting case
Uncovering Relations between Traffic Classifiers and Anomaly Detectors
113
is not solved by simply comparing detection rates. The overlap of both outputs as exhibited by our method would help us first to compare in which conditions a parameter set is more effective, second to make methods collaborate.
5
Conclusion
This article first raised the difficulties in relating outputs of different classifiers. We proposed a methodology to relate reported events although they are expressed in different ways and represent distinct granularities of the traffic. Our approach relies on the abstraction level of graph theory, graphs are generated from events and the original traffic to uncover the similarities of events. An algorithm finding community structure permits to distinguish coherent sets of nodes in the graph standing for sets of similar events. Preliminary evaluation highlighted the flexibility of our method and its effectiveness to cluster events reported by different anomaly detectors. The proposed methodology is a first step in our process to build a common database of annotated backbone traffic. We need more analyses to better understand the basic ability of the proposed method with different datasets and classifiers. In future work we will also adopt a strategy taking into account the nature of classifiers to decide the final label to annotate traffic represented by a set of events.
Acknowledgments We would like to thank V.D. Blondel et al. for having provided us with the source code of their community structure finding algorithm. This work is partially supported by MIC SCOPE.
References 1. Szab´ o, G., Orincsay, D., Malomsoky, S., Szab´ o, I.: On the validation of traffic classification algorithms. In: Claypool, M., Uhlig, S. (eds.) PAM 2008. LNCS, vol. 4979, pp. 72–81. Springer, Heidelberg (2008) 2. Lippmann, R., Haines, J.W., Fried, D.J., Korba, J., Das, K.: The 1999 darpa offline intrusion detection evaluation. Computer Networks 34(4), 579–595 (2000) 3. Cho, K., Mitsuya, K., Kato, A.: Traffic data repository at the WIDE project. In: USENIX 2000 Annual Technical Conference: FREENIX Track, June 2000, pp. 263–270 (2000) 4. Canini, M., Li, W., Moore, A.W., Bolla, R.: Gtvs: Boosting the collection of application traffic ground truth. In: Papadopouli, M., Owezarski, P., Pras, A. (eds.) TMA 2009. LNCS, vol. 5537, pp. 54–63. Springer, Heidelberg (2009) 5. Haakon Ringberg, A.S., Rexford, J.: Webclass: adding rigor to manual labeling of traffic anomalies. SIGCOMM CCR 38(1), 35–38 (2008) 6. Fontugne, R., Hirotsu, T., Fukuda, K.: A visualization tool for exploring multi-scale network traffic anomalies. In: SPECTS 2009, pp. 274–281 (2009)
114
R. Fontugne et al.
7. Gringoli, F., Salgarelli, L., Cascarano, N., Risso, F., Claffy, K.C.: Gt: Picking up the truth from the ground for internet traffic. SIGCOMM CCR 39(5), 13–18 (2009) 8. Moore, A.W., Papagiannaki, K.: Toward the accurate identification of network applications. In: Dovrolis, C. (ed.) PAM 2005. LNCS, vol. 3431, pp. 41–54. Springer, Heidelberg (2005) 9. Dainotti, A., Donato, W., Pescap´e, A.: Tie: A community-oriented traffic classification platform. In: Papadopouli, M., Owezarski, P., Pras, A. (eds.) TMA 2009. LNCS, vol. 5537, pp. 64–74. Springer, Heidelberg (2009) 10. Salgarelli, L., Gringoli, F., Karagiannis, T.: Comparing traffic classifiers. SIGCOMM CCR 37(3), 65–68 (2007) 11. Dewaele, G., Fukuda, K., Borgnat, P., Abry, P., Cho, K.: Extracting hidden anomalies using sketch and non gaussian multiresolution statistical detection procedures. In: SIGCOMM LSAD 2007, pp. 145–152 (2007) 12. Li, X., Bian, F., Crovella, M., Diot, C., Govindan, R., Iannaccone, G., Lakhina, A.: Detection and identification of network anomalies using sketch subspaces. In: SIGCOMM 2006, 147–152 (2006) 13. Fontugne, R., Himura, Y., Fukuda, K.: Evaluation of anomaly detection method based on pattern recognition. IEICE Trans. on Commun. E93-B(2) (Febuary 2010) 14. Sadoddin, R., Ghorbani, A.A.: A comparative study of unsupervised machine learning and data mining techniques for intrusion detection. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 404–418. Springer, Heidelberg (2007) 15. Erman, J., Arlitt, M., Mahanti, A.: Traffic classification using clustering algorithms. In: SIGCOMM MineNet 2006, pp. 281–286 (2006) 16. chul Kim, H., Claffy, K., Fomenkov, M., Barman, D., Faloutsos, M., Lee, K.: Internet traffic classification demystified: Myths, caveats, and the best practices. In: CoNEXT 2008 (2008) 17. Bernaille, L., Teixeira, R., Salamatian, K.: Early application identification. In: CoNEXT 2006, pp. 1–12 (2006) 18. Karagiannis, T., Papagiannaki, K., Faloutsos, M.: Blinc: multilevel traffic classification in the dark. In: SIGCOMM 2005, vol. 35(4) (2005) 19. Trestian, I., Ranjan, S., Kuzmanovi, A., Nucci, A.: Unconstrained endpoint profiling (googling the internet). In: SIGCOMM 2008 (2008) 20. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004) 21. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. (2008)
Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers Alessandro Finamore1, Michela Meo1 , Dario Rossi2 , and Silvio Valenti2 1 2
Politecnico di Torino, Italy TELECOM Paristech, France
Abstract. In the last few years the research community has proposed several techniques for network traffic classification. While the performance of these methods is promising especially for specific classes of traffic and particular network conditions, the lack of accurate comparisons among them makes it difficult to choose between them and find the most suitable technique for given needs. Motivated also by the increase of P2P-TV traffic, this work compares Abacus, a novel behavioral classification algorithm specific for P2P-TV traffic, and Kiss, an extremely accurate statistical payload-based classifier. We first evaluate their performance on a common set of traces and later we analyze their requirements in terms of both memory occupation and CPU consumption. Our results show that the behavioral classifier can be as accurate as the payload-based with also a substantial gain in terms of computational cost, although it can deal only with a very specific type of traffic.
1 Introduction In the last years, Internet traffic classification has attracted a lot of attention from the research community. This interest is motivated mainly by two reasons. First, an accurate traffic classification allows network operators to perform many fundamental activities, e.g. network provisioning, traffic shaping, QoS and lawful interception. Second, traditional classification methods, which rely on either well-known ports or packet payload inspection, have become unable to cope with modern applications (e.g., peer-to-peer) or with the increasing speed of modern networks [1,2]. Researchers have proposed many innovative techniques to address this problem. Most of them exploit statistical properties of the traffic generated by different applications at the flow or host level. These novel methods have the advantage of requiring less resources while still being able to identify applications which do not use wellknown ports or exploit encrypted/closed protocols. However the lack of accurate and detailed comparisons discourages their adoption. In fact, since each study tests its own algorithm on a different set of traces, under different conditions and often using different metrics, it is really difficult for a network operator to identify which methods could best fit its needs. In this paper, we face this problem by comparing two traffic classifiers, one of which is specifically targeted to P2P-TV traffic. These applications, which are rapidly gaining a very large audience, are characterized by a P2P infrastructure providing a live streaming video service. As next generation of P2P-TV services are beginning to offer F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 115–126, 2010. c Springer-Verlag Berlin Heidelberg 2010
116
A. Finamore et al.
HD content, the volume of traffic they generate is expected to grow even further, so that their identification is particularly interesting for network operators. The two considered classifiers are the first ones to handle this kind of traffic and exploit original and quite orthogonal approaches. Kiss [3] is a statistical payload-based classifier and it bases the classification on the examination of the first bytes of the application-layer payload. It has already been compared with other classifiers in [4], proving to be the best one for this specific class of traffic. Abacus [5], instead, is a behavioral classifier, which derives a statistical representation of the traffic patterns generated by a host by simply counting the number of packets and bytes exchanged with other peers during small-time windows. This simple approach can capture several distinctive properties of different applications, allowing their classification. We test the two techniques on an common set of traces, evaluating their accuracy in terms of both true positives (i.e., correct classification of P2P-TV traffic) and true negatives (i.e., correct identification of traffic other than P2P-TV). We also provide a detailed comparison of theirs features, focusing mostly on the differences which stem from the undertaken approaches. Moreover, we formally investigate the computational complexity by comparing the memory occupation and the computational costs. Results show that Abacus achieves practically the same performance of Kiss and both classifiers exceed 99% of correctly classified bytes for P2P-TV traffic. Abacus exhibits some problems in terms of flow accuracy for one specific application, for which it still has a high bytewise accuracy. The two algorithms are also very effective when dealing with non P2P-TV traffic, raising a negligible number of false negatives. Finally we found that Abacus outperforms Kiss in terms of computation complexity, while Kiss is a much more general classifier, able to work with a wider range of protocols and network conditions. The paper is organized as follows. In Sec. 2 we present some work related to ours. In Sec. 3 we briefly present the two techniques under exam, then in Sec. 4 we test them on a common set of traces and compare their performance. We proceed with a more qualitative comparison of the classifiers in Sec. 5 as well as an evaluation of their computational cost. Finally Sec. 6 concludes the paper.
2 Related Work Recently, many works have been focusing on the problem of traffic classification. In fact, traditional techniques like port-based classification or deep packet inspection appear more and more inadequate to deal with modern networks and applications [1,2]. Therefore the research community has proposed a rather large number of innovative solutions, which consist notably in several statistical flow-based approaches [6,7,8] and in a fewer host-based behavioral techniques [9,10]. The heterogeneity of these approaches, the lack of a common dataset and the lack of a widely approved methodology make a fair and comprehensive comparison of these methods a daunting task [11]. In fact, to date, most of the comparison effort has addressed the investigation of different machine learning techniques [6,7,8], using the same set of features and the same set of traces.
Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers
117
More recently, a few works have specifically taken into account the comparison problem [12,13,14]. The authors of [12] present a qualitative overview of several machine learning based classification algorithms. On the other hand, in [13] the authors compare three different approaches (i.e., based on signature, flow statistics and host behavior) on the same set of traces, highlighting both advantages and limitations of the examined methods. A similar study is carried also in [14], where authors evaluate spatial and temporal portability of a port-based, a DPI and a flow-based classifier. The work presented in this paper follows the same direction of these comparative studies, but focuses only on P2P-TV applications. In fact, P2P-TV has been attracting many users in the last few years, and consequently also much consideration from the research community. Moreover, works on this topic consist mainly in measurement studies of P2P-TV application performance in real networks [15,16]. The two classifiers compared are the only ones proven to correctly identify this type of traffic. Kiss was already contrasted with a DPI and a flow-based classification algorithm in [4], proving itself the most accurate for this class of traffic. Moreover in our study we also take into account the computational cost and memory occupation of the algorithms under comparison.
3 Classification Algorithms This section briefly introduces the two classifiers. Here we focus our attention on the most relevant aspects in a comparison perspective, while we refer the interested reader to [5] and [3] for further details and discussion on parameters settings. Both Kiss and Abacus employ supervised machine learning as their decision process, in particular Support Vector Machine - SVM [17], which has already been proved particularly suited for traffic classification [13]. In the SVM context, entities to be classified are described by an ordered set of features, which can be interpreted as coordinates of points in a multidimensional space. Kiss and Abacus differ for the choice of the features. The SVM must be trained with a set of previously labeled points, commonly referred to as the training set. During the training phase, the SVM basically defines a mapping between the original feature space and a new space, usually characterized by an higher dimensionality, where the training points could be separated by hyperplanes. In this way, the target space is subdivided in areas, each associated to a specific class. During the classification phase, a point can be classified simply looking for the region which best fits it. Before proceeding with the description of the classifiers, it is worth analyzing their common assumption. First of all, they both classify endpoints, i.e., couples (IP address, transport-layer port) on which a given application is running. Second, they currently work only on UDP traffic, since this is the transport-layer protocol generally chosen by P2P-TV applications. Finally, given that they rely on a machine learning process, they follow a similar procedure to perform the classification. As a first step, the engines derive a signature vector from the analysis of the traffic relative to the endpoint they are classifying. Then, they feed the vector to the trained SVM, which in turn gives the classification result. Once an endpoint has been identified, all the flows which have that endpoint as source or destination are labeled as being generated by the identified application.
118
A. Finamore et al.
3.1 Abacus A preliminary knowledge of the internal mechanisms of P2P-TV applications is needed to fully understand the key idea behind the Abacus classifier. A P2P-TV application performs two different tasks: first, it exchanges video chunks (i.e., small fixed-size pieces of the video stream) with other peers, and, second, it participates to the P2P overlay maintenance. The most important aspect is that it must keep downloading a steady rate of video stream to provide users with a smooth video experience. Consequently, a P2PTV application maintains a given number of connections with other peers from which it downloads pieces of the video content. Abacus signatures are thus based on the number of contacted peers and the amount of exchanged information among them. In Tab. 4 we have reported the procedure followed by Abacus to build the signatures. The first step consists in counting the number of packets and bytes received by an endpoint from each peer during a time window of 5 sec. At the beginning, let us focus on the packet counters. We first define a partition of N in B exponential-sized bins Ii , i.e. I0 = [0, 1], Ii = [2i−1 + 1, 2i ] and IB = [2B , ∞). Then, we order the observed peers in bins according to the number of packets they have sent to the given endpoint. In the pseudo-code we see that we can assign a peer to a bin by simply calculating the logarithm of the associated number of packets. We proceed in the same way also for the byte counters (except that we use a different set of bins), finally obtaining two vectors of frequencies, namely p and b. The concatenation of the two vectors is the Abacus signature which is fed to the SVM for the actual decision process. This simple method highlights the distinct behaviors of the different P2P-TV applications. Indeed, an application which implements an aggressive peer-discovering strategy will receive many single-packet probes, consequently showing large values for low order bins. Conversely, an application which downloads the video stream using chunks of, say, 64 packets will exhibit a large value of the 6-th bin. Abacus provides a simple mechanism to identify applications which are “unknown” to the SVM (i.e., not present in the training set), which in our case means non P2P-TV applications. Basically, for each class we define a centroid based on the training points, and we label a signature as unknown if its distance from the centiroid of the associated class exceeds a given threshold. To evaluate this distance we use the Bhattacharyya distance, which is specific for probability mass functions. All details on the choice of the threshold, as well as all other parameters can be found in [5]. 3.2 Kiss The Kiss classifier [3] is instead based on a statistical analysis of the packets payload. In particular, it exploits a Chi-Square like test to extract statistical features from the first application-layer payload bytes. Considering a window of C segments sent (or received) by an endpoint, the first k bytes of each packet payload are split into G groups of b bits. Then, the empirical distributions Oi of values taken by the G groups over the C segments are compared to a uniform distribution Ei = C/2b by means of the ChiSquare like test: 2b 2 (Oig − E) Xg = g ∈ [1, G] (1) E i=1
Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers
119
Table 1. Datasets used for the comparison Dataset Napa-WUT Operator 2006 (op06) Operator 2007 (op07)
Duration 180 min 45 min 30 min
Flows 73k 785k 319k
Bytes Endpoints 7Gb 25k 4Gb 135k 2Gb 114k
This allows to measure the randomness of each group of bits and to discriminate among constant/random values, counters, etc. as the Chi-Square test assumes different values for each of them. The array of the G Chi-Square values defines the application signature. In this paper, we use the first k = 12 bytes of the payload divided into groups of 4 bits (i.e., G = 24 features per vector) and C = 80 segments to compute each Chi-Square. The generated signatures are then fed to a multi-class SVM machine, similarly to Abacus. As previously stated, a training set is used to characterize each target class, but for Kiss an additional class must be defined to represent the remaining traffic, i.e., the unknown class. In fact, a multi-class SVM machine always assigns a sample to one of the known classes, in particular to the best fitting class found during the decision process. Therefore, in this case a trace containing only traffic other than P2P-TV is needed to characterize the unknown class. We already mentioned that in Abacus this problem is solved by means of a threshold criterion using the distance of a sample from the centroid of the class. We refer the reader to [3] for a detailed discussion about Kiss parameter settings and about the selection of traffic to represent the unknown class in the training set.
4 Experimental Results 4.1 Methodology and Datasets We evaluate the two classifiers on the traffic generated by four popular P2P-TV applications, namely PPLive, TVAnts, SopCast and Joost 1 . Furthermore we use two distinct sets of traces to asses two different aspects of our classifiers. The first set was gathered during a large-scale active experiment performed in the context of the Napa-Wine European project [18]. For each application we conduct an hour-long experiment where several machines provided by the project partners run the software and captured the generated traffic. The machines involved were carefully configured in such a way that no other interfering application was running on them, so that the traces contain P2P-TV traffic only. This set is used both to train the classifiers and to evaluate their performance in identifying the different P2P-TV applications. The second dataset consists of two real-traffic traces collected in 2006 and 2007 on the network of a large Italian ISP. This operator provides its customers with uncontrolled Internet access (i.e., it allows them to run any kind of application, from web browsing to file-sharing), as well as telephony and streaming services over IP. Given the extremely rich set of channels available through the ISP streaming services, customers 1
Joost became a web-based application in October 2008. At the time we conducted the experiments, it was providing VoD and live-streaming by means of P2P.
120
A. Finamore et al. Table 2. Classification results (a) Flows pp 13.35 0.86 0.33 0.06 0.1 0.21
Abacus tv sp jo 0.32 0.06 95.67 0.15 0.03 98.04 0.1 2.21 - 81.53 0.1 1.03 0.06 0.03 0.87 0.05
pp pp 99.33 tv 0.01 sp 0.01 jo op06 1.02 op07 3.03
Abacus tv sp jo 0.11 99.95 0.09 99.85 0.02 - 99.98 0.58 0.55 0.71 0.25
pp tv sp jo op06 op07
un 86.27 3.32 1.5 16.2 98.71 98.84
pp tv sp jo op06 op07
pp 98.8 -
tv 97.3 0.44 2.13
Kiss sp jo 0.01 98.82 - 86.37 0.08 0.55 0.09 1.21
un 0.2 0.69 0.21 3.63 92.68 84.07
nc 1 2 0.97 10 6.25 12.5
(b) Bytes un 0.56 0.04 0.03 0.02 97.85 96.01
pp tv sp jo op06 op07
Kiss pp tv sp jo un nc 99.97 0.01 0.02 - 99.96 0.03 0.01 - 99.98 0.01 0.01 - 99.98 0.01 0.01 0.07 0.08 98.45 1.4 0.08 0.74 0.05 96.26 2.87
pp=PPLive, tv=Tvants, sp=Sopcast, jo=Joost, un=Unknown, nc=not-classified.
are not inclined to use P2P-TV applications and actually no such traffic is present in the traces. We verified this by means of a classic DPI classifier as well as by manual inspection of the traces. This set has the purpose of assessing the number of false alarms raised by the classifiers when dealing with non P2P-TV traffic. We report in Tab. 1 the main characteristics of the traces. To compare the classification results, we employ the diffinder tool [19], as already done in [4] . This simple software takes as input the logs from different classifiers with the list of flows and the associated classification outcome. Then, it calculates as output several aggregate metrics, such as the percentage of agreement of the classifiers in terms of both flows and bytes, as well as a detailed list of the differently classified flows, so eventually enabling further analysis. 4.2 Classification Results Tab. 2 reports the accuracy achieved by the two classifiers on the test traces. Each table is organized in a confusion-matrix fashion where rows correspond to real traffic i.e. the expected outcome, while columns report the possible classification results. For each table, the upper part is related to the Napa-Wine traces while the lower part is dedicated to the operator traces. The values in bold on the main diagonal of the tables express the recall, a metric commonly used to evaluate classification performance, defined as the ratio of true positives over the sum of true positives and false negatives. The “unknown” column counts the percentage of traffic which was recognized as not being P2P-TV
Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers
121
traffic, while the column “not classified” accounts for the percentage of traffic that Kiss cannot classify as it needs at least 80 packets for any endpoint. At first glance, both the classifiers are extremely accurate in terms of bytes. For the Napa-Wine traces the percentage of true positives exceeds 99% for all the considered applications. For the operator traces, again the percentage of true negatives exceeds 96% for all traces, with Kiss showing a overall slightly better performance. These results demonstrate that even an extremely lightweight behavioral classification mechanism, such as the one adopted in Abacus, can achieve the same precision of an accurate payload based classifier. If we consider flow accuracy, we see that for three out of four applications the performance of the two classifiers is comparable. Yet Abacus presents a very low percentage of 13.35% true positives for PPLive, with a rather large number of flows falling in the unknown class. By examining the classification logs, we found that PPLive actually uses more ports on the same host to perform different functions (e.g. one for video transfer, one for overlay maintenance). In particular, from one port it generates many single-packet flows all directed to different peers, apparently to perform peer discovery. All these flows, which account for a negligible portion of the overall bytes, fall in the first bin of the abacus signature, which is always classified as unknown. However, from the byte-wise results we can conclude that the video endpoint is always correctly classified. Finally, we observe that Kiss has a lower flow accuracy for the operator traces. In fact, the great percentage of flows falling in the “not classified” class means that many flows are shorter than 80 packets. Again, this is only a minor issue since Kiss byte accuracy is anyway very high.
5 Comparison 5.1 Functional Comparison In the previous section we have shown that the classifiers actually have similar performance for the identification of the target applications as well as the “unknown” traffic. Nevertheless, they are based on very different approaches, both presenting pros and cons, which need to be all carefully taken into account. Tab. 3 summarizes the main characteristics of the classifiers, which are reviewed in the following. The most important difference is the classification technique used. Even if both classifiers are statistical, they work at different levels and clearly belong to different families of classification algorithms. Abacus is a behavioral classifier since it builds a statistical representation of the pattern of traffic generated by an endpoint, starting from transport-level data. Conversely, Kiss derives a statistical description of the application protocol by inspecting packet-level data, so it is a payload-based classifier. The first consequence of this different approach lies in type and volume of information needed for the classification. In particular, Abacus takes as input just a measurement of the traffic rate of the flows directed to an endpoint, in terms of both bytes and packets. Not only this represents an extremely small amount of information, but it could also be gathered by a Netflow monitor, so that no packet trace has to be inspected by the classification engine itself. On the other hand, Kiss must necessary access packet
122
A. Finamore et al. Table 3. Main characteristics of Abacus and Kiss Characteristic Abacus Kiss Technique Behavioral Stocastic Payload Inspection Entity Endpoint Endpoint/Flow Input Format Netflow-like Packet trace Grain Fine grained Fine grained Protocol Family P2P-TV Any Rejection Criterion Threshold Train-based Train set size Big (4000 smp.) Small (300 smp.) Time Responsiveness Deterministic (5sec) Stochastic (early 80pkts) Network Deploy Edge Edge/Backbone
payload to compute its features. This constitutes a more expensive operation, even if only the first 12 bytes are sufficient to achieve a high classification accuracy. Despite the different input data, both classifiers work at a fine-grained level, i.e., they can identify the specific application related to each flow and not just the class of applications (e.g., P2P-TV). This consideration may appear obvious for a payloadbased classifier such as Kiss, but it is one of the strength of Abacus over other behavioral classifiers which are usually capable only of a coarse grained classification. Clearly, Abacus pays the simplicity of its approach in terms of possible target traffic. In fact its classification process relies on some specific properties of P2P-TV traffic (i.e., the steady download rate required by the application to provide a smooth video playback), which are really tied to this particular service. For this reason Abacus currently cannot be applied to applications other than P2P-TV applications. On the contrary, Kiss is more general, it makes no particular assumptions on its target traffic and can be applied to any protocol. Indeed, it successfully classifies other kinds of P2P applications, from file-sharing (e.g., eDonkey) to P2P VoIP (e.g., Skype), as well as traditional clientserver applications (e.g., DNS). Another important distinguishing element is the rejection criterion. Abacus defines an hypersphere for each target class and measures the distance of each classified point from the center of the associated hypersphere by means of the Bhattacharyya formula. Then, by employing a threshold-based rejection criterion, a point is label as “unknown” when its distance from the center exceeds a given value. Instead Kiss exploits a multiclass SVM model where all the classes, included the unknown, are represented in the training set. If this approach makes Kiss very flexible, the characterization of the classes can be critical especially for the unknown since it is important that the training set contains samples from all possible protocols other than the target ones. We also notice that there is an order of magnitude of difference in the size of the training set used by the classifiers. In fact, we trained Abacus with 4000 samples per class (although in some tests we experimented the same performance even with smaller sets) while Kiss, thanks to the combination of the discriminative power of both the ChiSquare signatures and the SVM decision process, needs only 300 samples per class. On the other hand, Kiss needs at least 80 packets generated from (or directed to) an endpoint in order to classify it. This may seem a strong constraint but results reported
Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers
123
Table 4. Analytical comparison of the resource requirements of the classifiers Abacus Memory allocation
Packet processing Tot. op.
Feature extraction
Tot. op.
2F counters EP state = hash(IPd , portd ) FL state = EP state.hash(IPs , ports ) FL state.pkts ++ FL state.bytes += pkt size
2 lup + 2 sim EP state = hash(IPd , portd ) for all FL state in EP state.hash do p[ log2 (FL state.pkts )] += 1 b[ log2 (FL state.bytes)] += 1 end for N = count(keys(EP state.hash)) for all i = 0 to B do p[i] /= N b[i] /= N end for
Kiss b
2 G counters EP state = hash(IPd , portd ) for g = 1 to G do Pg = payload[g] EP state.O[g][Pg ]++ end for
(2G+1) lup + G sim E = C/2b (precomputed) for g = 1 to G do Chi[g] = 0 for i = 0 to 2b do Chi[g] += (EP state.O[g][i]-E)2 end for Chi[g] /= E end for
(4F+2B+1) lup + 2(F+B) com + 3F sim 2b+1 G lup + G com + (3·2b +1)G sim lup=lookup, com=complex operation, sim=simple operation.
in Sec. 4 actually show that the percentage of not supported traffic is negligible, at least in terms of bytes. This is due to the adoption of the endpoint-to-flow label propagation scheme, i.e. the propagation of the label of an “elephant” flow to all the “mice” flows of the same endpoint. With the exception of particular traffic conditions, this labeling technique can effectively bypass the constraint on the number of packets. Finally, for what concerns the network deployment, Abacus needs all the traffic received by the endpoint to characterize its behavior. Therefore, it is only effective when placed at the edge of the network, where all traffic directed to an host transits. Conversely, in the network core Abacus would likely see only a portion of this traffic, so gathering an incomplete representation of an endpoint behavior, which in turn could result in an inaccurate classification. Kiss, instead, is more robust with respect to the deployment position. In fact, by inspecting packet payload, it can operate even on a limited portion of the traffic generated by an endpoint, provided that the requirement on the minimum number of packets is satisfied. 5.2 Computational Cost To complete the classifiers comparison, we provide an analysis of the requirements in terms of both memory occupation and computational cost. We follow a theoretical approach and calculate these metrics from the formal algorithm specification. In this way, our evaluation is independent from specific hardware platforms or code optimizations. Tab. 4 compares the costs from an analytical point of view while in Tab. 5 there is a numerical comparison based on a case study.
124
A. Finamore et al. Table 5. Numerical case study of the resource requirements of the classifiers Abacus Kiss Memory allocation 320 bytes 384 bytes Packet processing 2 lup + 2 sim 49 lup + 24 sim Feature selection 177 lup + 96 com + 120 sim 768 lup + 24 com + 1176 sim Params values
B=8, F=40
G=24, b=4
Memory footprint is mainly related to the data structures used to compute the statistics. Kiss requires a table of G · 2b counters for each endpoint to collect the observed frequencies employed in the chi-square computation. For the default parameters, i.e. G = 24 chunks of b = 4 bits, each endpoint requires 384 counters. Abacus, instead, requires two counters for each flow related to an endpoint, so the total amount of memory is not fixed but it depends on the number of flows per endpoint. As an example, Fig. 1-(a) reports, for the two operator traces, the CDF of the number of flows seen by each endpoint in consecutive windows of 5 seconds, the default duration of the Abacus time-window. It can be observed that the 90th percentile in the worst case is nearly 40 flows. By using this value as a worst case estimate of the number of flows for a generic endpoint, we can say that 2 · #F lows = 80 counters are required for each endpoint. This value is very small compared to Kiss requirements but for a complete comparison we also need to consider the counters dimension. As Kiss uses windows of 80 packets, its counters assume values in the interval [0, 80] so single byte counters are sufficient. Using the default parameters, this means 384 bytes for each endpoint. Instead, the counters of Abacus do not have a specific interval so, using a worst case scenario of 4 bytes for each counter, we can say that 320 bytes are associated to each endpoint. In conclusion, in the worst case, the two classifiers require a comparable amount of memory but on average Abacus requires less memory than Kiss. Computational cost can be evaluated comparing three tasks: the operations performed on each packet, the operations needed to compute the signatures and the operations needed to classify them. Tab. 4 reports the pseudo code of the first two tasks for both classifiers, specifying also the total amount of operations needed for each task. The operations are divided in three categories and considered separately as they have different costs: lup for memory lookup operations, com for complex operations (i.e., floating point operations), sim for simple operations (i.e., integer operations). Let us first focus on the packet processing part, which presents many constraints from a practical point of view, as it should operate at line speed. In this phase, Abacus needs 2 memory lookup operations, to access its internal structures, and 2 integer increments per packet. Kiss, instead, needs 2G + 1 = 49 lookup operations, half of which are accesses to packet payload. Then, Kiss must compute G integer increments. Since memory read operations are the most time consuming, from our estimation we can conclude that Abacus should be approximately 20 times faster than Kiss in the packet processing phase. The evaluation of the signature extraction process instead is more complex. First of all, since the number of flows associated to an endpoint is not fixed, the Abacus cost is not deterministic but, like in the memory occupation case, we can consider 40 flows as a worst case scenario. For the lookup operations, Considering B = 8, Abacus requires a total of 177 operations, while Kiss needs 768 operations, i.e., nearly four times as
1
1
0.8
0.8
0.6
op06 op07 joost pplive sopcast tvants
0.4 0.2 0
1
10 Flows @ 5sec
(a)
100
CDF
CDF
Kiss to Abacus: A Comparison of P2P-TV Traffic Classifiers
0.6
op06 op07 joost pplive sopcast tvants
0.4 0.2 0
125
0.1
1
10
time @ 80pkt
(b)
Fig. 1. Cumulative distribution function of (a) number of flows per endpoint and (b) duration of a 80 packet snapshot for the operator traces
many. For the arithmetic operations, Abacus needs 96 floating point and 120 integer operations, while Kiss needs 24 floating point and 1176 integer operations. Abacus produces one signature every 5 seconds, while Kiss signatures are processed every 80 packets. To estimate the frequency of the Kiss calculation, in Fig. 1(b) we show the CDF of the amount of time needed to collect 80 packets for an endpoint. It can be observed that, on average, a new signature is computed every 2 seconds. This means that Kiss performs the feature calculation more frequently, i.e., it is more reactive and possibly more accurate than Abacus but obviously also more resource consuming. Finally, the complexity of the classification task depends on the number of features per signature, since both classifiers are based on a SVM decision process. The Kiss signature is composed, by default, of G = 24 features, while the Abacus signature contains 16 features: also from this point of view Abacus appears lighter than Kiss.
6 Conclusions In this paper we compared two approaches to the classification of P2P-TV traffic. We provided not only a quantitative evaluation of the algorithm performance by testing them on a common set of traces, but also a more insightful discussion of the differences deriving from the two followed paradigms. The algorithms proved to be comparable in terms of accuracy in classifying P2P-TV applications, at least regarding the percentage of correctly classified bytes. Differences emerged also when we compared the computational cost of the classifiers. With this respect, Abacus outperforms Kiss, because of the simplicity of the features employed to characterize the traffic. Conversely, Kiss is much more general, as it can classify other types of applications as well. Our work is a first step in cross-evaluating the novel algorithms proposed by the research community in the field of traffic classification. We showed how an innovative behavioral method can be as accurate as a payload-based one, and at the same time lighter, so being a perfect candidate for scenarios with hard constraints in term of computational resources. However, we also showed some limitations in its general applicability, which we would like to address in our future work.
126
A. Finamore et al.
Acknowledgements.This work was funded by EU under the FP7 Collaborative Project “Network-Aware P2P-TV Applications over Wise-Networks” (NAPAWINE).
References 1. Moore, A.W., Papagiannaki, K.: Toward the Accurate Identification of Network Applications. In: Dovrolis, C. (ed.) PAM 2005. LNCS, vol. 3431, pp. 41–54. Springer, Heidelberg (2005) 2. Karagiannis, T., Broido, A., Brownlee, N., Claffy, K., Faloutsos, M.: Is p2p dying or just hiding? In: IEEE GLOBECOM 2004., Dallas, Texas, US (2004) 3. Finamore, A., Mellia, M., Meo, M., Rossi, D.: KISS: Stocastic Packet Inspection. In: Traffic Measurement and Analysis (TMA) Workshop at IFIP Networking 2009, Aachen, Germany (May 2009) 4. Cascarano, N., Risso, F., Este, A., Gringoli, F., Salgarelli, L., Finamore, A., Mellia, M.: Comparing p2ptv traffic classifiers submitted to IEEE ICC 2010 (2010) 5. Valenti, S., Rossi, D., Meo, M., Mellia, M., Bermolen, P.: Accurate, Fine-Grained Classification of P2P-TV Applications by Simply Counting Packets. In: Traffic Measurement and Analysis (TMA) Workshop at IFIP Networking 2009, Aachen, Germany (May 2009) 6. Bernaille, L., Teixeira, R., Salamatian, K.: Early application identification. In: Proc. of ACM CoNEXT 2006, Lisboa, PT (December 2006) 7. Williams, N., Zander, S., Armitage, G.: A prelimanery performance comparison of five machine learning algorithms for practical ip traffic flow comparison. ACM SIGCOMM Comp. Comm. Rev. 36(5), 7–15 (2006) 8. Erman, J., Arlitt, M., Mahanti, A.: Traffic classification using clustering algorithms. In: MineNet 2006: Mining network data (MineNet) Workshop at ACM SIGCOMM 2006, Pisa, Italy (2006) 9. Karagiannis, T., Papagiannaki, K., Faloutsos, M.: Blinc: multilevel traffic classification in the dark. SIGCOMM Comput. Commun. Rev. 35(4), 229–240 (2005) 10. Iliofotou, M., Kim, H., Pappu, P., Faloutsos, M., Mitzenmacher, M., Varghese, G.: Graphbased p2p traffic classification at the internet backbone. In: 12th IEEE Global Internet Symposium (GI 2009), Rio de Janeiro, Brazil (April 2009) 11. Salgarelli, L., Gringoli, F., Karagiannis, T.: Comparing traffic classifiers. ACM SIGCOMM Comp. Comm. Rev. 37(3), 65–68 (2007) 12. Nguyen, T.T.T., Armitage, G.: A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys & Tutorials 10(4), 56–76 (2008) 13. Kim, H., Claffy, K., Fomenkov, M., Barman, D., Faloutsos, M., Lee, K.: Internet traffic classification demystified: myths, caveats, and the best practices. In: Proc. of ACM CoNEXT 2008, Madrid, Spain (2008) 14. Li, W., Canini, M., Moore, A.W., Bolla, R.: Efficient application identification and the temporal and spatial stability of classification schema. Computer Networks 53(6), 790–809 (2009) 15. Hei, X., Liang, C., Liang, J., Liu, Y., Ross, K.W.: A Measurement Study of a Large-Scale P2P IPTV System. IEEE Transactions on Multimedia (December 2007) 16. Li, B., Qu, Y., Keung, Y., Xie, S., Lin, C., Liu, J., Zhang, X.: Inside the New Coolstreaming: Principles, Measurements and Performance Implications. In: IEEE INFOCOM 2008, Phoenix, AZ (April 2008) 17. Cristianini, N., Shawe-Taylor, J.: An introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, New York (1999) 18. Napa-Wine, http://www.napa-wine.eu/ 19. Risso, F., Cascarano, N.: Diffinder, http://netgroup.polito.it/research-projects/ l7-traffic-classification
TCP Traffic Classification Using Markov Models Gerhard M¨ unz, Hui Dai, Lothar Braun, and Georg Carle Network Architectures and Services – Institute for Informatics Technische Universit¨ at M¨ unchen, Germany {muenz,braun,carle}@net.in.tum.de, [email protected] Abstract. This paper presents a novel traffic classification approach which classifies TCP connections with help of observable Markov models. As traffic properties, payload length, direction, and position of the first packets of a TCP connection are considered. We evaluate the accuracy of the classification approach with help of packet traces captured in a real network, achieving higher accuracies than the cluster-based classification approach of Bernaille [1]. As another advantage, the complexity of the proposed Markov classifier is low for both training and classification. Furthermore, the classification approach provides a certain level of robustness against changed usage of applications.
1
Introduction
Network operators are interested in identifying the traffic of different applications in order to monitor and control the utilization of the available network resources. Since the traffic of many new applications cannot be identified by specific port numbers, deep packet inspection (DPI) is the current technology of choice. However, DPI is very costly as it requires a lot of computational resources as well as up-to-date signatures of all relevant applications. Furthermore, DPI is limited to unencrypted traffic. Therefore, traffic classification using statistical methods has become an important area of research. In this paper, we present a novel classification approach which models transitions between data packets using Markov models. While most existing Markovbased traffic classification methods rely on hidden Markov models (HMMs), we make use of observable Markov models where each state directly reflects certain packet attributes, such as the payload length, the packet direction, and the position within the connection. Using training data, separate Markov models are estimated for those applications which we want to identify and distinguish. The classification of new connections is based on the method of maximum likelihood which selects the application whose Markov model yields the highest a-posteriori probability for the given packet sequence. We restrict the evaluation of our approach to the classification of TCP traffic. Based on traffic traces captured in our department network, we compare the outcome of the Markov classifier with the results of Bernaille’s cluster-based classification approach [1]. Furthermore, we show an example of changed application usage and its effect on the classification accuracy. Last but not least, we assess and discuss the complexity of the Markov classifier. F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 127–140, 2010. c Springer-Verlag Berlin Heidelberg 2010
128
G. M¨ unz et al.
After giving an overview on existing Markov-based traffic classification approaches in Section 2, we explain our approach in Section 3. Section 4 presents the evaluation results before Section 5 concludes this paper.
2
Related Work
In the recent past, various research groups have proposed the utilization of statistical methods for traffic classification. Common to these approaches is that application specific traffic characteristics are learned from training data. Typically, the considered properties are statistics derived from entire flows or connections, or attributes of individual packets. Examples for these two kinds of properties are the average packet length and the length of the first packet of a connection, respectively. Nguyen and Armitage provide a comparison of various existing approaches in a survey paper [2]. In the following, we give an overview on existing traffic classification approaches which make use of Markov models. Wright et al. [3] and Dainotti et al. [4] estimate a separate HMM for each application considering packet lengths and inter-arrival times. In an HMM, the output of a state is not deterministic but randomly distributed according to the emission probability distribution of the state. While the state output is observable, transitions between states are hidden. Readers looking for a comprehensive introduction to HMMs are referred to Rabiner’s tutorial [5]. Wright [3] considers TCP connections and deploys left-right HMMs with a large number of states and discrete emission probability distributions. In contrast, Dainotti [4] generates ergodic HMMs with four to seven states and Gamma-distributed emission probabilities for unidirectional TCP and UDP traffic from clients to servers; packets without payload are ignored. In both cases, traffic classification assigns new connections to the application whose HMM yields the maximum likelihood. Our approach is motivated by Estevez-Tapiador et al. who use observable ergodic Markov models for detecting anomalies in TCP connections [6]. In an observable Markov model, each state emits a different symbol, which allows deducing the state transitions from a series of observations directly. In the case of Estevez-Tapiador et al., the Markov model generates the sequence of TCP flag combinations observed in those packets of a TCP connection which are sent from the client to the server. Hence, every state represents a specific combination of TCP flags, every transition the arrival of a new packet in the same TCP connection. The transition matrix is estimated using training data which is free of anomalies. During the detection phase, anomalies are then detected by calculating the a-posteriori probability and comparing it with a lower threshold. Estevez-Tapiador et al. use separate Markov models for different applications which are distinguished by their well-known port numbers. We adopt and extend the modeling approach of Estevez-Tapiador for classifying TCP connections. The training phase is identical: we estimate distinct Markov models for the different applications using training data. In the classification phase, however, we calculate the a-posteriori probabilities of an observed connection for all Markov models. Thereafter, the connection is assigned
TCP Traffic Classification Using Markov Models
129
to the application for which the Markov model yields the maximum a-posteriori probability. In contrast to Estevez-Tapiador, we consider both directions of the TCP connection and take payload lengths instead of TCP flag combinations into account. In prior work [7], we achieved good classification results with states reflecting the payload length and PUSH flag of each packet. However, the deployed Markov models did not consider the position of each packet within the connection although the packet position strongly influences the payload length distribution and the occurrence probability of the PUSH flag. In this paper, we present a new variant of the Markov classifier which is based on left-right Markov models instead of ergodic Markov Models. Hence, we are able to incorporate the dependency of transition probabilities on the packet’s position within the TCP connection. The next section explains this approach in more detail.
3
TCP Traffic Classification Using Markov Models
Just like other statistical traffic classification approaches, we assume that the communication behavior of an application influences the resulting traffic. Hence, by observing characteristic traffic properties, it should be possible to distinguish applications with different behaviors. One such characteristic property is the sequence of packet lengths observed within a flow or connection, which serves as input to many existing traffic classification methods [3, 1, 4]. We use observable Markov models to describe the dependencies between subsequent packets of a TCP connection. The considered packet attributes are payload lengths (equaling the TCP segment size), packet direction, and packet position within the connection. Considering the TCP payload length instead of the IP packet length has the advantage that the value is independent of any IP and TCP options. Similar to several existing approaches (e.g., [1,4]), we only take into account packets carrying payload. We call these packets “data packets” in the following. The reason for ignoring empty packets is that these are either part of the three-way handshake, which is common to all TCP connections, or they represent acknowledgments. In both cases, the packet transmission is mainly controlled by the transport layer and not by the application. The packet direction denotes whether the packet is sent from the client to the server or vice versa. As client, we always consider the host which initiates a TCP connection. In contrast to our previous work [7], we do not consider any TCP flags although the occurrence of the PUSH flag may be influenced by how the application passes data to the transport layer. However, an experimental evaluation and comparison of TCP implementations showed that the usage of the PUSH flag varies a lot between different operating systems. Hence, slight improvements of the classification results which can be achieved by considering the PUSH flag might not be reproducible if other operating systems are deployed. As another difference to our previous work, we take into account the packet position within the TCP connection. This leads to better models since the probability distribution of payload length and direction typically depends on the packet position,
130
G. M¨ unz et al.
especially in the case of the first packets of the connection. Moreover, the classification accuracy can be increased because payload length and direction at specific packet positions are often very characteristic for an application. For example, the majority of HTTP connections start with a small request packet sent from the client to the server, followed by a longer series of long packets from the server to the client. In Section 4.1, we empirically confirm these assumptions by looking at TCP connections of different applications. In general, a Markov model consists of n distinct states Σ = {σ1 , . . . , σn }, a vector of initial state probabilities Π = (π1 , . . . , πn ), and an n × n transition matrix A = {aσi ,σj }. In our case, each state represents a distinct combination of payload length, packet direction, and packet position within the TCP connection. The initial state reflects the properties of the first packet within the TCP connection. A transition from one state to the next state corresponds to the arrival of a new packet. The next state then describes the properties of the new packet. To obtain a reasonably small number of states, the payload lengths are discretized into a few intervals. We evaluated different interval definitions and found that good classification results can be obtained with a rather small number of intervals. The evaluation results presented in Section 4 are based on the following four intervals: [1,99], [100,299], [300, MSS-1], [MSS]. The value of the maximum sequence size (MSS) is often exchanged in a TCP option during the TCP three-way handshake. Alternatively, MSS can be deduced from the maximum observed payload length unless the connection does not contain any packet of maximum payload length. A fallback option is to set MSS to a reasonable default value. Another measure to keep the number of states small is to limit the Markov model to a maximum of l data packets per TCP connection. Hence, if a connection contains more than l data packets, we only consider the first l of them. In order to find a good value for l, we evaluated different settings and show the classification results for l = 3, . . . , 7 in Section 4. The initial state and transition probabilities are estimated from training data using the following equations: F0 (σi ) m=1 F0 (σm )
πσi = n
;
F (σi , σj ) m=1 F (σi , σm )
aσi ,σj = n
(1)
F0 (σi ) is the number of initial packets matching the state σi . F (σi , σj ) is the frequency of transitions from packets described by state σi to packets described by state σj . Since the packet position is reflected in the state definitions, we obtain a left-right Markov model with l stages corresponding to the l first data packets in the TCP connection. In our case, every stage comprises eight states representing four payload length intervals and two directions. An example of such a Markov model with l = 4 stages is given in Figure 1. L = 1, . . . , 4 denote the different payload length intervals, C ⇒ S and S ⇒ C the two directions from client to server and server to client. Only transitions from one stage to the next (left to right) may occur, which means that at most 82 (l − 1) out of (8l)2 transition matrix elements are nonzero. Apart from the packet position, the states within each of the stages describe
TCP Traffic Classification Using Markov Models
131
Fig. 1. Left-right Markov model
the same set of packet properties. Therefore, we may alternatively interpret the model as a Markov model with eight states and a time-variant 8 × 8 transition matrix At , t = 1, . . . , (l − 1). This interpretation enables a much more memory efficient storage of the transition probabilities than one large 8l × 8l matrix. For every application k, we determine a separate Markov model M (k) . For this purpose, the training data must be labeled, which means that every connection must be assigned to one of the applications. In order to obtain reliable estimates of the initial and transition probabilities, the training data must contain a sufficiently large number of TCP connections for each application. On the other hand, it is not necessary that all connections contain at least l data packets since the estimation does not require a constant number of observations for every transition. Instead of individual applications, we may also use a single Markov model for a whole class of applications. This approach is useful if multiple applications are expected to show a similar communication behavior, for example because they use the same protocol. Figure 2 illustrates how the resulting Markov models are used to classify new TCP connections. Given the first l packets of a TCP connection O = {o1 , o2 , . . . , ol }, the log-likelihood for this observation is calculated for all Markov (k) (k) (k) models M (k) with Π (k) = (π1 , . . . , πn ) and A(k) = {aσi ,σj } using the following equation: l−1 l−1 (k) (k) (k) (k) = log πo1 aoi ,oi+1 = log πo1 + log a(k) (2) log Pr O|M oi ,oi+1 i=1
i=1
132
G. M¨ unz et al.
Fig. 2. Traffic classification using Markov models
The maximum likelihood classifier then selects the application for which the log-likelihood is the largest. If a connection contains less than l data packets, the log-likelihood is calculated for the available number of transitions only. It is possible that a TCP connection to be classified contains an initial state (k) (k) for which πo1 = 0, or a transition for which aoi ,oi+1 = 0. This means that such an initial state or transition has not been observed in the training data. Thus, the connection does not fit to the corresponding Markov model. Furthermore, if an unknown initial state or transition occurs in every model, the connection cannot be assigned to any application. This approach, however, may lead to unwanted disqualifications if the training data does not cover all possible traffic, including very rare transitions. As the completeness of the training data usually cannot be guaranteed, we tolerate a certain amount of non-conformance but punish it with a very low (k) (k) likelihood. For this purpose, we replace all πσi = 0 and all aσi ,σj = 0 by a positive value which is much smaller than any of the estimated non-zero (k) probabilities. Then, we reduce the remaining probabilities to ensure i πσi = (k) −5 = 0.001%, j aσi ,σj = 1. In the evaluation in Section 4, we use = 10 which is very small compared to the smallest possible estimated probability of 1 300 = 0.33% (300 is the number of connections per application in the training data). Despite of the uncertainty regarding the completeness of the training data, we want to limit the number of tolerated -states and -transitions per connection. This is achieved by setting a lower threshold of 3 log for the log-likelihood, which corresponds to three unknown transitions, or an unknown initial state plus two unknown transitions. Connections with a log-likelihood below this threshold are considered unclassifiable.
TCP Traffic Classification Using Markov Models
4 4.1
133
Evaluation Training and Test Data
We evaluated the presented traffic classification approach using TCP traffic traces captured in our department network. The traces comprise four classical client-server applications (HTTP, IMAP, SMTP, and SSH) and three peerto-peer (P2P) applications (eDonkey, BitTorrent, and Gnutella). An accurate assignment of each TCP connection to one of the applications is possible as the HTTP, IMAP, SMTP, and SSH traffic involved our own servers. The P2P traffic, on the other hand, originated or terminated at hosts on which we had installed the corresponding peer-to-peer software; no other network service was running. The training data consists of 300 TCP connections of each application. The evaluation of the classification approach is based on test data containing 500 connections for each application. In order to enable a comparison with the clusterbased classification approach by Bernaille [1], we only consider connections with at least four data packets. In principle, our approach also works for connections with a smaller number of data packets, yet the classification accuracy is likely to decreases in this case. Using boxplots, Figure 3 illustrates the payload length distribution of the first seven data packets in the TCP connections contained in the training data. The packet direction is encoded in the sign: payload lengths of packets sent by the server are accounted with a negative sign. In addition to the seven applications used for classification, there is another boxplot for HTTP connections carrying Adobe Flash video content which will be discussed later in Section 4.5. The upper and lower end of the boxes correspond to the 25% and 75% quantiles, the horizontal lines in the boxes indicate the medians. The length of the whiskers is 1.5 times the distance between 25% and 75% quantile. Crosses mark outliers. As can be seen, two groups of protocols can be distinguished by looking at the first data packet. In the case of SMTP and SSH, the server sends the first data packet, in all other cases, it is the client. Protocols, such as IMAP or SMTP, which specify a dialog in which client and server negotiate certain parameters, are characterized by alternating packet directions. In contrast, the majority of the HTTP connections follow a simple scheme of one short client request followed by a series of large packets returned by the server. 4.2
Evaluation Metrics
As evaluation metrics, we calculate recall and precision for every application k: recallk =
number of connections correctly classified as application k number of connections of application k in the test data
precisionk =
number of connections correctly classified as application k total number of connections classified as application k
These two metrics are frequently used for evaluating statistical classifiers. A perfect classifier achieves 100% recall and precision for all applications. Recall is
134
G. M¨ unz et al.
IMAP 1500
1000
1000
500
500
payload length
payload length
HTTP 1500
0
0
−500
−500
−1000
−1000
−1500
1
2
3
4
5
6
−1500
7
1
2
3
data packet
4
1000
1000
500
500
0
−500
−1000
−1000
2
3
4
5
6
−1500
7
1
2
3
data packet
1000
1000
500
500
0
−500
−1000
−1000
3
4
5
6
−1500
7
1
2
3
data packet
1000
1000
500
500
0
−500
−1000
−1000
3
4
data packet
5
6
7
6
7
0
−500
2
4
Flash video over HTTP 1500
payload length
payload length
Gnutella
1
7
data packet
1500
−1500
6
0
−500
2
5
BitTorrent 1500
payload length
payload length
eDonkey
1
4
data packet
1500
−1500
7
0
−500
1
6
SSH 1500
payload length
payload length
SMTP 1500
−1500
5
data packet
5
6
7
−1500
1
2
3
4
data packet
Fig. 3. Payload lengths of first data packets
5
TCP Traffic Classification Using Markov Models
135
Table 1. Classification results of Markov classifier Recall HTTP 96.00% 94.60% IMAP 99.60% SMTP 99.00% SSH 55.00% eDonkey BitTorrent 98.80% 97.20% Gnutella Average 91.46%
Recall HTTP IMAP SMTP SSH eDonkey BitTorrent Gnutella Average
97.20% 94.80% 99.60% 99.40% 93.80% 98.40% 96.60% 97.11%
3 stages Prec. 97.17% 75.20% 94.86% 99.80% 99.28% 86.67% 95.48% 92.64%
Uncl. 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
Recall 98.80% 94.80% 99.80% 99.20% 87.20% 98.80% 95.40% 96.29%
4 stages Prec. 95.92% 97.33% 95.23% 99.60% 99.09% 89.98% 97.95% 96.44%
Uncl. 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
Recall 97.20% 95.00% 99.80% 99.20% 89.00% 99.40% 97.20% 96.69%
5 stages Prec. 97.59% 97.94% 95.23% 99.80% 99.55% 91.03% 97.01% 96.88%
Uncl. 0.00% 0.20% 0.00% 0.20% 0.00% 0.00% 0.00% 0.06%
6 stages 7 stages Prec. Uncl. Recall Prec. Uncl. 97.79% 99.79% 95.40% 99.40% 99.79% 93.18% 96.60% 97.42%
0.20% 0.40% 0.20% 0.20% 0.00% 0.20% 0.40% 0.23%
97.20% 94.80% 99.60% 99.40% 97.40% 98.40% 96.80% 97.66%
98.38% 99.79% 95.40% 100% 99.19% 96.85% 96.61% 98.03%
0.20% 0.40% 0.20% 0.40% 0.00% 0.20% 1.00% 0.34%
independent of the traffic composition, which means that it does not matter how many connections of the test data belong to application k. In contrast, precision depends on the traffic composition in the test data since the denominator usually increases for larger numbers of connections not belonging to application k. Using test data which contains an equal number of connections for every application, we ensure that the calculated precision values are unbiased. In order to compare different classifiers with a single value, we calculate the overall accuracy, which is usually defined as the number of correctly classified connections divided by the total number of connections in the test data. Since the number of connections per application is constant in our case, the overall accuracy is identical to the average recall. Note that the accuracy values mentioned in this document cannot be directly compared to accuracies mentioned in many related publications which are usually based the unbalanced traffic compositions observed in real networks. 4.3
Classification Results
Table 1 shows the classification results for different numbers of stages l. In addition to recall and precision, the table indicates the percentage of unclassifiable connections for every application. These connections could not be assigned to any application because the maximum log-likelihood is smaller than the lower threshold 3 log 10−5 = −15. As explained in Section 3, we apply this threshold to sort out connections which differ very much from all Markov models.
136
G. M¨ unz et al.
As can be seen in the table, the recall values of most applications increase or do not change much if the Markov models contain more stages, which means that more transitions between data packets are considered. Stage l = 5 is an exception because HTTP reaches a much higher and Gnutella a much lower recall value than for the other setups. We inspected this special case and saw that 11 to 13 Gnutella connections are usually misclassified as HTTP traffic and vice versa. If the Markov models contain four stages, however, 21 Gnutella connections are misclassified as HTTP, and only four HTTP connections are misclassified as Gnutella, which leads to unusual recall (and precision) values. Except for Markov models with seven stages, eDonkey is the application with the largest number of misclassified connections. In fact, a large number of eDonkey connections are misclassified as BitTorrent and IMAP traffic. For example, in the case of four stages, 53 eDonkey connections are assigned to BitTorrent, another 11 eDonkey connections to IMAP. These numbers decrease with larger numbers of stages. The example of eDonkey nicely illustrates the relationship between a low recall value for one application and low precision values for other applications: low recall values of eDonkey coincide with low precision values of BitTorrent and IMAP. The recall value of IMAP stays below 95% because 24 IMAP connections are classified as SMTP in all setups. The precision values show little variation and increase gradually with larger numbers of stages. Finally, the number of unclassifiable connections increases for larger numbers of stages. The reason is that more transitions are evaluated, which also increases the probability of transitions which did not appear in the training data. Although we account unknown initial states and transitions with -probability, connections with three or more of these probabilities are sorted out by the given threshold. Obviously, the number of unclassifiable connections could be reduced by tolerating a larger number of unknown transitions. Alternatively, we could increase the number of connections in the training data in order to cover a larger number of rare transitions. The average recall, which is equal to the overall accuracy, jumps from 91.46% to 96.29% when the number of stages is increased from three to four. At the same time, the average precision increases from 92.64% to 96.44%. Thereafter, both averages increase gradually with every additional stage. Hence, at least four data packets should be considered in the Markov models to obtain good results. 4.4
Comparison with Bernaille’s Approach
Bernaille [1] proposed a traffic classification method which uses clustering algorithms to find connections with similar payload lengths and directions in the first data packets. The Matlab code of this method can be downloaded from a website [8]. Bernailles’s approach requires that all connections in the test and training data have at least as many data packets as analyzed by the classification method. Furthermore, the results of his work show that best results can be achieved with three or four data packets. As mentioned in Section 4.1, we prepared our datasets for a comparison with Bernaille by including connections with at least four data packets only.
TCP Traffic Classification Using Markov Models
137
Table 2. Classification results of Bernaille’s classifier 3 data packets, 27 clusters Recall Prec. HTTP 88.60% 96.30% 91.00% 96.19% IMAP 98.80% 95.18% SMTP 97.20% 98.98% SSH 95.80% 87.09% eDonkey BitTorrent 88.80% 100% 96.40% 85.61% Gnutella Average 93.80% 94.19%
4 data packets, 34 clusters Recall Prec. 99.60% 86.16% 93.20% 99.79% 97.40% 95.49% 95.40% 99.58% 98.80% 98.80% 93.60% 100% 92.00% 92.37% 95.71% 96.03%
3 data packets, 28 clusters Recall Prec. 88.00% 94.62% 92.60% 83.88% 90.20% 100% 97.40% 100% 91.00% 92.11% 96.80% 95.09% 96.20% 88.75% 93.17% 93.49%
3 data packets, 29 clusters Recall Prec. 90.20% 95.35% 87.80% 90.89% 98.80% 95.37% 97.80% 98.79% 100% 89.61% 97.20% 100% 95.20% 97.74% 95.29% 95.39%
The learning phase of Bernaille’s classifier is nondeterministic and depends on random initialization of the cluster centroids. Furthermore, the number of clusters as well as the number of data packets needs to be given as input parameters to the training algorithm. The documentation of Bernaille’s Matlab code recommends 30 to 40 clusters and three to four data packets as a good start point. At the end of the clustering, the algorithm automatically removes clusters which are assigned less then three connections of the training data. A calibration method performs the training of the classifier with different numbers of clusters and data packets and returns the model which achieves the highest classification accuracy with respect to the training data. As recommended by Bernaille, we ran the calibration method to cluster the connections in the training data with 30, 35, and 40 initial cluster centroids and three and four data packets. The best classifier was then used to classify the test data by assigning each of the connections to the nearest cluster. Further improvements, which Bernaille achieved by considering port numbers in addition to cluster assignments [1], were not considered since our approach does not evaluate port numbers either. Table 2 shows the classification results for four different runs of the calibration method. As can be seen, the average recall and precision values do not reach the same level as the Markov classifier. A possible explanation is that Bernaille’s approach does not consider any correlation between subsequent packets. The classification results vary a lot between different runs of the calibration method. Interestingly, we obtain very different results in the third and forth run although both classifiers use three data packets and a very similar number of clusters. The range of the recall values obtained for an individual application can be very wide. The most extreme example is HTTP with recall values ranging from 88.6% to 99.6%. In general, we observed that the classification results depend very much on the initialization values of the cluster centroids and not so much on the remaining parameters, such as the number of clusters and data packets.
138
G. M¨ unz et al. Table 3. Classification of Flash over HTTP traffic
4 5 6 7
stages stages stages stages
tolerant classifier HTTP Gnutella Uncl. 68.0% 17.4% 14.6% 63.8% 16.6% 19.6% 60.8% 17.0% 22.2% 61.2% 16.0% 22.8%
intolerant classifier HTTP Gnutella Uncl. 60.0% 11.0% 29.0% 44.6% 14.4% 41.0% 43.8% 11.6% 44.6% 39.6% 11.6% 48.8%
In contrast to Bernaille’s approach, the training of the Markov classifier always yields deterministic models which do not depend on any random initialization. Hence, we do not need to run the training method several times, which is an advantage regarding the practical deployment. 4.5
Change of Application Usage
HTTP has become a universal protocol for various kinds of data transports. Many websites now include multimedia contents, such as animated pictures or videos. There are many sites delivering such contents, with www.youtube.com being one of the most popular. A large proportion of these embedded multimedia contents are based on Adobe Flash. Flash typically transfers data in streaming mode, which means that after a short prefetching delay the user can start watching the video without having to wait until the download is finished. In order to assess how our classification approach behaves if the usage of an application changes, we applied the classifier to 500 HTTP connections carrying Flash video content. These connections were captured in our university network and identified by the HTTP content type “video/x-flv”. The boxplots at the bottom right of Figure 3 show the payload length distribution. Compared to the previously analyzed HTTP connections, which did not include any Flash video downloads, the variance in the first four packets is much larger. The request packets sent from the client to the server tend to contain more payload than in the case of other HTTP traffic whereas the second and third packets are often smaller. Traffic classification should be robust against such changes of application usage. In the optimal case, the classifier still classifies the corresponding connections correctly as HTTP traffic. Apart from that, it is also acceptable to label the connections as unclassifiable. On the other hand, the connections should not be assigned to wrong applications. Table 3 shows how the HTTP connections containing Flash video content are classified in dependence of the number of stages in the Markov models. Apart from tolerant classification with = 10−5 and log-likelihood threshold −15, we tested an intolerant classifier which disqualifies all connections with unknown initial state or transition. The tolerant classifier assigns 60% of the connections to HTTP and around 17% to Gnutella. Hence again, similarities between HTTP and Gnutella traffic cause a certain number of misclassified connections. The
TCP Traffic Classification Using Markov Models
139
remaining connections remain unclassified because the maximum log-likelihood is smaller than 3 log 10−5 . With the intolerant classifier, twice as many connections remain unclassified, mainly account of connections previously assigned to HTTP. This shows that tolerance of non-conforming connections increases the robustness of the classifier against usage changes. Although the tolerant classifier still classifies most of the connections as HTTP traffic, the classification accuracy is degraded. To solve this problem, it suffices to re-estimate the Markov model of HTTP with training data covering the new kind of HTTP traffic. Alternatively, we can add a Markov model which explicitly models Flash over HTTP. 4.6
Complexity
The estimation of initial state and transition probabilities using equations (1) requires counting the frequency of initial packet properties and transitions. If the training data contains C connections of an application, estimating the parameters of the corresponding Markov model with l stages requires at most C · l counter increments plus 7 + 56(l − 1) additions and 8l divisions. In order to classify a connection, the log-likelihood needs to be calculated for every Markov model using equation (2). This calculation requires (N − 1) additions, N being the number of analyzed data packets in the connection. The number of stages l is an upper bound for N . The maximum log-likelihood of all Markov models needs to be determined and checked against the given lower threshold. Hence, for K different applications, we have at most K(l−1) additions and K comparisons. Other statistical traffic classification approaches typically require more complex calculations. This is particularily true for HMMs where emission probabilities have to be considered in addition to transition probabilities. Regarding Bernaille’s approach, the clustering algorithm determines the assignment probability of every connection in the training data to every cluster. After recalculating the cluster centroids, the procedure is repeated in another iteration. Just one of these iterations is more complex than estimating the Markov models. The classification of a connection requires calculating the assignment probabilities for every cluster. If Gaussian mixture models (GMMs) are used as in Bernaille’s Matlab code, the probabilities are determined under the assumption of multivariate normal distributions, which is more costly than calculating the Markov likelihoods.
5
Conclusion
We presented a novel traffic classification approach based on observable Markov models and evaluated the classification accuracy with help of TCP traffic traces of different applications. The results show that our approach yields slightly better results than Bernaille’s cluster-based classification method [1]. Furthermore, it provides a certain level of robustness with respect to the changed usage of an
140
G. M¨ unz et al.
application. After all, the complexity of our approach is low compared to other statistical traffic classification methods. The classification accuracy depends on the number of stages per Markov model, which corresponds to the maximum number of data packets considered per TCP connection. Based on our evaluation, we recommend Markov models with at least four stages, corresponding to 32 states. Every additional stage gradually improves the accuracy. As an important property, connections whose number of data packets is smaller than the number of stages can still be classified. Hence, the only drawback of maintaining more stages is that more transition probabilities need to be estimated, saved, and evaluated per application. In order to better assess the performance of our classification approach, we intend to apply it to other traffic traces captured in different networks. Beyond that, it will be interesting to consider additional applications since the set of applications regarded in our evaluation is very limited, of course. Finally, we think of extending the approach to the classification of UDP traffic, which is mainly used for real-time applications.
Acknowledgments We gratefully acknowledge support from the German Research Foundation (DFG) funding the LUPUS project in which this research work as been conducted.
References 1. Bernaille, L., Teixeira, R., Salamatian, K.: Early application identification. In: Proc. of ACM International Conference on Emerging Networking Experiments and Technologies (CoNEXT) 2006, Lisboa, Portugal (2006) 2. Nguyen, T.T.T., Armitage, G.: A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys & Tutorials 10, 56–76 (2008) 3. Wright, C., Monrose, F., Masson, G.: HMM profiles for network traffic classification (extended abstract). In: Proc. of Workshop on Visualization and Data Mining for Computer Security (VizSEC/DMSEC), Fairfax, VA, USA, pp. 9–15 (2004) 4. Dainotti, A., de Donato, W., Pescap`e, A., Rossi, P.S.: Classification of network traffic via packet-level hidden markov models. In: Proc. of IEEE Global Telecommunications Conference, GLOBECOM 2008, New Orleans, LA, USA (2008) 5. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of IEEE, vol. 77, pp. 257–286 (1989) 6. Estevez-Tapiador, J.M., Garcia-Teodoro, P., Diaz-Verdejo, J.E.: Stochastic protocol modeling for anomaly based network intrusion detection. In: Proc. of IEEE International Workshop on Information Assurance, IWIA (2003) 7. Dai, H., M¨ unz, G., Braun, L., Carle, G.: TCP-Verkehrsklassifizierung mit MarkovModellen. In: 5. GI/ITG-Workshop MMBnet 2009, Hamburg, Germany (2009) 8. Bernaille, L.: Homepage of early application identification (2009), http://www-rp.lip6.fr/~ teixeira/bernaill/earlyclassif.html
K-Dimensional Trees for Continuous Traffic Classification Valent´ın Carela-Espa˜ nol1, Pere Barlet-Ros1, Marc Sol´e-Sim´o1, 2 Alberto Dainotti , Walter de Donato2 , and Antonio Pescap´e2 1 2
Department of Computer Architecture, Universitat Polit`ecnica de Catalunya (UPC) {vcarela,pbarlet,msole}@ac.upc.edu Department of Computer Engineering and Systems, Universit´ a di Napoli Federico II {alberto,walter.dedonato,pescape}@unina.it Abstract. The network measurement community has proposed multiple machine learning (ML) methods for traffic classification during the last years. Although several research works have reported accuracies over 90%, most network operators still use either obsolete (e.g., port-based) or extremely expensive (e.g., pattern matching) methods for traffic classification. We argue that one of the barriers to the real deployment of ML-based methods is their time-consuming training phase. In this paper, we revisit the viability of using the Nearest Neighbor technique for traffic classification. We present an efficient implementation of this well-known technique based on multiple K-dimensional trees, which is characterized by short training times and high classification speed.This allows us not only to run the classifier online but also to continuously retrain it, without requiring human intervention, as the training data become obsolete. The proposed solution achieves very promising accuracy (> 95%) while looking just at the size of the very first packets of a flow. We present an implementation of this method based on the TIE classification engine as a feasible and simple solution for network operators.
1
Introduction
Gaining information about the applications that generate traffic in an operational network is much more than mere curiosity for network operators. Traffic engineering, capacity planning, traffic management or even usage-based pricing are some examples of network management tasks for which this knowledge is extremely important. Although this problem is still far from a definitive solution, the networking research community has proposed several machine learning (ML) techniques for traffic classification that can achieve very promising results in terms of accuracy. However, in practice, most network operators still use either obsolete (e.g., port-based) or unpractical (e.g., pattern matching) methods for traffic identification and classification. One of the reasons that explains this slow adoption by network operators is the time-consuming training phase involving
This work has been supported by the European Community’s 7th Framework Programme (FP7/2007-2013) under Grant Agreement No. 225553 (INSPIRE Project) and Grant Agreement No. 216585 (INTERSECTION Project).
F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 141–154, 2010. c Springer-Verlag Berlin Heidelberg 2010
142
V. Carela-Espa˜ nol et al.
most ML-based methods, which often requires human supervision and manual inspection of network traffic flows. In this paper, we revisit the viability of using the well-known Nearest Neighbor (NN) machine learning technique for traffic classification. As we will discuss throughout the paper, this method has a large number of features that make it very appealing for traffic classification. However, it is often discarded given its poor classification speed [15, 11]. In order to address this practical problem, we present an efficient implementation of the NN search algorithm based on a K-dimensional tree structure that allows us not only to classify network traffic online with high accuracy, but also to retrain the classifier on-the-fly with minimum overhead, thus lowering the barriers that hinder the general adoption of ML-based methods by network operators. Our K-dimensional tree implementation only requires information about the length of the very first packets of a flow. This solution provides network operators with the interesting feature of early classification [2, 3]. That is, it allows them to rapidly classify a flow without having to wait until its end, which is a requirement of most previous traffic classification methods [12,16,7]. In order to further increase the accuracy of the method along with its classification speed, we combine the information about the packet sizes with the relevant data still provided by the port numbers [11]. We present an actual implementation of the method based on the Traffic Identification Engine (TIE) [5]. TIE is a community-oriented tool for traffic classification that allows multiple classifiers (implemented as plugins) to run concurrently and produce a combined classification result. Given the low overhead imposed by the training phase of the method and the plugins already provided by TIE to set the ground truth (e.g., L7 plugin), the implementation has the unique feature of continuous training. This feature allows the system to automatically retrain itself as the training data becomes obsolete. We hope that the large advantages of the method (i.e., accuracy (> 95%), classification speed, early classification and continuous training) can give an incentive to network operators to progressively adopt new and more accurate ML-based methods for traffic classification. The remainder of this paper is organized as follows. Section II reviews the related work. Section III describes the ML-based method based on TIE. Section IV analyzes the performance of the method and presents preliminary results of its continuous training feature. Finally, Section V concludes the paper and outlines our future work.
2
Related Work
Traffic classification is a classical research area in network monitoring and several previous works have proposed different solutions to the problem. This section briefly reviews the progress in this research field, particularly focusing on those works that used the Nearest Neighbor algorithm for traffic classification. Originally, the most common and simplest technique to identify network applications was based on the port numbers (e.g., those registered by the IANA [9]).
K-Dimensional Trees for Continuous Traffic Classification
143
This solution was very efficient and accurate with traditional applications. However, the arrival of new applications (e.g., P2P) that do not use a pre-defined set of ports or even use registered ones from other applications made this solution unreliable to classify current Internet traffic. Deep packet inspection (DPI) constituted the first serious alternative to the wellknown ports technique. DPI methods are based on searching for typical signatures of each application in the packet payloads. Although these techniques can potentially be very accurate, the high resource requirements of pattern matching algorithms and their limitations in the presence of encrypted traffic make their use incompatible with the continuously growing amount of data in current networks. Machine learning techniques (ML) were later proposed as a promising solution to the well-known limitations of port- and DPI-based techniques. ML methods extract knowledge of the characteristic features of the traffic generated by each application from a training set. This knowledge is then used to build a classification model. We refer the interested reader to [13], where an extensive comparative study of existing ML methods for traffic classification is presented. Among the different ML-based techniques existing in literature, the NN method rapidly became one of the most popular alternatives due to its simplicity and high accuracy. In general, given an instance p, the NN algorithm finds the nearest instance (usually using the Euclidean distance) from a training set of examples. NN is usually generalized to K-NN where K refers to the number of nearest neighbors to take into account. The NN method for traffic classification was firstly proposed in [14], where a comparison of the NN technique with the Linear Discriminant Analysis method was presented. They showed that NN was able to classify, among 7 different classes of traffic, with an error rate below 10%. However, the most interesting conclusions about the NN algorithm are found in the works from Williams et al. [15] and Kim et al. [11]. Both works compared different ML methods and showed the pros and cons of the NN algorithm for traffic classification. In summary, NN was shown to be one of the most accurate ML methods, with the additional feature of requiring zero time to build the classification model. However, NN was the ML-based algorithm with the worst results in terms of classification speed. This is the reason why NN is often discarded for online classification. The efficient implementation of the NN algorithm presented in this paper is based instead on the K-dimensional tree, which solves its problems in terms of classification speed, while keeping very high accuracy. Another important feature of the method is its ability to early classify the network traffic. This idea is exported from the work from Bernaille et al. [2, 3]. This early classification feature allows the method to classify the network traffic by just using the first packets of each flow. Bernaille et al. compared three different unsupervised ML methods (K-Means, GMM and HMM), while in this work we apply this idea to a supervised ML method (NN). As ML-based methods for traffic classification become more popular, new techniques appear in order to evade classification. These techniques, such as
144
V. Carela-Espa˜ nol et al.
protocol obfuscation, modify the value of the features commonly used by the traffic classification methods (e.g., by simulating the behavior of other applications or padding packets). Several alternative techniques have been also proposed to avoid some of these limitations. BLINC [10] is arguably the most well-known exponent of this alternative branch. Most of these methods base their identification in the behavior of the end-hosts and, therefore, their accuracy is strongly dependent on the network viewpoint where the technique is deployed [11].
3
Methodology
This section describes the ML-based classification method based on multiple K-dimensional trees, together with its continuous training system. We also introduce TIE, the traffic classification system we use to implement our technique, and the modifications made to it in order to allow the method to continuously retrain itself. 3.1
Traffic Identification Engine
TIE [5] is a modular traffic classification engine developed by the Universit´ a di Napoli Federico II. This tool is designed to allow multiple classifiers (implemented as plugins) to run concurrently and produce a combined classification result. In this work, we implement the traffic classification method as a TIE plugin. TIE is divided in independent modules that are in charge of the different classification tasks. The first module, Packet Filter, uses the Libpcap library to collect the network traffic. This module can also filter the packets according to BPF or user-level filters (e.g., skip the first n packets, check header integrity or discard packets in a time range). The second module, Session Builder, aggregates packets in flows (i.e., unidirectional flows identified by the classic 5-tuple), biflows (i.e., both directions of the traffic) or host sessions (aggregation of all the traffic of a host). The Feature Extractor module calculates the features needed by the classification plugins. There is a single module for feature extraction in order to avoid redundant calculations for different plugins. TIE provides a multiclassifier engine divided in a Decision Combiner module and a set of classification plugins. On the one hand, the Decision Combiner is in charge of calling several classification plugins when their features are available. On the other hand, this module merges the results obtained from the different classification plugins in a definitive classification result. In order to allow comparisons between different methods, the Output module provides the classification results from the Classification Combiner based on a set of applications and groups of applications defined by the user. TIE supports three different operating modes. The offline mode generates the classification results at the end of the TIE execution. The real-time mode outputs the classification results as soon as possible, while the cycling mode is an hybrid mode that generates the information every n minutes.
K-Dimensional Trees for Continuous Traffic Classification
3.2
145
KD-Tree Plugin
In order to evaluate the traffic classification method, while providing a ready-touse tool for network operators, we implement the K-dimensional tree technique as a TIE plugin. Before describing the details of this new plugin, we introduce the K-dimensional tree technique. In particular, we focus on the major differences with the original NN search algorithm. The K-dimensional tree is a data structure to efficiently implement the Nearest Neighbor search algorithm. It represents a set of N points in K-dimensional spaces as described by Friedman et al. [8] and Bentley [1]. In the naive NN technique the set of points is represented as a set of vectors where each position of a vector represents a coordinate from a point (i.e., feature). Besides these data, the K-dimensional tree implementation also creates a binary tree that recursively take the median point of the set of points, leaving half of points in each side. The original NN algorithm searches iteratively the nearest point i, from a set of points E, to a point p. In order to find the i point, it computes, for each point in E, the distance (e.g., Euclidean or Manhattan distance) to the point p. Likewise, if we are performing a K-NN search, the algorithm looks for the K i points nearest to the point p. This search has O(N) time complexity and becomes unpractical with the amount of traffic found in current networks. On the contrary, the search in a K-dimensional tree allows to find in average the nearest point in O(log N), with the additional cost of spending once O(N log N) building the binary tree. Besides this notable improvement, the structure also supports approximate searches, which can substantially improve the classification time at the cost of producing a very small error. The K-dimensional tree plugin that we implement in TIE is a combination of the K-dimensional tree implementation provided by the C++ ANN library and a structure to represent the relevant information still provided by the port numbers. In particular, we create an independent K-dimensional tree for each relevant port. We refer as relevant ports as those that generate more traffic. Although the list of relevant ports can be computed automatically, we also provide the user with the option of manually configuring this list. Another configuration parameter is the approximation value, which allows the method to improve its classification speed by performing an approximate NN search. In the evaluation, we set this parameter to 0, which means that this approximation feature is not used. However, higher values of this parameter could substantially improve the classification time in critical scenarios, while still obtaining a reasonable accuracy. Unlike in the original NN algorithm, the proposed method requires a lightweight training phase to build the K-dimensional tree structure. Before building the data structure, a sanitation process is performed on the training data. This procedure removes the instances labeled as unknown from the training dataset assuming that they have similar characteristics to other known flows. This assumption is similar to that of ML clustering methods, where unlabeled instances are classified according to their proximity in the feature space to those that are known. The sanitation process also removes repeated or indistinguishable instances.
146
V. Carela-Espa˜ nol et al.
The traffic features used by our plugin are the destination port number and the length of the first n packets of a flow (without considering the TCP handshake). By using only the first n packets, the plugin can classify the flows very fast, providing the network operator with the possibility of quickly reacting to the classification results. In order to accurately classify short flows, the training phase also accepts flows with less than n packets by filling the empty positions with null coordinates. 3.3
Continuous Training System
In this section, we show the interaction of our KD-Tree plugin with the rest of the TIE architecture, and describe the modifications done in TIE to allow our plugin to continuously retrain itself. Figure 1 shows the data flow of our continuous training system based on TIE. The first three modules are used without any modification as found in the original version of TIE. Besides the implementation of the new KD-Tree plugin, we significantly modified the Decision Combiner module and the L7 plugin. Our continuous training system follows the original TIE operation mode most part of the time. Every packet is aggregated in bidirectional flows while its features are calculated. When the traffic features of a flow (i.e., first n packet sizes) are available or upon its expiration, the flow is classified by the KD-Tree plugin. Although the method was tested with bidirectional flows, the current implementation also supports the classification of unidirectional flows. In order to automatically retrain our plugin, as the training data becomes obsolete, we need a technique to set the base-truth. TIE already provides the L7 plugin, which implements a DPI technique originally used by TIE for validation purposes. We modified the implementation of this plugin to continuously produce training data (which includes flow labels - that is, the base-truth - obtained by L7) for future trainings. While every flow is sent to the KD-Tree plugin through the main path, the Decision Combiner module applies flow sampling to the traffic, which is sent through a secondary path to the L7 plugin. This secondary path is used to (i) set the base truth for the continuous training system, (ii) continuously check the accuracy of the KD-Tree plugin by comparing its output with that of L7, and (iii) keep the required computational power low by using flow sampling (performing DPI on every single flow will significantly decrease the performance of TIE). The Decision Combiner module is also in charge of automatically triggering the training of the KD-Tree plugin according to three different events that can be configured by the user: after p packets, after s seconds, or if the accuracy of the plugin compared to the L7 output is below a certain threshold t. The flows classified by the L7 plugin, together with their features (i.e., destination port, n packet sizes, L7 label), are placed in a queue. This queue keeps the last f classified flows or the flows classified during the last s seconds. The training module of the KD-Tree plugin is executed in a separate thread. This way, the KD-Tree plugin can continuously classify the incoming flows without interruption, while it is periodically updated. The training module builds a
K-Dimensional Trees for Continuous Traffic Classification
147
Fig. 1. Diagram of the Continuous Training Traffic Classification system based on TIE
completely new multi K-dimensional tree model using the information available in the queue. We plan as future work to study the alternative solution of incrementally updating the old model with the new information, instead of creating a new model from scratch. In addition, it is possible to automatically update the list of relevant ports by using the training data as a reference.
4
Results
This section presents a performance evaluation of the proposed technique. First, Subsection 4.1 describes the dataset used in the evaluation. Subsection 4.2 compares the original Nearest Neighbor algorithm with the K-dimensional tree implementation. Subsection 4.3 presents a performance evaluation of the proposed plugin described in Subsection 3.2 and, evaluates different aspects of the technique as the relevant ports or the number of packet sizes used for the classification. Finally, Subsection 4.4 presents a preliminary study of the impact of the continuous training system in the traffic classification. 4.1
Evaluation Datasets
The dataset used in our performance evaluation consists of 8 full-payload traces collected at the Gigabit access link of the Universitat Polit`ecnica de Catalunya (UPC), which connects about 25 faculties and 40 departments (geographically distributed in 10 campuses) to the Internet through the Spanish Research and Education network (RedIRIS). This link provides Internet access to about 50000 users. The traces were collected at different days and hours trying to cover as much diverse traffic from different applications as possible. Due to privacy issues, we are not able to publish our traces. However, we made our traces accessible using the CoMo-UPC model presented in [4]. Table 1 presents the details of the traces used in the evaluation. In order to evaluate the proposed method, we used the first seven traces. Among those traces, we selected a single trace (UPC-II) as training dataset, which is the trace that contains the highest diversity in terms of instances from different applications. We limited our training set to one trace in order to leave a meaningful
148
V. Carela-Espa˜ nol et al. Table 1. Characteristics of the traffic traces in our dataset
Name
Date
UPC-I 11-12-08 UPC-II 11-12-08 UP-III 12-12-08 UPC-IV 12-12-08 UPC-V 14-12-08 UPC-VI 21-12-08 UPC-VII 22-12-08 UPC-VIII 10-03-09
Day Start Time Duration Packets Bytes Valid Flows Avg. Util Thu Thu Fri Fri Sun Sun Mon Tue
10:00 12:00 01:00 16:00 00:00 12:00 12:30 03:00
15 min 15 min 15 min 15 min 15 min 1h 1h 1h
95 M 114 M 69 M 102 M 53 M 175 M 345 M 114 M
53 G 63 G 38 G 55 G 29 G 133 G 256 G 78 G
1936 2047 1419 2176 1346 3793 6684 3711
K K K K K K K K
482 573 345 500 263 302 582 177
Mbps Mbps Mbps Mbps Mbps Mbps Mbps Mbps
number of traces for the evaluation that are not used to build the classification model. Therefore, the remaining traces were used as the validation dataset. The last trace, UPC-VIII, was recorded with a difference in time of four months with the trace UPC-II. Given this time difference, we used this trace to perform a preliminary experiment to evaluate the gain provided by our continuous training solution. 4.2
Nearest Neighbor vs. K-Dimensional Tree
Section 3.2 already discussed the main advantages of the K-dimensional tree technique compared to the original Nearest Neighbor algorithm. In order to present numerical results showing this gain, we perform a comparison between both methods. We evaluate the method presented in this paper with the original NN search implemented for validation purposes by the ANN library. Given that the ANN library implements both methods in the same structure we calculated the theoretical minimum memory resources necessary for the naive NN technique (i.e., # unique examples * # packet sizes * 4 bytes (C++ integer)). We tested both methods with the trace UPC-II (i.e., ≈500.000 flows after the sanitation process) using a 3GHz machine with 4GB of RAM. It is important to note that, since we are performing an offline evaluation, we do not approximate the NN search in the NN original algorithm or in the K-dimensional tree technique. For this reason, the accuracy of both methods is the same. Table 2 summarizes the improvements obtained with the combination of the K-dimensional tree technique with the information from the port numbers. Results are shown in terms of classifications per second depending on the number of packets needed for the classification and the list of relevant ports. There are three possible lists of relevant ports. The unique list, where there are no relevant ports and all the instances belong to the same K-dimensional tree or NN structure. The selected list, which is composed by the set of ports that contains most traffic from the UPC-II trace (i.e., ports that receive more than 0.05% of the traffic (69 ports in the UPC-II trace)). We finally refer to all as the list where all ports found in the UPC-II trace are considered as relevant. The first column corresponds to the original NN presented in previous works [11, 14, 15], where
K-Dimensional Trees for Continuous Traffic Classification
149
Table 2. Speed Comparison (flows/s): Nearest Neighbor vs K-Dimensional Tree Packet Size 1 5 7 10
Naive Nearest Neighbor Unique Selected Ports All Ports 45578 104167 185874 540 2392 4333 194 1007 1450 111 538 796
K-Dimensional Tree Unique Selected Ports All Ports 423729 328947 276243 58617 77280 159744 22095 34674 122249 1928 4698 48828
Table 3. Memory Comparison: Nearest Neighbor vs K-Dimensional Tree Packet Size 1 5 7 10
Naive Nearest Neighbor 2.15 MB 10.75 MB 15.04 MB 21.49 MB
K-Dimensional Tree Unique Selected Ports All Ports 40.65 MB 40.69 MB 40.72 MB 52.44 MB 52.63 MB 53.04 MB 56.00 MB 56.22 MB 57.39 MB 68.29 MB 68.56 MB 70.50 MB
all the information is maintained in a single structure. When only one packet is required, the proposed method is ten times faster than the original NN. However, the speed of the original method dramatically decreases when the number of packets required increases, becoming even a hundred times slower than the K-dimensional tree technique. In almost all the situations, the introduction of the list of relevant ports substantially increases the classification speed in both methods. Tables 3 and 4 show the extremely low price that the K-dimensional tree technique pays for a notable improvement in classification speed. The results show that the memory resources required by the method, although being higher than the naive NN technique, are few. The memory used in the K-dimensional tree is almost independent from the relevant ports parameter and barely affected by the number of packet sizes. Regarding time, we believe that the trade-off of the training phase is well compensated by the ability to use the method as an online classifier. In the worst case, the method only takes about 20 seconds for the building phase. Since both methods output the same classification results, the data presented in this subsection show that the combination of the relevant ports and the Kdimensional tree technique significantly improves the original NN search with the only drawback of a (very fast) training phase. This improvement allows us to use this method as an efficient online traffic classifier. 4.3
K-Dimensional Tree Plugin Evaluation
In this section we study the accuracy of the method depending on the different parameters of the KD-Tree plugin. Figure 2(a) presents the accuracy according to the number of packet sizes for the different traces of the dataset. In this case,
150
V. Carela-Espa˜ nol et al.
Table 4. Building Time Comparative: Nearest Neighbor vs K-Dimensional Tree Packet Size
Naive Nearest Neighbor 0s 0s 0s 0s
1 5 7 10
100%
100000
10000
80% UPC−I UPC−II UPC−III UPC−IV UPC−V UPC−VI UPC−VII
70%
60%
# Flows
Accuracy
90%
50%
K-Dimensional Tree Unique Selected Ports All Ports 13.01 s 12.72 s 12.52 s 16.45 s 16.73 s 15.62 s 17.34 s 16.74 s 16.07 s 19.81 s 19.59 s 18.82 s
# −
2
3
4
5
6
7
8
9
10
Number of Packet Sizes
−−− #− − −− − − −−− − −− − #− − −# − − # − − # − − # − − ##− − − − − − # −− − − − −#− # − − #−− − − − − −− − #− −# − #− − − −−−− ##− #−−#− #− ## #−−−−− −
100
10 5 3
(a) K-dimensional tree accuracy (by flow) without relevant ports support
−300
INTERACTIVE GAME P2P FILE−SYSTEM ENCRYPTED TUNNELING
#
1000
1
1
WEB MAIL BULK CONFERENCE MULTIMEDIA SERVICES
− −
0
# −
− −
−−#− −# − −−−−−− − #− #−#− #−###− #−#− ## ## #− #−−−−−−# −#−−−− − −#− − #−−− #
300
##−− # # −− −
600
#
−
−
900
−
1200
−
1500
Packet Size
(b) First packet size distribution in the training trace UPC-II
Fig. 2. K-dimensional tree evaluation without the support of the relevant ports
no information from the relevant ports is taken into account producing a single K-dimensional tree. With this variation, using only the first two packets, we achieve an accuracy of almost 90%. The accuracy increases with the number of packet sizes until a stable accuracy > 95% is reached with seven packet sizes. In order to show the impact of using the list of relevant ports in the classification, in Figure 2(b) we show the distribution of the first packet sizes for the training trace UPC-II. Although there are some portions of the distribution dominated by a group of applications, most of the applications have their first packet sizes between the 0 and the 300 bytes ticks. This collision explains the poor accuracy presented in the previous figure with only one packet. The second parameter of the method, the relevant ports, besides improving the classification speed appears to alleviate that situation. Figure 3(a) presents the accuracy of the method by number of packets using the set of relevant ports that contains most of the traffic in UPC-II. With the help of the relevant ports, the method achieves an accuracy > 90% using only the first packet size and achieving a stable accuracy of 97% with seven packets. Figure 3(b) presents the accuracy of the method depending on the set of relevant ports with seven packet sizes. We choose seven because as it can be seen in Figures 2(a) and 3(a), increasing the number of packet sizes beyond seven does not improve its accuracy but decrease, its classification speed. Using all the ports of the training trace UPC-II, the method achieves the highest accuracy with the same trace. However, with the rest of the traces the accuracy substantially decreases but being always higher than 85%. The reason of this decrease is that using all the ports as relevant ports is very dependent to the scenario and could
K-Dimensional Trees for Continuous Traffic Classification
90%
90%
80% UPC−I UPC−II UPC−III UPC−IV UPC−V UPC−VI UPC−VII
70%
60%
50%
1
2
3
4
5
6
7
8
9
10
Number of Packet Sizes
(a) K-dimensional tree accuracy (by flow) with relevant ports support
Accuracy
100%
Accuracy
100%
151
80%
70%
All Single Selected
60%
50%
UPC−I
UPC−II
UPC−III
UPC−IV
UPC−V
UPC−VI
UPC−VII
Traces
(b) K-dimensional tree accuracy (by flow, n=7) by set of relevant ports
Fig. 3. K-dimensional tree evaluation with the support of the relevant ports
present classification inaccuracies with new instances belonging to ports not represented in the training data. Furthermore, the classification accuracy also decreases because it produces fragmentation in the classification model for those applications that use multiple or dynamic ports (i.e., their information is spread among different K-dimensional trees). However, the figure shows that using a set of relevant ports - in our case the ports that receive more than 0.05% of the traffic - besides increasing the classification speed also improves accuracy. Erman et al. pointed out in [6] a common situation found among the ML techniques: the accuracy when measured by flows is much higher than when measured by bytes or packets. This usually happens because some elephantflows are not correctly classified. Figures 4(a) and 4(b) present the classification results of the method considering also the accuracy by bytes and packets. They show that, unlike other ML solutions, the method is able to keep high accuracy values even with such metrics.This is because the method is very accurate with the group of applications P2P and WEB, which represent in terms of bytes most of the traffic in our traces. Finally, we also study the accuracy of the method broken down by application group. In our evaluation we use the same application groups as in TIE. Figure 5 shows that the method is able to classify with excellent accuracy the most popular groups of applications. However, the accuracy of the applications groups that are not very common substantially decreases. These accuracies have a very low impact on the final accuracy of the method given that the representation of these groups in the used traces is almost negligible. A possible solution to improve the accuracy for these groups of applications could be the addition of artificial instances of these groups in the training data. Another potential problem is the disguised use of ports by some applications. Although we do not have evaluated this impact in detail, the results show that currently we can still achieve an additional gain in accuracy by considering the port numbers. We have also checked the accuracy by application group with a single K-dimensional tree and we found that it was always below the results shown in Figure 5. We omit the figure in the interest of brevity. In conclusion, we presented a set of results showing how the K-dimensional tree technique, combined with the still useful information provided by the ports, improves almost all the aspects of previous methods based in the NN search.
152
V. Carela-Espa˜ nol et al.
90%
90%
Accuracy
100%
Accuracy
100%
80%
70%
All Single Selected
60%
50%
UPC−I
UPC−II
UPC−III
UPC−IV
UPC−V
UPC−VI
80%
70%
All Single Selected
60%
50%
UPC−VII
UPC−I
UPC−II
Traces
UPC−III
UPC−IV
UPC−V
UPC−VI
UPC−VII
Traces
(a) K-dimensional tree accuracy (by packet, n=7) by set of relevant ports
(b) K-dimensional tree accuracy (by byte, n=7) by set of relevant ports
Fig. 4. K-dimensional tree evaluation with the support of the relevant ports 100% UPC−I UPC−III UPC−IV UPC−V UPC−VI UPC−VII AVERAGE
90% 80%
Accuracy
70% 60% 50% 40% 30% 20% 10% 0% CONFERENCING
P2P
WEB
SERVICES
ENCRYPTION
GAMES
MAIL
MULTIMEDIA
BULK
FILE_SYSTEM
TUNNEL
INTERACTIVE
Application Groups
Fig. 5. Accuracy by application group (n=7 and selected list of ports as parameters)
With the unique drawback of a short training phase, the method is able to perform online classification with very high accuracy, > 90% with only one packet or > 97% with seven packets. 4.4
Continuous Training System Evaluation
This section presents a preliminary study of the impact of our continuous training traffic classifier. Due to lack of traces comprising a very long period of time and because of the intrinsic difficulties in processing such large traces, we simulate a scenario in which the features of the traffic evolve by concatenating the UPC-II and UPC-VIII traces. The trace UPC-VIII, besides belonging to a difference day-time, was recorded four months later than UPC-II, this suggests a different traffic mix with different properties. Using seven as the fixed number of packets sizes, the results in Table 5 confirm our intuition. On one hand, using the trace UPC-II as training data to classify the trace UPC-VIII we obtain an accuracy of almost 85%. On the other hand, after detecting such decrease in accuracy and retraining the system, we obtain and impressive accuracy of 98,17%.
K-Dimensional Trees for Continuous Traffic Classification
153
Table 5. Evaluation of the Continuous Training system by training trace and set of relevant ports Training Trace UPC-II First 15 min. UPC-VIII Relevant Port List UPC-II UPC-VIII UPC-II UPC-VIII Accuracy 84.20 % 76.10 % 98.17 % 98.33 %
This result shows the importance of the continuous training feature to maintain a high classification accuracy. Since this preliminary study was performed with traffic traces, instead of a live traffic stream, we decided to use the first fifteen minutes of the UPC-VIII trace as the queue length parameter (s) of the retraining process. The results of a second experiment are also presented in Table 5. Instead of retraining the system with a new training data we study if the modification of the list of relevant ports is enough to obtain the original accuracy. The results show that this solution does not bring any improvement when applied alone. However the optimum solution is obtained when both the training data and the list of relevant ports are updated and the system is then retrained.
5
Conclusions and Future Work
In this paper, we revisited the viability of using the Nearest Neighbor algorithm (NN) for online traffic classification, which has been often discarded in previous studies due to its poor classification speed. In order to address this well-known limitation, we presented an efficient implementation of the NN algorithm based on a K-dimensional tree data structure, which can be used for online traffic classification with high accuracy and low overhead. In addition, we combined this technique with the relevant information still provided by the port numbers, which further increases its classification speed and accuracy. Our results show that the method can achieve very high accuracy (> 90%) by looking only at the first packet of a flow. When the number of analyzed packets is increased to seven, the accuracy of the method increases beyond 95%. This early classification feature is very important, since it allows network operators to quickly react to the classification results. We presented an actual implementation of the traffic classification method based on the TIE classification engine. The main novelty of the implementation is its continuous training feature, which allows the system to be automatically retrained by itself as the training data becomes obsolete. Our preliminary evaluation of this unique feature presents very encouraging results. As future work, we plan to perform a more extensive performance evaluation of our continuous training system with long-term executions in order to show the large advantages of maintaining the classification method continuously updated without requiring human supervision.
154
V. Carela-Espa˜ nol et al.
Acknowledgments This paper was done under the framework of the COST Action IC0703 “Data Traffic Monitoring and Analysis (TMA)” and with the support of the Comissionat per a Universitats i Recerca del DIUE from the Generalitat de Catalunya. The authors thank UPCnet for the traffic traces provided for this study and the anonymous reviewers for their useful comments.
References 1. Bentley, J.L.: K-d trees for semidynamic point sets, pp. 187–197 (1990) 2. Bernaille, L., Teixeira, R., Salamatian, K.: Early application identification. In: Proc. of ACM CoNEXT (2006) 3. Bernaille, L., et al.: Traffic classification on the fly. ACM SIGCOMM Comput. Commun. Rev. 36(2) (2006) 4. CoMo-UPC data sharing model, http://monitoring.ccaba.upc.edu/como-upc/ 5. Dainotti, A., et al.: TIE: a community-oriented traffic classification platform. In: Proceedings of the First International Workshop on Traffic Monitoring and Analysis, p. 74 (2009) 6. Erman, J., Mahanti, A., Arlitt, M.: Byte me: a case for byte accuracy in traffic classification. In: Proc. of ACM SIGMETRICS MineNet (2007) 7. Erman, J., et al.: Identifying and discriminating between web and peer-to-peer traffic in the network core. In: Proc. of WWW Conf. (2007) 8. Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3(3), 209–226 (1977) 9. Internet Assigned Numbers Authority (IANA): as of August 12 (2008), http://www.iana.org/assignments/port-numbers 10. Karagiannis, T., Papagiannaki, K., Faloutsos, M.: BLINC: multilevel traffic classification in the dark. In: Proc. of ACM SIGCOMM (2005) 11. Kim, H., et al.: Internet traffic classification demystified: myths, caveats, and the best practices. In: Proc. of ACM CoNEXT (2008) 12. Moore, A., Zuev, D.: Internet traffic classification using bayesian analysis techniques. In: Proc. of ACM SIGMETRICS (2005) 13. Nguyen, T., Armitage, G.: A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys and Tutorials 10(4) (2008) 14. Roughan, M., et al.: Class-of-service mapping for qos: a statistical signature-based approach to ip traffic classification. In: Proc. of ACM SIGCOMM IMC (2004) 15. Williams, N., Zander, S., Armitage, G.: Evaluating machine learning algorithms for automated network application identification. CAIA Tech. Rep. (2006) 16. Zander, S., Nguyen, T., Armitage, G.: Automated traffic classification and application identification using machine learning. In: Proc. of IEEE LCN Conf. (2005)
Validation and Improvement of the Lossy Difference Aggregator to Measure Packet Delays Josep Sanju` as-Cuxart, Pere Barlet-Ros, and Josep Sol´e-Pareta Department of Computer Architecture Universitat Polit`ecnica de Catalunya (UPC) Barcelona, Spain {jsanjuas,pbarlet,pareta}@ac.upc.edu
Abstract. One-way packet delay is an important network performance metric. Recently, a new data structure called Lossy Difference Aggregator (LDA) has been proposed to estimate this metric more efficiently than with the classical approaches of sending individual packet timestamps or probe traffic. This work presents an independent validation of the LDA algorithm and provides an improved analysis that results in a 20% increase in the number of packet delay samples collected by the algorithm. We also extend the analysis by relating the number of collected samples to the accuracy of the LDA and provide additional insight on how to parametrize it. Finally, we extend the algorithm to overcome some of its practical limitations and validate our analysis using real network traffic.
1
Introduction
Packet delay is one of the main indicators of network performance, together with throughput, jitter and packet loss. This metric is becoming increasingly important with the rise of applications like voice-over-IP, video conferencing or online gaming. Moreover, in certain environments, it is an extremely critical network performance metric; for example, in high-performance computing or automated trading, networks are expected to provide latencies in the order of few microseconds [1]. Two main approaches have been used to measure packet delays. Active schemes send traffic probes between two nodes in the network and use inference techniques to determine the state of the network (e.g., [2,3,4,5]). Passive schemes are, instead, based on traffic analysis in two of the points of a network. They are, in principle, less intrusive to the network under study, since they do not inject probes. However, they have been often disregarded, since they require collecting, transmitting and comparing packet timestamps at both network measurement points, thus incurring large overheads in practice [6]. For example, [7] proposes delaying computations to periods of low network utilization if measurement information has to be transmitted over the network under study. The Lossy Difference Aggregator (LDA) is a data structure that has been recently proposed in [1] to enable fine-grain measurement of one-way packet delays using a passive measurement approach with low overhead. The data structure F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 155–170, 2010. c Springer-Verlag Berlin Heidelberg 2010
156
J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta
is extremely lightweight in comparison with the traditional approaches, and can collect a number of samples that easily outperforms active measurement techniques, where traffic probes interfering with the normal operation of a network can be a concern. The main intuition behind this measurement technique is to sum all packet timestamps in the first and second measurement nodes, and infer the average packet delay by subtracting these values and dividing over the total number of packets. The LDA, though, maintains several separate counters and uses coordinated, hash-based traffic sampling [8] in order to protect against packet losses, which would invalidate the intuitive approach. The complete scheme is presented in Sect. 2. This work constitutes an independent validation of the work presented in [1]. Section 3 revisits the analysis of the algorithm. In particular, Sect. 3.1 investigates the effective number of samples that the algorithm can collect under certain packet loss ratios. This work improves the original analysis, and finds that doubling the sampling rates suggested in [1] maximizes the expectation of the number of samples collected by the algorithm. In Sect. 3.2, we contribute an analysis that relates the effective sample size with the accuracy that the method can obtain, while Sect. 3.3 compares the network overhead of the LDA with pre-existing techniques. For the case when packet loss ratios are unknown, the original work proposed and compared three reference configurations of the LDA in multiple memory banks to target a range of loss ratios. In Sects. 3.4 and 3.5 we extend our improved analysis to the case of unknown packet loss, and we (i) find that such reference configurations are almost equivalent in practice, and (ii) provide improved guidelines on how to dimension the multi-bank LDA. Sect. 4 validates our analysis through simulations, with similar parameters to [1], for the sake of comparability. Finally, in Sect. 5 we deploy the LDA on a real network scenario. The deployment of the LDA in a real setting presents a series of challenges that stem from the assumptions behind the algorithm as presented in [1]. We propose a simple extension of the algorithm that overcomes some of the practical limitations of the original proposal. At the time of this writing, another analysis of the Lossy Difference Aggregator already exists in the form of a public draft [9]. The authors provide a parallel analysis of the expectation for the sample size collected by the LDA and, coherently with ours, suggest doubling the sampling rates compared to [1]. For the case where packet loss ratios are unknown beforehand, their analysis studies how to dimension the multi-bank LDA to maximize the expectation for the sample size. Optimal sampling rates are determined that maximize sample sizes for tight ranges of expected packet loss ratios. Our analysis differs in that we relate sample size with accuracy, and focus on maximizing accuracy rather than sample size. Additionally, our study includes an analytic overhead comparison with traditional techniques, presents the first real world deployment of the LDA and proposes a simple extension to overcome some of its practical limitations.
Validation and Improvement of the Lossy Difference Aggregator
2
157
Background
The Lossy Difference Aggregator (LDA) [1] is a data structure that can be used to calculate the average one-way packet delay between two network points, as well as its standard deviation. We refer to these points as the sender and the receiver, but they need not be the source or the destination of the packets being transmitted, but merely two network viewpoints along their path. The LDA operates under three assumptions. First, packets are transmitted strictly in FIFO order. Second, the clocks of the sender and the receiver are synchronized. Third, the set of packets observed by the receiver is identical to the one observed by the sender, or a subset of it when there is packet loss. That is, the original packets are not diverted, and no extra traffic is introduced that reaches the receiver. A classic algorithm to calculate the average packet delays in such a scenario would proceed as follows. In both the sender and the receiver, the timestamps of the packets are recorded. After a certain measurement interval, the recorded packet delays (or, possibly, a subset of them) are transmitted from the sender to the receiver, which can then compare the timestamps and compute the average delay. Such an approach is impractical, since it involves storing and transmitting large amounts of information. The basic idea behind the LDA is to maintain a pair of accumulators that sum all packet timestamps in the sender and the receiver separately, as well as the total count of packets. When the measurement interval ends, the sender transmits the value of its accumulator to the receiver, which can then compute the average packet delay by subtracting the values and dividing over the total number of packets. The LDA requires the set of packets processed by the sender and the receiver to be identical, since the total packet counts in the sender and the receiver must agree. Thus, it is extremely sensitive to packet loss. In order to protect against it, the LDA partitions the traffic into b separate streams, and aggregates timestamps for each one separately in both the sender and the receiver. Additionally, for each of the sub-streams, it maintains a packet count. Thus, it can detect packet losses and invalidate the data collected in the corresponding accumulators. When the measurement interval ends, the sender transmits all of the values of the accumulators and counters to the receiver. Then, the receiver discards the accumulators where packet counts disagree, and computes an estimate for the average sampling delay using the remainder. Each of the accumulators must aggregate the timestamps from the same set of packets in the sender and the receiver, i.e., both nodes must partition the traffic using the same criteria. In order to achieve this effect, the same pre-arranged, pseudo-random hash function is used in both nodes, and the hash of a packet identifier is used to determine its associated position in the LDA. As packet losses grow high, though, the number of accumulators that are invalidated increases rapidly. As an additional measure against packet loss, the LDA samples the incoming packet stream. In the most simple setting, all of the accumulators apply an equal sampling rate p to the incoming packet stream.
158
J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta
Again, sender and receiver sample incoming packets coordinately using a prearranged pseudo-random hash function [8]. As an added benefit, the LDA data structure can also be mined to estimate the standard deviation of packet delays using a known mathematical trick [10]. We omit this aspect of the LDA in this work, but the improvements we propose will also increase the accuracy in the estimation of the standard deviation of packet delays.
3
Improved Analysis
The LDA is a randomized algorithm that depends on the correct setting of the sampling rates to gather the largest possible number of packet delay samples. The sampling rate p presents a classical tradeoff. The more packets are sampled, the more data the LDA can collect, but the more it is affected by packet loss. Conversely, lower sampling rates provide more protection against loss, but limit the amount of information collected by the accumulators. This section improves the original analysis of the Lossy Difference Aggregator (LDA) introduced in [1] in several ways. First, it improves the analysis of the expected number of packet delay samples it can collect, which leads to the conclusion that sampling rates should be twice the ones proposed in [1]. Second, it relates the number of samples with the accuracy in an easy to understand way that makes it obvious that undersampling is preferable to sampling too many packets. Third, it compares its network overhead with pre-existing passive measurement techniques. Fourth, it provides a better understanding and provides guidelines to dimension multi-bank LDAs. 3.1
Effective Sample Size
In order to protect against packet loss, the LDA uses packet hashes to distribute timestamps across several accumulators, so that losses only invalidate the samples collected by the involved memory positions. Table 1 summarizes the notation used in this section. Given n packets, b buckets (accumulator-counter pairs) and packet loss probability r, the probability of a bucket of staying useful corresponds to the probability that no lost packet hashes to the bucket in the receiver node, which can be computed as (1 − r/b)n ≈ e−n r/b (according to the law of rare events). Then, the expectation for the number of usable samples, which we call the n . In order to provide effective sample size, can be approximated to E [S] ≈ (1−r) en r/b additional protection against packet losses, the LDA also samples the incoming packets; we can adapt the previous formulation to account for packet sampling as follows: (1 − r) p n (1) en r p/b Reference [9] shows that this approximation is extremely accurate for large values of n. The approximation is best as n becomes larger and the probability of E [S] ≈
Validation and Improvement of the Lossy Difference Aggregator
159
Table 1. Notation name variable n r p
name variable
#pkts packet loss ratio sampling rate
b μ μ ˆ
#buckets average packet delay estimate of the avg. delay
sampling a packet loss stays low. Note that this holds in practice; otherwise, the buckets would too often be invalidated. For example, when the absolute number of sampled packet losses is in the order of the number of buckets b, it obtains relative errors around 5 × 10−4 for as few as n = 1000 packets. Note however that this formula only accounts for a situation where all buckets use an equal fixed sampling rate p, i.e., a single bank LDA. Section 3.5 extends this analysis to the multi-bank LDA, while Sect. 4 provides an experimental validation of this formula. Reference [1] provides a less precise approximation for the expected effective sample size. When operating under a sampling rate p = α b/(L + 1), a lower bound E[S] >= α (1 − α) R b/(L + 1) is provided, where R corresponds to the number of received packets and L to the number of lost packets; in our notation, R = n (1 − r) and L = n r. Trivially, this bound is maximized when α = 0.5. Therefore, it is concluded that the best sampling rate p that can be chosen is b . p = 0.5 n r+1 However, our improved analysis leads to a different value for p by maximizing (1). The optimal sampling rate p that maximizes the effective sample size for any known loss ratio r can be obtained by solving ∂E[S] ∂p = 0, which leads to b p = n r (in practice, we set p = min (b/n r, 1)). Thus, our analysis approximately doubles the sampling rate compared to [1], i.e., sets α = 1 in their notation, which yields an improvement in the effective sample size of around 20% at no cost. The conclusions of this improved analysis are coherent with the parallel analysis of [9], which also shows that the same conclusions are reached without the approximation in (1). Assuming a known loss ratio and the optimal setting of the sampling rate p = nbr , then, the expectation of the effective sample size is (by substitution of p in (1)): E[Sopt ] =
1−r b re
(2)
In other words, given a known number of incoming packets and a known loss ratio, setting p optimally maximizes the expectation of the sample size at 1−r re samples per bucket. Figure 1 shows how the number of samples that can be collected by the LDA quickly degrades when facing increasing packet loss ratios. Therefore, in a high packet loss ratio scenario, the LDA will require large amounts of memory to preserve the sample size. As an example, in order to sustain the same sample size of a 0.1% loss scenario, the LDA must grow around 50 times
0
100
200
300
J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta
samples per bucket
160
0.0
0.2
0.4
0.6
0.8
1.0
packet loss rate
Fig. 1. Expected number of samples collected per bucket under varying packet loss ratios, assuming an ideal LDA that can apply, for each packet loss ratio, the optimal sampling rate
larger on 5% packet loss, and by a factor of around 250 in the case of 20% packet loss. Recall that this analysis assumes that the packet loss ratios are known beforehand, so that the sampling rate can be tuned optimally. When facing unknown loss ratios, the problem becomes harder, since it is not possible to configure p = nbr , given that both parameters are unknown. However, this analysis does provide an upper bound on the performance of this algorithm. In any configuration of the LDA, including in multiple banks, the expectation of the effective sample size will be at most 1−r r e b. 3.2
Accuracy
It is apparent from the previous subsection that increasing packet loss ratios have a severe impact on the effective sample size that the LDA can obtain. However, the LDA is empirically shown to obtain reasonable accuracy up to 20% packet loss in [1]. How can we accommodate these two facts? The resolution of this apparent contradiction lies in the fact that the accuracy of the LDA does not depend linearly on the sample size but, instead, the gains in terms of accuracy of the larger sample sizes are small. The LDA algorithm estimates the average delay μ from a sample of the overall population of packet delays. According to the central limit theorem, the sample mean is a random variable that converges to a normal distribution as the sample size (S in our notation) grows [11]. The rate of convergence towards normality depends on the distribution of the sampled random variable (in this case, packet delays). If the arbitrary distribution of the packet delays has mean μ and variance σ 2 , assuming that the sample size S obtained by the LDA is large enough for the normal approximation to be accurate, the sample mean can be considered to be normally distributed, with mean μ and variance σ 2 /S, which implies that, with 99% confidence, the estimate of the average delay μ ˆ as the sample average will √ σ. be within μ ± 2.576 √σS and, thus, the relative error will be below 2.576 μ S
0.3 0.1
0.2
relative error
0.4
0.45 0.30 0.15
0.0
0.00
relative error
161
0.5
Validation and Improvement of the Lossy Difference Aggregator
1e+02
1e+03
1e+04
1e+05
sample size (log.)
1e+06
0.0
0.2
0.4
0.6
0.8
1.0
loss ratio
Fig. 2. 99% confidence bound on the relative error of the estimation of the average delay as a function of the obtained sample size (left) and as a function of the packet loss ratio (right), assuming a 1024 bucket ideal LDA, 5 × 106 packets and Weibull (α = 0.133, β = 0.6) distributed packet delays
An observation to be made is that the relative error of the LDA is proportional to √1S , that is, halving the relative error requires 4 times as many samples. A point is reached where the return of obtaining additional samples has a negligible practical impact on the relative error. As stated, the accuracy of the LDA depends on the distribution of the packet delays, which are shown to be accurately modeled by a Weibull distribution in [6], and this distribution is used in [1] to evaluate the LDA. Figure 2 plots, as an example, the accuracy as a function of the sample size (left) and as a function of the loss ratio (right) when packet delays are Weibull distributed with scale parameter α = 0.133 and shape β = 0.6, and 5 × 106 packets per measurement interval (these parameters have been chosen consistently with [1] for comparability). It can be observed that, in practice, small sample sizes obtain satisfactory accuracies. In this particular case, 2000 samples bound the relative error to around 10%, 8000 lower the bound to 5%, and 25 times as many, to 1%. 3.3
Overhead
Ref. [1] presents an experimental comparison of the LDA with active probing. In this section, we compare the overhead of the LDA with that of a passive measurement approach based on trajectory sampling [8] that sends a packet identifier and a timestamp for each sampled packet. As a basis for comparison, we compute the network overhead for each method per collected sample. Note that, for equal sample sizes, the accuracy of both methods is expected to match, since samples are collected randomly. Traditional techniques incur an overhead directly proportional to the collected number of samples. For example, an active probe will send a packet to obtain each sample. The overhead of a trajectory sampling based technique is also a constant α bytes/sample. For example, a 32 bit hash of a packet plus a 64 bit timestamp set α = 12.
10 20 30 40 50 60
J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta
LDA traditional
0
network overhead (B/sample)
162
0.0
0.2
0.4
0.6
0.8
1.0
packet loss ratio
Fig. 3. Communication overhead of the LDA relative to a traditional trajectory sampling approach, assuming 12 byte per bucket and per sample transmission costs
However, as discussed in the previous section, the sample size collected by the LDA depends on the packet loss ratio. Assuming a single-bank, optimally dimensioned LDA, it requires sending b × β bytes (where β denotes the size βre of a bucket) to gather 1−r r e b samples. Thus, the overhead of the LDA is 1−r B/sample, and using 64 bit timestamp accumulators and 32 bit counters yields β = 12. re < α and, thus, The LDA is preferable as long as it has lower overhead, i.e., β1−r α r < β e+α . The values of α and β will vary in real deployments (e.g., timestamps can be compressed in both methods). In the example, where α = β = 12, the 1 LDA is preferable as long as r < e+1 ≈ 0.27. Figure 3 compares the overheads of both techniques in such a scenario, and shows the superiority of the LDA for the lowest packet loss ratios and its undesirability for the highest. 3.4
Unknown Packet Loss Ratio
It has already been established that the optimal choice of the LDA sampling rate is p = nbr , which obtains 1−r r e b samples. However, in practice, both n and r are unknown a priori, since they depend on the network conditions, which are generally unpredictable. Thus, setting p beforehand implies that, inevitably, its choice will be suboptimal. What is the impact of over and under-sampling, i.e., setting a conservatively low or an optimistically high sampling rate on the LDA algorithm? We find that undersampling is preferable to oversampling. As explained, the relative error of √ the algorithm is proportional to 1/ S. Thus, oversampling leads to collecting a high number of samples on low packet loss ratios, and slightly increases the accuracy on such circumstances, but leads to a high percentage of buckets being invalidated on high loss, thus incurring large errors. Conversely, undersampling preserves the sample size on high loss, thus obtaining reasonable accuracy, at the cost of missing the opportunity to collect a much larger sample on when losses are low, which, however, has a comparatively lower impact on the accuracy.
0.5
1e+06
163
80% loss
0.3 0.1
80% loss
0.2
1e+02
5% loss 20% loss
rel err bound
1e+04
0.4
ideal single−bank LDA
20% loss
0.0
0.2
0.4
0.6
0.8
1.0
ideal single−bank LDA
5% loss
0.0
1e+00
obtained sample size
Validation and Improvement of the Lossy Difference Aggregator
0.0
packet loss ratio
0.2
0.4
0.6
0.8
1.0
packet loss ratio
Fig. 4. Impact on the sample size (left) and expected relative error (right) of selecting a sub-optimal sampling rate
Figure 4 provides a graphical example of this analysis. In this example we consider, again analogously to [1], Weibull (α = 0.133, β = 0.6) distributed packet delays. We compare the sample sizes and accuracy bounds obtained by different configurations of the LDA using a value of p targeted at loss ratios of 5%, 20% and 80%. All LDA configurations use b = 1024 accumulators. It can be observed that, in terms of sample size, the conservative setting of p for 80% loss underperforms, in terms of sample size, under the lowest packet loss ratios, but this loss does not imply an extreme degradation in terms of measurement accuracy. On the contrary, the more optimistic sampling rate settings achieve better accuracy under low loss, but incur extreme accuracy degradation as the loss ratio grows. 3.5
The Multi-bank LDA
So far, the analysis of the LDA has assumed all buckets have a common sampling rate p. However, as exposed in [1], when packet loss ratios are unknown, it is interesting to divide the LDA in multiple banks. A bank is a section of the LDA for which all the buckets use the same sampling rate. Each of the banks can be tuned to a particular sampling rate, so that, intuitively, the LDA is resistant to a range of packet loss ratios. Reference [1] tests three different configurations of the multi-bank LDA, always using equal (or almost) sized banks. No systematic analysis is performed on the appropriate bank sizing nor on the appropriate sampling rate for each of the banks; each LDA configuration is somewhat arbitrary and based on intuition. We extend our analysis to the most general multi-bank LDA, where each bucket i independently samples the full packet stream at rate pi (i.e., our analysis supports all combinations of bank configurations and sampling rates). We adapt (1) accordingly: E[S] ≈
b (1 − r) pi n i=1
en r pi
(3)
164
J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta
When every bucket uses the same sampling rate, the two equations are equivalent with pi = p/b (each bucket receives 1/b of the traffic and samples packets at rate p). As for the error bound, the analysis from Sect. 3.2 still holds. We have evaluated the three alternative multi-bank LDA configurations proposed in [1], using the same configuration parameters and distribution of packet delays. Figure 5 compares the accuracy obtained by the three configurations. The figure assumes, again, a Weibull distribution for packet losses, with shape parameter β = 0.6 and scale α = 0.133, and a number of packets n = 5 × 106 . All configurations use b = 1024 buckets. The first uses two banks, each targeted to 0.005 and 0.1 loss; the second, three banks that target 0.001, 0.01 and 0.1 loss; the third, four banks that target 0.001, 0.01, 0.05 and 0.1 loss. The figure shows that, in practice, the three approaches (lda-2, lda-3 and lda-4 in the figure) proposed in [1] perform very similarly, which motivates further discussion on how to dimension multi-bank LDAs. The figure also provides, as a reference, the accuracy obtained by an ideal LDA that, for every packet loss ratio, obtains the best possible accuracy (from (2)). We argue that, consistently with the discussion of subsection 3.4, in order to support a range of packet loss ratios, the LDA should be primarily targeted towards maintaining accuracy over the worst-case target packet loss ratio. Using this conservative approach has two benefits. First, it guarantees that a target accuracy can be maintained in the worst-case packet loss scenario. Second, it is guaranteed that its accuracy over the smaller packet loss ratios is at least as good. However, this rather simplistic approach has an evident flaw in that it does not provide significantly higher performance gains in the lowest packet loss scenarios, where a small number of high packet sampling ratio provisioned buckets would easily gather a huge number of samples. Based on this intuition, as a rule of thumb, 90% of the LDA could be targeted to a worst-case sampling ratio, using the rest of the buckets to increase the accuracy in low packet loss scenarios. A more sophisticated approach to dimensioning a multi-bank LDA is to determine the vector of sampling rates < p1 , p2 , . . . , pb > that performs closest to optimal across a range of sampling rates. We have used numerical optimization to search for a vector of sampling rates such that it minimizes the maximum difference between the accuracies of the multi-bank LDA and the ideal LDA across a range of packet loss ratios. Additionally, we have restricted the search space to sampling rates that are powers of two for performance reasons [1,9]. We have obtained a multi-bank LDA that targets a range of loss rates between 0.1% and 20% for the given scenario: 5 million packets, Weibull distributed delays, and 1024 buckets. The best solution that our numerical optimizator has found is, coherently with the previous discussion, targeted primarily to the highest loss ratios. Table 2 summarizes the resulting multi-bank LDA. Most notably, a majority (70%) of the buckets use pi = 2−20 , i.e., are targeted to a packet loss ratio of 20%, while fewer (around 20%) use pi = 2−17 , i.e., are optimized for around 2.6% loss. All buckets combined sample around 0.47% of the packets.
Validation and Improvement of the Lossy Difference Aggregator
165
Table 2. Per-bucket sampling rates respective to the full packet stream of the numerically optimized LDA for the given scenario. Overall sampling rate is around 0.47%. 2−14 2−15 2−16 2−17 2−18 2−19 2−20 2−21
#buckets
2
189
7
27
717
2
0.10
0.15
0.20
6
rel err bound
0.4
0.6
0.8
lda−2 lda−3 lda−4 optimized ideal
lda−2 lda−3 lda−4 optimized ideal
0.0
0.00
0.2
rel err bound
74
0.05
1.0
sampling rate
0.0
0.2
0.4
0.6
packet loss ratio
0.8
1.0
0.00
0.05
0.10
0.15
0.20
packet loss ratio
Fig. 5. Error bounds for several configurations of multi-bank LDA in the 0-1 packet loss ratio range (left) and in the 0-0.2 range (right)
Figure 5 shows the result of this approach (line optimized ) when targeting a range of loss rates between 0.1% and 20% for 5 million packets with the mentioned Weibull distribution of delays. The solution our optimizer found has the desirable property of always staying within below 3% higher relative error than the best possible, for any given loss ratio within the target range. These results suggest that there is little room for improvement in the multi-bank LDA parametrization problem. In the parallel analysis of [9], numerical optimization is also mentioned as an alternative to maximize the effective sample size when facing unknown packet loss. Optimal configurations are derived using competitive analysis for some particular cases of tight ranges of target packet loss ratios [l1 , l2 ]. In particular, it is found that both for l2 /l1 ≤ 2, and for l2 /l1 ≤ 5.5 and a maximum of 2 l1 banks, the optimal configuration is a single bank LDA with p = ln ll22 −ln −l1 . We believe that our approach is more practical in that it supports arbitrary packet loss ratios and it focuses on preserving the accuracy, rather than sample size.
4
Validation
In the previous section, we derived formulas for the expected effective sample size of the LDA when operating under various sampling rates, and provided bounds for the expected relative error under typical distributions of the network delays. In this section, we validate the analytical results through simulation. We have chosen the same configuration parameters as in the evaluation of [1]. Thus, this section not only validates our analysis of the LDA algorithm, but also
0.30 0.20
experiment − ideal experiment − LDA2 experiment − LDA3 experiment − LDA4 analytic bounds
0.00
0.10
1e+04
experiment − ideal experiment − LDA2 experiment − LDA3 experiment − LDA4 expected
99−pct of relative error
1e+06
J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta
1e+02
effective sample size
166
0.00
0.05
0.10 packet loss ratio
0.15
0.20
0.00
0.05
0.10
0.15
0.20
packet loss ratio
Fig. 6. Effective sample size (left) and 99 percentile of the relative error (right) obtained from simulations of the LDA algorithm using 5 million packets per measurement interval, and Weibull distributed packet delays
shows consistency with the previous results of [1]. The simulation parameters are as follows: we assume 5 million packets per measurement interval, and Weibull (α = 0.133, β = 0.6) distributed packet delays. In our simulation, losses are uniformly distributed. Note however that, as stated in [1], the LDA is completely agnostic to the packet loss distribution, but only sensitive to the overall packet loss ratio. Thus, other packet loss models (e.g., in bursts [12]) are supported by the algorithm without requiring any changes. Figure 6 (left) compares the expected sample sizes with the actual results from the simulations. The figure includes the three multi-bank LDA configurations introduced in [1], with expected sample size calculated using (3), and the ideal LDA that achieves the best possible accuracy under each packet loss ratio, obtained from (2). This figure validates our analysis of the algorithm, since effective sample sizes are always around their expected value (while in [1], only a noticeably pessimistic lower bound is presented). On the other hand, Figure 6 (right) plots the 99 percentile of the relative error obtained after 500 simulations for each loss ratio, and compares it to the 99% bound on the error derived from the analysis of Sect. 3.2. The figures confirm the correctness of our analysis for both the effective sample size and the 99% confidence bound on the relative error.
5 5.1
Experiments Scenario
In the previous section, a simulation based validation of our analysis of the LDA has been presented that reproduces that of [1]. In this section we evaluate the algorithm using real network traffic. To the best of our knowledge, this is the first work to evaluate the algorithm in a real scenario.
Validation and Improvement of the Lossy Difference Aggregator
167
Our scenario consists of two measurement points: one of the links that connect the Catalan academic network (also known as Scientific Ring) to the rest of the Internet, and the access link of the Technical University of Catalonia (UPC). In the first measurement point, a commodity PC equipped with a 10 Gb/s Endace DAG card [13] obtains a copy the of the inbound traffic via an optical splitter, and filters for incoming packets with destination address belonging to the UPC network. In the second measurement point, a commodity PC equipped with a 1 Gb/s Endace DAG card [13] analyzes a copy of the traffic that enters UPC, obtained from a port mirror from a switch. 5.2
Deployment Challenges
The deployment of the LDA in a real world scenario presents important challenges. The design of the LDA is built upon several assumptions. First, as stated in [1], the clocks in the two measurement points must be synchronized. We achieve this effect by synchronizing the internal DAG clocks prior to trace collection. Second, packets in the network follow strict FIFO ordering, and the monitors can inject control packets in the network (by running in the routers themselves) which also observe this strict FIFO ordering, and are used to signal measurement intervals. In our setting, packets are not forwarded in a strict FIFO ordering due to different queueing policies being applied to certain traffic. Moreover, injecting traffic to signal the intervals is unfeasible, since the monitors are isolated from the network under study. Third, in the original proposal, the complete set of packets observed in the second monitor (receiver) must have also been observed in the first (sender). In [1], the LDA algorithm is proposed to be applied in network hardware in a hop-by-hop fashion. However, this assumption severely limits the applicability of the proposal; for example, as is, it cannot be used in our scenario, since receiver observes packets that have been routed through a link to a commercial network that sender does not monitor (we refer to these packets as third party traffic). This limitation could be addressed by using appropriate traffic filters to discern whether each packet comes from receiver (e.g., source MAC address, or source IP address), but in the most general case, this is not possible. In particular, in our network, we lack routing information, and traffic engineering policies make it likely that the same IP networks are routed differently. The problem lies in that the LDA counters might match by chance when, in receiver, packet losses are compensated by extra packets from the third party traffic. The LDA would assume that the affected buckets are usable, and introduce severe error. We work around this by introducing a simple extension to the data structure: we attach to each LDA bucket an additional memory position that stores an XOR of all the hashes of the packets aggregated in the corresponding accumulator. Thus, receiver can trivially confirm that the set of packets of each position matches the set of packets aggregated in sender by checking this XOR. From a practical standpoint, using this approach makes third party traffic
168
J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta
count as losses. We use 64 bit hashes and, thus, the probability of the XORs matching by chance is negligible1 . 5.3
Experimental Results
We have simultaneously collected a trace in each of the measurement points in the described scenario, and wrote two CoMo [14] modules to process the traces offline: one that implements the LDA, and another that computes the average packet delays exactly. The traces have a duration of 30 minutes. We have configured 10 seconds measurement intervals, so that the average number of packets per measurement interval is in the order of 6 × 105 . We have tested 16 different single-bank configurations of the LDA with b = 1024 buckets and sampling rates ranging from 20 to 2−15 . Also, we have used our numerical optimizator to obtain a multi-bank LDA configuration that tolerates up to 80% loss in our scenario. Figure 7 summarizes our results. As noted in the previous discussion, third party traffic that is not seen in sender is viewed as packet losses in receiver. Therefore, our LDAs operate at an average loss rate of around 10%, which roughly corresponds to the fraction of packets arriving from a commercial network link that sender does not monitor. Hence, the highest packet sampling ratios are over-optimistic and collect too much traffic. It can be observed in Fig. 7 (right) that sampling ratios from 20 to 2−4 lose an intolerable amount of measurement intervals because all LDA buckets become unusable. Lower sampling rates, though, are totally resistant to the third party traffic. Figure 7 (left) plots the results in terms of accuracy, but only including measurement intervals not lost. It can be observed that 2−6 and 2−7 are the best settings. This is consistent with the analysis of Sect. 3, that suggests using p = nbr ≈ 0.17 ≈ 2−6 . The figure also includes the performance of our numerically optimized LDA, portrayed as a horizontal line (the multi-bank LDA is a hybrid of the other sampling rates). It performs very similarly to the best static sampling rates. However, it is important to note that this configuration will consistently perform close to optimal when losses (or third party traffic) grow to 80%, obtaining errors below 50%, while the error bound for the less flexible single bank LDA reaches 400%. On average, for each measurement interval, the optimized LDA collected around 3478 samples, while transmitting 1024 × 20 bytes (8 for the timestamp accumulators and the XOR field plus 4 for the counter for each bucket), resulting in 5.8 B/sample of network overhead. A traditional technique based on sampling and transmitting packet timestamps would cause a higher overhead, e.g., if using 8 byte timestamps and 4 byte packet IDs, it would transmit 12 B/sample. Thus, in this scenario, the LDA reduced the communication overhead in over 50%.
1
The XORs of the hashes have to be transmitted from sender to receiver, causing extra network overhead. Choosing the smallest hash size that still guarantees high accuracy is left for future work.
1e−04
1e−03
1e−02
1e−01
1e+00
single bank lda optimized lda
0.2
0.4
0.6
0.8
1.0
169
0.0
fraction of measurements lost
0.10
0.20
0.30
single bank lda (avg) single bank lda (99−percentile) optimized multi−bank lda (avg.) optimized multi−bank lda (99−percentile
0.00
relative error (%)
Validation and Improvement of the Lossy Difference Aggregator
1e−04
sampling rate
1e−03
1e−02
1e−01
1e+00
sampling rate
Fig. 7. Experimental results
6
Conclusions
We have performed a validation on the Lossy Difference Aggregator (LDA) algorithm originally presented in [1]. We have improved the theoretical analysis of the algorithm by providing a formula for the expected sample size collected by the LDA, while in [1] only a pessimistic lower bound was presented. Our analysis finds that the sampling rates originally proposed must be doubled. Only three configurations of the more complex multi-bank LDA were evaluated in [1]. We have extended our analysis to multi-bank configurations, and explored how to properly parametrize them, obtaining a procedure to numerically search for multi-bank LDA configurations that maximize accuracy over an arbitrary range of packet losses. Our results show that there is little room for additional improvement in the problem of multi-bank LDA configuration. We have validated our analysis through simulation and using traffic from a monitoring system deployed over a large academic network. The deployment of the LDA on a real network presented a number of challenges related to the assumptions behind the original proposal of the LDA algorithm, that does not tolerate packet insertion/diversion and depends on strict FIFO packet forwarding. We propose a simple extension that overcomes such limitations. We have compared the network overhead of the LDA with pre-existing techniques, and observed that it is preferable under zero to moderate loss or addition/diversion of packets (up to ∼25% combined). However, the extra overhead of pre-existing techniques can be justified in some scenarios, since they can provide further information on the packet delay distribution (e.g., percentiles), than just the average and standard deviation that are provided by the LDA.
Acknowledgments We thank the anonymous reviewers and Fabio Ricciato for their comments, which led to significant improvements in this work. This research has been partially funded by the Comissionat per a Universitats i Recerca del DIUE de la Generalitat de Catalunya (ref. 2009SGR-1140).
170
J. Sanju` as-Cuxart, P. Barlet-Ros, and J. Sol´e-Pareta
References 1. Kompella, R., Levchenko, K., Snoeren, A., Varghese, G.: Every microsecond counts: tracking fine-grain latencies with a lossy difference aggregator. In: Proc. of ACM SIGCOMM Conf. (2009) 2. Bolot, J.: Characterizing end-to-end packet delay and loss in the internet. Journal of High Speed Networks 2(3) (1993) 3. Paxson, V.: Measurements and analysis of end-to-end Internet dynamics. University of California at Berkeley, Berkeley (1998) 4. Choi, B., Moon, S., Cruz, R., Zhang, Z., Diot, C.: Practical delay monitoring for ISPs. In: Proc. of ACM Conf. on Emerging network experiment and tech. (2005) 5. Sommers, J., Barford, P., Duffield, N., Ron, A.: Accurate and efficient SLA compliance monitoring. ACM SIGCOMM Computer Communication Review 37(4) (2007) 6. Papagiannaki, K., Moon, S., Fraleigh, C., Thiran, P., Diot, C.: Measurement and analysis of single-hop delay on an IP backbone network. IEEE Journal on Selected Areas in Communications 21(6) (2003) 7. Zseby, T.: Deployment of sampling methods for SLA validation with non-intrusive measurements. In: Proc. of Passive and Active Measurement Workshop (2002) 8. Duffield, N., Grossglauser, M.: Trajectory sampling for direct traffic observation. IEEE/ACM Transactions on Networking 9(3) (2001) 9. Finucane, H., Mitzenmacher, M.: An improved analysis of the lossy difference aggregator (public draft), http://www.eecs.harvard.edu/~ michaelm/postscripts/LDApre.pdf 10. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. Journal of Computer and system sciences 58(1) (1999) 11. Rohatgi, V.: Statistical inference. Dover Pubns (2003) 12. Sommers, J., Barford, P., Duffield, N., Ron, A.: Improving accuracy in end-toend packet loss measurement. ACM SIGCOMM Computer Communication Review 35(4) (2005) 13. Endace: DAG network monitoring cards, http://www.endace.com 14. Barlet-Ros, P., Iannaccone, G., Sanju` as-Cuxart, J., Amores-L´ opez, D., Sol´e-Pareta, J.: Load shedding in network monitoring applications. In: Proc. of USENIX Annual Technical Conf. (2007)
End-to-End Available Bandwidth Estimation Tools, An Experimental Comparison Emanuele Goldoni1 and Marco Schivi2 1
2
University of Pavia, Dept. of Electronics, 27100-Pavia, Italy [email protected] University of Pavia, Dept. of Computer Engineering and Systems Science, 27100-Pavia, Italy [email protected]
Abstract. The available bandwidth of a network path impacts the performance of many applications, such as VoIP calls, video streaming and P2P content distribution systems. Several tools for bandwidth estimation have been proposed in the last years but there is still uncertainty in their accuracy and efficiency under different network conditions. Although a number of experimental evaluations have been carried out in order to compare some of these methods, a comprehensive evaluation of all the existing active tools for available bandwidth estimation is still missing. This article introduces an empirical comparison of most of the active estimation tools actually implemented and freely available nowadays. Abing, ASSOLO, DietTopp, IGI, pathChirp, Pathload, PTR, Spruce and Yaz have been compared in a controlled environment and in presence of different sources of cross-traffic. The performance of each tool has been investigated in terms of accuracy, time and traffic injected into the network to perform an estimation.
1
Introduction
Available bandwidth is a fundamental metric for describing the performance of a network path. This parameter is used in many applications, from routing algorithms to congestion control mechanisms and multimedia services. For example, in [1,2] the authors investigated the importance of the available bandwidth for adaptive content delivery in peer-to-peer (P2P) or video streaming systems. The easiest and most effective method for estimating the available bandwidth is active probing – a few test packets are transmitted through the path and are used to infer the network status. The problem of end-to-end estimation has received considerable attention and a number of active probing tools have emerged in recent years [3]. Nevertheless, producing reliable estimations still remains challenging: the measurement process should be accurate, non-intrusive and robust at the same time. Considerable efforts have also been put in comparison projects aiming to analyze the performances of existing methods in different network scenarios. F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 171–182, 2010. c Springer-Verlag Berlin Heidelberg 2010
172
E. Goldoni and M. Schivi
Nevertheless, many issues remain still unresolved and the quest for the best available bandwidth estimation tool is still open [4]. Compared to previous works, this paper proposes the largest comparison of available bandwidth estimation methods. The performances of 9 different tools are investigated in terms of accuracy, time and intrusiveness. All the experiments have been conducted in a low-cost and flexible testbed which could be easily extended to simulate more complex and realistic network topologies. The remainder of the article is organized as follows. Section 2 briefly presents related works on measurement tools and past performance comparisons. In Section 3 we describe the testbed and the methodology adopted for the experimental comparison of the tools. Next, Section 4 includes the preliminary results obtained during the performance tests we performed in our laboratory. Finally, the conclusions drawn from this study are presented in Section 5.
2
Background and Related Work
Many software tools for network bandwidth monitoring have been developed in the last years by both independent scientists and collaborative research projects. Although designed for the same purpose, these tools are based on different principles and implement various techniques. This section briefly introduces the main methodologies proposed in literature and also describes the previous works carried out to compare them. 2.1
Measurement Techniques
Several active end-to-end measurement tools have been proposed in the last years. Looking at the big picture, these systems infer the available bandwidth of a network path by sending a few packets and analyzing the effects on the probe frames of intermediate nodes and cross-traffic. Examples of probing tools which have emerged in recent years are Pathload [5], IGI/PTR [6], Abing [7], Spruce [8], pathChirp [9], DietTopp [10], Yaz [11], and ASSOLO [12]. These methods differ in the size and temporal structure of probe streams, and in the way the available bandwidth is derived from the received packets. Spruce [8] uses tens of packet pairs having an input rate chosen to be roughly around to the capacity of the path, which is assumed to be known. Moreover, packets are spaced with exponential intervals in order to emulate a poissonian sampling process. IGI [6] uses a sequence of about 60 unevenly spaced packets to probe the network and the gap between two consecutive packets is increased until the average output and initial gaps match. Similarly, PTR relies on unevenly spaced packets but the background traffic is detected through a comparison of the time intervals at the source with those found on the destination side. Abing [7] relies on packet pair dispersion technique. Typically, 10 or 20 closely spaced probes are sent to one destination as a train. The evaluation of the
End-to-End Available Bandwidth Estimation Tools
173
observed packet pairs delays and the estimation of the available bandwidth are based on a technical analysis of the problems that the frames could meet in the routers or other network devices. Pathload [5] and DietTopp [10] use constant bit-rate streams and change the sending rate every round. Although both tool try to identify the turning point, DietTopp increases linearly the sending rate in successive streams while Pathload varies the probing rate using a binary search scheme. Yaz [11] is a similar estimation tool derived from Pathload which should reports results more quickly and with increased accuracy with respect to its predecessor. PathChirp [9] sends a variable bit-rate stream consisting of exponentially spaced packets. The actual unused capacity is inferred from the rate responsible for increasing delays at the receiver side. ASSOLO [12] is a tool based on the same principle, but it features a different probing traffic profile and uses a filter to improve the accuracy and stability of results. Other works like AB-Shoot [13], S-chirp [14], FEAT [15] , BART [16] or MRBART [17] have also been proposed in the past. However, the source codes of these tools have never been released publicly or the methods have been implemented only in simulations. A detailed analysis of the existing estimation techniques is outside the scope of this paper – a proposed taxonomy has been developed by [8], while more information on specific tools can be found in the original papers. 2.2
Past Comparisons
Most of tools’ proponents have compared the performance of their solution against that of others researchers. For example, in [11] Sommers et al. compared Yaz with Pathload and Spruce in a controlled environment, while Strauss and his colleagues [8] investigated the performances of Spruce against IGI and Pathload over hundreds of real Internet paths. Ribeiro et al. [9] tested pathChirp against Pathload and TOPP through emulation. In [12] the performance of pathChirp has been compared to that of ASSOLO in a laboratory network setup. Unfortunately, the works mentioned above covered only a small number of tools and the scenarios investigated are limited too. A more comprehensive evaluation has been performed by Shriram et al. [18], who compared Abing, pathChirp, Pathload and Spruce on a high-speed testbed and on real world GigE paths. The specific features of the network paths also allowed the researchers to investigate timing issues related to high-speed links and network interfaces. A similar work has been carried out by Labit et al. [19], that tested Abing, Spruce, Netest/Pipechar, pathChirp and IGI over a real Internet path of the French national monitoring and measurement platform Metropolis. Angrisani et al. [20] compared IGI, pathChirp and Pathload in a testbed equipped with a proper measurement station. The adoption of a performance evaluation methodology relying on the use of electronics instrumentation for time measurements allowed the authors to focusing on concurrence, repeatability and bias of the results obtained from the testbed. Furthermore, an optimal setting of each tool has been identified thanks to the experimental activity.
174
E. Goldoni and M. Schivi
In [21] the authors presented a comparative study of DietTopp, Pathload and pathChirp in a mobile transport network. However, all the results presented have been generated only from simulations using ns2. The ns2 network simulator has been used also by Shriram and Kaur [22] to evaluate the performance of Pathload, pathChirp, Spruce, IGI and Cprobe under different network conditions. Two additional works in this research fields are [23] and [24]. In the first paper the authors proposed a comparative analysis of Spruce, Pathload, pathChirp and IGI in a simple testbed and they analyzed in depth the measurement errors and the uncertainty of the tools. On the other hand, in the latter article Urvoy-Keller et al. investigated the long-term behavior and the biases of Pathload and Spruce collecting data from real Internet paths. Finally, Guerrero and Labrador [25] presented a low cost and flexible testbed and they evaluated Pathload, IGI, and Spruce in a common environment in presence of different cross-traffic loads. The same authors included in [4] more tools in the performance evaluation, comparing Pathload, pathChirp, Spruce, IGI and Abing. The considered scenarios were extended too, examining varying packet loss rate, cross-traffic packet size, link capacity and delay. In addition, the newest article pointed out which tools might be the best choices for particular applications and environments. Although great efforts have been made to compare the existing estimation methods, all past works considered only part of the existing measurement tools. The above-mentioned experiments have also been performed considering different scenarios and testbed configurations, thus making the various results not easily comparable. We advocate the need for a unified, flexible and low-cost platform for independent evaluations of measurements tools, and we propose in this paper a testbed solution based on free GPL-licensed software alternative to the one described in [4]. Our study also takes one step further with respect to previous works, since it proposes the largest comparison of available bandwidth estimation tools – the performances of 9 software programs are examined in terms of accuracy, time and intrusiveness. All the tools have been ported to a recent operating system, and the changes required to make older software work on a newer system have been publicly released [26].
3
Testbed Setup
All the experimental results reported in this paper have been obtained using the simple testbed setup depicted in Figure 3. Our controlled network is based on general purpose PCs running only open source software. Two low-cost computers running Ubuntu GNU/Linux 8.04 are connected together through a 100 Mbps Fast Ethernet link and serve as routers. Two other machines of the testbed have been used to load the network with a source of controlled traffic originated by DITG [27] traffic generator. Finally, we installed the client and the server of each measurement tool on two additional computer running Ubuntu GNU/Linux 8.04
End-to-End Available Bandwidth Estimation Tools
175
Fig. 1. Testbed setup used to compare available bandwidth estimation tools
and we connected these two end-host to the testbed. We also added two Fast Ethernet switches to the network in order to make the tests more realistic. The two intermediate routers which emulate the multi-hop network path are based on Linux 2.6. The routers also contains iproute2 tc [28], an utility used to configure traffic control mechanisms in the Linux kernel. With tc it is possible to change the capacity of each interface, limiting the output in terms of packets or byte per second. tc also supports different packet queuing disciplines, and it can emulate the properties of wide area networks, such as variable delay, loss, duplication and re-ordering, through the netem kernel component. The traffic generator D-ITG allowed us to produce traffic at packet level, replicating stochastic processes having both temporal and packet size distributed as random variables with known properties. D-ITG is also capable to generate traffic at network, transport, and application layer, and it can also use real traffic traces. In our experiments we loaded the network with poissonian or constant bit rate (CBR) cross-traffic with varying rate from 0 to 64 Mbps and we did not introduce any traffic shaping policy. The final topology of the testbed and the scenarios considered are simple and admittedly unrealistic, but sufficient to perform a preliminary evaluation of the various measurement tools. A similar configuration has been used for example in [20] and [23], and the resulting system has the same features and flexibility of the testbed proposed in [25]. All the tools considered in this work must be executed on the two terminal hosts of the measured path, using a regular user account (administrator privileges are not required). For each measurement tool we left the configuration parameters untouched, using the default values suggested by the authors in the original paper or directly within the software – a list of list of the main configuration settings for each tested program is given in [26]. Although better results might be obtained using different setups, an optimal tuning of the tools is
176
E. Goldoni and M. Schivi
outside the scope of this paper. We also run one measurement tool at a time – as shown in [29], current techniques can offer good estimates when used alone, but they might not work if several estimations severely interfere with each other.
4
Experimental Results
Using the testbed described before, we evaluated Abing, ASSOLO, DietTopp, IGI, pathChirp, Pathload, PTR, Spruce and Yaz in terms of estimation time, overhead and accuracy. For each tool, we considered respectively 5 CBR and 5 additional poissonian cross-traffic scenarios with varying intensity. We loaded the network using sources of 64, 32, 16 and 8 Mbps and, finally, we turned off the traffic generator. We repeated the measurement process from scratch 10 times before calculating the averaged results for each scenario. The convergence time and the amount of probe traffic transmitted have been calculated considering only actual probing frames. For example, we did not consider the delay of the initial control connection phase which most of the tools use to synchronize the client and the server. 4.1
Accuracy
Figures 2 and 3 show the average results obtained from our experiments. Abing, Spruce and DietToop provide good estimations in presence low-rate cross-traffics, but the accuracy decreases significantly when the network load increases. On the contrary, the stability and the accuracy of measurements obtained with IGI and PTR increase when the intensity of the cross-traffic is higher. PathChirp constantly overestimates available bandwidth and its measurements are quite unstable – this is a well-know problem of this tool and similar results have been obtained in [16], [18], [23]. Pathload and Yaz are quite accurate and their results are similar; this is justified by the fact that Yaz is a modified version of Pathload. Comparable results in terms of accuracy are also provided by ASSOLO. It is worth noting that the measured values do not exhibit significant differences with respect to the kind of cross-traffic source – the tools performed in the same way regardless of the use of CBR or poissonian distributed packets. 4.2
Intrusiveness
Table 1 shows the preliminary data obtained from the testbed network in presence of a 16 Mbps CBR cross-traffic load, that is an available bandwidth of around 80 Mbps. We ran the measurement process for each tool in this scenario and we used a network protocol sniffer [30] to evaluate the exact time required to provide an estimation and the amount of probe traffic injected into the path. During tests we calculated only the actual estimation time and the probe traffic, not considering for example the delay introduced by an eventual initial
110
110
100
100
100
90 80 70 60 50 40 30 20
Available Bandwidth [Mbps]
110 Available Bandwidth [Mbps]
Available Bandwidth [Mbps]
End-to-End Available Bandwidth Estimation Tools
90 80 70 60 50 40 30
0
20 40 Cross−Traffic [Mbps]
20
60
70 60 50 40
20 40 Cross−Traffic [Mbps]
20
60
110
100
100
80 70 60 50 40
Available Bandwidth [Mbps]
110
100
20
90 80 70 60 50 40 30
0
20 40 Cross−Traffic [Mbps]
20
60
80 70 60 50 40
20 40 Cross−Traffic [Mbps]
20
60
100
70 60 50 40 30
Available Bandwidth [Mbps]
110
100 Available Bandwidth [Mbps]
110
100
90 80 70 60 50 40 30
0
20 40 Cross−Traffic [Mbps]
(g) PTR
0
60
20
20 40 Cross−Traffic [Mbps]
60
(f) Pathload
110
80
60
90
(e) pathChirp
90
20 40 Cross−Traffic [Mbps]
30 0
(d) IGI
20
0
(c) DietTopp
110 Available Bandwidth [Mbps]
Available Bandwidth [Mbps]
80
(b) ASSOLO
30
Available Bandwidth [Mbps]
90
30 0
(a) Abing
90
177
90 80 70 60 50 40 30
0
20 40 Cross−Traffic [Mbps]
(h) Spruce
60
20
0
20 40 Cross−Traffic [Mbps]
60
(i) Yaz
Fig. 2. Experimental results (solid line) obtained in presence of Constant Bit Rate cross-traffic with varying rate from 0 to 64 Mbps (dashed line)
E. Goldoni and M. Schivi
110
110
100
100
100
90 80 70 60 50 40 30 20
Available Bandwidth [Mbps]
110 Available Bandwidth [Mbps]
Available Bandwidth [Mbps]
178
90 80 70 60 50 40 30
0
20 40 Cross−Traffic [Mbps]
20
60
70 60 50 40
20 40 Cross−Traffic [Mbps]
20
60
110
100
100
80 70 60 50 40 30
Available Bandwidth [Mbps]
110
100 90
90 80 70 60 50 40 30
0
20 40 Cross−Traffic [Mbps]
20
60
90 80 70 60 50 40
20 40 Cross−Traffic [Mbps]
20
60
100
60 50 40 30
Available Bandwidth [Mbps]
110
100 Available Bandwidth [Mbps]
110
100
70
90 80 70 60 50 40 30
0
20 40 Cross−Traffic [Mbps]
(g) PTR
60
20
20 40 Cross−Traffic [Mbps]
60
(f) Pathload
110
20
0
(e) pathChirp
80
60
30 0
(d) IGI
90
20 40 Cross−Traffic [Mbps]
(c) DietTopp
110
20
0
(b) ASSOLO
Available Bandwidth [Mbps]
Available Bandwidth [Mbps]
80
30 0
(a) Abing
Available Bandwidth [Mbps]
90
90 80 70 60 50 40 30
0
20 40 Cross−Traffic [Mbps]
(h) Spruce
60
20
0
20 40 Cross−Traffic [Mbps]
60
(i) Yaz
Fig. 3. Experimental results (solid line) obtained in presence of possonian cross-traffic with varying rate from 0 to 64 Mbps (dashed line)
End-to-End Available Bandwidth Estimation Tools
179
control connection phase. Similarly, we ignored the traffic and the time required by IGI and PTR to measure the initial capacity. For each tool, we considered a single estimation, although tools like ASSOLO and pathChirp usually repeat the process a number of times and produce a final value using filtering techniques. During the first round of tests the average end-to-end delay in our testbed was around 2 milliseconds – this is a reasonable value for short FastEthernet links. We repeated all the experiments enabling the netem module on the two routers in order to emulate a real Internet path with a symmetric One-Way Delay of 125 ms. Table 1. Estimation time (in seconds) and amount of probe traffic (in Megabyte) associated to the tools analyzed Tool Abing ASSOLO DietTopp IGI pathChirp Pathload PTR Spruce Yaz
Traffic TimeOW D=2ms TimeOW D=125ms 0.6 1.0 1.2 >0.1 0.4 0.5 7.6 1.7 1.9 9.1 0.9 1.1 >0.1 0.4 0.5 40.6 7.6 18.2 9.1 0.9 1.1 0.3 10.0 10.1 6.4 4.3 8.1
Results show that DietTopp, IGI and PTR are quite fast, but they also inject a significant amount of probe traffic in the network; on the other hand, Spruce could take seconds but it is more lightweight. Pathload and Yaz are quite slow and intrusive, while ASSOLO, Abing and pathChirp appear to have a good trade-off between speed and intrusiveness. It is worth pointing out that ASSOLO, DietTopp, pathChirp, Pathload, Yaz and PTR are based on the concept of self-induced congestion – the search of the available bandwidth is performed by transmitting probe traffic at a rate higher than the unused capacity of the path. The major drawback of this approach is that one or more intermediate queues will fill up during the measurement process – the existing network traffic will be delayed and some packet could even be discarded by the congested bottleneck. The remaining tools are based instead on the probe gap model, which infers the available bandwidth by observing the gap measured on a packet pair injected into the path. Although this method limits the interference between probe and exiting traffic, it has been proved to be less accurate in some network scenarios [31]. The total estimation time of some tools also depends on the Round Trip Time of the observed network path – the results change significantly when iterative programs like Pathload or Yaz are used over links with sizable delays. On the other hand, the impact of the one-way delay on the direct tools is negligible since they relay on a single stream to produce an estimation.
180
5
E. Goldoni and M. Schivi
Conclusion
In this work we presented the largest experimental comparison of existing available bandwidth measurement tools on a laboratory testbed. We compared tools’ performance in terms of intrusiveness, response time and accuracy in presence of different cross-traffics. All the tests have been carried out on a flexible and highly customizable laboratory testbed based only on open-source software, lowcost personal computers and simple network devices. Preliminary results shows that ASSOLO, Pathload and Yaz are accurate and scale well with increasing traffic loads, while Abing seems to be the best choice from the speed-intrusiveness point of view. Although the programs considered in this work represent the majority of existing active estimation tools, we intend to extend the experimental comparison to more candidates as they will become freely available and usable. Ongoing works are devoted to include Netest [32] in our testbed – this was a promising tool but the source code is based on a home-made and unmaintained build system which does not compile successfully under any modern GNU/Linux environment. BART is another recent tool which is being used for experiments over the European research measurement infrastructure Etomic. However, Ericsson owns BART intellectual property rights and the code has not yet been freely released for scientific purposes. The laboratory testbed we used is actually quite simple: the single-bottleneck topology and the limited number of links oversimplify reality, and the CBR or poissonian cross-traffic sources do not fully catch the complexity of actual communication flows. We also did not consider longterm oscillations or biases in the estimations and the analysis we performed does not include highly congested scenarios. Although quite promising, the preliminary results obtained from our tests are not sufficient to draw any definitive conclusions on how the tools will behave on real networks. As further development, we plan to complete the analysis extending the set of considered scenarios to actual Internet paths or using a testbed having a more complex topology and loaded with real-world traffic traces.
References 1. Chuan, W., Baochun, L., Shuqiao, Z.: Characterizing Peer-to-Peer Streaming Flows. IEEE JSAC 25(9), 1612–1626 (2007) 2. Favalli, L., Folli, M., Lombardo, A., Reforgiato, D., Schembra, G.: A BandwidthAware P2P Platform for the Transmission of Multipoint Multiple Description Video Streams. In: Proceedings of the Italian Networking Workshop 2009 (2009) 3. Shamsi, J., Brockmeyer, M.: Principles of Network Measurement. In: Misra, S., Misra, S.C., Woungang, I. (eds.) Selected Topics Communication Networks and Distributed Systems, pp. 1–40. World Scientific, Singapore (2010) 4. Guerrero, C.D., Labrador, M.A.: On the applicability of available bandwidth estimation techniques and tools. Computer Communications 33(1), 11–22 (2010)
End-to-End Available Bandwidth Estimation Tools
181
5. Jain, M., Dovrolis, C.: Pathload: A measurement tool for end-to-end available bandwidth. In: Proceedings of the 3th International workshop on Passive and Active network Measurement, PAM 2002 (2002) 6. Hu, N., Steenkiste, P.: Evaluation and Characterization of Available Bandwidth Probing Techniques. IEEE JSAC 21(6), 879–894 (2003) 7. Navratil, J., Cottrell, R.L.: ABwE: A Practical Approach to Available Bandwidth. In: Proceedings of the 4th International workshop on Passive and Active network Measurement, PAM 2003 (2003) 8. Strauss, J., Katabi, D., Kaashoek, F.: A measurement study of available bandwidth estimation tools. In: Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, IMC 2003 (2003) 9. Ribeiro, V., Riedi, R., Baraniuk, R., Navratil, J., Cottrell, L.: PathChirp: Efficient Available Bandwidth Estimation for Network Paths. In: Proceedings of the 4th International workshop on Passive and Active network Measurement, PAM 2003 (2003) 10. Johnsson, A., Melander, B., Bjorkman, M.: DietTopp: A First Implementation and Evaluation of a Simplified Bandwidth Measurement Method. In: Proceedings of the 2nd Swedish National Computer Networking Workshop (2004) 11. Sommers, J., Barford, P., Willinger, W.: A Proposed Framework for Calibration of Available Bandwidth Estimation Tools. In: Proceedings of the 11th IEEE Symposium on Computers and Communications, ISCC 2006, pp. 709–718 (2006) 12. Goldoni, E., Rossi, G., Torelli, A.: Assolo, a New Method for Available Bandwidth Estimation. In: Proceedings of the Fourth International Conference on Internet Monitoring, ICIMP 2009, pp. 130–136 (May 2009) 13. Tan, W., Zhanikeev, M., Tanaka, Y.: ABshoot: A Reliable and Efficient Scheme for End-to-End Available Bandwidth Measurement. In: Proceedings of the IEEE Region 10 Conference TENCON 2006, pp. 1–4 (2006) 14. Pasztor, A.: Accurate Active Measurement in the Internet and its Applications. University of Melbourne, Department of Electrical and Electronic Engineering, Ph.D. Thesis (2003) 15. Qiang, W., Liang, C.: FEAT: Improving Accuracy in End-to-end Available Bandwidth Measurement. In: Proceedings of IEEE Global Telecommunications Conference GLOBECOM 2006, pp. 1–4 (2006) 16. Ekelin, S., Nilsson, M., Hartikainen, E., Johnsson, A., Mangs, J.-E., Melander, B., Bjorkman, M.: Real-Time Measurement of End-to-End Available Bandwidth using Kalman Filtering. In: Proceedings of 10th IEEE/IFIP Network Operations and Management Symposium, NOMS 2006, pp. 73–84 (2006) 17. Sedighizad, M., Seyfe, B., Navaie, K.: MR-BART: multi-rate available bandwidth estimation in real-time. In: Proceedings of the 3nd ACM workshop on Performance monitoring and measurement of heterogeneous wireless and wired networks PM2HW2N 2008, pp. 1–8 (2008) 18. Shriram, A., Murray, M., Hyun, Y., Brownlee, N., Broido, A., Fomenkov, M., claffy, k.: Comparison of public end-to-end bandwidth estimation tools on high-speed links. In: Dovrolis, C. (ed.) PAM 2005. LNCS, vol. 3431, pp. 306–320. Springer, Heidelberg (2005) 19. Labit, Y., Owezarski, P., Larrieu, N.: Evaluation of active measurement tools for bandwidth estimation in real environment. In: Proceedings of the IEEE/IFIP Workshop on End-to-End Monitoring Techniques and Services E2EMON 2005, pp. 71–85 (2005)
182
E. Goldoni and M. Schivi
20. Angrisani, L., D’Antonio, S., Esposito, E., Vardusi, M.: Techniques for available bandwidth measurement in IP networks: a performance comparison. Elsevier Computer Networks 50(3), 332–349 (2006) 21. Castellanos, C.U., Villa, D.L., Teyeb, O.M., Elling, J., Wigard, J.: Comparison of Available Bandwidth Estimation Techniques in Packet-Switched Mobile Networks. In: Proceedings of the 17th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, pp. 1–5 (2006) 22. Shriram, A., Kaur, J.: Empirical Evaluation of Techniques for Measuring Available Bandwidth. In: Proceedings of 26th IEEE International Conference on Computer Communications INFOCOM 2007, pp. 2162–2170 (2007) 23. Ait Ali, A., Michaut, F., Lepage, F.: End-to-End Available Bandwidth Measurement Tools: A Comparative Evaluation of Performances. In: Proceedings of the 4th International Workshop on Internet Performance, Simulation, Monitoring and Measurements IPS-MoMe 2006, pp. 1–14 (2006) 24. Urvoy-Keller, G., En-Najjary, T., Sorniotti, A.: Operational comparison of available bandwidth estimation tools. ACM SIGCOMM Comput. Commun. Rev. 38(1), 39– 42 (2008) 25. Guerrero, C.D., Labrador, M.A.: Experimental and Analytical Evaluation of Available Bandwidth Estimation Tools. In: Proceedings of the 31st IEEE Conference on Local Computer Networks 2006, pp. 710–717 (2006) 26. University of Pavia, Networking Lab: Collection of Available Bandwidth Estimation Tools, http://netlab-mn.unipv.it/avail-bw/ 27. Avallone, S., Guadagno, S., Emma, D., Pescap, A., Ventre, G.: D-ITG Distributed Internet Traffic Generator. In: Proceedings of the First International Conference on Quantitative Evaluation of Systems QEST 2004, pp. 316–317 (2004) 28. Hemminger, S., Kuznetsov, A., et al.: iproute2 utility suite, http://www.linuxfoundation.org/collaborate/workgroups/networking/ iproute2 29. Croce, D., Mellia, M., Leonardi, E.: The Quest for Bandwidth Estimation Techniques for large-scale Distributed Systems. In: Proceedings of ACM HotMetrics 2009 (2009) 30. Combs, G., et al.: The Wireshark Network Protocol Analyzer, http://www.wireshark.org 31. Lao, L., Dovrolis, C., Sanadidi, M.Y.: The probe gap model can underestimate the available bandwidth of multihop paths. ACM SIGCOMM Comput. Commun. Rev. 36(5), 29–34 (2006) 32. Jin, G., Tierney, B.: Netest: a tool to measure the maximum burst size, available bandwidth and achievable throughput. In: Proceedings of the International Conference on Information Technology: Research and Education ITRE 2003, pp. 578–582 (2003)
On the Use of TCP Passive Measurements for Anomaly Detection: A Case Study from an Operational 3G Network Peter Romirer-Maierhofer1, Angelo Coluccia2 , and Tobias Witek1 1
Forschungszentrum Telekommunikation Wien (FTW), Austria 2 Universit` a del Salento, Italy [email protected]
Abstract. In this work we discuss the use of passive measurements of TCP performance indicators in support of network operation and troubleshooting, presenting a case-study from a real 3G cellular network. From the analysis of TCP handshaking packets measured in the core network we infer Round-Trip-Times (RTT) on both the client and server sides separately for UMTS/HSPA and GPRS/EDGE sections. We also keep track of the relative share of packet pairs which did not lead to a valid RTT sample, e.g. due to loss and/or retransmission events, and use this metric as an additional performance signal. In a previous work we identified the risk of measurement bias due to early retransmission of TCP SYNACK packets by some popular servers. In order to mitigate this problem we introduce here a novel algorithm for dynamic classification and filtering of early retransmitters. We present a few illustrative cases of abrupt-change observed in the real network, based on which we derive some lessons learned about using such data for detecting anomalies in a real network. Thanks to such measurements we were able to discover a hidden congestion bottleneck in the network under study.
1
Motivations
The evolving nature and functional complexity of a 3G network increases its vulnerability to network problems and errors. Hence, the timely detection and reporting of network anomalies is highly desirable for operators of such networks. Passive packet-level monitoring can be an important instrument for supporting the operation and troubleshooting of 3G networks. A natural approach to validating the health status of a network is the extraction of performance indicators from passive probes, e.g. Round-Trip Time (RTT) percentiles and/or frequency of retransmission. These indicators, which we hereafter refer to as “network signals”, can be analyzed in real time in order to detect anomalous deviations from the “normal” network performance observed in the past. This approach underlies two fundamental assumptions: i) that extracted network signals are stable over time under problem-free operation, and ii) that anomalous phenomena generate appreciable deviations in any of the observed signals. In an earlier work [3] we demonstrated that passively extracted TCP RTT distributions are relatively F. Ricciato, M. Mellia, and E. Biersack (Eds.): TMA 2010, LNCS 6003, pp. 183–197, 2010. c Springer-Verlag Berlin Heidelberg 2010
184
P. Romirer-Maierhofer, A. Coluccia, and T. Witek
Fig. 1. Monitoring setting
stable over time in the operational network under study. Here, we take the next step and present some cases from a real 3G network where abnormal events were reflected by a sudden change in the analyzed network signals. Our findings are promising about the possibility of leveraging passively extracted TCP performance indicators for troubleshooting of real 3G networks. The TCP performance indicators presented in this work are obtained from the passive analysis of TCP handshaking packets. The idea of measuring TCP performance by observing handshaking packets was already presented in 2004 by Benko et al. [1] who reported results from an operational GPRS network. Vacirca et al. [2] reported RTT measurements from an operational 3G network including GPRS and also UMTS. In [3] we have shown that RTT values have decreased considerably due to the introduction of EDGE and HSPA and the consequential increase of radio bandwidth. Several studies [4,5,6,7] presented passive estimation of TCP RTT in wired networks inferred also from TCP DATA/ACK pairs. However, this approach is complicated by loss, reordering and duplication of TCP segments as well as by delayed acknowledgements [4]. Jaiswal et al. [5] measure TCP RTT by keeping track of the current congestion window of a connection by applying finite state machines (FSM). Since the computation of the congestion window differs among different flavors of TCP, the authors suggest the parallel operation of several FSMs, each tailored to a specific TCP flavor. Rewaskar et al. [6] identified Operation System (OS)-specific differences in prominent TCP implementations, which may bias the passive analysis of TCP segment traces if not handled properly. This issue is addressed by implementing four OS-specific state machines to measure TCP RTT while discarding all connections with less than 10 transmitted segments. Mellia et al. [7] compute the TCP RTT by applying the moving average estimator standardized in [8]. In case of short TCP flows as e.g. client HTTP requests, no RTT samples may be collected by this approach [8]. As shown in [4], the RTT inferred from TCP handshake packets is a reasonable approximation of the minimum RTT of the whole connection. Motivated by this result, we elaborate on the use of such RTT measurements for long-term and real-time anomaly detection in an operational 3G network.
On the Use of TCP Passive Measurements for Anomaly Detection
(a) Computation of RTT
185
(b) Ambiguity by retransmission
Fig. 2. Measurement schemes
We believe this method to be much more scalable, since it neither requires the analysis of all packets of a TCP flow nor it relies on any knowledge about the involved TCP flavors and/or Operating Systems. Moreover, in contrast to [6,7], this approach does not exclude short TCP flows from RTT measurements. The detection of congestion bottlenecks by passively inferring spurious retransmission timeouts from DATA/ACK pairs was presented in [9]. In this work we show that our simpler approach of extracting RTT just from TCP handshake packets is also suitable to detect hidden congestion bottlenecks.
2
Measurement Methodology
The measurement setting is depicted in Fig. 1. Packet-level traces are captured on the so-called “Gn interface” links between the GGSN and SGSN — for more information about the 3G network structure refer to [10]. We use the METAWIN monitoring system developed in a previous research project and deployed in the network of a mobile operator in EU — for more details refer to [11]. By extracting and correlating information from the 3GPP layers (GTP protol on Gn, see [12]) the METAWIN system enables discrimination of connections originated in the GPRS/EDGE and UMTS/HSPA radio sections. We shortly recap the RTT measurement methodology already introduced in [3]. We only consider TCP connections established in uplink, i.e. initiated by Mobile Stations in the Radio Access Network. By measuring the time span between the arrival of a SYN and the arrival of the associated SYNACK we infer the (semi-)RTT between the Gn link and a remote server in the Internet — denoted by “server-side RTT” in Fig. 2(a). Similarly, we estimate the (semi-)RTT in the Radio Access Network (RAN), between the Gn link and the Mobile Station — referred to as “client-side RTT”— by calculating the time span between the arrival of the SYNACK and the associated ACK. Valid RTT samples may only be estimated from unambiguous and correctly conducted 3-way handshakes. Those cases where the association between packet pairs is ambiguous (e.g. due to retransmission, duplication) have to be discarded. Within a measurement interval (e.g. 5 minutes) valid RTT samples are aggregated into equal-sized bins.
186
P. Romirer-Maierhofer, A. Coluccia, and T. Witek
The corresponding bin width is 0.1 ms for RTT samples <100 ms and 1 ms for RTT samples ≥ 100 ms. This two-level binning keeps the total number of bins reasonably low while offering sufficiently accurate resolution for both, lower RTT samples typically measured at the server-side and higher RTT samples typically measured at the client-side. We collect RTT samples separately for TCP handshakes to port 80 and to all other ports. As shown in § 3.4, this differentiation is motivated by the fact that part of the traffic to port 80 might be intercepted by a network-wide proxy. 2.1
Invalid Sample Ratio
We mentioned above the problem of ambiguous association of handshake pairs, which can be caused by different reasons. One example of retransmissions is depicted in Fig. 2(b). When the SYNACK retransmission timer expires before an ACK is received, the server retransmits a second SYNACK and this leads to ambiguity in the association SYNACK/ACK: it cannot be decided whether the ACK packet is acknowledging the first SYNACK (correctly received) or the second one (in case the first one was lost along the path). Similar ambiguities can occur for SYN/SYNACK pairs, if e.g. a SYNACK is lost and the client retransmits a second SYN before the expiration of SYNACK retransmission timer at the server. In each time interval (e.g. 5 minutes) we record the relative share of ambiguous SYN/SYNACK and SYNACK/ACK pairs, which we denote by ISSY N and ISSY N ACK respectively. Since a retransmission timeout may expire due to loss of the first SYNACK packet in the Radio Access Network (RAN), i.e. on the client-side of the monitored path, the ISSY N ACK indicators correlates to — and can be used as proxy signal for — the level of packet loss in the radio section. However, as stated above, ambiguous pairs may be due also to other causes, and the actual level of packet loss will in general stay below the value of ISSY N ACK . Similar considerations apply to ISSY N for the server-side section. Focusing on ISSY N ACK , we will show that the presence of so called ”early retransmitting servers” has a non-negligible influence on such signal. Nonetheless, we expect that an anomalous event raising the networkwide packet loss will be reflected in anomalous deviations of ISSY N ACK . In the following, we introduce an indicator built upon ISSY N ACK which can be used to reveal anomalous loss events in the network. In the generic measurement time bin, for each active user i we denote by mi the number of invalid samples, i.e. SYNACK which could not be univocally associated to an ACK, and by ni the total number of SYNACK. A simple indicator for ISSY N ACK can be defined as the total ratio of invalid samples across all terminals: I mi def SG = i=1 I i=1 ni
(1)
where I denotes the total number of active terminals. However, the uneven distribution of ni — which is typically heavy-tailed — injects a large amount
On the Use of TCP Passive Measurements for Anomaly Detection 0.2
0.2
EGR
S = 10
0.15
Ratio
Ratio
0.15
187
0.1
0.05
0.1
0.05
0 0
12
24
36
48
60
0
72
0
Hours after Day 1 00:00:00
12
24
36
48
60
72
Hours after Day 1 00:00:00
(a) Timeseries of SG
(b) Timeseries of SL (10)
Fig. 3. Timeseries of estimation of invalid sample ratio, 3 days, 5 min bins
of variance into SG : since most of the terminals have low traffic (low ni ), a few terminals with high traffic (high ni ) and high loss level (high mi , due e.g. to bad radio conditions, self-congestion or other terminal-specific reasons) might occasionally inflate the value of SG . This results in a very noisy signal, which complicates the detection of network-wide anomalies. This is clearly visible in the example of Fig. 3(a), which shows a three-day timeseries of SG . In a previous work [13], we derived a low-variance indicator by taking the weighted average of individual (per terminal) ratios, formally: SL (θ) =
I
w ˜i
i=1
with
n ˜i def w ˜ i = I
j=1
n ˜j
,
mi ni
def
n ˜ i = min(ni , θ).
(2)
(3)
The cut-off parameter θ must be set heuristically — it is shown in [13] that such setting is not too much critical. In this work we have chosen θ = 10. In Fig. 3(b) we plot SL (10) for the same time period as in Fig. 3(a). We observe that SL (10) — which we call Invalid Sample Ratio (ISR) hereafter — provides a much clearer signal than SG . 2.2
Impact of Early Retransmitted SYNACK Packets
In [3] we identified the risk of a possible measurement bias due to early retransmission of SYNACK. In fact, some popular servers — which we hereafter refer to as early retransmitters — resend the SYNACK after 300-500 ms instead of 3 sec which is the recommended value for the TCP Retransmission TimeOut (RTO) [14]. This strategy, which aims at being more responsive against ACK losses, causes an excess of spurious SYNACK retransmissions for wireless
188
P. Romirer-Maierhofer, A. Coluccia, and T. Witek
S(10) vs. time; Time bins of 5 Min. 0.18
0.16
0.16
0.14
0.14
0.12
0.12
0.1
Ratio
Ratio
S(10) vs. time; Time bins of 5 Min. 0.18
0.08
0.1 0.08
0.06
0.06
0.04
0.04
0.02
0.02 GPRS
0 0
UMTS
24 48 72 Hours after Day 1 00:00:00
(a) All servers
GPRS
0 96
0
UMTS
24 48 72 Hours after Day 1 00:00:00
96
(b) Without early retransmitters
Fig. 4. Invalid sample ratio SL (10), 4 days, 5 min bins
connections with high delays such as GPRS/EDGE, and therefore generates a high number of ambiguous SYNACK/ACK associations as the one outlined in Fig. 2(b). Therefore, we can infer the client-side RTT between the SYNACK of an early retransmitter and the associated ACK only if the ACK packets arrive within a retransmission timeout (RTO) of 300-500 ms. All other RTT samples are discarded due to ambiguity of the retransmitted SYNACK packets. This is particularly a problem in GPRS, where 50% of the client-side RTT samples are usually above 500 ms (ref. [3]) and hence, a significant fraction of ACK packets may not arrive in time before the early retransmission of SYNACK. As a consequence, measurements involving early retransmitters include only client-side RTT samples below 300-500 ms, while higher RTT samples are invalidated by retransmissions of the SYNACK. From the discussion above, it is clear that early retransmitters may bias summary statistics as e.g. client-side RTT percentiles and ISR. Although a shifted indicator with a fixed offset bias might not be a problem for the task of detecting abrupt changes, the biasing effect of early retransmitters is more problematic since the traffic share of early retransmitters can change during time, resulting in artifactual deviations which are not mirroring network anomalies. For that reason, we introduce a simple algorithm for the dynamic classification and filtering of early retransmitters described in the following section aimed at mitigating their influence on the final measurements. As an example of possible bias introduced by early retransmitters we plot in Fig. 4 two four-day timeseries of ISR (i.e. SL (10)) for traffic to port 80, separately for GPRS/EDGE and UMTS/HSPA. We observe that the ISR for UMTS/HSPA is very stable at a value of around 0.03 before filtering the early retransmitters (ref. Fig. 4(a)) and reduces to 0.02 after filtering (ref. Fig. 4(b)). Hence, the presence of early retransmitters introduces a relative error of 50% of the ISR of UMTS/HSPA. In case of GPRS/EDGE the ISR signal yields a cyclic time-ofday variation between around 0.06 and 0.16. Comparing Fig. 4(a) and 4(b) we observe that early retransmitters introduce an absolute error of 0.01-0.02 in the case of GPRS/EDGE.
On the Use of TCP Passive Measurements for Anomaly Detection
TCP SYNACK vs. answered TCP SYNACK per server; UMTS; Time bins of 5 Min.; 20:00 to 20:55; 3500
TCP SYNACK vs. answered TCP SYNACK per server; GPRS; Time bins of 5 Min.; 20:00 to 20:55; 400
Residual servers (ISSYNACK = 0.017) Early retransmitters (ISSYNACK = 0.131)
3000
189
Residual servers (ISSYNACK = 0.142) Early retransmitters (ISSYNACK = 0.201)
350 300
2500
250 NU
NU
2000 200
1500 150 1000
100
500
50
0
0 0
500
1000
1500 2000 NSYNACK
2500
(a) UMTS/HSPA
3000
3500
0
50
100
150
200 NSYNACK
250
300
350
400
(b) GPRS/EDGE
Fig. 5. Total number of SYNACK vs. number of unambiguously replied SYNACK, Port 80, 5 min bins
2.3
Filtering of Early Retransmitting Servers
In order to mitigate the statistical bias introduced by early retransmitters, we implement a dynamic classification and filtering of these servers, described in the following. For each measurement interval we count the total number of SYNACK denoted by NSY N ACK and the number of SYNACK retransmitted after a retransmission time out <600 ms denoted by NEARLY . We define the EARLY early retransmission ratio r = NNSY . Within an observation period of 1 N ACK minute we compute a ratio ri (k) for each server separately. Finally, a server is classified as early retransmitter if ri (k) > 0.01. If an early retransmitter did not send any SYNACK within the last five observation periods (i.e. 5 minutes), we remove it from the class of early retransmitters. Instead of discarding those measurements which involved early retransmitters, we collect them in a separate class. Let NU denote the number of unambiguously replied SYNACKs. Recall that a SYNACK is unambiguously replied only if a client ACK arrives within the SYNACK retransmission timeout of the destined server (ref. Fig. 2(b)). In Fig. 5 we plot for each server NSY N ACK versus NU over a measurement period of one hour in time bins of 5 minutes for handshakes established to port 80. Measurement points of early retransmitters are represented by a green ’x’, while residual servers are depicted by a red ’+’. We observe two separate clusters in Fig. 5(a) for NU > 700. The points of early retransmitters show a clear offset towards higher values of NSY N ACK , since a significant number of SYNACKs is ambiguously replied due to early retransmission of the SYNACK packets. In contrast to that, the points of residual servers are located along the line where each SYNACK is unambiguously replied by an ACK packet (i.e. NSY N ACK ≈ NU ). Interesting to note, both clusters are overlapping for NU < 700. This might be explained by the fact that few early retransmissions per server are already sufficient to exceed our classification threshold of ri (k) > 0.01 if this server is sending a low number of NSY N ACK , which leads to false negatives in our classification
190
P. Romirer-Maierhofer, A. Coluccia, and T. Witek Client−side RTT percentiles; UMTS; Gn−Interface; Port 80 7 days; Time Bins of 5 Min. 0.05
0.25
0.5
0.75
Client−side RTT percentiles; UMTS; Gn−Interface; Port 80 7 days; Time Bins of 5 Min.
0.95
0.05
0
0.25
0.5
0.75
0.95
0
Delay [sec]
10
Delay [sec]
10
−1
−1
10
10
Fri
Sat
Sun
Mon Tue Days
Wed
Thu
Fri
Fri
(a) Path A via SGSN and RNC
Sat
Sun
Mon Tue Days
Wed
Thu
Fri
(b) Path B via RNC only
Fig. 6. Client-side RTT percentiles in UMTS, different paths, 7 days, 5 min bins
method. Moreover, there might be intervals where early retransmitters do not retransmit any SYNACKs, because all ACKs arrive before the expiration of the (short) SYNACK retransmission timeout. In such an interval also an early retransmitter would be located along the line NSY N ACK ≈ NU in Fig. 5(a). The same scatterplot for connections via GPRS/EDGE is depicted in Fig. 5(b). The qualitative shape is comparable to Fig. 5(a). However, the existing clusters are less clearly separated. In Fig. 5(b) we observe points where NSY N ACK NU also for servers not classified as early retransmitters. Note that this is not necessarily due to misclassification of the corresponding servers, since a SYNACK can be replied ambiguously also due to other effects than early retransmission of SYNACK.
3
Measurement Results
In the following we present three illustrative examples of abrupt changes in the network-wide performance signals (i.e. ISSY N ACK and RTT percentiles) found in an operational 3G network in Austria between June and October 2009. By investigating the root causes of these sudden deviations we will discuss relevant practical issues and show the applicability of these performance signals for detecting anomalies in a real 3G network. 3.1
Client-Side RTT Per Network Area
One interesting feature of the METAWIN monitoring system [11] is the analysis of TCP RTT, separately for different SGSN and RNC areas. In Fig. 6 we plot two timeseries of client-side RTT percentiles for connections established towards TCP port 80 and via UMTS/HSPA. The RTT percentiles for a specific SGSN area are depicted in Fig. 6(a), while Fig. 6(b) shows the RTT percentiles measured via a RNC directly connected to the GGSN (see dashed line labelled
On the Use of TCP Passive Measurements for Anomaly Detection
191
S(10) vs. time; Time bins of 5 Min. 0.18 0.16 0.14
Ratio
0.12 0.1 0.08 0.06 0.04 0.02 0 0
6
GPRS
UMTS
12
18
Hours after Day 1 00:00:00
Fig. 7. Temporary increase of SL (10), 1 day time series, 5 min bins
“direct tunnel” in Fig. 1). In both cases the percentiles are relatively stable over time, showing statistical fluctuations during night hours when traffic load, and consequently also the number of RTT samples per measurement bin is low. As expected the direct path bypassing the SGSN has lower client-side RTT. 3.2
Temporary Increase of Packet Loss
In Fig. 7 we report a 24 hour timeseries of ISR separately for UMTS/HSPA and GPRS/EDGE. The estimated ISR of GPRS/EDGE shows a time-of-day variation between slightly aboc 0.04 in the night hours and around 0.16 in the peak hour after 18:00. However, we observe a sudden increase of the estimated loss probability SL (10) of UMTS/HSPA starting at 04:00 and lasting until around 08:00. This sudden shift from 0.02 to 0.08 with a distinct spike of around 0.15 is a clearly anomalous behavior. A deep exploration of the phenomenon at hand showed that this anomaly was caused by a temporary network problem associated to the reconfiguration of one GGSN, which led to partial packet loss at a specific site of the network. The presented anomaly is an important confirmation that, as we expected, an anomalous increase in packet loss is reflected by an abnormal deviation in the estimation of invalid sample ratio ISSY N ACK . 3.3
Detection of Bottleneck Link
Within our analysis of TCP RTT, we discriminate between server IP addresses allocated to the 3G operator (used for e.g. gateway servers, internal application servers) and all other IP addresses of the public Internet. The server-side percentiles for two consecutive days in time bins of 5 minutes only for internal servers deployed by the mobile network operator are depicted in Fig. 8. The percentiles show a slight time of day effect, i.e. the server-side RTT is higher during the peak hours in the evening. This might be explained by two phenomena. First, an increase of traffic rate may lead to higher link utilization and thus
192
P. Romirer-Maierhofer, A. Coluccia, and T. Witek RTT percentiles of ISP−net; RAT: all; Gn−Interface; All Ports 2 days; Time Bins of 5 Min. 0.01
0.05
06:00
12:00
0.25
0.5
0.75
0.9
0.95
−2
Delay [sec]
10
−3
10
00:00
18:00
00:00 06:00 Time of Day
12:00
18:00
00:00
Fig. 8. Time series of server-side RTT percentiles, internal servers, 2 days, 5 min bins
larger delays on the path from/to the internal servers. Second, a higher load at the involved servers may increase their response time and hence also the serverside RTT. Besides a slight time-of-day effect of the percentiles we observe that 75% of the RTT samples take values below ≈2 ms. However, at around 20:30 of day 1 there is an abrupt change in the RTT percentiles. For instance, the 75-percentile is suddenly shifted from 2 ms to 10 ms. We observe that this shift of RTT percentiles is persistent for a period of about two hours. This is a clearly anomalous behavior. By taking into account also other signals obtained from the METAWIN system [11] we revealed that this RTT shift was contemporary to a significant increase of UDP/RTP traffic from the video streaming server during the live broadcast of a soccer match. In fact this traffic increase and consequently the abrupt shift in the server-side RTT was triggered by a significant number of users watching this live broadcast. Note the notch in the RTT percentiles at around 21:15 during the half-time break of the soccer match. Moreover, Fig. 8 shows a second abrupt change in the RTT percentiles at the second day with a clear spike around 14:00. Similarly to the example of the soccer match, this anomaly was caused by users watching the live broadcast of a Formula One race. Note that the increase of video streaming traffic did not only increase the server-side RTT towards the video streaming server, but also towards all internal servers located in the same subnet. Our findings finally pointed at a hidden congestion bottleneck on the path towards these internal servers of the network operator. After reporting our results to the network staff, the problem was fixed by increasing the capacity of the corresponding link. It is interesting to note that a traffic congestion caused at the UDP layer was discovered by analyzing just the handshaking packets of TCP at a single monitoring point. This result confirms
On the Use of TCP Passive Measurements for Anomaly Detection
0
RTT percentiles of RAN; RAT: UMTS; Gn−Interface; Port 80 5 days; Time Bins of 5 Min.
193
RTT percentiles of RAN; RAT: GPRS; Gn−Interface; Port 80 5 days; Time Bins of 5 Min.
10
0
Delay [sec]
Delay [sec]
10
−1
10
0.01 Mon
0.05 Tue
0.25
0.5
Wed
0.75 Thu
0.9 Fri
0.95
−1
Sat
10 Mon
Days
(a) UMTS/HSPA
0.01
0.05 Tue
0.25
0.5
Wed
0.75 Thu
0.9 Fri
0.95 Sat
Days
(b) GPRS/EDGE
Fig. 9. Time series of client-side percentiles, 5 days, 5 min bins
the value of leveraging passively extracted TCP RTT for detecting bottleneck links in an operational 3G network. 3.4
Activation of Transparent Proxy
Fig. 9 depicts two 5 day time series of client-side RTT percentiles, separately for UMTS/HSPA and GPRS/EDGE. The client-percentiles of UMTS/HSPA are stable over time and do not show time-of-day variation. The median RTT is slightly above 100 ms. In the network under study, traffic load is steadily increasing during day hours (reaching its peak after 20:00) and decreasing again during night hours. The fact that client-side percentiles of UMTS/HSPA are independent from the actual traffic load suggests that the network under study is well-provisioned. Also the client-side percentiles of GPRS/EDGE in Fig. 9(b) are stable and independent from the variations of traffic load. However, Fig. 9(b) shows a sudden and persistent shift in client-side percentiles of GPRS/EDGE in the morning of the third day. For instance, the median of RTT is shifted from below 600 ms to around 700 ms. Further analysis revealed that the shift of RTT percentiles observed in Fig. 9(b) is caused by a reconfiguration of the network, specifically the activation of a network-wide proxy mediating TCP connections to Port 80 established in the GPRS/EDGE RAN. For further clarification we now elaborate on the dependence of the client-side RTT on the SYNACK retransmission timeouts of remote servers. Recall from § 2.1 that we cannot compute a valid client-side RTT whenever a SYNACK is retransmitted by a remote server, since this retransmission leads to an ambiguous relation between two observed SYNACK and the ACK. A remote server retransmits a SYNACK if it did not receive the client ACK within a specific retransmission timeout (RTO). In other words, we can compute a valid client-side RTT (inferred from an unambiguous SYNACK/ACK pair) if and only if a client ACK arrives at the remote server before the expiration of its SYNACK retransmission timeout referred to as RT OSY N ACK in Fig. 10(a). Let TSY N ACK denote the time required for a server SYNACK to arrive at our passive probe. TACK
194
P. Romirer-Maierhofer, A. Coluccia, and T. Witek Retransmission Timeout CDF Gn−Interface, 19:00 to 20:59 1
Fraction of Samples <= Time Out
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Proxy Early retransmitter Residual Server
0.1 0 0
(a) Dependence of client-side RTT
0.5
1 1.5 2 2.5 3 Retransmission Time Out [sec.]
3.5
4
(b) CDF RTO per server class
Fig. 10. Relation of SYNACK retransmission timeout and client-side RTT
represents the delay a client ACK experiences between our monitoring point and the remote server. The maximum client-side Round-Trip-Time RT Tclient max that may be computed before a server is invalidating the current sample by retransmitting the SYNACK is then defined by RT Tclient max = RT OSY N ACK − (TSY N ACK + TACK ).
(4)
We note that the time delay TSY N ACK + TACK is equivalent to our definition of server-side RTT. In [3] we have shown that 75% of the server-side RTT values are smaller than 40 ms and 95% are below 200 ms in the network under study. This RTT is small compared to a SYNACK retransmission timeout of 3 seconds recommended in [14]. From Eq. 4 we observe that RT Tclient max is directly correlated to the setting of RT OSY N ACK at the server. In Figure 10(b) we plot a CDF of SYNACK retransmission timeouts inferred from SYNACK retransmissions within a time period of 2 hours from 19:00 to 21:00 for different classes of servers. The blue dashed line refers to the RTOs of servers being classified as early retransmitters. We note that 50% of SYNACKs were retransmitted after less than 500 ms and around 93% after less than 3 seconds. These short retransmission timeouts introduce a bias of the client-side RTT percentiles towards lower values, since RT Tclient max is decreased. In order to mitigate this problem we introduced a filtering of early retransmitters. In § 2.3 we have shown that the presence of early retransmitters introduce an error of 50% to our estimation of ISSY N ACK . Moreover, in Figure 10(b) we observe that residual servers retransmit 30% of SYNACKS after a timeout of less than 1.2 seconds and 45% after less than 3 seconds (ref. black solid line in Fig. 10(b)). In contrast to that the network-wide proxy deploys only RTOs of above ≈3.2 seconds, which allows for a maximum measurable client-side RTT of around 3 seconds. This RTO setting of the network-wide proxy is the explanation of the persistent shift of the client-side RTT in Fig. 9(b). After its activation at day 3 the proxy increased the maximum measurable client-side RTT for all GPRS/EDGE users establishing TCP connections to port 80. This change was reflected in a larger relative share
On the Use of TCP Passive Measurements for Anomaly Detection
195
of higher client-side RTT values. In Fig. 9(a) we observe that there is no change in the client-side percentiles of UMTS/HSPA, since the proxy was only activated for users in the GPRS/EDGE RAN. At this point it becomes obvious that client-side RTT percentiles are strongly dependent on the RTO settings of the involved remote servers. That means an abrupt change in the client-side RTT percentiles might be triggered not only by network problems, but also by a reconfiguration of the remote servers or the introduction of a transparent proxy as in the example of Fig. 9(b). One can even argue that not only early retransmitters, but all servers handling ports other than port 80 bias the RTT percentiles since also these servers deploy a large relative share of RTOs less than 3 seconds. For instance, one possible approach to mitigate this bias would be the filtering of handshakes of those servers, which retransmit at least one SYNACK after a timeout of less than 3 seconds. However, in the present example such filtering strategy would discard around 72% of all samples from our measurements while only 11.8% of samples are discarded by our filtering of a few early retransmitters. Hence, we believe filtering of early retransmitters is a trade-off between measurement accuracy and the completeness of the network-wide RTT measurements. In our work we compensate for the different RTO setting of the network-wide proxy by collecting and analyzing performance signals of two separate classes, one class referring to handshakes to port 80 (i.e. traffic mediated by the network-wide proxy) and a second class for all other ports.
4
Conclusions and Future Work
In this work we have investigated the possibility to exploit passively extracted TCP performance signals for anomaly detection in an operational 3G network. We have shown that some examples of real anomalies found in an operational network were reflected by abnormal deviations in the performance signals under study, i.e. invalid sample ratio and RTT percentiles. Our results show that the RTT measured at the client-side does not only depend on the current network status, but also on the RTO setting of the remote servers and of intermediate proxies, if present. This calls for additional caution when relying on passive TCP RTT measurements for a comparison of different datasets — e.g. from different networks, or taken at different times. In fact, since RTO settings vary across different servers, variations in the global RTT distribution do not necessarily relate to network-level changes but might be caused by differences in the distribution of traffic mix and/or variations in the server popularity. We have proposed a dynamic classification and filtering of early retransmitters in order to mitigate their influence (bias) on the ISR and RTT measurements. We are aware that this classification and filtering involves additional computational efforts and reduces the complexity gap between the approach based on the analysis of handshaking packets exclusively and the approach of considering all DATA/ACK pairs. On the other hand, if RTT measurements are exploited for the purpose of anomaly detection, relying just on
196
P. Romirer-Maierhofer, A. Coluccia, and T. Witek
handshake packets yields a distinct advantage. Since the transmission delay of a packet depends directly on the packet length, variation of the packet length translates into statistical fluctuation of the packet delay — and hence of the associated RTT. By inferring RTT only from small handshake packets, we exclude these “normal” statistical variations which otherwise would complicate the task of detecting “abnormal” RTT deviations caused by anomalous events. Our results show that the invalid sample ratio can be used as a complementary signal, e.g. to detect abnormally high levels of packet loss. Finally, thanks to our analysis of TCP RTT percentiles we have revealed one real instance of a link bottleneck caused by a temporary increase of UDP/RTP traffic. Taken collectively, our results show that such measurements can be fruitfully exploited in support of the operation and troubleshooting of a real-world 3G network. Notably all the abrupt changes presented in the measured time-series could be easily detected by means of thresholding and/or very basic change-detection algorithms. In this study, the main challenge is not on the design of sophisticated signal processing algorithms for time-series, but rather on the extraction of a robust and reliable network signal to base the detection upon. As part of ongoing work, we are now integrating basic alarming thresholds for such signals into the on-line monitoring system.
Acknowledgments The Telecommunications Research Center Vienna (FTW) is supported by the Austrian Government and the City of Vienna within the competence center program COMET. This work is part of the DARWIN project [11] run at FTW.
References 1. Benko, P., Malicsko, G., Veres, A.: A Large-scale, Passive Analysis of End-to-End TCP Performance over GPRS. In: IEEE INFOCOM 2004 (2004) 2. Vacirca, F., Ricciato, F., Pilz, R.: Large-Scale RTT Measurements from an Operational UMTS/GPRS Network. In: Proc. of WICON 2005, Budapest (July 2005) 3. Romirer-Maierhofer, P., Ricciato, F., D’Alconzo, A., Franzan, R., Karner, W.: Network-wide measurements of TCP RTT in 3G. In: Papadopouli, M., Owezarski, P., Pras, A. (eds.) TMA 2009. LNCS, vol. 5537, pp. 17–25. Springer, Heidelberg (2009) 4. Aikat, J., Kaur, J., Smith, F.D., Jeffay, K.: Variability in TCP round-trip times. In: ACM SIGCOMM IMC 2003, Miami Beach, USA (October 2003) 5. Jaiswal, S., Iannaccone, G., Diot, C., Kurose, J., Towsley, D.: Inferring TCP Connection Characteristics Through Passive Measurements. In: IEEE INFOCOM 2003, San Francisco, USA (April 2003) 6. Rewaskar, S., Kaur, J., Smith, F.D.: A passive state-machine approach for accurate analysis of TCP out-of-sequence segments. ACM SIGCOMM Computer Communication Review 36(3), 51–64 (2006) 7. Mellia, M., Meo, M., Muscariello, L., Rossi, D.: Passive analysis of TCP anomalies. Computer Networks 52(14), 2663–2676 (2008)
On the Use of TCP Passive Measurements for Anomaly Detection
197
8. RFC2988: Computing TCP’s Retransmission Timer (November 2000) 9. Ricciato, F., Vacirca, F., Svoboda, P.: Diagnosis of Capacity Bottlenecks via Passive Monitoring in 3G Networks: an Empirical Analysis. Computer Networks 51(4), 1205–1231 (2007) 10. Bannister, J., Mather, P., Coope, S.: Convergence Technologies for 3G Networks: IP, UMTS, EGPRS and ATM. Wiley, Chichester (2004) 11. METAWIN and DARWIN projects: http://userver.ftw.at/~ ricciato/darwin 12. Digital cellular telecommunications system (Phase 2+); Universal Mobile Telecommunications System (UMTS); General Packet Radio Service (GPRS); GPRS Tunnelling Protocol (GTP) across the Gn and Gp interface, 3GPP TS 29.060, Version 8.9.0, Release 8 (October 2009) 13. Coluccia, A., Ricciato, F., Romirer-Maierhofer, P.: On Robust Estimation of Network-wide Packet Loss in 3G Cellular Networks. In: IEEE BWA 2009, Honolulu, USA, November 30 (2009) 14. RFC1122: Requirements for Internet Hosts - Communication Layers (October 1989)
Author Index
Abry, Patrice
101
Kawahara, Ryoichi 17 Krawiec, Piotr 46
Barbuzzi, Antonio 87 Barlet-Ros, Pere 141, 155 Beben, Andrzej 46 Boggia, Gennaro 87 Borgnat, Pierre 101 Braun, Lothar 127 Carela-Espa˜ nol, Valent´ın Carle, Georg 127 Castro, Sebastian 1 Claffy, Kimberly 1 Coluccia, Angelo 183
Lorenzetti, Valeria
141
Meo, Michela 115 Moore, Andrew 32 Mori, Tatsuya 17 Mortier, Richard 32 Mortimer, Steve 73 M¨ unz, Gerhard 127 Owezarski, Philippe Pescap´e, Antonio
Dai, Hui 127 Dainotti, Alberto 141 de Donato, Walter 141 Deri, Luca 73 Fay, Damien 32 Finamore, Alessandro 115 Fontugne, Romain 101 Fukuda, Kensuke 101 Goldoni, Emanuele 171 Grieco, Luigi Alfredo 87
59 141
Romirer-Maierhofer, Peter Rossi, Dario 115 Sanju` as-Cuxart, Josep 155 Schivi, Marco 171 Shimogawa, Shinsuke 17 Sliwinski, Jaroslaw 46 Sol´e-Pareta, Josep 155 Sol´e-Sim´ o, Marc 141 Uhlig, Steve
32
Valenti, Silvio Haddadi, Hamed 32 Hasegawa, Haruhisa 17 Jamakovic, Almerima John, Wolfgang 1
73
115
Wessels, Duane 1 Witek, Tobias 183
32 Zhang, Min
1
183