[]], <xchcont>[ <xseq>[ <xload>[$var], <xchild>[], <xmap>[ <xif>[ <xwithtag>[title], <xchild>[], <xconst>[], ] ] ] ] ]
Fig. 4. code-a
The transformations, CM, are used to manage or use the transformation context. They provide a variable binding mechanism for the Bi-X language. <xstore>[Var ] binds the source data to the variable Var, which is valid until it is released by <xfree>[Var ]. <xload>[Var ] accesses the bound value of a valid variable. The predicate < xwithtag> [str] holds if the source data is an element with tag str, and any transformation can be used as a predicate for <xif>[P , X1 , X2 ]. Using the Bi-X syntax, the Bi-X code needed to perform the transformation for the example given in section 2 is shown in Figure 3 and Figure 4. The code is divided into two parts for readability. code-a in Figure 3 represents the code given in Figure 4, which extracts the titles from the source data. code-b in Figure 3 represents the code for extracting the author and is not shown to save a space. As can be seen from these figures, the code for Bi-X tends to be longer than that for the one-way XML transformation languages. An XQuery interpreter has been developed in order to reduce the coding effort [12]. Since the expressive power of Bi-X is almost the same as that of XQuery, a user can write an XQuery code for the forward transformation and automatically get the equivalent Bi-X code for the bidirectional transformation. 3.2
Bidirectional Property of Bi-X
In this section, the view updating property of the Bi-X language is illustrated informally to help users better understand the results of backward transformation.
A Web Service Architecture for Bidirectional XML Updating
727
That is, given an updated view, what should the updated source document look like after backward transformation? To shorten the presentation, we show only the modifications needed to update the XML text contents and tags. The more complex updatings, such as insertion and deletion, are described elsewhere [12]. During a session of forward and backward transformation, there are two pairs of documents: the original source document and the source document after updating, and the original view and the updated view. Each pair of documents has the same structure since we are interested in only modifications here. The property of Bi-X is defined on the differences between the original and updated documents. The differences are represented as a multiset of pairs, and each pair consists of two different strings, which are either element tags or text contents. A pair represents a modification; that is, the first component is changed to the second one. To represent modifications more precisely, tags and text contents in source documents are assigned unique identifiers, while tags and text contents in xconst are associated with a specific identifier, say c. Identifiers are kept unchanged while transforming source documents and modifying views. A modification is called a bad modification if it contains strings with the c identifier. This means data from the transformation code cannot be modified. Two string components in a good modification must have the same identifiers, and no two good modifications in one document can have the same identifier. Two modifications are said to be equal if they make the same changes to strings with the same identifiers. We write diff(od,md) for the differences between the original document, od, and its modified version, md. For two documents with the same structure, their differences can be easily obtained by traversing the document structure and comparing each tag and text content. The view updating property of Bi-X is as follows. Suppose sd is a source document, X a Bi-X transformation, td a target document of sd transformed by X, and td is obtained from td with only good modifications. After backward transformation of td using X, the following condition holds: diff(sd, sd ) = diff(td, td ), where sd is the updated sd generated by the backward transformation. Intuitively, this property means that, after a backward transformation, the modifications on the views are reflected back to the corresponding tags or text contents in the source documents.
4
Communication Protocol
The communication protocol in the data updating process, comprises two phases: init and update. They are performed by the init and update services, respectively, provided by the Bi-X service. Between the two phases, the user edits the view on the client. The steps in each phase are illustrated in Figure 5 and described below.
728
Y. Hayashi et al.
Fig. 5. Communication Patterns
Fig. 6. Configuration of Implemented Bi-X Service
Init Phase Init(1): Client sends init message to Bi-X server with two arguments: URI1 for source data to be transformed and URI2 for Bi-X code. Init(2): Bi-X server requests files specified by URI1 and URI2 using HTTP Get method. Init(3): Machines specified in URI1 and URI2 process HTTP Get method and return specified files. Init(4): Bi-X server performs forward transformation and sends view to client. Updating Phase Update(1): After data is edited, client sends update message to Bi-X server with three arguments: URI1 for source data, URI2 for Bi-X code, and changed view. Update(2): Bi-X server requests source data to be updated and code specified in URI1 and URI2 using HTTP Get method. Update(3): Machines specified in URI1 and URI2 process HTTP Get method and return specified files. Update(4): Bi-X server performs backward transformation to obtain updated source data and sends updated source data back to URI1 using HTTP POST method. Update(5): Bi-X server performs forward transformation using updated source data and sends new view to client.
5
Bi-X Service Implementation
We implemented our Bi-X service in Java, using standard Web service technologies such as SOAP [15], the representational state transfer (REST) model [7], and WSDL [16]. The configuration is shown in Figure 6. Its application to a practical case is described in Section 6.
A Web Service Architecture for Bidirectional XML Updating
729
Axis and Tomcat. The Axis2 platform [2] is used to implement SOAP and the REST model. It runs on the Tomcat server engine [3]. Because the Bi-X service uses these standard technologies, its installation requires only that a Bi-X service archive file be registered to the containers. Users can thus easily install Bi-X service on their own machines. Bi-X Driver. The Bi-X driver wraps Bi-X engine, which is a Java implementation of the Bi-X bidirectional transformation language. The driver is also written in Java. It provides the engine with the network communications used to transfer XML documents from and to the content servers. These communications use HTTP GET messages to retrieve XML documents (i.e., source and code) from content servers and HTTP POST messages to place modified XML documents (i.e., new source) on content servers. Bi-X Service Port and WSDL. The Bi-X service port and WSDL enable users to use such methods as init and update through the Internet. The types of these methods and the data structures of their arguments are provided by WSDL. Users can easily construct SOAP clients for these methods by putting the WSDL to an automatic program generator such as WSDL2Java of Axis. Moreover, users can use REST interfaces for these methods thanks to the power of Axis2. If they do, they need only a method to access the target URLs to use the Bi-X service.
6
Application Examples
In our architecture, the client and the content servers simply need to satisfy the requirements given in Section 2. Here, we give an example of the client and content server, by which we have tested several use cases. We also show the usability of our system using one test case that uses CiteSeer [5] database. 6.1
Client and Content Server Example
A Bi-X service client that calls methods provided by the server can be easily prepared using standard Web service technologies. All necessary information for this can be obtained from the WSDL description of the Bi-X service. For example, a client program can be created by using the WSDL2java tool included in Axis. It generates client stub code for SOAP communication from the WSDL description. The client simply uses the code to invoke a Web service as if it was a regular Java object in the same address space. As the interface for our client, we use Justsystem xfy [10], which is an “integrated XML application development environment” developed by Justsystem Corporation. An advantage of using xfy in our testing is its ability to handle various kinds of XML vocabularies in an optimized and sophisticated manner. For example, texts in XHTML vocabulary are directly editable in the xfy browser. We incorporated our client program with xfy so that it would work as an xfy plug-in. We create request messages on the xfy interface and sent them to the Bi-X server. The results from the server are displayed in the xfy browser. In the current update implementation, the entire document of the changed view
730
Y. Hayashi et al.
Fig. 7. CiteSeer View on xfy
is sent to the Bi-X server, and its well-formedness is checked in the client. The validity over a schema is checked in the Bi-X server when the URI of the schema definition file is given. There are two basic requirements for a content server: be able to provide XML files and be able to accept modified files. For example, we can use the eXist XML DB [14] to provide source data. In this case, when receiving a request for source data, the content server extracts the source data from the DB with XQuery and sends them to the transformation engine. When the updated source data is returned, it updates the DB accordingly by executing updating queries prepared by the user. The XQuery in eXist extends the standard XQuery with some update statements that can be used to create updating queries. 6.2
CiteSeer
CiteSeer is a scientific literature digital library and search engine that focuses primarily on the literature in computer and information science. It crawls the Web and harvests academic and scientific documents. It uses autonomous citation indexing to permit querying by citation. The CiteSeer Web site has pages for correcting the information for a given document (title, abstract, summary, author(s), etc.). Any user can submit a correction through a form-based Web interface by editing the contents and submitting them. This kind of application is thus well suited to our view updating system. To test the view updating, we saved part of the original XML data taken from the CiteSeer library and perform view updating using the Bi-X server. Figure 7 shows a snapshot of the view in the xfy browser. We provide the URIs of the source XML file and the Bi-X code needed to transform it. We then press the Start button to invoke the init service. The XHTML view is generated by the Bi-X code and displayed in the xfy browser. The view contains the document
A Web Service Architecture for Bidirectional XML Updating
731
information (title, author, and titles of cited documents) in list format. We edit the information directly in the XHTML view provided by the xfy browser. The modifications are then reflected back to the source by pressing the Update button, which invokes the update service. Thus, users can create a view that includes only the contents of interest in a suitable format by creating an appropriate Bi-X code, edit the contents in the view, and update the source XML data.
7
Related Work
The Bi-X language has a bidirectional transformation style similar to that of the Harmony [8] and XEditor [9], which are both domain-specific. Harmony was designed for synchronizing tree-structured data, while XEditor is mainly used for editing tree-structured data. Bi-X extends their capabilities, so it can be used for general purpose XML processing. The differences between Bi-X and these languages are discussed in detail elsewhere [13]. In the relational database area, there has been some work on bidirectional mapping between a database and XML documents. In the approach of Braganholo et al. [4], the underlying relational database tables are updated directly rather than through views. In that of Knudsen et al. [11], the updates to query tree are transformed into SQL updates, and then the traditional view updating techniques are used to update relational databases. Obviously, these approaches are not suitable for updating native XML repositories. Many XML updating systems that use a database are closely connected to the database system, so they are not easy to re-implement to work with a different system. The Bi-X server is a generic tool for XML updating, so it can be easily connected to content servers and web applications and can be reused.
8
Conclusion
In our Web service architecture for bidirectional XML updating, users can update remote source data by editing a target view on the local machine. This view is generated by some transformation of the source data. The user can create a view that includes only contents of interest in a suitable format by creating an appropriate Bi-X code, edit the contents in the view, and update the source XML data accordingly without coding a backward transformation. Due to the use of standard Web service technologies, the data viewer client and content servers can be easily replaced with ones chosen by users to implement their own applications. There are a number of directions for future research to make the service architecture more practical and usable. Although we considered discrete updates in this work, concurrency control that would enable many updates to be made at the same time would make it more practical. A control policy needs to be defined for allowing access to the service.
732
Y. Hayashi et al.
Acknowledgments We are grateful to Justsystem Corporation for providing us with helpful technical information about xfy. This research is supported by the Comprehensive Development of e-Society Foundation Software program of the Ministry of Education, Culture, Sports, Science and Technology, Japan.
References 1. Abiteboul, S.: On views and XML. In Proceedings of the 18th ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems. (1999) 1–9 2. Apache Software Foundation: Apache Axis2/Java. http://ws.apache.org/axis2/. 3. Apache Software Foundation: Apache Tomcat. http://tomcat.apache.org/. 4. Braganholo, V., Davidson, S., Heuser, C.: From XML view updates to relational view updates: old solutions to a new problem. In Proceedings of International Conference on Very Large Databases. (2004) 276–287 5. College of Information Sciences and Technology, The Pennsylvania State University: CiteSeer. http://citeseer.ist.psu.edu/. 6. Dayal, U., Bernstein, P. A.: On the correct translation of update operations on relational views. ACM Trans. Database Syst. 7 (1982) 381–416 7. Fielding, R. T.: Architectural styles and the design of network-based software architectures. PhD thesis, University of California. (2000) 8. Foster, J. N., Greenwald, M. B., Moore, J. T., Pierce, B. C., Schmitt, A.: Combinators for bi-directional tree transformations: a linguistic approach to the view update problem. In Proceedings of the 32nd ACM SIGPLANSIGACT Symposium on Principles of Programming Languages. (2005) 233–246 9. Hu, Z., Mu, S-C., Takeichi, M.: A programmable editor for developing structured documents based on bidirectional transformations. In Proceedings of the 2004 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-based Program Manipulation. (2004) 10. Justsystem Corporation: xfy technology. http://www.xfytec.com. 11. Knudsen, S. U., Pedersen, T. B., Thomsen. C, Torp, K.: RelaXML: bidirectional transfer between relational and XML data. Proceedings of the 9th International Database Engineering and Applications Symposium. (2005) 151–162 12. Liu, D., Hu, Z., Takeichi, M.: Bidirectional interpretation of XQuery. In Proceedings of the ACM SIGPLAN 2007 Workshop on Partial Evaluation and Program Manipulation. (2007) 13. Liu, D., Hu, Z., Takeichi, M., Kakehi, K., Wang, H.: A Java library for bidirectional XML transformation. JSSST Computer Software (to appear) 14. Meier, F.: eXist: Open Source Native XML Database. http://www.existdb.org/. 15. W3C: Simple Object Access Protocol (SOAP) 1.1. http://www.w3.org/TR/soap. (2000) 16. W3C: Web Services Description Language (WSDL) 1.1. http://www.w3.org/TR/ wsdl. (2001) 17. W3C Draft: XML Query (XQuery). http://www.w3.org/XML/Query. (2005) 18. W3C Draft: XSL Transformations (XSLT) Version 2.0. http://www.w3.org/TR/ xslt20/. (2005)
(α, k)-anonymity Based Privacy Preservation by Lossy Join Raymond Chi-Wing Wong1 , Yubao Liu2 , Jian Yin2 , Zhilan Huang2 , Ada Wai-Chee Fu1 , and Jian Pei3 1
Department of Computer Science and Engineering, the Chinese University of Hong Kong, Hong Kong {cwwong,adafu}@cse.cuhk.edu.hk 2 Department of Computer Science, Zhongshan University, China {liuyubao,issjyin}@mail.sysu.edu.cn,
[email protected] 3 School of Computing Science, Simon Fraser University, Canada
[email protected]
Abstract. Privacy-preserving data publication for data mining is to protect sensitive information of individuals in published data while the distortion to the data is minimized. Recently, it is shown that (α, k)anonymity is a feasible technique when we are given some sensitive attribute(s) and quasi-identifier attributes. In previous work, generalization of the given data table has been used for the anonymization. In this paper, we show that we can project the data onto two tables for publishing in such a way that the privacy protection for (α, k)-anonymity can be achieved with less distortion. In the two tables, one table contains the undisturbed non-sensitive values and the other table contains the undisturbed sensitive values. Privacy preservation is guaranteed by the lossy join property of the two tables. We show by experiments that the results are better than previous approaches.
1
Introduction
Privacy-preserving data mining is about preserving the individual privacy and retaining as much as possible the information in a dataset to be released for mining. The perturbation approach [2] and the k-anonymity model [14,13,4,1] are two major techniques for this goal. The k-anonymity model assumes a quasi-identifier (QID), which is a set of attributes that may serve as an identifier in the data set. In the simplest case, it is assumed that the dataset is a table and that each tuple corresponds to an individual. For example, in Table 1, attributes Job, Birth and Postcode form a quasi-identifier, where attribute Illness is a sensitive attribute. The privacy may be violated if some quasi-identifier values are unique in the released table. The assumption is that an attacker can have the knowledge of another table where the quasi-identifier values are linked with the identities of individuals. Therefore, a join of the released table with this background table will disclose the sensitive data of individuals. A real example is found in the voter registration records in G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 733–744, 2007. c Springer-Verlag Berlin Heidelberg 2007
734
R. Chi-Wing Wong et al. Table 1. Raw medical data set Job clerk manger clerk factory worker factory worker technical supporter
Birth Postcode Illness 1975 4350 HIV 1955 4350 flu 1955 5432 flu 1955 5432 fever 1975 4350 flu 1940 4350 fever
Table 2. A (0.5, 2)-anonymous table of Table 1 by full-domain generalization
Table 3. An (0.5,2)-anonymous data set of Table 1 by local recoding
Job Birth Postcode Illness * * 4350 HIV * * 4350 flu * * 5432 flu * * 5432 fever * * 4350 flu * * 4350 fever
Job Birth Postcode Illness white-collar * 4350 HIV white-collar * 4350 flu * 1955 5432 flu * 1955 5432 fever blue-collar * 4350 flu blue-collar * 4350 fever
the United States, where the attributes of name, gender, zip code and date of birth are recorded. It is found that a high percentage of the population can be uniquely identified by the gender, date of birth and the zip code [12]. Let Q be the quasi-identifier (QID). An equivalence class set, called a QIDEC, for the same QID value of a table with respect to Q is a collection of all tuples in the table containing identical values of Q. For instance, Table 2 contains two QID-EC’s. The first QID-EC contains the first two and the last two tuples because these tuples contain identical values of Q. Similarly, the second QID-EC contains the third and the fourth record. A data set D is k-anonymous with respect to Q if the size of every QID-EC with respect to Q is k or more. As a result, it is less likely that any tuple in the released table can be linked to an individual and thus personal privacy is preserved. For example, each QID-EC in Table 2 has a size equal to or greater than 2. If k = 2, the data set in Table 2 is said to be k-anonymous. We assume that each attribute follows a generalization hierarchy. In this hierarchy, a value in a lower level has a more specific meaning compared with a value in a higher level. For instance, Figure 1 shows a generalization hierarchy of attribute job. * white-collar clerk
blue-collar
manager factory worker
technical supporter
Fig. 1. Generalization hierarchy of attribute job
(α, k)-anonymity Based Privacy Preservation by Lossy Join
735
In order to achieve k-anonymity, we generalize some values in some attributes in the quasi-identifier by replacing the values in a lower level by the values in a higher level according to the generalization hierarchy. Table 2 is a generalization of Table 1.
2
(α, k)-anonymity
The k-anonymity model is proposed in order to prevent the re-identification of individuals in the released data set. However, it does not consider the inference relationship from the quasi-identifier to some sensitive attribute. We assume for simplicity that there is only one sensitive attribute and that some values of this attribute are sensitive values. Suppose all tuples in a QID-EC contain the same sensitive value in the released data set, even though the size of the QIDEC is greater than or equal to k, all tuples in this QID-EC are linked to this sensitive value in the released data set. Therefore, each individual that has the corresponding QID value will be linked to the sensitive value. Let us call such an attack an inference attack. In order to overcome this attack, [9,17] proposed some privacy models and methods. [9] and [17] proposed an l-diversity model and an (α, k)-anonymity model, respectively, where α is a real number ∈ [0, 1] and k and l are positive integers. As discussed in [17], it is difficult for users to set the parameters in the l-diversity model. In this paper, we focus on the (α, k)-anonymity model, which generates publishable data that is free from the inference attack. In addition to k-anonymity, this model requires that the value of the frequency (in fraction) of any sensitive value in any QID-EC is no more than α after anonymization. There are two possible schemes of generalizations: global recoding and local recoding. With global recoding [13,8,3,11,7,16,4] all values of an attribute come from the same domain level in the hierarchy. In other words, all values come from the values in the same level in the generalization hierarchy. For example, all values in attribute job are in the lowest level (i.e. clerk, manager, factory worker and technical supporter), or all are in the top level (i.e. *). For example, a global recoding of Table 1 is Table 2. One advantage is that an anonymous view has uniform domains. But, it may lose more information, compared with local recoding (which will be discussed next), because it suffers from over-generalization. Under the scheme of local recoding [14,13,1,10,6,5,19], values may be generalized to different levels in the domain. For example, Table 3 is a (0.5, 2)anonymous table by local recoding. In fact, one can say that local recoding is a more general model and global recoding is a special case of local recoding. Note that, in the example, known values are replaced by unknown values (*). This is called suppression, which is one special case of generalization, which is in turn one of the ways of recoding. It is easy to check that generalizing data to form QID-EC’s in a released table is one possible way to achieve (α, k)-anonymity. However, it is not the only possible way, and we shall describe another method in the next section.
736
R. Chi-Wing Wong et al. Table 4. Temp table Job clerk manager clerk factory worker factory worker technical supporter
Birth Postcode Illness ClassID 1975 4350 HIV 1 1955 4350 flu 1 1955 5432 flu 2 1955 5432 fever 2 1975 4350 flu 3 1940 4350 fever 3
Table 5. NSS Table Job clerk manager clerk factory worker factory worker technical supporter
3
Birth Postcode 1975 4350 1955 4350 1955 5432 1955 5432 1975 4350 1940 4350
Table 6. SS Table ClassID 1 1 2 2 3 3
ClassID 1 1 2 2 3 3
Illness HIV flu flu fever flu fever
The Lossy Join Approach
In recent work, it has been found that lossy join of multiple tables is useful in privacy-preserving data publishing [18,15]. The idea is that if two tables with a join attribute are published, the join of the two tables can be lossy and this lossy join helps to conceal the private information. In this paper, we make use of the idea of lossy join to derive a new mechanism for achieving a similar privacy preservation target as (α, k)-anonymization. Let us take a look at an example in Table 1. A (0.5, 2)-anonymization is given in Table 3. From this table, we can generate a table Temp as shown in Table 4. For each equivalence class E in the anonymized table, we assign a unique identifier (ID) to E and also to all tuples in E. Then, we attach the correspondence ID to each tuple in the original raw table and form a new table Temp. From the Temp table, we can generate two separate tables, Tables 5 and 6. The two tables share the attribute of ClassID. If we join these two tables by the ClassID, it is easy to see that the join is lossy and it is not possible to derive the table Temp after the join. The result of joining the two tables is given in Table 7. From the lossy join, each individual is linked to at least 2 values in the sensitive attribute. Therefore, the required privacy of individual can be guaranteed. Also, in the joined table, for each individual, there are at least 2 individuals that are linked to the same bag B of sensitive values, such that in terms of the sensitive values, they are not distinguishable. For example, the first record in the raw table (QID = (clerk, 1975, 4350)) is linked to bag {HIV,flu}. We find that the second individual (QID = (manager, 1955, 4350)) is also linked to the same bag B of sensitive values. This is the goal of k-anonymity for the protection of sensitive values.
(α, k)-anonymity Based Privacy Preservation by Lossy Join
737
Table 7. Join result table Job clerk manager clerk manager clerk factory worker clerk factory worker factory worker technical supporter factory worker technical supporter
3.1
Birth 1975 1955 1975 1955 1955 1955 1955 1955 1975 1940 1975 1940
Postcode 4350 4350 4350 4350 5432 5432 5432 5432 4350 4350 4350 4350
Illness HIV HIV flu flu flu flu fever fever flu flu fever fever
ClassID 1 1 1 1 2 2 2 2 3 3 3 3
Contribution
[17] proposed to generate one generalized table which satisfies (α, k)-anonymity. Since the table is generalized, the data in the table is distorted. In this paper, we generalize the definition of (α, k)-anonymity to allow for the generation of two tables instead of one generalized table. In this way, the privacy protection for (α, k)-anonymity can be achieved with less distortion. In the two tables, one table contains the undisturbed non-sensitive values and the other table contains the undisturbed sensitive values. The privacy preservation is by the lossy join property of the two tables. We show that the results are better than previous approaches [17,18] in the experiments. The rest of the paper is organized as follows. In Section 4, we re-visit (α, k)anonymity and propose a genearlization model of (α, k)-anonymity. In Section 5, we describe how the lossy join can be adapted to the generalized (α, k)anonymity model. We propose an algorithm which generates two tables satisfying (α, k)-anonymity in Section 6. A systematic performance study is reported in Section 7. The paper is concluded in Section 8.
4
Generalized (α, k)-anonymity
Let us re-examine the objectives of (α, k)-anonymity. With k-anonymity, we want to make sure that when an individual is mapped to some sensitive values, at least k − 1 other individuals are also mapped to the same sensitive values. Let B be a bag of these sensitive values. For example, consider an individual with QID=(clerk, 1975, 4350) in Table 1. With 2-anonymity, since s/he is mapped to the first and the second tuple in Table 3, s/he is mapped to a bag B = {HIV, f lu}. There is another individual with QID=(manager, 1955, 4350) in Table 1 that also is mapped to the same bag B = {HIV, f lu} in Table 3. (α, k)anonymity further ensures that no sensitive value is sufficiently dominating in B so that an individual cannot be linked to any sensitive value in B with a
738
R. Chi-Wing Wong et al.
high confidence. For instance, with α = 0.5, since B contains HIV and f lu, the frequency (in fraction) of each value in B is at most 0.5. Based on this observation, we generalize the definition of (α, k)-anonymity as follows: Definition 1 (Generalized (α, k)-anonymity). Consider a dataset D in which a set of attributes form the QID. We assume that the adversary only has the knowledge of an external table where the QID’s are linked to individuals. A released data set D generated from D satisfies generalized k-anonymity if, whenever an individual is linked to a bag B of sensitive values, at least k − 1 other individuals are also linked to B. In addition, if the frequency (in fraction) of any sensitive value in B is no more than α, then the released data satisfies generalized (α, k)-anonymity.
5
Generalized (α, k)-anonymity by Lossy Join
Suppose we form an anonymized table in which some QID values are generalized. In the anonymized table, each set of tuples with the same QID values forms a QID-EC. However, instead of publishing one single table A with the generalized values, there is the possibility of separating the sensitive attribute from the non-sensitive attributes and generate two tables by projecting these two sets of attributes. Tuples in the two tables are linked if they belong to the same QIDEC in A. Hence we can publish two tables: (1) one table, called non-sensitive table (NSS table), containing all the non-sensitive attributes together with QID equivalence class (QID-EC in A) IDs, and (2) the other table, called sensitive table (SS table), containing the QID-EC ID and the sensitive attributes. The released tables are annotated with the remark that each tuple in each of the two published table corresponds to one record in the original single table. This is to ensure that a user will not mistakenly join the two tables and assume that the join result corresponds to the original table. The schema of the non-sensitive table (NSS table) is shown as follows, where Class ID corresponds to QID-EC ID. Original QID attributes
Class ID
The schema of the sensitive table (SS table) is shown as follows. Class ID
Sensitive attribute
Let us consider the example in Table 1 again. We propose that Table 5 (NSS) and Table 6 (SS) can be published as the anonymized data. Theorem 1. The resulting published tables N SS and SS satisfy generalized (α, k)-anonymity. Proof: Given the QID information of individuals in a table TI (which we assume that an attacker may possess) and the anonymized Table TA (e.g. Table 3), we can
(α, k)-anonymity Based Privacy Preservation by Lossy Join
739
“join” the two tables by matching each QID in TA to its anonymized equivalence class and obtain a table TIA . Since TA satisfies (α, k)-anonymity, when the QID of an individual is linked to a bag B of values in the sensitive attribute, at least k − 1 other QID’s of other individuals are also linked to B. In addition, the frequency (in fraction) of any sensitive value in B is no more than α. Now, suppose the adversary is given tables NSS and SS. Equipped with only table TI , an adversary must join the tables NSS and SS on their common attribute in an attempt to link the QID’s to the sensitive values. Let the join result be table TA , such as Table 7. Consider any QID-EC with class ID X. Let BX be the bag of sensitive values that X is linked to in TA and suppose there are a tuples in TA belonging to X. In Table TA , there will be a2 tuples generated for and each entry in BX is duplicated a times X. In Table TA , BX becomes BX 2 in BX . In the a tuples in TA , each original QID value in the given table T will now be linked to the bag BX . Besides, a individuals are involved in X, and is the same as each is linked to BX . The frequency of each sensitive value in BX that in BX in TIA . Hence, the tables N SS and SS release no more information as the table TA in terms of the linkage of an individual to a bag B of sensitive values and in terms of the percentage of each sensitive value in B. This shows that the privacy protection provided by the single anonymized table TA is no stronger than that provided by the NSS and SS tables in terms of (α, k)-anonymity. Since TA satisfies (α, k)-anonymity, tables NSS and SS also satisfy (α, k)-anonymity. The example shown in Tables 3 to 7 demonstrates the ideas in the proof above. If we publish Tables 5 and 6, we can achieve similar privacy preservation objectives as if we publish Table 3 only.
6
Algorithm
Our method includes the following steps. 1. Construct an (α, k)-anonymous table T ∗ from the given raw table (which will be described in Algorithm 1), and assign each equivalence class in the resulting table a class ID. 2. Add a column for the class ID of the equivalence class in the original raw table, such that, for each tuple, the class ID is the ID of the equivalence class that the tuple belongs in T ∗. Call this new table the Temp table. Hence the Temp table contains the raw table plus one extra column. 3. Project the Temp table on the QID attributes and the Class ID column. The resulting table is the NSS table. 4. Project the Temp table on the sensitive attributes and the Class ID column. This results in the SS table. The top-down approach has been found to be highly effective in kanonymization [4]. In this approach, the table is first totally anonymized to the unknown values, and then attributes are specialized one at a time until we hit a point where the resulting table violates (α, k)-anonymity. We shall adopt
740
R. Chi-Wing Wong et al.
Algorithm 1. Top-Down Approach for Single Attribute 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
fully generalize all tuples such that all tuples are equal let P be a set containing all these generalized tuples S ← {P }; O ← ∅ repeat S ← ∅ for all P ∈ S do specialize all tuples in P one level down in the generalization hierarchy such that a number of specialized child nodes are formed unspecialize the nodes which do not satisfy (α, k)-anonymity by moving the tuples back to the parent node if the parent P does not satisfy (α, k)-anonymity then unspecialize some tuples in the remaining child nodes so that the parent P satisfies (α, k)-anonymity for all non-empty branches B of P , do S ← S ∪ {B} S ← S if P is non-empty then O ← O ∪ {P } until S = ∅ return O
the top-down approach in [17] to tackle the first step of (α, k)-anonymization in the above. The idea of the algorithm is to first generalize all tuples completely so that, initially, all tuples are generalized to one equivalence class. Then, some values in the dataset are specialized in iterations. During the specialization, we must maintain (α, k)-anonymity. The process continues until we cannot specialize the tuples anymore without violating (α, k)-anonymity. The pseudo-code of the top-down approach is shown in Algorithm 1.
7
Experimental Results
The system platform we used is: Windows XP OS, Microsoft SQL Server 2000, Intel Celeron CPU 2.66GHz, 256MB Memory, 80G Hard disk. We implemented our proposed algorithm, the (α, k)-anonymity based privacy preservation by lossy join, in C/C++ language. Let us denote it by Alpha(Lossy). We compared the proposed lossy-join algorithm with two algorithms in the literature. One is the original algorithm of (α, k)-anonymity [17] which generalizes the QID and forms one generalized table only. Let us denote the algorithm by Alpha. The other algorithm is the anatomy algorithm which makes use of the lossy join for the anonymization [18]. Let us denote the algorithm by Anatomy. Anatomy also generates two tables with a similar strategy of separating the sensitive data and the QID data. However, the goal of Anatomy is to create QID-EC’s which satisfy the l-diversity requirement, without precaution in creating QID-EC’s that also minimizes the effective distortion to the QID values. In other words, Anatomy does not consider the minimization of the variations in the QID values in each QID-EC when two tables are released. Alpha(Lossy) takes care of this issue by the top-down anonymization algorithm and therefore results in less data distortion.
(α, k)-anonymity Based Privacy Preservation by Lossy Join
741
Table 8. Description of Adult Data Set Attribute Distinct Values Generalizations Height 1 Age 74 5-, 10-, 20-year ranges 4 2 Work Class 7 Taxonomy Tree 3 3 Education 16 Taxonomy Tree 4 4 Martial Status 7 Taxonomy Tree 3 5 Race 5 Taxonomy Tree 2 6 Sex 2 Suppression 1 7 Native Country 41 Taxonomy Tree 3 8 Salary Class 2 Suppression 1 9 Occupation 14 Taxonomy Tree 2
The source code of this algorithm can be obtained from the author’s website http://www.cs.cityu.edu.hk/∼taoyf/paper/vldb06.html. In our experiments, we make some modifications on the ST files generated by the original anatomy algorithm such that ST table can be loaded into the Microsoft SQL Server 2000. Similar to [4,8,17], we adopted the adult data set for the experiment, which can be downloaded in the UCIrvine Machine Learning Repository (http:// www.ics.uci.edu/∼mlearn/MLRepository.html). We eliminated the records with unknown values in this data set. The resulting data set contains 45,222 tuples. Nine of the attributes were chosen in our experiments, as shown in Table 8. By default, we set k = 2 and α = 0.33. In Table 8, we set the first eight attributes and the last attribute as the quasi-identifer and the sensitive attribute, respectively. We compare the algorithms in terms of effectiveness for aggregate queries. Similar to [18], the effectiveness of aggregate query is defined to be its average relative error in answering a query of the following form. SELECT COUNT(*) FROM Unknown-Microdata qi s WHERE pred(Aqi 1 ) AND ... AND pred(Aqd ) AND pred(A )
In the above query, Unknown-Microdata is an original data set or an anonymized data set. qd denotes the number of QID attributes to be queried and As denotes the sensitive attribute. For any attribute A, the predicate pred(A) has the form (A = x1 OR A = x2 OR ... OR A = xb ) where xi is a random value in the domain of A, for 1 ≤ i ≤ b. The value of b depends on the expected query selectivity s b = |A| · s1/(qd+1) where |A| is the domain size of A. If the value of s is set higher, the selection conditions in pred(A) will be more. We compare the anonymized tables generated by different algorithms in terms of average relative error, which is defined as follows. We perform the aggregate query with the original data set, called Original. That is,
742
R. Chi-Wing Wong et al.
SELECT COUNT(*) FROM Original qi s WHERE pred(Aqi 1 ) AND ... AND pred(Aqd ) AND pred(A )
Let us call the count obtained above act. We execute the aggregate query with the anonymized data set as follows. As algorithm Alpha(Lossy) and algorithm Anatomy generates two tables, namely NSS and SS, we perform the query as follows. SELECT COUNT(*) FROM SS WHERE SS.ClassID in (SELECT NSS.ClassID FROM NSS qi s WHERE pred(Aqi 1 ) AND ... AND pred(Aqd )) AND pred(A ))
Let us call the count obtained above est. As algorithm Alpha generates one anonymized table, we perform the first query by replacing Unknown-Microdata with the anonymized or generalized data. Then, we define the relative error to be |act − est|/act, where act is its actual count derived from the original, and est the estimated count computed from the anonymized table. In our experiments, we compare all algorithms by varying the following factors: (1) the number of QID-attributes d; (2) query dimensionality qd; (3) selectivity s and (4) dataset cardinality n. For each setting, we performed 1000 queries on the anonymized tables and then reported the average query accuracy. By default, we set qd = 4, s = 0.05 and n = 45222. As we adopt the first eight attributes in Table 8 as the quasiidentifier, the default value of d is 8. We study the effect of the number of QI-attributes as shown in Figure 2. The average relative error remains unchanged. Also, algorithm Alpha(Lossy) gives a lower average relative error compared with algorithm Anatomy and algorithm Alpha. This is because algorithm Alpha(Lossy) considers the minimization step of the distortion for the anonymization but algorithm Anatomy does not. Also, algorithm Alpha(Lossy) does not generalize the table but algorithm Alpha generalize the table, which makes the average relative error higher. On average, algorithm Anatomy gives lower average relative error compared with algorithm Alpha. The reason is similar. Algorithm Alpha generalizes the table, which distort the data much, but algorithm Anatomy does not. We also studied the effect of query dimensionality qd as shown in Figure 3. Similarly, even though the average relative error of algorithm Alpha(Lossy) is smaller than that of algorithm Anatomy and algorithm Alpha, qd had little effect on the average relative error. We also varied the selectivity s as shown in Figure 4 and found that the average relative error of all algorithms decreases when s increases. This is because, when s is larger, each attribute in the aggregate query involves more value matches. That means the actual count is larger. Note that the actual count is the denominator of the average relative error. Besides, if the generalized values in the anonymized
3 2.5 2 1.5 1 0.5 0
Average relative error
Average relative error
(α, k)-anonymity Based Privacy Preservation by Lossy Join
4
5
6
743
2 1.5
Alpha(Lossy) Anatomy Alpha
1 0.5 0 5
7
6
7
8
Number of qd
Number of QID-attributes
Fig. 2. Query accuracy vs. the Fig. 3. Query accuracy vs. query dimensionality number of QID-attributes d qd
Average relative error
Average relative error
20 15 10 5 0 0.01
0.04
0.07
Selectivity s
0.1
4 3
Alpha(Lossy) Anatomy Alpha
2 1 0 10000 20000 30000 40000 Dataset cardinality
Fig. 4. Query accuracy vs. selectiv- Fig. 5. Query accuracy vs. dataset cardinality ity s
table match more aggregate values in the query, the estimated count will be more accurate. Thus, the overall average relative error decreases when s increases. Figure 5 shows the average relative error against the data set cardinality n. We found that the average relative error of all algorithms decreases slightly when n increases. This is because, when n is larger, there is more chance that a tuple can be matched with an existing tuple in the data without much generalization. Similarly, algorithm Alpha(Lossy) gives a lower average relative error compared with algorithm Anatomy and algorithm Alpha.
8
Conclusion
In this paper, we proposed an (α, k)-anonymity based privacy preservation mechanism that reduce information loss by the use of lossy join. Instead of one generalized table, we generate two tables with a sharing attribute called ClassID, which corresponds to a unique identifier of an “equivalence class”. One table contains the detailed information of the quasi-identifier and ClassID, and the other table contains ClassID and the sensitive attribute. By avoiding the generalization of the quasi-identifier in the first table, we achieve less information loss. We conducted some experiments and verified the improvement on information loss.
744
R. Chi-Wing Wong et al.
Acknowledgements: This paper is in part supported by the National Natural Science Foundation of China (60573097), Natural Science Foundation of Guangdong Province (05200302), Research Foundation of Science and Technology Plan Project in Guangdong Province (2005B10101032), and Research Foundation of Disciplines Leading to Doctorate degree of Chinese Universities (20050558017). This research was also supported by the RGC Earmarked Research Grant of HKSAR CUHK 4120/05E.
References 1. G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu. Anonymizing tables. In ICDT, pages 246–258, 2005. 2. R. Agrawal and R. Srikant. Privacy-preserving data mining. In SIGMOD, pages 439–450, May 2000. 3. R. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In ICDE, pages 217–228, 2005. 4. B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In ICDE, pages 205–216, 2005. 5. A. Hundepool. The argus software in the casc-project: Casc project international workshop. In Privacy in Statistical Databases, volume 3050 of Lecture Notes in Computer Science, pages 323–335, Barcelona, Spain, 2004. Springer. 6. A. Hundepool and L. Willenborg. μ-and τ - argus: software for statistical disclosure control. In Third international seminar on statsitcal confidentiality, Bled, 1996. 7. V. S. Iyengar. Transforming data to satisfy privacy constraints. In SIGKDD, pages 279–288, 2002. 8. K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In SIGMOD, pages 49–60, 2005. 9. A. Machanavajjhala, J. Gehrke, and D. Kifer. l-diversity: privacy beyond k-anonymity. In ICDE, 2006. 10. A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS, pages 223–228, 2004. 11. P. Samarati. Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13(6):1010–1027, 2001. 12. L. Sweeney. Uniqueness of simple demographics in the u.s. population. Technical Report, Carnegie Mellon University, 2000. 13. L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International journal on uncertainty, Fuzziness and knowldege based systems, 10(5):571 – 588, 2002. 14. L. Sweeney. k-anonymity: a model for protecting privacy. International journal on uncertainty, Fuzziness and knowldege based systems, 10(5):557 – 570, 2002. 15. K. Wang and B. Fung. Anonymizing sequential releases. In SIGKDD, 2006. 16. K. Wang, P. S. Yu, and S. Chakraborty. Bottom-up generalization: A data mining solution to privacy protection. In ICDM, pages 249–256, 2004. 17. R. Wong, J. Li, A. Fu, and K. Wang. (alpha, k)-anonymity: An enhanced k-anonymity model for privacy-preserving data publishing. In SIGKDD, 2006. 18. X. Xiao and Y. Tao. Anatomy: Simple and effective privacy preservation. In VLDB, 2006. 19. J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. W.-C. Fu. Utility-based anonymization using local recoding. In SIGKDD, 2006.
Achieving k -Anonymity Via a Density-Based Clustering Method Hua Zhu and Xiaojun Ye School of Software, Tsinghua University, Beijing, 100084, P. R. China
[email protected],
[email protected]
Abstract. The key idea of our k -anonymity is to cluster the personal data based on the density which is measured by the k -Nearest-Neighbor (KNN) distance. We add a constraint that each cluster contains at least k records which is not the same as the traditional clustering methods, and provide an algorithm to come up with such a clustering. We also develop more appropriate metrics to measure the distance and information loss, which is suitable in both numeric and categorical attributes. Experiment results show that our algorithm causes significantly less information loss than previous proposed clustering algorithms.
1
Introduction
Society is experiencing exponential growth in the number and variety of data collections containing person-specific information as computer technology, network connectivity and disk storage space become increasingly affordable[9]. Many data holders publish their microdata for different purposes. However, they have difficulties in releasing information which does not compromise privacy. The difficulty is that data quality and data privacy conflict with each other. Recently, a new approach of protecting data privacy called k-anonymity[8] has gained popularity. In a k -anonymized dataset, quasi-identifier attributes that leak information are suppressed or generalized so that, each record is indistinguishable from at least (k -1) other records with respect to the quasi-identifier. Since the k -anonymity is simple and practical, so a number of algorithms have been proposed[5][6]. The objective of this paper is to develop a new approach to achieve k -anonymity, where quasi-identifier attribute values are clustered and then published with these clusters. We view the k -anonymity problem as a clustering issue, and we add a constraint that each cluster contains at least k records, so that it satisfies k -anonymity requirements. The key idea is to cluster records based on density which is measured by the k -Nearest-Neighbor distance. We develop an algorithm to come up with such a clustering. To measure the information loss, we give some data quality metrics which are suitable in both numeric and categorical attributes. G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 745–752, 2007. c Springer-Verlag Berlin Heidelberg 2007
746
2
H. Zhu and X. Ye
Basic Concepts
The process of k -anonymization is to delete all the direct identifiers firstly, then generalize/suppress the quasi-identifiers by which most individuals may identified[8], and finally release the modified dataset which satisfies the k -anonymity constraint. For example, Table 1(left) is a raw mircodata of hospital and Table 1(right) is a 2anonymity view of (left). Table 1. Table of health data. Left: a raw table. Right: a 2-anonymity view. Zip Gender Age Disease 43520 Male 22 Cancer 43522 Male 25 Flu 43518 Male 23 Cancer 43533 Female 21 Obesity 43567 Female 30 Coryza 43562 Female 27 Flu
Zip Gender Age Disease 4352* Male [21,25] Cancer 4352* Male [21,25] Flu 435** Person [21,25] Cancer 435** Person [21,25] Obesity 4356* Female [26,30] Coryza 4356* Female [26,30] Flu
Definition 1 (Quasi-Identifier). A quasi-identifier is a minimal set Q of attributes in table T that can be joined with external information to re-identify individual records(with sufficiently high probability)[8]. Definition 2 (Equivalence Class). An equivalence class of a table with respect to the quasi-identifier is the set of all records in the table containing identical values for the quasi-identifier attributes. For example in Table 1, the attribute set {Zip, Gender, Age} is the quasiidentifier. Record 1 and 2 form an equivalence class in Table1(b), with respect to quasi-identifier {ZIP, Gender, Age} and their corresponding item values are identical. Definition 3 (k -Anonymity). Table T is said to satisfy k-anonymity if and only if each set of values in Q appears at least k times in T[8]. For example, Table 1(b) is a 2-anonymity view of Table 1(a) since the minimum size of all equivalence classes is great than 2. So it can ensure that even though an intruder knows a particular individual is in the k -anonymous table T, he can not infer which record in T corresponds to the individual with a probability greater than 1/k. Clustering techniques used in k -anonymity issue do not require the number of clusters; instead, they need to satisfy a constraint that each cluster contains at least k records[1][3]. We define k -anonymity clustering issue as follow: Definition 4 (k -Anonymity Clustering Issue). The k-anonymity clustering issue is to cluster n points into a set of clusters with an information loss metric, such that each cluster contains at least k (k ≤ n) data points and that the sum of information loss for all clusters is minimized.
Achieving k -Anonymity Via a Density-Based Clustering Method
3
747
Distance and Information Loss Metrics
The distance metrics measure the dissimilarities among data points and minimizing the information loss for published microdata is the objective of anonymization issue. Distance metrics should handle records that consist of both numeric and categorical attributes. The earlier works[5][6] described generalizations for a categorical attribute by a taxonomy tree. Consider some sample in Table 2 and a taxonomy tree of attribute workclass in Fig.1. The leaf nodes depict all the distinct value of attribute workclass. These leaf nodes can be generalized at next level into self-employed, government, and unemployed. The level of a leaf node is 0 and the level of a root node is hw , based on the notion tree height, [3] gives a distance definition between two categorical values. Table 2. some sample patient records of a hospital Age Workclass Disease 37 Self-emp-inc Cancer 22 Self-emp-not-inc Flu 31 Federal government Cancer 21 State government Obesity 54 Local government Coryza 43 Private Flu 25 Without pay Flu 18 Never worked Cancer
The priority of generalization should be considered such that the generalization near to the root should give greater information loss compared with the generalization far from the root[7]. Thus we reformulate the level weight scheme based on [3]. We define the weight distance between two categorical values as follow: Definition 5 (Weight Distance Between Two Categorical Values). Let C be a categorical attribute, and let hw be the height of weight taxonomy tree of C. wi,i+1 (0 ≤ i < hw ) is the weight from level i to level i+1. The weight distance between two values vi , vj ∈ C is defined as: l12 −1 distCW (v1 , v2 ) = hi=0 w −1 j=0
wi,i+1 wj,j+1
(1)
where l12 is the level of the closet common ancestor of v1 and v2 . For example, the weight distance in Fig.1 between of Federal and Local is 1/(1 + 2) = 0.33, while the distance between Inc and Without pay is (1 + 2)/(1 + 2) = 1. Generalizing a numeric attribute (such as age in Table 2) is done by discretizing values into a set of disjoint intervals. How to choose possible end points
748
H. Zhu and X. Ye
Fig. 1. A Taxonomy Tree of Attribute workclass
determines the granularity of the intervals. Intuitively, the difference between two numeric values indeed represents their distances on the k-anonymity clustering problem. We define the distance between two numeric values as follow: Definition 6 (Distance between Two Numeric Values). Let N be a finite numeric attribute domain. The distances between two numeric values v1 ,v2 is defined as:[3] distN (v1 , v2 ) =
|v1 − v2 | |Ni |
(2)
where |Ni | is the size of numeric attribute |Ni |. For example, we consider the Age attribute in Table 2. The distance between the first two records in the Age attribute is |37 − 22|/|54 − 18| = 0.42. Definition 7 (Distance between Two Records). Let C1 , C2 , . . . , Cm , N1 , N2 , . . . , Nn be the quasi-identifier attributes in table T. Ci (i = 1 . . . m) is the categorical attribute and Nj (j = 1 . . . n) is the numeric attribute. The distance between two records is defined as: distance(r1 , r2 ) =
m
distCW (r1 [Ci ], r2 [Ci ]) +
i=1
n
distN (r1 [Nj ], r2 [Nj ]) (3)
j=1
For example, the distance between the first two records from Table 2 is 1/3 + 0.25 = 0.58. Based on the above distance definition between records, information loss for the anonymous table can be defined as follow: Definition 8 (Information Loss). Let C1 , C2 , . . . , Cm , N1 , N2 , . . . , Nn be the quasi-identifier attributes. Let c be a cluster. We define information loss as follow:
level(vall )−1
ilCi =
wi,i+1
(4)
i=0
ilNj =
|vmax − vmin | |Nj |
(5)
Achieving k -Anonymity Via a Density-Based Clustering Method m n IL(c) = |c|( ilCi + ilNj ) i=1
749
(6)
j=1
where ilCi is the information loss for categorical attribute Ci and ilNj is the information loss for numeric attribute Nj . vall is the value of the closest common ancestor of all values in attribute Ci . vmax is the maximal value in Ni and vmin is the minimal value in Ni . |Ni | represents the size in Ni . IL(c) is the information loss of cluster c. Thus, the total information loss of all clusters for the released microdata is: IL(c) (7) T otalIL(R) = c∈R
where R is a set of clusters.
4
k -Anonymity Clustering Algorithm
The choice of cluster center points can be based on the distribution density of data points. We pick a record whose density is the maximal and make it as the center of a cluster center c. Then we choose k-1 records to c that make the information loss minimal. We note that there are two important issues in the algorithm: 1. The effect of clustering. We introduce a density metric called k -nearestneighbor distance which is defined as follow: Definition 9 (k -Nearest-Neighbor Distance). Let R be a set of records and r be a record in R. Let distK(i)(0 < i ≤ k) be the minimal k values in all distance(r, rj )(0 < j ≤ |R|). Then we define k-nearest-neighbor distance of r as follow: k distK(i) distKNN(r) = i=1 (8) k where |R| represents the size of R. Definition 10 (Density). Let distKNN(r) be the k-nearest-neighbor distance of record r, we define the density of r as follow: dens(r) =
1 distKNN(r)
(9)
The larger the density of r is, the smaller the distances between r and other records around it are. The record with larger density will be made as a cluster center with high probability because the cluster has a smaller information loss. 2. The process of clustering. How to choose the next cluster center is another important issue when one iteration has finished, because we consider that the next cluster center is a record which has the maximal density in remainder records. And the next cluster center is not in the k -nearest-neighbor records of this center, thus we define a principle as follow:
750
H. Zhu and X. Ye
Definition 11 (Principle of Choosing the Next Cluster Center). Let R be a set of records, rc be a center of cluster c and rc N ext be the next cluster center. The rc N ext ∈{R-c} chosen must satisfy the follow two requirements at the same time: distance(rc , rc N ext) > (distKNN(rc ) + distKNN(rc N ext))
(10)
dens(rc N ext) = max{dens(ri ), ri ∈ {R − c}}
(11)
So we propose an algorithm called density-based k-anonymity clustering (DBKC). We provide the pseudo code of the algorithm as follow: Density-Based K-Anonymity Clustering (DBKC) 1: compute density of each record in R and sort all records in a decrease order according to density; 2: choose the first records r (with the maximal density) in R and make it as a cluster e1 ’s center; 3: while the size of R > k, do 4: delete r from R ; 5: find k -1 best records in R and add them to cluster e1 and delete them from R ; 6: find the next cluster center r in R and make it as a new cluster e1 ’s center; 7: end while; 8: while the size of R > 0, do 9: insert each remainder record into best cluster; 10: end while; In line1-2, we compute the density of each record and sort them. The density of each record is computed with Definition 10. Sorting algorithm chosen here is quick-sort()[4] because its time complexity is smallest. Line3-7 we form one cluster whose size is k in each iteration . For one cluster center, we find k -1 best records to add them to cluster in line 5. The best record here is a record ri in R such that IL(e1 ∪ ri ) is minimal. Line 6 finds the next cluster center according to Definition 11. After all iterations in line3-7, there are fewer than k records in R and these remainder records will be handled in line 8-10. We insert each remainder rj into the best cluster in line 9. The best cluster here is a cluster e1 from the set of clusters formed in line3-7 such that IL(e1 ∪ rj ) is minimal. For the sake of space, we do not provide the source codes of DBKC algorithm. We analysis the time complexity based on the source codes. Computing the density of all records in R needs O((k + log k + 1)n2 ≈ O(n2 ) (when k n); Sorting all records with quick-sort() needs O(n log n). In line 3-7, the execution times ET = (n − 1) + (n − 2) + . . . + k ≈ n(n − 1)/2, thus ET is in O(n2 ). Line 8-9 need fewer than k passes. As a result of analysis above, the time complexity of density-based k -clustering algorithm is O(n2 ), when k n .
Achieving k -Anonymity Via a Density-Based Clustering Method
5
751
Experimental Results
For experiments, we adopted the Adult dataset from the UC Irvine Machine Learning Repository[2]. Before the experiments, the Adult dataset was prepared similarly to[1][6]. Eight attributes were chosen as the quasi-identifier and two of them were treated as numeric attributes while the other were treated as categorical attributes. We evaluate the algorithm in terms of two measurements: information loss and execution time, and compare DBKC algorithm with the k -means which was added only one constraint that each cluster contains at least k records. Fig.2 reports the results of these algorithms and shows that the total information loss in DBKC algorithm is 2.82 times lower than that in k -means algorithm for all k values on average. This result can be explained with the following reasons. First, the choice of the cluster center points in DBKC algorithm is based on the density, while the k -means algorithm used in our experiments chooses center points randomly. Secondly, DBKC algorithm chooses the closest point to the cluster in order to make information loss lowest, while k -means algorithm chooses the point to a cluster in order to make the distance between center point and it the shortest.
Fig. 2. Experiments result. (a): Information Loss Metric. (b): Execution Time.
As shown in Fig.2, the execution time of both algorithms decreases with the value of k. Although the execution time of the DBKC algorithm is larger than that of k -means algorithm, the time complexity of DBKC algorithm is O(n2 )(as discussed in Section 4)and that of k -means algorithm is also O(n2 ). The execution time of DBKC algorithm is acceptable in most cases considering its better performance on information loss, but it is not fully optimized and this is our future work. The experiment result shows that the DBKC algorithm is acceptable on information loss and execution time. It is feasible to solve k -anonymity on using clustering methods based on density.
752
6
H. Zhu and X. Ye
Conclusion
In this paper, we study the k -anonymity as a clustering problem and propose an algorithm based on density. We define the distance and information loss metrics, especially we discuss the advantage of weight distance in categorical attributes. We experimentally show that our algorithm causes significantly less information loss than traditional k -means clustering algorithm and we analyze the difference between two algorithms. Our future work includes the following. Although the experiment result shows that DBKC algorithm has a better compromise between data quality and data privacy conflict. We believe that we can improve DBKC algorithm on time complexity. The key idea of DBKC algorithm is based on the density and we use k -nearest-neighbor distance to measure it, there may be a better density metric emergence in the future work. Because k -anonymity ensures relatively weak privacy protection, DBKC method should consider new privacy requirements such as l -diversity, personalized privacy preservation etc. in future. Acknowledgement. This work was supported by NSFC 60673140 and NORPC 2004C B719400.
References 1. G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu.: Achieving Anonymity via Clustering. PODS’06, (2006) 26-28. 2. C. Blake and C. Merz.: UCI repository of machine learning databases (1998). 3. J.-W. Byun, A. Kamra, E. bertino, and Ninghui. Li.: Efficient k-Anonymity Using Clustering Technique. Cerias Tech Report (2006). 4. Thomas H. Cormen, C.E. Leiserson, RL. Rivest.: Introduction to algorithm, Second Edition, published by MIT press (2001). 5. K. L. Fevre, D. J. Dewitt, and R. Ramakrishnan.: Incognito: Efficient Full-Domain k-Aonymity. In SIGMOD 2005 June (2005) 14-16. 6. B.C.M. Fung, K. Wang, and P.S. Yu.: Top-down Specialization for Information and Privacy Preservation. In the twenty-first International Conference on Data Engineering (ICDE) (2005). 7. Jiuyong. Li, Raymond. Chi-Wing. W., Ada F., and Jian P.: Achieving k-Anonymity by Clustering in Attribute Hierarchical Structures. DaWaK 2006, LNCS 4081, (2006) 405-416. 8. L. Sweeney.: Achieving k-Anonymity Privacy Protection Using Generalization and Suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based System Vol. 10, No. 5 (2002) 571-588 9. K.Wang, P.S. Yu, and S. Chakraborty.: Bottom-up Generalization: A Data Mining Solution to Privacy Protection. In ICDM04: The fourth IEEE International Conference on Data Mining, (2004) 249-256.
k-Anonymization Without Q-S Associations Weijia Yang1 and Shangteng Huang2 1
2
Shanghai Jiao Tong University, Shanghai 200030, China
[email protected] Shanghai Jiao Tong University, Shanghai 200030, China
[email protected]
Abstract. Privacy concerns on sensitive data are becoming indispensable in data publishing and knowledge discovering. The k-anonymization provides a way to protect the sensitivity without fabricating the data records. However, the anonymity can be breached by leveraging the associations between quasi-identifiers and sensitive attributes. In this paper, we model the possible privacy breaches as Q-S associations using association and dissociation rules. We enhance the common k-anonymization methods by evaluating the Q-S associations. Moreover, we develop a greedy algorithm for rule hiding in order to remove all the Q-S associations in every anonymity-group. Our method can not only protect data from the privacy breaches but also minimize the data loss. We also make a comparison between our method and one of the common k-anonymization strategies.
1
Introduction
The researches on privacy preserving data mining starting from the work of [1] have been popular these years. Randomization is widely used upon the original datasets to hide sensitive values. In such a way, most of the data records have been “faked”, tuples with real data can not be easily retrieved. The kanonymization proposed in [2] provides an alternative way to preserve the sensitivity, it uses generalization to hide sensitive values while keeping the realness of data. Most of the k-anonymization researches[2,3,4,5,6,7,8] focus on how to detach the individuals from their corresponding data records. In doing so, the individuals are hidden in groups sizing more than k. However, the frequent values in a group can break the defense setup by k-anonymization, which is first addressed in [9]. Furthermore, we find that once matching with the users’ priori knowledge, the frequent patterns can lead to more serious sensitivity leakage. For example: we derive a 5-anonymity dataset in Figure 1(b) from the original data in Figure 1(a). Statistically, users can only distinguish the right record of an individual with confidence less than 20%. But, without any priori knowledge, if Tom knows that Jennifer belongs to the first generalization group(all-female) in Figure 1(b), then he will get with 80% confidence that Jennifer’s salary ≤ 50K. Moreover,when having some priori knowledge: Tom knows Jennifer works in a private company, and this time he ensures with 100% confidence about Jennifer’s G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 753–764, 2007. c Springer-Verlag Berlin Heidelberg 2007
754
W. Yang and S. Huang
Fig. 1. (a)Census data. (b)Generalized census data.
income and even her marital status. Because the rule “Private→Divorced,≤50K (100%)” exists in the group. The type of users’ priori knowledge may either be negative or positive. Similarly, when Tom knows Michael is not married, he will find Michael works in federal government. In this paper, we model those frequent values and patterns within groups using association and dissociation rules. We also try to lower them during the common anonymization process and hide them using our algorithm with minimal data loss. This paper is organized as follows. In Section 2, we provide the works related to our topics. Some basic definitions for k-anonymization are presented in Section 3. In Section 4, we model the problem and give out our enhanced anonymization process. Section 5 is our hiding algorithm, and the experimental results are presented in Section 6. Finally, we summarize the conclusions of our study in Section 7.
2
Related Works
k-anonymization proposed in [2] has been a popular direction in protecting the sensitive information. Quite a few systems have been developed for this purpose: the μ-argus[5], the datafly[2], and the incognito[4] system,. . .
k-Anonymization Without Q-S Associations
755
In the research of [3], the problem of optimal k-anonymization was proved to be NP-hard. And, various of strategies have been developed to approach this goal, such as the bottom-up generalization[8], the top-down anonymization[7], and the cell based approach[10], . . . Recently, the work of [9] first considered the problem of current k- anonymization methods: the associations between the quasi-identifier and the sensitive attributes will break the anonymity. It proposed the concept of “l-diversity” to measure such associations, and embedded the measurement into k-anonymization algorithm. However, their method is more applicable to handle tables with one sensitive attribute than more sensitive columns which is the practical condition. Tables with highly frequent attribute values are also beyond its competence. Research [11] focused on implementing personalized anonymity requirements by generalizing both the quasi-identifier and sensitive values, by doing so, it also dissociated the associations mentioned above but only in one sensitive attribute condition. The direction of association rule hiding has been proposed in [12,13]. And [13] is a summary of the authors’ previous methods: SWA, IGA and DSA. Most of the researches developed heuristic methods to reduce the confidence or support of the sensitive rules by adding and removing rows.
3
Preliminary
Firstly, we inherit several basic definitions for k-anonymization from previous works mentioned in Section 2. Definition 1. (Generalization) Given a domain D consists of disjoint partitions {Pi }(i = 1 . . . n), and ∪Pi = D. On given value v, we call the generalization process as returning the only partition Pi containing v. Definition 2. (Quasi-Identifier) Given a table T (A1 , A2 , . . . , An ). If ∃ external table S, for ∀ record ti ∈ T , by searching values of ti (Aj , . . . , Am ) in S, ti can be uniquely located, then we call the set of attributes {Aj , . . . , Am } a quasiidentifier. (i, j, m ≤ n, Aj is not the identifier attribute) Definition 3. (k-Anonymity) Given a table T (A1 , A2 , . . . , An ) and its quasiidentifier QI. If ∀ subset C ⊆ QI, ∀ record ti ∈ T , ∃ at least k − 1 other records that have the same values with ti on attribute set C, then we call the Table T satisfies k-anonymity. Definition 4. (Anonymity-Group) Given a table T and its quasi-identifier QI. We call an anonymity-group as the set of all records from T with the same values on QI.
4 4.1
Enhanced Anonymization Process Problem Modeling
From the example in Section 1, we notice that the users’ all-positive priori knowledge has the same function as the antecedent of association rules while other
756
W. Yang and S. Huang
inferable sensitive values as the consequent part. Similarly, the knowledge containing negative part can be represented by dissociation rules. Both types of rules obtained from sensitive data within an anonymity-group actually form the “inference paths” with respect to its quasi-identifier. These paths improve the attackers’ ability to infer the sensitive values with high confidence in far excess of 1/k. We regard the anonymity breaking as two types: without priori knowledge and with priori knowledge. The first type can be represented as the frequent 1-itemsets within an anonymity-group. This is also the case discussed in [9], which checks the measurements of diversity and tries to make all values in each sensitive attribute evenly distributed in every anonymity-group. But it may not be feasible in some datasets with highly frequent itemsets. We represent the second type of anonymity-breaking as association and dissociation rules with high confidence in the anonymity-group. Just as in the previous example, we have “Married-civ-spouse→Private(67%)” and “¬Married-civspouse → Federal-gov(100%)” in the second anonymity-group. Currently, we only deal with the dissociation rules in the form of ¬A → B, more complex forms will be considered in our future work. In doing so, we are trying to solve the problem of anonymity breaking in a different way. Our main idea is to lower the confidence of those association and dissociation rules, also the support of the frequent 1-itemsets. By developing our own rule hiding strategy, we will achieve this while generalizing the minimum number of sensitive data cells. Therefore, the inference probability can be controlled in the preset threshold, and datasets with all kinds of distributions can also be handled. We combine the two types of anonymity breaking into our formal definition of the “quasi-identifier”-“sensitive attribute”(Q-S for short) association: Definition 5. (Q-S Associations) Given an anonymity-group AG, sensitive attribute set S and confidence threshold θ. Denote the 1-itemset as m, association rule as r, and dissociation rule as dr. We call Q-S associations for AG as {m, r, dr|support(m) > θ, conf idence(r) > θ, conf idence(dr) > θ & m, r, dr ∈ AG(S)}. We will carry out our anonymization process in two main steps: 1. Enhance the common k-anonymization process by evaluating and lowering the Q-S associations in all anonymity-groups. 2. After the anonymization, hide the Q-S associations in each k-anonymity group by sensitive value generalization. In the first step, we evaluate the change of the Q-S associations brought by the candidate generalizations in each iteration. Combining with the measurements of anonymity and data loss, we use all of them to choose the best generalization in each iteration. As for the rule discovering, there are works about it. We discover the rules in the way similar with the work of [14]. We treat the anonymity-groups as
k-Anonymization Without Q-S Associations
757
“partitions”[14], looking for rules in every group, and then form the “global” rules based on the local ones. 4.2
Data Structure
Each anonymity-group setups the structure “tree of inverted file”. This structure together with the attached records id(outlined with dotted boundary in Figure 2) is indispensable in the “Q-S association hiding” step.
Fig. 2. Example tree of inverted file
In Figure 2, we give out an example tree structure for the first group in Figure 1(b)(we set support threshold to 25% and confidence threshold to 60%). The tree starts from the longest itemsets, we denote h as the height of the tree. Every node represents an itemset(rectangle for itemset containing association rules, and rounded rectangle for itemset associating dissociation rules), the lth layer consists of itemsets with length h − l + 1, and the leaf layer consists of frequent 1-itemsets. For association rules, the nodes in the subtree are the sub frequent itemsets of the root. Each node also contains the corresponding rules with their confidence(not included in Figure 2). Rather than having every itemset associate all id of the supporting records, we only store them in the root nodes of subtrees, and none of whose parents are supported by the records. As in Figure 2, record T 4, T 9 is not stored in any rectangle nodes under layer 2. Rules in child nodes can look up in their parent for all the supporting rows. As for the dissociation rule dr : ¬A → B, since A, B are also frequent itemsets[15], node of dr will link itemsets A and B as child nodes in the tree structure and records supporting the infrequent itemset {A, B} will be attached. 4.3
Anonymization Metric
Let {r1 , r2 , . . . , rm }, {s1 , s2 , . . . , sn } represent the Q-S associations of two anonymity-groups AG1 , AG2 going to be merged. Suppose rule t length k(i.e. consist of k attributes) is one of their common Q-S associations. Let conf (t) represent the confidence of t, antec(t) be the antecedent itemset of t, suppN um(t)
758
W. Yang and S. Huang
be the number of records supporting t. It’s fast in calculating the new confidence of t in the merged group without retrieving the dataset before really merge AG1 , AG2 : new conf (t) =
suppN umAG1 (t) + suppN umAG2 (t) suppN umAG1 (antec(t)) + suppN umAG2 (antec(t))
(1)
If t does not exist in AG2 , we look for t’s antecedent antec(t) and other rules sharing the same itemset of t to calculate its new confidence. Furthermore, when AG2 does not have a rule with the same k-itemset of t, we search AG2 for the antecedent itemset of t, and have: new conf ∈ [
suppN umAG1 (t) · θ , suppN umAG1 (antec(t)) + supN umAG2 (antec(t)) suppN umAG1 (t) + suppN umAG2 (antec(t)) · θ ) . suppN umAG1 (antec(t)) + suppN umAG2 (antec(t))
(2)
We use “Contribution” to quantify the effect of lowering each Q-S association in the candidate generalization. Definition 6. (Contribution) Given a table T , the confidence threshold θ, and a candidate generalization G. We denote all anonymity-groups involved in G as {AGi }. For a single Q-S association t, we denote its number of records still to be generalized after applying G as n after = suppN um(antec(t))·(new conf (t)−θ), and that before G as n bef ore = suppN umAGi (antec(t)) · (confAGi (t) − θ). af ter We have 1 − nn bef ore as G’s contribution to reduce t. When evaluating the candidate generalization, we denote the “average Q-S contribution” as the average of the contributions for all Q-S associations involved. We will have contribution intervals when the specific Q-S association can not be found in all the anonymity-groups involved. We hold these intervals until overlaps exist in the interval comparing. Then, the data records contained in the corresponding groups will be retrieved in order to calculate the definite value of those contributions. The action of data retrieving will less happen with the growing of the minimum anonymity[8](i.e. the minimum size of the anonymitygroups), as every anonymity-group also maintains the global rules in its tree structure. Therefore, for each candidate generalization G, we calculate A(G) as the anonymity increase that G will make (i.e. the increase of the minimum size of the anonymity-groups); DL(G) as the data loss after applying G, which can be quantified by entropy increase[8] or decrease of the distinct values in the taxonomy trees[6]; Con(G) for average Q-S contribution. Thus, we evaluate G as: A(G) · Con(G) DL(G)
(3)
We will choose the generalization with the largest value of Equation 3. More methods of evaluating A(G), DL(G) can also be referenced from the works mentioned in Section 2.
k-Anonymization Without Q-S Associations
759
Since the quasi-identifier uniquely identifies the individuals through external databases, the size of the initial anonymity-groups will be small. Thus, the time to start evaluating the Q-S associations has its effect on the balance among computing cost, memory taking, anonymity, and data loss. We choose to track the minimum anonymity[8] during the anonymization, if it reaches the preset c · k(c ∈ IR), we bring in the Q-S association evaluation. The first round of QS association evaluation will have the most computational cost, as the tree of inverted file will be setup there. Afterwards, the evaluation will be quite fast, because most of the computation will be done without touching the original dataset. We will also prove this in the experiment part.
5
Q-S Association Hiding Algorithm
After the anonymization process, the anonymity-groups will have Q-S associations with relatively low confidence. Then we will try to generalize the sensitive values to totally hide the Q-S associations below the threshold. As mentioned in Section 2, quite a few works about association rule hiding have been presented. However, most of them aim at removing the set of sensitive rules while preserving the remaining rules and introducing fewer new rules, i.e. to achieve less side effects and less artifactual patterns[13]. Although we also aim at hiding rules in the anonymity-groups, we have different goals and requirements: 1. 2. 3. 4.
Hide both association and dissociation rules. Hide all rules exceeding the confidence threshold. Minimize data loss during the sensitive value generalization. Other than adding or deleting rows in the former studies[12,13], we use generalization.
Currently no work meets all the requirements above, while the problem handles by IGA [13] is the closest to ours. We will compare with it in the experiment. 5.1
Hiding Metrics
Since we use generalization to hide Q-S associations, the interest measure of suppN um(t) rule t will be evaluated in minimum confidence, i.e. maxmin suppN um(antec(t)) . For example, in Figure 1(b), suppose we generalize the marital-status of record T 9 to “Any”, the maximum confidence of the rule “Private, ≤ 50K →Divorced” decreases from 100% to 50%. In our method, we try to hide all Q-S associations. And in each time, we reduce the confidence of only one Q-S association by choosing a generalization which generalize one attribute of it. We greedily choose the attributes to generalize in order to reduce the largest amount of other Q-S associations. Lemma 1. Given an anonymity-Group: AG, its tree of inverted file: T (AG), the sensitive attributes: {S1 , S2 , . . . , Sm }, and the confidence threshold: θ. Let
760
W. Yang and S. Huang
N S represent the node of an arbitrary sensitive itemset in T (AG), SR be the set of rules in both N S and the nodes ⊂ N S’s subtree . Then, ∀generalization Gi ∈ SR, ∃generalization Gns ∈ N S, When generalize a fixed number of records, the contribution of Gns to reduce the Q-S associations is no less than that of Gi . Proof. Firstly, we derive the expressions of contribution in different cases. Suppose an association rule r in N S : A → B(A ∪ B ⊂ N S & A ∩ B = ∅). If the candidate generalization G for r is to generalize any attribute in A, and d records that support r will be affected. Then, the maximum possible number of records supporting A will not change, while the definite number of records that support A ∪ B will decrease by d. We apply the concept of contribution here. And the generalization G contributes to the confidence reduction of r as: contributionG (r) =
conf (r) −
suppN um(r)−d suppN um(r)/conf (r)
conf (r) − θ
(4)
It is similar when we generalize any attribute in B. As for the dissociation rule dr : ¬A → B which has itemsets A and B as its child nodes. Also, we generalize the itemset A, the maximum possible number of records that support ¬A will increase, and the definite number of records supporting dr remains the same as before. The contribution will be: contributionG (dr) =
conf (r) −
suppN um(r) suppN um(r)/conf (r) +d
conf (r) − θ
(5)
When B is to be generalized, we shall only avoid those records attached in dr as they are supporting A ∪ B. Generalizing records supporting dr produces the similar contribution as Equation 4. Child nodes of N S will be affected by the candidate generalization, which leads to the confidence reduction in parts of association and dissociation rules within SR. We sum up all these contributions as the measurement to quantify G’s effect of reducing Q-S associations: |SR|
wholeContributionG (r) =
contributionG (ri ∈ SR)
(6)
i=1
Therefore, if G is a candidate generalization for rule r in a child node of N S, which generalize the value in attribute Sj , let G take the same action for r, then we have wholeContributionG (r) ≥ wholeContributionG (r ).
5.2
Hiding Algorithm
Based on Lemma 1, we develop our Q-S association hiding algorithm as below.
k-Anonymization Without Q-S Associations
761
Algorithm 1: Q-S association hiding algorithm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21
Data: anonymity group:AG, inverted file tree for G: T (AG), confidence threshold: θ Result: anonymity-group without Q-S associations: AG begin foreach record ∈ AG do Store the record id in nodes of T using maximum matching; foreach level l ∈ T (AG) & l > 1 (top-down) do foreach node ∈ level l of T (AG) do Set s = node ∪ {node t|t ∈ subtree of node}; mr ← {rule r|conf (r) = min conf (ri ), ri ∈ node}; Hmr ←candidate generations for generalizing each Attribute ∈ mr; wholeContribution(mr) ← zero vector with length(mr) dimensions; foreach rule rr ∈ s do wholeContribution(mr) ← contributionHmr (rr); if rr is the antecedent of dissociation dr then wholeContribution(mr) ← contributionHmr (dr); attr ← max dimension ∈ wholeContribution(mr); foreach record row to be generalized do if attr in row is not generalized then Generalize row by attribute attr; else if attr is generalized to D & row(attr) ∈ / D then Generalize attr into a higher position containing row(attr) in the hierarchy. Recompute the confidence of other rules row supports; Generalize the remaining frequent 1-itemsets; end
The Q-S association hiding algorithm proceeds as follows. Firstly, as in Section 4.2, we attach every record to the tree nodes. Then, start from the longest rule, we generate the candidate generalizations as generalizing each of its attribute. We test the candidates in the subtree to generate the vector of the “wholeContribution”, each of whose dimension corresponds to one candidate. Afterwards, we select the generalization with the highest contribution sum to apply. Also, we notice that a data record may be stored in more than one itemsets which do not contain each other. This will lead to repeated generalization of the same column value in a record. Actually, we will check the status of the attribute and decide whether to generalize it to a higher domain or skip the record, and recompute the confidence of every missing rule (i.e. the rule outside the subtree but supported by the generalizing row). Our algorithm choose the generalizing attribute through contribution comparing. Although the contribution calculation is limited to the subtree in the current study, it covers most of the generalization effect especially when handling long itemsets which rapidly reduce all the Q-S associations they contain.
762
W. Yang and S. Huang
Moreover, limiting the range of contribution calculation leads to a small memory requirement for our inverted file tree. Otherwise, we have to associate every row with all the Q-S associations it contains. Suppose the number of Q-S associations is n, since every rule is generalized after a traverse of its subtree, the time complexity of the algorithm is O(n log n) . It is also shown in Algorithm 1 that we deal with dissociation rules. When the antecedent node(also is the child node) of the dissociation rule is affected by a generalization, we also evaluate the effect and compute the contribution to that dissociation rule(as in Lemma 1) by the generalization.
6
Experiment Result
In our experiment, we use the “Adult Database” obtained from [16], which has 14 attributes, 48842 instances. The records with missing attribute values “?” are removed. We show in Table 1 the attributes we adopt, the number of the leaf nodes in their hierarchy trees, and the height of the trees. We make different combinations of quasi-identifier and sensitive columns to get the average experiment result. Table 1. The Attributes Adopted Quasi-identifier/Sensitivity Attribute Educ- Occup- Race Sex Work- Marital- Relation- Nativeation ation class status ship country Leaf Num. 16 14 5 2 8 7 6 41 Height 4 4 3 2 3 3 3 4
There are 2 steps in our implementation, we test them respectively. Due to the space limit, we can’t list all the results of our experiments here. For rule hiding, we make a comparison between our algorithm and the implementation of IGA[13] strategy using generalization. To be fair, we only choose to hide the association rules from the datasets. The support and confidence threshold are set to 20% and 50% respectively, and the hierarchy trees are constructed as height 2. Every time, we choose a different number of attributes, and compute the ratio of the cells generalized in our algorithm to that in IGA. We find in Figure 3(a) that under our requirement, our Q-S association hiding algorithm has smaller data loss. This is mainly attributed to that the item with the highest contribution reduce the most amount of Q-S associations. Then, a comparison is made between the common k-anonymization and our enhanced version. The support and confidence threshold are 10%, 50% and k = 250. We implement the strategy in [8] as the common version. And we bring in the Q-S association evaluation to simulate our method at different values of minimum anonymity[8]: 25, 50, 100. . . . In Figure 3(b), we compare the “information loss”, “performance” and “hiding efficiency” of both methods by calculating “entropy loss in anonymization”, “execution time after building the inverted file
k-Anonymization Without Q-S Associations
763
Fig. 3. Methods comparison (a)Comparison between Q-S association hiding and IGA. (b)Comparison of k-anonymization between our method and “bottom-up” strategy.
tree” and “data loss in hiding step”, then we calculate their ratios of our method to the common k-anonymization. As shown in Figure 3(b), our method approaches the optimal result of the “bottom-up” strategy when the minimum anonymity becomes larger. Currently, the Q-S associations, information loss and anonymity have the same weight in choosing the candidate generalizations. Therefore, when we start to evaluate the Q-S association with a small minimum anonymity, the anonymization will deviate from the optimal result in an early time. We can make different weights for these 3 metrics to relieve this phenomenon, which will be one of our future research directions. The inflexion on the curve of “information loss” shows the greedy characteristic of “bottom-up”, which sometimes prevent it from getting the global optimal result. We also find in the series of “execution time comparison” that when the Q-S associations is evaluated early in the process, the performance after the tree construction decreases due to the increasing dataset accesses. Series of “data loss in hiding” shows that smaller number of cells have to be hidden when we bring in the Q-S association evaluation early in the anonymization. In order to have a balance among the performance, optimal k-anonymity result and cells to hide, we find it be better to start evaluating the Q-S associations when the minimum anonymity reaches either 50 or 100.
7
Conclusion
In this paper, we have introduced an enhanced k-anonymization method which detaches the links between quasi-identifiers and sensitive attributes. We’ve defined such links using frequent 1-itemsets, association and dissociation rules with high confidence within an anonymity-group. We’ve not only evaluated them in the k-anonymization process, but also removed them using our Q-S association hiding algorithm. In our research, k-anonymization is combined with rule hiding which is also a direction in privacy preserving data mining. By applying our greedy algorithm, we prevent the anonymity breaking from those “inference paths” with minimum data loss.
764
W. Yang and S. Huang
The k-anonymization method is a promising way to protect the sensitive data in data publishing. Although it has limitations, combining it with other techniques may accomplish more tasks. We regard our work as an initial step. Further research will include more works on Q-S association modeling and generalization metrics developing.
References 1. Agrawal, R., S.R.: Privacy-preserving data mining. In: Proc. of the ACM SIGMOD Conference on Management of Data. (2000) 2. Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty 10(5) (2002) 571–588 3. Meyerson, A., W.R.: On the complexity of optimal k-anonymity. In: Proc. of the 23th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. (2004) 4. LeFevre, K., D.D.R.R.: Incognito: Efficient fulldomain k-anonymity. In: Proc. of the 2005 ACM SIGMOD international conference on Management of data. (2005) 5. Hundpool, A., W.L.: Mu-argus and tau-argus: Software for statistical disclosure control. In: Proc. of the 3rd International Seminar on Statistical Confidentiality. (1996) 6. lyengar, V.: Transforming data to satisfy privacy constraints. In: Proc. of the 8th ACM SIGKDD international conference on Knowledge discovery and data mining. (2002) 7. Bayardo, R.J., A.R.: Data privacy through optimal k-anonymization. In: Proc. of the 21th International Conference on Data Engineering. (2005) 8. Wang, K., Y.P.C.S.: Bottom-up generalization a data mining solution to privacy protection. In: Proc. of the 4th IEEE International Conference on Data Mining. (2004) 9. Machanavajjhala, A., G.J.K.D.V.M.: l-diversity privacy beyond k-anonymity. In: Proc. of the 22th International Conference on Data Engineering. (2006) 10. Nergiz, M.E., C.C.: Thoughts on k-anonymization. In: Proc. of the 22th International Conference on Data Engineering Workshops. (2006) . 11. Xiao, X., T.Y.: Personalized privacy preservation. In: Proc. of the 2006 ACM SIGMOD international conference on Management of data. (2006) 12. Verykios, V.S., E.A.B.E.S.Y.D.E.: Association rule hiding. IEEE Transactions on Knowledge and Data Engineering 16(4) (2004) 434–447 13. Oliveira, S.R.M., Z.O.: A unified framework for protecting sensitive association rules in business collaboration. International Journal of Business Intelligence and Data 1(3) (2006) 247–287 14. Savasere, A., O.E.N.S.: An efficient algorithm for mining association rules in large databases. In: Proc. of the 21th International Conference on Very Large Data Bases. (1995) 15. Wu, X., Z.C.Z.S.: Mining both positive and negative association rules. In: Proc. of the 19th International Conference on Machine Learning. (2002) 16. Hettich, S., B.S.: The uci kdd archive. Univeristy of California, Irvine, Department of Information and Computer Science (1999)
Protecting and Recovering Database Systems Continuously Yanlong Wang, Zhanhuai Li, and Juan Xu School of Computer Science, Northwestern Polytechnical University, No.127 West Youyi Road, Xi'an, Shaanxi, China 710072 {wangyl,xuj}@mail.nwpu.edu.cn,
[email protected]
Abstract. Data protection is widely deployed in database systems, but the current technologies (e.g. backup, snapshot, mirroring and replication) can not restore database systems to any point in time. This means that data is less well protected than it ought to be. Continuous data protection (CDP) is a new way to protect and recover data, which changes the data protection focus from backup to recovery. We (1) present a taxonomy of the current CDP technologies and a strict definition of CDP, (2) describe a model of continuous data protection and recovery (CDP-R) that is implemented based on CDP technology, and (3) report a simple evaluation of CDP-R. We are confident that CDP-R continuously protect and recover database systems in the face of data loss, corruption and disaster, and that the key techniques of CDP-R are helpful to build a continuous data protections system, which can improve the reliability and availability of database systems and guarantee the business continuity.
1 Introduction With the widespread use of computers, database systems are vital in human life and data stored in database systems is becoming companies’ most valuable asset. Although we are careful to defend against all kinds of disasters, they still occur frequently. For example, hardware breaks, software has defects, viruses propagate, buildings catch fire, power fails and people make mistakes [1]. Data corruption and data loss by those disasters have become more dominant, accounting for up to 80% [2] of data loss. Recent high-profile data loss has raised awareness of the need to plan for recovery or continuity. In particular, it is a challenge that a large number of database systems must be continuously available and businesses also must be prepared to provide continued service in the event of disasters. Many data protection solutions including fault-tolerance and disaster-tolerance techniques have been employed to increase database systems availability and to reduce the damage caused by data loss, corruption and disaster [3]. Backup [4] is the most popular solution which stores vital data on tape or disk. Basic backup include three modes: full backup, incremental backup and differential backup, all of which can be implemented offline and online. In addition, there are several solutions such as redundant disk arrays (RAID) [5], mirroring [6], snapshot [7] and replication [8]. G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 765–776, 2007. © Springer-Verlag Berlin Heidelberg 2007
766
Y. Wang, Z. Li, and J. Xu
However, conventional backup technologies have many drawbacks. First, offline backup (cold backup) requires that the application should be periodically (daily or weekly) down or completely offline, and although online backup (hot backup) allows to backup with database still running, it has to bare some performance penalty. Second, backup is time-consuming and takes a long time to recover data. Third, database systems can only be restored back to a pre-determined previous point, and data between backups are vulnerable to data loss. Recent research [1] has shown that data loss or data unavailability can cost up to millions of dollars per hour in many businesses. Other solutions have most of the same drawbacks as backup. Therefore, the traditional time-consuming techniques are no longer adequate for today’s information age. In order to remove backup window and resolve recovery point objective (RPO) and recovery time objective (RTO) issues, researchers present continuous data protection (CDP) [9]. CDP represents a major breakthrough in data protection and dramatically changes the data protection focus from backup to recovery. With CDP continuously capturing and protecting all data changes to the important data of database systems, this provides rapid recovery to any desired point in the past when disaster strikes and access to data at any point in time (APIT) [10] after recovery. CDP offers more flexible RPO and faster RTO than traditional data protection solutions which were designed to create, manage and store single-point-in-time (SPIT) [11] copies of data, thereby reducing data loss and eliminating costly downtime. CDP has appeared only for a short while, so it isn’t well understood. With our survey of the approaches used in practice, we found most of current CDP technologies aren’t the real CDP and are only near-CDP. Our first contribution, then, is a taxonomy of current CDP technologies and a strict definition of CDP. Our second contribution is the design of a CDP model for database systems, referred to as continuous data protection and recovery model (CDP-R). It is built on block-level and provides continuous protecting and recovering of database systems’ data. The final contribution is an evaluation of our CDP-R model, simply comparing it with other backup technologies.
2 CDP 2.1 Taxonomy CDP is becoming a hot topic and there have been research efforts in some large IT companies, research institutions and emerging companies. There are several assessment criteria for CDP designs and we summarize the basic axes as data protection scheme, design level, storage repository and recovery mechanism. Data protection scheme. Current CDP systems implement a continuous or nearcontinuous data protection scheme for retrieving even the most recently saved data: 1.
CDP systems: save every change to data as it is made and let administrators or users recover files and other data such as email from any point in time, such as Peabody [12], TRAP-Array [13], CPS [14] and TimeData [15].
Protecting and Recovering Database Systems Continuously
2.
767
Near-CDP systems: do not have the detail of CDP and take snapshots of data at specified points in time and only allow customers to retrieve data from those times, not from seconds or even hours ago, such as Backup Exec 10d, DPM, Tivoli CDP for Files and LiveServe [16].
CDP syOstems can recover the primary to any point in time and near-CDP only can provide some scheduled point-in-time recovery. So we do not consider near-CDP in this paper. Design level. CDP systems have been implemented at the block-, file- or applicationlevel against disasters: 1.
2.
3.
Block-level CDP systems: operate above the physical storage or logical volume management layer. As data blocks are written to the primary storage, copies of the writes are captured and stored to an independent location. Peabody [12] exposes virtual disks to recover any previous state of their sectors and shares backend storage to reduce the total amount of storage needed. TRAP-Array [13] designs a CDP prototype of the new RAID architecture and stores the timestamped Exclusive-ORs of successive writes to provided timely recovery to any point in time. CPS [14] adopts time-addressable-storage (TAS) and adds time as a dimension of data storage. File-level CDP systems: operate just above the file system. They capture and store file-system data and metadata events (such as file creation, modification, or deletion). For example, TimeData [15] keep the protected instances of files in their natural form and recover files to any point in time at file-level. Application-level CDP systems: operate directly within the specific application that is being protected. Such solutions offer deep integration and are typically either built-in to the application itself or make use of special application APIs, which grant continuous access to the application’s internal state as changes occur.
File- and application-level CDP systems provides CDP only for some fixed file systems or applications. Block-level CDP systems can take the advantage of supporting many different applications with the same general underlying approach. It can achieve high performance and help to build a multi-platform CDP engine to protect a variety of database systems. The recovery granularity of block is the most ideal and potential data loss is minimal. We discuss CDP at the block-level in this paper, although file- and application-level CDP could readily be implemented. Storage repository. Storage repository provides the ability to store and manage the CDP data over time. CDP systems employ a distinct and dedicated node or the host itself as the storage repository: 1.
2.
Distinct storage repository: is architected in an independent location where all data changes are stored. The distinct node is available on the LAN, WAN or SAN. This kind of repository is employed by most of CDP systems. Self-storage repository: is established on the protected host itself where changed data is written directly onto the independent CDP storage region, such as Peabody [12] and TRAP-Array [13]. We use the distinct storage repository to keep CDP data in the following text.
768
Y. Wang, Z. Li, and J. Xu
Recovery mechanism. Recovery mechanism determines the recovery procedure and can be implemented in two modes: 1.
Independent recovery: is achieved only according to the storage repository where data includes the initial data set and the changed data set of the primary. Independent recovery makes it possible to reduce the cost of CDP recovery. Dependent recovery: is achieved with the storage repository and an initial replica which increases the complexity of CDP recovery.
2.
We use the independent recovery mechanism in the following text. 2.2 Definition According to CDP systems and researchers, the SNIA (Storage Networking Industry Association) CDPSIG Continuous Data Protection Special Interest Group defines CDP as “a methodology that continuously captures or tracks data modifications and stores changes independent of the primary data, enabling recovery points from any point in the past.” [9] While various CDP systems confuse us, the above definition is too simple to guide us to design a veritable CDP system. To describe CDP in details and rigorously, we define CDP in two aspects (i.e. protection and recovery) theoretically as follows:
τ P (t ) is the data image/view of the primary P at time t (t ≥ t0 ) where τ P (t0 ) is the initial data image/view at the beginning time t0 . If | τ P (t ) | is the data P P P set of the Primary P at time t (t ≥ t0 ) , δ (t ) =| τ (t ) | − | τ (t − Δt ) | is the data
Definition 1.
set of all the changes of the primary P at time
δ (t1 , t2 ) = {δ (t ), t1 ≤ t ≤ t2 } P
P
t where Δt → 0 and
is the sum of the changed data sets of the Primary
t1 to time t2 . When δ P (t0 , t ) is stored to a distinct site, the backup B, the procedure is called as continuous data protection (CDP) from t0 to t .
P from time
Definition 2.
λ B (t )
is the data set of the backup B corresponding to the data
image/view of the primary P at time the beginning time
t (t ≥ t0 ) where λ B (t0 ) is the initial data set at
t0 . If the backup B receives the delta δ P (t ) from the primary P
λ B (t ) = λ B (t − Δt ) + δ P (t ) where Δt → 0 and inductively λ B (t ) = δ P (t0 , t ) . If | λ B (t ) | is the data set of the backup B after coalescing all
at
time
t,
the blocks with the same address at time
t (t ≥ t0 ) , when | λ B (t ) | is restored to the
primary P and overwrites the blocks of the primary P according to the address of each block, the procedure is called as continuous data recovery (CDR) at time t .
Protecting and Recovering Database Systems Continuously
769
3 CDP-R Model In order to protect and recover database systems continuously, we set out to design a model of continuous protection and recovery (CDP-R). The goal of CDP-R model is to keep the copy of each block-level change of the database system in a distinct storage repository and make the data of the database system available despite both hardware failures and software failures, which could exploit a continuous protection and recovery of database systems. CDP-R model is composed of client, primary and backup as shown in Fig. 1:
Fig. 1. An overview of CDP-R model
Client gives an intelligent management platform for users to operate database systems and configure CDP-R model. Primary includes database system, protector, storage and log, Backup includes repository, storage and time-index-table. The protector and the repository are main components of CDP-R model to protect and recover database systems as shown in Fig. 2. The protector continuously captures every change of the primary and sends to the backup. The repository receives data from the primary and stores data over time in storage. capture-module
logmodule
storagemodule
recoverymodule
receive-module encapsulationmodule replicationmodule
storagemodule
(a)protector
indexmodule
(b)repository
Fig. 2. Modules of protector and repository
3.2 Workflow Normally, an operation at the primary causes a write record to be written synchronously in the primary log, and the block-level data can then be written to the primary storage. Simultaneously, CDP-R model implements three-step work:
770
1.
Y. Wang, Z. Li, and J. Xu
Capture: After the capture-module gets every block-level change of database system, the encapsulation-module wraps the data block datai in a package with
ti and other description information disci (including storage address, size, etc.) and then forms a backup record < ti , disci , datai > .
a timestamp 2.
3.
Backup: The replication-module replicates every backup record to the backup synchronously or asynchronously. After the receive-module takes the backup record, the storage-module inserts an item into time-index-table and stores the record in the storage. Retrieve: When database system needs recovering in case of data loss, corruption and disaster, clients can lookup the time-index-table and select a past point. Then we retrieve the primary from the appointed data of the backup and recreate the exact data state as it existed at any point in time.
The whole procedure of capture-backup-retrieve is implemented in the background automatically. Fig. 3 shows the state transitions of a data block in CDP-R model. BackupReceive Primary PrimarySend BackupWrite FromPrimary Encapsulate ToBackup encapsulated replicated captured received indexed BackupAck PrimaryReceive BackupWrite ToPrimary PrimaryWrite Ack BackupAckToPrimary acknowledged logged stored PrimaryReturn
PrimaryWrite
PrimaryReturn
stored
finished
legend
event
state
Fig. 3. Data block states in CDP-R model. The left part is the state transitions of the data block at the primary and the right part is the state transitions at the backup.
4 Key Technologies CDP-R model is implemented by three key technologies (referred to as 3R) as follows: 1.
2.
3.
Replication: To meet users’ need and fit the networks situation, the primary must adopt an appropriate replication protocol and dynamically transmit the backup record to the backup within all types of TCP/IP networks (LAN, WAN, etc.). We implement two replication protocols, i.e. synchronous and asynchronous, and keep the data consistency between the primary and the backup. Repository: To store and conveniently lookup every backup record, the backup must manage all backup records with an effective structure and an index dictionary. We architect a delta-chain to store backup records over time, and build a time-index-table to locate every record. Recovery: To deal with a disaster, the primary must recover from the backup. We create an any-point-in-time incremental or full version of the backup, and use it to retrieve the primary rapidly.
Protecting and Recovering Database Systems Continuously
771
To capture and encapsulate the changes of database system at the primary continuously, CDP-R model can adopt Loadable Kernel Modules (LKM) on Linux or Windows Driver Model (WDM) on Windows. We won’t discuss this technology in more details here. 4.1 Replication Replication mode. The replication protocol plays an important role in CDP-R model. It automatically transmits every backup record to the backup. It includes two modes of synchronous and asynchronous. We deal with a block-level change of database system by nine steps and implement the replication protocol in synchronous and asynchronous modes as shown in Fig. 4. 3
3 1
9
2
5
1
4
4 6 (1)Sync mode
7
9
2 6
8 Time
7
8 Time
(2)Async mode
Fig. 4. Replication protocol of CDP-R model. 1-protector captures a block-level change of database system; 2-protector writes the change to the log; 3-protector writes the change to the storage; 4-protector encapsulates the change and sends the backup record to the repository; 5-repository returns the receiving acknowledgement; 6-repository writes the change to the time-index-table; 7-repository writes the change to the storage; 8-repository returns the completing acknowledgement; 9-protector returns success to database system.
We recast the traditional protocol into a new replication protocol and adopt some methods to increase the reliability and efficiency of the protocol. For example, we write the log/time-index-table before writing the storage. We also execute several steps in parallel and solve the block-level changes by pipelining. Each replication mode deals with the block-level changes differently. Synchronous mode ensures that a backup record has been posted to the backup before the request of database system completes at the application-level. Database system performing an application may experience the response time degradation caused by each backup record incurring the cost of a network round-trip, but the backup is up to date. If a disaster occurs at the primary, data can be recovered from any surviving backup with minimal loss. Asynchronous mode completes an update when it has been recorded in the log and storage at the primary. The response time is shorter at the cost of the backup being potentially out of date. If a disaster strikes, it is likely that the most recent writes have not reached the backup. Therefore, the decision to use synchronous or asynchronous mode depends on users’ requirements, the available network bandwidth, network latency and the number of backup servers.
772
Y. Wang, Z. Li, and J. Xu
Data consistency. Data is consistent if database system using it can be successfully restarted to a known, usable state. That is, data at the backup correctly reflects the data changes at the primary at some point in the past. CDP-R model maintains data consistency by two means: 1.
2.
Send-queue and receive-queue: The backup records are queued temporarily in a circular queue to be sent to the backup. When there is a surge in the block-level change rate, this queue may grow and will be continuously drained. After the backup records reach the backup, there the other circular queue keeps them temporarily and drains as fast as they are written to storage. Both queues try to keep the backup as consistent as the primary and achieve write-order-fidelity. Atomic replication and atomic write: While data consistency in synchronous mode is not impacted by network failures, in asynchronous mode it tends to be impacted. Asynchronous mode makes it possible to fail to receive completing acknowledgements of some backup records when network problems occur, but in fact those backup records have been written to storage at the backup. When network is ok and we send those records again, the backup may be inconsistent with the primary. Thus, the primary sends them with atomic replication and the backup stores them with atomic write, which avoids the risk of inconsistency.
4.2 Repository Time index table. When we want to recover the primary, we need select a past time t and then collect all backup records at that time. According to time t , time index table (See Fig. 5) is useful to build an index dictionary and find the target backup records stored in the storage. It just maps t to address in the storage. In order to generate a unique fingerprint for every time, CDP-R model uses the Sha1 hash function[17] to build a large hash table as the time index table. Sha1 is a popular, efficient hash algorithm for many security systems and its output is a 160-bit hash value. Assuming the granularity of time is microsecond and random hash values has a uniform distribution, a collection of n different times and a hash function that generates 160 bits, the probability p that there will be one or more collisions is bounded by the number of pairs of times multiplied by the probability that a given pair will collide, i.e. p ≤
n( n − 1) 2
*
1 2160
. If we keep backup records for one year that
is enough for protecting common database systems, n = 365 * 24 * 60 * 60 *106 ≈ 1014 and then p is less than 10−20 . Obviously, Sha1 is suitable for CDP-R model and the collision scenario can be ignored. Although it is ideal that every backup record has a unique timestamp, in fact a series of backup records may have the same timestamp. For example, there may be some backup blocks with the same microsecond timestamp in current computer systems. Therefore, time index table locates the first of a series of backup records with the same timestamp. After receiving a backup record, repository extracts the timestamp from it and hashes the timestamp with Sha1. Then check whether the item has been in time index table. If yes, repository locates the address in the storage and scans the storage forwards to find a free space for the backup record; otherwise,
Protecting and Recovering Database Systems Continuously
773
repository fills a new address into the item and then stores the backup record in the storage according to the new address. Therefore, when getting a time t , we can collect a series of backup records with the same time. DeltaChain. We present DeltaChain to manage the storage at the backup. DeltaChain is like a link list composed of a large number of segments, and a segment has a series of backup records with the same time (See Fig. 5.). All of the backup records are stored continuously, referred to as continuous storage over time, which is not like Peabody [12] or TRAP-Array [13]. Continuous storage increases the speed of locating the address and reduces storage fragments in Peabody which stores every version of block continuously. Time index table
item0
item1
itemi legend itemi
segment
DeltaChain (Storage)
address0
address1
addressi
Record01 Record02
Record11 Record12
Recordi1
< ti , addressi >
Recordij < discij , dataij >
Recordij Record0k
Record1l
Fig. 5. Repository of CDP-R model. Every segment is stored continuously. In every segment, records have been coalesced if they have identical description.
Segment0 is ready to store all the backup records from the primary at time DeltaChain is fully initialized by
τ (t0 ) and k P
t0 . If
is equal to the number of all the data
blocks of the primary, all the backup records in Segment0 is corresponding to all the data blocks of the primary at time
t0 . If DeltaChain is partially initialized by τ P (t0 ) ,
when data of a backup record (e.g. < ti , discij , dataij > ) is replicated to the repository for the first time, a backup record < t0 , disc0 r , data0 r > also has to be replicated to the
disc0 r = discij and data0 r is the data at the same address before being overwritten. Then the repository stores it as the r − th backup record to
repository where
Segment0 before storing < ti , discij , dataij > . That is, only the data that will be overwritten has to be replicated and stored into Segment0. The other segments have the same function, and are used to store ordinary backup records. All the backup records in a segment have the same timestamp. A segment increases with receiving a new backup record.
774
Y. Wang, Z. Li, and J. Xu
4.3 Recovery The primary faces several threat categories: data loss, data corruption and Data inaccessibility [1]. To limit the scope of this study, we focus on data loss events for the primary and map data corruption and inaccessibility threats into data loss. After a failure, we can adopt one of the following continuous-data-recovery algorithms to restore the primary from the backup to any point in time and make it usable again. When we decide to restore the primary to a past time ti , we find the newest version of each data block in the segments from time
t0 to time ti , and send
it to the primary to overwrite the data block according to the storage address of the description. The pseudo-code of recovery algorithms are shown in Table 1. Table 1. Recovery Algorithms. Full-recovery is used to recover the primary continuously and fully when Segment0 keeps all data of the primary at time t0. Fast-recovery is used to recover the primary continuously and fast when Segment0 only keeps data of the primary at time t0 which is changed in the future. 1 2 3 4 5 6 7 8 9
FULL_RECOVERY(t0) S0:=GetSegment(t0); B:=S0; S:=S0; repeat S:=GetNextSegment(S); B:=Coalesce(B,S); until S==GetSegment(ti); P:=Recover(B,NULL); return SUCCESS;
1 2 3 4 5 6 7 8 9
FAST_RECOVERY(t0) S0:=GetSegment(t0); B:=S0; S:=S0; repeat S:=GetNextSegment(S); B:=Coalesce(B,S); until S==GetSegment(ti); P:=Recover(B,P); return SUCCESS;
In Table 1, the symbol Si denotes the segment with time
ti , and S is a temporary
variable to keep the segment Si . The symbol B denotes the backup records that will be sent back to the primary and P denotes all the data blocks at the primary. GetSegment(t), GetNextSegment(S), Coalesce(B,S) and Recover(B,P) are APIs supplied by CDP-R model. GetSegment(t) is used to get the segment according to time t and GetNextSegment(S) is used to get the next segment of the current segment S. Coalesce(B,S) and Recover(B,P) are very important as shown in Fig. 6: 1. Coalesce(B,S) : is used to coalesce the backup records of B and S with the same description and reserve the backup record of S as the newer version; 2. Recover(B,P): is used to recover P from B. Data of each backup record of B is overwritten to one of P with the same storage address. B
B 1 2 3 4 5 6 Coalesce B′ 1' 2' 3' 4 5 6 7 8 S 1' 2' 3' 7 8
P
1 2 3 4 5 6 Recover 4' 5' 6' 7 8
(a)Coalesce(B,S)
Fig. 6. Recovery APIs of CDP-R model
P′ (b)Recover(B,P)
1 2 3 4 5 6 7 8
Protecting and Recovering Database Systems Continuously
775
5 Evaluation According to the above introduction, CDP is an innovative data protection technique and different from traditional data protection technologies, such as backup, mirroring, snapshot and replication (See Table 2). Table 2. Data Protection Technologie Technology Backup window Recovery Point Objective (RPO) Recovery Time Objective (RTO) Recovery point
Backup large
Mirroring small
Snapshot small
Replication small
CDP small
large
small
medium
small
small
large
medium
medium
medium
small
specified point in time
recent point in time
specified point in time
recent point in time
any point in time
CDP-R model supplies a new approach to protect and recover database by adopting the technology of CDP, and can be implemented on any platform, such Linux, Windows and Unix. Here we just give an example to evaluate CDP-R model base on Logical Volume Manager on Linux: If a database system (e.g. Oracle) build on CDP-R model at 8:00:00 a.m. and the time granularity of CDP-R is second, when the database system is in a disaster 14:00:00 p.m., we can restore the database system to a past time point between 8:00:00 and 13:59:59. In CDP-R model, by coalescing the backup records in every segment of the repository, the storage space is reduced by up to 20%. By coalescing the backup records before restoring to the primary, the transmission bandwidth is reduced by up to 42%. In additional, fast recovery is 1~1.5 times faster than full recovery.
6 Conclusion and Future Work Database systems are very important and require 24x7 availability. CDP transforms the backup/restore process to deliver high availability level of database system and keep business continuity. CDP is more comprehensive and cost-effective than any other solution, such as backup, snapshot, mirroring and replication. CDP-R model adopts the CDP technology to solve the data restoration time-gap problem and to make true business continuity a realistic objective. It is presented based on the taxonomy and definition of CDP technology. CDP-R model synthesizes the technologies of block-level replication, repository and recovery to offer the ultimate solution. Therefore, CDP-R can provide days, weeks or months (even years) of protection with microsecond/second/minute/hour granularity. CDP-R model also can provide the business resiliency and the ability to rapidly restore to any point in time on the timeline. In addition, the attribute of CDP-R model built on block-level can achieve the high performance and satisfy all kinds of database systems. CDP-R model complies with the needs of database system protection, but there still exits some future work. For example, we need to optimize the structure of
776
Y. Wang, Z. Li, and J. Xu
DeltaChain and the recovery algorithms. Furthermore, we are developing a prototype system based on CDP-R model and hope to explore many of these avenues. Acknowledgments. This work is supported by the National Natural Science Foundation of China (60573096).
References 1. Kimberly Keeton, Cipriano A. Santos, Dirk Beyer, Jeffrey S. Chase, John Wilkes: Designing for Disasters. In: Proc of the 3rd USENIX Conf on File and Storage Technologies (FAST’04). (2004) 59–72 2. David Patterson, Aaron Brown and et al. Recovery oriented computing (ROC): Motivation, Definition, techniques, and Case Studies. Computer Science Technical Report, U.C. Berkeley (2002) 3. Manhoi Choy, Hong Va Leong, and Man Hon Wong. Disaster Recovery Techniques for Database systems. COMMUNICATIONS OF THE ACM. (2002) 272–280 4. A. L. Chervenak, V. Vellanki, and Z. Kurmas. Protecting file systems: A survey of backup techniques. In: Proc of Joint NASA and IEEE Mass Storage Conference. (1998) 5. D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (RAID). In: Proc of the ACM SIGMOD International Conference on Management of Data. (1988) 109–116 6. Minwen Ji, Alistair Veitch and John Wilkes. Seneca: Remote Mirroring Done Write. In: Proc of the 2nd USENIX Conf on File and Storage Technologies (FAST’03). (2003) 7. G. Duzy. Match snaps to apps. Storage, Special Issue on Managing the information that drives the enterprise. (2005) 46–52 8. H. M. Zou and P Jahanian. A real-time primary-backup replication service. IEEE Trans on Parallel and Distributed Systems. (1999) 533–548 9. Brian J. Olson and et al. CDP Buyers Guide: An overview of today’s Continuous Data Protection (CDP) solutions. SNIA DMF CDP SIG. (2005) http://www.snia.org/ 10. B. O’Neill. Any-point-in-time backups. Storage, Special Issue on Managing the Information that Drives the Enterprise. (2005) 11. Alain Azagury, Michael E. Factor and Julian Satran. Point-in-Time Copy: Yesterday, Today and Tomorrow. In: Proc of the 10th Goddard Conference on Mass Storage Systems and Technologies. (2002) 259–270 12. C. B. Morrey III and D. Grunwald. Peabody: The time traveling disk. In: Proceedings of IEEE Mass Storage Conference, San Diego, CA (2003) 13. Qing Yang, Weijun Xiao, and Jin Ren. TRAP-Array: A Disk Array Architecture Providing Timely Recovery to Any Point-in-time. In: Proc of the 33rd Annual International Symposium on Computer Architecture (ISCA’06), Boston, USA (2006) 14. Michael Rowan. Continuous Data Protection: A Technical Overview. Revivio, Inc. (2005) http://www.revivio.com/documents/CDP%20Technical%20Overview.pdf 15. Protecting Transaction Data: What Every IT Pro Should Know. TimeSpring Software Corp. (2004) http://www.timespring.com/ Protecting%20Transaction%20Data.pdf 16. Deni Connor. Continuous data protection finds Supporters. In: Network World. (2005) http://www.networkworld.com/news/2005/091605-continuous-data-protection.html 17. National Institute of Standards and Technology, FIPS 180-1. Secure Hash Standard. US Department of Commerce (1995)
Towards Web Services Composition Based on the Mining and Reasoning of Their Causal Relationships* Kun Yue, Weiyi Liu, and Weihua Li Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, 650091 Kunming, P.R. China
[email protected]
Abstract. In this paper, a probabilistic graphical modeling approach for Web services is proposed, and the Web services Bayesian network (WSBN) is constructed by mining the historical invocations among them. Further, the semantic guidance to Web services composition is generated based on the Markov blanket and causality reasoning in the WSBN. Preliminary experiments and performance analysis show that our approach is effective and feasible. Keywords: Web Services, composition, Bayesian network, Markov blanket.
1 Introduction To implement automatic Web services composition, an underlying model, the corresponding reasoning approach, and the measure for service associations are indispensable [1, 2, 3, 4]. Thus, the guidance of services composition can be obtained, and then the composition can be carried out automatically. Different approaches are proposed to address this problem, among which most are given at a syntactic level of services themselves, or annotated with ontologies, or based on keyword retrieval [12, 13, 14, 15, 16]. Actually, many services have nothing to do with the actual provision although they have the matched syntactic or keyword description [4]. This requires that the composition be done at the semantic level, and the reasoning among given services is necessary too. Therefore, towards automatic Web services composition, we should first develop a model to represent the implied semantic relationships among given services, and then the composition guidance can be derived. Intuitively, by mining distributed historical service invocations, we can discover the knowledge or behavior rules, and learn the implied model of given services. In real paradigms, statistic computation is one of the frequently adopted approaches, and the Bayesian network (BN) [5] is such an effective model that can be used to represent the causal relationships implied among Web services. It is known that BNs are the graphical representation of probabilistic relationships between variables. They are widely used in nondeterministic knowledge representation and reasoning under conditions of uncertainty [5, 6, 7]. Modeling of Web services based on BN not only *
This work is supported by the Natural Science Foundation of Yunnan Province (No. 2005F0009Q), the Cultivating Scheme for Backbone Teachers in Yunnan University, and the Chun-Hui Project of the Educational Department of China (No. Z2005-2-65003).
G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 777–784, 2007. © Springer-Verlag Berlin Heidelberg 2007
778
K. Yue, W. Liu, and W. Li
can describe the causal dependencies with a graph structure, but also can give the quantitative measure of these dependencies. In this paper, we focus on discovering causal relationships for elementary services, described as operations in WSDL documents. An approach to the probabilistic graphical modeling of Web services is proposed, and the method for constructing the Web services Bayesian network, denoted WSBN, is presented. The Markov blanket (MB) of a variable X consists of X’s parents, X’s children, and parents of X’s children in a BN. Actually, MB describes the direct causes, direct effects, and the direct effects of direct causes of a variable [9, 10, 11]. In this paper, we develop a composition guidance of elementary services making use of the idea of MBs and corresponding reasoning mechanisms in the WSBN [5, 9, 11]. With preliminary experiments and performance analysis, the effectiveness and feasibility of the proposed method are verified. The remainder of the paper is organized as follows: Section 2 introduces related work. Section 3 gives the method to constructing the WSBN. Section 4 presents the algorithm to developing the semantic guidance of services composition. Section 5 shows the experimental results. Section 6 concludes and discusses the future work.
2 Related Work Similarity search for Web services is discussed in [4]. Firstly, the approaches to modeling Web services based on predefined rules and expert knowledge is discussed in [12, 13, 14, 15]. A lot of research work is oriented to the specific applications on Web services architectures [16, 17]. Secondly, the approaches to modeling Web services based on messages, events, activities and procedures are discussed in [2, 18]. However, both of these two classes of approaches are established on the predefined domain knowledge, which does not always make sense and is difficult to be updated and refined incrementally. BNs have been used in many different intelligent applications [5, 6, 7]. Cheng et al. proposed the method for learning the BN from data based on information theory [8]. The concept of Markov blanket and its discovery are discussed in [5, 9, 10, 11]. Recently, there has been some research work of BN-based applications on Web services. In the semantic Web, BNs can be constructed from ontology by expanding OWL with probabilities [19]. The BN representing given domain knowledge is used to evaluate cost factors verses benefit factors of services [20]. In addition, Web services metadata are obtained based on the naïve Bayesian classifier [21]. According to our knowledge, the dynamic characteristic and inherent causal dependencies are almost not considered in these BN-based applications on Web services.
3 Modeling Elementary Services Based on the Bayesian Network Following, we first give the definition of elementary services. Definition 1. Let ES={S1, S2, …, Sn} be the set of ordered elementary services in a given domain, in which Si (1≤i≤n) is a separate elementary service represented as an operation in the corresponding WSDL document.
Towards Web Services Composition b a
b
c
Sequential
d
a
d c Conditional
a
b
a
c
779
d
Parallel
Fig. 1. Three basic types in Web services compositions
Fig. 1 shows the invocations of these three types with respect to elementary services a, b, c and d. Now we give the following definition to universally describe service invocations. Definition 2. Let P=(id, ps, cs, τb, τe) represent direct invocations between two elementary services in composition procedures, and let T be a temporal domain of timestamps, in which id identifies a service composition procedure; ps and cs are the parent and child services in a service invocation respectively, ps∈ES and cs∈ES; τb and τe are the begin and end times of the invocation from ps to cs respectively, and τb, τe ∈T. For any two instances p1 and p2 of P, if p1.id=p2.id, and p1. cs=p2. ps, then p1. τe=p2. τb. For example, (1, a, b, τ1, τ2), (1, b, c, τ2, τ4) and (1, c, d, τ4, τ5) are instances of P containing direct invocations from the same procedure. In this paper, for given elementary services, we will construct the semantic model from their historical invocations based on the BN. Definition 3. A Bayesian network is a directed acyclic graph in which the following properties hold [5]: A set of random variables makes up the nodes of the network. A set of directed links connects pairs of nodes. An arrow from node X to node Y means that X has a direct influence on Y. Each node has a conditional probability table (CPT) that quantifies the effects that the parents have on the node. The parents of node X are all those that have arrows pointing to X. A BN represents the joint probability distribution in products by the chain rule: n
P( x1 ,L , xn ) = ∏ P ( xi | Parents ( xi )) . i =1
Based on the general definition of BNs, we will construct the elementary Web services Bayesian network (WSBN), G=(ES, BE), to describe their implied causal relationships, in which ES represents the node set including given elementary services, and BE is the corresponding set of directed edges. 3.1 Fixpoint Deduction of Elementary Services Associations
The fixpoint of an initial data set can derive the fixed structure by a monotonic and iterative process of computation, and thus some indirect service associations can be deduced [22]. We adopt the basic idea of the fixpoint to obtain all the service associations completely by the deduction on the instances of P. Definition 4. Let ℒ=(id, ps, cs, τb, τe) represent all associations (direct and indirect) between any two elementary services, where id, ps, cs, τb, τe are defined as those of P in Definition 2.
780
K. Yue, W. Liu, and W. Li
From Definition 3, P⊆ℒ holds ultimately since only direct associations are described in P. In order to obtain ℒ taking as input P, a recursive function is defined. Definition 5. Let the function f from (ℒ, P) to ℒ be ℒ = f (ℒ, P) = π1, 2, 8, 4, 10(P⋈1=1∧3=2∧5=4ℒ)
∪P,
(3-1)
where P=(id, ps, cs, τb, τe), π and ⋈ represent the projection and join operations respectively, similar to those in the relational algebra. Initially, ℒ is empty, i.e., ℒ=Φ. We note P is given as a constant one, so equation 3-1 can be simplified to ℒ = f (ℒ).
(3-2)
Clearly, f gives the recursive rule for defining the fixpoint computation [23]. The computation of f is iterative based on the result of the previous result, and f is monotonic. The instances of ℒ are composed of two parts: the direct associations in P, and the indirect ones derived using equation 3-1. By using the above method, we can get the unique fixpoint given P, argued by Theorem 1. Theorem 1. ℒ that satisfies equation 3-2 is the least fixpoint of f.
□
For space limitation, the proof is ignored. For the monotonicity of f, we have f ↑i (Φ)⊆ f ↑i +1 (Φ). Thus, let Ii be the instances of ℒ after the i-th iteration, such that Ii ⊆ Ii+1 and suppose Ii+1= Ii δi+1, where δi+1 is the incremental part. For any iteration in this process, the obtained instances of ℒ must be included in the results of the next iteration. As well, we have δi+1=π1, 2, 8, 4, 10(P⋈1=1∧3=2∧5=4δi) P. This idea is given in Algorithm 1. For the invocations of the first composition procedure given following Definition 2, (1, a, c, τ1, τ4), (1, b, d, τ2, τ5) will be obtained after the first iteration, and (1, a, d, τ1, τ5) will be obtained after the second iteration.
∪
∪
3.2 Constructing the Elementary Web Services Bayesian Network
Based on the existing theory and approach, the WSBN will be constructed considering the specialty of Web services. It is well known that the most challenging and time-consuming operation is the tests of conditional independencies (CI tests). In this paper, we adopt the conditional mutual information to test whether X is independent of Y given Z, computed by the following equation: ⎛ P ( x, y | z ) ⎞ I ( X , Z , Y ) = ∑ xy∈∈YX P ( x, y, z ) log ⎜ ⎟. z∈Z ⎝ P( x | z ) P ( y | z ) ⎠
(3-3)
If I(X, Z, Y)≤ε, then X is conditionally independent of Y given Z, where ε is a given threshold. However, we note that P(x, y, z), P(x|z) and P(y|z) in equation 3-3 cannot be computed directly from the sample data preprocessed by the deduction of fixpoint function. Thus, we first give a transformation for the sample data by augmenting the traces of service invocations. Let MIST=(m(i, j)) |ℒ|×n (1≤i≤|ℒ|, 1≤j≤n) be the spanning matrix of traces of invoked elementary services, in which m(i, j)=1 if Sj is in the trace of the i-th row of ℒ, and m(i, j)=0 otherwise. Fig. 2 gives an example of MIST.
Towards Web Services Composition
M IST
⎡a ⎢1 ⎢ ⎢0 ⎢ = ⎢0 ⎢1 ⎢ ⎢0 ⎢⎢1 ⎣
b
c
1 1
0 1
0 1
1 1
1 1
1 1
781
d⎤ 0 ⎥⎥ 0⎥ ⎥ 1⎥ 0⎥ ⎥ 1⎥ 1 ⎥⎥⎦
Fig. 2. A spanning matrix
Fig. 3. The constructed WSBN
According to the general method to constructing a BN [5], the WSBN constructed from the MIST in Fig. 2 is shown in Fig. 3, where the CPTs of c and d are ignored.
4 Generating Services Composition Guidance Based on the WSBN Let us consider the WSBN of elementary services {a, b, c, d, e, f, g}, shown in Fig. 4 (the CPTs are ignored here). If c is one of the beginning services of a composition procedure, we can observe that e is likely to be concerned, since e is the child associated with c directly. As well, d is also likely to be concerned in the composition procedure, since it is another parent of d. b a We want to obtain the composition guidance that is universally suitable for the basic three types, composed of c d current node’s children and the other parent nodes of these children step by step. Fortunately, the Markov blanket in e the WSBN guarantees that the above two kinds of nodes are causally associated with the given node from the g f viewpoint of service invocation, while not associated with Fig. 4. An WSBN structure other nodes for conditional independence. Definition 6. A Markov blanket (MB) S of an element α∈U (U is the set of elements in the BN) is any subset of elements for which I(α, S, U S α) and α∉S.
--
The union of the following three types of neighbors is sufficient for forming a Markov blanket of node α: the direct parents of α, the direct successors of α, and all direct parents of α’s direct successors [5]. The elements in the Markov blanket of an elementary service S (S∈ES) are causally associated with S. The invocation guidance is desired to demonstrate the immediate and subsequent services for each step. Additionally, the causal relationships among given services cannot be reversible when it comes to service invocations. Thus, we consider the associated services of S by the MB except its ancestors in the WSBN. For the WSBN in Fig. 4, c is directly associated with e and d, since e is c’s child and d is e’s another parent. Definition 7. Let YS={Y1, Y2, …, Ym} be the children of S (S∈ES), and let Fj be the set of parents of Yj (1≤j≤m). Let SN(S)=YS F1 … Fm be the set of service neighbors of S. That is, SN(S)=MB(S)−Parent(S), and each element in SN(S) is called a service neighbor of S.
∪ ∪ ∪
For example, SN(c)={e, d}. Moreover, we always want to give the most probable or most associated services in each step instead of all possible ones. Although SN(S)
782
K. Yue, W. Liu, and W. Li
gives the associated services of S, how associated between them is not interpreted. We note that there exist following facts for the nodes in SN(S): (1) For each Yj in YS, we consider the probability that YS may be invoked when S is invoked, P(YS=1|S=1), which can be obtained directly from the CPT in the WSBN. (2) If Yj is likely to be invoked, we consider each f in Fj and the probability P(f=1|Yj=1). It is the posterior probability that can be computed by the Bayes formula based on the corresponding CPTs in the WSBN: P( f = 1 | Y j = 1) = ( P (Y j = 1 | f = 1) P( f = 1) ) P(Y j = 1) , in which P(Yj=1|f=1) and the
marginal probabilities P(f=1) and P(Yj=1) can be easily computed from the CPTs. Definition 8. A service neighbor sn in SN(S) is active if (1) sn∈YS and P(YS=1|S=1)>ta1, (2) sn∈Fj and P(Yj=1|S=1)>ta1 and P(sn=1|Yj=1)>ta2, where ta1 and ta2 are two given threshold values. Definition 9. Given a WSBN G = (ES, BE), let SCG = (GB, GS, GE) be the services composition guidance, as a subgraph of G, in which (1) GB is the set of beginning elementary services, and GB⊆ES; (2) GS is the set of elementary services in SCG, and GS⊆ES. For each elementary service S in GS−GB, there is an elementary service S’ (S≠S’) in GS, such that S is an active service neighbor of S’; (3) GE is the set of directed edges in SCG.
Algorithm 1 gives the recursive method for constructing the WSBN. Algorithm 1. GenerateSCG (G, GB): Generate SCG from the WSBN G Initially, GS=GB and GE=Φ 1.for each S in GB do // starting from the elements in GB 2. for each ys in SN(S) do //consider the elements in MB(S)−Parent(S) 3. if ys∈YS and ys is active then //if S’ child is active 4. GS←GS {ys}, GE←GE {(S, ys)} 5. for each fys in Fys do // consider the other parents of ys 6. if fys is active then 7. GS←GS {fys}, GE←GE {(fys, ys)}, GenerateSCG(G, {fys}) 8. GenerateSCG(G, {ys}) 9. output SCG
∪
∪
∪
∪
By Algorithm 1, services composition guidance can be generated. SN(S) can be obtained in O(n2) time. Thus, Algorithm 2 can be done in O(n5) time for the worst case. Actually, less than O(n5) time will be necessary since the directed edges in WSBN are much less than those of the completely connected graph on ES.
5 Experimental Results In this section, we mainly show the performance of constructing the WSBN. It is simulated on the machine with a 1.4GHZ P4 processor, 512M of main memory, running Windows 2000 server. The codes were written in JAVA, and JDBC-ODBC is used to communicate with DB2 (UDB 7.0). The elementary Web services and their
Towards Web Services Composition
783
invocations were generated by our program based on the real City-Travel services given by e-commerce Inc. [23], and revised considering the instances in [16]. Given 6 elementary services, the performance of generating MIST and constructing the WSBN are shown in Fig. 5 and Fig. 6 respectively. Clearly, the time of generating MIST is sensitively decreased with the increase of services composition procedures. Meanwhile, by 50 services composition procedures and with the increase of the elementary services, the performance of preprocessing when generating MIST, and the construction of WSBN are shown in Fig. 7 and Fig. 8 respectively. We note that by a certain number of composition procedures, the performance of generating MIST only slightly decreases with the increase of elementary services, while the performance of constructing the WSBN on the generated MIST is largely decreased.
Fig. 5. Generating MIST on 6 services
Fig. 7. Generating MIST on increased elementary services
Fig. 6. Constructing WSBN on 6 services
Fig. 8. Constructing WSBN on increased elementary services
Generally, the performance of our proposed method depends on the number of given elementary services and the size of historical services composition procedures. The experimental results show that our proposed approach is effective and feasible.
6 Conclusions and Future Work In this paper, we propose an approach to the probabilistic graphical modeling of Web services based on the Bayesian network, and propose the services composition guidance based on the Markov Blankets in the WSBN. The proposed approach can be used to Web services clustering, intelligent services management, etc. As well, the behavior modeling of Web services to describe the inherent hierarchical, temporal and logical dependencies can be done based on the WSBN. These research issues are exactly our future work.
784
K. Yue, W. Liu, and W. Li
References 1. Yue, K., Wang, X., Zhou, A.: The Underlying Techniques for Web Services: A Survey. J. Software. Vol. 15. 3 (2004) 428–442 2. Dustdar, S. and Schreiner, W.: A Survey on Web Services Composition. Int. J. Web and Grid Services, Vol. 1. 1 (2005) 1–30 3. Hull, R., Su, J.: Tools for Design of Composite Web Services. SIGMOD (2004) 958–961 4. Dong, X., Halcvy, A., Madhavan, J., Ncmcs, E., Zhang J.: Similarity Search for Web Services. VLDB (2004) 5. Pearl, J.: Probabilistic Reasoning In Intelligent Systems: Networks of Plausible Inference. San Mateo. CA: Morgan Kaufmann Publishers, INC. (1988) 6. Pearl, J.: Propagation And Structuring In Belief Networks, Artificial Intelligence. Vol. 29. 3 (1986) 241–288 7. Heckerman, D., Wellman, M.P.: Bayesian Networks. Communications of ACM. Vol. 38. 3 (1995) 27–30 8. Cheng, J., Bell, D., Liu, W.: Learning Bayesian Networks from Data: An efficient Approach Based on Information Theory. The 6th ACM Conf. on Info. and Knowl. Management. (1997) 9. Pearl., J.: Evidential Reasoning Using Stochastic Simulation of Causal Models. Int. J. Artificial Intelligence. Vol. 32 (1987) 245–257 10. Margaritis, D., and Thrun S.: Bayesian Network Induction via Local Neighborhoods. Technical Report, CMU-CS-99-134, Carnegie Mellon University (1999) 11. Tsamardinos, I., Aliferis, C.F., Statnikov, A.: Algorithms for Large Scale Markov Blanket Discovery. 16th Int. FLAIRS Conf. (2003) 12. Narayanan, S., Mcilraith, S. A.: Simulation, Verification and Automated Composition of Web Services. WWW (2002) 77–88 13. Tosic, V., Pagurek, B., Esfandiari, B., Patel, K.: On the Management of Compositions of Web Services. OOPSLA (2001) 14. Peer, J.: Bringing together semantic Web and Web services. Semantic Web Conf. (2002) 279–291 15. Feier, C., Roman, D., Polleres, A., Domingue, J., Stollberg, M., and Fensel, D.: Towards Intelligent Web Services: The Web Service Modeling Ontology (WSMO). Int. Conf. on Intelligent Computing (2005) 16. Benetallah, B., Dumas, M., Sheng, Q., Ngu, A.: Declarative Composition and Peer-to-Peer Provisioning of Dynamic Services. ICDE (2002) 297–308 17. Amer-Yahia, S., Kotidis, Y.: A Web-Services Architecture for Efficient XML Data Exchange. ICDE (2004) 523–534 18. Bultan, T., Fu, X., Hull, R., Su, J.: Conversation Specification: A New Approach to Design and Analysis of E-Service Composition. WWW (2003) 19. Helsper, E. M., Van der Gaag, L. C.: Building Bayesian Network Through Ontologies. 15th European Conf. on Artificial Intelligence (2003) 20. Zhang, G., Bai, C., Lu, J., Zhang, C.: Bayesian Network Based Cost Benefit Factor Inference in E-Services. ICTITA (2004) 21. Heβ, A., Kushmerick, N.: Automatically attaching semantic metadata to Web services. IIWEB (2003) 22. Van Emden, M., Kowalski, R.: The Semantics of predicate logic as a programming language. JACM. Vol. 23. 4 (1976) 733–742 23. Web Services: Design, Travel, Shopping. http://www.ec-t.com
A Dynamically Adjustable Rule Engine for Agile Business Computing Environments Yonghwan Lee1, Junaid Ahsenali Chaudhry2, Dugki Min1, Sunyoung Han1, and Seungkyu Park2 1
School of Computer Science and Engineering, Konkuk University, Hwayang-dong, Kwangjin-gu, Seoul, 133-701, Korea {yhlee,dkmin,syhan}@konkuk.ac.kr 2 Graduate School of Information and Communication, Ajou University, Woncheon-dong, Paldal-gu, Suwon, 443-749, Korea {junaid,sparky}@ajou.ac.kr
Abstract. Most agile applications have to deal with dynamic change of processes of automatic business policies, procedures, and logics. As a solution for the dynamic change of processes, rule-based software development is used. With the increase in complexity in modern day business system, the business rules have become harder to express hence require additional especially designed scripting languages. The high cost of modifying or updating those rules is our motivation in this paper. We propose a compilation-based dynamically adjustable rule engine that is used for rich rule expression and performance enhancement. Because of immense complications among and within business rules, we use Java language to create/modify rule instead of scripting languages. It gives us the facility of standardized syntax also. It separates the condition from action during run time which makes rule notification easier and quicker. According to experimental results, the proposed dynamically adjustable rule engine shows promising results when compared with contemporary script-based solutions.
1 Introduction The revolution in computer systems and torrent of applications is led by growth in enabling technologies. The systems are increasing annually for 20 years by roughly a factor of 2 (disk capacity), 1.6 (Moore’s Law), and 1.3 (personal networking; modem to Digital Subscriber Line (DSL)), respectively. The cost attached to manage the complex system of today is a lot more then the actual cost of the system. Among those applications (i.e. mission-critical applications, automatic process of business policies, procedures, and business logics) time is decisive. Better representation, organization and management of business processes in agile computing have helped optimize and fine tune the processes with the help of computer systems. Moreover, as the software industry has developed rapidly in various forms and smaller life cycles of software systems, companies need to produce highly competitive applications with many features like user adaptation, customization, software reusability, timeliness, low maintenance and fault free service etc. The component oriented software G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 785–796, 2007. © Springer-Verlag Berlin Heidelberg 2007
786
Y. Lee et al.
engineering has stepped up and component-based software systems are growing in popularity. When software is divided into many dynamically connected components, the cost of immediate adjustment to new business processes or rearrangement of existing processes climbs high. So it is essential to develop software components that possess features of extensibility and flexibility, adapting to diverse requirements required upon each component’s development and maintenance. Many researchers proposed a variety of adaptation methods for software components, emphasizing on extensibility and adaptability. However the application of those solutions in a real time application decreases performance that is the motivation of our work. In order to answer this weak point, the techniques of rule-based component development are proposed. For extensibility and adaptability of components, the techniques separate business variability [1] from the component’s internal code by keeping separate rules. Upon the occurrence of requirement changes, a new requirement can be satisfied with changes in rules without changes in components. However, this technology usually needs some additional script language to describe rule expression, which has the limitation in expressing complex business rules. Also, this script-based rule handling is not suitable to the system that requires high performance. In this paper, we propose the compilation-based rule engine for performance enhancement and improving rule expression to cope with dynamic system requiring runtime adjustments. Unlike interpretation-based rule engines proposed as contemporary solutions, our rule engine does not require any additional script language for expressing rules resulting into better performance in terms of time compilation and overall performance. Moreover, the solution we propose is able to use the current existing libraries for condition/action codes of rules in legacy systems, such as string, number, and logical expression etc, so that it may not only express complex condition or action statements but also easily integrate the existing systems developed in Java language. In agile business computing environments, computing systems have become highly capricious and complex. Our rule-based automatically changeable mechanism is an appropriate solution for bringing the benefits of automatic computing, trustworthy management, consistency, and easy maintenance in rule-based systems. The remainder of this paper is organized as follows: In section 2 we present a scenario and functional features for better understanding. In section 3 we present the architecture of the rule engine proposed in this paper. We describe performance and compare the features of JSR-94 and the rule engine proposed in this paper in section 4. We discuss the related work in section 5 and lastly we conclude this paper along with the future work in section 6.
2 Solution of the Dynamically Adjustable Rule Engine In order to apply a changing rule to a dynamically adjustable rule engine, it is an integral proposition that the rule engine should be adaptable to coupe up with regular updates and changes. The main procedure of our dynamically adjustable rule engine is that a rule writer composes a condition and an action part of a rule expression in Java language. The condition code and action code of a rule expression converts into condition and action object with hook method respectively and put them into an object
A Dynamically Adjustable Rule Engine for Agile Business Computing Environments
787
pool. After finding a specific rule, our rule engine takes the condition and action objects specified by the rule’s configuration from the object pool for rule execution. Processing a sample scenario is introduced in the following subsections. 2.1 A Sample Scenario of the Dynamically Adjustable Rules Figure 1 shows the application example of customer’s credit rule. Suppose that there is a rule of the customer’s credit in import and export business domain.
Fig. 1. Application Example of Customer’s Credit Rule
Let us consider a simple credit rule: “If a customer’s credit limit is greater than the invoice amount and the status of the invoice is ‘unpaid’, the credit limit decreases by taking off the invoice amount and the status of the invoice becomes ‘paid’.”
Fig. 2. Rule Expression for the Customer’s Credit Rule with a Rule Editor
In this scenario, the process of applying the dynamically adjustable rules can be divided into 3 phases: 1) the rule expression phase, 2) the rule initialization phase, 3) and the rule execution phase. During the rule expression phase, a rule writer writes condition and action parts of the customer’s credit rule using a rule editor as in
788
Y. Lee et al.
figure 2. After writing the rule, the rule writer saves the customer’s credit rule-related information to a rule base in form of an XML file. Figure 3 shows an example of XML-based rule base for the customer’s credit rule.
Fig. 3. XML-based Rule Base for Customer’s Credit Rule Expression
Fig. 4. Condition and Action Class Generation using Template Method Pattern
During the rule initialization phase, the rule engine makes Java source files from the Java codes of condition and action in figure 2, compiles them, makes instance of the classes and deploys them to the object pool. During the rule execution phase, if the rule application domain sends request event messages to the rule engine, the rule engine extracts the event identifier from the request event message. The rule engine finds the rule from a rule base by matching the event identifier. The rule engine takes condition and action objects from the object pool and invokes the hook method of the condition and action objects. In figure 2, the rule identifier is the unique name for finding the specified rule and the rule priority specifies the order of executing rules. It is also possible to use the existing libraries specified in CLASSPATH. If necessary, a rule writer can write multiple action codes for a rule.
A Dynamically Adjustable Rule Engine for Agile Business Computing Environments
789
2.2 Code Generation and Operation in the Rule Engine In order to generate condition and action classes, the rule engine uses a template method pattern. Figure 4 shows the class diagram for applying the temple method pattern to our rule engine. The name of hook method for condition and action classes are “Compare” and “Execute”, respectively. Figure 5 shows condition or action codes generated automatically through the template method pattern. The condition and action objects are made from the CreditRuleCondition and the CreditRuleAction class and put into an object pool to be used for executing the rule. When a rule application sends request events for rule execution to the rule engine, the rule engine extracts the event identifier from the request event message. The event identifier is the string of “domain name: task identifier: rule name”. The rule engine finds the rule from a rule base by matching the event identifier. The rule matched has rule configuration, such as rule identifier, rule name, condition or action class name, and rule priority. The rule engine takes condition and action objects from the object pool and invokes the hook method of the condition and action objects.
Fig. 5. Condition and Action Code Generation for Customer’s Credit Rule
3 Software Architecture of the Rule Engine In the previous section, we studied a sample scenario with processing flow. This section introduces the architecture of the dynamically adjustable rule engine, which is operated based on compilation. Also we present flow of the initialization process in the rule engine and execution process of rules. In figure 6, we show the software architecture of the proposed rule engine. The rule engine is mainly comprised of three parts: the Admin Console, Rule Repository, and Core Modules. The Admin Console is
790
Y. Lee et al.
the toolkit for the expressing and managing of rules. The Rule Repository saves the xml-based rule information expressed by the toolkit. The Core Modules are in charge of finding, paring, and executing rules. There are a number of modules in the Core Modules. The responsibility of the Rule Engine is to receive request message from a client and to execute rules. To find an appropriate rule, it sends the request message to the Rule Parser. The Rule Parser extracts the event identifier from the request message, compares it with the event identifier of a parsing table, and finds the rule. The event identifier is the string of “domain name: task identifier: rule name”. After finding the rule, the Rule Engine knows the names of condition and action objects from the configuration of a rule and has the references of them from the ObjectPool Manager.
Fig. 6. Software Architecture of the proposed Rule Engine
The Rule Parser is responsible to find rules. The ObjectPool Manager manages the condition and action objects specified in rule expression. The RuleInfor Manager performs CRUD (Create, Read, Update, and Delete) action on the Rule Repository. The JavaCode Builder makes Java source files, compiles them, makes instances of the classes, and deploys them to the object pool. The Condition and Action Objects are the objects made from condition and action codes of rule expression. The Rule Engine is required to initialize before executing rules. In figure 7, we show the collaboration diagram to show the flow of the process for rule engine initialization. The Rule Engine sends an initialization request to the RuleInfor Manager. The RuleInfor Manager reads rule information from the Rule Repository and save it to a buffer. Recursively, the RuleInfor Manager extracts condition and action codes of rules, makes object instances, and deploys to the object pool through the ObjectPool Manager. After the Rule Engine initializes the condition and action parts of rule, it calls the Rule Parser for building a parsing table. The Rule Parser gets a pair of rule identifiers and names from the RuleInfor Manager, and builds the parsing table with them for finding appropriate rules.
A Dynamically Adjustable Rule Engine for Agile Business Computing Environments
791
Fig. 7. Process Flow for Rule Initialization
Figure 8 presents the collaboration diagram to show the flow for rule execution. A client sends request messages to the Rule Engine. The Rule Engine saves it to a buffer through the EventBuffer Manager and then gets the request message with highest priority from the EventBuffer Manager.
Fig. 8. Process Flow for Rule Execution
The Rule Engine calls the Rule Parser for finding the rule matched with the rule identifier. The Rule Parser searches the parsing table to find appropriate rules. After finding the rule, the Rule Engine calls the ObjectPool Manager to get the condition and action objects specified in the founded rule and then calls the “Compare” hook method of the condition object. If the result of invocation of the condition object is true, the Rule Engine calls the “Execute” hook method of the action object. If a rule has many action objects, the Rule Engine calls them according to the order of the action object specified in rule expression. The rule engine also supports the forwardchaining rule execution. It allows the action of one rule to cause the condition of other rules.
792
Y. Lee et al.
4 Performance of the Rule Engine In this section, we show the experimental performance results of the compilationbased rule engine proposed in this paper. We use the Microsoft 2003 server for operation system, WebLogic 6.1 with SP 7 for web application server, and Oracle 9i for relational database. As for load generation, WebBench 5.0 tool is employed. TPS (Transactions per Seconds) and execution time are used for the metric of performance measurement. For performance comparison in J2EE environment, we use a servlet object as a client of the rule engine. 4.1 Experimental Environment Before showing the performance results, we introduce the workloads that were used in the experiments. Generally, business rules are classified into business process rules and business domain rules. Business domain rules define the characteristics of variability and the variability methods which analyzes these characteristics for an object. Business process rules define the occupation type, sequence, and processing condition, which is necessary to process an operation. In the business process rule, the variability regulations for process flows are defined as the business process rules. Table 1 shows the workload configuration for experiments. Among the five rules, two rules are the business process rules and the other two rules are the business domain rules. In an e-business environment, as the business domain rules are more frequently used than the business process rules, we give more weight to the business domain rules. Table 1. Workload for Experiments Index
Rule Name
Rule Type
Weight
1
Log-In
-
5%
2
Customer Credit
Process Rule
15%
Domain Rule
30%
3
Customer Age
4
Interest Calculation
Process Rule
15%
5
Role Checking
Domain Rule
35%
The “Customer Age” rule measures a customer’s age is according to problem request. The “Interest Calculation” rule calculates interest according to the interest rates. The “Role Checking” rule specifies the assertion of the “An authorized user can access certain resources.” The rule engine takes role information from the profiles of the customer and decides whether the requesting jobs are accepted or not. 4.2 Performance Comparison The performance of the proposed rule engine is compared with Java Rule Engine API (JSR-94) in figure 9. The proposed rule engine achieved 395 transactions per second
A Dynamically Adjustable Rule Engine for Agile Business Computing Environments
793
(TPS) in maximum workload. While JSR-94 achieved 150 TPS in maximum. The proposed rule engine in this paper processes 245 more transactions per second than JSR-94. We believe that the rule engine proposed achieved 2.5 times better performance than JSR 94 because of its special emphasis on features like ease in extensibility and highly levels of adjustability for rules that are used in a system. In order to compare performances of sub-modules of the rule engine, Figure 10 shows load analysis of two rule engines. Since the proposed rule engine operates on compilation-based rule processing, performance in the module of generating objects may take a long execution time, but there is not a big different in performance. Moreover, the proposed rule engine achieves better performance results in parsing and executing rules. It is because it divides the condition and action class into separate parts which gives ease at run time when rules are called in an object pool. . Moreover one does not have to define separate condition statement for multiple actions the proposed rule engines provides the facility of defining more than one executions for one condition which can help in fault tolerance in a hybrid environment.
Fig. 9. Performance Comparisons with JSR-94
Fig. 10. Comparison of Load of Two Rule Engines
4.3 Feature Comparison In table 2, we compare the features of the two rule engines. In contrast to JSR-94, the proposed rule engine expresses each business rule by a business task unit. If there are one or more rules in a task, each rule is categorized in a unique rule name.
794
Y. Lee et al. Table 2. Feature Comparison between the Two Rule Engines JSR-94 Rule Engine
The proposed Rule Engine
Performance (Max TPS)
150 TPS
395 TPS (2.5 times better performance)
Rule Expression
A rule expression is confined by the JESS script rule language
- need to learn Java language - can express complex business rules using the Java language
Reusability of existing Libraries Integration of existing system using rule engine
Impossible - needs additional rule expression for integrating existing systems - An application domain expert is easier to write rules.
- possible by using the CLASSPATH in rule expression
- easier to integrate with existing systems in Java language
- Any Java coder can be easier to write rules
Easy to Learn needs to learn additional script-based rule language
Dynamic Change of Business Rules Separation of Condition and Action Parts
Ease of embedment The condition/action dependability
- Learning additional rule language is not required
Possible
- Possible (An object pool mechanism of condition and action objects can make dynamic change of rules.)
No
Yes, the condition and action part of rules are separated so that the updates are easier to manage and multiple actions could be taken against one condition.
Low
High
Yes, causes rule evaluation to block until a condition becomes true or an event is raised
No, since conditions and events are ‘physically’ separate from each other, it gives the proposed engine an edge on time constraint.
The proposed rule engine uses Java language for writing business rules without using any additional script languages for expressing rules. Although it might seem odd to assume that we assume that the user must have knowledge of java language we foresee that the business rules, when converted into Java language, eliminates the fuzziness and brings clarity to the conditions and actions. Moreover syntax of java is the same everywhere in the world so it would be easier to embed the proposed rule engine into applications facing diverse environment. However, we aim to build a GUI based front end the rule engine proposed in this paper as future work. Whenever executing each business rule in the proposed rule engine, the step for matching rule conditions is not required. In other words, after finding the required business rule from a rule base, the proposed rule engine executes it without parsing the rule and matching the rule conditions due to Java-based rule expression. The proposed rule engine converts the condition and action codes of a rule into condition and action objects, respectively and puts it into an object pool for improving performance and dynamic changeability. Thus, it can execute the newly changing business rule without restarting itself.
A Dynamically Adjustable Rule Engine for Agile Business Computing Environments
795
5 Related Works The Business Rules Group [2] defines a business rule as “a statement that defines and constraints some aspects of business”. It is intended to assert business structure or to control or influence the behavior of the business. The Object Management Group (OMG) is working on Business Rules Semantics [3]. Nevertheless, several classifications of different rule types have emerged [2, 4, 5]. In [4], business rules are classified into four different types, such as integrity rules, derivation rules, reaction rules, and demonic assignments. A well-known algorithm for matching rule conditions is RETE [6]. For business rule expression, rule markup language is needed. Currently, BRML (Business Rule Markup Language) [7], Rule Markup Language (RuleML) [8], and Semantic Web Rule Language (SWRL) [9] are proposed as rule markup languages. The IBM took initiative of developing Business Rule Markup Language (BRML) for the Electronic Commerce Project [7]. The BRML is an XML encoding which represents a broad subset of KIF. The Simple Rule Markup Language (SRML) [10] is a generic rule language consisting of a subset of language constructs common to the popular forward-chaining rule engines. Another rule markup approach is the Semantic Web Rule Language (SWRL), a member submission to the W3C. It is a combination of OWL DL an OWL Lite sublanguages of the OWL Web Ontology language [9]. The SWRL includes an abstract syntax for Horn-like rules in both of its sublanguages. Most recently, the Java Community Process finished the final version of their Java Rule Engine API. The JSR-094 (Java Specification Request) was developed in November 2000 to define a runtime API for different rule engines for the Java platform. The API prescribes a set of fundamental rule engine operations based on the assumption that clients need to be able to execute a basic multiple-step rule engine cycle (parsing the rules, adding objects to an engine, firing rules, and getting the results) [11]. It does not describe the content representation of the rules. The Java Rule API is already supported (at least partially) by a number of rule engine vendors (cf. Drools [12], ILOG [13] or JESS [14]) to support interoperability.
6 Concluding Remarks As business applications become complex and changeable, rule-based mechanism is needed for automatic adaptive computing as well as trustworthy and easy maintenance. For this purpose, we propose a compilation-based rule engine that can easily express business rules in Java codes. It does not need additional script language for expressing rules. It can create and execute condition and action objects at run time. Moreover, it can use existing libraries for condition or action codes of rules (i.e., String, Number, and Logical Expression) so that it can not only express complex condition or action statements but also easily integrate the existing systems developed in Java. So the compilation-based rule engine, proposed in this paper, shows better performance than JSR-94, a generally used interpretation-based rule engine. According to our experiments, the proposed rule engine processes 245 more transactions per second than JSR-94. We intend to test the performance of the rule
796
Y. Lee et al.
engine proposed in this research with different weights and in different conditions. This will not only gives us a better idea about the working capacity of the outcome of this research, it will give clear application of are for this rule engine too. Moreover we intend to develop a GUI that could assist the users who have limited knowledge of java in operating with this rule engine.
References 1. Lars Geyer and Martin Becker, "On the influence of Varaibilities on the ApplicationEngineering Process of a Product Famliy",Proceedings of SPLC2, 2002. 2. The Business Rules Group. Defining Business Rules – What Are They Really? http://www.businessrulesgroup.org/first paper/br01c0.htm, July 2000. 3. B. von Halle. Business Rules Applied. Wiley, 1 edition, 2001. 4. K. Taveter and G.Wagner. Agent-Oriented Enterprise Modeling Based on Business Rules? In Proceedings of 20th Int. Conf. on Conceptual Modeling (ER2001), LNCS,Yokohama, Japan, November 2001. Springer-Verlag. 5. S. Russell and P. Norvig. Artificial Intelligence –A Modern Approach. Prentice Hall, second edition, 2003. 6. C. Forgy. RETE: a fast algorithm for the many pattern/many object pattern atch problem. Artificial Intelligence, 19(1):17–37, 1982. 7. IBM T.J. Watson Research. Business Rules for Electronic Commerce Project. http://www.research. ibm.com/rules/home.html, 1999. 8. RuleML Initiative. Website. http://www.ruleml. org. 9. W3C. OWL Web Ontology Language Overview. http: //www.w3.org/TR/owl-features/. W3C Recommendation 10 February 2004. 10. ILOG. Simple Rule Markup Language (SRML). http://xml.coverpages.org/srml..html, 2001. 11. Java Community Process. JSR 94 - Java Rule Engine API. http://jcp.org/ 12. aboutJava/communityprocess/final/jsr094/index. html, August 2004. 13. Drools. Java Rule Engine. http://www.drools.org. 14. ILOG. Website. http://www.ilog.com. 15. JESS. Java Rule Engine. http://herzberg.ca.sandia.gov/jess.
A Formal Design of Web Community Interactivity Chima Adiele University of Lethbridge Lethbridge, Alberta, Canada
[email protected]
Abstract. Web Communities (WCs) are emerging as business enablers in the electronic marketplace. As the size of the community becomes increasingly large, there is a tendency for some members of the community to use resources provided by the community without necessarily making any contribution. It is, therefore, necessary to determine members’ contributions towards sustaining the community. In this paper, we present a formal framework to dynamically measure the interactivity of members, and indeed the interactivity level of the community. This formal foundation is necessary to eliminate ad hoc approaches that characterize existing solutions, and provide a sound foundation for this new research area. We design an efficient interactivity algorithm, and also implement a prototype of the system. Keywords: Formal specification, Web communities, and interactivity lifecycle.
1 Introduction A Web community (WC) is a Web-enabled communication and social interaction between a group of people that have common interests. Rheingold [1] envisions a WC as a social phenomenon that has no business dimension. Recent advances in information and communication technologies, however, have given impetus to WCs as business enablers in the digital marketplace. Many organizations leverage virtual communities to attract new and retain old customers by identifying the needs and beliefs of their customer base, and hence, create value through intention-based customer relationships [2,3]. The main thrust of this paper is to provide a formal framework to measure interactivity of members in a WC, and also determine the community’s interactivity level. Interactivity relates to the level of participation of a member in a given community, and the usefulness of such contributions to the needs of the community. To achieve the envisioned objectives, we leverage algebraic signatures to formally specify components of the interactivity model to provide a sound foundation. The use of formal and theoretical foundations is particularly important for this new research area to guarantee correctness and completeness of the system. We design an interactivity model that uses a common term vocabulary (CT V) to automatically filter irrelivant messages from the community. Automatically filtering irrelivant messages eliminates the manual process that is time consuming, labour intensive, and error prone. In addition, we provide an efficient interactivity algorithm and implement a prototype of the system. G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 797–804, 2007. c Springer-Verlag Berlin Heidelberg 2007
798
C. Adiele
The remaining part of this paper is structured as follows. In Section 2, we provide background information on our specification, and also discuss related work. Section 3 examines the dynamics of interactivity, and hence, presents a formal framework for a WC interactivity. We design an interactivity algorithm in Section 4, while Section 5 concludes the paper and provides insight into future work.
2 Background The specification in this paper uses set notations (∩, ∪, ⊆, ⊇, , ∈, N, ) to describe structural components, and predicate logic to describe pre- and post-conditions for any requirements. Pre- and post-conditions are stated as predicates. A simple predicate usually has one or more arguments, and is of the form P(x), where x is an argument that is used in the predicate P. The universal quantifier (∀) and existential (∃) quantifiers are in common use. Every declaration must satisfy a given constraint. In general, a quantified statement can be written in one of two forms: 1.
<declaration(s)> • <predicate>; 2. <declaration(s)> | • <predicate> The symbols ”|” and ”•”, which are part of the syntax mean ”satisfying” and ”such that”, respectively. To create compound predicates, statements can be nested and combined together using one or more logical connectives, such as: and (∧), or (∨), not (¬), conditional (=⇒), and bi-conditional (⇐⇒). The formal specification of a requirement in this paper follows the general format of a quantified statement. There are some previous research efforts that are tangentially related to our work. Lave and Wenger [4], and Menegon and D’Andrea [5] observe that members of a community develop shared practice by interacting around problems, solutions, and insights, and building a common store of knowledge. Blanchard and Markus [6] argue that “the success of community support platforms depends on the active participation of a significant percentage of the community members”. Community participation is necessary for sustained interactivity. Some research [7,6] have examined the effect of size and under-contribution for online communities. These research work suggest ways of using concepts in social psychology to motivate contributions. In this paper, we provide a formal framework to dynamically measure the interactivity of members, and indeed the interactivity level of the community.
3 Formal Framework for a WC Interactivity To discuss the formal framework of a WC interactivity model, we first examine its dynamics. We use the interactivity lifecycle in Figure 1 to discuss the dynamics of a WC interactivity. It is a multi-user, Web-based system designed to provide a WC where members can interact and exchange ideas. The system has several servers in a server farm to manage and display the different types of media (text, images, audio,and video). Video frames need to be transmitted quickly and in synchrony but at relatively low resolution to support video conferencing. Video contents may be compressed in a store,
A Formal Design of Web Community Interactivity
799
so the video server may handle video compression and decompression into different formats. There is also an audio server that facilitates teleconferencing. Both the audio and video servers are used to manage activities in the subset of conferencing activities. The other activities (such as post messages, read messages, reply messages, etc.) in the WC fall under message activities. There are different data servers used to manage messages and display member’s interactivity records. These data servers provide support for extensive queries and scripting facilities to enable members interact.
Fig. 1. WC interactivity Diagram
To address the issue of posting irrelivant messages that have nothing to do with the subject of discussion, some communities moderate messages posted. Manually moderating messages in large communities can be time consuming, labour intensive, and error prone. Therefore, there is a need to automate the process of filtering messages that are posted in a given community. We leverage a (CT V) to automatically filter messages before they are posted. A CT V is an ontology that contains primitive terms in a given domain and does not prescribe any structure for its designers [8]. When a member writes a message, that message has to pass through a filter mechanism. The filter mechanism, which uses the CT V, is an accepting device that either accepts a message and it is posted, or rejects otherwise [9]. 3.1 Formal Foundation Members loyalty to the community varies according to their level of participation in the community. Adiele and Ehikioya [8] identified three categories of membership, namely executive, senior and ordinary members with corresponding degrees of participation. Butler [7] identified similar categories of membership, namely leaders, active and silent users. Accordingly, we classify members into three groups, namely: Leading Members (LM)- these are members that make substantial contributions to the community by posting, responding and reading messages on a regular basis; Active Members (AM)- these are members that make some contributions to the community that are far less than the contributions of LM; and Non-active Members (NM)- these are members that make minimal or no contributions at all to the community.
800
C. Adiele
We model Members participation as a function of class of membership. Accordingly, the following inequalities hold: LMnum ≤ AMnum ≤ N Mnum (”num” is the number of members)
(1)
LMcont ≥ AMcont ≥ N Mcont (”cont” is the contributions of members)
(2)
Let MEMBER be the basic type for members of a WC. Let Mem be a non-empty power set of members (i.e. Mem: 1 MEMBER). There are three classes of membership divided according to members’ participation levels over a specified time window [7]. Let LM, AM, and N M represent the set of leading members, active members and non-active members respectively. LM, AM, and N M are the three classes of membership and every member can only belong to one class at a given time. ∀mi : MEMBER|mi ∈ Mem• ∃LM, AM, N M : MEMBER|LM, AM, N M ⊂ Mem• n n j=1 (LM, AM, N M) = Mem ∧ j=1 (LM, AM, N M) = ∅
(3)
Every member in the community is unique. We capture this uniqueness formally as follows: ∀mi , m j : MEMBER|mi , m j ∈ Mem• (4) mi = m j =⇒ i = j Activity: In a WC, a member performs certain actions, which we call activities to contribute to the community. Different sets of activities have differnt parameters of measurements. For example, we count the number of messages that a member may have posted, read or replied to determine the member’s contributions. While we measure the time a member spends on video conferencing or teleconferencing to determine the member’s contributions. We refer to the former as message activities and the latter as conferencing activities. Let MA represent the set of message activities and CA represent the set of conferencing activities.Thus, (MA ∪ CA) = A; and (MA ∩ CA) = ∅
(5)
Let ACTIVITY be the basic type for activities that members can participate (a formal definition of Participate is given in (8)) and A, a power set of activities, such that A: 1 ACTIVITY. Definition 1: An activity, ai , is an action that a member, m j , undertakes in a WC to contribute to the community. In every WC, an activity, ai ∈ A has a measure of importance. That importance is captured by the weight wi . The weight of an activity is assigned relative to the importance of the activity in a given community. Let W be the set of weights for a corresponding set of activities A. Let VALUE be the basic type of values. The product of ai and w j
A Formal Design of Web Community Interactivity
801
represents the value of the activity in a given community. We define a function Value that returns the value of each activity. Value : ACTIVITY × WEIGHT → VALUE ∀ai : ACTIVITY|(ai ∈ A) • ∃w j : WEIGHT|w j ∈ W• Value(ai, w j ) = (ai ∗ w j )
(6)
We define a function Participate that returns the activity a member participates in. Participate : MEMBER → ACTIVITY ∀m j : MEMBER|m j ∈ Mem • ∃ai : ACTIVITY|ai ∈ A• Participate(m j ) = ai
(7)
A member can only participate in one activity at a given time instance. Let t be a time instance of type TIME, we capture this constraint formally: ∀t : TIME • (∃m j : MEMBER ∧ ∃1 ai : ACTIVITY)• Participate(m j ) = ai
(8)
To participate in a WC, a member has to log in to the system. We define status of members to facilitate the Login operation. S tatus = {ON, OFF}. Formally, Login : MEMBER Login(m j) = T RUE ⇔ ∀m j : MEMBER|m j ∈ Mem• S tatus = ON
(9)
A member who logs into the system can also log out at will. The definition of Logout follows. Logout : MEMBER Logout(m j ) = T RUE ⇔ ∀m j : MEMBER|m j ∈ Mem• (10) S tatus = OFF To simplify our exposition and facilitate understandability, we discuss a subset of the activities. For example, start posts (sP) for a message that begins a thread; reply post (rP) for a message that responds to another message, thus building the thread; Reads R for messages read by a member. Let MESSAGE be the basic type for messages and MA be a set of messages for messages posted / replied to in a WC. MA is a subset of activities, such that MA = {sP, rP}, where MA ⊂ A. Let tM be the total number of messages, such that sP + rP ≤ tM. We specify a generic CT V that provides enterprise-wide definition of terms (called context labels) to automate the process of filtering messages. The CT V is organized hierarchically using linguistic relations to show how terms relate to one another. To capture these linguistic relations, we let CONTEXT-LABEL be the basic type for context labels. Let LRI = {synon, hyper, hypon, meron} be the set of linguistic relationship identifiers, where synon, hyper, hyponandmeron are synonym, hypernym, hyponym, and meronymy, respectively. To define the CT V, we first, define a context label, cl, as a primitive term (word) that has a unique meaning in the real world. A formal definition of linguistic relation follows. Let be a linguistic relation, then: : CL × CL −→ LRI
(11)
802
C. Adiele
Definition 2: A CT V is a pair (CL, ), where CL is a set of context labels and is a linguistic relation which shows that given cli , cl j ∈ CL, then the relationship between cli and cl j is one of {synon, hyper, hypon, meron} (i.e., (cli , cl j ) ∈ LRI). Definition 3: A filter mechanism, F M is an accepting device, which uses the CT V to parse words in a message, and if the message meets a given acceptance standard the message is accepted, otherwise it is rejected. To represent this partial function formally, we let DATABASE be the basic type of database. Only messages parsed by the filter mechanism are posted. F M : MES S AGE × CT V −→ DAT ABAS E
(12)
We define a function U pdate that updates the database. To enable us define the function U pdate, we give the signature of Write, a function that writes into the database. Write : DATABASE → DATABASE U pdate : MEMBER × ACTIVITY → DATABASE ∀mi : MEMBER|mi ∈ Mem• ∃ai : ACTIVITY|ai ∈ A ∧ tM : MESSAGE • U pdate(mi , ai ) =⇒ ∀ai : ACTIVITY|(ai = sP) =⇒ Write(sP + 1)∨ ∀ai : ACTIVITY|(ai = rP ∧ (tM = sP ∪ rP ∧ sP ∩ rP = ∅)) =⇒ Write(rP + 1) ∨ ∀ai : ACTIVITY|(ai = R ∧ R < tM) =⇒ Write(R + 1)
(13)
(14)
Interactivity: Let WC be a Web community, there exists a set of members Mem and a set of activities A, such that a member, mi ∈ Mem participates in activities, ai ∈ A. Definition 4: The interactivity of a member, m j of a WC for a given time window, W (written, IWI ) is the sum of the values, vk of the activities that m j participates in over the width of W. Formally, Interactivity IWI : VALUE • ∀m j : MEMBER|m j ∈ Mem• (∃ai : ACTIVITY|(ai ∈ A ∧ S : TIME)• Participate(m j ) = ai ∧ ∃wk : WEIGHT|wk ∈ W)• IWI = ΣS (Value(ai, wk ))
(15)
Definition 4 represents the interactivity of a member in a WC. We extend this definition to obtain the interactivity of a community. The interactivity of a community IWC is the sum of the individual interactivity IWI over the size of the community CS . Formally, Interactivity IWC : VALUE• IWC = ΣCS (IWI )
(16)
4 Overview of the System In this section, we present an interactivity algorithm that describes how to capture the interactivity of members in a WC. We also describe a prototype of the system.
A Formal Design of Web Community Interactivity
803
Algorithm: Measure-Interactivity Input: (Unique member’s ID and members’s activities) Output: (Member’s interactivity level) 1. while login(Mid) 2 Participate in activity 3. if ai ∈ MA and ai = R then 4. search(messages); read(messages); 5. computeInteractivity(Mid); 6. else if ai is any of (sP, rP, Res) then 7. filter(messages); updateDb( ); 8. else if ai ∈ CM then 9. T 1 = startTime(conferencing); 10. T 2 = stopTime(conferencing); 11. T = T2 - T1; 12. computeInteractivity(Mid); 13. updateDB(messages); 14. end(while) 15. end. We implemented a prototype of the WC on a client-server architecture using apache server 1.3.34 (Unix) as our Web server and Javascript as our main development language for the application server. Apache HTTP Server is a stable, efficient and portable open source HTTP web server for Unix-like systems. JavaScript permits easy vertical migration in future, and allows platform independence. We used CSS to specify the presentation of elements on the Web page independent of the document structure. At the backend, we used MySQL version 4.1.0 as the database and the application uses SQL query language to manipulate the database. Our prototype uses PHP to connect the client to database server and to run queries in the database from the client side. Figure 2(a) is a screen shot of a discussion group showing messages that members posted. When a member posts a message, the filter mechanism uses the CT V to parse the message. Figure 2(b) shows how a member can search for posted messages.
(a)
(b)
Fig. 2. (a) Messages Posted in a Discussion Group; (b) Members Search for Messages Posted
804
C. Adiele
Messages are indexed in the database according to subjects and titles. The system has an efficient search mechanism to enable members search for messages and respond to them.
5 Conclusions In this paper, we formally specified components of an interactivity model to measure the contributions of members of a WC. The use of formal and theoretical foundations is particularly important for this new research area which, in the recent past, has been characterized mostly by ad-hoc solutions. We also designed an interactivity algorithm and provided a prototype of the Web community. The model we presented dynamically measures individual member’s interactivity, and indeed, the interactivity level of the community. These measurements will enable us to understand the dynamics of the community and also facilitate the classification of members into different groups according to their levels of participation. This classification provides a framework to address individual member’s needs and reward deserving members.
References 1. Rheingold, H.: The Virtual Community: Homesteading on the Electronic Frontier. Revised edition edn. MIT Press (2000) 2. Boczkowski, P.J.: Mutula shaping of users and technology in a national virtual community. Journal of Communications 49(2) (1999) 86–109 3. Romm, C., Pliskin, N., Clarke, R.: Virtual communities: Towards integrative three phase model. International Journal of Information Management 17(4) (1997) 261–271 4. Lave, J., Wenger, E.: Situated Learning. Legitimate Peripheral Participation. Cambridge University Press (1991) 5. Menegon, F., D’Andrea, V.: Social processes and technology in an online community of practices. In: Proceedings of the International Conference on Web-based Communities (WBC2004). (2004) 115–122 6. Blanchard, A.L., Markus, M.L.: Sense of virtual community: Maintaining the experience of belonging. In: Proceedings of the Hawaii 35th International Conference on System Sciences (HICSS-3502). (2002) 7. Butler, B.: Membership size, communication activity and sustainability: a resource-based model of on-line social structures. Information Systems Research 12(4) (2001) 346–362 8. Adiele, C., Ehikioya, S.A.: Towards a formal data management strategy for a web-based community. Int. J. Web Based Communities 1(2) (2005) 226–242 9. Adiele, C., Ehikioya, S.A.: Algebraic signatures for scalable web data integration for electronic commerce transactions. Journal of Electronic Commerce Research 6(1) (2005) 56–74
Towards a Type-2 Fuzzy Description Logic for Semantic Search Engine* Ruixuan Li, Xiaolin Sun, Zhengding Lu, Kunmei Wen, and Yuhua Li College of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China [email protected], [email protected], {zdlu,kmwen,yhli3}@hust.edu.cn
Abstract. Classical description logics are limited in dealing with the crisp concepts and relationships, which makes it difficult to represent and process imprecise information in real applications. In this paper we present a type-2 fuzzy version of ALC and describe its syntax, semantics and reasoning algorithms, as well as the implementation of the logic with type-2 fuzzy OWL. Comparing with type-1 fuzzy ALC, system based on type-2 fuzzy ALC can define imprecise knowledge more exactly by using membership degree interval. To evaluate the ability of type-2 fuzzy ALC for handling vague information, we apply it to semantic search engine for building the fuzzy ontology and carry out the experiments through comparing with other search schemes. The experimental results show that the type-2 fuzzy ALC based system can increase the number of relevant hits and improve the precision of semantic search engine. Keywords: Semantic search engine, Description logic, Type-2 fuzzy ALC, Fuzzy ontology.
1 Introduction As the fundament of the semantic web [1,2], ontology is playing a very important role in many applications such as semantic search [3]. Being one of the logic supports of ontology, Description logics (DLs) [4] represent the knowledge of an application domain by defining the relevant concepts of the domain (terminologies) and using these concepts to specify properties of objects and individuals which belong to the domain (the world description). As one in the family of knowledge representation (KR) formalisms, the powerful ability of describing knowledge makes DLs express the information more easily in different application domains [5]. Being established by W3C in 2004, OWL (Web Ontology Language) [2,6] becomes the standard knowledge representation markup language for the semantic web. *
This work is supported by National Natural Science Foundation of China under Grant 60403027, Natural Science Foundation of Hubei Province under Grant 2005ABA258, Open Foundation of State Key Laboratory of Software Engineering under Grant SKLSE05-07, and a grant from Huawei Technologies Co., Ltd.
G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 805–812, 2007. © Springer-Verlag Berlin Heidelberg 2007
806
R. Li et al.
Being expected to be applied in semantic web, semantic search extended the search engine with ontology. Using general ontologies, the most current semantic search engines handle the information retrieval in semantic web based on classic DLs. University of Maryland proposed SHOE [7,8], which can find the semantic annotations from web pages. Tap [9,10,11] developed by Stanford University and IBM applies the technology of semantic web into Google, which augments the results in order to increase the quality of the retrieval. Swoogle [12,13,14] is designed for the information retrieval in structured documents such as RDF (Resource Description Framework), OWL and so on. At present, more and more semantic search systems are designed based on ontology that is supported by classic DLs. But the classical DLs can only define the crisp concepts and properties, and the certain reasoning of classic DLs means that the answer of inference only returns "True" or "False", which cannot solve the fuzzy problem of ontology system in real world. Therefore, the fuzzy DLs are designed to expand the classic DLs to make it more applicable to ontology system. At present, most fuzzy logic systems (FLSs) are based on type-1 fuzzy sets, which were proposed by Zadeh in 1965 [15]. However, it was quite late when the fuzzy sets were applied to DLs and ontology System. Without reasoning algorithm, Meghini proposed a preliminary fuzzy DL as a tool for modeling multimedia document retrieval [16]. Straccia presented the formalized Fuzzy ALC (FALC) [17] in 2001, which is a type-1 fuzzy extension of ALC. Before long, Straccia extended the SHOIN(D) , the corresponding DL of the standard ontology description language OWL DL, to a fuzzy version [18,19]. However, there are some limits in Type-1 fuzzy sets. For example the imprecision cannot be described by a crisp value clearly, which will result the loss of fuzzy information. To address the problem mentioned above, we propose a type-2 fuzzy ALC and try to apply it into semantic search engine. The contributions of the paper are as follows. First, we present the syntax and semantics of a type-2 fuzzy extension of ALC, which can represent and reason fuzzy information with OWL, a formalized ontology description language. Besides the format of the axioms defined in Type-2 fuzzy ALC, the reasoning algorithm is also proposed for semantic search. Finally, we design and realize the system of semantic search engine based on type-2 fuzzy ALC and carry out the experiments to evaluate the performance of the proposed search scheme. The rest of the paper is organized as follows. Section 2 gives the condition of relative research and basic concepts of DL, typical ALC and type-1 fuzzy ALC. Section 3 presents the format of the type-2 fuzzy ALC and the method of reasoning in type-2 fuzzy DL. Approaches for applying the type-2 fuzzy DL to deal with the description in fuzzy ontology for semantic search engine with OWL is addressed in section 4, followed by conclusions and future research of the paper.
2 Basic Concepts ALC concepts and roles are built as follows. Use letter A for the set of atomic concepts, C for the set of complex concept defined by descriptions and R for the set of
Towards a Type-2 Fuzzy Description Logic for Semantic Search Engine
807
roles. Starting with: (1) A, B∈ A (2) C, D∈ C and (3) R∈ R. The concept terms in TBox can be defined with the format as following inductively: C ⊑ f (A, B, R, ⊓, ⊔, ∀, ∃, ⊥, ⊤) (partial definition) and C ≡ f (A, B, R, ⊓, ⊔, ∀, ∃, ⊥, ⊤) (full definition). ⊥ and ⊤ are two special atomic concepts named “bottom concept” and “universe concept”. The syntax and semantics of ALC constructors have been represented in [4]. For the reason we mentioned above, classic DL such as ALC cannot deal with the imprecise description. To solve this problem in DLs, Straccia presented FALC, which is an extension of ALC with fuzzy features, to support fuzzy concept representation. Because Straccia used a certain number to describe the fuzzy concepts and individuals in FALC, we call the FALC type-1 FALC [7].
3 Type-2 Fuzzy ALC 3.1 Imprecise Axioms in Type-2 Fuzzy ALC Different from the type-1 fuzzy sets, type-2 fuzzy sets use an interval to show the membership. Each grade of the membership is an uncertain number in interval [0,1]. We denote the membership in type-2 fuzzy sets with μ A instead of μ A in type-1, which is defined as following:
μ A ( x) = [ μ AL ( x), μ UA ( x)]
(1)
In (1) we present: μ AL ( x), μ UA ( x) : U → [0,1] , and ∀x ∈ U , μ AL ( x) ≤ μ UA ( x) . We call μ AL (x) and μ UA (x) the primary membership and secondary membership, and x is an instance in the fuzzy sets U. It is obvious that type-2 fuzzy sets can be reduced to type-1 fuzzy sets when the primary membership equals the secondary one. So a type-1 fuzzy set is embedded in a type-2 fuzzy set. There are two fuzzy parts in type-2 fuzzy ALC presented in our paper, which are the imprecise terminological axiom (TBox) and fuzzy individual membership (ABox). To built a DL system, the first thing should be done in creating TBox is to define necessary atomic concepts and roles with some symbols. It is certainly that the base symbols exist in the DL system, but the name symbols are not. In other words, the atomic concepts defined by different axioms may be imprecise, which means that the axiom may not come into existence in type-2 fuzzy ALC TBox. For example, given two base symbols named: Animal and FlyingObject, we can define the atomic concept Bird in TBox with the axiom (2). Bird [0.9,0.95] ≡ Animal ⊓ FlyingObject
(2)
(2) means that the probability value of that bird can be described with the conjunction of the Animal and FlyingObject is between 0.90 and 0.95.
808
R. Li et al.
Because of the certainty of the base symbols, the probability of atomic concepts Animal and FlyingObject are both 1, in the interval [1,1]. Instead of Animal [1,1] we define the certain atomic concept Animal without [1,1] concisely. Type-2 fuzzy ALC can represent the vagueness in atomic concept with two properties, fuzzy:LowerDegree and fuzzy:UpperDegree to describe μ AL ( x) and μ UA ( x) . Because it can be considered true that every atomic concept (role) is independent, we can calculate the value of fuzzy:LowerDegree and fuzzy:UpperDegree of fuzzy concept if we do not know it beforehand. For example, we want to define an atomic concept Meat-eatingBird with base symbol Meat-eatingObject with axiom (3): Meat-eatingBird ≡ Bird
[0.9,0.95]
⊓ Meat-eatingObject
(3)
when we apply the triangular norms T(a,b) = ab/[1+(1-a)(1-b)] , S(a,b) =(a+b)/(1+ab), we can get the value of fuzzy:LowerDegree (fuzzy:UpperDegree) of Meat-eatingBird with the follow equation: μ L (Meat-eatingBird)=T ( μ L (Bird),
μ L (Meat-eatingObject)), as mentioned above, we know that μ L (Bird)=0.9 and μ L (Meat-eatingObject)=1. So μ L (Meat-eatingBird)= (0.9×1)/[1+(1-1)(1-0.9)] = 0.9. So the membership of atomic concept Meat-eatingBird is in scope [0.9,0.95]. We call it transitivity of type-2 fuzzy ALC. In addition to the fuzzy TBox, the uncertainty still exists in ABox in fuzzy DLs. The assertion Bird[0.9,0.95](penguin) [0.65,0.9] means the degree that the penguin can be considered as an instant of Bird[0.9,0.95] is in [0.65,0.90] in a given DL. Being similar with FALC, the ABox assertions C I (d)=[a, b], in which 0≤ a≤ b≤ 1. Take atomic Bird [0.9,0.95] for example, The Bird(penguin) being satisfied in ABox has two pre-conditions: (1) concept Bird should be satisfied in TBox; (2) penguin belongs to bird in ABox. So we can conclude that μ L (Bird(penguin))= μ L (Bird)× μ L (penguin ∈ Bird) = T(0.65,0.90)=0.565 (so do μ U (Bird(penguin))). So the ABox can be denoted by a set
of equations with form as: C [a ,b] (a)=[c, d] Where C= f (A, B, R, ⊓, ⊔, ∀, ∃, ⊥, ⊤). For example: Bird[0.9,0.95](penguin)=[0.65,0.95], or Bird[0.9,0.95](penguin)[0.65,0.98]. 3.3 The Syntax and Semantics of Type-2 Fuzzy ALC
We define A, C and R as the set of atomic concepts, complex concepts, and roles. C⊓D, C⊔D, ¬C, ∀R.C and ∃R.C are fuzzy concept. The fuzzy interpretation in type-2 fuzzy ALC is a pair I = (∆I, ·I), and ·I is an interpretation function that map
fuzzy concept and role into a membership degree interval: CI =∆I →[a, b] and RI =∆I×∆I →[a, b], and a, b must satisfy 0≤ a≤ b≤ 1. The syntax and semantic of type2 fuzzy ALC is shown in Table 1. Different from FALC the type-2 fuzzy ALC, ∆I is not a set of numbers in scope [0,1] but a set of pairs, which have the form like [a, b]. And it must satisfy the inequation: 0≤ a≤ b≤ 1.
Towards a Type-2 Fuzzy Description Logic for Semantic Search Engine
809
Table 1. The syntax and semantics of type-2 fuzzy ALC constructors Constructor Top (Universe)
Syntax
Semantics
⊤
Bottom (Nothing)
⊥
∆I Φ
Atomic Concept
A[ a ,b ]
A[Ia ,b ] ⊆ ΔI
Atomic Role
R[ a ,b ]
R[Ia ,b ] ⊆ ΔI × ΔI
Conjunction
C[ a ,b ] ⊓ D[ c,d ]
(C ⊓ D) [IT ( a ,c ),T (b,d )]
Disjunction
C[ a ,b ] ⊔ D[ c,d ]
(C ⊔ D) [IS ( a ,c ),S (b,d )]
Negation
¬C[ a ,b ]
C[I1−b ,1− a ]
Value restriction
∀R[ a ,b ] .C[ c,d ]
∀y.S ( R[1−b,1− a ] ( x, y ), C[ c,d ] ( y ))
Full existential quantification
∃R[ a ,b ] .C[c ,d ]
∃y.T ( R[ a ,b ] ( x, y ), C[ c ,d ] ( y ))
3.4 Reasoning in Type-2 Fuzzy ALC
Tableau algorithms use negation to reduce subsumption to (un)satisfiability of concept descriptions instead of testing subsumption of concept descriptions directly: C ⊑ D iff ¬ C ⊓ D=⊥. The fuzzy tableau begin with an ABox A0={C [a ,b] (x) [c ,d] } to check the (un)satisfiability of concept C[a ,b]. Since ALC has not number restrictions, here are 5 rules presented: ⋂ -rule: if A contains C [a ,b] (x) [c ,d], and C [e ,f] (x) [g ,h]: if [a, b] ⋂ [e, f]≠ Φ and [c, d] ⋂ [g, h]≠ Φ algorithm should extend A to A ’ = A - { C [a ,b] (x) [c ,d] , C [e, f] (x) [g ,h]}⊔{
C [S0(a, e) , T0(b, f)] (x) [S0(c, g) ,T0 (d, h)]}, else A ’ = A - { C [a ,b] (x) [c ,d] , C [e, f] (x) [g ,h]} ⊓ -rule: if A contains (C’[e ,f] ⊓ C’’[g h]) [a ,b] (x) [c ,d]=( C’ ⊓ C’’) [T(T(e, f) ,a) ,T(T(g, h) , b)] (x) [c ’ ’’ ,d], but not contains both C [e ,f](x) [c ,d] and C [g ,h](x) [c ,d], algorithm should extend A to
A ’ = A⊔{ C’[e ,f](x) [c ,d], C’’[g ,h](x) [c ,d]}. ⊔ -rule: if A contains (C’[e ,f]⊔C’’[g h])[a ,b](x) [c ,d]=( C’⊔C’’) [S(S(e, f) ,a) S(S(g, h) , b)] (x) [c ,d],
but neither C’[e ,f](x) [c ,d] nor C’’[g ,h](x) [c ,d], the algorithm should extend A to A’ = A ⊔{ C’[e ,f](x) [c ,d]}or A’’ = A ⊔{C’’[g ,h](x) [c ,d]}. ∃ -rule: if A contains (∃R[e ,f].C[g ,h])(x) [c ,d], but no individuals z such that R[e ,f](x, z) [c
C[g ,h] (z) [c ,d], the algorithm should extend A to A’ = A⊔{ R[e ,f](x, y) [c ,d], C[g ,h] (y) [c ,d] }where y is an individual not occurring in A before.
,d]and
∀-rule: if A contains (∀R[e ,f].C[g ,h])(x) [c ,d], and R[e ,f](x, y) [c ,d], but not C[g ,h] (y) [c ,d],
the algorithm should extend A to A’ = A ⊔{C[g ,h] (y) [c ,d] }. Given two limit values: TL and TU, the way to decide whether the ABox in type-2 fuzzy ALC is unsatisfiable is different from typical tableau. In that
810
R. Li et al.
μ L (U ) (C ) ≤ TL ⇔ C[0,0] , μ L(U ) (C ) ≥ TU ⇔ C[1,1] . So the process of tableau will stop when anyone of following conditions is established: (1) Any obvious clash (⊥(x), ( C⊓¬C )(x), etc.) is found in process of algorithm. (2) All rules (⊓-rule, etc.) have been executed. (3) Any fuzzy clash (C[0 ,0] (x)=[c, d],C[a ,b] (x)=[c, d], C[c ,d] (x)=[a, b]with a≤ b≤ TL , C [a ,b] (x) and C [c ,d] (x) with the intervals [a, b] and [c, d] do not overlap) happened in process of algorithm.
4 The Semantic Search Engine Based on Type-2 Fuzzy Ontology 4.1 Architecture of Type-2 Fuzzy Semantic Search Engine
The natural languages in daily communication often have imprecise information. We call the queries including fuzzy concepts fuzzy queries. To handle these fuzzy queries, semantic search engines based on ontology must extend their knowledge bases on fuzzy ontologies such as the fuzzy semantic search engine proposed in this paper. Fig. 1 shows the architecture of type-2 fuzzy semantic search engine. Results in ontology Semantic Query
Type-2 Fuzzy Ontology Questioner / Answerer
Fuzzy KeyWords
User
Type-2 Fuzzy Ontology Analyzer
Type-2 Fuzzy Ontology
Individuals
KeyWords Generator
Index Domain
KeyWords
Search Engine
Results
Fig. 1. Architecture of Type-2 fuzzy semantic search engine
In this framework, users can propose their query in two ways: they can ask the type-2 fuzzy ontology analyzer with keywords or fuzzy keywords. On the other hand, users can also search the ontology by issuing the semantic query to type-2 fuzzy ontology questioner (answerer) with keywords or other interfaces. Thus, users can communicate with ontology directly with the recalls formed by individuals or classes to make the queries precisely, which are sent to type-2 fuzzy ontology analyzer later. Then type-2 fuzzy ontology analyzer can generate individuals that satisfy to query and send these answers to keywords generator to combine proper keywords. At last, the traditional search engine will find the results from index with these keywords and return the hits to users.
Towards a Type-2 Fuzzy Description Logic for Semantic Search Engine
811
4.2 Experiments and Analysis
Based on the framework introduced above, we have implemented the type-2 fuzzy search engine. Supported by the fuzzy ontology reasoner, the semantic search engine based on type-2 fuzzy ALC can improve the relativity of the responses to query. The experiment is carried out in the scope of all resources available in Huazhong University of Science and Technology, including almost 7000 web pages indexed from different departments and 2400 documents. The type-2 fuzzy ontology analyzer, answer, keywords generator and the search engine are all implemented with java. The ontology has built with protégé. We chose a group of keywords to retrieve information from indexes, and then picked out the relevant hits (hits those are relevant to the retrieval) from result set and counted the average of them. Fig. 2 shows that the semantic search engine based on ontology (including classic and fuzzy ontology) can expand the relevant hits greatly when there is no imprecise information in keywords. The reason is that ontology generates more keywords with its individuals. However, the number of relevant hits of search engine based on classic ontology decreases rapidly when we add more fuzzy keywords such as “very”, “young” into the keywords group. Compared to classic ontology, semantic search engine based on type-2 fuzzy ontology can accommodate itself to fuzzy keywords much better. We carry out an experiment on the precision (the fraction of the retrieved documents which is relevant) of the semantic search engine for that reason. Fig. 3 represents that the precision of semantic search engine based on classic ontology increases slower than the one based on type-2 fuzzy ontology when the number of nodes increases in otology. That means the precision of search engine will be improved if type-2 fuzzy ALC is applied. 15
1 traditional search engine with classic ontology with type-2 fuzzy ontology
classic ontology type-2 fuzzy ontology
0.9 0.8 0.7
10
precision
relevant hits
0.6 0.5 0.4 5
0.3 0.2 0.1
0
0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 proportion of imprecise keywords
0.8
0.9
1
Fig. 2. Relevant hits -imprecision graph
0
0
100
200
300 400 500 600 700 number of nodes in ontology
800
900
Fig. 3. Precision-nodes graph
5 Conclusions and Future Work As the fundament of type-2 fuzzy DLs, type-2 ALC is introduced of its syntax, semantics, reasoning algorithm and application in this paper. Comparing with the type-1 fuzzy
812
R. Li et al.
ALC, the type-2 fuzzy ALC can deal with the imprecise knowledge much better. Besides semantic search, there are many applications based on DLs need to handle fuzzy information such as trust management. Our approach can be applied in those domains to enrich its representation meaning and reasoning abilities. Future work includes the research of type-2 fuzzy ALCN, SHOIN(D) and the reasoning algorithms.
References 1. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. The Scientific American 284(5) (2001) 34-43 2. Horrocks, I., Patel-Schneider, P.F., van Harmelen, F.: From SHIQ and RDF to OWL: The Making of a Web Ontology Language. Journal of Web Semantics 1(1) (2003) 7-26 3. Guha, R., McCool, R., Miller, E.: Semantic Search. In: Proceedings of the 12th International World Wide Web Conference (WWW 2003). Budapest, Hungary (2003) 700-709 4. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.: The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press (2003) 47-100 5. Calvanese, D., Lenzerini, M., Nardi, D.: Unifying Class-Based Representation Formalisms. Journal of Artificial Intelligence Research 11(2) (1999) 199-240 6. Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., PatelSchneider, P.F. (ed.): L.A.S.: OWL Web Ontology Language Reference (2004) 7. Heflin, J.D.: Towards the Semantic Web: Knowledge Representation in a Dynamic Distributed Environment. PhD Thesis, University of Maryland (2001) 8. Heflin, J., Hendler, J.: Searching the Web with Shoe. In: AAAI-2000 Workshop on AI for Web Search. Austin, Texas, USA (2000) 9. Guha, R., McCool, R.: TAP: A Semantic Web Test-bed. Journal of Web Semantics 1(1) (2003) 32-42 10. Guha, R., McCool, R.: The Tap Knowledge Base. http://tap.stanford.edu/ 11. Guha, R., McCool, R.: Tap: Towards a Web of Data. http://tap.stanford.edu/ 12. Ding, L., Finin, T., Joshi, A., et al.: Swoogle: A Search and Metadata Engine for the Semantic Web. In: CIKM’04. Washington DC, USA (2004) 13. Finin, T., Mayfield, J., Joshi, A., et al.: Information Retrieval and the Semantic Web. In: Proceedings of the 38th Hawaii International Conference on System Sciences (2005) 14. Mayfield, J. Finin, T.: Information Retrieval on the Semantic Web: Integrating Inference and Retrieval. In: 2004 SIGIR Workshop on the Semantic Web. Toronto (2004) 15. Zadeh, L. A.: Fuzzy Sets. Information and Control 8(3) (1965) 338-353 16. Meghini, C., Sebastiani, F., Straccia, U.: Reasoning about the Form and Content for Multimedia Objects. In: Proceedings of AAAI 1997 Spring Symposium on Intelligent Integration and Use of Text, Image, Video and Audio. California (1997) 89-94 17. Straccia, U.: Reasoning within Fuzzy Description Logics. Journal of Artificial Intelligence Research 14 (2001) 137-166 18. Straccia, U.: Transforming Fuzzy Description Logics into Classical Description Logics. In: Proceedings of the 9th European Conference on Logics in Artificial Intelligence. Lisbon, (2004) 385-399 19. Straccia, U.: Towards a Fuzzy Description Logic for the Semantic Web. In: Proceedings of the 1st Fuzzy Logic and the Semantic Web Workshop. Marseille (2005) 3-18
A Type-Based Analysis for Verifying Web Application* Woosung Jung1, Eunjoo Lee2,**, Kapsu Kim3, and Chisu Wu1 1
School of Computer Science and Engineering, Seoul National University, Korea {wsjung,wuchisu} @selab.snu.ac.kr 2 Department of Computer Engineering, Kyungpook National University, Korea [email protected] 3 Department of Computer Education, Seoul National University of Education, Korea [email protected]
Abstract. Web applications have become standard for several areas, however, they tend to be poorly structured and do not have strongly-typed support. In this paper, we present a web application model and a process to extract the model using static and dynamic analysis. We show recurring problems regarding type and structure in web applications and formally describe algorithms to verify those problems. Finally, we show the potentials of our approach via tool support. Keywords: Web application model, analysis, verification.
1 Introduction It becomes more and more important to verify and validate web applications, because web applications have become standard for business and public areas [1]. Since web applications do not have strongly-typed support, type checking problem for web applications has arisen. Several studies have been conducted on the verification of web applications using type information [2] [3] [4] [5], however, they concentrate on testing web applications and overlook such kinds of errors that occur frequently in using ‘form’s and resources. In this paper, we present some practical recurring problems, such as frame structure, form-parameters’ types, form-parameters’ names, and resources’ type. We convert them into type problems and try to solve them formally. At first, we define a model for web applications, and then, we formalize the algorithms for verifying the raised problems by using the model. A tool has been implemented to apply our approach. The remainder of this paper is organized as follows: Section 2 defines a model. In section 3, we present four problems that are checked and define the verification *
This work was supported by the Brain Korea 21 Project and by the Korea Science and Engineering Foundation(KOSEF) grant funded by the Korea government(MOST) (No. R012006-000-11150-0). ** Corresponding author. G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 813–820, 2007. © Springer-Verlag Berlin Heidelberg 2007
814
W. Jung et al.
algorithm formally. Section 4 describes the results regarding type checking problems by using the tool we implemented. Finally, in section 5, conclusion and suggestions for future work are presented.
2 Web Application Model We illustrate the web model in the ER diagram (Fig. 1). UML notations are adopted in many studies, however, we choose ER model because it enables the seamless modeling and verification using stored procedures in SQL. Also, it is more appropriate to apply the fixed point theory that we utilize.
Fig. 1. Web application model in the ER-diagram
The entities in the DB schema are classified as follows: • Static page structure: rINCLUDE, ePage, rFRAME, ePackage, eComponent, rCONTAIN, rUSE, and eResource • Page behavior: eServerCase, eServerCaseParam, rNAVIGATE, and eNavigateParam • Database: eField, eTable, and eDatabase • Server-side allocation: eScope, rALLOCVAR, eVariable, rALLOCPARAM, and rALLOCDB • Predefined environment: dComponentType, dComponentTypeCategory, dComponent-TypeConstraint, dTypeCategory, and dType
A Type-Based Analysis for Verifying Web Application
815
3 Checking Algorithms with the Model In this section, we introduce four frequent errors that happen in web applications and show checking algorithms. 3.1 Frame-Type Checking If a user can navigate on a frame from a web page p to its upper page, the frame page may repeat throughout the entire web page. This is mostly because of the wrong ‘target’ in a frame tag or errors in the navigational structure. We call this kind of error as “frame-type error”. We define the domain for frame-type checking in Fig. 2. W: WebApplication P(W) = [[P]]W={p | p s Page, S 2page
∈
∈
∀ ∈ Page, p is a page of W}
frameowner: 2page→2page, [[frameowner]]= λS.{p| p
∈
∈S, frameset(p)≠ φ }
frameset: Page→2page, [[frameset]]= λs.{p| p Page, p is a frame page of s} NavigationTargets: 2page→2page, [[NavigationTargets]]= λs.{p| p Page, p is a reachable page from p’ S with 1 navigation}
∈
∈
Fig. 2. Domain definition for checking frame-type
We mark to all pages that have frames as ‘visited’, to assure that a frame in a page cannot be navigable to its upper pages, including itself. This test is conducted on all web pages. When it is impossible for a page to navigate its upper pages, we can say that the frame-type of the web application is sound. Figure 3 shows the algorithm for frame-type checking.
∈
for each p frameowner P(W) do P(W).visited=false T=NavigationTargets(frameset(p)) if T = φ then <<Exit>> else if p
∈T then <>
else frameset(p).visited=true T’=NavigationTargets(T) if T’ = φ then <<Exit>> else if p
∈T’ then <> else T.visited=true
… end of for Fig. 3. An algorithm for checking frame-type
We can describe semantics in frame-type checking by using part of the algorithm, as in Fig. 4.
816
W. Jung et al.
[[FrameTypeCheck]] = if S = φ then <<Exit>>
∈
else if p [[NavigationTargets]]S then <> else let S.visited=true [[FrameTypeCheck]] end
∈
Fig. 4. Semantics for checking frame-type
We regard semantics as an equation of “X=F(X)”, then the algorithm can be described by using fixed point (Fig. 5).
∈ else if p∈[[NavigationTargets]]S then <> else let S.visited=true ∈X end)))
[[FrameTypeCheck]]= fixF Page x 2page→ Page x 2page =fix(λX. (λS. (λp. if S = φ then << Exit>>
Fig. 5. An Algorithm for checking frame-type using fixed point
To check the soundness of a frame-type, [[FrameTypeCheck]] is executed on all pages that have frames. The initial value is
. That is, the algorithm is summarized as follows:
∀p∈[[frameowner]]([[p]]W).[[FrameTypeCheck]]
3.2 Resource-type Checking Resource-type checking tests the mismatch of resource-types. Each component in a web application has type constraints in the resources that it uses. For example, only an image type can be used in tag. If ‘AVI’ resource is used in , it generates a resource-type error. Resource-type errors are not revealed because web applications are not compiled. Also, it is difficult to find resource-type error in large web applications, however, this type of checking is not supported in existing web applications. We define the domain for resource-type checking (Fig. 6).
∈
∈
c Comp, r Res component: Page→2Comp, [[component]]= λs.{c|c Comp, c is a component of page s} resource: Comp →2Res, [[resource]]= λc.{r|r Res, r is a resource used by component c}
∈
∈
Fig. 6. Domain definition for checking resource-type
We define the function that checks the resource-type for a component in the web (Fig. 7). [[ResourceTypeCheck]]c= if([[resource]]c).type ∉ ([[constraint]]c).type then <> Fig. 7. A function for checking resource-type
A Type-Based Analysis for Verifying Web Application
817
Figure 8 shows the algorithm for resource-type checking.
∈
for each p P(W) do for each c component(p) do [[ResourceTypeCheck]]c end of for end of for
∈
Fig. 8. An algorithm for checking resource-type
The algorithm is summarized as follows:
∀p∈[[P]]W, ∀c∈[[component]]p.[[ResourceTypeCheck]]c
If resource-type errors do not occur during checking, we can say that web application W is sound with regard to resource-type. 3.3 Form-Parameter Name Checking When parameters are submitted by ‘GET’ or ‘POST’ in client side, the server pages may try to use some parameters that are not submitted from a client or that have different names from the submitter. For example, a form variable ‘name’ in a web page is submitted and used as ‘nama’ in other web page. This happens frequently in practical situations, however, it is difficult to find this kind of error in the web. We can uncover parameter-name mismatch error by static analysis based on the form. We define the domain for form-parameter name checking (Fig. 9)
∈
∈
t Case, n NavigationCase case : Page →2Case, [[case]]= λp.{t|t Case, t is a case possible to happen in page p} Navigation: Page →2NavigationCase , [[Navigation]] = λp.{n|n NavigationCase, n is a navigation case that can happen in a page p} NavigationParam: NavigationCase →2Param , [[NavigationParam]]= λn.{m|m Param, m is a submitted parameter in navigation case n} CaseParam: Case→2Param, [[CaseParam]]= λt.{m|m Param, expected parameters in case t}
∈
∈
∈
∈
Fig. 9. Domain definition for checking form-parameter names
We define a function that checks the resource-type for a component in the web (Fig. 10). [[FormNameCheck]] = if CaseParam(t).name ∉ NavigationParam(n).name then <> Fig. 10. A function for checking form-parameter names
Figure 11 shows the algorithm for resource-type checking. If form name errors do not happen for all web pages, we can say that the web application W is sound in formparameter names.
818
W. Jung et al.
∈
for each p P(W) do for each n Navigation(p) do for each t case(n.TargetPage) do [[FormNameCheck]] end of for end of for end of for
∈
∈
Fig. 11. An algorithm for checking form-parameter names
The algorithm is summarized as follows:
∀p∈[[P]]W,∀n∈[[Navigation]]p,∀t∈[[Case]](n.TargetPage).[[FormNameCheck]] 3.4 Form-Parameter Type Checking In addition to parameter name, parameter type can be considered in form-parameter checking. Figure 12 describes a way to check the type mismatches between parameter m1 by a server and m2 by a client. [[FormTypeCheck]]<m1, m2>= if m1.name=m2.name and m1.type m2.type then <>
≠
Fig. 12. A function for checking form-parameter types
Figure 13 shows the algorithm for form-parameter type checking.
∈
for each p P(W) do for each n Navigation(p) do for each t case(n.TargetPage) do for each m1 CaseParam(t), m2 NavigationParam(n) do [[FormTypeCheck]] <m1, m2> end of for end of for end of for end of for
∈
∈
∈
∈
Fig. 13. An algorithm for checking form-parameter types
4 Implementation We implemented a tool to support the static analysis of the web application model. We applied the tool to a sample web application. Figure 14 is a screenshot that illustrates the results of the verification. This tool supports four kinds of error checking as stated in section 3. Furthermore, the tool indicates information of errors including locations, reasons, and hints for debugging. In particular, frame-type error shows not only the page containing frames, but the navigational paths that may trigger the frame-type errors. The
A Type-Based Analysis for Verifying Web Application
819
right side of the top shows the test results, which contain the number of errors and the validity of each type. The body of Fig. 14 shows the details of the result.
Fig. 14. The result of the analysis
We will explain part of the result in the following: • Frame-type error The following is part of Fig. 17, which indicates the frame-type error. This shows that page1 has page100, 101 and 102 as its frame and there is a navigation path from page100 to page1 via 110 and 111. This result in frame-type error; Page1 is nested, which is undesirable. * Error :: Frame-type : [Page 1] has Frame( [Page 100], [Page 101], [Page 102] ) Page navigation: 100 110 111 1
→ → →
• Resource-type error The following results indicate that component1 in page1 can use type10, 11 and 12, however, component1 uses resource with type 20. * Error :: Resource-type : [Page 1]'s [Component 1] Supported type: 10, 11, 12 Used Resource with type error: [Resource 1]:20
• Form-parameter name error Navigation3 in the first line of this example indicates that there is navigation from page3 to page4 and that the navigation id is three. Page3 submits two formparameters name and addr to page4. The attached number ‘2’ (name:2, addr:2) is their type, but page4 receives them as ‘nama’ and ‘addr’, which results in a wrong parameter name in ‘nama’.
820
W. Jung et al.
* Error :: Forms Input Name : Navigation 3, [Page 3] -> [Page 4] (Case 2) [Page 3] send ( name:2, addr:2 ) [Page 4] receive ( nama:2, addr:2 ) Wrong parameter names - nama:2
• Form-parameter type error Page1 submits two parameters, id and pwd to page2 with their types. In this example, the type of pwd is different between page1 and page2, which results in parameter-type error. * Error :: Forms Input Type : Navigation 1, [Page 1] -> [Page 2] (Case 1) [Page 1] send ( id:2, pwd:1 ) [Page 2] receive ( id:2, pwd:2 ) Wrong parameter types - pwd:1<>2
5 Conclusion We have proposed a verifying method of web applications using typed-approach. We have defined a model of web applications and have formally shown some algorithms to verify several typed-problems for web applications, including form-parameters, frame structure (frame-type), and resource-type, by static analysis. It is expected that the proposed model can be used as a reference to obtain a web application structure with type information. The algorithms that have been presented formally can provide a type-verification method for problems that occur frequently in the field. Also, the verification cost can be decreased, because the checking processes are executed via tool support. In future work, we will identify and verify other verification problems found in web applications using the model. Finally, we will extend our work for defining the framework to support model-driven development of web applications.
References 1. Tonella, P. and Ricca, F.: A 2-Layer Model for the White-Box Testing of Web Applications. Proc of the 6th IEEE International Workshop on Web Site Evolution, (2004). 2. Harmelen, F., Meer, J., Webmaster: Knowledge-based Verification of Web Pages. Proc of the 12th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, (1999). 3. Despeyroux, T., Trousse, B.: Semantic Verification of Web Sites Using Natural Semantics. Proc of the 6th Conference on Content-Based Multimedia Information Access, (2000). 4. Despeyroux, T.: Practical Semantic Analysis of Web Sites and Documents, Proc of the 13th Conrefence on World Wide Web, (2004). 5. Draheim, D., Weber, G.: Strongly Typed Server Pages, Proc. of Next Generation Information Technologies and Systems, (2002). 6. http://www.antlr.org/ 7. http://tidy.sourceforge.net/
Homomorphism Resolving of XPath Trees Based on Automata* Ming Fu and Yu Zhang1,2 1 Department of Computer Science & Technology, University of Science & Technology of China, Hefei, 230027, China 2 Laboratory of Computer Science, Chinese Academy of Sciences, Beijing, 100080, China [email protected], [email protected]
Abstract. As a query language for navigating XML trees and selecting a set of element nodes, XPath is ubiquitous in XML applications. One important issue of XPath queries is containment checking, which is known as a co-NP complete. The homomorphism relationship between two XPath trees, which is a PTIME problem, is a sufficient but not necessary condition for the containment relationship. We propose a new tree structure to depict XPath based on the level of the tree node and adopt a method of sharing the prefixes of multi-trees to construct incrementally the most effective automata, named XTHC (XPath Trees Homomorphism Checker). XTHC takes an XPath tree and produces the result of checking homomorphism relationship between an arbitrary tree in multi-trees and the input tree, thereinto the input tree is transformed into events which force the automata to run. Moreover, we consider and narrow the discrepancy between homomorphism relationship and containment relationship as possible as we can. Keywords: XPath tree, containment, homomorphism, automata.
1 Introduction XML has become the standard of exchanging a wide variety of data on Web and elsewhere. XML is essentially a directed labeled tree. XPath[1] is a simple and popular query language to navigate XML trees and extract information from them. XPath expression p is said to contain another XPath expression q, denoted by q ⊆ p, if and only if for any XML document D, if the resulting set of p returned by querying on D contains the resulting set of q. Containment checking becomes one of the most important issues in XPath queries. Query containment is crucial in many contexts, such as query optimization and reformulation, information integration, integrity checking, etc. However, [2] shows that containment in fragment XP{[ ],*,//} is co-NP complete. The authors proposed a complete algorithm for containment, whose complexity is EXPTIME. The authors also proposed a sound but incomplete PTIME *
This work is supported by the National Natural Science Foundation of China under Grant No. 60673126, and the Foundation of Laboratory of Computer Science, Chinese Academy of Science under Grant No. SYSKF0502.
G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 821–828, 2007. © Springer-Verlag Berlin Heidelberg 2007
822
M. Fu and Y. Zhang
algorithm based on homomorphism. This algorithm may return false negatives because the homomorphism relationship between two XPath trees is a sufficient but not necessary condition for the containment relationship. In many practical situations containment can be replaced by homomorphism. The homomorphism algorithms proposed in [2][3] are mainly focused on how to resolve the containment problem between two XPath expressions. In [3] the authors proposed hidden-conditioned homomorphism to further narrow the discrepancy between homomorphism and containment based on [2]. However, the homomorphism relationship was considered in these works only between two XPath trees. In practice we may need to verify the homomorphism relationship between an arbitrary tree in a set of XPath trees and the input XPath tree, such as filtering redundant queries in a large query set. It is inefficient to check one by one using the homomorphism algorithm, because the same prefix and branch in multi-trees will cause redundant computing. Although a method handling this was discussed in [4], it will return false negatives for some XPath expressions which have containment relationship, such as XPath expressions p = /a//*/b, q = /a/*//b etc. In this paper, we propose an efficient method to check homomorphism from multi-trees to a single XPath tree based on automata.We also narrow the discrepancy between homomorphism and containment as possible as we can. Our major contributions are: 1)We propose the fixed tree and alterable tree to describe the XPath tree, and define homomorphism based on them. 2)We define XTHC machine, a kind of indexed incremental automata with prefix-sharing of multi-trees, and our method can give the optimal automata. 3)We propose an algorithm to check homomorphism from multi-trees to a single tree based on XTHC machine. 4)The experiment results demonstrate both the practicability and efficiency of our techniques. The rest of this paper is organized as follows. Section 2 gives some basic notations and definitions. Section 3 is the major part of our work, that is, how to construct XTHC machine and how to use XTHC to resolve the homomorphism problem. The last two sections present the experimental evaluation and conclusions, respectively.
2 Preliminaries Each XPath expression has a corresponding XPath tree. The XPath tree given in [2] uses each node test in the XPath expression as a node in the tree, and classifies its edges into child-edge and descendant-edge according to the type of axes in the XPath expression. This description is straightforward and easy to understand, however, difficult to expand. If there is any backward axis (parent-axis or ancestor-axis) in the XPath expression, this method is no longer applicable to describing the tree structure. We now give a different description of XPath tree, in which the level information between the adjacent two node tests is abstracted from the type of the axis between the node tests, and recorded at the corresponding node in the XPath tree. Our work is limited to XP{[ ],*,//} expression only. Definition 1: For a given XP{[ ],*,//} expression q, we construct an XPath tree T. The root of T is independent of q. Every node test n in q can be described by a non-root node v. The relationship between v and its parental node v' is denoted by L(v)=[a, b],
Homomorphism Resolving of XPath Trees Based on Automata
823
where a and b are the minimum and maximum numbers of levels between v and v' respectively. The relationship between nodes in tree T is given as: 1) If n is a root node test, i.e. /n or //n, there exists an edge in T between the node v in T that corresponds to n, and the root r, edge(r, v), where r is the parental node of v. When /n, L(v)=[1, 1]; and L(v)=[1, ∞ ] when //n. 2) If n is not a root node test, there is an adjacent node test n' in q that satisfies n'/n, n'[n], n'//n or n'[.//n], therefore, there exists an edge in T between v and v' (corresponding to n and n' respectively), where v' is the parental node of v. When n'/n or n'[n], L(v)=[1, 1]; and L(v)=[1, ∞ ] when n'//n or n'[.//n]. Definition 2: Given an XPath tree T, let NODES(T) be the set of nodes in T, EDGES(T) be the set of edges in T, ROOT(T) be the root node of T. If there exists v ∈ NODES(T), and the outdegree of v is greater than 1, or the outdegree or the indegree of v is 0, node v is then called key node of the XPath tree T. ∀ edge(x,y) ∈ EDGES(T), where x,y ∈ NODES(T), and edge(x,y) implies x is the parental node of y. If nid is the unique idtentifier of node y and ln is the label of node y, we then denote node y by nid[a,b], where [a,b] equals to L(y). Informally, key nodes in an XPath tree are branching nodes (nodes with outdegree greater than 1), leaves, and root. There are often some wildcard location steps without predicate used in an XPath expression, which are represented as non-branching nodes ‘*’, such as the expression /a/*//*/b. We can remove those wildcard nodes in the XPath tree for simplification, but have to revise the L(v) value of some related non-wildcard node v which is the descendent node of the removed wildcard node. Fig. 1(a) illustrates the two XPath trees of the expression /a/*//*/b before and after removing non-branching wildcard nodes, where L(b) is revised. In the following context, all XPath trees are those trees from which the non-branching wildcard nodes are removed.
(a)
(b)
Fig. 1. (a)XPath Tree /a/*//*/b ; (b)XPath Tree /a/*/b[.//*/c]//d
Definition 3: Given an XPath tree T, let CNODES(T) be the set of alterable nodes, FNODES(T) be the set of fixed nodes, NODES(T)={ROOT(T)} ∪ CNODES(T) ∪ FNODES(T). ∀ n ∈ NODES(T) and n ≠ ROOT(T), L(n) = [a,b]. If a=b, then n ∈ FNODES(T); if b= ∞ , then n ∈ CNODES(T). When CNODES(T) is not empty, the XPath tree T is an alterable tree, otherwise it is a fixed tree. As an example, the XPath tree of the XPath expression /a/*/b[.//*/c]//d is shown in Fig. 1(b). The set of level relationship between node x2 and its parental node is L(x2)=[2,2]. From definition 3, node x2 is a fixed node. The set of level relationship
824
M. Fu and Y. Zhang
between node x3 and its parental node is L(x3)=[2, ∞ ], and node x3 is an alterable node, so the corresponding XPath tree is an alterable tree. Definition 4: Function h: NODES(p) → NODES(q) describes the homomorphism relationship from XPath tree p to XPath tree q: 1)h(ROOT(p)) = ROOT(q); 2)For each x ∈ NODES(p), LABEL(x)='*' or LABEL(x) = LABEL(h(x)); 3)For each edge(x,y) ∈ EDGES(p), where x,y ∈ NODES(p), L(x,y) ⊇ L(h(x),h(y)); Fig. 2 shows the homomorphism mapping h from XPath tree p to XPath tree q based on Fig. 2. Homomorphism mapping h:pÆq XPath expressions /a/*//b and /a[c]//*/*//b.
3 Homomorphism Resolution Based on XTHC Machine 3.1 Construction of Basic XTHC Machine We will incrementally construct NFA with prefix-sharing on the set of XPath trees P={p1,p2…pn}. Each node nid[a,b] in the XPath tree will be mapped to an automata fragment in NFA, and such a fragment has a unique start state and a unique end state. There are two cases while constructing the fragment from the node nid[a,b]: 1. When a=b, nid[a,b] is a fixed node, the constructed automata fragment is shown in Fig.3(a). The states s-1 and s+a-1 are the start and end states of the fragment, respectively. Since a represents the minimum number of levels between node nid[a,b] and its parental node, starting from state s-1, we can construct in turn a-1 states along the arcs labeled ‘*’, which are called extended states; we then construct state s+a-1 along the arc labeled ln from state s+a-2. Obviously there exist extended states in the automata fragment based on nid[a,b] when a>1.
(a)
(b)
Fig. 3. (a) The automata fragment corresponding to the fixed node nid[a,a]; (b) The automata fragment corresponding to the alterable node nid[a, ∞ ]
2. When b= ∞ , nid[a,b] is an alterable node, many kinds of automata fragment can be constructed, one example is shown in Fig.3(b). Similarly to that in case 1, we first construct a-1 extended states and the end state s+a-1, starting from state s-1. Since b= ∞ , it is necessary to add self-looping arc, labeled by ‘*’, in any one or more states from state s-1 or the following a-1 extended states. The chain consisting of the start state and the extended states, is denoted by extended state-chain. Fig.3(b) only shows one self-looping arc at last state of the extended state-chain. Obviously,
Homomorphism Resolving of XPath Trees Based on Automata
825
an automata fragment corresponding to an alterable node nid[a,b] (a>1) in an XPath tree p is optimal, if and only if there is only one state in the fragment that has a self-looping arc, and this state must be the last state along the extended state-chain. Definition 5: Suppose the NFA constructed from set P of XPath trees is A, called the XHTC machine. We can create the following two index tables for each state s in A: 1) LP(s): list of leaf nodes. ∀p ∈ P, for each leaf node nl in p, if s is the last state constructed from nl, then nl ∈ LP(s). Only when s is a leaf state, LP(s) is non-empty. 2) LB(s): list of branching nodes. ∀ p ∈ P, for each branching node nb in p, if s is the last state constructed from nb, then nb ∈ LB(s). Only when s is a branching state, LB(s) is non-empty. Fig. 4(b) is the XTHC machine constructed from XPath trees p1, p2, and p3 which are shown in Fig.4(a), pi.x represents node x in XPath tree pi, a state is denoted by a circle. An arc implies state transition, where dashed lines represents transition of descendant-axis type, and solid lines represents transition of child-axis type. A label on an arc is a node test. State S1 has an arc to itself since it has a transition of descendant-axis type.
(a)
(b)
Fig. 4. (a) The XPath tree set P ; (b) The XTHC machine constructed from XPath tree set P
Definition 6: A basic non-deterministic XTHC machine A is defined as: A = (Q s, Σ , δ, qs0, F, B, Ss) where • Qs is the set of NFA states; • Σ is the set of input symbols; • qs0 is the initial(or start) NFA state of A, i.e. the root state; • δ is the set of state transition functions, it contains at least the NFA state transition s function, i.e. tforward: Q s × Σ → 2Q ; • F ⊆ Qs is the set of final states, it is also the set of leaf states; • B ⊆ Qs is the set of branching states; • ∀ qs ∈ Qs, we call qs an NFA state of A, LP(qs) and LB(qs) are two index tables of qs (see definition 5); Ss is the stack for state transition, the stack frame of Ss is a subset of Q s.
826
M. Fu and Y. Zhang
3.2 Running an XTHC Machine In order to resolve the homomorphous relationship using an XTHC machine, a depthfirst traverse on the input XPath tree is required to generate SAX events. These events will be used as input to the XTHC machine for the XTHC machine running. Four types of events will be generated at depth-first traverse on the input XPath tree p: startXPathTree, startElement, endElement and endXPathTree. Time of these events being generated is: 1) send startXPathTree event when entering root of p; 2) send startElement event when entering non-root node of p; 3) send endElement event when tracing back to non-root node of p; 4) send endXPathTree event when tracing back to root of p. Since a and b are not always 1 in a node nid[a,b] of an XPath tree, more than one events are sent at entering or tracing back to node nid[a,b]: 1) the startElement event sequence sent when a=b is shown in Fig.5(a); 2) the startElement event sequence sent when b= ∞ is shown in Fig.5(b). In particular, there are some restrictions applied on a startElement(‘//’) event: 1) it occurs only when node nid[a,b] is an alterable node; 2) state transition driven by this event occurs only at state s in the extended statechain corresponding to the alterable node, and there is a unique state transition: tforward(s, ‘//’) → s. Similarly, more than one endElement event are sent when tracing back to node nid[a,b] in the tree, which are shown in Fig. 5(c) and 5(d).
(a)
(c)
(b)
(d)
Fig. 5. (a) The startElement(“SE” for short) event sequence of the fixed node nid[a,a]; (b) The startElement event sequence of the alterable node nid[a, ∞ ]; (c) The endElement(“EE” for short) event sequence of the fixed node nid[a,a]; (d) The endElement event sequence of the alterable node nid[a, ∞ ]
Fig. 6 shows rules of processing SAX events in an XTHC machine. The homomorphism relationship between tree pi in a set of XPath tree P={p1,p2,…,pn} and an input tree q can be resolved by running the XTHC machine. When the XTHC machine is running, ∀ p ∈ P, homomorphism information between each node v in p and nodes in the input tree q is recorded. Let v ∈ p, a be the label of node u in the input XPath tree q. We define the following three operations to mark, deliver and reset information about the mapping in the XPath tree p:
Homomorphism Resolving of XPath Trees Based on Automata
827
1) mark(v, u): when the XTHC machine is running at a leaf state qs(qs ∈ F), ∀ v ∈ LP(qs), mark on v the information about the mapping from the leaf node v to the node u in the input XPath tree q; 2) deliver(v): when the machine traces back to a key state qs(qs ∈ F ∪ B), ∀ v ∈ LB(qs) ∪ LP(qs), if information about the mapping was marked on node v, deliver the mapping information of v to the nearest ancestor key node to v in the XPath tree; 3) reset(v): when the machine traces back to a key state qs(qs ∈ F ∪ B), ∀ v ∈ LB(qs) ∪ LP(qs), reset the mapping information on node v. startXPathTree() push(Ss, {qs0}); other initialization startElement(a) qsset ={}; // current NFA state set u = getCurrentInputNode( ); for each qs in peek(Ss) merge tforward(qs, a) into qsset push(Ss, qsset); for each qs in qsset
if (qs ∈ F) for each v in LP(qs) mark(v, u);
endElement(a) qsset = pop(Ss); for each qs in qsset
if (qs ∈ B or qs ∈ F){ for each v in LB(qs) or LP(qs) if exsit mapping of v{ deliver(v); reset(v); } } endXPathTree( ) pop(Ss);
Fig. 6. The processing rules of SAX events in XTHC
The time complexity of the algorithm resolving homomorphism from one XPath tree p to another XPath tree q is O(|p||q|2)[2]. Therefore, the time complexity from each tree p in a set of XPath tree P={p1,p2,…,pn} to q is O(n|p||q|2) without using prefix-sharing automata. However, if prefix-sharing automata is used, the time complexity is O(m|q|2), where m is the number of states in NFA. When XPath trees in P have common branches and prefixes, n|p| is much greater than m, therefore, it is much more efficient to resolve homomorphism from multi-XPath trees to one single XPath tree using prefix-sharing automata.
4 Experiments An algorithm resolving homomorphism based on the XTHC machine (XHO) was implemented using Java. The experimental platform is on Windows XP operation system, Pentium 4 CPU, with frequency of 1.6GHz and memory of 512MB. We compared several algorithms: the homomorphism algorithm (HO)[2], the complete algorithm in a cononical model (CM), branch homomorphism algorithm(BHO)[4], and the proposed XHO algorithm. We checked the scope of each algorithm at resolving containment of XPath expressions (see table 1, where T/F represents p containing/not containing q), and the time complexity of these algorithms(see Fig. 7). This experiment shows XHO is as capable as existing homomorphism algorithms. Furthermore, XHO supports containment calculation from multi-XPath expressions to
828
M. Fu and Y. Zhang
one single XPath expression. Although BHO also supports such calculation, it may give incorrect results in some cases as shown in Table 1. BHO gives a result that is rather different from the correct result CM gives. Compared to BHO, XHO gives smaller discrepancy between containment and homomorphism. Table 1. Some pairs of XPath trees for experiments and containment results No no.1 no.2 no.3 no.4 no.5 no.6 no.7 no.8
p /a//*[.//c]//d /a/*/*/c /a//b[*//c]/b/c /a//*/b /a/*[.//b]//c /a[a//b[c/*//d]/b/c/d] /a/*/*/*/c /a//b[c]/d 18000 d)n 16000 coe so 14000 rc 12000 im 10000 (e mi 8000 T 6000 gn 4000 nin 2000 Ru 0
q /a//b[c]//d /a/b[c]/e/c /a//b[*//c]/b[b/c]//c /a/*//b /a//*/b/c /a[a//b[c/*//d]/b[c//d]/b/c/d] /a//*/b//b/c /a/b[.//c]//d
HO T T T T F F F F
BHO T T T F F F F F
XHO T T T T T F F F
CM T T T T T T F F
no.1 no.2 no.3 no.4 no.5 no.6 no.7 no.8 HO
BHO XHO CM Homomorphism Algorithms
Fig. 7. The experimental results for some homomorphism algorithms
5 Conclusion This paper considers an algorithm to resolve containment between multi-XPath expressions and one single XPath expression through homomorphism. While high efficiency is kept at calculating multi-containment relationships, we also consider discrepancy between containment and homomorphism. The algorithm works correctly on calculating containment of a special type of XPath expressions. Experiments showed that our algorithm is more complete than conventional homomorphism algorithms. Future research can be done on how to resolve homomorphism between one XPath tree and multi-XPath trees simultaneously.
References [1] World
Wide
Web
Consortium,
XML
Path
Language
(XPath)
Version
1.0,
http://www.w3.org/TR/xpath, W3C Recommendation, November 1999. [2] G. Miklau and D. Suciu. Containment and equivalence for a fragment of XPath. Journal of the ACM, 51(1):2-45, 2004. [3] Yuguo Liao, Jianhua Feng, Yong Zhang and Lizhu Zhou. Hidden conditioned homomorphism for XPath fragment containment. In DASFAA 2006, LNCS 3882, 454-467, 2006. [4] Sanghyun Yoo, Jin Hyun Son and Myoung Ho Kim. Maintaining homomorphism information of XPath patterns. IASTED-DBA2005, 2005, 192-197.
An Efficient Overlay Multicast Routing Algorithm for Real-Time Multimedia Applications Shan Jin, Yanyan Zhuang, Linfeng Liu, and Jiagao Wu Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, Nanjing 210096, China {kingsoftseu,zhuangyanyan,liulf,jgwu}@seu.edu.cn
Abstract. Multicast services can be deployed either on network layer or application layer. Implementations of application-level multicast often provide more sophisticated features, and can provide multicast services where IP multicast services are not available. In this paper, we consider the degree and delay constrained routing problem in overlay multicast for real-time multimedia applications, and an efficient Distributed Tree Algorithm (DTA) is proposed. With DTA, end hosts can make trade-off between minimizing end-to-end delay and reducing local resource consumption by adjusting the heuristic parameters, and then self-organize into a scalable and robust multicast tree dynamically. By adopting distributed and tree-first schemes, a newcomer can adapt to different situations flexibly. The algorithm terminates when the newcomer reaches a leaf node, or joins the tree successfully. Simulation results show that the multicast tree has low node rejection rate by choosing appropriate values of the heuristic parameters. Keywords: overlay multicast, routing algorithm, heuristic parameter.
1 Introduction Multicast is a basic communication service for many new network applications, like the real-time multimedia transmission. When it comes to practical issues, however, full deployment of IP multicast [1] has long been postponed in the Internet for both technical and economic reasons [2]. Researchers wondered whether the network layer is appropriate for implementations of multicast functionality; therefore, overlay multicast [3] is proposed as an alternative to IP multicast. Overlay multicast deploys multicast services on hosts instead of core routers. The advantage of doing so is that the multicast services are easier to deploy, since there is no need to change the existing IP network infrastructure. From the architectural point of view, the overlay multicast systems can be classified as host-based architecture (like ALMI [4] and HMTP [5]), and proxy-based architecture (like Overcast [6] and Scattercast [7]). Both architectures face the same-natured problems when talking about overlay multicast routing. The overlay multicast routing problem in this paper is studied based on the host architecture, taking the common features of both architectures into consideration. Since overlay multicast routing performance is usually not as efficient as that of network layer multicast, it is crucial to study degree and delay constrained overlay G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 829–836, 2007. © Springer-Verlag Berlin Heidelberg 2007
830
S. Jin et al.
multicast routing algorithms for real-time multimedia applications. In centralized algorithms [4, 8], a server, which is supposed to know the path latency between any nodes in an overlay network, constructs multicast tree according to an objective function. However, these algorithms do not consider the dynamic feature of multicast members, and ignore the problems of algorithm complexity and the single-point failure. In contrast, distributed algorithms that use local routing optimization bear great extensibility and dynamic flexibility. These algorithms can be classified into mesh-first [7, 9] and tree-first [5, 6] strategies. Studies show that none of the protocols above have considered the strict delay constraint in real-time multimedia applications, and how the multicast routing performance is affected by the dynamic end hosts also lacks sufficient study [10]. We introduce a novel distributed overlay multicast routing algorithm named Distributed Tree Algorithm (DTA). The algorithm adopts tree-first strategy in order to enhance multicast routing performance effectively and save system maintenance cost. By adjusting appropriate heuristic parameters, DTA can improve multicast routing performance and reduce node rejection rate considerably.
2 Problem Formulations and Design Objectives The overlay multicast network is a logical network built on top of the Internet unicast infrastructure. It can be depicted as a complete directed graph, G = (V, E), where V is the set of vertices and E = V × V is the set of edges. Each vertex in V represents a host. The directed edge from node i to node j in G represents a logical channel corresponding to a unicast path from host i to host j in the physical topology. The data delivery path will be a directed spanning tree T of G rooted at the source host, with the edges directed away from the root.
∈ N: The out-degree constraint of host v in the overlay tree. Definition 2 l(u, v) ∈ R : The unicast latency from host u to host v. Definition 3 delay(r, v) ∈ R : The overlay latency from root r to host v. It is the sum of Definition 1 dmax(v)
+
+
all the unicast latencies along the path from r to v in the spanning tree T. We consider two optimization objectives: one seeks to minimize the maximum overlay latency in a multicast tree to reduce the session latency, taking the degree constraint at individual nodes into consideration; the other optimizes the bandwidth usage at each host to reduce the likelihood of bottleneck nodes and constructs a tree satisfying the constraint of the overlay maximum latency. dused(v) denotes the degree already used by node v, dres(v) = dmax(v) - dused(v) denotes the residual degree of v, S denotes the set of all hosts in the tree, and L denotes the upper bound of the session latency. Then the two objectives are formulated as follows: Problem 1 Minimum Maximum-Latency Degree-Bounded Directed Spanning Tree Problem (MMLDB): Given a complete directed graph G = (V, E), a degree constraint dmax(v) for each vertex v V, and a latency l(u, v) for each edge e(u, v) E. Find a directed spanning tree T of G rooted at host r that minimizes the maximum delay(r, v), and the degree constraint is satisfied at each node that dused(v) ≤ dmax(v).
∈
∈
An Efficient Overlay Multicast Routing Algorithm
831
Problem 2 Residual-Balanced Degree and Latency-Bounded Directed Spanning Tree Problem (RBDLB): Given a complete directed graph G = (V, E), a degree bound dmax(v) V, and a latency l(u, v) for each edge e(u, v) E. Find a directed for each vertex v spanning tree T of G rooted at host r that maximizes the minimum dres(v), satisfying both the degree constraint at each node that dused(v) ≤ dmax(v) and the latency constraint of the session that maxv∈S delay(r, v) ≤ L. Both MMLDB and RBDLB problems are NP-complete [8]. Our design of DTA can make trade-off between minimizing end-to-end delay and reducing local resource consumption. Resultingly, both of the desired objectives are met.
∈
∈
3 Design of DTA Each node only needs to maintain a local status set, {dmax(v), dused(v), delay(r, v), Children(v), parent(v), l(parent(v), v)}. Children(v) denotes the set of v’s children and parent(v) denotes v’s parent, l(parent(v), v) is the unicast latency from v’s parent to v itself, which can be acquired by an end-to-end measuring tool. 3.1 Creating a Multicast Group Each multicast group has a Rendezvous Point (RP) from which new members can learn about membership of the group so as to bootstrap themselves. The construction of a multicast group is as follows: 1) The host that sends out data acts as the creator, as well as the root, of the tree T once a multicast session commences. It sends to RP a CREATEREQUEST message. 2) When receiving the CREATEREQUEST message, RP adds the QoS parameters to its group list, then sends out a CREATEACK message to the corresponding requesting host. 3.2 Joining a Multicast Group A newcomer v sends to RP a QUERYREQUEST message, containing the multicast group ID. On receiving the request message, RP checks its root list for the specific item, say r, of that group, then sends QUERYACK message containing r’s IP address and the corresponding QoS parameters to v. Then v sets r as its tentative parent pt and asks r for the list of r’s children. Next, v queries r and its children for their latencies and bandwidth information to constitute its potential parents set PP(pt) defined in Definition 4 (see below). From all nodes in PP(pt), v picks a local optimal parent according to function (1): Local Optimal Parent Selection (LOPS) Function. If the local optimal parent is not the tentative parent pt, v replaces the old pt with this parent, and repeats this process until a local optimal parent, u for instance, perseveres in its role as the tentative parent. Then v makes u its parent by sending JOINREQUEST message to u. On the contrary, if there is no potential parent of v, i.e., PP(pt) is empty, v selects a local optimal grandparent from pt’s children and sets this grandparent as a new tentative parent according to function (2): Local Optimal Grandparent Selection (LOGS) Function, then repeats the joining process. Definition 4 PP(pt): Newcomer v’s potential parents set. PP(pt) = {n / dused(n) < dmax(n) delay(r, n) + l(n, v) ≤ L, n { pt } Children(pt)}.
∧
∈
∪
832
S. Jin et al.
Considering two different situations in which PP(pt) is empty or not, DTA deals with them with either LOGS-Function or LOPS-Function mentioned above. The two functions are given as follows:
Fig. 1. An Illustration of LOPS-Function and LOGS-Function
Local Optimal Parent Selection (LOPS) Function:
Pfunc( pt ') = min Pfunc(m) .
(1)
m∈PP ( pt )
Pfunc(m) reflects the efficiency in selecting a node from PP(pt) as candidate for the newcomer’s parent. It can be expressed as follows: d ( m) l (m, v) + (1 − ρ ) ⋅ Pfunc(m) = ρ ⋅ used , ρ ∈ [0,1] . d max (m) max l (n, v) n∈PP ( pt )
As shown in Fig. 1, v is a newcomer and g is its current tentative parent pt. PP(pt) = {g} Children(g) = {g, i, j, k}. v is now enquiring degree and latencies of all members
∪
in PP(pt) to calculate the values of the corresponding Pfunc().
l (m, v) reflects max l ( n, v )
n∈ PP ( pt )
how close node m is to node v. A smaller value denotes a shorter distance from a node d ( m) in PP(pt) to v. used reflects how many end system resources a node in PP(pt) has d max (m) used by now, and a smaller value denotes a smaller percentage of the resources that have been used. Weight ρ is a heuristic factor. We can trade off between minimizing end-to-end delay and reducing local resource consumption by adjusting the value of ρ between [0, 1]. Local Optimal Grandparent Selection (LOGS) Function: Gfunc ( pt ') =
max
q∈Children ( pt )
Gfunc(q) .
(2)
Gfunc(q) is kind of forecasting of the joining action, and it can be expressed as follows: Gfunc(q ) =
max
m∈Children ( pt )
l ( pt , m)
l ( pt , q)
⋅
dused (q) ⋅ θ t ( q ) ,θ ∈ (0,1), t (q) = 0,1, 2,3,L . max d max (n)
n∈Children ( pt )
An Efficient Overlay Multicast Routing Algorithm
833
In Fig. 1, if g is v’s current tentative parent pt. A bigger value of max
m∈Children ( g )
l ( g , m)
l ( g , q)
⋅
dused ( q) denotes relatively small latency from node q to its max d max (n)
n∈Children ( g )
parent g, and q itself has a relatively larger number of children. As a result, the tree’s radius can be decreased and the node rejection rate will fall. θt(q) is a balancing factor with its value between (0,1). t(q) records the number of times that node q (q Children(pt)) has been selected as a local optimal grandparent. A smaller value of θ indicates a more likelihood of a newcomer’s selecting a different node as its local optimal grandparent than the last time. Multiplying θ for t(q) times, thus getting θt(q), is to prevent one single node from being selected as the local optimal grandparent all the time, which deteriorates the overall performance of the multicast tree. If all of g, i, j and k do not meet the degree and latency constraints, i.e., the set PP(g) = , v will use LOGS-Function to evaluate i, j, k in Children(g) in order to decide which one will be the new tentative parent. To summarize, a newcomer tries to find a “good” parent by searching a certain part of the tree. It stops when it joins the tree successfully or reaches a leaf node. The detailed algorithm is shown as follows:
∈
Φ
Joining Algorithm
Φ
v finds root r by querying RP, let pt = r; while PP(pt) == Gfunc(pt’) = 0; foreach q Children(pt) if Gfunc(q) > Gfunc(pt’) Gfunc(pt’) = Gfunc(q); pt’ = q; if Gfunc(pt’) == 0 v returns JOINFAIL message to RP; pt = pt’; while true Pfunc(pt’) = + ; foreach m PP(pt) if Pfunc(m) < Pfunc(pt’) Pfunc(pt’) = Pfunc(m); pt’ = m; if pt’ == pt v establishs a unicast tunnel to pt; v returns JOINSUCCEED message to RP; pt = pt’;
∈
∈
∞
3.3 States Maintaining and Leaving a Multicast Group
Status in DTA is refreshed by periodic message exchanges between neighbors. Every child sends REFRESH message to its parent, and the parent replies this message by sending KEEPALIVE message back. Each member calculates the round-trip time (rtt) by these two messages. If a member cannot reach its parent any more or the rtt no longer meets the latency constraint, then the joining algorithm is triggered.
834
S. Jin et al.
When a member leaves a group, it sends LEAVEREQUEST message to its parent and children, from whom it receives the LEAVEACK messages. Its parent simply deletes the leaving member from its children list. But the children of this leaving member must find new parents. A child looks for a new parent with the help of joining algorithm. If the root is leaving, it notifies RP and its children by sending CANCELGROUP message to them. RP then deletes the group information of this root from its group list. Other members in the tree pass the message on to their neighbors then all of them leave the group.
4 Performance Evaluation 4.1 Performance Metrics and Simulation Setup
We have done some simulations to evaluate the performance of DTA, concerning the node rejection rate defined as follows: Definition 5 Node Rejection Rate Rr: Rr = n / N, in which n denotes the amount of nodes rejected by DTA and N denotes total amount of nodes. Our simulations are based on a network that consists of 1000 routers. The network has a random flat topology generated by using the Waxman model [11]. The communication delay which is designated between [1ms, 50ms] between neighbor routers, is directly proportional to their geometric distance. Some additional nodes are generated as regular hosts and are randomly attached to these routers. The node degree is uniformly distributed between 4 and 8. Each node experiences 100 rounds of simulation and the average value is recorded as experimental result. 4.2 Simulation Results and Analyses
Fig. 2 and Fig. 3 show the node rejection rate versus the session delay constraint. There are 50 regular hosts which want to join the multicast group one by one in Fig. 2 and 200 in Fig. 3. We set the value of θ to 0.2, 0.5, and 0.8 respectively and adjust ρ’s value among 0.0, 0.3, 0.7 and 1.0 with each different value of θ. Different curve in a chart denotes a different value combination of ρ and θ in the form of (ρ, θ). From all these charts, we can see that the rejection rate decreases as the session delay constraint increases. Furthermore, different combinations of (ρ, θ) also have an impact on system performance. Firstly, DTA approximates to minimizing local resource consumption strategy (RBDLB) when ρ is closer to 1, whereas it approximates to minimizing end-to-end delay strategy (MMLDB) when ρ approaches 0. The node rejection rate can not be decreased remarkably if only one of the two strategies is considered, i.e., only ρ’s value equals to 0 or 1. Therefore, an appropriate value must be set to trade off between the two strategies. Secondly, if θ has a larger value, a newcomer is more likely to select some specific members as its local optimal grandparent, which could bring about overload in local area and the rejection rate will increase as a result; if θ is smaller, a newcomer is likely to select its local optimal grandparent among all its potential grandparents with a relatively equal probability, but some of the preferable ones may be
An Efficient Overlay Multicast Routing Algorithm
835
ignored and the rejection rate could also increase as a result. From the six charts we can see that, DTA always performances best when the combination of (ρ, θ) is set to (0.3, 0.5). This result illustrates that our optimization strategy in DTA is much closer to end-to-end delay optimization strategy (ρ = 0.0). By comparing Fig. 2 and Fig. 3, it is clear that the optimization objectives are better achieved when the number of multicast group members is larger. Therefore, DTA is more suitable for large-scale overlay multicast applications.
Fig. 2. Node rejection rate of DTA vs. Session delay upper bound. Group size = 50.
Fig. 3. Node rejection rate of DTA vs. Session delay upper bound. Group size = 200.
Fig. 4. Node rejection rate of DTA vs. Group size. Session delay upper bound = 600ms, 1400ms, 2000ms.
Fig. 4 shows the node rejection rate versus the multicast group size when θ is set to 0.5. We can see that the combination of (0.3, 0.5) also brings about the best performance. If the session delay constraint is set to too low (the chart on the left) or too high (the chart on the right) a value, then the change in the value of ρ will have less impact on the performance. But when we set session delay constraint to 1400ms (the chart in the middle), a better choice of (ρ, θ) will have a notable effect.
836
S. Jin et al.
5 Conclusion We study tree-first overlay multicast routing algorithm and propose an efficient distributed routing algorithm named DTA. Our algorithm seeks to make trade-off between minimizing end-to-end delay and minimizing local resource consumption. Simulations show that the performance of DTA is quite a satisfaction with node degree and end-to-end delay constraints by properly selecting (ρ, θ). Work for algorithm improvement and discussion on the best value of (ρ, θ) is left for the future. Acknowledgments. This research is supported by the Natural Science Foundation of China (Grant No. 90604003).
References 1. Deering, S.E., Cheriton, D.R.: Multicast Routing in Datagram Internetworks and Extended LANs. In ACM Transactions on Computer Systems, Vol. 8. (1990) 85–110 2. Diot, C., Levine, B.N., Lyles, B., Kassem, H., Balensiefen, D.: Deployment Issues for the IP Multicast Service and Architecture. In IEEE Network, Vol. 14. (2000) 78–88 3. El-Sayed, A., Roca, V., Mathy, L.: A Survey of Proposals for an Alternative Group Communication Service. In IEEE Network, Vol. 17. (2003) 46–51 4. Pendarakis, D., Shi, S., Verma, D., Waldvogel, M.: ALMI: An Application Level Multicast Infrastructure. In Proceedings of 3rd USENIX Symposium on Internet Technologies and Systems, San Francisco (2001) 49–60 5. Zhang, B., Jamin, S., Zhang, L.: Host Multicast: A Framework for Delivering Multicast to End Users. In Proceedings of the IEEE INFOCOM, New York (2002) 1366–1375 6. Jannotti, J., Gifford, D.K., Johnson, K.L., Kaashoek, M.F., O’Toole, J.W., Jr.: Overcast: Reliable Multicasting with an Overlay Network. In Proceedings of the 4th USENIX Symposium on Operating Systems Design and Implementantion, San Diego (2000) 197–212 7. Chawathe, Y.: Scattercast: An Adaptable Broadcast Distribution Framework. In Multimedia System, Vol. 9. (2003) 104–118 8. Shi, S.Y., Turner, J.S.: Multicast Routing and Bandwidth Dimensioning in Overlay Networks. In IEEE Journal on Selected Areas in Communications, Vol. 20. (2002) 1444–1455 9. Chu, Y.H., Rao, S.G., Zhang, H.: A Case for End System Multicast. In Proceedings of the ACM SIGMETRICS, Santa Clara (2000) 1 – 12 10. Wu, J.G., Yang, Y.Y., Chen, Y.X., Ye, X.G.: Delay Constraint Supported Overlay Multicast Routing Protocol. In Journal on Communications, Vol. 26. (2005) 13–20 11. Zegura, E.W., Calvert, K.L., Bhattacharjee, S.: How to Model an Internetwork. In Proceedings of the IEEE INFOCOM, San Francisco (1996) 594–602
Novel NonGaussianity Measure Based BSS Algorithm for Dependent Signals Fasong Wang1 , Hongwei Li1 , and Rui Li2 1
School of Mathematics and Physics, China University of Geosciences Wuhan 430074, P.R. China [email protected], [email protected] 2 School of Sciences, Henan University of Technology Zhengzhou 450052, P.R. China [email protected]
Abstract. The purpose of this paper is to develop novel Blind Source Separation (BSS) algorithms from linear mixtures of them, which enable to separate dependent source signals. Most of the proposed algorithms for solving BSS problem rely on independence or at least uncorrelation assumption of the source signals. Here, we show that maximization of the nonGaussianity(NG) measure can separate the statistically dependent source signals and the novel NG measure is given by the Hall Euclidean distance. The proposed separation algorithm can result in the famous FastICA algorithm. Simulation results show that the proposed separation algorithm is able to separate the dependent signals and yield ideal performance.
1
Introduction
Blind source separation(BSS) is typically based on the assumption that the observed signals are linear superpositions of underlying hidden source signals. When the source signals are mutual independent, the BSS can be solved by using the so called independent component analysis(ICA) method which has been attracted considerable attention in the signal processing and neural network fields and several efficient algorithms have been proposed (see for overview, e.g., [1-2]). Despite the success of using standard ICA in many applications, the basic assumptions of ICA may not hold for some real-world situations, especially in biomedical signal processing and image processing, therefore the standard ICA cannot give the expected results. In fact, by definition, the standard ICA algorithms are not able to estimate statistically dependent original sources. Some authors [3] have proposed different approaches which take advantage of the nonstationarity of such sources in order to achieve better performance than the classical methods, but they still require their independence or uncorrelation. Some extended data models have also been developed to relax the independence assumption in the standard ICA model, such as multidimensional ICA [4], independent subspace analysis [5] and subband decomposition ICA (SDICA) model [6]. G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 837–844, 2007. c Springer-Verlag Berlin Heidelberg 2007
838
F. Wang, H. Li, and R. Li
As mentioned in [7], in the dependent sources situations, we can not resort minimization the mutual information(MI), but on the other hand we can maximization of NG to get the dependent sources. In this paper, based on the generalization of the central limit theorem(CLT) to special dependent variables, we will try to track the generalize ICA model-dependent component analysis problem by maximization NG measure. The NG quantity measure of arbitrary standardized probability density is defined by the L2 norm in the L2 space of the difference between the given density and the standard normal density. This paper is organized as follows: Section 2 introduces briefly the dependent BSS model and NG measure; Then in section 3, we describe the novel NG measure using Hall distance in detail; In section 4, we use the NG measure to get the proposed separation algorithm and show that it is equivalent to the FastICA algorithm; Simulations illustrating the good performance of the proposed method are given in section 5; Finally, section 6 concludes the paper.
2 2.1
Dependent BSS Model and NG Measure Dependent BSS Model
For our purposes, the problem of BSS can be formulated as: x(t) = As(t) + n(t), where s(t) = [s1 (t), s2 (t), · · · , sn (t)]T is the unknown n-dimensional source vector. Matrix A ∈ Rm×n is an unknown full column rank mixing matrix and m ≥ n. The observed mixtures x(t) = [x1 (t), x2 (t), · · · , xm (t)]T are called as sensor outputs and n(t) = [n1 (t), n2 (t), · · · , nm (t)]T is a vector of additive noise that is assumed to be zero in this paper. The task of BSS is to estimate the mixing matrix A or its pseudo inverse separating (unmixing) matrix W = A+ in order to estimate the original source signals s(t), given only a finite number of observation data. There are two indeterminacies cannot be resolved in BSS without some a priori knowledge: scaling ˆ ˆs) and (A; s) are said to be related by and permutation ambiguities. Thus, (A; a waveform-preserving relation. A key factor in BSS is the assumption about the statistical properties of sources such as statistical independence. That is the reason why BSS is often confused with ICA. In this paper, we exploit some weaker conditions for separation of sources assuming that they have statistically dependent properties. Throughout this paper the following assumptions are made unless stated otherwise: 1) The mixing matrix A is of full column rank; 2) Source signals are statistically dependent signals with zero-mean; 3) Additive noises {n(t)} = 0. So, the BSS model of this paper is simplified as x(t) = As(t). 2.2
(1)
NG Measure
In ICA applications, NG measures are used based on the following fundamental idea: the outputs of a linear mixing process that preserves variances, have
Novel NonGaussianity Measure Based BSS Algorithm for Dependent Signals
839
higher entropies than the inputs [7]. This general statement can be precisely expressed in mathematical terms as CLT which tell us that the linear mixture of N independent signals with finite variances will became asymptotically Gaussian (or more nearly Gaussian). Since CLT is not valid for any set of dependent variables, we must be aware that we may not always recover the original sources using maximum NG criteria. [7] gives a very special condition on sources, for which the linear combinations of dependent signals are not more Gaussian than the components and therefore the maximum NG criteria fails, but fortunately this is not the case in most of real world scenarios. The NG quantity measure of arbitrary standardized PDF is defined by the L2 norm in the L2 space of the difference between the given density and the normal density. This can be interpreted as the square-distance, with respect to some measure, between the two functions in the space of square integrable functions. Let x be a random variable with PDF f (x), We attempt to compute f ’s departure from Gaussianity by comparing it with its normal Gaussian counterpart: 2 g(x) = √12π exp(− x2 ). If one regards f and g as elements of the function space of PDF, the deviation of f from normality may be evaluated by an L2 metric defined with some positive measure of the real line, μ(x): ∞ (f (x) − g(x))2 w(x)dx, (2) D= −∞
where the w(x) is given by w(x) = dμ(x)/dx. This definition corresponds to the integrated square-difference between functions f and g, measured with the weight function w(x). Although we leave w(x) unspecified at this point, we assume that we choose w such that the integral converges for most reasonable densities. We expand the function f (x) in the integral (2) in terms of Hermite polynomials, a set of orthogonal functions on the entire real line with respect to an appropriate Gaussian weight. Following the notation in [8], two distinct families of Hermite polynomials, for n = 0, 1, 2, · · ·, are generated by the derivatives of the Gaussian PDF, 1
Hen (x) = (−1)n e 2 x
2
dn − 1 x2 e 2 , dxn
Hn (x) = (−1)n ex
2
dn −x2 e , dxn
(3)
√ √ and Hn (x) = 2n Hen (x/ 2). Following standard practice, we refer to the first set as Chebyshev-Hermite, and the second as Hermite polynomials. The first few polynomials are: H0 (x) = 1, H1 (x) = 2x, H2 (x) = 4x2 − 2x, H3 (x) = 8x3 − 12x, H4 (x) = 16x4 − 48x2 + 12. Chebyshev-Hermite and Hermite polynomials satisfy an orthogonality relationship, ∞
Hen (x)Hem (x)g(x)dx = δnm n!,
(4)
√ Hn (x)Hm (x)g 2 (x)dx = δnm 2n−1 n!/ π.
(5)
−∞ ∞
−∞
840
F. Wang, H. Li, and R. Li
with respect to the weight functions g(x) for Chebyshev-Hermite polynomials Hen (x), and g 2 (x) for Hermite polynomials Hn (x). We will give a nonGaussianity indices based on the squared functional distance [9]. The index is defined by a different form of orthogonal series expansion for arbitrary density f (x), written in terms of either Chebyshev-Hermite or Hermite polynomials.
3
Hall Eudilean Distance Based Novel NG Measure
From the point of view of the L2 metric space, perhaps the most natural weight is the uniform function w(x) = 1, which treats every point on the entire real line democratically. Hall [9] proposed such an index based on the L2 Euclidean distance, L2 (1), from the standard normal, called Hall distance. ∞ 2 DH = (g(x) − f (x))2 dx. (6) −∞
If f is a square integrable √ function (g certainly is, since g 2 is proportional to a Gaussian with variance 1/ 2), this integral is convergent. In such a case, we may expand f in terms of Hermite polynomials as follows: f (x) = g(x)
∞ bn √ Hn (x), κn n=0
(7)
∞ √ where bn = √1κn −∞ f (x)Hn (x)g(x)dx, and κn = 2n−1 n!/ π is the normalization constant. This form of Hermite expansion is sometimes called the GaussHermite series. Unlike the Gram-Charlier series, the polynomials used here are the Hermite polynomials (not Chebyshev-Hermite) and the Gaussian weight appears in both the decomposition and the reconstruction formulae. The GaussHermite coefficients can also be considered as the expectation values, T 1 1 Hn (xt )g(xt ), (8) bn = E √ Hn (X)g(X) ≈ √ κn T κn t=1 and thus can be estimated from the samples xt . In particular, one expects that these coefficients are robust against outliers, as large values of |xt | are attenuated by the tails of the Gaussian. If we substitute the series representation (7) into the L2 metric formula (6), and use the orthogonality conditions (4), we see that the Hall distance is ∞
2 = (b0 − DH
√ 2 κ0 ) + (bn )2 .
(9)
n=1
Again, the L2 distance is expressed as the sum of squared Hermite coefficients, with a zeroth order correction because the origin is taken to be the standard normal. In general, we do not know a priori the first few terms of the sum as we did in the Gram-Charlier case, because the coefficients bn are no longer directly linked to moments. However, this is only a minor computational disadvantage considering the benefit of the robustness gained by this formulation.
Novel NonGaussianity Measure Based BSS Algorithm for Dependent Signals
4 4.1
841
Proposed Algorithm of the Dependent Sources Preprocessing
In order to apply the maximum NG method to dependent source separation, the research must restrict the separating matrix W which make the separated signals yi have unit variance. A simple way to do this procedure is to apply first a spatial whitening filter to the mixtures x, and then, to parameterize the new separation matrix as the one composed by unitnorm rows. We implement this spatial filter using Karhunen-Loeve transformation (KLT) [10] reaching to a new set of spatially uncorrelated data, z = VΛ−1/2 VT x, where V is a matrix of eigenvectors of the covariance matrix Rxx = E[xxT ] and Λ is a diagonal matrix containing the eigenvalues of Rxx which are assumed to be non-zero. Now, if we define y = Uz, the new separation matrix U, must have the property of having unitnorm rows, which follows from the assumption of unitvariances of variables yi (Ryy = E[yyT ] = UUT ). The ”real” (original) separation matrix W can be calculated using y = Uz and (10) as follows: W = UVΛ−1/2 VT .
(10)
Note that source estimates may be permuted or sign changed versions of sources (scale ambiguity disappears since it is assumed that the sources have unit-variance). 4.2
The Main Algorithm
As mentioned in [7], in the dependent sources situations, we can not resort minimization the MI, but on the other hand we can maximization of NG to get the dependent sources. So we view BSS algorithms as de-Gaussianization methods which based on other definitions of L2 measurement, such as the Hall distances (6). For reasons stated above, we choose to use the Euclidean metric L2 (1) to define a non-Gaussianity index. Note that each component xi is a standardized random variable, E[x(t)] = 0, and E[x(t)xT (t)] = I. A natural extension of the L2 measurement is then given by the sum of L2 (1) NG indices of xi across all n dimensions, n 2 2 DH (x) = DH (xi ), (11) √
i=1
∞
where = (b0 (xi ) − κ0 ) + k=1 b2k (xi ). In particular, if we truncate the sum by taking only the 0-th order terms for each xi , we can show 2 DH (xi )
2 DH (x) ≈
n i=1
2
(b0 (xi ) −
n √ 2 1 κ0 ) ≈ (E[g(xi )] − E[g(z)])2 . κ0 i=1
(12)
Here, xi is the standardized random variable with an unknown density fk , z is a standard Gaussian random variable and g is the standard Gaussian PDF.
842
F. Wang, H. Li, and R. Li
This truncated form of the multidimensional L2 (1) distance is equivalent to an ICA contrast due to Hyv¨ arinen, and the fixed-point iteration algorithm called FastICA was introduced in [2]. Then the main procedure of the basic form of the one unit FastICA algorithm can be concluded as follows: step1. step2. step3. step4.
Choose an initial (e.g. random) weight vector u. Let u+ = E{zg(uT z)} − E{g (uT z)}u. Let u = u+ /u+ . If not converged, go back to step2.
The one-unit algorithm estimates just one of the components. To estimate several components, we need to run the one-unit FastICA algorithm using several units (e.g. neurons) with weight vectors u1 ; · · · ; un . To prevent different vectors from converging to the same maxima we must decorrelate the outputs uT1 z; · · · ; uTn z after every iteration. A simple way of achieving decorrelation is a deflation scheme based on a Gram-Schmidt-like decorrelation. This means that we estimate the components one by one. When we have estimated p components, or p vectors u1 ; · · · ; up , we run the one-unit fixed point algorithm for up+1 , and after every iteration step subtract from up+1 the ”projections” uTp+1 uj uj , j = 1, · · · , p of the previously estimated p vectors, and then renormalize up+1 : p step1. Let up+1 = up+1 − j=1 uTp+1 uj uj . step2. Let up+1 = up+1 / uTp+1 up+1 .
5
Simulation Results
In order to confirm the validity of the proposed Hall distance based BSS algorithm, simulations using Matlab were given below with four source signals which have different waveforms. The input signals were generated by mixing the four simulated sources with a 4 × 4 random mixing matrix in which the elements were distributed uniformly. The sources and mixtures are displayed in Figs. 1(a) and (b), respectively. The source signals correlation values are shown in Table 1. Table 1. The Correlation Values Between Source Signals
source source source source
1 2 3 4
source 1
source 2
source 3
source 4
1 0.6027 0.3369 0.4113
0.6027 1 0.4375 0.4074
0.3369 0.4375 1 0.5376
0.4113 0.4074 0.5376 1
So the sources are not the i.i.d signals, the proposed NG measurement based BSS algorithm can separate the desired signals properly.
Novel NonGaussianity Measure Based BSS Algorithm for Dependent Signals
843
Next, for comparison we execute the mixed signals with different BSS algorithms: JADE Algorithm [11], SOBI algorithm [1],TDSEP algorithm [12] and AMUSE algorithm [1]. At the same convergent conditions, the proposed algorithm which we call it as NG-FastICA was compared along the criteria statistical whose performance was measured using a performance index called cross-talking error index E defined as [1] E=
N N i=1
j=1
N N |pij | |pij | −1 + −1 , maxk |pik | maxk |pkj | j=1 i=1
where P(pij ) is the entries of P = WA is the performance matrix. The separation results of the four different sources are shown in Table 2 for various BSS algorithms(averaged over 100 Monte Carlo simulation). Table 2. The results of the separation are shown for various BSS algorithms Algorithm JADE E
SOBI
TDSEP AMUSE NG-FastICA
0.4118 0.7844
0.4052
0.6685
0.3028
The waveforms of source signals, mixed signals and the separated signals are shown in Fig. 1(c)(the first 512 observations are given). 5
5
0
0
0
−5
−5
−5
0
100
200
300
400
500
600
5
0
100
200
300
400
500
600
5
5
5
0
0
0
−5
0
100
200
300
400
500
600
5
−5
0
100
200
300
400
500
600
2
−5
0
0
0
−2
−5
100
200
300
400
500
600
0
100
200
300
400
500
600
5
5
5
0
0
0
−5
0
100
200
300 (a)
400
500
600
−5
100
200
300
400
500
600
0
100
200
300
400
500
600
0
100
200
300
400
500
600
0
100
200
300 (c)
400
500
600
5
−5
0
0
0
100
200
300 (b)
400
500
600
−5
Fig. 1. The source signals, observed signals and experiment results showing the separation of correlated sources using the proposed NG-FastICA Algorithm
6
Conclusion
In this paper, we developed a novel Blind Source Separation (BSS) algorithms from linear mixtures of them, which enable to separate dependent source signals.
844
F. Wang, H. Li, and R. Li
Most of the proposed algorithms for solving BSS problem rely on independence or at least uncorrelation assumption of source signals that is the independent component analysis algorithm. Here, we show that maximization of the nonGaussianity(NG) measure can separate the statistically dependent source signals and the novel NG measure is given by the Hall Euclidean distance. The proposed separation algorithm can result in the famous FastICA algorithm. Simulation results show that the proposed separation algorithm is able to separate the dependent signals and yield ideal performance.
Acknowledgment This work is partially supported by National Natural Science Foundation of China(Grant No.60672049) and the Science Foundation of Henan University of Technology under Grant No.06XJC032.
References 1. Cichocki, A., Amari, S.: Adaptive Blind Siganl and Adaptive Blind Signal and Image Processing. John Wiley&Sons, New York (2002) 2. Hyvarinen, A., Karhunen, J., Oja, E.: Independent component analysis. John Wiley&Sons, New York (2001) 3. Hyvarinen, A.: Blind source separation by nonstationarity of variance: a cumulantbased approach. IEEE Trans. Neural Networks 12(6) (2001) 1471-1474. 4. Cardoso, J.F.: Multidimensional independent component analysis. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing(ICASSP’98), Seattle, WA. (1998) 1941-1944. 5. Hyvarinen, A., Hoyer, P.O.: Emergence of phases and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation 12(5) (2000) 1705-1720. 6. Zhang, K., Chan L.W.: An adaptive method for subband decomposition ICA. Neural Computation 18(1) (2006) 191-223 7. Caiafa,C.F., Proto, A.N.: Separation of statistically dependent sources using an L2 -distance non-Gaussianity measure. Signal Processing 86(11) 3404-3420 8. Yokoo,T., Knighty, B.W., Sirovich, L.: L2 De-gaussianization and independent component analysis. In Proc. 4th Int. Sym. on ICA and BSS(ICA2003), Japan. (2003) 757-762. 9. Hall, P.: Polynomial Projection Pursuit. Annals of Statistics 17 (1989) 589-605. 10. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. second ed., John Wiley&Sons, New York (2000). 11. Cardoso, J.F.: High-order contrasts for independent component analysis. Neural Computation 11(1)(1999) 157-192 12. Ziehe, A., Muller, K.R.: TDSEP-an efficient algorithm for blind separation using time structure. In Proc. ICANN’98 (1998) 675-680
HiBO: Mining Web’s Favorites Sofia Stamou, Lefteris Kozanidis, Paraskevi Tzekou, Nikos Zotos, and Dimitris Cristodoulakis Computer Engineering and Informatics Department, Patras University, 26500 Patras, Greece {stamou,kozanid,tzekou,zotosn,dxri}@ceid.upatras.gr
Abstract. HiBO is a bookmark management system that incorporates a number of Web mining techniques and offers new ways to search, browse, organize and share Web data. One of the most challenging features that HiBO incorporates is the automated hierarchical structuring of bookmarks that are shared across users. One way to go about organizing shared files is to use one of the existing collaborative filtering techniques, identify the common patterns in the user preferences and organize bookmarked files accordingly. However, collaborative filtering suffers from some intrinsic limitations, the most critical of which is the complexity of the collaborative filleting algorithms that inevitably leads to the latency in updating the user profiles. In this paper, we address the dynamic maintenance of personalized views to shared files from a bookmark management system perspective and we study ways of assisting Web users share their information space with the community. To evaluate the contribution of HiBO, we applied our Web mining techniques to manage a large pool of bookmarked pages that are shared across community members. Results demonstrate that HiBO has a significant potential in assisting users organize and manage their shared data across web-based social networks. Keywords: Hierarchical Structures, Web Data Management, Bookmarks, System Architecture, Personalization.
1 Introduction Millions of people today access the plentiful Web content to locate information that is of interest to them. However, as the Web grows larger there is an increasing need in helping users to keep track of the interesting Web pages that they have visited so that they can get back to them later. One way to address this need is by maintaining personalized local URL repositories, widely known as bookmarks [15]. Bookmarks, also called favorites in the Internet Explorer, enable users to store the location (address) of a Web page so that they can revisit it in the future without the need of remembering the page’s exact address. People use bookmarks for various reasons [1]: some bookmark URLs for fast access, others bookmark URLs with long names that they find hard to remember, yet others bookmark their favorite Web pages in order to share them with a community of users with similar interests. G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 845–856, 2007. © Springer-Verlag Berlin Heidelberg 2007
846
S. Stamou et al.
As the number of the pages that are available on the Web keeps growing, so does the number of the pages stored in personal Web repositories. Moreover, although users visit frequently their bookmarked URLs, they rarely delete them; which practically results into users keeping stale links in their personal Web repositories. As a consequence, people tend to maintain large, and possibly overwhelming, bookmark collections [16]. However, keeping a flat list of bookmark URLs is insufficient for tracking down previously visited pages, especially if we are dealing with a long list of favorites. As the size of the personal repositories increases, the need for organizing and managing bookmarks becomes prevalent. To assist users organize their bookmark URLs in a meaningful and useful manner, there exist quite a few bookmark management systems offering a variety of functionalities to their users. These functionalities enable users to store their bookmarks into folders and subfolders named for the sites they are found in or named for the information they contain, as well as to organize the folders in a tree-like structure. Moreover, commercial bookmark management tools, e.g. BlinkPro [2], Bookmark Tracker [3], Check and Get [4], iKeepBookmarks [5], provide users with a broad range of advanced features like detection of duplicate bookmarks and/or dead links, importing, exporting and synchronizing bookmarks across different Web browsers (Mozilla, Internet Explorer, Opera, Netscape), updating bookmarks and so forth. In this paper, we present HiBO; an intelligent system that automatically organizes bookmarks into a hierarchical structure. HiBO is a powerful bookmark management system that exploits a multitude of Web mining techniques and offers a wide range of advanced services. Most importantly, HiBO is a non-commercial research project for managing the proliferating data in peoples’ personal Web repositories without any user effort. The main difference between HiBO and the other available bookmark management systems (cf. [11], [14], [15]) is that HiBO uses a built-in subject hierarchy for automatically organizing bookmarks within both the users’ local and shared Web repositories. The only input that our approach requires is a hierarchy of topics that one would like to use and a list of bookmark URLs that one would like to organize into these topics. Through the exploitation of the hierarchy, HiBO delivers personalized views to the shared files and eventually it assists Web users share their information space with the community. The remainder of the paper is organized as follows: we begin our discussion with the description of HiBO’s architecture. In Section 3, we give a detailed description of the functionalities and services that our bookmark management system offers. Experimental results are presented in Section 4. We finally review related work and conclude the paper in Section 6.
2 Overview of HiBO Architecture HiBO evolved in the framework of a large research project that aimed at the automatic construction of Web directories through the use of subject hierarchies. The subject hierarchy that HiBO uses contains a total of 475 topics organized into 14 top level topics, borrowed from the top categories of the Open Directory Project (ODP) [6]. At a high level, the way in which HiBO organizes bookmarks proceeds as follows: firstly HiBO downloads all the Web pages that have been bookmarked by a user
HiBO: Mining Web’s Favorites
847
and process them one by one in order to identify the important terms inside every page. Important terms of a page are linked together formulating a lexical chain. Then, our system uses the subject hierarchy and the lexical chains to compute a suitable topic to assign to every page. Finally, HiBO sorts the Web pages organized into topics in terms of their relevance to the underlying topics. More specifically, given a URL (bookmark) HiBO performs a sequence of tasks as follows: (i) download the URL and parse the HTML page, (ii) segment the textual content of the page into shingles and extract the page’s thematic words using the lexical chaining technique [8], (iii) map thematic words to the hierarchy’s concepts and traverse the hierarchy’s matching nodes upwards until reaching to one or more topic nodes, (iv) compute a relevance score of the page to each of the matching topics, (v) index the URL in the topic of the greatest relevance score. Figure 1 illustrates HiBO’s architecture.
Fig. 1. Overview of HiBO architecture and functionality
In particular, after downloading and segmenting a Web page into shingles, HiBO generates a lexical chain for the page as follows: it selects a set of candidate terms from the page and for each candidate term it finds an appropriate chain relying on the type of links that are used in WordNet [7] for connecting the candidate term to the other terms that are already stored in existing lexical chains. If this is found, HiBO inserts the term in the chain and updates the latter accordingly. Lexical chains are then scored in terms of their elements’ depth and similarity in WordNet, and their elements are mapped to the hierarchy’s nodes. For each of the hierarchy’s matching nodes, HiBO follows their hypernymy links until reaching a top level topic in which to categorize the Web page. Finally, HiBO sorts the Web pages categorized in each topic in terms of both the pages’ conceptual similarity to one another and their relevance to the underlying topic. In estimating the pages’ conceptual similarity, HiBO compares the elements in a page’s lexical chain to the elements in the lexical chains of the other pages in the same topic, based on the assumption that the more elements the chains of
848
S. Stamou et al.
two pages have in common, the more correlated the pages are to each other. On the other hand, in computing the pages’ relevance to the hierarchy’s topics, HiBO relies on the pages’ lexical chains scores and the fraction of the chains’ elements that match a given topic in the hierarchy. Based on this general and open architecture, HiBO explores a variety of Web mining techniques and provides users with a number of advanced functionalities that are presented below.
3 HiBO Functionalities Organizing Bookmarks: Besides the conventional way to organize bookmarks into a hierarchy of user-defined folders and subfolders, HiBO also incorporates a built-in subject hierarchy and a classification module, which automatically assigns every bookmarked page to a suitable topic in the hierarchy. HiBO’s classification module is set into forth by the user and helps the latter structure her bookmarks in a meaningful yet manageable structure, instead of simply keeping a flat list of favorite URLs. The subject hierarchy upon which HiBO currently operates is the one introduced in the work of [19]. Nevertheless HiBO’s architecture is quite flexible to incorporate any hierarchy of topics that one would like to use. For automatically classifying bookmarks into the hierarchy’s topics HiBO adopts the TODE classification technique, reported in [20]. At a very high level TODE classification scheme proceeds as follows: First, it processes the bookmarked pages one by one, identifies the most important terms inside every page and links them together, creating “lexical chains” [8]. Thereafter, it maps the lexical elements in every page’s chain to the hierarchy’s concepts and if a matching is found it traverses the hierarchy’s nodes upwards until it reaches a top level topic. To accommodate for chain elements matching multiple hierarchy topics, TODE computes for every page a Relatedness Score (RScore) to each of the matching topics. RScore indicates the expressiveness of each of the hierarchy’s topics in describing the bookmarked pages’ contents. Formally, the relatedness score of a page pi (represented by the lexical chain Ci) to the hierarchy’s topic Tk is determined by the fraction of words in the page’s chain that are descendants (i.e. specializations) of Tk. Formally, the RScore of a page to each of the hierarchy’s matching topics is given by: RScoreK(pi)=
them atic words in p i m atching K them atic words in p i
.
(1)
In the end, HiBO employs the topical category for which a bookmark has the highest relatedness score of all its RScores to describe that page’s thematic content. By enabling bookmarks’ automatic organization into a built-in hierarchical navigable structure, HiBO assists the user, who may be overwhelmed by the amount of her favorite pages organize and manage them instantly. Hierarchically organized bookmarks are stored locally on the user’s site for future reference. Moreover, HiBO supports personalized bookmarks’ organization by enabling the user define the set of topics in which bookmarks would be organized. These topics can be either a subset of the hierarchy’s topics or any other topic that the user decides. In case the user edits a new topic category in HiBO, she also needs to indicate a topic in HiBO’s built-in hierarchy with which the newly inserted topic correlates. Through
HiBO: Mining Web’s Favorites
849
the HiBO interface, the user can view the topics available in HiBO as well as the number of bookmarks in each topic. The user can navigate through the hierarchical tree to locate bookmarks related to specific topics. In the case of shared bookmarks across a user community, HiBO supports personalized bookmark management by providing different views across users or user groups. Personalized views, allow the user decide on the classification scheme in which her shared bookmarks will be displayed. For instance, a user might choose to view the bookmarks she shares with a Web community organized in her self-selected categories or alternatively organized in the system’s built-in subject hierarchy. Optionally, a user might decide to view her shared bookmarks organized in the categories defined by another member of the community, who she trusts. To enable personalized views on shared bookmarks, HiBO’s classification module re-assigns user favorites to the categories preferred by the user (self, community or system defined) following the categorization process described above. Additionally, HiBO enables bookmark organization by their file types. Searching Bookmarks: HiBO incorporates a powerful search mechanism that allows users to explore bookmark collections. The queries that HiBO supports are of the following types: topic-specific search, site/domain search, temporal search and keyword search. Similarly to querying a search engine for finding information on the Web, querying HiBO for locating information within one’s Web favorites enables users to issue queries and retrieve bookmark URLs that are relevant to the respective queries. Upon keyword-based search, the user submits a natural language query and the system’s search mechanism looks for bookmarked pages that contain any of the user-typed keywords, simply by employing traditional IR string-matching techniques. Additionally, HiBO incorporates a query refinement module introduced in the work of [12] and provides information seekers with alternative query formulations. Alternative query wordings are determined based on the semantic similarity that they exhibit to the user selected keywords in WordNet hierarchy. Refined queries are visualized in a graphical representation, as illustrated in Figure 2 and allow the user pick any of the system suggested terms either for reformulating a query that returns few or no relevant pages, or for crystallizing an under-specified information need.
Fig. 2. A refined query graph example
Moreover, HiBO supports topic-specific searches by allowing users select the topical category (e.g. folder) out of which they wish to retrieve search results. Topicspecific searches greatly resemble the process of querying particular categories in
850
S. Stamou et al.
Web Directories, in the sense that the user firstly selects among the topics offered in the HiBO hierarchy the one that is of interest to her and thereafter she issues and executes the query against the index of the selected topic. Search results can be ranked according to the query-bookmark similarity values combined with any of the measures described in the following paragraph. If the user selects multiple ranking measures, then results are ranked by the product of their values. Conversely, if the user does not pick a particular ranking measure, results are ranked by the semantic similarity between the query keywords (either organic, i.e. user typed, or refined, i.e., system suggested) to the terms appearing in the bookmark pages that match the respective query. Ranking Bookmarks: HiBO provides several options for sorting the bookmarks listed in each of the hierarchy’s topics as well as for sorting bookmarks that are retrieved in response to a user query. For ranking bookmark URLs that are retrieved in response to some query q, HiBO relies on the semantic similarity between the query itself and the bookmark pages that contain any of the query terms. To measure the semantic similarity between the terms in a query and the terms in the pages that match the given query, we use the similarity measure presented in [18], which is established on the hypothesis that the more information two concepts share in common, the more similar they are. The information shared by two concepts is indicated by the information content of their most specific common subsumer. Formally the semantic similarity between words, w1 and w2, linked in WordNet via a relation r is given by: s im
r
( w 1, w 2 ) = - log P
(
m s cs ( w 1 , w 2 )
).
(2)
The measure of the most specific common subsumer (mscs) depends on: (i) the length of the shortest path from the root to the most specific common subsumer of w1 and w2 and (ii) the density of concepts on this path. Based on the semantic similarity values between the query terms and the terms in a page, we compute the average Query-Page similarity (QPsim) as: P(t)
sQ P s i m
(
q (t ) , P (t)
)
∑ s i m ( q (t) , P (t) ) =
p =1
P(t)
(3) .
where q (t) denotes the terms in a query and P (t) denotes the terms in P that have some degree of similarity to the query terms. The greater the similarity value between the terms in a bookmark page and the terms in a query, the higher the ranking that the page will be given for that query. On the other side of the spectrum, for ordering bookmarks in the hierarchy’s topics, the default ranking that HiBO uses is the DirectoryRank (DR) metric [13], which determines the bookmarks’ importance to particular topics as a combination of two factors: the bookmarks’ relevance to their assigned topics and the semantic correlation that the bookmarks in the same topic exhibit to each other. In the DR scheme, a page’s importance with respect to some topic is perceived as the amount of information that the page communicates about the topic. More precisely, to compute DR with respect to some topic T, we first compute the degree of the pages’ relatedness to topic
HiBO: Mining Web’s Favorites
851
T. Formally, the relatedness score of a page p (represented by a set of thematic terms1) to a hierarchy’s topic T is defined as the fraction of the page’s thematic words that are specializations of the concept describing T in the HiBO hierarchy, as given by Equation (1). The semantic correlation between pages p1 and p2 is determined by the degree of overlap between their thematic words, i.e. the common thematic words in p1 and p2 as given by: Sim (p 1, p 2 ) =
2 • common words words in p 1
+ words in p 2
.
(4)
DR defines the importance of a page in a topic to be the sum of its topic relatedness score and its overall correlation to the fraction of pages with which it correlates in the given topic. Formally, consider that page pi is indexed in topic Tk with some RScore k (i) and let p1, p2, …, pn be pages in Tk with which pi semantically correlates with scores of Sim (p1, pi), Sim (p2, pi), …, Sim (pn, pi), respectively. Then the DR of pi is given by: ⎡ Sim (p 1, p i ) + Sim (p 2 , p i ) + ... + Sim (p n , p i ) ⎤⎦ DR T k (p i ) = RScore k (i) + ⎣ . n
(5)
where n corresponds to the total number of pages in topic Tk with which pi semantically correlates. Moreover, HiBO offers personalized bookmark sorting options such as the ordering of pages by their bookmark date or by their last update, as well as the ordering of bookmarks in terms of their popularity, where popularity is determined by the frequency with which a user or group of users sharing files, (re)visit bookmarks. Sharing Bookmarks: Besides offering bookmark management services to individuals; HiBO constitutes a social bookmark network, as it allows community members share their Web favorites. In this perspective, HiBO operates as a bookmark recommendation system since it not only gathers and distributes individually collected URLs but it also organizes and processes them in a multi-faceted way. In particular, HiBO despite offering personalized views to shared bookmarks (cf. Organizing Bookmarks paragraph) it enables users annotate their preferred Web data, share their annotations with other members of the network and comment on others’ annotations. To assist Web users exploit the knowledge accumulated in the bookmarks of others, HiBO goes beyond traditional collaborative filtering techniques and applies a multitude of Web mining techniques that exploit the hierarchical structure of the shared bookmarks. Such Web mining techniques range from the automatic classification of bookmark pages into a shared topical hierarchy, to the structuring of shared files according to their links and content similarity. Shared bookmarks’ dynamic categorization is achieved through the utilization of the TODE categorization scheme, whereas bookmarks’ structuring is supported by the different ranking algorithms that HiBO incorporates. Additionally, HiBO provides recommendation services to its users as it examines common patterns in the bookmarks of different community members and suggests interesting sites to users who might not have realized that they share common interests with others. HiBO communicates its recommendations in the form of 1
The thematic terms in a page p are the lexical elements that formulate the lexical chain of p.
852
S. Stamou et al.
highlighted URLs that are associated to one’s favorites, which are either stored in the system’s hierarchy or retrieved in response to some query. Keeping Bookmarks Fresh: Based on the observation that users rarely refresh their personal Web repositories, we equipped HiBO with a powerful update mechanism, which aims at maintaining the bookmarks index fresh. By fresh we mean that the index does not contain obsolete links among one’s bookmarks, as well as that it reflects the current content of bookmarked pages. The update mechanism that HiBO uses performs a dual task: on the one hand it records the users’ clickthrough data on their bookmarks and on the other it submits periodic requests to a built-in crawler for re-downloading the content of the bookmarked URLs. In case the system identifies bookmarks that have not been accessed for a long time, it posts a request to the user asking if she still wants to keep those bookmarks in her collection and/or if she still wants to share those bookmarks with other community members. Upon the user’s negative answer, the system deletes those rarely visited URLs from the bookmark index and updates the latter accordingly, i.e. it re-orders pages etc. Similarly, if the system detects invalid, broken or obsolete URLs within a user’s personal repository, it issues a notification to the user, who decides what to do with those links (either delete them, expunge them from her shared files, or keep them anyway). Furthermore, if the system detects a significant change in the current content of pages that had been bookmarked by a user some time ago, it issues an alert to the latter that her bookmarked URLs do not reflect the current content of their respective pages. It is then up to the user to decide whether she wants to keep the old or the new content of a bookmarked page. For content change detection, HiBO relies on the semantic similarity module discussed above, and uses a number of heuristics for deciding whether a page has significantly changed and therefore the user needs to be notified. HiBO’s update mechanism although operates on a single user’s site, it indirectly impacts the rest community members in the sense that upon changes in one’s personal Web repository, these will be reflected on her shared files. Note that the update mechanism that HiBO embodies is optional to the user who might decide not to activate it and therefore not to be disturbed by the issued update alerts and notifications.
4 Experimental Setup To evaluate HiBO’s effectiveness in managing and organizing Web favorites, we launched a fully functional version of our bookmark management system and we contacted 25 postgraduate students from our school asking them to donate their bookmarks. Donating bookmarks pre-requisites that users register to the system by providing a valid e-mail address and they receive a personal code, which is used in all their transactions with the system. Upon code’s receipt users obtain full rights on their personal bookmarks and they can also indicate the HiBO community with which they wish to share their preferred URLs. In the experiments reported here, all our 25 users formulated a single Web community sharing bookmarks. When users donate bookmarks, we use their agents to determine which browser and platform they are using in order to parse the files accordingly. We also use an SQL database server at the backend of the system, where we store all the information handled by HiBO, i.e. users and
HiBO: Mining Web’s Favorites
853
user groups, URLs, bookmarks’ structure at the user site, the subject hierarchy, time stamps, clickthrough data, queries, etc. In our experiments, we used a total set of 3,299 bookmarks donated by our subjects and we evaluated HiBO’s performance in automatically categorizing bookmarks in the system’s hierarchy, by comparing its classification accuracy to the accuracy of a Bayesian classifier and a Support Vector Machine (SVM) classifier. We also investigated the effectives of HiBO’s ranking mechanisms in offering personalized rankings. Table 1 summarizes some statistics on our experimental dataset. Table 1. Statistics on the experimental dataset # of bookmark URLs # of users # of topics considered # of queries Avg. # of bookmarks per user Avg. # of shared bookmarks per user Avg. # of topics per user Avg. # of shared topics Avg. # of queries per user Avg. # of visited pages per query Avg. # of useful pages per query Avg. # of terms per refined query
3,299 25 86 48 131.96 58 21 9.4 7.5 5.8 3.5 3.8
To evaluate HiBO’s efficiency in categorizing bookmarks to the hierarchy’s topics, we picked a random set of 1,350 pages from our experimental data that span 18 topics in the Open Directory that are also among our hierarchy’s topics and we applied our categorization scheme. Obtained results were compared to the results delivered by both the SVM and the Bayesian classifier that we trained with the 90% of the same dataset. Classification results are reported in Table 2, where we can see clearly that HiBO’s classifier significantly outperforms both Bayesian and SVM classification with a notable performance; reaching to a 90.70% overall classification accuracy. In Table 3, we illustrate the different ranking measures of HiBO, using the results of both browsing and searching for spam. For comparison, we also present the pages that Google considers “important” to the query spam. Although, Google uses a number of non-disclosed factors for computing the importance of a page, with PageRank [17] being at the core, we assume that a combination of content and link analysis is employed. Obtained results demonstrate the differences between the two HiBO rankings examined. In particular, the rankings delivered by DR sort bookmark pages in terms of their content importance to the underlying topic, i.e. Spam. As we can see from the reported data, our DR ranking values highly pages of practical interest compared to the pages retrieved from Google, which are general sites that mainly provide definitions of spam. On the other hand, the similarity ranking orders the bookmarked pages that are retrieved in response to the query spam in terms of their content semantic closeness to the semantics of the query. As such the results retrieved by HiBO contain pages that even if they are not categorized in the topic Spam, their contents exhibit substantial semantic similarity to the issued query. Recall that our experiments
854
S. Stamou et al.
were conducted towards a set of bookmarks that are shared across our subjects and as such reported results are influenced by our users’ interests. This is exemplified by the appearance of Spam Filter for Outlook, Block Referrer Spam and Spam Fixer in the top ten results of DR and Similarity rankings respectively; sites that are naturally favored by computer science students as they contain information that is of practical use to them. Table 2. Average classification accuracy between HiBO and Bayesian classifiers Topics Dance Music Artists Photography Architecture Art History Comics Costumes Design Literature Movies Performing Arts Collecting Writing Graphics Drawing Plastic Arts Mythology
HiBO classifier 97.05% 94.37% 86.45% 81.68% 79.77% 93.33% 95.45% 89.06% 90.79% 89.70% 94.59% 87.34% 92.87% 91.84% 92.68% 91.34% 90.86% 93.58% 90.70%
Bayesian classifier 69.46% 74.38% 83.59% 55.28% 69.89% 78.47% 29.46% 72.43% 69.29% 59.26% 71.04% 68.08% 67.17% 69.56% 79.80% 59.55% 64.36% 68.22% 67.18%
SVM classifier 71.58% 78.49% 82.64% 69.03% 72.11% 68.58% 45.24% 69.77% 55.08% 49.91% 68.97% 65.06% 53.88% 60.42% 71.53% 58.16% 62.07% 64.93% 64.85%
Table 3. Ordering bookmarks for spam HiBO DR Block Referrer Spam Referrer Log Spamming Spam Assassin Stop Spam with Sneakmail 2.0 Anti-Spam A Plan for Spam
Death to Spam Spam Filter for Outlook The Spam Weblog Damn Spam
HiBO Similarity Witchvox Article – That Pesky and Obnoxious Spam Outlook Express Tutorial: Filter- how to stop spam Message Cleaner – Stop viruses and spam emails now The Spammeister guide to spam Spamhuntress – Spam Cleaning for Blogs Discuss Sam Forums-Learn how to eliminate and prevent spam SpamFixer Spam Email Discussion List Emailabuse.org Spamcop.net
Google www.spam.com Fight Spam on the Internet Spam-Wikipedia E-mail Spam-Wikipedia FTC-Spam-Home Page Coalition Against Unsolicited Commercial Email SpamAssassin Spam Cop What is Spam- Webopedia Spam Laws
HiBO: Mining Web’s Favorites
5
855
Related Work
Bookmarks are essentially pointers to URLs that one would like to store in a personal Web repository for future reference and/or fast access. Today there exist many commercial bookmark management tools2, providing users with a variety of functionalities in an attempt to assist them organize the list of their Web favorites [2] [3] [4] [5]. With the recent advent of social bookmarking, bookmarks3 “have become a means for users sharing similar interests to locate new websites that they might not have otherwise heard of; or to store their bookmarks in such a way that they are not tied to one specific computer”. In this light, there currently exist several Web sites that collect, share and process bookmarks. These include Simpy, Furl, Del.icio.us, Spurl, Backflip, CiteULike and Connotea and are reviewed by Hammond et al. [9]. Such social networks of bookmarks are being perceived as recommendation systems in the sense that they process shared files and, based on a combinational analysis of the files themselves and their contributors in the network, they suggest to other network members interesting sites submitted by a different community member. From a research point of view, there have been several studies on how shared bookmarks can be efficiently organized to serve communities. The work of [21] falls in this area and introduces GiveALink, an application that explores semantic similarity as a means to process collected data and determine similarity relations among all its users. Likewise, [10] suggest a novel distributed collaborative bookmark system that they call CoWing and which aims at helping people organize their shared bookmark files. To that end, the authors introduce the utilization of a bookmark agent, which learns the user strategy in classifying bookmarks and based on that knowledge it fetches new bookmarks that match the local user information need. In light of the above, we perceive our work on HiBO to be complementary to existing approaches. However, one aspect that differentiates our system from available bookmark management systems in that HiBO provides a built-in subject hierarchy that enables the automatic classification of bookmark URLs on the side of either an individual user or group of users. Through the subject hierarchy, HiBO ensures the dynamic maintenance of personalized views to shared files and as such it assists Web users share their information space with the community.
6
Concluding Remarks
In this paper we presented HiBO, a bookmark management system that automatically manages orders, retrieves and mines the data that is either stored in Web users’ personal Web repositories or shared across community members. An obvious advantage of our system when compared to existing bookmark management tools is that HiBO uses a built-in subject hierarchy for dynamically grouping bookmarks thematically without any user effort. Another advantage of HiBO is the ordering of bookmarks into the hierarchy’s topics in terms of their content importance to the underlying topics. Currently, we are working on privacy issues so as to motivate Web users donate their Web favorites to HiBO and therefore launch a powerful bookmark mining system to the community. 2
For a complete list of available bookmark management systems we refer the reader to http:// dmoz.org/Computers/Internet/On_the_Web/Web_Applications/Bookmark_Managers/ 3 http://en.wikipedia.org/wiki/Bookmark_%28computers%29
856
S. Stamou et al.
References 1. Abrams, D., Baecker, R. and Chignell, M. Information Archiving with Bookmarks: Personal Web Space Construction and Organization. In Proceedings of the Human Computer Interaction Conference, 1998, pp. 41-48. 2. BlinkPro: Powerful Bookmark Manager http://www.bookmarksplus.com/ 3. Bookmark Tracker http://www.bookmarktracker.com/ 4. Check and Get http://activeurls.com/en/ 5. iKeepBookmarks http://www.ikeepbookmarks.com/ 6. Open Directory Project: http://dmoz.org 7. WordNet 2.0: http://www.cogsci.princet on.edu/~wn/. 8. Barzilay, R. and Elhadad, M. Lexical chains for text summarization. In Advances in Automatic Text Summarization. MIT Press, 1999. 9. Hammond, T., Hannay, T., Lund, B. and Scott, J. Social Bookmarking Tools (I): A General Review. D-Lib Magazine, 11(4): doi:10.1045/april2005—hammond, 2005. 10. Kanawati, R., Malek, M., Klusch, M. and Zambonelli F. CoWing: A Collaborative Bookmark Management. In Lecture Notes in Computer Science, ISSN 0302-9743, 2001. 11. Karousos, N., Panaretou, I., Pandis, I. and Tzagarakis, M. Babylon Bookmarks: A Taxonomic Approach to the Management of WWW Bookmarks. In Proceedings of the Metainformatics Symposium 2002, 42-48. 12. Kozanidis, L., Tzekou, P., Zotos, N., Stamou, S., and Christodoulakis, D. Ontology-Based Adaptive Query Refinement. To appear in Proceedings of the 3rd International Conference on Web Information Systems and Technologies, 2007. 13. Krikos, V., Stamou, S., Ntoulas, A., Kokosis, P. and Christodoulakis, D. DirectoryRank: Ordering Pages in Web Directories. In Proceedings of the 7th ACM International Workshop on Web Information and Data Management (WIDM), Bremen, Germany, 2005. 14. Li, W.S., Vu, Q., Chang, E., Agrawal, D., Hirata, K., Mukherjea, S., Wu, Y.L., Bufi, C., Chang, C.K., Hara, Y., Ito, R., Kimura, Y., Shimazu, K. and Saito, Y. PowerBookmarks: A System for Personalizable Web Information Organization, Sharing and Management. In Proceedings of the ACM SIGMOD Conference, 1999, pp. 565-567. 15. Maarek, Y., and Shaul, I. Automatically Organizing Bookmarks per Contents. In Proceedings of the 5th Intl. World Wide Web Conference, 1996. 16. McKenzie, B. and Cockburn, A. An Empirical Analysis of Web Page Revisitation. In Proceedings of the 34th Hawaii Intl. Conference on System Sciences, 2001. 17. Page, L., Brin, S., Motwani, R. and Winograd, T. The PageRank citation ranking: Bringing order to the web. Available at: http://dbpubs.stanford.edu:8090/pub/1999-66, 1998. 18. Resnik, Ph. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th Intl. Joint Conference on Artificial Intelligence, 2005, pp. 448-453. 19. Stamou, S. and Christodoulakis, D. Integrating Domain Knowledge into a Generic Ontology. In Proceedings of the 2nd Meaning Workshop. Italy, 2005. 20. Stamou, S., Ntoulas, A., Krikos, V., Kokosis, P., and Christodoulakis, D. Classifying Web Data in Directory Structures. In Proceedings of the 8th Asia-Pacific Web Conference (APWeb), Harbin, China, 2006, pp. 238-249. 21. Stoilova, L., Holloway, T., Markines, B., Maguitman, A. and Mencezer, F. GiveALink: Mining a Semantic Network of Bookmarks for Web Search and Recommendation. In Proceedings of the LinkKDD Conference, Chicago, IL, USA, 2005.
Frequent Variable Sets Based Clustering for Artificial Neural Networks Particle Classification Xin Jin and Rongfang Bie* College of Information Science and Technology, Beijing Normal University, Beijing 100875, P.R. China [email protected], [email protected]
Abstract. Particle classification is one of the major analyses in high-energy particle physics experiments. We design a classification framework combining classification and clustering for particle physics experiments data. The system involves classification by a set of Artificial Neural Networks (ANN); each using distinct subsets of samples selected from the general set. We use frequent variable sets based clustering for partitioning the train samples into several natural subsets, then standard back-propagation ANNs are trained on them. The final decision for each test case is a two-step process. First, the nearest cluster is found for the case, and then the decision is based on the ANN classifier trained on the specific cluster. Comparisons with other classification and clustering methods show that our method is promising.
1 Introduction Classification (i.e. supervised learning) is a fundamental task in data mining. A classifier, built from the labeled train samples described by a set of features/attributes, is a function that chooses a class label (from a group of predefined labels) for test samples. Major classification algorithms include Artificial Neural Network (ANN) [2, 3, 11], Nearest Neighbor [17, 13], Naïve Bayes [1, 20], etc. Clustering (i.e. unsupervised learning) is another fundamental task in data mining [18]. Cluster analysis partition unlabeled samples into a number of groups using a measure of distance, so that the samples in one group are similar while samples belonging to different groups are not similar [15, 16, 19]. Many clustering algorithms have been proposed, among which k-means is one of the most popular [27]. Particle classification is an important analysis in particle physics experiments. Traditional method separates distinct particle events by application of a series of cuts, which act on projections of high-dimensional event parameter space onto orthogonal axes [11]. This procedure often fails to yield the optimum separation of distinct event classes. In this paper, we investigate the use of data mining technology for particle classification. We describe a clustering method FVC especially designed for particle analysis, and then present a classification framework combining ANNs and FVC to improve the high-energy particle classification performance. *
Corresponding author.
G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 857–867, 2007. © Springer-Verlag Berlin Heidelberg 2007
858
X. Jin and R. Bie
The remainder of this paper is organized as follows: We first describe an ANN classifier in Section 2. Section 3 describes the clustering method FVC. Section 4 describes the classification system combining ANNs and clustering. Section 5 describes methods for comparison. Section 6 presents the dataset, four evaluation measures and the experiment results. Conclusions are presented in Section 7.
2 Artificial Neural Networks Artificial Neural Networks (ANN) is a network of perceptrons, which computes an output from multiple inputs by forming a linear combination according to its input weights and then putting the output through some activation function [4, 5]. Among many proposed ANN models, MLP, the multilayer feedforward network with a backpropagation-learning mechanism, is the most widely used [6]. MLP consists of an input layer of source nodes, one or more hidden layers of computation nodes, and an output layer of nodes. Data propagates through the network layer-by-layer. Fig. 1 shows the data flow of a MLP with two hidden layer.
Fig. 1. Data-flow graph of a two hidden layer MLP
Define X as a vector of inputs and Y as a vector of outputs. Y, which may also be a 1-dimension vector, is typically obtained by:
Y = W2 f a (W1 ⋅ X + B1 ) + B2
(1)
W1 denotes the weight vector of the first layer and B1 the bias vector of the input layer. W2 and B2 are for the output layer. fa denotes the activation function. The classification problem of the MLP can be defined as follows: Given a training set of features-class/input-output pairs (xi, ci), MLP learns a model, the classifier, for the dependency between them by adapting the weights and biases to their optimal values for the given training set. Squared reconstruction error is commonly used as the criterion to be optimized. MLP consists of iteration of two steps: (1) Forward - the predicted class corresponding to the given input are evaluated. (2) Backward - partial derivatives of the cost function with respect to the different parameters are propagated back through the network. The process stops when the weights and biases converge.
Frequent Variable Sets Based Clustering for ANN Particle Classification
859
3 Frequent Variable Sets Based Clustering In this section we describe a partitional clustering method Frequent Variable Sets based Clustering (FVC) to deal with the special characteristics of the high-energy particle data. It’s based on frequent itemset mining and is based on the work of Fung B. et al [15], who developed a hierarchical document clustering using frequent itemsets. Frequent itemsets is a basis concept in association rule mining [8, 14]. Many different algorithms have been developed for that task, including the well known Apriori [10] and FP-growth [9]. Frequent item-based high-energy particle clustering is based on partitioning the particles according to their variables detected. Since we are dealing with particles other than transactions, we will use the notion of variable sets instead of item sets. A variable is any attribute describing a particle within physics experiments (high-energy particles collision, for example), and a particle can have some variables detected and others undetected due to inevitable changes in experimental environments or other reasons. Therefore, even the same kind of particles may have different set of detecting variables. Thus we assume that if we can cluster particles into different groups where each group has its own specific experimental environment (please note that the particles within each group will be in different classes because the group forming process is not based on the classes of the particles, particles of different classes may be under the same experimental situation thus have the same set of detected and undetected variables), then the classification model built from the particles in the same group will be a better distinguisher than the model built from the whole set of particles. Traditional clustering method, k-means for example, just group distance similar points thus is not suitable to find variable-oriented groups. Instead of clustering the original high-dimensional space (for the data we used in this paper, the original space is 78-dimensional), FVC considers only the lowdimensional frequent variable sets as cluster candidates. We can say a frequent variable set (or variableset) is actually not a cluster (candidate) but only the description of a cluster (candidate), or the representational centroid of the cluster. The corresponding cluster itself consists of the set of particles containing all variables of the frequent variable set. 3.1 Binarizing Original particle data have numeric attributes/variables; in order to find frequent variable sets we need to first convert them to binary attributes (1 for detected variable and 0 for undetected variable). For a particle, if one variable has a value other than 0, then we believe that the variable is detected for the particle (or we can say that it occurs in the particle) and the value will be converted to 1. If a variable has value 0 for the particle, then we believe that it is undetected and is set to 0 (unchanged). Some variables are very peculiar in that for one such kind of variables the value of it is very close to 0 (0.0001 for example) for one particle and 0 for another particle, then for the latter particle is hard to know whether the 0-value means detected or not. We simply assume that the variable is also detected for the particle and set the variable’s value to 1. Table 1 shows an example data compose of four particles with five attributes/variables. Table 2 shows the converted data and its transaction representation.
860
X. Jin and R. Bie Table 1. Original data. V1,..., V5 are five variables, P1,..., P4 are four particles. ID P1 P2 P3 P4
V1 1.3546 1.7865 0 0
V2 0 2.3322 0 0
V3 2.5553 0 0.0001 0
V4 0.0001 0 2.5343 4.7865
V5 0 0 2.3444 2.2211
Table 2. Binarized data and its transanction representation ID P1 P2 P3 P4
V1 1 1 0 0
V2 0 1 0 0
V3 1 1 1 1
V4 1 1 1 1
V5 0 0 1 1
ID P1 P2 P3 P4
V1, V3, V4 V1,V2,V3, V4 V3, V4, V5 V3, V4, V5
3.2 Representational Frequent Variableset Assuming some variables occur in some particles, others occur in all particles. Let P = {P1,…, Pn} be a set of particles and A = {V1, V2,…} be all the variables occurring in the particles of P. Each particle Pi can be represented by the set of variables occurring in it. For any set of variables (or call it variableset) S, define C(S) as the set of particles containing all variables in S. For one particle, if just only a subset of S occurs in it, then the particle will not be in C(S). Define Fi as a representational frequent variableset, which is such kind of variableset that all variables in it appear together in more than a minimum and less than a maximum fraction of the whole particle set P. A minimum support (minsupp, in a percentage of all particles) and a maximum support (maxsupp, in a percentage of all particles) can be specified for this purpose. Define F={F1,…,Fm} to be the set of all representational frequent variablesets in P with respect to minsupp and maxsupp, the variables in each Fi in at least minsupp and at most maxsupp percentage of the |P| particles: F = {Fj ⊆ A | (maxsupp×|P|) ≥ |C(Fj )| ≥ (minsupp×|P|)}
(2)
where |P| is the number of particles. A representational frequent variable is a variable that belongs to representational frequent variableset. A representational frequent k-variableset is a representational frequent variableset containing kvariables. The definition of representational frequent is different to the traditional definition of frequent in association rule mining where only minsupp is used. We introduce maxsupp in order to avoid too frequent variable sets because these variables occur in so many particles that they are not suitable for representing different kinds of particles (i.e., not representational). In order to find representational frequent variablesets we first use a standard frequent itemset mining algorithm, such as Apriori or FP-growth, to find all frequent variablesets and then remove those whose support is beyond maxsupp and those who have any item/variable whose support is beyond maxsupp. For example, we define minsupp to be 10% and maxsupp 35%, suppose that variable V1’s support is 90%,
Frequent Variable Sets Based Clustering for ANN Particle Classification
861
V2’ s support is 30% and variableset {V1, V2} has a support of 30%, then frequent 1-variableset {V2} is representational frequent, but frequent 2-variableset {V1, V2} will not be representational frequent since {V1} is not representational frequent. The method described above is simple but not optimized, we also provide an optimized way of mining representational frequent variablesets: to modify Apriori by adding a maxsupp threshold when finding frequent itemsets/variablesets. At the steps of finding candidate frequent k-variablesets Ck from frequent (k-1)-itemsets Lk-1, we remove those frequent (k-1)-variablesets whose support is beyond maxsupp. This method can reduce the size of Ck and can directly obtain representational frequent variablesets. 3.3 Obtaining Clusters For each representational frequent variableset, we construct an initial cluster to contain all the particles that contain this variableset. One property of initial clusters is that all particles in a cluster contain all the items in the representational frequent variableset that defines the cluster, that is to say, these variables are mandatory for the cluster. We use this defining representational frequent variableset as the representational centroid to identify the cluster. Initial clusters are not hard/disjoint because one particle may contain several representational frequent variablesets. We will need to merge the overlapping of clusters. The following are two steps for merging. (I) Merging fully overlapped (or redundant) clusters. If two initial clusters are fully overlapping, that is, they have different representational centroids but the same set of particles; we will merge them and choose the largest representational centroid as resulting centroid. For example, if two representational frequent variablesets V1 and V2 are highly correlated (i.e. they always come together), then the three clusters, constructed by {V1}, {V2} and {V1, V2} respectively, will be merged and the resulting centroid is {V1, V2}. (II) Merging partially overlapped clusters. If two initial clusters are partial overlapping, we assign particles in the overlapping area to the largest representational centroid. For example, if a particle belongs to two initial clusters, {V1, V2, V5} and {V8, V14}, we will assign the particle to {V1, V2, V5}. The overall FVC clustering algorithm proceeds as follows. 1. Data binarizing. 2. Mining all representational frequent variablesets as the initial representational centroids and construct initial clusters. 3. Assign all points/particles to their representational centroids. 4. Merge overlapped clusters to disjoint clusters.
4 Classification Combining ANNs and FVC We design a classification system combining ANNs and FVC. The system, we call Clustering-ANNs, involves classification by a set of ANNs, each using distinct subsets of samples selected from the general set using clustering algorithm FVC. More
862
X. Jin and R. Bie
specifically, we use FVC for partitioning the train samples into several subsets, then train a standard back-propagation ANN for each subset. The final decision for a test case is a two-step process. First, the nearest cluster is found for the case, and then the decision is based on the ANN classifier trained on the specific cluster. The reason for using FVC before ANN is that FVC can partition particles into several groups according to their different experimental situation. Particles of different classes may be under the same experimental situation thus have the same set of detected and undetected variables. So each group will have particles within different classes.
5 Methods for Comparison In this section we describe several classification methods for comparison. 5.1 Probability Learning Naïve Bayes is a successful probability learning method which has been used in many applications [24, 25, 26]. For the task of Naïve Bayes based particle classification, we assume the particle data is generated by a parametric mixture model. Naïve Bayes estimates the parameters from labeled training samples since the true parameters of the mixture model are not known. Given a set of training particles L = {p1,…, pN, N is the number of training samples}, Naïve Bayes use maximum likelihood to estimate the class prior parameters as the fraction of training particles is ci. We describe the particle classification problem can be generally described as follows. By assuming one particle only belongs to one class (1 or 0 in our case), for a given particle p we search for a class ci that maximizes the posterior probability by applying Bayes rule. The method assumes that the features of a particle are independent with each other. Fig. 2 shows the Naïve Bayes classifier for the 2-class and m-feature particle data.
Fig. 2. Naïve Bayes classifier for the 2-class and m-feature particle data
5.2 Memory Learning Memory based learning is a non-parametric inductive learning paradigm that stores training instances in a memory structure on which predictions of new instances are based [22]. It assumes that reasoning is based on direct reuse of kept experiences
Frequent Variable Sets Based Clustering for ANN Particle Classification
863
rather than on the knowledge, such as models, abstracted from experience. The similarity between the new instance and a sample in the memory is computed using a distance metric. We use the nearest neighbor (NN) classifier, a memory based learning method, that uses Euclidian distance metric [23] in the experiment. For application in particle physics data, NN it treats all particles as points in the m-dimensional space (where m is the number of variables) and given an unseen particle, the algorithm classifies it by the nearest training particle. 5.3 Hard Partitional Clustering Hard partitional clustering techniques create a one-level/unnested, partitioning of the data points. Defining k as the desired number of clusters, partitional approaches can find all k clusters at once. There are many such kind of techniques, among which the k-means algorithm is mostly widely used [21]. One of the basic ideas of k-means is that a center point can represent a cluster. Particularly, we use centroid, which is the mean (or median) point of a group of points. The basic k-means clustering technique is summarized below. 1. Select k points as the initial centroids. 2. Assign all points to the closest centroid. 3. Re-compute the centroid of each cluster. 4. Repeat steps 2 and 3 until the centroids don’t change or change little.
6 Experiments 6.1 Datasets The high-energy particle physics dataset we used are publicly available on KDD website [7]. There are 50000 binary-labeled particles, 78 attributes for each particle. Since attributes 20, 21, 22, 29, 44, 45, 46 and 55 have many missing values, which may degrade the classification performance, we simply ignore these attributes. These particles fall into two classes: positive (1) and negative (0). 6.2 Evaluation Methods We use four performance measures [12] for the particle classification problem: Accuracy (ACC, to maximize): the number of cases predicted correctly, divided by the total number of cases. Area Under the ROC Curve (AUC, to maximize): ROC is a plot of true positive rate vs. false positive rate as the prediction threshold sweeps through all the possible values. AUC is the area under this curve. AUC can measure how many times one would have to swap samples with their neighbors to repair the sort. AUC = 1 indicates perfect prediction, where all positive samples sorted above all negative samples. AUC = 0.5 indicates random prediction, where there is no relationship between the predicted values and actual values. AUC below 0.5 indicates there is a relationship between predicted values and actual values. SLAC Q-Score (SLQ, to maximize): Researchers at the Stanford Linear Accellerator (SLAC) devised SLQ, a domain-specific performance metric, to measure the
864
X. Jin and R. Bie
performance of predictions made for particle physics problems. SLQ breaks predictions interval into a series of bins. For the experiments we are using 100 equally sized bins within the 0 to 1 interval. Cross-Entropy (CXE, to minimize): CXE measures how close predicted values are to actual values. It assumes the predicted values are probabilities on the interval of 0 to 1 that indicate the probability that the sample is with a certain class. CXE = Sum ((A)*log(P) + (1-A)*log(1-P)) (3) where A is the actual class (in our case, 0 or 1) and P is the predicted probability that the sample is with the class. Mean CXE (the sum of the CXE for each sample divided by the total number of samples) is used to make CXE independent of data set size. 6.3 Results 6.3.1 Illustration with Random Subset of Data We firstly provide an intuitional comparison between FCV and k-means. Fig.3 shows the results of FVC clustering on 100 randomly selected particles. Each column in the figure corresponds to a variable, and the rows represent particles, there are 65 columns and 100 rows. White in a grid means that the variable is detected for a particle, while black represents the variable is not detected for a particle. The number of clusters is decided automatically by FVC according to the nature of the data. In experiments, we found that the original particles (as shown in Fig. 3) will be partitioned into three natural groups as shown in Fig.4. We can see that FVC found natural groups of the dataset.
Fig. 3. The original 100 particles. The X-axis denotes the variables. The Y-axis denotes the particles.
Frequent Variable Sets Based Clustering for ANN Particle Classification
865
Fig. 4. FVC Clustering results on the 100 particles. The X-axis denotes the variables. The Y-axis denotes the particles.
6.3.2 Results on the Whole Dataset Full experiments are done on the whole dataset which has 50000 samples. We use 10fold cross-validation for estimating classification performance, so the four measures, ACC, AUC, SLQ and CXE, are averaged on the 10 runs. Table 3 shows the results. The results show that ANN is better than Nearest Neighbor and Naive Bayes for particle classification. By combining clustering and ANNs, the proposed scheme Clustering-ANNs can get even better performance than ANN. Kmeans-ANNs is slightly better than ANNs for ACC and SLQ. By using clustering algorithm FVC which is especially designed for particles, we can get the best performance for all four measures. The reason that FVC-ANNs is better than using a single ANN is that FVC can cluster particle data into different groups according to different experimental characteristics showed in the high-energy physics experiments. Different groups found by FCV have different set of variables. So more appropriate ANN can be trained for each group, this is better than just use a uniform ANN for all particles. Table 3. Classification performance results of traditional classifers and Clustering-ANNs (Kmeans-ANNs and FVC-ANNs), results in bold type are the best performce Methods Nearest Neighbor Naive Bayes ANN Kmeans-ANNs FVC-ANNs
ACC 0.653 0.684 0.701 0.703 0.719
AUC 0.730 0.747 0.788 0.788 0.801
SLQ 0.253 0.194 0.270 0.272 0.293
CXE 1.033. 0.988 0.801 0.800 0.787
866
X. Jin and R. Bie
7 Conclusion In this paper we describe a particle-oriented clustering method Frequent Variable Set based Clustering (FVC), and a framework Clustering-ANNs for the high-energy particle physics classification problem. The system involves classification by a set of artificial neural networks (ANNs), each using distinct subsets of samples selected from the general set using clustering algorithm. We use FVC clustering to partition the train samples into several subsets, then standard back-propagation ANNs are trained on them. Comparisons with other popular classification methods, Nearest Neighbor and Naive Bayes, show that ANN is the best for particle physics classification, and the proposed method FVC-ANNs can get even better performance.
Acknowledgments The authors gratefully acknowledge the support of the National Science Foundation of China (Grant No. 60273015 and No. 10001006).
Reference 1. Jason D. Rennie, et al.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Twentieth International Conference on Machine Learning. August 22 (2003) 2. Christopher Bishop: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995) 3. Ken-ichi Funahashi: On the Approximate Realization of Continuous Mappings by Neural Networks. Neural Networks, 2(3):183-192 (1989) 4. Simon Haykin: Neural Networks - A Comprehensive Foundation, 2nd ed. Prentice-Hall, Englewood Cliffs (1998) 5. Sepp Hochreiter and Jürgen Schmidhuber: Feature Extraction Through LOCOCODE. Neural Computation, 11(3):679-714 (1999) 6. Kurt Hornik, Maxwell Stinchcombe, and Halbert White: Multilayer Feedforward Networks are Universal Approximators. Neural Networks, 2(5):359-366 (1989) 7. KDD Cup 2004, http://kodiak.cs.cornell.edu/kddcup/index.html (2004) 8. Hipp J., Guntzer U., Nakhaeizadeh G.: Algorithms for Association Rule Mining – a General Survey and Comparison, ACM SIGKDD Explorations, Vol.2, pp. 58-64. (2000) 9. J. Han, J. Pei, and Y. Yin: Mining Frequent Patterns without Candidate Generation. In Proc of ACM SIGMOD’00. (2000) 10. Agrawal, R., Srikant R.: Fast Algorithms for Mining Association Rules in Large Databases, Proc. VLDB 94, Santiago de Chile, Chile, pp. 487-499 (1994) 11. Marcel Kunze: Application of Artificial Neural Networks in the Analysis of Multi-Particle Data. In the Proceedings of the CORINNEII Conference (1994) 12. KDD Cup 2004 – Description of Performance Metrics: http://kodiak.cs.cornell.edu/ kddcup/metrics.html (2006) 13. A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. P. Hardin, S. Levy: A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis. Bioinformatics (2004)
Frequent Variable Sets Based Clustering for ANN Particle Classification
867
14. J. Hipp, U. Guntzer, and G. Nakhaeizadeh: Algorithms for Association Rule Mining - a General Survey and Comparison. ACM SIGKDD Explorations, 2(1):58–64, July (2000) 15. Fung B., Wang K., Ester M.: Large Hierarchical Document Clustering Using Frequent Itemsets, Proc. SIAM International Conference on Data Mining 2003 (SDM ‘2003), San Francisco, CA. May (2003) 16. Florian Beil, Martin Ester, Xiaowei Xu: Frequent Term-based Text Clustering. KDD: 436-442 (2002) 17. Aha D., and D. Kibler: Instance-based Learning Algorithms, Machine Learning, Vol.6, 3766 (1991) 18. I.Witten and E.Frank: Data Mining –Practical Machine Learning Tools and Techniques with Java Implementation. Morgan Kaufmann (2000) 19. R. C. Dubes and A. K. Jain: Algorithms for Clustering Data. Prentice Hall College Div, Englewood Cliffs, NJ, March (1998) 20. Karl-Michael Schneider: A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, 307-314, April (2003) 21. Xin Jin, Anbang Xu, Rongfang Bie, Ping Guo: Kernel Independent Component Analysis for Gene Expression Data Clustering. ICA 2006: 454-461 (2006) 22. Aha, D., and D. Kibler: Instance-based Learning Algorithms, Machine Learning, Vol.6, 37-66 (1991) 23. Piotr Indyk: Nearest Neighbors in High-dimensional Spaces. In Jacob E. Goodman and Joseph O'Rourke, editors, Handbook of Discrete and Computational Geometry, chapter 39. CRC Press, 2rd edition, (2004)
24. George H. John and Pat Langley: Estimating Continuous Distributions in Bayesian Classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Page 338-345. Morgan Kaufmann, San Mateo (1995) 25. Xiaoyong Chai, Lin Deng, Qiang Yang, Charles X. Ling: Test-Cost Sensitive Naive Bayes Classification. ICDM 2004: 51-58 (2004) 26. Peter A. Flach and Nicolas Lachiche: Naive Bayesian Classification of Structured Data. Machine Learning, Volume 57(3): 233--269, December (2004) 27. H. Wang, et al.: Clustering by Pattern Similarity in Large Data sets. SIGMOD, 394-405 (2002)
Attributes Reduction Based on GA-CFS Method Zhiwei Ni, Fenggang Li, Shanling Yang, Xiao Liu, Weili Zhang, and Qin Luo School of Management, Hefei University of Technology, Hefei 230009, China
Abstract. The selection and evaluation task of attributes is of great importance for knowledge-based systems. It is also a critical factor affecting systems' performance. By using the genetic operator as the searching approach and correlation-based heuristic strategy as the evaluating mechanism, this paper presents a GA-CFS method to select the optimal subset of attributes from a given case library. Based on the above, the classification performance is evaluated by employing the combination method of C4.5 algorithm with k-fold cross validation. The comparative experimental results indicate that the proposed method is capable of identifying the most related subset for classification and prediction with reducing the representation space of the attributes dramatically whilst hardly decreasing the classification precision. Keywords: Attributes reduction, correlation-based feature selection (CFS), Genetic algorithm (GA), k-fold cross validation.
1 Introduction In the research field of machine learning and data mining, significant attention has been paid to attributes reduction and evaluation. As an important task for knowledgebased systems, its key problem is how to identify the most related subset to the given target, and at the same time clear away irrelevant redundant attributes. Through the successful performance of this task, the reduction of data dimensions and the assumption space shall be achieved which would enable the algorithm to have quicker execution speed and higher efficiency. Attributes reduction and evaluation are also a NP-hard problem. Hence how to select a valid searching method is a critical aspect we should investigate. Genetic algorithm differentiates itself from other searching methods in its particular genetic operator. It can be well implemented to solve the problem of attributes searching. Another important factor in system design is how to measure the weight of attributes for classification and prediction. Correlation-based heuristic method can evaluate the degree of association among attributes and measure the contribution of attributes (subsets) to classification. It can serve as the evaluation criterion for attributes reduction. This paper proposes a GA-CFS method by combining genetic algorithm with correlation-based evaluation. The proposed method solves not only the problem of searching efficiency caused by "combinatorial explosion" of attributes combination, but also the problem of correlation measure among attributes. Some researchers have implemented attributes reduction [1-3] using genetic mechanism without combining it G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 868–875, 2007. © Springer-Verlag Berlin Heidelberg 2007
Attributes Reduction Based on GA-CFS Method
869
with correlation-based feature reduction. The problem of how to find the (approximate) optimal subset of attributes for a given case library is what the authors are devoted to in this paper. The remainder of this paper is organized as follows. The next section describes the searching strategies as well as the evaluating strategies of attributes reduction and its formalization. Then section 3 describes the genetic algorithm in brief. Section 4 focuses on the process of attributes reduction using GA-CFS based on the genetic algorithm and correlation-based evaluation. Section 5 verifies the performance of GACFS with the method of combining C4.5 algorithm with K-fold cross validation, using the data from UCI database repository of the University of California. Finally, section 6 concludes this paper and points out future work.
2 Attributes Reduction Attributes reduction is to select a subset from the attributes space which influences predicting or classifying results significantly. Its goal is to find those attributes or subsets that have the most classifiable ability. In general, attributes reduction includes two parts: (1) searching strategy in the attributes space; (2) evaluation strategy for the selected attributes subset. They are both the indispensable segments of the process. 2.1 Searching Strategy and Evaluation Strategy of Attributes Reduction Attributes reduction is a combination optimization problem. It has high complexity and requires an efficient searching algorithm. Each searching state can be mapped as a subset in the searching space for a searching problem. An n-dimensional data set has a 2n potential state space, so it is very important to select the originating point for searching and the searching strategy. Usually, we use heuristic searching strategies instead of exhaustive searching strategies to obtain the approximate optimal subset. Searching strategies of attributes include: best first [4], forward reduction, stochastic searching, exhaustive searching, genetic algorithm [1, 2], ordering method, etc. From the viewpoint of an evaluation function, attributes evaluation figures out the opinion source of every potential attribute and then selects some attributes, which have the highest scores as the optimal subset. The evaluation function directly influences the final subset reduction. Based on different evaluation functions, different subsets will be formed. Generally, attributes evaluation methods include: information gain method [5], gain ratio method [6], correlation-based evaluation [7], principal component analysis, chi-mean square evaluation, etc. Attributes evaluation is the reduction problem of the evaluation function in a genetic algorithm. 2.2 Formalization of Attributes Reduction Considering the attributes set to be attribute vectors, reduction is a process of selecting the subset of which the cardinal number is M from the attributes set of which the cardinal number is N (M ≤ N).
870
Z. Ni et al.
FN be the original attributes set, FM be the selected subset. Thus, with respect to the optimized subset, the conditional probability P (Ci / FM = f M ) of each decision class Ci should be as equivalent to that of FM as possible. It can be defined as: Let
Ci : P(Ci / FM = f M ) ≅ P (Ci / FN = f N )
(1)
f M represents the special attribute vector of attributes set FM while f N represents that of attributes set FN . The process of attributes reduction is the process of searching for optimal or approximate optimal FM . Where,
3 Genetic Algorithm Genetic algorithm is a searching approach [8] that is based on natural reduction and natural genetic mechanism. Following the strategy of “survival of the fittest” in the nature, the algorithm uses random genetic operators to generate several new solutions, eliminates the poorer, and keeps the better and promising ones. The information of the fittest solutions is constantly utilized to search for the new unknown area of the searching space. Since its effective use of historic information that makes every search moving forward according to the best direction, genetic algorithm is similar to simulated annealing algorithm and tabu searching. As a result, genetic algorithm is not only a random searching approach, but also a directing random searching approach. Genetic algorithm can be formally defined as an 8-tuple:
GA = ( P (0), N , l , s, g , p, f , t ) Where, P (0)
= ( y1 (0), y 2 (0),..., y N (0)) ∈ I N denotes the initial population;
N is a positive integer, which denotes the number of individuals in a population; l is also a positive integer, which denotes the length of the symbol string (chromosome); I = Σ l represents the collectivity of the symbol strings of which the length is l. Σ is an alphabet. If binary coding is used, then Σ = {0,1} ; s : I N → I N represents reduction strategy;
g denotes genetic operators, which usually include duplicate operator
Or : I → I , crossover operator Oc : I × I → I × I and mutation operator Om : I → I ; f : I → R + is a fitness function; t : I N → {0,1} is a termination law.
Attributes Reduction Based on GA-CFS Method
871
Genetic algorithm presented by Holland initially adapted binary coding, that means Σ = {0,1} . But generally speaking, it can be expanded into any data structure. According to the needs of practical problems, Σ can be 0-1 bit string, as well as integer vector, Lisp expressions or neural networks. In this paper, we use binarycoded string to denote the attribute vector. The code ‘0’ denotes that the represented attribute is not appeared in the search, while ‘1’ denotes the opposite. The settings of genetic operators are given in section 4.2.
4 Attributes Reduction Based on GA-CFS Method 4.1 Reduction of Evaluation Method CFS evaluation method based on correlation-based attributes reduction is a heuristic algorithm [7]. It can evaluate the ‘merit’ of the subset of attributes. Its main consideration is the class prediction ability of single attribute and their correlations. The heuristic algorithm is based on the hypothesis below: attributes that belong to quality subset FM are highly correlated to class Ci while the attributes themselves are irrelevant to each other. The irrelevant attributes in the subset are hardly related to the classification, so they can be ignored. The redundant attributes can also be eliminated for they are certain to have a correspondence to a high-correlated attribute. The acceptance degree of an attribute is due to its ability to predict the classification in the case library space while other attributes can not. The evaluation function CFS of the subset is defined as follows:
Ms =
where
k rcf
(2)
k + k ( k − 1) r ff
M s is the heuristic ‘merit’ when the subset includes k attributes; rcf is the
attribute-classification correlative average value (f
∈S); and
rff is the attribute-
attribute correlation average value. For successive value data, the relativity between attributes can be calculated as follows:
rXY = where
σx
value data.
and
σy
∑ xy
nσ xσ y
(3)
denote quadratic mean deviation of the attributes of successive
872
Z. Ni et al.
If one of the two attributes is successive and the other is discrete, the relativity can be calculated as follows: k
rXY = ∑ p ( X = xi )rX biY
(4)
i =1
For
X bi , if X = xi , then X bi
=1, else X =0. bi
If both attributes are discrete, the relativity can be calculated as follows: k
l
rXY = ∑∑ p( X = xi ,Y = y j ) rX biYbj
(5)
i =1 j =1
According to the above formulations, correlation of the attributes can be calculated no matter it is discrete or successive. Then it will be selected as the attributes reduction criterion in the next step of the genetic searching until the final criterion of the algorithm is met. 4.2 Settings of Genetic Operator In order to obtain the reduction of attributes with genetic algorithm, the following operations need to be set: 1. Initialization of the population. Select N random initial points to form a population. The amount N of the individuals in a population is the population size. Each chromosome of the population is coded by binary string. Chromosomes denote the optimized parameters. Each initial individual denotes the initial solution. 2. Reduction. Select appropriate individuals according to the selective strategy of roulette wheel. The reduction should embody the principle of ‘survival of the fittest’. On the basis of the fitness value of each individual, the best individual can be selected as the next generation population for repropagation. 3. Crossover. With the crossover probability p c , new individuals can be generated. Thus searching can be effective in the solution space, meanwhile decrease the destruction to the effective scheme. Crossover is a mechanism of the information exchange between two chromosomes. 4. Mutation. According to the given mutation probability p m , we can select some individuals randomly from the population while make the mutation calculation to the selected individuals in correspondence with certain strategy. The mutation calculation is an important factor to enlarge the population diversities. It enhances the ability for genetic algorithm to find the optimal solutions. 4.3 Evaluation Method of Attribution Reduction In order to evaluate the performance of the subset of attributes FM which is selected by the GA-CFS method that combines GA with correlation-based attributes reduction, this paper uses the method which combines C4.5 algorithm [6] with k-fold cross
Attributes Reduction Based on GA-CFS Method
873
validation to verify the classification performance of FM . Meanwhile, we compare the classification performance with that of the original subset FN . C4.5 algorithm is the improvement of ID3 [5]. It can deal with the following problems: the attributes of successive value, the deficiency and deterioration of attribute value, pruning of decision tree and the creation of rules, etc. Its core concept is to adapt the information-entropy-based sorting strategy of attributes. K-fold cross validation is also called rotation estimation. It divides the whole set of
, , ,
k non-overlap and equal subsets (S1 S2 。。。 Sk) randomly. The classification model is trained and tested for k times. For each time ( t ∈ {1,2,L, k } ), let (S St) be the training subset. The cross validation precision is obtained from the average value by calculating the testing precision for k times case library (S) into
-
separately: k
CVA = ∑ Ai
(6)
i =1
where, CVA denotes the precision of cross validation, subset that has been used, and
k denotes the amount of the
Ai is the precision rate of every subset. In the
experiment described next in this paper, K =10 [9].
5 Experimental Results and Analysis In order to evaluate the validation of GA-based attributes reduction, we use GA-CFS to compare the attributes sets selected by the means of combining genetic algorithm with correlation-based heuristic before attributes reduction. Observing the variance of the number of attributes, the variance of precision and related performance values of subset, we can review the performances of the algorithm proposed in this paper. Our GA-CFS approach is implemented in Java and experiments were conducted on a Pentium(R) 4 CPU 2.80GHz with 256MB RAM running under Windows 2000. In the experiment, we select 4 data sets from UCI ML database repository from the University of California. The detailed information is given in Table 1. Table 1. Data set used in the experiment Data set Anneal Arrhythmia Breast_cancer Sick
Num. Of cases 798 452 286 3,772
Num. of attributes 38 279 9 30
Attributes deficiency (%) 73.2 0.32 0.3 5.4
Num. of classes 5 13 2 2
Parametric setting of genetic algorithm is as follows: population scale N=20, crossover probability Pc=0.66, mutation probability Pm=0.033, the largest number of iteration is 20.
874
Z. Ni et al.
Use C4.5 algorithm to compute the classification precision before and after attributes reduction. Use k-fold cross validation to verify the computation of classification precision. Obtain computing results by averaging after executing 10 times. The experimental results are given in Table 2 and Table 3. Table 2. Comparison before and after attributes selection
Data set Anneal Arrhythmia Breast_cancer Sick
Num. of attributes after selection 11 98 5 4
Num. of attributes before selection 38 279 9 30
Reduction
of attributes
%
( ) 71.05 64.87 44.44 86.67
Correlation value of subset 0.480,12 0.071,47 0.096,72 0.234,91
Table 3. Comparison of the classification accuracy before and after attributes selection Accuracy before selection ( ) 98.57 65.65 74.28 98.72
Accuracy after selection ( ) 97.97 66.04 73.08 97.39
%
Data set Anneal Arrhythmia Breast_cancer Sick
%
Decrease of accuracy ( ) 0.61 0.59 1.62 1.35
-
%
The experimental results indicate that using GA-CFS to select subset, concerning the reduction of attributes, can reduce attributes in the 4 data sets by 44.44% at least, and by 86.67% at most as show in Table 2. So the reduction of dimensions is considerable. From the variance of classification accuracy of the 4 data sets after attributes reduction as shown in Table 3, we can see that the accuracy of anneal data set reduces less than 1%, breast_cancer and sick dataset reduce about 1% and arrhythmia data set even increases.
90 80 71. 05 64. 87 70 60 44. 44 50 40 30 20 10 1. 62 0. 61 0 r ia - 10 nn ea l m- 0. 59 nc e th a A
A
rrh
y
Br
s ea
c t_
86. 67
1. 35 Si
ck
Fig. 1. Comparison of ratio between reduction of attributes and decrease of accuracy before and after attributes reduction
Attributes Reduction Based on GA-CFS Method
875
By analyzing the data set above we can conclude, by comparing with the original attributes, using the proposed attributes reduction method to optimize subset selection, the attributes are reduced by 70% on average, while precision decreases about 1% only, just as shown in Fig. 1 Hence, the proposed GA-CFS algorithm has achieved much better outcomes. It reduces the number of attributes dramatically whereas hardly decreases the classification precision.
6 Conclusions and Future Work Attributes reduction and evaluation is an important task for knowledge-based systems. They can identify the most related attributes to the problems of the system, clear away irrelevant attributes, reduce the representation space of case library, decrease the complexity of systems, and improve the performance of systems. We have proposed a GA-CFS method to guide the evolution of systems until it finds an approximate optimal subset. We have implemented the original searching approach using the genetic operator that introduces a correlation-based subset evaluation method as the evaluation function. By using C4.5 algorithm combined with k-fold cross validation to evaluate its performance, we have concluded that the GA-CFS method can identify the most related subset to classify and predict with reducing the representation space of the attributes dramatically whilst hardly decreasing classification precision. In the future, we would like to do some benchmark work on attributes reduction. It relates to these theory and techniques, such as Rough Set (RS), Prime Component Analysis (PCA), entropy-based attributes reduction, etc. We believe that it would benefit the use of the various attributes reduction methods.
References 1. Yuan, C. A., Tang, C. J., Zuo, J., et al. Attribute reduction function mining algorithm based on gene expression programming, 5th International Conference on Machine Learning and Cybernetics, AUG 13-16, Vols. 1-7, (2006) 1007-1012 2. Hsu, W. H.:Genetic wrappers for feature reduction in decision tree induction and variable ordering in Bayesian network structure learning, Information Sciences, vol. 163, (2004) 103–122 3. Zhao, Y., Liu, W. Y.:GA-based feature reduction method, Computer engineering and application, vol. 15, (2004) 52-54 4. Kohavi, R., John, G. H.:Wrappers for feature subset reduction, Artificial Intelligence, (1997) 273-324 5. Quinlan, J. R.:Induction of decision trees, Machine Learning, vol. 1, No. 1, (1986) 81-106, 6. Quinlan, J. R.:C4.5: Programs for machine learning, Morgan Kaufmann, San Mateo. CA, (1993) 7. Hall, M. A.:Correlation-based feature reduction for discrete and numeric class machine learning, Proc. of the 17th International Conference on Machine Learning(2000) 8. Zhou, M.,Sun, S. D.: GA principle and application, National defense industry press, Beijing (1999) 9. Kohavi, R.:A study of cross-validation and bootstrap for accuracy estimation and model reduction. In: Wermter, S., Riloff, E., and Scheler, G., (eds.): The Fourteenth International Joint Conference on Artificial Intelligence (IJCAI), Morgan Kaufman, San Francisco, CA, (1995) 1137—1145
Towards High Performance and High Availability Clusters of Archived Stream* Kai Du, Huaimin Wang, Shuqiang Yang, and Bo Deng School of Computer Science, National University of Defense Technology Changsha 410073, China [email protected], [email protected], [email protected], [email protected]
Abstract. Some burgeoning web applications, such as web search engines, need to track, store and analyze massive real-time users’ access logs with high availability of 24*7. The traditional high availability approaches towards generalpurpose transaction applications are always not efficient enough to store these high-rate insertion-only archived streams. This paper presents an integrated approach to store these archived streams in a database cluster and recover it quickly. This approach is based on our simplified replication protocol and high performance data loading and query strategy. The experiments show that our approach can reach efficient data loading and query and get shorter recovery time than the traditional database cluster recovery methods.
1 Introduction Some burgeoning applications have appeared which needs the high availability and extra high performance of data insertion operations. The records of web behavior, such as the records of personal search behavior in search engines, online stock transactions or call details, are the classical archived streams [11]. For instance, Google can improve the users’ search experiences based on Personalized Search [3]. This information should be written into a large database in a real-time mode and queried repeatedly when the user uses the search engine again. All of these archived streams applications have the following common characteristics: z z z
A round-the-clock Internet company needs a high availability of 24*7. However high availability is a great challenge for a large-scale Internet company like Google since a large number of equipments are needed. High-rate data streams need a high performance and near real-time record insertion method. Google processes about 4200 requests every second [4] and needs a high performance insertion program to record all the users’ behavior. The recorded data can be viewed as historical data because it will not be updated any more but only be queried repeatedly after being stored.
* Supported by the National Grand Fundamental Research 973 Program of China under Grant No.2005CB321804, and the National Science Fund for Distinguished Young Scholar of China under Grant No.60625203. G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 876–883, 2007. © Springer-Verlag Berlin Heidelberg 2007
Towards High Performance and High Availability Clusters of Archived Stream
877
We call these applications as log-intensive web applications. [11] is the first one which optimizes querying on the live and archived streams, but doesn’t study the insertion performance and system’s availability. [14] studies the availability of an updatable data warehouse filled with less-update data. It bases on the general-purpose 2PC which is not efficient enough for the high-rate archived streams. The first contribution of this paper is to optimize the insertion operations by writing no online-log and archived-log in databases and committing data in bulk. The second is providing a simple consistency protocol based on the no-update feature of the data. The third is designing an efficient recovery method for high-rate insertions. The remains of this paper are: Section 2 is the problem statement and related work. Section 3 is transaction processing and consistency protocol. Section 4 introduces the recovery approach; Section 5 is the experiments; Section 6 is the conclusion.
2 Problem Statement and Related Work Let’s consider the classical log-intensive applications: when the users are accessing the web sites, all the users’ behavior may be stored and a group of record items are generated at all times. These record items must be real-time stored and queried by subsequent web accesses. A high available and efficient system, such as a database cluster, needs to be built for these applications. A database cluster is m database servers, each having its own processors and disks, and running a “black-box” DBMS [1]. The “Read One Write All Available” policy [2] is always adopted. It means when a read request is received, it is dispatched to any one node in the available nodes. In [8], bulk loading is adopted to optimize the insertion performance; however it doesn’t focus on availability. The primary/secondary replica protocol [9] in commercial databases [10, 12] ships updates logs from the primary to the secondary. This way decreases the insertion performance for the IO access in log-intensive applications. The 2PC [2] keeps all replicas up-to-date, but has poor performance for its forcewrites logs and poor recovery performance based on complex ARIES [7, 14]. In order to avoid force-writes, ClustRa [13] uses a neighbor logging technique, in which a node logs records to main memory both locally and on a remote neighbor; HARBOR[14] avoids logs by revising the 2PC protocol, but the revised 2PC is too complex to the insertion-intensive and no-update applications. [15, 16] is not based on 2PC and propose a simple protocol, but it needs to maintain an undo/redo log. The object of this paper is to design an efficient integrated approach to solve the problem of high availability and high performance for these log-intensive applications. The basic idea is to insert the data in bulk without online log in databases and set a consistency fence for every table in the data processing phase.
3 Transaction Processing All recovery approaches are based on the transaction processing. This section will introduce the details about insertion and query processing.
878
K. Du et al.
3.1 System Framework: Transaction Types and Unique External Timestamp As is discussed in Section 1, in the log-intensive workloads, all the transactions can be classified as two types: insertion and query transactions since there are no update transactions. The insertion means inserting high-rate data into databases. The query means querying the massive non-update history data. The following are adopted to reach our objects: 1) Buffer and insert the data into a database in bulk. The experiments show bulk insertions always outperform standard single insertions by an order of magnitude. 2) Write no online logs in databases for insertions. 3) Insert multiple objects in parallel. Eliminating the dependency of the insertions on different objects could be reached by simply canceling the foreign key constraints. 4) Recovery methods based on no-log must be developed. According to 1), a coordinator is added upon a database cluster to buffer and insert data in bulk in Fig. 1. For every table, an insertion thread is always running since the coordinator processes the same data more easily than any underlying database. Thus only one thread for one table is enough. For a query request, a query thread dynamically starts and ends with that request. The insertion threads refresh the meta-info TF and ANS (introduced in Section 3.2) and the query threads read this on time. Another mechanism, the unique external timestamp, is designed to implement the consistency protocol. Since a record data item usually has a time field log_time, we can construct a unique id for every record by adding a field log_number which can differentiate the different records with the same log_time. Thus every record has a virtual unique identifier log_id through binding log_time and log_number. The allied timestamp is also used in [14]. However it is generated in the database core when the insertion is committed which will destroy the autonomy of the underlying databases. 3.2 Insertion and Query Processing The data insertion processing is illustrated in Fig. 2. The data is buffered into the input buffer B-in, and when B-in is full, it will be changed into output buffer B-out ( in Fig. 2). Then the data in B-out will be written into multiple database replicas simultaneously( in Fig. 2). After the insertion thread receives all the messages of replicas( in Fig. 2), it refreshes the Time Fence (TF) and Available Node Set (ANS)( in Fig. 2). Only if the insertion thread meets a database replica failure, it will write B-out into local log files ( in Fig. 2). Before the failed replica is recovered, a group of insertion log files will be maintained. The Time Fence (TF) is the log_id of the latest record inserted into the database. Every table has a TF. It is used to synchronize the query threads and insertion threads. From the above analysis, it’s obvious that no logs are generated on the coordinator node and database nodes as [14]. Since the volume of the log is at least larger than the data in a database, this method reduces at least 50% IO of the normal fashion. It is more efficient than [15] which stores logs both on middleware and database nodes. The process of queries includes two steps. Step one is rewriting the SQL. In order to synchronize the result sets of every database replicas, an extra condition of log_id is added according to the TF of every table. The revising rule is as Table 1. Thus all query threads have a uniform logical view about the data in the replicas even though
①
②
④ ⑤
③
⑥
Towards High Performance and High Availability Clusters of Archived Stream
879
the same data may be not inserted synchronously by an insertion thread. Step two is dispatching the revised SQL to an available replica in the ANS. This can be done in terms of some load balance policy like current requests number.
Insertions
①
Queries
Coordinator TF ANS
TF ANS
⑤
④
Query Thread
Fig. 1. System Framework
③
Coordinator
⑥
Log
④ Database Replicas
Database Replicas
Insertion Thread
②
①Buffer data in B-in ④Reply to manager ②Move data to B-out ⑤Refresh TF and ANS ③Write data to DBs ⑥Write logs (on failure) Fig. 2. Insertion Processing
Table 1. Rewriting Query Rule
Original SELECT tuples FROM table_a WHERE original_predicates;
Rewritten SELECT tuples FROM table_a WHERE original_predicates AND log_time < TF[table_a].log_time AND log_number
3.3 Replication Protocol The replication protocol is to keep copies (replicas) consistent despite updates [5]. The 2PC or its variation [14] can be used to synchronize the data, but it is too expensive for its communication overhead. Recently some efficient eager replication protocols[6] can partly solve the problem of throughput and scalability but not the latency. All these general-purpose protocols seem too complex for the simple transaction semantics of log-intensive workloads and inefficient for SQL log and complex locks. In the log-intensive workload, the atomicity and consistency of an insertion transaction is guaranteed by a table’s TF. When a table_a’s insertion thread receives the replies of every replica, it must wait until it attains an exclusive (write) lock of table_a’s TF. After that it can refresh the table_a’s TF and the ANS. Before a query thread revises a query SQL, it must wait until it attains a share (read) lock of table_a’s TF. Thus the committed data will not be seen until all replicas have committed it. This simply guarantees insertion transactions’ atomicity because no query will see the data before the TF is changed.
880
K. Du et al.
4 Recovery Approach
⑥
The recovery approach is based on the insertion data log files (generated in in Fig. 2). We design a recovery algorithm on a granularity of tables. This algorithm is constituted of a recovery manager thread rm_thread and many recovery threads recovery_thread(node_id, table_id). The rm_thread always runs on the background and monitor which failure database needs to be recovered. If it finds some one, it will create one recovery_thread for every table on that database. After a recovery_thread recovers a table, it will inform the rm_thread. The recovery procedure of every recovery_thread can be divided into three phases in Section 4.1. 4.1 Recovery from Instance Failure (1) Phase 1: Recover From the Latest Save Point When an insertion is pushed to a replica, the data will be directly written in pieces into the data files of the database. When the database meets an instance failure, one part of data of the insertion request is stored in the database while other in the memory is lost. In order to save the stored data and avoid duplicating it, we should get the log_id of the latest stored data. We call this log_id as “the latest save point(LSP)”. The LSP can be got in this standard SQL clause: SELECT MAX(log_time), MAX(log_number) INTO LSP.log_time, LSP.log_number FROM table_a;
Just as mentioned in Section 3.2, we can leverage the oldest insertion log file of the logs group. The pseudo code is just like: LOAD DIRECT FILE= the oldest file of table_id WHERE log_time ≤ LSP.log_time AND log_number < LSP.log_number;
Thus all the data left in the oldest insertion log file is loaded into the recovering database. Then the other insertion log files can be directly loaded into the database. From the above procedure, we can find that both the recovery of multiple tables in one database and the recovery of multiple failure databases can be done in parallel. (2) Phase 2: Catch Up with Data Logs This phase is a subsequent step of phase 1 and simpler. The pseudo code is: LOAD DIRECT FILE=other files of table_id;
In this phase, we can optimize the recovery by merging several small files into big files. This can improve the recovery performance due to decreasing the access times to the recovering database. The size of every big file is determined by the load of network, disk, cpu of two sides. The effect of merging will be shown in Section 5. (3)Phase 3: Catch Up with Current Insertion After loading all the log files of table_id, the recovery_thread will inform the rm_thread and the insertion_thread(table_id). The insertion_thread will push the current insertion to the database of node_id. After the insertion_thread has completed this insertion, it will refresh the TF of table_id and added the recovered database into the ANS of table_id. From that time on, the insertion and query transaction can send to the table of table_id of the recovered database.
Towards High Performance and High Availability Clusters of Archived Stream
881
If all recovering databases have recovered on this table of table_id, the insertion_thread(table_id) will no longer write log files. 4.2 Recovery from Media Failure When a database meets a media failure, such as some data files can’t be read or written, the recovering procedure can be implemented like the following two steps: 1) Recover the data files based on current and historical partitions. This will not be introduced in details. 2) Recover the instance like Section 4.1.
5 Experiments In these experiments, a database cluster of three nodes and a coordinator node is built. All the four nodes have two CPUs of Xeon 2G, 4G RAM, two 70G SCSI disks and are installed on Redhat AS 3.0. The three database nodes are installed with Oracle 10.1.0.4. And all the codes are written in GNU C++. The experimental data comes from the access records of some commercial search engine. A record item’s size is about 329 bytes. 5.1 Runtime and Recovery Performance Fig. 3, 4 are about runtime performance. Fig. 5, 6 are about recovery performance. In Fig. 3, we can find two conclusions: 1) The optimized loading’s performance is 50-100 fold of the standard INSERT SQL’s and 2PC used in [14]. 2) As the results show, when a database node writes online and archived log, online log, no log, the time ratio in average is about 1.43:1.14:1. 3) The insertion time is proportional to the size of the data. In Fig. 4, the bulk size is 80MB and the time is the average processing time of multi-users. Three scenes are simulated: writing online log on databases and the coordinator (it happens when a database node fails), writing online log on databases and writing no log. The ratio is 1.28:1.11:1. 6000
No Log Online Log Online & Archived Log Conventional
120
db & app log db log no log
100
4000
Time (seconds)
Time (seconds)
5000
3000
2000
80
60
40
1000 20
0 0
500
1000
1500
2000
Bulk size (MB)
Fig. 3. Insertion performance and bulk size
0 0
2
4
6
8
10
12
14
16
18
number of concurrent users
Fig. 4. Insertion performance and # of users
In Fig. 5, we compare the classical ARIES recovery method and ours. The results show that when the recovered data size is less than 4.5MB, the ARIES is better, but
882
K. Du et al.
after that point our method gets a better performance. When the recovered data size is small, the startup cost of our method is greater than ARIES and later the complexity of ARIES leads to its long recovery time. In Fig. 6, the time of three recovery phases is shown. The startup time in phase 1 and the catching up time in phase 3 is a constant time, while the insertion time in phase 2 is proportional to the data to be recovered. 5.2 Performance During Failure and Recovery The transaction processing performance during the databases’ failure and recovery is another problem needed to be discussed. In Fig. 7, the x-axis is the time, the left y-axis is the insertion performance whose criterion is MB/s, and the right y-axis is the query performance whose criterion is the completed transactions per second. 35
ours aries
Phase 1 Phase 2 Phase 3
30
Recovery Time (seconds)
Recovery time (seconds)
10
8
6
4
25
20
15 10
2
5 0
0 10
20
30
20
40
40
Fig. 5. Recovery performance and recovered size 0
60
100
120
140
160
5
10
15
Fig. 6. Decomposition of recovery time 20
25
30 100
12
90
db restart 10
Insertion performance (MB/s)
80
Recovered Data Size (MB)
Recovered data size (MB)
80
db online normal phase
8
70
recovery phase 3 recovery phase 1 & 2
6
60 50 40
4
db crash
30
Insertion performance 2
20
Query performance
Query performance (TRXs/s)
0
10 0
0 0
5
10
15
20
25
30
Time (seconds)
Fig. 7. Transaction processing performance during failure and recovery
Before the 10th second, the system runs in the normal state. At the 10th second, one of the three databases fails, the coordinator detects this and the DBA restarts the database at the 15th second. During this period, the insertion performance decreases a little about 13% because the log files need to be stored on the coordinator’s disk, and the query performance decreases about 31% because 1/3 of the three nodes can’t process the query requests. From 15th second to 25th second, the recovery phase 1 and 2 complete, and the performance is just as the 15th second because the recovery
Towards High Performance and High Availability Clusters of Archived Stream
883
will not decrease the online performance. From 26th to 27th second, the phase 3 completes, and the performance return to the normal level. From Fig. 7, we can find that there is no sharp performance degradation because other transactions will not be interrupted when one database fails.
6 Conclusion In this paper we have studied the problem of how to store and recover high-rate archived streams in a database cluster. According to the log-intensive applications, we present an optimized data insertion method based on reducing the disk IO access cost and a simple and efficient consistency protocol. The experiments results show that our approach can reach efficient data loading and query and get shorter recovery time than the traditional database cluster recovery methods.
References 1. S. Ganarski, H. Naacke, E. Pacitti, P. Valduriez: Parallel Processing with Autonomous Databases in a Cluster System, CoopIS, 2002. 2. J. Gray, A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufman, 1992. 3. Google personalized search. http://www.google.com/psearch 4. http://news.com.com/Google,+eBay+Strategic+bedfellows/2100-1024_3-6110304.html 5. J. Gray and P. Helland and P. O’Neil and D. Shasha: The Danger of Replication and a Solution. ACM SIGMOD, 1996. 6. Matthias Wiesmann, Fernando Pedone, Andre Schiper, Bettina Kemme,Gustavo Alonso. Transaction Replication Techniques: a Three Parameter Classification. SRDS 2000. 7. C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. ARIES: a transaction recovery method supporting fine-ranularity locking and partial rollbacks using write-ahead logging. ACM TODS, 17(1):94–162, 1992. 8. Y. Dora Cai, Ruth Aydt, Robert J. Brunner. “Optimized Data Loading for a MultiTerabyte Sky Survey Repository”. Super Computing 2005. 9. B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, and L. Shrira. Replication in the harp file system. In SOSP, pages 226–238. ACM Press, 1991. 10. Microsoft Corp. Log shipping. http://www.microsoft.com/technet/prodtechnol/sql/2000/ reskit/part4/c1361.mspx. 11. Sirish Chandrasekaran, Michael Franklin. Remembrance of Streams Past:OverloadSensitive Management of Archived Streams. VLDB 2004. 12. Oracle Inc. Oracle database 10g Oracle Data Guard. http://www.oracle.com/technology/ deploy/availability/htdocs/DataGuardOverview.html. 13. S.-O. Hvasshovd,Torbjrnsen, S. E. Bratsberg,and P. Holager. The clustra telecom database: High availability, high throughput, and real-time response. VLDB 1995. 14. Edmond Lau, Samuel Madden. An Integrated Approach to Recovery and High Availability in an Updatable, Distributed Data Warehouse. VLDB 2006. 15. R. Jim′enez-Peris, M. Patino-Martinez, and G. Alonso. An algorithm for non-intrusive, parallel recovery of replicated data and its correctness. In SRDS, 2002. 16. B. Kemme. Database Replication for Clusters of Workstations. PhD dissertation, Swiss Federal Institute of Technology, Zurich, Germany, 2000.
Continuously Matching Episode Rules for Predicting Future Events over Event Streams Chung-Wen Cho1, Ying Zheng2, and Arbee L.P. Chen3 1
Department of Computer Science, National Tsing Hua University, Taiwan, R.O.C. 2 Department of Computer Science, Fudan University, China 3 Department of Computer Science, National Chengchi University, Taiwan, R.O.C. [email protected]
Abstract. Predicting future events has great importance in many applications. Generally, rules with predicate events and consequent events are mined out, and then current events are matched with the predicate ones to predict the occurrence of consequent events. Many previous works focus on the rule mining problem; however, little emphasis has been attached to the problem of predicate events matching. As events often arrive in a stream, how to design an efficient and effective event predictor becomes challenging. In this paper, we give a clear definition of this problem and propose our own method. We develop an event filter and incrementally maintain parts of the matching results. By running a series of experiments, we show that our method is efficient and effective in the stream environment. Keywords: Continuous query, episode, event stream, prediction.
1 Introduction In many applications, events such as specific TCP connections in an intrusion detection system [10] are recorded for the predicting of future events. Generally speaking, there are two steps for the event prediction problem. The first step is to derive event associations represented as rules from the past events. The second one is to use the discovered rules to predict future events when given a recent record of events. We now explain these two steps by an example, and show the motivation of our work. Fig. 1 shows an example of the discovered rule in the form αDβ where α is called the predicate and β the consequent. α and β are both represented by a directed acyclic graph, where each vertex represents an event, and each edge from vertex v to vertex u indicates that the event corresponding to vertex v should occur before that corresponding to vertex u. To be specific, according to the predicate α in Fig. 1, event a should precede events b and c, and event b should precede event d. Additionally, there are two time bounds associated with the rule and the predicate, respectively. For example, in Fig. 1, if all the events occur within the time bound of 7 time units in accord with the specified temporal orders in the predicate, we can predict that all the events in the rule will, with a certain probability, appear within 11 timestamps G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp. 884–891, 2007. © Springer-Verlag Berlin Heidelberg 2007
Continuously Matching Episode Rules for Predicting Future Events over Event Streams
885
according to their temporal orders indicated in the rule. The first step of the event prediction problem mines out such kind of rules with two time bounds (called episode rule) from the past events. In the second step, whether all the events in the predicate have appeared according to their specified partial orders within the time bound (denoted as rule matching problem) is determined to predict the happening of the events in the corresponding consequents. For example, suppose the events coming from timestamp 1 to 9 are as depicted in Fig. 2, and we are to match the episode rule in Fig. 1. Notice that events a, b, c, and d occur within the time interval [3, 8), which satisfies the temporal constraint in the predicate. Thus, we should give the alarm that the consequent, or event f, may come within the time interval [8, 14) with a certain probability. We denote the occurrences of the events corresponding to the matching of the predicate as predicate episode occurrence. b d
a
f
c 7
a
d
a
d
b
c
d
c
d
1
2
3
4
5
6
7
8
9
11
Fig. 1. An episode rule
Fig. 2. A stream of events
The discovery of episode rules has been widely discussed over the past few years [6], [7]. However, little attention has been given to solve the important phase of episode rule matching. Since events arrive in streaming in all the applications mentioned above, how to efficiently match a number of episode rules in this environment becomes an important and difficult task. In this paper, we aim at this problem of continuously matching episode rules over the stream of events. The main challenges of this problem are stated below. 1) Many predicate episode occurrences of a rule can exist simultaneously by sharing the same occurrences of events over the stream. However, only the occurrences that give non-repetitive predictions of the occurrences of the consequents are what we concern about. Take the example of the episode rule in Fig. 1 and the stream of events in Fig. 2. It can be seen that from the two predicate episode occurrences, {(a,1), (b,5), (c,6), (d,7)} and {(a,3), (b,5), (c,6), (d,7)} (we use (e,t) to denote the event e with an occurring time t), we predict [8,12) and [8, 14) as the occurring time interval of event f, respectively. Since the predicted interval [8,12) is included in [8,14) and becomes trivially redundant, the occurrence {(a,1), (b,5), (c,6), (d,7)} can be ignored. 2) The structure of the episodes can be complex. High precision should be emphasized to effectively deal with all the possible combinations of events within the specified time bounds to match the episodes. 3) There are a large number of predicate episodes in different rules to be matched simultaneously. Moreover, events usually come in bursts. There is only a limited time to make matches for all the rules. A prompt episode detector is hence required. Our problem is related to three research topics. 1) Mining graph patterns from event or graph data sets [5], [6], [11]. The goals of these papers are essentially different from ours since we aim at continuous queries and they target on mining
886
C.-W. Cho, Y. Zheng, and A.L.P. Chen
process. 2) Efficient graph indexing for pattern searching [3], [4]. Nevertheless, all these methods are applied to a static graph database searching, which is very different from our work of continuous retrieval in the streaming environment. 3) Graph filtering in the stream environment [8], [9] and the query of the temporal relations over DBMSs [2]. These works are similar to ours. However, we are to retrieve episodes within the specified time bounds and the repetitive reports of predicted time intervals should be avoided. So we cannot directly apply these algorithms to address our problem. In this paper, we give a clear definition to the rule matching problem for event prediction, where the concepts of minimal episode occurrence, latest episode occurrence and rejected event occurrence were introduced to address the first challenge mentioned above. With the constraints in our problem definition, the retrieval of only user-required episode occurrences is assured. We then propose the method ToFel to solve this problem. ToFel makes use of the topological characteristic of the predicate episode, and develops its own pruning criteria. More specifically, ToFel finds the predicate episode occurrences by incrementally maintaining parts of the user-required episode occurrences, and thus avoid the backward scan of the stream. It constructs one event filter for each predicate episode to be matched. The filters continuously monitor the newly arrived events and only keep those which are likely to be parts of the predicate episode occurrences. By running a series of experiments with respect to different scales and distributions of the query set and the stream, we show that ToFel is efficient and effective in the stream environment. The remainder of the paper is organized as follows. Section 2 gives a detailed description of the problem statement. Section 3 presents our rule matching algorithm. The experimental results are discussed in Section 4. We give the conclusion and future directions of our work in Section 5.
2 Problem Statement The episode is a widely used representation for the associations of events. In this section, we first give the definitions related to the episode, and then present the basic concepts concerning the rule matching problem. Episode: An episode is a directed acyclic graph g, where each vertex corresponds to an event, and each directed edge (u,v) indicates that the event corresponding to u must precede that corresponding to v. We call this precedence a temporal and transitive order p between vertex u and vertex v. Denote V(g) as a vertex set, E(g) as an edge set, and ε(v) as the event corresponding to vertex v. The sink of the graph g is defined as the vertex with out-edge degree being equal to 0. For convenience, we focus on episodes consisting of vertices corresponding to different events. However, our techniques can be extended to episodes containing two or more vertices corresponding to an identical event. Episode occurrence: An event stream can be represented as Ŝ = <(a1,t1), (a2,t2), … (an,tn) …>, where (ai,ti) represents that event ai occurs at time ti , i = 1, 2, …, n, …, and t1, where t’1
Continuously Matching Episode Rules for Predicting Future Events over Event Streams
887
the start time and end time of S as t’1 and t’m+1, respectively. Given an episode α with its time bound ωα, the episode occurrence or occurrence of α over Ŝ is an event sequence S with time interval [ts,te) satisfying that: 1) There exist m integers i1, i2,…, im, such that 1≤i1
;
Episode Rule: An episode rule R is a 5-tuple (α, β, ωα, ωαβ, conf). Here, α and β are episodes representing the predicate and consequent of R, respectively. ωα and ωαβ (ωα<ωαβ) correspond to the time bounds of α and αβ (αβ is an episode satisfying that each vertex in α p any vertex in β), respectively. The interpretation of R is that if α has an occurrence O with interval [ts,te), β will occur during interval [te,ts+ωαβ) with probability conf. We denote [te,ts+ωαβ) as the predicted interval of the occurrence O. Given a set of episode rules, our problem is to continuously retrieve the episode occurrence of the predicate α with the time bound ωα over the event stream, and give non-repetitive information of the predicted intervals for the consequent β. We now introduce the concepts of mi-latest occurrence and rejected event occurrence, and give a clear definition of the rule matching problem. Definition 1. Minimal occurrence. A minimal occurrence O of a predicate episode α is an occurrence with predicted interval [ts1,te1) satisfying that there does not exist any other occurrence of α with predicted interval [ts2,te2), s.t. ts1≤ts2 in time interval [ts,te) on the event stream Ŝ. Or is called the latest occurrence of α in [ts,te) if both of the following conditions hold: 1) Let vj1, vj2, … vjx be the sink vertices of α. (ε(vy),ty) is the latest occurrence of ε(vy) in time interval [ts,te), y=j1, …, jx; 2) for a non-sink vertex vk of α, let vk1, vk2, … vkm be the children of vk, where 1≤k
888
C.-W. Cho, Y. Zheng, and A.L.P. Chen
Property 1. Let O be a latest occurrence of a predicate episode α with time interval [ts,te). If there exist minimal occurrences of α with the end time equal to te, O is one of the minimal occurrences of α. For example, consider Fig. 1 and Fig. 2, where the latest occurrences of events a, b, c, and d in the interval [1,8) are (a,3), (b,5), (c,6), and (d,7), respectively. The latest occurrence of the predicate episode in the interval [1,8) is the occurrence O = <(a,3), (b,5), (c,6), (d,7)>, which is also a minimal occurrence. Definition 4. Mi-latest occurrence. The mi-latest occurrence of a predicate episode α is defined as the occurrence which is both a minimal occurrence of α and a latest occurrence of α. We define the rule matching problem by the concept of mi-latest occurrence. The rule matching problem is to give predicted intervals of only the mi-latest occurrences to the user. Definition 5. The rejected event occurrence. Given a predicate episode α with vertices v1, v2, … vn and its mi-latest occurrence O = <(ε(v1),t1), (ε(v2),t2), … (ε(vn),tn)> on the event stream Ŝ. The rejected event occurrence deduced from O is defined recursively as following: 1) (ε(v1),t1) is a rejected event occurrence; 2) let vi1, vi2, … vim be the children of vi, 1< i, i1, i2, …, im ≤ n. If (ε(vi),ti) is a rejected event occurrence, (ε(vij),tij) is a rejected event, ∀j=i1, i2, …, im if there is no occurrence of ε(vi) in interval (ti ,tij). The essential of the rejected event occurrence deduced from the mi-latest occurrence O is that it can not be part of any other mi-latest occurrence that will appear later than O. Lemma 1. Given a latest occurrence O = <(e1,t1), (e2,t2), … (en,tn)> of a predicate episode α. If (ei,ti) is not a rejected event occurrence, ∀1≤i≤n, O is a minimal occurrence of α (For the detailed proof of the lemmas in this paper, please refer to our technical report [1]). To conclude this section, only the latest occurrences with no rejected event occurrences are the mi-latest occurrences we are looking for. This is the basic of our approach, whose correctness can be guaranteed if such occurrences are always targeted during the rule matching process.
,
3 The Proposed Approach: ToFel In the following, we present ToFel for the match of a given episode rule R = (α, β, ωα, ωαβ, conf). ToFel builds a queue of event occurrences that are likely to be parts of mi-latest occurrences for α and maintains the queue at each timestamp. We first discuss which event occurrences should be kept in the queues and on which condition we should remove the stored occurrences from the queues in the process of continuously monitoring of the stream. For each vertex of α, we implement a queue to store its corresponding event occurrences. Let Qv be the queue for vertex v∈V(α). Intuitively, for any event ε(v)
Continuously Matching Episode Rules for Predicting Future Events over Event Streams
889
arrives at time t, we should keep this event occurrence for contributing a mi-latest occurrence of α with another coming event occurrence. As time passes and more and more events arrive, we maintain the queues and only keep those useful occurrences. Since the queues is to store only the occurrences likely to contribute to the results, the occurrences whose occurring time t’ satisfies that t’ + ωα ≥ t (the current time) should be removed. In this condition, the maintenance of the queue is invoked. We call this kind of invocation of queue maintenance time-out invocation. Besides, as suggested in Definition 5 and Lemma 1, once we find out a mi-latest occurrence, we should adjust the queue by removing the rejected event occurrences. This condition is called rejected-event invocation. Both invocation forms are important for the correctness of our answer as well as the space saving. Definition 6. The nearest parent occurrence. Given any two event occurrences (ε(v),t) and (ε(u),t’), where u, v∈V(α). If v is a parent of u and t
890
C.-W. Cho, Y. Zheng, and A.L.P. Chen
Property 2. Let v be a sink vertex of episode α. There is at most an occurrence (ε(v),t) kept in Qv, and (ε(v),t) is the latest occurrence of ε(v) so far. Lemma 2. Let v1, v2, … vn be the vertices of episode α and vi, vi+1, … vn be the sinks of α. If there is a mapping occurrence of vj kept in Qvj, ∀i≤j≤n, there must exist a milatest occurrence O of α with interval [t1,tn+1), and O must be <(ε(v1),t1), (ε(v2),t2), … (ε(vn),tn)>, where tn+1−t1≤ωo, and (ε(vk),tk) is the 1st element in Qvk,∀1≤k≤n. The correctness of ToFel can be proved as follows. Whenever a new mi-latest occurrence exists, its last element must correspond to a sink of α. Therefore, when each sink occurrence comes, we check if there exists a mi-latest occurrence by Lemma 2. And, we can prove that the time complexity of ToFel is O(n). [1]
4 Experimental Results In this section, we evaluate the performances of DirectMatch [1] and ToFel by a series of experiments on synthetic data. The data is generated by the synthetic data generator [1]. We set various parameters to evaluate our method in the running time as well as the scalability on the structure of the episode and the size of the dataset. For the parameter settings, please refer to [1]. Fig. 3 shows the average execution time at each timestamp with respect to the number of episode rules to be matched. Though the time increases with the query number, ToFel always outperforms DirectMatch. And the increasing ratio in running time of ToFel is less than that of DirectMatch significantly. This can be explained that when matching an episode, ToFel only concerns the events likely to form the milatest occurrences, while DirectMatch usually repeatedly retrieves the kept events many of which are not even relevant to the episode. We also show the performance with respect to the number of vertices in the episode as shown in Fig. 4. And the result shows the slow increase in CPU time as well as the smaller time requirement of ToFel compared with DirectMatch. Finally, we compare the two approaches in their scalability with respect to the size of the event stream. As shown in Fig. 5, both approaches have the constant average running time at each timestamp no matter how the size of the stream changes. 12
45
DirectMatch
DirectMatch
40
DirectMatch 7
ToFel
10
ToFel
35 8
25 20
5
10 -3 sec.
10-3 sec.
30
10-3 sec.
ToFel
6
6 4
15
4 3 2
10
2
1
5 0
0 0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Num. of. queries (K)
Fig. 3. Running time for different query numbers
0 11
12
13
14
15
16
17
18
19
20
AveVertex
Fig. 4. Running time for different AveVertex values
10
20
30
40
50
60
70
80
90
100
Dataset size (10K)
Fig. 5. Running time for different dataset sizes
Continuously Matching Episode Rules for Predicting Future Events over Event Streams
891
5 Conclusion and Future Work In this paper, we propose a novel, deterministic and efficient approach to continuously match the episode rules over event streams for the predicting of future events. We introduce the concepts of mi-latest occurrence and rejected event occurrences such that no repetitive predicted intervals are reported. Besides, we build and continuously maintain the queue of events which are likely to contribute to the desired occurrences in an efficient time once a new event arrives. This leads to a prompt reaction towards the desired report of episode occurrences that may burst at one timestamp. Moreover, a series of experiments demonstrate the high performance of our approach in real processing time as well as the stability with respect to the number of queries, the number of vertices, and the size of event streams. For the future work, we will focus on utilizing the common substructures among the predicate episodes so as to more efficiently process a batch of them simultaneously. Acknowledgments. This work was partially supported by the NSC Program for Advanced Technologies and Applications for Next Generation Information Networks (II) under the grant number NSC 95-2752-E-007-004-PAE, and the NSC under the contract number 95-2627-E-004-002-.
References 1. Cho, C.W., Y. Zheng, and A.L.P. Chen. Continuously Matching Episode Rules for Predicting Future Events over Event Streams. Tech. Report CS-1006-05, Department of Computer Science, National Tsing Hua University, October 2006. 2. Chomicki J. History-less Checking of Dynamic Integrity Constraints. In Proceedings of the 8th International Conference on Data Engineering, 1992, 557-564. 3. Giugno, R. and D. Shasha. GraphGrep: A Fast and Universal Method for Querying Graphs. In 16th International Conference on Pattern Recognition, 2002, 112-115. 4. He, H. and A.K. Singh. Closure-Tree: An Index Structure for Graph Queries. In Proceedings of the 22nd International Conference on Data Engineering, 2006, p. 38. 5. Hsieh, C.E., Y.H. Wu and A.L.P. Chen. Discovering Frequent Tree Patterns over Data Streams. In Proceedings of the 6th SIAM International Conference on Data Mining, 2006. 6. Mannila, H., H. Toivonen, and A.I. Verkamo. Discovering Frequent Episodes in Sequences. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining, 1995, 210-215. 7. Mannila, H., H. Toivonen, and A.I. Verkamo. Discovery of Frequent Episodes in Event Sequences. Data Mining and Knowledge Discovery, 1(3), 1997, 259-289. 8. Olteanu, D., T. Kiesling, and F. Bry. An Evaluation of Regular Path Expressions with Qualifiers against XML Streams. In Proceedings of the 19th International Conference on Data Engineering, 2003, 702-704. 9. Peng, F. and S.S. Chawathe. XPath Queries on Streaming Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2003, 431-442. 10. Qin, M. and K. Hwang. Frequent Episode Rules for Internet Anomaly Detection. IEEE International Symposium on Network Computing and Applications, 2004, 161-168. 11. Yan, X. and J. Han. gSpan: Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE International Conference on Data Mining, 2002, 721-724.
Author Index
Adiele, Chima 797 Ahn, Chan-Min 108 Amagasa, Toshiyuki 317 An, Na 395 Asano, Yasuhito 479 Atay, Mustafa 428 Bei, Yijun 709 Bie, Rongfang 857 Chang, Lei 50 Chaudhry, Junaid Ahsenali 785 Chebotko, Artem 428 Chen, Arbee L.P. 884 Chen, Chun 188 Chen, Dongling 606 Chen, Gang 630, 709 Chen, Gencai 188 Chen, Jidong 200 Chen, Mei 650 Chen, Weijun 329 Chen, Zhi 116 Cheng, Jia-xing 452 Cho, Chung-Wen 884 Chou, Hung-Gi 670 Chu, Wesley W. 614 Chung, Yong J. 83 Cristodoulakis, Dimitris 845 Cui, Bin 127 Deng, Bo 876 Dikaiakos, Marios D. 265 Dong, Jinxiang 630, 709 Du, Kai 876 Duan, Lei 212 Ellis, Clarence A. 39 Emoto, Kento 721 Euzenat, J´erˆ ome 622 Faloutsos, Christos 1 Fan, Yi-Zheng 350 Feng, Jianhua 491 Feng, Jun 566 Foster, Ian 440
Fotouhi, Farshad 428 Fu, Ada Wai-Chee 733 Fu, Fong-Ling 670 Fu, Ming 821 Fu, Zhen 157 Gao, Fuxiang 395 Gao, Jun 305 Gao, Qiang 273 Gao, Yunjun 188 Gr¨ un, Katharina 471 Gu, Yu 522, 534 Guo, Qi 241 Guo, Weibin 220 Han, Dong Soo 642 Han, Jiawei 2 Han, Sunyoung 785 Hansen, David 463 Hayashi, Yasushi 721 Hemayati, Reza 678 Hou, Jiali 374 Hu, Tianming 374 Hu, Zhenjiang 721 Huang, Guowei 116 Huang, Shangteng 753 Huang, Yu 511 Huang, Zhilan 733 Ishikawa, Yoshiharu
574
Jatowt, Adam 253, 658 Ji, Ae-Ttie 594 Jiang, Congfeng 419 Jiang, Lizheng 74 Jin, Cheqing 220 Jin, Shan 829 Jin, Xin 857 Jo, Geun-Sik 594 Jung, Jason J. 622 Jung, Woosung 813 Junkui, Li 554 Kim, Deok-Hwan Kim, Heung-Nam
108 594
894
Author Index
Kim, Hyun-Jun 594 Kim, Il-Gon 687 Kim, JungHwan 157 Kim, Kapsu 813 Kim, Kwanghoon 39 Kitagawa, Hiroyuki 317, 574 Kitsuregawa, Masaru 228 Kozanidis, Lefteris 845 Lai, Caifeng 200 Lee, Eunjoo 813 Lee, Ju-Hong 108 Lee, Sang Ho 366 Lee, Yonghwan 785 Li, Chuan 212 Li, Chun 188 Li, Fenggang 868 Li, Gang 241 Li, Guoliang 491 Li, Hongwei 837 Li, Hui 650 Li, Jian-yang 341 Li, Jianzhong 697 Li, Lin 228 Li, Qing 188 Li, Rui 837 Li, Ruixuan 805 Li, Shuangfeng 168 Li, Weihua 777 Li, Xiaojing 522, 534 Li, Xiaoming 503 Li, Yuhua 805 Li, Zhanhuai 765 Liang, Dong 350 Liang, Yi 382 Lin, Ling 241 Lin, Songxiang 273 Liu, Chuang 440 Liu, Dongxi 721 Liu, Hui-ting 341 Liu, Liang 329 Liu, Linfeng 829 Liu, Shaorong 614 Liu, Weiyi 777 Liu, Xianmin 697 Liu, Xiao 868 Liu, Xiaohu 419 Liu, Yintian 212 Liu, Yubao 733 Liu, Yue 586
Lu, Jie 136 Lu, Shiyong 428 Lu, Zhengding 805 Luo, Qin 868 Luo, Qiong 168 Lv, Yanfei 522, 534 Ma, Jiangang 18 Ma, Xiuli 74, 168 Maeder, Anthony 463 Mao, Yingchi 382 Matai, Janarbek 642 Matsuda, Kazutaka 721 Meng, Weiyi 678 Meng, Xiaofeng 200 Min, Dugki 785 Mohania, Mukesh 30 Mukai, Naoto 566 Nestorov, Svetlozar 440 Ni, Zhi-wei 341, 868 Nishizeki, Takao 479 Oh, Jeong Seok 366 Oyama, Keizo 95 Pang, Chaoyi 463 Park, Myong-Soon 157 Park, Seungkyu 785 Park, Sun 108 Pattanasri, Nimit 658 Pei, Jian 281, 733 Peng, Zhaohui 6 Pok, Gouchol 83 Qian, Weining 127 Qu, Chao 374 Ramakrishnan, Raghu Roy, Prasan 30 Ryu, Keun Ho 83
3
Schrefl, Michael 471 Shen, Heng Tao 176 Shi, Baile 293 Shin, Hyun Woong 366 Stamou, Sofia 845 Stassopoulou, Athena 265 Sun, Guangzhong 511 Sun, Jiaguang 358 Sun, Xiaolin 805
Author Index Ta, Na 491 Takeichi, Masato 721 Tan, Zijing 293 Tanaka, Katsumi 253, 658 Tang, Changjie 212 Tang, Jun 350 Tang, Shiwei 50, 74, 168 Tezuka, Taro 253 Tezuka, Yu 479 Tzekou, Paraskevi 845 Wang, Bin 407, 542 Wang, Botao 228 Wang, Chao 136 Wang, Cheng 419 Wang, Daling 606 Wang, Fasong 837 Wang, Guoren 62, 144, 407, 542 Wang, Hanhu 650 Wang, Hongzhi 697 Wang, Huaimin 876 Wang, Jianmin 329, 358 Wang, Ling 200 Wang, Nian 350 Wang, Qiuyue 6 Wang, Shan 6 Wang, Teng 650 Wang, Tengjiao 50, 305 Wang, Wei 281, 293 Wang, Xuejian 650 Wang, Yanlong 765 Wang, Yuxin 95 Watanabe, Toyohide 566 Wen, Kunmei 805 Wen, Lijie 358 Whang, Kyu-Young 4 Wong, Raymond Chi-Wing 733 Wu, Chisu 813 Wu, Chunhui 317 Wu, Gongyi 116 Wu, Jiagao 829 Wu, Linyan 566 Wu, Mingda 503 Wu, Shanshan 522, 534 Xin, Junchang 144 Xiong, Zhongmin 281 Xu, Hongbo 586 Xu, Juan 765 Xu, Linhao 127
Xu, Qingui 374 Xu, Zhuoming 382 Yamamoto, Yusuke 253 Yang, Dongqing 50, 74, 168, 305 Yang, Nan 273 Yang, Shanling 868 Yang, Shuqiang 876 Yang, Weijia 753 Yang, WenCheng 157 Yang, Xiaochun 407, 542 Yang, Zhenglu 228 Yao, Lan 395 Ye, Xiaojun 745 Yin, Jian 733 Yin, Ying 62 Yu, Clement 678 Yu, Fang 606 Yu, Ge 395, 407, 522, 534, 542, 606 Yu, Lihua 630 Yu, Sheng-Chin 670 Yu, Xiaoming 586 Yuan, Huaqiang 374 Yuanzhen, Wang 554 Yue, Dejun 534 Yue, Kun 777 Zeitouni, Karine 200 Zeng, Tao 212 Zhan, Jiang 6 Zhang, Bin 62 Zhang, Bo 452 Zhang, Dehui 74 Zhang, Guangquan 136 Zhang, Jiang 350 Zhang, Jianwei 574 Zhang, Jun 6 Zhang, Ling 452 Zhang, Qing 463 Zhang, Weili 868 Zhang, Xiaoyi 144 Zhang, Yan 503 Zhang, Yanchun 18 Zhang, Yu 821 Zhang, Zijun 293 Zhao, Futong 220 Zhao, Jiakui 305 Zhao, Yinghui 419 Zhao, Yuhai 62 Zheng, Ying 884
895
896 Zhou, Zhou, Zhou, Zhou, Zhou, Zhou,
Author Index Aoying 127 Lizhu 241, 491 Xiangmin 176 Xiaofang 176 Yinghua 511 Yipeng 511
Zhu, Hua 745 Zhu, Yizhen 503 Zhu, Yuelong 566 Zhuang, Yanyan 829 Zimmerman, Antoine 622 Zotos, Nikos 845