Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5776
Willem Jonker Milan Petkovi´c (Eds.)
Secure Data Management 6th VLDB Workshop, SDM 2009 Lyon, France, August 28, 2009 Proceedings
13
Volume Editors Willem Jonker Philips Research Europe High Tech Campus 34, 5656 AE Eindhoven, The Netherlands and University of Twente, Department of Computer Science P.O. Box 217, 7500 AE Enschede, The Netherlands E-mail:
[email protected] Milan Petkovi´c Koninklijke Philips Electronics N.V., Philips Research Laboratories High Tech Campus 34, 5656 AE Eindhoven, The Netherlands E-mail:
[email protected]
Library of Congress Control Number: 2009933478 CR Subject Classification (1998): E.3, H.2.7, K.4.4, K.6.5, C.2 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-04218-X Springer Berlin Heidelberg New York 978-3-642-04218-8 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12752188 06/3180 543210
Preface
The new emerging technologies put new requirements on security and data management. As data are accessible anytime anywhere, it becomes much easier to get unauthorized data access. Furthermore, the use of new technologies has brought some privacy concerns. It becomes simpler to collect, store, and search personal information thereby endangering people’s privacy. Therefore, research in secure data management is gaining importance, attracting the attention of both the data management and the security research communities. The interesting problems range from traditional topics, such as, access control and general database security, via privacy protection to new research directions, such as cryptographically enforced access control and encrypted databases. This year, the call for papers attracted 24 papers both from universities and industry. For presentation at the workshop, the Program Committee selected 10 full papers (41% acceptance rate). These papers are collected in this volume, which we hope will serve as a useful research and reference material. The papers in the proceeding are grouped into three sections. The first section focuses on database security which remains an important research area. The papers in this section address several interesting topics including query optimization in encrypted databases, database provenance, database intrusion detection, and confidence policy compliant query evaluation. The second section changes the focal point to the topic of access control. The papers in this section deal with provenance access control, access control model for collaborative editors, self-modifying access control policies, and enforcing access control on XML documents. The third section focuses on privacy protection addressing the privacy issues around location-based services and anonymity/diversity for the micro-data release problem. We wish to thank all the authors of submitted papers for their high-quality submissions. We would also like to thank the Program Committee members as well as additional referees for doing an excellent review job. Finally, let us acknowledge Luan Ibraimi who helped in the technical preparation of the proceedings. July 2009
Willem Jonker Milan Petkovi´c
Organization
Workshop Organizers Willem Jonker Milan Petkovi´c
Philips Research/University of Twente, The Netherlands Philips Research/Eindhoven University of Technology, The Netherlands
Program Committee Gerrit Bleumer Ljiljana Brankovic Sabrina De Capitani di Vimercati Ernesto Damiani Eric Diehl Lee Dong Hoon Jeroen Doumen Csilla Farkas Eduardo Fern´ andez-Medina Elena Ferrari Simone Fischer-H¨ ubner Tyrone Grandison Dieter Gollmann Marit Hansen Min-Shiang Hwang Mizuho Iwaihara Sushil Jajodia Ton Kalker Marc Langheinrich Nguyen Manh Tho Nick Mankovich Sharad Mehrotra Stig Frode Mjølsnes Eiji Okamoto Sylvia Osborn G¨ unther Pernul Birgit Pfitzmann
Francotyp-Postalia, Germany University of Newcastle, Australia University of Milan, Italy University of Milan, Italy Thomson Research, France Korea University, Korea Irdeto, The Netherlands University of South Carolina, USA University of Castilla-La Mancha, Spain University of Insubria, Italy Karlstad University, Sweden IBM Almaden Research Center, USA Technische Universit¨at Hamburg-Harburg, Germany Independent Centre for Privacy Protection, Germany National Chung Hsing University, Taiwan Kyoto University, Japan George Mason University, USA HP Labs, USA Universit´ a della Svizzera italiana (USI), Switzerland Vienna University of Technology, Austria Philips Medical Systems, USA University of California at Irvine, USA Norwegian University of Science and Technology, Norway University of Tsukuba, Japan University of Western Ontario, Canada University of Regensburg, Germany IBM Watson Research Lab, Switzerland
VIII
Organization
Bart Preneel Kai Rannenberg Andreas Schaad Nicholas Sheppard Jason Smith Morton Swimmer Clark Thomborson Sheng Zhong
KU Leuven, Belgium Goethe University Frankfurt, Germany SAP Labs, France University of Calgary, Canada Queensland University of Technology, Australia John Jay College of Criminal Justice/CUNY, USA University of Auckland, New Zealand State University of New York at Buffalo, USA
Additional Referees Hans Hedbom Luan Ibraimi Leonardo Martucci Mike Radmacher Falk Wagner Lei Zhang
Karlstad University, Sweden Twente University, The Netherlands Karlstad University, Sweden Goethe University Frankfurt, Germany Goethe University Frankfurt, Germany George Mason University, USA
Table of Contents
Database Security Query Optimization in Encrypted Relational Databases by Vertical Schema Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mustafa Canim, Murat Kantarcioglu, and Ali Inan
1
Do You Know Where Your Data’s Been? – Tamper-Evident Database Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jing Zhang, Adriane Chapman, and Kristen LeFevre
17
Database Intrusion Detection Using Role Profiling with Role Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Garfield Zhiping Wu, Sylvia L. Osborn, and Xin Jin
33
Query Processing Techniques for Compliance with Data Confidence Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chenyun Dai, Dan Lin, Murat Kantarcioglu, Elisa Bertino, Ebru Celikel, and Bhavani Thuraisingham
49
Access Control An Access Control Language for a General Provenance Model . . . . . . . . . Qun Ni, Shouhuai Xu, Elisa Bertino, Ravi Sandhu, and Weili Han
68
A Flexible Access Control Model for Distributed Collaborative Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdessamad Imine, Asma Cherif, and Micha¨el Rusinowitch
89
On the Construction and Verification of Self-modifying Access Control Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Power, Mark Slaymaker, and Andrew Simpson
107
Controlling Access to XML Documents over XML Native and Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lazaros Koromilas, George Chinis, Irini Fundulaki, and Sotiris Ioannidis
122
Privacy Protection Longitude: Centralized Privacy-Preserving Computation of Users’ Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sergio Mascetti, Claudio Bettini, and Dario Freni
142
L-Cover: Preserving Diversity by Anonymity . . . . . . . . . . . . . . . . . . . . . . . . Lei Zhang, Lingyu Wang, Sushil Jajodia, and Alexander Brodsky
158
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
173
Query Optimization in Encrypted Relational Databases by Vertical Schema Partitioning Mustafa Canim, Murat Kantarcioglu, and Ali Inan The University of Texas at Dallas Richardson, TX 75083 {mxc054000,muratk,axi061000}@utdallas.edu
Abstract. Security and privacy concerns, as well as legal considerations, force many companies to encrypt the sensitive data in their databases. However, storing the data in encrypted format entails significant performance penalties during query processing. In this paper, we address several design issues related to querying encrypted relational databases. The experiments we conducted on benchmark datasets show that excessive decryption costs during query processing result in CPU bottleneck. As a solution we propose a new method based on schema decomposition that partitions sensitive and non-sensitive attributes of a relation into two separate relations. Our method improves the system performance dramatically by parallelizing disk IO latency with CPU-intensive operations (i.e., encryption/decryption).
1
Introduction
Sensitive data ranging from medical records to credit card information are increasingly being stored in databases and data warehouses. At the same time, there are increasing concerns related to security and privacy of such stored data. For example, according to a recent New York Times article [1], records of more than a hundred million individuals have been leaked from databases in the last couple of years. One of the most recent incidents was reported by a medical center in San Jose, which notified around 185,000 current and former patients about the theft of their personal information contained on two computers stolen from its offices during a burglary [2]. Although criminals have so far not taken considerable advantage of such disclosures, the need for better protection techniques for sensitive data within databases is obvious. Common techniques such as access control and firewalls do not provide enough security against hackers that use zero-day exploits or protection from insider attacks. Once a hacker gets administrator access to a server that stores the critical data, he can easily bypass the database access control system and reach all database files. Although it brings some extra cost, encrypting sensitive data is considered an effective last line of defense to counter against such attacks [3]. Also recently, legal considerations [4] and new legislations such as California’s Database Security Breach Notification Act [5] require companies to encrypt sensitive data. To assist its customers with legislation compliance hurdles, Microsoft W. Jonker and M. Petkovi´ c (Eds.): SDM 2009, LNCS 5776, pp. 1–16, 2009. c Springer-Verlag Berlin Heidelberg 2009
2
M. Canim, M. Kantarcioglu, and A. Inan
recently developed a new SQL server that comes with built in encryption support [6]. IBM also offers a similar functionality in its DB2 server, in which data is encrypted (and decrypted) using a row-level function [7]. Clearly, unless encryption keys are compromised, a hacker (or a malicious employee) that controls the system will not be able to read any sensitive data stored on the hard disk. In contrast to encryption within database systems, one can propose encrypted hard drives to guarantee the privacy of data. With the help of built-in hardware, the data is stored encrypted within the hard drives. So, the technology provides drive level encryption [8]. The major drawback of this solution is that it does not provide advanced key management capabilities that are actively used in multiuser encrypted databases [9]. Handling role-based key management issues with hardware key management systems is not practical yet. In addition to that, both customers and database vendors are looking for privacy solutions that come as a stand-alone framework within database products. This is particularly important for systems such as storage area networks and grid based storage environments where the database administrator does not have full control over all hard drives. Therefore database encryption is still a very important security mechanism. 1.1
Threat Model
In this paper, we assume a threat model similar to the one considered in [10], where the database server is trusted and only the disk is vulnerable to compromise. We consider an adversary that can only see the files stored on the disk but not the access patterns. In this case, we just need to satisfy the security under chosen plain text attacks. In other words, assuming security of the underlying block cipher (e.g., AES [11]), we need to guarantee that any polynomial-time adversary will have negligible probability of inferring any information about the sensitive data solely by looking at the disk contents. Specifically, we assume that (see Figure 1): – The storage system used by the database system is vulnerable to compromise. Only the files stored in the storage system are accessible to the attacker. – Query Engine and Authentication Server are trusted. We assume that queries are executed on a trusted query engine. – All sensitive data will be stored encrypted. Previous work on inference control shows that probability distribution of sensitive attributes, if available, may create unintended inference channels [12]. Therefore, in addition to sensitive data, all secondary sources of information that can reveal sensitive data or its probability distribution (e.g., log files, indexes) will also be stored encrypted. 1.2
Motivation
Encrypted storage of sensitive data in databases can be achieved at various levels of granularity. Tuple level, page level and column level encryption are the most well known options for this purpose.
Query Optimization in Encrypted Relational Databases
3
Trusted Components
Disk Access
Hard Disk
Query Engine
Transformed Query Query Results
Authentication and Query Transformation Plain Query + Authentication information
Plain Query Results Client Application
Fig. 1. Information flow and trust model for querying the encrypted data
Fig. 2. Left Deep Join tree of TPC-H query-10
In tuple level encryption, each tuple is encrypted or decrypted separately. If the database needs to retrieve some of the tuples, there is no need to decrypt all tuples in the table. The major drawback of this technique is that selective encryption of sensitive attributes will cause fragmentation of encrypted data within pages. Since the records are kept consecutively, encryption of the sensitive attributes of each tuple will be fragmented. Therefore it is not an efficient technique since decryption of all data one at a time is not possible [13]. Page level encryption corresponds to a mechanism where a particular page is completely decrypted whenever it is accessed. It is the most convenient granularity option if all attributes of a table are sensitive [13]. Additionally, page level approach does not require major changes in the design of the databases. Since the page structure is not changed, it can be implemented between the buffer manager and file manager layers with a slight modification. However, page level encryption is not preferable if a table includes both sensitive and non-sensitive attributes. This is because, non-sensitive attributes are unnecessarily encrypted along with sensitive attributes. Column level encryption can be implemented by mini page approach which is proposed by [13] based on the work of Ailamaki et al. [14]. In this technique, when a tuple is inserted into a page, its attributes are partitioned and stored in corresponding mini pages within the same page. Hence, sensitive attributes of records are kept altogether within the page. The most important aspect of this technique is that, it allows selective decryption of sensitive data. However, this approach has not been popular in conventional databases since it requires major changes in the storage engine. Additionally, accessing the sensitive and non-sensitive attributes of the records inside a page incurs an extra cost since the attributes are not stored together. In this paper, we propose a new technique to encrypt sensitive data. Our method does not suffer from any of the above disadvantages and even provides efficiency in terms of query execution time. Instead of keeping sensitive and nonsensitive attributes in the same table, we propose partitioning the table completely and storing them in two separate tables. By doing so, we can both prevent unnecessary decryption operations and reduce the number of pages retrieved to
4
M. Canim, M. Kantarcioglu, and A. Inan
the buffer pool. Whenever a query needs to access only non-sensitive attributes, we do not need to retrieve encrypted parts of the relations to the buffer pool. For queries that require accessing both sensitive and non-sensitive attributes, we can then retrieve partitioned tuples from encrypted and unencrypted relations and join them. Despite the cost associated with these join operations, the overall query evaluation performance could be boosted since partitioning also prevents CPU bottleneck. A detailed analysis of this issue is presented in section 3. We now introduce vertical partitioning over an example schema that contains only one relation. Consider a company that stores the following information of its customers: Customer (TupleID, Name, BirthDate, Address, Phone, SSN, CreditCardNumber). Let us assume that the database administrator designates SSN and CreditCardNumber fields as sensitive attributes and requests the DBMS to store these attributes encrypted. If the DBMS only permits page level encryption, one obvious solution is to encrypt the entire Customer relation. The solution that we propose is vertically partitioning Customer into two sub-relations containing only non-sensitive and sensitive attributes respectively. These two options are listed below: – Option1 : Storing the relation without decomposition and encrypting the entire table. – Option2 : Partitioning the relation into two sub-relations such that: Customer1 (TupleID, Name, BirthDate, Address, Phone) Customer2 (TupleID, SSN, CreditCardNumber) and encrypting relation Customer2 but not Customer1 . If the data file for Customer relation is 7000 pages long, assuming all attributes are of equal length, Customer1 and Customer2 relations should fit in roughly 5000 and 3000 pages respectively. Please note that a sensitive attribute might be part of the primary key (i.e., SSN for Customer), in which case, partitioning requires replacing the primary key with an additional attribute unique across the records (i.e., TupleID attribute of Customer). Next, we briefly compare the two options through a workload that consists of three queries over the Customer relation: – Query1 : Name and Phone attributes of customers are accessed (only nonsensitive attributes). – Query2 : TupleID and SSN attributes of customers are accessed (only sensitive attributes). – Query3 : Name and SSN attributes of customers are accessed (both sensitive and non-sensitive attributes). When Option1 is employed, regardless of sensitivity of the accessed attributes for each query, the DBMS fetches and decrypts all 7000 pages of the Customer relation. Therefore, non-sensitive attributes are decrypted unnecessarily. If Option2 is selected, Query1 can be answered using only Customer1 because sensitive attributes are irrelevant. Since Custormer1 is not encrypted, there is no associated decryption cost. Also recall that Customer consists of 7000 pages, while Customer1 is assumed to fit in 5000 pages. Therefore, overall cost of
Query Optimization in Encrypted Relational Databases
5
evaluating Query1 under Option2 should even be lower than the cost over an unencrypted version of Customer due to less IO latency. Choosing Option2 as the method of encryption is advantageous for Query2 as well. Instead of decrypting the entire Customer relation (7000 pages), we can process the query using only Customer2 (3000 pages). The savings for this particular example are quite significant: around 57% (4000 pages) less IO and decryption costs with Option2 . Query3 involves attributes from both Customer1 and Customer2 . Therefore evaluating Query3 requires fetching all pages of both, decrypting Customer2 and additionally joining Customer1 and Customer2 using the primary key, TupleID. For Query3 , IO costs of Option2 are always higher than Option1 because the primary key is stored redundantly (once for each vertical partition) to ensure lossless join of the partitions. Decryption costs of Option2 , on the other hand, will be lower since non-sensitive attributes are not encrypted with Option2 . Please note that join cost is specific to Option2 . Overall, effectiveness of vertical partitioning depends primarily on the trade-off between encrypting non-sensitive attributes and joining the partitions. The decision of vertical partitioning depends heavily on types and frequencies of the queries in query workload. If majority of the queries do not require joining the partitions (i.e., Query1 and Query2 ), then vertical partitioning might significantly improve the performance. On the other hand, if sensitive and nonsensitive attributes of the table in question have close affinity with each other (i.e., Query3 ), then the overall workload performance might be poor. This is because most of the queries within the given workload will require significant number of costly join operations. 1.3
Contributions of This Study
Throughout the paper we discuss the following problems and suggest solutions: Preventing CPU bottleneck by vertical partitioning: Preventing encryption of non-sensitive attributes is not the only advantage of vertical partitioning. Our experimental results reveal that, separating non-sensitive attributes from sensitive ones also provide higher CPU utilization during pipelined execution of a query involving multiple relations with sensitive attributes. Details of these experiments are discussed in section 3. Partitioning single relations: As we have pointed in section 1.2, the decision of partitioning a relation depends primarily on how frequently the partitions will be joined. We model the decision problem of whether partitioning a relation is efficient as an optimization problem in section 4. Partitioning relations of a schema: At the schema level, partitioning decision is also determined by interactions among different relations. Therefore, considering the workload over the entire schema makes more sense than solving the decision problem for each relation independently. We formally present this problem and propose a heuristic solution in section 5.
6
2
M. Canim, M. Kantarcioglu, and A. Inan
Related Work
Querying encrypted data under untrusted database server attack model was first suggested in [15]. Hacigumus et al. suggested partitioning the client’s attribute domains into a set of intervals [15]. The correspondence between intervals and the original values are kept at the client site and encrypted tables with interval information are stored in the database. Efficient querying of the data is made possible by mapping original values in range and equality queries to corresponding interval values. In subsequent work, Hore et al. [16] analyzed how to create the optimum intervals for minimum privacy loss and maximum efficiency. In [17], the potential attacks for interval based approaches were explored and models were developed to analyze the trade-off between efficiency and disclosure risk. This line of work is different from our current problem because we assume that the database server is trusted and only the disk is untrusted. In [18], Aggarwal et al. suggested to allow the client to partition its data across two, (and more generally, any number of) logically independent database systems that cannot communicate with each other. The partitioning of data is performed in such a fashion as to ensure that the exposure of the contents of any one database does not result in a violation of privacy. The client executes queries by transmitting appropriate sub-queries to each database, and then piecing together the results at the client side. In [10], Agrawal et al. suggested a method for order preserving encryption (OPES) for efficient range query processing on encrypted data. Unfortunately, OPES only works for numeric data and is not trivial to extend to discrete data. In [13], Iyer et al. suggested data structures to store and process sensitive and non-sensitive data efficiently. The basic idea was to group encrypted attributes on one mini page (or one part of the tuple) so that all encrypted attributes of a given table can be decrypted together. In [19], Elovici et al. also suggested a different way for tuple level encryption. In [20], [21], [22] vertical partitioning problem is studied in detail. However, none of these studies consider the relationship between cryptographic operations and vertical partitioning problem. Our consideration is different from the previous work in many aspects. Unlike the previous work, we discuss the CPU bottleneck problem in encrypted databases with experimental observations, show how to utilize the page level approach for selective decryption of sensitive data, and propose a new workload dependent vertical schema decomposition technique to mitigate the negative impacts of cryptographic operations.
3
Preventing CPU Bottleneck by Vertical Partitioning
Most conventional database systems use pipelining for query processing. Pipelining improves query evaluation efficiency by joining the tuples of intermediate results with the tuples of the outer relations without waiting for completion of all intermediate join operations. Therefore, join operations are executed by reading pages of each table simultaneously.
Query Optimization in Encrypted Relational Databases
7
To observe the impact of cryptographic operations on real database systems, we implemented a cryptography layer within Mysql-InnoDB storage engine [23] and conducted several experiments using TPC-H dataset and queries [24]. We repeated our experiments with different buffer sizes and different selectivity ratios. Our experimental results suggest that CPU becomes a bottleneck if all table accessed by the query are encrypted. This, in turn, translates to a significant increase (50-60 %) in query processing time. If all relations are composed of sensitive attributes only, then this increase is inevitable. On the other hand, if the relations include both sensitive and nonsensitive attributes, vertical partitioning into sensitive and non-sensitive subrelations eliminates this problem effectively. In our experiments, we observed that rather than the amount of data being decrypted, the number of encrypted tables being joined has more impact on query execution time. By separating the non-sensitive attributes from sensitive ones, the number of encrypted relations joined during query execution can be reduced. If a query accesses only non-sensitive attributes of a relation, partitioning will prevent retrieving the encrypted attributes of that relation. Hence, we get a considerable improvement for execution of a given query workload. 3.1
Details of the CPU Bottleneck Experiments
In the following example, we illustrate how pipelining prevents CPU from becoming a bottleneck. Figure 2 shows the join tree of TPC-H query # 10 [24]. Suppose all attributes of LineItem table are sensitive and the remaining three tables include both sensitive and non-sensitive attributes. Let q be a query which accesses only the non-sensitive attributes of these three tables and some attributes of LineItem table. If we apply partitioning, sensitive and non-sensitive attributes of Orders, Customer and Nation tables will be stored in separate tables. During pipelined execution of query q, only the pages of LineItem table will be decrypted. That is, only one out of four pages will be decrypted within unit time due to concurrent reads. On the other hand, if partitioning is not applied, all four tables will be decrypted as they all contain some sensitive attribute. Hence, accessing nonsensitive attributes of Orders, Customer and Nation require decryption of both sensitive and non-sensitive attributes. This time, during the pipelined execution of query q, any page read from any of these four tables will be waiting for decryption, i.e., four out of four pages will be decrypted within unit time. In our experiments, we observed that decrypting all retrieved pages overloads the CPU, which in turn increases query execution time. As stated in the above example, partitioning reduces the number of pages decrypted within unit time. However, for some cases, keeping non-sensitive and sensitive attributes in the same table might yield a better outcome in terms of overall workload execution performance. We will discuss this issue in sections 4 and 5 in detail. In order to quantify the increase in query execution time caused by CPU bottleneck, we prepared a test environment using MySQL and
8
M. Canim, M. Kantarcioglu, and A. Inan
TPC-H data. In this implementation, we used AES as the block cipher algorithm and employed OpenSSL library [25] for cryptographic operations. To implement the cryptographic layer we modified two components of MySQL-InnoDB source code. To encrypt the dirty data pages before writing to the disk the file manager (fil0fil.c) were modified. To decrypt the pages retrieved to the memory the buffer manager (buf0buf.c) were modified. As modes of operations, we used Counter mode to encrypt the data. For a detailed discussion on key management and transaction management issues please refer to [26] as we followed the same procedure. All experiments were conducted on a 2.79 GHz Intel Pentium D machine with 2 GB memory on NT platform. We prepared three database instances each with 1 GB TPC-H data, and composed of 8 distinct tables. In our experiments, we used only four of these eight tables: LineItem, Orders, Customer and Nation. These four tables occupy 85 % of all the file space in each database. In the first instance we did not encrypt any of these four tables. In the second instance we encrypted only the LineItem table which occupies almost 80 % of all the file space occupied by these four tables. In the third instance all four tables were encrypted. After preparing these instances, we ran Query-10 of TPC-H benchmarking dataset which joins these four tables and measured the execution times over the three database instances. To make sure that these results are independent of the database buffer pool size we repeated the same experiment with different buffer pool sizes. The results of the experiments are given in figure 3. In figure 3, the queries run on the unencrypted and LineItem-only encrypted database instances take almost the same amount of execution time. On the average, execution time for the unencrypted instance is 5 % less than the latter, but the two series are barely distinguishable. Notice that although LineItem table occupies 80 % of the database file space, associated decryption cost only introduces 5 % overhead. However, if the other three tables (which occupy only the remaining 20 %) are encrypted as well, decryption cost becomes 50-60 %. We conclude that the overhead resulting from decryption is not directly proportional to the amount of data. Pages of LineItem table can be decrypted while pages of the other three tables are being retrieved from the disk. But, if partitioning is not applied, IO latency can not be parallelized with CPU-intensive decryption operations. Therefore CPU becomes the bottleneck. Figure 4 presents a cross section of query execution times when the buffer pool size is fixed at 400MB. The query execution time in the LineItem-only (all-tables) encrypted database instance is 523 (853) seconds which is 5 % (71%) more than the query execution time over unencrypted database instance. Due to pipelined query execution, time spent on cryptographic operations over LineItem-only instance is almost overlapped with page read operations. 3.2
Mitigating the Join Cost Due to the Partitioning
As we discussed above, vertical partitioning prevents CPU bottleneck problem effectively. However, partitioning every relation that includes sensitive and nonsensitive attributes may not always be the best solution. For those tables where
200
80
180
70 Query Execution Time (sec)
Query Execution Time (sec)
Query Optimization in Encrypted Relational Databases
160 140 120 100 80 60 40
9
60 50 40 30 20 10
20
0
0 80 150
300
400
600
0
800
MySQL-InnoDB Buffer Pool Size(MB) All Tables Encrypted LineItem Encrypted
Nothing Encrypted (0 % of all data) LineItem Table Encrypted (80 % of all data) All Tables Encrypted (100 % of all data)
Nothing Encrypted
Fig. 3. Multi Table Partitioning Over Indexed Tables
Fig. 4. Effect of CPU Bottleneck on Encrypted Query Processing
the affinity between the attributes is very high, it might be preferable to keep the attributes together. To make an optimal decision, we will propose workload dependent approaches for single table and multi table queries. In section 4, we describe the partitioning issue in a detailed manner and analyze the workload dependent vertical partitioning approach for single table queries. Later on in section 5, we discuss the same notion for multi table queries. We discuss that finding the optimal problem is not tractable and propose a heuristic to make an effective partitioning decision.
4
Partitioning a Single Relation
We have discussed the advantages of partitioning a relation in section 1. In this section, we formally define the problem and provide our experimental results. 4.1
Formal Definition of the Problem
Given a relation R = {A1 , A2 , ...An } suppose, without loss of generality, that the attributes Aj , Aj+1 , ...An are sensitive attributes that should be stored encrypted whereas the remaining attributes are non-sensitive. Let E(r) denote the relation r in encrypted format. We will consider two transformations for storing the tuples of R: E(R) ← T0 (R) and (R0 , E(R1 )) ← T1 (R) where R0 = {A1 , ..., Aj−1 } and R1 = {Aj , ..., An }. Here E(R) represents encryption of unpartitioned relation R. R0 and E(R1 ) represent partitions of relation R that contains unencrypted nonsensitive attributes (R0 ) and encrypted sensitive attributes (R1 ) respectively. Suppose there is a workload Γ = {Q1 , Q2 , ..Qκ } defined on the relation R and wt is the weight of query Qt in the workload. We denote the minimum cost of running query Qt over the transformed relations as Ct (Tb (R)). For example, C1 (T0 (R)) denotes the minimum cost of running Q1 over the set of transformed relation E(R) whereas C1 (T1 (R)) denotes the minimum cost of running Q1 over the set of transformed relations R0 , E(R1 ) (note that T0 (R) = E(R) and T1 (R) = (R0 , E(R1 ))).
10
M. Canim, M. Kantarcioglu, and A. Inan
Let T C UnP art be the overall query evaluation cost for a given workload while the relation R is unpartitioned and T C P art be the overall query evaluation cost while the relation R is partitioned. Using the above notation, we can define T C UnP art and T C P art as follows: UnP art TC = κt=1 wt .Ct (T0 (R))) T C P art = κt=1 wt .Ct (T1 (R))) For a given workload if T C UnP art < T C P art , then partitioning does not improve the overall performance of a given workload because the cost of join operations suppresses the savings from cryptographic operations. On the other hand, T C UnP art ≥ T C P art implies that partitioning is advantageous, since it reduces the overall execution time. 4.2
Experiment Results
To observe the effectiveness of partitioning on single table queries we conducted experiments using TPC-H dataset. We observed that if majority of the queries in a given query workload does not require joining sensitive and non-sensitive attributes, then decomposition will improve the performance significantly. In this experiment, we generated two instances of 1 GB database using TPC-H dataset and used LineItem table of these instances since LineItem is the largest table of the TPC-H dataset. It occupies almost 80% of the whole database. In the first instance, we encrypted all attributes of LineItem table and stored the table without partitioning. In the second instance, LineItem table is partitioned such that half of the attributes are stored in LineItem 1 in plaintext and half of the attributes are stored in LineItem 2 in encrypted format. In terms of storage, LineItem 1 occupies 467 MB whereas LineItem 2 occupies 532 MB disk space. To build a workload, we prepared three types of queries using TPC-H query # 6 as the basis. Query type 1 accesses only non-sensitive attributes of LineItem table whereas Query type 2 accesses only sensitive attributes. Query 3 accesses both sensitive and non-sensitive attributes of the LineItem table. Query 1 and 2 are very similar to query 3. The only difference is that they do not require joining sensitive and non-sensitive attributes. After running these three queries in both instances of the databases, we represented the results in various query workload scenarios. In these workloads there are 100 queries. Figure 5 represents a query workload where 20 queries require accessing both sensitive and non-sensitive attributes (Type 3 queries). The remaining 80 queries include both query 1 and 2. Therefore 20 % of the queries require a join operation while 80 % does not. As it is seen in figure 5, if the number of type 3 queries is low, then partitioning the tables significantly reduces the overall query execution time. In figure 6, we can see that if the number of type 3 queries increases then partitioning becomes less effective. As it is shown , if the number of type 3 queries is less than 40 %, partitioning will still be effective. Otherwise, keeping relations unpartitioned is a better choice.
8000
11000
7000
10000 Workload Execution Time (sec)
Workload Execution Time (sec)
Query Optimization in Encrypted Relational Databases
6000 5000 4000 3000 2000 1000
9000 8000 7000 6000 5000 4000 3000
0
2000 0
10
20
30 40 50 Ratio of Q1 queries
With Partitioning
60
70
80
0
Without Partitioning
Fig. 5. Various distributions of type 1 and type 2 queries. 20 % of the queries are type 3 queries
5
11
10
20
30
40 50 60 70 Ratio of Q3 queries
With Partitioning
80
90
100
Without Partitioning
Fig. 6. Various distributions of type 3 queries.
Partitioning Multiple Table
The partitioning decision in single table queries is rather simple: Given a relation R and a workload κ, should we partition the relation or not. However the same decision for multiple tables additionally requires considering the interaction among different tables. Therefore given n relations we need to evaluate 2n different combinations of partitioning decisions per table. Theoretically, if the overall query execution time for each of these combinations were available, choosing the best decision would be simple: select the one that requires the least amount of time. However, this strategy is not practical in real-world applications since this solution is not tractable. In the following subsection we provide a formal representation of the problem and show that it is an example of binary integer programming. 5.1
Formal Definition of the Problem
Given the ith relation, Ri = {Ai1 , Ai2 , ...Ain } suppose, without loss of generality, that the attributes Aij , Aij+1 , ..., Ain are sensitive attributes that should be stored encrypted whereas the remaining attributes are non-sensitive.Let E(r) denote the relation r in encrypted format. We will consider two transformations for storing the tuples of Ri : E(Ri ) ← T0 (Ri ) and (R0i , E(R1i )) ← T1 (Ri ) where R0i = {Ai1 , ..., Aij−1 } and R1i = {Aij , ..., Ain }. Here E(Ri ) represents encryption of unpartitioned relation Ri . R0i and E(R1i ) represent partitions of relation Ri that contains unencrypted non-sensitive attributes (R0i ) and encrypted sensitive attributes (R1i ) respectively. Suppose there is a workload Γ = {Q1 , Q2 , ..Qκ } defined on the relations R1 , . . . , Rv and wt is the weight of query Qt in the workload. We denote the minimum cost of running query Qt over the transformed relations as Ct (Tb1 (R1 ), Tb2 (R2 ) , . . . , Tbv (Rv )). For example, C1 (T0 (R1 ), T1 (R2 )) denotes the minimum cost of running Q1 over the set of transformed relations E(R1 ), R02 , E(R12 ) (note that T0 (R1 ) = E(R1 ) and T1 (R2 ) = (R02 , E(R12 ))).
12
M. Canim, M. Kantarcioglu, and A. Inan
Using the above notation, we can define the optimum partitioning strategy as a minimization problem as follows: κ min wt .Ct (Tb1 (R1 ), Tb2 (R2 ), . . . , Tbv (Rv )) b1 ,b2 ,...,bv
t=1
subject to bj ∈ {0, 1}, 1 ≤ j ≤ v The above optimization is an example of binary integer programming problem which is known to be NP-Hard [27]. In the next section, we discuss a simple heuristic approach. 5.2
One Step at a Time (OSAT) Heuristic
Instead of evaluating all different combinations of partitioning decisions, we propose a greedy heuristic called “one step at a time” (OSAT). According to this heuristic, relations are evaluated one by one in a particular order and once the decision of partitioning a specific relation is made, this decision is considered when other relations are being evaluated. Assume that we have a set of relations S = {R1 , R2 , ..., Rn }. These relations have both sensitive and non-sensitive attributes. Given a workload κ, we can decide to partition each relation one by one. First, we evaluate relation R1 . While we are evaluating this relation we need to assume that all remaining relations are not partitioned. Then we need to estimate the total execution time for the given workload for both partitioned and unpartitioned versions of this relation. Depending on the result, say, we decide to partition R1 . Once we decide partitioning R1 , that decision will be used for subsequent evaluations. Therefore, when we evaluate R2 we assume that R1 is already partitioned and the remaining relations are still unpartitioned. This process will continue until the decision of partitioning Rn is made. The order of evaluation is determined by the descending sizes of tables. Depending on query workload, we sort the tables with respect to their sizes and then start evaluating each table in this particular order. 5.3
Experiment Results
To show that OSAT is an effective heuristic, we conducted experiments using TPC-H dataset. We observed that the heuristic finds almost the same partitioning strategy as the optimal solution. In addition to that, we observed that constructing the database schema with the optimum partitioning strategy boosts the overall query workload performance tremendously. In this experiment, we assumed that four tables of the database include some sensitive and non-sensitive attributes. These tables are: Supplier, Customer, LineItem, and Orders. For each of these tables half of the attributes are assumed to be sensitive so that sub-relations are balanced in terms of size. We constructed the workload with TPC-H Queries 5, 7, and 10 since they access these four tables. As to workload, we prepared six plans with different distributions of these three queries.
Query Optimization in Encrypted Relational Databases
13
Table 1. Query Execution Times for TPC-H Queries 5-7-10 and the Workload Execution Times on Different Partitioning Scenarios S P P P P P P P P
L P P P P P P P P
O P P P P P P P P
C P P P P P P P P
Query 5 Query 7 Query 10 W-1 W-2 W-3 W-4 W-5 W-6 531 671 543 57900 56500 60460 60340 57540 56260 532 670.5 543 57905 56520 60455 60345 57575 56300 48 36 34 3740 3860 3780 3920 4160 4140 419 653.5 508.5 53410 51065 56310 55415 50725 49275 530 670 552 58300 56900 60660 60440 57640 56460 57 36.5 34 3935 4140 3985 4215 4625 4600 420.5 651.5 508.5 53380 51070 56240 55360 50740 49310 533 668.5 552 58315 56960 60645 60455 57745 56580 50.5 30 34 3610 3815 3530 3695 4105 4145 39 26 32 3160 3290 3040 3110 3370 3430 418 653 517.5 53825 51475 56535 55540 50840 49485 60 30 34 3800 4100 3720 3980 4580 4620 48 25 32 3310 3540 3170 3330 3790 3860 418 651 517.5 53765 51435 56435 55440 50780 49445 42 20 33 3090 3310 2830 2920 3360 3490 51.5 20 33 3280 3595 3020 3205 3835 3965
To be able run the queries, we prepared a database instance such that each of the four tables are inserted to the database as partitioned and unpartitioned. Among the partitioned tables, only the partitions that contain the sensitive attributes are encrypted. In unpartitioned ones, all attributes of these four tables are stored encrypted. We implemented a “Query Rewriter” to generate queries with respect to different partitioning combinations. Since we used four tables, there are 24 = 16 different partitioning scenarios for a given query. Since we have 3 different queries, there are 3 × 16 = 48 different queries that we need to run. We run those 48 queries and measured their execution times. The results are given in Table 1 in the columns “Query 5”, “Query 7” and “Query 10”. Here columns S, L, O and C correspond to Supplier, LineItem, Orders and Customer tables respectively. An entry “P” on a column denotes that the corresponding table is partitioned. Using these execution times we calculated the overall workload execution times for 6 different distribution scenarios. The results are shown in Table 1. We denote the ith workload plans as W-i. For four distribution scenarios, exhaustive search technique suggested partitioning Customer, LineItem, and Orders tables but not Supplier table. For the remaining two scenarios only LineItem and Customer tables are suggested to be partitioned. On the other hand, for all six scenarios, OSAT suggested to partition Customer, LineItem, and Orders. Therefore in 4 out of 6 scenarios, both approaches found the same partitioning strategy. However when we analyze the total running times of that 2 differentiating scenarios, total execution times are very close. If the database is created according to the partitioning result of the heuristic, it takes 18995 seconds to run all given workloads. In contrast, it takes 18920 seconds with the exhaustive search technique. Assuming that exhaustive search approach gives the optimal solution, the relative error of the heuristic is 4.22/1000. Therefore we can empirically conclude that OSAT heuristic can be used instead of exhaustive search technique since its running time is linear and it finds an almost optimal partitioning strategy.
14
M. Canim, M. Kantarcioglu, and A. Inan
The second important observation is that, the overall workload execution time for the optimal partitioning are significantly less than those for unpartitioned schemas. Therefore, effective schema partitioning improves the workload performance considerably. The average running time of the given workload scenarios is 58166.7 seconds if none of the tables are partitioned. On the other hand, it takes 3165.8 seconds if the tables are partitioned. Therefore, using OSAT, partitioning the tables reduces the overall running time by 94.56 percent. The rationale behind this improvement is two-fold. First, different partitioning strategies cause the database server to choose different query execution plans and join orders. Some of these choices yield improvements while the others increase the execution cost. Especially in encrypted databases, joining the tables in the correct order has a great impact on query execution performance. Hence, trying different partitioning combinations helps find the best query execution plan. Second, by vertical partitioning the amount of data that is being decrypted is reduced. Separating the non-sensitive attributes from sensitive attributes leads to considerable improvements of performance.
6
Conclusions
In this paper, we proposed the vertical partitioning approach to prevent unnecessary cryptographic operations over non-sensitive attributes. Experiments conducted on the benchmark TPC-H dataset reveals that another advantage of vertical partitioning is preventing the CPU from becoming the bottleneck during query execution. Our analysis indicates that, due to pipelined execution, the overhead resulting from cryptographic operations is not directly proportional to the amount of decrypted data but rather the number of encrypted relations involved in a query. While vertical partitioning introduces only 5 % overhead, evaluating the same query over the unpartitioned database instance takes 50-60 % longer. In our method, the decision of vertically partitioning a relation depends on how frequently the partitions are joined to answer a query. When extended to the entire schema, making these decisions for each relation require also considering the interactions among the relations. While exhaustive search implies an exponential search space, our proposed heuristic solution has linear complexity and achieves 0.4 % error rate in comparison to the optimum partitioning strategy. Overall, heuristic partitioning improves query execution time by 94.5 % on the average.
References 1. Jr, T.Z.: An ominous milestone: 100 million data leaks, New York Times (December 18, 2006) 2. World, C.: Stolen computers contain data on 185,000 patients (April 2005), http://www.computerworld.com/databasetopics/data/story/0,10801, 100961,00.html
Query Optimization in Encrypted Relational Databases
15
3. Trinanes, J.A.: Database security in high risk environments. Technical report, governmentsecurity.org (2005), http://www.governmentsecurity.org/articles/ DatabaseSecurityinHighRiskEnvironments.php 4. HIPAA: Standard for privacy of individually identifiable health information. Federal Register 67(157), 53181–53273 (2002) 5. Peace, S.: California database security breach notification act (September 2002), http://info.sen.ca.gov/pub/01-02/bill/sen/sb_1351-1400/sb_1386_ bill_20020926_chaptered.html 6. Microsoft: Security features in microsoft sql server 2005. Technical report, Microsoft Corporation (2005), http://www.microsoft.com/sql/2005/productinfo/ 7. IBM: Ibm data encryption for ims and db2 databases. Technical report, IBM Corporation (2006), http://www-306.ibm.com/software/data/db2imstools/ db2tools/ibmencrypt.html 8. Seagate: Drivetrust technology: A technical overview (October 2006), http://www.seagate.com/docs/pdf/whitepaper/TP564_DriveTrust_Oct06.pdf 9. Damiani, E., di Vimercati, S.D.C., Foresti, S., Jajodia, S., Paraboschi, S., Samarati, P.: Key management for multi-user encrypted databases. In: StorageSS 2005: Proceedings of the 2005 ACM workshop on Storage security and survivability, pp. 74–83. ACM, New York (2005) 10. Agrawal, R., Kiernan, J., Srikant, R., Xu, Y.: Order-preserving encryption for numeric data. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France (June 13-18, 2004) 11. NIST: Advanced encryption standard (aes). Technical Report NIST Special Publication FIPS-197, National Institute of Standards and Technology (2001), http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf 12. Adam, N.R., Worthmann, J.C.: Security-control methods for statistical databases: a comparative study. ACM Comput. Surv. 21(4), 515–556 (1989) 13. Iyer, B., Mehrotra, S., Mykletun, E., Tsudik, G., Wu, Y.: A framework for efficient storage security in rdbms. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., B¨ ohm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 147–164. Springer, Heidelberg (2004) 14. Ailamaki, A., DeWitt, D.J., Hill, M.D., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 169–180. Morgan Kaufmann Publishers Inc., San Francisco (2001) 15. Hacigumus, H., Iyer, B.R., Li, C., Mehrotra, S.: Executing SQL over encrypted data in the database-service-provider model. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 4-6, pp. 216–227 (2002), http://doi.acm.org/10.1145/564691.564717 16. Hore, B., Mehrotra, S., Tsudik, G.: A privacy-preserving index for range queries. In: Proceedings of the 30th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco (2004) 17. Damiani, E., Vimercati, S.D.C., Jajodia, S., Paraboschi, S., Samarati, P.: Balancing confidentiality and efficiency in untrusted relational dbmss. In: Proceedings of the 10th ACM conference on Computer and communications security, pp. 93–102. ACM Press, New York (2003), http://doi.acm.org/10.1145/948109.948124 18. Aggarwal, G., Bawa, M., Ganesan, P., Garcia-Molina, H., Kenthapadi, K., Motwani, R., Srivastava, U., Thomas, D., Xu, Y.: Two can keep a secret: A distributed architecture for secure database services. In: CIDR, pp. 186–199 (2005)
16
M. Canim, M. Kantarcioglu, and A. Inan
19. Elovici, Y., Shmueli, E., Waisenberg, R., Gudes, E.: A structure preserving database encryption scheme. In: Jonker, W., Petkovi´c, M. (eds.) SDM 2004. LNCS, vol. 3178, pp. 28–40. Springer, Heidelberg (2004), http://www.extra.research.philips.com/sdm-workshop/RonenSDM.pdf 20. Cornell, D.W., Yu, P.S.: An effective approach to vertical partitioning for physical design of relational databases. IEEE Trans. Softw. Eng. 16(2), 248–258 (1990) 21. Agrawal, S., Narasayya, V., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: SIGMOD 2004: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 359–370. ACM, New York (2004) 22. Navathe, S., Ceri, S., Wiederhold, G., Dou, J.: Vertical partitioning algorithms for database design. ACM Trans. Database Syst. 9(4), 680–710 (1984) 23. Innobase: InnoDB, Transactional Storage Engine, http://www.innodb.com/ 24. TPC: TPC-H, Decision Support Benchmark, http://www.tpc.org/tpch/ 25. Cox, M., Engelschall, R., Henson, S., Laurie, B.: The OpenSSL Project, http://www.openssl.org/ 26. Canim, M., Kantarcioglu, M.: Design and analysis of querying encrypted data in relational databases. In: Barker, S., Ahn, G.-J. (eds.) Data and Applications Security 2007. LNCS, vol. 4602, pp. 177–194. Springer, Heidelberg (2007) 27. Garey, M.R., Johnson, D.S.: Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1990)
Do You Know Where Your Data’s Been? – Tamper-Evident Database Provenance Jing Zhang1, Adriane Chapman2, , and Kristen LeFevre1 1 2
University of Michigan, Ann Arbor, MI 48109 {jingzh,klefevre}@umich.edu The MITRE Corporation, McLean, VA 22102
[email protected]
Abstract. Database provenance chronicles the history of updates and modifications to data, and has received much attention due to its central role in scientific data management. However, the use of provenance information still requires a leap of faith. Without additional protections, provenance records are vulnerable to accidental corruption, and even malicious forgery, a problem that is most pronounced in the loosely-coupled multi-user environments often found in scientific research. This paper investigates the problem of providing integrity and tamper-detection for database provenance. We propose a checksum-based approach, which is wellsuited to the unique characteristics of database provenance, including non-linear provenance objects and provenance associated with multiple fine granularities of data. We demonstrate that the proposed solution satisfies a set of desirable security properties, and that the additional time and space overhead incurred by the checksum approach is manageable, making the solution feasible in practice.
1 Introduction Provenance describes the history of creation and modification of data. Problems of recording, storing, and querying provenance information are increasingly important in data-intensive scientific environments, where the value of scientific data is fundamentally tied to the method by which the data was created, and by whom [3,7,10,11,14,19,29]. In de-centralized and multi-user environments, we observe that individuals who obtain and use data (data recipients) often still need to make a leap of faith. They need to trust that the provenance information associated with the data accurately reflects the process by which it was created and refined. Unfortunately, provenance records can be corrupted accidentally, and they can even be vulnerable to malicious forgery. To this point, little research has focused on providing integrity for database provenance. While recent work considered a similar problem in the context of file systems [22], the proposed solutions are not directly applicable to databases. In particular, Hasan et al. [22] only considered provenance that could be expressed as a totally-ordered chain of operations on an atomic object (e.g., a file). In databases, however, we observe that
Approved for Public Release; Distribution Unlimited (09-1348).
W. Jonker and M. Petkovi´c (Eds.): SDM 2009, LNCS 5776, pp. 17–32, 2009. c Springer-Verlag Berlin Heidelberg 2009
18
J. Zhang, A. Chapman, and K. LeFevre
TrustUsRx (aggregate)
Pamela (Update Patient #4555’s Endocrine value)
Good Stewards Lab
Paul
(Set all White_Count values)
(Set all Age, Weight values)
Perfect Saints Lab (Set all Endocrine values)
Fig. 1. Sample Provenance Scenario
provenance is often expressed in terms of a partially-ordered set of operations on compound objects (e.g., records, tables, etc.). This is best illustrated with an example. Example 1. A pharmaceutical company, TrustUsRx, wants to show that their new drug is safe and effective. TrustUsRx delivers the result of their clinical trial (with accompanying provenance information) to the FDA for approval. The provenance information indicates that the patients’ ages and weights were originally collected by PCP Paul. Endocrine activity measurements were produced by the Perfect Saints Clinic, but then PCP Pamela amended the Endocrine value for patient #4555. White blood cell counts were determined by blood samples sent to GoodStewards Labs. Finally, all of the patient data was aggregated by TrustUsRx. The provenance of this final aggregate data is shown in Figure 1. Given the company’s pecuniary incentives, the FDA wants to verify that this provenance information has not been tampered with or forged. This example highlights the two major problems that are not addressed by Hasan et al. [22]. First, each patient record is a compound object; it contains several attributes (e.g., Age, Weight, Endocrine, and White Count), which were obtained through different methods, and have different provenance. Thus, we cannot treat records or tables as atomic; instead, a fine-grained approach is needed. Second, the modifications to the data do not form a totally-ordered (linear) sequence of operations (reads, writes, and updates). Instead, due to aggregation operations (e.g., the aggregation performed by TrustUsRx), the provenance associated with the final (compound) object delivered to the FDA is actually a DAG (non-linear provenance). Throughout this paper, we will consider an abstract set of participants (users, processes, transactions, etc.) that contribute to one or more data objects through insertions, deletions, updates, and aggregations [5,7,9,17,19]. Information about these modifications is collected and stored in the form of provenance records. Various system architectures have been proposed for collecting and maintaining provenance records, from attaching provenance to the data itself as a form of annotation [5,7] to depositing provenance in one or more repositories [10,11,14,19,29]. Thus, one of our chief goals is to develop a cross-platform solution for providing tamper-evident provenance. Since provenance is often collected and shared in a de-centralized and loosely-organized manner, it is impractical to use secure logging tools that rely, for example, on trusted hardware [32] or other systems-level assumptions about secure operation [37].
Do You Know Where Your Data’s Been?
19
Occasionally, a data recipient will request and obtain one or more of these data objects. In keeping with the vision of provenance, each data object is accompanied by a provenance object. Our goal is to collect enough additional information to provide cryptographic proof to the data recipient that the provenance object has not been maliciously altered or forged. 1.1 Contributions and Paper Overview This is the first in-depth study of integrity and tamper-evidence for database provenance. While related work has focused on security (integrity and confidentiality) for file system provenance [22], we extend the prior work in the following important ways: – Non-Linear Provenance: Database operations often involve the integration and aggregation of objects. One might consider treating an object produced in this way as if it were new (with no history), but this discards the history of the objects taken as input to the aggregation. Thus, in databases it is common to model provenance in terms of a DAG, or non-linear provenance. – Compound Objects: In databases, it is critical to think of provenance associated with multiple granularities of data, rather than to simply associate provenance with atomic objects. For example, in the relational data model, each table, row, and cell has associated provenance, and the provenance of these objects is inter-related. The remainder of this paper is organized as follows: In Section 2, we lay the groundwork by describing the database provenance model and integrity threat model. We then develop tamper-evident provenance tools for atomic and compound objects (Sections 3 and 4). Finally, an extensive performance evaluation (Section 5) indicates that the additional time and space overhead required for tamper-evidence (beyond that of standard provenance tracking) is often small enough to be feasible in practice.
2 Preliminaries We begin with the preliminary building blocks for our work, which include the basic provenance model and integrity threat model. Throughout this paper, we will consider a database, D, consisting of a set of data objects. Each object has a unique identifier, which we will denote using a capital letter, and a value. We will use the notation A.val to refer to the current value of object A. We assume that the database supports the following common operations: – Insert(A, val): Add a new object A to D with initial value val. – Delete(A): Remove an existing object A from D. – Update(A, val ): Update the value of A to new value val . – Aggregate({A1, ..., An }, B): Combine objects A1 , ..., An to form new object B. 2.1 Provenance Model With the exception of deletion, each operation is documented in the form of a provenance record. (For the purposes of this paper, after an object has been deleted, it’s provenance object is no longer relevant1.) We model each provenance record as a quadruple 1
This is not essential, but does enable some optimizations
20
J. Zhang, A. Chapman, and K. LeFevre
Fig. 2. An Example of Non-linear Provenance
of the form (seqID, p, {(A1 , v1 ), ..., (An , vn )}, (A, v)). p identifies the participant who performed the operation. {(A1 , v1 ), ..., (An , vn )} describes the (set of) input object(s), and their values. (A, v) describes the output object and its value.2 seqID is necessary to describe the relative order of provenance records associated with specific objects. In particular, if two provenance records rec1 and rec2 involve the same object (with the same id) as either input or output, then rec1.seqID < rec2.seqID indicates that the operation described by rec1 occurred before the operation described by rec2. Definition 1 (Provenance Object). The provenance of a data object, A, consists of a set of provenance records, which are partially-ordered by seqID. (Alternatively, it is easy to think of the provenance object as a DAG.) Each data object A always has a single most recent provenance record, with greatest seqID. For simplicity, we will assume that seqID values are assigned in the following way: When a new object is inserted, its initial seqID = 0. On each subsequent update, we add one to the seqID. Finally, for each aggregation operation, we add 1 to the maximum seqID of any input object. This is illustrated with a simple example. Example 2. Consider the example provenance object (for data object D) shown in Figure 2. This information indicates that participant p2 originally inserted objects A and B, with initial values a1 and b1 , respectively. Each of these objects was updated several times. The original version of object A, and an updated version of B were aggregated together to form C. Finally, D was created by aggregating C and a later version of A. Also, notice that the DAG shown in the figure is induced by the sequence ID values associated with each provenance record. 2.2 Threat Model In the absence of additional protections, the provenance records and objects described in the previous section are vulnerable to illegal and unauthorized modifications that can 2
This is certainly not the only possible way of describing an operation. We selected this model for the purposes of this work because we found it to be quite general. In contrast to provenance models that logically log the operation that was performed (e.g., a selection, or a sum), this simple model captures black-box operations (e.g., user-defined functions) and even nondeterministic functions. On the other hand, our proposed integrity scheme is easily translated to a provenance model that simply logs the white-box operations that have been performed.
Do You Know Where Your Data’s Been?
21
go undetected. Throughout this paper, our goal is to develop an efficient scheme for detecting such modifications. In this section, we outline our threat model and desired guarantees, which are a variation of those described by Hasan et al. [22]. In particular, consider a data object A and its associated provenance object P . Suppose that P accurately reflects the provenance of A, but that a group of one or more attackers would like to falsify history by modifying A and/or P . In the worst case, the attackers themselves are insiders (participants). We set out the following desired guarantees with respect to a single attacker: R1: An attacker (participant) cannot modify the contents of other participants’ provenance records (input and/or output values) without being detected by a data recipient. R2: An attacker cannot remove other participants’ provenance records from any part of P without being detected by a data recipient. R3: An attacker cannot insert provenance records (other than the most recent one) into P without being detected.3 R4: If an attacker modifies (updates) A without submitting a proper provenance record to P documenting the update, then this will be detected by a data recipient. R5: An attacker cannot attribute provenance object P (for data object A) to some other data object, B, without being detected by a data recipient. In short, we must be able to detect an attack that results from modifying any provenance record that has an immediate successor. Also, we must be able to detect any attack that causes the last provenance record in P to mismatch the current state of object A. In addition, it may be the case that multiple participants collude to attack the provenance object. In this case, we seek to make the following guarantees: R6: Two colluding attackers cannot insert provenance records for non-colluding participants between them without being detected by a data recipient. R7: Two colluding attackers cannot selectively remove provenance records of noncolluding participants between them without being detected by a data recipient. Finally, R8: Participants cannot repudiate provenance records. It is important to point out the distinction between these threats and two related threat models. First, notice that our goal is to detect tampering; we do not consider denial-ofservice type attacks, in which, for example, an attacker deletes or maliciously modifies data and / or provenance objects to prevent the information from being used. Second, we do not address the related problem of forged authorship (piracy) in which an attacker copies a data object, and claims to be the original creator of the data object. 2.3 Cryptography Basics We will make use of some basic cryptographic primitives. We assume a suitable publickey infrastructure, and that each participant is authenticated by a certificate authority. – Hash Functions: We will use a cryptographic hash function (e.g., SHA-1 [1] or MD5 [33]), which we will denote h(). Generally speaking, h() is considered secure 3
A participant can always append a provenance record with increasing seqID when the participant executes a corresponding database operation. In this case, the provenance record must properly document the operation in order to comply with requirement R4.
22
J. Zhang, A. Chapman, and K. LeFevre
if it is computationally difficult for an adversary to find a collision (i.e., messages m1 = m2 such that h(m1 ) = h(m2 )). – Public Key Signatures: We assume that each participant p has a public and secret key, denoted P Kp and SKp . p can sign a message m by first hashing m, and then encrypting h(m) with this secret key. We denote this as SSKp (m). RSA is a common public key cryptosystem [34].
3 Provenance Integrity for Atomic Objects We begin with the simple case in which we have a database D comprised entirely of atomic objects. In this case, we propose to provide tamper-evidence by adding a provenance checksum to each provenance record. In the case of linear provenance (operations consisting of only insertions, updates, and deletions), we take an approach similar to that proposed by Hasan et al. [22], and we begin by recapping this approach. Then, we extend the idea to aggregation operations (non-linear provenance). Consider each database operation resulting in a provenance record (insert, update, and aggregate), and the additional checksum associated with the provenance record: Insert: Suppose that participant p inserts an object A with value val. The checksum C0 is constructed as C0 = SSKp (0|h(A, val)|0) Update: Now consider the provenance record collected during an update in which participant p changes the value of object A from val to val . Suppose that the checksum of the previous operation on A is Ci−1 . The checksum for the update is Ci = SSKp (h(A, val)|h(A, val )|Ci−1 ) Aggregate: Finally, consider the provenance record collected as the result of an aggregation operation that takes as input objects A1 , ..., An (with values val1 , ..., valn , respectively) and produces an object B with value val. Assume that the input objects are sorted according to a globally-defined order (e.g, numeric or lexical). We denote the checksums for the previous operations on A1 , ..., An as C1 , ..., Cn . The checksum is C = SSKp h h(A1 , val1 )|h(A2 , val2 )| · · · |h(An , valn ) h(B, val)C1 |C2 | · · · |Cn Example 3. Consider again the non-linear provenance from Figure 2. Figure 3 shows (in tabular form) the provenance records augmented with checksums. Consider the data recipient who obtains object D and the provenance object P defined by these records. She can verify that P and D have not been maliciously altered by checking that all of the following conditions hold: 1. D matches the output field in the most recent provenance record. 2. Beginning with the earliest checksums (i.e., those associated with provenance records having the smallest seqID values among all provenance records with the same output object), recompute the checksum using the input and output fields of the provenance record (and the previous checksum if applicable). Check to make sure that each stored checksum matches the computed checksum.
Do You Know Where Your Data’s Been? seqID Participant Input 0 p2 {} 0 p2 {} 1 p1 {(A, a1 )} 1 p2 {(B, b1 )} 2 p2 {(A, a2 )} 2 p3 {(A, a1 ), (B, b2 )} 3 p1 {(A, a3 ), (C, c1 )}
Output (A, a1 ) (B, b1 ) (A, a2 ) (B, b2 ) (A, a3 ) (C, c1 ) (D, d1 )
23
Checksum C1 = SSKp2 (0|h(A, a1 )|0) C2 = SSKp2 (0|h(B, b1 )|0) C3 = SSKp1 (h(A, a1 )|h(A, a2 )|C1 ) C4 = SSKp2 (h(B, b1 )|h(B, b2 )|C2 ) C5 = SSKp2 (h(A, a2 )|h(A, a3 )|C3 ) C6 = SSKp3 (h(h(A, a1 )|h(B, b2 ))|h(C, c1 )|C1 |C4 ) C7 = SSKp1 (h(h(A, a3 )|h(C, c1 ))|h(D, d1 )|C5 |C6 )
Fig. 3. Non-Linear Provenance Example with Integrity Checksums
3.1 Checksum Security In this section, we will briefly explain how the provenance checksums provide the integrity guarantees outlined in Section 2.2. Property R1 is guaranteed because each input and output is cryptographically hashed, and then signed by the acting participant. Thus, in order to modify the input / output values that are part of the provenance record, without being detected, an attacker would need to either forge another participant’s signature, or find a hash collision. Also, attacks that require inserting or deleting provenance records (R2, R3, R6, R7) can be detected because each checksum contains the previous checksum(s) (defined by seqID). Moreover, consider a data recipient who receives a data object A and associated provenance object P . By comparing A to the output field of the most recent provenance record in P , in combination with the other checks, the data recipient can verify that the provenance has not been reassigned to a different data object (R4) and that a participant (attacker) has not modified the object without submitting proper provenance (R5). Finally, non-repudiation (R8) is guaranteed by participants’ signatures on provenance checksums. 3.2 Local vs. Global Checksum Chaining Finally, notice that when there are multiple data objects (each with associated provenance), we chose to “chain” provenance checksums on a per-object basis, rather than constructing a single global chain. While both approaches would satisfy our integrity goals, in a (potentially distributed) multi-user environment with many data objects, there are strong practical arguments in favor of the local-chaining approach. In particular, if we elected to construct a global chain, we would have to enforce a particular global sequence on entries into the provenance table, which would become a bottleneck. Consider, for example, two participants p1 and p2 , who are working on objects A and B. Using the global approach, the two participants would have to enforce a total order on their provenance records (e.g., using locking). In contrast, using the per-object approach, the participants can construct provenance chains (and checksums) for the two objects in parallel. Also, we find that local chaining is more resilient to failure. If the provenance associated with object A is corrupted, this does not preclude a data recipient from verifying
24
J. Zhang, A. Chapman, and K. LeFevre
the provenance of another object B (provided that B did not originate from an aggregation operation that took A as input).
4 Provenance Integrity for Compound Data Objects In the previous section, we described a checksum-based scheme for providing integrity for provenance (linear and non-linear) describing atomic objects. In this section, we expand the approach to the case where objects are compound (contain other objects). 4.1 Extended Data Model Throughout the rest of this paper, we will expand our data model to include richer and more realistic structure. In particular, instead of modeling the database D as an unorganized set of objects, we will model the database abstractly in terms of a set of trees (a forest). This abstraction allows us to express provenance information associated with varying levels of data granularity in two common data models: relational and tree-structured XML. In the relational model, we can use a tree to express varying granularities of data (e.g., tables, rows, and cells). Using this abstraction, we expand the idea of an atomic data object to be a triple of the form (id, value, {child ids}), where id uniquely identifies the object, value is the atomic value associated with the object, and {child ids} identifies the set of other objects of which this object is the parent in D. We will also refer to any set of atomic objects such that the child relationships form a tree as a compound object. We will use the notation subtree(A) to refer to the compound object defined by the subtree rooted at A. We assume that the database supports the following primitive operations: – Insert(A, val, parent): Add a new atomic object to D with value = val. The parent field is optional, and indicates the id of A’s parent. (For simplicity, the primitive operation only supports insertions and deletions of leaf objects. However, more complex operations can be expressed using multiple primitive operations, as described in Section 4.4.) – Delete(A): Remove an existing (leaf) atomic object A from D. – Update(A, val ): Update the value field of object A to new value val . – Aggregate({A1, ..., An }, B): Combine subtree(A1 ), ..., subtree(An ) to produce a new compound object rooted at B. For simplicity, we assume that the resulting root B has no parent in D. Example 4. As a simple example, consider the compound object shown in Figure 4, which contains atomic objects A, B, C, and D (with values a, b, c, d). 4.2 Extended Provenance Model The execution of each primitive operation is documented in the form of a provenance record. In this case, we extend the provenance records slightly; specifically, the input and output of each operation can be a compound (rather than atomic) object: (seqID, p, {subtree(A1 ), ..., subtree(An )}, subtree(A))
Do You Know Where Your Data’s Been?
25
hA = h((A, a, {B, C})|hB |hC ) (A,a,{B,C}) (B,b,{D}) (C,c,{})
hB = h((B, b, {D})|hD ) hC = h((C, c, {}))
(D,d,{}) Fig. 4. Example compound object
hD = h((D, d, {})) Fig. 5. Example compound hash value
A provenance object consists of a set of provenance records of this extended form, which are partially-ordered by seqID like before. When it comes to compound objects, a unique challenge arises because provenance among objects is naturally not independent. For example, consider a relational database, and a participant who updates a particular cell. Intuitively, if we are collecting provenance for cells, rows, and tables, a record of this update should be maintained in the provenance of the cell, but also for the row and table. The extended provenance model captures this through the idea of provenance inheritance. Conceptually, when an update or insert is applied to an atomic object A, we collect the standard provenance record for A: (seqID, p, {subtree(A)}, subtree(A) ), where subtree(A) denotes the subtree rooted at A before the update (in the case of insertions, this is empty), and subtree(A) denotes the subtree rooted at A after the update. In addition, when an object A is inserted, updated, or deleted, we must also collect, for each ancestor B of A the provenance record (seqID, p, {subtree(B)}, subtree(B)). Of course, this conceptual methodology is not an efficient means of collecting and storing inherited provenance. Efficient collection and storage of fine-grained provenance is beyond the scope of this paper; however, this problem has been studied in prior work. For example, [7,11] describe a set of optimizations that can be used. 4.3 Extended Provenance Checksums Finally, in order to provide provenance integrity for compound objects, we must extend the signature scheme described in Section 3. We accomplish this using an extended signature scheme related to Merkle Hash Trees [25]. Consider the provenance record (seqID, p, {subtree(A)}, subtree(A)) collected as the result of an update (or inherited update) on compound object subtree(A). Suppose also that the checksum of the previous (actual or inherited) operation on subtree(A) is Ci−1 . We will construct the following checksum for this provenance record: Ci = SSKp (h(subtree(A))|h(subtree(A) )|Ci−1 ) Notice that this checksum includes hashes computed over full compound objects (i.e., h(subtree(A))). While we could use any blocked hashing function for this purpose, we elected to define the hash function recursively, which allows us to reuse hashes computed for one complex object when computing the checksums necessary for inherited provenance records.
26
J. Zhang, A. Chapman, and K. LeFevre
For example, in Figure 5, hA is the hash value for subtree(A) from Figure 4. hA is calculated by hashing the concatenation of A and hB and hC . Of course, the order of hB and hC is important; different orderings will lead to different values of hA . In order to ensure that the checksums are always consistent, we require a well-defined total order over atomic objects. In the case of XML, an order naturally exists. In the case of relational databases, where a pre-defined order is not always present (e.g., for the rows that are part of a table), we impose the order based on the object keys. Notice also that an update of object B would generate a provenance record for B, but also an inherited provenance record for A. We are able to reuse h(subtree(B)) when computing h(subtree(A)). Economical Approach. A Basic version of this algorithm will hash all nodes in the input subtree(A), and hash all nodes in the output subtree(A). Even reusing h(subtree(B)) when computing h(subtree(A)), this approach requires two walks over the entire tree rooted at A. A more economical approach is to compute the hashes of the input nodes in subtree(A), and only re-compute a hash if the node has changed. In the worst case, this still could require 2 traversals of the tree. However, in the best case, it would be 1 traversal of the tree for computing the hash of the input and 1 traversal of the height of the tree to compute the hash of the output. Checksum Guarantees. These extended checksums provide the same guarantees as described earlier (Section 2.2). The analysis is essentially the same as in Section 3.1; the only important addition is to observe that the extended hash value constructed for a compound object is also difficult for an adversary to reverse. 4.4 Complex Operations At the most basic level, provenance records (including checksums) are defined and collected at the level of primitive operations (insert, update, and aggregate), and in the case of compound objects, updates are inherited upward whenever a descendant object is inserted, updated, or deleted. Of course, in practice, it may not be necessary to collect provenance records for every primitive operation. Instead, we can group together a sequence of insert, update, and delete operations to form a complex operation (which we assume produces a modified complex object). This is based on the idea of transactional storage described in [7]. In this case, for every object A, and its ancestors still present in the database after a series of operations, we collect the provenance record (seqID, p, {subtree(A)}, subtree(A) ). The checksum associated with this record is exactly the same as described in the last section.
5 Experiments This section briefly describes our experimental evaluation, the goal of which is to better understand the time and space overhead introduced by generating and storing checksums. Our experiments reveal that these costs are often low enough to be feasible in practice.
Do You Know Where Your Data’s Been? Table 1. Synthetic Tables And Databases (a) Synthetic Tables Table No. Num. Attr. Num. Row 1 8 4000 2 9 3000 3 10 2000 4 5 5000
Attr. types all integer all integer all integer all integer
(b) Synthetic Databases Combination of tables Num. of nodes 1 36002 1,2 66000 1,2,3 88004 1,2,3,4 118006
27
Table 2. Complex Operations for Each Experiment Experimental Complex Operations Setup 1 update on 1 cell 400n updates on 400n cells in A 400n rows(n = 1, · · · , 10) 4000n updates on 4000n cells in 4000 rows (n = 2, · · · , 8) 500 deletes of rows 500 inserts of rows B 4000 updates of cells in 500 rows 4000 updates of cells in 4000 rows 96(19.2%) deletes 500 operations 189(37.8%) inserts 15(43%) updates C 183(36.6%) delete 500 operations 152(30.4%) inserts 165(33%) updates 285(57%) deletes 500 operations 106(21.2%) inserts 109(21.8%) updates 391(78.2%) deletes 500 operations 49(9.8%) inserts 60(12%) updates
5.1 Experimental Setup Our experimental setup includes two databases. First, we have a back-end database, which contains the user data about which we collect provenance. Second, we have a provenance database. We will assume that both databases are relational. For the purposes of fine-grained provenance, we will view the back-end database as a tree of depth 4, with a single root node, and subsequent levels representing tables, rows, and cells. Our main goal is to measure the additional time and space cost incurred by collecting integrity checksums, as opposed to the cost of collecting provenance itself, which has been studied extensively in prior work. Thus, for each complex operation, our experiments record: SeqID(int), P articipant(int), Oid(int), Checksum(binary(128)). For the experiments, we generated synthetic back-end databases, each consisting of one or more synthetic data tables, as described in Table 1(a). We also constructed a set of synthetic complex operations on the back-end database, as described in Table 2. Our hardware and software configuration consists of a Celeron 3.06GHz machine with 1.96G RAM running Windows XP and Java SE runtime environment (JRE) version 6 update 13. Our provenance collection and checksumming code is written in Java, and connected to a MySQL database (v5.1) using MySQL connector/J. For hashing, we use java.security.MessageDigest (algorithm “SHA”), which generates a 20-byte message digest; for encryption, we use java.crypto.Cipher (algorithm “RSA”), which produces a 128-byte signature (given a 1024-byte key). For all performance experiments, we report the average across 100 runs, including 95% confidence intervals.
28
J. Zhang, A. Chapman, and K. LeFevre 1600.0
Basic
Hashing Time (ms)
1200.0 1000.0 800.0 600.0 400.0 200.0
500.0 400.0 300.0 200.0 100.0 0.0
0.0 36002
66002 88004 118006 Number of Nodes In Database
Fig. 6. Average Hashing Time For A Database
1 400 800 1200 1600 2000 2400 2800 3200 3600 4000 8000 12000 16000 20000 24000 28000 32000
Hashing Time (ms)
Economical
600.0
1400.0
Number Of Cells Updated
Fig. 7. Hashing The Output Tree Using Basic and Economical Approaches
5.2 Experimental Results We conducted several experiments, which illustrate the effect of database size on hashing time, the difference between basic and economical hashing, and the effect of operation types on checksum generation. Hashing. To understand the effect of the back-end database size on hashing time, we use four databases with increasing sizes as listed in Table 1(b). The time to hash each database is shown in Figure 6. It can be seen that the time grows roughly linearly with the number of nodes (thus the size of the database). To compare the Basic and Economical hashing approaches described in Section 4.3, we use a back-end database with one synthetic table (4000 rows and 8 integer-valued attributes). We used the complex operations in Experimental Setup A (Table 2), which consist entirely of updates, with increasing numbers of cells updated as part of the operation. As expected, the hashing time remains approximately constant when using the basic approach; however, the economical hashing time increases with the number of updated cells. Of course, a pressing question is whether these techniques can scale to a much larger database (i.e., larger than available memory). To do this we can read one row at a time, hashing the row and the cells in it, and updating the table’s hash value with the row’s hash value. When all rows are read and hashed, we get the final hash value of the table and update the database’s hash value with the table’s hash value. When all tables are hashed, we get the final hash value of the database. As a simple experiment, we hashed a relational database with a single table named “Title”. This table had 18,962,041 rows and two fields: Document ID (integer) and Title (varchar). (The total number of nodes was thus 56,886,125.) The time to hash this database was 1226.7 seconds (excluding the time of writing the hash values to disk), i.e., the average time of hashing a node took 0.02156 milliseconds. Although it is not an apples-to-apples comparison, this average hashing time of a node is within one order of magnitude of that when the whole tree fits into memory. Effects of Different Operations. Recall from Section 4.2 that, in the fine-grained provenance model, if a node n has x ancestors, and we delete n, then we must produce x (inherited) checksums. Alternatively, if we inserted or updated n, this would produce a total of x + 1 (actual and inherited) checksums.
Do You Know Where Your Data’s Been? Inserting Checksum
del 500 rows
ins 500 upd 4000 upd 4000 rows cells in cells in 500 rows 4000 rows Complex Operations
Fig. 8. Time Overheads for Complex Operations of All-Deletes, All-Inserts and AllUpdates
Time Overhead (s)
Encryption
Inserting Checksum
Checksum data size
Checksum Data Size (KB)
Encryption
1200.0 1000.0 800.0 600.0 400.0 200.0 0.0 del 500 rows
Hashing
45.0 40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0 0.19 0.37 0.57 0.78 Percentage of Deletes in A Complex Operation
ins 500 upd 4000 upd 4000 rows cells in cells in 500 rows 4000 rows Complex Operations
Fig. 9. Space Overheads for Complex Operations of All-Deletes, All-Inserts and AllUpdates Checksums data size
Checksum Data Size (KB)
Time Overhead (s)
Hashing
100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0
29
500.0 400.0 300.0 200.0 100.0 0.0
0.19 0.37 0.57 0.78 Percentage of Deletes in A Complex Operation
Fig. 10. Time Overhead for Complex Opera- Fig. 11. Space Overhead for Complex Operations Combining Deletes, Inserts and Updates tions Combining Deletes, Inserts and Updates
To analyze this relationship between operations and checksum overhead, we used the complex operations in Experimental Setup B, and we ran these operations on a database with one synthetic table consisting of 4000 rows and 8 integer-valued attributes. From Figure 8, we can see that the time overhead for the all-deletes operation is the smallest. The time overhead for the all-inserts and all-updates operation are similar to one another. Figure 9 shows the space overhead of storing the (actual and inherited) checksums for these four complex operations. As expected, the space overhead is much larger for inserts and updates, as these produce more total provenance records and checksums. In addition, we conducted some experiments for complex operations containing combinations of insert, update, and delete primitives. Figure 10 shows the time of hashing trees, encrypting and inserting checksums while running Experimental Setup C. As expected, the time overhead decreases as the percentage of deletes increases in the complex operation. Similarly, Figure 11 shows that the space overhead is also inversely proportional to the number of deletions.
6 Related Work Issues surrounding provenance have been studied in database systems [4,5,9,8], workflow systems [10,14,17,20,29], scientific applications [3,7,18,19] and general prove-
30
J. Zhang, A. Chapman, and K. LeFevre
nance issues [11,30]. However, this paper is the first to provide platform-independent support for verifying the integrity of provenance associated with data at multiple granularities, and through aggregation. The closest work to ours was described by Hasan et al. [21,22], and focused on security problems (integrity and confidentiality) that arise when tracking and storing provenance in a file system (e.g., PASS [27]). While our work utilizes a similar threat model and integrity checksum approach, we must deal with a significantly more complicated data model (compound objects) and provenance model (non-linear provenance objects) in order to apply these techniques in the database setting. A recent vision paper by Miklau and Suciu [26] considered the problem of data authenticity on the web, and described a pair of operations (signature and citation) for tracking the authenticity of derived data. One of the main differences between that work and ours is the structure of participants’ transformations. The previous work assumed that transformations were structured in a limited way (specifically, as conjunctive queries), whereas we consider arbitrary black-box transformations. The general problem of logging and auditing for databases has become increasingly important in recent years. Research in this area has focused on developing queryable audit logs (e.g., [2]) and tamper-evident logging techniques (e.g., [32,36,37]). In addition, there has been considerable recent interest in developing authenticated data structures to verify the integrity of query results in dictionaries, outsourced databases, and thirdparty data publishing (e.g., [15,16,24,28,31]). Finally, the provenance community has begun to think about security issues surrounding provenance records and annotations. [6,38] motivate the need for and complications of security in provenance systems. Several systems have implemented a provenance system to protect the information from unauthorized access: [39] for provenance in a SOA environment; [23] for annotations. Meanwhile, several groups are interested in securely releasing information. First, [13] use the history of data ownership to determine if a user may access information. Second, [12], provide views of provenance information based on the satisfaction of an access control policy. Finally, [35] describe the particular requirements that provenance mandates in access control abilities, and propose an extension to attribute based access control to satisfy these requirements.
7 Conclusion In this paper, we initiated a study of tamper-evident database provenance. Our main technical contribution is a set of simple protocols for proving the correctness and authenticity of provenance. This is the first paper dealing with the specific provenance and security issues that arise specifically in databases, including non-linear provenance resulting from aggregation and provenance expressed for data at multiple levels of granularity. Through an extensive experimental evaluation, we showed that the additional performance overhead introduced by these protocols can be small enough to be viable in practice.
Acknowledgements This work was supported by NSF grant IIS 0741620, NIH grant U54 DA021519, and a grant from the Horace H. Rackham Graduate School.
Do You Know Where Your Data’s Been?
31
References 1. Secure hash standard. Federal Information Processing Standards Publication (FIPS PUB) 180(1) (April 1995) 2. Agrawal, R., Bayardo, R., Faloutsos, C., Kiernan, J., Rantzau, R., Srikant, R.: Auditing compliance with a hippocratic database. In: VLDB (2004) 3. Annis, J., Zhao, Y., V¨ockler, J.-S., Wilde, M., Kent, S., Foster, I.: Applying chimera virtual data concepts to cluster finding in the sloan sky survey. In: Proceedings of the ACM / IEEE Conference on Supercomputing (2002) 4. Benjelloun, O., Das Sarma, A., Halevy, A., Widom, J.: ULDBs: Databases with uncertainty and lineage. In: VLDB (2006) 5. Bhagwat, D., Chiticariu, L., Tan, W.-C., Vijayvargiya, G.: An annotation management system for relational databases. In: VLDB (2004) 6. Braun, U., Shinnar, A., Seltzer, M.: Securing provenance. In: USENIX (July 2008) 7. Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: ACM SIGMOD (2006) 8. Buneman, P., Cheney, J., Vansummeren, S.: On the expressiveness of implicit provenance in query and update languages. In: Schwentick, T., Suciu, D. (eds.) ICDT 2007. LNCS, vol. 4353, pp. 209–223. Springer, Heidelberg (2006) 9. Buneman, P., Khanna, S., Tan, W.-C.: What and where: A characterization of data provenance. LNCS (2001) 10. Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silvaand, C.T., Vo, H.T.: VisTrails: Visualization meets data management. In: ACM SIGMOD (2006) 11. Chapman, A., Jagadish, H.V., Ramanan, P.: Efficient provenance storage. In: ACM SIGMOD (2008) 12. Chebotko, A., Chang, S., Lu, S., Fotouhi, F., Yang, P.: Scientific workflow provenance querying with security views. In: WAIM (2008) 13. Cirillo, A., Jagadeesan, R., Pitcher, C., Riely, J.: Tapido: Trust and Authorization Via Provenance and Integrity in Distributed Objects. In: Drossopoulou, S. (ed.) ESOP 2008. LNCS, vol. 4960, pp. 208–223. Springer, Heidelberg (2008) 14. Davidson, S., Cohen-Boulakia, S., Eyal, A., Ludascher, B., McPhillips, T., Bowers, S., Freire, J.: Provenance in scientific workflow systems. IEEE Data Engineering Bulletin 32(4) (2007) 15. Devanbu, P., Gertz, M., Kwong, A., Martel, C., Nuckolls, G., Stubblebine, S.: Flexible authentication of XML documents. Journal of Computer Security 12(6) (2004) 16. Devanbu, P., Gertz, M., Martel, C., Stubblebine, S.: Authentic third-party data publication. In: Proceedings of the IFIP 11.3 Workshop on Database Security (2000) 17. Foster, I., Vockler, J., Eilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: SSDBM (July 2002) 18. Frew, J., Metzger, D., Slaughter, P.: Automatic capture and reconstruction of computational provenance. Concurr. Comput.: Pract. Exper. 20(5), 485–496 (2008) 19. Groth, P., Miles, S., Fang, W., Wong, S., Zauner, K.-P., Moreau, L.: Recording and using provenance in a protein compressibility experiment. In: IEEE International Symposium on High Performance Distributed Computing (2005) 20. Groth, P., Miles, S., Moreau, L.: PReServ: Provenance recording for services. In: Proceedings of the UK OST e-Science second All Hands Meeting 2005, AHM 2005 (2005) 21. Hasan, R., Sion, R., Winslett, M.: Introducing secure provenance: Problems and challenges. In: International Workshop on Storage Security and Survivability (2007) 22. Hasan, R., Sion, R., Winslett, M.: The case of the fake picasso: Preventing history forgery with secure provenance. In: FAST (2009)
32
J. Zhang, A. Chapman, and K. LeFevre
23. Khan, I., Schroeter, R., Hunter, J.: Implementing a Secure Annotation Service. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 212–221. Springer, Heidelberg (2006) 24. Li, F., Hadjieleftheriou, M., Kollios, G., Reyzin, L.: Dynamic authenticated index structures for outsourced databases. In: ACM SIGMOD (2006) 25. Merkle, R.: A certified digital signature. In: Proceedings of the 9th Annual International Cryptology Conference (1989) 26. Miklau, G., Suciu, D.: Managing integrity for data exchanged on the web. In: WebDB (2005) 27. Muniswamy-Reddy, K., Holland, D., Braun, U., Seltzer, M.: Provenance-aware storage systems. In: USENIX (2006) 28. Naor, M., Nissim, K.: Certificate revocation and certificate update. In: USENIX (1998) 29. Oinn, T., Greenwood, M., Addis, M., Alpdemir, M.N., Ferris, J., Glover, K., Goble, C., Goderis, A., Hull, D., Marvin, D., Li, P., Lord, P., Pocock, M.R., Senger, M., Stevens, R., Wipat, A., Wroe, C.: Taverna: lessons in creating a workflow environment for the life sciences: Research articles. Concurr. Comput.: Pract. Exper. 18(10) (2006) 30. Open provenance model (2008), http://twiki.ipaw.info/bin/view/Challenge/OPM 31. Pang, H., Jain, A., Ramamritham, K., Tan, K.: Verifying completeness of relational query results in data publishing. In: ACM SIGMOD (2005) 32. Peha, J.M.: Electronic commerce with verifiable audit trails. In: Internet Society (1999) 33. Rivest, R.: The MD5 message digest algorithm (1992) 34. Rivest, R., Shamir, A., Adelman, L.: A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM 21(2) (1978) 35. Rosenthal, A., Seligman, L., Chapman, A., Blaustein, B.: Scalable access controls for lineage. In: Workshop on the Theory and Practice of Provenance (2009) 36. Schneier, B., Kelsey, J.: Secure audit logs to support computer forensics. ACM Transactions on Information and System Security 2(2) (1999) 37. Snodgrass, R., Yao, S., Collberg, C.: Tamper detection in audit logs. In: VLDB (2004) 38. Tan, V., Groth, P., Miles, S., Jiang, S., Munroe, S., Tsasakou, S., Moreau, L.: Security Issues in a SOA-Based Provenance System. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 203–211. Springer, Heidelberg (2006) 39. Tsai, W.T., Wei, X., Chen, Y., Paul, R., Chung, J.-Y., Zhang, D.: Data provenance in SOA: security, reliability, and integrity. Journal Service Oriented Computing and Applications (2007)
Database Intrusion Detection Using Role Profiling with Role Hierarchy Garfield Zhiping Wu1 , Sylvia L. Osborn1 , and Xin Jin2 1
Department of Computer Science The University of Western Ontario {zwu58,sylvia}@csd.uwo.ca 2 Microsoft Corporation
[email protected]
Abstract. Insider threats cause the majority of computer system security problems. An anomaly-based intrusion detection system (IDS), which can profile normal behaviors for all users and detect anomalies when a user’s behaviors deviate from his/her profiles, can be effective to protect computer systems against insider threats. Although many IDSes have been developed at the network or host level, there are still very few IDSes specifically tailored to database systems. We build our anomaly-based database IDS using two different profiling methods: one is to build profiles for each individual user (user profiling) and the other is to mine profiles for roles (role profiling) when role-based access control (RBAC) is supported by the database management system (DBMS). Detailed comparative evaluations between role profiling and user profiling are conducted, and we also analyze the reasons why role profiling is more effective and efficient than user profiling. Another contribution of our work is that we introduce role hierarchies into database IDS and remarkably reduce the false positive rate without increasing the false negative rate. Keywords: Insider threats, Intrusion detection, RBAC, Database security, Role profiling.
1
Introduction
With the digitalization of the world, a considerable amount of invaluable data has been stored in databases; however, there are always a lot of unauthorized attempts to access information, manipulate information and even destroy information deliberately or by accident. Traditionally, mechanisms like authentication, authorization and encryption are applied to ensure the security of the data, but unluckily, they are often not adequate. A significant fact is that most of the security problems of computer systems are caused by insider threats [15]. However, traditional database security mechanisms do little to prevent the malicious actions or misuses of legitimate users as long as they can log into the system successfully. Therefore, complementary database security mechanisms that can monitor legitimate insiders continuously are necessary. W. Jonker and M. Petkovi´ c (Eds.): SDM 2009, LNCS 5776, pp. 33–48, 2009. c Springer-Verlag Berlin Heidelberg 2009
34
G.Z. Wu, S.L. Osborn, and X. Jin
An IDS has proved promising to detect actions that attempt to compromise the confidentiality, integrity or availability of a digital resource [5]. Since the 1980s, many IDSes have been developed, most of which work at either the network level or host level (operating system level). However, the malicious actions or misuses at the database level, such as SQL-injection attacks, do not often result in network anomalies; additionally, all of these actions are not monitored by a host-based IDS if they are issued by legitimate insiders. In short, the abnormal actions for databases are not necessarily also anomalous for networks or the operating systems [1,7]. In this case, current network-based or host-based IDSes which do not take into account the characteristics of the DBMS cannot detect abnormal behaviors towards database applications effectively. This forms our motivation to design an IDS specifically tailored to database systems. This paper presents our anomaly-based database IDS. Assuming RBAC is supported by the DBMS, both user profiling and role profiling are applied to characterize the normal behaviors of inside users by building profiles for users or roles, respectively, against which all new actions are examined. If the new access patterns deviate from the normal ones too much, an alarm will be raised. We then conduct detailed experimental evaluations between the two systems using user profiling and role profiling. The primary purpose of our work is to illustrate that RBAC can play an important role in a database IDS because using role profiling can make the IDS more effective and efficient, and the underlying reasons are also analyzed. In order to reduce the false positives of our IDS using role profiling, we introduce role hierarchies into the system, and the experimental results reveal that this method largely reduces the false positives while the false negatives are maintained at an extremely low level. The rest of the paper is organized as follows. Section 2 surveys some related work in this area. In Section 3, background information necessary for understanding the rest of our work is introduced. We then present our approach to generating both training and testing data in Section 4. Section 5 describes the system architecture, the profiling approaches and the classifier we use in detail, followed by Section 6 in which the experimental results and relevant analysis are presented. We conclude in Section 7 where our future work is discussed as well.
2
Related Work
While a lot of work has been done on network-based or host-based IDS, very limited work on database IDS has been done. As there is for the network-based or host-based IDS, there are also two main techniques for database IDSes: misuse detection and anomaly detection. Misuse detection extracts special patterns (signatures) of known attacks in advance, and a new action is recognized as an intrusion whenever it matches any attack signature. This method could detect known intrusions very well but does little when new unknown attacks happen. Anomaly detection, on the other hand, typically works by defining normal behaviors in advance (creating profiles using attack-free behavior samples) and then comparing new actions with the normal profiles. An alarm is raised if the
Database Intrusion Detection Using Role Profiling with Role Hierarchy
35
new action deviates too much from the normal behaviors. Anomaly detection has the potential to deal with previously unseen attacks, but it often suffers from high false positives. Among the limited work, DIDAFIT is the first IDS specifically designed for DBMS [10,8]. DIDAFIT, as a misuse database IDS, mines legitimate fingerprints instead of fingerprinting illegal SQL statements, and an alarm is raised when a new transaction cannot find its corresponding legitimate fingerprint. Another system, IIDD, uses a similar approach to detecting illegal transactions issued by applications [4]. However, both DIDAFIT and IIDS cannot detect unseen attacks as mentioned above. Lee et al. present another misuse-based system only applicable to real-time databases in [9] by making use of time signatures. Information theory is exploited by Bertino et al. in [2] in order to detect a specific type of attacks towards databases called query flood. Proposed in [3], DEMIDS uses the notion of distance measure to examine if a user’s new action is out of his/her normal work scope. This system takes advantage of domain knowledge encoded in a certain database schema while building user profiles, which limits its general applicability. Valeur et al. develop an anomaly-based IDS that constructs profiles reflecting normal database transactions performed by web-based applications using a number of different statistical models [16]. The drawback of this approach is that it is limited to detecting only three specific classes of SQL-based attacks. [1] and [7], two papers inspiring us most, are the only work taking into consideration role information. However, the main purpose of using role profiling instead of user profiling in this work is to reduce the number of necessary profiles for the IDS without thinking about the possible performance differences. Only 8 simple application roles are assumed and no role hierarchy is built while there is always a complex role hierarchy for the database system of a large organization. Moreover, although the false negatives are low, the false positives are relatively high. Our work, therefore, includes building a more realistic role hierarchy for a company’s database, comparing user profiling and role profiling systematically and exploiting the role hierarchy to decrease the false positives. Finally, although we believe our approach as well some other database IDSes can perform very well in detecting anomalous behaviors, and that IDS should play an important role in database security, what we have to point out is that an IDS is still not a replacement approach to traditional database security mechanisms but a supplement.
3 3.1
Preliminaries RBAC and Role Hierarchy
Role-based access control is a neutral and flexible approach that is able to simulate both discretionary access control (DAC) and mandatory access control (MAC) [14]. In RBAC, permissions are not granted to users directly; instead, they are assigned to roles. The users can get the permissions associated with the role or roles he/she is assigned to. In [6,13], the concept of role hierarchy is
36
G.Z. Wu, S.L. Osborn, and X. Jin
defined for RBAC models. A role in the role hierarchy inherits all permissions associated with the roles junior to it. For example, in a company’s database system, suppose the role Sales Manager is senior to another role Sales Representative; in this case, the former role inherits all permissions the latter one has. While a role is usually the reflection of a job position, a senior job position does not have to be associated with a role also senior in the role hierarchy. 3.2
AdventureWorks
AdventureWorks, based on which our system is built and tested, is a sample database provided with SQL Server 2005 by Microsoft [12]. It is the database of a fictitious company that manufactures bicycles and sells them to North America, Europe and Pacific markets. This database contains 290 users and 69 tables in total. As SQL Server 2005 can support RBAC, we design 32 roles with a role hierarchy for various job functions. Our system focuses on the Sales and Marketing Departments which have 18 and 9 employees, respectively. In the database, each employee has only one legal account. In addition, 12 roles belong to these two departments, including 2 abstract roles that no users are assigned to. The role hierarchy of these two departments is shown in Fig. 1. 17 tables are referenced by Sales and Marketing users and we number these tables from 0 to 16. 3.3
Raw Data Collection
Although using the log of the DBMS is a quite direct and easy approach to collecting data, we prefer to use our own data collection mechanism. That is because the data obtained otherwise cannot be fully trusted (e.g. a database administrator (DBA) may change the logs) and our data collecting mechanism
Sales Manager in EU
Sales Manager in NA
Sales Rep in EU
Sales Rep in NA
Sales Manager in PA
Sales Rep in PA
Marketing Manager Marketing Specialist
Sales Rep Basic
VP Sales
Marketing Assistant Sales&Marketing Basic
Fig. 1. Role hierarchy of the Sales and Marketing Departments
Database Intrusion Detection Using Role Profiling with Role Hierarchy
37
can make some further extensions to our IDS feasible (e.g. achieve real-time detection). For each transaction issued by a user, the information of 6 features is collected, including the EmployeeID and associated RoleID of the user issuing it, the issued time, the IP address where the transaction is from, the access type (direct or through application) and the SQL statement. They will be further parsed for training and testing.
4 4.1
Data Set Generation Training and Testing Data Set
We initially design 6 applications for the Marketing Department and 25 for the Sales Department, based on the scenarios of AdventureWorks; the users and the permissions of invoking the applications are then assigned to the corresponding roles. A user is then allowed to invoke certain applications according to the permissions he/she has. All transactions are presumed to be issued through the applications in our current system, but our IDS can be easily extended to be able to monitor the users interacting with the database directly, such as DBAs. In the company, a day is divided into three work shifts - Day [7:00:00 ∼ 15:00:00), Evening [15:00:00 ∼ 23:00:00) and Night [23:00:00 ∼ 7:00:00), and each user only works in his/her work shift. Meanwhile, we also assume each department has its unique IP address space, for example, 192.168.1.0 ∼ 192.168.1.255, 192.168.2.0 ∼ 192.168.2.255 belong to the Sales Department and the Marketing Department, respectively. For each legitimate transaction, first, an employee in either the Sales or Marketing department is picked out randomly. After that, we randomly choose the time within the corresponding work shift (e.g. 10:16:08), the IP address (e.g. 192.168.1.79) within the employee’s department’s IP space and one application the user can invoke legally among all applications he/she is permitted to invoke. Finally, we assume that the user is not always interested in all attributes the application he/she invokes can access to, so a non-empty subset of the attributes is randomly generated. In this way, a transaction is manufactured. 4.2
Intrusion Data Set
When RBAC is supported, we accept the assumption of [1,7] that a transaction (Ri , Ti ) (means that a user assigned to Ri issues the transaction Ti ) becomes an anomaly if it is changed to (Rj , Ti ) (i = j). So the first step of generating intrusions is to manufacture a set of legitimate transactions using the methods described in Section 4.1. With the consideration of role hierarchy, we change each transaction’s associated RoleID to another one that is not equal or senior to the original one. The reason why the new role cannot be senior to the old one is that a senior role has all permissions the junior one has, as presented in section 3.1. For example, Marketing Manager is senior to Marketing Analyst (see Fig. 1), so if (Marketing Analyst, TMA ) is legal, (Marketing Manager, TMA ) must NOT be an intrusion.
38
G.Z. Wu, S.L. Osborn, and X. Jin
For the user profiling IDS, we assume role information is unavailable, and therefore, we can only simply change the EmployeeID to a new one while generating an intrusion.
5
System Description
Up to now, we have transferred our work into a classification problem. The next challenge is to find a classifier, using which we can achieve relatively low false positives/negatives. We also hope the computational cost of the classifier is acceptable, especially for detection mode, because we expect short latency when an intrusion occurs. This section describes the system architecture of our IDS and the classifier we use in detail. 5.1
System Architecture
Fig. 2 shows the main components of our system, as well its working process. The Data Generator generates data for both training and testing. The Data Collector collects transactions containing features listed in Section 3.3. Each transaction collected is then passed to the Parser which further parses the transaction and forms necessary features for training or detection. Its duties include changing the exact time the transaction is issued to the corresponding work shift and the exact IP address the transaction comes from to the corresponding DepartmentID. It also transfers the feature SQL statement to four features, including query type, referenced tables, the number of attributes in the answer and area constraints. We use a string to represent the referenced tables according to each table’s number. Table 1 and Table 2 illustrate a collected transaction and its format after being parsed. Another point we should state is that the two Data Collectors in the training module and the detection module have the same functions, and so do the two Parsers. Both the Trainer and Detector use a Naive Bayes classifier to build profiles (training) and examine newly issued transactions (detecting). The fundamentals
Fig. 2. System working process
Database Intrusion Detection Using Role Profiling with Role Hierarchy
39
Table 1. An example of a collected transaction in raw format Collected feature EmployeeID RoleID Time IP address AccessType SQL statement
Feature value 287 9 10:32:09AM 192.168.1.95 1 (through application) SELECT S.Name FROM Sales.Store AS S JOIN Sales.Customer AS Cu ON S.CustomerID = Cu.CustomerID JOIN Sales.SalesTerritory AS Te ON Te.TerritoryID = Cu.TerritoryID WHERE Cu.CustomerType = ’S’ AND Te.[Group] = ’Europe’
Table 2. An example of a transaction after being parsed Feature Feature value EmployeeID (ignored for role profiling ) 287 RoleID (ignored for user profiling) 9 WorkShift Day DepartmentID 3 (Sales Department) AccessType 1 (through application) QueryType SELECT ReferencedTables 00000111000000000 NumberOfAttributes 1 AreaConstraint Europe
of a Naive Bayes classifier and how it works for training and detection in our IDS are presented in Section 5.2. 5.2
Classifier
Similar to [1,7], we also chose a Naive Bayes classifier [11] to build profiles in the training mode and detect intrusions in detection mode after we notice its advantages listed below. First, its computational cost is quite low. Second, although a Naive Bayes classifier largely simplifies reality by assuming that all features of a class are completely unrelated to each other (independence assumption), it usually performs much more accurately than people expect. The reason is probably that as a probabilistic classifier using the Maximum A-posteriori Probability estimation, a Naive Bayes classifier can reach the right classification without getting accurate probabilities of classes as long as the correct class has higher probability than any other classes. Finally, a Naive Bayes classifier is robust to noise [11]. We present the mathematical background of a Naive Bayes classifier in this paragraph. Abstractly speaking, the probability model for a classifier is a conditional model: p(C|F1 , ...Fn )
(1)
40
G.Z. Wu, S.L. Osborn, and X. Jin
In this model, C is a class, H is the hypothesis space (C ∈ H) and Fi is one of the features that compose an instance x of the data. The perspective of the classifier is to find out the most probable class when an instance x consisting of features (f1 , ..., fn ) is given. A decision rule is needed to combine with a classifier for classification; the most common decision rule is Maximum A-Posteriori (MAP). Using MAP, the corresponding classifier is defined as follows: classif y(f1 , ..., fn ) = argmax p(C = c|F1 = f1 , ..., Fn = fn ) c∈H
(2)
However, the calculation of p(C|F1 , ..., Fn ) is very difficult. In many cases, it is actually infeasible. Therefore, we reformulate p(C|F1 , ..., Fn ) using Bayes’ theorem: p(C|F1 , ..., Fn ) =
p(C)p(F1 , ..., Fn |C) p(F1 , ...Fn )
(3)
We notice that the denominator of (3) is a constant when an instance x with features Fi is given, so we are only interested in the numerator which could be re-written as below: p(C)p(F1 , ..., Fn |C) = p(C)p(F1 |C)p(F2 , ..., Fn |C, F1 ) = p(C)p(F1 |C)p(F2 |C, F1 )p(F3 , ..., Fn |C, F1 , F2 )
(4)
= p(C)p(F1 |C)p(F2 |C, F1 )...p(Fn |C, F1 , F2 , F3 , Fn−1 )
The next step is to apply the independence assumption to simplify Equation (4). According to this assumption, Fi is completely independent from Fj when i = j, so we have p(Fi |C, Fj ) = p(Fi |C). Then Equation (4) can be simplified to be: p(C)p(F1 , ..., Fn |C)
= p(C)p(F1 |C)p(F2 |C, F1 )...p(Fn |C, F1 , F2 , F3 , Fn−1 ) = p(C)
n
(5)
p(Fi |C)
i=1
The classifier now can be transfered to: classif y(f1 , ..., fn ) = argmax p(C = c|F1 = f1 , ..., Fn = fn ) c∈H
p(C)p(F1 , ..., Fn |C) p(F1 , ...Fn ) c∈H p(C) n i=1 p(Fi |C) = argmax p(F1 = f2 , ..., Fn = fn ) c∈H n = argmax p(C) p(Fi |C) = argmax
c∈H
(6)
i=1
The general principal of a Naive Bayes classifier is stated above. When it is applied to our IDS, features of each transaction obtained from the Parser form the features F1 to Fn in the Naive Bayes classifier; each role (for role profiling) or each user (for user profiling) is a class. In training mode, we calculate for each
Database Intrusion Detection Using Role Profiling with Role Hierarchy
41
class the probability of each observed value of each feature, based on the training samples. The detection task is then turned into finding the most probable role or user who may issue the transaction when a new transaction with the features is given, and checking if it equals the original one associated with the transaction. If NOT, the transaction is recognized as anomalous. 5.3
User Profiling vs. Role Profiling
Using Naive Bayes as our classifier, user profiling and role profiling are applied to build the two IDSes, respectively. For the user profiling IDS, we assume role information is unavailable, so we build profiles and detect intrusions based on EmployeeID. The role profiling IDS is quite similar to user profiling; the only difference is that we use the RoleID instead of EmployeeID to construct profiles and detect anomalies. The detailed profiling work is very straightforward. We calculate the probability of each feature’s each observed value for every user or role and write them down in files (profiles). In the detection mode, the probability of a new value that cannot be found in the corresponding profile is simply set as 0; if we find the p(C = c|F1 = f1 , ..., Fi = fi ) for EVERY possible c (a user or a role) in the hypothesis space is 0, an alarm is directly raised. Then as stated in the above section, whenever the original user or role (denoted as ORIGIN ALU or ORIGIN ALR ) differs from the most probable one obtained by using the classifier and MAP decision rule (denoted as M APU or M APR ), an anomaly is detected. 5.4
Role Profiling with Role Hierarchy
We now present our novel intrusion detection approach taking into consideration the role hierarchy. It is based on the role profiling described in Section 5.3 with an extra rule applied. Using role profiling, apparently when a role RH issues a transaction and the Naive Bayes classifier and MAP rule say the most probable role that issues the transaction is another role RL while RL = RH , an alarm will be raised. However, we notice that if RL ≺ RH (means RL and RH are comparable and RL is junior to RH in the role hierarchy), the alarm is probably a false positive. The reason is that RH inherits all permissions RL has, and when a user assigned to RH exploits the permissions that RH inherits from RL , we can consider that this user is now acting as a member of RL . In this case, we can say the real role associated with a transaction is the role equal or junior to the role to which the user issuing the transaction is assigned. We also need to point out that we are interested in the real role only when M APR = ORIGIN ALR . This is because when a user assigned to RH is taking advantage of RL ’s permission and acting as a member of RL , there is still a possibility that the M APR = RH , and certainly, this situation is legitimate. In summary, the new IDS performs exactly the same as the IDS using role profiling (see Section 5.3 for details) when M APR = ORIGIN ALR . The system will check if M APR ≺ ORIGIN ALR when M APR = ORIGIN ALR , and an alarm will be raised only if M APR is not junior to ORIGIN ALR . In the following sections, we use the terms user profiling, simple role profiling (no
42
G.Z. Wu, S.L. Osborn, and X. Jin
role hierarchy) and advanced role profiling (with role hierarchy) to represent the three profiling approaches mentioned above.
6
Experimental Evaluation and Related Analysis
This Section presents the detailed results of our tests. The primary objectives are to measure the false positives and false negatives of our IDS when using user profiling, simple role profiling and our advanced role profiling, respectively, and to evaluate the overhead due to intrusion detection. An significant principle of the testing is to make the comparisons relatively fair. We describe our test methodologies as follows. First, we evaluate the three profiling approaches by measuring the false positives and false negatives each approach causes. Training sets containing different numbers of transactions are used to build the profiles using user profiling, simple role profiling and advanced role profiling, respectively. We then generate 800 illegitimate transactions for false negatives testing and 1688 legitimate transactions to test their false positives. How we generate data in detail can be found in Section 4. Second, attention is paid to the overhead to the original database system. Although we plan to design the system architecture that can conduct real-time detection in our future work, currently we simply test the detection time for some numbers of transactions and their response time without the deployment of the IDS. We can predict that the potential overhead should be light if the detection time is short compared with the response time. Finally, we focus on training time. 6.1
False Negatives/Positives Test Results and Discussion
Fig. 3 shows the false negatives testing result. Generally speaking, either profiling approach achieves low false negative rate, which illustrates that they can detect
Fig. 3. False negative rates test
Database Intrusion Detection Using Role Profiling with Role Hierarchy
43
Table 3. False positive rate test in detail Training transactions 50 125 250 500 1000 2000 4000 8000
UP (%) 89.514 84.005 79.150 76.777 73.104 73.756 73.341 73.222
SRP (%) ARP (%) 47.808 43.009 29.147 22.156 17.180 7.761 14.277 3.791 12.441 1.718 12.026 1.007 11.552 0.889 11.197 0.237
Fig. 4. False positive rates test
most anomalies. Moreover, we find the false negatives remain 0 even when we have only as few as 50 training transactions for either simple role profiling or our advanced role profiling. We present the false positives with respect to user profiling (denoted as UP), simple role profiling (SRP) and advanced role profiling (ARP) in Table 3 and Fig. 4. While many anomaly-based IDSes can arrive at relatively low false negatives, a low false positive rate continues to be a difficult objective for this category of IDSes. As expected, user profiling results in the much poorer performance (seems almost useless in practice) compared with the other two alternatives. Unsurprisingly, the false positive rate of simple role profiling is very near the false positive rate in [1,7] in which similar role profiling is used. Our advanced role profiling approach, however, largely improves the performance of the IDS. We find the false positive rate of our approach drops dramatically to a very low level with the increase of the training samples, and it reaches as low as 0.24% when we have a training set containing 8000 transactions.
44
G.Z. Wu, S.L. Osborn, and X. Jin
It is significant to answer why the role profiling performs much better than user profiling especially why it causes lower false positive rates. Before discussing the reasons, we introduce the concept of kindred users and this concept is defined as follows: Definition 1. Kindred users in role-based access control are users who are assigned to the same role. Apparently, kindred users have many permissions in common and often do many similar operations. Therefore, their probabilities of the values of the features can be quite near each other, and in this case, when a user issues a transaction, there is a big chance that the Naive Bayes classifier mis-recognizes it as the behavior of one of his/her kindred users. Besides, we can also explain it in a simplified way. For instance, users U1 ∼ Un are kindred users assigned to role Rx , and Ui (1 ≤ i ≤ n) invokes application AP Pa more times than other kindred users do in the training set (the permission of executing AP Pa is only assigned to Rx ); then in the detection mode, when any other Uj (1 ≤ j ≤ n and j = i) invokes the AP Pa , the IDS will match it to Ui and a false alarm is raised. In summary, user profiling causes a lot of false positives due to the mis-matchings between kindred users when we assume role information is not available. Furthermore, even when RBAC is indeed not supported, there must be some users who have similar operation duties, and mis-matchings can occur frequently among them. We explain why user profiling results in more false positives and find the mis-matchings among kindred users are the primary reasons. Then is it possible that there is also high mis-matching rate among roles? Unfortunately, this may happen if we do not design roles reasonably and have too many “similar roles”. Therefore, we expect good design of roles for the satisfactory behaviors of the IDS. The success of our advanced role profiling can strongly highlight our previous statement that IDSes should be a supplemental mechanism but not a replacement for traditional security mechanisms because the IDS can perform much better when working together with other mechanisms. Security is a hybrid problem, and we must even think about many non-technological aspects such as policy making and human factors. One principal we must remember is that we should not expect any single security mechanism to perform perfectly for data protection. We have stated why our advanced role profiling approach could reduce the false positives. Here we give more detailed explanations about how the false positives reduced by the advanced role profiling occur so as to make our readers better understand that why they can be removed. When a role RH is senior to another role RL , and when a user Ui assigned to RH is exploiting the permissions that RH inherits from RL , Ui can be viewed as an acting member of RL . Certainly, Ui can do it legally, and the system raises a false alarm if it categorizes the transaction into RL and does not check the role hierarchy. In addition, we can use another simplified case to exemplify this issue, too. We assume two users Uh and Ul are members of the roles RH and RL ,respectively, and RL gains the permission of invoking application AP Pb ; Uh invokes AP Pb m times and Ul does that n times (m ≺ n) according to training data. We then find when Uh invokes
Database Intrusion Detection Using Role Profiling with Role Hierarchy
45
AP Pb later, it will be mis-matched to RL with an alarm being raised. Therefore, it is meaningful to check the role hierarchy extra so that we can prevent the IDS from raising the category of false alarms explained above. Another crucial discovery is the false positive rates decrease much faster when there are less than 1000 training transactions than when we have more than 1000. Actually, with more than 1000 training samples, the false positives, for all three profiling approaches, maintain their stabilities even while the training set keeps growing. Obviously, 1000 is a diving line (probably this number varies in other IDSes), and we name this number of training samples the training threshold. This indicates two points: firstly, we need enough training data that should exceed the training threshold in order to achieve ideal behavior of the IDS; secondly, we are able to find out the reasonable trade-off between the detection capability and the costs of data collection and training. 6.2
Overhead and Training Time
The performance test of our IDS is conducted to quantify the overhead. We generate five testing sets containing different numbers of transactions, and in each set, 10% are intrusions while others are normal. Using these testing sets, we first test the query response time of the original database system without the deployment of the IDS; then we test the system’s examining time when using user profiling, simple role profiling and advanced role profiling, respectively. The results (measured in seconds) is shown in Table 4. Table 4. Performance test (in Sec) Samples 1000 2000 3000 4000 5000
Query 227.313 457.375 678.321 945.578 1080.239
UP 0.625 0.141 0.203 0.281 0.328
SRP 0.625 0.094 0.141 0.188 0.250
ARP 0.625 0.094 0.156 0.203 0.250
Obviously, the system’s examining time is very short compared with the query response time whichever profiling approach among the three is used, due to the low computational task of a Naive Bayes classifier. Additionally, we can find that using role profiling requires less examining time than using user profiling. That is because the hypothesis space of the former approach is smaller than that of the latter one (the number of roles is smaller than the number of users). Moreover, using advanced role profiling costs slightly more time than using simple role profiling because of the extra process of checking the role hierarchy; we, however, find the extra checking time is quite short (sometimes even too short to be reflected by the computer). Although we feel that it is bearable even the training time is not so short for IDSes, we do still like shorter training time. We present the comparison of the
46
G.Z. Wu, S.L. Osborn, and X. Jin
Fig. 5. Training time
training times in Fig. 5. Please notice that the training data collected is separated by the RoleIDs/EmployeeIDs in advance. Unsurprisingly, role profiling requires less training time due to the fewer profiles that we need to build.
7
Conclusions and Future Work
Database security problems seriously persecute organizations storing invaluable data in their databases. An intrusion detection system provides another layer of security; however, few IDSes are specifically tailored to the DBMS so that their behaviors of detecting intrusions or misuses towards databases are quite poor. Therefore, in this paper we present an anomaly-based IDS that takes into account the characteristics of DBMS, and our system is able to monitor every inside user continually so that insider threats can be substantially lessened. We apply a Naive Bayes classifier to build profiles in advance, and each new behavior is compared against the profiles in order to find out the most probable role/user (M AProle/user ) who may issue the transaction. An alarm will be raised as long as the M AProle/user is different from the original one associated with the transaction (checking the role hierarchy while using advanced role profiling). Although RBAC is supported in the DBMS we use, we initially built profiles for users (user profiling) by assuming role information is unavailable; then we take advantage of RBAC and build our system using simple role profiling (without role hierarchy) and advanced role profiling (with role hierarchy). Our evaluations illustrate that role profiling is more effective (lower false positives/negatives) and more efficient (less training time) than user profiling in general; the results also show that using the advanced role profiling can further improve the behavior of the IDS since false positives are reduced greatly. We also want to point out that our IDS can monitor the DBAs as well by building profiles for the role
Database Intrusion Detection Using Role Profiling with Role Hierarchy
47
DBA, although we have not done so yet. Additionally, another advantage of role profiling we should mention is that with role profiling, we do not have to re-train the IDS when a new user joins the system; what we need to do is to simply assign the new user to the corresponding role. Our method has shown its promising applicability. We are currently extending our work in various directions. The first one is to pay extra attention to the DBAs. The DBAs have more flexible behaviors than average inside users, so our current system may result in higher false positives/negatives while examining the DBAs’ behaviors. However, there are some unique characteristics of the DBAs; for example, they are usually not expected to access to the detailed data within user tables. We can apply such additional rules for DBAs to improve our IDS. Secondly, we plan to develop a system architecture, based on which the new transactions can be checked in real-time and no overhead will be caused to the original database system. Extra servers may be necessary in order to support the new architecture. Finally, part of our future work will be related to investigating response mechanisms. We hope that the IDS can take actions automatically and appropriately even when it is the DBA who conducts an attack.
References 1. Bertino, E., Kamra, A., Terzi, E., Vakali, A.: Intrusion detection in RBACadministered databases. In: ACSAC, pp. 170–182. IEEE Computer Society, Los Alamitos (2005) 2. Bertino, E., Leggieri, T., Terzi, E.: Securing DBMS: Characterizing and detecting query floods. In: Zhang, K., Zheng, Y. (eds.) ISC 2004. LNCS, vol. 3225, pp. 195– 206. Springer, Heidelberg (2004) 3. Chung, C.Y., Gertz, M., Levitt, K.N.: DEMIDS: A misuse detection system for database systems. In: van Biene-Hershey, M.E., Strous, L. (eds.) IICIS, IFIP Conference Proceedings, vol. 165, pp. 159–178. Kluwer, Dordrecht (1999) 4. Fonseca, J., Vieira, M., Madeira, H.: Integrated intrusion detection in databases. In: Bondavalli, A., Brasileiro, F., Rajsbaum, S. (eds.) LADC 2007. LNCS, vol. 4746, pp. 198–211. Springer, Heidelberg (2007) 5. Heady, R., Luger, G., Maccabe, A., Servilla, M.: The architecture of a network level intrusion detection system. Technical report, University of New Mexico, Department of Computer Science (August 1990) 6. American National Standards Institute. For information technology - role-based access control. ANSI INCITS 359 (January 2004) 7. Kamra, A., Terzi, E., Bertino, E.: Detecting anomalous access patterns in relational databases. VLDB Journal 17(5), 1063–1077 (2008) 8. Lee, S.Y., Low, W.L., Wong, P.Y.: Learning fingerprints for a database intrusion detection system. In: Gollmann, D., Karjoth, G., Waidner, M. (eds.) ESORICS 2002. LNCS, vol. 2502, pp. 264–280. Springer, Heidelberg (2002) 9. Lee, V.C.S., Stankovic, J.A., Son, S.H.: Intrusion detection in real-time database systems via time signatures. In: IEEE Real Time Technology and Applications Symposium, pp. 124–133 (2000) 10. Low, W.L., Lee, J., Teoh, P.: DIDAFIT: Detecting intrusions in databases through fingerprinting transactions. In: ICEIS, pp. 121–128 (2002) 11. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
48
G.Z. Wu, S.L. Osborn, and X. Jin
12. Microsoft MSDN. AdventureWorks sample OLTP database, http://msdn.microsoft.com/en-us/library/ms124659.aspx (February 2009) 13. Nyanchama, M., Osborn, S.L.: The role graph model. In: ACM Workshop on RoleBased Access Control (1995) 14. Osborn, S.L., Sandhu, R., Munawer, Q.: Configuring role-based access control to enforce mandatory and discretionary access control policies. ACM Transactions on Information and System Security 3(2), 85–106 (2000) 15. Parker, D.B.: Crime by Computer, 1st edn. Charles Scribner’s Sons, New York (1976) 16. Valeur, F., Mutz, D., Vigna, G.: A learning-based approach to the detection of SQL attacks. In: Julisch, K., Kr¨ ugel, C. (eds.) DIMVA 2005. LNCS, vol. 3548, pp. 123–140. Springer, Heidelberg (2005)
Query Processing Techniques for Compliance with Data Confidence Policies Chenyun Dai1 , Dan Lin2 , Murat Kantarcioglu3, Elisa Bertino1 , Ebru Celikel3 , and Bhavani Thuraisingham3 1
Department of Computer Science, Purdue University {daic,bertino}@cs.purdue.edu 2 Department of Computer Science, Missouri University of Science and Technology
[email protected] 3 Department of Computer Science, The University of Texas, Dallas {muratk,ebru.celikel,bhavani.thuraisingham}@utdallas.edu
Abstract. Data integrity and quality is a very critical issue in many data-intensive decision-making applications. In such applications, decision makers need to be provided with high quality data on which they can rely on with high confidence. A key issue is that obtaining high quality data may be very expensive. We thus need flexible solutions to the problem of data integrity and quality. This paper proposes one such solution based on four key elements. The first element is the association of a confidence value with each data item in the database. The second element is the computation of the confidence values of query results by using lineage propagation. The third element is the notion of confidence policies. Such a policy restricts access to the query results by specifying the minimum confidence level that is required for use in a certain task by a certain subject. The fourth element is an approach to dynamically increment the data confidence level to return query results that satisfy the stated confidence policies. In particular, we propose several algorithms for incrementing the data confidence level while minimizing the additional cost. Our experimental results have demonstrated the efficiency and effectiveness of our approach.
1 Introduction Nowadays, it is estimated that more than 90% of the business records being created are electronic [1]. These electronic records are commonly used by companies or organizations to profile customers’ behaviors, to improve business services, and to make tactical and strategic decisions. As such the quality of these records is crucial [2]. Approaches, like data validation and record matching [18], have been proposed to obtain high quality data and maintain data integrity. However, improving data quality may incur in additional, not negligible, costs. For example, to verify a customer address, the company may need to compare its records about the customer with other available sources which may charge fees. To verify the financial status of the startup company, the venture capital company may have to acquire reports from a certified organization or even send auditors to the startup company, which adds time and financial costs. As for the health care example, cancer registry and administrative data are often readily W. Jonker and M. Petkovi´c (Eds.): SDM 2009, LNCS 5776, pp. 49–67, 2009. c Springer-Verlag Berlin Heidelberg 2009
50
C. Dai et al.
available at reasonable costs; patient and physician survey data are more expensive, while medical record data are often the most expensive to collect and are typically quite accurate [11]. In other words, the cost of obtaining accurate data can be very expensive or even unaffordable for some companies or organizations. It is also important to notice that the required level of data quality depends on the purpose for which the data have to be used. For example, for tasks which are not critical to an organization, like computing a statistical summary, data with a medium confidence level may be sufficient, whereas when an individual in an organization has to make a critical decision, data with high confidence are required. As an example, Malin et al. [11] give some interesting guidelines: for the purpose of hypothesis generation and identifying areas for further research, data about cancer patients’ disease and primary treatment need not be highly accurate, as treatment decisions are not likely to be made on the basis of these results alone; however, for evaluating the effectiveness of a treatment outside of the controlled environment of a research study, accurate data is desired. While identifying the purposes of data use is the task of field experts, the question to computer scientist here is how to design a system that can take such input and provide data meeting the confidence level required for each data use. In particular, “how to specify which task requires high-confidence data?” In situations where we do not have enough data with high-confidence level to allow a user to complete a task, how can we improve the confidence of the data to desired level with minimum cost? Yet another question could be: “There is a huge data volume. Which portion of the data should be selected for quality improvement?” When dealing with large data volumes, it is really hard for a human to quickly find out an optimal solution that meets the decision requirement with minimal cost. As we will see, the problem is NP-hard. To solve the above problems, we propose a comprehensive framework based on four key elements (see Figure 1). The first element is the association of confidence values with data in the database. A confidence value is a numeric value ranging from 0 to 1, which indicates the trustworthiness of the data. Confidence values can be obtained by using techniques like those proposed by Dai et al. [5] which determine the confidence value of a data item based on various factors, such as the trustworthiness of data providers and the way in which the data has been collected. The second element is the computation of the confidence values of the query results based on the confidence values of each data item and lineage propagation techniques [6]. The third and fourth elements, which are the novel contributions of this paper, deal respectively with the notion of confidence policy and with strategies for incrementing the confidence of query results at query processing time. The notion of confidence policy is a novel notion. Such a policy specifies the minimum confidence level that is required for use of a given data item in a certain task by a certain subject. As a complement to the traditional access control mechanism that applies to base tuples in the database before any operation, the confidence policy restricts access to the query results based on the confidence level of the query results. Such an access control mechanism can be viewed as a natural extension to the Role-based Access Control (RBAC) [7] which has been widely adopted in commercial database systems. Therefore, our approach can be easily integrated into existing database systems.
Query Processing Techniques for Compliance with Data Confidence Policies
(1) Query
Query Evaluation
(2) Query (3) data
51
Confidence Assignment
(4) Intermediate results Result
Policy Evaluation (6) Cost
(10) Results
(5) Request more results
Database
Strategy Finding (7) Request more results
(8) Request improvement
Data Quality Improvement
(9) Increase confidence
Fig. 1. System Framework
Since some query results will be filtered out by the confidence policy, a user may not receive enough data to make a decision and he may want to improve the data quality. To meet the user’s need, we propose an approach for dynamically incrementing the data confidence level; such an approach is the fourth element of our solution. In particular, our approach selects an optimal strategy which determines which data should be selected and how much the confidence should be increased to satisfy the confidence level stated by the confidence policies. We assume that each data item in the database is associated with a cost function that indicates the cost for improving the confidence value of this data item. Such a cost function may be a function on various factors, like time and money. We develop several algorithms to compute the minimum cost for such confidence increment. It is important to compare our solution to the well-known Biba Integrity Model [4], which represents the reference integrity model in the context of computer security. The Biba model is based on associating an integrity level with each user1 and data item. The set of levels is a partially ordered set. Access to a data item by a user is permitted only if the integrity level of the data is “higher” than the integrity level of the user. Despite its theoretical interest, the Biba Integrity Model is rigid in that it does not distinguish among different tasks that are to be executed by users nor it addresses how integrity levels are assigned to users and data. Our solution has some major differences with respect to the Biba Integrity Model. First it replaces “integrity levels” with confidence values and provides an approach to determine those values [5]. Second it provides policies by using which one can specify which is the confidence required for use of certain data in certain tasks. As such our solution supports fine-grained integrity tailored to specific data and tasks. Third, it provides an approach to dynamically adjust the data confidence level so to provide users with query replies that comply with the confidence policies. Our contributions are summarized as follows. – We propose the first systematic approach to data use based on confidence values of data items. 1
We use the term ’user’ for simplicity in the presentation; however the discussion also applies to the more general notion of ‘subject’.
52
C. Dai et al.
– We introduce the notion of confidence policy and confidence policy compliant query evaluation, based on which we propose a framework for the query evaluation. – We develop algorithms to minimize the cost for adjusting confidence values of data in order to meet requirements specified in confidence policies. – We have carried out performance studies which demonstrate the efficiency of our system. The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 discusses the notion of policy complying query evaluation and presents the related architectural framework. Section 4 provides detailed algorithms, whereas Section 5 reports experimental results. Finally, Section 6 outlines some conclusions and future work.
2 Related Work Work related to our approach falls into two categories: (i) access control policies; and (ii) lineage calculation. For access control in a relational DBMS, most existing access control models, like RBAC [7] and Privacy-aware RBAC [14], perform authorization checking before every data access. Our confidence policy is complementary to such conventional access control enforcement and applies to query results. Many efforts [16,3,8,9,15] have been devoted to tracking the provenance of the query results, i.e., recording the sequence of steps taken in a workflow system to derive the datasets, and computing the confidence values of the query results. For example, Widom et al. have developed a database management system, Trio [15], which combines data, accuracy and lineage (provenance). However, no one of those systems provide a comprehensive solution, based on policies, for addressing the use of data based on confidence values for different tasks and roles. Perhaps the most closely related work is by Missier et al. [13], who propose a framework for the specification of users’ quality processing requirements, called quality views. These views can be compiled and embedded within the data processing environment. The function of these views is, to some extent, similar to that of our confidence policies. However, such system is not flexible since it does not include a data quality increment component which is, instead, a key component of our system.
3 Policy Compliant Query Evaluation In this section, we first introduce an illustrative example and then present our policy compliant query evaluation framework. 3.1 An Illustrative Example To illustrate our approach, we consider a scenario in a venture capital company which is able to offer a wide range of asset finance programs to meet the funding requirements
Query Processing Techniques for Compliance with Data Confidence Policies
53
of startup companies. Suppose that in such venture capital company, there is a database having two relations with the following schemas: Proposal(Company:string, Proposal:string, Funding:real); CompanyInfo(Company:string, Income:real). An instantiation of the tables is given with sample tuples and their confidence values in Table 1 and Table 2. In the example, the variable pN o denotes the confidence of the tuple with numeric identifier equal to N o. Assume that the venture capital company has a certain amount of funds available and is looking for financial information about a company with a proposal that requires less than one million dollars. Such a query can be expressed by the following relational algebra expression: Candidate=(Πcompany σF unding<1M Proposal) CompanyInfo The corresponding query results are shown in Table 3. By definition [6], the lineage (or data provenance) of each query result is the part of the database that contributes to this result. The confidence of a query result is thus computed according to the confidence of its lineage and the query structure. For example, the lineage (denoted as λ) for base tables (i.e. Proposal and CompanyInfo) is their individual tuples :λ(01)=01, λ(02)=02 and λ(03)=03 for relation Proposal, and λ(11)=11 and λ(12)=12 and λ(13)=13 for relation CompanyInfo. The lineage calculation for the resulting tuple 038 is represented as a Directed Acyclic Graph (DAG) in Figure 2. We ensure that no dependencies on base tuples occur, that is, only safe plans [6] are allowed. Suppose that this database has some confidence policies. Policy P1 states that the data that a secretary uses for analysis purposes must have a confidence value higher than 0.05. According to P1 , any user under secretary role would be able to obtain the result shown in Table 3. The reason is that the confidence level of the tuple in the query result Table 1. Proposal No Company Proposal 01 Creative Plan A 02 Star Plan B 03 Star Plan C
Funding 1.2M p01 = 0.2 500K p02 = 0.3 400K p03 = 0.4
Table 2. CompanyInfo No Company Income 11 Creative 10M p11 = 0.2 12 Foto 1M p12 = 0.9 13 Star 2M p13 = 0.1 Table 3. Candidate No Company Income 38 Star 2M p38 = 0.058 λ(38) = (02 ∨ 03) ∧ 13
54
C. Dai et al.
38
p =0.058 38
p =0.58 25 25
p =0.3 02 02
13
03
p =0.1 13
p =0.4 03
Fig. 2. DAG for the Sample Plan
is above the minimum required confidence value, that is, p38 > 0.05. Another policy P2 states that the data used by a manager who has to make an investment decision must have a confidence value higher than 0.06. The confidence threshold in P2 is higher than the value in P1 since the data usage in P2 , i.e., investment, is more critical than the data usage, i.e. analysis, in P1 . According to policy P2 , a user under the role definition of manager will not be able to access the query result because the calculated confidence level, that is, p38 =0.058, is smaller than the minimum confidence level 0.06 required for such role when performing an investment decision tasks. In our example, no result is returned to the manager. In order to let the manager obtain some useful information from his query, one solution is to improve the confidence level of the base tuples, which may however introduce some cost. Thus, our goal is to find an optimal strategy that has minimum cost. Assume that the costs of incrementing the confidence level by 0.1(10%) for each of the tuples 02 and 03 are 100 and 10, respectively. Consider the example again. If we increase the confidence level of the base tuple 02 from 0.3 to 0.4, we have p25 = p02∨03 = p02 + p03 - p02 · p03 = 0.64 and as a result p38 will become p38 = p25∧13 = p25 ·p13 = 0.064 which is above the threshold. Alternatively, if we increase the confidence level of the base tuple 03 from 0.4 to 0.5, we obtain p25 = 0.65 and p38 = 0.065 which is also above the threshold. However, we can observe that the first solution is more expensive because acquiring 10% more confidence for tuple 02 is 10 times more costly than for tuple 03. Therefore, among the two alternatives, we choose the second alternative. The increment cost and the data whose confidence needs to be improved will be reported to the manager. If the manager agrees with the suggestion given by the system, some actions will be taken to improve the data quality and new query results will be returned to the manager. 3.2 PCQE Framework The PCQE framework consists of five main components: confidence assignment, query evaluation, policy evaluation, strategy finding, and data quality improvement. We elaborate the data flow within our framework. Initially, each base tuple is assigned a confidence value by the confidence assignment component which corresponds to the first element of our approach as mentioned in the introduction. A user inputs query information in the form Q, pu, perc, where Q is a normal SQL query, pu is the purpose for issuing the query and perc is the percentage of results that the user expects to
Query Processing Techniques for Compliance with Data Confidence Policies
55
receive after the policy enforcement. Then, the query evaluation component computes the query Q and the confidence level of each query result based on the confidence values of base tuples. This component corresponds to the second element. The intermediate results are sent to the policy evaluation component. The policy evaluation component first selects the confidence policy associated with the role of user U , his query purpose and the data U wants to access, and then checks each query result according to the selected confidence policy. Only the results with confidence value higher than the threshold specified in the confidence policy are immediately returned to the user. If less than perc results satisfy the confidence policy, the policy evaluation component sends a request message to the strategy finding component. The strategy finding component will then compute an optimal strategy for increasing the confidence values of the base tuples and report the cost to the user. If the user agrees about the cost, the strategy finding component will inform the data quality improvement component to take actions to improve the data quality and then update the database. The strategy finding and data quality improvement components correspond to the fourth element. Finally, new results will be returned to the user. Confidence Policy. A confidence policy specifies the minimum confidence that has to be assured for certain data, depending on the user accessing the data and the purpose the data access. In its essence, a confidence policy contains three components: a subject specification, denoting a subject or set of subjects to whom the policy applies; a purpose specification, denoting why certain data are accessed; a confidence level, denoting the minimum level of confidence that has to be assured by the data covered by the policy when the subject (set of subjects) to whom the policy applies requires to access the data for the purpose specified in the policy. Correspondingly, we have the following three sets: R, P u and R+ . R is a set of roles used for subject specification. In our system, a user is human being and a role represents a job function or job title within the organization that the user belongs to. P u is a set of data usage purposes identified in the system. R+ denotes non-negative real numbers. Then the definition of a confidence policy is the following. Definition 1 [Confidence Policy]. Let r ∈ R, pu ∈ P u, and β ∈ R+ . A confidence policy is a tuple r, pu, β, specifying that when a user under a role r issues a database query q for purpose pu, the user is allowed to access the results of q only if these results have confidence value higher than β. Policies P1 and P2 from our running example are expressed as follows. - P1 :Secretary, analysis, 0.05. - P2 :Manager, investment, 0.06. Confidence Increment. In some situations, the policy evaluation component may filter out all intermediate results if the confidence levels of these results are lower than the threshold specified in the confidence policy. To increase the amount of useful information returned to users, our system allows users to specify a minimum percentage (denoted as θ) of results they want to receive. The strategy finding component then computes the cost for increasing the confidence values of the tuples in the base tables
56
C. Dai et al.
so that at least θ percent of query result has a confidence value above the threshold. The problem is formalized as follows. Let Q be a query, and let λ1 , λ2 , ..., λn be results for Q before policy checking. Such results are referred to as intermediate results thereafter. Each λi (1 ≤ i ≤ n) is computed from a set of base tuples denoted as Λ0i ={λ0i1 , ..., λ0ik }. The confidence value of λi is represented as a function Fλi (pλ0i , pλ0i ,...,pλ0i ), where pλ0ij is the con1
2
k
fidence level of base tuple λ0ij (1 ≤ j ≤ k). In our running example, function F is F (p02 , p03 , p13 ) = (p02 + p03 − p02 · p03 ) · p13 . Suppose that the minimum percentage given by a user is θ and the percentage of current results with confidence value higher than the threshold β is θ (θ < θ). To meet the user requirements, we need to increase at least (θ − θ ) · n results. Let Λ denote the set of results whose confidence value needs to be increased. We then formalize the confidence increment problem as the following constraint optimization problem: minimize cost = (cλ0x (p∗λ0x − pλ0x )) λ0x ∈Λ0
subject to |Λ| ≥ (θ − θ ) · n Fλi (p∗λ0 , p∗λ0 , ..., p∗λ0 ) ≥ β f or λi ∈ Λ i1
i2
ik i
p∗λ0 ∈ [pλ0i , 1] f or j = 1, ..., ki ij
j
where Λ0 = ∪λi ∈Λ Λ0i is the union of the base tuples for query results in Λ, and cλ0x (p∗λ0 ) computes the cost for increasing the confidence value of base tuple λ0x from x pλ0x to p∗λ0 . Fi is usually a nonlinear function and the problem of solving nonlinear x constraints over integers or reals is known to be NP-hard [12]. The above definition can be easily extended to a more general scenario in which a user issues multiple queries within a short time period.
4 Algorithms In this section, we present three algorithms to determine suitable base tuples for which the increase in the confidence values can lead to the minimum cost. The input of our problem is a set of intermediate query results, denoted as Λinter = {λ1 , ..., λn }, which have confidence values below the threshold; and a set of base tuples, denoted as Λ0 = {λ01 , ..., λ0k }, associated with the query results. The output consists of a subset of Λinter , denoted as Λ, and the total cost costmin of increasing the confidence values. 4.1 Heuristic Algorithm We first introduce the basic search process and then present a series of domain-specific heuristic functions derived from our knowledge of the problem. We adopt a depth-first search algorithm which chooses values for one variable at a time and backtracks when a variable has no legal value left to assign. Figure 3 shows part of a search tree. At the root node of the search tree, we assign the confidence values
Query Processing Techniques for Compliance with Data Confidence Policies
57
...... λ ( pλ )
λ ( pλ +0.1)
λ01 ( pλ ) λ02 ( pλ )
λ01 ( pλ ) λ02 ( pλ +0.1)
λ01 ( pλ ) λ02 (1.0)
λ01 ( pλ ) λ02 ( pλ +0.1) λ03 ( pλ )
λ01 ( pλ ) λ02 ( pλ +0.1) λ03 ( pλ +0.1)
λ01 ( pλ ) λ02 ( pλ +0.1) λ03 (1.0)
0 1
0 1
0 2
0 1
0 2
0 3
0 1
...... 0 1
0 2
...... 0 1
0 2
0 3
0 1
0 1
λ01 (1.0)
0 1
0 1
0 2
Fig. 3. Search Tree
of the first base tuple λ01 . The values we can select for λ01 range from pλ01 (its initial confidence level) to 1 (or its maximum possible confidence level). The minimum distance between two values, i.e., the granularity, depends on the application requirements. In this example, the granularity is set to 0.1. After we assigns a confidence value to λ01 , we generate its successors by considering the second base tuple λ02 . Similarly, we assign the confidence value for λ02 and generate its successors by considering the third base tuple λ03 . After each assignment step, we compute the confidence value of each intermediate query result and the cost. If more than (θ − θ ) · n intermediate query results have confidence values higher than the threshold, this assignment is successful and the corresponding cost will be used as an upper bound during the subsequent search. Later on, at each node of the search tree, we compare the current cost with the upper bound. If the current cost is higher, we do not need to consider the successors of this node. If a new successful assignment with lower cost is found, the upper bound will be replaced with this lower cost. In the worst case, the computation complexity is O(dk ) where k is the number of base tuples and d is the number of values can be selected for each base tuple. As mentioned, our problem is NP-hard and we therefore aim at finding heuristics that help reducing the search space for most cases. We first consider the base tuple ordering since various studies [17] have shown that the search order of the tuples largely affects performance. For our problem, we need an ordering that can quickly lead to a solution with minimum cost. We know that a query result is usually associated with multiple base tuples and different base tuples are associated with different cost functions. Intuitively, the base tuples with less cost have higher probability to be included in the final solution. Therefore, we would like to increase confidence of such base tuples first. The ordering is obtained by sorting the base tuples in a descending order of their minimum cost (denoted as costβ ) for enabling at least one intermediate result to satisfy the requirement. In some cases, even when the confidence value of a base tuple has been increased to 1 (or its maximum possible confidence level), none of the query results has β the required confidence value. For such base tuples, we adjust its costβ to (Fcost , max /β) where Fmax is the maximum confidence value that the query result obtains when the confidence value of this base tuple is 1. We summarize our first heuristics as follows.
58
C. Dai et al.
Heuristics 1. Let λ0i and λ0j be two base tuples. If costβi > costβj , then λ0i will be the ancestor of λ0i in the search tree. The next heuristics takes advantage of the non-monotonic increasing property of confidence functions of intermediate results. When increasing the confidence value of a base tuple only benefits the intermediate results with a confidence value already above the threshold, we can prune its right siblings. It is easy to prove that the optimal solution does not exist in the pruned branches. Heuristics 2. Let λ0c (p∗λ0 ) be the current node in the search tree, and λ1 , ..., λj be the c intermediate results associated with λ0c . If ∀i ∈ {1, ..., j}, Fλi ≥ β, then prune the right siblings of λ0c (p∗λ0 ). c
There is another useful heuristics that can quickly detect whether it is necessary to continue searching. That is, if increasing the confidence values of all remaining base tuples to 1 still cannot yield a solution, there is no need to check the values of the remaining base tuples. Heuristics 3. Let λ01 (p∗λ0 ), ... , λ0c (p∗λ0 ) be the nodes at the current path of the search c 1 tree. Let λ0c+1 , ..., λ0j be the base tuples after λ0c and their confidence values be 1. If |{Fλi |Fλi (p∗λ0 , ..., p∗λ0 , 1, ..., 1) > β}|< (θ − θ ) · n, then prune all branches below c 1 the node λ0c (p∗λ0 ). c
Similar to Heuristics 3, we can check if any confidence increment of the remaining base tuples will result in a higher cost than the current minimum cost. If so, there is also no need to continue searching this branch. Heuristics 4. Let λ0c (p∗λ0 ) be the current node in the search tree. Let costc , costmin c be the current cost and the cost of current optimal solution respectively. If costc + min{costλ0j (δ)} > costmin (j > c), then prune all branches below node λ0c (p∗λ0 ). c
4.2 Greedy Algorithm When dealing with large datasets, the heuristic algorithm may not be able to provide an answer within a reasonable execution time. Therefore, we seek approximation solutions and develop a two-phase greedy algorithm. The first phase keeps increasing the confidence values of base tuples while the second phase reduces unnecessary increments. We first elaborate on the procedure in the first phase. The basic idea is to iteratively compute gain of each base tuple by increasing its confidence value by δ, and then select the one with the maximum gain value. If there is only one intermediate result λ, gain is defined for each base tuple as shown in equation 1, where ΔFλ is the increase of the confidence value of λ when the confidence value of the base tuple λ0 is increased by δ, and cλ0 is the corresponding cost. gain =
ΔFλ cλ0
(1)
Query Processing Techniques for Compliance with Data Confidence Policies gain
59
0
λ2 0
λ3 0
λ1
0
δ
p
Fig. 4. Gain
A simple example is shown in Figure 4. Among the three base tuples λ01 , λ02 and λ03 , λ01 yields the maximum gain when its confidence level is increased by δ, and therefore λ01 will be selected at this step. It is worth noting that once the confidence value of a base tuple is changed, the confidence function of the corresponding intermediate result is also changed. We need to recompute gain at each step. When there are multiple intermediate results, we update gain function as follows. gain∗ =
Σλ∈Λ ΔFλ cλ0
(2)
As shown in equation 2, gain∗ takes into account overall increment of the confidence levels of the query results. The selection procedure continues until there are more than (θ − θ ) · n intermediate results (denoted as Λ) with confidence values above the threshold. The set of base tuples whose confidence levels have been increased is denoted as Λ0in . The first phase is an aggressive increasing phase and sometimes it may raise the confidence too much for some base tuples. For example, it may increase the confidence of a base tuple which has a maximum gain value at some step but does not contribute to any result tuple in the final answer set Λ. As a remedy, the second phase tries to find such base tuples and reduce the increment of their confidence values, and hence reduces the overall cost. The second phase can be seen as a reverse procedure of the first phase. In particular, we first sort base tuples in Λ0in in an ascending order of their latest gain∗ values. The intuition behind such sorting is that the base tuple with minimum gain∗ costs most for the same amount of increment on the intermediate result tuples, and hence we reduce its confidence value first. Then, for each base tuple, we keep reducing its confidence value by δ until it reaches its original confidence value or the reduction decreases the number of satisfied result tuples. Step 1: λ00 (+δ) − λ1 (0.55), λ2 (0.3), λ3 (0.1) Step 2: λ01 (+δ) − λ2 (0.4), λ3 (0.2), λ4 (0.3) Step 3: λ02 (+δ) − λ3 (0.45), λ5 (0.3), λ6 (0.35) Step 4: λ01 (+δ) − λ2 (0.6), λ3 (0.55), λ4 (0.4) Step 5: λ02 (−δ) − λ3 (0.5), λ5 (0.25), λ6 (0.2) Fig. 5. Example for the Greedy Algorithm
60
C. Dai et al.
Procedure Greedy(Λ0 , num, β) Input : Λ0 is a set of base tuples, num is the number of required query results and β is the confidence threshold //- - - - - - - - - - - 1st Phase - - - - - - - - - - 1. success ← N U LL; L ← N U LL 2. while (|success| < num) do 3. max ← 0 4. for each tuple λ0i in Λ0 do 5. compute gain∗i 6. if gain∗i > max then 7. pick ← i; max ← gain∗i 8. L ← L ∪ {λ0pick } 9. increase confidence of λ0pick by δ 10. compute confidence of affected result tuples 11. success ← result tuples with confidence value above β //- - - - - - - - - - - 2nd Phase - - - - - - - - - - 12. C ← L 13. sort C based on gain∗ in an ascending order 14. for each tuple λ0i in C do 15. while (|sucess| ≥ num) do 16. if (p∗λ0 > pλ0 ) then i i 17. decrease λ0i ’s confidence by δ 18. if (|sucess| < num) then 19. increase λ0i ’s confidence by δ
Fig. 6. The Two-Phase Greedy Algorithm
To exemplify, we step through the example shown in Figure 5. Suppose that we need to increase the confidence values of at least (θ − θ ) · n = 3 intermediate results, and the threshold is 0.5. Each step we compute the gain values by increasing the confidence value of the base tuples by δ. The first step selects a base tuple λ00 which has the maximum gain. The change of the confidence value of λ00 results in the changes of confidence values of three intermediate result tuples λ1 , λ2 , and λ3 . The number in the bracket denotes the new confidence value. The second step selects another base tuple λ01 which affects the intermediate result tuples λ2 , λ3 and λ4 . Until the fourth step, we have three results λ1 , λ2 and λ3 with confidence value above the threshold 0.5. Then the second phase starts. As shown by Step 5, decreasing the confidence value of λ02 by δ still keeps the confidence values of λ1 , λ2 and λ3 above the threshold. In the end, the algorithm suggests to increase the confidence value of λ00 by δ and that of λ01 by 2δ. Figure 6 outlines the entire algorithm. Let l1 be the number of the outer loop in the first phase. The second phase uses the quick sort algorithm. The time complexity of the algorithm is O(k(l1 + logk)), where k is the total number of base tuples.
Query Processing Techniques for Compliance with Data Confidence Policies
61
4.3 Divide-and-Conquer Algorithm The divide-and-conquer (D&C) algorithm is proposed due to the scalability concern. Its key idea is to divide the problem into small pieces, search the optimal solution for each small piece, and then combine the result in a greedy way. We expect the D&C algorithm to combine the advantages of both the heuristic algorithm and greedy algorithm. We proceed to present the details of the D&C algorithm. The first task is to partition the problem into sub-problems, where we need a partitioning criteria. Observe that some base tuples are independent from each other in the sense that they do not contribute to the same set of intermediate query results. Such base tuples form a natural group. From the following example, we can see that concentrating the confidence increment on a group of base tuples may lead to a solution more quickly than increasing confidence values of independent base tuples. In the example, there are three intermediate results λ1 , λ2 and λ3 with confidence value below the threshold 0.5 and it is required that at least two results should be reported. Result tuples λ1 and λ2 associate with the same base tuples λ01 ,λ02 and λ03 , while λ3 associates with the base tuple λ04 . Suppose that an ordering in a heuristic or greedy algorithm is λ03 , λ04 , λ02 , λ01 . Figure 7 shows the first three steps of the confidence increment where the number in the bracket indicates the new confidence value of a result tuple after the change of confidence value of the base tuple. Observe that if we exchange the order of λ04 and λ02 , we can obtain an answer more quickly. This indicates the benefit of concentrating confidence increment on base tuples in the same group. Ideally, all base tuples are partitioned into a set of almost equal-size natural groups and then search can be carried out in each independent group. However, such situation rarely happens. A more common situation is that most base tuples are related to each other due to the overlapping among their corresponding intermediate result sets. The question here is how to determine which base tuples are more related so that they should be placed in the same group. We found that this problem essentially is a graph partitioning problem. In particular, each intermediate result tuple is a node, and two nodes are connected by an edge if the corresponding result tuples share at least one base tuples. Figure 8 shows an example graph of seven result tuples. For instance, λ1 and λ2 have three common base tuples, while λ2 and λ3 share only one base tuple. Our goal is to partition the graph into disjoint graphs that satisfy the following two requirements. The first requirement is that the number of base tuples associated with the result tuples in the same group should not exceed a threshold. Such requirement ensures that each sub-problem is solvable in reasonable time (or user specified time). The second requirement is that the sum of the weights on the connecting edges of any two sub-graphs should be minimized. The reason for such requirement is to reduce the duplicate search of the base tuples belonging to two groups. Step 1: λ03 − λ1 (0.3), λ2 (0.4) Step 2: λ04 − λ3 (0.4) Step 3: λ02 − λ1 (0.5), λ2 (0.6) Fig. 7. An Example of Partitioning Effect
62
C. Dai et al.
λ1 4
3
λ5
λ2 3
1
λ3 2
2
λ6
λ4 5 4
λ7
Fig. 8. An Example of Partitioning
Unfortunately, finding an optimal graph partitioning is also an NP-complete problem. Extensive studies have been carried out and a variety of heuristic and greedy algorithms have been proposed [10]. As in our case, the partitioning is just the first phase. Most existing approaches are still too expensive and can result in too much overhead. Therefore, we propose a lightweight yet effective approach specific to our problem. Initially, each node is considered as a group. We keep merging two nodes connected by an edge with the maximum weight. After each mergence, the weight on the edge between a node and the new group is the sum of weights on the edges between the node and all nodes included in this group. The process stops when the maximum weight is less than a given threshold γ. For example, the graph in Figure 8 can be partitioned into two groups when γ = 2 (see Figure 9). Step 1: Merge λ4 and λ6 (maximum weight = 5) Step 2: Merge λ1 and λ5 (maximum weight = 4) Step 3: Merge λ4 , λ6 and λ7 (maximum weight = 4) Step 4: Merge λ1 , λ5 and λ2 (maximum weight = 3) Step 5: Merge λ4 , λ6 , λ7 and λ3 (maximum weight = 2) Fig. 9. A Graph Partitioning Example
After the partitioning, we apply the greedy algorithm to each group. Let x be the number of result tuples associated with a group, and y be the required number of result tuples for the entire query. If x is smaller than y, the greedy algorithm will find a solution for these x result tuples; if x is larger than y, the greedy algorithm will stop when y result tuples with confidence above the threshold. Next, we further carry out a heuristic search in each group which contains less than τ base tuples. The parameter τ is determined by the performance of heuristic algorithm. The results obtained from the greedy algorithm serve as initial cost upper bounds. The last step is a result combination and refinement step. A subtlety during the combination is to handle the overlapping base tuples in different groups. When we combine answers from such groups, we select the maximum confidence value of each overlapping base tuple. We can thus guarantee that the combined answer will not reduce the confidence values of result tuples in the answer set of each individual group. After the combination, the total number of satisfied result tuples may be more than the required or the confidence values of the result tuples are much higher than the threshold, both of which introduce additional cost. Therefore, we carry out a refinement process similar
Query Processing Techniques for Compliance with Data Confidence Policies
63
Procedure D&C(Λ0 , num, β, γ) Input : Λ0 is a set of base tuples, num is the number of required query results β is the confidence threshold γ is the graph partitioning threshold 1. for each intermediate result tuple λi do 2. group Gi is the set of base tuples associated with λi 3. for each intermediate result tuple λj (i = j) do 4. wij ← |Gi ∪ Gj | 5. select two groups with maximum weight wmax 6. while wmax > γ do 7. merge the selected two groups 8. adjust weights on the affected edges 9. select two groups with maximum weight wmax 10. for each group Gi do 11. invoke Greedy() 12. if |Gi | < τ then 13. invoke Heuristic Algorithm() 14. result combination and refinement
Fig. 10. The Divide-and-Conquer Algorithm
to the second phase of the greedy algorithm. It starts from the base tuple with the minimum gain∗ and stops when any further confidence reducing will result in less satisfied result tuples than the required. An overview of the entire algorithm is shown in Figure 10. The complexity of our graph partitioning algorithm is O(n2 ), where n is total number of intermediate result tuples. The complexity of the remaining part of the D&C algorithm is the same as the greedy and heuristic algorithms by replacing the size of the entire dataset with that of each group. The complexity of the result combination and refinement step is O(klogk). At the end of this section, we would like to mention that it is easy to extend the three algorithms, i.e., heuristic, greedy, and divide-and-conquer algorithms, to support multiple queries. Two aspects are important for such an extension. First, the search space has to be extended to include all distinct base tuples associated with all queries. Second, instead of checking whether a solution is found for a query, we need to check whether a solution is found for all queries.
5 Performance Study 5.1 Experimental Settings Our experiments are conducted on a Intel Core 2 Duo Processor (2.66GHz) Dell machine with 4 Gbytes of main memory. We use synthetic datasets in order to cover all general scenarios. First, we generate a set of base tuples and assign a randomly generated confidence value around 0.1 and a cost function to each tuple. The types of cost
64
C. Dai et al. Table 4. Parameters and Their Settings Parameter
Setting
Data size 10, 1K, 10K, ..., 100K No. of base tuples per result 5, 10, 25, 50, 100 Confidence increment step δ 0.1 Percentage of required results θ 50% Confidence level β 0.6
functions include the binomial, exponential and logarithm functions. Then we associate a certain number of base tuples with each result tuple. Since our focus is the policy evaluation and strategy finding components, we use randomly generated DAGs to represent queries. Table 4 gives an overview of the parameters used in the experiments, where values in bold are default values. “Data Size” means the total number of distinct base tuples associated with results of a single query. “No. of base tuples per result” refers to the average number of base tuples associated with each result tuple. “Confidence increment Step” is the confidence value to be increased for the chosen base tuple at each step. “Percentage of required results” is a user input parameter perc (θ) which is the percentage of results that a user expects to receive after the policy checking. Unless specified otherwise, we use a 10K dataset where each result tuples is associated with 5 base tuples and the percentage of the required results is 50%. 5.2 Algorithm Analysis Heuristic Algorithm. These experiments assess the impact of the four heuristics on the search performance through a small dataset with 10 base tuples. Each query requires at least three results with a confidence value above 0.6 and each result is linked to 5 base tuples. Figure 11 (a) and (d) show the performance when different heuristics are used: H1 (Heuristics 1), H2 (Heuristics 2), H3 (Heuristics 3), H4 (Heuristics 4). “Naive” means that only the current optimal cost is used as an upper bound and “All” means that all heuristics are applied. From Figure 11 (a), we observe that the response time when applying any one of the four heuristics is lower than the response time of “Naive”. When all heuristics are applied, the performance improves by a factor of about 60. Such behavior can be explained as follows. Compared to an arbitrary ordering, H1 provides a much better base tuple ordering that quickly leads to the optimal solution. H2, H3 and H4 reduce unnecessary searches. In Figure 11 (d), we use the minimum cost computed from the greedy algorithm as the initial upper bound for the heuristics algorithm. We can see that the search performance improves for all cases. The reason is that the upper bound provided by the greedy algorithm helps pruning the search space from the beginning of the search. Since it is a nearly optimal solution, it is tighter than most upper bounds found during the search. Two-phase Greedy Algorithm. The second phase in the greedy algorithm is for the result refinement. It may reduce the minimum cost but requires additional processing
Query Processing Techniques for Compliance with Data Confidence Policies
120 100 80 60 40
One-Phase
200
100
Two-Phase
Response Time (s)
Response Time (s)
140
Response Time (s)
1000
250
160
150 100 50
10 1 0.1 Heuristic Greedy
0.01
20
Divide-and-Conquer
0
0
0.001
1K Naïve
H1
H2
H3
H4
3K
All
(a) No greedy bound
5K 7K Data Size
9K
10
(b) Response Time
1K
5K 10K Data Size
50K 100K
(c) Response Time Heuristic
12000
45
Greedy
Divide-and-Conquer
10000
One-Phase
10000
40 35
Two-Phase 1000
25
Cost
8000
30
Cost
Response Time (s)
65
6000
20 15
4000
10
2000
100
10
5
0
0
1
1K Naïve
H1
H2
H3
H4
(d) Using greedy bound
3K
All
5K 7K Data Size
9K
(e) Cost
10
1K
5K 10K Data Size
50K
100K
(f) Cost
Fig. 11. Experimental Results
time. This set of experiments aim to study whether the second phase is beneficial. We compare the performance of the greedy algorithms with and without using the second phase. Figure 11 (b) and (e) show the results when varying the data size from 1K to 10K. From Figure 11 (b), we can observe that both versions of the greedy algorithm have similar response time which means the overhead introduced by the second phase is negligible. This conforms to the complexity of the second phase. As for the minimum cost (in Figure 11 (e)), we can see that the two-phase algorithm clearly outperforms the one-phase algorithm. Specifically, after using the second phase, the minimum cost can be reduced by more than 30%. All these results confirm the effectiveness of the second phase. In the subsequent experiments, the greedy algorithm only refers to the two-phase algorithm. 5.3 Overall Performance Comparison In this section, we compare the performance of three algorithms in terms of both response time and the minimum cost. We evaluate the scalability of all algorithms. The data size is varied from 10 to 100K. The number of base tuples per result is set to 5 for data size less than 5K. For data size from 10K to 100K, this parameter is set to 1/1000 of the data size. Figure 11 (c) reports the performance. It is not surprising to see that the heuristic algorithm can only handle very small datasets (less than one hundred) within reasonable time because its complexity is exponential in the worst case. The greedy algorithm has the shortest response time when the dataset is small and then is beaten by the D&C algorithm. The gap between the greedy and D&C algorithms is widen with the increase of the data size. In particular, the greedy algorithm needs to take hours for datasets larger than 50K. The reason is that the graph partitioning phase in the D&C algorithm introduces some overhead when dealing with small datasets and hence it requires more
66
C. Dai et al.
time than the greedy algorithm. However, as the dataset increases, the advantage of the partitioning becomes more and more significant. Thus, the D&C algorithm scales best among the three algorithms. Another interesting observation is that the response time decreases when data size changes from 5K to 10K. The possible reason is that the group size is relatively larger in the 10K dataset than that in the 5K dataset, and hence less heuristic searches and more greedy searches are involved, which results in shorter response time. Figure 11 (f) compares the minimum cost computed by all algorithms. The minimum cost increases with the data size since more result tuples needed to be reported and more base tuples need to be considered in a larger dataset. The heuristic algorithm yields the optimal solution as it is based on an exhaustive search. The other two algorithms perform very similar and have slightly higher cost than the optimal cost. This demonstrates the accuracy of the other two algorithms.
6 Conclusion This paper proposes the first systematic approach to use data based on confidence values associated with the data. We introduce the notion of confidence policy compliant query evaluation, based on which we develop a framework for the query evaluation. We have proposed three algorithms for dynamically incrementing the data confidence value in order to return query results that satisfy the stated confidence policies as well as minimizing the additional cost. Experiments have been carried out to evaluate both efficiency and effectiveness of our approach. Since actually improving data quality may take some time, the user can submit the query in advance before the expected time of data use and statistics can be used to let the user know “how much time” in advance he needs to issue the query. We will investigate such topic in future work.
References 1. http://www.arma.org/erecords/index.cfm 2. Ballou, D., Madnick, S.E., Wang, R.Y.: Assuring information quality. Journal of Management Information Systems 20(3), 9–11 (2004) 3. Barbar´a, D., Garcia-Molina, H., Porter, D.: The management of probabilistic data. IEEE Transactions on Knowledge and Data Engineering 4(5), 487–502 (1992) 4. Bishop, M.: Computer security: Art and science. ch. 6. Addison-Wesley Professional, Reading (2003) 5. Dai, C., Lin, D., Bertino, E., Kantarcioglu, M.: An approach to evaluate data trustworthiness based on data provenance. In: Jonker, W., Petkovi´c, M. (eds.) SDM 2008. LNCS, vol. 5159, pp. 82–98. Springer, Heidelberg (2008) 6. Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: Proc. VLDB, pp. 864–875 (2004) 7. Ferraiolo, D.F., Sandhu, R., Gavrila, S., Kuhn, D.R., Chandramouli, R.: Proposed nist standard for role-based access control. ACM Trans. Inf. Syst. Secur. 4(3), 224–274 (2001) 8. Fuhr, N., R¨olleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems 15(1), 32–66 (1997)
Query Processing Techniques for Compliance with Data Confidence Policies
67
9. Green, T.J., Tannen, V.: Models for incomplete and probabilistic information. In: Grust, T., H¨opfner, H., Illarramendi, A., Jablonski, S., Mesiti, M., M¨uller, S., Patranjan, P.-L., Sattler, K.-U., Spiliopoulou, M., Wijsen, J. (eds.) EDBT 2006. LNCS, vol. 4254, pp. 278–296. Springer, Heidelberg (2006) 10. Hendrickson, B., Leland, R.: A multilevel algorithm for partitioning graphs. In: Supercomputing (1995) 11. Malin, J.L., Keating, N.L.: The cost-quality trade-off: Need for data quality standards for studies that impact clinical practice and health policy. Journal of Clinical Oncology 23(21), 4581–4584 (2005) 12. McAllester, D.: The rise of nonlinear mathematical programming. ACM Computer Survey, 68 (1996) 13. Missier, P., Embury, S., Greenwood, M., Preece, A., Jin, B.: Quality views: Capturing and exploiting the user perspective on data quality. In: VLDB, pp. 977–988 (2006) 14. Ni, Q., Trombetta, A., Bertino, E., Lobo, J.: Privcy aware role based access control. In: Proceedings of the 12th ACM symposium on Access control models and technologies (2007) 15. Sarma, A.D., Theobal, M., Widom, J.: Exploiting lineage for confidence computation in uncertain and probabilistic databases. Technical Report, Stanford InfoLab (2007) 16. Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34(3), 31–36 (2005) 17. Tsang, E.: Foundations of constraint satisfaction. Academic Press, London (1993) 18. Wang, R.Y., Storey, V.C., Firth, C.P.: A framework for analysis of data quality research. TKDE 7(4), 623–640 (1995)
An Access Control Language for a General Provenance Model Qun Ni1 , Shouhuai Xu2 , Elisa Bertino1 , Ravi Sandhu3 , and Weili Han4 1
Purdue University, Department of Computer Science, West Lafayette IN, USA {ni,bertino}@cs.purdue.edu 2 UT San Antonio, Department of Computer Science, San Antonio TX, USA
[email protected] 3 UT San Antonio, Institute for Cyber Security, San Antonio TX, USA
[email protected] 4 Fudan University, Software School, Shanghai, China
[email protected] Abstract. Provenance access control has been recognized as one of the most important components in an enterprise-level provenance system. However, it has only received little attention in the context of data security research. One important challenge in provenance access control is the lack of an access control language that supports its specific requirements, e.g., the support of both fine-grained policies and personal preferences, and decision aggregation from different applicable policies. In this paper, we propose an access control language tailored to these requirements.
1
Introduction
Provenance, a documented history of an object, has already been widely used in the scientific and grid computing domains to properly document workflows, data generation, and processing. Access control of provenance is of the highest importance for many critical organizations [1] either because the fulfillment of their duties relies on a secure provenance management or because the protection of provenance is required by laws or regulations. In a national security agency, the improper disclosure of the source or the ownership of a piece of classified information may result in great and irreversible losses [2]. In a pharmaceutical company, the source of data and the processing executed on data may be sensitive or valuable. In the absence of an access control mechanism for protecting such information, malicious or faulty insiders could steal it [1]. Additionally, many compliance regulations require proper archives and audit logs for electronic records [1], e.g. HIPAA mandates to properly log accesses and updates to the histories of medical records. Therefore, provenance access control is considered to be the primary issue in provenance security [3]. Unfortunately, despite the large number of research efforts focusing on the management of provenance [4, 5, 6, 7, 8], only a few of these efforts have investigated the problem of securing provenance [3, 2, 9, 10, 1]. Moreover, none of these proposals focuses on access control. W. Jonker and M. Petkovi´ c (Eds.): SDM 2009, LNCS 5776, pp. 68–88, 2009. c Springer-Verlag Berlin Heidelberg 2009
An Access Control Language for a General Provenance Model
69
The problem of access control for provenance is complicated by the fact that given a request to access some provenance information, different access control policies, possibly from different sources, may apply (see Figure 1): organizational high-level security policies, departmental fine-grained access control policies, privacy laws and regulations. Moreover, individuals who contributed to the information, referred to as originators, may specify personal preferences on the disclosure of such information. Given an access request, whether the request is allowed or not depends on the decisions from all of these policies. We thus need a language able to support the specification of fine-grained policies, privacy policies, and preferences, and equipped with a flexible access control decision aggregation mechanism. The goal of this paper is to propose such a comprehensive access control language addressing those specific requirements of provenance access control, e.g. fine-grained, privacy-aware, and originator control. Our contributions include: – A novel provenance model that captures the characteristics of previously proposed provenance models and is the base for analyzing the requirements for provenance access control. – A language tailored to fine-grained provenance access control and originator preferences. – A simple yet flexible evaluation mechanism for decision aggregation. The rest of this paper is organized as follows: Section 2 introduces our general provenance model and analyzes the requirements of provenance access control. Based on such a provenance model, Section 3 develops the access control language model. Section 4 discusses the flow of access control decision process. Section 5 shows how originator preferences are taken into account in access control decisions. Section 6 illustrates our approach with several examples. Section 7 discusses related work. Section 8 outlines some conclusions and directions for future work. An Access Request
Access Refused
Policy Evaluation
Privacy Laws, Regulations e.g. HIPAA
Organizational High Level Policies
Departmental Fine-grained Policies
Preferences from Persons involved in Provenance
Decision
Decision
Decision
Decision
Aggregation Permit A Provenance Store
Fig. 1. Different policies may be applicable
Deny
70
2
Q. Ni et al.
A Provenance Model
In order to develop an access control language for provenance, the first step is to analyze the requirements for a provenance access control model. Our analysis is based on the sensitivity of the different entities in a provenance model that describes how provenance is represented. 2.1
The Model
Unfortunately, there is currently no standard for representing provenance in spite of some initial attempts such as the Open Provenance Model [11] and the Architecture for Provenance Systems [3]. Some proposals [4, 5, 6, 7] focus on different application domains (scientific data provenance vs electronic health record), have different forms (relational vs XML), or purposes (storage vs query). Several systems for managing data lineage and provenance are being used in the context of scientific processes, e.g. Chimera [12], myGRID [8], and ESSW [13]. Moreover, some workflow systems [14] are also able to generate provenance information as well. Provenance is already well understood in the field of art history where it refers to the trusted, documented history of some art objects [3]. Given a documented history, the object attains an authority that allows scholars to understand and appreciate its importance and context relative to other objects. Art objects that do not have a trusted, proven history may be treated as faked items. This same provenance concept may also be applied to data acquired, generated, manipulated, and distributed by computer systems and applications. One of our primary objectives is thus to define a provenance representation that is suitable for such data. Hence, in this context, we give the following definition of the provenance of data (see Fig. 2). Our provenance model can capture and describe provenance models proposed by aforementioned approaches and research. Definition 1 (Provenance). The provenance of a piece of data is the documentation of messages, operations, actors, preferences, and context that led to that piece of data. An operation is a manipulation performed on or caused by some data, referred to as input messages, and resulting in other data, referred to as output messages. Messages represent data flows between operations. Applications, database A General Provenance Model Data
Input
Operation
Message
Operation
Context
Actor
Preference
Fig. 2. A provenance model
Output
Data
An Access Control Language for a General Provenance Model
71
commands, and web services are typical examples of operations, while copy and paste, emails, and inter-communication between UNIX processes are typical messages. The piece of data with which the provenance is associated is the output of the last operation in the provenance. Operations and messages are operated by actors that could be application logics, workflow templates, or human beings. In some situations, information about actors, e.g., a physical therapist of a treatment of musculoskeletal disorders in a patient, is also a necessary component of provenance. Such an observation motivates the introduction of actor records in our provenance model. Context refers to additional data which is independent of the input messages of an operation but affects the content of the output messages of the operation, e.g., operation states and operation parameters. Some operations are stateful or rely on values from some external context variables. In some circumstances, the internal states of a stateful operation and the values of external variables may also be necessary in order to understand the functionality or performance of the operation and therefore the nature of the result of the operation [3]. Moreover, in scientific computations, the parameters used in some operations are crucial, like for example the parameters in a classification algorithm, for the final output [6]; thus such information should also be included in the provenance as context records. Most existing provenance studies do not consider security and privacy requirements concerning the utilization of provenance, especially the requirements concerning actors. Security and privacy are however crucial when provenance contains information of commercial value or of a legally sensitive nature (e.g., a proprietary algorithm). Usually such sensitive information is very specific, and its protection requirements may depend on the specific application domain and can often only be determined by the involved actors. Thus there is a need for a provenance model able to address such requirements in order to limit the access to the operation or message content based on access restriction by the corresponding actor [9]. Such an observation motivates preference records. These records are designed for actors to specify their personal preferences that control whether and how other actors may utilize operation and message records. If we consider connections resulting from actors, the context, and preferences to be special messages, these records generally form a directed acyclic graph (DAG) with messages as edges and other records as nodes. Such graphs may have cycles when representing provenance for workflow; however, we can always rewrite a graph with cycles into a DAG by replicating edges and nodes [7]. 2.2
Provenance Records
Provenance is represented by a set of provenance records stored in a provenance store. Such a store can be implemented by various systems, like relational DBMS or XML document management systems. Based on the proposed provenance data model (see Fig. 2), we have defined five kinds of provenance records, that is: operation records, message records, actor records, preference records, and context records. To be general, we leave out unspecified details about the
72
Q. Ni et al. Operation PK
ID
FK1 FK2
Context ID Actor ID Description Output Timestamp
Message Actor PK
Context PK
ID
ID Name Role Timestamp
State Parameter
PK
ID
FK1 FK2 FK3
Source ID Destination ID Actor ID Description Content Carrier Timestamp
Preference PK
ID
FK3 FK1,FK2
Actor ID Target Condition Effect Obligations Timestamp
Fig. 3. Provenance record schemata
implementation of these records. However, to illustrate the usage of provenance records and the access control requirements of a provenance store, it is necessary to consider some of those details. In what follows, we discuss details about each type of record that are relevant to the definition of provenance access control. The schema of each record is shown in Fig. 3; in the graphical representation, PK means “primary key” and FK means “foreign key”. Each record consists of several attributes. Some attributes are optional in that their value might be null. A basic assumption in the provenance model is that each piece of data and each provenance record are uniquely identified by one identification attribute, referred to as the ID attribute. Message, operation, actor, and preference records have a timestamp attribute which is useful for time-restricted provenance queries and preference evaluation (their use will be discussed in Section 5). Operation record attributes include ID, actor ID, context ID, description, output, and timestamp. The detail of a description attribute depends on applications. The description attribute may clearly define a function by pseudo-code, or even by source code, but it can also be only a function name. The output attribute describes the output of the operation. The value of the output attribute usually represents the connection between provenance records and data records. Message record attributes include ID, actor ID, source ID, destination ID, description, content, carrier, and timestamp. Specific details about the description attribute depend on applications. A message record is not a copy of a real message between two operations; it is just the provenance of the real message. For the purpose of provenance completeness, the message content attribute will be expected to contain the full information transmitted by the real message. However, other choices are possible. If intermediate data transferred in the real
An Access Control Language for a General Provenance Model
73
message have been stored elsewhere, the reference ID of the intermediate data can be stored in the content attribute instead those in the data. Moreover, if the destination operation is reversible and such intermediate data may be reproduced, the data need not be stored in the content attribute either. The carrier attribute indicates the message transferring channel, e.g. email, which may be sensitive and useful in some cases, e.g. digital forensics [1]. Actor record attributes include ID, name, and role. Actors usually have names and roles. A role, like the concept of role in role-based access control, is a job function of the actor. Someone may argue why not use the actor information directly from human resource databases. A human being may have different roles during his/her career time. Thus he/she may have different versions of actor records with different roles for different operation / message / preference records. This is the crucial reason why we cannot rely only on the information from the actor records stored in human resource databases. Such a record usually only stores the latest actor information, but an actor record in a provenance store needs to record complete historical actor information. Context record attributes include ID, state, and parameter. The content of a context record heavily depends on the application domain. Usually each operation record has at most one context record. It seems preferable to include the context record into the operation record. We choose to separate the context record from the operation record because of two reasons. First, the schema and size of context records vary with respect to different operation records. Some operation records do not have a context record; however other operation records may have a complex context record. Second, it is also possible that two different operation records may share the same context record. Context records are the only provenance records that do not need timestamps because their timestamps are determined from their parent operation records. Preference record attributes include ID, actor ID, target, condition, effect, obligations, and timestamps. Preference is used to record the access preferences of the actor of the operation or message. Sometimes it is also useful to record the preferences expressed by the subject of the operation/message, for example a patient in the case of healthcare applications. The actor ID attribute is used to record the author of the preference record. Authors are usually actors. A patient may specify his/her preference, but the preference is usually recorded by a nurse or a doctor. The target attribute is used to specify the subject and the exact record at which the preference aims. Each target of a preference record only references an operation record or a message record. Details of the target and other attributes, e.g. conditions, are elaborated in Section 3. Because provenance is a documented history of a piece of data, it has been pointed out [10] that a provenance store is immutable. However, it is reasonable to allow some actors to rationally change their preferences on their own records. There are two approaches to support such selective updates. One is to allow those changes to overwrite previous values. The other approach is to use versioning and associating with each preference record a timestamp. Given a query, if two
74
Q. Ni et al.
preferences from a same actor evaluate to one permit and one deny, the result from the latest preference record takes precedence. We adopt the latter solution because – previous preference records are a part of data provenance and have a value as well; – the immutable property reduces the complexity of access control on provenance because we only need to focus on querying and do no need to worry about changes, such as updates and deletions, to existing provenance records. Such a property also makes it possible to store provenance in “Write Once, Read Many” (WORM) devices that may greatly help in protecting provenance integrity. As shown in Fig. 3, there are relations between the records that compose a provenance DAG. The actor ID in an operation, message or preference record references the primary key of an actor record. The source ID and destination ID in a message record reference two primary keys in operation records. The context ID in an operation record references the primary key of a context record. The record field together with the restriction field (Section 3) in the target of a preference record references the primary key of either an operation record or a reference record. Another basic assumption in our provenance model is that, at each time instant, a piece of data is at most manipulated by one operation. The piece of data can be manipulated several times by the same or different operations, and such an operation history builds the provenance of the piece of data. 2.3
Provenance Records for Medical Data
We now discuss the application of the proposed provenance model to represent the provenance of medical records generated from the Diabetes Quality Improvement Program workflow shown in Fig. 4, where CDC refers to Comprehensive Diabetes Care. Medical records and relevant provenance records generated from the workflow in Fig. 4 are shown in Fig. 5. The first column shows medical records, e.g. register, eye exam etc., except for actors. Other provenance records are shown on the right side. For simplicity, we do not show some attributes, e.g. the timestamps in actor records, and some records, e.g. context records. Based on the records reported in Fig. 5, we have the following observations: – Each medical record is generated by one operation at a specific time, and can be uniquely identified by the output attribute (with two fields) in the operation’s record. – Some message records have values in their content attributes that reference medical records, and others do not. – Message records and operation records connected by these message records form two independent DAGs whose structure is exactly the same as that of the workflow of interest (Fig. 4).
An Access Control Language for a General Provenance Model
75
Diabetic adult patient - first visit in calendar year
HBA1c lab result
Eye exam
Blood pressure measurement
Kidney function monitoring
Patient in CDC quality measure compliant
Fig. 4. Diabetes QI Program workflow Medical records Register Name ID 1 Alice 2 Bob Eye_exam ID Patient ID Retinopathy 3 1 Yes 4 2 No HBA1c ID Patient ID Result 7 1 6.50% 8 2 8.30% Blood_Pressure ID Patient ID Result 2 1 125-85 3 2 144-95 Kidney_Function ID Patient ID Compliant 5 1 Yes 6 2 No CDC ID Patient ID Status 8 1 Good 9 2 Bad
Provenance records Operation ID Actor ID Context ID 1 1 null 2 1 null 3 2 null 4 2 null 5 5 null 6 5 null 7 4 null 8 4 null 9 3 null 10 3 null 11 6 null 12 6 null
Description registration registration eye examination eye examination HBA1c test HBA1c test Blood pressure Blood pressure Kidney function Kidney function CDC CDC
Output.record Register Register Eye_exam Eye_exam HBA1c HBA1c Blood_pressure Blood_pressure Kidney_Function Kidney_Function CDC CDC
Output.id 1 2 3 4 7 8 2 3 5 6 8 9
Timestamp 1/23/2009 6:00 1/24/2009 6:14 1/25/2009 6:28 1/26/2009 6:43 1/27/2009 6:57 1/28/2009 7:12 1/29/2009 7:26 1/30/2009 7:40 1/31/2009 7:55 2/1/2009 8:09 2/2/2009 8:24 2/3/2009 8:38
Message ID Actor ID 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 10 5 11 4 12 2 13 5 14 4 15 3 16 3
Description Eye exam req Eye exam req HBA1c test req HBA1c test req Blood pressure req Blood pressure req Kidney function req Kidney function req Eye exam result HBA1c test result Blood pressure Eye exam result HBA1c test result Blood pressure Kidney function Kidney function
Content.record null null null null null null null null Eye_exam HBA1c Blood_Pressure Eye_exam HBA1c Blood_Pressure Kidney_Function Kidney_Function
Content.id null null null null null null null null 3 7 2 4 8 3 6 5
Timestamp 1/23/2009 8:24 1/24/2009 8:52 1/25/2009 9:21 1/26/2009 9:50 1/27/2009 10:19 1/28/2009 10:48 1/29/2009 11:16 1/30/2009 11:45 1/31/2009 12:14 2/1/2009 12:43 2/2/2009 13:12 2/3/2009 13:40 2/4/2009 14:09 2/5/2009 14:38 2/6/2009 15:07 2/7/2009 15:36
Carrier paper paper paper paper paper paper paper paper email email email email email email email email
Actor ID 1 2 3 4 5 6
Src ID 1 2 1 2 1 2 1 2 3 5 7 4 6 8 10 9
Name Jame Katty John David Tom Betty
Role Nurse Practitioner Doctor Nurse Practitioner Doctor
Des ID 3 4 5 6 7 8 9 10 11 11 11 12 12 12 12 11
Preference Target. ID Actor ID Subject
Target. Record
Target.Restriction
Condition purpose = research
necessary 1/23/2009 6:00 permit
null
operation.body
actor.role = doctor and operation.id = 10 operation.id = 5 and actor.name = David
1/27/2009 6:57 deny
null
message.body
message.id = 16
null purpose= marketing
2/7/2009 15:36 deny
null
1
3
actor
operation
2
5
actor
3
3
actor
Timestamp
Effect
Obligs
Fig. 5. Medical records and Provenance records
– Actor records are referenced from operation, message, and preference records. – Each preference record references exact one message record or operation record. All records other than preference records are easily understood. The meaning of preference records will be more clear in Section 5. One important design choice in our provenance model is that we do not need a provenance pointer to be included in the original data item to indicate the location of relevant provenance records. Given an item ID, we can directly retrieve its provenance from the provenance store based on our model. An advantage of our model is that it does not need the adjustment of the schemata of existing datasets. It is well known that database administrators usually “hate” such adjustments.
76
Q. Ni et al.
2.4
Desiderata for a Provenance Access Control Model
Based on previous work on securing provenance [10, 3, 1] and query examples in provenance management [6, 5, 15], we identify some important requirements of an access control mechanism for provenance that are discussed in what follows. First, provenance access control must be fine-grained. Because of the sensitivity of different provenance records, it is usually the case that an organization may want to ensure that certain portions of the provenance records be only accessible to certain parties, e.g. a few treatments in privacy sensitive electronic healthcare records [9], sources of information in a classified document by the Central Intelligence Agency [2], or a proprietary algorithm applied to some segments of scientific data [1]. Such a requirement asks for the ability to confine a query to a very limited scope with respect to subjects and/or objects in terms of access control. Moreover, it may also be useful to ensure that certain subjects are authorized to access only the subset of the provenance records that are necessary for a specific purpose or more generally, any type of context1 in which provenance representations can be useful [3]. This would require the ability to express authorizations with context restrictions. Second, provenance access control may have to constrain data accesses in order to address both security and privacy. One typical example is in the context of the electronic heathcare records that essentially contain both original data and their provenance. If we consider the final medical results about the treatment of a patient to be a piece of data, its provenance usually contains observations, procedures, tests, prescriptions, and information flows between patients, doctors, practitioners, and nurses. Therefore accesses to electronic heathcare records should not only comply with organizational security policies based on well-known principles such as “need to know” and “least privilege”, but also comply with privacy regulations, such as HIPAA. Third, provenance access control may need both originator control [16, 17] (ORGCON) and usage control [18, 19] (UCON). ORGCON is an access control proposal that requires recipients to gain originator’s approval for a redissemination of an originally disseminated digital object or a new digital object that includes the originally distributed digital objects. Motivated by digital rights management, UCON is an access control model that confines the usage of re-disseminated digital objects. As mentioned in Section 2.1, a provenance access control must be able to take into account preferences by actors about how to utilize relevant records, which is indeed similar to an originator (actor) control in usage control (record usage). One challenge from such requirement is that provenance access control should provide a meaningful and usable method to integrate decisions from both organizational policies and actor preferences, for which multiple versions may exist. Another challenge is the need of a mechanism that ensures regulations, e.g. HIPAA, always take precedence over preferences when there is a conflict. 1
Here the meaning of the term “context” is different from that of the term “context” in which an actor performs some operation.
An Access Control Language for a General Provenance Model
3
77
An Access Control Language for Provenance Stores
In this section we propose an access control language, based on our provenance model, for addressing the requirements discussed in the previous section. The language supports the specification of both actor preferences and organizational access control policies. 3.1
The Language
The proposed language is graphically represented in Fig. 6. Its main components are target, condition, effect, and obligations, which are discussed in what follows. Subject
Record
Effect
1..* 1..*
1
1
1
1 Target
Policy 1
1
Obligations 1 0..1
1 Scope 0..1
1
1
0..1
0..1
Restriction
Condition
Fig. 6. An Access Control Language
3.2
Target
Since a provenance store is immutable, only two operations can be supported: append and read. We believe that in a provenance-aware system, the append operation should be automatically performed by applications and not by users, like log operations in database systems. The privilege to stop or start an append operation by an application is not controlled by regular users but by administrators. Therefore, our access control language only focuses on query (read) operation on provenance records. The target specifies the set of subjects and records, to which the policy is intended to apply. Because the provenance store is immutable and the access on which we focus is query (read), the type of access, e.g. read or append, is intentionally omitted in the target. The subject element can be the name of any collection of users, e.g. actor or professor, or a special user collection anyuser which represents all users. The record element can be the name of any collection of provenance records, e.g. operation, some attributes in records, e.g. operation.body, or a special record collection anyrecord. The following example shows a policy target that applies to access requests from any user and to all information contained in the attribute description in the operation records.
78
1 2 3 4
Q. Ni et al.
<subject>anyuser operation.description
The (optional) restriction element can further refine the applicability established by the target through the specification of predicates on subject attributes (combined with anyuser) and/or record attributes (Section 2.4). The following target example applies to users with a doctor role and to the description of operation records before year 2009. 1 2 3 4 5
<subject>anyuser operation.description anyuser.role == doctor AND operation.timestamp <=1.1.2009
Provenance represents the lineage of a piece of data; thus when we define a provenance access control policy we often want the ability to specify one policy for all provenance records related to a piece of data. Such a requirement is captured by an optional element scope. Two predefined values, “transferable” and “non-transferable” (default) can be specified in the element, where “transferable” means that the target not only contains the set of records defined by other elements in a target, but also includes all ancestors of these records. By contrast, “non-transferable” means that the target only contains the set of records defined by other elements. If the scope element is absent from a target, the “non-transferable” semantics is adopted for the target. The target of the following example shows a provenance record set for an operation record and its antecedent records that resulted in the CDC record of patient Alice (See Fig. 5). 1 2 3 4 5 6
<subject>anyuser operation operation.id == 11 <scope>transferable
3.3
Condition
A condition represents a boolean expression that describes the optional context requirements (Section 2.4) that confine the applicable access requests, e.g. access purpose, limitation on access time and location, and verification of the record originator’s license. System or context variables usually appear in the condition expression. The following condition restricts an applicable access to be executed only from a machine, e.g. obelix, and with an access purpose, e.g. research. 1
system.machineid == obelix AND purpose == research
An Access Control Language for a General Provenance Model
79
Restrictions and conditions are both boolean expressions, and are crucial in order to achieve fine-grained access control. The reason why they are mapped onto different components is not only because they focus on different policy aspects, i.e. the target scope and the context requirements, but also because they have a different impact on the aggregation of decisions by different applicable policies. We will elaborate more on this issue in Section 4. 3.4
Effect
The effect of a policy indicates the policy author’s intended consequence of a “true” evaluation for policy. In the current version of the language, the effect can take one of the following values: Absolute Permit, Deny, Necessary Permit, and Finalizing Permit. The motivation and semantics of these four different effects will be discussed in Section 4. The following example shows a policy without obligations. The policy requires that any doctor who accesses the description field of operation records before year 2009 can only do so from machine obelix and that the access purpose must be research only. 1 2 3 4 5 6 7 8 9
<policy ID=1>
<subject>anyuser operation.description anyuser.role == doctor AND operation.timestamp <1.1.2009 system.machineid == obelix AND purpose == research <effect>necessary permit
3.5
Obligations
An obligation is an operation, specified in a policy, that should be executed before the condition in the policy is evaluated, in conjunction with the enforcement of an authorization decision, or after the execution of the access. There are at least two cases in which we may need obligations in an access control language for provenance stores. An actor may require any user of his/her records to obtain his/her agreement before access to these records, or to inform him/her after the access. He/she may do so by adding a pre-obligation or a post-obligation [20] in his/her preference. Another case is when one has to comply with regulations that include obligations, e.g. HIPAA; organizational policies may also require obligations. Due to space limitations, we do not elaborate on the obligations component and we only provide an example. We refer the interested readers to our previous work [20]. The following example shows an obligation specifying that the actor of the record has to be informed about each data access within 10 days from the date of the access. Commonly used obligations are “inform actors or originators of an access” and “obtain (either advance or later) approval from actors or originators”. Operations in an obligation may have their specific parameters and we adopt a simplified representation of operations.
80
1 2 3 4 5 6 7
4
Q. Ni et al.
inform the actor of the record 10 days access
Policy Evaluation
Abstractly, we may consider a two-dimensional space defined by a provenance store query, referred to as a query space, to be a tuple of a singleton of user and a set of records, represented by the pair (userq , recordsq ). If we consider the tuple of the sets of subjects and records defined by a target to be a target space, represented by (subjectst , recordst), then we have the following definition. Definition 2 (Applicable Policy). A policy is applicable to a query if and only if the component-wise intersection of the target space of the policy and the query space generates neither an empty user set nor an empty records set. In other words, a policy is applicable to a query if and only if its target space contains both the query user and a subset of query records. Given a query, only the conditions, effects, and obligations in applicable policies are evaluated in its final authorization decision. The evaluation of an applicable policy is its condition evaluation result, either true or false. The evaluation sequence depends on the effects of applicable policies, as shown in Fig. 7. The final decision depends on both the effects and the condition evaluation results of applicable policies. An applicable policy with an absolute permit effect has the highest priority. Given a query, if at least one applicable absolute policy evaluates to true, the query is permitted regardless of the effects of other applicable policies. The motivation for the absolute permit is that provenance queries required by law enforcement institutions or national security agencies should be able to circumvent the limitation specified by actor preferences and organizational policies. A policy with a deny effect has the second priority. Given a query, if no applicable absolute permit policies evaluate to true and at least one applicable deny policy evaluates to true, the query is denied regardless of the effects of other applicable policies. The motivation is that negative policies may significantly reduce the total number of policies required in practice, which may in turn reduce the administration costs for policies. The popular “deny takes precedence” principle is adopted in the language. A policy with a necessary permit effect has the third priority. Given a query, if no applicable absolute permit policies evaluate to true, no applicable deny policies evaluate to true, and at least one applicable necessary permit policy evaluates to false, the query is denied regardless of the effects of any other applicable policies. The motivation for the necessary permit is the requirements from actor preferences and regulation compliance in organizational policies. When an actor specifies his/her preferences on some operation or message records,
An Access Control Language for a General Provenance Model
A Query
Remove the policy
Does an applicable absolute permit policy exist?
Yes
No
Remove the policy
Does an applicable deny policy exist?
Yes
No
Remove the policy
Does an applicable necessary permit policy exist?
Yes
No
Remove the policy
Does an applicable finalizing permit policy exist?
Yes
81
False
The policy condition evaluation result
True
Permit
True
Deny
False
Deny
True
Permit
False
The policy condition evaluation result
True
The policy condition evaluation result
False
The policy condition evaluation result
No Deny
Fig. 7. Policy Evaluation Flow
he/she, like donors who regulate the usage of their funds, usually only specifies some necessary conditions that should be satisfied by future usage of relevant records. To comply with regulations, e.g. HIPAA, an organization has to specify corresponding access control policies that are usually not sufficiently fine-grained but necessary for all relevant queries. A necessary permit is useful in these cases. It should be noted that we can write a deny policy which is semantically equivalent to another necessary permit policy by negating the condition and changing the effect; thus for any query, the final decisions are the same. However, since some regulations are more naturally expressed by a deny policy and other preferences are more naturally expressed by a necessary permit, we intentionally do not merge these two effects into one. A policy with a finalizing permit effect has the lowest priority. Given a query, if no applicable absolute permit policies evaluate to true, no applicable deny policies evaluate to true, no applicable necessary policies evaluate to false, and at least one applicable finalizing permit policy evaluate to true, then the query is permitted. Otherwise, the query is denied. One goal in classifying positive authorization policies into necessary permit policies and finalizing permit policies is to achieve flexibility and convenience for the administration of policies at different granularity levels. For instance, the Chief Security and Privacy Officer (CSPO) of an organization may specify a binding set of basic regulations for all departments with respect to the access to
82
Q. Ni et al.
provenance information within the organization. These regulations can in turn be refined by the various departments [21]. One way to address this requirement is to let the CSPO define some applicable necessary permit policies with which all the departments have to comply. However, these necessary permit policies cannot authorize access requests because they are not sufficiently fine-grained. Each department can then define its own fine-grained finalizing policies. The decision for a query is then obtained by composing the decisions from all applicable policies with at least one decision from the finalizing permit policies in one department. In other words, if all applicable necessary permit policies allow the access request and at least one finalizing policy allows the request, the query is authorized. Positive norms and negative norms are based on a similar idea [22].
5
Originator Preferences
Our access control language can be applied to specify originator preferences, that is, to support originator control. Compared to organizational policies, originator preferences are usually very specific to a particular record and its fields. The following originator preference specifies that the description information of operation record 12345678 cannot be accessed for either reverse engineering or reselling purpose. Fig. 5 shows other preference record examples. 1 2 3 4 5 6 7 8 9 10
<preference ID=1>
<subject>anyuser operation.description operation.ID == 12345678 purpose == reverse engineering OR purpose == reselling <effect>deny
1.29.2009
The timestamp plays a key role in the evaluation of the originator preferences. When multiple preferences exist, the evaluation criterion is that only the latest applicable preference is evaluated. The final authorization decision depends on the latest applicable preference and all applicable organizational policies. Given a query, the applicable organizational policies and applicable preferences are evaluated together. The semantics of different effects in user preferences is the same as that of policies. Given a record to be queried, if only necessary permit preferences are specified on this record and there are no corresponding organizational finalizing permit policies, according to the evaluation flow introduced in Section 4 the record cannot be disclosed. This behavior is reasonable and meets our expectation.
6
Additional Examples
In this section, we show how our access control language can specify access control policies from some recently published papers and thus meet access control requirements identified there.
An Access Control Language for a General Provenance Model
83
As mentioned in Section 2.4, electronic health records are a hybrid representation of data and relevant provenance, and thus need to be protected to assure privacy. Three crucial components in privacy regulations are the access or usage purpose, obligations, and conditions [23]. The proposed access control language directly supports conditions and obligations. Purpose requirements can be specified as predicates in conditions (as a matter of fact, we have already used them in previous examples). Someone may argue that the language cannot prevent policy authors from specifying invalid purposes. We, however, believe that specifying valid purposes in conditions is the duty of the policy authors and thus it is reasonable that the language itself is only capable of specifying purposes in policies. Perhaps by policy analysis we may be able to identify invalid purposes in policies that is left for our future work. In conjunction with effects, purpose predicates can directly model the following common cases of purpose requirements in privacy regulations. – case 1: some records can only be used for some specific purposes; – case 2: some records can be used for some specific purposes; – case 3: some records should not be used for some purposes. These cases can be represented by the following three policy fragments. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
<policy ID=1> ...
purpose == research OR purpose == development <effect>necessary permit ... <policy ID=2> ...
purpose == research OR purpose == development <effect>finalizing permit ... <policy ID=3> ...
purpose == marketing <effect>deny ...
The evaluation flow introduced in Section 4 can be directly applied to the integration of decisions from privacy policies. An employee’s performance review [10] is an example where the provenance is more sensitive than the data. Generally employees are permitted - and usually encouraged - to read their performance review. However, the employee is not told who had input in writing the review. Thus the employee can see the data but not the provenance of that data. The following policy forbids any subject to access the message that leads to the review document about him/her; thus no source can be disclosed.
84
1 2 3 4 5 6 7 8 9
Q. Ni et al.
<policy ID=1>
<subject>anyuser operation operation.output.record == review AND anyuser.name == review.objectname <effect>deny
Braun et al. [10] argued that provenance is poorly served by traditional data security models because these models focus on individual (provenance) data items, whereas provenance focuses on the relationships between those items. These relationships and data items form a DAG, and both nodes and edges need to be protected. In addition, Braun et al. also suggested that one may need to hide the participation of an operation. We agree with these requirements but do not agree that traditional data security models cannot secure provenance. With an appropriate provenance model, like the one proposed in this paper, the relationships between data items (message records) and the participation of an operation (actor records) may be secured by traditional style access control policies, as shown by examples in this paper. As indicated by Hasan [1], the ownership history of documents (e.g., the chain to associate a user or users with a document) may also be sensitive. A query for the source of a piece of data may be recursively executed on provenance records to generate the chain. By appropriate access control policies on message records, we can easily achieve protection with different granularity on the chain based on the specific protection requirements: – If we need to disclose the original sources without disclosing details in the chain, e.g. operations and actors, to a specific subject, we can address this requirement by only allowing the subject to access the Source ID and Destination ID of relevant message records. – If we need to hide the actor information in the chain from a specific subject, we can deny the subject access to the Actor ID in the records in the chain.
7
Related Work
The proposed access control language has been influenced by the XACML language [24]. One distinct feature of our provenance access control is the need for the aggregation of authorization decisions from different policies with different purposes, e.g. organizational policies, user preferences with different versions, and privacy regulations. Because XACML does not distinguish between conditions and restrictions, XACML is not suitable for dealing with the aggregation required by the management of provenance. In addition its rule evaluation truth table and policy combining algorithms have several shortcomings [25]. The purpose handling has been inspired by the Privacy-aware Role-based Access Control model [23], however, it is more flexible than that in this earlier approach.
An Access Control Language for a General Provenance Model
85
Security issues in the context of provenance management have been only briefly discussed in a few prior papers [3, 2, 9, 10, 1]. Groth et al. [3, 9] discussed security requirements for Service Oriented Architectures and proposed some abstract frameworks providing security mechanisms, including access control, for provenance stores. Braun and Shinnar [2], in the context of the PASS (Provenance-aware Storage Systems) project [26], discussed a security model for provenance, which consists of two separate models: one for protecting the structure or workflow (i.e., which ancestors and descendants are accessible to which users) and the other for specifying which node attributes are accessible to which users. Braun et al. [10] later argued the need for new security models for provenance management. In particular, they highlighted two properties, “DAG-nature” and “Immutability”, of provenance information which distinguish provenance information from traditional data items and from tree-structured data. Hasan et al. [1] discussed research challenges to secure a provenance chain, and proposed a lifecycle model for provenance. They also analyzed possible applications of secure provenance. Compared to such work, our work is the only one focusing on the analysis of requirements for provenance access control and providing a comprehensive access control language to meet these requirements. Very recently, the problem of secure provenance management has also been investigated in the broader context of information networks which abstract distributed information sharing [27, 28]. With respect to the domains that have been investigated and mentioned above (e.g., scientific database, grid computing, workflow, health care applications), this new problem domain introduces new challenges. For example, provenance in scientific databases keeps the modifications that have been applied to a specific data item, whereas workflow systems operate within the boundary of a single enterprise. In contract, information networks capture the movement/processing of data beyond any single database or enterprise, which means that each node can maintain a provenance store for all the data messages that have been received and that the provenance store is maintained by all the nodes in an information network that can cross many enterprises. The access control language presented in this paper can serve as a useful mechanism in this framework.
8
Conclusion and Future Work
Our proposal for provenance access control is still at its first stage, and many interesting problems are left open. In the evaluation of provenance access control policies, decisions with uncertainties about the result of target evaluation or condition evaluation may arise. There are at least two cases in which a policy evaluation may generate uncertainties. First, because the predicates in a policy may refer to the content of other data or other provenance store or system variables, it might not be able to evaluate them due to the lack of privileges. There are several different design choices to address the issue of which privilege is needed for policy evaluation: the
86
Q. Ni et al.
administrative privilege, the query issuer privilege, and the policy author privilege. Obviously the administrative privilege, which may give excessive power to all table owners (they are policy authors), may result in severe security breaches. Rosenthal et al. [29] suggest that policies should be evaluated under the privilege of the query issuers rather than the policy authors. In contrast, Olson et al. [30] suggest that policies should be evaluated under the privilege of the policy authors rather than the query issuers. In either approach, it is possible that predicates in policies cannot be successfully evaluated due to the lack of privileges. Second, external factors, such as software vulnerabilities or hardware failures, may prevent predicates from being evaluated correctly as well. In both situations, uncertain decisions (neither permit or deny), in which we do not know the exact decision, are inevitable. The D-algebra [25] can be applied to deal with policy evaluation in the presence of uncertainty. Delegation of access control rights, which is one important requirement for provenance access control [3, 10], has not been addressed in this paper. We prefer policy-based delegation management and consider delegation management policies to be meta-policies on access control policies and will investigate this issue in our future work. Because of the semantics of different effects and predicates used in conditions and restrictions, inappropriate policy specifications may generate conflicting policies or redundant policies [31]. Detecting these abnormal policies is essentially a SAT problem. Fortunately, the problem size is usually very small regardless of the number of policies. Only policies with overlapping target spaces and sharing variables in predicates need to be checked. Various heuristic techniques have already been developed [31]; we need a tailored version for provenance access control policies as well. Acknowledgement. This work is supported in part by AFOSR MURI award FA9550-08-1-0265.
References [1] Hasan, R., Sion, R., Winslett, M.: Introducing secure provenance: problems and challenges. In: Proceedings of the 2007 ACM Workshop on Storage Security And Survivability (StorageSS), pp. 13–18 (2007) [2] Braun, U., Shinnar, A.: A security model for provenance. Technical Report TR04-06, Harvard University Computer Science (January 2006) [3] Groth, P., Jiang, S., Miles, S., Munroe, S., Tan, V., Tsasakou, S., Moreau, L.: An architecture for provenance systems. Technical report, University of Southampton (November 2006) [4] Benjelloun, O., Sarma, A.D., Halevy, A.Y., Theobald, M., Widom, J.: Databases with uncertainty and lineage. VLDB J. 17(2), 243–264 (2008) [5] Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: SIGMOD 2006, pp. 539–550 (2006) [6] Chapman, A., Jagadish, H.V., Ramanan, P.: Efficient provenance storage. In: [32], pp. 993–1006
An Access Control Language for a General Provenance Model
87
[7] Heinis, T., Alonso, G.: Efficient lineage tracking for scientific workflows. In: [32], pp. 1007–1018 [8] Moreau, L., Groth, P.T., Miles, S., V´ azquez-Salceda, J., Ibbotson, J., Jiang, S., Munroe, S., Rana, O.F., Schreiber, A., Tan, V., Varga, L.Z.: The provenance of electronic data. Commun. ACM 51(4), 52–58 (2008) [9] Tan, V., Groth, P., Miles, S., Jiang, S., Munroe, S., Tsasakou, S., Moreau, L.: Security issues in a soa-based provenance system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 203–211. Springer, Heidelberg (2006) [10] Braun, U., Shinnar, A., Seltzer, M.: Securing provenance. In: HotSec 2008 (2008) [11] Moreau, L., Plale, B., Miles, S., Goble, C., Missier, P., Barga, R., Simmhan, Y., Futrelle, J., McGrath, R., Myers, J., Paulson, P., Bowers, S., Ludaescher, B., Kwasnikowska, N., den Bussche, J.V., Ellkvist, T., Freire, J., Groth, P.: The open provenance model (v1.01). Technical report, University of Southampton (2008) [12] Foster, I.T., V¨ ockler, J.S., Wilde, M., Zhao, Y.: Chimera: Avirtual data system for representing, querying, and automating data derivation. In: SSDBM, pp. 37–46. IEEE Computer Society, Los Alamitos (2002) [13] Janee, G., Mathena, J., Frew, J.: A data model and architecture for long-term preservation. In: Larsen, R.L., Paepcke, A., Borbinha, J.L., Naaman, M. (eds.) JCDL, pp. 134–144. ACM, New York (2008) [14] Callahan, S.P., Freire, J., Scheidegger, C.E., Silva, C.T., Vo, H.T.: Towards provenance-enabling paraview. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 120–127. Springer, Heidelberg (2008) [15] Buneman, P., Khanna, S., Tan, W.-C.: Why and where: A characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2001) [16] Abrams, M.D., Smith, G.W.: A generalized framework for database access controls. In: DBSec., pp. 171–178 (1990) [17] McCollum, C.D., Messing, J.R., Notargiacomo, L.: Beyond the pale of mac and dac-defining new forms of access control. In: IEEE Symposium on Security and Privacy, pp. 190–200 (1990) [18] Park, J., Sandhu, R.S.: Towards usage control models: beyond traditional access control. In: SACMAT, pp. 57–64 (2002) [19] Park, J., Sandhu, R.S.: Originator control in usage control. In: POLICY, pp. 60–66. IEEE Computer Society, Los Alamitos (2002) [20] Ni, Q., Bertino, E., Lobo, J.: An obligation model bridging access control policies and privacy policies. In: Ray, I., Li, N. (eds.) SACMAT, pp. 133–142. ACM, New York (2008) [21] Raub, D., Steinwandt, R.: An algebra for enterprise privacy policies closed under composition and conjunction. In: M¨ uller, G. (ed.) ETRICS 2006. LNCS, vol. 3995, pp. 130–144. Springer, Heidelberg (2006) [22] Barth, A., Datta, A., Mitchell, J.C., Nissenbaum, H.: Privacy and contextual integrity: Framework and applications. In: IEEE Symposium on Security and Privacy, pp. 184–198. IEEE Computer Society, Los Alamitos (2006) [23] Ni, Q., Trombetta, A., Bertino, E., Lobo, J.: Privacy-aware role based access control. In: Lotz, V., Thuraisingham, B.M. (eds.) SACMAT, pp. 41–50. ACM, New York (2007) [24] Moses, T., ed.: eXtensible Access Control Markup Language (XACML) Version 2.0. OASIS Open (February 2005) [25] Ni, Q., Bertino, E., Lobo, J.: D-algebra for composing access control policy decisions. In: ASIACCS (2009)
88
Q. Ni et al.
[26] Muniswamy-Reddy, K., Holland, D., Braun, U., Seltzer, M.: Provenance-aware storage systems. In: Proceedings of the 2006 USENIX Annual Technical Conference, pp. 43–56 (2006) [27] Xu, S., Ni, Q., Bertino, E., Sandhu, R.: A characterization of the problem of secure provenance management. In: Workshop on Assured Information Sharing, Affiliated with the 2009 IEEE Intelligence and Security Informatics, ISI 2009 (2009) [28] Xu, S., Sandhu, R., Bertino, E.: Tiupam: A framework for trustworthiness-centric information sharing. In: Third IFIP WG 11.11 International Conference on Trust Management, TM 2009 (2009) [29] Rosenthal, A., Sciore, E.: Abstracting and refining authorization in sql. In: Jonker, W., Petkovi´c, M. (eds.) SDM 2004. LNCS, vol. 3178, pp. 148–162. Springer, Heidelberg (2004) [30] Olson, L.E., Gunter, C.A., Madhusudan, P.: A formal framework for reflective database access control policies. In: Ning, P., Syverson, P.F., Jha, S. (eds.) ACM Conference on Computer and Communications Security, pp. 289–298. ACM, New York (2008) [31] Ni, Q., Lin, D., Bertino, E., Lobo, J.: Conditional privacy-aware role based access control. In: Biskup, J., L´ opez, J. (eds.) ESORICS 2007. LNCS, vol. 4734, pp. 72–89. Springer, Heidelberg (2007) [32] Wang, J.T.L. (ed.): Proceedings of the ACM SIGMOD International Conference on Management of Data. In: Wang, J.T.L. (ed.) SIGMOD 2008, SIGMOD Conference, Vancouver, BC, Canada, June 10-12, ACM, New York (2008)
A Flexible Access Control Model for Distributed Collaborative Editors Abdessamad Imine1 , Asma Cherif1 , and Micha¨el Rusinowitch2 1
Nancy University and INRIA Nancy-Grand Est, France {imine,asma}@loria.fr 2 INRIA Nancy-Grand Est, France
[email protected]
Abstract. Distributed Collaborative Editors (DCE) provide computer support for modifying simultaneously shared documents, such as articles, wiki pages and programming source code, by dispersed users. Controlling access in such systems is still a challenging problem, as they need dynamic access changes and low latency access to shared documents. In this paper, we propose a flexible access control model where the shared document and its authorization policy are replicated at the local memory of each user. To deal with latency and dynamic access changes, we use an optimistic access control technique in such a way that enforcement of authorizations is retroactive. We show that naive coordination between updates of both copies can create security holes on the shared document, by permitting illegal modifications or rejecting legal modifications. Finally, we present a prototype for managing authorizations in collaborative editing work which may be deployed easily on P2P networks. Keywords: Secure Data Management, Authorization and Access Control, Collaborative Editing Systems.
1
Introduction
Distributed Collaborative Editors (DCE) belong to a particular class of distributed systems that enables several and dispersed users to form a group for editing documents (e.g. Google Docs). To ensure data availability, the shared documents are replicated on the site of each participating user. Each user modifies locally his copy and then sends this update to other users. DCE are distributed systems that have to consider human interactions. So, they are characterised by the following requirements: (i) High local responsiveness: the system has to be as responsive as its single-user editors [3,14,15]; (ii) High concurrency: the users must be able to concurrently and freely modify any part of the shared document at any time [3,14]; (iii) Consistency: the users must eventually see a converged view of all copies [3,14] in order to support WYSIWIS (What You See Is What I See) principle; (iv) Decentralized coordination: all concurrent updates
This work has been supported by AVANTSSAR Project FP7 216471.
W. Jonker and M. Petkovi´ c (Eds.): SDM 2009, LNCS 5776, pp. 89–106, 2009. c Springer-Verlag Berlin Heidelberg 2009
90
A. Imine, A. Cherif, and M. Rusinowitch
must be synchronized in decentralized fashion in order to avoid a single point of failure; (v) Scalability: a group must be dynamic in the sense that users may join or leave the group at any time. Motivations. One of the most challenging problem in DCE is balancing the computing goals of collaboration and access control to shared information [16]. Indeed interaction in collaborative editors is aimed at making shared document available to all who need it, whereas access control seeks to ensure this availability only to users with proper authorization. Moreover, the requirements of DCE include high responsiveness of local updates. However, when adding an access control layer, high responsiveness is lost because every update must be granted by some authorization coming from a distant user (as a central server). The major problem of latency in access control-based collaborative editors is due to using one shared data-structure containing access rights that is stored on a central server. So controlling access consists in locking this data-structure and verifying whether this access is valid. Furthermore, unlike traditional single-user models, collaborative applications have to allow for dynamic change of access rights, as users can join and leave the group in an ad-hoc manner. Contributions. To overcome the latency problem, we propose a flexible access control model based on replicating the access data-structure on every site. Thus, a user will own two copies: the shared document and the access data-structure. It is clear that this replication enables users to gain performance since when they want to manipulate (read or update) the shared document, this manipulation will be granted or denied by controlling only the local copy of the access datastructure. As DCE have to allow for dynamic change of access rights, it is possible to achieve this goal when duplicating access rights. To do that, our model enables only one user, called administrator, to modify the shared access data-structure. Therefore, updates locally generated by the administrator are then broadcast to other users. We choose dynamic access changes initiated by one user in order to avoid the occurrence and the resolution of conflict changes. The shared document’s updates and the access data-structure’s updates are applied in different orders at different user sites. The absence of safe coordination between these different updates may cause security holes (i.e. permitting illegal updates or rejecting legal updates on the shared document). Inspired by the optimistic security concept introduced in [8], we propose an optimistic approach that tolerates momentary violation of access rights but then ensures the copies to be restored in valid states with respect to the stabilized access control policy. To the best of our knowledge, this is the first effort towards developing an optimistic access control model for DCE that is based on replicating the shared document and its authorization policy. Outline of the paper. This paper is organized as follows: Section 2 discusses related work. Section 3 presents the ingredients of our collaboration model. In Section 4, we investigate the issues raised by replicating the shared document and its access data-structure. Section 5 presents our concurrency control algorithm for managing optimistic access control-based collaborative editing sessions.
A Flexible Access Control Model for Distributed Collaborative Editors
91
Section 6 describes our prototype and its evaluation on experiments. Section 7 summarizes our contributions and sketch future works.
2
Related Work
A survey on access control for collaborative systems can be found in [16]. We only recall some representative approaches and their shortcomings. A collaborative environment has to manage the frequent changing of access rights by users. Access Control Lists (ACL) and Capability Lists (CL) cannot support very well dynamic change of permissions. Hence, the administrator of collaborative environments often sets stricter permissions, as multiple users with varying levels of privileges will try to access shared resources [12]. Role Based Access Control (RBAC) [11] overcomes some problems with dynamic change of rights. RBAC has the notion of a session which is a per-user abstraction [5]. However, the ”session” concept also prevents a dynamic reassignment of roles since the user roles cannot be changed within a single session. Users have to authenticate again to obtain new roles. Spatial Access Control (SAC) has been proposed to solve this problem of role migration within a session [2]. Instead of splitting users into groups as in RBAC, SAC divides the collaborative environment into abstract spaces. However, SAC implementation needs prior knowledge of the practice used in some collaborative system, in order to produce a set of rules that are generic enough to match most of the daily access patterns. Every access needs to check the underlying access data-structures; this requires locking data-structures and reduces collaborative work performance. The majority of works on replicating authorization policies appears in database area [10,1,17]. For maintaining authorization consistency, these works generally rely on concurrency control techniques that are suitable for database systems. As outlined in [3], these techniques are inappropriate for DCE. Nevertheless, [10] is related to our work as it employs an optimistic approach. Indeed, changes in authorizations can arrive in different order at different sites. Unlike our approach, conflict authorizations may appear as updates are initiated by several sites.
3
Our Collaboration Model
In the following, we present the ingredients of our model. 3.1
Shared Data Object
It is known that collaborative editors manipulate share objects that admit a linear structure [3,13,15]. This structure can be modelled by the list abstract data type. The type of the list elements is a parameter that can be instantiated by each needed type. For instance, an element may be regarded as a character, a paragraph, a page, an XML node, etc. In [15], it has been shown that this linear structure can c be easily extended to a range of multimedia documents, such as MicroSoft Word c and PowerPoint documents.
92
A. Imine, A. Cherif, and M. Rusinowitch
Definition 1. [Cooperative Operations]. The shared document state can be altered by the following set of cooperative operations: (i) Ins(p, e) where p is the insertion position, e the element to be added at position p; (ii) Del(p, e) which deletes the element e at position p; (iii) U p(p, e, e ) which replaces the element e at position p by the new element e . It is clear that combinations of these operations enable us to define more complex ones, such as cut/copy and paste, that are intensively used in professional text editors. 3.2
Shared Policy Object
We consider an access control model based on authorization policies. An authorization policy specifies the operations a user can execute on a shared document. Three sets are used for specifying authorization policies, namely: 1. S is the set of subjects. A subject can be a user or a group of users. 2. O is the set of objects. An object can be the whole shared document, an element or a group of elements of this shared document. 3. R is the set of access rights. Each right is associated with an operation that user can perform on shared document. Thus, we consider the right of reading an element (rR), inserting an element (iR), deleting an element (dR) and updating an element (uR). We deal only with dynamic changes of iR, dR and uR rights. The read right is out of the scope of this paper but we plan to give an outlook on future work. Definition 2. [Policy]. A policy is a function that maps a set of subjects and a set of objects to a set of signed rights. We denote this function by P : P(S) × P(O) → P(R) × {+, −}, where P(S), P(O) and P(R) are the power sets of subjects, objects and rights respectively. The sign “+” represents a right attribution and the sign “−” represents a right revocation. We represent a policy P as an indexed list of authorizations. Each authorization Pi is a quadruple Si , Oi , Ri , ωi where Si ⊆ S, Oi ⊆ O, Ri ⊆ R and ωi ∈ {−, +}. An authorization is said positive (resp. negative) when ω = + (resp. ω = −). Negative authorizations are just used to accelerate the checking process. We use a first-match semantics: when an operation o is generated, the system checks o against its authorizations one by one, starting from the first authorization and stopping when it reaches the first authorization l that matches o. If no matching authorizations are found, o is rejected. Definition 3. [Administrative Operations]. The state of a policy is represented by a triple P, S, O where P is the list of authorizations. The administrator can alter the state policy by the following set of administrative operations: (i) AddU ser/DelU ser to add/remove a user in S; (ii) AddObj/DelObj to add/remove an object in O; (iii) AddAuth(p, l)/DelAuth(p, l) to add/remove authorization l at position p. An administrative operation r is called restrictive iff r = AddAuth(p, l) and l is negative or r = DelAuth(p, l).
A Flexible Access Control Model for Distributed Collaborative Editors
3.3
93
Collaboration Protocol
In our collaboration protocol, we consider that a user maintains two copies: the shared document and its access policy object. Each group consists of one administrator and several users. Only administrator can specify authorizations in the policy object. It can also modify directly the shared documents. As for users, they only modify the shared document with respect to the local policy object. Our collaboration protocol proceeds as follows: 1. When a user manipulates the local copy of the shared document by generating a cooperative operation, this operation will be granted or denied by only checking the local copy of the policy object. 2. Once granted and executed, the local operations are then broadcast to the other users. A user has to check whether or not the remote operations are authorized by its local policy object before executing them. 3. When an administrator modifies its local policy object by adding or removing authorizations, he sends these modifications to the other users in order to update their local copies. Note that the administrator site does not coordinate concurrent cooperative operations. 4. We assume that messages are sent via secure and reliable communication network, and users are identified and authenticated by the administrator in order to associate correctly access to these users. Even though our access control model is simple, we will show in the following that the policy enforcement is very tricky.
4
Consistency and Security Issues
The replication of the shared document and the policy object is twofold beneficial: firstly it ensures the availability of the shared document, and secondly it allows for flexibility in access rights checking. However, this replication may create violation of access rights which may fail to meet one of the most important requirements of DEC, the consistency of the shared document’s copies. Indeed, the cooperative and administrative operations are performed in different orders on different copies of the shared document and the policy object. In the following, we investigate the issues raised by the use of the collaboration protocol described in Section 3.3 and we informally present our solutions to address these issues. 4.1
Out-of-Order Execution of Cooperative Operations
What happens if cooperative operations arrive in arbitrary orders even with stable policy object? Consider the scenario in Figure 1.(a) where two users work on a shared document represented by a sequence of characters and they have the same policy object (they are authorized to insert and delete characters). These characters are addressed from 1 to the end of the document. Initially, both copies hold the string “efecte”. User 1 executes operation o1 = Ins(2, f ) to insert the character ‘f’ at position 2. Concurrently, user 2 performs o2 = Del(6, e) to delete the
94
A. Imine, A. Cherif, and M. Rusinowitch site 1 “efecte”
site 2 “efecte”
site 1 “efecte”
site 2 “efecte”
o1 = Ins(2, M f)
o2 = Del(6, e)
o1 = Ins(2,P f )
o2 = Del(6, e)
Del(6, e)
Ins(2, f )
IT (o2 , o1 ) = Del(7, e)
Ins(2, f )
“effece”
“effect”
“effect”
“effect”
MMM q MMqMqqq “effecte” qq MM “efect” MM& xqqq
(a) Incorrect integration.
PPP n PPP nnnnn P “effecte” nnn PPP “efect” PP( vnnn
(a) Correct integration.
Fig. 1. Serialization of concurrent cooperative operations
character ‘e’ at position 6. When o1 is received and executed on site 2, it produces the expected string “effect”. But, at site 1, o2 does not take into account that op1 has been executed before it and it produces the string “effece”. The result at site 1 is different from the result of site 2 and it apparently violates the intention of o2 since the last character ‘e’, which was intended to be deleted, is still present in the final string. To maintain consistency of the shared document, even though the policy object remains unchanged, we use the Operational Transformation (OT) approach which has been proposed in [3]. In general, it consists of application-dependent transformation algorithm, called IT , such that for every possible pair of concurrent operations, the application programmer has to specify how to integrate these operations regardless of reception order. In Figure 1.(b), we illustrate the effect of IT on the previous example. At site 1, o2 needs to be transformed in order to include the effects of o1 : o2 = IT ((Del(6, e), Ins(2, f )) = Del(7, e). The deletion position of o2 is incremented because o1 has inserted a character at position 1, which is before the character deleted by o2 . It should be noted that OT enables us to ensure the consistency for any number of concurrent operations which can be executed in arbitrary order [9,7] (i.e. no global order is necessary). For managing collaborative editing work in a decentralized and scalable fashion, we reuse an OT-based framework that is not presented here due to space limit. For more details see e.g. [4]. Our objective here is to develop on the top of this framework a security layer for controlling access to the shared documents. 4.2
Out-of-Order Execution of Cooperative and Administrative Operations
Performing cooperative and administrative operations in different orders at every user site may inevitably lead to security holes. To underline these issues we will present in the following three scenarios. First scenario: Consider a group composed of an administrator adm and two standard users s1 and s2 . Initially, the three sites have the same shared document “abc” and the same policy object where s1 is authorized to insert characters
A Flexible Access Control Model for Distributed Collaborative Editors
adm “abc”
s1 “abc”
revoke insertion right to s1
Ins(1, x)
95
s2 “abc”
RRR :: LLL RRR ww RRR :: LLLwww ( w :: ww LLL Accepted L “xabc” : w L w: LLL ww :: LLL :: {ww LLL :: Ignored “xabc” LLL :: LLL : &
“abc”
revoke insertion right to s1
revoke insertion right to s1
“xabc”
“xabc”
Fig. 2. Divergence caused by introducing administrative operations
(see Figure 2). Suppose that adm revokes the insertion right of s1 and sends this administrative operation to s1 and s2 so that it is applied on their local policy copies. Concurrently s1 executes a cooperative operation Ins(1, x) to derive the state “xabc” as it is granted by its local policy. When adm receives the s1 ’s operation, it will be ignored (as it is not granted by the adm’s local policy) and then the final state still remain “abc”. As s2 receives the s1 ’s insert operation before its revocation, he gets the state “xabc” that will be unchanged even after having executed the revocation operation. We are in presence of data inconsistency (the state of adm is different from the state of s1 and s2 ) even though the policy object is same in all sites. The new policy object is not uniformly enforced among all sites because of the out-of-order execution of administrative and cooperative operations. Thus, security holes may be created. For instance some sites can accept cooperative operations that are illegal with respect to the new policy (e.g. sites s1 and s2 ). As our objective is to deploy such DCE in a P2P environment, the solution based on enforcing a total order between both operations is discarded as it would require a central server. Achieving this objective raises a critical question: how the enforcement of the new policy is performed with respect to concurrent cooperative operations? It should be noted that this enforcement may be delayed by either the latency of the network or malicious users. To solve this problem, we apply the principles of optimistic security [8] in such a way that the enforcement of the new policy may be retroactive with respect to concurrent cooperative operations. In this case, only illegal operations are undone. For instance, in Figure 2, Ins(1, x) should be undone in s1 and s2 after the execution of the revocation. Second scenario: Suppose now that we use some technique to detect concurrency relations between administrative and cooperative operations. In the scenario of
96
A. Imine, A. Cherif, and M. Rusinowitch s1 “abc”
adm “abc”
s2 “abc”
a) ccccccc Del(1, SSSWWWW SSScWcWcWcWcccccccccccc S c + qccccc SSSS Ignored Revoke to s2 SSdeletion SSS “bc” SSS SSS SSS SSS ) Revoke deletion to s2 Allow deletion to s 2 undo(Del(1, a)) iii4 iiii i i i ii Allow deletion to s2 [[[[ abc [[[[[[[[[ [ [ [ [ [ [[
Revoke deletion S Wto W s2
“abc”
Accepted
Allow deletion to s2
“bc”
“abc”
Fig. 3. Necessity of admin Log
Figure 3, three users see initially the same document “abc” and they use the same policy object P =< {s2 }, {doc}, {dR}, + >. Firstly, adm revokes the deletion right to s2 by removing an authorization from P (P becomes empty). Concurrently, s2 performs Del(1, a) to obtain the state “bc”. Once the revocation arrives at s2 , it updates the local policy copy and it enforces the new policy by undoing Del(1, a) and restoring the state to “abc”. How to integrate the remote operation Del(1, a) at adm and s1 ? Before to execute this operation, if we check it directly against the local policy at adm, it will be rejected (the policy is empty). After a while of receiving and ignoring operation Del(1, a), adm decides to grant once again the deletion right to s2 . At s1 , the execution of both administrative operations leads to P =< {s2 }, {doc}, {dR}, + >. Before to execute Del(1, a), if we check it directly with respect to the local policy of s1 then it will be granted and its execution will lead to data inconsistency. This security hole comes from the fact that the generation context of Del(1, a) (the local policy on which it was checked) at s2 is different from the current execution context at adm and s1 (due to preceding executions of concurrent administrative operations). Intuitively, our solution consists in capturing the causal relations between cooperative operations and the policy copies on which they are generated. In other words, every local policy copy maintains a monotonically increasing counter that is incremented by every administrative operation performed on this copy. If each granted cooperative operation is associated with the local counter of the policy object at the time of its creation, then we can correctly integrate it in every remote site. However, when the cooperative operation’s counter is less than the policy copy’s counter of another site then this operation need to be checked with respect to preceding concurrent administrative operations before its execution. Therefore, we propose in our model to store administrative operations in a log at every site
A Flexible Access Control Model for Distributed Collaborative Editors
97
in order to validate the remote cooperative operations at appropriate context. For instance, in Figure 2, we can deduce that Del(1, a) will be ignored at s1 by simply checking it against the first revocation. Third scenario: Using the above solution, the administrative operations will be totally ordered as only administrator modifies the policy object and we associate to every version of this object a monotonically increasing counter. Consider the scenario illustrated in Figure 4 where s1 is initially authorized to insert any character. When adm revokes the insertion right to s1 , he has already seen the effect of the s1 ’s insertion. If s2 receives the revocation before the insertion, he will ignore this insertion as it is checked against the revocation. It is clear that the insertion may be delayed at s2 either by the latency of the network or by a malicious user. We observe that there is a causal relation at adm between the insertion and the revocation. This causal relation is not respected at s2 and the out-of-execution of operations creates a security hole as s2 rejects a legal insertion. Before it is received at the administrator site, we consider a cooperative operation as tentative. So, our solution consists of an additional administrative operation that doesn’t modify the policy object but increments the local counter. This operation validates each received and accepted cooperative operation at the administrator site. Consequently, every administrative operation is concurrent to all tentative operations. The policy modifications done after the validation of a cooperative operation are executed after this operation in all sites, as administrative operations are totally ordered. In case of our scenario in Figure 4, the revocation received at s2 will not be executed until the validation of the insertion is received. This avoids blocking legal operations and data divergence. s1 “abc”
adm “abc”
s2 “abc”
Ins(1, x)
88 88 88 revoke insertion “xabc” 888 88l6 right to s2 ll lll 888 l l l 88 lll 88 lll l l 88 ll
mmm mmm m m v m
Accepted
“xabc”
revoke insertion right to s2
Ignored
OOO OOO '
revoke insertion right to s2
“xabc”
“xabc”
“abc”
Fig. 4. Validation of operations
98
A. Imine, A. Cherif, and M. Rusinowitch
5
Concurrency Control Algorithm
Now we formally present the different components of our algorithm. We also give its asymptotic time complexity. 5.1
Cooperative and Administrative Requests
We define a cooperative request q as a tuple (c, r, a, o, v, f ) where: (i) c is the identity of the collaborator site (or the user) issuing the request. (ii) r is its serial number (note that the concatenation of q.c and q.r is defined as the request identity of q). (iii) a is the identity of the preceding cooperative request1 . If a is null then the request does not depend on any other request. (iv) o is the cooperative operation (see Definition 1) to be executed on the shared state. (v) v is the number version of the policy copy on which the operation is granted. (vi) f is the kind of cooperative (tentative, valid or invalid). We consider three kinds of cooperative requests: 1. tentative : when an operation is locally accepted, it is stored as a request waiting for validation from the administrator. 2. valid : it is generated by a given site and validated by the local policy of the administrator. 3. invalid : this means that it is not confirmed by the receiver local policy. It is then stored in the log and flagged in order to memorize its reception. To detect causal dependency and concurrent relations between cooperative requests, we use a technique proposed in [4] which allows for dynamic groups as it is independent of the number of users (unlike to vector timestamp-based technique [3]). This technique builds a dependency tree where each request q has only to store in q.a the request identity whose it directly depends on. For more details, see [4]. We consider an administrative request r as the triple r = (id, o, v) where: (i) id is the identity of the administrator; (ii) o is the administrative operation (see Definition 3); (iii) v is the last version number of the policy object. As only administrator specifies authorizations in the policy object, the administrative requests are totally ordered. Indeed, each policy copy maintains a monotonically increasing counter that is stored (in the version component v) and incremented by every administrative operation performed on this copy. As seen in Section 4, it is crucial to correctly deal with the out-of-order execution between cooperative and administrative requests in order to avoid the security holes. Let q and r be cooperative and administrative requests respectively: (i) q depends causally on r iff q.v > r.v, i.e. q already has seen the effect of r; (ii) if q is tentative then it is concurrent to r, i.e. the administrator has not yet seen the effect of q when it generates r.
1
According to the dependency relation described in [4].
A Flexible Access Control Model for Distributed Collaborative Editors
5.2
99
Control Procedure
In our approach, a group consists of one administrator site and N user sites (where N is variable in time) starting a collaboration session from the same initial document state D0 . Each site stores all cooperative requests in log H and administrative requests (AddAuth and DelAuth) in a log L. Our concurrency control procedure is given in Algorithm 1. It should be noted that Algorithm 1 is mainly based on framework proposed in [4]. This framework relies on (i) using OT approach [3] in order to execute cooperative requests in any order; (ii) using a particular class of logs, called canonical, where insertion requests are stored before deletion requests in order to ensure data convergence. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
Main: Initialization while not aborted do if there is an input o then if o is cooperative then Generate coop request else if i = admin then Generate admin request end if else Receive request Receive coop request Receive Admin request end if end while
16. 17. 18. 19. 20. 21. 22. 23. 24.
Initialization: D ← D0 {Actual state of the site} s ← Identification of local site version ← 0 {Initial Version of local site} H ← [] {Cooperative log} L ← [] {Administrative log} F ← [] {Cooperative requests buffer} Q ← [] {Administrative requests buffer} compteurCoopOp ← 0
25. 26. 27. 28. 29. 30. 31.
Receive Request: if there is a cooperative request q from a network then F ←F +q end if if there is an administrative request r from a network then Q← Q+r end if Algorithm 1: Control Concurrency Algorithm at the i-th site
Generation of local cooperative request. In Algorithm 2, when an operation o is locally generated, it is first checked against the local policy object (i.e. using
100
A. Imine, A. Cherif, and M. Rusinowitch
boolean function Check Local). If it is granted locally, it is immediately executed on its generation state (i.e. Do(o, D) computes the resulting state when executing operation o on state D). Once the request q is formed, it is considered either as valid when the issuer is the administrator or otherwise as tentative. Function ComputeBF(q,L) is called to detect inside H whether or not q is causally dependent on precedent cooperative request. Integrating q after H may result in not canonical log. To transform [H; q] in canonical form, we use function Canonize. Finally, the request q (the result of ComputeBF) is propagated to all sites in order to be executed on other copies of the shared document. For more details on functions ComputeBF and Canonize, see [4]. 1. Generate coop request: 2. if Check Local(o) then 3. D ← Do(o, D) 4. compteurCoopOp ← compteurCoopOp + 1 5. if i = admin then 6. q ← (s, compteurCoopOp, o, null, V alid, version) 7. else 8. q ← (s, compteurCoopOp, o, null, T entative, version) 9. end if 10. q ←ComputeBF(q,H) 11. H ← Canonize(q ,H) 12. broadcast q ; 13. end if Algorithm 2: Cooperative request generation at a site s
Reception of cooperative request. Each site has the use of queue F to store the remote requests coming from other sites. Request q generated on site i is added to F when it arrives at site j (with i = j). In Algorithm 3, to preserve the causality dependency with respect to precedent administrative requests and precedent cooperative requests, q is extracted from the queue when it is causally-ready (i.e. q.v ≤ version and the precedent cooperative requests of q have been already integrated on site j). Using function Check Remote(q,L), q is checked against the administrative log L to verify whether or not q is granted. If q is received by the administrator then it is validated and a validation request is generated in order to broadcast it to other sites. Next, function ComputeFF(q,L) is called in order to compute the transformed form q to be executed on current state D. This function is given in [4]. Finally, the transformed form of q, namely q , is executed on the current state and function Canonize is called in order to turn again [H; q ] in canonical form. Generation and Reception of administrative request. In Algorithm 4, the policy copy maintains a version counter that is incremented by the request generated by the administrator and performed on this copy. This request is next broadcast to other users for enforcing the new policy. When the received request r is causally ready (i.e. r.v = version + 1 and if r is a validation of a cooperative request q then this one has been already executed on this site), it is extracted from
A Flexible Access Control Model for Distributed Collaborative Editors
101
1. Receive coop Request(q): 2. if q is causally ready then 3. F ← F −q 4. if (Check Remote(q,L)) then 5. if i = adm then 6. q.f ← valid 7. r ← Generate Admin Request(V alidate(q)) 8. end if 9. else 10. q.f ← invalid 11. end if 12. q ← ComputeFF (q,H) 13. D ← Do(q , D) 14. Canonize(q ,H) 15. end if Algorithm 3: Cooperative request reception by a site s
Q. If r.o is AddAuth or DelAuth: (i) it is performed on the the policy copy; and, (ii) it undoes the tentative cooperative request that are no longer granted by the new policy. However, if r is a validation of cooperative request q then it sets q to valid. 1. 2. 3. 4. 5.
Generate admin Request: version ← version + 1 apply modification to the policy r ← (admin, version, o) broadcast r;
6. Receive Admin Request(r): 7. if r is causally ready then 8. Q← Q−r 9. if (r.o is an AddAuth or DelAuth) then 10. apply modification to policy 11. if r is restrictive then 12. H ← Undo(q, H) for all tentative request q concerned by the request r 13. end if 14. else 15. j ← GetIndex(r.q) {to determine the index of the cooperative request to validate it} 16. H[j].f ← valid 17. end if 18. version ← version + 1 19. end if Algorithm 4: Generation and reception of administrative request
Asymptotic Time Complexities. Let Hdu be all deletion/update requests H. In the worst case, when cooperative request q is an insertion and it has no dependency inside H (see [4]): (i) functions ComputeFF(q, H) and ComputeBF(q, H)
102
A. Imine, A. Cherif, and M. Rusinowitch
have the same complexity, O(|H|), and; (ii) function Canonize(q, H) has the complexity O(|Hdu |). Hence, the complexity of Generate Coop Request is O(|H| + |Hdu | + |P rv |) = O(2 ∗ |H| + |P rv |) (with P rv is the list of authorizations at version v), and the complexity of Receive Coop Request is O(|L| + |H| + |Hdu |) = O(|L| + 2 ∗ |H|) (where L is the administrative log). Consequently, our concurrency control algorithm is not expensive and scale well as all functions have a linear behaviour. However, to enforce the new authorization policy we have used the function Undo(q, H). The complexity of this function is O(|H|2 ) when all H’s requests are tentative and they should be undone by request r. Practically, Undo is not expensive if we assume that the transmission time of requests is very short. In this case, the most of tentative requests will be validated by the administrator and there will be fewer requests to undo between two version of the policy object. 5.3
Illustrative Example
To highlight the feature of our concurrency control algorithm, we present a slightly complicated scenario in Figure 5, where the solid (dotted) arrows describe the integration order (validation of tentative requests). We have an administrator adm and two users s1 and s2 starting the collaboration with the initial state D0 = ‘‘abc” and the initial policy version (vi0 = v0 ) characterized by the policy Pi0 =< (All, Doc, {iR, dR, rR, uR}, +) > (for i = adm, 1, 2). The notations All and Doc designate the set of all users and the whole document respectively. Initially, the cooperative and administrative logs of each site are empty (Hi0 = L0i = [] for i = adm, 1, 2). They generate three concurrent cooperative requests respectively: q0 .o = Ins(2, y), q1 .o = Del(2, b) and q2 .o = Ins(3, x). After integrating q0 , q1 and q2 , s1 generates q3 .o = Del(1, a). As for s2 , it generates q4 .o = Del(1, a) after the integration of q1 and q2 . Finally adm generates the administrative request r.o = AddAuth(1, (s1 , Doc, dR, −)). At the end of the collaboration, the three sites will converge to the final state ”ayc”. We describe the integration of our requests in three steps: Step 1. At adm, the execution of q0 produces D01 = ”aybc” and H01 = [q0 ]. When q2 and q1 arrive, they are transformed by ComputeFF(). This results in D03 = ayxc and H03 = [q0 ; q2 ; q1 ] with q2 .o = Ins(4, x) and q1 .o = Del(4, b). These requests are validated and sent to s1 and s2 . At s1 , the execution of q1 gives D11 = ”ac” and H11 = [q1 ]. Once received and granted by the local policy, q2 and q0 are transformed and the obtained log is twice modified by Canonize() as insertions must appear before deletions. We get D13 = ”ayxc” and H13 = [q2 ; q0 ; q1 ] with q1 .o = Del(3, b). Executing q2 and q1 at s2 produces D22 = ”axc” and H22 = [q2 ; q1 ]. The sites adm, s1 and s2 generate r, q3 and q4 respectively. They are propagated as follows. Step 2. At adm site, r is restrictive and it produces P01 =< (s1 , Doc, dR, −), (All, Doc, {iR, dR, rR, uR}, +) >, L10 = [r] and v01 = v0 + 1. Indeed, it revokes the deletion right to s1 .
A Flexible Access Control Model for Distributed Collaborative Editors adm “abc”
s1 “abc”
103
s2 “abc”
q0 .o = Ins(2, y)
q1 .o = Del(2, b)d q2 .o = Ins(3, x) IIPPP Sdd II PPP dddddudududdddddd SkSkSkSkSkSk d u I SSSS d kk ddP ) qddddddIdIIIPPPuPuPuu / ukkk IuIu PPP u I P u II P uu II PP( uu I u I II uu u I I/ I zuu II II II II II II II $
r.o = AddAuth((s1 , Doc, dR, −)) q3 .o = Del(1, a) q4 .o = Del(2, x)
PPP ZZ g I PgPgPggggZgZgZZZZZZZZZhZhZhIZIhIhkIkhkhkhkk g h P g h PPP hh ukkkkkZZIZIZIZZZZZsgggg P hhhh II h P h P h PPP II hh PPP II hhhh h h h $ shh /(
“ayc”
“ayc”
- “ayc”
Fig. 5. Collaboration scenario between an administrator and two sites
At s1 , the execution of q3 after H13 results in D14 = ”yxc”. To broadcast q3 with a minimal generation context, function ComputeBF() is called to detect causal dependency inside H13 . The obtained log is H14 = [q2 ; q0 ; q1 , q3 ]. At s2 , q4 is executed after H22 and produces D23 = ”ac” H23 = [q2 ; q1 ; q4 ]. Using ComputeBF() enables to detect that q4 depends on q2 , as q4 removes the character inserted by q2 . When q0 arrives, its integration produces D24 = ”ayc” and H24 = [q2 ; q0 ; q1 ; q4 ] (with q1 .o = Del(3, b) and q4 .o = Del(3, x)). This log is the result of Canonize(). Step 3. At adm, when q3 is checked against L10 it is rejected but is stored in invalid form q3∗ which has no effect on the local document state. The resulting log is H05 = [q0 ; q2 ; q1 , q3∗ ]. When q4 arrives, it is only transformed against q1 and q3∗ as it depends on q2 . This results in D06 = ”ayc” and H06 = [q0 ; q2 ; q1 , q3∗ , q4 ] with q4 .o = Del(3, x). At s1 , the integration of q4 produces D15 = ”yc” and H15 = [q2 ; q0 ; q1 ; q3 ; q4 ]. Integrating r results in L11 = [r] and v11 = v0 +1. Enforcing the new policy requires to undo q3 as it is a tentative (not validated yet) request. The inverse of q3 , noted q3 , is firstly generated with q3 .o = Ins(1, a). Next, q3 is transformed against q4 giving q3 of which the execution results in D16 = “ayc . Finally the log is modified to H16 = [q2 ; q0 ; q1 ; q3 ; q3 ; q4 ] where q4 is the form of q4 as if q3 hasn’t been executed. At s2 , the reception of r results in L21 = [r] and v12 = v0 + 1. Request q3 is invalidated (q3 .f = invalid) and stored in log without being executed. This results in H25 = [q2 ; q0 ; q1 ; q4 ; q3∗ ].
104
6
A. Imine, A. Cherif, and M. Rusinowitch
Implementation and Evaluation
A prototype of DCE based on our flexible access control model has been implemented in Java. It supports the collaborative editing of html pages and it is deployed on P2P JXTA platform (see Figure 6).In our prototype, a user can create a html page from scratch by opening a new collaboration group. Thus, he is the administrator of this group. Others users may join the group to participate in html page editing, as they may leave this group at any time. The administrator can dynamically add and remove different authorizations for accessing to the shared document according the contribution and the competence of users participating in the group. Using JXTA platform, users exchange their operations in real-time in order to support WYSIWIS (What You See Is What I See) principle. Experiments are necessary to understand what the asymptotic complexities mean when interactive constraints are present in the system. For our evaluation performance, we consider the following times: (i) t1 is the execution time of Generate Coop request(); (ii) t2 is the execution time of Receive Coop request(). We assume that the transmission time between sites is negligible. In general, it is established that the OT-based DCE must provide t1 + t2 < 100ms [6]. Both algorithms 2 and 3 call function Canonize, their performances are mostly determined by the percentage of insertion requests inside
Fig. 6. p2pEdit tool
A Flexible Access Control Model for Distributed Collaborative Editors
105
Fig. 7. Time processing of Insert Requests
the log. The management of the policy may affect the performance of the system since, in Algorithms 2 and 3, we have to explore either the policy or the the administrative log which are edited by the administrator. In our experiments we suppose that the policy is not optimized (i.e. it contains authorization redundancies). Figure 7 shows three experiments2 with different percentages of insertions inside log H. These measurements reflects the times t1 , t2 and their sum. The execution time falls within 100ms for all |H| ≤ 5000 if H contains 0% INS, |H| ≤ 9000 if H contains 100% INS which is not achieved in SDT and ABT algorithms [6].
7
Conclusion
In this paper, we have proposed a new framework for controlling access in collaborative editing work. It is based on optimistic replication of the shared document and its authorization policy. We have shown how naive coordination between updates of both copies may create security holes. Finally, we have provided some performance evaluations to show the applicability of our MAC model distributed collaborative editing. In future work, we plan to deal with the optimistic access control of read right. Then, we intend to investigate the impact of our work when using delegation of 2
The experiments have been performed under Ubunto Linux kernel 2.6.24-19 with an Intel Pentium-4 2.60 GHz CPU and 768 Mo RAM.
106
A. Imine, A. Cherif, and M. Rusinowitch
administrative requests between the group users. As the length of local (administrative and cooperative) logs increases rapidly during collaboration sessions, we plan to address the garbage collection problem.
References 1. Bertino, E., Bettini, C., Ferrari, E., Samarati, P.: A decentralized temporal autoritzation model. In: SEC, pp. 271–280 (1996) 2. Bullock, A., Benford, S.: An access control framework for multi-user collaborative environments. In: GROUP 1999, pp. 140–149. ACM, New York (1999) 3. Ellis, C.A., Gibbs, S.J.: Concurrency Control in Groupware Systems. In: SIGMOD Conference, vol. 18, pp. 399–407 (1989) 4. Imine, A.: Coordination model for real-time collaborative editors. In: Field, J., Vasconcelos, V.T. (eds.) COORDINATION 2009. LNCS, vol. 5521, pp. 225–246. Springer, Heidelberg (2009) 5. Jaeger, T., Prakash, A.: Requirements of role-based access control for collaborative systems. In: RBAC 1995, p. 16. ACM, New York (1996) 6. Li, D., Li, R.: An operational transformation algorithm and performance evaluation. Computer Supported Cooperative Work 17(5-6), 469–508 (2008) 7. Lushman, B., Cormack, G.V.: Proof of correctness of ressel’s adopted algorithm. Information Processing Letters 86(3), 303–310 (2003) 8. Povey, D.: Optimistic security: a new access control paradigm. In: NSPW 1999: Proceedings of the 1999 workshop on New security paradigms, pp. 40–45. ACM, New York (2000) 9. Ressel, M., Nitsche-Ruhland, D., Gunzenhauser, R.: An Integrating, Transformation-Oriented Approach to Concurrency Control and Undo in Group Editors. In: ACM CSCW 1996, Boston, USA, November 1996, pp. 288–297 (1996) 10. Samarati, P., Ammann, P., Jajodia, S.: Maintaining replicated authorizations in distributed database systems. Data Knowl. Eng. 18(1), 55–84 (1996) 11. Sandhu, R.S., Coyne, E.J., Feinstein, H.L., Youman, C.E.: Role-based access control models. Computer 29(2), 38–47 (1996) 12. Shen, H., Dewan, P.: Access control for collaborative environments. In: CSCW 1992, pp. 51–58. ACM, New York (1992) 13. Sun, C., Ellis, C.: Operational transformation in real-time group editors: issues, algorithms, and achievements. In: ACM CSCW 1998, pp. 59–68 (1998) 14. Sun, C., Jia, X., Zhang, Y., Yang, Y., Chen, D.: Achieving Convergence, Causalitypreservation and Intention-preservation in real-time Cooperative Editing Systems. ACM Trans. Comput.-Hum. Interact. 5(1), 63–108 (1998) 15. Sun, C., Xia, S., Sun, D., Chen, D., Shen, H., Cai, W.: Transparent adaptation of single-user applications for multi-user real-time collaboration. ACM Trans. Comput.-Hum. Interact. 13(4), 531–582 (2006) 16. Tolone, W., Ahn, G.-J., Pai, T., Hong, S.-P.: Access control in collaborative systems. ACM Comput. Surv. 37(1), 29–41 (2005) 17. Xin, T., Ray, I.: A lattice-based approach for updating access control policies in real-time. Inf. Syst. 32(5), 755–772 (2007)
On the Construction and Verification of Self-modifying Access Control Policies David Power, Mark Slaymaker, and Andrew Simpson Oxford University Computing Laboratory Wolfson Building, Parks Road, Oxford OX1 3QD United Kingdom
Abstract. Typically, access control policies are either static or depend on independently maintained external state to achieve some notion of dynamism. While it is possible to fully verify the properties of static policies, any reference to external state will necessarily limit the scope of such verification. In this paper we explore the feasibility of describing self-modifying policies which contain both rules for granting access and rules for the modification of the policy. Policy level constraints are used to define validity. Using these constraints it becomes possible to verify both the current state of the policy and any possible future states. A working prototype is described which utilises a relational model finder to perform the verification. The prototype is capable of generating instances of failure cases and presenting them via a simple user interface.
1
Introduction
One of the fundamental reasons for using an access control policy is to provide a single point of reference for authorisation decisions. Having a single point of reference simplifies the management of authorisation decisions and allows analysis to be performed independently of the rest of the system. Such policies are particularly beneficial for large systems that may have multiple points of entry and may be maintained by large numbers of developers. In many systems, access control policies make reference to state that is external to the policy. This environmental state may be as simple as a Boolean value or could be significantly more complex, such as the role relationships in a Role Based Access Control (RBAC) system. In such contexts, the maintenance of the environment needs to be carefully managed as it is potentially just as significant as the policy itself. In this paper we explore the feasibility of defining policies that are capable of describing changes to their own state and thus eliminating the reliance on environmental data. Once this has been achieved, it becomes possible to fully analyse potential changes to authorisation decisions. Not all changes to a policy can be made via state changes—sometimes there are changes to the policy rules themselves. We also explore the feasibility of the self-modification of policy rules. The motivation for our work resides in the disconnect between data management legislation and guidelines (written at a high level of abstraction, assuming W. Jonker and M. Petkovi´ c (Eds.): SDM 2009, LNCS 5776, pp. 107–121, 2009. c Springer-Verlag Berlin Heidelberg 2009
108
D. Power, M. Slaymaker, and A. Simpson
context-sensitive decisions and policy updates to be made to reflect changes in state) and policy implementation (at a much lower level of abstraction, requiring manual changes as and when appropriate). We are concerned with the capture, enforcement and verification of such higher-order policies: given the potential for access control policies to change automatically, requirements for correctness increase significantly. Policies following this paradigm have been constructed and deployed within our existing web services framework, sif (see, for example, [1]); our focus in this paper is the theoretical underpinnings of the approach. With the potential to change both the rules and state within a policy it is important to be able to define what is and what is not an appropriate policy. To do this, constraints are defined in terms of both the state and the rules of the policy. Like rules, constraints themselves may change over time, so we also explore the feasibility of self-modification of constraints. To explore these ideas we have written a simple policy language based on relational logic in which state changes are written in a declarative style. To analyse the policy we have written a tool which is capable of request evaluation, constraint checking and scenario analysis. The tool uses the kodkod [2] Java libraries which provide facilities for constraint checking and model finding. We start, in Section 2 by providing a brief overview of previous, related work in this area, as well as the fundamentals of the relational logic that we use.
2
Background
Access control policies are critical components in any security-critical system, and the consideration of means of formally analysing access control policies is an active area of research. For example, one of the early uses of the Alloy modelling language [3] was the modelling and analysis of RBAC systems [4]. The eXtensible Access Control Markup Language (XACML) has also been the subject of formal analysis using the process algebra Communicating Sequential Processes (CSP) [5], the RW language [6] and SAT solvers [7]. The behaviour of access control policies in a dynamic environment has been analysed in [9]. A language that supports dynamic policy updating is described in [10]. The Security Policy Assertion Language (SecPAL) does not support dynamic policy updating but the implications of such changes on a related language are explored in [8]. This paper differs from previous work in that the modifications to the policy are described as part of the policy itself: the dynamic state is part of the policy, as are the rules for the modification of the policy. Further, we describe how to automatically verify the properties of these dynamic policies and have produced a tool to perform the verification. The relational model we use is the same as that used by the Alloy Analyzer; the syntax is also similar. The only type of value supported is a relation, which is a set of tuples (referred to as a tupleset ). The tuples themselves are constructed from atoms which cannot be further divided. Assuming atoms, atom1 and atom2,
is a tuple of arity two and {} is an arity two tupleset containing two elements.
On the Construction and Verification of Self-modifying Access Control
109
An expression is a statement which describes a relational value; a formula is a statement which is either true or false. Basic expressions include the names of relations, the names of variables and anonymous relations—which are written as tuplesets. Constant expressions include univ and none, which are the arity one tuplesets containing all atoms and no atoms respectively. The basic set operations union (+), intersection (&) and set difference (-) are all supported. The relational product operator (->) takes two relations of arity n and m and produces a relation of arity n+m by combining all possible pairs of tuples, e.g. {} -> {} = {}. The composition operator (.) takes two relations of arity n and m and produces a relation of arity n+m-2 by combining tuples where the last atom in the left-hand tuple matches the first atom in the right-hand tuple, with the matching atoms being discarded, e.g. {} . {} = {}. Basic formulae include the constants true and false, and the basic logic operators conjunction (&&), disjunction (||) and negation (!). As relations are the only data type, there is no set membership operator—the subset operator (in) is used instead. There are also multiplicity logic operators, including no and one, stating that relations contain no tuples or exactly one tuple respectively. Variable declarations take the form [name 1: mult 1 exp 1 , ...] where name i is the name of the variable, mult i is the multiplicity of the values the variable can take and exp i is an expression describing the tupleset which the variable must be a subset of. The syntax of a quantified expression is (quantifier variables formula) with quantifier being either some (existential) or all (universal), variables being a variable declaration, and formula being a logical formula.
3
Policy Language
In this section we illustrate how the different parts of a policy file may be described. Policies are stored as XML files and conform to an XML schema. Understandably, the XML format that is used is verbose and ill-suited to presentation here; as such, to increase readability, the representation used here is based on the screen representation of the policy. The first part of the policy file is the current state of the policy. Each line represents a different relation and consists of the name of the relation, the arity of the relation and the current value of the relation expressed as a tupleset. State User(1) Role(1) Perm(1) UA(2) = RH(2) = PA(2) =
= {} = {} = {} {} {} {}
The request section of a policy consists of a list of variables which will form a request. Each line represents a single variable and consists of the name of
110
D. Power, M. Slaymaker, and A. Simpson
the variable and an expression from which the variable is drawn. There is a restriction that the expression must have an arity of one, but otherwise they can be arbitrarily complex. There is no explicit type system, but it is possible to use the expressions to loosely type the variables, as shown below. Request user : User perm : Perm
The decision-making logic of the policy is contained in the policy rule, which is a logical formula that can make reference both to the state relations and the request variables. Rule (user -> perm) in (UA . PA)
To be able to validate a policy, it is first necessary to define what constitutes a valid policy. The constraint section of the policy contains a logical formula which holds for all valid policies. The formula may contain references to the state relations and also to the policy rule via the formula reference ref[Rule]. Constraint !(some [user: one User] | (some [perm: one Perm] | perm = {} && ref[Rule]) && (some [perm: one Perm] | perm = {} && ref[Rule]))
Changes to policies are made via messages which are defined in a similar way to the request. Each message has a name and a number of variable declarations. To introduce new values into the system there is a special expression which is represented as new. In the XML representation there is no ambiguity between relations, variables and new expressions as they each have their own tags. Message(AddUser) user : new Message(DisjointPermission) perm1 : Perm perm2 : Perm
Each message has a handler that describes the change to the policy. There are three kinds of handlers: state-modifying handlers, rule-modifying handlers and constraint-modifying handlers. Each handler section contains the name of the message it responds to, the kind of handler it is, and a formula. For state handlers the formula contains references to the before and after state relations, with the after state relations being distinguished by the prime (’) symbol. Handler(AddUser) : State User’ = (User + user) && UA’ = UA && RH’ = RH && PA’ = PA && Role’ = Role && Perm’ = Perm
For rule and constraint-modifying handlers, the formula replaces the current rule or constraint formula. Rule-modifying handlers may refer to the current rule using the formula reference ref[Rule], at execution the current rule will
On the Construction and Verification of Self-modifying Access Control
111
be inserted in place of the reference. Constraint-modifying handlers may refer to the current constraint using the formula reference ref[Constraint]; these references will be replaced by the current constraint at execution. Constraintmodifying handlers may also contain rule references, but these will not be replaced when the handler is executed. Handler(DisjointPermission) : Constraint !(some [user: one User] | (some [perm: one Perm] | perm = perm1 && ref[Rule]) && (some [perm: one Perm] | perm = perm2 && ref[Rule])) && ref[Constraint]
4
Verification
To evaluate a request, the constraint checking capabilities of the kodkod libraries are used. There are a number of steps to this process. First, a universe of atoms needs to be created, which is done by taking all of the atoms that are defined in the policy. Next, relations are defined for the state of the policy and an instance is created by associating the state relations to their tuplesets. The message variables are read from the user interface and all instances of the variables in the rule are replaced with anonymous relation expressions. For example, the request with user={} and perm={} would change the rule formula from Section 3 to ({} -> {}) in (UA . PA). The kodkod libraries do not support anonymous relations so these are given names of the form &i and added to the instance. In the example given, the &1(1) = {} and &2(1) = {} relations are added to the instance and the formula becomes (&1 -> &2) in (UA . PA). The formula is then evaluated for the defined instance to yield a Boolean result. The process for checking the policy constraint is almost identical—without the need to substitute the request variables into the formula. The effect of sending a message will depend on what kind of handler responds to it. For rule and constraint-modifying handlers, a new formula is exchanged with the old one: the new formula is created by taking the handler formula and performing two sets of substitutions. The message variables are read from the user interface and all instances of the variables are replaced with anonymous relation expressions. Rule formula references are replaced by the current rule formula for rule-modifying handlers, and constraint formula references are replaced by the current constraint formula for constraint-modifying handlers. For example, the effect of sending a DisjointPermission message with perm1={} and perm2={} with an existing constraint of true would be !(some [user: one User] | (some [perm: one Perm] | perm = {} && ref[Rule]) && (some [perm: one Perm] | perm = {} && ref[Rule])) && true
For state-modifying handlers, the model finding capabilities of the kodkod libraries are used. Rather than define an instance with known tuplesets for the
112
D. Power, M. Slaymaker, and A. Simpson
relations, bounds are placed on the relations and the libraries attempt to find an instance that makes the formula true. The bounds on a relation take the form of two tuplesets which form a lower and upper bound: the final relation must be a superset of the lower bound and a subset of the upper bound. If the message variables are drawn from new expressions then the user interface prompts the user for a unique atom for each variable. The universe is created using the policy atoms, as before, with the addition of any new atoms that have been input. Two sets of relations are created, one for the current state and one for the new state. The new state relations are identified by being primed—the User relation becoming User’. For the current state relations, the lower and upper bounds are identical, being set to the current value of the relation. For the new state relations, the lower bound is an empty tupleset of the appropriate arity and the upper bound is the set of all possible tuples of the appropriate arity. The handler formula constrains the possible values of the new relations and the libraries use this formula to find a matching instance. For example, the message AddUser with the variable user={} will result in a unique instance being found, with the new relations having the following values: User’(1) Role’(1) Perm’(1) UA’(2) = RH’(2) = PA’(2) =
= {} = {} = {} {} {} {}
To describe the effect of changing rules and constraints it is necessary to introduce some new syntax. The constraint and rule of the starting policy are represented as Constraint and Rule respectively; in addition, the formula from the handler in question will be represented as Handler. The variable declaration [name1 : one exp1, name2 : one exp2, ...] associated with the message being handled is referred to as VarDecls. The use of the prime syntax will be extended to formulae, which has the effect of priming all of the relations and formula references within the formula. When the use of multiple primes becomes impractical the alternative syntax ’(n) will be used, so formula’’ can also be represented as formula’(2). When dealing with formula references we will use a substitution syntax formula[form => ref], where the formula form replaces the reference ref in the formula formula. In total we have considered four different ways to verify the validity of the handlers. The first and simplest test is to look at the effect of a single handler starting with the current state: we use the criteria that if the after state does not meet the constraint then an error has occurred. For rule-modifying handlers, neither the state nor the constraint are changed, so the only way the constraint can fail is if it references the new rule. To test this we execute the following, where Handler[Rule => ref[Rule]] is the new rule formula: (some VarDecls | !Constraint[Handler[Rule => ref[Rule]] => ref[Rule]])
The new constraint becomes Handler[Constraint => ref[Constraint]]) for constraint-modifying handlers. This is then evaluated in the context of the existing rule.
On the Construction and Verification of Self-modifying Access Control
113
(some VarDecls | !(Handler[Constraint => ref[Constraint]])[Rule => ref[Rule]])
For state-modifying handlers, two sets of state relations are created: the current state is known and the bounds are set appropriately; the after state is unknown and is effectively unbounded. The handler formula restricts the values of the after state relations and the constraint is tested against the after state. (some VarDecls | Handler) && !Constraint’[Rule’ => ref[Rule’]]
For all three types of handlers, if a failure case is found the message variables are output to the user interface, along with (as appropriate) the new rule, constraint or after state. The following is the output of testing a handler that adds the pair (user -> role) to the UA relation, with a constraint that prevents a user being able to both prescribe and administer. Note that the $ symbol is added to the names of all variables by the libraries. Variables $user(1) = {} $role(1) = {} New State User(1) = {} Role(1) = {} Perm(1) = {} UA(2) = {} RH(2) = {} PA(2) = {}
The second type of test was to make no assumption about the current state of the policy and instead test all possible states that meet the current constraint. The universe of atoms is constructed from all of the atoms explicitly mentioned in the rule, constraint, message and handler, as well as atoms to represent the input to new expressions and a number of extra atoms. It is generally not possible to calculate the exact number of extra atoms required, so currently the user interface always uses 16. The test formulae are the same as for the first test, with the addition of Constraint[Rule => ref[Rule]] to only allow states that meet the initial constraint. For a state-modifying handler, the test formula is as follows: (some VarDecls | Handler) && !Constraint’[Rule’ => ref[Rule’]] && Constraint[Rule => ref[Rule]]
The result of running the test on the same UA modifying handler as was used for the first test can be seen below. It should be noted that, as Prescribe and Administer are explicitly referenced in the constraint, they have been added to the universe of atoms, along with the extra atoms which have names of the form Ei. Variables $user(1) = {<E9>} $role(1) = {<E7>}
114
D. Power, M. Slaymaker, and A. Simpson
Old State User(1) = {<E9>} Role(1) = {<E7>} Perm(1) = {} UA(2) = {} RH(2) = {} PA(2) = {<E7,Administer><E7,Prescribe>} New State User(1) = {<E9>} Role(1) = {<E7>} Perm(1) = {} UA(2) = {<E9,E7>} RH(2) = {} PA(2) = {<E7,Administer><E7,Prescribe>}
In the third test, multiple state-modifying handlers are tested over a number of steps. The initial state relations are created and bound to their current tuplesets. For each additional step, a new set of state relations is created with the appropriate number of primes appended to their name. The universe of atoms includes all the atoms referenced by the state, rule, constraint, messages and handlers. In addition, each new expression has a specific new atom associated with it for each step. To reduce the execution time, the bounds placed on the state relations only include the new atoms for the current and previous steps. In the following, the formula (some VarDecls i | Handler i’(j-1)) represents the successful execution of the ith handler in the jth step. In each step, at least one handler must succeed and the constraint must not hold after the last step. It is possible that the constraint fails after an earlier step, but it is envisaged that the number of steps would be increased starting at 1. ((some VarDecls_1 | Handler_1) || (some VarDecls_2 | Handler_2) || ... ) && ... ((some VarDecls_1 | Handler_1’(n-1)) || (some VarDecls_2 | Handler_2’(n-1)) || ... ) && !Constraint’(n)[Rule’(n) => ref[Rule’(n)]]
Due to the nested nature of the variable declarations, their values are not accessible after evaluation; however, the values of the relations at each step are known. Using these before and after states, the exact path of execution is calculated by evaluating (some VarDecls i | Handler i) for each handler separately at each step. As these are simple evaluations, calculating the path is much quicker than the main step. The final test is a version of the third test that also includes rule and constraint-modifying handlers. As the rule and constraint-modifying handlers result in formula changes, it is not possible to run the test in a single step. Instead, a depth-first search is made with each possible path being explored. The final evaluation is of the following form: (some Decl(n) | Combined(n) && !Constraint(n)[Rule(n) => ref[Rule’(n)]])
On the Construction and Verification of Self-modifying Access Control
115
Here, Decl(n) is the variable declaration for all of the handlers, Combined(n) is the combined formula for the state relations, Constraint(n) is the final constraint, and Rule(n) is the final rule. The parts of the final evaluation are built up recursively, starting with an empty declaration, a combined formula which is true and the original rule and constraint. If a step involves a rule handler, then the updated values of the four parts are as follows: Rule(n) = Handler’(n)[Rule(n-1)’ => ref[Rule’(n)]] Constraint(n) = Constraint(n-1)’ Combined(n) = Combined(n-1) && State’(n) = State’(n-1) Decl(n) = Decl(n-1) + VarDecls
Here, State’(n) = State’(n-1) is shorthand for the formula which states rel i’(n) = rel i’(n-1) for all state relations rel i. The + symbol is overloaded to combine variable declarations. As the same variable names may be used in multiple steps, the variable names of each handler are made unique by adding &n to their names; constraint-modifying handlers are dealt with similarly. Rule(n) = Rule(n-1)’ Constraint(n) = Handler’(n)[Constraint(n-1)’ => ref[Constraint’(n)]] Combined(n) = Combined(n-1) && State’(n) = State’(n-1) Decl(n) = Decl(n-1) + VarDecls
The formula of a state-modifying handler will already contain references to primed state relations; for that reason it needs to be primed n-1 not n times. Rule(n) = Rule(n-1)’ Constraint(n) = Constraint(n-1)’ Combined(n) = Combined(n-1) && Handler’(n-1) Decl(n) = Decl(n-1) + VarDecls
When a path is found for which there exists a valid instance, the process stops and the values of the variables and state relations are output to the user interface. As the variable names have been made unique, they can be paired up with the step in which they were introduced. In the next section we will look at examples of the output of the four tests for two different versions of a simple RBAC system.
5
Examples
In this section we will use an example policy to demonstrate the different types of verification. The policy utilises a version of RBAC described in [11]. There are three arity one relations: User, Role and Perm which represent users, roles and permissions respectively. We will use these as pseudo-types throughout the policy. There are also three arity two relations: UA, RH and PA. UA represents the relationship between users and roles, PA represents the relationship between roles and permissions, and RH represents the role hierarchy, with, for a given pair of roles, the first role being senior to the second role.
116 State User(1) Role(1) Perm(1) UA(2) = RH(2) = PA(2) =
D. Power, M. Slaymaker, and A. Simpson
= {} = {} = {} {} {} {}
The request involves a user drawn from User and a permission drawn from Perm. The rule states that the user/permission pair is a member of the composition of UA and PA, and, as such, the role hierarchy RH is ignored. Request user : User perm : Perm Rule (user -> perm) in (UA . PA)
The constraint is used to maintain the integrity of the pseudo-types; it states that User, Role and Perm are disjoint, and that UA, RH and PA are subsets of relational products of the pseudo-types. Constraint (User & Role) = none && (Role & Perm) = none && (Perm & User) = none && UA in (User -> Role) && RH in (Role -> Role) && PA in (Role -> Perm)
For each of the state relations there are add and subtract messages. For User, Role and Perm adding introduces new atoms; for UA, RH and PA adding involves existing atoms. The message and handler for the AddUser are shown below. Message(AddUser) user : new Handler(AddUser) : State User’ = (User + user) && Role’ = Role && Perm’ = Perm && UA’ = UA && RH’ = RH && PA’ = PA
The handler for SubUser is defined in a similar manner. Handler(SubUser) : State User’ = (User - user) && Role’ = Role && Perm’ = Perm && UA’ = UA && RH’ = RH && PA’ = PA
However, this can lead to problems. For example, running the first test on SubUser finds the following failure instance. Variables $user(1) = {} New State User(1) = {} Role(1) = {}
On the Construction and Verification of Self-modifying Access Control Perm(1) UA(2) = RH(2) = PA(2) =
117
= {} {} {} {}
The problem here is that UA contains —which is no longer a member of User -> Role. Once the error has been found, it is a simple matter to modify the handler formula to remove such tuples. Handler(SubUser) : State User’ = (User - user) && Role’ = Role && Perm’ = Perm && UA’ = (UA - (user -> univ)) && RH’ = RH && PA’ = PA
There are two rule handlers that allow role hierarchies to be turned on or off. With the role hierarchy in operation, users with senior roles gain the permissions of junior roles. With the relations in the initial state, all users with the doctor role gain the permissions of the nurse role. The unary prefix operator (*) creates the symmetric transitive closure of a relation of arity two. Handler(ResetSimple) : Rule (user -> perm) in (UA . PA) Handler(ResetHierarchy) : Rule (user -> perm) in (UA . *RH . PA)
There are also a number of rule-modifying handlers that explicitly give or remove permissions from users, effectively ignoring their roles. (It should be noted that this is not part of RBAC and this functionality is only included as an example of what is possible within this paradigm.) These handlers contain rule formula references. All four messages contain a variable user1 which refers to the user in question; in addition Allow and Block have a variable perm1 which refers to the permission in question. Handler(AllowAll) : Rule (user = user1) || ref[Rule] Handler(BlockAll) : Rule !(user = user1) && ref[Rule] Handler(Allow) : Rule (user = user1 && perm = perm1) || ref[Rule] Handler(Block) : Rule !(user = user1 && perm = perm1) && ref[Rule]
These handlers all produce rules of increasing sizes; the effect of calling Allow followed by Block is not to revert to the original rule but instead to produce the following, assuming that the user in question was Nurse Y and the permission was Prescribe. !(user = {} && perm = {}) && ({user = {} && perm = {}) || (user -> perm) in (UA . PA))
118
D. Power, M. Slaymaker, and A. Simpson
There are three constraint handlers: ResetConstraint, which resets the constraint; DisjointRole, which prevents a user simultaneously holding two specific roles; and DisjointPermission, which prevents a user simultaneously having two specific permissions. The handlers for DisjointRole, which uses the variables role1 and role2, and DisjointPermission, which uses the variables perm1 and perm2, are shown below. Duplicate roles or permissions are ignored to avoid finding trivial failure cases. Handler(DisjointRole) : Constraint (role1 = role2 || !(some [user: one User] | (role1 + role2) in (user . UA))) && ref[Constraint] Handler(DisjointPermission) :Constraint (perm1 = perm2 || !(some [user: one User] | (some [perm: one Perm] | (perm = perm1) && ref[Rule]) && (some [perm: one Perm] | (perm = perm2) && ref[Rule]))) && ref[Constraint])
Running the fourth test on the policy described with a depth of one does not find any failure instances; however, setting the depth to two yields one of many possible failure cases. In this case, Dr X is given the role of Nurse and then the constraint is changed to exclude the possibility of a user being both a Doctor and a Nurse. Handler(AddUA) : State $user&1(1) = {} $role&1(1) = {} Handler(DisjointRole) : Constraint $role1&2(1) = {} $role2&2(1) = {}
It is possible to rewrite the example policy to give the same functionality while only using state-modifying handlers. To do this, a number of extra state relations are required. To replace the rule-modifying handlers, six addition relations have been added. The AllowedAll and BlockedAll relations are used to keep track of the users that have been given all or no permissions. Similarly, Allowed and Blocked keep track of the permissions that have been explicitly given or taken from a user. A new pseudo-type, Rule, is used to identify alternative rules with the current one being stored as CurrentRule. AllowedAll(1) = {} BlockedAll(1) = {} Allowed(2) = {} Blocked(2) = {} Rule(1) = {} CurrentRule(1) = {}
As there are now no rule-modifying handlers, the rule is fixed but must encapsulate the alternative rules. The rule identified as Rule1 is the original rule; the
On the Construction and Verification of Self-modifying Access Control
119
rule identified as Rule2 takes account of rule hierarchies. The permissions that have been explicitly given to or taken from specific users are handled as part of the same fixed rule: Rule !(user in BlockedAll) && !((user -> perm) in Blocked) && (user in AllowedAll || (user -> perm) in Allowed || (CurrentRule = {} && (user -> perm) in (UA . PA)) || (CurrentRule = {} && (user -> perm) in (UA . *RH . PA)))
To be able to have a single constraint, the disjoint roles and permissions are also represented by state relations: DisjointedRoles(2) = {} DisjointedPermissions(2) = {}
The constraint is significantly more complex: there are extra typing constraints for each of the new relations; there is the constraint that there must be exactly one rule in operation; and there are the disjoint role and permission constraints. Constraint (User & Role) = none && (Role & Perm) = none && (Perm & User) = none && UA in (User -> Role) && RH in (Role -> Role) && PA in (Role -> Perm) && AllowedAll in User && BlockedAll in User && Allowed in (User -> Perm) && Blocked in (User -> Perm) && CurrentRule in Rule && one CurrentRule && (Rule & (User + Role + Perm)) = none && !(some [role1: one Role, role2: one Role, user: one User] | (role1 -> role2) in DisjointedRoles && !(role1 = role2) && (role1 + role2) in (user . UA)) && !(some [perm1: one Perm, perm2: one Perm, user: one User] | (perm1 -> perm2) in DisjointedPermissions && !(perm1 = perm2) && (some [perm: one perm1] | ref[Rule]) && (some [perm: one perm2] | ref[Rule]))
The state-modifying handlers all need additional clauses to preserve the values of the new state relations, but are otherwise unchanged. The rule-modifying handlers become state-modifying handlers, which modify the values of the six new rule related state relations. For example, ResetHierarchy now modifies the CurrentRule relation. Handler(ResetHierarchy) : State UA’ = UA && RH’ = RH && PA’ = PA && User’ = User && Role’ = Role && Perm’ = Perm && Allowed’ = Allowed && Blocked’ = Blocked && AllowedAll’ = AllowedAll && BlockedAll’ = BlockedAll Rule’ = Rule && CurrentRule’ = {} && DisjointedRoles’ = DisjointedRoles && DisjointedPermissions’ = DisjointedPermissions
&&
Likewise, the constraint-modifying handlers become state-modifying handlers which modify the values of the two new constraint related state relations. For example, DisjointPermission now modifies the DisjointPermissions relation.
120
D. Power, M. Slaymaker, and A. Simpson
Handler(DisjointPermission) : State UA’ = UA && RH’ = RH && PA’ = PA && User’ = User && Role’ = Role && Perm’ = Perm && Allowed’ = Allowed && Blocked’ = Blocked && AllowedAll’ = AllowedAll && BlockedAll’ = BlockedAll && Rule’ = Rule && CurrentRule’ = CurrentRule && DisjointedRoles’ = DisjointedRoles && DisjointedPermissions’ = (DisjointedPermissions + (perm1 -> perm2))
As all the handlers now modify the state, it is possible to test the whole policy with the third test—which is significantly more efficient than the fourth test which was used for the original policy. Again it takes two steps to find a failure instance as with the original policy; however, a different failure is found involving the ResetHierarchy handler, which enables doctors to administer, followed by the DisjointPermission handler with variables of Administer and Prescribe. Handler(ResetHierarchy) : State Handler(DisjointPermission) : State $perm1(1) = {} $perm2(1) = {}
6
Discussion
We have described a policy language and an associated tool that allow the capture and automatic verification of self-modifying access control policies. The motivation behind the capture of such policies is to raise the level of abstraction, and to allow policies to evolve on the basis of environmental changes. The policy language uses a relational data model and a relational model finder is used to find instances of failures. Four different types of verification were described with correctness defined by a policy constraint. Two example policies, based on a simple RBAC system, were described. The second example used a subset of the language which allowed for more efficient verification. In three of the four tests, the number of atoms required can be calculated exactly. In the second test, the exact number of atoms required is not known, leading to uncertainty if no failing instance can be found. The fourth test requires a number of calculations that is exponential in the number of steps being tested, making it less attractive than the third test. The explicit modification of rules and constraints gives rise to new types of policies to be written, which may be closer to the level at which guidelines derived from legislation may be written. If the handler formula contains a reference to the previous rule or constraint the size of the formula can grow. By moving some aspects of rule and constraint modification into the policy state, formula growth can be avoided. The other disadvantage associated with the rule- and constraint-modifying handlers is the increased complexity of verification. The disadvantage of moving all modifications into the policy state is that there are more state relations to maintain. As state-modifying handlers only
On the Construction and Verification of Self-modifying Access Control
121
place constraints on possible after states, each state relation has to be explicitly constrained to avoid non-determinism. A move to explicit assignment of state relations would allow the assumption that relations have not changed unless stated otherwise; this would have the potential to decrease the size of statemodifying handlers dramatically. There are two main avenues of further work. First, the current language only supports one data type, requiring any higher order type, such as a set of sets, to be constructed via the composition of two or more relations. As the libraries we are using are the same ones that are used by the Alloy Analyzer, it should, in theory, be possible to support the Alloy type system. Our second avenue of future work pertains to policy construction. Currently, the policy language is not suitable for general use by the typical policy writer. To be more widely applicable, an appropriate interface that is capable of translating higher order concepts into logical statements will need to be developed.
References 1. Slaymaker, M.A., Power, D.J., Russell, D., Simpson, A.C.: On the facilitation of fine-grained access to distributed healthcare data. In: Jonker, W., Petkovi´c, M. (eds.) SDM 2008. LNCS, vol. 5159, pp. 169–184. Springer, Heidelberg (2008) 2. Torlak, E., Jackson, D.: Kodkod: A relational model finder. In: Grumberg, O., Huth, M. (eds.) TACAS 2007. LNCS, vol. 4424, pp. 632–647. Springer, Heidelberg (2007) 3. Jackson, D.: Alloy: a lightweight object modelling notation. ACM Transactions on Software Engineering Methodologies 11, 256–290 (2002) 4. Zao, J., Wee, H., Chu, J., Jackson, D.: RBAC schema verification using lightweight formal model and constraint analysis. In: Proceedings of 8th ACM symposium on Access Control Models and Technologies, SACMAT (2003) 5. Bryans, J.: Reasoning about XACML policies using CSP. In: Proceedings of the 2005 Workshop on Secure Web Services, pp. 28–35 (2005) 6. Zhang, N., Guelev, D.P., Ryan, M.: Synthesising verified access control systems through model checking. Journal of Computer Security 16, 1–61 (2007) 7. Hughes, G., Bultan, T.: Automated verification of access control policies using a SAT solver. International Journal on Software Tools for Technology Transfer (STTT) 10, 503–520 (2008) 8. Becker, M.Y., Nanz, S.: A logic for state-modifying authorization policies. In: Biskup, J., L´ opez, J. (eds.) ESORICS 2007. LNCS, vol. 4734, pp. 203–218. Springer, Heidelberg (2007) 9. Dougherty, D.J., Fidler, K., Krishnamurthi, S.: Specifying and reasoning about dynamic access-control policies. In: Furbach, U., Shankar, N. (eds.) IJCAR 2006. LNCS (LNAI), vol. 4130, pp. 632–646. Springer, Heidelberg (2006), doi:10.1007/11814771 10. Crescini, V.F., Zhang, Y.: PolicyUpdater: a system for dynamic access control. International Journal of Information Security 5, 145–165 (2006) 11. Power, D.J., Slaymaker, M.A., Simpson, A.C.: On formalizing and normalizing role-based access control systems. The Computer Journal (2008), doi:10.1093/comjnl/bxn016
Controlling Access to XML Documents over XML Native and Relational Databases Lazaros Koromilas, George Chinis, Irini Fundulaki, and Sotiris Ioannidis FORTH-ICS, Greece {koromil,gchinis,fundul,sotiris}@ics.forth.gr
Abstract. In this paper we investigate the feasibility and efficiency of mapping XML data and access control policies onto relational and native XML databases for storage and querying. We developed a re-annotation algorithm that computes the XPath query which designates the XML nodes to be re-annotated when an update operation occurs. The algorithm uses XPath static analysis and our experimental results show that our re-annotation solution is on the average 7 times faster than annotating the entire document. Keywords: XML, access control, XML to relational mapping.
1
Introduction
XML has become an extremely popular format for publishing, exchanging and sharing data by users on the Web. Often this data is sensitive in nature and therefore it is necessary to ensure selective access, based on access control policies. For this purpose flexible access control frameworks must be built that permit secure XML querying while at the same time respecting the access control policies. Furthermore such a framework must be efficient enough to scale with the number of documents, users, and queries. In this paper we study how to control access to XML documents stored in a relational database and in a native XML store. Prior work proposed the use of RDBMS for storing and querying XML documents [1], to combine the flexibility and the usability of XML with the efficiency and the robustness of a relational schema. In this paper we examine the feasibility and efficiency of using the above approach to enforce access control policies. In particular, we study how to control access on XML documents following the materialized approach, in which the XML document is stored in a database along with annotations attached to the nodes; these specify whether a node is accessible or not. We evaluate our approach using (i) a native XML storage system and (ii) a relational database where the XML documents are shredded a` la ShreX [8]. Specifically we: – propose a method to annotate XML documents stored in a relational database and in an XML database; W. Jonker and M. Petkovi´ c (Eds.): SDM 2009, LNCS 5776, pp. 122–141, 2009. c Springer-Verlag Berlin Heidelberg 2009
Controlling Access to XML Documents
123
– discuss an optimization procedure based on XPath containment that removes redundant access control rules from a policy; – develop a re-annotation technique that allows us to re-compute the annotations of a portion of the nodes in an XML document if a document update occurs; and finally – we discuss results of extensive experiments that compare annotation and re-annotation techniques for the relational and the XML cases. This is the first attempt to compare the use of relational and XML databases to store annotated (with accessibility information) XML documents. Annotationbased enforcement techniques have been considered in [3, 7] for rule-based policies. More sophisticated techniques for storing and querying annotations have been investigated [26, 27]. The related problem of optimizing security checks during query evaluation with respect to an annotated document was investigated in [5]. XML access control over relational databases has been also studied in [23]. Our work is different in that we use annotations (materialized approach), whereas Lee et al. check the accessibility of the document on-the-fly. [20] discusses a “function-based” model that translates policy rules to functions (e.g. Java methods) which are subsequently called to check the policy whenever a part of the document is accessed. Security views [10, 16] address the problem of information leaks in the presence of read-only security policies and queries. Security views contain just the information a user is allowed to read; queries to the view can be translated efficiently to queries on the underlying data, foregoing expensive view materialization and maintenance. However, previous work on annotation-based security policies, such as compressed accessibility maps, does not address the problem of keeping the annotations consistent with the policy when the document or policy changes. These techniques have not yet been used directly to provide access control in an XML database system; it appears that doing so would require modifying the database system internals. 1.1
Motivating Example
Before we formally discuss our approach, we present an example from the medical domain. Consider the XML DTD of Figure 1 that is used to represent information for hospitals, their departments, staff and patients. We choose a node and edge labeled graph representation for the XML DTD where nodes in the graph are the element types of the XML DTD, and edges represent the content models of an element type (sequence, choice). Dashed arrows connecting a node with its children nodes capture the choice, whereas straight lines capture the sequence content model. Edges are labeled with *, + and ? to denote the occurrence indicators in the XML DTD (“zero or more”, “one or more” and “optional” respectively). In the graph, a valid hospital instance of this schema contains one or more departments (dept+). Each department holds information about patients (patients) and its staff (staffinfo). There may be zero or more patients (patient*) and zero or more staff members (staff*). A patient has an identifier (psn), a registered
124
L. Koromilas et al. Table 1. Hospital policy rules
hospital +
Rule Resource Effect R1 //patient + R2 //patient/name + R3 //patient[treatment] − R4 //patient[treatment]/name + R5 //patient[.//experimental] − R6 //regular + R7 //regular[med=“celecoxib”] + R8 //regular[bill > 1000] +
dept patients
staffinfo
*
*
patient psn
? treatment
staff name
nurse
doctor
? regular
experimental
med
bill
sid
name
phone
test
Fig. 1. Hospital schema
name (name) and an optional treatment (treatment?). The treatment may be either conventional (regular?) or experimental (experimental?); it can also be unspecified (an empty element). Regular treatments have a medication (med) and a bill (bill), whereas experimental treatments are associated with a medical exam (test) and a bill (bill). Staff members are doctors (doctor) or nurses (nurse). In either case they have an identifier, a name and a phone number (sid, name and phone respectively). A sample partial instance of the hospital schema is presented in Figure 2. For the sake of simplicity we focus on the patients element of a department and show ...
(−) patients
(−) patient (−) psn
patient (−) name (+) (−) psn
(−) treatment
033
"john doe" (+) regular
(−) med
...
name (+) (−) psn
(−) treatment
042
patient (+)
"jane doe"
(−) experimental
bill (−)
"enoxaparin" 700
(−) test
bill (−)
"regression 1600 hypnosis"
Fig. 2. Partial hospital document
099
name (+) "joy smith"
Controlling Access to XML Documents
125
three different patients. We will be using this document together with the access control rules of Table 1 in the examples in the remainder of this paper. Table 1 shows the access control rules specified for the hospital XML DTD. Each rule has the form (resource, effect) where resource is an XPath expression that designates the nodes in the XML document concerned by the rule and effect specifies whether the node is accessible (sign “+”) or inaccessible (sign “−”). Rule R1 says that all patient nodes are accessible whereas rule R3 specifies that patient nodes that have a treatment are not. Rules R4 and R2 specify that the names of patients that have a treatment and patients in general are accessible. Patients under experimental treatment are not accessible according to rule R5 . Rule R6 gives access to all regular treatment nodes; in addition rules R7 and R8 are more specific and specify that regular treatment nodes that have a medication (med) with value “celecoxib” or a bill (bill) with a value greater than 1000 respectively are accessible. This set of rules is associated with a conflict resolution policy and default semantics [15, 17, 14]. The former specify the accessibility of a node in the case in which it is in the scope of access control rules with opposite signs. The later determines the default accessibility of a node. In our example we consider that the conflict resolution policy is deny overrides (the rule that denies access to a node overrides the one that grants access to it) and the default semantics is deny (nodes are inaccessible by default). We say that an XML node is in the scope of an access control rule, if it is in the result of the evaluation of the resource (i.e., XPath expression) part of the rule on the XML document. Figure 2 shows the annotated XML document where annotations designate whether a node is accessible (label “+”) or not (label “−”). Note that the elements for which no access control rule is specified are annotated with “−” (denied access by default). The first and second patient elements are not accessible: both elements have a treatment subelement and according to rule R3 are not accessible (note that R3 overrides R1 due to the conflict resolution policy). On the other hand, the third patient is accessible, since it is in the scope of rule R1 and not in the scope of either R3 or R5 .
2 2.1
Preliminaries XML Trees
We model XML documents as rooted unordered trees with labels from the set Σ ∪ D ∪ {∗}. Σ is a finite set of element names, D a data domain and ∗ is the wildcard (matches any label). We represent an XML document as a tree T = (VT , ET , RT , λT ), where (i) VT is the set of nodes in T , (ii) ET ⊆ VT × VT is the set of edges, (iii) λT : VT → Σ ∪ D maps nodes to element names from Σ and values in D and iv) RT is a distinguished node in VT , called the root node. 2.2
XPath
The fragment of XPath that we will be using in queries and access control rules is defined as follows:
126
L. Koromilas et al.
Paths p ::= axis :: ntst | p[q] | p/p Qualifiers q ::= p | q and q | p = d Axes axis ::= child | descendant Node Test ntst ::= l | ∗ where l is an element label from Σ, and d a value in D. The expressions are built using only the child and descendant axes of XPath and conditions which test for the existence of sub-elements or constants in the subtree of an element. We use the standard abbreviated form of XPath expressions. For example, /a//b[∗] is an abbreviation of /child::a/descendant-or-self::node()/child::b[child::∗] . For p an absolute XPath expression (a path expression starting with “/”), and T an XML tree, we write [[p]](T ) to denote the set of nodes of T obtained from evaluating expression p on the root node of T . The semantics of XPath expressions are defined in [2, 12, 25]. We say that an XPath expression p is contained in another expression q (denoted by p q), if for every XML tree T , [[p]](T ) ⊆ [[q]](T ). We say that two XPath expressions are disjoint (denoted by p ◦◦ q) if their intersection is empty. That is, for every T , [[p]](T ) ∩ [[q]](T ) = ∅. Otherwise, we say p and q overlap (denoted by p ◦◦ q).
3
Access Control Framework
An XML access control policy is defined by a set of rules that specify who has access to which data. We undertake the more or less agreed definition of an access control rule which is a tuple of the form (requester, resource, action, effect, propagation) where: – requester refers to the user or a set of users concerned by the authorization; – resource refers to the data that the requester is (or not) authorized to access; – action refers to the action (read, write, modify, delete etc.) that the requester is (or not) allowed to perform on the resource; – effect specifies whether the rule grants (“+” sign) or denies (“−” sign) access to the resource and finally – scope which defines whether the rule applies to the node only, or to its subtree [11]. In this paper we assume that the requester and action parameters are fixed and concentrate on the resource and effect components. We define the scope of a rule to be the XML node itself (explicit rules). Implicit rules are not considered here (no accessibility inheritance). We will refer to the access control rules that grant access to a node (effect=‘+’) as positive and those that deny access to it (effect=‘−’) as negative. For simplicity we define an access control rule R to be a tuple of the form R = (resource, effect) with resource an XPath expression in the fragment discussed
Controlling Access to XML Documents
127
in Section 2 and effect ∈ {+, −}. We say that a node n in the XML tree T is in the scope of an access control rule r = (resource, effect) if n ∈ [[resource]](T ). We define an access control policy P to be a tuple of the form P = (ds, cr, A, D) where: – – – –
ds is the default semantics ds ∈ {+, −}, cr the conflict resolution policy cr ∈ {+, −}, A is the set of positive access control rules and D is the set negative rules.
As previously discussed, conflict resolution specifies the accessibility of a node in the case in which it is in the scope of access control rules with opposite signs. Default semantics indicate that a node in the XML tree is accessible/inaccessible by default. Intuitively, an access control policy P restricts the set of nodes of an XML tree T returned as the answer to the user query. The semantics of an access control policy P for an XML tree T are the set of accessible nodes of T . We denote with [[P ]](T ) the semantics of a policy P for an XML tree T . Table 2 defines the semantics of a policy P = (ds, cr, A, D) where U(T ), [[D]](T ) and [[A]](T ) are the nodes of tree T , the nodes that are in the scope of some negative, and the nodes in the scope of some positive rule of policy P respectively. Table 2. Semantics of an access control policy P = (ds, cr, A, D) [[(+, +, A, D)]](T ) [[(−, +, A, D)]](T ) [[(+, −, A, D)]](T ) [[(−, −, A, D)]](T )
= U(T ) − ([[D]](T ) − [[A]](T )) = [[A]](T ) = U(T ) − [[D]](T ) = [[A]](T ) − [[D]](T )
In the case in which the default semantics is allow and the conflict resolution is allow overrides, then the accessible nodes are all the nodes in T except those that are in the scope of a negative rule and not those in the scope of a positive rule (U(T ) − ([[D]](T ) − [[A]](T ))). In the case in which the default semantics is deny and the conflict resolution policy is allow, the accessible nodes are exactly those that are in the scope of some positive access control rule ([[A]](T )). On the other hand, if the default semantics is allow and the conflict resolution is deny, the accessible nodes are all the XML nodes in T except those that are in the scope of some negative rule (U(T ) − [[D]](T )). Finally, in the case that occurs most often in practice, if the conflict resolution policy is deny overrides and the default semantics is deny, the accessible nodes are those that are in the scope of some positive rule except those that are in the scope of some negative rule ([[A]](T ) − [[D]](T )).
4
System Architecture
We will now present the architecture of the access control system and describe the functionality of its key components. The core of the system is comprised of the optimizer, annotator, reannotator and requester modules.
128
L. Koromilas et al. policy
update
xmlac dtd
shredder
reldb
translator
optimizer
xml xmldb
annotate query
annotator
simple query
reannotator
requester
xpath
user yes/no
Fig. 3. System components
Module optimizer shown in Figure 3 is responsible for detecting and removing the redundant access control rules from the access control policy. We discuss this idea in detail in Section 5.1. The annotator module is responsible for computing the queries used to annotate the XML document with accessibility information. More specifically, it takes as input an XML access control policy P and computes the SQL and the XQuery queries that will be used to annotate with accessibility information the relational representation of the XML document and the XML document stored in the native XML store resp.. These queries implement the semantics of an access control policy as presented in Section 3 and will be discussed in detail in Section 5.2. The reannotator module is responsible for computing the SQL and XQuery queries to re-annotate the already annotated XML document when a document update has occurred. The idea is that when an update occurs (i.e., a node is deleted or inserted in the document), the accessibility of a node might change: for instance, if the treatment element of a patient element is deleted, then the latter becomes accessible (Table 1, rules R1 and R3 ). In this case, we need to consider re-annotating only the patient elements in the XML document. The re-annotation algorithm is discussed in detail in Section 5.3. The requester module is the front-end of our system. A user request is sent by the requester to the relational and native XML stores for evaluation and depending on the result of the evaluation, it either returns the requested data or denies access to the user request. To store an XML document in a relational database, we first need to create the relational tables used to store the XML document, and in a second phase, produce a relational representation of the XML document using these tables. We employ ShreX [8, 1] to obtain the relational representation of XML documents. ShreX is a system that handles the translation of XML data into relational tables. This includes relational schema creation, document loading and database querying. It takes as input an XML Schema [9, 24, 4] and produces a mapping to create relational tables so that XML documents which are valid instances of the given XML Schema, can be stored. ShreX is also responsible for translating XPath [6] queries to SQL queries which are then evaluated on the relational representation of the XML document.
Controlling Access to XML Documents
129
In our system we control access to read-only queries expressed as XPath expressions. We follow an all-or-nothing semantics for query answering: if all the nodes requested by the XPath expression are accessible (i.e., annotated with “+” sign), then we return the requested nodes. Otherwise, we deny access to the user request.
5
Controlling Access to XML Documents
In this section we discuss in detail our approach on controlling access to XML documents stored in a relational and an XML store. 5.1
Access Control Policy Optimization
The first step is to remove redundant rules from the access control policy. This redundancy elimination is performed by the optimizer module presented in Section 4. Given a policy P , we say that an access control rule R is redundant, if there exist some access control rule R such that 1. R, R are both either positive (in A) or negative (in D) and 2. R is contained in R . We say that an access control rule R=(effect, resource) is contained in a rule R =(effect, resource’) iff resource is contained in resource’. An XPath expression p is contained in an XPath expression p (p p ) iff the set of nodes obtained by evaluating p on any XML tree T is a subset of the set of nodes obtained by evaluating p on T [18, 22, 19]. Algorithm Redundancy-Elimination shown in Figure 4 takes as input a set of access control rules S and returns a subset S of S which is free of redundant rules. The idea is the following: redundancy elimination is performed for both sets of positive and negative rules (A and D). The resulting redundancy-free sets of rules are combined to obtain a revised policy. We employ the containment algorithm of [18,13]. Containment for fragments of XPath such as XP (/, //, ∗, []) has been studied in [18] and for larger fragments in [19] (see [22] for a survey). For our motivating example, the redundancy-free access control policy Table 3. Redundancy-free policy is shown in Table 3. Rule R4 is removed because //patient[treatment]Rule Resource Effect /name //patient/name (it is conR1 //patient + tained in R2 ). Similarly, rules R7 , R8 R //patient/name + 2 are contained in R6 . Rule R3 is conR //patient[treatment] − 3 tained in R1 , however it is not elimiR //patient[.//experimental] − 5 nated because the two have different R //regular + 6 effects.
130
L. Koromilas et al.
Redundancy-Elimination(rules) Ensure: ∀r1 ∀r2 ∈ rules ⇒ r1 r2 1: for all r ∈ rules do 2: for all r ∈ rules where r =r do 3: if r r then 4: rules ← rules − {r} 5: else if r r then 6: rules ← rules − {r } 7: else 8: {neither disjoint nor overlap, do nothing} 9: end if 10: end for 11: end for 12: return rules Fig. 4. Eliminating redundant access control rules
5.2
Annotation-Queries Require: Policy P = (A, D, ds, cr) Ensure: Annotation Query 1: for all r ∈ A do 2: grants ← grants UNION r 3: end for 4: for all r ∈ D do 5: denys ← denys UNION r 6: end for 7: if ds = ‘−’ then 8: if cr = ‘−’ then 9: toupdate ← query(grants EXCEPT denys) 10: else {cr = ‘+’} 11: toupdate ← query(grants) 12: end if 13: else {ds = ‘+’} 14: if cr = ‘−’ then 15: toupdate ← query(denys) 16: else {cr = ‘+’} 17: toupdate ← query(denys EXCEPT grants) 18: end if 19: end if 20: return toupdate Fig. 5. Computing annotation queries
Annotating XML Documents with Accessibility Information
To annotate an XML document independently of where it is stored, we must first compute the annotation queries that implement the semantics of the XML access control policy P . Algorithm Annotation-Queries takes as input an XML access control policy P and computes the SQL and the XQuery queries that will be used to annotate the relational and XML databases with accessibility information, implementing the policy semantics as described in Table 2. In the relational case, the resource part of an access control rule (XPath expression) is translated into an equivalent SQL query q using the ShreX [8, 1] translation. The resource part of the rules that grant (resp. deny) access are unioned using the relational UNION (XQuery union) operator. Depending on the default semantics and conflict resolution policy the relational EXCEPT (XQuery except) operator is used to express the annotation query that implements the semantics of an access control policy. In the relational case, the components in the UNION query, are the translated SQL queries from the XPath expressions using the chosen XML-to-relational mapping.
Controlling Access to XML Documents
131
Table 4. Relational representation of the XML document of Fig. 2 psn id pid 3 2 10 9 17 16
name treatment id pid val s regular id pid s 8 2 john doe + id pid s 4 2 − 15 9 jane doe + 5 4 + 11 9 − 18 16 joy smith + bill test experimental med id pid val s id pid val s id pid val s id pid s 7 5 700 + regression + 13 12 6 5 enoxaparin − 12 11 − hypnosis 14 12 1600 +
patient patients id pid s id pid s 2 1 − 1 null − 9 1 − 16 1 +
val 033 042 099
s − − −
Relational Approach. To store an XML document in a relational database, the tree (specified in our case with an XML DTD), must first be mapped into an equivalent, relational schema. Using this mapping the XML document is then shredded and loaded into the relational tables. In the context of XML, we need to capture access control information at the XML node level. To satisfy this requirement, we map each element type in the XML DTD to a relational table. More specifically, each element type E with attributes A1 , A2 , . . . An in the XML DTD, is mapped to a table ET (id, pid, A1 , A2 , . . . An , s) where id is the primary key for ET , pid is a foreign key that refers to a relational table ET to which the parent element type E of E is mapped to. Finally s is an additional column that stores the access permission for the tuple (i.e., node in the XML document). The value of an Ai column is the value of the Ai attribute of the XML node. For nodes whose type is a base type such as string, integer we define tables of the form ET (id, pid, A1 , . . . , An , v, s) where v is the value of the XML node. The id key is unique not only through the table but throughout the entire database; we will call this key ‘universal identifier’. For our motivating example we will define one table of the form ET (id, pid, s) per element type in the XML DTD shown in Figure 1. Table 4 shows the output of the XML to relational mapping for the XML document shown in Figure 2 where for each node in the XML document whose element type is E we create one tuple in table ET . The accessibility of each tuple (i.e., corresponding XML node) is initialized to the default semantics of the policy. To annotate the tuples in the relational store given a policy P , we must first find the tuples that are in the semantics of P (i.e., are accessible according to policy P ) and then perform the necessary update operation. To obtain the tuples to be annotated, we run the SQL query obtained by executing algorithm Annotation-Queries. Our annotation algorithm is a two phase algorithm: in the first phase we find the id’s of all the tuples that need to be annotated, and in the second phase, for each such tuple we run the update query that changes the value of the s column. For example, consider the policy shown in Table 1. The translated SQL queries for the rules of the policy are given below: For rule R1 the produced query Q1 is:
132
L. Koromilas et al.
SELECT pat1.id FROM patients pats1, patient pat1 WHERE pats1.id = pat1.pid;
For rule R3 the corresponding query Q3 is SELECT pat1.id FROM patients pats1, patient pat1, treatment treat1, WHERE pats1.id = pat1.pid AND pat1.id = treat1.pid
For rule R7 the corresponding query Q7 is SELECT med1.id FROM patients pats1, patient pat1, treatment treat1, regular regular1, med med1 WHERE pats1.id = pat1.pid AND pat1.id = treat1.pid AND treat1.id = regular1.pid AND regular1.id = med1.pid AND med1.v = ‘celecoxib’
Finally, given that default semantics is deny and conflict resolution is deny overrides, the SQL query that implements the semantics of the redundancy-free policy in Table 3 is: (Q1 UNION Q2 UNION Q6 EXCEPT (Q3 UNION Q5 )) where a query Qi is defined for access control rule Ri . Note that the result of the SQL query is a set of tuple identifiers that are in the semantics of the access control policy, i.e., are accessible. In the relational context, to update a relational tuple, we need to know the name of the table that the tuple belongs to. The universal identifier does not provide us with that kind of information. Consequently, to identify the table that a tuple (i.e., the identifier of a tuple belongs to), we iterate over all tables of the database. For each table the algorithm computes the intersection between the universal identifiers of the tuples included in the table and those computed by applying the SQL query that implements the semantics of policy P . The tuples with primary key in the computed intersection are updated to reflect the accessibility of a node. The annotation process from the creation of the annotation queries to the addition of accessibility information to the relational tuples is shown in Algorithm Annotate. Native XML. In the case of a native XML store, the annotation process is straightforward. We choose to store accessibility annotations for XML elements in the form of the XML attribute sign that takes value “+” (if the node is accessible) or “−” otherwise. The idea is the following: we employ algorithm Annotation-Queries to obtain the XQuery expression that implements the semantics of the access control policy (i.e., determines the accessible nodes). To minimize the amount of information stored, we choose to annotate the accessible (inaccessible) nodes for policies with deny (grant) default semantics respectively. The modification of the sign attribute of the nodes is performed with function xmlac:annotate() shown below. The function takes as input the XML node to
Controlling Access to XML Documents
133
Annotate(policy) Require: Policy P , Relational DB D Ensure: Annotated D according to P {first, produce the SQL query} 1: sqlquery ← Annotation-Queries(P ) {execute the SQL query to compute set S of tuple ids} 2: S ← query(sqlquery, D) 3: for all table ∈ schema do 4: ids ← query(SELECT id FROM table) 5: upids ← ids ∩ S {produce the SQL update queries to update the permissions} 6: for all upid ∈ upids do 7: query(UPDATE table SET s = ‘+’ WHERE id = upid) 8: end for 9: end for Fig. 6. Annotation Algorithm for Relational DB
be annotated (n), and the annotation label (val). If the node does not have a sign attribute, then the attribute is inserted along with its value, otherwise, the current value is updated. function xmlac:annotate($n as element(), $val as xs:string) { if (count($n/@sign) = 0) then do insert attribute sign { $val } into $n else do replace value of $n/@sign with $val };
For instance, for the motivative example (policy of Table 3) we run the following query to annotate the XML nodes. for $n := doc("xmlgen")((R1 union R2 union R6) except (R3 union R5)) return xmlac:annotate($n, "+")
5.3
Re-annotation
When a database is frequently updated, the cost of keeping the annotations consistent with the access control policy becomes considerably large. The simple approach to tackle this problem is to delete all annotations and annotate from scratch, a process that induces large processing cost. In this section we discuss how we can identify the access control rules that should be triggered to reannotate the nodes whose access permission changed due to the update. Intuitively, the nodes that must be re-annotated are all the nodes that are in the scope of an access control rule that specifies a condition (filter) on a node that is modified (inserted or deleted) by the update operation. For instance, consider the following example: suppose that the treatment subelement of a patient
134
L. Koromilas et al.
Depend(P ) Depend-Resolve(r, dlist) 1: for all r ∈ P do 1: r.visited ← true 2: r.visited ← f alse 2: for all n ∈ r.neighbours do {explore (r, n)} 3: end for 3: if n.visited = f alse then 4: for all r ∈ P do 4: dlist ← dlist {n} 5: Depend-Resolve(r, dlist) 5: Depend-Resolve(n, dlist) 6: r.depends ← dlist 6: end if 7: end for 7: end for Fig. 7. Dependency resolution algorithm
element is deleted. Recall that access control rule R3 of our motivating example states that patients with treatment are inaccessible. In this case, we should consider for re-annotation all patient elements, since rule R3 that was used in their annotation is no longer applicable. To determine this set of rules we employ XPath containment tests between the rules and the update query. We discuss this in more detail in the following. As a tool to discover the access control rules that must be considered for re-annotating the access permissions of a node, we compute their dependency graph. The graph captures interdependencies between the access control rules: for every rule R in a policy P that has in its scope a node n, the dependency graph stores all the rules R of opposite sign that also have in their scope node n. The graph allows us to get in constant time all the rules that should be considered for re-annotating an XML node. The dependency graph is represented as a list of adjacency lists, where each member of the list corresponds to an access control rule in a policy P . We associate with each rule r, attributes neighbours that stores the adjacency list (i.e., dependency graph) for r and visited to note that the rule has been visited during the execution of the algorithm. Algorithm Depend computes the dependency graph as follows: we iterate over all rules in a policy P to discover the dependencies that arise. Each call to Depend-Resolve for rule r initiates a DFS-like recursive traversal that finds all dependent rules for r. In line 2 of algorithm Depend-Resolve the r.neighbours variable denotes the adjacency list for r. In these adjacency lists, each entry n is neighbor of another entry r iff r has a containment relation with n: r n ∨ n r ∨ r = n. For instance, consider the rules R1 and R3 of the motivative example (Table 3). We can see that R3 is contained in R1 (//patient[treatment] //patient) as the former returns patients with treatment subelement whereas the latter all patients. Consequently, after this process rule R3 will be included in the dependency list of rule R1 and vice versa. We should clarify that we are interested in dependencies between rules that have opposite effect, in contrast to the offline policy optimization where we eliminated rules of the same effect. We consider that the updates are XPath expressions that specify the location of the nodes to be inserted or deleted. When an update u occurs we must determine the XML nodes that must be re-annotated. The idea is that the nodes that must be re-annotated are in the scope of the access control rules that are “related to” the update u. To discover this set of rules we run the Trigger
Controlling Access to XML Documents
135
algorithm which tests the containment between the query and the expansion of policy rules, and then adds the dependent rules based on the previously constructed dependency graph. The complexity of this algorithm is O(n · h), where n is the number of rules Trigger(P, u) and h the height of the XML document 1: rules ← ∅ tree. 2: for all p ∈ P do The need for rule expansion and depen3: X ← Expand(p) dency resolution can be supported with a 4: for all x ∈ X do simple example. Consider the XML tree 5: if x u ∨ x u ∨ x = u in Figure 2 and the accompanying polthen icy of Table 1. The rules R1 and R3 6: rules ← rules {p} say that all patients are accessible except 7: end if those that have a treatment as child el8: end for ement. Also consider that the incoming 9: end for update query specifies the deletion of 10: for all r ∈ rules do //patient/treatment nodes. After this op11: rules ← rules r.depends eration one would expect that all patient 12: end for elements are now accessible. To make this 13: return rules happen we should consider triggering the positive rule //patient (R1 ) for the reFig. 8. Trigger algorithm annotation process. This is accomplished in two steps: (i) rule R3 expands to //patient[treatment] −→
//patient //patient/treatment
the latter part of which matches the query, (ii) the dependency resolution finds that positive rule R1 is a dependent of R3 (by means of containment) and consequently is included in the set of rules to consider. If the expansion had not taken place, the positive rule R1 would not have been triggered and thus the previous annotations would have incorrectly been preserved. This rule expansion does not cover the case in which the XPath expressions of the access control rules contain predicates with descendant axes. Consider the hospital document and rules R1 and R5 . Consider an update that deletes all treatment elements (//treatment) and their subtrees. The query will not trigger any rules that do not contain the treatment tag. This is not right because patient elements should be accessible now, as there is no descendant experimental element under patient anymore. To deal with this problem, we need to replace all descendant axes that occur inside a predicate of an access control rule with relative paths using only the child axis. With the schema information these replacements are finite. Rule R5 now expands to //patient[.//experimental]
−→
//patient //patient//experimental
−→
//patient //patient/treatment/experimental
136
L. Koromilas et al.
After the expansion, rule R5 is triggered by the query. This triggers also rule R1 because of containment, and accessibility of nodes is updated correctly. The full picture of the re-annotation process can be perceived as a sequence of the steps described previously. The idea is the following: we first obtain the set of triggered rules by calling Trigger. We then produce an annotation query Q for this set of rules (using algorithm Annotation-Queries). We then re-annotate the nodes that are in the semantics of Q as accessible.
6
Implementation
Our system transforms XML data and stores it in a relational database by shredding them using ShreX1 . We modified ShreX to better interface with external code modules. For uniformity of evaluations we decided to use the MonetDB2 database. MonetDB offers the advantage of providing both XML and supporting and SQL module. This permitted us to directly compare the two methods using the same engine. We also chose to evaluate our system using PostgreSQL3. We used the PL/Python feature, which enables PostgreSQL functions to be written in Python. The core of the application that handles all the input and transactions was also written in Python. We used the py-psycopg2 module for PostgreSQL and the MonetSQLdb module distributed with the MonetDB project. In some cases we used object serialization and disk storage, to keep an algorithm’s computation or a procedure’s output for future use. For example, document shredding which is a very time consuming process, or containment comparisons, which are an issue mostly because our current implementation is in Java, and we must pay the cost of JVM initialization.
7
Evaluation
7.1
Setup
To evaluate our system and runtime environment we used the following parameters: (i) size of the XML document, (ii) size of the policy, (iii) the coverage of the policy, and we designed our experiments to measure: (i) loading time, (ii) annotation time, (iii) response time, (iv) re-annotation time. The XML data were generated with xmlgen from the XMark project [21]. We should also note that we modified xmlgen’s code that generates XML —and as a consequence the conforming schema— in an effort to eliminate all recursive paths. This is crucial for the specific shredding procedure to work properly. With xmlgen we generated a set of documents of variable sizes (see Table 5). The sizes of the respective SQL files are also displayed. We manually designed policies with variable coverage, that is, we crafted several policy files to force our system to annotate increasingly larger portions 1 2 3
http://shrex.sourceforge.net/ http://monetdb.cwi.nl/ http://www.postgresql.org/
Controlling Access to XML Documents
137
of the data.4 In this fashion we obtain information about the system’s behavior when managing small or large number of annotations. We refer to these policies as the coverage policy dataset. We shred the XML files to text files containing SQL INSERT statements representing the data. LoadTable 5. Documents generated with xmlgen and ing time is the time needed to run these SQL files on a relational database. Similarly, with respect to native their sizes XML storage, loading time refers to the time needed to factor size (bytes) load the document from the XML file to the XQuery XML SQL database. The annotation process for the relational store consists of evaluating the query obtained by al0.0001 19K 33K gorithm Annotation-Queries, performing set op0.001 85K 149K erations on their results to determine the tuples that 0.01 804K 1.6M need updating, and finally execute UPDATE queries if 0.1 7.9M 17M needed as discussed in Section 5.2. Annotation time 1.0 79M 78M is the time required for these actions to complete. In 2.0 158M 140M a similar manner, for the XML store, we measure the 10.0 793M 310M time needed to evaluate the Annotation-Queries with MonetDB XQuery. Response time is the time needed to check whether a user has access to the data they request. Finally, re-annotation time is the time spent to get the database to a consistent state, after an update occurs. R We run our experiments on a Dell OptiPlex 755 Desktop with an Intel CoreTM 2 Duo CPU E8400 @ 3.00GHz with 3GB of memory, running FreeBSD 7.0. 7.2
Experimental Results
We run a series of experiments to evaluate the efficiency and feasibility of the system. Loading the documents in native XML form is fairly quick to complete, whereas running all the equivalent INSERTs is over one order of magnitude avg response time
avg loading time 2500
xquery monetsql postgres
6
xquery monetsql postgres
5 time (sec)
2000 time (sec)
7
1500 1000
4 3 2
500
1
0 0
1
2
3
4
5
6
7
8
9
document size (xmlgen f)
Fig. 9. Loading time comparison 4
10
0 0.0001
0.001
0.01
0.1
1
document size (xmlgen f)
Fig. 10. Response time comparison
We evaluated the actual coverage percents with XQuery after each document annotation.
138
L. Koromilas et al.
slower. Among the two relational databases, we found that PostgreSQL performs about twice faster than MonetDB/SQL when inserting data (see Figure 9). In Figure 10 we show the performance on client requests. We run 55 different queries (of the same complexity as the coverage policy dataset) and calculated their average response time for each document. The required time is roughly analogous to the document size. MonetDB/SQL performs better than PostgreSQL on large documents, but compared to XQuery they both perform 34 times slower on average (and growing). Figure 11 presents the results for variable coverage policies of the actual annotating process on all the database systems we used. There is a small performance gain on small documents when using relational databases, but in the long run avg annotation time
avg reannotation time
1000
10 1
reannot fannot
350 300 time (sec)
100 time (sec)
400
f0.0001 f0.001 f0.01 f0.1 f1
250 200 150 100
0.1
50 0.01
0 25 30 35 40 45 50 55 60 65 70 doc coverage (%)
0
(a) MonetDB/XQuery
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 document factor
(a) MonetDB/XQuery
avg annotation time
avg reannotation time
1000
400
f0.0001 f0.001 f0.01 f0.1 f1
10 1
reannot fannot
350 300 time (sec)
100 time (sec)
2
250 200 150 100
0.1
50 0.01
0 25 30 35 40 45 50 55 60 65 70 doc coverage (%)
0
(b) MonetDB/SQL
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 document factor
(b) MonetDB/SQL
avg annotation time 1000
10 1
reannot fannot
350 300 time (sec)
time (sec)
avg reannotation time 400
f0.0001 f0.001 f0.01 f0.1 f1
100
2
250 200 150 100
0.1
50 0.01
0 25 30 35 40 45 50 55 60 65 70 doc coverage (%)
0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 document factor
2
(c) PostgreSQL
(c) PostgreSQL
Fig. 11. Annotation time comparison
Fig. 12. Reannotation vs full annotation
Controlling Access to XML Documents
139
the MonetDB/XQuery database performs best when annotating actions. There is little difference between MonetDB/SQL and PostgreSQL. Optimization results in the case of re-annotation are shown in Figure 12. We run the same 55 queries (derived from the coverage dataset) as delete updates and calculated the average re-annotation time per document. The re-annotation time is not actually dependent to the document size. On XQuery, for documents of tens of MBs or larger it becomes efficient as a technique, and is 5 times faster than full annotation. On relational databases it is almost always more efficient to perform partial re-annotation, and on average it’s 9 and 7 times faster on MonetDB/SQL and PostgreSQL respectively. Among the relational and native XML re-annotation, the latter is about twice as fast on average.
8
Conclusions
In this paper we have studied the problem of enforcing access control on XML documents stored in relational and native XML databases. We have presented a novel re-annotation algorithm that computes the XPath query which designates the XML nodes to be re-annotated when an update operation occurs. We performed exhaustive experiments to evaluate the effectiveness and efficiency of the proposed solutions. We concluded that performing access control on XML documents stored in native XML databases outperforms the relational-based solution. Schema-aware optimizations should be further studied, as they can extend our mechanism to support larger XPath fragments and produce more accurate results. As a future work, we also plan to extend our framework to handle access control for update operations (inserts and deletes).
Acknowledgments This work was supported in part by the Marie Curie Actions – Reintegration Grants project PASS. We would like to thank James Cheney for interesting discussions during the first steps of this work and Loreto Bravo for insightful comments.
References 1. Amer-Yahia, S., Du, F., Freire, J.: A comprehensive solution to the XML-torelational mapping problem. In: Proc. of the 6th Annual ACM Int’l workshop on Web Information and Data Management, pp. 31–38. ACM, New York (2004) 2. Benedikt, M., Fan, W., Kuper, G.: Structural properties of XPath fragments. Theoretical Computer Science 336(1), 3–31 (2005) 3. Bertino, E., Ferrari, E.: Secure and selective dissemination of XML documents. ACM Transactions on Information and System Security 5(3), 290–331 (2002)
140
L. Koromilas et al.
4. Biron, P.V., Malhotra, A.: XML Schema Part 2: Datatypes Second Edition, October 2004, W3C Recommendation (2004), http://www.w3.org/TR/xmlschema-2/ 5. Cho, S.R., Amer-Yahia, S., Lakshmanan, L.V.S., Srivastava, D.: Optimizing the secure evaluation of twig queries. In: Proc. of the 28th Int’l Conf. on Very Large Data Bases, pp. 490–501. VLDB Endowment (2002) 6. Clark, J., DeRose, S., et al.: XML path language (XPath) version 1.0. W3C recommendation (1999), http://www.w3c.org/TR/xpath 7. Damiani, E., Di Vimercati, S.C., Paraboschi, S., Samarati, P.: A fine-grained access control system for XML documents. ACM Transactions on Information and System Security (TISSEC) 5(2), 169–202 (2002) 8. Du, F., Amer-Yahia, S., Freire, J.: ShreX: Managing XML documents in relational databases. In: Proc. of the 30th Int’l Conf. on Very large data bases, vol. 30, pp. 1297–1300. VLDB Endowment (2004) 9. David, C.: Fallside and Priscilla Walmsley. XML Schema Part 0: Primer Second Edition, October 2004, W3C Recommendation (2004), http://www.w3.org/TR/xmlschema-0/ 10. Fan, W., Chee-Yong, C., Garofalakis, M.: Secure XML querying with security views. In: Proc. of the ACM SIGMOD Int’l Conf. on Management of Data (SIGMOD), Paris, France, pp. 587–598 (2004) 11. Fundulaki, I., Marx, M.: Specifying access control policies for XML documents with XPath. In: Proc. of the 9th ACM symposium on Access control models and technologies, pp. 61–69. ACM, New York (2004) 12. Gottlob, G., Koch, C., Pichler, R., Segoufin, L.: The complexity of XPath query evaluation and XML typing. Journal of the ACM 52(2), 284–335 (2005) 13. Haj-Yahya, K.: XPath-Containment Checker. Version: (2005), http://www.ifis.uni-luebeck.de/projects/XPathContainment 14. Ioannidis, S.: Security policy consistency and distributed evaluation in heterogeneous environments. PhD thesis, Philadelphia, PA, USA (2005) 15. Jajodia, S., Samarati, P., Subrahmanian, V.S.: A Logical Language for Expressing Authorizations. In: Proc. IEEE Computer Society Symposium on Security and Privacy, pp. 31–42 (1997) 16. Kuper, G., Massacci, F., Rassadko, N.: Generalized XML security views. Int’l Journal of Information Security 8(3), 173–203 (2009) 17. Lupu, E.C., Sloman, M.S.: Conflict Analysis for Management Policies. In: Proc. of the 5th IFIP/IEEE Int’l Symposium on Integrated Network Management IM, San Diego, CA (1997) 18. Miklau, G., Suciu, D.: Containment and equivalence for a fragment of XPath. Journal of the ACM 51(1), 2–45 (2004) 19. Neven, F., Schwentick, T.: XPath containment in the presence of disjunction, DTDs, and variables. LNCS, pp. 315–329 (2003) 20. Qi, N., Kudo, M., Myllymaki, J., Pirahesh, H.: A function-based access control model for XML databases. In: Proc. of the 14th ACM Int’l Conf. on Information and Knowledge Management, pp. 115–122. ACM, New York (2005) 21. Schmidt, A., Waas, F., Kersten, M., Carey, M.J., Manolescu, I., Busse, R.: XMark: A benchmark for XML data management. In: Proc. of the 28th Int’l Conf. on Very Large Data Bases, pp. 974–985. VLDB Endowment (2002) 22. Schwentick, T.: XPath query containment. SIGMOD RECORD 33(1), 101 (2004)
Controlling Access to XML Documents
141
23. Tan, K.L., Lee, M.L., Wang, Y.: Access control of XML documents in relational database systems. In: Int’l Conf. on Internet Computing, pp. 185–191. Citeseer (2001) 24. Thompson, H.S., Beech, D., Maloney, M., Mendelsohn, N.: XML Schema Part 1: Structures Second Edition, October 2004, W3C Recommendation (2004), http://www.w3.org/TR/xmlschema-1/ 25. Wadler, P.: Two semantics for XPath. Technical report (2000) 26. Yu, T., Srivastava, D., Lakshmanan, L.V.S., Jagadish, H.V.: A compressed accessibility map for XML. ACM Transactions on Database Systems (TODS) 29(2), 363–402 (2004) 27. Zhang, H., Zhang, N., Salem, K., Zhuo, D.: Compact access control labeling for efficient secure XML query evaluation. Data & Knowledge Engineering 60(2), 326– 344 (2007)
Longitude: Centralized Privacy-Preserving Computation of Users’ Proximity Sergio Mascetti, Claudio Bettini, and Dario Freni Universit` a degli Studi di Milano DICo - EveryWare Lab
Abstract. A “friend finder” is a Location Based Service (LBS) that informs users about the presence of participants in a geographical area. In particular, one of the functionalities of this kind of application, reveals the users that are in proximity. Several implementations of the friend finder service already exist but, to the best of our knowledge, none of them provides a satisfactory technique to protect users’ privacy. While several techniques have been proposed to protect users’ privacy for other types of spatial queries, these techniques are not appropriate for range queries over moving objects, like those used in friend finders. Solutions based on cryptography in decentralized architectures have been proposed, but we show that a centralized service has several advantages in terms of communication costs, in addition to support current business models. In this paper, we propose a privacy-aware centralized solution based on an efficient three-party secure computation protocol, named Longitude. The protocol allows a user to know if any of her contacts is close-by without revealing any location information to the service provider. The protocol also ensures that user-defined minimum privacy requirements with respect to the location information revealed to other buddies are satisfied. Finally, we present an extensive experimental work that shows the applicability of the proposed technique and the advantages over alternative proposals.
1
Introduction
Location-aware social networks are social network applications in which the geographical position of participants can be used to enable new services. The ability to access social network applications through mobile devices and the availability of precise positioning technologies are likely to make this new generation of social networks very popular. As in many social networks, each user is part of one or more groups of users, called friends or buddies. Among the enabled services, proximity services alert a user when one of her buddies is in the vicinity, possibly enacting other activities like the visualization on a map of the approximate position, or starting a communication session. These services, often called friend finder, are currently available on the Internet, with specific client-side software or plugin to be installed on mobile devices1 . From a data management point 1
Examples are Google Latitude, Loopt, iPoki.
W. Jonker and M. Petkovi´ c (Eds.): SDM 2009, LNCS 5776, pp. 142–157, 2009. c Springer-Verlag Berlin Heidelberg 2009
Longitude: Centralized Privacy-Preserving Computation of Users’ Proximity
143
of view, a proximity service involves the computation of a sequence of range queries over a set of moving entities issued by a moving user, where the range is a distance threshold value decided by the user. Currently available services are based on a centralized architecture in which location updates are acquired from mobile devices2 by a location server in the service provider (SP) infrastructure; proximity is computed based on the acquired locations. These services are not currently offered by mobile phone operators, that usually have approximate location information about their users from the network infrastructure. Indeed, in this paper we assume that the SP is an untrusted entity that has no location information about users except the one acquired through the service itself. We also consider “proximity” a user-dependent concept: User A defines a distance threshold δA , and considers any buddy B as being in proximity if the following condition holds: dist(loc(A), loc(B)) ≤ δA
(1)
where dist(loc(A), loc(B)) denotes the Euclidean distance between the reported locations of A and B. Optimization strategies for location updates and proximity computations have been proposed [1], but they are not the focus of this paper. The problem we are considering is to offer such a service, in an analogous centralized architecture, while providing formal guarantees to each user about her location privacy. The launch of friend finder services on the Internet has indeed generated a lot of concerns by the potential users about the release of precise location data, that in current systems can be easily associated with the user identities, as demonstrated by privacy research in LBS [2,4]. There are actually two different concerns: the first is related to the location data disclosed to the SP. Indeed, many users do not have complete trust in the SP, and they are also concerned about their location data being possibly accessed later on SP data stores by untrusted parties. Hence, the first privacy requirement we aim to satisfy is not to reveal any location information to the SP. The second concern regards the location data disclosed to the buddies: a user may wish not to provide the exact location to her buddies, although she may be willing to reveal if she is in proximity. In general, the level of location privacy can be represented by the uncertainty that an external entity has about the position of the user, and this uncertainty can be formally represented as a geographic area in which no point can be ruled out as a possible position of the user. In principle, each user should be able to express her privacy preferences by specifying for each other user (or class of users perceived as adversaries) a partition of the geographical space defining the minimal uncertainty regions that she wants to be guaranteed. For example, Alice specifies that Bob should never be able to find out the specific building where Alice is within the campus, i.e., the entire campus area is a minimal uncertainty region. Current services have very limited support to fine tune the location privacy with respect to buddies. 2
While a variety of positioning technologies and communication infrastructure can be used, here we assume GPS-enabled devices with always on 3G data connections.
144
S. Mascetti, C. Bettini, and D. Freni
Considering the related research literature, the available privacy preserving solutions for location based services are not straightforwardly applicable to this problem, since they are either focused on guaranteeing the anonymity of requests or limited to k-NN (Nearest Neighbor) queries or range queries over static resources. At the time of writing, we are aware of only a few proposals for privacyaware proximity detection [6,7,5]. The benefits and vulnerabilities of applying distance preserving transformations have been investigated in privacy preserving data mining [3]. In the specific topic of privacy-aware proximity services, distance preserving transformations have been used to hide the user positions to the SP [6]. However, this specific solution seems to be subject to vulnerabilities, since the SP acquires information on the exact distance between two buddies and hence the SP can exclude some places as their possible locations. On the contrary, Longitude does not preserve the exact distance but, instead, it preserves what we call modular distance that prevents the disclosure of any location information to the SP. The release of location information to the SP is also avoided in the three protocols proposed in [7]. They are based on a two-party secure computation exploiting public key cryptography. The solutions suggest a decentralized architecture, and each user does have to contact every buddy each time she needs to know which ones are in proximity; this can result in high communication costs for large number of buddies. In our approach we take advantage of the presence of the SP to significantly reduce the communication costs of a user. An experimental comparison in terms of service precision and communication cost with the algorithm named Pierre3 is shown in Section 4. An important conceptual difference from both of these papers, is that we more formally consider the privacy with respect to buddies, through the definition and enforcement of userdefined minimal uncertainty regions. In their approach, the location information revealed to buddies only depends on the proximity threshold used in the queries, while in our case it also depends on the minimal uncertainty region defined by each user. This leads to two advantages: an explicit privacy guarantee of the protocol regarding buddies, and a better quality of service as it will be clearer from the details of the protocol. Finally, in our previous work on this topic [5] we presented an obfuscationbased solution in which the SP is allowed to acquire user location information, but only limited to a user-defined precision. The SP-Filtering procedure proposed in [5] provides an approximate answer to the proximity problem exploiting spatial granularities, and may also be used as a pre-processing step of Longitude, as well as of the algorithms in [7]. Longitude is based on a three-party secure computation involving only communication between each buddy and the server. Each time a user location is sent to the SP, it is first generalized to a two-dimensional area A whose dimension depends on the user privacy requirement with respect to buddies. Our solution considers a mapping from the two-dimensional space in which users move into 3
This has been selected since it is the algorithm implemented by the authors in the NearbyFriend service.
Longitude: Centralized Privacy-Preserving Computation of Users’ Proximity
145
a toroidal space. A solid transformation is applied to the projection of A in the toroidal space and the result is then sent to the SP. Each user shares a (possibly different) secret with each of her buddies that determines the solid transformation. The SP computes proximity in the toroidal space and communicates the result to the participating buddies, which can then compute the proximity in the two-dimensional space. In order to avoid correlations in time, the above transformation changes at each run. The knowledge of the distance in the toroidal space does not disclose to the SP any location information about the buddies. The contributions of this paper can be summarized as follows: – We design a protocol enabling a service to compute users’ proximity in a centralized architecture, without the service provider acquiring any location information; – We allow each user to specify minimum privacy requirements with respect to the location data released to other buddies, and show the correctness of the protocol with respect to these privacy requirements; – We experimentally evaluate precision and performance of the proposed protocol by using a realistic simulation of user movements. We also experimentally compare our solution with the only available implementation of a privacy-preserving friend finder. The main goal of this paper is to illustrate an innovative technique for privacyaware proximity computation and not to illustrate all the technical details of the protocol. Despite extensions can be applied to deal with more involved scenarios, the basic protocol we describe does not assume particularly powerful adversaries. For example, our aim is not to contrast complex cryptanalytic attacks, and we assume that potential adversaries could only acquire the messages exchanged as part of the protocol, without a-priori knowledge about particular distributions of individuals or spatio-temporal trajectories. The rest of the paper is organized as follows. In Section 2 we illustrate the Longitude protocol, and in Section 3 its formal properties are analyzed. In Section 4 we report experimental results, and in Section 5 we conclude the paper pointing out some interesting future work.
2
Privacy Preserving Proximity Computation
In this section we first describe how buddies can specify their privacy requirements, and then illustrate the Longitude protocol. 2.1
Minimum Privacy Requirements
We assume that each user A can specify her minimum privacy requirements with respect to other buddies by defining a particular grid GA partitioning the spatial domain, such that each cell cA of the grid represents a minimum uncertainty region for A. Cells can either be shaped as squares or rectangles, and all the cells within a grid are required to have the same shape and the same dimension.
146
S. Mascetti, C. Bettini, and D. Freni
We conservatively assume that the minimum privacy requirement defined by a user is public and hence that it is possibly known by the SP or by the other buddies. The two extreme cases in which a user requires no privacy protection, and maximum privacy protection can be naturally modeled. For example, if a user A does not want her privacy to be protected with respect to other buddies (in this case A can tolerate other buddies to know her location at the maximum available precision) then A will set GA to the finest grid (the one having as cells the basic elements, or pixels, of the spatial domain). Similarly, if A wants to impose the maximum privacy protection, then A sets GA to the grid having a single cell covering the entire spatial domain. The first privacy requirement identified in Section 1 regarding the information released to the SP can be formalized by considering this maximum protection grid as a protection against the SP. This requirement would be clearly violated if locations are sent to the SP; but the second requirement, regarding buddies would be easily violated as well. Indeed, suppose A has a buddy B who sets a value of δB in a way such that the circular region of radius δB (centered at B’s location) is properly contained in a cell of GA . Then, if A happens to enter that circular region, the SP will notify B, and A’s minimum privacy requirement would be violated. 2.2
The Longitude Protocol
For the sake of simplicity, the protocol will be illustrated considering a user A issuing a proximity request with respect to a single buddy B, but the extension to multiple buddies is trivial, and the experiments in Section 4 consider a large number of buddies. The main steps of the Longitude protocol are the following: each time A wants to check whether B is in proximity, A runs the encryptLocation procedure to encrypt the cell cA of GA where A is located and sends it to the SP. Referring to the intuitive protocol description given in introduction, the encryption is equivalent to project cA in the toroidal space, and then apply the solid transformation. Upon receiving the request, the SP sends a message to B requiring a location update. B runs the encryptLocation procedure to encrypt the cell cB of GB where B is located and sends the result to the SP. Note that B uses the same key as A, generated from their common secret. The encryption function is designed in such a way that the SP, upon receiving a request from A and an answer from B with cells encrypted with the same key, can compute, through the computeProximity procedure, the distance in the toroidal space, which we call modular distance. The SP compares the modular distance with the proximity threshold and sends the result as a boolean value to the requester A that computes whether B is in proximity or not through the procedure getResult. The encryption function is such that, if A sends her location cell to the SP using the same encryption key in different instants and while being in different cells, and the SP is aware of this, he can possibly learn some information about the movement of A, and hence about her location. For this reason, A changes the encryption key each time she communicates her location cell to the SP.
Longitude: Centralized Privacy-Preserving Computation of Users’ Proximity
147
The following is a simple protocol to achieve this, but many optimizations and different solutions can be devised without affecting the main results of this paper. We assume A and B share a secret K; the actual key used to encrypt the location information is composed by a pair of integers generated with a pseudo-random number generator (PRNG) with seed K. An integer i, locally stored by A, is incremented at each proximity request, and it is used to select the generated keys with index 2i and 2i + 1. Its value is also included in the proximity request, since the locations of other buddies will need to be encrypted with a key selected according to i. 2.3
The encryptLocation Procedure
The procedure is schematically illustrated as Procedure 1. It is used to issue requests for proximity as well as to send responses to location requests by the SP. The inputs are the location l of the user running the procedure, the grid G chosen to protect the privacy of the user, the seed K, the parameter lastIndex that takes the value of i (i.e., the index of the last key generated by the PRNG by the user running the procedure), and the optional parameter newIndex that is only defined when the procedure is used to respond to a proximity request issued by another buddy; In this case, the value of newIndex is the index of the key used by the issuing buddy. If the procedure is used to issue a request for proximity, the index is incremented. If the procedure is used to send a response to a location request, it first checks if the index used by the buddy issuing the request has ever been used. If this is the case, using the same index again could compromise the user’s privacy and the procedure simply terminates, hence ignoring the request incoming from the SP. Otherwise, the key with this index is generated with the PRNG. Procedure 1. encryptLocation Input: a location l, a grid G, the seed K, the value lastIndex, the optional value newIndex. Procedure: 1: if (issuing request for proximity) then 2: i = lastIndex + 1 3: else {responding to a proximity request} 4: if (newIndex ≤ lastIndex) then return 5: i = newIndex 6: end if 7: kx is the 2i-th number generated by the PRNG with seed K 8: ky is the (2i + 1)-th number generated by the PRNG with seed K 9: ki = kx , ky 10: c is the cell of G that contains the location l 11: c = Eki (c) 12: send i, c to the SP. 13: store i {for the next execution}
148
S. Mascetti, C. Bettini, and D. Freni
(a) Translation of a (b) Contiguous points (c) Non-contiguous point points Fig. 1. Examples of modular translations of a point and of a cell. Eki (c) represented in gray
The next three steps consist in defining the encryption key ki as the pair of integers kx and ky generated with the PRNG. Then, the cell c in which the user is located is encrypted using the function E with parameter ki . Finally, the result is sent to the SP together with the value of i that is also stored on the client for the next run of the procedure. Before describing the encryption function, we first introduce some notation. In our approach we assume that users are moving in a two-dimensional space W which consists in a rectangular grid of sizex × sizey points. For each point p ∈ W , we denote with px and py the projection of p on the x and y axis, respectively. The encryption function E we propose is based on a “modular translation”. The idea is to apply, to each point of c, a translation followed by a modulus operation in such a way that no point is moved outside W . For example, if a point is moved by the translation right above the top boundary of W , the modulus operation moves it right above the bottom boundary of W and hence still within W (see Figure 1(a)). The translation shift value is represented by α = αx , αy which is computed from the key ki = kx , ky as follows: αx = kx mod sizex, αy = ky mod sizey . The encryption function Eki is then specified as: Eki (cA ) = (px + αx )mod sizex, (py + αy )mod sizey p∈cA
In practice, cA = Eki (cA ) is computed by applying a transformation to each point of cA . On the x axis, the transformation consists in shifting the point by αx and then in applying the module sizex. On the y axis the transformation is analogous. It is worth noting that, depending on α and cA , Eki (cA ) could be a set of contiguous points (see Figure 1(b)) as well as a set of non-contiguous points (see Figure 1(c)). 2.4
The computeProximity Procedure
The computeProximity procedure (see Procedure 2) is run by the SP when it receives two locations encrypted with the same key.
Longitude: Centralized Privacy-Preserving Computation of Users’ Proximity
149
Procedure 2. computeProximity Input: i, cA received from A, which issued a proximity request, and i, cB received from B, which is responding to the request. Procedure: 1: dist = mmd(cA , cB ) {minimum modular distance} 2: send the boolean value (dist ≤ δA ) to A
(a) On the vertical axis
(b) On the horizontal axis
(c) On both axis
Fig. 2. Examples of modular distance
The first step of the procedure consists in computing the “minimum modular distance” between cA and cB as follows: mmd(cA , cB ) =
min
p∈cA ,p ∈cB
modDist(p, p )
where modDist is the modular distance between p and p . Intuitively, the modular distance is the Euclidean distance computed as if W were “circular” on both axis. For example, consider two points p and p (see Figure 2(a)), with the same horizontal position such that p is close to the top boundary of W and p is close to the bottom boundary. The Euclidean distance of the two points is about sizey , while the modular distance is close to zero. The same holds for the other axis (see Figure 2(b)) and also for the combination of the two axis (see Figure 2(c)). Formally, given two points p and p , Δx = |px − px | and Δy = |py − py |, the modular distance is defined as: modDist(p, p ) = min( (Δx )2 + (Δy )2 , (sizex − Δx )2 + (Δy )2 , (Δx )2 + (sizey − Δy )2 , (sizex − Δx )2 + (sizey − Δy )2 ) The final step of computeProximity consists in comparing the minimum modular distance between cA and cB with δA , the proximity threshold of A. The boolean value of this comparison is sent to A.
150
S. Mascetti, C. Bettini, and D. Freni
Procedure 3. getResult Input: The boolean value res received from the SP, the cell c where the user running the procedure is located, the certainty region CR of the user running the protocol, the user B which responded to the proximity request. Procedure: 1: if (res = True AND c ⊆ CR) then 2: B is in proximity 3: else 4: B is not in proximity 5: end if
Fig. 3. Example of the certainty region CRA
2.5
The getResult Procedure
In the getResult procedure (see Procedure 3) user A, which is running the procedure, decides whether B is in proximity or not. This result is obtained considering the boolean value received from the SP and the relative position of the cell cA , where A is located, with respect to a region called “certainty region” of A. This region, denoted by CRA , is the set of points of W that are farther than δA from the boundaries of W (see Figure 3). The correctness of the result computed by the getResult, as well as the approximation introduced by the protocol and its safety are discussed in Section 3.
3
Analysis of the Longitude Protocol
In this section we first discuss the safety of the Longitude protocol with respect to privacy protection and then we analyze its correctness and the approximation it introduces. We first introduce a formal proposition that will be used in the protocol analysis. Proposition 1. Given two cells cA and cB and a key ki , the encryption function E is such that: mmd(cA , cB ) = mmd(Eki (cA ), Eki (cB )) Proposition 1 intuitively states that the encryption function E presented in Section 2.3 does not alter the minimum modular distance between cA and cB .
Longitude: Centralized Privacy-Preserving Computation of Users’ Proximity
3.1
151
Safety
We first analyze the privacy that the Longitude protocol provides to a user with respect to another buddy and with respect to the SP under the assumptions that the SP and the buddies do not collude. Then, we discuss the location information that is disclosed in case collusion occurs. During the execution of the protocol the only message that A receives containing information related to the location of a buddy B is the boolean value received from the SP as a response to A’s request for the proximity of B. When A receives True from the SP (i.e., mmd(cA , cB ) ≤ δA ), due to Proposition 1, A learns that B is located in a cell cB of GB such that mmd(cA , cB ) ≤ δA . Since A knows cA and GB , she can compute the set of cells where B is possibly located. Formally, A cannot exclude B is located in any cell c of GB such that mmd(cA , c) ≤ δA . Analogously, when A receives False from the SP A cannot exclude B is located in any cell c of GB such that mmd(cA , c) > δA . Consequently, the minimum privacy requirement of B with respect to A are guaranteed. For what concerns the privacy protection with respect to the SP, it is easily seen from the protocol that the SP only learns the minimum modular distance between cA and cB and hence, due to Proposition 1, the minimum modular distance between cA and cB . Since this knowledge does not disclose any information about the location of A and B, we can conclude that Longitude guarantees that the SP does not acquire any information about the location of the buddies. We now turn to consider collusion. If a user B considers all buddies as untrusted, he will probably use the same (coarse) grid for everybody. In this case, even if buddies collude, the minimum privacy requirements are guaranteed. However, if user B has different degrees of trust on her buddies (hence using different grids), and these buddies collude, the location of B could be discovered with high precision by intersecting the location information about B acquired by the colluding buddies. This can be easily avoided by imposing the following constraint on the relationship among the spatial grids used as privacy preferences: cells from different grids never partially overlap. In this case, the location of B is never disclosed with a precision higher than the finest grid among those defined for the colluding buddies. In other words, the minimum privacy requirement defined for the most trusted buddy among the colluding ones is guaranteed. Collusion with the SP is not likely in the service model we are considering, since the SP is considered untrusted, while a certain degree of trust is assumed among the participating buddies that indeed share a secret. In the worst case in which the trust model is broken by a buddy A of B colluding with the SP, the SP can obtain and share with A the cell cB where B is located each time B sends this information encrypted with the secret seed K shared with A. Note that the minimum privacy requirement with respect to A is guaranteed, and that the SP can only obtain the same location information about B available to A. 3.2
Service Precision
We now discuss the correctness of Longitude in terms of the service precision it provides. If A receives False from the SP then, according to the computeProximity
152
S. Mascetti, C. Bettini, and D. Freni
procedure, mmd(cA , cB ) > δA . Due to Proposition 1, this means that mmd (cA , cB ) > δA . Since mmd(cA , cB ) is a lower bound to the real distance between A and B, it is guaranteed that B is not in proximity of A. Vice versa, if A receives True, it is not possible for A to conclude that B is in proximity, since two forms of approximation are introduced. We now explain the reason for these approximations, and our choice for the conditions under which the protocol declares B’s proximity; we will show in Section 4 through extensive experiments the impact these approximations have in practice. One form of approximation, which we call the modular-shift error is due to the fact that the encryption function does not preserve the distance. Indeed, as shown in Figure 4(a), it can happen that, while cA is close to cB , cA is far from cB . This would imply that, when the SP sends True to A (i.e., mmd(cA , cB ) ≤ δA ) A does not actually know whether B is in proximity or not. However, it is easily seen that when cA is in the certainty region CRA , mmd(cA , cB ) is equal to the minimum distance between cA and cB . In this case A can exclude the modularshift error. Consequently, A knows that minDist(cA , cB ) ≤ δA and considers B as in proximity whenever True is returned by the SP, and cA is contained in CRA (lines 1-2 of the getResult procedure). If True is returned but cA is not contained in CRA , then A cannot conclude that B is in proximity. As we shall see in our experimental results, this case is very rare and, as a practical and efficient solution, procedure getResult returns in this particular case B as not being in proximity. Clearly, this leads to some possible false negative responses. A technical solution to avoid this approximation at some extra cost is to apply a P2P protocol between A and B, whenever this case arises [5]. The second form of approximation, which we call cell approximation, is due to the fact that B may not be in proximity of A even if minDist(cA , cB ) ≤ δA . Figure 4(b) shows an example of this situation. The consequence of cell approximation is that, even if A knows that minDist(cA , cB ) ≤ δA , she cannot be sure whether dist(loc(A), loc(B)) ≤ δA . Nevertheless, in this case A assumes B to be in proximity. This can lead to some false positive cases. In our experimental evaluation we show that for many practically useful grids GA and GB , cell approximation only minimally affects quality of service.
(a) modular-shift error
(b) Cell approximation
Fig. 4. Two forms of approximation introduced by the Longitude Protocol
Longitude: Centralized Privacy-Preserving Computation of Users’ Proximity
4
153
Experimental Evaluation
We performed an extensive experimental evaluation of our solution and we compared it with the Pierre protocol proposed in [7]. In our tests we evaluated the service precision, measured as the percentage of correct answers given by the protocol, the privacy, measured in terms of the size of the region in which an adversary cannot exclude any of the points as a possible location of a user, and the system costs, in terms of communication and computational costs. Experimental setting. For the tests, we used an artificial dataset of user movements which was obtained using the MilanoByNight simulation4 . We carefully tuned the simulator in order to reflect a typical deployment scenario of a friend finder service, i.e. 100, 000 potential users of this service moving from their homes to entertainment places on the road network of Milan during a weekend night. All the test results shown in this section are obtained as average values computed over 1, 000 users, each of them using the service during the 4 hours of the simulation. Locations are sampled every 2 minutes. The total size of the map is 215 km2 and the average density is 465 users/km2. Our techniques were implemented in Java, and tests were performed on a 64-bit Windows Server 2003 machine with 2, 4Ghz Intel Core 2 Quad processor and 4GB of shared RAM. To represent different levels of privacy, we considered eleven different levels of grids. A grid of level 0 consists of 1024 × 1024 cells having a square shape with an edge of about 15 meters. The grid covers the whole map. Grids of level l are obtained by grouping together 2l cells of the level 0 grid on each dimension. For example, the grids of level 2 are obtained by grouping together 4 × 4 cells of the level 0 grid, starting from the cell positioned at the lower left corner. The grid of level 10 contains only one cell covering the entire map. Table 1. Parameter values Parameter Values δ 125m, 250m, 500m, 1000m Level of GA grid 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Average number 10, 20, 30, 40, 50, of buddies 60, 70, 80, 90, 100
For the sake of simplicity, in our tests we assume that all the users share the same parameters. In particular, in each test we fix a single value of δ (the proximity threshold) and GA for all the users. Table 1 shows the most relevant parameters of our experimental evaluation. Default values are denoted in bold. Evaluation of service precision. Figure 5(a) shows the service precision for different levels of GA using our protocol. It can be observed that for small values of GA the percentage of correct answers is close to 100%. In particular, using 4
http://everywarelab.dico.unimi.it/lbs-datasim
154
S. Mascetti, C. Bettini, and D. Freni
99
Precision (%)
Precision (%)
100
98 97 96 95
100
90 Longitude Pierre 80
0
1
2
3
4
5
6
400
Level of GA
800
1200
1600
2000
Proximity threshold (m)
(a) Percentage of correct answers (b) Service precision with different δ Fig. 5. Service precision
our parameters, the service precision is always above 95% when the size of a cell of GA is less than 1 km2 . In Figure 5(b) we compare the service precision of the Pierre protocol and Longitude for different proximity thresholds. The idea of the Pierre protocol is that the plane is divided into a grid of cells, where the edge of a cell is equal to the proximity threshold δA requested by the issuing user A. After a two-party secure computation between A and another user B, A obtains to know whether B is located in the same grid cell or in one of the adjacent ones. The Pierre protocol is subject to a form of approximation similar to the cell approximation. However, in the case of the Pierre protocol, the approximation depends on the value of δA . Consequently, as shown in Figure 5(b), the precision of the Pierre protocol decreases for large values of the proximity threshold. In contrast, the precision of the Longitude protocol is not significantly affected by the proximity threshold. Evaluation of privacy. Although the minimum privacy requirement is always guaranteed, using Longitude, it is desirable for a user to obtain as much privacy as possible. We measure the privacy as the size of the uncertainty region, i.e. the size of the region in which an adversary cannot exclude any of the points as possible location of a user. The larger this region is, the better. Figure 6(a) shows the privacy obtained by a user A for different levels of GA with respect to another user B when the SP notifies B that A is in proximity. We can observe that Longitude always achieves more privacy than the minimum required. Even when using a GA equal to zero, which is the minimum possible privacy requirement, the average area of uncertainty is around 0.85 km2 . This is approximately the area of the circle centered in B’s location and having radius equal to the proximity threshold. Evaluation of system costs. To evaluate the computational costs, we analyze the computation time needed when a user updates her location, both on the client and the server sides. The main parameter affecting this cost is the number of buddies. In Figure 6(b) we can observe that, as expected, the computation time grows linearly with the number of buddies. It should be observed that the computation time is around 0.1 milliseconds with 100 buddies on the desktop
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
155
0.2 Longitude min. priv. requirement
computation time (ms)
Area (km2 )
Longitude: Centralized Privacy-Preserving Computation of Users’ Proximity
0.15
0.1
0.05
0 0
1
2
3
4
5
10
20
30
Level of GA grid
40
50
60
70
80
90 100
number of buddies
(a) Evaluation of privacy
(b) Client-side computation time
# of messages × user
60000 50000 Longitude Pierre 40000 30000 20000 10000 0 10
20
30 40
50
60
70
80
90 100
number of buddies
(c) Average number of sent/received messages Fig. 6. Evaluation of privacy and of performance
machine we used in our tests. We are confident that the computational cost remains sustainable also on high-end mobile devices. We also measured the server-side computation time and we observed that it grows linearly with the number of buddies and that, even when a user has 100 buddies, the server side computation time is less than 0.3ms. The metrics we considered to evaluate the communication costs is the total number of messages exchanged by each user (see Figure 6(c)). It can be observed that the number of messages sent by the Pierre and the Longitude protocols grows linearly with the number of buddies. However, the number of messages required by the Pierre protocol is almost double with respect to Longitude. This is due to the fact that, given n the number of buddies, 2n messages are needed to a user when issuing a proximity query using Pierre, and other 2n messages are needed to that user to reply to the proximity requests issued by all the buddies. On the contrary, when using Longitude, only 2 messages are needed by a user when issuing a proximity query, and other 2n messages are still needed to communicate the encrypted locations requested by the SP for all the buddies.
5
Conclusion and Future Work
We believe that the Longitude protocol presented and validated in this paper can be a technical solution for users that would like to enjoy proximity services,
156
S. Mascetti, C. Bettini, and D. Freni
but do not necessarily trust the service providers or the security of their infrastructure, as well as for those that want to have more control on the information released to buddies. The solution proposed in this paper is based on a centralized architecture that enables optimizations that are not possible with a P2P solution. Some optimizations have already been proposed, in centralized architecture, to enhance the system performance at the cost of revealing some location information to the SP (e.g., the SP-Filtering protocol, presented in [5]). The centralized architecture can be also exploited to provide other forms of optimization that do not reveal any location information to the SP. One form of optimization is based of the following idea: if the users in a set form a clique (i.e., for each pair of users in the set, the two users are buddies), then a single “group key” can be used, instead of a key for each pair of buddies. This can significantly reduce the number of locations that need to be sent to the SP, hence reducing computation and communication cost. We leave as a future work the evaluation of the performance improvement obtained with this optimization. Several other issues deserve further investigation. The specification of privacy preferences in terms of spatial granularities requires a study of a user interface that should be at the same time intuitive and effective in graphically showing the uncertainty regions. From a technical point of view, several details need a deeper investigation, including the choice of an adequate PRNG, the refinement of the protocol to provide protection against sophisticated cryptanalysis, as well as time constraints on successive runs of the protocol to prevent attacks based on historical correlation. Another direction we are considering is the extension of our architecture to a hybrid architecture in which the Longitude protocol is coupled with P2P algorithms to improve service precision in particular situations.
Acknowledgments The authors would like to thank the reviewers for their very helpful comments. This work was partially supported by Italian MIUR under grants PRIN-2007F9437X and InterLink II04C0EC1D, and by the National Science Foundation under grant CNS-0716567.
References 1. Amir, A., Efrat, A., Myllymaki, J., Palaniappan, L., Wampler, K.: Buddy tracking - efficient proximity detection among mobile friends. Pervasive and Mobile Computing 3(5), 489–511 (2007) 2. Kalnis, P., Ghinita, G., Mouratidis, K., Papadias, D.: Preventing location-based identity inference in anonymous spatial queries. IEEE Transactions on Knowledge and Data Engineering 19(12), 1719–1733 (2007) 3. Liu, K., Giannella, C., Kargupta, H.: An attacker’s view of distance preserving maps for privacy preserving data mining. In: F¨ urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 297–308. Springer, Heidelberg (2006)
Longitude: Centralized Privacy-Preserving Computation of Users’ Proximity
157
4. Mascetti, S., Bettini, C., Freni, D., Wang, X.S.: Spatial generalization algorithms for LBS privacy preservation. Journal of Location Based Services 2(1), 179–207 (2008) 5. Mascetti, S., Bettini, C., Freni, D., Wang, X.S., Jajodia, S.: Privacy-aware proximity based services. In: Proc. of the 10th International Conference on Mobile Data Management, pp. 31–40. IEEE Computer Society, Los Alamitos (2009) 6. Ruppel, P., Treu, G., K¨ upper, A., Linnhoff-Popien, C.: Anonymous user tracking for location-based community services. In: Hazas, M., Krumm, J., Strang, T. (eds.) LoCA 2006. LNCS, vol. 3987, pp. 116–133. Springer, Heidelberg (2006) 7. Zhong, G., Goldberg, I., Hengartner, U.: Louis, lester and pierre: Three protocols for location privacy. In: Borisov, N., Golle, P. (eds.) PET 2007. LNCS, vol. 4776, pp. 62–76. Springer, Heidelberg (2007)
L-Cover: Preserving Diversity by Anonymity Lei Zhang1, Lingyu Wang2 , Sushil Jajodia1 , and Alexander Brodsky1 1
2
Center for Secure Information Systems George Mason University Fairfax, VA 22030, USA {lzhang8,jajodia,brodsky}@gmu.edu Concordia Institute for Information Systems Engineering Concordia University Montreal, QC H3G 1M8, Canada [email protected]
Abstract. To release micro-data tables containing sensitive data, generalization algorithms are usually required for satisfying given privacy properties, such as k-anonymity and l-diversity. It is well accepted that k-anonymity and l-diversity are proposed for different purposes, and the latter is a stronger property than the former. However, this paper uncovers an interesting relationship between these two properties when the generalization algorithms are publicly known. That is, preserving l-diversity in micro-data generalization can be done by preserving a new property, namely, l-cover, which is to satisfy l-anonymity in a special way. The practical impact of this discovery is that it may potentially lead to better heuristic generalization algorithms in terms of efficiency and data utility, that remain safe even when publicized.
1 Introduction The micro-data release problem has attracted much attention due to increasing concerns over personal privacy. Various generalization techniques have been proposed to transform a micro-data table containing sensitive information for satisfying given privacy properties, such as k-anonymity [16] and l-diversity [2]. For example, in the micro-data table shown in Table 1, suppose each patient’s medical condition is to be kept confidential. The attributes can thus be classified into three classes, namely, identity (Name), quasi-identifiers (ZIP, Age), and sensitive value (Condition). Clearly, simply hiding the identity (Name) when releasing the table is not sufficient. A tuple and its sensitive value may still be linked to a unique identity through the quasiidentifiers, if the combination (ZIP, Age) happens to be unique [16]. To prevent such a linking attack, the table needs to be generalized to satisfy k-anonymity. For example, if generalization (A) in Table 2 is released, then any linking attack can at best link an identity to a group of two tuples with the same combination (ZIP, Age). We can also see from the example that k-anonymity by itself is not sufficient, since linking an identity to the second group will reveal the condition of that identity to be cancer. This problem is addressed in generalization (B), by satisfying both k-anonymity and l-diversity. However, the situation is worse in practice. As recently pointed out by Zhang et al. [22], an adversary can still deduce both Clark and Diana have cancer, if it is publicly W. Jonker and M. Petkovi´c (Eds.): SDM 2009, LNCS 5776, pp. 158–171, 2009. c Springer-Verlag Berlin Heidelberg 2009
L-Cover: Preserving Diversity by Anonymity
159
Table 1. An Example of Micro-Data Table Name Alice Bob Clark Diana Ellen Fen
ZIP 22030 22031 22032 22035 22045 22055
Age Condition 60 flu 50 tracheitis 40 cancer 35 cancer 34 pneumonia 33 gastritis
Table 2. Two Potential Table Generalizations ZIP Age Condition 22030∼22031 50∼60 flu tracheitis 22032∼22035 35∼40 cancer cancer 22045∼22055 33∼34 pneumonia gastritis (A)
ZIP Age Condition 22030∼22032 40∼60 flu tracheitis cancer 22035∼22055 33∼35 cancer pneumonia gastritis (B)
known that generalization (A) is first considered (but not released) before generalization (B) is considered and released. This complication makes the problem of micro-data release more challenging when the algorithm used to compute the disclosed data is assumed to be publicly known. In [22], the authors give a comprehensive study of how a privacy property can be guaranteed in this situation and they also prove it to be an NP-hard problem to optimize data utility while guaranteeing the l-diversity property. Also, how to design heuristic algorithms is discussed and one heuristic generalization algorithms which remains safe when publicized is presented in [22]. However, as shown in [22], the proposed algorithm is not practical due to the data utility it can provide. In this paper, we uncover an interesting relationship between k-anonymity and ldiversity, which can be used to design better generalization algorithms in terms of efficiency and data utility, while guaranteeing the property of l-diversity when the algorithm itself is publicized. More specifically, our contribution is two fold. First, we propose a novel strategy for micro-data release. That is, instead of trying to select the generalization with the “best” data utility from all possible generalizations, we first restrict our possible selections to a subset of all possible generalizations, and then optimize the data utility within the subset. Certainly, we guarantee only a local optimality in the restricted set instead of the global optimality. We prove that, as long as the restricted subset satisfies certain properties, the security/privacy of the publicized result will not be affected by whether this applied generalization algorithm is publicized or not. Second, we introduce the property of l-cover, defined on a set of generalizations, which is an anonymity-like property when exchanging the role of identity and sensitive value in a micro-data table. We prove that in order to guarantee the property of l-Diversity on the released data, it is sufficient to have the above subset of generalizations satisfy the property of l-cover. We also show, through examples, that in practice
160
L. Zhang et al.
we do not need to compute the entire subset of generalizations that satisfies l-cover. In stead, we only need to construct anonymity groups of size l for sensitive values, which can be done efficiently in advance, and check whether a candidate generalization breaks these groups when optimizing the data utility. Therefore, this technique can be potentially used to design more practical heuristic generalization algorithms compare to the algorithms proposed in [22]. Organization In Section 2, we define our model and examine relevant concepts. In Section 3, we propose a novel strategy for micro-data release. In Section 4, we formalize the concept of l-cover and employs it to compute safe generalizations. We discuss related work in Section 5 and draw conclusions in Section 6.
2 The Model We first define our notations for micro-data table and generalization. We then discuss privacy properties and how they may disclose information. 2.1 Micro-data Table and Generalization A micro-data table is a relation T (i, q, s), where i, q, and s is called the identity, quasiidentifier, and sensitive value, respectively (see Table 3 for a list of important notations used in this paper). Note that both q and s can be a sequence of attributes. We use I, Q, S for the projection Πi (T ), Πq (T ), and Πs (T ), respectively. Unless explicitly stated otherwise, all projections in this paper preserve duplicates. Therefore, both Q and S are actually multisets. Let Riq , Rqs , Ris denote the three projections Πi,q (T ), Πq,s (T ), and Πi,s (T ), respectively. As typically assumed, I, Q and the relation Riq are considered as public knowledge. We also assume S as publicly known, since any released generalization will essentially disclose S any way (we do not consider suppression). On the other hand, the relation Ris and Rqs both remain secret until a generalization is released. Between them, Ris is considered as the private information, whereas Rqs is considered as the utility Table 3. A List of Important Notations T (i, q, s) or T I, Q, S Riq , Rqs , Ris GQ GT = (GQ, GS, GRqs ) GQL GQA T
Micro-data table Projections Πi (T ), Πq (T ), Πs (T ) Projections Πi,q (T ), Πq,s (T ), Πi,s (T ) Quasi-identifier generalization Table generalization Locally safe set Candidate set Disclosure set
L-Cover: Preserving Diversity by Anonymity
161
information. The goal of a generalization algorithm is usually to disclose as much information about Rqs as possible, while still guaranteeing the secrecy or uncertainty of information about Ris . We need to explicitly distinguish the two stages, in order, of a generalization process, namely, quasi-identifier generalization and table generalization, as formalized in Definition 1. The key difference is the following. A quasi-identifier generalization GQ only generalizes the publicly known Q, and thus contains information only from the publicly known Q. On the other hand, a table generalization GT generalizes the entire micro-data table, containing information from both the secret relations Ris and Rqs . This difference will be critical to our further discussion. Definition 1. (Quasi-Identifier Generalization and Table Generalization) Given a micro-data table T (i, q, s), we define – a quasi-identifier generalization GQ as any partition on Q, and – a table generalization GT as a triple (GQ, GS, GRqs ), where • GQ is a quasi-identifier generalization, • GS is a partition on S, and • GRqs ⊆ GQ × GS is a one-to-one relation, such that for all (gq, gs) ∈ GRqs , there exists a one-to-one relation R ⊆ gq × gs satisfying R ⊆ Rqs . 2.2 Privacy Properties k-Anonymity The concept of k-anonymity [16] mainly concerns with the size of each group in GQ. More precisely, a table generalization GT satisfies k-anonymity if ∀gq ∈ GQ, |gq| ≥ k. Notice that this condition only depends on GQ. Therefore, if GQ is publicly known, anyone may determine whether a table generalization GT computed based on GQ satisfies k-anonymity, even without knowing the entire GT . As a result, we have the following claim, which is straightforward. Claim 1. Given a micro-data table T and a quasi-identifier generalization GQ, to disclose the fact that a table generalization GT computed based on GQ violates kanonymity does not provide any additional information about T . l-Diversity The concept of l-diversity [2] concerns with the diversity of sensitive values that can be linked to each identity. In particular, we shall focus on entropy l-diversity, which requires the entropy of values in a multiset S to be no less than log l, that is, count(s,S) |S| log count(s,S) ≥ log l, where count(s, S) is the number of s∈BagT oSet(S) |S| appearances of s in S. For a table generalization GT = (GQ, GS, GRqs ), l-diversity is applied to each group in GS. That is, GT satisfies entropy l-diversity, if for all gs ∈ GS, gs satisfies entropy l-diversity. Clearly, unlike k-anonymity, l-diversity depends on not only GQ but also GS and GRqs . Therefore, without knowing a table generalization GT , it is impossible to check whether GT satisfies l-diversity simply based on the knowledge about the corresponding GQ. In another word, the fact that a table generalization GT violates l-diversity
162
L. Zhang et al.
may provide additional information about T , even though GT itself is not known. Such a disclosure is in the form of knowledge about unsafe groups, which is formalized in Definition 2. Definition 2. (Unsafe Group) Given a micro-data table T , a multiset of quasi-identifiers Q ⊆ Q is said to be an unsafe group with respect to entropy l-diversity, if |Q | ≥ l and S = {s : (q, s) ∈ Rqs , q ∈ Q } does not satisfy entropy l-diversity. Clearly, no multiset S of size less than l can ever satisfy entropy l-diversity. The following claim is then straightforward. Claim 2. Given a micro-data table T and a quasi-identifier generalization GQ, to disclose the fact that a table generalization GT computed based on GQ violates entropy l-diversity will – not provide additional information about T , if GT also violates l-anonymity. – provide the additional information about T that there exists at least one unsafe group in GQ, if GT satisfies l-anonymity. P -safety As illustrated in Section 1, when a generalization algorithm is publicly known, enforcing k-anonymity and l-diversity on the released table generalization is not sufficient. More specifically, if an algorithm is known to have considered i − 1 table generalizations computed based on GQ1 , GQ2 , . . . , GQi−1 before it finally releases GT = (GQi , GSi , GRqsi ), then l-diversity can no longer be evaluated on each group gs ∈ GSi . For instance, for generalization (B) in Table 2, when we evaluate l-diversity on each group of three conditions, we are actually assuming that an adversary can only guess the secret micro-data table (that is, Table 1) from generalization (B) alone. Therefore, any table not in a conflict with generalization (B) will be a valid guess. However, if it is a known fact that generalization (A) has been considered but not released, the adversary can drop any guessed table if it can make generalization (A) satisfy the required ldiversity. More generally, the concept of disclosure set depicts the set of all possible guesses about a secret micro-data table T , when the generalization algorithm is publicly known [22]. The concept P -safety (P can be any privacy property, such as l-diversity) then defines the correct way for evaluating any privacy property based on the disclosure set. We repeat the proposed definition of disclosure set and the property of P-safety as follows. Definition 3. (Disclosure Set and P -Safe) Given a micro-data table T and a generalization algorithm that will consider the quasi-identifier generalizations GQ1 , GQ2 , . . . , GQn in the given order for satisfying a given privacy property P , we say – the disclosure set T of a table generalization GT is the set of all micro-data tables for which the generalization algorithm will also output GT . – a table generalization GT is P -safe, if for all identities i ∈ I, P is satisfied on the multiset Si = {s : (i , s ) ∈ Πi,s (T ), T ∈ T }, where T is the disclosure set of GT .
L-Cover: Preserving Diversity by Anonymity
163
The concept of P -safety guarantees the desired privacy property to be satisfied even when the applied generalization algorithm is publicly known. However, the cost is high. To find an optimal table generalization that satisfies a privacy property, such as entropy l-diversity, is generally a NP-hard problem [22]. Also, as discussed in [22], it is even hard to have an efficient heuristic algorithm that provide practical data utility. In the rest of this paper, we will propose a different but more efficient strategy to address this issue.
3 A Novel Strategy for Micro-data Release We now consider a different strategy for micro-data release that decouples privacy preservation from data utility optimization. Roughly speaking, in stead of optimizing the data utility in all possible quasi-identifier generalizations, it will first find a subset of the generalizations that satisfies two conditions: (1) every quasi-identifier generalization in the subset will yield a table generalization satisfying the given privacy property; (2) the given privacy property will still hold, even if the whole subset of the quasiidentifier generalizations is known to satisfy the first condition. Once such a subset of quasi-identifier generalizations is found, any data utility optimization can be done freely inside this collection without worrying about whether the generalization algorithm is publicized or not. Note that, a subset of generalizations of the above form can have a large size so that can computation of it is not practical. In the next section, we will show that, in practice, we can replace such computations by verifying whether a candidate generalization satisfies a proposed new property, which can be don efficiently. 3.1 Locally Safe Set First, we consider a collection of quasi-identifier generalizations of which each can generalize the given micro-data table into a safe table generalization, as formalized in Definition 4. Definition 4. (Locally Safe Set) Given a micro-data table T and a desired entropy ldiversity, a locally safe set of quasi-identifier generalizations is the set GQL = {GQ : the table generalization of T computed based on GQ satisfies entropy l-diversity}. Consider an example shown in Figure 1, which depicts the multisets of quasi-identifiers Q and sensitive values S of a micro-data table. Assume entropy 2-diversity is the desired privacy property. We can then compute the locally safe set GQL . For example, GQL includes GQ = {{q1 , q3 }, {q2 , q4 }, {q5 , q6 }}. Next, assume an adversary has full knowledge about GQL itself (note that the knowledge about GQ ∈ GQL is different from that about the table generalization computed based on GQ). If this knowledge does not violate the desired entropy l-diversity, then any optimization of generalization function for best data utility will never violate the desired entropy l-diversity. The reason is the optimization process is now simulatable. That is, the adversary, with the knowledge about GQL and the publicly known data utility metric, can repeat the optimization process and obtain the same result. In other words, we have the following claim which is straightforward:
164
L. Zhang et al.
Fig. 1. An example of Q and S
Claim 3. If disclosing the locally safe set GQL does not violate the desired entropy l-diversity, then any optimization of generalization function for best utility within GQL will not violate entropy l-diversity. However, the knowledge about GQL may indeed violate entropy l-diversity. First of all, by Claim 1, we know that to disclose all quasi-identifier generalizations whose corresponding table generalizations satisfy l-anonymity will not disclose any information, We call this the candidate set of quasi-identifier generalizations. Definition 5. (Candidate Set) Given a micro-data table T and a desired entropy ldiversity, a candidate set of quasi-identifier generalizations is the set GQA = {GQ : the table generalization of T computed based on GQ satisfies l-anonymity} Therefore, the knowledge about GQL is equivalent to knowing about GQA \ GQL , which is the set of quasi-identifier generalizations whose corresponding table generalizations satisfy l-anonymity but violate entropy l-diversity. By Claim 2, the knowledge about GQL may therefore violate entropy l-diversity. For example, in Figure 1, if we disclose GQL , anyone can notice that any partition of Q that contains the subset {q1 , q2 } or {q1 , q2 , qx }, for any qx ∈ Q), will not appear in GQL . On the other hand, for any other two-element set {qx , qy }(qx , qy ∈ Q), there always exists at least one GQ ∈ GQL such that {qx , qy } ∈ GQ. Since the multiset of sensitive values S is public knowledge, anyone knows there is only one value s1 that appears twice in S. Therefore, anyone can determine the facts: (q1 , s1 ), (q2 , s1 ) ∈ Rqs . 3.2 Globally Safe Set Now we study the condition for the knowledge about GQL to be safe. Consider the example shown in Figure 2 and assume entropy 2-diversity. Clearly, there are two sets of quasi-identifiers, each of which contains two elements, that will not appear in the locally safe set GQL . These are {q1 , q2 } and {q3 , q6 }. Interestingly, at this time, the knowledge about GQL will only indicate that one of the following two facts holds: – (q1 , s1 ), (q2 , s1 ), (q3 , s2 ), (q6 , s2 ) ∈ Rqs – (q1 , s2 ), (q2 , s2 ), (q3 , s1 ), (q6 , s1 ) ∈ Rqs
L-Cover: Preserving Diversity by Anonymity
165
Fig. 2. Another example of Q and S
Since the above two facts are equally likely to be true, the knowledge about GQL will not violate entropy 2-diversity by means of Definition 6. In the definition, the set of tables T can be regarded as the disclosure set (see Section 2.2) of GQL . That is, T is the set of micro-data tables not in conflict with the fact that GQL is a locally safe set. Definition 6. (Globally Safe Set) Given a micro-data table T and a set of quasiidentifier generalizations GQ, let T be the set of tables satisfying that for all T ∈ T , – T has the same I, Q, S, and Riq as T does, and – the table generalization of T computed based on every GQ ∈ GQ satisfies entropy l-diversity, we say GQ is a globally safe set of quasi-identifier generalizations, if ∀i ∈ I, ldiversity is satisfied on the multiset Si = {s : (i , s ) ∈ Πi,s (T ), T ∈ T }. By Claim 3, if a locally safe set of quasi-identifier generalizations GQL happens to be also a globally safe set, then any optimization generalization function for best data utility within GQL will not violate entropy l-diversity. However, we also know from above discussions that GQL is not always globally safe . Therefore, we need to further restrict the optimization of generalization function to be within subsets of GQL that are globally safe.
4 l-Cover We introduce the concept of l-cover for finding globally safe sets of quasi-identifier generalizations. Recall that our strategy has two stages: 1. Find a globally safe set GQ ⊆ GQL . 2. Compute a table generalization GT based on GQ ∈ GQ such that GT has the optimal data utility. Correspondingly, we have two forms of l-cover, weak l-cover and l-cover. 4.1 Weak l-Cover In Figure 2, we can observe that two values, s1 and s2 , both appear twice in the multiset S = {s1 , s1 , s2 , s2 , s3 , s4 }. Therefore, the corresponding two sets of quasi-identifiers
166
L. Zhang et al.
Fig. 3. Cover from sensitive values with different number of appearances
{q1 , q2 } and {q3 , q6 } provide a cover for each other in the sense that they cannot be distinguished based on the knowledge about GQL . In this case, sensitive values having exactly the same number of appearances in S cover each other. However, this is not a necessary condition. Consider another example shown in Figure 3 and assume entropy 2-diversity. We have the locally safe set GQL = {GQ1 , GQ2 , GQ3 } where GQ1 = {{q1 , q3 }, {q2 , q4 }}, GQ2 = {{q1 , q4 }, {q2 , q3 }}, GQ3 = {{q1 , q2 , q3 , q4 }}. We can observe that {q1 , q2 } never appears in any of the quasi-identifier generalizations in GQL due to their identical sensitive value s1 . Moreover, the set {q3 , q4 } never appears, either. Therefore, {q3 , q4 } becomes a cover of {q1 , q2 } even though their corresponding sensitive values have different number of appearances. More generally, we define the concept of cover in the following. Definition 7. (Cover) Given a set of quasi-identifier generalizations GQ on Q, we say Q ⊆ Q and Q ⊆ Q provide cover for each other, if – Q ∩ Q = φ, and – there exists a bijection fcover : Q → Q satisfying the following. For any Qx ∈ GQ, GQ ∈ GQ, there exists GQ ∈ GQ such that (Qx \ (Q ∪ Q )) ∪ fcover (Qx ∩ −1 Q) ∪ fcover (Qx ∩ Q ) ∈ GQ . Note that, if Qx ∩ (Q ∪ Q ) = φ, GQ = GQ naturally exists. Interestingly, the way we provide a cover for a set of quasi-identifiers Q is similar to providing “anonymity” to sensitive values. In another word, the concept of cover is similar to anonymity if we exchange the role of identity and sensitive value in the micro-data table. Therefore, analogous to k-anonymity, we have the metric of l-cover in Definition 8. Definition 8. (Weak l-Cover) Given a micro-data table T and a set of quasi-identifier generalizations GQ, GQ is said to satisfy weak l-cover if for any Q ⊆ Q satisfying ∃s ∈ S, Q = {q : (q , s ) ∈ Rqs }, we have – there exist at least l − 1 covers of Q: Q1 , Q2 , . . . , Ql−1 , and – ∀j
= j , Qj ∩ Qj = φ. Claim 4 states that weak l-cover is a sufficient condition for a globally safe set (this condition is also suspected to be necessary). Intuitively, each sensitive value (and its number of appearances) is blended into the sensitive values of its l − 1 or more covers. The knowledge about the quasi-identifier generalizations GQ thus will not violate ldiversity.
L-Cover: Preserving Diversity by Anonymity
167
Claim 4. A set of quasi-identifier generalizations GQ is a globally safe set with respect to entropy l-diversity if GQ satisfies weak l-cover. Proof Sketch: Consider the set of tables T and the multiset of sensitive values Si as defined in Definition 6. Let s be the most frequent element in Si . Based on the definition of weak l-cover, the set of quasi-identifier Q = {q : q ∈ Q, (q , s ) ∈ Rqs } has l − 1 covers, Q1 , . . . , Ql−1 each of which has a corresponding bijection fi : Q → Qi (1 ≤ i ≤ l − 1). Therefore, for any table T ∈ T satisfying (i , q , s ) ∈ T , there must exist T1 ∈ T such that (i , q , s1 ) ∈ T1 and s = s1 , where s1 satisfies (i , fi (q), s1 ) ∈ T . Similarly, we can have s2 , . . . , sl−1 corresponding to each cover of Q. Therefore, there exist at least l − 1 other different sensitive values that have the same number of appearances as s does in Si . The property of entropy l-diversity is thus satisfied. From Claim 3 and Claim 4, we immediately have the following. Claim 5. A generalization algorithm will not violate entropy l-diversity while optimizing data utility within a set of quasi-identifier generalizations GQ that satisfies weak l-cover. Note that, among the previous examples, those shown in Figure 2 and Figure 3 satisfy lcover, whereas the one in Figure 1 does not. Therefore, a (deterministic) generalization algorithm may violate entropy l-diversity, when it attempts to disclose a quasi-identifier generalization with optimal data utility for the micro-data table shown in Figure 1, even if the corresponding table generalization is not yet disclosed. 4.2 l-Cover From the previous discussions, we will optimize data utility within a globally safe set of quasi-identifier generalizations. Once this optimization process finishes, we will need to compute and release a table generalization based on the optimal quasi-identifier generalization. However, such a disclosure introduces additional knowledge about the secret micro-data table, and may violate the desired entropy l-diversity. First, consider the example shown in Figure 3, which has a locally safe set GQL that is also globally safe. Assume the optimization of generalization function has found that inside GQL , the quasi-identifier generalization GQ1 = {{q1 , q3 }, {q2 , q4 }} is optimal. We can thus compute the table generalization shown in Table 4. With this table generalization disclosed, the set of quasi-identifiers {q3 , q4 } is still a cover of {q1 , q2 }. That is, an adversary still cannot tell which of them is associated with Table 4. Table Generalization for Figure 3 Quasi-Identifier Sensitive Value q1 s1 q3 s2 q2 s1 q4 s3
168
L. Zhang et al. Table 5. Table Generalization for Figure 2 Quasi-Identifier Sensitive Value q1 s1 q3 s2 q2 s1 q4 s3 q5 s2 q6 s4
both appearances of the sensitive value s1 . Therefore, in this particular case, the table generalization in Table 4 can be safely released. However, this is not always the case. Releasing a table generalization may violate the privacy property that has been satisfied in the process of finding a globally safe set and optimizing data utility. Consider the example shown in Figure 2. Assume that GQ1 = {{q1 , q3 }, {q2 , q4 }, {q5 , q6 }} is the optimal quasi-identifier generalization. Based on GQ1 , we can compute the following table generalization. Recall that during the discussion about Figure 2, we have shown that the locally safe set of quasi-identifier generalizations GQL is also globally safe. More specifically, {q1 , q2 } and {q3 , q6 } provide cover for each other. An adversary thus cannot tell which of these is associated to s1 and which to s2 . However, if the table generalization in Table 5 is disclosed, then clearly, since {q5 , q6 } is associated with {s2 , s4 }, q6 must not be associated with s1 in the micro-data table. Therefore, the following must be true: (q1 , s1 ), (q2 , s1 ), (q3 , s2 ), (q6 , s2 ) ∈ Rqs , which violates entropy l-diversity. In the above example, the table generalization in Table 5 contains extra information that is not part of the knowledge about GQL . Therefore, the table generalization computed based on GQ1 cannot be safely released, even though GQL is globally safe. To prevent such cases, we should not consider quasi-identifier generalizations like GQ1 for the optimization of data utility. Instead, the optimization process should be confined to a subset of the globally safe set GQL that satisfies a stronger condition, as formalized in Definition 9. Definition 9. (l-Cover) Given a micro-data table T and a set of quasi-identifier generalizations GQ, GQ is said to satisfy l-cover if for any Q ⊆ Q satisfying ∃s ∈ S, Q = {q : (q , s ) ∈ Rqs }, we have – there exist at least l − 1 covers of Q: Q1 , Q2 , . . . , Ql−1 , – ∀j
= j , Qj ∩ Qj = φ, and – ∀GQ ∈ GQ, ∀Qx ∈ GQ, |Qx ∩ Q| = |Qx ∩ Qj | (j = 1, 2, . . . , l − 1). The property of l-cover basically requires a set of quasi-identifier generalizations GQ to satisfy both weak l-cover and an additional conditions, that is, the disclosure of a table generalization computed based on any GQ ∈ GQ will not include any extra information that is not part of the knowledge about GQ. Any such table generalization can thus be safely released. More formally, we have the following. Claim 6. Given a micro-data table T , and a set of quasi-identifier generalizations GQ satisfying l-cover, and any GQ ∈ GQ, let GT = (GQ, GS, GRqs ) be the table gen-
L-Cover: Preserving Diversity by Anonymity
169
eralization of T computed based on GQ. Also, let T be the set of all tables having the same I, Riq , Q, S and the same table generalization GT when computed based on GQ. We then have that entropy l-diversity is satisfied on the multiset si = {s : (i , s ) ∈ Πi,s (T ), T ∈ T } for all i ∈ I. Proof Sketch: The proof of this claim is similar to that of Claim 4, except that the disclosure of the table generalization GT does not allow an adversary to disregard any quasi-identifier generalization in GQ by Definition 9. In addition, we can have another interesting observation about those sensitive values that appear exactly once in S. That is, as long as each group in the table generalization has more than l different sensitive values, those sensitive values that appear only once will be protected with l-cover. Therefore, in practice we only need to be concerned with those sensitive values that appear multiple times. From Claim 5 and Claim 6, the following argument is now straightforward. Claim 7. A generalization algorithm will not violate entropy l-diversity while disclosing a table generalization with the optimal data utility, if the optimization process is confined to a set of quasi-identifier generalizations that satisfies l-cover. Since the locally safe set of quasi-identifier generalizations GQL does not always satisfy l-cover. To preserve the property of entropy l-diversity, we may need to find a subset of GQL that does so. Note that, to avoid the huge complexity to compute the entire GQL , We can: (1) in advance construct l covers for any sensitive values that appears more than once; (2) check whether a given generalization is contained in a GQL that satisfies l-cover by checking whether the property of L-cover can be violated by the given generalization, based on the previously constructed l-covers and the generalizations that have already been considered. We shall leave detailed methods and the study of the corresponding performances to our future work. Nonetheless, by following this approach, we will not face the NP-hard problem of preserving entropy l-diversity with publicized algorithms as pointed out in [22], and expect to have “better” heuristic algorithms that guarantees entropy l-diversity when publicized, in terms of data utility and efficiency.
5 Related Work The initial works [1,3,7,9,10] were concerned with conducting data census, while protecting the privacy of sensitive information in disclosed tables. Two approaches, data swapping [6,14,19] and data suppression [11] were suggested to protect data, but could not quantify how well the data is protected. The work [5] gave a formal analysis of the information disclosure in data exchange. The work [16] showed that publishing data sets even without identifying attributes can cause privacy breaches and suggested a new notion of privacy called k-anonymity. Achieving k-anonymity with the best data utility was proved to be NP-hard [13]. A similar measure, called blending in a crowd was proposed by [18]. The work [21] proposed a new generalization framework based on the concept of “personalized anonymity.” In addition, many works, e.g., [4,15,16,12,17,8], proposed efficient algorithms for k-anonymity. The work [2] discussed deficiency of kanonymity as a measure of privacy, and proposed an alternative property of l-diversity
170
L. Zhang et al.
to ensure privacy protection in the micro-data disclosure, and demonstrated that algorithms developed for k-anonymity can also be used for l-diversity. The above works, however, did not take into account that the disclosure algorithm and sequence may be known to the adversary. The work [22] provide an comprehensive analysis of both safety and complexity for the disclosure algorithm for micro-data disclosure under such assumption. Another work [20] tackles a similar issue but in a more specific problem setting.
6 Conclusion We have uncovered the similarity between k-anonymity and l-diversity under a novel strategy for micro-data release. More specifically, we have proposed to confine the optimization of generalization function for best data utility to a globally safe subset of all possible quasi-identifier generalizations. This approach decoupled privacy preservation from data utility optimization, which essentially simplified both. To find a globally safe set, we have provided the concept of l-cover and shown that to satisfy this novel property is basically to satisfy l-anonymity in a special way. This result may lead to “better” heuristic algorithms than existing solutions in terms of data utility and efficiency, while guaranteeing the data privacy with publicized algorithms. Our future work will focus on the algorithm design and performance study.
Acknowledgment Lei Zhang and Sushil Jajodia were partially supported by the National Science Foundation under grants CT-0716567, CT-0716323, and CT-0627493, and by the Air Force Office of Scientific Research under grants FA9550-07-1-0527 and FA9550-08-1-0157. We thank the anonymous reviewers for their valuable comments to improve this paper.
References 1. Dobra, A., Feinberg, S.E.: Bounding entries in multi-way contingency tables given a set of marginal totals. In: Foundations of Statistical Inference: Proceedings of the Shoresh Conference 2000. Springer, Heidelberg (2003) 2. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. In: Proceedings of the 22nd IEEE International Conference on Data Engineering, ICDE 2006 (2006) 3. Slavkovic, A., Feinberg, S.E.: Bounds for cell entries in two-way tables given conditional relative frequencies. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 30–43. Springer, Heidelberg (2004) 4. Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: k-anonymity: Algorithms and hardness, Technical report. Stanford University (2004) 5. Miklau, G., Suciu, D.: A formal analysis of information disclosure in data exchange. In: SIGMOD (2004) 6. Duncan, G.T., Feinberg, S.E.: Obtaining information while preserving privacy: A markov perturbation method for tabular data. In: Joint Statistical Meetings, Anaheim,CA (1997)
L-Cover: Preserving Diversity by Anonymity
171
7. Fellegi, I.P.: On the question of statistical confidentiality. Journal of the American Statistical Association 67(337), 7–18 (1993) 8. LeFevre, K., DeWitt, D., Ramakrishnan, R.: Incognito: Efficient fulldomain k-anonymity. In: SIGMOD (2005) 9. Cox, L.H.: Solving confidentiality protection problems in tabulations using network optimization: A network model for cell suppression in the u.s. economic censuses. In: Proceedings of the Internatinal Seminar on Statistical Confidentiality (1982) 10. Cox, L.H.: New results in disclosure avoidance for tabulations. In: International Statistical Institute Proceedings (1987) 11. Cox, L.H.: Suppression, methodology and statistical disclosure control. J. of the American Statistical Association (1995) 12. Sweeney, L.: k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5), 557–570 (2002) 13. Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: ACM PODS (2004) 14. Diaconis, P., Sturmfels, B.: Algebraic algorithms for sampling from conditional distributions. In: Annals of Statistics (1998) 15. Samarati, P.: Protecting respondents’ identities in microdata release. In: IEEE TKDE, pp. 1010–1027 (2001) 16. Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, CMU, SRI (1998) 17. Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: ICDE (2005) 18. Chawla, S., Dwork, C., McSherry, F., Smith, A., Wee, H.: Toward privacy in public databases. In: Kilian, J. (ed.) TCC 2005. LNCS, vol. 3378, pp. 363–385. Springer, Heidelberg (2005) 19. Dalenius, T., Reiss, S.: Data swapping: A technique for disclosure control. Journal of Statistical Planning and Inference 6, 73–85 (1982) 20. Wong, R.C.-W., Fu, A.W.-C., Wang, K., Pei, J.: Minimality attack in privacy preserving data publishing. In: VLDB 2007: Proceedings of the 33rd international conference on Very large data bases, pp. 543–554 (2007) 21. Xiao, X., Tao, Y.: Personalized privacy preservation. In: SIGMOD (2006) 22. Zhang, L., Jajodia, S., Brodsky, A.: Information disclosure under realistic assumptions: Privacy versus optimality. In: ACM Conference on Computer and Communications Security, CCS (2007)
Author Index
Bertino, Elisa 49, 68 Bettini, Claudio 142 Brodsky, Alexander 158 Canim, Mustafa 1 Celikel, Ebru 49 Chapman, Adriane 17 Cherif, Asma 89 Chinis, George 122
LeFevre, Kristen Lin, Dan 49
17
Mascetti, Sergio
142
Ni, Qun
68
Osborn, Sylvia L. Power, David
Dai, Chenyun
33
107
49 Rusinowitch, Micha¨el
Freni, Dario 142 Fundulaki, Irini 122 Han, Weili
68
Imine, Abdessamad 89 Inan, Ali 1 Ioannidis, Sotiris 122 Jajodia, Sushil Jin, Xin 33
Sandhu, Ravi 68 Simpson, Andrew 107 Slaymaker, Mark 107 Thuraisingham, Bhavani Wang, Lingyu 158 Wu, Garfield Zhiping
158
Kantarcioglu, Murat 1, 49 Koromilas, Lazaros 122
89
Xu, Shouhuai
68
Zhang, Jing 17 Zhang, Lei 158
33
49