Association Rule Hiding for Data Mining
ADVANCES IN DATABASE SYSTEMS Volume 411
Series Editors Ahmed K. Elmagarmid Purdue University West Lafayette, IN 47907
Amit P. Sheth Wright State University Dayton, OH 45435
For other titles published in this series, please visit www.springer.com/series/5573
Association Rule Hiding for Data Mining
by
Aris Gkoulalas-Divanis
IBM Research GmbH - Zurich, Rueschlikon, Switzerland
Vassilios S. Verykios University of Thessaly, Volos, Greece
Aris G koulalas-Divanis Information Analytics Lab IBM Research GmbH - Zurich Saumerstrasse 4 8803 Rueschlikon Switzerland
[email protected]
Vassilios S. Verykios Department of Computer and Communication Engineering University of Thessaly Glavani 37 & 28th Octovriou Str. GR 38221 Volos Greece
[email protected]
ISSN 1386-2944 ISBN 978-1-4419-6568-4 e-ISBN 978-1-4419-6569-1 DOI 10.1007/978-1-4419-6569-1 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010927402 © Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Dedicated to my parents, Aspa and Dimitris. — Aris Gkoulalas–Divanis.
Dedicated to Jenny and Aggelos. — Vassilios S. Verykios.
Preface
Since its inception, privacy preserving data mining has been an active research area of increasing popularity in the data mining community. This line of research investigates the side-effects of the existing data mining technology that originate from the penetration into the privacy of individuals and organizations. From a general point of view, privacy issues related to the application of data mining can be classified into two main categories, namely data hiding and knowledge hiding. Data hiding methodologies are related to the data per se, aiming to remove confidential or private information from the data prior to its publication. Knowledge hiding methodologies, on the other hand, are concerned with the sanitization of data leading to the disclosure of confidential and private knowledge, when the data is mined by the existing data mining tools for knowledge patterns. In this book, we provide an extensive survey on a specific class of privacy preserving data mining methods that belong to the knowledge hiding thread and are collectively known as association rule hiding methods. “Association rule hiding” (a term commonly used for brevity instead of the longer title “frequent itemset and association rule hiding”) has been mentioned for the first time in 1999 in a workshop paper that was presented by Atallah et al. In this work, the authors tried to apply general ideas about the implications of data mining in security and privacy of information — first discussed by Clifton and Marks in 1996 — to the association rule mining framework proposed by Agrawal and Srikant. Clifton and Marks, following the suggestions of D.E. O’Leary (1991) — who was the very first to point out the security and privacy breaches that originate from data mining algorithms — indicated the need to consider different data mining approaches under the prism of preventing the privacy of information, and proposed different ways to accomplish this. Since then, a large body of research emerged that involved novel approaches for the hiding of sensitive association rules from within the data. Due to the combinatorial nature of the problem, the proposed methodologies span from simple, time and memory efficient heuristics and border-based approaches, to exact hiding algorithms that offer guarantees on the quality of the computed hiding solution at an increased, however, computational complexity cost. The focus of this book is mostly towards the latter type of approaches since it is also the most recent.
vii
viii
Preface
Book Organization The book consists of 21 chapters, which are organized in four parts. Each part ends with a brief summary that overviews the covered material in the corresponding part. We tried to keep each chapter of the book self-contained so as to provide maximum reading flexibility. The first part of the book presents some fundamental concepts for the understanding of the problem of association rule hiding as well as the key classes of proposed solution methodologies. It is composed of five chapters. Chapter 1 introduces the reader to association rule hiding and motivates this line of research. Next, Chapter 2 formally sets out the problem and provides the necessary notation and terminology for its proper understanding, while Chapter 3 sheds light on the different classes of association rule hiding methodologies that have been investigated over the years. Following that, Chapter 4 briefly discusses privacy preserving methodologies in related areas of research and specifically in the areas of classification, clustering and sequence mining. Last, Chapter 5 summarizes the contents of the first part. The second part of the book contains three chapters and covers the heuristic class of association rule hiding methodologies. Two main directions of heuristic methodologies have been investigated over the years: support-based and confidence-based distortion schemes (presented in Chapter 6) that operate by including or excluding specific items in/from transactions of the original database, and support-based and confidence-based blocking approaches (covered in Chapter 7) that replace original values with question marks, reducing in this way the confidence of attackers regarding the existence (or nonexistence) of specific items in transactions of the original database. Chapter 8 summarizes the contents of the second part of the book. The use of the theory of borders of the frequent itemsets to support association rule hiding is the key principle behind the border-based class of approaches, presented in detail in the third part of the book. This part comprises four chapters. Chapter 9 elucidates the process of border revision and presents a set of algorithms that can be directly applied to the association rule mining framework to allow for the efficient computation of the original and the revised borders of the frequent itemsets. Following that, Chapters 10 and 11 present two specific border-based methodologies for association rule hiding and demonstrate their way of operation. A short summary of this part is given in Chapter 12. The last part of the book is devoted to exact association rule hiding methodologies, which are the only ones to offer guarantees on the quality of the identified hiding solution. Since this line of research is also the most recent one, the detail of presentation is intentionally finer in favor of this class of approaches. This part is comprised of seven chapters. Chapter 13 presents the first work to elevate from pure heuristics to exact knowledge hiding by somewhat combining the two worlds. Next, Chapters 14, 15 and 16 study in detail three exact association rule hiding methodologies that formulate the problem of association rule hiding as a pure optimization problem and offer quality guarantees on the identified hiding solution. A notable drawback of exact hiding methodologies, mainly attributed to their optimization nature, is their high computational and memory requirements. Chapter 17 discusses
Preface
ix
a decomposition and parallelization framework that has been recently developed to ameliorate these disadvantages by effectively decomposing large optimization problems into smaller subproblems that can be solved concurrently, without however sacrificing the quality of the computed hiding solution. Following that, Chapter 18 presents a systematic layered approach for the quantification of privacy that is offered by the exact hiding algorithms. The proposed approach allows data owners to decide on the level of privacy they wish as a tradeoff of the distortion that is induced to the database by the hiding process. In this way, the exact algorithms can effectively shield all the sensitive association rules with the least possible damage to data utility. Chapter 19 summarizes the contributions of this part. The Epilogue comprises two chapters that are wrapping up the discussion presented in this book. In Chapter 20, we give a summary of the presented material and emphasize on the most important topics that were covered. Finally, Chapter 21 elaborates on a number of interesting and promising directions for future research in the area of association rule hiding.
Intended Audience We believe that this book will be suitable to course instructors, undergraduate and postgraduate students studying association rule hiding, knowledge hiding and privacy preserving data mining at large. Moreover, it is expected to be a valuable companion to researchers and professionals working in this research area as it provides an overview of the current research accomplishments by presenting them under a new perspective, building its way up from theory to practice and from simple heuristic methodologies to more advanced exact knowledge hiding approaches that have been proposed for association rule hiding. Finally, practitioners working on the development of association rule hiding methodologies for database systems can benefit from this book by using it as a reference guide to decide upon the hiding approach that best fits their needs as well as gain insight on the eccentricities and peculiarities of the existing methodologies.
How to Study This Book The order of presentation is also the proposed reading order of the material. This way, the reader can start from the basics and incrementally build his or her way up to more advanced topics, including the different classes of hiding approaches, the underlying principles of the proposed methodologies and their main differences. However, based on the reader’s expertise in the area, it is possible to focus directly on the topic of interest and thus skip either the first part of the book or parts devoted to other classes of approaches. Evidently, if the reader wishes to go deep into the details of a certain subject topic, then the provided references should be consulted.
x
Preface
More precisely, undergraduate students can focus on the first two parts of the book, effectively gain a good understanding of the fundamental concepts related to association rule hiding and grasp the main characteristics of some of the most popular heuristic methodologies in this area. Postgraduate students and researchers will find the third and fourth parts of the book to be more interesting as they cover more recent developments in association rule hiding. Specifically the fourth part of the book reviews the most recent line of research in association rule hiding, which is expected to radically benefit from the advances in the area of optimization techniques. Course instructors and researchers are encouraged to study all the material in order to select the most suitable parts of the book required for class presentation or further research in the area.
Acknowledgments This work aims to summarize the most interesting research accomplishments that took place in the area of association rule hiding during the ten years of its existence. Since this is the first book in the market that is specifically targeted on association rule hiding, we would like to express our deep gratitude to Ahmed K. Elmagarmid, Chris Clifton, Yucel Saygin, Osmar R. Zaïane, Charu Aggarwal and Francesco Bonchi for cordially embracing our book proposal and for offering constructive comments that helped us improve the overall quality of the manuscript. We are also indebted to Susan Lagerstrom-Fife and Jennifer Maurer from Springer, for their great support towards the preparation and completion of this work. Their editing suggestions were valuable to improving the organization, readability and appearance of the manuscript. Finally, we would like to express our deep love and gratitude to our families for their understanding throughout the duration of this project. We really hope that this book will serve as a valuable resource to researchers, graduate students and professors interested in the area of association rule hiding and privacy preserving data mining at large, as well as a motivating companion to senior undergraduate students who wish to study the theory and methods in this research area.
Aris Gkoulalas-Divanis, Vanderbilt University, Nashville, USA. Vassilios S. Verykios, University of Thessaly, Volos, GREECE. January 2010
Contents
Part I Fundamental Concepts 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Privacy Preserving Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Association Rule Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Terminology and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Problem Formulation and Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Goals of Association Rule Hiding Methodologies . . . . . . . . . 13 2.2.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3
Classes of Association Rule Hiding Methodologies . . . . . . . . . . . . . . . . . 17 3.1 Classification Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Classes of Association Rule Hiding Algorithms . . . . . . . . . . . . . . . . . 19
4
Other Knowledge Hiding Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Classification Rule Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Privacy Preserving Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Sequence Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 5 5 7 8
21 21 22 23
Part II Heuristic Approaches 6
Distortion Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7
Blocking Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
xi
xii
Contents
Part III Border Based Approaches 9
Border Revision for Knowledge Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
10
BBA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Hiding Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Hiding Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Weighing Border Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Hiding a Sensitive Itemset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.3 Order of Hiding Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47 48 49 50 51 51 52
11
Max–Min Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Hiding a Sensitive Itemset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Order of Hiding Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Max–Min 1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Max–Min 2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53 54 55 55 56
12
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Part IV Exact Hiding Approaches 13
Menon’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Exact Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1.1 CSP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1.2 CSP Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Heuristic Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63 64 64 65 69
14
Inline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 Privacy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Solution Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.1 Problem Size Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.2 Reduction to CSP and the BIP solution . . . . . . . . . . . . . . . . . . 14.2.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.2 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.4 Discussion on the Efficiency of the Inline Algorithm . . . . . . .
71 72 74 74 75 77 79 80 80 81 81
15
Two–Phase Iterative Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Removal of Constraints from Infeasible CSPs . . . . . . . . . . . . . . . . . . . 15.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83 83 85 86 89
Contents
xiii
16
Hybrid Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 16.1 Knowledge Hiding Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 16.1.1 Hybrid Hiding Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 16.1.2 A Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 16.2 Main Issues Pertaining to the Hiding Methodology . . . . . . . . . . . . . . 96 16.2.1 Size of Database Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 16.2.2 Optimal Solutions in the Hybrid Methodology . . . . . . . . . . . . 98 16.2.3 Revision of the Borders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 16.2.4 Problem Size Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 16.2.5 Handling Suboptimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 16.3 Hybrid Solution Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 16.3.1 Problem Size Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 16.3.2 Adjusting the Size of the Extension . . . . . . . . . . . . . . . . . . . . . 104 16.3.3 Formulation and Solution of the CSP . . . . . . . . . . . . . . . . . . . . 105 16.3.4 Minimum Extension and Validity of Transactions . . . . . . . . . 105 16.3.5 Treatment of Suboptimality in Hiding Solutions . . . . . . . . . . 107 16.3.6 Continuing the Running Example . . . . . . . . . . . . . . . . . . . . . . . 108 16.4 A Partitioning Approach to Improve Scalability . . . . . . . . . . . . . . . . . 110 16.4.1 Partitioning Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 16.4.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 16.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
17
Parallelization Framework for Exact Hiding . . . . . . . . . . . . . . . . . . . . . . 119 17.1 The Parallelization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 17.1.1 Structural Decomposition of the CSP . . . . . . . . . . . . . . . . . . . . 120 17.1.2 Decomposition of Large Independent Components . . . . . . . . 123 17.1.3 Parallel Solving of the Produced CSPs . . . . . . . . . . . . . . . . . . 126 17.2 Computational Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 128
18
Quantifying the Privacy of Exact Hiding Algorithms . . . . . . . . . . . . . . . 131
19
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Part V Epilogue 20
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
21
Roadmap to Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
List of Tables
2.1
Notation table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
9.1 9.2
An original database DO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Frequent itemsets for database DO at mfreq = 0.3. . . . . . . . . . . . . . . . . 42
13.1 An example of a constraints-by-transactions matrix. . . . . . . . . . . . . . . 67 14.1 14.2 14.3 14.4 14.5
The original database DO used in the example. . . . . . . . . . . . . . . . . . . . The intermediate form of database DO used in the example. . . . . . . . . The three exact (and optimal) solutions for the CSP of the example. . The characteristics of the three datasets. . . . . . . . . . . . . . . . . . . . . . . . . . The experimental results for the three datasets. . . . . . . . . . . . . . . . . . . .
78 78 78 79 81
15.1 15.2 15.3 15.4 15.5
The original database DO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The intermediate database of DO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The database DH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intermediate form of database DH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Database D produced by the two–phase iterative approach. . . . . . . . . .
87 87 87 88 89
16.1 Sanitized database D as a mixture of the original database DO and the applied extension DX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 16.2 Frequent itemsets for DO and DX at msup = 3. . . . . . . . . . . . . . . . . . . 95 16.3 The intermediate form of database DX . . . . . . . . . . . . . . . . . . . . . . . . . . 108 16.4 Database DX after the solution of the CSP. . . . . . . . . . . . . . . . . . . . . . . 109 16.5 Database DX as the union of D1 and D2 produced by the partitioning approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 16.6 The characteristics of the four datasets. . . . . . . . . . . . . . . . . . . . . . . . . . 112 17.1 The original database DO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 17.2 The intermediate form of database DO . . . . . . . . . . . . . . . . . . . . . . . . . . 122 17.3 Constraints matrix for the CSP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
xv
xvi
List of Tables
18.1 An example of a sanitized database D produced by the inline algorithm [23], which conceals the sensitive itemsets S = {B,CD} at a frequency threshold of mfreq = 0.3. . . . . . . . . . . . . . . . . . . . . . . . . . 134
List of Figures
2.1 2.2
Database D along with its itemsets and related association rules. . . . . 11 Examples of two lattices for a database with (i) I = {a, b, c}, and (ii) I = {a, b, c, d}. In the first lattice we also demonstrate the positive and the negative borders when the database D is the same as in Fig. 2.1 and msup=4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1
A taxonomy of association rule hiding approaches along four dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 The three classes of association rule hiding algorithms. . . . . . . . . . . . . 19
3.2 9.1
An itemsets’ lattice demonstrating (i) the original border and the sensitive itemsets, and (ii) the revised border. . . . . . . . . . . . . . . . . . . . . 42
13.1 13.2 13.3 13.4
CSP formulation for the exact part of Menon’s algorithm [47]. . . . . . . The original CSP formulation for the example. . . . . . . . . . . . . . . . . . . . The first independent block for the CSP of Figure 13.2. . . . . . . . . . . . . The second independent block for the CSP of Figure 13.2. . . . . . . . . .
64 67 67 68
14.1 14.2 14.3 14.4
The architectural layout for the inline approach. . . . . . . . . . . . . . . . . . . The CSP formulation as an optimization process. . . . . . . . . . . . . . . . . . The Constraints Degree Reduction approach. . . . . . . . . . . . . . . . . . . . . The CSP formulation for dataset DO . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74 76 76 79
15.1 15.2 15.3 15.4
The architectural layout for the two–phase iterative approach. . . . . . . The CSP for the second stage of the two–phase iterative algorithm. . . The two phases of iteration for the considered example. . . . . . . . . . . . Distance between the inline and the two–phase iterative algorithm. . .
84 85 90 91
16.1 16.2 16.3 16.4
CSP formulation as an optimization process. . . . . . . . . . . . . . . . . . . . . . 104 The Constraints Degree Reduction approach. . . . . . . . . . . . . . . . . . . . . 105 Expansion of the CSP to ensure validity of transactions. . . . . . . . . . . . 106 The constraints in the CSP of the running example. . . . . . . . . . . . . . . . 109
xvii
xviii
List of Figures
16.5 Quality of the produced hiding solutions. . . . . . . . . . . . . . . . . . . . . . . . . 113 16.6 Scalability of the hybrid hiding algorithm. . . . . . . . . . . . . . . . . . . . . . . . 114 16.7 Size of extension DX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 16.8 Number of constraints in the CSP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 16.9 Distance of the three hiding schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 16.10Performance of the partitioning approach. . . . . . . . . . . . . . . . . . . . . . . . 117 17.1 17.2 17.3 17.4 17.5 17.6 17.7
Decomposing large CSPs to smaller ones. . . . . . . . . . . . . . . . . . . . . . . . 121 CSP formulation for the presented example. . . . . . . . . . . . . . . . . . . . . . 122 Equivalent CSPs for the provided example. . . . . . . . . . . . . . . . . . . . . . . 123 An example of decomposition using articulation points. . . . . . . . . . . . 124 Three-way decomposition using weighted graph partitioning. . . . . . . . 125 An example of parallel solving after decomposition. . . . . . . . . . . . . . . 127 Performance gain through parallel solving, when omitting the V part of the CSP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 17.8 Performance gain through parallel solving of the entire CSP. . . . . . . . 130 18.1 A layered approach to quantifying the privacy that is offered by the exact hiding algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 18.2 The modified CSP for the inline algorithm that guarantees increased safety for the hiding of sensitive knowledge. . . . . . . . . . . . . 133
List of Algorithms
9.1 9.2 9.3 9.4 10.1 11.1 11.2 13.1 13.2 14.1 16.1 16.2
Computation of the large itemsets and the original negative border. . . 44 Computation of the positive border (original and revised) Bd + (F). . . 45 0 ) of D. . . . . . . . . 45 Computation of the revised negative border Bd − (FD Hiding of all the sensitive itemsets and their supersets. . . . . . . . . . . . . 46 The Border Based Approach of Sun & Yu [66, 67]. . . . . . . . . . . . . . . . 52 The Max–Min 1 Algorithm of Moustakides & Verykios [50, 51]. . . . . 56 The Max–Min 2 Algorithm of Moustakides & Verykios [50, 51]. . . . . 58 The Decomposition Algorithm proposed in [47]. . . . . . . . . . . . . . . . . . 68 The Intelligent Sanitization Approach of [47]. . . . . . . . . . . . . . . . . . . . . 69 Relaxation Procedure in V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Validation of Transactions in DX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 0 ). . . . . . . . . . . . . . . . . . . . . . . . . 108 Relaxation Procedure in V = Bd + (FD
xix
Part I
Fundamental Concepts
Chapter 1
Introduction
The significant advances in data collection and data storage technologies have provided the means for the inexpensive storage of enormous amounts of transactional data in data warehouses that reside in companies and public sector organizations. Apart from the benefit of using this data per se (e.g., for keeping up to date profiles of the customers and their purchases, maintaining a list of the available products, their quantities and price, etc), the mining of these datasets with the existing data mining tools can reveal invaluable knowledge that was unknown to the data holder beforehand. The extracted knowledge patterns can provide insight to the data holders as well as be invaluable in important tasks, such as decision making and strategic planning. Moreover, companies are often willing to collaborate with other entities who conduct similar business, towards the mutual benefit of their businesses. Significant knowledge patterns can be derived and shared among the partners through the collaborative mining of their datasets. Furthermore, public sector organizations and civilian federal agencies usually have to share a portion of their collected data or knowledge with other organizations having a similar purpose, or even make this data and/or knowledge public in order to comply with certain regulations. For example, in the United States, the National Institutes of Health (NIH) [2] endorses research that leads to significant findings which improve human health and provides a set of guidelines which sanction the timely dissemination of NIH-supported research findings for use by other researchers. At the same time, the NIH acknowledges the need to maintain privacy standards and, thus, requires NIH-sponsored investigators to disclose data collected or studied in a manner that is “free of identifiers that could lead to deductive disclosure of the identity of individual subjects” [2] and deposit it to the Database of Genotype and Phenotype (dbGaP) [45] for broad dissemination. Another example of the benefits of data sharing comes from the business world. Wal-Mart, a major retailer in the United States, and Procter & Gamble (P&G), an international manufacturer, decided in 1988 to share information and knowledge across their mutual supply chains in order to better coordinate their common activities. As Grean & Shaw discuss in [31], the partnership of the two companies led to the improvement of their business relationship, reduced the needs for inventories thus driven down the associated costs, and led to a high increase in the joint sales of A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_1, © Springer Science+Business Media, LLC 2010
3
4
1 Introduction
the two companies. For example, by mining Wal-Mart’s customer sales data, P&G helped Wal-Mart to focus on selling items that were preferred by its customers and eliminate other products that were rarely sold. By 1990 the two companies had achieved to significantly improve their common business towards their mutual profitability. The excessive amount of customer data owned by Wal-Mart paved the way to the establishment of more business collaborations of the company with external partners in the following years. Based on its business strategy, Wal-Mart shared (or sometimes sold) portions of its customer sales data with (to) large market-research companies. This action, however, was latter on proved to be a bad idea. As pointed out in [35], the business managers of Wal-Mart soon found out that the dissemination of customer sales data to their partners was sometimes leading to a leakage of business strategic knowledge to business competitors. Indeed, the strategic sales information of Wal-Mart was found in several cases to be used for the preparation of industry-wide reports that were broadly disseminated, even to Wal-Mart’s business competitors. The disclosure of business trade secrets, shortly led Wal-Mart decide that the sharing of its data was harming its own business. As it becomes evident from the previous discussion, there exist an extended set of application scenarios in which collected data or knowledge patterns extracted from the data have to be shared with other (possibly untrusted) entities to serve ownerspecific or organization-specific purposes. The sharing of data and/or knowledge may come at a cost to privacy, primarily due to two main reasons: (i) if the data refers to individuals (e.g., as in detailed patient-level clinical data derived from electronic medical records or customers’ market basket data collected from a supermarket) then its disclosure can violate the privacy of the individuals who are recorded in the data if their identity is revealed to untrusted third parties or if sensitive knowledge about them can be mined from the data, and (ii) if the data regards business (or organizational) information, then disclosure of this data or any knowledge extracted from the data may potentially reveal sensitive trade secrets, whose knowledge can provide a significant advantage to business competitors and thus can cause the data owner to lose business over his or her peers. The aforementioned privacy concerns in the course of data mining are significantly amplified due to the fact that simple de-identification1 of the original data prior to its mining has been proven to be in several cases insufficient to guarantee a privacy-aware outcome. Indeed, intelligent analysis of the data through inferencebased attacks, may uncover sensitive patterns to untrusted entities. This can be mainly achieved by utilizing external and publicly available sources of information (e.g., the yellow pages, patient demographics and discharge summaries, or other public reports) in conjunction with the released data or knowledge to re-identify individuals or uncover hidden knowledge patterns [21,49]. Thus, compliance to pri-
1
Data de-identification refers to the process of removing obvious identifiers from the data (e.g., names, social security numbers, addresses, etc) prior to its disclosure. A typical de-identification strategy for patient-specific data is based on the Safe Harbor standard of the Health Insurance Portability and Accountability Act (HIPAA) [1], whereby records are stripped of a number of potential identifiers, such as personal names and geocodes.
1.2 Association Rule Hiding
5
vacy regulations requires the incorporation of advanced and sophisticated privacy preserving methodologies.
1.1 Privacy Preserving Data Mining Privacy preserving data mining is a relatively new research area in the data mining community, counting approximately a decade of existence. It investigates the sideeffects of data mining methods that originate from the penetration into the privacy of individuals and organizations. Since the pioneering work of Agrawal & Srikant [8] and Lindell & Pinkas [43] in 2000, several approaches have been proposed in the research literature for the offering of privacy in data mining. The majority of the proposed approaches can be classified along two principal research directions: (i) data hiding approaches and (ii) knowledge hiding approaches. The first direction collects methodologies that investigate how the privacy of raw data, or information, can be maintained before the course of mining the data. The approaches of this category aim at the removal of confidential or private information from the original data prior to its disclosure and operate by applying techniques such as perturbation, sampling, generalization or suppression, transformation, etc. to generate a sanitized counterpart of the original dataset. Their ultimate goal is to enable the data holder receive accurate data mining results when is not provided with the real data or adhere to specific regulations pertaining to microdata publication (e.g., as is the case of publishing patient-specific data). The second direction of approaches involves methodologies that aim to protect the sensitive data mining results (i.e., the extracted knowledge patterns) rather than the raw data itself, which were produced by the application of data mining tools on the original database. This direction of approaches mainly deals with distortion and blocking techniques that prohibit the leakage of sensitive knowledge patterns in the disclosed data, as well as with techniques for downgrading the effectiveness of classifiers in classification tasks, such that the produced classifiers do not reveal any sensitive knowledge.
1.2 Association Rule Hiding In this book, we focus on the knowledge hiding thread of privacy preserving data mining and study a specific class of approaches which are collectively known as frequent itemset and association rule hiding approaches. Other classes of approaches under the knowledge hiding thread include classification rule hiding, clustering model hiding, sequence hiding, and so on and so forth. An overview of these methodologies can be found in [4, 28, 70, 73]. Some of these methodologies are also surveyed as part of Chapter 4 of this book.
6
1 Introduction
Association rules are implications that hold in a transactional database under certain user-specified parameters that account for their significance. Significant association rules provide knowledge to the data miner as they effectively summarize the data, while uncovering any hidden relations (among items) that hold in the data. The term “association rule hiding” has been mentioned for the first time in 1999 by Atallah et al. [10] in a workshop paper on knowledge and data engineering. The authors of this work studied the problem of modifying the original database in a way that certain (termed as “sensitive”) association rules disappear without, however, seriously affecting the original data and the nonsensitive rules. They proposed a number of solutions like fuzification of the original database, limiting access to the original data, as well as releasing samples instead of the entire database. Due to the combinatorial nature of the problem, all of the solutions that the authors proposed, as well as the vast majority of solutions that have been proposed since then, were based on heuristics. Heuristic solutions, although computationally efficient and scalable, suffer from local optima issues as they require optimizing a specific function in each step of the algorithm, which however does not guarantee finding an optimal hiding solution to the whole problem. More recently, a new direction of exact approaches to association rule hiding has been proposed. These approaches have increased time and memory requirements but in return they offer quality guarantees on the identified solution.
A Motivating Example The following example scenario, borrowed from the work of Verykios et al. [72], complements the real world examples we presented earlier and motivates the necessity of applying association rule hiding algorithms to protect sensitive association rules against disclosure. Let us suppose that we are negotiating with Dedtrees Paper Company, as purchasing directors of BigMart, a large supermarket chain. They offer their products in reduced prices, provided that we agree to give them access to our database of customer purchases. We accept the deal and Dedtrees starts mining our data. By using an association rule mining tool, they find that people who purchase skim milk also purchase Green Paper. Dedtrees now runs a coupon marketing campaign offering a 50 cents discount on skim milk with every purchase of a Dedtrees product. The campaign cuts heavily into the sales of Green Paper, which increases the prices to us, based on the lower sales. During our next negotiation with Dedtrees, we find out that with reduced competition they are unwilling to offer to us a low price. Finally, we start losing business to our competitors, who were able to negotiate a better deal with Green Paper. In other words, the aforementioned scenario indicates that BigMart should sanitize competitive information (and other important corporate secrets of course) before delivering their database to Dedtrees, so that Dedtrees does not monopolize the paper market. Similar motivating examples for association rule hiding are discussed in the work of [18, 54, 66].
1.3 Research Challenges
7
1.3 Research Challenges The association rule hiding problem can be considered as a variation of the well established database inference control problem [21] in statistical and multilevel databases. The primary goal in database inference control is to block access to sensitive information that can be obtained through nonsensitive data and inference rules. In association rule hiding, we consider that it is not the data itself but rather the sensitive association rules that create a breach to privacy. Given a set of association rules, which are mined from a specific data collection and are considered to be sensitive by an application specialist (e.g., the data owner), the task of association rule hiding is to properly modify (or as is usually called sanitize2 ) the original data so that any association rule mining algorithms that may be applied to the sanitized version of the data (i) will be incapable to uncover the sensitive rules under certain parameter settings, and (ii) will be able to mine all the nonsensitive rules that appeared in the original dataset (under the same or higher parameter settings) and no other rules. The challenge that arises in the context of association rule hiding can thus be properly stated as follows: How can we modify (sanitize) the transactions of a database in a way that all the nonsensitive association rules that are found when mining this database can still be mined from its sanitized counterpart (under certain parameter settings), while, at the same time, all the sensitive rules are guarded against disclosure and no other (originally nonexistent) rules can be mined? Association rule hiding algorithms are especially designed to provide a solution to this challenging problem. They accomplish this by introducing a small distortion to the transactions of the original database in a way that they block the production of the sensitive association rules in its sanitized counterpart, while still allowing the mining of the nonsensitive knowledge. What differentiates the quality of one association rule hiding methodology from that of another is the actual distortion that is caused to the original database, as a result of the hiding process. Ideally, the hiding process should be accomplished in such a way that the nonsensitive knowledge remains, to the highest possible degree, intact. Another very interesting problem has been investigated recently, which even though it is not targeted to addressing privacy issues per se, it does give a special solution to the association rule hiding problem. The problem is known as inverse frequent itemset mining [48].
2
A dataset is said to be sanitized when it appropriately protects the sensitive knowledge from being mined, under certain parameter settings. Similarly, a transaction of a dataset is sanitized when it no longer supports any sensitive itemset or rule. Last, an item is called sanitized when it is altered in a given transaction to accommodate the hiding of the sensitive knowledge.
8
1 Introduction
1.4 Summary Since its first appearance as a problem in 1999, association rule hiding has been extensively studied by the data mining research community, leading to a large body of significant research work over the past years. The approaches that have been developed span from simple, time-efficient and memory-efficient heuristics that select transactions and items to sanitize, to more complicated and sophisticated solutions which conceive the hiding process as an optimization problem and solve it by using specific optimization techniques. The following chapter provides the background along with the necessary terminology for the formal definition of the problem. Next, in Chapter 3, we present a brief taxonomy of the various association rule hiding methodologies that have been proposed over the years, along four orthogonal dimensions. Each class of approaches is further presented in detail in the corresponding part of the book. Specifically, heuristic methodologies are covered in the second part, while the third part deals with border-based approaches. The most recent class of approaches that involves exact hiding methodologies is presented in the fourth part of the book. Finally, Chapter 4 discusses some methodologies for sensitive knowledge hiding in research areas related to that of association rule hiding and in particular in the areas of classification, clustering and sequence hiding. Last, Chapter 5 summarizes the first part of the book.
Chapter 2
Background
In this chapter we provide the background and terminology that are necessary for the understanding of association rule hiding. Specifically, in Section 2.1, we present the theory behind association rule mining and introduce the notion of the positive and the negative borders of the frequent itemsets. Following that, Section 2.2 explicitly states the goals of association rule hiding methodologies, discusses the different types of solutions that association rule hiding algorithms can produce, as well as it delivers the formal problem statement for association rule hiding and its popular variant, frequent itemset hiding.
2.1 Terminology and Preliminaries Association rule mining is the process of discovering sets of items (also known as itemsets) that frequently co-occur in a transactional database so as to produce significant association rules that hold for the data. Each association rule is defined as an implication of the form I ⇒ J, where I, J are frequently occurring itemsets in the transactional database, for which I ∩ J = ∅ (i.e., I and J are disjoint). The itemset I ∪ J that leads to the generation of an association rule is called generating itemset. An association rule consists of two parts: the Left Hand Side (LHS) or antecedent, which is the part on the left of the arrow of the rule (here I), and the Right Hand Side (RHS) or consequent, which is the part on the right of the arrow of the rule (here J). Two metrics, known as support and confidence, are incorporated to the task of association rule mining to drive the generation of association rules and expose only those rules that are expected to be of interest to the data owner. In particular, the measure of support eliminates rules that are not adequately backed up by the transactions of the dataset and thus are expected to be uninteresting, i.e. occurring simply by chance. On the other hand, confidence measures the strength of the relation between the itemsets of the rule as it quantifies the reliability of the inference made by the rule [68]. A low value of confidence in a rule I ⇒ J shows that it is rather rare for itemset J to be present in transactions that contain itemset I. A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_2, © Springer Science+Business Media, LLC 2010
9
10
2 Background
Table 2.1: Notation table. Notation I = {i1 , i2 , . . . , iM } |R| I, J, . . . k–itemset Tn = (tid, I) D = {T1 , T2 , . . . , TN } Tnm I⇒J DI ℘( ) P = ℘(I) sup(I, D), sup(I) freq(I, D), freq(I) msup, mfreq, mconf FD Bd + (FD ), Bd + Bd − (FD ), Bd − Bd(FD ) S R RS
Description Universe of literals (items) with cardinality M Cardinality of a set R Itemsets produced from I An itemset of length (equiv. size) k A transaction with unique identifier tid and itemset I A database consisting of N transactions The m-th item of the n-th transaction in a database An association rule between itemsets I and J The supporting transactions of itemset I in database D The powerset of a set of literals The set of all possible itemsets (patterns) extracted from I Support of itemset I in database D Frequency of itemset I in database D Minimum support/frequency/confidence threshold Set of all frequent itemsets in database D Positive border of FD Negative border of FD Border of FD Set of sensitive itemsets (patterns) Set of mined association rules Set of sensitive association rules from R
Association rule mining, introduced by Agrawal, et al. [5, 7] in 1993, operates by first mining all the itemsets that are frequent in the database and then by using these itemsets to derive association rules that are strong enough to be considered as interesting. The process of frequent itemset mining is defined as follows1 : Let I = {i1 , i2 , . . . , iM } be a finite set of literals, called items, where M denotes the cardinality of the set. Any subset I ⊆ I is called an itemset over I. A k–itemset is an itemset of length (equiv. of size) k, i.e. an itemset consisting of k items. A transaction Tn over I is a pair Tn = (tid, I), where I is the itemset and tid is a unique identifier, used to distinguish among transactions that correspond to the same itemset. A transactional database D = {T1 , T2 , . . . , TN } over I is a N × M table consisting of N transactions over I carrying different identifiers, where entry Tnm = 1 if and only if the m-th item (m ∈ [1, M]) appears in the n-th transaction (n ∈ [1, N]). Otherwise, Tnm = 0. A transaction T = (tid, J) is said to support an itemset I over I, if I ⊆ J. Let DI denote the supporting transactions of itemset I in database D. Furthermore, let S be a set of items. Notation ℘(S) denotes the powerset of S, which is the set of all subsets of S. Given the universe of all items I in D, we use notation P = ℘(I) to refer to all possible itemsets that can be produced from I. Given an itemset I over I in D, sup(I, D) denotes the number of transactions T ∈ D that support I and freq(I, D) denotes the fraction of transactions in D that 1
The reader can refer to the work of Agrawal, et al. [5, 7] for a more detailed presentation of the association rule mining framework as well as for a set of computationally efficient algorithms for the derivation of the association rules. Moreover, [68] provides very good coverage of this topic.
2.1 Terminology and Preliminaries
11
support I 2 . An itemset I is called large or frequent in database D, if and only if, its frequency in D is at least equal to a minimum frequency threshold mfreq, set by the owner of the data. Equivalently, I is large in D, if and only, sup(I, D) ≥ msup, where msup = mfreq × N. The set of frequent itemsets for a database D is denoted as FD = {I ⊆ I : freq(I, D). All the itemsets having a frequency lower than mfreq (equivalently a support lower than msup) are called infrequent and their set is ℘(I) − FD . The second step in the process of association rule mining typically involves the identification of all the significant association rules that hold among the derived frequent itemsets. An association rule I ⇒ J is significant if it holds in database D with a confidence that is higher than a minimum confidence threshold mconf set by sup(I∪J,D) the owner of the data, i.e. when sup(I,D) ≥ mconf . The support of this rule in D is equal to that of its generating itemset, i.e. is equal to sup(I ∪ J, D). An example of association rule mining (borrowed from [72]) is shown on Fig. 2.1.
tid Itemset T1 abc T2 abc T3 abc T4 ab T5 a T6 ac
Itemset Support a 6 b 4 c 4 ab 4 ac 4 bc 3 abc 3
Rules Confidence Support b⇒a 100% 4 b⇒c 75% 3 c⇒a 100% 4 c⇒b 75% 3 b ⇒ ac 75% 3 c ⇒ ab 75% 3 ab ⇒ c 75% 3 ac ⇒ b 75% 3 bc ⇒ a 100% 3
Fig. 2.1: Database D along with its itemsets and related association rules.
Border Theory The theory of the borders of the frequent itemsets is also very important in our discussion as both border-based and exact hiding methodologies rely on a process inspired from this theory. Let FD be the set of all frequent itemsets in D, and P = ℘(I) be the set of all patterns in the lattice of D (e.g., see Fig. 2.2). The positive border of FD , denoted as Bd + (FD ), consists of all the maximally frequent patterns in P, i.e. all the patterns in FD , whose all proper supersets are infrequent. / FD }. Formally, Bd + (FD ) = {I ∈ FD | for all J ∈ P with I ⊂ J we have that J ∈ Respectively, the negative border of FD , denoted as Bd − (FD ), consists of all the minimally infrequent patterns in P, i.e. all the patterns in P\FD , whose all proper subsets are frequent. Formally, Bd − (FD ) = {I ∈ P\FD | for all J ⊂ I we have that 2
We will use sup(I) and freq(I) instead of sup(I, D) and freq(I, D), respectively, for notational convenience, when database D is obvious in our context.
12
2 Background
Fig. 2.2: Examples of two lattices for a database with (i) I = {a, b, c}, and (ii) I = {a, b, c, d}. In the first lattice we also demonstrate the positive and the negative borders when the database D is the same as in Fig. 2.1 and msup=4.
J ∈ FD }. Finally, the border of FD , denoted as Bd(FD ), is the union of these two sets: Bd(FD ) = Bd + (FD ) ∪ Bd − (FD ) 3 . For example, assuming that msup= 4 in the database of Fig. 2.1, we have that Bd + (FD ) = {ab, ac}, Bd − (FD ) = {bc} and Bd(FD ) = {ab, ac, bc}. These borders are presented in Fig. 2.2(i). Borders allow for a condense representation of the itemsets’ lattice, identifying the key itemsets which separate all frequent patterns from their infrequent counterparts. More details on the theory of borders as well as its underlying concepts can be found in the work of Mannila & Toivonen [46].
2.2 Problem Formulation and Statement Having presented the necessary background for association rule mining, in this section we formally set out the problem of association rule hiding. First, in Section 2.2.1, we highlight the goals of association rule hiding algorithms, rank these goals in terms of importance of being satisfied, as well as discuss the side-effects that are introduced when each of them is left unsatisfied in the sanitized database. Following that, in Section 2.2.2, we deliver the formal problem statement.
For notational convenience, we will use Bd + and Bd − to refer to the positive and the negative border of a set of itemsets, respectively, when the set of frequent itemsets is obvious in our context.
3
2.2 Problem Formulation and Statement
13
2.2.1 Goals of Association Rule Hiding Methodologies Association rule hiding methodologies aim at sanitizing the original database in a way that at least one of the following goals is accomplished: 1. No rule that is considered as sensitive from the owner’s perspective and can be mined from the original database at pre-specified thresholds of confidence and support, can be also revealed from the sanitized database, when this database is mined at the same or at higher thresholds 2. All the nonsensitive rules that appear when mining the original database at prespecified thresholds of confidence and support can be successfully mined from the sanitized database at the same thresholds or higher, and 3. No rule that was not derived from the original database when the database was mined at pre-specified thresholds of confidence and support, can be derived from its sanitized counterpart when it is mined at the same or at higher thresholds. The first goal requires that all the sensitive rules disappear from the sanitized database, when the database is mined under the same thresholds of support and confidence as the original database, or at higher thresholds. A hiding solution that achieves the first goal is termed feasible as it accomplishes the hiding task. The second and the third goals involve the nonsensitive rules that may be lost or generated as a side-effect of the employed sanitization process. Specifically, the second goal simply states that there should be no lost rules in the sanitized database, meaning that all the nonsensitive rules that were mined from the original database should also be mined from its sanitized counterpart at the same (or higher) levels of confidence and support. The third goal, on the other hand, states that no false rules (also known as ghost rules) should be produced when mining the sanitized database at the same (or higher) levels of confidence and support. A false (ghost) rule is an association rule that was not among the ones mined from the original database and thus it constitutes an artifact that was generated by the hiding process. Based on these three goals, the sanitization process of a hiding algorithm has to be accomplished in a way that minimally affects the original database, preserves the general patterns and trends, and achieves to conceal all the sensitive association rules. A solution that addresses all these three goals (i.e., is feasible and introduces no side-effects) is called exact. Exact hiding solutions that cause the least possible distortion (modification) to the original database are called ideal or optimal. Lastly, non-exact but feasible solutions are called approximate. As a final remark, we should point out that association rule hiding methodologies usually differ in the way they rank the aforementioned goals (especially the second and the third goal) in terms of importance of having them satisfied. With respect to the first goal, it is interesting to notice that for any database and any set of sensitive association rules there exists a feasible hiding solution, i.e. a solution that effectively hides all the sensitive association rules in the database. This means that the first goal can always be accomplished irrespective of the specific properties of the database or the peculiarities of the hiding problem. The most trivial way to identify a feasible
14
2 Background
hiding solution in a database is to select one item from the generating itemset of each sensitive rule and delete it from all transactions of the database.
2.2.2 Problem Statement Having presented the goals of association rule hiding methodologies, we now proceed to present the problem statement. Association rule hiding has been widely researched along two principal directions (henceforth referred as variants). The first variant involves approaches that aim at hiding specific association rules among those mined from the original database. The second variant, on the other hand, collects methodologies that aim at hiding specific frequent itemsets from those found when applying frequent itemset mining to the original database. The two variants of the problem are very similar in nature. Indeed, concealing the sensitive association rules by hiding their generating itemsets is a common strategy that is adopted by the majority of researchers. By ensuring that the itemsets that lead to the generation of a sensitive rule become insignificant in the disclosed database, the data owner can be certain that his or her sensitive knowledge is adequately protected from untrusted third parties. In what follows, we lay out the formal statement for each variant of the problem by introducing the problem statement both in the context of association rule mining and that of frequent itemset mining.
Variant 1: Hiding sensitive itemsets We assume that we are provided with a database DO , consisting of N transactions, and a threshold mfreq set by the owner of the data. After performing frequent itemset mining in DO with mfreq, we yield a set of frequent patterns, denoted as FDO , among which a subset S contains patterns which are considered to be sensitive from the owner’s perspective. Given the set of sensitive itemsets S, the goal of frequent itemset hiding methodologies is to construct a new, sanitized database D from DO , which achieves to protect the sensitive itemsets S from disclosure, while minimally affecting the nonsensitive itemsets existing in FDO (i.e., the itemsets in FDO − S). The hiding of a sensitive itemset corresponds to a lowering of its statistical significance, depicted in terms of support, in the resulting database. To hide a sensitive itemset, the privacy preserving algorithm has to modify the original database DO in such a way that when the sanitized database D is mined at the same (or a higher) level of support, the frequent itemsets that are discovered are all nonsensitive.
2.2 Problem Formulation and Statement
15
Variant 2: Hiding sensitive association rules We assume that we are provided with a database DO , consisting of N transactions, and thresholds mfreq and mconf set by the owner of the data. After performing association rule mining in DO using thresholds mfreq and mconf, we yield a set of association rules, denoted as R, among which a subset RS of R contains rules which are considered to be sensitive from the owner’s perspective. Given the set of sensitive association rules RS , the goal of association rule hiding methodologies is to construct a new, sanitized database D from DO , which achieves to protect the sensitive association rules RS from disclosure, while minimally affecting the nonsensitive rules existing in R (i.e., those in R − RS ). The hiding of a sensitive association rule corresponds to a lowering of its significance, depicted in terms of support or confidence, in the resulting database. To hide a sensitive rule, the privacy preserving algorithm modifies the original database DO in such a way that when the sanitized database D is mined at the same (or a higher) levels of confidence and support, the association rules that are discovered are all nonsensitive.
Chapter 3
Classes of Association Rule Hiding Methodologies
In this chapter, we present a taxonomy of frequent itemset and association rule hiding algorithms after having reviewed a large collection of independent works in this research area. The chapter is organized as follows. Section 3.1 presents a set of four orthogonal dimensions that we used to classify the existing methodologies by taking into consideration a number of parameters related to their workings. Following that, Section 3.2 straightens out the three principal classes of association rule hiding methodologies that have been proposed over the years and discusses the main properties of each class of approaches.
Fig. 3.1: A taxonomy of association rule hiding approaches along four dimensions.
3.1 Classification Dimensions Figure 3.1 presents a set of four orthogonal dimensions based on which we classified the existing association rule hiding algorithms. As a first dimension, we consider whether the hiding algorithm uses the support or the confidence of the rule to driving the hiding process. In this way, we separate the hiding algorithms into supportbased and confidence-based. Support-based algorithms hide a sensitive association rule by decreasing the support of either the rule antecedent or the rule consequent, or by lowering the support of the rule’s generating itemset up to the point that the support of the rule drops below the minimum support threshold. Confidence-based A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_3, © Springer Science+Business Media, LLC 2010
17
18
3 Classes of Association Rule Hiding Methodologies
methodologies, on the other hand, reduce the confidence of the sensitive rule either by increasing the support of the rule antecedent or by decreasing the support of the rule consequent, until the rule becomes uninteresting for association rule mining. The second dimension in our proposed classification is related to the type of modification of the raw data that is caused by the hiding algorithm. The two possible forms of modification comprise the distortion and the blocking of the original data values. Distortion is the process of replacing 1’s by 0’s and/or 0’s by 1’s (i.e., excluding or including specific items) in selected transactions of the original database, while blocking refers to replacing original values by question marks (called unknowns). The main difference between the two data modification strategies is that in distortion-based algorithms the data recipient cannot be certain about the truthful existence (or nonexistence) of any specific item in any transaction of the database, since any item in the database could have been included or excluded to accommodate for association rule hiding. On the contrary, blocking-based algorithms make a distinction between distorted (or potentially distorted) items and unaffected items of the original database, since only the (potentially) distorted items are replaced by unknowns. This nice property of blocking-based algorithms constitutes them especially useful in critical life applications, where the distinction between “false” and “unknown” can be vital. On the negative side, blocking causes a fuzification of the support and the confidence of an association rule due to the incorporation of unknowns and this hardens the process of mining the nonsensitive significant association rules from the sanitized database. The third dimension, refers to whether a single rule or a set of rules can be hidden in every iteration of the hiding algorithm. Based on this criterion we differentiate association rule hiding algorithms into single rule and multiple rule schemes. Single rule hiding algorithms consider only one association rule at a time and modify the raw data appropriately to hide this rule. In these approaches, the sequence in which the rules are examined in order to be hidden can be of major importance. On the other hand, multiple rule hiding schemes identify items whose modification impacts more than one sensitive association rule and apply those item modifications that are rewarding to the hiding of multiple rules. The fourth, and last, dimension of the proposed taxonomy deals with the nature of the hiding algorithm, which in a broad sense can be either heuristic or exact. Heuristic techniques rely on the optimization of certain goals in the hiding process, while they do not guarantee optimality of the hiding solution. The formulation of the association rule hiding problem, introduced in Chapter 2, implies that there are three specific goals that need to be attained by every association rule hiding algorithm. The first goal, which is basically the most important, is to try to hide as many sensitive rules as possible. The second and third goals involve hiding the sensitive rules by minimizing the possible side-effects. As side-effects in the hiding process, we consider (i) the number of original data items affected by the hiding process, (ii) the number of nonsensitive rules which were accidentally hidden during the hiding process, and (iii) the number of ghost rules which were generated by the hiding process. Different hiding algorithms give different priorities to the satisfaction of these goals, producing in this way a list of hiding primitives. Exact techniques, on the
3.2 Classes of Association Rule Hiding Algorithms
19
Fig. 3.2: The three classes of association rule hiding algorithms.
other hand, rely on formulating the association rule hiding problem in such a way that a solution can be found that satisfies all the goals. Of course there is a possibility that an exact approach fails to give a solution, and for this reason, some of the goals may need to be relaxed. However, this relaxation process is still part of the exact approach, which makes it radically different from the heuristic approaches. Moreover, the data owner has control over the side-effects that are introduced to the database due to the the approximation strategy that is applied.
3.2 Classes of Association Rule Hiding Algorithms Association rule hiding algorithms can be divided into three classes, namely heuristic approaches, border-based approaches and exact approaches. The first class of approaches involves efficient, fast algorithms that selectively sanitize a set of transactions from the original database to hide the sensitive association rules. Due to their efficiency and scalability, heuristic approaches have been the focus of attention for the majority of researchers in the data mining area. However, there are several circumstances in which these algorithms suffer from undesirable side-effects that lead them to identify approximate hiding solutions. This is due to fact that heuristics always aim at taking locally best decisions with respect to the hiding of the sensitive knowledge which, however, are not necessarily also globally best. As a result, heuristics fail to provide guarantees with respect to the quality of the identified hiding solution. Some of the most interesting heuristic methodologies for association rule hiding are presented in the next part of the book. The second class of approaches considers association rule hiding through the modification of the borders in the lattice of the frequent and the infrequent itemsets of the original database. Borders [46] capture those itemsets of a lattice that control the position of the borderline separating the frequent itemsets from their infrequent counterparts. Data modification, applied to the original database to facilitate sensitive knowledge hiding, has an immediate effect on the support of the various itemsets and, subsequently, on the borders of the sanitized database. Border-based algorithms achieve to hide the sensitive association rules by tracking the border of the nonsensitive frequent itemsets and greedily applying the data modifications that have minimal impact on the quality of the border to accommodate the hiding of the
20
3 Classes of Association Rule Hiding Methodologies
sensitive rules. The algorithms in this class differ in the methodology they follow to enforce the new, revised borders, in the sanitized database. The theory of border revision is critical for the understanding of the border-based approaches and is extensively covered in Chapter 9. Following that, Chapters 10 and 11 present two popular border-based association rule hiding methodologies that have been proposed. The third class of approaches contains non-heuristic algorithms which conceive the hiding process as a constraints satisfaction problem (an optimization problem) that they solve by using integer programming. The main difference of these approaches, when compared to those of the two previous classes, is the fact that the sanitization process is capable of offering quality guarantees for the computed hiding solution. The modeling of the hiding problem as an optimization problem enables the algorithms of this category to identify optimal hiding solutions that minimally distort the original database as well as introduce no side-effects to the hiding process. Exact hiding approaches can be considered as the descendant of borderbased methodologies. They operate by first applying border revision to compute a small portion of itemsets from the original database whose status (i.e., frequent vs. infrequent) in the sanitized database plays a crucial role to the quality of the hiding solution. Having computed those itemsets, the exact methodologies incorporate unknowns to the original database and generate inequalities that control the status of selected itemsets of the border. These inequalities along with an optimization criterion that requires minimal modification of the original database to facilitate sensitive knowledge hiding, formulate an optimization problem whose solution (if exists) is guaranteed to lead to optimal hiding. It is important to mention that unlike the two previous classes of approaches that operate in a heuristic manner, exact hiding methodologies achieve to model the hiding problem in a way that allows them to simultaneously hide all the sensitive knowledge as an atomic operation (from the viewpoint of the hiding process). On the negative side, these approaches are usually several orders of magnitude slower than the heuristic ones, especially due to the time that is taken by the integer programming solver to solve the optimization problem.
Chapter 4
Other Knowledge Hiding Methodologies
Association rule hiding algorithms aim at protecting sensitive knowledge captured in the form of frequent itemsets or association rules. However, (sensitive) knowledge may appear in various forms directly related to the applied data mining algorithm that achieved to expose it. Consequently, a set of hiding approaches have been proposed over the years to allow for the safeguarding of sensitive knowledge exposed by data mining tasks such as clustering, classification and sequence mining. In this chapter, we briefly discuss some state-of-the-art approaches for the hiding of sensitive knowledge that is depicted in any of the aforementioned formats.
4.1 Classification Rule Hiding Classification rule hiding has been studied to a substantially lesser extent than association rule hiding. Similarly to association rule hiding methodologies, classification rule hiding algorithms consider a set of classification rules as sensitive and aim to protect them. Research in the area of classification rule hiding has developed along two main directions: suppression-based techniques and reconstruction-based techniques. Suppression-based techniques aim at reducing the confidence of a sensitive classification rule (measured in terms of the owner’s belief regarding the holding of the rule when given the data) by distorting the values of certain attributes in the database that belong to transactions related to the existence of the rule. Chang & Moskowitz [15] were the first to address the inference problem caused by the downgrading of the data in the context of classification decision rules. Through a blocking technique, called parsimonious downgrading, the authors block the inference channels that lead to the identification of the sensitive classification rules by selectively modifying transactions so that missing values appear in the released database. This has as an immediate consequence the lowering of the confidence regarding the holding of the sensitive rules. Wang, et al. [74] propose a heuristic approach that achieves to fully eliminate all the sensitive inferences, while effectively handling overlapping rules. The algorithm A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_4, © Springer Science+Business Media, LLC 2010
21
22
4 Other Knowledge Hiding Methodologies
operates by identifying the set of attributes that influence the existence of each sensitive rule the most and then removing them from those supporting transactions that affect the nonsensitive rules the least. Chen & Liu [16] present a random rotation perturbation technique to preserve the multidimensional geometric characteristics of the original database with respect to task-specific information. As an effect, in the sanitized database the sensitive knowledge is adequately protected against disclosure, while the utility of the data is preserved to a large extend. Reconstruction-based approaches, inspired by the work of [17, 61] and introduced by Natwichai, et al. [52], offer an alternative to suppression-based techniques. These approaches target at reconstructing the original database by using only supporting transactions of the nonsensitive rules. As discussed in [71], reconstructionbased approaches are advantageous when compared to heuristic data modification algorithms, since they hardly introduce any side-effects to the hiding process. They operate as follows. First, they perform rule-based classification to the original database to enable the data owner to identify the sensitive rules. Then, they construct a decision tree classifier that contains only nonsensitive rules, approved by the data owner. The produced database remains similar to the original one, except from the sensitive part, while the difference between the two databases is proven to reduce as the number of rules increases. Natwichai, et al. [53] propose a methodology that further improves the quality of the reconstructed database. This is accomplished by extracting additional characteristic information from the original database with regard to the classification and by improving the decision tree building process. Furthermore, with the aid of information gain, the usability of the released database is ameliorated even in the case of hiding many sensitive rules with high discernibility in records classification. A similar approach to that of [53] was proposed by Katsarou, et al. [40]. The proposed methodology operates by modifying transactions supporting both sensitive and nonsensitive classification rules in the original database and then using the supporting transactions of the nonsensitive rules to produce its sanitized counterpart.
4.2 Privacy Preserving Clustering The area of privacy preserving clustering collects methodologies that aim to protect the underlying attribute values and thus assure the privacy of individuals who are recorded in the data, when the data is shared for clustering purposes. Achieving privacy preservation when sharing data for clustering is a challenging task since the privacy requirements should be met, while the clustering results remain valid. The methodologies that have been proposed so far can be separated into two broad categories: the transformation-based approaches and the protocol-based approaches. Transformation-based approaches are directly related to the distortion-based approaches of association rule hiding. They operate by performing a data transforma-
4.3 Sequence Hiding
23
tion of the original database that maintains the similarity among the various pairs of attributes. In most of the cases, these methodologies are independent of the clustering algorithm that is used. In the transformed space, the similarity between the distorted attribute pairs can still maintain the computation of accurate results which allow for correct clustering of the various transactions. Some interesting approaches that fall in this category involve the work of Oliveira & Zaïane [56, 57]. Protocol-based approaches, on the other hand, assume a distributed scenario where many data owners want to share their data for clustering purposes, without however compromising the privacy of their data by revealing any sensitive knowledge. The algorithms of this category make an assumption regarding the partitioning of the data among the interested, collaborating parties and are typically the privacy preserving versions of commonly used clustering algorithms, such as K-means [68]. The proposed protocols control the information that is communicated among the different collaborating parties and guarantee that no sensitive knowledge can be learned from the model. Approaches in this category include the work of Jha, et al. [38] and the work of Jagannathan, et al. [37], among others. A somewhat different kind of approach that targets on density-based clustering is presented in the work of Silva & Klusch [19]. The authors propose a kernel-based distributed clustering algorithm that uses an approximation of density estimation in an attempt to harden the reconstruction process for the original database. Each site computes a local density estimate for the data it holds and transmits it to a trusted third party. In sequel, the trusted party builds a global density estimate and returns it to the collaborating peers. By making use of this estimate, the sites can locally execute density-based clustering.
4.3 Sequence Hiding The hiding of sensitive sequences is one of the most recent and challenging research directions in privacy preserving data mining, particularly due to the tight relation that exists between sequential and mobility data1 . The underlying problem has the same principles as association rule hiding in the sense that a set of sensitive sequential patterns need to be hidden from a database of sequences in a way that causes the least side-effects to their nonsensitive counterparts. Abul, et al. [3] propose a sequential pattern hiding methodology which assumes that pertinent to every sensitive sequence is a disclosure threshold that defines the maximum number of sequences in the sanitized database that are allowed to support it. The sequence sanitization operation is based on the use of unknowns to mask selected elements in the sequences of the original dataset. The proposed algorithm operates as follows. For each sensitive sequence, the algorithm searches all the se1
Privacy preserving data mining of user mobility data is a very hot research topic that has been studied in the context of EU-funded IST projects such as Geographic Privacy-aware Knowledge Discovery and Delivery — GeoPKDD (http://www.geopkdd.eu) and Mobility, Data Mining, and Privacy — MODAP (http://www.modap.org).
24
4 Other Knowledge Hiding Methodologies
quences of the original database to identify those in which the sensitive sequence is a subsequence2 . For every such sequence of the original database, the algorithm examines in how many different ways this sequence becomes a subsequence of the sensitive one. Each “different way” (also called a matching) is counted based on the position of each element in the sequence that participates to the generation of the sensitive sequence. As an effect, for each element of the sequence coming from the original dataset, the algorithm maintains a counter depicting the number of matchings in which it is involved. To sanitize the sequence, the algorithm iteratively identifies the element of the sequence which has the highest counter (i.e., it is involved in most matchings) and replaces it with an unknown, until the sensitive sequence is no longer a subsequence of the sanitized one. As a result of this operation, the sensitive sequence becomes unsupported by the sanitized sequence. In order to enforce the requested disclosure threshold the algorithm applies this sanitization operation in the following manner. For each sensitive sequence, all the sequences of the original dataset are sorted in ascending order based on the number of different matchings that they have with the sensitive sequence. Then, the algorithm sanitizes the sequences in this order, until the required disclosure threshold is met in the privacy-aware version of the original dataset. The authors have developed extensions of this approach for the handling of temporal constraints, such as min gap, max gap and max window.
2
A sequence S1 is a subsequence of another sequence S2 if it can be obtained by deleting some elements from S2.
Chapter 5
Summary
Association rule hiding is a subarea of privacy preserving data mining that focuses on the privacy implications originating from the application of association rule mining to large public databases. In this first part of the book, we provided the basics for the understanding of the problem, which investigates how sensitive association rules can escape the scrutiny of malevolent data miners by modifying certain values in the original database, and presented some related problems on the knowledge hiding thread. Specifically, in the first two chapters we motivated the problem of association rule hiding, presented the necessary background for its understanding and derived the problem statement along two popular variants of the problem: frequent itemset hiding and association rule hiding. In Chapter 3, we provided a classification of the association rule hiding algorithms to facilitate the organization that we follow for the presentation of the methodologies in the rest of this book. Our proposed taxonomy partitions the association rule hiding methodologies along four orthogonal directions based on the employed hiding strategy, the data modification strategy, the number of rules that are concurrently hidden, and the nature of the algorithm. Elaborating on the last direction, we identified three classes of association rule hiding approaches, namely heuristic-based, border-based and exact approaches, and discussed the differences among them. Last, in Chapter 4 we examined the problem of knowledge hiding in the related research areas of clustering, classification and sequence mining. For each of these areas we briefly discussed some state-of-the-art approaches that have been proposed.
A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_5, © Springer Science+Business Media, LLC 2010
25
Part II
Heuristic Approaches
Chapter 6
Distortion Schemes
In this chapter, we review some popular support-based and confidence-based distortion algorithms that have been proposed in the association rule hiding literature. Distortion-based approaches operate by selecting specific items to include to (or exclude from) selected transactions of the original database in order to facilitate the hiding of the sensitive association rules. Two of the most commonly employed strategies for data distortion involve the swapping of values between transactions [10, 20], as well as the deletion of specific items from the database [54]. In the rest of this chapter, we present an overview of these approaches along with other methodologies that also fit in the same class. Atallah, et al. [10] were the first to propose an algorithm for the hiding of sensitive association rules through the reduction in the support of their generating itemsets. The authors propose the construction of a lattice-like graph [65] in the database. Through this graph, the hiding of a large itemset, related to the existence of a sensitive rule, is achieved by a greedy iterative traversal of its immediate subsets, selection of the subset that has the maximum support among all candidates (therefore is less probable to be hidden) and consideration of this itemset as the new candidate to be hidden. By iteratively following these steps, the algorithm identifies the 1–itemset ancestor of the initial sensitive itemset with the highest support. Then, by identifying the supporting transactions for both the initial candidate and the currently identified 1–itemset, the algorithm removes the 1–itemset (item) from the supporting transaction which affects the least number of 2–itemsets. In sequel, the algorithm propagates the results of this action to the affected itemsets in the graph. When hiding a set of sensitive rules, the algorithm first sorts the corresponding large itemsets based on their support and then proceeds to hide them in a one-by-one fashion, using the methodology presented above. One of the most significant contributions of this work is the proof that the authors provide regarding the NP-hardness of finding an optimal sanitization of a dataset. On the negative side, the proposed approach does not take into consideration the extend of loss in support for the large itemsets, as long as they remain frequent in the sanitized database. Dasseni, et al. [20] generalize the hiding problem in the sense that they consider the hiding of both sensitive frequent itemsets and sensitive association rules. The A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_6, © Springer Science+Business Media, LLC 2010
29
30
6 Distortion Schemes
authors propose three single rule heuristic hiding algorithms that are based on the reduction of either the support or the confidence of the sensitive rules, but not both. In all three approaches, the goal is to hide the sensitive rules while minimally affecting the support of the nonsensitive itemsets. The first two algorithms reduce the confidence of the sensitive association rule either (i) by increasing the support of the rule antecedent, through transactions that partially support it, until the rule confidence drops below the minimum confidence threshold, or (ii) by decreasing the frequency of the rule consequent through transactions that support the rule, until the rule confidence is below the minimum threshold. The third algorithm, on the other hand, decreases the frequency of a sensitive association rule by decreasing the support of either the antecedent or the rule consequent, until either the confidence or the support lies below the corresponding minimum threshold. A basic drawback of the proposed methodologies is the strong assumption that all the items appearing in a sensitive association rule do not appear in any other sensitive rule. Under this assumption, hiding of the sensitive rules one at a time or altogether makes no difference. Moreover, the proposed methodologies fail to avoid undesired side-effects, such as lost rules and ghost rules. Verykios, et al. [72] extend the previous work of Dasseni, et al. [20] by improving and evaluating the association rule hiding algorithms of [20] for their performance under different sizes of input datasets and different sets of sensitive rules. In addition to that, the authors propose two novel heuristic algorithms that incorporate the third strategy presented above. The first of these algorithms protects the sensitive knowledge by hiding the item having the maximum support from the minimum length transaction (i.e., the one with the least supporting items). The hiding of the generating itemsets of the sensitive rules is performed in a decreasing order of size and support and in a one-by-one fashion. Similarly to the first algorithm, the second algorithm first sorts the generating itemsets with respect to their size and support, and then hides them in a round-robin fashion as follows. First, for each generating itemset, a random ordering of its items and of its supporting transactions is attained. Then, the algorithm proceeds to remove the items from the corresponding transactions in a round-robin fashion, until the support of the sensitive itemset drops below the minimum support threshold, thus the itemset is hidden. The intuition behind hiding in a round-robin fashion is fairness and the proposed algorithm, although rather naïve, serves as a baseline for conducting a series of experiments. Oliveira & Zaïane [54] were the first to introduce multiple rule hiding approaches. The proposed algorithms are efficient and require two scans of the database, regardless of the number of sensitive itemsets to hide. During the first scan, an index file is created to speed up the process of finding the sensitive transactions and to allow for an efficient retrieval of the data. In the second scan, the algorithms sanitize the database by selectively removing the least amount of individual items that accommodate the hiding of the sensitive knowledge. An interesting novelty of this work is the fact that the proposed methodology takes into account not only the impact of the sanitization on hiding the sensitive patterns, but also the impact related to the hiding of nonsensitive knowledge. Three item restriction-based algorithms (known as MinFIA, MaxFIA, and IGA) are proposed that selectively re-
6 Distortion Schemes
31
move items from transactions that support the sensitive rules. The first algorithm, MinFIA, operates as follows. For every restrictive pattern it identifies the supporting transactions and the item having the smallest support in the pattern (called the victim item). Then, by using a user-supplied disclosure threshold, it first sorts the identified transactions in ascending order of degree of conflict and then selects the number of transactions (among them) that need to be sanitized. Finally, from each selected transaction the algorithm removes the victim item. The MaxFIA algorithm proceeds exactly as the MinFIA with the only difference of selecting as the victim item the one that has the maximum support in the sensitive association rule. Finally, IGA aims at clustering the restricted patterns into groups that share the same itemsets. By identifying overlapping clusters, the algorithm hides the corresponding sensitive patterns at once (based on the sensitive itemsets they share), thus reduces the distortion that is induced to the database when hiding the sensitive knowledge. A more efficient approach than that of [54] and the work of [20, 63, 64], was introduced by Oliveira & Zaïane in [55]. The proposed algorithm, called SWA, is an efficient, scalable, one-scan heuristic which aims at providing a balance between the needs for privacy and knowledge discovery in association rule hiding. It achieves to hide multiple rules in only one pass through the dataset, regardless of its size or the number of sensitive rules that need to be protected. The algorithm proceeds in five steps that are applied to every group of K transactions (thus formulating a window of size K) read from the original database. Initially, the nonsensitive transactions are separated from the sensitive ones and copied directly to the sanitized database. For each sensitive rule, the item having the highest frequency is selected and its supporting transactions are identified. Then, a disclosure threshold ψ, potentially different for each sensitive rule, is used to capture the severity characterizing the release of the rule. Based on this threshold, SWA computes the number of supporting transactions that need to be sanitized for each rule and then sorts them in ascending order of length. For each selected transaction, the item with the highest frequency as identified before is removed and then the transaction is copied to the sanitized database. SWA is experimentally shown to outperform the state-of-the-art heuristic approaches in terms of concealing all the sensitive rules, while maintaining high data utility of the released database. Amiri [9] proposes three effective, multiple association rule hiding heuristics that outperform SWA by offering higher data utility and lower distortion, at the expense of increased computational speed. Although similar in philosophy to the previous approaches, the three proposed methodologies do a better job in modeling the overall objective of a rule hiding algorithm. The first approach, called Aggregate, computes the union of the supporting transactions for all sensitive itemsets. Among them, the transaction that supports the most sensitive and the least nonsensitive itemsets is selected and expelled from the database. The same process is repeated until all the sensitive itemsets are hidden. Similarly to this approach, the Disaggregate approach aims at removing individual items from transactions, rather than removing the entire transaction. It achieves that by computing the union of all transactions supporting sensitive itemsets and then, for each transaction and supporting item, by calculating the number of sensitive and nonsensitive itemsets that will be affected if
32
6 Distortion Schemes
this item is removed from the transaction. Finally, it selects to remove the item from the transaction that will affect the higher number of sensitive and the least number of nonsensitive itemsets. The third approach, called Hybrid, is a combination of the two previous algorithms; it employs the Aggregate approach to identify the sensitive transactions and the Disaggregate approach to selectively delete items from these transactions, until all the sensitive knowledge is appropriately concealed. Wu, et al. [79] propose a sophisticated methodology that removes the assumption of [20] regarding the disjoint relation among the items of the various sensitive rules. Using set theory, the authors formalize a set of constraints related to the possible side-effects of the hiding process and allow item modifications to enforce these constraints. However, the existing correlation among the rules can make impossible the hiding of the sensitive knowledge without the violation of any constraints. For this reason, the user is permitted to specify which of the constraints he/she considers more significant and relax the rest. A drawback of the approach is the simultaneous relaxation (without the users’ consent) of the constraint regarding the hiding of all the sensitive itemsets. To accommodate for rule hiding, the new scheme defines a class of allowable modifications that are represented as templates and are selected in a one-by-one fashion. A template contains the item to be modified, the applied operation, the items to be preserved or removed from the transaction and coverage information regarding the number of rules that are affected. Based on this information the algorithm can select and apply only the templates that are considered as beneficial, since they cause the least side-effects to the sanitized database. Pontikakis, et al. [59] propose two distortion-based heuristics to selectively hide the sensitive association rules. The proposed schemes use efficient data structures for the representation of the association rules and effectively prioritize the selection of transactions for sanitization. However, in both algorithms the proposed hiding process may introduce a number of side-effects, either by generating rules which were previously unknown, or by eliminating existing nonsensitive rules. The first algorithm, called Priority-based Distortion Algorithm (PDA), reduces the confidence of a sensitive association rule by reversing 1’s to 0’s in items belonging in the rule’s consequent. The second algorithm, called Weight-based Sorting Distortion Algorithm (WDA), concentrates on the optimization of the hiding process in an attempt to achieve the least side-effects and the minimum complexity. This is achieved through the use of priority values assigned to transactions based on weights. Both PDA and WDA are experimentally shown to produce hiding solutions of comparable (or slightly better) quality than the ones produced by the algorithms of [64], generally introducing few side-effects. However, both algorithms are computationally demanding, with PDA requiring typically twice the time of the hiding methodologies in [64] to facilitate the hiding of the sensitive knowledge. Wang & Jafari [76, 77] propose two data modification algorithms that aim at the hiding of predictive association rules, i.e. rules containing the sensitive items on their left hand side (rule antecedent). Both algorithms rely on the distortion of a portion of the database transactions to lower the confidence of the sensitive association rules. The first strategy, called ISL, decreases the confidence of a sensitive rule by increasing the support of the itemset in its left hand side. The second approach,
6 Distortion Schemes
33
called DSR, reduces the confidence of the rule by decreasing the support of the itemset in its right hand side (rule consequent). Both algorithms experience the item ordering effect under which, based on the order that the sensitive items are hidden, the produced sanitized databases can be different. Moreover, the DSR algorithm is usually more effective when the sensitive items have a high support. Compared to the work of Saygin, et al. [63, 64], the algorithms presented in [76, 77] require a reduced number of database scans and have an efficient pruning strategy. However, by construction, they are assigned the task of hiding all the rules containing the sensitive items on their left hand side, while the algorithms in the work of Saygin, et al. [63, 64] can hide any type of sensitive association rule. Lee, et al. [42] introduce a data distortion approach that operates by first constructing a sanitization matrix from the original data and then multiplying the original database (represented as a transactions-by-items matrix) with the sanitization matrix in order to obtain the sanitized database. The applied matrix multiplication strategy follows a new definition that aims to enforce the suppression of selected items from transactions of the original database, thus reduce the support of the sensitive itemsets. Along these lines, the authors develop three sanitization algorithms: Hidden-First (HF), Non-Hidden-First (NHF) and HPCME (Hiding sensitive Patterns Completely with Minimum side Effect on nonsensitive patterns). The first algorithm takes a drastic approach to eliminate the sensitive knowledge from the original database and is shown to lead to hiding solutions that suffer from the loss of nonsensitive itemsets. The second algorithm, focuses on the preservation of the nonsensitive patterns and thus may fail to hide all the sensitive knowledge from the database. Last, the third algorithm tries to combine the advantages of HF and NHF in order to hide all sensitive itemsets with minimal impact on the nonsensitive ones. To achieve this goal, the algorithm introduces a factor restoration probability that it uses to decide when the preservation of the nonsensitive patterns does not affect the hiding of the sensitive ones, and thus take the appropriate action.
Chapter 7
Blocking Schemes
An interesting line of research in the class of heuristic methodologies for association rule hiding regards approaches that block certain original values of a database by introducing unknowns (i.e., by replacing the real values of some items in selected transactions of the original database with question marks “?”). Unlike data distortion schemes, blocking methodologies do not add any false information to the original database and thus provide a much safer alternative for critical real life applications, where the distinction between “false” and “unknown” can be vital. Due to the introduction of unknowns, the support and the confidence of association rules that are mined from the sanitized database becomes fuzzified to an interval (rather than being an exact value) and can no longer be safely estimated. Blocking approaches, similarly to distortion approaches, can be partitioned into support-based and confidence-based, depending on whether they use the support or the confidence of the association rules to drive the hiding process. In this chapter, we review two methodologies that belong to this class of approaches. Saygin, et al. [63,64] are the first to propose the use of unknowns instead of transforming 1’s to 0’s and the opposite, for the hiding of sensitive association rules. In their work, the authors introduce three simple heuristic approaches that materialize the idea of blocking. The first approach, relies on the reduction in the support of the generating itemsets of the sensitive association rules, while the other two approaches are based on the reduction of the confidence of the rule, below the minimum thresholds. The definitions of both the support and the confidence measures are extended to capture the notion of an interval instead of being crisp values, while the algorithms consider both 0 and 1 values to use for hiding (in some proportion) so that it is difficult for an adversary to conclude upon the value concealed behind a question mark. A universal safety margin is applied to capture how much below the minimum thresholds should the new support and confidence of a sensitive rule lie, in order to consider that the rule is safely hidden. An important contribution of this work, apart from the proposed methodologies, is a discussion regarding the effect of the algorithms towards the hiding of the sensitive knowledge, the possibility of reconstruction of the hidden patterns by an adversary, and the importance of choosing an adequate safety margin when hiding the sensitive rules. A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_7, © Springer Science+Business Media, LLC 2010
35
36
7 Blocking Schemes
Pontikakis, et al. [58] argue that the main disadvantage of blocking is the fact that the dataset, apart from the blocked values (i.e., the incorporated unknowns), is not distorted. Thus, an adversary can disclose the hidden association rules simply by identifying those generating itemsets that contain question marks and lead to rules with a maximum confidence that lies above the minimum confidence threshold. If the number of these rules is small then the probability of identifying the sensitive ones among them becomes high. To avoid this serious shortcoming of previous approaches, the authors propose a blocking algorithm that purposely generates rules that were not existent in the original dataset (i.e., ghost rules) and that their generating itemsets contain unknowns. Thus, the identification of the sensitive association rules becomes harder, since the adversary is unable to tell which of the rules that have a maximum confidence above the minimum threshold are the sensitive and which are the ghost ones. However, the introduction of ghost rules leads to a decrement in the data quality of the sanitized database. In order to balance the trade-off between the level of privacy and data utility, the proposed algorithm incorporates a safety margin that corresponds to the extend of sanitization that can be performed to the database. The higher the safety margin the better the protection of the sensitive association rules and the worse the data utility of the resulting sanitized database.
Chapter 8
Summary
In the second part of the book, we presented some state-of-the-art algorithms that have been proposed for association rule hiding which belong to the class of heuristic-based approaches. The heuristic class of approaches collects computationally and memory efficient algorithms that operate in a series of steps by optimizing certain subgoals to drive the hiding process. We partitioned the current heuristic approaches into two main categories: distortion-based schemes, which operate by alternating certain items in selected transactions from 1’s to 0’s (and vice versa), and blocking-based schemes, which replace certain items in selected transactions with unknowns, in order to facilitate association rule hiding. Each category of approaches was further partitioned into support-based and confidence-based methodologies, depending on whether the algorithm uses the support or the confidence of the rule to drive the hiding process. A large amount of research has been conducted over the past years, leading to several interesting heuristic methodologies being proposed for association rule hiding. In Chapters 6 and 7 it was our intention to cover a selection of the existing methodologies by considering the most representative ones for this domain. As a final remark, we should note down that, as is also evident from the number of presented works in each category, research on heuristic methodologies for association rule hiding has mostly concentrated on the direction of distortionbased approaches rather than blocking techniques. However, blocking techniques are certainly more preferable than conventional distortion methodologies in several real life scenarios and for this reason we feel that such approaches are expected to attract more scientific interest in the years to come. Moreover, the combination of distortion and blocking techniques is also a prominent research direction, which may lead to solutions that further minimize the data loss that is introduced to the original database to account for the hiding of the sensitive knowledge.
A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_8, © Springer Science+Business Media, LLC 2010
37
Part III
Border Based Approaches
Chapter 9
Border Revision for Knowledge Hiding
In this chapter, we highlight the process of border revision, which plays a significant role on both border-based and exact approaches for association rule hiding. Simply stated, the process of border revision captures those itemsets of the original database which need to remain frequent and those that need to become infrequent in the sanitized version of the database, in order to allow furnishing an optimal hiding solution. After presenting the theory behind border revision, we introduce a set of algorithms (adopted from [27]) that can be applied for the efficient computation of both the original and the revised borders in a transactional database, when given a minimum support threshold. Table 9.1: An original database DO . a 1 1 1 1 0 1 0 1 0 1
b 1 1 0 0 1 1 0 1 1 0
c 0 1 1 0 0 1 0 1 1 1
d 0 1 0 0 0 1 1 0 0 1
e 0 0 0 0 1 1 0 1 0 1
f 1 0 1 0 0 0 0 0 0 0
Borders [46] allow for a condense representation of the frequent itemsets in a database, effectively identifying those key itemsets in the lattice which separate the frequent patterns from their infrequent counterparts. The process of border revision facilitates the minimum harm in the hiding of the sensitive itemsets. In what follows, we use a simple example to demonstrate how this process works. Consider database DO of Table 9.1. Applying frequent itemset mining in DO using mfreq = 0.3 leads to the set of frequent itemsets shown in Table 9.2. Among these itemA. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_9, © Springer Science+Business Media, LLC 2010
41
42
9 Border Revision for Knowledge Hiding
Table 9.2: Frequent itemsets for database DO at mfreq = 0.3. Frequent itemsets
Support
{a} {b}, {c} {ac} {d}, {e}, {ab}, {bc} {ad}, {ae}, {be}, {cd}, {ce}, {abc}, {acd}, {ace}
7 6 5 4 3
Fig. 9.1: An itemsets’ lattice demonstrating (i) the original border and the sensitive itemsets, and (ii) the revised border.
sets, assume that S = {e, ae, bc} are the sensitive ones that have to be protected. Figure 9.1 demonstrates the process of border revision for the problem at hand. In this figure, near each itemset we depict its support in the original (Figure 9.1(i)) and the sanitized (Figure 9.1(ii)) database, which facilitates optimal hiding of the sensitive knowledge. Notice that although the sanitized database is not yet constructed, the process of border revision allows us to compute the borderline of the optimal hiding solution. As one can observe, there are four possible scenarios involving the status of an itemset I prior and after the application of border revision: C1 C2 C3
Itemset I was frequent in DO and remains frequent in D. Itemset I was infrequent in DO and remains infrequent in D. Itemset I was frequent in DO and became infrequent in D.
9 Border Revision for Knowledge Hiding
C4
43
Itemset I was infrequent in DO and became frequent in D.
The original border (Figure 9.1(i)) corresponds to the hyperplane that partitions the universe of itemsets into two groups: the frequent itemsets of FDO (depicted on the left of the original borderline) and their infrequent counterparts in P − FDO (shown on the right of the borderline). The optimal hiding process is then defined as the revision of the original border so that the revised border excludes from the frequent itemsets the sensitive ones and their supersets, i.e. all those itemsets that appear in / S} Smax = {I ∈ FDO |∃J ∈ Smin , J ⊆ I}, where Smin = {I ∈ S| for all J ⊂ I, J ∈ contains all the minimal sensitive itemsets from S. Given the four possible scenarios for an itemset I prior and after the application of border revision, we deduce that in an optimal hiding scenario, where no sideeffects are introduced by the hiding algorithm, C2 should always hold, while C4 must never hold. On the contrary, C1 must hold for all the itemsets in P − Smax , while C3 must hold for all the itemsets in Smax . Thus, the hiding of the sensitive itemsets can be pictured as a movement of the original borderline in the lattice to a 0 =F new position that adheres to the optimal set FD DO − Smax . Continuing our example, given that S = {e, ae, bc} (shown in squares in Figure 9.1(i)), we have that Smin = {e, bc} (shown in bold in Figure 9.1(i)), and Smax = {e, ae, be, bc, ce, abc, ace}. Based on DO in Figure 9.1(i) we present the itemsets of the original positive border (double underlined) and the original negative border (single underlined). Ideally, the frequent itemsets in the sanitized database D will be 0 = {a, b, c, d, ab, ac, ad, cd, acd}. The revised borderline along exactly those in FD with the corresponding revised borders for D and scenarios C1 and C3 that pertain to an exact hiding solution, are depicted in Figure 9.1(ii). Specifically, in this figure, the itemsets of the revised positive border are double underlined, while those of the revised negative border are single underlined. To enhance the clarity of the figure only a small portion of itemsets, among those involved in C2 , are shown. What is then needed, is a way to modify the transactions of the original database DO in order to have them support the revised border in database D. To capture the optimal borderline, which is the revised border for the original database DO based on the problem at hand, one can use the border theory and identify the key itemsets which separate all the frequent patterns from their infrequent counterparts. As it is proven in [23, 24] these key itemsets correspond to the union 0 ) and the revised negative border Bd − (F 0 ). of the revised positive border Bd + (FD D In what follows, we provide some efficient algorithms for computing the positive and the negative borders, both for capturing the original and the revised borderline. Algorithm 9.1 provides a straight-forward way to compute the negative border. It achieves this by incorporating the computation to the Apriori algorithm of [7]. The new algorithm has extra code in the NB–Apriori and the GetLargeItems procedures to compute the border. On the other hand, the Apriori–Gen procedure is the same as in the original version of Apriori. The proposed method for the computation of the negative border relies on Apriori’s candidate generation scheme, which uses (k − 1) frequent itemsets to produce candidate k large itemsets. Thus, it achieves to identify the first infrequent candidates in the lattice, whose all subsets are found to
44
9 Border Revision for Knowledge Hiding
Algorithm 9.1 Computation of the large itemsets and the original negative border. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:
procedure NB-A PRIORI(DO , msup) F1 ← G ET L ARGE I TEMS(DO , msup) for k = 2; Fk−1 6= ∅; k++ do Ck ← A PRIORI G EN(Fk−1 , msup) for each t ∈ Ti do Ct ← subset(Ck , t) for each candidate c ∈ Ct do c.count++ end for end for Fk ← {c ∈ Ck |c.count ≥ msup} Bd − (FDO ) ← {c ∈ Ck |c.count < msup} end for end procedure procedure G ET L ARGE I TEMS(DO , msup) for Ti ∈ DO do for x ∈ Ti do x.count++ end for end for for each item x do if x.count ≥ msup then x ∈ Fk else x ∈ Bd − (FDO ) end if end for end procedure
. for all transactions in DO . get candidate subsets of t
. traverse all transactions . traverse all items in the transaction
. item x is frequent
. add this item to the negative border
be frequent. In this algorithm, we use notation Fk to refer to large k–itemsets from the original database DO . Having identified the original negative border, the next step is to compute the original positive border Bd + (FDO ). Algorithm 9.2, presents a level-wise (in the length of the large itemsets) approach to accomplish this computation. Assume that FDO is the set of frequent itemsets identified by using Apriori. For each itemset in FDO we associate a counter, initialized to zero. The algorithm first sorts these itemsets in decreasing length and then for all itemsets of the same length, say k, it identifies their (k–1) large subsets and increases their counters by one. The value of k iterates from the length of the largest identified frequent itemset down to 1. Finally, the algorithm performs a one-pass through all the counters of the itemsets and collects the large itemsets having a value of zero in the associated counter. These, constitute the positive border Bd + (FDO ). Due to its very nature, this algorithm is suitable for computing both the original and the revised positive borders. For this reason, we use notation Bd + (F) to abstract the reference to the particular border that is computed, namely the original or the revised border. A way of computing the revised negative border that adheres to an exact hid0 =F ing solution, in which FD DO − Smax , is presented in Algorithm 9.3. In this
9 Border Revision for Knowledge Hiding
45
Algorithm 9.2 Computation of the positive border (original and revised) Bd + (F). 1: procedure PB-C OMPUTATION(F) 2: count{0...|F|} ← 0 3: Fsort = reverse-sort(F) 4: for each k-itemset f ∈ Fsort do 5: for all (k − 1)-itemsets q ∈ Fsort do 6: if q ⊂ f then 7: q.count++ 8: end if 9: end for 10: end for 11: for each f ∈ Fsort do 12: if f .count = 0 then 13: f ∈ Bd + (F ) 14: end if 15: end for 16: end procedure
. initialize counters
. add itemset to Bd + (F )
0 ) of D. Algorithm 9.3 Computation of the revised negative border Bd − (FD
1: procedure INB-C OMPUTATION(F) 2: for k = 1; Fk 6= ∅; k++ do 3: if k = 1 then 4: for each item x ∈ F1 do 0 then 5: if x ∈ / FD 0 6: x ∈ Bd − (FD ) 7: end if 8: end for 9: else if k = 2 then 10: for x ∈ F1 do 11: for y ∈ F1 do 12: if (x < y) ∧ (x ./ y ∈ / F2 ) then 0 ) 13: (x ./ y) ∈ Bd − (FD 14: end if 15: end for 16: end for 17: else 18: for x ∈ Fk−1 do 19: for y ∈ Fk−1 do 20: if (x1 = y1 ) ∧ . . . ∧ (xk−1 < yk−1 ) then 21: z = x ./ y 0 , @rk−1 ⊂ z : rk−1 ∈ / F 0 then 22: if z ∈ / FD − 0 23: z ∈ Bd (FD ) 24: end if 25: end if 26: end for 27: end for 28: end if 29: end for 30: end procedure
. x is infrequent
. for k > 2
. z is the join of x and y
46
9 Border Revision for Knowledge Hiding
algorithm we move top-down in the lattice to identify infrequent itemsets whose all 0 . First, we examine all 1–itemsets (i.e., items). proper subsets are frequent in FD If any of these itemsets is infrequent, it should be included in the revised negative border. Then, we examine all 2–itemsets by properly joining (symbol ./ denotes a 0 join) the frequent 1–itemsets. Again, if the produced 2–itemset does not exist in FD − 0 we include it in Bd (FD ). To examine k–itemsets (where k > 3), we first construct them by properly joining frequent (k–1)–itemsets (as in Apriori) and then check to 0 . If the itemset is reported as infrequent, see if the produced itemset is frequent in FD we then examine all its (k–1) proper subsets. If none of these is infrequent, then the 0 ). itemset belongs to the revised negative border, so we include it in Bd − (FD Algorithm 9.4 Hiding of all the sensitive itemsets and their supersets. 1: procedure H IDE SS(FDO , Smax ) 2: for each s ∈ Smax do 3: for each f ∈ FDO do 4: if s ⊆ f then 5: FDO ← FDO − f 6: end if 7: end for 8: end for 0 ← FDO 9: Return: FD 10: end procedure
. for all sensitive itemsets . for all large itemsets . the large itemset is sensitive . remove itemset f
0 by Last, Algorithm 9.4 presents the hiding process in which we identify FD removing from FDO all the sensitive itemsets and their supersets. To do so, we iterate over all the sensitive itemsets and their supersets (i.e., set Smax ), and the large itemsets in FDO , and we identify all those large itemsets that are supersets of the sensitive. Then, we remove these itemsets from the list of frequent itemsets, thus 0 with the remaining large itemsets in F construct a new set FD DO .
Chapter 10
BBA Algorithm
Sun & Yu [66, 67] in 2005 proposed the first frequent itemset hiding methodology that relies on the notion of the border [46] of the nonsensitive frequent itemsets to track the impact of altering transactions in the original database. By evaluating the impact of each candidate item modification to the itemsets of the revised positive border, the algorithm greedily selects to apply those modifications (item deletions) that cause the least impact to the border itemsets. As already covered in the previous chapter, the border itemsets implicitly dictate the status (i.e., frequent vs. infrequent) of every itemset in the database. Consequently, the quality of the borders directly affects the quality of the sanitized database that is produced by the hiding algorithm. The heuristic strategy that was proposed in [66, 67], assigns a weight to each itemset of the revised positive border (which is the original positive border after it has been shaped up with the removal of the sensitive itemsets) in an attempt to quantify its vulnerability of being affected by an item deletion. The assigned weights are dynamically computed during the sanitization process as a function of the current support of the corresponding itemsets in the database. To hide a sensitive itemset, the algorithm calculates the expected impact of each candidate item deletion to the itemsets of the revised positive border, by computing the sum of the weights of the revised positive border itemsets that will be affected. Then, the algorithm determines the optimal deletion candidate item, which is the item whose deletion has minimal impact on the revised positive border, and deletes this item from a set of carefully selected transactions. The proposed strategy aims to minimize the number of nonsensitive frequent itemsets that are affected from the hiding of the sensitive knowledge, as well as it attempts to maintain the relative support of the nonsensitive frequent itemsets in the sanitized database. The rest of this chapter is organized as follows. In Section 10.1, we explicitly state the objectives that drive the hiding process of the BBA algorithm. Following that, Section 10.2 provides the strategy that is employed for the hiding of a sensitive itemset in a way that bears minimal impact on the revised positive border, as well as it presents the order in which the sensitive itemsets are selected to be hidden. Finally, Section 10.3 delivers the pseudocode of the algorithm.
A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_10, © Springer Science+Business Media, LLC 2010
47
48
10 BBA Algorithm
10.1 Hiding Goals In Chapter 2 (Section 2.2.1) we presented the main goals of association rule hiding methodologies. In the context of frequent itemset hiding, these goals can be restated as follows: (i) all the sensitive itemsets that are mined from the original database should be hidden so that they cannot be mined from its sanitized counterpart under the same (or a higher) threshold of support, (ii) all the nonsensitive frequent itemsets of the original database should be preserved in the sanitized database so that they can be mined under the same threshold of support, and (iii) no ghost itemsets are introduced to the sanitized database, i.e. no itemset that was not among the nonsensitive frequent ones mined from the original database, can be mined from the sanitized database under the same or a higher threshold of support. To achieve the first goal, the BBA algorithm operates by deleting specific items belonging to the sensitive itemsets from a set of carefully selected transactions, until the support of the sensitive itemsets drops below the minimum support threshold and thus the itemsets are hidden. To satisfy the second goal, the algorithm attempts to minimize the number of nonsensitive itemsets that are lost due to the enforced sanitization process, by tracking the effect of each candidate item deletion to the itemsets of the revised positive border. Last, it is trivial to observe that compliance of the hiding algorithm to the third goal is achieved by construction, since the algorithm operates by applying only item deletions and thus no infrequent itemsets of the original database can become frequent in its sanitized counterpart. Apart from the main goals presented above, the BBA hiding algorithm also aims to maintain the relative support of the nonsensitive frequent itemsets in the sanitized database. Clearly, based on the items that are selected for deletion from transactions of the original database, as well as the transactions from which these items are removed to facilitate the hiding of the sensitive knowledge, the nonsensitive itemsets can be affected to a substantially different degree. Maintaining the relative support of the nonsensitive frequent itemsets is an important feature of the BBA algorithm, as it allows more accurate mining of the sanitized database at different support thresholds. As an example of this nice property of BBA, assume two nonsensitive frequent itemsets I and J of the original database DO for which it holds that sup(I, DO ) > sup(J, DO ). Now let’s assume that in the sanitized database D of DO it holds that sup(I, D) < sup(J, D). Evidently, if we mine D with a minimum support threshold msup = sup(J, D) then itemset I will be reported as infrequent. However, this itemset should have been frequent in D based on DO . Generally, we would like the hiding process to preserve to the highest possible extend the difference in the support of the nonsensitive frequent itemsets, i.e. for two nonsensitive frequent itemsets I and J of DO it should hold that sup(I, DO ) − sup(J, DO ) is as close as possible to sup(I, D) − sup(J, D).
10.2 Hiding Process
49
10.2 Hiding Process The BBA algorithm concentrates on the itemsets of the revised positive border to evaluate the impact of each candidate item deletion to the quality of the database and, subsequently, select the item whose deletion introduces the least side-effects to the hiding process. Given an itemset I that belongs to the set of minimal sensitive itemsets Smin , the authors of [66, 67] define the set CI that contains all pairs (T, i) of transactions T and items i ∈ I from the original database DO that can be used to lower the support of itemset I. CI is called the set of hiding candidates for itemset I and is formally represented as CI = {(T, i)|T ∈ DI ∧ i ∈ I}, where DI is the set of supporting transactions for itemset I. Using this notation, pair (To , io ) represents the deletion of candidate item io from transaction To of the original database DO , to facilitate the lowering of the support of a sensitive itemset through item io . When a hiding candidate is deleted all other hiding candidates that involve the same sensitive itemset and the same transaction are removed from set CI , since their potential deletion will not affect the support of the sensitive itemset. Thus, by construction, set CI comprises of hiding candidates, the deletion of each of which leads to a decrement of one in the support of a sensitive itemset. As an effect, the BBA algorithm has to select the minimum number of hiding candidates that are associated with each itemset in Smin and apply the corresponding deletions to facilitate the hiding of this itemset. The selected hiding candidates will be the ones that affect the itemsets of the positive border, as well as their relative support, the least. The aim of the proposed algorithm is to apply the necessary item deletions so that the support of each sensitive itemset in the sanitized database D is just below the minimum support threshold and, thus, the itemset is properly hidden. The hiding of a sensitive itemset may cause a decrement to the support of some nonsensitive frequent itemsets that share common items with the sensitive one. Depending on the properties of the database and the minimum support threshold that is used, it is also possible that some nonsensitive frequent itemsets become lost as a side-effect of the hiding process. As is evident, the loss of one or more nonsensitive frequent itemsets causes a retreat of the borderline that separates the frequent from the infrequent itemsets in the (under sanitization) database. The retreat of the borderline directly affects the itemsets that participate to the revised positive border. Thus, when an itemset of the revised positive border becomes infrequent, the border should be recomputed to account for this change as well as to allow for accurate subsequent decisions to be made by the sanitization process. However, such computations may be costly and, in order to avoid this extra cost, the BBA algorithm operates under the assumption that the new border, after the loss of nonsensitive frequent itemsets, will be a good approximation of the old one in most cases1 . 1
An interesting observation that should be made at this point is that the need of recomputing the revised positive border is dictated by the greedy nature of the BBA algorithm. Since item deletions are applied in a one-by-one fashion, each item deletion may cause the loss of one or more itemsets that belong to the revised positive border, which subsequently needs to be recalculated to accommodate this change. Contrary to border based approaches, exact hiding algorithms (see Chapters 14, 15 and 16) do not suffer from this shortcoming, since they apply all necessary item
50
10 BBA Algorithm
In the sections that follow, we first examine the way that the proposed border based algorithm weighs the itemsets of the revised positive border to quantify the impact of an item deletion. Then, we present the process that is followed by BBA for the hiding of a sensitive itemset with minimal impact on the revised positive border. Last, we present the ordering scheme that is used by BBA in order to hide all the sensitive itemsets of Smin in the sanitized database D that it produces.
10.2.1 Weighing Border Itemsets As we presented earlier, the deletion of a hiding candidate may impact itemsets from the revised positive border, possibly causing some of these items to become lost in the sanitized database D. To ameliorate this problem, BBA employs a weighting scheme that allows it to select, at each point, the item deletion that causes minimal impact on the itemsets of the revised positive border. To achieve this, each itemset of the revised positive border is weighted based on its vulnerability of being affected by an applied item deletion. In the proposed weighting scheme, larger weights are assigned to border itemsets that are more vulnerable and, thus, should have a lower priority of being affected. The authors of [66, 67] define the weight of a border itemset in the following way: ( sup(I,D )−sup(I,D0 )+1 O , sup(I, D0 ) ≥ msup + 1 + sup(I,DO )−msup w(I ∈ Bd ) = 0 λ + msup − sup(I, D ) , 0 ≤ sup(I, D0 ) ≤ msup where D0 is the database during the sanitization process, sup(I, D0 ) is the current support of itemset I in the database, and λ is an integer with a value greater than the number of itemsets that participate to the revised positive border. Although we will not discuss the properties of the proposed weighting function in detail2 , we briefly note that this function is designed to (i) encourage item deletions that minimally affect frequent itemsets of the revised positive border, (ii) prevent item deletions that will cause frequent itemsets of the revised positive border to be lost, and (iii) prevent item deletions that will cause an extra loss in the support of already lost itemsets from the revised positive border (thus try to keep the lost border itemsets near the revised borderline). Moreover, the rate in which the weights of itemsets that belong to the revised positive border increases, is designed to allow for maintaining the relative support of the itemsets in the sanitized database.
modifications as an atomic operation. This means that all possible item modifications are applied at the same time and thus there is no need to recompute the border. 2 The reader is encouraged to refer to [67] for a thorough discussion of the properties and the rationale behind the employed weighting scheme.
10.2 Hiding Process
51
10.2.2 Hiding a Sensitive Itemset BBA accomplishes the hiding of a sensitive itemset by applying a series of item deletions commanded by a set of hiding candidates, each of which results in a decrement of one in the support of the sensitive itemset. By employing the weighting scheme that was discussed earlier, each itemset of the revised positive border carries a weight that denotes its vulnerability of being affected. By using the supplied weights, the proposed algorithm has to select the hiding candidate that has minimal effect on the itemsets of the revised positive border and enforce the corresponding item deletion. It is important to observe that the hiding of a sensitive itemset may affect only those itemsets of the revised positive border with which it shares at least one common item. This is due to the fact that the hiding of a sensitive itemset corresponds to a reduction in the support of some of the items it contains, while the support of the rest of the items (in the universe of all possible items from I) will remain unaffected. The authors of [66, 67] define the set Bd + |I of possibly affected itemsets J from the revised positive border Bd + (due to the hiding of an itemset I) as the affected border, where Bd + |I = {J | J ∈ Bd + ∧ I ∩ J 6= ∅}3 . Given a hiding candidate c = (To , io ) for a sensitive itemset I, the border based approach calculates the impact of deleting c as the sum over the itemsets in the revised positive border which will be affected by this item’s deletion. Clearly, these itemsets are the subset of those in Bd + |I , which contain item io . By applying this strategy, the BBA algorithm computes in each iteration the impact of each hiding candidate and selects to delete the one bearing the minimum impact.
10.2.3 Order of Hiding Itemsets The order in which the sensitive itemsets from Smin are hidden by the border based algorithm plays an important role in the quality of the constructed sanitized database D4 . This is due to the fact that the affected borders for two or more sensitive itemsets may contain common itemsets of the revised positive border. Assuming two overlapping affected borders Bd + |I and Bd + |J (for two sensitive itemsets I, J ∈ Smin ), enforcing the hiding candidates for I may change the weight of some itemsets in Bd + |J , thus affect subsequent decisions that are taken for the hiding of J. As a result, the authors propose that the itemsets in Smin are hidden in a decreasing order of length, splitting ties based on an increasing order of support.
3
It is interesting to contrast this formula to (14.5), which is used by the inline algorithm to select itemsets whose status (frequent vs. infrequent) needs to be controlled by the hiding algorithm. 4 Sun & Yu [66,67] evaluate the quality of the sanitized database from the viewpoint of lost nonsensitive itemsets in the hiding process. An alternative way of evaluating the quality of the sanitized database is employed by Gkoulalas–Divanis & Verykios in [23]. The proposed algorithm uses the number of item deletions that were necessary for the hiding of the sensitive itemsets.
52
10 BBA Algorithm
Algorithm 10.1 The Border Based Approach of Sun & Yu [66, 67]. 1: function BBA(Original database DO , frequent itemsets FDO , sensitive itemsets S, minimum support threshold msup) 2: D 0 ← DO 3: Compute Smin and Bd + 4: Sort itemsets in Smin in decreasing order of length and increasing order of support 5: for each I ∈ Smin do 6: Compute Bd + |I and w(I ∈ Bd + |I ) 7: Initialize the set C|I of hiding candidates for itemset I . candidate selection 8: for i = 0; i < sup(I, D 0 ) − msup + 1; i + + do 9: Find hiding candidate c = (To , io ) with minimal impact in C 10: C ← C − {(T, i)|T = To } 11: Update w(I ∈ Bd + |I ) 12: end for 13: Update database D 0 14: end for 15: Return: sanitized database D ← D0 16: end function
10.3 Algorithm Algorithm 10.1 summarizes the border based approach for the hiding of the sensitive itemsets. At each step the algorithm selects the hiding candidate that minimally affects the itemsets of the revised positive border and applies the corresponding item deletion. The details of the candidate selection process (see the inner for-loop) are beyond the scope of this chapter. The interested reader can refer to [67].
Chapter 11
Max–Min Algorithms
Moustakides & Verykios in [50, 51] propose two border based methodologies that rely on the max–min criterion for the hiding of the sensitive itemsets. Both methodologies use the revised positive border of the frequent itemsets to track the impact of each tentative item modification that helps towards the hiding of a sensitive itemset. Then, they select to apply those item modifications to the original database that effectively conceal all the sensitive knowledge, while minimally affecting the itemsets of the revised positive border and, consequently, the nonsensitive frequent itemsets. For each item of a sensitive itemset, the Max–Min algorithms identify the set of itemsets from the revised positive border which depend on it, and select among them the ones that are supported the least. Then, from among all minimum supported border itemsets (coming from the previously computed sets for the different items of the sensitive itemset), the itemset with the highest support is selected as it is the one with the maximum distance from the borderline that separates the frequent from the infrequent itemsets. This itemset, henceforth called the max–min itemset, determines the item through which the hiding of the corresponding sensitive itemset will take place. The proposed Max–Min algorithms delete this item from selected transactions in a way that the support of the max–min itemset is minimally affected. When hiding multiple sensitive itemsets, the algorithms perform the sanitization in a one-by-one fashion starting from the sensitive itemsets that have lower supports and moving towards itemsets of higher support. Through experimental evaluation, both methodologies are shown to provide superior hiding solutions when compared to the BBA algorithm of [66,67], while also being less computationally demanding. In the rest of this chapter, we explain the main ideas behind the max–min optimization criterion that is employed by the Max–Min algorithms to effectively conceal the sensitive knowledge. Moreover, we shed light on the workings of the two algorithms, as well as elaborate on their commonalities and differences. Specifically, in Section 11.1 we straighten out the underlying principles of the max–min criterion and discuss the process that is followed by both algorithms for the hiding of a sensitive itemset. Following that, in Section 11.2 we clarify the order in which the Max–Min algorithms process the sensitive itemsets to effectively hide them in the sanitized database. Last, Sections 11.3 and 11.4 present the different steps that A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_11, © Springer Science+Business Media, LLC 2010
53
54
11 Max–Min Algorithms
are involved in the operation of each of these two approaches and demonstrate why the Max–Min 2 variant outbalances Max–Min 1 in the majority of hiding scenarios.
11.1 Hiding a Sensitive Itemset Similarly to the border based algorithm of Sun & Yu [66, 67] (covered in Chapter 10), the Max–Min algorithms proposed by Moustakides & Verykios [50, 51] aim to hide the sensitive itemsets in a way that minimizes the impact of the sanitization process to the itemsets of the revised positive border. Both algorithms take as input the itemsets of the revised positive border and the sensitive itemsets that participate to the revised negative border. Assuming that a sensitive itemset I needs to be hidden from the original database DO , the Max–Min algorithms have to determine the item i ∈ I that will be modified in selected transactions of the database to accommodate the hiding of itemset I. Each item of a sensitive itemset is called a tentative victim item, since it can be used for hiding this itemset. Among the tentative victim items, the one that is selected by the hiding algorithms (here, for example, item i) is called the victim item, since it will be deleted from transactions of the original database to facilitate the hiding of the sensitive itemset. The deletion of an item i ∈ I from transactions of the original database, as already discussed in Chapter 10, may affect itemsets that belong to the revised positive border, constituting some of them infrequent in the sanitized database. To limit this side-effect, for every tentative victim item j ∈ I the Max–Min algorithms compute the set of itemsets Bd + | j of the revised positive border Bd + that contain this item. The itemsets in Bd + | j are termed tentative victim itemsets since their status (frequent vs. infrequent) may be affected by the deletion of item j. From each computed set Bd + | j , j ∈ I, the itemsets with the minimum support (henceforth referred to as the minimum support itemsets) are selected, since they lie closest to the borderline between the negative and the positive border and, thus, are the most vulnerable ones from being accidentally lost. Among all minimum support itemsets coming from the Bd + | j sets for the different items j ∈ I, the Max–Min algorithms select the itemset that has the maximum support. The rationale behind this criterion is that this itemset, termed as the max–min itemset, has the maximum distance among all minimum support itemsets from the borderline. The max–min itemset determines the tentative victim item that will be deleted from transactions of the database to facilitate the hiding of the sensitive itemset I. The goal of the Max–Min algorithms is to modify the victim item that is indicated by the max–min itemset, in a way that minimally affects the support of this itemset. When more than one max–min itemsets exist, the corresponding set of itemsets, known as the max–min set, reflects the possible victim items that can be used to hide the same sensitive itemset. At this point, the two Max–Min algorithms operate in a different way to select the victim item from the tentative ones and proceed to its sanitization.
11.3 Max–Min 1 Algorithm
55
11.2 Order of Hiding Itemsets Given a set of sensitive itemsets, the order in which these itemsets are hidden may have a significant impact on the quality of the sanitized database. Thus, when hiding multiple sensitive itemsets, the proposed Max–Min algorithms operate by first sorting the sensitive itemsets in ascending order of support and then by hiding each sensitive itemset in a one-by-one fashion, starting from the itemset that is least supported in the original database DO . The rationale behind the selected order is that sensitive itemsets with a low support are easier to hide as they are already positioned closer to the borderline than their higher support counterparts. An interesting observation (proven in [51]) is that while the sensitive itemset with the minimum support is hidden, its support constantly remains lower than the support of the rest of the sensitive itemsets in the database. This behavior is attributed to the fact that the Max–Min algorithms decrease the support of the sensitive itemset that is selected for hiding by one, in each iteration, as they operate by deleting the victim item from one transaction of the database. As a result, the support of any other sensitive itemsets may lower by at most one in each iteration of the algorithm, if the corresponding sensitive itemsets are affected by the employed item deletion. Still, however, the support of these itemsets will continue to be higher than that of the currently selected sensitive itemset. Even so, the original ordering of the sensitive itemsets in terms of support can be significantly affected during the sanitization process. This is because the support of the sensitive itemsets that are not yet selected for hiding may change to a different degree based on the victim items that are selected by the hiding algorithm and the transactions from which these items are deleted. Consequently, the Max–Min algorithms need to recompute the supports of all the remaining sensitive itemsets after hiding the currently selected itemset and choose to hide the sensitive itemset that has the new minimum value of support. Last, in the case of a tie among the sensitive itemsets in terms of support, the Max–Min algorithms operate by giving preference to the hiding of longer sensitive itemsets than shorter ones. This preference is justified by the authors on the grounds of the degree of freedom that such a choice bestows for the selection of the victim item, among the tentative ones.
11.3 Max–Min 1 Algorithm Max–Min 1 is a straightforward heuristic algorithm that applies the max–min criterion for the hiding of the sensitive itemsets. Until all the sensitive itemsets are hidden from DO , the algorithm selects the sensitive itemset I that currently has the minimum value of support (breaking ties in favor of maximum length) and for each tentative victim item j ∈ I of this itemset, it computes the tentative victim itemsets Bd + | j from the revised positive border Bd + . Then, Max–Min 1 undergoes a series of iterations until the selected sensitive itemset is properly hidden in the database. In each iteration, the algorithm computes the current max–min itemset using the
56
11 Max–Min Algorithms
Algorithm 11.1 The Max–Min 1 Algorithm of Moustakides & Verykios [50, 51]. 1: function M AX –M IN 1(Original database DO , revised positive border Bd + , sensitive itemsets S, minimum support threshold msup) 2: D 0 ← DO 3: while S 6= ∅ do 4: Select I ∈ S with minimum support, breaking ties in favor of maximum length 5: For each tentative victim item j ∈ I compute its tentative victim itemsets Bd + | j 6: while sup(I, D0 ) ≥ msup do 7: Compute the max–min itemset, splitting ties arbitrarily 8: Remove victim item i ∈ I, determined by the max–min itemset from the first transaction that supports the sensitive itemset I 9: Revise the tentative victim itemsets 10: end while 11: Remove I from S 12: end while 13: Return: sanitized database D ← D 0 14: end function
minimum support itemsets from Bd + | j (for each j ∈ I) and selects to remove the victim item i ∈ I (corresponding to the set Bd + |i ) that is determined by the current max–min itemset, from the first transaction in the database that supports the sensitive itemset I. In the case that the max–min itemset participates to more than one tentative victim itemsets and different victim items can be used for the hiding of the sensitive itemset, Max–Min 1 arbitrarily selects the tentative victim item to use for hiding. After removing the victim item from a transaction that supports the selected sensitive itemset, the algorithm revises the affected sets of tentative victim itemsets Bd + |i and proceeds to the next iteration until the itemset is hidden. The same process repeats until all the sensitive itemsets in S are hidden from DO , at which point the sanitized database D is returned. Algorithm 11.1 straightens out the operation of the Max–Min 1 algorithm.
11.4 Max–Min 2 Algorithm The Max–Min 2 algorithm provides a more sophisticated approach to knowledge hiding than that of Max–Min 1, by reducing the side-effects of the hiding process to the nonsensitive itemsets. While the basic features of Max–Min 1 are also adopted by Max–Min 2, this algorithm takes special care in improving the selection process for the transactions of the original database that will be sanitized to accommodate the hiding of the sensitive knowledge. On these grounds, three special case scenarios that are based on the properties of the identified max–min itemsets, are taken into consideration by Max–Min 2, and are discussed in the following. The first case scenario regards the existence of more than one max–min itemsets, which are all derived from the same tentative victim item j. In this case scenario, Max–Min 2 tries to reduce the support of the sensitive itemset through j, without
11.4 Max–Min 2 Algorithm
57
affecting the support of any max–min itemset. If this is possible, then the authors in [50, 51] prove that no other itemset from the minimum support itemsets will be affected. Reducing the support of a sensitive itemset without affecting the support of the max–min itemsets from Bd + | j can be accomplished only if (i) the max–min itemsets are not subsets of the sensitive itemset, and (ii) there exist transactions in the database which support the sensitive itemset without, however, supporting the max–min itemsets. In order to get this information, Max–Min 2 computes for every sensitive itemset and for every max–min itemset the transactions of the original database that support them. Assuming that LI and LU are the two lists, then their difference LI − LU denotes the set of transactions from DO that support the sensitive itemset I without supporting the max–min itemset U. If the size of this set is at least equal to sup(I) − msup −1, then there exists a sufficient number of transactions for hiding I, without reducing the support of the max–min itemset. The second case scenario involves the existence of more than one max–min itemsets that correspond to different tentative victim items. In this case scenario, Max– Min 2 iterates over the sets of tentative victim itemsets Bd + | j (for the different items j) that contain a max–min itemset, and examines whether the support of the sensitive itemset can be reduced through j, without affecting the support of any of the corresponding tentative victim itemsets that belong in Bd + | j . If this is possible, then the authors of [50, 51] prove that the support of no other itemset in Bd + |i (i 6= j) will be affected as a result of this process. To examine whether the support of the sensitive itemset can be reduced through j without affecting the support of its corresponding tentative itemsets in Bd + | j , the authors identify the transactions in the database that support the sensitive itemset without supporting any max–min itemset in Bd + | j . Deleting item j from such transactions leads to a reduction in the support of the sensitive itemset, without affecting the support of any nonsensitive itemset. Provided that a sufficient number of transactions exist with this property, the hiding of the sensitive itemset can be achieved with minimal distortion of the original database. Last, when the second case scenario is inapplicable, Max–Min 2 iterates over all possible pairs of the Bd + | j sets, to identify transactions that support the minimum support itemsets of the first list and are affected by the corresponding victim item, while not supporting any of the max–min itemsets that belong to the second list. If such transactions exist, then the corresponding victim item is deleted from them. Otherwise, the victim item is deleted from transactions that support the minimum support itemsets of the first list. Algorithm 11.2 presents the details pertaining to the operation of the Max–Min 2 algorithm.
58
11 Max–Min Algorithms
Algorithm 11.2 The Max–Min 2 Algorithm of Moustakides & Verykios [50, 51]. 1: function M AX –M IN 2(Original database DO , revised positive border Bd + , sensitive itemsets S, minimum support threshold msup) 2: D 0 ← DO 3: while S 6= ∅ do 4: Select I ∈ S with minimum support, breaking ties in favor of maximum length 5: For each tentative victim item j ∈ I compute its tentative victim itemsets Bd + | j 6: while sup(I, D0 ) ≥ msup do 7: Compute the max–min itemset, splitting ties arbitrarily 8: if ∃ j ∈ I : max–min ⊆ Bd + | j and ∀i 6= j : Bd + |i ∩max–min = ∅ then 9: if L ← LI − LU 6= ∅ then 10: delete j from a transaction of list L 11: else 12: delete j from a transaction of list LI 13: end if 14: else 15: K ← {tentative victim items j ∈ I : Bd + | j ∩ max–min 6= ∅} 16: for each k ∈ K do 17: U ←max–min ∩Bd + |k 18: if L ← LI − LU 6= ∅ then 19: delete k from a transaction of list L 20: break 21: end if 22: end for 23: if L = ∅ then 24: for each k1 ∈ K do 25: for each k2 (6= k1 ) ∈ K do 26: U1 ←max–min ∩Bd + |k1 27: U2 ←max–min ∩Bd + |k2 28: L ← (LU1 ∩ LI ) − (LU2 ∩ LI ) 29: if L 6= ∅ then 30: delete k1 from a transaction of list L 31: else 32: delete k1 from a transaction of list LU1 ∩ LI 33: end if 34: end for 35: end for 36: end if 37: end if 38: end while 39: Remove I from S 40: end while 41: Return: sanitized database D ← D 0 42: end function
Chapter 12
Summary
In the third part of the book, we examined two popular border-based approaches that have been recently proposed for the hiding of sensitive frequent itemsets. The first approach, called BBA, was presented in Chapter 10 and was developed by Sun & Yu in [66, 67]. This approach uses the border of the nonsensitive frequent itemsets to track the impact of altering items of transactions in the original database. The second approach, called Max–Min, was proposed by Moustakides & Verykios in [50] and was covered in Chapter 11. It involves two heuristic algorithms that rely on the max–min criterion and use the revised positive border to hide the sensitive knowledge with minimal impact on the nonsensitive frequent itemsets. By restricting the impact of any tentative item modifications to the itemsets of the revised positive border, both BBA and Max–Min achieve to identify hiding solutions with fewer side-effects, when compared to pure heuristic approaches. This is due to the fact that tracking the impact of item modifications to the itemsets of the borders provides a measure of the side-effects that are induced by the hiding algorithm to the original database (in terms of lost itemsets) and thus serves as an optimization criterion to guide the hiding process towards high quality solutions. As a concluding remark, we should point out that although border-based approaches provide an improvement over pure heuristic approaches, they are still reliant on heuristics to decide upon the item modifications that they apply on the original database. As a result, in many cases these methodologies are unable to identify optimal hiding solutions, although such solutions may exist for the problem at hand.
A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_12, © Springer Science+Business Media, LLC 2010
59
Part IV
Exact Hiding Approaches
Chapter 13
Menon’s Algorithm
Menon, et al. in [47] are the first to propose a frequent itemset hiding methodology that consists of an exact part and a heuristic part to facilitate the hiding of the sensitive knowledge. The exact part uses the original database to formulate a Constraints Satisfaction Problem (CSP)1 in the universe of the sensitive itemsets, with the objective of identifying the minimum number of transactions that have to be sanitized for the proper hiding of the sensitive knowledge. The optimization process of solving the CSP is driven by a criterion function that is inspired from the measure of accuracy [42, 60], essentially penalizing the hiding algorithm based on the number of transactions (instead of items) that are sanitized from the original database to accommodate the hiding of the sensitive itemsets. The constraints imposed in the integer programming formulation aim to capture the transactions of the database that need to be sanitized for the hiding of each sensitive itemset. An integer programming solver is then used to find the solution of the CSP that optimizes the objective function and to derive the value of the objective. In turn, the heuristic algorithm uses this information to perform the actual sanitization of the database. An important contribution of the authors, apart from the algorithm itself, is a discussion over the possibility of parallelization of the exact part of the algorithm, which is also the most time-consuming. As demonstrated, based on the underlying properties of the database to be sanitized, it is possible for the produced CSP to be decomposed into parts that can be solved independently. Bearing in mind the exponential complexity of solving a CSP, this process can drastically reduce the runtime that is required for the hiding of the sensitive knowledge. The rest of this chapter is organized as follows. In Section 13.1, we discuss the exact part of the proposed hiding methodology that involves formulating a CSP over 1 A CSP [62] is defined by a set of variables and a set of constraints, where each variable has a non-empty domain of potential values. The constraints, on the other hand, involve a subset of the variables and specify the allowable combinations of valid values that these variables can attain. An assignment that does not violate the set of constraints is called consistent. A solution of a CSP is a complete assignment of values to the variables that satisfies all the constraints. In CSPs we usually wish to maximize or minimize an objective function subject to a number of constraints. CSPs can be solved by using various techniques, such as linear and non-linear programming [44].
A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_13, © Springer Science+Business Media, LLC 2010
63
64
13 Menon’s Algorithm minimize ∑Tn ∈DS xi ( subject to
∑Tn ∈DS ai j xi ≥ sup(I, DO ) − msup + 1, xi ∈ {0, 1},
∀I ∈ S ∀i ∈ DO
Fig. 13.1: CSP formulation for the exact part of Menon’s algorithm [47].
the sensitive itemsets and subsequently solving it through the application of integer programming. The goal of the formulated CSP is to maximize the accuracy of the produced sanitized database by identifying the minimum number of transactions that need to be modified to hide the sensitive knowledge. Apart from the formulation of the CSP, in this section we also shed light on a decomposition approach that was proposed in [47] to improve the efficiency of its solving. Following that, Section 13.2 delivers two simple heuristic strategies that can be used to perform the actual hiding of the sensitive itemsets from the original database.
13.1 Exact Part In this section we shed light on the exact part of Menon’s algorithm [47], which involves the formulation of a constraints satisfaction problem. The solution of this CSP will provide the minimum number of transactions from the original database that have to be sanitized to conceal the sensitive knowledge. In what follows, Section 13.1.1 elaborates on the properties of the CSP and provides its formulation. After formulating the CSP, an integer programming solver, such as ILOG CPLEX [36], GNU GLPK [29] or XPRESS-MP [32] can be used to solve it and derive the objective function along with the values of the participating variables. Since the produced CSP may involve too many constraints and reduces to the set-covering problem [47], thus is NP-complete [22], Section 13.1.2 delivers a decomposition approach that can be employed in certain cases to improve the runtime of solving the CSP and enhance the efficiency of the hiding algorithm.
13.1.1 CSP Formulation The goal of the exact part of Menon’s algorithm [47] is to identify the minimum number of transactions from the original database DO that have to be sanitized so that every sensitive itemset is hidden in database D. As is evident, the minimum number of transactions that have to be sanitized for the hiding of a sensitive itemset is equal to the current support that this itemset has in the database minus the minimum support threshold plus one. When this number of transactions no longer sup-
13.1 Exact Part
65
port the sensitive itemset, the support of the itemset in D drops below the minimum support threshold and thus the itemset is hidden. This observation led the authors of [47] to propose the formulation of the hiding process as a CSP that contains one constraint for each (sensitive) itemset in S, effectively regulating which of the supporting transactions of the itemset will have to be sanitized to reduce its support the least below the minimum support threshold. To allow for an optimal solution of the CSP, the objective criterion that is selected requires the least number of transactions, among those supporting the sensitive knowledge in S, to be marked for sanitization. Figure 13.1 presents the formulation of the CSP that is considered in [47]. In this figure, we use notation DS to capture the transactions of DO that support sensitive itemsets from S, as well as we adopt the parameters ai j and xi from [47], which are defined as follows: ( ai j =
if transaction Ti ∈ DS supports the j-th itemset inS otherwise
1, 0,
( xi =
1, 0,
if transaction Ti ∈ DS has to be sanitized otherwise
(13.1)
(13.2)
As one can observe, the objective function ∑Tn ∈DS xi of the CSP requires the least number of ones for parameter xi , which translates to the minimum number of transactions in DS that have to be sanitized to conceal the sensitive knowledge. Since this objective penalizes the hiding algorithm for every transaction that is marked for sanitization, it achieves to maximize the accuracy of the released database. Moreover, the first constraint of the CSP requires that at least (sup(I, DO ) − msup + 1) transactions are sanitized for the hiding of each sensitive itemset in S, which are (obviously) selected among the supporting ones. Last, the second constraint of the CSP requires that the computed xi ’s by the integer programming solver are binary variables, essentially capturing which transactions from DS will have to be sanitized to effectively conceal the sensitive knowledge. Given the proposed CSP formulation, an interesting observation is that the data owner can select to hide each sensitive itemset to a different degree in the sanitized database. This can be easily accomplished by using a different minimum support threshold, say msupI , for each sensitive itemset I ∈ S that replaces the common msup threshold in the right hand side of the first constraint of the CSP.
13.1.2 CSP Decomposition For very large databases the CSP formulation presented in Figure 13.1 may contain too many constraints that have to be simultaneously satisfied and, thus, be tough to solve. In such cases, it is important to examine if the CSP can be decomposed
66
13 Menon’s Algorithm
into smaller, manageable blocks (CSPs), that can be solved independently (also, possibly, concurrently), while providing the exact same solution as that of the original CSP. Due to the exponential nature of solving the CSP, such a decomposition strategy, if successful, can substantially improve the runtime that is required by the integer programming solver to solve the CSP, and thus significantly ameliorate the total time of the hiding algorithm. The basic idea behind the proposed decomposition strategy relies on the following properties of the formulated CSP: Type of optimization criterion The objective function of the CSP requires the minimization of the sum of non-negative values. Thus, it holds that b1
N
b2
min( ∑ xi ) = min( ∑ xi ) + min( i=1
bN
∑
xi ) + . . . + min(
i=b1 +1
i=1
∑
xi )
i=bn +1
, where b1 < b2 < . . . < bn are integers in (1, N) defining splitting points in [1, N]2 . Type of constraints/variables Based on the problem at hand there may exist sets of variables xi and their related constraints c j , which are independent from one another. As an example, if c1 : x1 + x2 ≥ 1 and c2 : x3 + x4 + x5 ≥ 2 are two constraints of the CSP, then these constraints can be solved independently since they involve different xi variables and thus do not interfere with one another. To identify if a given CSP is decomposable, the methodology that is proposed by Menon, et al. [47] involves (i) the construction of a constraints-by-transactions matrix for the problem at hand, and (ii) the examination of the structure of this matrix to determine if any rearrangement of its rows and/or columns leads to the identification of independent blocks. A block is called independent if it involves a set of constraints (equivalently rows) and a set of transactions (equivalently columns) that do not participate to any other block. An example of a constraints-by-transactions matrix is presented in Table 13.1. As one can observe from the structure of this matrix, matrix M is decomposable since the rearrangement of its rows and columns leads to the appearance of two independent blocks, denoted by singly and doubly underlining the corresponding elements, respectively. As a result of this structure, the original CSP that involved constraints c1 . . . c4 and transactions T1 . . . T7 can be decomposed into two smaller CSPs, one having constraints c1 , c3 and transactions T3 , T6 , T7 , and the other having 2
One can easily notice that the same property holds in the case of maximizing a sum of variables xi that can attain non-negative values, i.e. it holds that N
b1
i=1
i=1
bN
b2
max( ∑ xi ) = max( ∑ xi ) + max(
∑
i=b1 +1
xi ) + . . . + max(
∑
xi )
i=bn +1
, where b1 < b2 < . . . < bn are integers in (1, N) defining splitting points in [1, N]. This interesting property is also discussed in Chapter 17, since it plays a key role in the parallelization process of exact knowledge hiding algorithms.
13.1 Exact Part
67
Table 13.1: An example of a constraints-by-transactions matrix.
M=
c/T T6
T7
T3
T1
T2
T5
T4
c1
1
1
1
0
0
0
0
c3
0
1
0
0
0
0
0
c2
0
0
0
1
0
1
0
c5
0
0
0
1
0
1
1
c4
0
0
0
0
1
0
1
minimize (x1 + x2 + x3 + x4 + x5 + x6 + x7 )
subject to
x3 + x6 + x7 ≥ 2 x7 ≥ 1 x1 + x5 ≥ 1 x1 + x4 + x5 ≥ 3 x2 + x4 ≥ 2 x1 , x2 , x3 , x4 , x5 , x6 , x7 ∈ {0, 1}
Fig. 13.2: The original CSP formulation for the example.
constraints c2 , c4 , c5 and transactions T1 , T2 , T4 , T5 . An example of the original CSP along with its two independent blocks is presented in Figures 13.2, 13.3 and 13.4, respectively. minimize (x3 + x6 + x7 ) x + x6 + x7 ≥ 2 3 subject to x7 ≥ 1 x , x , x ∈ {0, 1} 3
6
7
Fig. 13.3: The first independent block for the CSP of Figure 13.2. Algorithm 13.1 (proposed in [47]) is a clustering strategy that can be employed for the decomposition of the original CSP through the identification of its independent blocks. The algorithm begins by considering that each sensitive itemset belongs to a separate block and then merges different blocks that share supporting transactions for their respective sensitive itemsets. Specifically, for each transaction in DS the algorithm identifies the sensitive itemsets that it supports and merges the corresponding blocks into a single block. In the end, the independent blocks along with the corresponding transactions and sensitive itemsets are returned to the user. The
68
13 Menon’s Algorithm minimize (x1 + x2 + x4 + x5 )
subject to
x1 + x5 ≥ 1 x1 + x4 + x5 ≥ 3 x2 + x4 ≥ 2 x1 , x2 , x4 , x5 ∈ {0, 1}
Fig. 13.4: The second independent block for the CSP of Figure 13.2. Algorithm 13.1 The Decomposition Algorithm proposed in [47]. 1: function D ECOMPOSE(Database DS , sensitive itemsets S) 2: for each itemset I ∈ S do 3: C(I) ← {I} . Initially every sensitive itemset belongs to a separate block . j is the position of I in the (arbitrary) ordering of the itemsets in S 4: pj ← j 5: end for 6: for each transaction Ti ∈ DS do 7: r ← pk where k is the index of the first nonzero value in M for Ti 8: for the rest nonzero elements of Ti in M do 9: C(Ir ) ← C(Ir ) ∪C(Ip j ) 10: C(Ip j ) ← ∅ 11: pj ← r 12: end for 13: end for 14: Return: nonempty independent blocks C(I) 15: end function
algorithm operates in time that is linear with respect to the number of supporting transactions for the sensitive itemsets. To illustrate the operation of this algorithm, consider the example of Table 13.1 that involves 5 sensitive itemsets (with constraints c1 . . . c5 ) which are supported by a total of 7 transactions. As a preprocessing step, Algorithm 13.1 (lines 3–4) considers that each sensitive itemset belongs to a different independent block, thus we have that: C(I1 ) = {I1 }, . . . ,C(I5 ) = {I5 }. The indexing pointers p are accordingly updated based on the constraints of matrix M, i.e. p1 = 1, p2 = 3, p3 = 2, p4 = 5, p5 = 4. At this point we enter the first iteration of the algorithm, which involves transaction T6 . The first nonzero element in the column of T6 is that for c1 and thus we have k = r = 1. Since there is no other nonzero element for T6 we proceed to the next iteration of the algorithm, which involves transaction T7 . For this transaction we have r = p1 = 1. Since the next element of transaction T7 is nonzero, we have that C(I1 ) ← C(I1 )∪C(I3 ) = {I1 , I3 } (since p2 = 3) and then C(I3 ) ← ∅, p2 ← 1. Following that, we move to T3 which again has only one nonzero element and thus requires no special handling. Next, for T1 we have that r ← p3 = 2 and C(I2 ) = {I2 , I5 }. In the same way, Algorithm 13.1 identifies that the constraints-by-transactions matrix M of Figure 13.1 can be decomposed into two independent blocks: {c1 , c3 } (cor-
13.2 Heuristic Part
69
Algorithm 13.2 The Intelligent Sanitization Approach of [47]. 1: function I NTELLIGENT-S ANITIZE(Database DS , sensitive itemsets S) 2: for each transaction Ti ∈ DS do 3: identify all itemsets S j ∈ S that it supports 4: while S j 6= ∅ do 5: remove the item of Ti that appears most often in S j 6: remove the itemsets of S j that contain this item 7: end while 8: end for 9: Return: sanitized transactions Ti 10: end function
responding to sensitive itemsets I1 , I3 ) and {c2 , c4 , c5 } (corresponding to sensitive itemsets I2 , I4 , I5 ). A last remark about the proposed decomposition strategy, is the fact that it operates in a way that has no impact to the quality of the identified solution of the original CSP. However, there may exist hiding scenarios in which the corresponding constraints-by-transactions matrix cannot be decomposed. For such cases the authors of [47] discuss some quick procedures that can be employed to tackle the problem, however at a cost to the quality of the attained solution. Moreover, in Chapter 17 we discuss a framework that can be employed for the decomposition and the parallelization of CSPs that are produced by exact hiding algorithms.
13.2 Heuristic Part The solution of the CSP that was formulated in the exact part of Menon’s approach yields a set of transactions from DO that have to be sanitized in order to conceal the sensitive knowledge. In this section, we shed light on the actual process that is followed by [47] for the sanitization of these transactions. The authors present two simple heuristic strategies that take as input the transactions that were marked for sanitization based on the solution of the CSP, and output a database D in which the sensitive knowledge S from DO is properly hidden. The first strategy, known as the blanket approach, is inspired from the work of Oliveira & Zaïane in [54] and it operates by deleting all items except from one, from each transaction that was previously marked for sanitization. Although this strategy achieves to hide all the sensitive knowledge from DO , it generally leads to significant side-effects being unnecessarily introduced to D. The second strategy, called the intelligent approach, induces significantly less harm to the original database by focusing on retaining the majority of items which appear in each transaction that is marked for sanitization. Specifically, the algorithm operates as follows. First, for each transaction Ti ∈ DS , the proposed algorithm finds the set of sensitive itemsets S j from S that this transaction supports. To sanitize the transaction, the intelligent approach selects to remove the item of this transaction
70
13 Menon’s Algorithm
that appears in the most number of itemsets in S j . Then, the sensitive itemsets of S j that were supported by this transaction are no longer supported and thus are removed from S j . The same process continues for the rest of the transactions in DS until set S becomes empty, at which point the database is sanitized. Algorithm 13.2 provides the details pertaining to the operation of the intelligent approach.
Chapter 14
Inline Algorithm
In this chapter, we present in detail the first algorithm that has been proposed which does not rely on any heuristics to secure the sensitive knowledge derived by association rule mining. Similarly to Menon’s algorithm (covered in Chapter 13), the inline algorithm of [23] aims to hide the sensitive frequent itemsets of the original database DO that can lead to the production of sensitive association rules. As a first step, we will introduce the notion of distance between two databases along with a measure for quantifying it. The quantification of distance provides us with important knowledge regarding the minimum data modification that is necessary to be induced to the original database to facilitate the hiding of the sensitive itemsets, while minimally affecting the nonsensitive ones. By trying to minimize the distance between the original database and its sanitized counterpart, the inline algorithm formulates the hiding process as an optimization problem in which distance is the optimization criterion. Specifically, the hiding process is modeled as a CSP which is subsequently solved by using a technique called Binary Integer Programming (BIP). The attained solution is such that the distance measure is minimized. It is important to mention that since the inline algorithm is non-heuristic, it does not suffer from local minima issues that would lead the hiding algorithm to suboptimal (i.e., locally — but not globally — best) solutions. As an effect, this methodology is guaranteed to identify hiding solutions of superior quality when compared to state-of-the-art heuristics and border-based approaches. The remainder of this chapter is organized as follows. Section 14.1 introduces the privacy model that is employed by the inline algorithm for the identification of exact and optimal hiding solutions. Next, in Section 14.2, we demonstrate the solution methodology that is adopted. Finally, Section 14.3 presents some experiments that demonstrate the effectiveness of the approach towards solving the problem. More experiments comparing the inline algorithm to other state-of-the-art approaches are included in Chapters 15 and 16.
A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_14, © Springer Science+Business Media, LLC 2010
71
72
14 Inline Algorithm
14.1 Privacy Model In Chapter 9, we highlighted the process of border revision and we presented a set of algorithms which enable the computation of the revised borders that pertain to an exact hiding solution. The inline methodology that is covered in this chapter aims to enforce the revised borders when constructing the sanitized version D of the original database DO . The sanitization approach that is followed is based on the modification of selected Tnm values from one to zero (1 → 0), thus excluding certain items from selected transactions of DO . Since no transactions are added of removed to/from the original database, the sanitized version D of database DO will have the same number of transactions as the original one. Therefore, it holds that |D| = |DO |. Using FD to denote the set of frequent itemsets in the sanitized database D, the goal that must be accomplished for the hiding of all the sensitive itemsets in D can be / FD , where Smin is the set of minimal sensitive expressed as follows: ∀S ∈ Smin : S ∈ itemsets, formally introduced in Chapter 9. It is trivial to prove that the aforementioned goal can be easily achieved for any database DO , provided that sufficient distortion is introduced to the database1 . However, in reality one would like to hide all sensitive itemsets while minimally affecting the original database. As stated in Chapter 9, the minimum harm in the hiding of the 0 =F sensitive knowledge can be quantified as FD DO − Smax , meaning that in an optimal hiding scenario, the sanitized database D will contain all frequent itemsets of DO except from the sensitive ones2 . By only removing certain items from selected transactions in DO , the inline algorithm ensures that database D will contain no ghost itemsets, i.e. nonsensitive itemsets that did not appear among the frequent ones in the original database. However, the exclusion of items from transactions may lead to nonsensitive frequent itemsets being accidentally lost in the sanitized database D. Thus, the inline algorithm has to take the appropriate measures to minimize the loss of frequent itemsets in D as a side-effect of the sanitization process. As we presented in Chapter 13, Menon, et al. [47] employ the measure of accuracy in the optimization criterion of the produced CSP in order to derive the hiding solution. Maximizing accuracy is equivalent to minimizing the number of transactions from DO that need to be modified to accommodate the hiding of the sensitive knowledge. The inline algorithm, on the other hand, takes a more radical approach to quantifying the harm induced to the original database by penalizing the number of actual item modifications rather than the number of transactions that are affected by the hiding process. Indeed, a hiding algorithm may alter less transactions but to a much higher degree than another one that probably alters more transactions but to a much lower degree. A basic drawback of the accuracy measure is that it penalizes all potential modifications of a frequent itemset, say for instance, both ‘abcde’ → ∅ and ‘abcde’ → ‘abcd’, to the same extent. Based on this mishap, a 1
This property along with a naïve approach for solving the association rule hiding problem, were briefly discussed in Chapter 2 (Section 2.2.1). 2 0 We should point out that notation FD refers to the optimal set of frequent itemsets that should appear in the sanitized database D to facilitate exact knowledge hiding, while FD denotes the actual set of frequent itemsets that can be mined from D.
14.1 Privacy Model
73
novel global measure of quality, called distance, is proposed in [23], which aims to capture the proximity between the original and the sanitized database. Distance is formally defined as follows: 0 0 dist(DO , D) = Tnm − Tnm = |{Tnm 6= Tnm }| (14.1) ∑ ∑ n∈[1,N],m∈[1,M] n∈[1,N],m∈[1,M] 0 to its sanitized version D, N where Tnm refers to the original database DO and Tnm corresponds to the number of transactions in DO , D, and M to the number of items. Due to the fact that the inline algorithm supports only item deletions from selected transactions, minimizing the outcome of the above formula is exactly the same as maximizing the number of ones left in D. Thus, the above formula can be restated as follows: !
min{dist(DO , D)} = max
∑
0 Tnm
(14.2)
n∈[1,N],m∈[1,M]
Formula (14.2) captures the optimization criterion that is used by the inline algorithm for the formulation of the CSP, discussed in detail in the following section. The distance metric allows the selection of the optimal hiding solution, among the exact ones (in the case that many exact hiding solutions exist for the problem at hand), or the approximation of the optimal solution to the best possible extent (in the case that an exact hiding solution cannot be found). Since only the support of the minimal sensitive itemsets, i.e. the itemsets in Smin , has to be reduced for the hiding of the sensitive knowledge, the Tnm values in DO that will be zeroed out in D will have to correspond to items appearing in transactions that support itemsets in Smin . Having defined the optimization criterion that has to be minimized, the next target is to identify the set of itemsets in which this criterion will be applied. Solving the entire problem involving all |℘(I)|−1 = 2M −1 itemsets in the lattice of DO is well known to be NP-hard (c.f. [10, 47]). Therefore, the inline algorithm has to apply the optimization criterion to only a subset of the transactions of the original database. Prior to discussing the selection process that is adopted by the inline algorithm, we present the two possible constraints that can hold for a frequent itemset I in D. An itemset I from DO will continue to be frequent in D if and only if N
∑ ∏ Tnm0 ≥ msup
sup(I, D) ≥ msup ⇒
(14.3)
n=1 im ∈I
and will be infrequent otherwise, i.e. when N
sup(I, D) < msup ⇒
∑ ∏ Tnm0 < msup
(14.4)
n=1 im ∈I
Based on the above inequalities, one can observe that any exact solution will necessarily satisfy them for the entire problem (i.e., for all the itemsets in the lattice 0 , these itemsets should of D). Specifically, for all frequent itemsets appearing in FD
74
14 Inline Algorithm
Fig. 14.1: The architectural layout for the inline approach.
remain frequent in D and therefore they should satisfy inequality (14.3). On the contrary, all frequent itemsets in Smax should be hidden in D and therefore they must satisfy inequality (14.4). If at least one inequality does not hold for an itemset 0 , then the identified solution is non-exact but approximate. in FD An important observation that was made in [23] is that the two types of inequalities, namely (14.3) and (14.4), are actually not of the same importance. Specifically, while it is crucial to ensure that inequality (14.4) holds for all sensitive itemsets in D, such that they are properly hidden, inequality (14.3) just facilitates the minimization of side-effects in the hiding process. As we present in the following section, the inline algorithm considers the difference in the significance of these two types of inequalities in order to compute the best approximate solutions, when optimal ones cannot be found for the problem at hand.
14.2 Solution Methodology As is proven in [23] the NP-hard problem of the 2M − 1 inequalities can be reduced to an extent that is solvable while yielding the exact same set of solutions. In what follows, we discuss how this reduction is possible. Then, we demonstrate that based on the reduced set of produced inequalities the hiding process in the case of the inline algorithm can be formulated as a CSP that is solved by using BIP. The architectural layout of the inline algorithm is presented on Fig. 14.1.
14.2.1 Problem Size Minimization Although the entire problem of the 2M − 1 constraints cannot be solved due to its exponential growth, the authors of [23] demonstrate that it is possible to minimize the number of inequalities that participate to the CSP, without any loss in the quality of the attained solution. Specifically, certain inequalities of the system (referring to the status of itemsets in D) can be removed from the CSP of the 2M − 1 inequalities, without affecting its set of solutions. The rational behind this claim is that
14.2 Solution Methodology
75
in a typical case there exist many overlapping itemsets in DO and therefore a large number of overlapping inequalities are produced. Identifying such inequalities helps drastically reduce the size of the problem up to a point that it becomes solvable. As we presented earlier, an one-to-one correspondence exists between itemsets and produced inequalities that participate to the CSP. Given that C is the total set of affected itemsets, such that C = {C1 , C2 , . . . , C2M −1 }, we denote by LC the set of solutions of the corresponding inequalities from C. The set of solutions for the system of inequalities will be the intersection among the solution produced for each T|C| individual inequality. Thus, we can write LC = i=1 LCi . Based on the above notation, the set of solutions that corresponds to all itemsets in DO is denoted as L℘(I)\{∅} . Given an inequality (produced, for example, by an itemset C2 ) with a solution set being a proper subset of the solution set of another inequality, say C1 , we can deduce that the latter inequality can be removed from any system containing the first inequality, without affecting the global solution of this system. In this case we have that LC2 ⊂ LC1 and we state that C1 is a (generalized) cover of C2 . By exploiting cover relations, the authors of [23] achieve to reduce the problem of satisfying all the inequalities produced by all the itemsets in DO , to the examination of a much smaller set 0 ) : I ∩ I Smin 6= ∅} ∪ Smin C = {I ∈ Bd + (FD
(14.5)
where notation I Smin depicts the items of the itemsets in Smin . Thus, I ∩ I Smin refers to those itemsets of the revised positive border that do not contain any items from those appearing in the minimum sensitive itemsets (i.e., in Smin ). Set C provides us the optimal hiding solution LC , if one exists. The reduction of the set of all inequalities to those involving the itemsets in C, is proven as part of [23].
14.2.2 Reduction to CSP and the BIP solution After the formalization of the hiding process, the whole problem can been regarded as a CSP, whose solution will provide the sanitized database D. CSPs can be solved by using various techniques, such as linear and non-linear programming [44]. In the current context all variables are binary in nature as they refer to items participating to specific transactions; this fact provides an important advantage as we will present later on. To solve the CSP, the inline algorithm first transforms it to an optimization problem and then applies a technique which is known as BIP [32]. The employed formulation enables the solution of the sanitization problem in DO and is capable of identifying the optimal solution (if one exists). In the case of problems where exact solutions are infeasible, a relaxation of the inline algorithm (by using a heuristic targeted for inequalities selection and removal) is adopted that allows the identification of a good approximate solution. Figure 14.2 presents the CSP formulation as an optimization problem. As one can notice, the degree of the constraints participating in the problem formulation is
76
14 Inline Algorithm maximize ∑unm ∈U unm ( subject to
∑Tn ∈DI ∏im ∈I unm < msup, ∀I ∈ Smin ∑Tn ∈DI ∏im ∈I unm ≥ msup, ∀I ∈ V
0 ) : I ∩ I Smin 6= ∅} where V = {I ∈ Bd + (FD
Fig. 14.2: The CSP formulation as an optimization process. Replace ∑Tn ∈Dz Ψi S msup, Ψi = ∏im ∈z unm = unz1 × . . . × unz|z| with Ψi ≤ unz1 Ψi ≤ unz2 . ∀i .. Ψ ≤ unz|z| i Ψi ≥ unz + unz + . . . + unz − |z| + 1 1 2 |z| and ∑i Ψi S msup where Ψi ∈ {0, 1}
Fig. 14.3: The Constraints Degree Reduction approach.
a-priori unknown, and can be as large as the number of items M in the given dataset. This fact prohibits the solution of the CSP in the exact same format that is presented in Figure 14.2. Fortunately though, due to the binary nature of the variables involved in the CSP, one can replace inequalities that contain products of two or more unm variables with a set of inequalities that each contains no product of binary variables. In addition, this can be accomplished in a way that when all the new inequalities are solved, the solution that is attained is the same as the one of solving the initial inequality. The side-effect of this methodology is that it increases the number of constraints that participate to the system of inequalities. On the other hand, the resulting inequalities are very simple and allow for fast solutions, thus adhere for an efficient solution of the CSP. To generate this transition, a number of temporary binary values Ψi need to be introduced, as shown in Figure 14.3. After applying the Constraints Degree Reduction (CDR) approach, all constraints become linear with no coefficients. A linear optimization solver can then be applied to provide the optimal solution LC of the CSP. The final case that needs to be examined is what happens if the optimization problem yields no solution, i.e. the produced CSP is unsolvable. To handle such problems where an exact solution cannot be attained, the inline algorithm relies on the properties of the produced CSP. As one can observe, the integer programming problem that contains only the constraints imposed by the sensitive itemsets in Smin
14.2 Solution Methodology
77
Algorithm 14.1 Relaxation Procedure in V . 1: procedure S ELECT R EMOVE(Constraints CR , V, D) S 2: CRmaxlen ← argmaxi {|Ri |} 3: crmsup ← minCRmaxlen ,i (sup(Ri ∈ V, D)) 4: for each c ∈ CRmaxlen do 5: if sup(Ri , D) = crmsup then 6: Remove (c) 7: end if 8: end for 9: end procedure
. CRi ↔ Vi . Ri ∈ V
. remove constraint from the CSP.
will always have a solution. This is due to the fact that all inequalities are of the same type, so in the worst case scenario zeroing out all the involved unm variables will hide the sensitive knowledge. However, such a solution is certainly not desirable. To identify a good hiding solution, the inline algorithm removes a small portion of the constraints imposed by the itemsets in V (see Figure 14.2) so that the CSP of the remaining inequalities becomes solvable. The question that is now raised is how to select the inequalities to remove. Algorithm 14.1 provides a simple heuristic for selection and removal of inequalities from the CSP of Figure 14.2. It applies a relaxation process by removing all constraints that correspond to maximal size and minimum support itemsets in V . The rationale behind this heuristic is that the hidden itemsets in D (due to the side-effects of hiding itemsets in Smax ) will be the first that would be hidden in DO if the support threshold was increased, since their current support is also low. After removing a subset of the constraints that participate to the CSP, the remaining problem will become solvable. Thus, the relaxation process is iteratively applied to the CSP until a solution is attained. Last, it is important to note that each repetition of the relaxation process may potentially lead to the removal of more than one constraints from the original CSP.
14.2.3 An Example To demonstrate the core approach excluding the relaxation process, we borrow the example that is used in [23]. Consider the transactional database that is presented in Table 14.1. Using a minimum frequency threshold mfreq= 0.2, the set of frequent itemsets that are mined from this database is: FDO = {A, B,C, D, AB, AC, AD,CD, ACD}. Now suppose that one wishes to hide the sensitive itemset {AB}, therefore S = {AB}. As a first step, the inline algorithm computes the ideal set of frequent 0 = {A, B,C, D, AC, AD,CD, ACD}. itemsets for D by using Algorithm 9.4. That is: FD 0 ) = {B, ACD}. To produce The revised positive border for D will then be: Bd + (FD the constraints for the CSP, the inline algorithm has to consider all itemsets appear0 ) = {B, ACD}. The ing in C (see (14.5)). In this example, we have that: V = Bd + (FD next step is to substitute in all transactions supporting the sensitive itemsets, their current Tnm values of the sensitive items with unm binary variables. This constitutes
78
14 Inline Algorithm
Table 14.1: The original database DO used in the example. A 1 1 0 0 1 0 0 1
B 0 0 0 1 1 0 0 1
C 1 1 1 0 1 0 1 0
D 0 1 1 0 1 1 0 0
the intermediate form of database DO , which is presented in Table 14.2. Then, for each itemset in (V ∪ Smin ), the inline algorithm generates a constraint, as follows: 1 + u52 + u82 ≥ msup ⇒ u52 + u82 ≥ 0.6 1 + u51 ≥ msup ⇒ u51 ≥ 0.6 u51 u52 + u81 u82 < msup ⇒ u51 u52 + u81 u82 < 1.6
(14.6) (14.7) (14.8)
Table 14.2: The intermediate form of database DO used in the example. A 1 1 0 0 u51 0 0 u81
B 0 0 0 1 u52 0 0 u82
C 1 1 1 0 1 0 1 0
D 0 1 1 0 1 1 0 0
Table 14.3: The three exact (and optimal) solutions for the CSP of the example. Solution l1 l2 l3
u51 1 1 1
u52 0 1 1
u81 1 0 1
u82 1 1 0
The first two inequalities correspond to the itemsets in V , whereas the third one regards the sensitive itemset in S (here Smin = S). From these constraints, only the third one contains products of unm variables and therefore needs to be replaced. Based on Figure 14.3 the inline algorithm replaces this constraint, while introducing
14.3 Experiments and Results
79 maximize (u51 + u52 + u81 + u82 ) u52 + u82 ≥ 0.6 u51 ≥ 0.6 Ψ1 ≤ u51 Ψ 1 ≤ u52 subject to Ψ2 ≤ u81 Ψ2 ≤ u82 Ψ1 ≥ u51 + u52 − 1 Ψ2 ≥ u81 + u82 − 1 Ψ1 + Ψ2 < 1.6
where {u51 , u52 , u81 , u82 , Ψ1 , Ψ2 } ∈ {0, 1}
Fig. 14.4: The CSP formulation for dataset DO . Table 14.4: The characteristics of the three datasets. Dataset BMS–WebView–1 BMS–WebView–2 Mushroom
N 59,602 77,512 8,124
M 497 3,340 119
Avg tlen 2.50 5.00 23.00
two new temporary binary variables, Ψ1 and Ψ2 , as follows: Ψ ≤ u51 , Ψ2 ≤ u81 Ψ1 Ψ2 1 z }| { z }| { u51 u52 + u81 u82 < 1.6 ⇒ Ψ1 ≤ u52 , Ψ2 ≤ u82 Ψ1 ≥ u51 + u52 − 1, Ψ2 ≥ u81 + u82 − 1 By using this process, all constraints of the CSP that involve products of binary variables become linear. The resulting CSP is presented in Figure 14.4 and is solved by using BIP. The solution of the CSP leads to three optimal hiding solutions, presented in Table 14.3, among which one is selected to derive the sanitized database D.
14.3 Experiments and Results The inline algorithm has been tested on real world datasets using different parameters such as minimum support threshold and number/size of sensitive itemsets to hide [23]. In this section, we present the datasets that were used, their special characteristics, the selected parameters and the attained experimental results.
80
14 Inline Algorithm
14.3.1 Datasets All real datasets that were used to evaluate the inline algorithm are publicly available through the FIMI repository (http://fimi.cs.helsinki.fi/). Datasets BMS–WebView–1 and BMS–WebView–2 contain click stream data from the Blue Martini Software, Inc. and were used for the KDD Cup 2000 [41]. The mushroom dataset was prepared by Roberto Bayardo (University of California, Irvine) [11]. These datasets demonstrate varying characteristics in terms of the number of transactions and items and the average transaction lengths. Table 14.4 summarizes them. The primary bottleneck of the inline approach was the time taken to run the frequent itemset mining algorithm as well as the time needed to solve the produced CSPs. The thresholds of minimum support were properly selected to ensure a large amount of frequent itemsets. Among these itemsets, some itemsets were randomly selected and characterized as sensitive. Several experiments were conducted to hide 1, 2, 4 and 5–sensitive itemsets. Since the sanitization methodology of the inline algorithm is item-based, the higher the number of items in a sensitive itemset, the more the unm variables involved and therefore the more constraints need to be solved. To avoid the explosion in the number of constraints two pruning techniques were enforced. Firstly, all tautologies, i.e. inequalities that always hold, were removed. Secondly, the remaining inequalities were partitioned into sets, the overlapping sets were identified and only the most specific inequality was kept from each overlapping set. Both pruning techniques can be easily applied and relieve the solver from a good portion of unnecessary work. All experiments were conducted on a PC running Linux on an Intel Pentium D, 3.2 Ghz processor. The integer programs were solved by using ILOG CPLEX 9.0 [36].
14.3.2 Evaluation Methodology The inline algorithm was evaluated using the metric of distance between databases DO and D. The lower the distance, the better the quality of the hiding solution. Since even in non-exact solutions the sensitive itemsets are successfully hidden, it is important to notice that in any case the sanitized database D can be safely released to untrusted third parties as it effectively protects all the sensitive knowledge. As a final comment on the evaluation procedure that was employed in [23], we should mention that it is not sensible to compare the attained experimental results against the widely known metric of accuracy (as used in other state-of-the-art algorithms). The reason is that in the inline algorithm it is the distance (i.e., the number of item modifications) and not the accuracy (i.e., the number of transactions from DO ) that is minimized. These two goals are highly distinct since multiple item modification may occur to the same transaction and yet produce a penalty of one when using the accuracy measure, although the actual damage induced to the database is much higher. Thus, the metric of accuracy would be inappropriate to evaluate the outcome of the presented methodology.
14.3 Experiments and Results
81
Table 14.5: The experimental results for the three datasets. Dataset BMS–1 BMS–1 BMS–1 BMS–1 BMS–1 BMS–1 BMS–1 BMS–2 BMS–2 BMS–2 BMS–2 BMS–2 BMS–2 Mushroom Mushroom Mushroom Mushroom Mushroom Mushroom Mushroom
Hiding Scenario HS1 HS2 HS3 HS4 HS5 HS6 HS7 HS1 HS2 HS3 HS4 HS5 HS6 HS1 HS2 HS3 HS4 HS5 HS6 HS7
unm vars
Distance
157 359 60 120 120 228 150 12 36 24 40 40 80 32 68 64 128 128 256 160
128 301 19 12 8 21 24 3 18 3 8 4 12 8 20 12 16 8 24 16
14.3.3 Experimental Results The inline algorithm was tested in the following hiding scenarios: hiding 1 1–itemset (HS1), hiding 2 1–itemsets (HS2), hiding 1 2–itemset (HS3), hiding 2 2–itemsets (HS4), hiding 1 4–itemset (HS5), hiding 2 4-itemsets (HS6), and hiding 1 5–itemset (HS7). Table 14.5 summarizes the attained experimental results. The number of unm variables participating in the CSP provides an estimate of the worst-case scenario in the context of the inline algorithm; it is equivalent to the maximum distance between DO and D. The fourth column shows the actual distance of the two databases as reported by the solver. It corresponds to the actual number of items that were hidden from individual transactions of DO .
14.3.4 Discussion on the Efficiency of the Inline Algorithm At a first glance, a serious shortcoming of the inline algorithm seems to be the time that is required for solving the produced CSPs, especially due to the number of constraints introduced by the CDR approach. However, due to the fact that the produced constraints involve binary variables and no products among these variables exist, it turns out that the solution time (although certainly much higher than that of heuristic approaches) remains acceptable. To support this claim, in [23] the authors created
82
14 Inline Algorithm
a dataset of 10,000 transactions with 10 items each, where all items participate in all transactions. Threshold msup was set to 1.0 and all itemsets from the lattice of the original database were extracted. Then, 2 1–itemsets and 2 4–itemsets, all consisting of disjoint items, were selected to be the sensitive ones and were hidden. To ensure that the problem had a feasible solution, it was relaxed by using only the constraints of set Smin that always lead to a satisfiable CSP. Based on this formulation, the total number of unm variables participating in the problem were 100,000. To split their products for the 2 4–itemsets, another 20,000 Ψi variables were introduced. The total number of constraints were 100,004 and the optimal solution was 99,996 (since 4 items, one belonging in each sensitive itemset, need to be zeroed in a transaction). CPLEX was capable of identifying this solution within 889 seconds (approx. 15 min), which is rather low for such a demanding problem. Of course, adding the constraints from set V increases time complexity. However, in all conducted experiments the solver was able to rapidly decide if the optimization problem was infeasible. Thus, in such cases, the relaxation procedure (presented in Algorithm 14.1) can be applied to allow for the production of a solvable CSP and the identification of a good approximate hiding solution.
Chapter 15
Two–Phase Iterative Algorithm
In this chapter, we present a two–phase iterative algorithm (proposed in [27]) that extends the functionality of the inline algorithm of [23] (Chapter 14) to allow for the identification of exact hiding solutions for a wider spectrum of problem instances. A problem instance is defined as the set of (i) the original dataset DO , (ii) the minimum frequency threshold mfreq that is considered for its mining, and (iii) the set of sensitive itemsets S that have to be protected. Since the inline algorithm allows only supported items in DO to become unsupported in D, there exist problem instances that although they allow for an exact hiding solution, the inline approach is incapable of finding it. The truthfulness of this statement can be observed in the experiments provided in Section 15.4, as well as in the experimental evaluation of [26]. The remainder of this chapter is organized as follows. In Section 15.1 we present the theoretical background behind the two–phase iterative process (see Figure 15.1; large box shows the iteration). Section 15.2 highlights some important aspects of the methodology that influence the quality of the hiding solution that is found. In Section 15.3, we present a specific problem instance for which the inline algorithm fails to provide an exact hiding solution and demonstrate how the two–phase iterative algorithm achieves to identify it. Finally, Section 15.4 presents the experimental evaluation of the two–phase iterative algorithm.
15.1 Background The two–phase iterative algorithm of [27] consists of two phases that iterate until either (i) an exact solution of the given problem instance is identified, or (ii) a prespecified number of subsequent iterations ` have taken place. Threshold ` is called the limiting factor and must remain low enough to allow for a computationally efficient solution. The first phase of the algorithm utilizes the inline algorithm to hide the sensitive knowledge. If it succeeds, then the process is terminated and database D is returned. This phase causes the retreat of the positive border of DO in the lattice, thus excluding from FD the sensitive itemsets and their supersets. If the first A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_15, © Springer Science+Business Media, LLC 2010
83
84
15 Two–Phase Iterative Algorithm
Fig. 15.1: The architectural layout for the two–phase iterative approach.
phase is unable to identify an exact solution, the algorithm proceeds to the second phase, which implements the dual counterpart of the inline algorithm. Specifically, from the produced (infeasible) CSP of the first phase, the two–phase iterative algorithm [27] proceeds to remove selected inequalities as in Algorithm 14.1 but in a “one-to-one” rather than in a “batch” fashion1 , until the underlying CSP becomes feasible. The second phase results to the expansion of the positive border, and the iteration of the two phases aims at the approximation (to the highest possible extent) of the optimal borderline in the sanitized database D. In what follows, let H denote the set of itemsets for which the corresponding inequalities were removed from the CSP to allow for a feasible solution. Obviously, 0 ). Since the produced CSP is now feasible, the algorithm proceeds H ⊆ Bd + (FD to identify the hiding solution and to compose the corresponding sanitized database D. This database is bound to suffer from the side-effect of hidden (nonsensitive) frequent patterns. The purpose of the second phase is to try to constitute these lost (due to the side-effects) itemsets of set H frequent again by increasing their support in D. However, the support increment should be accomplished in a careful manner to ensure that both the sensitive itemsets, as well as all the itemsets that do not 0 , remain infrequent. participate to the optimal set FD Let DH denote this intermediate (for the purposes of the second phase) view of the sanitized database D. In this database the itemsets of H should become frequent and the outcome database should adhere as much as possible to the properties of 0 . As one can observe, the second phase of the approach has as a the ideal set FD consequence the movement of the revised positive border downwards in the lattice to include an extended set of itemsets, constituting them frequent. As a result, this two–phase iterative process, when viewed vertically in the itemset lattice, resembles 1
The selection among inequalities corresponding to the same “batch” is performed arbitrarily.
15.2 Removal of Constraints from Infeasible CSPs
85
the oscillation of a pendulum, where one stage follows the other with the hope that the computed revised positive border will converge at some point to the ideal one. We now proceed to analyze the mechanism that takes place as part of the second phase of this “oscillation”. Since the goal of this phase is to constitute the itemsets of H frequent in DH , the modification of this database will be based only on item inclusions on a selected subset of transactions. Following the inline approach [23], the candidate items for inclusion are only those that appear among the itemsets in H, i.e. those in the universe of I H . All these items will be substituted in the transactions of DH where they are unsupported, by the corresponding unm variables. This will produce an intermediate form of database DH . Following that, a CSP is constructed in which the itemsets that are controlled belong in set 0 ) C = H ∪ Bd − (FD
(15.1)
Finally, the optimization criterion is altered to denote that a minimization (i.e. the least amount of 1s), rather than a maximization of the binary variables unm will lead to the best possible solution. Figure 15.2 presents the form of the CSP created in the second phase of this two–phase iterative hiding process. As mentioned earlier, these two phases are executed in an iterative fashion until convergence to the exact hiding solution is achieved or a pre-specified number of oscillations ` take place. In this figure, DH,{I} refers to the transactions of DH that support itemset I. minimize ∑unm ∈U unm ( subject to
0 ) ∑Tn ∈DH,{I} ∏im ∈X unm < msup, ∀I ∈ Bd − (FD ∑Tn ∈DH,{I} ∏im ∈R unm ≥ msup, ∀I ∈ H
Fig. 15.2: The CSP for the second stage of the two–phase iterative algorithm.
15.2 Removal of Constraints from Infeasible CSPs An aspect of the two–phase iterative approach that requires further investigation regards the constraints selection and removal process that turns an infeasible CSP into a feasible one. Ref. [27] considers the eviction process of Algorithm 14.1 in an attempt to maximize the probability of yielding a feasible CSP after the removal of only a few number of constraints (inequalities). This is achieved by removing the most strict constraints, i.e. those involving the maximum number of binary variables (equivalently, the maximum length itemsets from DO ). To ensure that the removal of these constraints will cause minimal loss of nonsensitive knowledge to the database, the selected constraints are the ones that involve itemsets of low support in DO , as those itemsets would be the first to be hidden if the support threshold was increased.
86
15 Two–Phase Iterative Algorithm
Although this is a reasonable heuristic, it may not always lead to the optimal selection, i.e. the one where the minimum number of constraints are selected for eviction and their removal from the CSP causes the least distortion to the database. Thus, in what follows, we present a mechanism that was proposed in [27] which allows the identification of the best possible set of inequalities to formulate the constraints of the feasible CSP. Let a constraint set be a set of inequalities corresponding to the itemsets of the positive border, as taken from the original (infeasible) CSP. Then, as discussed in [27], there exists a “1–1” correspondence between constraint sets and itemsets, where each item is mapped to a constraint (and vice versa). A frequent itemset is equivalent to a feasible constraint set, i.e. a set of constraints that has a solution satisfying all the constraints in the set. The downwards closure property of the frequent itemsets also holds in the case of constraint sets, since all the subsets (taken by removing one or more inequalities) of a feasible constraint set are also feasible constraint sets and all the supersets of an infeasible constraint set are also infeasible constraint sets. Furthermore, a maximal feasible constraint set (in relation to a maximal frequent itemset) is a feasible constraint set where all its supersets are infeasible constraint sets. The constraints removal process, explained earlier, can be considered as the identification of a maximal feasible constraint set among the inequalities of the revised positive border. These constraints will be the ones that will participate to the (feasible) CSP. Due to the correspondence that exists between itemsets and constraint sets, one could use techniques that are currently applied on frequent itemset mining algorithms (such as pruning approaches) to efficiently identify the maximal constraint sets and possibly further select one among them (assuming that there exists some metric that captures the quality of the different constraint sets). However, the decision of whether a constraint set is feasible or not is a computationally demanding process and for this reason a much simpler heuristic is used instead by the two–phase iterative algorithm.
15.3 An Example In what follows we provide an example that was borrowed from [27]. Consider database DO depicted in Table 15.1. When mining this database for frequent itemsets, using mfreq = 0.2, we have that FDO = {A, B,C, D, E, AC, AD, AE,CD,CE, DE, ACD, ACE, ADE,CDE, ACDE}. Suppose that the sensitive knowledge corresponds to the frequent itemset S = {CD}. Given S we compute Smax = {CD, ACD, ACDE} that corresponds to the sensitive itemsets and their frequent supersets. The revised 0 ) = {B, ACE, ADE}. The positive border will then contain the itemsets of Bd + (FD intermediate form of database DO (generated using the inline approach) is shown 0 ) : I ∩ I S 6= ∅} = in Table 15.2. From this database, using set V = {I ∈ Bd+ (FD {ACE, ADE}, we produce the following set of inequalities for C = V ∪ S = {ACE, ADE,CD} that are incorporated to the CSP:
15.3 An Example
87
Table 15.1: The original database DO . A 1 1 0 0 1 0 0 1 1 0
B 0 0 0 1 0 0 0 1 0 0
C 1 1 1 0 1 0 1 0 1 1
D 0 1 1 0 1 1 0 0 0 1
E 0 1 0 1 1 1 0 0 0 0
Table 15.2: The intermediate database of DO . A 1 1 0 0 1 0 0 1 1 0
B 0 0 0 1 0 0 0 1 0 0
C 1 u23 u33 0 u53 0 1 0 1 u103
D 0 u24 u34 0 u54 1 0 0 0 u104
E 0 1 0 1 1 1 0 0 0 0
Table 15.3: The database DH . A 1 1 0 0 1 0 0 1 1 0
B 0 0 0 1 0 0 0 1 0 0
C 1 1 1 0 0 0 1 0 1 1
D 0 1 0 0 1 1 0 0 0 0
E 0 1 0 1 1 1 0 0 0 0
ACE : u23 + u53 ≥ 2 ADE : u24 + u54 ≥ 2 CD : u23 u24 + u33 u34 + u53 u54 + u103 u104 < 2
(15.2) (15.3) (15.4)
The first two inequalities correspond to itemsets ACE and ADE, which must remain frequent in D, while the last one reflects the status of the sensitive itemset CD,
88
15 Two–Phase Iterative Algorithm
which must become infrequent in the sanitized outcome. It is easy to prove that if these inequalities are incorporated to the CSP of Figure 14.2, then the produced CSP is unsolvable. Thus, Algorithm 14.1 is used to alleviate the CSP from inequalities of set V = {ACE, ADE}. However, as mentioned earlier, the two–phase iterative algorithm does not use the “batch” mode of operation of this algorithm (which would remove both itemsets in V ) but instead, it uses the “one-by-one” mode of operation that selects to remove one among the inequalities returned in the same “batch”. As already stated, at this point the selection among the inequalities of the same batch is arbitrary. Suppose that the inequality corresponding to itemset ACE of the revised positive border is selected to be removed from the CSP. The resulting CSP is then solvable. Among the alternative possible solutions assume that the one having u34 = u53 = u104 = 0 and u23 = u24 = u33 = u54 = u103 = 1 is selected. Since the criterion function of the CSP requires the maximization of the number of binary variables that are set to ‘1’, the solution that is produced has the minimum possible variables set to ‘0’. This solution (having a distance of 3) is presented in Table 15.3; since it is not exact, we name our database DH and proceed to the second phase of the iteration. Table 15.4: Intermediate form of database DH . A 1 1 u31 u41 1 u61 u71 1 1 u101
B 0 0 0 1 0 0 0 1 0 0
C 1 1 1 u43 u53 u63 1 u83 1 1
D 0 1 0 0 1 1 0 0 0 0
E u15 1 u35 1 1 1 u75 u85 u95 u105
As we observe, in database DH of Table 15.3, itemset H = {ACE} is hidden, since its support has dropped below the minimum support threshold. In an attempt to constitute this itemset frequent again, the two–phase iterative algorithm creates an intermediate form of database DH in which in all transactions that do not support items A, C, or E, the corresponding zero entries are substituted with binary variables unm . This, leads to the database presented in Table 15.4. As a following step, the inequalities for the itemsets of set C are produced by 0 ) = {AB, BC, BD, BE, CD}, taking into account the itemsets of sets H and Bd − (FH H which are not foreign to I = {A,C, E}. Thus, we have the following set of produced inequalities:
15.4 Experimental Evaluation
89
(ACE) : u15 + u31 u35 + u41 u43 + u53 + u61 u63 + u71 u75 + u83 u85 + u95 + u101 u105 ≥ 1 (AB) : u41 < 1 ⇒ u41 = 0 (BC) : u43 + u83 < 2 ⇒ u43 = 0 ∨ u83 = 0 (BE) : u85 < 1 ⇒ u85 = 0 (CD) : u53 + u63 < 1 ⇒ u53 = u63 = 0
(15.5) (15.6) (15.7) (15.8) (15.9)
Notice that no inequality is produced from itemset BD since this itemset is foreign to set I H . The produced set of inequalities is solvable, yielding two exact solutions, each of which requires only one binary variable to become ‘1’: either u15 or u95 . These two solutions are actually the same, since the involved transactions (1st or 9th) do not differ from one another and both solutions apply the same modification to either of them. Table 15.5 presents the solution where u15 = 1. Table 15.5: Database D produced by the two–phase iterative approach. A 1 1 0 0 1 0 0 1 1 0
B 0 0 0 1 0 0 0 1 0 0
C 1 1 1 0 0 0 1 0 1 1
D 0 1 0 0 1 1 0 0 0 0
E 1 1 0 1 1 1 0 0 0 0
Figure 15.3 shows the original borderline along with the borderlines of the databases produced as an output of the two phases of the algorithm. Notice that the output of the second phase in the first iteration of the algorithm yields an exact solution. Thus, for ` = 1 the two–phase iterative algorithm has provided an exact solution that was missed by the inline algorithm of [23].
15.4 Experimental Evaluation In this section, we provide some experimental results that test the two–phase iterative approach against the inline algorithm. The two algorithms were tested in [27] on three real world datasets using different parameters such as minimum support threshold and number/size of sensitive itemsets to hide. All these datasets are publicly available through the FIMI repository [30] and are summarized in Table 14.4.
90
15 Two–Phase Iterative Algorithm
Fig. 15.3: The two phases of iteration for the considered example.
The primary bottleneck that was experienced in most experiments involved the time that was taken to run the frequent itemset mining algorithm, as well as the time needed to solve the formulated CSPs through the application of BIP. In all tested settings, the thresholds of minimum support were properly selected to ensure an adequate amount of frequent itemsets and the sensitive itemsets to be hidden were selected randomly among the frequent ones. All the experiments were conducted on a PC running Linux on an Intel Pentium D, 3.2 Ghz processor equipped with 4 GB of main memory. All integer programs were solved using ILOG CPLEX 9.0 [36]. A limiting factor of ` = 5 was used in all presented experiments in order to control the number of iterations that the algorithm is allowed to execute. After the execution of the first iteration of the two–phase iterative algorithm, the attained solution as well as the respective impact on the dataset are being recorded. Then, the two– phase iterative algorithm proceeds to up to ` subsequent iterations. If the algorithm fails to identify an exact solution after the ` runs, then the stored solution from the first iteration of the algorithm is selected to produce the sanitized database D. This way, both the runtime of execution of the two–phase iterative algorithm is limited to a level that it remains tractable (through the use of `) and also the algorithm is adapted so that it constantly outperforms the inline algorithm of [23] and provides superior hiding solutions. In what follows, let notation a × b denote the hiding of a itemsets of length b, i.e. 2 × 1 refers to the hiding of two items.
15.4 Experimental Evaluation
91 Inline vs 2−Phase Iterative Approach for BMS−1
60
Inline 2−Phase Iterative
50
* *
*
* *
1x4
2x4
Distance
40
30
* 20
*
*
1x2
2x2
10
0
1x3
2x3
Hiding Scenarios Inline vs 2−Phase Iterative Approach for BMS−2 Inline 2−Phase Iterative
16 14
Distance
12 10
* * *
8 6
* 4 2 0
1x2
2x2
1x3
2x3
1x4
2x4
Hiding Scenarios Inline vs 2−Phase Iterative Approach for Mushroom 16 Inline 2−Phase Iterative 14 12
* Distance
10 8
*
6
* *
4
*
2 0
1x2
2x2
1x3
2x3
1x4
2x4
Hiding Scenarios
Fig. 15.4: Distance between the inline and the two–phase iterative algorithm.
92
15 Two–Phase Iterative Algorithm
Figure 15.4 presents the performance comparison of the two algorithms. A star above a column in the graphs indicates that an exact hiding solution could not be found and the algorithm resorted to an approximate solution. Based on the presented experimental results, the following observations can be made. First, by construction, the two–phase iterative algorithm [27] is constantly superior to the inline algorithm [23], since its worst performance equals the performance of the inline scheme. Second, as the experiments indicate, there are several settings in which the two–phase iterative algorithm finds an optimal hiding solution (with a small increment in the distance) that is missed by the inline algorithm. Third, by construction, the two–phase iterative algorithm can capture all the exact solutions that were also identified by the inline approach. This fact, constitutes the two–phase iterative algorithm superior, when compared to the inline algorithm.
Chapter 16
Hybrid Algorithm
Gkoulalas–Divanis & Verykios in [26] introduce the first exact methodology to strategically perform sensitive frequent itemset hiding based on a new notion of hybrid database generation. This approach broadens the regular process of data sanitization (as introduced in [10] and adopted by the overwhelming majority of researchers [10, 47, 55]) by applying an extension to the original database instead of either modifying existing transactions (directly or through the application of transformations), or rebuilding the dataset from scratch to accommodate sensitive knowledge hiding. The extended portion of the database contains a set of carefully crafted transactions that achieve to lower the importance of the sensitive patterns to a degree that they become uninteresting from the perspective of the data mining algorithm, while minimally affecting the importance of the nonsensitive ones. The hiding process is guided by the need to maximize the data utility of the sanitized database by introducing the least possible amount of side-effects. The released database, which consists of the initial part (original database) and the extended part (database extension), can guarantee the protection of the sensitive knowledge, when mined at the same or higher support as the one used in the original database. Extending the original database for sensitive itemset hiding is proven to provide optimal solutions to an extended set of hiding problems, compared to previous approaches, as well as to lead to hiding solutions of typically superior quality [26]. In this chapter we provide the essential theoretical background for the understanding of the hybrid algorithm and elucidate the workings of this methodology. Following previous work [23, 27], the approach presented in this chapter is exact in nature; provided that a hiding solution that causes no side-effects to the sanitized database exists, the hybrid algorithm is guaranteed to find it. The remainder of this chapter is organized as follows. In Section 16.1 we provide the basic steps of the methodology, while Section 16.2 sheds light on the main issues that pertain to its successful application. Following that, Section 16.3 presents the various aspects of the hybrid solution methodology and Section 16.4 introduces a partitioning approach which substantially improves the scalability of the hybrid algorithm. Finally, Section 16.5 contains an experimental evaluation of the hybrid algorithm contrast-
A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_16, © Springer Science+Business Media, LLC 2010
93
94
16 Hybrid Algorithm
ing it to the exact algorithm of [23] (Chapter 14) and the border based approaches of [66, 67] (Chapter 10) and [50, 51] (Chapter 11).
16.1 Knowledge Hiding Formulation This section sets out the hiding methodology that is employed by the hybrid algorithm, along with a running example which helps towards its proper understanding.
16.1.1 Hybrid Hiding Methodology To properly introduce the hiding methodology one needs to consider the existence of three databases, all depicted in binary format. They are defined as follows: • Database DO is the original database which, when mined at a certain support threshold msup, leads to the disclosure of some sensitive knowledge in the form of sensitive frequent patterns. This sensitive knowledge needs to be protected. • Database DX is a minimal extension of DO that is created by the hiding algorithm during the sanitization process to facilitate knowledge hiding. • Database D is the union of database DO and the applied extension DX and corresponds to the sanitized outcome that can be safely released. Suppose that database DO consists of N transactions. By preforming frequent itemset mining in DO , using a support threshold msup set by the owner of the data, a set of frequent patterns FDO are discovered, among which, a subset S contains patterns which are sensitive from the owner’s perspective. The goal of the hybrid hiding algorithm is to create a minimal extension to the original database DO in a way that the final, sanitized database D protects the sensitive itemsets from disclosure. The database extension can by itself be considered as a new database DX , since it consists of a set of transactions in the same space of items I as the ones of DO . Among alternative hiding solutions that may exist, the target of the hybrid algorithm is to protect the sensitive knowledge, while minimally affecting the nonsensitive itemsets appearing in FDO . This means that all the nonsensitive itemsets in FDO should continue to appear as frequent among the mined patterns from D, when performing frequent itemset mining using the same or a higher support threshold. The hiding of a sensitive itemset is equivalent to a degradation of its statistical significance, in terms of support, in the result database. The hybrid algorithm [26] first applies border revision to identify the revised borders for D, then computes the minimal size for the extension DX and, by using the itemsets of the revised borders, defines a CSP that is solved using BIP. In this way, all possible assignments of itemsets to the transactions of DX are examined and the optimal assignment is bound to be found. Although the properties of the produced CSP allow for an acceptable runtime of the hiding algorithm, there are cases
16.1 Knowledge Hiding Formulation
95
Table 16.1: Sanitized database D as a mixture of the original database DO and the applied extension DX .
DO
DX
a
b
c
d
e
f
1
1
0
0
0
1
1
1
1
1
0
0
1
0
1
0
0
1
1
0
0
0
0
0
0
1
0
0
1
0
1
1
1
1
1
0
0
0
0
1
0
0
1
1
1
0
1
0
0
1
1
0
0
0
1
0
1
1
1
0
1
0
0
0
0
0
1
0
1
1
0
0
1
0
1
1
0
0
1
1
0
0
0
0
in which the partitioning approach of Section 16.4 becomes useful to accommodate for very large problem sizes. Table 16.2: Frequent itemsets for DO and DX at msup = 3. Frequent itemset in DO
Support
{a} {b}, {c} {ac} {d}, {e}, {ab}, {bc} {ad}, {ae}, {be}, {cd}, {ce}, {abc}, {acd}, {ace}
7 6 5 4 3
Frequent itemset in D
Support
{a} {c} {b}, {ac} {d} {ab}, {ad}, {cd}, {acd} {e}, {bc} {ae}, {be}, {ce}, {abc}, {ace}
11 8 7 6 5 4 3
96
16 Hybrid Algorithm
16.1.2 A Running Example To shed light into the workings of the hybrid algorithm, in what follows we borrow the running example of [26]. Suppose that we are provided with database DO of Table 16.1. Applying frequent itemset mining in DO using mfreq = 0.3, leads to the set of large itemsets FDO appearing in the upper part of Table 16.2. Among these itemsets, S = {e, ae, bc} denotes the sensitive knowledge that has to be protected. The hybrid hiding algorithm aims at the creation of a database extension DX to DO (see Table 16.1) that allows the hiding of the sensitive knowledge, while keeping the nonsensitive patterns frequent in the sanitized outcome. Table 16.1 summarizes the target of the hybrid hiding algorithm. The union of the two datasets, DO and DX , corresponds to the sanitized outcome D that can be safely released. Thus, the primary goal of the hiding algorithm is to construct the privacy aware extension DX of DO such that (i) it contains the least amount of transactions that are necessary to ensure the proper hiding of all the sensitive knowledge in DO , and (ii) it introduces no side-effects to the hiding process. As one can observe in the lower part of Table 16.2, all the sensitive itemsets of DO along with their supersets are infrequent in D (shown under the dashed line), while all the nonsensitive itemsets of DO remain frequent. Since DO is extended, in order to ensure that the nonsensitive patterns will remain frequent in D, the hiding algorithm needs to appropriately increase their support in the sanitized database.
16.2 Main Issues Pertaining to the Hiding Methodology The hybrid hiding methodology produces a sanitized database D that corresponds to a mixture of the original transactions in DO and a set of synthetic transactions, artificially created to prohibit the leakage of the sensitive knowledge. For security reasons, all the transactions in D are assumed to be randomly ordered so that it is difficult for an adversary to distinguish between the real ones and those that were added by the hiding algorithm to secure the sensitive knowledge. There are several issues of major importance, involving the hiding methodology, that need to be examined. To continue, let P denote the size of database D, N is the size of database DO , and Q is the size of the extension DX .
16.2.1 Size of Database Extension Since database DO is extended by DX to construct database D, an initial and very important step in the hiding process is the computation of the size of DX (i.e., the necessary amount of transactions that need to be added to those of DO to facilitate the hiding of the sensitive knowledge). A lower bound on this value can be established based on the sensitive itemset in S which has the highest support (breaking
16.2 Main Issues Pertaining to the Hiding Methodology
97
ties arbitrarily). The rationale here is as follows; by identifying the sensitive itemset with the highest support, one can safely decide upon the minimum number of transactions that must not support this itemset in DX , so that it becomes infrequent in D. This number, theoretically, is sufficient to allow the hiding of all the other itemsets participating in S and all its supersets, and corresponds to the minimum number of transactions that DX must have to properly secure the sensitive knowledge. Theorem 16.1 demonstrates how this lower bound Q is established [26]. Theorem 16.1. Let IM ∈ S such that for all I ∈ S it holds that sup(IM , DO ) ≥ sup(I, DO ). Then, the minimum size of DX to allow the hiding of the sensitive itemsets from S in D equals: Q=
j sup(I , D ) k M O −N +1 mfreq
(16.1)
Proof. We only need to prove that any itemset I ∈ S will become hidden in D if and only if Q > sup(I, DO )/mfreq − N, provided that I is not supported in DX . Since itemset I must be infrequent in database D, the following condition holds: sup(I, D) < msup =⇒ sup(I, DO ) + sup(I, DX ) < mfreq · P Moreover, since sup(I, DX ) ≥ 0 and (by construction) P = N + Q, we have that sup(I, DO ) < mfreq · (N + Q) =⇒ sup(I, DO ) < N + Q The last inequality was relaxed by removing term sup(I, DX ). Since it consists of the summation of non-negative terms and a non-negative term was removed, the inequality will continue to hold. Its holding proves the holding of (16.1) since the sensitive itemset with the highest support will require the largest amount of transactions (not supporting it) in the extension DX in order to be properly hidden. The lower bound of the number of the necessary transactions for DX will thus equal the floor value of sup(IM , DO )/mfreq − N plus one. Moreover, as expected, itemsets having lower support than IM may be supported by some transactions of database DX , as long as they are infrequent in D. Equation (16.1) provides the absolute minimum size of DX to accommodate for the sensitive knowledge hiding. However, as presented in [26] and discussed in the following, this lower bound may, under certain circumstances, be insufficient to allow for the identification of an exact hiding solution, even if such a solution exists. This situation may occur if, for instance, the number of transactions returned by (16.1) is too small to allow for consistency among the different requirements imposed upon the status (frequent vs. infrequent) of the various itemsets appearing in D. Later on, we present a way to overcome this limitation.
98
16 Hybrid Algorithm
16.2.2 Optimal Solutions in the Hybrid Methodology Having identified the size Q of database DX , the next step is to properly construct these transactions to facilitate knowledge hiding. Since the actual values of all the items in the database extension are unknown at this point, the hiding algorithm represents them with binary variables that will be instantiated later on in the process. In what follows, let uqm be the binary variable corresponding to the m–th item of transaction Tq ∈ DX (q ∈ [1, Q], m ∈ [1, M]), when DO is in the sanitization process. Under this formulation, the goal of the hybrid hiding algorithm [26] becomes to optimally adjust all the binary variables involved in all the transactions of DX in order to hide the sensitive itemsets, while minimally affecting the nonsensitive ones in a way that they remain frequent in the sanitized outcome. As already presented in Chapter 2, this is the notion of an exact hiding solution. In a typical hiding scenario, distinct feasible solutions are of different quality. Thus, an optimization criterion needs to be incorporated into the hiding strategy to guide the algorithm to the best possible among all the feasible solutions. The metric of distance, introduced in [23], is also applied by the hybrid methodology to quantify the notion of “harm” caused to the original dataset by the sanitization process. In the context of the hybrid hiding algorithm, the distance between database DO and its sanitized version D is measured based on the extension DX as follows: dist(DO , D) =
∑
uqm
(16.2)
q∈[1,Q],m∈[1,M]
As one can observe, the minimum impact of D can be quantified as the minimum distance between DO and D. Thus, the objective of the hiding algorithm becomes to appropriately set the uqm variables such that all the sensitive knowledge is hidden, while the distance is minimized. An interesting property of the distance measure is that it allows the hiding algorithm to ensure high quality in the sanitized database D and to identify the optimal solution, if one exists. The notion of an optimal solution, in the context of the hybrid algorithm of [26], is presented in Definition 16.2. Based on the notion of distance and the size of the produced extension DX of database DO , the database quality can be defined as follows: Definition 16.1. (Database quality) Given the sanitized database D, its original version DO and the produced extension DX , the quality of database D is measured both in the size of DX and in the number of binary variables set to ‘1’ in the transactions of DX (i.e., the distance metric). In both cases, lower values correspond to better solutions in terms of quality. Through (16.1) the hybrid hiding algorithm is capable of identifying the lower bound in the size of DX that is necessary to accommodate the hiding of the sensitive knowledge in DO . Furthermore, through the use of (16.2) as an optimization criterion, the hybrid hiding approach is guaranteed to identify a feasible solution having minimum impact on database D. Given the previous definitions, we proceed to define the notion of an optimal solution in the context of the hybrid algorithm.
16.2 Main Issues Pertaining to the Hiding Methodology
99
Definition 16.2. (Optimal solution) A solution to the hiding of the sensitive itemsets is considered as optimal if it has the minimum distance among all the existing exact solutions and is obtained through the minimum expansion of DO . In that sense, optimal is a solution that is both minimal (with respect to the distance and the size of the extension) and exact.
16.2.3 Revision of the Borders The concept of border revision, introduced by Sun and Yu [66] and covered in detail in Chapter 9, provides the underlying mechanism for the specification of the values of the uqm variables to 1s or 0s, in a way that minimizes the impact on D. In order to maintain high quality on the result database, the hybrid hiding algorithm should select such item modifications in DX that have minimal impact on the border of the nonsensitive itemsets. The first step of the hybrid hiding methodology rests on the identification of the revised borders for D. The hiding algorithm relies on both the revised positive and 0 ) and Bd − (F 0 ), respectively. Afthe revised negative borders, denoted as Bd + (FD D ter identifying the revised borders, the hiding process has to perform all the required minimal adjustments of the transactions in DX to enforce the existence of the new borderline in the result database D. Continuing the example of Section 16.1.2, consider the lattice of Figure 9.1, corresponding to database DO of Table 16.1, when applying frequent itemset mining at mfreq = 0.3. Near each itemset we depict its support in the original (Figure 9.1(i)) and the sanitized (Figure 9.1(ii)) database. Based on the original hyperplane, the following borders can be identified for DO : Bd + (FDO ) = {abc, be, acd, ace} and Bd − (FDO ) = { f , bd, de, abe, bce}. The original borders are presented in Figure 9.1(i), where the itemsets belonging to the positive border are double underlined and the ones belonging to the negative border are single underlined. As one can notice, in the lattice of this figure all frequent itemsets lie at the left of the respective border, while their infrequent counterparts are on the right. Given that S = {e, ae, bc} (shown in squares in Figure 9.1(i)), we have that Smin = {e, bc} (shown in bold in Figure 9.1(i)) and Smax = {e, ae, be, bc, ce, abc, ace}. Ideally, 0 = the frequent itemsets in the sanitized database D will be exactly those in FD {a, b, c, d, ab, ac, ad, cd, acd}. The revised borderline along with the corresponding borders for D and scenarios C1 and C3 that must hold for an exact solution, are depicted in Figure 9.1(ii). Scenario C2 corresponds to the itemsets lying on the right of the original border. To enhance the clarity of the figure only a small portion of the itemsets involved in C2 are shown in Figure 9.1. The revised borders that per0 ) = {ab, acd} (double underlined in Figure tain to an optimal solution are: Bd + (FD − 0 9.1(ii)) and Bd (FD ) = {e, f , bc, bd} (single underlined in Figure 9.1(ii)). What is thus needed is a way to enforce the existence of the revised borders in D.
100
16 Hybrid Algorithm
16.2.4 Problem Size Reduction To enforce the computed revised border and identify the exact hiding solution, a mechanism is needed to regulate the status (frequent vs. infrequent) of all the itemsets in D. Let C be the minimal set of border itemsets used to regulate the values of the various uqm variables in DX . Moreover, suppose that I ∈ C is an itemset, whose behavior we want to regulate in D. Then, itemset I will be frequent in D if and only if sup(I, DO ) + sup(I, DX ) ≥ mfreq × (N + Q), or equivalently if: Q
sup(I, DO ) + ∑
q=1
∏ uqm ≥ mfreq × (N + Q)
(16.3)
∏ uqm < mfreq × (N + Q)
(16.4)
im ∈I
and will be infrequent otherwise, when Q
sup(I, DO ) + ∑
q=1
im ∈I
Inequality (16.3) corresponds to the minimum number of times that an itemset I has to appear in the extension DX to remain frequent in D. On the other hand, inequality (16.4) provides the maximum number of times that an itemset I can appear in DX in order to remain infrequent in database D. To identify an exact solution to the hiding problem, every possible itemset in P, according to its position in the lattice — with respect to the revised border — must satisfy either (16.3) or (16.4). However, the complexity of solving the entire system of the 2M − 1 inequalities is well known to be NP-hard [10, 47]. Therefore, one should restrict the problem to capture only a small subset of these inequalities, thus leading to a problem size that is computationally manageable. The problem formulation proposed in [26] achieves this in a similar way as the inline algorithm [23], i.e. by reducing the number of the participating inequalities that need to be satisfied. Even more, by carefully selecting the itemsets in set C, the hiding algorithm ensures that the exact same solution to the one of solving the entire system of inequalities, is attained. This is accomplished by exploiting cover relations existing among the itemsets in the lattice due to the monotonicity of support [7]. Definition 16.3. (Cover/Generalized cover) Given itemsets I, J, K ∈ P, itemset I is defined as a cover of K if and only if I ⊃ K and there exists no proper itemset J such that I ⊃ J ⊃ K. Itemset I is a generalized cover of K if and only if I ⊃ K. Given two itemsets I, J ∈ P, J covers I if and only if J ⊃ I, i.e. if itemset J is a (generalized) cover of itemset I. The basic premises for formulating set C are the following [26]. All frequent 0 should remain frequent in D, thus inequality (16.3) must hold. On itemsets in FD the contrary, all frequent itemsets in Smin should satisfy inequality (16.4) in order to be hidden in D. By definition, the hiding of the itemsets in Smin will cause the hiding of the itemsets in Smax . Additionally, all infrequent itemsets in DO should
16.2 Main Issues Pertaining to the Hiding Methodology
101
remain infrequent in D. For these reasons, set C is chosen appropriately to consist of all the itemsets of the revised border. The hybrid hiding algorithm is capable of ensuring that if the inequalities (16.3) and (16.4) are satisfied for all the itemsets in set C, then the produced solution is exact and is identical to the solution involving the whole system of the 2M − 1 inequalities.
16.2.5 Handling Suboptimality Since an exact solution may not always be feasible, the hybrid hiding algorithm should be capable of identifying high quality approximate solutions. There are two possible scenarios that may lead to non-existence of an exact solution. Under the first scenario, DO itself does not allow for an optimal solution due to the various supports of the participating itemsets. Under the second scenario, database DO is capable of providing an exact solution but the size of the database extension is insufficient to satisfy all the required inequalities of this solution. To tackle the first case, the hybrid hiding algorithm follows the paradigm of the inline approach [23] by assigning different degrees of importance to different inequalities. To be more precise, while it is crucial to ensure that (16.4) holds for all sensitive itemsets in D, thus they are properly protected from disclosure, satisfaction of (16.3) for an itemset rests in the discretion of ensuring the minimal possible impact of the sanitization process to DO . This inherent difference in the significance of the two inequalities, along with the fact that solving the system of all inequalities of the form (16.4) always leads to a feasible solution (i.e., for any database DO ), allows the relaxation of the problem when needed and the identification of a good approximate solution. To overcome the second issue, the hiding algorithm incorporates the use of a safety margin threshold, which produces a further expansion of DX by a certain number of transactions. These transactions must be added to the ones computed by using (16.1). The introduction of a safety margin can be justified as follows. Since equation (16.1) provides the lower bound on the size of database DX , it is possible that the artificially created transactions are too few to accommodate for the proper hiding of knowledge. This situation may occur due to conflicting constraints imposed by the various itemsets regarding their status in D. These constraints require more transactions (or to be more precise, more item modifications) in order to be met. Thus, a proper safety margin will allow the algorithm to identify an exact solution if such a solution exists. Moreover, as is demonstrated in Section 16.3.4, the additional extension of DX , due to the incorporation of the safety margin, can be restricted to the necessary degree. A portion of the transactions in DX are selected and removed at a later point, thus reducing its size and allowing an exact solution. Therefore, the only side-effect of using a safety margin in the hiding process, is an inflation in the number of constraints and associated binary variables in the problem formulation, leading to a minuscule overhead in the runtime of the hiding algorithm.
102
16 Hybrid Algorithm
16.3 Hybrid Solution Methodology In the following sections, we present the way that [26] achieves to minimize the problem size by regulating the status of only an essential portion of itemsets from P. Moreover, we present the solution that was proposed regarding the size of the extension DX and the formulation of the hiding process as a CSP solved by using BIP. Finally, the critical issues of how to ensure validity of transactions in D and how to handle suboptimality in the hiding process, are appropriately addressed.
16.3.1 Problem Size Minimization Cover relations governing the itemsets in the lattice of DO ensure that the formulated set of itemsets C has an identical solution to the one of solving the system of all 2M − 1 inequalities for D. Since the status of each itemset in C needs to be regulated in DX , a one-to-one correspondence exists between itemsets and produced inequalities in the system. Definition 16.4. (Solution of a system of inequalities from C) Given set C = {C1 , C2 , . . . , C2M −1 }, let LC be the set of solutions of the corresponding inequalities from C. The set of solutions for the system of inequalities will be the intersection of T the solutions produced by each inequality. Thus, LC = Rr=1 LCr . Based on this notation, the set of solutions that corresponds to all itemsets in D is denoted as LP\{∅} . Given an inequality, for instance C2 , produced by an itemset I with a solution set being a proper subset of the solution set of another inequality, say C1 , one can deduce that the latter inequality can be removed from any system containing the first inequality, without affecting the global solution of this system. In this case LC2 ⊂ LC1 and C1 is a (generalized) cover of C2 . The following theorems identify itemsets that should be represented in the CSP formulation and prove that the optimal hiding solution LC in the case of the hybrid 0 ) and Bd − (F 0 ). algorithm can be found based on the union of the borders Bd + (FD D 0 then ∀J ⊂ I we have that L ⊂ L . Theorem 16.2. If I ∈ FD I J
Proof. Consider the inequality produced by I, which should be frequent in D, i.e. Q
∑ ∏ uqm ≥ msup − sup(I, DO )
q=1
(16.5)
im ∈I
Suppose that this inequality holds for a combination of uqm values, corresponding to a solution l ∈ LI . If all the uqm variables in the extension DX are substituted with their values from l, then every bit will gain a specific value. Let DI denote the supporting transactions of itemset I in database D and index p ∈ [1, P] be used to
16.3 Hybrid Solution Methodology
103
capture transactions in D. Since J ⊂ I, it holds that DI ⊂ DJ . Then, provided that A = ∑Tq ∈(DJ \DI ) ∏im ∈J uqm the following must hold for database D: A + sup(J, DO ) − sup(I, DO ) ≥ 0
(16.6)
From (16.5) and (16.6) we have that: Q
A+ ∑
q=1
∏ uqm ≥ msup − sup(J, DO )
im ∈I
and therefore Q
∑ ∏ uqm ≥ msup − sup(J, DO )
q=1
im ∈J
Theorem 16.3. If FD is the set of all frequent itemsets in D, then FD = ∪I∈Bd + (FD )℘(I). Proof. Due to the definition of the positive border it holds that Bd + (FD ) ⊆ FD . Based on the downward closure property of the frequent itemsets, all subsets of a frequent itemset are also frequent. Therefore, all subsets of Bd + (FD ) must be frequent, which means that ∪I∈Bd + (FD )℘(I) ⊆ FD . To prove the theorem, we need to show that FD ⊆ ∪I∈Bd + (FD )℘(I) or (equivalently) that if I ∈ FD , then I ∈ ∪K∈Bd + (FD )℘(K) ⇒ ∃J ∈ Bd + (FD ) : I ⊆ J. Assume that this condition does not hold. Then, it must hold that I ∈ FD and that @J ∈ Bd + (FD ) : I ⊆ J. From the last two conditions we conclude that I ∈ / Bd + (FD ) since, if I ∈ Bd + (FD ) then the second condition does not hold for J = I. Therefore, and due to the definition of the positive border, we have that ∃J1 ∈ FD : I ⊂ J1 . Thus, |J1 | > |I| ⇒ |J1 | ≥ |I| + 1 ⇒ / Bd + (FD ), and |J1 | ≥ 1. Moreover, since @J ∈ Bd + (FD ) : I ⊆ J it means that J1 ∈ since J1 ∈ FD , ∃J2 ∈ FD : J1 ⊂ J2 , therefore |J2 | > |J1 | ⇒ |J2 | ≥ |J1 | + 1. Finally, since |J1 | ≥ 1 we conclude that |J2 | ≥ 2. As it can be noticed, using induction it is easy to prove that ∀k ∈ ℵ∗ : ∃Jk ∈ FD : |Jk | ≥ k. This means that there is no upper bound in the size of frequent itemsets, a conclusion that is obviously wrong. Therefore, the initial assumption that I ∈ FD and @J ∈ Bd + (FD ) : I ⊆ J does not hold which means that FD ⊆ ∪I∈Bd + (FD )℘(I). Theorems 16.2 and 16.3 prove the cover relations that exist between the itemsets 0 . In the same manner one can prove that the itemsets of Bd + (FD ) and those of FD − 0 ) ∪ {∅}). of Bd (FD ) are generalized covers for all the itemsets of P\(Bd + (FD Therefore, the itemsets of the positive and the negative borders cover all the itemsets in P. As a result, the following Corollary to Theorems 16.2 and 16.3 holds. Corollary 16.1. (Optimal solution set C) The exact hiding solution, which is identical to the solution of the entire system of the 2M − 1 inequalities, can be attained based on the itemsets of set 0 0 ) ∪ Bd − (FD ) C = Bd + (FD
(16.7)
104
16 Hybrid Algorithm minimize ∑q∈[1,Q+SM],m∈[1,M]
subject to
Q+SM 0 u < thr, ∀I ∈ Bd − (FD ) ∑q=1 ∏im ∈I qm ∑Q+SM ∏ q=1
where
uqm
im ∈I uqm
0 ≥ thr, ∀I ∈ Bd + (FD )
thr = mfreq · (N + Q + SM) − sup(I, DO )
Fig. 16.1: CSP formulation as an optimization process.
0 ) and Bd − (F 0 ) can Based on (16.7) the itemsets of the revised borders Bd + (FD D be used to produce the inequalities, which will lead to an exact hiding solution.
16.3.2 Adjusting the Size of the Extension Equation (16.1) provides the absolute minimum number of transactions that need to be added in DX , to allow for the proper hiding of the sensitive itemsets of DO . However, this lower bound can, under certain circumstances, be insufficient to allow for the identification of an exact solution1 , even if one exists [26]. To circumvent this problem, one needs to expand the size Q of DX as determined by (16.1), by a certain number of transactions. A threshold, called Safety Margin (SM) is incorporated for this purpose. Safety margins can be either predefined or be computed dynamically, based on particular properties of database DO and/or other parameters regarding the hiding process. In any case, the target of using a safety margin is to ensure that an adequate number of transactions participate in DX , thus an exact solution (if one exists) will not be lost due to the small size of the produced extension. Since for each transaction in DX , M new binary variables are introduced that need to be tuned when solving the system of inequalities from C, one would ideally want to identify a sufficiently large number of transactions for DX (that allow for an exact solution), while this number be as low as possible to avoid unnecessary variables and constraints participating in the hiding process. Supposing that the value of Q is adjusted based on (16.1) and a sufficiently large safety margin is used, the methodology of Section 16.3.4 minimizes the size of DX after the sanitization process to allow for an ideal solution.
1
On the contrary, the lower bound is always sufficient to allow for an approximate solution of the set of inequalities produced by the itemsets in C.
16.3 Hybrid Solution Methodology
105
16.3.3 Formulation and Solution of the CSP The CSP formulation that is adopted by the hybrid hiding algorithm is presented in Fig. 16.1, while Fig. 16.2 demonstrates the CDR approach that is applied to eliminate any products of binary variables in the considered constraints. Assuming that DX is large enough, and that DO allows for an exact solution, the above hiding formulation is capable of identifying it. However, DO may not always allow for an exact solution. Section 16.3.5 presents an approach used in [26] for dealing with suboptimality, while the following section examines the issue of validity in the transactions of DX and presents an algorithm for removing unnecessary transactions. Replace All ∑Tq ∈DX Ψs S thr, Ψs = ∏im ∈Tq uqm With c1 : Ψs ≤ uq1 c Ψs ≤ uq2 2 : . ∀i .. cZ : Ψs ≤ uqm Ψs ≥ uq1 + uq2 + . . . + uqm − |Z| + 1 And ∑s Ψs S thr where
Ψs ∈ {0, 1}
Fig. 16.2: The Constraints Degree Reduction approach.
16.3.4 Minimum Extension and Validity of Transactions The incorporation of the safety margin threshold in the hiding process may lead to an unnecessary extension of DX . Fortunately, as discussed in [26], it is possible to identify and remove the extra portion of DX that is not needed, thus minimize the size of database D to the necessary limit. To achieve that, one needs to rely on the notion of null transactions, appearing in database DX . Definition 16.5. (Null transaction) A transaction Tq is defined as null or empty if it does not support any valid itemset in the lattice. Null transactions do not support any pattern from P\{∅}.
106
16 Hybrid Algorithm
Apart from the lack of providing any useful information, null (or empty) transactions are easily identifiable, thus produce a privacy breach to the hybrid hiding methodology. Such transactions may exist in D due to two reasons: (i) an unnecessarily large safety margin SM, or (ii) a large value of Q that is essential for proper hiding. In the first case, these transactions need to be removed from DX , while in the second case the null transactions need to be validated, since Q denotes the lower bound in the number of transactions to ensure proper hiding. The methodology proposed in [26] proceeds as follows. After solving the CSP of Figure 16.1, all the null transactions appearing in DX are identified. Suppose that Qinv such transactions exist. The size of database DX will then equal the value of Q plus the safety margin SM. This means that the valid transactions in DX will be equal to v = Q + SM − Qinv . To ensure minimum size of DX , the hybrid hiding algorithm keeps only k null transactions, such that: k = max(Q − v, 0) ⇒ k = max(Qinv − SM, 0)
(16.8)
As a second step, the hiding algorithm needs to ensure that the k empty transactions that remain in DX become valid prior to releasing database D to public. A heuristic is applied for this purpose, which effectively replaces null transactions of DX with transactions supporting itemsets of the revised positive border. After solving the CSP of Figure 16.1, the outcome is examined to identify null transactions. Then, Algorithm 16.1 is applied to replace the null transactions with valid ones, supporting 0 ). Notice, that a round robin heuristic is applied based on the itemsets of Bd + (FD relative frequencies of the itemsets in the revised positive border. Algorithm 16.1 0 )| · (|J | + log |Bd + (F 0 )|)), where has a computational complexity of O(|Bd + (FD D 0 )| is the size of the |J | is the number of null transactions in DX and |Bd + (FD revised positive border. The heuristic approach for the validation of transactions is preferable in the general case, when SM > 0. minimize ∑q∈[1,Q+SM],m∈[1,M]
uqm
Q+SM 0 ) ∑q=1 ∏im ∈I uqm < thr, ∀I ∈ Bd − (FD + 0 subject to ∑Q+SM ∏ im ∈I uqm ≥ thr, ∀I ∈ Bd (FD ) q=1 ∀Tq ∈ DX : ∑im ∈I uqm ≥ 1 where
thr = mfreq · (N + Q + SM) − sup(I, DO )
Fig. 16.3: Expansion of the CSP to ensure validity of transactions. A different approach is preferable when the minimization of distance between databases DO and D is of crucial importance and the requested safety margin is small or zero. In such cases, an exact solution consisting of valid transactions can
16.3 Hybrid Solution Methodology
107
Algorithm 16.1 Validation of Transactions in DX . 0 1: procedure VALIDATE(DO , DX , Bd + (FD )) 2: J ← {Tq ∈ DX | Tq is null} 3: min ← 1 0 )) do 4: for each (I ∈ Bd + (FD 5: ds(I) ← sup(I, D) − sup(I, DO ) 6: if (ds(I) < min) then 7: min ← ds(I) 8: end if 9: end for 0 )|; i++) do 10: for (i = 0; i < |Bd + (FD 11: ds(I) ← int(ds(I)/min) 12: end for 13: Sds ← reverse_sort(ds) 0 )|) ← 1 14: Sds(|Bd + (FD 15: k ← |J | 16: while (k > 0) do 0 )|; i++) do 17: for (i = 0; i < |Bd + (FD 18: for ( j = Sds(i); j ≥ Sds(i + 1); j–) do 19: if (k > 0) then 20: k ← k−1 21: Sds(i) ← Sds(i) − 1 22: R EPLACE(J (k), T{Sdsi } ) 23: else return 24: end if 25: end for 26: end for 27: end while 28: end procedure
be attained if, for each transaction in database DX , a constraint is added to the CSP to enforce its validity in the final solution. The new CSP formulation is depicted in Figure 16.3. An immediate disadvantage of this approach is the increment in the number of constraints of the produced CSP.
16.3.5 Treatment of Suboptimality in Hiding Solutions Similarly to the inline approach of [23], there are certain problem instances where exact hiding solutions do not exist and one must seek for a good approximate solution. To identify such a solution, the difference of importance existing between (16.3) and (16.4) is of crucial importance. Since the target of a hiding methodology is to secure sensitive knowledge, the holding of (16.4) is of major importance. The inherent difference in the significance of these two inequalities, along with the fact that solving the system of all inequalities of the form (16.4) always has a feasible solution, enables the relaxation of the problem when needed and the identification of good approximate solutions.
108
16 Hybrid Algorithm
Table 16.3: The intermediate form of database DX . a
b
c
d
e
f
u11
u12
u13
u14
u15
u16
u21
u22
u23
u24
u25
u26
u31
u32
u33
u34
u35
u36
u41
u42
u43
u44
u45
u46
0 ). Algorithm 16.2 Relaxation Procedure in V = Bd + (FD
1: procedure S ELECT R EMOVE(Constraints CR , V , DO ) S ← argmaxi {|Ri |} 2: CR maxlen 3: crmsup ← minCR ,i (sup(Ri ∈ V, DO )) maxlen do 4: for each c ∈ CR maxlen 5: if sup(Ri , DO ) = crmsup then 6: Remove (c) 7: end if 8: end for 9: end procedure
. CRi ↔ Vi
. remove constraint
Algorithm 16.2 applies a relaxation process that selectively removes inequalities of type (16.3) up to a point that the resulting CSP is solvable. In this algorithm, CR 0 ) and |R | denotes the is the set of constraints imposed by the itemsets of Bd + (FD i size of itemset Ri in CR . In a realistic situation, only a small portion of constraints will have to be discarded. Thus, emphasis is given in [26] on constraints involving maximal size and minimum support itemsets, appearing in database DO .
16.3.6 Continuing the Running Example Consider database DO of Table 16.1. As part of Section 16.2.3 the original and the revised borders for the hiding of the sensitive itemsets in S = {e, ae, bc}, when mfreq = 0.3, were computed. To proceed, one needs to identify the minimum extension of DO , namely the minimum size of DX that facilitates sensitive knowledge hiding. Using (16.1) for the sensitive itemset with the highest support (here either {e} or {bc}), we have that Q = 4. Excluding, for brevity, the use of a safety margin, DO needs to be extended by 4 transactions. Table 16.3 depicts database DX prior to the adjustment of the various uqm variables involved. To produce the needed constraints, consider all the itemsets appearing in C, where 0 ) ∪ Bd − (F 0 ) = {e, f , bc, bd, ab, acd}. Based on Figure 16.1 the CSP C = Bd + (FD D depicted in Figure 16.4 is constructed, where the validity of all transactions of DX
16.3 Hybrid Solution Methodology
109
{ab} : u11 u12 + u21 u22 + u31 u32 + u41 u42 ≥ 0.2 + 0 Bd (FD ) {acd} : u11 u13 u14 + u21 u23 u24 + +u31 u33 u34 + u41 u43 u44 ≥ 1.2 {e} : u 15 + u25 + u35 + u45 < 0.2 { f } : u16 + u26 + u36 + u46 < 2.2 0 Bd − (FD ) {bc} : u12 u13 + u22 u23 + u32 u33 + u42 u43 < 0.2 {bd} : u12 u14 + u22 u24 + u32 u34 + u42 u44 < 2.2 u11 + u12 + u13 + u14 + u15 + u16 ≥ 1 u +u +u +u +u +u ≥ 1 21 22 23 24 25 26 Tq ∈ DX + u + u + u + u + u u 31 32 33 34 35 36 ≥ 1 u41 + u42 + u43 + u44 + u45 + u46 ≥ 1
Fig. 16.4: The constraints in the CSP of the running example. Table 16.4: Database DX after the solution of the CSP. a
b
c
d
e
f
1
1
1
1
1
0
0
1
0
0
1
0
1
0
1
1
1
0
0
0
0
0
0
0
has to be ensured. Notice that if a safety margin was used, the last set of constraints (regarding the validity of transactions in DX ) would have been excluded from the CSP to allow for the minimization of the size of DX at a later point. After formulating the CSP the CDR approach of Figure 16.2 is applied. Solving the produced CSP using BIP creates the transactions of database DX presented in the Table 16.1. The hiding solution of Table 16.1 is exact. Notice, that with a minimal extension of 4 transactions, all the sensitive knowledge was hidden, leaving other patterns unaffected in D. This solution is optimal with a distance of dist(DO , D) = 9. To highlight some other aspects of the hybrid hiding methodology, consider the hiding of S = {abc} in the same database DO . In this case, using (16.1) we have that Q = 1. However, as it turns out, only one transaction is insufficient to provide an exact solution to the formulated CSP. Suppose that a safety margin of 3 was used, leading to the construction of 4 transactions in database DX . Due to the use of the safety margin in the CSP formulation, the set of constraints corresponding to the validity of the transactions in database DX are excluded, thus, the formulation of Figure 16.1 is used instead. Table 16.4 shows database DX after the solution of the CSP. As one can notice, the last transaction of DX is null. Based on (16.8), k = max(1 − 3, 0) so the empty transaction from DX can be safely removed, without
110
16 Hybrid Algorithm
jeopardizing the hiding of the sensitive patterns. The result database DX contains the 3 valid transactions and pertains to an exact hiding solution. As one can notice in the presented examples, the solver adjusts the uqm variables in a way that all the inequalities pertaining to an exact solution are satisfied, while the number of zero entries in DX is maximized. Consider for instance the database DX of Table 16.1. To hide itemset {e} this itemset must not be supported by any transaction in DX . Thus, all the respective variables become zero. Now, consider the 1–itemset { f } of the revised negative border. Since no other itemset in C supports it, all its entries in DX can be safely turned into zero. Finally, itemset {ab} has to be supported by at least one transaction in DX to remain frequent in D. In any case, since the optimization criterion of the CSP requires the adjustment of the minimum number of uqm variables to ‘1’, the solver ensures that the sanitized database D lies as close as possible to DO , while providing an exact solution.
16.4 A Partitioning Approach to Improve Scalability To achieve an exact hiding solution the hybrid algorithm has to regulate the values of all items in every transaction of DX . This number of variables may, under certain circumstances, be very large leading to a large increase in the runtime of the BIP algorithm. Fortunately, the scalability of the hiding process can be improved without relaxing the requirement for an exact hiding solution. To achieve this, first the hiding process is properly decomposed into two parts of approximately equal size. Then, each part is solved separately and, last, the individual solutions are coupled together.
16.4.1 Partitioning Methodology Let I = IP ∪ IP¯ , be a decomposition of the universe of all items M of DO into two disjoint subsets, such that IP contains all items that appear in the sensitive itemsets along with possibly other items from I, and IP¯ contains the rest of the items in I. First, the revised borders for database D are identified, as in Section 16.2.3, and set C is formulated (as in (16.7)). Each itemset in C, is then uniquely assigned to one of the sets IP , IP¯ , based on the items it supports. For set IP all the assigned itemsets from C must not support any items of IP¯ . However, itemsets assigned to IP¯ may support items of IP . The ideal case scenario is a partitioning of the items in IP , IP¯ such that (i) the two partitions are of approximately equal size, and (ii) as few as possible itemsets assigned in IP¯ support items from IP . After formulating the two partitions and assigning the itemsets from C, relation (16.1) and the safety margin are applied to decide on the size of DX . The goal is to properly construct DX based on the principles presented earlier. Starting from set IP , consider each transaction in D to be translated in this subspace; IP regulates the assignment of the uqm variables for all transactions Tq ∈ DX and all items m ∈
16.4 A Partitioning Approach to Improve Scalability
111
Table 16.5: Database DX as the union of D1 and D2 produced by the partitioning approach.
D1
(
D2
a
b
c
d
e
f
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
1
0
1
1
0
0
1
0
1
1
0
0
IP . Given the proper itemsets from C that support items in this subspace, one can adjust these binary variables in a way that the revised border is preserved for the corresponding items. In the same manner the borderline in the subspace of IP¯ is preserved. Especially for the itemsets in IP¯ that also support items from IP , by considering the previously assigned values of the uqm variables in the subspace of IP one can proceed to properly assign the uqm variables in the subspace of IP¯ , so that the newly formulated BIP has a solution. In the case that the current transactions in DX are insufficient to accommodate for the holding of the new constraints imposed by the itemsets in IP¯ , database DX has to be appropriately expanded. However, this expansion can be negated at a later point with the removal of the null transactions (as in Section 16.3.4) after the overall hiding solution. Since the formulated CSP sets the minimum possible uqm variables to ‘1’, the borderline will be adjusted through this process without causing any harm to the exact solution, apart from potentially using a larger part of the extension. Due to the exponential execution time of the integer programming solver the benefit of this decomposition is substantial since it allows for reasonably low execution times, even when performing knowledge hiding in very large databases.
16.4.2 An Example In what follows, we present an example borrowed from [26], which demonstrates the operation of the partitioning approach in the database of Table 16.1. Assume that for this database we have IP = {a, b, c, e} and IP¯ = I\IP . The itemsets of C = {e, f , ab, bc, bd, acd} are then partitioned as follows: C = {e, ab, bc} ∪ { f , bd, acd}. Solving the first CSP creates database D1 of Table 16.5, which allows the satisfaction of the constraint regarding itemset {acd} participating in the second CSP. Thus, after the solution of the first CSP, database DX has to be extended. Suppose that a SM = 2 is used. Then, the second CSP leads to database D2 , as shown in Table
112
16 Hybrid Algorithm
16.5. Overall, the produced extension contains three null transactions. Using (16.8), k = max(3 − 2, 0) = 1, which means that only one null transaction must be kept. Thus, after removing the two null transactions of D1 , Algorithm 16.1 is employed to validate the third transaction. As one can notice, database DX is exact but not 0 ). ideal, since there exists no 1–itemset (i.e., item) in B + (FD
16.5 Experimental Evaluation In this section we provide the results of the experiments conducted in [26] that test the hybrid algorithm on real datasets under different parameters such as minimum support threshold and number/size of sensitive itemsets to hide. All these datasets are available through FIMI [30] and their properties are summarized in Table 16.6. Table 16.6: The characteristics of the four datasets. Dataset
N
M
Avg tlen
BMS–WebView–1 BMS–WebView–2
59,602
497
2.5
77,512
3,340
5.0
Mushroom
8,124
119
23.0
Chess
3,196
76
37.0
The thresholds of minimum support were appropriately set to ensure an adequate amount of frequent itemsets. Several experiments were conducted for hiding up to 20 sensitive 10–itemsets. In all conducted experiments, the CSP was formulated based on Figure 16.3 and a safety margin of 10 transactions was used. The sensitive itemsets for sanitization were randomly selected among the frequent ones. The hybrid algorithm was implemented in Perl and C and the experiments were conducted on a PC running 64-bit Linux on an Intel Pentium D, 3.2 Ghz processor equipped with 4GB of RAM. All integer programs were solved using ILOG CPLEX 9.0 [36]. Let notation a × b denote the hiding of a itemsets of length b. Figure 16.5 provides a comparison of the hiding solutions identified by the hybrid algorithm against three state-of-the-art approaches: the Border Based Approach (BBA) of Sun & Yu [66, 67], the Max–Min 2 algorithm of Moustakides & Verykios [50] and the inline algorithm of Gkoulalas–Divanis & Verykios [23], in terms of side-effects introduced by the hiding process. As one can notice the hybrid algorithm consistently outperforms the three other schemes, with the inline approach being the second best. In most tested cases, the heuristic algorithms failed to identify a solution bearing minimum side-effects, while the inline approach demonstrated in several occasions that an exact solution could not be attained without extending the dataset. Figure 16.6 presents a comparison of the four algorithms at terms of runtime cost. As ex-
16.5 Experimental Evaluation
113
Mushroom Dataset 180 BBA MaxMin2 Inline Hybrid
160
Side−Effects
140 120 100 80 60 40 20 0
3x5
5x5 10x5 15x5 20x5 5x7 10x7 15x7 20x7 5x10 10x10 15x10 20x10
Hiding Scenarios Chess Dataset 350 BBA MaxMin2 Inline Hybrid
300
Side−Effects
250
200
150
100
50
0
3x5
5x5 10x5 15x5 20x5 5x7 10x7 15x7 20x7 5x10 10x10 15x10 20x10
Hiding Scenarios BMS−WebView Datasets 4 BBA MaxMin2 Inline Hybrid
3.5
BMS−WebView−1
Side−Effects
3 2.5
BMS−WebView−2
2 1.5 1 0.5 0
1x2
2x2
1x3
2x3
1x4
2x4
1x2
2x2
1x3
1x4
Hiding Scenarios
Fig. 16.5: Quality of the produced hiding solutions.
114
16 Hybrid Algorithm
Mushroom Dataset 700 BBA MaxMin2 Inline Hybrid
Runtime (in seconds)
600
500
400
300
200
100
0
3x5
5x5 10x5 15x5 20x5 5x7 10x7 15x7 20x7 5x10 10x10 15x10 20x10
Hiding Scenarios Chess Dataset 600
Runtime (in seconds)
500
BBA MaxMin2 Inline Hybrid
400
300
200
100
0
3x5
5x5 10x5 15x5 20x5 5x7 10x7 15x7 20x7 5x10 10x10 15x10 20x10
Hiding Scenarios BMS−WebView Datasets 800
Runtime (in seconds)
BBA MaxMin2 Inline Hybrid
BMS−WebView−1
700 600 500
BMS−WebView−2
400 300 200 100 0
1x2
2x2
1x3
2x3
1x4
2x4
1x2
2x2
1x3
1x4
Hiding Scenarios
Fig. 16.6: Scalability of the hybrid hiding algorithm.
16.5 Experimental Evaluation
115
pected, the scalability of the border-based approaches is better when compared to the exact approaches, since the runtime of the latter ones is primarily determined by the time required by the BIP solver to solve the CSP. However, as shown in Figure 16.5, the runtime cost of solving the CSPs of the exact methodologies is worthwhile, since the quality of the attained solutions is bound to be high. Chess / Mushroom Datasets
Hiding Scenarios
20x10 15x10 10x10 5x10
Chess Dataset
20x7 15x7 10x7 5x7 20x10 15x10 10x10 5x10
Mushroom Dataset
20x5 15x5 10x5 5x5 0
1000
2000
3000
4000
5000
6000
Size of DX
Fig. 16.7: Size of extension DX .
Chess / Mushroom Datasets
Hiding Scenarios
Chess Dataset 20x10 15x10 10x10 5x10 20x5 15x5 10x5 5x5 3x5 20x10 15x10 10x10 5x10
Mushroom Dataset
20x5 15x5 10x5 5x5 3x5 0
500
1000
1500
2000
Constraints
Fig. 16.8: Number of constraints in the CSP.
2500
116
16 Hybrid Algorithm
Figure 16.7 presents the dependence that exists between the size of the extension and the sensitive itemsets. In this figure the hiding scenarios are separated into groups of four. In each group, the hiding of more itemsets of the same length includes all the sensitive itemsets that were selected for the previous hiding scenario. For instance, the 5 10–itemsets of the 10 × 10 hiding scenario, are the same as in the 5 × 10 hiding scenario. In the chess dataset, the a × 7 itemsets are selected to have lower supports than their counterparts participating in the a × 10 hiding scenarios. On the other hand, in the mushroom dataset, the group of a × 10 itemsets reflects itemsets that lie near the border, whereas the a × 5 itemsets are highly supported in the dataset. Figure 16.8 follows the layout of Figure 16.7 and presents the relation between the number of constraints in the CSP and the number and the size of the sensitive itemsets. As is shown, the hiding of more itemsets of the same size leads to the production of more inequalities for the CSP, since in the typical case, the size of the negative border is augmented. A similar relation exists between the minimum support threshold of the dataset and the sensitive itemsets to be hidden. The lower the minimum support threshold of the dataset, the more itemsets become frequent. Supposing that one wishes to hide the exact same itemsets, if the minimum support threshold is reduced then in the typical case more transactions are necessary in DX to ensure that the revised positive border will be preserved in D. Thus, more inequalities have to be included in the CSP to accomplish this goal. On the other hand, increment of the minimum support threshold typically leads to smaller problems and thus to a better performance of the hiding algorithm. Mushroom Dataset 1200 BBA MaxMin2 Inline
1000
Distance
800
600
400
200
0
3x5
5x5
10x5 15x5 20x5
4x6
5x7
10x7 15x7 20x7 5x10 10x10 15x10 20x10
Hiding Scenarios
Fig. 16.9: Distance of the three hiding schemes. Figure 16.9 presents the distance (i.e., number of item modifications) between the original and the sanitized database that is required by each algorithm to facilitate knowledge hiding. Since the two border-based approaches and the inline algorithm operate in a similar fashion (i.e., by selecting transactions of the original database
16.5 Experimental Evaluation
117
and excluding some items), it makes sense to compare them in terms of the produced distances. From the comparison it is evident that the inline approach achieves to minimize the number of item modifications, a result that can be attributed to the optimization criterion of the generated CSPs. On the contrary, the hybrid hiding algorithm does not alter the original dataset but instead uses a database extension to (i) leave unsupported the sensitive itemsets so as to be hidden in D, and (ii) adequately support the itemsets of the revised positive border in order to remain frequent in D. For this reason, the item modifications (0s → 1s) that are introduced by the hybrid hiding algorithm in DX should not be attributed to the hiding task of the algorithm but rather to its power to preserve the revised positive border and thus eliminate the side-effects. This important difference between the hybrid algorithm and the other three approaches hardens their comparison in terms of item modifications. However, due to the common way that both the inline and the hybrid approaches model the CSPs, the property of minimum distortion of the original database is bound to hold for the hybrid hiding algorithm. BMS−WebView−1 Dataset 700 Hybrid Partitioning
Runtime (in seconds)
600
500
400
300
200
100
0
1x2
2x2
1x3
2x3
1x4
2x4
Hiding Scenarios
Fig. 16.10: Performance of the partitioning approach. Figure 16.10 presents the performance gain that is accomplished when using the partitioning approach. As one can notice, the split of the CSP into two parts has a significant benefit in the performance of the hybrid hiding algorithm. An interesting insight from the experiments is the fact that the hybrid approach, when compared to the inline algorithm [23] and the border-based approaches of [50,66], can better preserve the quality of the border and produce superior solutions. Indeed, the hybrid approach introduces the least amount of side-effects among the four tested algorithms. On the other hand, the hybrid approach is worse in terms of scalability than its competitors, due to the large number of the uqm variables and the associated constraints of the produced CSP. Moreover, depending on the properties of the used dataset, there are cases where a substantial amount of transactions has to be added to the original dataset to facilitate knowledge hiding. This situation
118
16 Hybrid Algorithm
usually occurs when hiding itemsets with a very high support and/or when using a very low minimum support threshold. In most of these cases, approaches like the ones presented in Section 16.4 become essential to allow for a solution of the hiding problem. It is important however to mention that for as long as the given hiding problem decomposes to a CSP that remains computationally manageable and a sufficiently large safety margin is used, the hybrid approach is bound to identify a solution that will bear the least amount of side-effects to the original dataset. Thus, contrary to state-of-the-art approaches, the power of the hybrid hiding methodology is that it guarantees the least amount of side-effects to an extended set of hiding problems, when compared to the inline algorithm of [23].
Chapter 17
Parallelization Framework for Exact Hiding
In this chapter, we elaborate on a novel framework, introduced in [25], that is suitable for decomposition and parallelization of the exact hiding algorithms that were covered in Chapters 14, 15 and 16. The framework operates in three phases to decompose CSPs that are produced by the exact hiding algorithms, into a set of smaller CSPs that can be solved in parallel. In the first phase, the original CSP is structurally decomposed into a set of independent CSPs, each of which is assigned to a different processor. In the second phase, for each independent CSP a decision is made on whether it should be further decomposed into a set of dependent CSPs, through a function that questions the gain of any further decomposition. In the third and last step, the solutions of the various CSPs, produced as part of the decomposition process, are appropriately combined to provide the solution of the original CSP (i.e. the one prior to the decomposition). The generality of the framework allows it to efficiently handle any CSP that consists of linear constraints involving binary variables and whose objective is to maximize (or minimize) the summation of these variables. Together with existing approaches for the parallel mining of association rules [6, 34, 80], the framework of [25] can be applied to parallelize the most time consuming steps of the exact hiding algorithms. The remainder of this chapter is structured as follows: In Section 17.1 we present the properties of the parallelization framework. Then, Section 17.2 contains the experimental evaluation and demonstrates the benefit of using this framework towards speeding up the hiding process.
17.1 The Parallelization Framework Performing knowledge hiding by using the inline [23], the two–phase iterative [27] or the hybrid approach [26] allows for the identification of exact hiding solutions, whenever such solutions exist. However, the cost of identifying an exact hiding solution is usually high due to the time required for solving the involved CSP.
A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_17, © Springer Science+Business Media, LLC 2010
119
120
17 Parallelization Framework for Exact Hiding
In what follows, we present a framework that was proposed in [25] that can be applied as part of the sanitization process of exact hiding algorithms to improve their computational cost. The framework operates in three phases: (i) the structural decomposition phase, (ii) the decomposition of large individual components phase, and (iii) the parallel solving phase for the produced CSPs. In what follows, we discuss in detail each phase of the framework.
17.1.1 Structural Decomposition of the CSP The number of constraints in a CSP can be very large depending on the properties of the database, the minimum support threshold that was used for the mining of the frequent itemsets, as well as the number and length of sensitive itemsets that need to be hidden. Moreover, the fact that various initial constraints may incorporate products of unm variables, thus have a need to be replaced by numerous linear inequalities (using the CDR approach), makes the whole BIP problem tougher to solve. There is, however, a nice property in the CSPs that can be used to improve their solution time. That is, decomposition. Based on the divide and conquer paradigm, a decomposition approach allows the fragmentation of a large problem into numerous smaller ones, the solution of these new subproblems independently, and the subsequent aggregation of the partial solutions to attain the same overall solution as the one of solving the entire problem. The property of the CSPs which allows considering such a strategy lies behind the optimization criterion that is used. Indeed, one can easily notice that the criterion of maximizing (or equivalently minimizing) the summation of the binary unm variables is satisfied when as many unm variables as possible are set to one (equivalently to zero). This, can be established independently, provided that the constraints that participate in the CSP allow for an appropriate decomposition. The approach that is followed in [25] for the initial decomposition of the CSP is similar to the decomposition structure identification algorithm presented in [47], although applied in a “constraints” rather than a “transactions” level. As demonstrated in Figure 17.1, the output of structural decomposition, when applied on the original CSP, is a set of smaller CSPs that can be solved independently. An example will allow to better demonstrate how this process works. Consider database DO presented in Table 17.1. By performing frequent itemset mining in DO using frequency threshold mfreq = 0.3, we compute the following set of frequent itemsets: FDO = {A, B,C, D, AB,CD}. Suppose that one wishes to hide the sensitive itemsets in S = {B,CD}, using for instance the inline approach1 . Then, it holds that: 1 We need to mention that it is of no importance which methodology will be used to produce the CSP, apart from the obvious fact that some methodologies may produce CSPs that are better decomposable than those constructed by other approaches. However, the structure of the CSP also depends on the problem instance and thus it is difficult to know in advance which algorithm is bound to produce a better decomposable CSP.
17.1 The Parallelization Framework
121
Fig. 17.1: Decomposing large CSPs to smaller ones. Table 17.1: The original database DO . A 1 1 1 1 0 1 0 0 1 0
B 1 1 0 1 1 0 0 0 0 0
C 0 0 0 0 0 1 1 1 0 0
D 0 0 0 0 1 1 1 1 0 1
Smax = {B, AB,CD} 0 ) = {A,C, D} Bd + (FD 0 ) V = {C, D} ⊂ Bd + (FD
(17.1) (17.2) (17.3)
The intermediate form of this database is shown in Table 17.2 and the CSP formulation based on the inline algorithm is presented in Figure 17.2. Table 17.3 highlights the various constraints cr along with the variables that they control. As one can observe, the various constraints of the CSP can be clustered into disjoint sets based on the variables that they involve. In this example, there are two such clusters of constraints, namely M1 = {c1 }, and M2 = {c2 , c3 , c4 }. Notice that none of the variables in each cluster of constraints is contained in any other cluster. Thus, instead of solving the entire problem of Figure 17.2, one can equivalently solve the two subprob-
122
17 Parallelization Framework for Exact Hiding
Table 17.2: The intermediate form of database DO . A 1 1 1 1 0 1 0 0 1 0
B u12 u22 0 u42 u52 0 0 0 0 0
C 0 0 0 0 0 u63 u73 u83 0 0
D 0 0 0 0 1 u64 u74 u84 0 1
maximize ( u12 + u22 + u42 + u52 + u63 + u64 + u73 + u74 + u83 + u84 ) u12 + u22 + u42 + u52 < 3 u u +u u +u u < 3 63 64 73 74 83 84 subject to u63 + u73 + u83 ≥ 3 u64 + u74 + u84 ≥ 1
Fig. 17.2: CSP formulation for the presented example.
lems that are presented in Figure 17.3, yielding, when combined, the same solution as the one of the initial CSP: u12 = u22 = u42 = u63 = u64 = u73 = u74 = u83 = 1 and u52 = u84 = 0. Table 17.3: Constraints matrix for the CSP. u12 u22 u42 u52 u63 u64 u73 u74 u83 u84
c1 X X X X
c2
c3
Y Y Y Y Y Y
Y
c4
Y Y Y Y Y
17.1 The Parallelization Framework
123
maximize(u12 + u22 + u42 + u52 ) subject to u12 + u22 + u42 + u52 < 3 where {u12 , u22 , u42 , u52 } ∈ {0, 1} and maximize(u63 + u64 + u73 + u74 + u83 + u84 ) u63 u64 + u73 u74 + u83 u84 < 3 subject to u63 + u73 + u83 ≥ 3 u64 + u74 + u84 ≥ 1 where {u63 , u64 , u73 , u74 , u83 , u84 } ∈ {0, 1}
Fig. 17.3: Equivalent CSPs for the provided example.
17.1.2 Decomposition of Large Independent Components The structural decomposition of the original CSP allows one to break the original large problem into numerous smaller subproblems that can be solved independently, thus significantly reduce the runtime that is necessary to attain the overall hiding solution. However, as it can be noticed, both (i) the number of subproblems, and (ii) the size of each subproblem, are totally dependent on the underlying CSP and on the structure of the constraints matrix. As an effect, there exist problem instances which are not decomposable and other instances which experience a notable imbalance in the size of the produced components. Thus, in what follows, we discuss two methodologies from [25] which allow us to decompose large individual components that are nonseparable through the structural decomposition approach. In both methodologies, the goal is to minimize the number of variables that are shared among the newly produced components, which are now dependent.
17.1.2.1 Decomposition Using Articulation Points To decompose an independent component one needs to identify the least amount of unm variables which, when discarded from the various inequalities of this CSP, produce a CSP that is structurally decomposable. The following strategy can be employed to identify the unm variables that will be discarded. First, an undirected graph G(V, E) in which each vertex v ∈ V corresponds to a unm variable and each edge e ∈ E connects vertexes that participate to the same constraint, is generated. Graph G can be constructed in linear time and provides an easy way to model the network of constraints and involved variables in the input CSP. Since the input CSP is not structurally decomposable, graph G will have to be connected.
124
17 Parallelization Framework for Exact Hiding
Fig. 17.4: An example of decomposition using articulation points.
After creating the constraints graph G, [25] identifies all its articulation points (also known as cut-vertexes). The rationale behind this process is that the removal of a cut-vertex will disconnect graph G and the best cut-vertex unm will be the one that leads to the largest number of connected components in G. Each of these components will then itself constitute a new subproblem to be solved independently from the others. As presented in [25] the computation of the best articulation point can be achieved in linear time O(V + E) through a DFS traversal of the graph. After identifying the best articulation point, the next step is to remove the corresponding unm variable from graph G. Then, each component of the resulting graph corresponds to a new subproblem (i.e., is a new CSP) that can be derived in linear time and be solved independently. To provide the same solution as the original CSP, the solutions of the various created subproblems need to be cross-examined. This process is discussed in Section 17.1.3. A final step to be addressed involves the situation in which no single cut-vertex can be found in the graph. In such a case, an empirical approach is proposed in [25] which is based on the premises that nodes having high degrees in graph G are more likely than others to correspond to cut-vertexes. The runtime of the empirical approach is also linear in the number of vertexes and edges of graph G. Figure 17.4 demonstrates an example of decomposition using articulation points. In this graph, we denote as “cut-vertex”, the vertex which, when removed, leads to a disconnected graph having the maximum number of connected components (here 3).
17.1 The Parallelization Framework
125
Fig. 17.5: Three-way decomposition using weighted graph partitioning.
17.1.2.2 Decomposition Using Weighted Graph Partitioning A serious disadvantage of decomposition using articulation points is the fact that it provides very limited control over (i) the number of components in which the CSP will eventually split, and (ii) the size of each of these components. This fact may lead to a low CPUs utilization in a parallel solving environment. For this reason, [25] proposes an alternative decomposition strategy that can be employed to break the original CSP into as many CSPs as necessary, based on the number of available processors. The proposed decomposition strategy relies on modern graph partitioning algorithms to provide an optimal decomposition. The decomposition strategy operates as follows. A constraints graph is generated by assigning each unm variable of the original CSP to a vertex and each constraint c to a number of edges ec formulating a clique in the graph (while denoting the dependence of the unm variables involved). This graph, henceforth denoted as GW , is a weighted alternative of graph G and is associated with two types of weights: one for each vertex u ∈ V W and one for each edge e ∈ E W . The weight of a vertex corresponds to the number of constraints in which it participates in the CSP. The weight of an edge, on the other hand, reflects the number of constraints in the CSP in which the two vertexes (it connects) appear together. Using a weighted graph partitioning
126
17 Parallelization Framework for Exact Hiding
algorithm, such as [39], one can decompose the weighted graph into as many parts as the number of available processors that can be used to concurrently solve them. The rationale behind the weighting scheme is to ensure that the connectivity between vertexes belonging in different parts is minimal. Figure 17.5 demonstrates a three-way decomposition of the original CSP, using weighted graph partitioning.
17.1.3 Parallel Solving of the Produced CSPs Breaking a dependent CSP into a number of components (using one of the strategies mentioned earlier) is a procedure that should incur only if the CSP’s size is large enough to worth the cost of the decomposition. For this reason, it is necessary to define a function FS to calculate the size of a CSP and a threshold above which the CSP should be decomposed. The authors of [25] chose function FS to be a weighted sum of the number of unm variables involved in the CSP and the associated constraints C. The weights are problem-dependent. Thus, FS = w1 · |unm | + w2 · |C|
(17.4)
Having defined function FS , the problem solving strategy proceeds as follows. First, structural decomposition is applied on the original CSP and each produced component is distributed to an available processor. These components can be solved independently of each other. The final solution (i.e., the value of the objective for the original CSP) will equal the sum of the values of the individual objectives; thus, the master node that holds the original CSP should wait to accumulate the solutions returned by the servicing nodes. Each servicing node in the system is allowed to choose other nodes and assign them some computation. Whenever a node assigns job to other nodes, it waits to receive the results of the processing and then enforces the needed function to create the overall outcome (as if it had solved the entire job itself). After receiving an independent component, each processor applies the function FS to its assigned CSP and decides whether it is essential to proceed with further decomposition. If this is the case, then it proceeds to its decomposition using one of the two schemes presented earlier (i.e., decomposition using articulation points or weighted graph partitioning) and assigns the newly created CSPs, each to an available processor. A mechanism that keeps track of the jobs distribution to processors and their status (i.e., idle vs. occupied) is applied to allow for the best possible CPUs utilization. The same procedure continues until all constructed CSPs are below the user-defined size threshold and therefore do not need to be further decomposed. At any given time, the processors contain a number of independent components and a number of mutually dependent components. In the case of the independent components, as mentioned earlier, the value of the objective function for the overall CSP is attained by summing up the values of the individual objectives. However, in the case of dependent CSPs the situation is more complex. To handle such circum-
17.1 The Parallelization Framework
127
Fig. 17.6: An example of parallel solving after decomposition.
stances, let border unm be a variable that appears in two or more dependent CSPs. This means that this variable was either the best articulation point selected by the first strategy, or a vertex that was at the boundary of two different components, identified using the graph partitioning algorithm. Border variables need to be checked for all possible values they can attain in order to provide the exact same solution as the one of solving the original CSP. Assuming p such variables, there exist 2 p possible value assignments that need to be checked. For each possible assignment one needs to solve the corresponding CSPs, in which the objective functions apart from the unm variables for the non-border cases they also contain (in the objectives and the constraints) the values of the currently tested assignment for the p variables. After solving the CSPs for each assignment, the solution is computed by summing up the resulting objective values. As one can observe, the final solution will correspond to the maximum value among the different summations produced by the possible assignments. To make matters clearer, assume that at some point time a processor receives a CSP that needs to be solved and finds that its size is greater than the minimum size threshold; thus the CSP has to be decomposed. Suppose that applying the first strategy yields a decomposition involving two border variables u1 and u2 and leads to two components: CA and CB . Each component is then assigned to a processor to solve it; all possible assignments should be tested. Thus, there exist four different problem instances, namely: C00 ,C01 ,C10 ,C11 , where Cxy means that the problem instance where u1 = x and u2 = y is solved; the rest variables’ assignments remain unknown. Given two processors, the first processor needs to solve these 4 instances for CA whereas the second one needs to solve them for CB . Now suppose that the objec-
128
17 Parallelization Framework for Exact Hiding
tive values for CA,00 and CB,00 were found. The objective value for problem instance C00 will then be the summation of these two objectives. To calculate the overall objective value and identify the solution of the original CSP, one needs to identify the maximum among the objective values of all problem instances. An example of parallel solving after the application of a decomposition strategy is presented in Figure 17.6, where it is easy to notice that the solution of the initial CSP is provided by examining, for all involved CSPs, the two potential values of the selected cut-vertex h (i.e., solving each CSP for h = 0 and h = 1). The overall objective corresponds to the maximum of the two objectives, an argument that is justified by the binary nature of variable h.
17.2 Computational Experiments and Results In this section, we provide the results of a set of experiments that were conducted in [25] to test the structural decomposition phase of the presented parallelization framework. The parallelization framework was tested using CSPs produced by the inline approach of [23], when it was applied to hide sensitive knowledge coming from three real datasets, summarized in Table 14.4. In all tested settings, the thresholds of minimum support were properly selected to ensure an adequate amount of frequent itemsets and the sensitive itemsets to be hidden were selected randomly among the frequent ones. Several experiments were conducted for the hiding of sensitive 2–itemsets, 3–itemsets, and 4–itemsets. All experiments were conducted on a PC running Linux on an Intel Pentium D, 3.2 Ghz processor equipped with 4 GB of main memory. All integer programs were solved using ILOG CPLEX 9.0 [36]. All presented experiments assume the existence of the necessary resources that would lead to a full-scale parallelization of the initial CSP. This means that if the original CSP can potentially break into P independent parts, then the assumption is that at least P processors are available to independently undertake and solve each resultant CSP. Thus, the overall runtime of the hiding algorithm will equal the summation of (i) the runtime of the serial algorithm that produced the original CSP, (ii) the runtime of the Structure Identification Algorithm (SIA) that decomposed the original CSP into numerous independent parts, (iii) the time that is needed to communicate each of the resulting CSPs to an available processor, (iv) the time needed to solve the largest of these CSPs, (v) the communication time needed to return the attained solutions to the original processor (hereon called “master”) that held the whole problem, and finally (vi) the time needed by the master processor to calculate the summation of the objective values returned in order to compute the overall solution of the problem. That is: Toverall = THA + TSIA + Tspread + Tsolve + Tgather + Taggregate
(17.5)
The following experiments capture the runtime of (ii) and (iv), namely TSIA and Tsolve , since the communication overhead (Tspread + Tgather ) and the overall so-
17.2 Computational Experiments and Results
129 Performance gain in the BMS−2 dataset
Performance gain in BMS−1 dataset 250 T 350
) vs serial runtime (sec)
SIA
T
solve
T 300
serial
250
+T
100
Parallel (T
Parallel (T
TSIA Tsolve 200
T
serial
150
100
SIA
150
SIA
+T
solve
200
solve
) vs serial runtime (sec)
400
50
0
2x2
2x3
2x4
3x2
3x3
50
0
3x4
2x2
Hiding Scenarios
2x3
2x4 3x2 Hiding Scenarios
3x3
3x4
Performance gain in the Mushroom dataset 100 ) vs serial runtime (sec)
40
Parallel (T
SIA
solve
90
+T
T 80
SIA
Tsolve Tserial
70 60 50
30 20 10 0
2x2
2x3
2x3 3x2 Hiding Scenarios
3x3
3x4
Fig. 17.7: Performance gain through parallel solving, when omitting the V part of the CSP.
lution calculation overhead (Taggregate ) are considered to be negligible when compared to these runtimes. Moreover, the runtime of (i) does not change in the case of parallelization and therefore its measurement in these experiments is of no importance. To compute the benefit of parallelization, we include in the results the runtime Tserial of solving the entire CSP without prior decomposition. The first set of experiments (presented in Figure 17.7) breaks the initial CSP into a controllable number of components by excluding all the constraints involving itemsets from set V (see Figure 14.2). Thus, to break the original CSP into P parts, one needs to identify P mutually exclusive (in the universe of items) itemsets to hide. However, based on the number of supporting transactions for each of these itemsets in DO , the size of each produced component may vary significantly. As one can observe in Figure 17.7, the time that was needed for the execution of the SIA algorithm and the identification of the independent components is low when compared to the time needed for solving the largest of the resulting CPSs. Moreover, by comparing the time needed for the serial and the one needed for the parallel solving of the CSP, one can notice how beneficial is the decomposition strategy in reducing the runtime that is required by the hiding algorithm. For example, in the 2 × 2 hiding scenario for BMS–1, serial solving of the CSP requires 218 seconds,
130
17 Parallelization Framework for Exact Hiding Performance gain in the BMS−1 dataset
Performance gain in the BMS−2 dataset
Tsolve Tserial
250
100
Tsolve 200
Tserial
150
SIA
Parallel (T
Parallel (T
50 0
100
TSIA
solve
150
+T
solve
200
) vs serial runtime (sec)
300
250
+T
TSIA
SIA
) vs serial runtime (sec)
350
2x2
2x3 3x2 Hiding Scenarios
3x3
50
0
2x2
2x3 3x2 Hiding Scenarios
3x3
Performance gain in the Mushroom dataset
Parallel (T
SIA
+T
solve
) vs serial runtime (sec)
120 TSIA 100
Tsolve Tserial
80
60
40
20
0
2x2
2x3 3x2 Hiding Scenarios
3x3
Fig. 17.8: Performance gain through parallel solving of the entire CSP.
while parallel solving requires 165 seconds. This means that by solving the CSP in parallel using two processors, the solution time is reduced by 53 seconds. The second set of experiments, presented in Figure 17.8, includes the V part of the CSP produced by the inline algorithm. As one can observe, there are certain situations (e.g., 3 × 3 in BMS–WebView–1 or 3 × 2 in BMS–WebView–2) in which the original CSP cannot be decomposed (i.e. Tsolve = 0). In all such cases, one has to apply either the decomposition approach using articulation points or the weighed graph partitioning algorithm, in order to parallelize the hiding process.
Chapter 18
Quantifying the Privacy of Exact Hiding Algorithms
The exact hiding algorithms that were presented in Chapters 14, 15, and 16 are all based on the principle of minimum harm (distortion), which requires the minimum amount of modifications to be made to the original database to facilitate sensitive knowledge hiding. As an effect, in most cases (depending on the problem instance at hand), the sensitive itemsets are expected to be positioned just below the revised borderline in the computed sanitized database D. However, the selection of the minimum support threshold based on which the hiding is performed can lead to radically different solutions, some of which are bound to be superior to others in terms of offered privacy. In this chapter, we present a layered approach that was originally proposed in [27], which enables the owner of the data to quantify the privacy that is offered on a given database by the employed exact hiding algorithm. Assuming that an adversary has no knowledge regarding which of the infrequent itemsets in D are the sensitive ones, this approach can compute the disclosure probability for the hidden sensitive itemsets, given the sanitized database D and a minimum support or frequency threshold msup/mfreq.
A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_18, © Springer Science+Business Media, LLC 2010
131
132
18 Quantifying the Privacy of Exact Hiding Algorithms
Fig. 18.1: A layered approach to quantifying the privacy that is offered by the exact hiding algorithms.
Figure 18.1(i) demonstrates the layered approach of [27] as applied on a sanitized database D. The support-axis (shown vertically in the figure) is partitioned into two regions with respect to the minimum support threshold msup that is used for the mining of the frequent itemsets in D. In the upper region (above msup), Layer 0 contains all the frequent itemsets that are found in D after the application of a frequent itemset mining algorithm like Apriori [7]. The value of MSF indicates the maximum support of a frequent itemset in D. The region starting just below msup contains all the infrequent itemsets, including the sensitive ones, provided that they were appropriately covered up by the applied hiding algorithm. The region below msup is further partitioned into three layers, defined as follows: Layer 1 This layer spans from the infrequent itemsets having a maximum support (MSI) to the sensitive itemsets with a maximum support (MSS), excluding the latter ones. It models the “gap” that may exist below the borderline, either due to the use of a margin of safety to better protect the sensitive knowledge (as is the typical case in various hiding approaches, e.g. [63]), or due to the properties of the original database DO and the sensitive itemsets that were selected to be hidden. This layer is assumed to contain ψ itemsets. Layer 2 This layer spans from the sensitive itemsets having a maximum support (MSS) to the sensitive itemsets with the minimum support (mSS), inclusive. It contains all the sensitive knowledge that the owner wishes to protect, possibly along with some nonsensitive infrequent itemsets. This layer is assumed to contain s itemsets out of which S are the sensitive ones.
18 Quantifying the Privacy of Exact Hiding Algorithms
133
Layer 3 This layer collects the rest of the infrequent itemsets, starting from the one having the maximum support just below mSS and ending at the infrequent itemset with the minimum support (mSI), inclusive. This layer is assumed to contain a total of r itemsets. Given the layered partitioning of the itemsets in D with respect to their support values, the quality of a hiding algorithm depends on the position of the various infrequent itemsets in Layers 1, 2 and 3. Specifically, let x denote the distance (from msup) below the borderline where an adversary tries to locate the sensitive knowledge (e.g., by mining database D using support threshold msup – x). Then, estimator E˜ provides the mean probability of sensitive knowledge disclosure and is defined in [25] as follows: 0 x ∈ [0 . . . ψ] S· msup−x ψ+MSS−mSS+1 x ∈ (ψ . . . (ψ + MSS – mSS + 1)] (18.1) E˜ = ψ+s· msup−x ψ+MSS−mSS+1 S x ∈ ((MSS – mSS + 1) . . . (msup – mSI + 1)] msup−x ψ+s+r· msup−mSI+1
By computing E˜ for the sanitized database D, the owner of DO can gain in-depth understanding regarding the degree of protection that is offered on the sensitive knowledge in D. Furthermore, he or she may decide on how much lower (with respect to the support) should the sensitive itemsets be located in D, such that they are adequately covered up. As a result, a hiding methodology can be applied to the original database DO to produce a sanitized version D that meets the newly imposed privacy requirements. Given the presented exact approaches to sensitive knowledge hiding, such a methodology can be implemented in two steps, as follows: 1. The database owner uses the probability estimator E˜ to compute the value of x that guarantees maximum safety of the sensitive knowledge. 2. An exact knowledge hiding approach is selected and extra constraints are added to the formulated CSP to ensure that the support of the sensitive knowledge in the generated sanitized database will become at most x.
maximize ∑unm ∈U unm ( subject to
∑Tn ∈DX ∏im ∈X unm < msup − x, ∀X ∈ Smin ∑Tn ∈DR ∏im ∈R unm ≥ msup, ∀R ∈ V
Fig. 18.2: The modified CSP for the inline algorithm that guarantees increased safety for the hiding of sensitive knowledge.
134
18 Quantifying the Privacy of Exact Hiding Algorithms
For example, in the case of the inline approach, the CSP of Figure 18.2 guarantees the holding of these requirements1 . Another possibility is to apply a postprocessing algorithm that will increase the support of the infrequent itemsets of Layer 3 in the sanitized database D, such that they move to Layer 2 (thus increase the concentration of itemsets in the layer that contains the sensitive ones). On the negative side, it is important to mention that all these methodologies for increasing the safety of the sensitive knowledge have as an effect the decrement of the quality of the sanitized database, with respect to its original counterpart. This brings up one of the most commonly discussed topics in knowledge hiding: hiding quality vs. usability of the sanitized database, offered by the hiding algorithm. Table 18.1: An example of a sanitized database D produced by the inline algorithm [23], which conceals the sensitive itemsets S = {B,CD} at a frequency threshold of mfreq = 0.3. A 1 1 1 1 0 1 0 0 1 0
B 1 1 0 0 0 0 0 0 0 0
C 0 0 0 0 0 1 1 1 0 0
D 0 0 0 0 1 1 1 0 0 1
Figure 18.1(ii) demonstrates the operation of the layered approach of [25] for the example database of Table 18.1. As expected, due to the minimum harm that is introduced by the exact hiding algorithms, both sensitive itemsets B and CD are located just under the borderline. In this example, the size of Layer 1 is zero (i.e., ˜ the probability of an adversary to identify the ψ = 0). Based on the estimator E, sensitive knowledge is found to be 2/3 when using x = 1 (equivalently when mining the database using msup = 2). Since the probability of sensitive knowledge disclosure is high, the owner of the data could either (i) use the CSP formulation of Figure 18.2 to constraint the support of the sensitive itemsets to at most 1, or (ii) apply a methodology that increases the support of some of the itemsets in Layer 3 so as to move to Layer 2 (i.e., obtain a support of 2). Both approaches are bound to introduce extra distortion to the original database DO from which D was produced, but will also provide better protection of the sensitive knowledge.
1
We should also point out that the owner of the data can decide to hide different sensitive itemsets at a different degree, thus consider some of these itemsets as more sensitive than the others. To achieve that he/she can properly adjust the support threshold (i.e., the right side of the inequalities) in the corresponding constraints of the CSP of Figure 18.2, involving these sensitive itemsets.
Chapter 19
Summary
In this part of the book, we studied the third class of association rule hiding approaches that involves methodologies which lead to superior hiding solutions when compared to heuristic and border-based algorithms. The methodologies of this class operate by (i) using the process of border revision (Chapter 9) to model the optimal hiding solution, (ii) constructing a constraints satisfaction problem using the itemsets of the computed revised borders, and (iii) solving the constraints satisfaction problem by applying integer programming. Transforming the hiding problem to an optimization problem guarantees the minimum amount of side-effects in the computed hiding solution and offers minimal distortion of the original database to accommodate sensitive knowledge hiding. Unlike the previous classes of approaches, exact hiding algorithms perform the sanitization process by considering all relevant data modifications to select the one that leads to an optimal hiding solution for the problem at hand. In Chapter 13, we surveyed Menon’s algorithm [47], which is the first approach to incorporate an exact part in the hiding process, effectively identifying the minimum number of transactions from the original database that have to be sanitized for the hiding of the sensitive knowledge. Following that, in Chapters 14, 15 and 16 we presented three exact hiding methodologies that have been recently proposed to compute optimal hiding solutions. The achieved optimality in the hiding solution that is computed by the exact hiding algorithms comes at a cost to the computational complexity of the sanitization process. In Chapter 17, we elaborated on a parallelization framework that was proposed in [26] to improve the scalability of exact hiding algorithms. Through the use of this framework, the exact algorithms can scale to large problem instances, provided that the necessary number of processors are available. Last, in Chapter 18, we presented a layered approach to the quantification of the privacy that is offered by the exact hiding methodologies. By using this approach, the data owner can achieve the desired level of privacy in the hiding of the sensitive itemsets at a minimal distortion of his or her data. Moreover, he or she can decide the extent to which each sensitive itemset will be hidden in the sanitized database by tuning the corresponding constraints in the CSP that is produced by the exact hiding methodology. A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_19, © Springer Science+Business Media, LLC 2010
135
Part V
Epilogue
Chapter 20
Conclusions
The serious privacy concerns that are raised due to the sharing of large transactional databases with untrusted third parties for association rule mining purposes, soon brought into existence the area of association rule hiding, a very popular subarea of privacy preserving data mining. Association rule hiding focuses on the privacy implications originating from the application of association rule mining to shared databases and aims to provide sophisticated techniques that effectively block access to sensitive association rules that would otherwise be revealed when mining the data. The research in this area has progressed mainly along three principal directions: (i) heuristic-based approaches, (ii) border-based approaches, and (iii) exact hiding approaches. Taking into consideration the rich work proposed so far, in this book we tried to collect the most significant research findings since 1999, when this area was brought into existence, and effectively cover each principal line of research. The detail of our presentation was intentionally finer on exact hiding approaches, since they comprise the most recent direction, offering increased quality guarantees in terms of distortion and side-effects introduced by the hiding process. The first part of the book serves as an introduction to the area of association rule hiding by motivating the problem at hand, discussing its predominant research challenges and variations, and providing the necessary background for its proper understanding. The distinctive characteristics of the proposed methods led us to propose a taxonomy to partition them along four orthogonal dimensions based on the employed hiding strategy, the data modification strategy, the number of rules that are concurrently hidden, and the nature of the algorithm. By using this partitioning, we devoted each subsequent part of the book to a specific class of approaches, hence presenting heuristic-based approaches in Part II, border-based approaches in Part III, and exact hiding approaches in Part IV of the book. The majority of the proposed methodologies for association rule hiding are of heuristic nature in order to effectively tackle the combinatorial nature of the problem. The basic property of heuristic algorithms that makes them attractive is their computational and memory efficiency, which allows them to scale well even to very large datasets. A wide variety of heuristics have been proposed over the years. In the second part of the book, we partitioned the heuristic methodologies along two main A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_20, © Springer Science+Business Media, LLC 2010
139
140
20 Conclusions
categories, namely distortion-based schemes and blocking-based schemes, and further partitioned each of these categories into support-based and confidence-based approaches. Among the two main categories, blocking schemes consider item modification (mostly item inclusion) to be unacceptable without proper indication in the sanitized database, as such modifications can render the data useless at a transactionlevel. To tackle this issue, blocking approaches incorporate unknowns to mask potentially modified items in the sanitized database, such that the data recipient is able to distinguish any counterfeit entries from the correct ones. Despite the specific mechanism that is employed by each heuristic approach, all heuristic algorithms operate by taking locally best decisions which often do not translate to globally optimal hiding solutions. Border-based approaches were covered in the third part of the book and, although they are also of heuristic nature, they manage to offer increased quality guarantees when compared to plain heuristics. The main characteristic of this family of approaches is that they all employ the revised positive border, which is the original border shaped up by the removal of the sensitive itemsets, in order to track (and minimize) the impact of sanitizing selected transactions to facilitate the hiding of the sensitive knowledge. By applying those item modifications that affect the support of the itemsets in the revised positive border the least, border-based algorithms are generally capable of identifying hiding solutions with fewer side-effects than most plain heuristic approaches, while causing substantially less distortion to the original database to facilitate the hiding of the sensitive knowledge. Last, the exact class of hiding approaches was covered in the fourth part of the book. This is the most recent line of research which collects algorithms that offer guarantees on the quality of the computed hiding solution. On the negative side, these algorithms suffer from an increased computational complexity cost. Exact hiding approaches operate by transforming the hiding problem into an equivalent optimization problem, where the objective is to minimize the distortion that is caused to the original database (in terms of item or transaction modifications) to facilitate the hiding of all the sensitive knowledge with the least side-effects. Unlike the previous classes of approaches, exact algorithms operate on the original database by considering all possible solutions for the problem at hand, in order to find the one that optimizes the criterion function. To improve the poor scalability of these approaches, we presented a framework that aims to parallelize the hiding process by decomposing the original optimization problem into numerous blocks. Moreover, we discussed an approach that has been proposed to tune the privacy that is offered by the exact hiding methodologies on a per itemset level, effectively allowing the owner of the data to decide on the disclosure threshold for each sensitive itemset. Conclusively, over the past years we have witnessed a plethora of novel methodologies being published to top-tier conferences and journals that propose interesting solutions to the problem of association rule hiding. In this book, we aimed to provide an overview of the most important state-of-the-art research accomplishments by presenting them under a new perspective, unifying their theory and shedding light on the operation of each methodology. We would like to emphasize that it is our opinion that this domain has certainly not yet reached to a consensus that
20 Conclusions
141
would not justify any future research. On the contrary, we feel that there are several open problems that need to be addressed as well as a lot of room for improvement of the current hiding methodologies. Moreover, in view of the advent of the new emerging technologies that require the hiding of frequent itemsets under the prism of application-specific requirements, sophisticated hiding solutions need to be designed that operate under the imposed requirements.
Chapter 21
Roadmap to Future Work
There is a plethora of open issues related to the problem of association rule hiding that are still under investigation. First of all, the emergence of sophisticated exact hiding approaches of high complexity, especially for very large databases, causes the consideration of efficient parallel approaches to be employed for improving the runtime of these algorithms. Parallel approaches allow for decomposition of the constraints satisfaction problem into numerous components that can be solved independently. The overall solution is then attained as a function of the objectives of the individual solutions. A framework for decomposition and parallelization of exact hiding approaches has been recently proposed in [25] and is covered in Chapter 17 of the book. Although this framework improves the runtime of solving the constraints satisfaction problem that is produced by the exact hiding algorithms, we are confident the further optimizations can be achieved by exploiting the inherent characteristics of the constraints that are involved in the CSP. Also, different optimization techniques can be adopted to allow searching the space of possible solutions for the CSP, in a more advanced way. Regarding the use of unknowns in blocking association rule hiding algorithms, a lot more research is in need to provide sophisticated hiding solutions that take advantage of the capabilities offered by their use. Evidence has shown that the use of unknowns in several real life scenarios is much more preferable than the use of conventional distortion techniques. This is true because distortion techniques fail to provide a distinction between the real values in the dataset and the ones that were distorted by the hiding algorithm in order to allow for its proper sanitization. Therefore, it is our belief that future research in association rule hiding should target towards providing sophisticated and efficient solutions that make use of unknowns. A different research direction concerns the use of database reconstruction approaches to generate a database from scratch that is compatible with only the nonsensitive frequent itemsets or a given set of association rules. Prominent research efforts towards this direction include the work of several researchers in the field of inverse frequent set mining [17, 33, 48, 78]. However, it was recently proved that this is an NP-hard problem [12–14]. On going work considers yet another solution which is to append to the original database a synthetically generated database part so A. Gkoulalas-Divanis and V.S. Verykios, Association Rule Hiding for Data Mining, Advances in Database Systems 41, DOI 10.1007/978-1-4419-6569-1_21, © Springer Science+Business Media, LLC 2010
143
144
21 Roadmap to Future Work
that the sensitive knowledge is hidden in the combined database which is disclosed to the public [26]. Another thread of research in the area of association rule hiding involves sanitization algorithms that operate on databases that are regularly updated with new transactions instead of being static [75]. In such cases, sanitization of the new database from scratch becomes impractical and thus should be avoided. Last, the problem of association rule hiding has been recently considered in the context of data streams, where unique challenges arise due to the elevated processing requirements of this type of data [69]. Other interesting future trends include, but are certainly not limited to, (i) the extension of the border revision idea to cover the direct hiding of association rules, instead of their indirect hiding through their generative itemsets, (ii) the introduction of techniques for correlation rule hiding, which is a more general field than the one of association rules hiding, (iii) the provision and unification of more advanced measures for the comparison of the different hiding strategies, and (iv) the inception of spatiotemporal privacy preserving rule hiding methodologies that will prohibit the leakage of sensitive rules related to “sensitive” spatial and/or temporal information. The hiding of spatiotemporal patterns is currently a hot research topic since it imposes greater challenges than the traditional knowledge hiding approaches. Finally, future work specifically targeted in the area of exact frequent itemsets hiding, should try to address the following research problems. First, it should investigate the possibility of further reducing the size of the constraint satisfaction problems that are constructed by the exact hiding methodologies, while still guaranteeing the optimality of the hiding solution. We envision the reduction in the size of the constraint satisfaction problems to be not in terms of how many itemsets are controlled (i.e., each itemset should continue to be represented in the constraint satisfaction problem to allow for an exact hiding solution) but in terms of the form of the inequalities and the associated unknowns that are necessary for controlling the status (frequent vs. infrequent) of these itemsets in the sanitized database. Second, future work in this research area should aim at the proposal of methodologies for the computation of the exact number of transactions that need to be appended to the original database by the hybrid algorithm [26], in order to facilitate exact knowledge hiding. Such methodologies will allow for a reduction in the size of the constraints satisfaction problem that is produced by the hybrid algorithm, as well as the elimination of the post-processing phase regarding the validity of the transactions in the generated database extension.
References
1. U.S. Dept. of Health and Human Services: Standards for privacy of individually identifiable health information; Final Rule, 2002. Federal Register 45 CFR: Pt. 160 and 164. 2. National Institutes of Health: Final NIH statement on sharing research data, 2003. NOT–OD– 03–032. 3. O. Abul, M. Atzori, F. Bonchi, and F. Giannotti. Hiding sequences. In Proceedings of the 23rd International Conference on Data Engineering Workshops (ICDEW), pages 147–156, 2007. 4. C. C. Aggarwal and P. S. Yu. Privacy Preserving Data Mining: Models and Algorithms. Springer–Verlag, 2008. 5. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 207–216, 1993. 6. R. Agrawal and J. C. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(1):962–969, 1996. 7. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Databases (VLDB), pages 487–499, 1994. 8. R. Agrawal and R. Srikant. Privacy preserving data mining. SIGMOD Record, 29(2):439–450, 2000. 9. A. Amiri. Dare to share: Protecting sensitive knowledge with data sanitization. Decision Support Systems, 43(1):181–191, 2007. 10. M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. S. Verykios. Disclosure limitation of sensitive rules. In Proceedings of the 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX), pages 45–52, 1999. 11. R. Bayardo. Efficiently mining long patterns from databases. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 1998. 12. T. Calders. Computational complexity of itemset frequency satisfiability. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pages 143–154, 2004. 13. T. Calders. The complexity of satisfying constraints on databases of transactions. Acta Informatica, 44(7):591–624, 2007. 14. T. Calders. Itemset frequency satisfiability: Complexity and axiomatization. Theoretical Computer Science, 394(1):84–111, 2008. 15. L. Chang and I. S. Moskowitz. Parsimonious downgrading and decision trees applied to the inference problem. In Proceedings of the 1998 Workshop on New Security Paradigms (NSPW), pages 82–89, 1998. 16. K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. In Proceedings of the 5th IEEE International Conference on Data Mining, pages 589–592, 2005.
145
146
References
17. X. Chen, M. Orlowska, and X. Li. A new framework of privacy preserving data sharing. In Proceedings of the 4th IEEE International Workshop on Privacy and Security Aspects of Data Mining, pages 47–56, 2004. 18. C. W. Clifton and D. Marks. Security and privacy implications of data mining. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 15–19, Feb. 1996. 19. J. C. da Silva and M. Klusch. Inference on distributed data clustering. In Proceedings of the 4th International Conference on Machine Learning and Data Mining in Pattern Recognition, pages 610–619, 2005. 20. E. Dasseni, V. S. Verykios, A. K. Elmagarmid, and E. Bertino. Hiding association rules by using confidence and support. In Proceedings of the 4th International Workshop on Information Hiding, pages 369–383, 2001. 21. C. Farkas and S. Jajodia. The inference problem: A survey. SIGKDD Explorations, 4(2):6–11, 2002. 22. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP–Completeness (Series of Books in the Mathematical Sciences). W. H. Freeman, January 1979. 23. A. Gkoulalas-Divanis and V. S. Verykios. An integer programming approach for frequent itemset hiding. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM), pages 748–757, 2006. 24. A. Gkoulalas-Divanis and V. S. Verykios. A hybrid approach to frequent itemset hiding. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pages 297–304, 2007. 25. A. Gkoulalas-Divanis and V. S. Verykios. A parallelization framework for exact knowledge hiding in transactional databases. In Proceedings of the 23rd International Information Security Conference (SEC), pages 349–363, 2008. 26. A. Gkoulalas-Divanis and V. S. Verykios. Exact knowledge hiding through database extension. IEEE Transactions on Knowledge and Data Engineering, 21(5):699–713, 2009. 27. A. Gkoulalas-Divanis and V. S. Verykios. Hiding sensitive knowledge without side effects. Knowledge and Information Systems, 20(3):263–299, 2009. 28. A. Gkoulalas-Divanis and V. S. Verykios. Privacy Preserving Data Mining: How far can we go?, pages 1–21. Handbook of Research on Data Mining in Public and Private Sectors: Organizational and Governmental Applications. IGI Global, 2009. Accepted. 29. GLPK. GNU GLPK 4.32 User’s Manual. Free Software Foundation, Inc., Boston, MA, 2008. Available at http://www.gnu.org/software/glpk/glpk.html/. 30. B. Goethals. The fimi repository, 2003. Available at http://fimi.cs.helsinki.fi/. 31. M. Grean and M. J. Shaw. Supply-Chain Partnership between P&G and Wal-Mart, chapter 8, pages 155–171. Integrated Series in Information Systems. Springer–Verlag, 2002. 32. C. Gueret, C. Prins, and M. Sevaux. Applications of Optimization with Xpress–MP. Dash Optimization Ltd., 2002. 33. Y. Guo, Y. Tong, S. Tang, and D. Yang. A fp-tree-based method for inverse frequent set mining. In Proceedings of the 23rd British National Conference on Databases (BNCOD), pages 152–163, 2006. 34. E. H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 277–288, 2007. 35. C. T. Heun. Wal-Mart and Other Companies Reassess their Data-Sharing Strategies. Information Week, May 2001. 36. I. ILOG. CPLEX 9.0 User’s Manual. Mountain View, CA, Oct. 2005. Available at http://www.ilog.com/. 37. G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright. A new privacy–preserving distributed k–clustering algorithm. In Proceedings of the 2006 SIAM International Conference on Data Mining (SDM), 2006. 38. S. Jha, L. Kruger, and P. McDaniel. Privacy preserving clustering. In Proceedings of the 10th European Symposium on Research in Computer Security (ESORICS), pages 397–417, 2005.
References
147
39. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientific Computing, 20(1):359–392, 1998. 40. A. Katsarou, A. Gkoulalas-Divanis, and V. S. Verykios. Reconstruction–based classification rule hiding through controlled data modification. In Proceedings of the 5th IFIP Conference on Artificial Intelligence Applications and Innovations (AIAI), 2009. 41. R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD–Cup 2000 organizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98, 2000. http://www.ecn.purdue.edu/KDDCUP. 42. G. Lee, C. Y. Chang, and A. L. P. Chen. Hiding sensitive patterns in association rules mining. In Proceedings of the 28th International Computer Software and Applications Conference (COMPSAC), pages 424–429, 2004. 43. Y. Lindell and B. Pinkas. Privacy preserving data mining. Journal of Cryptology, 15(3):36–54, 2000. 44. D. Luenberger. Introduction to Linear and Non–Linear Programming. Addison–Wesley, 1973. 45. M. D. Mailman, M. Feolo, and Y. J. et. al. The ncbi dbgap database of genotypes and phenotypes. Nature Genetics, 39:1181–1186, 2007. 46. H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3):241–258, 1997. 47. S. Menon, S. Sarkar, and S. Mukherjee. Maximizing accuracy of shared databases when concealing sensitive patterns. Information Systems Research, 16(3):256–270, 2005. 48. T. Mielikainen. On inverse frequent set mining. In Proceedings of the 2nd Workshop on Privacy Preserving Data Mining, pages 18–23, 2003. 49. M. Morgenstern. Controlling logical inference in multilevel database systems. In Proceedings of the IEEE Symposium on Security and Privacy, pages 245–255, 1988. 50. G. V. Moustakides and V. S. Verykios. A max–min approach for hiding frequent itemsets. In Workshops Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), pages 502–506, 2006. 51. G. V. Moustakides and V. S. Verykios. A maxmin approach for hiding frequent itemsets. Data and Knowledge Engineering, 65(1):75–89, 2008. 52. J. Natwichai, X. Li, and M. Orlowska. Hiding classification rules for data sharing with privacy preservation. In Proceedings of the 7th International Conference on Data Warehousing and Knowledge Discovery (DAWAK), pages 468–477, 2005. 53. J. Natwichai, X. Li, and M. Orlowska. A reconstruction–based algorithm for classiciation rules hiding. In Proceedings of the 17th Australasian Database Conference (ADC), pages 49–58, 2006. 54. S. R. M. Oliveira and O. R. Zaïane. Privacy preserving frequent itemset mining. In Proceedings of the 2002 IEEE International Conference on Privacy, Security and Data Mining (CRPITS), pages 43–54, 2002. 55. S. R. M. Oliveira and O. R. Zaïane. Protecting sensitive knowledge by data sanitization. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), pages 211– 218, 2003. 56. S. R. M. Oliveira and O. R. Zaïane. Achieving privacy preservation when sharing data for clustering. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM), 2004. 57. S. R. M. Oliveira and O. R. Zaïane. Privacy–preserving clustering by object similarity–based representation and dimensionality reduction transformation. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM), pages 21–30, 2004. 58. E. Pontikakis, Y. Theodoridis, A. Tsitsonis, L. Chang, and V. S. Verykios. A quantitative and qualitative analysis of blocking in association rule hiding. In Proceedings of the 2004 ACM Workshop on Privacy in the Electronic Society (WPES), pages 29–30, 2004. 59. E. D. Pontikakis, A. A. Tsitsonis, and V. S. Verykios. An experimental study of distortion– based techniques for association rule hiding. In Proceedings of the 18th Conference on Database Security (DBSEC), pages 325–339, 2004.
148
References
60. M. Reddy and R. Y. Wang. Estimating data accuracy in a federated database environment. In Proceedings of 6th International Conference on Information Systems and Management of Data (CISMOD), pages 115–134, 1995. 61. S. Rizvi and J. R. Haritsa. Maintaining data privacy in association rule mining. In Proceedings of the 28th International Conference on Very Large Databases (VLDB), 2002. 62. S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice–Hall, 2nd edition, 2003. 63. Y. Saygin, V. S. Verykios, and C. W. Clifton. Using unknowns to prevent discovery of association rules. ACM SIGMOD Record, 30(4):45–54, 2001. 64. Y. Saygin, V. S. Verykios, and A. K. Elmagarmid. Privacy preserving association rule mining. In Proceedings of the 2002 International Workshop on Research Issues in Data Engineering: Engineering E–Commerce/E–Business Systems (RIDE), pages 151–163, 2002. 65. D. A. Simovici and C. Djeraba. Mathematical Tools for Data Mining: Set Theory, Partial Orders, Combinatorics. Springer Publishing Company, Incorporated, 2008. 66. X. Sun and P. S. Yu. A border–based approach for hiding sensitive frequent itemsets. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM), pages 426– 433, 2005. 67. X. Sun and P. S. Yu. Hiding sensitive frequent itemsets by a border–based approach. Computing science and engineering, 1(1):74–94, 2007. 68. P. N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison–Wesley, 2005. 69. T. G. U. Gunay. Association rule hiding over data streams. Information Technology and Control, 38:125–134. 70. J. Vaidya, C. W. Clifton, and Y. M. Zhu. Privacy Preserving Data Mining. Springer–Verlag, 2006. 71. V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and Y. Theodoridis. State– of–the–art in privacy preserving data mining. ACM SIGMOD Record, 33(1):50–57, 2004. 72. V. S. Verykios, A. K. Emagarmid, E. Bertino, Y. Saygin, and E. Dasseni. Association rule hiding. IEEE Transactions on Knowledge and Data Engineering, 16(4):434–447, 2004. 73. V. S. Verykios and A. Gkoulalas-Divanis. A Survey of Association Rule Hiding Methods for Privacy, chapter 11, pages 267–289. Privacy Preserving Data Mining: Models and Algorithms. Springer Berlin Heidelberg, 2008. 74. K. Wang, B. C. M. Fung, and P. S. Yu. Template–based privacy preservation in classification problems. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM), pages 466–473, 2005. 75. S.-L. Wang. Maintenance of sanitizing informative association rules. Expert Systems with Applications, 36(2):4006–4012, 2009. 76. S. L. Wang and A. Jafari. Using unknowns for hiding sensitive predictive association rules. In Proceedings of the 2005 IEEE International Conference on Information Reuse and Integration (IRI), pages 223–228, 2005. 77. S.-L. Wang, B. Parikh, and A. Jafari. Hiding informative association rule sets. Expert Systems with Applications, 33(2):316–323, 2007. 78. Y. Wang and X. Wu. Approximate inverse frequent itemset mining: Privacy, complexity, and approximation. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM), pages 482–489, 2005. 79. Y. H. Wu, C. M. Chiang, and A. L. P. Chen. Hiding sensitive association rules with limited side effects. IEEE Transactions on Knowledge and Data Engineering, 19(1):29–42, 2007. 80. O. R. Zaïane, M. El-Hajj, and P. Lu. Fast parallel association rule mining without candidacy generation. In Proceedings of the IEEE International Conference on Data Mining (ICDM), pages 665–668, 2001.
Index
Accuracy measure, 63, 65, 72 Affected border, 51 Aggregate approach, 31 Architectural layout, 74, 83 Articulation points, 123, 124 Association rule, 6, 9 confidence, 9, 11 support, 9, 10
Data utility, ix, 22, 31, 36, 93 Database inference control, 7 Database quality, 98 Database reconstruction, 143 De-identification, 4 Dedication, v Disaggregate approach, 31 Distance measure, 73, 98
Binary integer programming, 71 Blanket approach, 69 Border, 12 negative border, 11 positive border, 11 Border revision, 41 Border theory, 11 Border variable, 127
Empty transaction, see Null transaction Estimator, 133
Challenges, 7 Classification rule hiding, 5, 8, 21 parsimonious downgrading, 21 reconstruction approaches, 21, 22 suppression approaches, 21, 22 Constraint set, 86 Constraints degree reduction, 76, 105 Constraints graph, 124, 125 Constraints matrix, 123 Constraints satisfaction problem (def), 63 Constraints-by-transactions matrix, 66, 69 Correlation rule hiding, 144 Cover, see Generalized cover
Hidden-first algorithm, 33 Hiding candidate, 49–51 Hiding goals, 13 Hiding side-effects ghost rules, 13 lost rules, 13 Hiding solution approximate, 13 exact, 13 feasible, 13, 14 optimal, 13, 98
Data disclosure, 3–5, 7 Data hiding, 5 Data sanitization, vii, 5–7, 12–15 Data sharing, 3, 4 Data streams, 144
Feasible constraint set, 86 Fuzification, 6, 18, 35 Generalized cover, 75, 100 Generating itemset, 11, 14, 17
Independent block, 66 Independent components, 123, 126 Inference, 4, 7, 9, 21, 22 Integer programming, 20, 63 Integer programming solver, 64 Inverse frequent itemset mining, 7, 143 Item ordering effect, 33
149
150 Itemset lattice, 41, 42 Knowledge hiding, 5, 8 Lattice, see Itemset lattice Limiting factor, 83 Max–min criterion, 53, 55 Max–min itemset, 53, 54 Minimal sensitive itemsets, 43, 49, 72 Modification strategy blocking, 5, 18, 35 distortion, 5, 13, 18, 29 Motivating examples Deadtrees & BigMart, 6 National Institutes of Health, 3 Wal-mart & Procter and Gamble, 3 Multiple rule hiding, 18, 30 Non-hidden first algorithm, 33 NP-hard, 64, 73, 74, 100 Null transaction, 105 Optimal borderline, 84 Optimal hiding scenario, 72 Optimal set, 43 Optimal solution, 98 Optimal solution set, 103 Optimization, viii, 20 Original border, 43 Oscillation, 85
Index Partitioning approach, 110 Priority-based distortion, 32 Privacy issues, 7 Privacy preserving clustering density-based approaches, 23 protocol-based approaches, 22, 23 transformation approaches, 22, 23 Privacy preserving data mining, vii, ix, 5 Problem instance (def), 83 Problem variants, 14, 15 Relative support, 47–50, 106 Relaxation process, see Constraints degree reduction Revised border, 43 Safety margin, 35, 36, 101, 104–106, 132 Sequence hiding, 23, 24 Single rule hiding, 18, 30, 31 Structural decomposition, 120 Taxonomy, 17, 19 Tentative victim item, 54 Unknowns, 18, 35 Validity of transactions, 106 Weight-based sorting distortion, 32